Questioning Everything Propaganda

Home Tags
Login RSS
Pseudo Code

VAERS Script — Processing Flow and Pseudocode

Overview

The script processes VAERS (Vaccine Adverse Event Reporting System) data releases called "drops". Each drop is a dated ZIP or set of CSV files published by the CDC. The script ingests drops sequentially, consolidates the three raw VAERS files per drop into a single flat record per VAERS_ID, then tracks every field-level change between consecutive drops to produce a cumulative audit trail output file.

Directory Structure

0_VAERS_Downloads/        Raw input: ZIP files or dated subdirectories of CSVs
1_vaers_working/          Scratch space: extracted CSVs for current drop only
1_vaers_consolidated/     Intermediate: one file per drop, multiple rows per VAERS_ID
3_vaers_flattened/        Intermediate: one file per drop, one row per VAERS_ID
2_vaers_full_compared/    Output: cumulative FLATFILE per drop with change tracking
VAERS_FINAL_MERGED.csv    Final output: copy of the latest FLATFILE
stats.csv                 Running statistics one row per drop plus a totals row
ever_published_covid.txt  Persistent list of every COVID VAERS_ID ever seen
ever_published_any.txt    Persistent list of every VAERS_ID ever seen (all vaccines)

Three Raw VAERS File Types per Drop

*VAERSDATA.csv     One row per VAERS_ID. Patient, outcome, and narrative fields.
*VAERSVAX.csv      One or more rows per VAERS_ID. One row per vaccine dose/lot.
*VAERSSYMPTOMS.csv One or more rows per VAERS_ID. Up to 5 MedDRA symptom codes per row.

Top-Level Execution: run_all()

PROCEDURE run_all():

    IF --merge-only flag is set:
        call create_final_merged_file()
        RETURN

    call validate_dirs_and_files()
    IF validation fails:
        exit with error

    WHILE more_to_do() returns True:
        date = get_next_date()
        initialize stats for date
        success = open_files(date)
        IF success is False:
            mark date as done, CONTINUE to next date

        call consolidate(date)
        call flatten(date)
        call compare(date)

    call create_final_merged_file()
    print error summary

Stage 0: Setup and File Cataloguing

validate_dirs_and_files()

PROCEDURE validate_dirs_and_files():

    create directories if missing:
        0_VAERS_Downloads, 1_vaers_working, 1_vaers_consolidated,
        2_vaers_full_compared, 3_vaers_flattened

    scan 0_VAERS_Downloads recursively for *.zip and *.csv files
    IF no files found:
        log error, return False

    print count of files found
    return True

files_populate_information()

Called at the start of several stages to refresh the in-memory catalogue of what files exist in each directory.

PROCEDURE files_populate_information():

    IF files dict is empty:
        initialise keys: input, working, flattened, changes, consolidated
        each key maps to: _dir, date[], keyval{}, valkey{}, files[]

    FOR each category in [input, working, flattened, changes, consolidated]:

        walk the category's directory recursively
        collect all files whose names contain a date string (YYYY-MM-DD)
        keep only .csv and .zip files
        exclude files ending in _a.csv or _b.csv (legacy test artifacts)

        extract the date string from each filename
        store sorted unique list of dates

        build keyval dict  date -> filepath
            NOTE: if two files share the same date key, log an error;
                  the last file wins (see fixes.md Fix 4)
        build valkey dict  filepath -> date

    IF test mode active:
        replace input catalogue with flattened catalogue

    IF date_floor is set:
        remove input dates earlier than date_floor
    IF date_ceiling is set:
        remove input dates later than date_ceiling
    apply the same date filter to all sub-catalogues of input

Stage 1: Loop Control

more_to_do()

FUNCTION more_to_do() -> Boolean:

    call files_populate_information()

    IF no changes files exist yet:
        return True   (nothing processed, keep going)

    IF all four catalogues (input, flattened, consolidated, changes) contain
       exactly the same dates AND the latest changes date >= latest input date:
        trigger save_multi_csv() to produce _A/_B split if output exceeds
        Excel row limit of 1,048,575
        return False  (all drops processed)

    reset per-drop elapsed timer
    return True

get_next_date()

FUNCTION get_next_date() -> date string:

    dates_input   = sorted input dates NOT yet in files['done']
    dates_changes = sorted dates of completed changes files

    IF no flattened files exist yet (first ever run):
        return dates_input[0]   (oldest unprocessed input)

    IF changes files exist:
        high_changes_done = latest completed changes date
        candidates = input dates that are AFTER high_changes_done

        IF candidates exist:
            date_next = earliest candidate
            IF date_next > date_ceiling:  exit
            return date_next

        ELSE (no candidates — all inputs are behind or equal to changes):
            exit ("No more input files to process")

    ELSE (flattened files exist but no changes files yet):
        IF first input date already has a flattened file:
            copy that flattened file into dir_compared as a FLATFILE
            add cell_edits=0, status='', changes='' columns
            update ever_covid and stats for this date
            return dates_input[1]   (skip first, use it as baseline for compare)

        ELSE:
            return dates_input[0]

Stage 2: Input Extraction — open_files(date)

PROCEDURE open_files(date):

    call files_populate_information()

    IF date is already in the consolidated catalogue:
        clear working directory
        recreate working directory
        write date marker file into working directory
        return True  (nothing to extract, consolidation already done)

    IF date is already in the flattened catalogue:
        return True  (skip extraction, already fully processed)

    IF date not found in input catalogue:
        exit with error

    source = files['input']['keyval'][date]

    IF source is a CSV file:
        clear working directory
        copy all input files for this date into working directory

    ELSE IF source is a ZIP file:
        clear working directory
        open ZIP (using zipfile_deflate64 for deflate64 compression support)
        FOR each file inside the ZIP:
            IF filename matches year pattern (YYYY...) OR starts with "nond":
                extract to working directory
        IF test mode with files_limit set:
            remove extracted files not matching files_limit filter

    write date marker file into working directory
    return True on success, False if ZIP could not be opened or extracted

Stage 3: Consolidation — consolidate(date)

Combines the three raw VAERS file types into one file. Output still has multiple rows per VAERS_ID (one per dose).

PROCEDURE consolidate(date):

    call files_populate_information()

    IF date already in consolidated catalogue:
        print "already done", clear df_vax global, RETURN

    IF date already in flattened catalogue:
        print "skipping, flattened exists", RETURN

    --- Load raw files ---

    df_vax      = concatenate all *VAERSVAX.csv     files from working directory
    df_data_all = concatenate all *VAERSDATA.csv    files from working directory
    df_syms     = concatenate all *VAERSSYMPTOMS.csv files from working directory

    IF vids_limit set (test mode):
        filter all three dataframes to only those VAERS_IDs

    --- Identify earliest COVID VAERS_ID ---

    df_vax_covid = rows in df_vax where VAX_TYPE contains "covid" (case-insensitive)

    IF df_vax_covid is not empty:
        covid_earliest_vaers_id_now = minimum VAERS_ID in df_vax_covid
        IF covid_earliest_vaers_id not yet set:
            set covid_earliest_vaers_id = covid_earliest_vaers_id_now
        ELSE IF new minimum is lower than current:
            update covid_earliest_vaers_id to the lower value
    ELSE:
        print "no covid found in this drop"
        remove this date from input catalogue (cheap trick to skip it)
        RETURN

    --- Remove pre-COVID records ---

    filter df_data_all, df_vax, df_syms to VAERS_ID >= covid_earliest_vaers_id
    print count removed and count remaining

    update ever_any tracking dict with all VAERS_IDs in df_data_all
    call do_never_ever() to track IDs not yet seen

    --- Identify COVID reports ---

    Strategy: a report is "covid" if it meets ANY of:
        (a) Single-dose report (doses == 1) AND that dose is VAX_TYPE containing "COVID"
        (b) Multi-dose report (doses >= 2) AND at least one dose is VAX_TYPE "COVID"
        (c) VAERS_ID not in (a) or (b) but SYMPTOM_TEXT contains "Pfizer" or "Moderna"
            or "Janssen" AND also contains "Covid"

    vids_all_covid_list = union of (a+b) and (c) VAERS_IDs

    --- Filter to COVID-only ---

    df_data = df_data_all filtered to vids_all_covid_list
    df_vax  = df_vax       filtered to vids_all_covid_list
    df_syms = df_syms      filtered to vids_all_covid_list

    df_data = df_data.drop_duplicates(subset='VAERS_ID')
        [one dedup only — the second redundant call was removed, see fixes.md Fix 1]

    --- Shorten VAX_NAME and VAX_MANU field values ---

    Apply regex replacements to df_vax:
        VAX_NAME: remove parentheses, collapse "COVID19 COVID19 ..." to "C19 ...",
                  shorten manufacturer strings (Pfizer-BionT, Moderna, Janssen, etc.)
        VAX_MANU: shorten to readable forms (Unknown, Pfizer-BionT, Moderna, etc.)

    Sort df_vax by [VAERS_ID, VAX_LOT, VAX_SITE, VAX_DOSE_SERIES, VAX_TYPE,
                    VAX_MANU, VAX_ROUTE, VAX_NAME]
        (sorting prevents row-order differences between drops from appearing as changes)

    --- Merge DATA into VAX ---

    df_data_vax = LEFT JOIN df_vax ON df_data using VAERS_ID
        (result has multiple rows per VAERS_ID — one per dose/lot)
    fill NaN with empty string

    --- Aggregate symptoms ---

    call symptoms_file_entries_append_to_symptom_text(df_syms):

        PROCEDURE symptoms_file_entries_append_to_symptom_text(df_symfile):

            keep only VAERS_ID and SYMPTOM1..SYMPTOM5 columns (drop SYMPTOMVERSION cols)

            join all five symptom columns into single string using "_|_" delimiter:
                symptom_entries = sort(SYMPTOM1..5) joined by "_|_"

            GROUP BY VAERS_ID, join all symptom_entries rows with "_|_"
                (handles reports with more than 5 symptoms spanning multiple rows)

            clean up consecutive delimiters from empty symptom columns
            ensure string starts and ends with "_|_"

            IF any symptom_entries exceed 32,720 characters (Excel cell limit):
                truncate and append "[truncated, Excel cell size limit 32,767]"

            return df_syms_flat  (one row per VAERS_ID)

    --- Final consolidated merge ---

    df_consolidated = LEFT JOIN df_data_vax ON df_syms_flat using VAERS_ID
    fill NaN with empty string

    write to: 1_vaers_consolidated/DATE_VAERS_CONSOLIDATED.csv
        (multiple rows per VAERS_ID is expected and correct at this stage)

Stage 4: Flattening — flatten(date)

Reduces to exactly one row per VAERS_ID by aggregating all dose/lot fields.

PROCEDURE flatten(date):

    IF covid_earliest_vaers_id not yet set:
        mark date as done, RETURN

    call files_populate_information()

    IF date already in flattened catalogue:
        print "already done", RETURN

    --- Source data: in-memory or from consolidated file ---

    IF df_vax global is empty (consolidate did not just run in this session):
        read from saved consolidated file for this date

        df_vax      = consolidated[VAX columns]
        df_data     = consolidated[all other columns].drop_duplicates(VAERS_ID)
            [dedup immediately — consolidated file has N rows per VAERS_ID,
             must reduce to 1 before merge; see fixes.md Fix 2]
        remove symptom_entries from df_data (added back separately)
        df_syms_flat = consolidated[VAERS_ID, symptom_entries].drop_duplicates(VAERS_ID)
            [same reason — see fixes.md Fix 2]

    ELSE:
        df_vax, df_data, df_syms_flat are already in memory from consolidate()
        df_data is already 1 row per VAERS_ID

    --- Aggregate VAX fields to one row per VAERS_ID ---

    df_vax_flat = df_vax.groupby(VAERS_ID).agg(list).map("|".join)
        Each VAX field becomes a pipe-delimited string of all values:
            e.g. VAX_LOT = "EK9231|EL0140" for a two-dose report
    drop_duplicates on full row as safety net

    --- Merge flattened VAX with DATA ---

    df_data_vax_flat = LEFT JOIN df_vax_flat ON df_data using VAERS_ID
        (should be 1:1 since both are now one row per VAERS_ID)
    drop_duplicates on full row as safety net

    --- Merge symptom_entries ---

    df_final = LEFT JOIN df_data_vax_flat ON df_syms_flat using VAERS_ID
    fill NaN with empty string
        (some VAERS_IDs have no entry in the symptoms file; this is expected)
    drop_duplicates on full row as safety net

    --- Deduplicate repeat sentences in SYMPTOM_TEXT ---

    call symptoms_dedupe_repeat_sentences(df_final):

        FOR each row:
            call symptoms_dedupe_repeat_sentences_each(VAERS_ID, SYMPTOM_TEXT):

                detect the most prevalent delimiter in the text:
                    candidates: ". "  "|"  "; "  " - "
                    winner = delimiter appearing >= 5 times
                IF no prevalent delimiter found: return text unchanged

                split text on delimiter into list_in
                IF no duplicates in list_in: return text unchanged

                FOR each segment in list_in:
                    IF segment already seen in this text AND len(segment) > 40:
                        replace with placeholder "`^`"
                        record byte count saved
                    ELSE:
                        keep segment

                update dedupe stats (count, bytes, max_bytes, which VAERS_ID)
                return cleaned text

    write to: 3_vaers_flattened/DATE_VAERS_FLATTENED.csv
        (one row per VAERS_ID)

    store result in global df_flat_2 for use by compare() without re-reading disk

Stage 5: Comparison — compare(date)

Diffs the newly flattened file against the previous one. Tracks field-level changes, deletions, restorations, and new reports. Produces the cumulative output FLATFILE.

PROCEDURE compare(date_currently):

    IF covid_earliest_vaers_id not yet set: RETURN

    call files_populate_information()

    file_flattened_previous = most recent flattened file with date < date_currently
    file_flattened_working  = flattened file for date_currently

    --- First-drop special case ---

    IF no previous flattened file exists:
        copy current flattened to 2_vaers_full_compared/DATE_VAERS_FLATFILE.csv
        add columns: cell_edits=0, status='', changes=''
        update ever_covid and stats
        RETURN

    --- Load previous and current flat files ---

    IF df_flat_1 global is populated (carried over from previous loop iteration):
        df_flat_prv = df_flat_1   (avoid re-reading disk)
    ELSE:
        df_flat_prv = read file_flattened_previous from disk

    IF df_flat_2 global is populated (just produced by flatten()):
        df_flat_new = df_flat_2   (avoid re-reading disk)
    ELSE:
        df_flat_new = read file_flattened_working from disk

    ensure df_flat_prv has columns: cell_edits, status, changes
        (absent on first compare when sourced directly from a flattened file)
    sort df_flat_prv by [cell_edits, status, changes] descending
        (reports with most edits appear first)

    --- Build tracking union ---

    df_both_flat_inputs = concat(df_flat_prv, df_flat_new).drop_duplicates(VAERS_ID)
        used at the end to verify no records are lost

    --- Initialise df_edits ---

    df_edits        = copy of df_flat_new
        (all new records; will be modified in place with carried-over tracking data)
    df_changes_done = empty dataframe
        (accumulates records that need no further processing this pass)

    --- Carry forward change tracking from previous ---

    find VAERS_IDs present in both df_flat_prv and df_edits
    from df_flat_prv, find those with non-zero cell_edits or non-empty changes/status

    FOR those VAERS_IDs:
        copy cell_edits, changes, status from df_flat_prv into df_edits

    --- Absorb records only in previous (deleted candidates) ---

    rows in df_flat_prv NOT in df_edits (df_flat_new) are appended to df_edits
        (they may be truly deleted or just absent from this drop temporarily)

    --- IDENTICAL records ---

    df_merged = OUTER JOIN df_flat_prv and df_flat_new on all columns
    df_identical = rows where _merge == 'both'
        (field-for-field identical in prv and new)

    move identical rows to df_changes_done
    remove their VAERS_IDs from both df_flat_prv and df_flat_new

    --- DELETED records ---

    list_prv_not_in_new = VAERS_IDs in df_flat_prv but NOT in df_flat_new

    FOR each:
        IF already marked "Deleted YYYY-MM-DD" in status (trailing spaces marker):
            note as "deleted_prior" — already tracked
        ELSE:
            append "Deleted DATE_CURRENTLY    " to status
            add len(columns_vaers) to cell_edits (bulk penalty for deletion)
            record in stats['deleted']

    move all prv_not_in_new rows to df_changes_done
    remove their VAERS_IDs from df_flat_prv

    --- RESTORED records ---

    df_restored = rows in df_edits whose VAERS_ID is in df_flat_new
                  AND whose status contains "Deleted YYYY-MM-DD    " (trailing spaces)

    FOR each restored record:
        append "Restored DATE_CURRENTLY    " to status
        add len(columns_vaers) to cell_edits
        move to df_changes_done
        remove from df_flat_prv
        record in stats['restored']

    --- NEW records (VAERS_ID higher than max in previous) ---

    list_flat_new_gt = VAERS_IDs in df_flat_new > max(df_flat_prv.VAERS_ID)
        these are brand-new reports above the previous maximum ID

    --- DELAYED / GAPFILL records ---

    list_flat_new_lt = VAERS_IDs in df_flat_new < max(df_flat_prv.VAERS_ID)
    list_gapfills    = lt-IDs that are NOT in df_flat_prv AND NOT in restored list
        (IDs whose VAERS_ID falls within the previously seen range but only
         appear now — reports that were published late / throttled by CDC)

    FOR each gapfill:
        append "Delayed DATE_CURRENTLY    " to status
        record in stats['gapfill']
    move gapfills to df_changes_done
    remove from df_flat_new

    --- Remaining brand-new records (gt and any other new-only) ---

    rows in df_flat_new NOT in df_flat_prv:
        move directly to df_changes_done (no compare needed, first appearance)
        remove from df_flat_new

    --- MODIFIED records ---

    At this point df_flat_prv and df_flat_new contain only VAERS_IDs present in both,
    with the same count of rows. These are candidates for field-level changes.

    sort both by VAERS_ID, reset index so positional comparison is valid
    align df_flat_new columns to match df_flat_prv column order

    df_all_changed = df_flat_prv.compare(df_flat_new)
    cols_changed_list = unique column names that differ (excluding cell_edits, status, changes)

    FOR each changed column col:

        df_slice_prv = df_flat_prv[VAERS_ID, col]
        df_slice_new = df_flat_new[VAERS_ID, col]

        vids_changed = VAERS_IDs where col value differs between prv and new

        --- Bulk blanking shortcut ---

        IF >= 200 records in this column were non-empty in prv but empty in new:
            note in changes column in bulk without per-record iteration
            record in stats['cells_emptied']
            remove these from df_three_columns (handled separately)

        --- Trivial change filter ---

        IF col is a date column:
            remove rows where values are identical after stripping leading zeros
            (e.g. "12/03/2020" and "12/3/2020" treated as identical)
        ELSE:
            remove rows where values are identical after stripping all non-alphanumeric
            characters (ignores punctuation changes, case, whitespace)

        record count of trivial changes ignored in stats

        IF no non-trivial changes remain: CONTINUE to next column

        --- Group identical changes ---

        group remaining changes by (prv_value, new_value) pairs
            so that many VAERS_IDs sharing the same change are processed once
            and the result distributed to all of them

        FOR each unique (prv_value, new_value) group:

            vids_list = all VAERS_IDs in this group
            val_prv   = prv_value
            val_new   = new_value

            classify the change:
                newly_cut      = val_prv non-empty AND val_new empty AND no "cut_" marker in prv
                continuing_cut = val_prv contains "cut_" AND val_new is empty
                restored       = val_prv contains "cut_" AND val_new is non-empty
                                 AND val_prv starts with val_new

            --- Diff context reduction ---

            IF both val_prv and val_new are non-empty:
                call diff_context(val_prv, val_new, col) to reduce to just the differing part:

                    FUNCTION diff_context(prv, new, col):
                        simplify both strings (remove unicode, collapse whitespace)

                        IF identical after stripping non-alphanumeric: return ('', '')

                        IF delimiter-separated content (symptom_entries or detected delimited string):
                            split both on delimiter
                            keep only items unique to each side
                            IF remaining is identical after stripping non-alphanumeric: return ('', '')
                            IF order is only difference (sorted equal): return ('', '')

                        WHILE up to 10 iterations:
                            IF either string is empty: return (a, b)
                            IF equal case-insensitively: return ('', '')
                            IF word-order is only difference: return ('', '')

                            find longest common substring of words
                            IF common substring > 6 characters:
                                replace it with " .. " in both strings
                                keep only words unique to each side
                            ELSE:
                                stop iterating

                        remove consecutive spaces, strip leading/trailing
                        IF identical after stripping non-alphanumeric: return ('', '')
                        return (reduced_prv, reduced_new)

            FOR each vid in vids_list:

                determine val_to_keep (what to store in that field going forward):
                    IF newly_cut:
                        val_to_keep = original_val + " cut_DATE_CURRENTLY   "
                            (preserves deleted content with date stamp)
                        the_changes_note = "col cut_DATE_CURRENTLY    "
                        edits_add = 1
                    ELIF continuing_cut:
                        val_to_keep = val_prv_original (preserve existing cut tag)
                        edits_add = 0
                    ELIF both empty after diff: edits_add = 0
                    ELSE:
                        val_to_keep = val_new_original
                        the_changes_note = "col DATE: prv <> new    "
                        edits_add = 1

                IF edits_add:
                    df_edits[vid].cell_edits += 1
                IF the_changes_note:
                    df_edits[vid].changes   += the_changes_note
                IF val_to_keep:
                    df_edits[vid][col]       = val_to_keep

    --- Finalise df_edits ---

    df_edits = df_edits.drop_duplicates(subset='VAERS_ID', keep='last')
        [applied to df_edits directly, not a discarded copy; see fixes.md Fix 5]
    len_edits = count of rows in deduplicated df_edits

    move entire df_edits to df_changes_done

    df_changes_done = drop_duplicates on all columns (keep='last')
    df_changes_done = drop_duplicates on VAERS_ID     (keep='last')
        (two-pass: first remove fully identical rows, then enforce one row per VAERS_ID)

    fill NaN with empty string
    sort by [cell_edits, status, changes] descending
        (most-edited reports appear at top of output file)

    --- Add dose count metadata ---

    df_doses = rows where VAX_LOT contains "|" (multi-dose reports)
    df_doses['doses'] = count of "|" characters in VAX_LOT
        (pipe count = number of additional doses; +1 for the first dose)

    --- Write output ---

    write df_changes_done to:
        2_vaers_full_compared/DATE_VAERS_FLATFILE.csv

    IF output exceeds 1,048,575 rows (Excel sheet limit):
        also write _A and _B split files via save_multi_csv()

    --- Carry forward for next loop iteration ---

    df_flat_1 = df_changes_done  (used as df_flat_prv in next compare())

    --- Tracking and verification ---

    call do_never_ever(all VAERS_IDs from df_both_flat_inputs, date, source)
    call do_ever_covid(all VAERS_IDs from df_both_flat_inputs)
    call stats_resolve(date)
    call verify_all_reports_present(df_both_flat_inputs, df_changes_done)

Stage 6: Final Output — create_final_merged_file()

PROCEDURE create_final_merged_file():

    scan 2_vaers_full_compared/ for *_VAERS_FLATFILE.csv files
    IF none found, try *_VAERS_FLATFILE_A.csv

    sort by filename (date is prefix, so lexicographic sort = chronological)
    latest_file = last in sorted list

    read latest_file into df_final

    write df_final to VAERS_FINAL_MERGED.csv
        (simple copy — no additional processing)

    print summary:
        total records
        total cell_edits sum (if column present)
        count of records with cell_edits > 0
        count of records with "Deleted" in status

Persistent Tracking Functions

do_ever_covid(vids_list)

PROCEDURE do_ever_covid(vids_list):

    load ever_published_covid.txt if it exists
        parse as list of integers, store in ever_covid dict

    find vids_new = vids in vids_list NOT already in ever_covid

    IF vids_new is non-empty:
        update ever_covid dict with all vids in vids_list
        write sorted ever_covid keys back to ever_published_covid.txt
            one VAERS_ID per line

stats_initialize(date) / stats_resolve(date)

PROCEDURE stats_initialize(date):
    reset stats dict for this drop with zero counters:
        drop_input_covid, comparisons, deleted, restored, modified,
        lo_ever, hi_ever, dedupe_count, dedupe_reports, dedupe_bytes,
        dedupe_max_bytes, dedupe_max_vid, gapfill, cells_edited,
        cells_emptied, trivial_changes_ignored
    columns counter = Counter()

PROCEDURE stats_resolve(date):
    load stats.csv if it exists
    remove any existing row for this date and any existing "All" totals row
    append new row for this date
    compute "All" totals row:
        columns starting with "lo_" -> minimum across all drops
        columns starting with "hi_" -> maximum across all drops
        all other numeric columns   -> sum across all drops
    append totals row
    write stats.csv

Data Type and Quality Functions

types_set(df)

PROCEDURE types_set(df):
    IF VAERS_ID column present:
        coerce to numeric, fill NaN with 0, cast to int64
    IF cell_edits column present:
        coerce to numeric, fill NaN with 0, cast to int64
    FOR each of [AGE_YRS, CAGE_YR, CAGE_MO, NUMDAYS, HOSPDAYS]:
        IF column present: coerce to numeric (leave NaN as NaN)

fix_date_format(df)

PROCEDURE fix_date_format(df):
    skip if df has no VAERS_ID column or has a 'gapfill' column
    FOR each date column [DATEDIED, VAX_DATE, RPT_DATE, RECVDATE, TODAYS_DATE, ONSET_DATE]:
        IF any values contain "/" (MM/DD/YYYY format):
            strip any trailing " cut_..." markers
            parse with pd.to_datetime, reformat to YYYY-MM-DD
            fill NaN with empty string

drop_dupes(df)

PROCEDURE drop_dupes(df):
    record len_before
    df = drop_duplicates on ALL columns, keep='last'
    IF any were dropped: print warning "SHOULD NOT HAVE HAPPENED"

check_dupe_vaers_id(df)

FUNCTION check_dupe_vaers_id(df) -> int:
    count VAERS_IDs appearing more than once
    IF any duplicates: print warning with line number, count, and sample IDs
    return 1 if duplicates exist, 0 otherwise

verify_all_reports_present(df_input, df_output)

PROCEDURE verify_all_reports_present(df_input, df_output):
    find VAERS_IDs in df_input NOT in df_output -> print error if any (orphans)
    find VAERS_IDs in df_output NOT in df_input -> print error if any (extras)
    IF orphans found: append them to df_output and write orphans.csv
    IF extras found: write extras.csv
    cross-check df_output against ever_covid tracking dict:
        report IDs in output not in ever_covid
        report IDs in ever_covid not in output

Column Structure of Output Files

Consolidated file (1_vaers_consolidated/)

Multiple rows per VAERS_ID. All raw VAERS columns plus symptom_entries.

Flattened file (3_vaers_flattened/)

One row per VAERS_ID. VAX fields are pipe-delimited strings. Columns: all from VAERSDATA + all from VAERSVAX (pipe-joined) + symptom_entries.

FLATFILE / Final output (2_vaers_full_compared/)

One row per VAERS_ID. All flattened columns plus three tracking columns prepended:

Column Type Description
cell_edits integer Cumulative count of non-trivial field changes across all drops
status string Space-separated event markers: "Deleted YYYY-MM-DD", "Restored YYYY-MM-DD", "Delayed YYYY-MM-DD"
changes string Concatenated change notes: "COLUMN DATE: old_value <> new_value" per changed field

Records are sorted by cell_edits descending so the most-amended reports appear first.


Key Conditions and Decision Points Summary

Condition Where checked Effect
Drop already consolidated consolidate() entry Skip consolidation entirely
Drop already flattened consolidate() and flatten() entry Skip that stage
No COVID reports in drop consolidate() after VAX filter Remove date from input catalogue, skip
VAERS_ID < covid_earliest_vaers_id consolidate() Record excluded from all processing
VAERS_ID in both prv and new, all fields identical compare() identical block Moved to done, no change note
VAERS_ID only in prv, no "Deleted" marker compare() deleted block Mark Deleted, add cell_edits penalty
VAERS_ID only in prv, already marked "Deleted" compare() deleted block Carry forward as-is
VAERS_ID in new with "Deleted" marker in prv compare() restored block Mark Restored, add cell_edits penalty
VAERS_ID in new > max(prv VAERS_ID) compare() gt block Brand new, move to done, no compare
VAERS_ID in new < max(prv VAERS_ID) but not in prv compare() gapfill block Mark Delayed, move to done
Field change is only punctuation/case/whitespace compare() per-column Counted as trivial, skipped
>= 200 records blanked in one column compare() bulk blanking Noted in bulk, not per-record
Field blanked (prv non-empty, new empty) compare() per-record Value preserved with "cut_DATE" suffix
Field previously cut, still empty compare() per-record Carry cut value forward, no new note
Field previously cut, now restored compare() per-record New value kept, "[restored]" noted
SYMPTOM_TEXT segment > 40 chars appears twice symptoms_dedupe... Second occurrence replaced with "^"
symptom_entries string > 32,720 chars symptoms_file_entries... Truncated with Excel limit notice
Output rows > 1,048,575 save_multi_csv() Written as _A and _B split files
ZIP uses deflate64 compression files_from_zip() Requires zipfile_deflate64 library
Date key collision in directory scan files_populate_information() Error logged, last file wins

Original Author: admin

Views: 17 (Unique: 15)

Page ID ( Copy Link): page_69a9b6b0346bd2.44073166-2d4d8181b1f7913a

Page History (1 revisions):

  • 2026-03-05 17:00:32 (Viewing)