Pseudo Code

← Back to Folder

VAERS Script — Processing Flow and Pseudocode

Overview

The script processes VAERS (Vaccine Adverse Event Reporting System) data releases called "drops". Each drop is a dated ZIP or set of CSV files published by the CDC. The script ingests drops sequentially, consolidates the three raw VAERS files per drop into a single flat record per VAERS_ID, then tracks every field-level change between consecutive drops to produce a cumulative audit trail output file.

Directory Structure

0_VAERS_Downloads/        Raw input: ZIP files or dated subdirectories of CSVs
1_vaers_working/          Scratch space: extracted CSVs for current drop only
1_vaers_consolidated/     Intermediate: one file per drop, multiple rows per VAERS_ID
3_vaers_flattened/        Intermediate: one file per drop, one row per VAERS_ID
2_vaers_full_compared/    Output: cumulative FLATFILE per drop with change tracking
VAERS_FINAL_MERGED.csv    Final output: copy of the latest FLATFILE
stats.csv                 Running statistics one row per drop plus a totals row
ever_published_covid.txt  Persistent list of every COVID VAERS_ID ever seen
ever_published_any.txt    Persistent list of every VAERS_ID ever seen (all vaccines)

Three Raw VAERS File Types per Drop

*VAERSDATA.csv     One row per VAERS_ID. Patient, outcome, and narrative fields.
*VAERSVAX.csv      One or more rows per VAERS_ID. One row per vaccine dose/lot.
*VAERSSYMPTOMS.csv One or more rows per VAERS_ID. Up to 5 MedDRA symptom codes per row.

Top-Level Execution: `run_all()`

PROCEDURE run_all():

    IF --merge-only flag is set:
        call create_final_merged_file()
        RETURN

    call validate_dirs_and_files()
    IF validation fails:
        exit with error

    WHILE more_to_do() returns True:
        date = get_next_date()
        initialize stats for date
        success = open_files(date)
        IF success is False:
            mark date as done, CONTINUE to next date

        call consolidate(date)
        call flatten(date)
        call compare(date)

    call create_final_merged_file()
    print error summary

Stage 0: Setup and File Cataloguing

`validate_dirs_and_files()`

PROCEDURE validate_dirs_and_files():

    create directories if missing:
        0_VAERS_Downloads, 1_vaers_working, 1_vaers_consolidated,
        2_vaers_full_compared, 3_vaers_flattened

    scan 0_VAERS_Downloads recursively for *.zip and *.csv files
    IF no files found:
        log error, return False

    print count of files found
    return True

`files_populate_information()`

Called at the start of several stages to refresh the in-memory catalogue of what files exist in each directory.

PROCEDURE files_populate_information():

    IF files dict is empty:
        initialise keys: input, working, flattened, changes, consolidated
        each key maps to: _dir, date[], keyval{}, valkey{}, files[]

    FOR each category in [input, working, flattened, changes, consolidated]:

        walk the category's directory recursively
        collect all files whose names contain a date string (YYYY-MM-DD)
        keep only .csv and .zip files
        exclude files ending in _a.csv or _b.csv (legacy test artifacts)

        extract the date string from each filename
        store sorted unique list of dates

        build keyval dict  date -> filepath
            NOTE: if two files share the same date key, log an error;
                  the last file wins (see fixes.md Fix 4)
        build valkey dict  filepath -> date

    IF test mode active:
        replace input catalogue with flattened catalogue

    IF date_floor is set:
        remove input dates earlier than date_floor
    IF date_ceiling is set:
        remove input dates later than date_ceiling
    apply the same date filter to all sub-catalogues of input

Stage 1: Loop Control

`more_to_do()`

FUNCTION more_to_do() -> Boolean:

    call files_populate_information()

    IF no changes files exist yet:
        return True   (nothing processed, keep going)

    IF all four catalogues (input, flattened, consolidated, changes) contain
       exactly the same dates AND the latest changes date >= latest input date:
        trigger save_multi_csv() to produce _A/_B split if output exceeds
        Excel row limit of 1,048,575
        return False  (all drops processed)

    reset per-drop elapsed timer
    return True

`get_next_date()`

FUNCTION get_next_date() -> date string:

    dates_input   = sorted input dates NOT yet in files['done']
    dates_changes = sorted dates of completed changes files

    IF no flattened files exist yet (first ever run):
        return dates_input[0]   (oldest unprocessed input)

    IF changes files exist:
        high_changes_done = latest completed changes date
        candidates = input dates that are AFTER high_changes_done

        IF candidates exist:
            date_next = earliest candidate
            IF date_next > date_ceiling:  exit
            return date_next

        ELSE (no candidates — all inputs are behind or equal to changes):
            exit ("No more input files to process")

    ELSE (flattened files exist but no changes files yet):
        IF first input date already has a flattened file:
            copy that flattened file into dir_compared as a FLATFILE
            add cell_edits=0, status='', changes='' columns
            update ever_covid and stats for this date
            return dates_input[1]   (skip first, use it as baseline for compare)

        ELSE:
            return dates_input[0]

Stage 2: Input Extraction — `open_files(date)`

PROCEDURE open_files(date):

    call files_populate_information()

    IF date is already in the consolidated catalogue:
        clear working directory
        recreate working directory
        write date marker file into working directory
        return True  (nothing to extract, consolidation already done)

    IF date is already in the flattened catalogue:
        return True  (skip extraction, already fully processed)

    IF date not found in input catalogue:
        exit with error

    source = files['input']['keyval'][date]

    IF source is a CSV file:
        clear working directory
        copy all input files for this date into working directory

    ELSE IF source is a ZIP file:
        clear working directory
        open ZIP (using zipfile_deflate64 for deflate64 compression support)
        FOR each file inside the ZIP:
            IF filename matches year pattern (YYYY...) OR starts with "nond":
                extract to working directory
        IF test mode with files_limit set:
            remove extracted files not matching files_limit filter

    write date marker file into working directory
    return True on success, False if ZIP could not be opened or extracted

Stage 3: Consolidation — `consolidate(date)`

Combines the three raw VAERS file types into one file. Output still has multiple rows per VAERS_ID (one per dose).

PROCEDURE consolidate(date):

    call files_populate_information()

    IF date already in consolidated catalogue:
        print "already done", clear df_vax global, RETURN

    IF date already in flattened catalogue:
        print "skipping, flattened exists", RETURN

    --- Load raw files ---

    df_vax      = concatenate all *VAERSVAX.csv     files from working directory
    df_data_all = concatenate all *VAERSDATA.csv    files from working directory
    df_syms     = concatenate all *VAERSSYMPTOMS.csv files from working directory

    IF vids_limit set (test mode):
        filter all three dataframes to only those VAERS_IDs

    --- Identify earliest COVID VAERS_ID ---

    df_vax_covid = rows in df_vax where VAX_TYPE contains "covid" (case-insensitive)

    IF df_vax_covid is not empty:
        covid_earliest_vaers_id_now = minimum VAERS_ID in df_vax_covid
        IF covid_earliest_vaers_id not yet set:
            set covid_earliest_vaers_id = covid_earliest_vaers_id_now
        ELSE IF new minimum is lower than current:
            update covid_earliest_vaers_id to the lower value
    ELSE:
        print "no covid found in this drop"
        remove this date from input catalogue (cheap trick to skip it)
        RETURN

    --- Remove pre-COVID records ---

    filter df_data_all, df_vax, df_syms to VAERS_ID >= covid_earliest_vaers_id
    print count removed and count remaining

    update ever_any tracking dict with all VAERS_IDs in df_data_all
    call do_never_ever() to track IDs not yet seen

    --- Identify COVID reports ---

    Strategy: a report is "covid" if it meets ANY of:
        (a) Single-dose report (doses == 1) AND that dose is VAX_TYPE containing "COVID"
        (b) Multi-dose report (doses >= 2) AND at least one dose is VAX_TYPE "COVID"
        (c) VAERS_ID not in (a) or (b) but SYMPTOM_TEXT contains "Pfizer" or "Moderna"
            or "Janssen" AND also contains "Covid"

    vids_all_covid_list = union of (a+b) and (c) VAERS_IDs

    --- Filter to COVID-only ---

    df_data = df_data_all filtered to vids_all_covid_list
    df_vax  = df_vax       filtered to vids_all_covid_list
    df_syms = df_syms      filtered to vids_all_covid_list

    df_data = df_data.drop_duplicates(subset='VAERS_ID')
        [one dedup only — the second redundant call was removed, see fixes.md Fix 1]

    --- Shorten VAX_NAME and VAX_MANU field values ---

    Apply regex replacements to df_vax:
        VAX_NAME: remove parentheses, collapse "COVID19 COVID19 ..." to "C19 ...",
                  shorten manufacturer strings (Pfizer-BionT, Moderna, Janssen, etc.)
        VAX_MANU: shorten to readable forms (Unknown, Pfizer-BionT, Moderna, etc.)

    Sort df_vax by [VAERS_ID, VAX_LOT, VAX_SITE, VAX_DOSE_SERIES, VAX_TYPE,
                    VAX_MANU, VAX_ROUTE, VAX_NAME]
        (sorting prevents row-order differences between drops from appearing as changes)

    --- Merge DATA into VAX ---

    df_data_vax = LEFT JOIN df_vax ON df_data using VAERS_ID
        (result has multiple rows per VAERS_ID — one per dose/lot)
    fill NaN with empty string

    --- Aggregate symptoms ---

    call symptoms_file_entries_append_to_symptom_text(df_syms):

        PROCEDURE symptoms_file_entries_append_to_symptom_text(df_symfile):

            keep only VAERS_ID and SYMPTOM1..SYMPTOM5 columns (drop SYMPTOMVERSION cols)

            join all five symptom columns into single string using "_|_" delimiter:
                symptom_entries = sort(SYMPTOM1..5) joined by "_|_"

            GROUP BY VAERS_ID, join all symptom_entries rows with "_|_"
                (handles reports with more than 5 symptoms spanning multiple rows)

            clean up consecutive delimiters from empty symptom columns
            ensure string starts and ends with "_|_"

            IF any symptom_entries exceed 32,720 characters (Excel cell limit):
                truncate and append "[truncated, Excel cell size limit 32,767]"

            return df_syms_flat  (one row per VAERS_ID)

    --- Final consolidated merge ---

    df_consolidated = LEFT JOIN df_data_vax ON df_syms_flat using VAERS_ID
    fill NaN with empty string

    write to: 1_vaers_consolidated/DATE_VAERS_CONSOLIDATED.csv
        (multiple rows per VAERS_ID is expected and correct at this stage)

Stage 4: Flattening — `flatten(date)`

Reduces to exactly one row per VAERS_ID by aggregating all dose/lot fields.

PROCEDURE flatten(date):

    IF covid_earliest_vaers_id not yet set:
        mark date as done, RETURN

    call files_populate_information()

    IF date already in flattened catalogue:
        print "already done", RETURN

    --- Source data: in-memory or from consolidated file ---

    IF df_vax global is empty (consolidate did not just run in this session):
        read from saved consolidated file for this date

        df_vax      = consolidated[VAX columns]
        df_data     = consolidated[all other columns].drop_duplicates(VAERS_ID)
            [dedup immediately — consolidated file has N rows per VAERS_ID,
             must reduce to 1 before merge; see fixes.md Fix 2]
        remove symptom_entries from df_data (added back separately)
        df_syms_flat = consolidated[VAERS_ID, symptom_entries].drop_duplicates(VAERS_ID)
            [same reason — see fixes.md Fix 2]

    ELSE:
        df_vax, df_data, df_syms_flat are already in memory from consolidate()
        df_data is already 1 row per VAERS_ID

    --- Aggregate VAX fields to one row per VAERS_ID ---

    df_vax_flat = df_vax.groupby(VAERS_ID).agg(list).map("|".join)
        Each VAX field becomes a pipe-delimited string of all values:
            e.g. VAX_LOT = "EK9231|EL0140" for a two-dose report
    drop_duplicates on full row as safety net

    --- Merge flattened VAX with DATA ---

    df_data_vax_flat = LEFT JOIN df_vax_flat ON df_data using VAERS_ID
        (should be 1:1 since both are now one row per VAERS_ID)
    drop_duplicates on full row as safety net

    --- Merge symptom_entries ---

    df_final = LEFT JOIN df_data_vax_flat ON df_syms_flat using VAERS_ID
    fill NaN with empty string
        (some VAERS_IDs have no entry in the symptoms file; this is expected)
    drop_duplicates on full row as safety net

    --- Deduplicate repeat sentences in SYMPTOM_TEXT ---

    call symptoms_dedupe_repeat_sentences(df_final):

        FOR each row:
            call symptoms_dedupe_repeat_sentences_each(VAERS_ID, SYMPTOM_TEXT):

                detect the most prevalent delimiter in the text:
                    candidates: ". "  "|"  "; "  " - "
                    winner = delimiter appearing >= 5 times
                IF no prevalent delimiter found: return text unchanged

                split text on delimiter into list_in
                IF no duplicates in list_in: return text unchanged

                FOR each segment in list_in:
                    IF segment already seen in this text AND len(segment) > 40:
                        replace with placeholder "`^`"
                        record byte count saved
                    ELSE:
                        keep segment

                update dedupe stats (count, bytes, max_bytes, which VAERS_ID)
                return cleaned text

    write to: 3_vaers_flattened/DATE_VAERS_FLATTENED.csv
        (one row per VAERS_ID)

    store result in global df_flat_2 for use by compare() without re-reading disk

Stage 5: Comparison — `compare(date)`

Diffs the newly flattened file against the previous one. Tracks field-level changes, deletions, restorations, and new reports. Produces the cumulative output FLATFILE.

PROCEDURE compare(date_currently):

    IF covid_earliest_vaers_id not yet set: RETURN

    call files_populate_information()

    file_flattened_previous = most recent flattened file with date < date_currently
    file_flattened_working  = flattened file for date_currently

    --- First-drop special case ---

    IF no previous flattened file exists:
        copy current flattened to 2_vaers_full_compared/DATE_VAERS_FLATFILE.csv
        add columns: cell_edits=0, status='', changes=''
        update ever_covid and stats
        RETURN

    --- Load previous and current flat files ---

    IF df_flat_1 global is populated (carried over from previous loop iteration):
        df_flat_prv = df_flat_1   (avoid re-reading disk)
    ELSE:
        df_flat_prv = read file_flattened_previous from disk

    IF df_flat_2 global is populated (just produced by flatten()):
        df_flat_new = df_flat_2   (avoid re-reading disk)
    ELSE:
        df_flat_new = read file_flattened_working from disk

    ensure df_flat_prv has columns: cell_edits, status, changes
        (absent on first compare when sourced directly from a flattened file)
    sort df_flat_prv by [cell_edits, status, changes] descending
        (reports with most edits appear first)

    --- Build tracking union ---

    df_both_flat_inputs = concat(df_flat_prv, df_flat_new).drop_duplicates(VAERS_ID)
        used at the end to verify no records are lost

    --- Initialise df_edits ---

    df_edits        = copy of df_flat_new
        (all new records; will be modified in place with carried-over tracking data)
    df_changes_done = empty dataframe
        (accumulates records that need no further processing this pass)

    --- Carry forward change tracking from previous ---

    find VAERS_IDs present in both df_flat_prv and df_edits
    from df_flat_prv, find those with non-zero cell_edits or non-empty changes/status

    FOR those VAERS_IDs:
        copy cell_edits, changes, status from df_flat_prv into df_edits

    --- Absorb records only in previous (deleted candidates) ---

    rows in df_flat_prv NOT in df_edits (df_flat_new) are appended to df_edits
        (they may be truly deleted or just absent from this drop temporarily)

    --- IDENTICAL records ---

    df_merged = OUTER JOIN df_flat_prv and df_flat_new on all columns
    df_identical = rows where _merge == 'both'
        (field-for-field identical in prv and new)

    move identical rows to df_changes_done
    remove their VAERS_IDs from both df_flat_prv and df_flat_new

    --- DELETED records ---

    list_prv_not_in_new = VAERS_IDs in df_flat_prv but NOT in df_flat_new

    FOR each:
        IF already marked "Deleted YYYY-MM-DD" in status (trailing spaces marker):
            note as "deleted_prior" — already tracked
        ELSE:
            append "Deleted DATE_CURRENTLY    " to status
            add len(columns_vaers) to cell_edits (bulk penalty for deletion)
            record in stats['deleted']

    move all prv_not_in_new rows to df_changes_done
    remove their VAERS_IDs from df_flat_prv

    --- RESTORED records ---

    df_restored = rows in df_edits whose VAERS_ID is in df_flat_new
                  AND whose status contains "Deleted YYYY-MM-DD    " (trailing spaces)

    FOR each restored record:
        append "Restored DATE_CURRENTLY    " to status
        add len(columns_vaers) to cell_edits
        move to df_changes_done
        remove from df_flat_prv
        record in stats['restored']

    --- NEW records (VAERS_ID higher than max in previous) ---

    list_flat_new_gt = VAERS_IDs in df_flat_new > max(df_flat_prv.VAERS_ID)
        these are brand-new reports above the previous maximum ID

    --- DELAYED / GAPFILL records ---

    list_flat_new_lt = VAERS_IDs in df_flat_new < max(df_flat_prv.VAERS_ID)
    list_gapfills    = lt-IDs that are NOT in df_flat_prv AND NOT in restored list
        (IDs whose VAERS_ID falls within the previously seen range but only
         appear now — reports that were published late / throttled by CDC)

    FOR each gapfill:
        append "Delayed DATE_CURRENTLY    " to status
        record in stats['gapfill']
    move gapfills to df_changes_done
    remove from df_flat_new

    --- Remaining brand-new records (gt and any other new-only) ---

    rows in df_flat_new NOT in df_flat_prv:
        move directly to df_changes_done (no compare needed, first appearance)
        remove from df_flat_new

    --- MODIFIED records ---

    At this point df_flat_prv and df_flat_new contain only VAERS_IDs present in both,
    with the same count of rows. These are candidates for field-level changes.

    sort both by VAERS_ID, reset index so positional comparison is valid
    align df_flat_new columns to match df_flat_prv column order

    df_all_changed = df_flat_prv.compare(df_flat_new)
    cols_changed_list = unique column names that differ (excluding cell_edits, status, changes)

    FOR each changed column col:

        df_slice_prv = df_flat_prv[VAERS_ID, col]
        df_slice_new = df_flat_new[VAERS_ID, col]

        vids_changed = VAERS_IDs where col value differs between prv and new

        --- Bulk blanking shortcut ---

        IF >= 200 records in this column were non-empty in prv but empty in new:
            note in changes column in bulk without per-record iteration
            record in stats['cells_emptied']
            remove these from df_three_columns (handled separately)

        --- Trivial change filter ---

        IF col is a date column:
            remove rows where values are identical after stripping leading zeros
            (e.g. "12/03/2020" and "12/3/2020" treated as identical)
        ELSE:
            remove rows where values are identical after stripping all non-alphanumeric
            characters (ignores punctuation changes, case, whitespace)

        record count of trivial changes ignored in stats

        IF no non-trivial changes remain: CONTINUE to next column

        --- Group identical changes ---

        group remaining changes by (prv_value, new_value) pairs
            so that many VAERS_IDs sharing the same change are processed once
            and the result distributed to all of them

        FOR each unique (prv_value, new_value) group:

            vids_list = all VAERS_IDs in this group
            val_prv   = prv_value
            val_new   = new_value

            classify the change:
                newly_cut      = val_prv non-empty AND val_new empty AND no "cut_" marker in prv
                continuing_cut = val_prv contains "cut_" AND val_new is empty
                restored       = val_prv contains "cut_" AND val_new is non-empty
                                 AND val_prv starts with val_new

            --- Diff context reduction ---

            IF both val_prv and val_new are non-empty:
                call diff_context(val_prv, val_new, col) to reduce to just the differing part:

                    FUNCTION diff_context(prv, new, col):
                        simplify both strings (remove unicode, collapse whitespace)

                        IF identical after stripping non-alphanumeric: return ('', '')

                        IF delimiter-separated content (symptom_entries or detected delimited string):
                            split both on delimiter
                            keep only items unique to each side
                            IF remaining is identical after stripping non-alphanumeric: return ('', '')
                            IF order is only difference (sorted equal): return ('', '')

                        WHILE up to 10 iterations:
                            IF either string is empty: return (a, b)
                            IF equal case-insensitively: return ('', '')
                            IF word-order is only difference: return ('', '')

                            find longest common substring of words
                            IF common substring > 6 characters:
                                replace it with " .. " in both strings
                                keep only words unique to each side
                            ELSE:
                                stop iterating

                        remove consecutive spaces, strip leading/trailing
                        IF identical after stripping non-alphanumeric: return ('', '')
                        return (reduced_prv, reduced_new)

            FOR each vid in vids_list:

                determine val_to_keep (what to store in that field going forward):
                    IF newly_cut:
                        val_to_keep = original_val + " cut_DATE_CURRENTLY   "
                            (preserves deleted content with date stamp)
                        the_changes_note = "col cut_DATE_CURRENTLY    "
                        edits_add = 1
                    ELIF continuing_cut:
                        val_to_keep = val_prv_original (preserve existing cut tag)
                        edits_add = 0
                    ELIF both empty after diff: edits_add = 0
                    ELSE:
                        val_to_keep = val_new_original
                        the_changes_note = "col DATE: prv <> new    "
                        edits_add = 1

                IF edits_add:
                    df_edits[vid].cell_edits += 1
                IF the_changes_note:
                    df_edits[vid].changes   += the_changes_note
                IF val_to_keep:
                    df_edits[vid][col]       = val_to_keep

    --- Finalise df_edits ---

    df_edits = df_edits.drop_duplicates(subset='VAERS_ID', keep='last')
        [applied to df_edits directly, not a discarded copy; see fixes.md Fix 5]
    len_edits = count of rows in deduplicated df_edits

    move entire df_edits to df_changes_done

    df_changes_done = drop_duplicates on all columns (keep='last')
    df_changes_done = drop_duplicates on VAERS_ID     (keep='last')
        (two-pass: first remove fully identical rows, then enforce one row per VAERS_ID)

    fill NaN with empty string
    sort by [cell_edits, status, changes] descending
        (most-edited reports appear at top of output file)

    --- Add dose count metadata ---

    df_doses = rows where VAX_LOT contains "|" (multi-dose reports)
    df_doses['doses'] = count of "|" characters in VAX_LOT
        (pipe count = number of additional doses; +1 for the first dose)

    --- Write output ---

    write df_changes_done to:
        2_vaers_full_compared/DATE_VAERS_FLATFILE.csv

    IF output exceeds 1,048,575 rows (Excel sheet limit):
        also write _A and _B split files via save_multi_csv()

    --- Carry forward for next loop iteration ---

    df_flat_1 = df_changes_done  (used as df_flat_prv in next compare())

    --- Tracking and verification ---

    call do_never_ever(all VAERS_IDs from df_both_flat_inputs, date, source)
    call do_ever_covid(all VAERS_IDs from df_both_flat_inputs)
    call stats_resolve(date)
    call verify_all_reports_present(df_both_flat_inputs, df_changes_done)

Stage 6: Final Output — `create_final_merged_file()`

PROCEDURE create_final_merged_file():

    scan 2_vaers_full_compared/ for *_VAERS_FLATFILE.csv files
    IF none found, try *_VAERS_FLATFILE_A.csv

    sort by filename (date is prefix, so lexicographic sort = chronological)
    latest_file = last in sorted list

    read latest_file into df_final

    write df_final to VAERS_FINAL_MERGED.csv
        (simple copy — no additional processing)

    print summary:
        total records
        total cell_edits sum (if column present)
        count of records with cell_edits > 0
        count of records with "Deleted" in status

Persistent Tracking Functions

`do_ever_covid(vids_list)`

PROCEDURE do_ever_covid(vids_list):

    load ever_published_covid.txt if it exists
        parse as list of integers, store in ever_covid dict

    find vids_new = vids in vids_list NOT already in ever_covid

    IF vids_new is non-empty:
        update ever_covid dict with all vids in vids_list
        write sorted ever_covid keys back to ever_published_covid.txt
            one VAERS_ID per line

`stats_initialize(date)` / `stats_resolve(date)`

PROCEDURE stats_initialize(date):
    reset stats dict for this drop with zero counters:
        drop_input_covid, comparisons, deleted, restored, modified,
        lo_ever, hi_ever, dedupe_count, dedupe_reports, dedupe_bytes,
        dedupe_max_bytes, dedupe_max_vid, gapfill, cells_edited,
        cells_emptied, trivial_changes_ignored
    columns counter = Counter()

PROCEDURE stats_resolve(date):
    load stats.csv if it exists
    remove any existing row for this date and any existing "All" totals row
    append new row for this date
    compute "All" totals row:
        columns starting with "lo_" -> minimum across all drops
        columns starting with "hi_" -> maximum across all drops
        all other numeric columns   -> sum across all drops
    append totals row
    write stats.csv

Data Type and Quality Functions

`types_set(df)`

PROCEDURE types_set(df):
    IF VAERS_ID column present:
        coerce to numeric, fill NaN with 0, cast to int64
    IF cell_edits column present:
        coerce to numeric, fill NaN with 0, cast to int64
    FOR each of [AGE_YRS, CAGE_YR, CAGE_MO, NUMDAYS, HOSPDAYS]:
        IF column present: coerce to numeric (leave NaN as NaN)

`fix_date_format(df)`

PROCEDURE fix_date_format(df):
    skip if df has no VAERS_ID column or has a 'gapfill' column
    FOR each date column [DATEDIED, VAX_DATE, RPT_DATE, RECVDATE, TODAYS_DATE, ONSET_DATE]:
        IF any values contain "/" (MM/DD/YYYY format):
            strip any trailing " cut_..." markers
            parse with pd.to_datetime, reformat to YYYY-MM-DD
            fill NaN with empty string

`drop_dupes(df)`

PROCEDURE drop_dupes(df):
    record len_before
    df = drop_duplicates on ALL columns, keep='last'
    IF any were dropped: print warning "SHOULD NOT HAVE HAPPENED"

`check_dupe_vaers_id(df)`

FUNCTION check_dupe_vaers_id(df) -> int:
    count VAERS_IDs appearing more than once
    IF any duplicates: print warning with line number, count, and sample IDs
    return 1 if duplicates exist, 0 otherwise

`verify_all_reports_present(df_input, df_output)`

PROCEDURE verify_all_reports_present(df_input, df_output):
    find VAERS_IDs in df_input NOT in df_output -> print error if any (orphans)
    find VAERS_IDs in df_output NOT in df_input -> print error if any (extras)
    IF orphans found: append them to df_output and write orphans.csv
    IF extras found: write extras.csv
    cross-check df_output against ever_covid tracking dict:
        report IDs in output not in ever_covid
        report IDs in ever_covid not in output

Column Structure of Output Files

Consolidated file (`1_vaers_consolidated/`)

Multiple rows per VAERS_ID. All raw VAERS columns plus symptom_entries.

Flattened file (`3_vaers_flattened/`)

One row per VAERS_ID. VAX fields are pipe-delimited strings. Columns: all from VAERSDATA + all from VAERSVAX (pipe-joined) + symptom_entries.

FLATFILE / Final output (`2_vaers_full_compared/`)

One row per VAERS_ID. All flattened columns plus three tracking columns prepended:

Column	Type	Description
`cell_edits`	integer	Cumulative count of non-trivial field changes across all drops
`status`	string	Space-separated event markers: "Deleted YYYY-MM-DD", "Restored YYYY-MM-DD", "Delayed YYYY-MM-DD"
`changes`	string	Concatenated change notes: "COLUMN DATE: old_value <> new_value" per changed field

Records are sorted by cell_edits descending so the most-amended reports appear first.

Key Conditions and Decision Points Summary

Condition	Where checked	Effect
Drop already consolidated	`consolidate()` entry	Skip consolidation entirely
Drop already flattened	`consolidate()` and `flatten()` entry	Skip that stage
No COVID reports in drop	`consolidate()` after VAX filter	Remove date from input catalogue, skip
VAERS_ID < covid_earliest_vaers_id	`consolidate()`	Record excluded from all processing
VAERS_ID in both prv and new, all fields identical	`compare()` identical block	Moved to done, no change note
VAERS_ID only in prv, no "Deleted" marker	`compare()` deleted block	Mark Deleted, add cell_edits penalty
VAERS_ID only in prv, already marked "Deleted"	`compare()` deleted block	Carry forward as-is
VAERS_ID in new with "Deleted" marker in prv	`compare()` restored block	Mark Restored, add cell_edits penalty
VAERS_ID in new > max(prv VAERS_ID)	`compare()` gt block	Brand new, move to done, no compare
VAERS_ID in new < max(prv VAERS_ID) but not in prv	`compare()` gapfill block	Mark Delayed, move to done
Field change is only punctuation/case/whitespace	`compare()` per-column	Counted as trivial, skipped
>= 200 records blanked in one column	`compare()` bulk blanking	Noted in bulk, not per-record
Field blanked (prv non-empty, new empty)	`compare()` per-record	Value preserved with "cut_DATE" suffix
Field previously cut, still empty	`compare()` per-record	Carry cut value forward, no new note
Field previously cut, now restored	`compare()` per-record	New value kept, "[restored]" noted
SYMPTOM_TEXT segment > 40 chars appears twice	`symptoms_dedupe...`	Second occurrence replaced with "`^`"
symptom_entries string > 32,720 chars	`symptoms_file_entries...`	Truncated with Excel limit notice
Output rows > 1,048,575	`save_multi_csv()`	Written as _A and _B split files
ZIP uses deflate64 compression	`files_from_zip()`	Requires zipfile_deflate64 library
Date key collision in directory scan	`files_populate_information()`	Error logged, last file wins

Original Author: admin

Views: 17 (Unique: 15)

Page ID ( Copy Link): page_69a9b6b0346bd2.44073166-2d4d8181b1f7913a Copied!

Page History (1 revisions):

2026-03-05 17:00:32 (Viewing)

Questioning Everything Propaganda

VAERS Script — Processing Flow and Pseudocode

Overview

Directory Structure

Three Raw VAERS File Types per Drop

Top-Level Execution: run_all()

Stage 0: Setup and File Cataloguing

validate_dirs_and_files()

files_populate_information()

Stage 1: Loop Control

more_to_do()

get_next_date()

Stage 2: Input Extraction — open_files(date)

Stage 3: Consolidation — consolidate(date)

Stage 4: Flattening — flatten(date)

Stage 5: Comparison — compare(date)

Stage 6: Final Output — create_final_merged_file()