Questioning Everything Propaganda

Home Tags
Login RSS
Fixes

VAERS Script — Fixes Applied

Fix 1 — Removed Redundant Double drop_duplicates on df_data

Location: consolidate(), formerly lines 2027–2030 Issue ref: errors.md Issue 1

The second drop_duplicates(subset='VAERS_ID') call on df_data was an exact repeat of the call four lines above it, with only a print statement between them. Nothing in between could have reintroduced duplicates, so the second call was a guaranteed no-op on every run. On large datasets this wastes measurable compute time scanning the full dataframe a second time.

Removed:

len_before = len(df_data)
df_data    = df_data.drop_duplicates(subset='VAERS_ID')
if len(df_data) - len_before:
    print(f'{(len_before - len(df_data)):>10} duplicates dropped in df_data on VAERS_IDs, expected none')

Fix 2 — Dedup df_data and df_syms_flat at Extraction from Consolidated File

Location: flatten(), pull_vax_records path, ~line 2177 Issue ref: errors.md Issue 2

When flatten() runs without a prior in-memory consolidate() (i.e. when resuming from a saved consolidated file), it was pulling df_data and df_syms_flat directly from the consolidated CSV, which intentionally stores N rows per VAERS_ID — one per dose/lot. Merging that N-row df_data with df_vax_flat (already reduced to 1 row per VAERS_ID via groupby) causes the merge to re-expand to N rows per VAERS_ID. Two drop_duplicates calls downstream then had to clean up the inflation.

The in-memory path (when consolidate just ran) did not have this problem because df_data was already 1 row per VAERS_ID at that point. The two paths behaved differently with no indication.

Before:

df_data      = df_consolidated[columns_data].astype(str)
df_syms_flat = df_consolidated.astype(str)[['VAERS_ID', 'symptom_entries']]

After:

df_data      = df_consolidated[columns_data].drop_duplicates(subset='VAERS_ID').astype(str)
df_syms_flat = df_consolidated[['VAERS_ID', 'symptom_entries']].drop_duplicates(subset='VAERS_ID').astype(str)

Both paths now produce df_data and df_syms_flat with exactly 1 row per VAERS_ID before any merges occur. The downstream drop_duplicates calls are now true safety nets rather than load-bearing corrections.


Fix 3 — Removed Redundant Third Glob Pattern in validate_dirs_and_files()

Location: validate_dirs_and_files(), formerly lines 654–656 Issue ref: errors.md Issue 3

The input file discovery used three glob patterns:

input_files = glob.glob(f"{dir_input}/**/*.zip", recursive=True) + \
              glob.glob(f"{dir_input}/**/*.csv", recursive=True) + \
              glob.glob(f"{dir_input}/**/VAERS*.csv", recursive=True)  # removed

Every file matching VAERS*.csv also matches **/*.csv, so the third pattern produced a subset of the second. Every VAERS CSV file was included in input_files twice, making the Found N input files count reported to the user roughly double the real number for CSV-based drops. The redundant pattern was removed:

input_files = glob.glob(f"{dir_input}/**/*.zip", recursive=True) + \
              glob.glob(f"{dir_input}/**/*.csv", recursive=True)

This does not affect which files are actually processed — only the count displayed.


Fix 4 — Surface Silent File Loss on keyval Date Key Collision

Location: files_populate_information(), ~line 716 Issue ref: errors.md Issue 4

The date-to-filename mapping was built with a dict comprehension:

files[thing]['keyval'] = {date_from_filename(x): x for x in full}

If two files in the same directory share the same date string in their name, the comprehension silently overwrites the first entry with the second. The sorted(set(...)) on the date list suppresses any IndexError, so the lost file leaves no trace in the output. This is most likely to occur in test scenarios or malformed drop directories but is undetectable when it happens.

After: the dict is built iteratively so any collision triggers an error() call, which logs to both stdout and the errors summary printed at the end of the run:

keyval = {}
for x in full:
    date_key = date_from_filename(x)
    if date_key in keyval:
        error(f"Date key collision in '{thing}': {date_key} matches both "
              f"'{os.path.basename(keyval[date_key])}' and '{os.path.basename(x)}'. "
              f"Using {os.path.basename(x)}.")
    keyval[date_key] = x
files[thing]['keyval'] = keyval

The last file wins (consistent with the original dict comprehension behaviour), but now the collision is visible in the run log and the errors summary.


Fix 5 — Apply df_edits Dedup to the Data, Not Just the Count

Location: compare(), ~line 2903 Issue ref: errors.md Issue 5

A deduplicated view of df_edits was created solely to compute a row count for the len_edits stat, but the deduplicated result was then discarded:

df_edits_unique = df_edits.drop_duplicates(subset=columns_vaers)
len_edits += len(df_edits_unique)   # only the count was used
...
df_edits, df_changes_done = move_rows(df_edits, df_edits, df_changes_done)  # original df_edits

This meant the len_edits stat reflected the deduplicated count while the actual data flowing into df_changes_done was the non-deduplicated df_edits. Two later drop_duplicates calls on df_changes_done caught the result, but the intermediate state was inconsistent.

After: the dedup is applied directly to df_edits (using keep='last' to preserve the most recently accumulated cell_edits/changes values), and len_edits is derived from that same cleaned frame:

len_before_edits = len(df_edits)
df_edits = df_edits.drop_duplicates(subset='VAERS_ID', keep='last')
if len_before_edits - len(df_edits):
    print(f'{(len_before_edits - len(df_edits)):>10} duplicates removed from df_edits before final move')
len_edits += len(df_edits)

The stat and the data now reflect the same deduplicated state. The dedup key was also changed from subset=columns_vaers (a 40-column list) to subset='VAERS_ID' (the natural unique key for a flat file), which is both faster and semantically correct — a VAERS_ID should appear at most once in df_edits at this point in the pipeline.


Original Author: admin

Views: 16 (Unique: 14)

Page ID ( Copy Link): page_69a9b6ef533000.68330013-81eb2f9c529f7c4d

Page History (1 revisions):

  • 2026-03-05 17:01:35 (Viewing)