drop_duplicates on df_dataLocation: consolidate(), formerly lines 2027–2030
Issue ref: errors.md Issue 1
The second drop_duplicates(subset='VAERS_ID') call on df_data was an exact
repeat of the call four lines above it, with only a print statement between them.
Nothing in between could have reintroduced duplicates, so the second call was a
guaranteed no-op on every run. On large datasets this wastes measurable compute
time scanning the full dataframe a second time.
Removed:
len_before = len(df_data)
df_data = df_data.drop_duplicates(subset='VAERS_ID')
if len(df_data) - len_before:
print(f'{(len_before - len(df_data)):>10} duplicates dropped in df_data on VAERS_IDs, expected none')
df_data and df_syms_flat at Extraction from Consolidated FileLocation: flatten(), pull_vax_records path, ~line 2177
Issue ref: errors.md Issue 2
When flatten() runs without a prior in-memory consolidate() (i.e. when resuming
from a saved consolidated file), it was pulling df_data and df_syms_flat directly
from the consolidated CSV, which intentionally stores N rows per VAERS_ID — one per
dose/lot. Merging that N-row df_data with df_vax_flat (already reduced to 1 row
per VAERS_ID via groupby) causes the merge to re-expand to N rows per VAERS_ID.
Two drop_duplicates calls downstream then had to clean up the inflation.
The in-memory path (when consolidate just ran) did not have this problem because
df_data was already 1 row per VAERS_ID at that point. The two paths behaved
differently with no indication.
Before:
df_data = df_consolidated[columns_data].astype(str)
df_syms_flat = df_consolidated.astype(str)[['VAERS_ID', 'symptom_entries']]
After:
df_data = df_consolidated[columns_data].drop_duplicates(subset='VAERS_ID').astype(str)
df_syms_flat = df_consolidated[['VAERS_ID', 'symptom_entries']].drop_duplicates(subset='VAERS_ID').astype(str)
Both paths now produce df_data and df_syms_flat with exactly 1 row per VAERS_ID
before any merges occur. The downstream drop_duplicates calls are now true safety
nets rather than load-bearing corrections.
validate_dirs_and_files()Location: validate_dirs_and_files(), formerly lines 654–656
Issue ref: errors.md Issue 3
The input file discovery used three glob patterns:
input_files = glob.glob(f"{dir_input}/**/*.zip", recursive=True) + \
glob.glob(f"{dir_input}/**/*.csv", recursive=True) + \
glob.glob(f"{dir_input}/**/VAERS*.csv", recursive=True) # removed
Every file matching VAERS*.csv also matches **/*.csv, so the third pattern
produced a subset of the second. Every VAERS CSV file was included in input_files
twice, making the Found N input files count reported to the user roughly double
the real number for CSV-based drops. The redundant pattern was removed:
input_files = glob.glob(f"{dir_input}/**/*.zip", recursive=True) + \
glob.glob(f"{dir_input}/**/*.csv", recursive=True)
This does not affect which files are actually processed — only the count displayed.
keyval Date Key CollisionLocation: files_populate_information(), ~line 716
Issue ref: errors.md Issue 4
The date-to-filename mapping was built with a dict comprehension:
files[thing]['keyval'] = {date_from_filename(x): x for x in full}
If two files in the same directory share the same date string in their name, the
comprehension silently overwrites the first entry with the second. The sorted(set(...))
on the date list suppresses any IndexError, so the lost file leaves no trace in the
output. This is most likely to occur in test scenarios or malformed drop directories
but is undetectable when it happens.
After: the dict is built iteratively so any collision triggers an error() call,
which logs to both stdout and the errors summary printed at the end of the run:
keyval = {}
for x in full:
date_key = date_from_filename(x)
if date_key in keyval:
error(f"Date key collision in '{thing}': {date_key} matches both "
f"'{os.path.basename(keyval[date_key])}' and '{os.path.basename(x)}'. "
f"Using {os.path.basename(x)}.")
keyval[date_key] = x
files[thing]['keyval'] = keyval
The last file wins (consistent with the original dict comprehension behaviour), but now the collision is visible in the run log and the errors summary.
df_edits Dedup to the Data, Not Just the CountLocation: compare(), ~line 2903
Issue ref: errors.md Issue 5
A deduplicated view of df_edits was created solely to compute a row count for
the len_edits stat, but the deduplicated result was then discarded:
df_edits_unique = df_edits.drop_duplicates(subset=columns_vaers)
len_edits += len(df_edits_unique) # only the count was used
...
df_edits, df_changes_done = move_rows(df_edits, df_edits, df_changes_done) # original df_edits
This meant the len_edits stat reflected the deduplicated count while the actual
data flowing into df_changes_done was the non-deduplicated df_edits. Two later
drop_duplicates calls on df_changes_done caught the result, but the intermediate
state was inconsistent.
After: the dedup is applied directly to df_edits (using keep='last' to
preserve the most recently accumulated cell_edits/changes values), and len_edits
is derived from that same cleaned frame:
len_before_edits = len(df_edits)
df_edits = df_edits.drop_duplicates(subset='VAERS_ID', keep='last')
if len_before_edits - len(df_edits):
print(f'{(len_before_edits - len(df_edits)):>10} duplicates removed from df_edits before final move')
len_edits += len(df_edits)
The stat and the data now reflect the same deduplicated state. The dedup key was
also changed from subset=columns_vaers (a 40-column list) to subset='VAERS_ID'
(the natural unique key for a flat file), which is both faster and semantically correct
— a VAERS_ID should appear at most once in df_edits at this point in the pipeline.