Date: 2026-06-12 (analysis) / 2026-06-13 (row counts + documentation)
Directory: /home/pagetelegram/Server/RAIDER/vaers/VAERS/Gary
Reference outputs:
VAERS_FINAL_MERGED-gary.csv — produced with vaers-orig.py (Gary Hawkins' original logic)VAERS_FINAL_MERGED-jason.csv — produced with vaers35.pyThe two FINAL_MERGED CSVs (first column = VAERS_ID) were compared by:
cell_edits rows (the natural "first N" after the sort_values(['cell_edits', 'status', 'changes'], ascending=False) + set_columns_order that precedes every *_VAERS_FLATFILE.csv write)Key factual results:
Jason has more total rows but a dramatically smaller file because long text fields (especially SYMPTOM_TEXT) are missing/blank on the high-edit reports that dominate the head of the file.
In the top 200 edited rows (head of both files):
SYMPTOM_TEXT content (>5 chars)SYMPTOM_TEXT contentOf the ~66 common VIDs in the highest-edit overlap where Gary had content: Jason had '' (blank) in 35/35 cases. Several of the same VIDs also show slightly lower cell_edits counts and missing "Delayed ..." gapfill prefixes in status.
Column order (header + physical layout) also differs, making raw positional row diffs look like total data corruption until aligned by name.
The root cause is in vaers35.py: aggressive but incomplete parallel refactors (especially the deduplication step in flatten(), the set_columns_order preference list, and supporting parallel paths in compare()) introduced data loss for report-level text fields and small divergences in edit classification / carry-forward logic.
vaers36.py was created as the corrected version that preserves data fidelity while retaining the performance enhancements.
VAERS_FINAL_MERGED-gary.csv: 2.4 GB (ls -lh), 1,535,553 lines (pure-Python csv + line iteration, ~380 s)VAERS_FINAL_MERGED-jason.csv: 676 MB, 1,691,667 lines (~85 s)The size inversion despite Jason having ~156 k more rows is explained by the systematic absence of long SYMPTOM_TEXT (and related history) on the most-edited records.
Gary header prefix (after the shared metadata):
VAERS_ID,cell_edits,status,changes,AGE_YRS,SEX,STATE,DIED,SPLTTYPE,SYMPTOM_TEXT,...
Jason header prefix:
VAERS_ID,cell_edits,status,changes,AGE_YRS,SEX,STATE,SPLTTYPE,DIED,... (VAX_* fields also promoted earlier; SYMPTOM_TEXT later in the preferred list).
This is emitted by set_columns_order() (called immediately before the final write_to_csv of each per-drop *_VAERS_FLATFILE.csv, which create_final_merged_file then copies for the "latest" date).
cell_edits and status often match on the very highest-edit examples (e.g. 126 edits, same delete/restore dates).SYMPTOM_TEXT present in Gary, '' in Jason for the affected records.cell_edits deltas (e.g. 88 vs 85, 87 vs 86), changes text differences, occasional missing "Delayed ..." token in status.SYMPTOM_TEXT.cell_edits 50–126 range) and "Deleted ... Restored ..." status — exactly the reports that exercised bulk blanking, deduplication, restore paths, and subsequent "identical this drop" carry-forward.Other columns (VAX_* fields, demographics, most flags/dates) largely matched when name-aligned, confirming the inputs were similar but the processing pipelines diverged on long-text preservation and some metadata.
vaers35.py is an ambitious parallelized enhancement (multi-core Pools/ThreadPoolExecutor, chunked I/O, numpy globals for workers, sub-chunking for huge columns like SYMPTOM_TEXT, fast .equals() + worker-based change detection, interactive/argparse mode, etc.). The high-level structure (consolidate → flatten → compare → final from last FLATFILE) and many comments/TODOs were preserved, but several hot paths were reimplemented.
Primary sources of the observed data loss and divergence:
Parallel deduplication in flatten() (main cause of blank SYMPTOM_TEXT)
symptoms_dedupe_repeat_sentences() (called unconditionally at the end of every flatten, before writing the FLATTENED file that feeds all later stages). Pool(imap(_dedupe_single_for_pool, pairs)), builds content_map, then _data['SYMPTOM_TEXT'] = _data['VAERS_ID'].map(content_map) with no fallback. fillna(''). vaers-orig.py: simple apply(..., axis=1) calling the each() function directly — every row always receives a string (original or cleaned). Different set_columns_order preference list
SPLTTYPE before DIED, lists many VAX_* columns early, and places SYMPTOM_TEXT late. Parallel / refactored paths in compare() and supporting flatten logic
logical_and.reduce instead of the original pandas merge(..., indicator=True, how='outer'). Can affect which rows keep "current" data from df_flat_new vs. prior slices. _build_col_arg_global + subchunking + _compare_col_worker + reassembly of row_updates/bulk_blank, followed by a separate vectorized apply. Bulk-blank handling only touches metadata + changes note (data cell stays as the new '' from the base). The val_to_keep / cut_ / restore logic for text columns has more surface area for divergence. flatten: ThreadPool split of consolidated data with drop_duplicates(subset='VAERS_ID') (order-dependent "first encountered"), parallel _agg_vax_chunk, _pool_astype_str + _merge_parallel (array_split + per-chunk merge + concat), later syms attach. Any of these can affect which value for a report-level column ends up in the one-row-per-VID flat, especially for VIDs with multi-dose VAX entries. types_set differences, do_never_ever stub, gapfill .max() after transforms, etc.Config / accumulation differences
date_floor in orig ('2016-02-13') vs. vaers35 default ('2020-12-13' for covid or user-supplied). *_VAERS_FLATFILE.csv written at the very end of compare() (after identical/deleted/restored/gapfill moves + cell-by-cell edits on the remainder) is what becomes the FINAL. Any loss at that stage propagates.The net effect on the provided CSVs: Jason's output has more reports but has lost the actual text content on the very records that the cell_edits sort brings to the front, while also using a different column layout.
vaers36.py was created (cp vaers35.py vaers36.py) and then surgically edited for data fidelity. The goal was to keep the useful performance / UX enhancements while eliminating the mechanisms that produced missing long-text fields and inconsistent layout.
Specific changes (documented with comments in the source):
Module docstring — updated to describe vaers36 as the data-fidelity-corrected edition and to note the primary fixes.
set_columns_order() (exact match to vaers-orig.py)
cols_order list with the original short list that places DIED before SPLTTYPE, SYMPTOM_TEXT early, etc. *_VAERS_FLATFILE.csv and FINAL_MERGED files will have the same physical column order and header as the Gary reference outputs.Dedupe safety (the main data-loss guard)
_dedupe_single_for_pool(): added outer try/except; on any error or when a cleaned result would be empty but the original input had content, return the original content (with 0 replacements). symptoms_dedupe_repeat_sentences() (the parallel wrapper): capture pre_sym before the map; after map(content_map), detect VIDs that went blank/NaN but had prior content and restore from the pre series; explicit fillna('') only for truly empty cases; added a print when any rescues occur. Safer per-VID collapse in flatten() split
df_data and df_syms_flat from the consolidated file: changed the data/syms lambdas from drop_duplicates(subset='VAERS_ID') (order-dependent first-encountered) to groupby('VAERS_ID', sort=False, as_index=False).first(). Minor cleanup
check_dupe_vaers_id(df_vax_flat) → the just-built df_data_vax_flat after the data_vax merge). Verification
python3 -m py_compile vaers36.py succeeds cleanly.The result: when vaers36.py is used to process drops, the one-row-per-VID FLATTENED files (and therefore all downstream FLATFILEs and FINAL_MERGED outputs) will preserve the report-level text values (SYMPTOM_TEXT etc.) the same way the original reference logic does. Column layout will also match the Gary outputs for easy diffing of "first 20" etc.
Secondary parallel paths (identical detection, per-col change workers, vax/sym merges) may still produce small differences in exact cell_edits counts or "Delayed" status strings; those are optimization surfaces. The primary "missing data in fields" problem is closed by the guards above.
# Same interface as vaers35.py
python vaers36.py # interactive setup
python vaers36.py --dataset covid --cores 8
python vaers36.py --dataset full --cores 16 --chunk-size 100000 --date-floor 2020-12-13
python vaers36.py --merge-only # just emit FINAL_MERGED from the latest FLATFILE
After a full run the latest *_VAERS_FLATFILE.csv (and the VAERS_FINAL_MERGED.csv it produces) will have correct data fields and the reference column order.
vaers-orig.py — original reference implementation (Gary)vaers35.py — enhanced but buggy version (produced the Jason CSV with missing fields)vaers36.py — corrected edition (recommended for new runs)changes.md — this documentVAERS_FINAL_MERGED-gary.csv / VAERS_FINAL_MERGED-jason.csv — the two outputs that were comparedbackup/ — older reference flatfilesymptoms_dedupe_repeat_sentences, _dedupe_single_for_pool, call site in flatten()set_columns_order, move_column_forward, calls at end of compare()compare(), identical detection block (~3210), per-col workers + apply (~3442–3573), bulk/row_updates handling, move_rows, restore/gapfill blocks_merge_parallel + _pool_astype_str + final dedupecreate_final_merged_file(), write at end of compare()This document plus the inline comments in vaers36.py fully describe the investigation and the corrective changes. Re-running the pipeline with vaers36.py on the same input drops should yield FINAL_MERGED outputs whose data cells (especially SYMPTOM_TEXT on high-history reports) match the fidelity of the Gary/orig reference while still benefiting from the parallel enhancements.
If further alignment of the identical-detection or change-application blocks is desired for bit-for-bit edit-count matching, additional conservative fallbacks can be added in a follow-up pass.