vaers_complete.py is a comprehensive Python script for processing VAERS (Vaccine Adverse Event Reporting System) data with advanced features including multi-core parallel processing, memory-efficient chunked data handling, and comprehensive change tracking across CDC data releases.
Original Author: Gary Hawkins - http://univaers.com/download/ Enhanced Version: 2025 by Jason Page
pip install pandas numpy tqdm zipfile-deflate64
python vaers_complete.py [OPTIONS]
--dataset {covid,full}Default: covid
Selects which dataset to process:
covid: Process COVID-19 era data only (from 2020-12-13 onwards by default)full: Process full historical VAERS dataset (from 1990-01-01 onwards by default)Examples:
python vaers_complete.py --dataset covid
python vaers_complete.py --dataset full
--cores NUMBERDefault: Number of CPU cores available on system
Specifies the number of CPU cores to use for parallel processing.
Examples:
python vaers_complete.py --cores 8
python vaers_complete.py --cores 16
python vaers_complete.py --dataset full --cores 4
--chunk-size NUMBERDefault: 50000
Sets the chunk size for processing large datasets. Larger chunks use more memory but may be faster. Smaller chunks are more memory-efficient.
Examples:
python vaers_complete.py --chunk-size 100000
python vaers_complete.py --chunk-size 25000
--date-floor DATEDefault: 2020-12-13 for COVID dataset, 1990-01-01 for full dataset
Sets the earliest date to process (format: YYYY-MM-DD). Records before this date will be excluded.
Examples:
python vaers_complete.py --date-floor 2021-01-01
python vaers_complete.py --dataset full --date-floor 2000-01-01
--date-ceiling DATEDefault: 2025-01-01
Sets the latest date to process (format: YYYY-MM-DD). Records after this date will be excluded.
Examples:
python vaers_complete.py --date-ceiling 2024-12-31
python vaers_complete.py --date-floor 2020-01-01 --date-ceiling 2023-12-31
--testDefault: Not set
Uses test cases directory (z_test_cases) instead of the main working directory. Useful for development and testing.
Example:
python vaers_complete.py --test
--no-progressDefault: Not set
Disables progress bars. Useful for logging output to files or when running in environments without terminal support.
Example:
python vaers_complete.py --no-progress > output.log
--merge-onlyDefault: Not set
Skips all processing and only creates the final merged file from existing processed data. Useful when you want to regenerate the final output without reprocessing everything.
Example:
python vaers_complete.py --merge-only
python vaers_complete.py --dataset covid --cores 8
python vaers_complete.py --dataset full --cores 16 --chunk-size 100000
python vaers_complete.py --dataset covid --date-floor 2021-01-01
python vaers_complete.py --date-floor 2021-01-01 --date-ceiling 2023-12-31 --cores 8
python vaers_complete.py --dataset covid --chunk-size 25000 --cores 4
python vaers_complete.py --merge-only
python vaers_complete.py --test --cores 4
python vaers_complete.py --dataset covid --no-progress > processing.log 2>&1
The script expects and creates the following directory structure:
.
├── 0_VAERS_Downloads/ # Input: Raw VAERS ZIP files from CDC
├── 1_vaers_working/ # Intermediate: Extracted CSV files
├── 1_vaers_consolidated/ # Intermediate: Consolidated data files
├── 2_vaers_full_compared/ # Output: Comparison results with change tracking
├── 3_vaers_flattened/ # Intermediate: Flattened data (one row per VAERS_ID)
├── stats.csv # Output: Processing statistics
├── never_published_any.txt # Output: VAERS IDs never published
├── ever_published_any.txt # Output: All VAERS IDs ever published
├── ever_published_covid.txt # Output: COVID-related VAERS IDs
├── writeups_deduped.txt # Output: Deduplicated symptom descriptions
└── VAERS_FINAL_MERGED.csv # Final output: Complete merged dataset
When using --test flag:
z_test_cases/
├── drops/ # Input: Test VAERS data
├── 1_vaers_working/
├── 1_vaers_consolidated/
├── 2_vaers_full_compared/
├── 3_vaers_flattened/
└── [output files]
flowchart TD
A([Start]) --> B[Archive previous log\nvaers_processing_START_to_END.log]
B --> C[Parse args / Interactive setup\ndataset · cores · chunk-size · date range]
C --> D{dataset?}
D -->|covid| E[date_floor: 2020-12-13]
D -->|full| F[date_floor: 1990-01-01]
E & F --> G[Validate directories\nand input files]
G -->|fail| Z1([Exit with error])
G -->|ok| H[Scan all date drops\nfiles_populate_information]
H --> I{--merge-only?}
I -->|yes| M
I -->|no| LOOP
subgraph LOOP ["Main loop — one iteration per VAERS date drop"]
J[open_files\nUnzip drop into 1_vaers_working] --> K
K[consolidate\nMerge VAERSDATA + VAERSVAX\n+ VAERSSYMPTOMS → one CSV] --> L
L[flatten\nOne row per VAERS_ID\nAggregate doses · combine symptoms] --> N
N[compare\nDiff vs previous flatfile\nTag: new / modified / deleted] --> O{more\ndrops?}
O -->|yes| J
end
LOOP --> O
O -->|no| M[create_final_merged_file\nLatest compared flatfile\n→ VAERS_FINAL_MERGED.csv]
M --> P[Print error summary\nTotal runtime]
P --> Q[atexit: write end-time\nto vaers_processing.meta\nClose log]
Q --> R([Done])
At startup the script renames the previous vaers_processing.log to a file whose name
encodes the previous run's start and end times:
vaers_processing_20250922T202213_to_20250922T214530.log
A small sidecar file (vaers_processing.meta) is written at startup and updated on exit
to supply these timestamps. All print() output for the current run is mirrored into the
fresh vaers_processing.log via an internal _Tee object.
Combines the three VAERS data files for each data release in parallel (one Pool worker per file group):
*VAERSDATA.csv — main report data (demographics, outcomes, narrative)*VAERSVAX.csv — vaccination details (one row per dose)*VAERSSYMPTOMS.csv — coded MedDRA symptom terms (up to 5 per report)Coded symptoms are appended to SYMPTOM_TEXT using _|_ as a cell delimiter.
Output: 1_vaers_consolidated/<date>_VAERS_CONSOLIDATED.csv
Reduces to exactly one row per VAERS_ID because a single adverse event can involve
multiple vaccine doses (multiple rows in the consolidated file):
_|_-delimited stringsOutput: 3_vaers_flattened/<date>_VAERS_FLATTENED.csv
Diffs the current and previous flattened files field-by-field for every shared VAERS_ID.
Three tracking columns are added / updated:
| Column | Meaning |
|---|---|
cell_edits |
Count of fields changed since first appearance |
status |
New, Modified, Deleted, or blank |
changes |
Human-readable audit trail: field: old → new |
Output: 2_vaers_full_compared/<date>_VAERS_FLATFILE.csv
Picks the most-recent _VAERS_FLATFILE.csv and copies it to VAERS_FINAL_MERGED.csv.
Use --merge-only to re-run this step without reprocessing any drops.
VAERS_FINAL_MERGED.csv
cell_edits, status, changesstats.csv
never_published_any.txt
ever_published_any.txt
ever_published_covid.txt
writeups_deduped.txt
The final merged file contains all standard VAERS columns plus:
VAERS_ID - Unique report identifierAGE_YRS, SEX, STATE - Demographic informationDIED, L_THREAT, ER_VISIT, HOSPITAL, DISABLE - Serious outcomesVAX_TYPE, VAX_MANU, VAX_LOT - Vaccine informationVAX_DATE, ONSET_DATE, RPT_DATE - Date informationSYMPTOM_TEXT - Symptom descriptioncell_edits - Count of cells modified across all releasesstatus - Report status (new, modified, deleted)changes - Detailed log of all changes made to the reportsymptom_entries - Aggregated symptom entriespython vaers_complete.py --dataset covid --cores 16 --chunk-size 100000
python vaers_complete.py --dataset covid --cores 4 --chunk-size 25000
python vaers_complete.py --dataset full --cores 16 --chunk-size 50000
The script includes comprehensive error handling:
By default, filters to COVID-19 era data:
Processes complete historical VAERS data:
The script tracks modifications to VAERS reports across CDC data releases:
Example change tracking entry:
2023-01-15: AGE_YRS changed from "45" to "46"
2023-01-15: SYMPTOM_TEXT appended with "Patient recovered"
--chunk-size to 25000 or lower--cores to use fewer parallel processes--date-floor and --date-ceilingpip install tqdm--no-progress if not neededpip install zipfile-deflate640_VAERS_Downloads/ directoryOriginal script by Gary Hawkins (http://univaers.com/download/) Enhanced version with performance improvements and additional features.
For issues, questions, or contributions, refer to the original source or the repository where this script is maintained.