README VAERS

← Back to Folder

VAERS Complete - Enhanced Data Processing Script

Overview

vaers_complete.py is a comprehensive Python script for processing VAERS (Vaccine Adverse Event Reporting System) data with advanced features including multi-core parallel processing, memory-efficient chunked data handling, and comprehensive change tracking across CDC data releases.

Original Author: Gary Hawkins - http://univaers.com/download/ Enhanced Version: 2025 by Jason Page

Features

✓ Multi-core parallel processing for faster execution
✓ Memory-efficient chunked data handling for large datasets
✓ Command-line dataset selection (COVID-19 era or full historical data)
✓ Progress bars for all major operations
✓ Comprehensive error tracking and reporting
✓ Fixed statistics functionality
✓ Change detection and tracking across data releases
✓ Deduplication and data consolidation
✓ Complete audit trail of modifications to VAERS reports

Requirements

Python Dependencies

pip install pandas numpy tqdm zipfile-deflate64

pandas: Data manipulation and analysis
numpy: Numerical operations
tqdm: Progress bars (optional but recommended)
zipfile-deflate64: Enhanced ZIP file handling (optional, falls back to standard zipfile)

System Requirements

Python 3.x
Multi-core CPU recommended for parallel processing
Sufficient RAM for large dataset processing (16GB+ recommended for full dataset)

Command-Line Options

Basic Syntax

python vaers_complete.py [OPTIONS]

Options Reference

`--dataset {covid,full}`

Default: covid

Selects which dataset to process:

covid: Process COVID-19 era data only (from 2020-12-13 onwards by default)
full: Process full historical VAERS dataset (from 1990-01-01 onwards by default)

Examples:

python vaers_complete.py --dataset covid
python vaers_complete.py --dataset full

`--cores NUMBER`

Default: Number of CPU cores available on system

Specifies the number of CPU cores to use for parallel processing.

Examples:

python vaers_complete.py --cores 8
python vaers_complete.py --cores 16
python vaers_complete.py --dataset full --cores 4

`--chunk-size NUMBER`

Default: 50000

Sets the chunk size for processing large datasets. Larger chunks use more memory but may be faster. Smaller chunks are more memory-efficient.

Examples:

python vaers_complete.py --chunk-size 100000
python vaers_complete.py --chunk-size 25000

`--date-floor DATE`

Default: 2020-12-13 for COVID dataset, 1990-01-01 for full dataset

Sets the earliest date to process (format: YYYY-MM-DD). Records before this date will be excluded.

Examples:

python vaers_complete.py --date-floor 2021-01-01
python vaers_complete.py --dataset full --date-floor 2000-01-01

`--date-ceiling DATE`

Default: 2025-01-01

Sets the latest date to process (format: YYYY-MM-DD). Records after this date will be excluded.

Examples:

python vaers_complete.py --date-ceiling 2024-12-31
python vaers_complete.py --date-floor 2020-01-01 --date-ceiling 2023-12-31

`--test`

Default: Not set

Uses test cases directory (z_test_cases) instead of the main working directory. Useful for development and testing.

Example:

python vaers_complete.py --test

`--no-progress`

Default: Not set

Disables progress bars. Useful for logging output to files or when running in environments without terminal support.

Example:

python vaers_complete.py --no-progress > output.log

`--merge-only`

Default: Not set

Skips all processing and only creates the final merged file from existing processed data. Useful when you want to regenerate the final output without reprocessing everything.

Example:

python vaers_complete.py --merge-only

Usage Examples

Process COVID-19 data with 8 cores

python vaers_complete.py --dataset covid --cores 8

Process full historical dataset with 16 cores and larger chunks

python vaers_complete.py --dataset full --cores 16 --chunk-size 100000

Process COVID data from a specific start date

python vaers_complete.py --dataset covid --date-floor 2021-01-01

Process data for a specific date range

python vaers_complete.py --date-floor 2021-01-01 --date-ceiling 2023-12-31 --cores 8

Process with smaller chunks for memory-constrained systems

python vaers_complete.py --dataset covid --chunk-size 25000 --cores 4

Create final merged file only

python vaers_complete.py --merge-only

Run with test data

python vaers_complete.py --test --cores 4

Process without progress bars (for logging)

python vaers_complete.py --dataset covid --no-progress > processing.log 2>&1

Directory Structure

The script expects and creates the following directory structure:

.
├── 0_VAERS_Downloads/          # Input: Raw VAERS ZIP files from CDC
├── 1_vaers_working/            # Intermediate: Extracted CSV files
├── 1_vaers_consolidated/       # Intermediate: Consolidated data files
├── 2_vaers_full_compared/      # Output: Comparison results with change tracking
├── 3_vaers_flattened/          # Intermediate: Flattened data (one row per VAERS_ID)
├── stats.csv                   # Output: Processing statistics
├── never_published_any.txt     # Output: VAERS IDs never published
├── ever_published_any.txt      # Output: All VAERS IDs ever published
├── ever_published_covid.txt    # Output: COVID-related VAERS IDs
├── writeups_deduped.txt        # Output: Deduplicated symptom descriptions
└── VAERS_FINAL_MERGED.csv      # Final output: Complete merged dataset

Test Mode Directory Structure

When using --test flag:

z_test_cases/
├── drops/                      # Input: Test VAERS data
├── 1_vaers_working/
├── 1_vaers_consolidated/
├── 2_vaers_full_compared/
├── 3_vaers_flattened/
└── [output files]

Processing Workflow

Flowchart

flowchart TD
    A([Start]) --> B[Archive previous log\nvaers_processing_START_to_END.log]
    B --> C[Parse args / Interactive setup\ndataset · cores · chunk-size · date range]
    C --> D{dataset?}
    D -->|covid| E[date_floor: 2020-12-13]
    D -->|full|  F[date_floor: 1990-01-01]
    E & F --> G[Validate directories\nand input files]
    G -->|fail| Z1([Exit with error])
    G -->|ok| H[Scan all date drops\nfiles_populate_information]
    H --> I{--merge-only?}
    I -->|yes| M
    I -->|no| LOOP

    subgraph LOOP ["Main loop — one iteration per VAERS date drop"]
        J[open_files\nUnzip drop into 1_vaers_working] --> K
        K[consolidate\nMerge VAERSDATA + VAERSVAX\n+ VAERSSYMPTOMS → one CSV] --> L
        L[flatten\nOne row per VAERS_ID\nAggregate doses · combine symptoms] --> N
        N[compare\nDiff vs previous flatfile\nTag: new / modified / deleted] --> O{more\ndrops?}
        O -->|yes| J
    end

    LOOP --> O
    O -->|no| M[create_final_merged_file\nLatest compared flatfile\n→ VAERS_FINAL_MERGED.csv]
    M --> P[Print error summary\nTotal runtime]
    P --> Q[atexit: write end-time\nto vaers_processing.meta\nClose log]
    Q --> R([Done])

Step-by-step description

1. Log archiving (new)

At startup the script renames the previous vaers_processing.log to a file whose name encodes the previous run's start and end times:

vaers_processing_20250922T202213_to_20250922T214530.log

A small sidecar file (vaers_processing.meta) is written at startup and updated on exit to supply these timestamps. All print() output for the current run is mirrored into the fresh vaers_processing.log via an internal _Tee object.

2. Consolidation

Combines the three VAERS data files for each data release in parallel (one Pool worker per file group):

*VAERSDATA.csv — main report data (demographics, outcomes, narrative)
*VAERSVAX.csv — vaccination details (one row per dose)
*VAERSSYMPTOMS.csv — coded MedDRA symptom terms (up to 5 per report)

Coded symptoms are appended to SYMPTOM_TEXT using _|_ as a cell delimiter. Output: 1_vaers_consolidated/<date>_VAERS_CONSOLIDATED.csv

3. Flattening

Reduces to exactly one row per VAERS_ID because a single adverse event can involve multiple vaccine doses (multiple rows in the consolidated file):

Aggregates vax columns into _|_-delimited strings
Deduplicates data and symptom rows
Uses multi-core parallel joins for performance

Output: 3_vaers_flattened/<date>_VAERS_FLATTENED.csv

4. Comparison

Diffs the current and previous flattened files field-by-field for every shared VAERS_ID. Three tracking columns are added / updated:

Column	Meaning
`cell_edits`	Count of fields changed since first appearance
`status`	`New`, `Modified`, `Deleted`, or blank
`changes`	Human-readable audit trail: `field: old → new`

Output: 2_vaers_full_compared/<date>_VAERS_FLATFILE.csv

5. Final Merge

Picks the most-recent _VAERS_FLATFILE.csv and copies it to VAERS_FINAL_MERGED.csv. Use --merge-only to re-run this step without reprocessing any drops.

Output Files

Primary Output

VAERS_FINAL_MERGED.csv

Complete dataset with all VAERS reports
Includes all historical changes tracked across data releases
Contains columns: cell_edits, status, changes
One row per VAERS_ID with complete information

Statistics and Tracking Files

stats.csv

Processing statistics for each data release
Counts of new reports, modifications, deletions
Date ranges and record counts

never_published_any.txt

VAERS IDs that were never published in any release
Identifies gaps in the VAERS ID sequence

ever_published_any.txt

Complete list of all VAERS IDs ever published
Includes all vaccine types

ever_published_covid.txt

List of COVID-19 vaccine-related VAERS IDs
Filtered by VAX_TYPE containing 'covid'

writeups_deduped.txt

Deduplicated symptom text descriptions
Useful for analysis of unique symptom patterns

Key Columns in Output

The final merged file contains all standard VAERS columns plus:

Standard VAERS Columns

VAERS_ID - Unique report identifier
AGE_YRS, SEX, STATE - Demographic information
DIED, L_THREAT, ER_VISIT, HOSPITAL, DISABLE - Serious outcomes
VAX_TYPE, VAX_MANU, VAX_LOT - Vaccine information
VAX_DATE, ONSET_DATE, RPT_DATE - Date information
SYMPTOM_TEXT - Symptom description
And many more...

Enhanced Tracking Columns

cell_edits - Count of cells modified across all releases
status - Report status (new, modified, deleted)
changes - Detailed log of all changes made to the report
symptom_entries - Aggregated symptom entries

Performance Tuning

For Fast Processing (High RAM)

python vaers_complete.py --dataset covid --cores 16 --chunk-size 100000

For Memory-Constrained Systems

python vaers_complete.py --dataset covid --cores 4 --chunk-size 25000

For Very Large Full Dataset

python vaers_complete.py --dataset full --cores 16 --chunk-size 50000

Error Handling

The script includes comprehensive error handling:

All errors are collected and displayed at the end of processing
Errors include timestamps for tracking
Processing continues when possible, skipping problematic files
Final error summary shows total errors encountered
Exit code 0 = success, 1 = errors occurred

Data Filtering

COVID Dataset Mode

By default, filters to COVID-19 era data:

Automatically detects the earliest COVID VAERS_ID
Removes all reports prior to first COVID vaccine report
Typically starts from VAERS_ID ~896636 (first trial report)

Full Dataset Mode

Processes complete historical VAERS data:

Includes all vaccine types from 1990 onwards (or specified date-floor)
Significantly larger processing time and storage requirements

Change Tracking

The script tracks modifications to VAERS reports across CDC data releases:

New reports: First appearance in a data release
Modifications: Changes to any field in existing reports
Deletions: Reports removed from later releases
Cell edits: Count of individual cell changes
Change log: Detailed description of what changed

Example change tracking entry:

2023-01-15: AGE_YRS changed from "45" to "46"
2023-01-15: SYMPTOM_TEXT appended with "Patient recovered"

Troubleshooting

Out of Memory Errors

Reduce --chunk-size to 25000 or lower
Reduce --cores to use fewer parallel processes
Process smaller date ranges using --date-floor and --date-ceiling

Progress Bars Not Showing

Install tqdm: pip install tqdm
Or disable with --no-progress if not needed

ZIP File Errors

Install zipfile-deflate64: pip install zipfile-deflate64
Script falls back to standard zipfile if not available

Missing Input Files

Ensure VAERS data files are in 0_VAERS_Downloads/ directory
Check that files are in correct ZIP format from CDC

License and Attribution

Original script by Gary Hawkins (http://univaers.com/download/) Enhanced version with performance improvements and additional features.

Notes

The script automatically handles mixed date formats (MM/DD/YYYY → YYYY-MM-DD)
Duplicate records are automatically identified and removed
String type handling is optimized for memory efficiency
All CSV files use UTF-8-sig encoding for compatibility
Progress tracking can be disabled for automated/batch processing

Support

For issues, questions, or contributions, refer to the original source or the repository where this script is maintained.

Original Author: admin

Views: 14 (Unique: 14)

Page ID ( Copy Link): page_6a2339243adbf8.41278964-3e2f102ea11dc7cc Copied!

Page History (1 revisions):

2026-06-05 21:01:24 (Viewing)

Questioning Everything Propaganda

VAERS Complete - Enhanced Data Processing Script

Overview

Features

Requirements

Python Dependencies

System Requirements

Command-Line Options

Basic Syntax

Options Reference

--dataset {covid,full}

--cores NUMBER

--chunk-size NUMBER

--date-floor DATE

--date-ceiling DATE

--test

--no-progress

--merge-only

Usage Examples

Process COVID-19 data with 8 cores

Process full historical dataset with 16 cores and larger chunks

Process COVID data from a specific start date

Process data for a specific date range

Process with smaller chunks for memory-constrained systems

Create final merged file only

Run with test data

Process without progress bars (for logging)

Directory Structure

Test Mode Directory Structure

Processing Workflow

Flowchart

Step-by-step description

1. Log archiving (new)

2. Consolidation

3. Flattening

4. Comparison

5. Final Merge

Output Files

Primary Output

Statistics and Tracking Files

Key Columns in Output

Standard VAERS Columns

Enhanced Tracking Columns

Performance Tuning

For Fast Processing (High RAM)

For Memory-Constrained Systems

For Very Large Full Dataset

Error Handling

Data Filtering

COVID Dataset Mode

Full Dataset Mode

Change Tracking

Troubleshooting

Out of Memory Errors

Progress Bars Not Showing

ZIP File Errors

Missing Input Files

License and Attribution

Notes

Support

`--dataset {covid,full}`

`--cores NUMBER`

`--chunk-size NUMBER`

`--date-floor DATE`

`--date-ceiling DATE`

`--test`

`--no-progress`

`--merge-only`