Synthetic Patient Data for Dev and Testing: Build Realistic FHIR Datasets Without Touching PHI

March 16, 2026

13 min read

The PHI Problem Every Healthcare Developer Faces

You need realistic patient data to build and test your EHR integration, FHIR API, or clinical decision support tool. But real patient data is locked behind HIPAA's Privacy Rule, and for good reason. A single breach can cost your organization $1.3 million on average in fines alone, plus the reputational damage that follows.

De-identification under HIPAA's Safe Harbor method requires stripping 18 categories of identifiers. Expert Determination requires a qualified statistician to certify the risk is "very small." Both approaches are expensive, slow, and still carry residual re-identification risk, especially with genomic and longitudinal data. Even after de-identification, the data often loses the clinical richness you need for meaningful testing.

The answer? Synthetic patient data: clinically realistic records generated algorithmically, with zero connection to real people. No HIPAA obligations. No IRB approval. No data use agreements. Just realistic FHIR bundles you can spin up in minutes.

This guide walks you through the complete pipeline: generating synthetic patients with Synthea, loading them into a HAPI FHIR server, and building test scenarios that actually exercise your code. You will have working test data within 30 minutes.

Synthea: The Gold Standard for Synthetic Clinical Data

Synthea is an open-source synthetic patient generator developed by The MITRE Corporation. It is not a random data generator. Synthea simulates realistic patient lifecycles from birth to death using clinically validated disease progression models, real demographic distributions from US Census data, and standard medical coding systems (SNOMED CT, LOINC, RxNorm, ICD-10).

Each synthetic patient gets a complete medical history: encounters, conditions, medications, procedures, immunizations, lab results, vital signs, and care plans. The output is native FHIR R4 (or C-CDA, CSV, and other formats), ready to load into any FHIR-compliant system.

What Makes Synthea Data Realistic

Disease modules: Over 90 clinical modules model conditions like diabetes, hypertension, cancer, asthma, and COVID-19 with evidence-based state machines
Demographics: Population distributions match US Census data for age, gender, race, ethnicity, and geographic location
Temporal coherence: Lab values change over time in clinically plausible ways (e.g., HbA1c drifting upward in uncontrolled diabetes)
Standard codes: Every condition, medication, lab test, and procedure uses real SNOMED CT, LOINC, and RxNorm codes
Payer simulation: Insurance coverage, claims, and costs follow realistic patterns

Installation: Two Options

Option 1: From Source (Recommended for Customization)

Synthea requires Java 11 or higher. On macOS, install via Homebrew: brew install openjdk@17. On Ubuntu: sudo apt install openjdk-17-jdk.

# Clone the repository
git clone https://github.com/synthetichealth/synthea.git
cd synthea

# Build (Gradle wrapper included)
./gradlew build check test

# Verify installation
./run_synthea --help

Option 2: Docker (Fastest Start)

# Pull and run the Docker image
docker pull ghcr.io/synthetichealth/synthea:master

# Generate 100 patients from Massachusetts
docker run --rm -v "$(pwd)/output:/output" \
  ghcr.io/synthetichealth/synthea:master \
  -p 100 Massachusetts

Generating Your First Patient Population

The basic command generates FHIR R4 bundles by default:

# Generate 500 patients from Massachusetts
./run_synthea -p 500 Massachusetts

# Output lands in ./output/fhir/
ls output/fhir/ | head -10
# Abe604_Koss676_0a1b2c3d-4e5f-6789-abcd-ef0123456789.json
# Ada529_Mertz280_1a2b3c4d-5e6f-7890-bcde-f01234567890.json
# ...

Each file is a FHIR Bundle (type: transaction) containing all resources for one patient. A typical patient bundle includes 50-200 resources depending on age and clinical history.

Key CLI Options

# Generate specific number of patients in a specific state
./run_synthea -p 1000 California

# Target a specific city
./run_synthea -p 200 Massachusetts Boston

# Set a seed for reproducible output
./run_synthea -p 100 -s 12345 New_York

# Generate only living patients (no deceased)
./run_synthea -p 500 --exporter.years_of_history 10

# Control age range via overrides
./run_synthea -p 100 -a 60-80 Massachusetts

Customizing Synthea for Your Use Case

The real power of Synthea is customization. You can control demographics, disease prevalence, and which clinical modules run.

Demographics Configuration

Edit src/main/resources/synthea.properties to control population characteristics:

# synthea.properties - key configuration options

# Export format (FHIR R4 is default)
exporter.fhir.export = true
exporter.ccda.export = false
exporter.csv.export = false

# Transaction bundles (true) vs individual resources (false)
exporter.fhir.transaction_bundle = true

# Include hospital/practitioner bundles
exporter.fhir.export_hospital = true
exporter.fhir.export_practitioner = true

# Limit history to recent years
exporter.years_of_history = 5

# Gender ratio (default 50/50)
generate.demographics.gender.male = 0.5
generate.demographics.gender.female = 0.5

Disease Module Targeting

Synthea includes 90+ disease modules. You can enable or disable specific ones to create focused datasets:

# Generate only patients with diabetes-related conditions
./run_synthea -p 200 -m diabetes*

# Multiple modules
./run_synthea -p 300 -m "diabetes*,hypertension,chronic_kidney_disease"

# List available modules
ls src/main/resources/modules/
# allergic_rhinitis.json
# asthma.json
# atopy.json
# breast_cancer.json
# chronic_kidney_disease.json
# colorectal_cancer.json
# copd.json
# covid19.json
# diabetes.json
# heart_failure.json
# hypertension.json
# lung_cancer.json
# ...

Setting Up a Local HAPI FHIR Server

Generating FHIR bundles is step one. To actually query and test against them, you need a FHIR server. HAPI FHIR is the most widely used open-source FHIR server, written in Java, with full R4 and R5 support.

Launch with Docker (30 Seconds)

# Start HAPI FHIR R4 server on port 8080
docker run -d --name hapi-fhir \
  -p 8080:8080 \
  -e hapi.fhir.default_encoding=json \
  hapiproject/hapi:latest

# Verify it is running
curl -s http://localhost:8080/fhir/metadata | python3 -m json.tool | head -20

The server starts with an empty database. The /fhir/metadata endpoint returns the CapabilityStatement, confirming it is ready.

Loading Synthea Bundles into HAPI

Each Synthea output file is a FHIR transaction bundle. Post them directly:

# Load a single patient bundle
curl -s -X POST http://localhost:8080/fhir \
  -H "Content-Type: application/fhir+json" \
  -d @output/fhir/Abe604_Koss676_0a1b2c3d.json | python3 -m json.tool

# Bulk load all bundles with a bash loop
for bundle in output/fhir/*.json; do
  echo "Loading: $bundle"
  curl -s -X POST http://localhost:8080/fhir \
    -H "Content-Type: application/fhir+json" \
    -d @"$bundle" > /dev/null
done
echo "Done. Loaded $(ls output/fhir/*.json | wc -l) bundles."

For large datasets (1,000+ patients), consider parallel loading:

# Parallel load using xargs (4 concurrent uploads)
ls output/fhir/*.json | xargs -P 4 -I {} \
  curl -s -X POST http://localhost:8080/fhir \
  -H "Content-Type: application/fhir+json" \
  -d @{} -o /dev/null

Querying Your Loaded Data

# Count total patients
curl -s "http://localhost:8080/fhir/Patient?_summary=count" | python3 -c \
  "import sys,json; print(f'Patients: {json.load(sys.stdin)[\"total\"]}')"

# Search for diabetic patients
curl -s "http://localhost:8080/fhir/Condition?code=44054006" \
  | python3 -m json.tool | head -30

# Get a patient with all their data using $everything
curl -s "http://localhost:8080/fhir/Patient/PATIENT_ID/\$everything" \
  | python3 -m json.tool

SMART on FHIR Sandbox for App Testing

If you are building a SMART on FHIR application, launch.smarthealthit.org provides a free sandbox with pre-loaded synthetic patients and a full OAuth2 authorization flow. You can test your app's launch sequence, patient context, and scopes without deploying your own server.

Key features of the SMART sandbox:

EHR Launch simulation: Test the full EHR launch flow with patient/practitioner context selection
Standalone Launch: Test standalone patient app launches with authorization code flow
Pre-loaded patients: Dozens of synthetic patients with clinical data ready for testing
Scope negotiation: Test different SMART scopes (patient/*.read, launch/patient, openid, fhirUser)

# Register your app with the SMART sandbox
# 1. Go to https://launch.smarthealthit.org/
# 2. Set your App Launch URL (e.g., http://localhost:3000/launch)
# 3. Set redirect URI (e.g., http://localhost:3000/callback)
# 4. Choose launch type: EHR Launch or Standalone Patient
# 5. Select patient and practitioner context
# 6. Click "Launch" to start the OAuth flow

Building a Realistic Clinical Scenario

Let us walk through what a realistic test patient looks like in FHIR. Meet Maria Rodriguez: a 65-year-old woman in Boston with Type 2 Diabetes, Hypertension, and a history of 12 office visits over 5 years. Synthea generates exactly this kind of patient.

The Patient Resource

{
  "resourceType": "Patient",
  "id": "maria-rodriguez-test",
  "name": [{
    "use": "official",
    "family": "Rodriguez",
    "given": ["Maria"],
    "prefix": ["Mrs."]
  }],
  "gender": "female",
  "birthDate": "1961-03-15",
  "address": [{
    "line": ["123 Commonwealth Ave"],
    "city": "Boston",
    "state": "MA",
    "postalCode": "02115",
    "country": "US"
  }],
  "maritalStatus": {
    "coding": [{
      "system": "http://terminology.hl7.org/CodeSystem/v3-MaritalStatus",
      "code": "M",
      "display": "Married"
    }]
  }
}

A Condition Resource (Type 2 Diabetes)

{
  "resourceType": "Condition",
  "clinicalStatus": {
    "coding": [{
      "system": "http://terminology.hl7.org/CodeSystem/condition-clinical",
      "code": "active"
    }]
  },
  "verificationStatus": {
    "coding": [{
      "system": "http://terminology.hl7.org/CodeSystem/condition-ver-status",
      "code": "confirmed"
    }]
  },
  "code": {
    "coding": [{
      "system": "http://snomed.info/sct",
      "code": "44054006",
      "display": "Type 2 diabetes mellitus"
    }]
  },
  "subject": {
    "reference": "Patient/maria-rodriguez-test"
  },
  "onsetDateTime": "2018-07-22"
}

An Observation Resource (HbA1c Lab Result)

{
  "resourceType": "Observation",
  "status": "final",
  "category": [{
    "coding": [{
      "system": "http://terminology.hl7.org/CodeSystem/observation-category",
      "code": "laboratory"
    }]
  }],
  "code": {
    "coding": [{
      "system": "http://loinc.org",
      "code": "4548-4",
      "display": "Hemoglobin A1c/Hemoglobin.total in Blood"
    }]
  },
  "subject": {
    "reference": "Patient/maria-rodriguez-test"
  },
  "effectiveDateTime": "2025-09-15",
  "valueQuantity": {
    "value": 7.2,
    "unit": "%",
    "system": "http://unitsofmeasure.org",
    "code": "%"
  }
}

MedicationRequest Resource (Metformin)

{
  "resourceType": "MedicationRequest",
  "status": "active",
  "intent": "order",
  "medicationCodeableConcept": {
    "coding": [{
      "system": "http://www.nlm.nih.gov/research/umls/rxnorm",
      "code": "860975",
      "display": "Metformin hydrochloride 500 MG Oral Tablet"
    }]
  },
  "subject": {
    "reference": "Patient/maria-rodriguez-test"
  },
  "authoredOn": "2018-07-22",
  "dosageInstruction": [{
    "text": "Take 500mg twice daily with meals",
    "timing": {
      "repeat": {
        "frequency": 2,
        "period": 1,
        "periodUnit": "d"
      }
    },
    "doseAndRate": [{
      "doseQuantity": {
        "value": 500,
        "unit": "mg",
        "system": "http://unitsofmeasure.org",
        "code": "mg"
      }
    }]
  }]
}

Data Quality: Where Synthetic Falls Short

Synthea data is excellent for development and integration testing, but it has known gaps you should be aware of.

What Synthea Gets Right

Standard coding: Every element uses SNOMED CT, LOINC, RxNorm, or ICD-10 codes correctly
Temporal consistency: Disease progression follows clinically plausible timelines
Resource references: All FHIR references resolve correctly within a patient bundle
US Core compliance: Output aligns with the US Core Implementation Guide profiles

What Real-World Data Looks Like (That Synthea Doesn't Model)

Free-text notes: Real EHRs contain extensive unstructured clinical notes. Synthea generates minimal narrative text.
Missing codes: In production, up to 30% of conditions may use local codes or free text instead of SNOMED. Synthea always uses standard codes.
Data entry errors: Real data has typos, incorrect codes, contradictory information. Synthea data is pristine.
Partial records: Patients switch providers, creating gaps in history. Synthea generates complete histories.
Workflow artifacts: Duplicate entries, amended results, retracted orders are common in real EHRs but absent in Synthea.

If your application handles these edge cases (and it should), supplement Synthea data with hand-crafted FHIR bundles that deliberately include these imperfections.

When Synthetic Data Is Not Enough

Synthea covers the 80% case well, but some testing scenarios require different approaches:

Scenario	Why Synthea Falls Short	Better Approach
Rare diseases (e.g., Gaucher disease, ALS)	No disease module exists	Hand-craft FHIR bundles using Orphanet codes
Complex medication regimens (oncology protocols)	Simplified medication modeling	Build custom MedicationRequest bundles with real RxNorm codes
Genomic data (molecular testing results)	No genomic observations	Use FHIR Genomics Reporting IG examples
Social determinants of health (SDOH)	Limited SDOH data	Use Gravity Project SDOH IG examples
ML model training (statistical validity)	Synthetic distributions may not match	Use Gretel.ai or MOSTLY AI for privacy-safe synthetic data trained on real distributions
Performance/load testing	Need massive scale	Synthea with `-p 100000` plus parallel HAPI loading

Putting It All Together: A Complete Testing Pipeline

Here is a production-ready script that generates patients, loads them into HAPI, and validates the data is queryable:

#!/bin/bash
# synthetic-test-pipeline.sh
# Generates synthetic patients and loads into a local HAPI FHIR server

set -e

PATIENT_COUNT=${1:-500}
STATE=${2:-Massachusetts}
HAPI_URL="http://localhost:8080/fhir"

echo "=== Step 1: Start HAPI FHIR Server ==="
docker run -d --name hapi-fhir-test \
  -p 8080:8080 \
  hapiproject/hapi:latest

echo "Waiting for HAPI to start..."
until curl -sf "$HAPI_URL/metadata" > /dev/null 2>&1; do
  sleep 2
done
echo "HAPI FHIR server is ready."

echo "=== Step 2: Generate $PATIENT_COUNT Patients ==="
cd synthea
./run_synthea -p "$PATIENT_COUNT" "$STATE"

echo "=== Step 3: Load Bundles into HAPI ==="
LOADED=0
for bundle in output/fhir/*.json; do
  HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
    -X POST "$HAPI_URL" \
    -H "Content-Type: application/fhir+json" \
    -d @"$bundle")
  if [ "$HTTP_CODE" = "200" ]; then
    LOADED=$((LOADED + 1))
  else
    echo "WARN: Failed to load $bundle (HTTP $HTTP_CODE)"
  fi
done
echo "Loaded $LOADED bundles successfully."

echo "=== Step 4: Validate ==="
TOTAL=$(curl -s "$HAPI_URL/Patient?_summary=count" | \
  python3 -c "import sys,json; print(json.load(sys.stdin)['total'])")
echo "Total patients in HAPI: $TOTAL"

CONDITIONS=$(curl -s "$HAPI_URL/Condition?_summary=count" | \
  python3 -c "import sys,json; print(json.load(sys.stdin)['total'])")
echo "Total conditions: $CONDITIONS"

OBS=$(curl -s "$HAPI_URL/Observation?_summary=count" | \
  python3 -c "import sys,json; print(json.load(sys.stdin)['total'])")
echo "Total observations: $OBS"

echo "=== Pipeline Complete ==="
echo "HAPI FHIR server running at $HAPI_URL"
echo "Query patients: curl '$HAPI_URL/Patient?_count=10'"

Quick Reference: Tools Compared

Tool	Data Quality	FHIR Support	Best For	Cost
Synthea	High (clinically modeled)	R4 native	Full dev/test, CI pipelines	Free / Open Source
HAPI FHIR	N/A (server only)	R4, R5	Local FHIR API testing	Free / Open Source
SMART Sandbox	Medium	R4	SMART app launch testing	Free
Gretel.ai	High (ML-generated)	Custom export	Privacy-safe ML training	Freemium
CMS Test Data	Medium	Bulk FHIR (limited)	Claims/billing testing	Free

Next Steps

Synthetic patient data eliminates the biggest bottleneck in healthcare software development: getting realistic test data without legal risk. With Synthea and HAPI FHIR, you can go from zero to a fully populated FHIR server in under 30 minutes.

Start with ./run_synthea -p 100 Massachusetts, load the bundles into HAPI, and build your tests against real SNOMED, LOINC, and RxNorm codes. When you hit edge cases that Synthea does not cover, hand-craft targeted FHIR bundles. The combination gives you comprehensive test coverage without ever touching PHI.

At Nirmitee, we build healthcare systems that handle real-world FHIR data at scale, from EHR integrations to clinical data pipelines. Synthetic data is how we test every feature before it touches production. If you are building in healthcare and need help with FHIR implementation, we would love to talk.

Struggling with healthcare data exchange? Our Healthcare Interoperability Solutions practice helps organizations connect clinical systems at scale. We also offer specialized Healthcare AI Solutions services. Talk to our team to get started.

Frequently Asked Questions

What is synthetic patient data?

Synthetic patient data is clinically realistic patient records generated algorithmically, with zero connection to real people. Because no real patients are involved, there are no HIPAA obligations, no IRB approval, and no data use agreements required. Tools like Synthea produce complete medical histories, including encounters, conditions, medications, lab results, and vital signs, as native FHIR R4 bundles that healthcare developers can load into a test server within minutes.

What is Synthea and how does it generate realistic FHIR data?

Synthea is an open-source synthetic patient generator developed by The MITRE Corporation that simulates patient lifecycles from birth to death. It uses over 90 clinically validated disease modules, demographic distributions from US Census data, and standard coding systems including SNOMED CT, LOINC, RxNorm, and ICD-10. Lab values change over time in clinically plausible ways, such as HbA1c drifting upward in uncontrolled diabetes, and output aligns with US Core Implementation Guide profiles.

Why not just de-identify real patient data for testing instead of using synthetic data?

Because HIPAA de-identification is expensive, slow, and still carries residual re-identification risk, especially with genomic and longitudinal data. Safe Harbor requires stripping 18 categories of identifiers, while Expert Determination requires a qualified statistician to certify the risk is very small. Even then, de-identified data often loses the clinical richness needed for meaningful testing, and a single PHI breach averages $1.3 million in fines alone.

How do I load Synthea patients into a HAPI FHIR server?

Launch HAPI FHIR with Docker, then POST each Synthea output file directly to the server, since every file is a FHIR transaction bundle containing all resources for one patient, typically 50-200 resources depending on age and clinical history. HAPI FHIR is the most widely used open-source FHIR server, with full R4 and R5 support. For large datasets of 1,000 or more patients, parallel loading speeds up ingestion considerably.

What are the limitations of synthetic patient data for healthcare testing?

Synthea data is excellent for development and integration testing but does not model everything found in production EHRs. It generates minimal free-text narrative, while real EHRs contain extensive unstructured clinical notes, and in production up to 30% of conditions may use local codes or free text rather than clean standard codes. Healthcare engineering teams like Nirmitee's pair Synthea datasets with targeted edge-case fixtures so test suites cover both the clean and messy data paths.