The Data Scarcity Problem in Healthcare ML
Healthcare machine learning has a data problem that no amount of data engineering can solve. Rare diseases affect fewer than 200,000 patients in the US — but building a diagnostic model requires tens of thousands of labeled examples. New hospitals opening their doors have zero historical data for predictive models. Research collaborations need datasets that can be shared across institutions, but IRB approvals take months and HIPAA de-identification is imperfect.
Traditional solutions — data augmentation, transfer learning, few-shot learning — help, but they have limits. You cannot augment your way to a representative dataset when you have 47 confirmed cases of a rare autoimmune condition. Transfer learning from a general medical dataset misses the specific patterns of your patient population. And few-shot learning, while promising, has not achieved clinical-grade performance for most tasks.
Synthetic data offers a fundamentally different approach: generate artificial patient records that preserve the statistical properties of real data without containing any actual patient information. Done correctly, a model trained on synthetic data can approach the performance of one trained on real data — while eliminating the privacy, regulatory, and access barriers that make real data so difficult to work with.
This is distinct from data de-identification, which transforms real records to remove identifiers. Synthetic data is generated from scratch — no real patient record exists in the synthetic dataset, even in modified form.

When Synthetic Data Wins
Synthetic data is not a universal replacement for real data. It excels in specific scenarios where real data is insufficient, inaccessible, or prohibited.

Rare Disease Modeling (Class Imbalance)
A hospital system with 50,000 annual admissions might see 30 cases of Addison's disease per year. Over five years, that is 150 positive cases against 250,000 negative cases — a 0.06% positive rate. No amount of class weighting or SMOTE oversampling will produce a reliable model from 150 examples. Synthetic generation can create thousands of statistically plausible Addison's cases, preserving the correlations between cortisol levels, electrolyte imbalances, and clinical presentations observed in the real cases.
New Hospitals and Health Systems
When a new hospital opens or a health system deploys its first predictive analytics platform, there is no historical data to train on. Synthetic data generated from similar institutions' statistical profiles (not their raw data) can bootstrap initial models. These models are replaced with locally-trained versions as real data accumulates, but synthetic data eliminates the cold-start period.
Research Data Sharing
Multi-institutional research studies often stall during the data sharing agreement phase. IRB approvals, Data Use Agreements, and legal reviews can take 6-18 months. Synthetic datasets can be shared immediately — no IRB approval required because no real patient data exists. Researchers can develop and validate methods on synthetic data, then run final validation on real data at each institution using federated learning.
Developer Testing and Education
Healthcare software developers need realistic data to test EHR integrations, build dashboards, and train new team members. Using real patient data for development violates HIPAA minimum necessary requirements. Synthetic data provides realistic clinical scenarios without any compliance risk.
Synthetic Data Generation Tools

| Tool | Data Type | Approach | Healthcare Focus | Best For |
|---|---|---|---|---|
| Synthea | FHIR patient records | Rule-based simulation | Native (built for healthcare) | Realistic patient journeys, EHR testing |
| Gretel.ai | Tabular, text | Neural network (LSTM-based) | General (configurable) | Distribution-preserving synthesis at scale |
| CTGAN/TVAE | Tabular | GAN/VAE | General (open source) | Custom clinical tabular datasets |
| Stable Diffusion | Medical imaging | Diffusion model | Configurable with medical fine-tuning | Synthetic X-rays, pathology slides, dermatology |
Synthea: The Gold Standard for FHIR Data
Synthea is an open-source patient generator that creates realistic synthetic FHIR patient records. Unlike statistical models, Synthea uses clinically-validated disease modules that simulate patient journeys over time: a synthetic patient might develop Type 2 diabetes at age 45, progress to diabetic retinopathy at 52, and experience a cardiovascular event at 58 — following the actual clinical progression probabilities from published literature.
Synthea generates complete FHIR Bundles including Patient, Condition, Observation, MedicationRequest, Encounter, and Procedure resources. This makes it ideal for testing FHIR implementations and training developers on clinical data workflows.
CTGAN and TVAE: Deep Learning for Tabular Clinical Data
CTGAN (Conditional Tabular GAN) and TVAE (Tabular Variational Autoencoder) are the two most widely-used deep learning approaches for generating synthetic tabular data. They learn the joint probability distribution of all columns in a dataset and generate new rows that preserve correlations between variables.
For healthcare, CTGAN is particularly valuable because it handles mixed data types (continuous lab values, categorical diagnoses, binary flags) and can model the complex correlations that exist in clinical data — for example, the relationship between HbA1c levels, fasting glucose, BMI, and the probability of a diabetes diagnosis.
Building a CTGAN Training Pipeline for Clinical Data
Here is a complete example of training CTGAN on clinical tabular data and generating synthetic patient records for a readmission prediction dataset.

# synthetic_clinical_data.py — CTGAN for Healthcare Tabular Data
import pandas as pd
import numpy as np
from sdv.single_table import CTGANSynthesizer
from sdv.metadata import SingleTableMetadata
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import GradientBoostingClassifier
def prepare_clinical_dataset():
"""Prepare a clinical dataset for synthetic generation.
In production, this loads from your EHR data warehouse.
Here we create a realistic example structure.
"""
np.random.seed(42)
n_patients = 10000
data = pd.DataFrame({
# Demographics
"age": np.random.normal(65, 15, n_patients).clip(18, 100).astype(int),
"sex": np.random.choice(["M", "F"], n_patients, p=[0.48, 0.52]),
"race": np.random.choice(
["White", "Black", "Hispanic", "Asian", "Other"],
n_patients, p=[0.58, 0.22, 0.12, 0.05, 0.03]
),
# Vitals at discharge
"systolic_bp": np.random.normal(130, 20, n_patients).clip(80, 220).astype(int),
"heart_rate": np.random.normal(78, 15, n_patients).clip(40, 150).astype(int),
"spo2": np.random.normal(96, 2, n_patients).clip(85, 100).round(1),
# Lab results
"hba1c": np.random.normal(6.5, 1.8, n_patients).clip(4.0, 14.0).round(1),
"creatinine": np.random.lognormal(0.1, 0.4, n_patients).clip(0.5, 12.0).round(2),
"hemoglobin": np.random.normal(12.5, 2.0, n_patients).clip(5.0, 18.0).round(1),
"wbc": np.random.normal(8.0, 3.0, n_patients).clip(2.0, 30.0).round(1),
# Clinical history
"prior_admissions_12m": np.random.poisson(1.2, n_patients),
"ed_visits_12m": np.random.poisson(0.8, n_patients),
"num_medications": np.random.poisson(5, n_patients),
"has_diabetes": np.random.binomial(1, 0.30, n_patients),
"has_chf": np.random.binomial(1, 0.15, n_patients),
"has_copd": np.random.binomial(1, 0.12, n_patients),
"length_of_stay": np.random.lognormal(1.0, 0.7, n_patients).clip(1, 60).astype(int),
})
# Generate correlated outcome (readmission)
risk_score = (
0.02 * data["age"]
+ 0.5 * data["prior_admissions_12m"]
+ 0.3 * data["ed_visits_12m"]
+ 0.8 * data["has_chf"]
+ 0.4 * data["has_diabetes"]
+ 0.1 * data["hba1c"]
+ 0.3 * data["creatinine"]
- 0.05 * data["hemoglobin"]
+ np.random.normal(0, 1, n_patients)
)
data["readmitted_30d"] = (risk_score > np.percentile(risk_score, 82)).astype(int)
return data
def train_ctgan_synthesizer(real_data: pd.DataFrame, epochs: int = 300):
"""Train CTGAN on clinical data."""
# Define metadata (column types and constraints)
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)
# Override auto-detected types for clinical accuracy
metadata.update_column("age", sdtype="numerical")
metadata.update_column("sex", sdtype="categorical")
metadata.update_column("race", sdtype="categorical")
metadata.update_column("has_diabetes", sdtype="categorical")
metadata.update_column("has_chf", sdtype="categorical")
metadata.update_column("has_copd", sdtype="categorical")
metadata.update_column("readmitted_30d", sdtype="categorical")
# Initialize and train CTGAN
synthesizer = CTGANSynthesizer(
metadata,
epochs=epochs,
batch_size=500,
generator_dim=(256, 256),
discriminator_dim=(256, 256),
generator_lr=2e-4,
discriminator_lr=2e-4,
verbose=True,
)
synthesizer.fit(real_data)
return synthesizer
def generate_synthetic_data(
synthesizer, n_samples: int, conditions: dict = None
) -> pd.DataFrame:
"""Generate synthetic clinical records.
Optionally condition on specific values (e.g., generate
only diabetic patients for rare-condition augmentation).
"""
if conditions:
# Conditional generation for targeted augmentation
condition_df = pd.DataFrame([conditions] * n_samples)
synthetic = synthesizer.sample_remaining_columns(
condition_df
)
else:
synthetic = synthesizer.sample(n_samples)
# Post-generation clinical validation
synthetic = apply_clinical_constraints(synthetic)
return synthetic
def apply_clinical_constraints(data: pd.DataFrame) -> pd.DataFrame:
"""Enforce clinical validity constraints on synthetic data.
CTGAN may generate clinically impossible combinations.
These rules catch and correct the most common issues.
"""
# HbA1c and diabetes must be consistent
data.loc[
(data["hba1c"] >= 6.5) & (data["has_diabetes"] == 0),
"has_diabetes"
] = 1
# SpO2 cannot exceed 100%
data["spo2"] = data["spo2"].clip(upper=100.0)
# Age must be >= 18 (adult model)
data["age"] = data["age"].clip(lower=18)
# Creatinine cannot be negative
data["creatinine"] = data["creatinine"].clip(lower=0.3)
return data
def evaluate_synthetic_quality(
real_data: pd.DataFrame,
synthetic_data: pd.DataFrame,
target_col: str = "readmitted_30d",
) -> dict:
"""Evaluate synthetic data across three dimensions."""
results = {}
# 1. UTILITY: Train model on synthetic, test on real
features = [c for c in real_data.columns if c != target_col]
real_encoded = pd.get_dummies(real_data, drop_first=True)
synth_encoded = pd.get_dummies(synthetic_data, drop_first=True)
# Align columns
common_cols = list(
set(real_encoded.columns) & set(synth_encoded.columns)
)
feat_cols = [c for c in common_cols if c != target_col]
# Model trained on real data
X_real, y_real = real_encoded[feat_cols], real_encoded[target_col]
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
X_real, y_real, test_size=0.2, random_state=42
)
clf_real = GradientBoostingClassifier(n_estimators=100)
clf_real.fit(X_train_r, y_train_r)
auc_real = roc_auc_score(y_test_r, clf_real.predict_proba(X_test_r)[:, 1])
# Model trained on synthetic data, tested on real
X_synth = synth_encoded[feat_cols]
y_synth = synth_encoded[target_col]
clf_synth = GradientBoostingClassifier(n_estimators=100)
clf_synth.fit(X_synth, y_synth)
auc_synth = roc_auc_score(y_test_r, clf_synth.predict_proba(X_test_r)[:, 1])
results["utility"] = {
"auc_real_model": round(auc_real, 4),
"auc_synthetic_model": round(auc_synth, 4),
"utility_ratio": round(auc_synth / auc_real, 4),
}
# 2. FIDELITY: Statistical similarity
numeric_cols = real_data.select_dtypes(include=[np.number]).columns
fidelity_scores = {}
for col in numeric_cols:
real_mean = real_data[col].mean()
synth_mean = synthetic_data[col].mean()
real_std = real_data[col].std()
synth_std = synthetic_data[col].std()
mean_diff = abs(real_mean - synth_mean) / (real_std + 1e-8)
fidelity_scores[col] = round(1 - min(mean_diff, 1), 4)
results["fidelity"] = {
"per_column": fidelity_scores,
"average": round(np.mean(list(fidelity_scores.values())), 4),
}
# 3. PRIVACY: Nearest-neighbor distance ratio
from sklearn.neighbors import NearestNeighbors
real_numeric = real_data[numeric_cols].values
synth_numeric = synthetic_data[numeric_cols].values
nn = NearestNeighbors(n_neighbors=1)
nn.fit(real_numeric)
distances, _ = nn.kneighbors(synth_numeric)
results["privacy"] = {
"min_distance": round(float(distances.min()), 4),
"mean_distance": round(float(distances.mean()), 4),
"pct_below_threshold": round(
float((distances < 0.1).mean()) * 100, 2
),
}
return results
if __name__ == "__main__":
# Full pipeline
print("1. Preparing clinical dataset...")
real_data = prepare_clinical_dataset()
print(f" Real data: {len(real_data)} records")
print("2. Training CTGAN synthesizer...")
synthesizer = train_ctgan_synthesizer(real_data, epochs=300)
print("3. Generating synthetic data...")
synthetic = generate_synthetic_data(synthesizer, n_samples=10000)
print(f" Synthetic data: {len(synthetic)} records")
print("4. Evaluating quality...")
quality = evaluate_synthetic_quality(real_data, synthetic)
print(f" Utility ratio: {quality['utility']['utility_ratio']}")
print(f" Fidelity avg: {quality['fidelity']['average']}")
print(f" Privacy risk: {quality['privacy']['pct_below_threshold']}%")
Synthetic Data Quality: The Trilemma
Evaluating synthetic data requires measuring three properties that are often in tension with each other. Optimizing for any two tends to degrade the third.
Utility
Can ML models trained on synthetic data perform as well as models trained on real data? This is measured by training identical models on real vs. synthetic data and comparing their performance on a held-out real test set. The utility ratio (synthetic AUC / real AUC) should be above 0.90 for the synthetic data to be useful for model development.

Privacy
Can an attacker determine whether a specific real patient was in the training set used to generate the synthetic data? This is measured through membership inference attacks — training a classifier to distinguish between records that were in the training set and records that were not. A privacy score close to 0.50 (random chance) means the synthetic data does not leak information about individual training records.

Fidelity
Do the statistical properties of the synthetic data match the real data? Fidelity is measured at multiple levels: univariate (each column's distribution matches), bivariate (pairwise correlations are preserved), and multivariate (higher-order relationships hold). High fidelity is necessary for utility but can conflict with privacy — perfectly faithful synthetic data might memorize real records.
| Metric | What It Measures | Target Value | How to Compute |
|---|---|---|---|
| Utility Ratio | Model performance: synthetic vs real training | Greater than 0.90 | AUC(synthetic-trained) / AUC(real-trained) |
| Membership Inference AUC | Can attacker identify training records? | Close to 0.50 | Train classifier on in-vs-out membership |
| Attribute Inference Accuracy | Can attacker infer sensitive attributes? | Close to random baseline | Predict withheld attribute from remaining |
| Statistical Fidelity | Distribution match (KS test, correlation) | KS p-value greater than 0.05 | Kolmogorov-Smirnov test per column |
| Nearest Neighbor Distance | How close is nearest real record? | Greater than 5th percentile of real-to-real distances | L2 distance in normalized feature space |
FDA Guidance on Synthetic Data
The FDA has not issued definitive guidance specifically on synthetic data for SaMD validation, but several published frameworks and discussion papers indicate the agency's evolving position.

Key positions from FDA published materials:
- Development and augmentation: Synthetic data is acceptable for initial model development, hyperparameter tuning, and augmenting rare classes during training. The FDA does not regulate the training process — it regulates the final device's performance.
- Pre-submission testing: Synthetic data can supplement real-data testing for pre-submission evaluations, but the FDA expects to see real-data validation results as well. A model validated solely on synthetic data would not satisfy 510(k) requirements.
- Clinical validation: Synthetic data cannot replace real patient data for pivotal clinical studies. The FDA requires evidence of clinical performance on actual patients in intended-use conditions. For more on FDA requirements for clinical AI, see our detailed guide.
- Post-market surveillance: Synthetic data may be used for ongoing monitoring and drift detection simulations, but real-world performance data must also be collected.
Practical Recommendations

- Start with Synthea for FHIR testing. If your need is developer testing and EHR integration, Synthea provides clinically realistic FHIR data with zero privacy risk and no model training required.
- Use CTGAN for ML training augmentation. When you have some real data but not enough (especially for rare conditions), CTGAN can generate additional training samples that preserve clinical correlations.
- Always validate on real data. Never report model performance based solely on synthetic test sets. The synthetic data trains the model; real data validates it.
- Measure all three quality dimensions. A synthetic dataset with high fidelity but poor privacy is dangerous. One with high privacy but poor utility is useless. Track the trilemma explicitly.
- Document everything for regulatory review. If your model training pipeline uses synthetic data, the model registry should record: which synthetic generator was used, what percentage of training data was synthetic, and the quality metrics of the synthetic dataset.
Frequently Asked Questions
Is synthetic data considered PHI under HIPAA?
No. Properly generated synthetic data does not contain any real patient information and is not derived from identifiable records. It falls outside HIPAA's definition of PHI. However, if the synthetic generation process is poorly implemented and memorizes real records, those memorized records would be PHI. Always run membership inference tests to verify.
Can I share synthetic data freely between institutions?
Yes, with appropriate validation. Because synthetic data contains no real patient information, it does not require Data Use Agreements, IRB approval, or BAAs for sharing. However, best practice is to include a data sheet documenting the generation method, quality metrics, and intended use limitations.
How much real data do I need to train a good synthetic generator?
CTGAN typically needs at least 1,000 records to learn meaningful distributions for tabular data with 15-20 features. For complex datasets with many features and rare categories, 5,000-10,000 records produce better results. Below 500 records, rule-based generators like Synthea are more reliable than learned models.
What about synthetic medical imaging?
Diffusion models can generate synthetic medical images (chest X-rays, pathology slides, dermatology images), but the technology is less mature than tabular synthesis. The main challenge is ensuring clinical accuracy — a synthetic chest X-ray showing a pneumothorax must have radiologically correct features, not just visually plausible ones. Expert radiologist review of generated images is essential before use in training.
Does mixing synthetic and real data improve or degrade model performance?
Generally, mixing improves performance when real data is scarce. A 2023 study in JAMIA showed that augmenting 500 real records with 2,000 synthetic records improved AUC by 0.08 compared to training on 500 real records alone. However, beyond a 4:1 synthetic-to-real ratio, returns diminish and can even degrade performance as synthetic artifacts dominate the training signal.
Conclusion
Synthetic data is not a replacement for real clinical data — it is a complement that fills specific gaps where real data is insufficient, inaccessible, or prohibited. The technology has matured to the point where synthetic datasets can achieve 85-95% of real-data utility for most clinical ML tasks, while providing strong privacy guarantees that simplify compliance and accelerate research collaboration.
The practical path forward is clear: use Synthea for FHIR testing and development, CTGAN for tabular ML training augmentation, and always validate on real data. Measure the trilemma (utility, privacy, fidelity) explicitly, document synthetic data usage in your model registry, and stay informed on evolving FDA guidance. Organizations that build synthetic data capabilities now will have a significant advantage in scaling their healthcare AI programs — especially for rare diseases and multi-institutional research where real data limitations are the primary bottleneck.



