Synthetic Data for ML Training in Healthcare: When Real Data Isn't Enough (or Isn't Allowed)

March 16, 2026

14 min read

The Data Scarcity Problem in Healthcare ML

Healthcare machine learning has a data problem that no amount of data engineering can solve. Rare diseases affect fewer than 200,000 patients in the US — but building a diagnostic model requires tens of thousands of labeled examples. New hospitals opening their doors have zero historical data for predictive models. Research collaborations need datasets that can be shared across institutions, but IRB approvals take months and HIPAA de-identification is imperfect.

Traditional solutions — data augmentation, transfer learning, few-shot learning — help, but they have limits. You cannot augment your way to a representative dataset when you have 47 confirmed cases of a rare autoimmune condition. Transfer learning from a general medical dataset misses the specific patterns of your patient population. And few-shot learning, while promising, has not achieved clinical-grade performance for most tasks.

Synthetic data offers a fundamentally different approach: generate artificial patient records that preserve the statistical properties of real data without containing any actual patient information. Done correctly, a model trained on synthetic data can approach the performance of one trained on real data — while eliminating the privacy, regulatory, and access barriers that make real data so difficult to work with.

This is distinct from data de-identification, which transforms real records to remove identifiers. Synthetic data is generated from scratch — no real patient record exists in the synthetic dataset, even in modified form.

The synthetic data pipeline: extract statistical patterns from real data, generate new records, and validate that they preserve utility while protecting privacy.

When Synthetic Data Wins

Synthetic data is not a universal replacement for real data. It excels in specific scenarios where real data is insufficient, inaccessible, or prohibited.

Synthetic data is most valuable when real data volume is insufficient, sharing is restricted, or class balance is severely skewed.

Rare Disease Modeling (Class Imbalance)

A hospital system with 50,000 annual admissions might see 30 cases of Addison's disease per year. Over five years, that is 150 positive cases against 250,000 negative cases — a 0.06% positive rate. No amount of class weighting or SMOTE oversampling will produce a reliable model from 150 examples. Synthetic generation can create thousands of statistically plausible Addison's cases, preserving the correlations between cortisol levels, electrolyte imbalances, and clinical presentations observed in the real cases.

New Hospitals and Health Systems

When a new hospital opens or a health system deploys its first predictive analytics platform, there is no historical data to train on. Synthetic data generated from similar institutions' statistical profiles (not their raw data) can bootstrap initial models. These models are replaced with locally-trained versions as real data accumulates, but synthetic data eliminates the cold-start period.

Research Data Sharing

Multi-institutional research studies often stall during the data sharing agreement phase. IRB approvals, Data Use Agreements, and legal reviews can take 6-18 months. Synthetic datasets can be shared immediately — no IRB approval required because no real patient data exists. Researchers can develop and validate methods on synthetic data, then run final validation on real data at each institution using federated learning.

Developer Testing and Education

Healthcare software developers need realistic data to test EHR integrations, build dashboards, and train new team members. Using real patient data for development violates HIPAA minimum necessary requirements. Synthetic data provides realistic clinical scenarios without any compliance risk.

Synthetic Data Generation Tools

Four tools dominate healthcare synthetic data generation, each optimized for different data types and use cases.

Tool	Data Type	Approach	Healthcare Focus	Best For
Synthea	FHIR patient records	Rule-based simulation	Native (built for healthcare)	Realistic patient journeys, EHR testing
Gretel.ai	Tabular, text	Neural network (LSTM-based)	General (configurable)	Distribution-preserving synthesis at scale
CTGAN/TVAE	Tabular	GAN/VAE	General (open source)	Custom clinical tabular datasets
Stable Diffusion	Medical imaging	Diffusion model	Configurable with medical fine-tuning	Synthetic X-rays, pathology slides, dermatology

Synthea: The Gold Standard for FHIR Data

Synthea is an open-source patient generator that creates realistic synthetic FHIR patient records. Unlike statistical models, Synthea uses clinically-validated disease modules that simulate patient journeys over time: a synthetic patient might develop Type 2 diabetes at age 45, progress to diabetic retinopathy at 52, and experience a cardiovascular event at 58 — following the actual clinical progression probabilities from published literature.

Synthea generates complete FHIR Bundles including Patient, Condition, Observation, MedicationRequest, Encounter, and Procedure resources. This makes it ideal for testing FHIR implementations and training developers on clinical data workflows.

CTGAN and TVAE: Deep Learning for Tabular Clinical Data

CTGAN (Conditional Tabular GAN) and TVAE (Tabular Variational Autoencoder) are the two most widely-used deep learning approaches for generating synthetic tabular data. They learn the joint probability distribution of all columns in a dataset and generate new rows that preserve correlations between variables.

For healthcare, CTGAN is particularly valuable because it handles mixed data types (continuous lab values, categorical diagnoses, binary flags) and can model the complex correlations that exist in clinical data — for example, the relationship between HbA1c levels, fasting glucose, BMI, and the probability of a diabetes diagnosis.

Building a CTGAN Training Pipeline for Clinical Data

Here is a complete example of training CTGAN on clinical tabular data and generating synthetic patient records for a readmission prediction dataset.

The synthetic data trilemma: maximizing utility, privacy, and fidelity simultaneously is the core engineering challenge.

# synthetic_clinical_data.py — CTGAN for Healthcare Tabular Data
import pandas as pd
import numpy as np
from sdv.single_table import CTGANSynthesizer
from sdv.metadata import SingleTableMetadata
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import GradientBoostingClassifier


def prepare_clinical_dataset():
    """Prepare a clinical dataset for synthetic generation.
    
    In production, this loads from your EHR data warehouse.
    Here we create a realistic example structure.
    """
    np.random.seed(42)
    n_patients = 10000

    data = pd.DataFrame({
        # Demographics
        "age": np.random.normal(65, 15, n_patients).clip(18, 100).astype(int),
        "sex": np.random.choice(["M", "F"], n_patients, p=[0.48, 0.52]),
        "race": np.random.choice(
            ["White", "Black", "Hispanic", "Asian", "Other"],
            n_patients, p=[0.58, 0.22, 0.12, 0.05, 0.03]
        ),

        # Vitals at discharge
        "systolic_bp": np.random.normal(130, 20, n_patients).clip(80, 220).astype(int),
        "heart_rate": np.random.normal(78, 15, n_patients).clip(40, 150).astype(int),
        "spo2": np.random.normal(96, 2, n_patients).clip(85, 100).round(1),

        # Lab results
        "hba1c": np.random.normal(6.5, 1.8, n_patients).clip(4.0, 14.0).round(1),
        "creatinine": np.random.lognormal(0.1, 0.4, n_patients).clip(0.5, 12.0).round(2),
        "hemoglobin": np.random.normal(12.5, 2.0, n_patients).clip(5.0, 18.0).round(1),
        "wbc": np.random.normal(8.0, 3.0, n_patients).clip(2.0, 30.0).round(1),

        # Clinical history
        "prior_admissions_12m": np.random.poisson(1.2, n_patients),
        "ed_visits_12m": np.random.poisson(0.8, n_patients),
        "num_medications": np.random.poisson(5, n_patients),
        "has_diabetes": np.random.binomial(1, 0.30, n_patients),
        "has_chf": np.random.binomial(1, 0.15, n_patients),
        "has_copd": np.random.binomial(1, 0.12, n_patients),
        "length_of_stay": np.random.lognormal(1.0, 0.7, n_patients).clip(1, 60).astype(int),
    })

    # Generate correlated outcome (readmission)
    risk_score = (
        0.02 * data["age"]
        + 0.5 * data["prior_admissions_12m"]
        + 0.3 * data["ed_visits_12m"]
        + 0.8 * data["has_chf"]
        + 0.4 * data["has_diabetes"]
        + 0.1 * data["hba1c"]
        + 0.3 * data["creatinine"]
        - 0.05 * data["hemoglobin"]
        + np.random.normal(0, 1, n_patients)
    )
    data["readmitted_30d"] = (risk_score > np.percentile(risk_score, 82)).astype(int)

    return data


def train_ctgan_synthesizer(real_data: pd.DataFrame, epochs: int = 300):
    """Train CTGAN on clinical data."""
    # Define metadata (column types and constraints)
    metadata = SingleTableMetadata()
    metadata.detect_from_dataframe(real_data)

    # Override auto-detected types for clinical accuracy
    metadata.update_column("age", sdtype="numerical")
    metadata.update_column("sex", sdtype="categorical")
    metadata.update_column("race", sdtype="categorical")
    metadata.update_column("has_diabetes", sdtype="categorical")
    metadata.update_column("has_chf", sdtype="categorical")
    metadata.update_column("has_copd", sdtype="categorical")
    metadata.update_column("readmitted_30d", sdtype="categorical")

    # Initialize and train CTGAN
    synthesizer = CTGANSynthesizer(
        metadata,
        epochs=epochs,
        batch_size=500,
        generator_dim=(256, 256),
        discriminator_dim=(256, 256),
        generator_lr=2e-4,
        discriminator_lr=2e-4,
        verbose=True,
    )

    synthesizer.fit(real_data)
    return synthesizer


def generate_synthetic_data(
    synthesizer, n_samples: int, conditions: dict = None
) -> pd.DataFrame:
    """Generate synthetic clinical records.
    
    Optionally condition on specific values (e.g., generate
    only diabetic patients for rare-condition augmentation).
    """
    if conditions:
        # Conditional generation for targeted augmentation
        condition_df = pd.DataFrame([conditions] * n_samples)
        synthetic = synthesizer.sample_remaining_columns(
            condition_df
        )
    else:
        synthetic = synthesizer.sample(n_samples)

    # Post-generation clinical validation
    synthetic = apply_clinical_constraints(synthetic)
    return synthetic


def apply_clinical_constraints(data: pd.DataFrame) -> pd.DataFrame:
    """Enforce clinical validity constraints on synthetic data.
    
    CTGAN may generate clinically impossible combinations.
    These rules catch and correct the most common issues.
    """
    # HbA1c and diabetes must be consistent
    data.loc[
        (data["hba1c"] >= 6.5) & (data["has_diabetes"] == 0),
        "has_diabetes"
    ] = 1

    # SpO2 cannot exceed 100%
    data["spo2"] = data["spo2"].clip(upper=100.0)

    # Age must be >= 18 (adult model)
    data["age"] = data["age"].clip(lower=18)

    # Creatinine cannot be negative
    data["creatinine"] = data["creatinine"].clip(lower=0.3)

    return data


def evaluate_synthetic_quality(
    real_data: pd.DataFrame,
    synthetic_data: pd.DataFrame,
    target_col: str = "readmitted_30d",
) -> dict:
    """Evaluate synthetic data across three dimensions."""
    results = {}

    # 1. UTILITY: Train model on synthetic, test on real
    features = [c for c in real_data.columns if c != target_col]
    real_encoded = pd.get_dummies(real_data, drop_first=True)
    synth_encoded = pd.get_dummies(synthetic_data, drop_first=True)

    # Align columns
    common_cols = list(
        set(real_encoded.columns) & set(synth_encoded.columns)
    )
    feat_cols = [c for c in common_cols if c != target_col]

    # Model trained on real data
    X_real, y_real = real_encoded[feat_cols], real_encoded[target_col]
    X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
        X_real, y_real, test_size=0.2, random_state=42
    )
    clf_real = GradientBoostingClassifier(n_estimators=100)
    clf_real.fit(X_train_r, y_train_r)
    auc_real = roc_auc_score(y_test_r, clf_real.predict_proba(X_test_r)[:, 1])

    # Model trained on synthetic data, tested on real
    X_synth = synth_encoded[feat_cols]
    y_synth = synth_encoded[target_col]
    clf_synth = GradientBoostingClassifier(n_estimators=100)
    clf_synth.fit(X_synth, y_synth)
    auc_synth = roc_auc_score(y_test_r, clf_synth.predict_proba(X_test_r)[:, 1])

    results["utility"] = {
        "auc_real_model": round(auc_real, 4),
        "auc_synthetic_model": round(auc_synth, 4),
        "utility_ratio": round(auc_synth / auc_real, 4),
    }

    # 2. FIDELITY: Statistical similarity
    numeric_cols = real_data.select_dtypes(include=[np.number]).columns
    fidelity_scores = {}
    for col in numeric_cols:
        real_mean = real_data[col].mean()
        synth_mean = synthetic_data[col].mean()
        real_std = real_data[col].std()
        synth_std = synthetic_data[col].std()
        mean_diff = abs(real_mean - synth_mean) / (real_std + 1e-8)
        fidelity_scores[col] = round(1 - min(mean_diff, 1), 4)
    results["fidelity"] = {
        "per_column": fidelity_scores,
        "average": round(np.mean(list(fidelity_scores.values())), 4),
    }

    # 3. PRIVACY: Nearest-neighbor distance ratio
    from sklearn.neighbors import NearestNeighbors
    real_numeric = real_data[numeric_cols].values
    synth_numeric = synthetic_data[numeric_cols].values
    nn = NearestNeighbors(n_neighbors=1)
    nn.fit(real_numeric)
    distances, _ = nn.kneighbors(synth_numeric)
    results["privacy"] = {
        "min_distance": round(float(distances.min()), 4),
        "mean_distance": round(float(distances.mean()), 4),
        "pct_below_threshold": round(
            float((distances < 0.1).mean()) * 100, 2
        ),
    }

    return results


if __name__ == "__main__":
    # Full pipeline
    print("1. Preparing clinical dataset...")
    real_data = prepare_clinical_dataset()
    print(f"   Real data: {len(real_data)} records")

    print("2. Training CTGAN synthesizer...")
    synthesizer = train_ctgan_synthesizer(real_data, epochs=300)

    print("3. Generating synthetic data...")
    synthetic = generate_synthetic_data(synthesizer, n_samples=10000)
    print(f"   Synthetic data: {len(synthetic)} records")

    print("4. Evaluating quality...")
    quality = evaluate_synthetic_quality(real_data, synthetic)
    print(f"   Utility ratio: {quality['utility']['utility_ratio']}")
    print(f"   Fidelity avg:  {quality['fidelity']['average']}")
    print(f"   Privacy risk:  {quality['privacy']['pct_below_threshold']}%")

Synthetic Data Quality: The Trilemma

Evaluating synthetic data requires measuring three properties that are often in tension with each other. Optimizing for any two tends to degrade the third.

Utility

Can ML models trained on synthetic data perform as well as models trained on real data? This is measured by training identical models on real vs. synthetic data and comparing their performance on a held-out real test set. The utility ratio (synthetic AUC / real AUC) should be above 0.90 for the synthetic data to be useful for model development.

Synthetic data typically achieves 85-95% of real data model performance, with the gap depending on task complexity and data volume.

Privacy

Can an attacker determine whether a specific real patient was in the training set used to generate the synthetic data? This is measured through membership inference attacks — training a classifier to distinguish between records that were in the training set and records that were not. A privacy score close to 0.50 (random chance) means the synthetic data does not leak information about individual training records.

Three classes of privacy attacks target synthetic data; defenses include differential privacy, minimum population thresholds, and distance-based filtering.

Fidelity

Do the statistical properties of the synthetic data match the real data? Fidelity is measured at multiple levels: univariate (each column's distribution matches), bivariate (pairwise correlations are preserved), and multivariate (higher-order relationships hold). High fidelity is necessary for utility but can conflict with privacy — perfectly faithful synthetic data might memorize real records.

Metric	What It Measures	Target Value	How to Compute
Utility Ratio	Model performance: synthetic vs real training	Greater than 0.90	AUC(synthetic-trained) / AUC(real-trained)
Membership Inference AUC	Can attacker identify training records?	Close to 0.50	Train classifier on in-vs-out membership
Attribute Inference Accuracy	Can attacker infer sensitive attributes?	Close to random baseline	Predict withheld attribute from remaining
Statistical Fidelity	Distribution match (KS test, correlation)	KS p-value greater than 0.05	Kolmogorov-Smirnov test per column
Nearest Neighbor Distance	How close is nearest real record?	Greater than 5th percentile of real-to-real distances	L2 distance in normalized feature space

FDA Guidance on Synthetic Data

The FDA has not issued definitive guidance specifically on synthetic data for SaMD validation, but several published frameworks and discussion papers indicate the agency's evolving position.

The FDA's position on synthetic data varies by use: acceptable for development, conditional for pre-submission, not acceptable as sole evidence for clinical validation.

Key positions from FDA published materials:

Development and augmentation: Synthetic data is acceptable for initial model development, hyperparameter tuning, and augmenting rare classes during training. The FDA does not regulate the training process — it regulates the final device's performance.
Pre-submission testing: Synthetic data can supplement real-data testing for pre-submission evaluations, but the FDA expects to see real-data validation results as well. A model validated solely on synthetic data would not satisfy 510(k) requirements.
Clinical validation: Synthetic data cannot replace real patient data for pivotal clinical studies. The FDA requires evidence of clinical performance on actual patients in intended-use conditions. For more on FDA requirements for clinical AI, see our detailed guide.
Post-market surveillance: Synthetic data may be used for ongoing monitoring and drift detection simulations, but real-world performance data must also be collected.

Practical Recommendations

A phased approach to synthetic data adoption minimizes risk while building organizational capability.

Start with Synthea for FHIR testing. If your need is developer testing and EHR integration, Synthea provides clinically realistic FHIR data with zero privacy risk and no model training required.
Use CTGAN for ML training augmentation. When you have some real data but not enough (especially for rare conditions), CTGAN can generate additional training samples that preserve clinical correlations.
Always validate on real data. Never report model performance based solely on synthetic test sets. The synthetic data trains the model; real data validates it.
Measure all three quality dimensions. A synthetic dataset with high fidelity but poor privacy is dangerous. One with high privacy but poor utility is useless. Track the trilemma explicitly.
Document everything for regulatory review. If your model training pipeline uses synthetic data, the model registry should record: which synthetic generator was used, what percentage of training data was synthetic, and the quality metrics of the synthetic dataset.

Frequently Asked Questions

Is synthetic data considered PHI under HIPAA?

No. Properly generated synthetic data does not contain any real patient information and is not derived from identifiable records. It falls outside HIPAA's definition of PHI. However, if the synthetic generation process is poorly implemented and memorizes real records, those memorized records would be PHI. Always run membership inference tests to verify.

Can I share synthetic data freely between institutions?

Yes, with appropriate validation. Because synthetic data contains no real patient information, it does not require Data Use Agreements, IRB approval, or BAAs for sharing. However, best practice is to include a data sheet documenting the generation method, quality metrics, and intended use limitations.

How much real data do I need to train a good synthetic generator?

CTGAN typically needs at least 1,000 records to learn meaningful distributions for tabular data with 15-20 features. For complex datasets with many features and rare categories, 5,000-10,000 records produce better results. Below 500 records, rule-based generators like Synthea are more reliable than learned models.

What about synthetic medical imaging?

Diffusion models can generate synthetic medical images (chest X-rays, pathology slides, dermatology images), but the technology is less mature than tabular synthesis. The main challenge is ensuring clinical accuracy — a synthetic chest X-ray showing a pneumothorax must have radiologically correct features, not just visually plausible ones. Expert radiologist review of generated images is essential before use in training.

Does mixing synthetic and real data improve or degrade model performance?

Generally, mixing improves performance when real data is scarce. A 2023 study in JAMIA showed that augmenting 500 real records with 2,000 synthetic records improved AUC by 0.08 compared to training on 500 real records alone. However, beyond a 4:1 synthetic-to-real ratio, returns diminish and can even degrade performance as synthetic artifacts dominate the training signal.

Conclusion

Synthetic data is not a replacement for real clinical data — it is a complement that fills specific gaps where real data is insufficient, inaccessible, or prohibited. The technology has matured to the point where synthetic datasets can achieve 85-95% of real-data utility for most clinical ML tasks, while providing strong privacy guarantees that simplify compliance and accelerate research collaboration.

The practical path forward is clear: use Synthea for FHIR testing and development, CTGAN for tabular ML training augmentation, and always validate on real data. Measure the trilemma (utility, privacy, fidelity) explicitly, document synthetic data usage in your model registry, and stay informed on evolving FDA guidance. Organizations that build synthetic data capabilities now will have a significant advantage in scaling their healthcare AI programs — especially for rare diseases and multi-institutional research where real data limitations are the primary bottleneck.

Loading blogs...

Synthetic Data for ML Training in Healthcare: When Real Data Isn't Enough (or Isn't Allowed)

March 16, 2026

14 min read

The Data Scarcity Problem in Healthcare ML

The synthetic data pipeline: extract statistical patterns from real data, generate new records, and validate that they preserve utility while protecting privacy.

When Synthetic Data Wins

Synthetic data is not a universal replacement for real data. It excels in specific scenarios where real data is insufficient, inaccessible, or prohibited.

Synthetic data is most valuable when real data volume is insufficient, sharing is restricted, or class balance is severely skewed.

Rare Disease Modeling (Class Imbalance)

New Hospitals and Health Systems

Research Data Sharing

Developer Testing and Education

Synthetic Data Generation Tools

Four tools dominate healthcare synthetic data generation, each optimized for different data types and use cases.

Tool	Data Type	Approach	Healthcare Focus	Best For
Synthea	FHIR patient records	Rule-based simulation	Native (built for healthcare)	Realistic patient journeys, EHR testing
Gretel.ai	Tabular, text	Neural network (LSTM-based)	General (configurable)	Distribution-preserving synthesis at scale
CTGAN/TVAE	Tabular	GAN/VAE	General (open source)	Custom clinical tabular datasets
Stable Diffusion	Medical imaging	Diffusion model	Configurable with medical fine-tuning	Synthetic X-rays, pathology slides, dermatology

Synthea: The Gold Standard for FHIR Data

CTGAN and TVAE: Deep Learning for Tabular Clinical Data

Building a CTGAN Training Pipeline for Clinical Data

Here is a complete example of training CTGAN on clinical tabular data and generating synthetic patient records for a readmission prediction dataset.

The synthetic data trilemma: maximizing utility, privacy, and fidelity simultaneously is the core engineering challenge.

# synthetic_clinical_data.py — CTGAN for Healthcare Tabular Data
import pandas as pd
import numpy as np
from sdv.single_table import CTGANSynthesizer
from sdv.metadata import SingleTableMetadata
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import GradientBoostingClassifier


def prepare_clinical_dataset():
    """Prepare a clinical dataset for synthetic generation.
    
    In production, this loads from your EHR data warehouse.
    Here we create a realistic example structure.
    """
    np.random.seed(42)
    n_patients = 10000

    data = pd.DataFrame({
        # Demographics
        "age": np.random.normal(65, 15, n_patients).clip(18, 100).astype(int),
        "sex": np.random.choice(["M", "F"], n_patients, p=[0.48, 0.52]),
        "race": np.random.choice(
            ["White", "Black", "Hispanic", "Asian", "Other"],
            n_patients, p=[0.58, 0.22, 0.12, 0.05, 0.03]
        ),

        # Vitals at discharge
        "systolic_bp": np.random.normal(130, 20, n_patients).clip(80, 220).astype(int),
        "heart_rate": np.random.normal(78, 15, n_patients).clip(40, 150).astype(int),
        "spo2": np.random.normal(96, 2, n_patients).clip(85, 100).round(1),

        # Lab results
        "hba1c": np.random.normal(6.5, 1.8, n_patients).clip(4.0, 14.0).round(1),
        "creatinine": np.random.lognormal(0.1, 0.4, n_patients).clip(0.5, 12.0).round(2),
        "hemoglobin": np.random.normal(12.5, 2.0, n_patients).clip(5.0, 18.0).round(1),
        "wbc": np.random.normal(8.0, 3.0, n_patients).clip(2.0, 30.0).round(1),

        # Clinical history
        "prior_admissions_12m": np.random.poisson(1.2, n_patients),
        "ed_visits_12m": np.random.poisson(0.8, n_patients),
        "num_medications": np.random.poisson(5, n_patients),
        "has_diabetes": np.random.binomial(1, 0.30, n_patients),
        "has_chf": np.random.binomial(1, 0.15, n_patients),
        "has_copd": np.random.binomial(1, 0.12, n_patients),
        "length_of_stay": np.random.lognormal(1.0, 0.7, n_patients).clip(1, 60).astype(int),
    })

    # Generate correlated outcome (readmission)
    risk_score = (
        0.02 * data["age"]
        + 0.5 * data["prior_admissions_12m"]
        + 0.3 * data["ed_visits_12m"]
        + 0.8 * data["has_chf"]
        + 0.4 * data["has_diabetes"]
        + 0.1 * data["hba1c"]
        + 0.3 * data["creatinine"]
        - 0.05 * data["hemoglobin"]
        + np.random.normal(0, 1, n_patients)
    )
    data["readmitted_30d"] = (risk_score > np.percentile(risk_score, 82)).astype(int)

    return data


def train_ctgan_synthesizer(real_data: pd.DataFrame, epochs: int = 300):
    """Train CTGAN on clinical data."""
    # Define metadata (column types and constraints)
    metadata = SingleTableMetadata()
    metadata.detect_from_dataframe(real_data)

    # Override auto-detected types for clinical accuracy
    metadata.update_column("age", sdtype="numerical")
    metadata.update_column("sex", sdtype="categorical")
    metadata.update_column("race", sdtype="categorical")
    metadata.update_column("has_diabetes", sdtype="categorical")
    metadata.update_column("has_chf", sdtype="categorical")
    metadata.update_column("has_copd", sdtype="categorical")
    metadata.update_column("readmitted_30d", sdtype="categorical")

    # Initialize and train CTGAN
    synthesizer = CTGANSynthesizer(
        metadata,
        epochs=epochs,
        batch_size=500,
        generator_dim=(256, 256),
        discriminator_dim=(256, 256),
        generator_lr=2e-4,
        discriminator_lr=2e-4,
        verbose=True,
    )

    synthesizer.fit(real_data)
    return synthesizer


def generate_synthetic_data(
    synthesizer, n_samples: int, conditions: dict = None
) -> pd.DataFrame:
    """Generate synthetic clinical records.
    
    Optionally condition on specific values (e.g., generate
    only diabetic patients for rare-condition augmentation).
    """
    if conditions:
        # Conditional generation for targeted augmentation
        condition_df = pd.DataFrame([conditions] * n_samples)
        synthetic = synthesizer.sample_remaining_columns(
            condition_df
        )
    else:
        synthetic = synthesizer.sample(n_samples)

    # Post-generation clinical validation
    synthetic = apply_clinical_constraints(synthetic)
    return synthetic


def apply_clinical_constraints(data: pd.DataFrame) -> pd.DataFrame:
    """Enforce clinical validity constraints on synthetic data.
    
    CTGAN may generate clinically impossible combinations.
    These rules catch and correct the most common issues.
    """
    # HbA1c and diabetes must be consistent
    data.loc[
        (data["hba1c"] >= 6.5) & (data["has_diabetes"] == 0),
        "has_diabetes"
    ] = 1

    # SpO2 cannot exceed 100%
    data["spo2"] = data["spo2"].clip(upper=100.0)

    # Age must be >= 18 (adult model)
    data["age"] = data["age"].clip(lower=18)

    # Creatinine cannot be negative
    data["creatinine"] = data["creatinine"].clip(lower=0.3)

    return data


def evaluate_synthetic_quality(
    real_data: pd.DataFrame,
    synthetic_data: pd.DataFrame,
    target_col: str = "readmitted_30d",
) -> dict:
    """Evaluate synthetic data across three dimensions."""
    results = {}

    # 1. UTILITY: Train model on synthetic, test on real
    features = [c for c in real_data.columns if c != target_col]
    real_encoded = pd.get_dummies(real_data, drop_first=True)
    synth_encoded = pd.get_dummies(synthetic_data, drop_first=True)

    # Align columns
    common_cols = list(
        set(real_encoded.columns) & set(synth_encoded.columns)
    )
    feat_cols = [c for c in common_cols if c != target_col]

    # Model trained on real data
    X_real, y_real = real_encoded[feat_cols], real_encoded[target_col]
    X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
        X_real, y_real, test_size=0.2, random_state=42
    )
    clf_real = GradientBoostingClassifier(n_estimators=100)
    clf_real.fit(X_train_r, y_train_r)
    auc_real = roc_auc_score(y_test_r, clf_real.predict_proba(X_test_r)[:, 1])

    # Model trained on synthetic data, tested on real
    X_synth = synth_encoded[feat_cols]
    y_synth = synth_encoded[target_col]
    clf_synth = GradientBoostingClassifier(n_estimators=100)
    clf_synth.fit(X_synth, y_synth)
    auc_synth = roc_auc_score(y_test_r, clf_synth.predict_proba(X_test_r)[:, 1])

    results["utility"] = {
        "auc_real_model": round(auc_real, 4),
        "auc_synthetic_model": round(auc_synth, 4),
        "utility_ratio": round(auc_synth / auc_real, 4),
    }

    # 2. FIDELITY: Statistical similarity
    numeric_cols = real_data.select_dtypes(include=[np.number]).columns
    fidelity_scores = {}
    for col in numeric_cols:
        real_mean = real_data[col].mean()
        synth_mean = synthetic_data[col].mean()
        real_std = real_data[col].std()
        synth_std = synthetic_data[col].std()
        mean_diff = abs(real_mean - synth_mean) / (real_std + 1e-8)
        fidelity_scores[col] = round(1 - min(mean_diff, 1), 4)
    results["fidelity"] = {
        "per_column": fidelity_scores,
        "average": round(np.mean(list(fidelity_scores.values())), 4),
    }

    # 3. PRIVACY: Nearest-neighbor distance ratio
    from sklearn.neighbors import NearestNeighbors
    real_numeric = real_data[numeric_cols].values
    synth_numeric = synthetic_data[numeric_cols].values
    nn = NearestNeighbors(n_neighbors=1)
    nn.fit(real_numeric)
    distances, _ = nn.kneighbors(synth_numeric)
    results["privacy"] = {
        "min_distance": round(float(distances.min()), 4),
        "mean_distance": round(float(distances.mean()), 4),
        "pct_below_threshold": round(
            float((distances < 0.1).mean()) * 100, 2
        ),
    }

    return results


if __name__ == "__main__":
    # Full pipeline
    print("1. Preparing clinical dataset...")
    real_data = prepare_clinical_dataset()
    print(f"   Real data: {len(real_data)} records")

    print("2. Training CTGAN synthesizer...")
    synthesizer = train_ctgan_synthesizer(real_data, epochs=300)

    print("3. Generating synthetic data...")
    synthetic = generate_synthetic_data(synthesizer, n_samples=10000)
    print(f"   Synthetic data: {len(synthetic)} records")

    print("4. Evaluating quality...")
    quality = evaluate_synthetic_quality(real_data, synthetic)
    print(f"   Utility ratio: {quality['utility']['utility_ratio']}")
    print(f"   Fidelity avg:  {quality['fidelity']['average']}")
    print(f"   Privacy risk:  {quality['privacy']['pct_below_threshold']}%")

Synthetic Data Quality: The Trilemma

Evaluating synthetic data requires measuring three properties that are often in tension with each other. Optimizing for any two tends to degrade the third.

Utility

Synthetic data typically achieves 85-95% of real data model performance, with the gap depending on task complexity and data volume.

Privacy

Three classes of privacy attacks target synthetic data; defenses include differential privacy, minimum population thresholds, and distance-based filtering.

Fidelity

Metric	What It Measures	Target Value	How to Compute
Utility Ratio	Model performance: synthetic vs real training	Greater than 0.90	AUC(synthetic-trained) / AUC(real-trained)
Membership Inference AUC	Can attacker identify training records?	Close to 0.50	Train classifier on in-vs-out membership
Attribute Inference Accuracy	Can attacker infer sensitive attributes?	Close to random baseline	Predict withheld attribute from remaining
Statistical Fidelity	Distribution match (KS test, correlation)	KS p-value greater than 0.05	Kolmogorov-Smirnov test per column
Nearest Neighbor Distance	How close is nearest real record?	Greater than 5th percentile of real-to-real distances	L2 distance in normalized feature space

FDA Guidance on Synthetic Data

The FDA has not issued definitive guidance specifically on synthetic data for SaMD validation, but several published frameworks and discussion papers indicate the agency's evolving position.

The FDA's position on synthetic data varies by use: acceptable for development, conditional for pre-submission, not acceptable as sole evidence for clinical validation.

Key positions from FDA published materials:

Development and augmentation: Synthetic data is acceptable for initial model development, hyperparameter tuning, and augmenting rare classes during training. The FDA does not regulate the training process — it regulates the final device's performance.
Pre-submission testing: Synthetic data can supplement real-data testing for pre-submission evaluations, but the FDA expects to see real-data validation results as well. A model validated solely on synthetic data would not satisfy 510(k) requirements.
Clinical validation: Synthetic data cannot replace real patient data for pivotal clinical studies. The FDA requires evidence of clinical performance on actual patients in intended-use conditions. For more on FDA requirements for clinical AI, see our detailed guide.
Post-market surveillance: Synthetic data may be used for ongoing monitoring and drift detection simulations, but real-world performance data must also be collected.

Practical Recommendations

A phased approach to synthetic data adoption minimizes risk while building organizational capability.

Start with Synthea for FHIR testing. If your need is developer testing and EHR integration, Synthea provides clinically realistic FHIR data with zero privacy risk and no model training required.
Use CTGAN for ML training augmentation. When you have some real data but not enough (especially for rare conditions), CTGAN can generate additional training samples that preserve clinical correlations.
Always validate on real data. Never report model performance based solely on synthetic test sets. The synthetic data trains the model; real data validates it.
Measure all three quality dimensions. A synthetic dataset with high fidelity but poor privacy is dangerous. One with high privacy but poor utility is useless. Track the trilemma explicitly.
Document everything for regulatory review. If your model training pipeline uses synthetic data, the model registry should record: which synthetic generator was used, what percentage of training data was synthetic, and the quality metrics of the synthetic dataset.