Build Your First Clinical ML Model to Predict Patient Readmission Risk

April 16, 2026

15 min read

Healthcare

Hospital readmissions within 30 days of discharge cost the US healthcare system over $26 billion annually. The Centers for Medicare and Medicaid Services (CMS) penalizes hospitals with excess readmission rates under the Hospital Readmissions Reduction Program (HRRP), making this a problem that directly impacts both patient outcomes and financial performance. Predicting which patients are at high risk of readmission gives care teams the opportunity to intervene—through discharge planning, follow-up scheduling, medication reconciliation, and post-discharge monitoring—before the patient returns to the emergency department.

From a machine learning perspective, readmission prediction is an ideal starting project because the outcome variable is clearly defined (readmitted within 30 days: yes or no), the features are available in structured EHR data, the clinical utility is well-established, and the regulatory bar is lower than for diagnostic models.

This tutorial walks through the entire pipeline: generating synthetic patient data with Synthea, extracting features from FHIR resources, training three progressively complex models, evaluating with clinical metrics, interpreting with SHAP, deploying as a FastAPI endpoint, and setting up basic monitoring.

Step 1: Generate Synthetic Patient Data with Synthea

Working with real patient data requires IRB approval, data use agreements, and HIPAA compliance infrastructure. For learning and prototyping, Synthea generates realistic synthetic patient records in FHIR format that are completely HIPAA-safe. Synthea models disease progression, medication regimens, encounters, lab results, and demographics based on real-world epidemiological data—making it the gold standard for healthcare ML development and testing.

Installing and Running Synthea

# Clone and build Synthea
git clone https://github.com/synthetichealth/synthea.git
cd synthea
./gradlew build check test

# Generate 5,000 patients in FHIR R4 format
./run_synthea -p 5000 --exporter.fhir.export true \
  --exporter.fhir.transaction_bundle true \
  --exporter.years_of_history 10 \
  --exporter.baseDirectory ./output

# Each patient generates a FHIR Bundle JSON file
# Output location: ./output/fhir/
ls output/fhir/ | head -10

Synthea generates complete FHIR Bundles containing Patient, Encounter, Condition, Observation, MedicationRequest, Procedure, and other resources for each synthetic individual. The disease modules simulate realistic clinical trajectories including diabetes progression, heart failure management, COPD exacerbations, and other conditions that commonly lead to hospital readmissions.

Loading FHIR Bundles into Python

import json
import os
import pandas as pd
from pathlib import Path
from datetime import datetime, timedelta

def load_synthea_bundles(fhir_dir: str) -> list:
    """Load all FHIR Bundle JSON files from Synthea output."""
    bundles = []
    fhir_path = Path(fhir_dir)
    for json_file in fhir_path.glob("*.json"):
        with open(json_file) as f:
            bundle = json.load(f)
            if bundle.get("resourceType") == "Bundle":
                bundles.append(bundle)
    print(f"Loaded {len(bundles)} patient bundles")
    return bundles

def extract_resources(bundles: list) -> dict:
    """Extract resources by type from FHIR Bundles."""
    resources = {}
    for bundle in bundles:
        for entry in bundle.get("entry", []):
            resource = entry.get("resource", {})
            rtype = resource.get("resourceType")
            if rtype:
                resources.setdefault(rtype, []).append(resource)
    
    for rtype, rlist in resources.items():
        print(f"  {rtype}: {len(rlist)} resources")
    return resources

# Load data
bundles = load_synthea_bundles("./output/fhir/")
resources = extract_resources(bundles)
# Expected output:
#   Patient: 5000 resources
#   Encounter: ~85000 resources
#   Condition: ~35000 resources
#   Observation: ~450000 resources
#   MedicationRequest: ~25000 resources

Step 2: Extract Features from FHIR Resources

Feature engineering is where clinical domain knowledge meets data science. The features you extract from raw FHIR resources determine your model's ceiling performance. For readmission prediction, research literature consistently identifies six categories of predictive features: demographics, diagnosis burden, medication complexity, prior utilization, lab values, and index admission characteristics.

Feature Definitions

Feature	FHIR Source	Extraction Logic	Clinical Rationale
Age	Patient.birthDate	Years between birth date and discharge date	Older patients have higher readmission rates
Gender	Patient.gender	Binary encoding (male=1, female=0)	Gender-specific disease patterns
Diagnosis count	Condition resources	Count of active conditions at discharge	Comorbidity burden predicts complexity
Medication count	MedicationRequest	Active medications at discharge	Polypharmacy increases adverse events
Prior admissions (12mo)	Encounter (inpatient)	Count of inpatient encounters in the prior year	Prior utilization is the strongest predictor
Length of stay	Encounter.period	Days between admission and discharge	Longer stays indicate higher acuity
Has diabetes	Condition.code	Any condition with SNOMED 44054006	Diabetes increases readmission risk 1.3x
Has heart failure	Condition.code	Any condition with SNOMED 84114007	Heart failure: highest readmission rate
Has COPD	Condition.code	Any condition with SNOMED 13645005	COPD exacerbations drive readmissions
Abnormal lab count	Observation	Labs outside reference range in last 48hr	Unresolved abnormalities at discharge

Complete Feature Extraction Code

from datetime import datetime, timedelta
from dateutil.parser import parse as parse_date
import numpy as np

def extract_features(resources: dict) -> pd.DataFrame:
    """Extract ML features from FHIR resources for readmission prediction."""
    
    patients = {p["id"]: p for p in resources.get("Patient", [])}
    
    # Group resources by patient
    encounters_by_patient = {}
    conditions_by_patient = {}
    meds_by_patient = {}
    obs_by_patient = {}
    
    for enc in resources.get("Encounter", []):
        pat_ref = enc.get("subject", {}).get("reference", "").split("/")[-1]
        encounters_by_patient.setdefault(pat_ref, []).append(enc)
    
    for cond in resources.get("Condition", []):
        pat_ref = cond.get("subject", {}).get("reference", "").split("/")[-1]
        conditions_by_patient.setdefault(pat_ref, []).append(cond)
    
    for med in resources.get("MedicationRequest", []):
        pat_ref = med.get("subject", {}).get("reference", "").split("/")[-1]
        meds_by_patient.setdefault(pat_ref, []).append(med)
    
    for obs in resources.get("Observation", []):
        pat_ref = obs.get("subject", {}).get("reference", "").split("/")[-1]
        obs_by_patient.setdefault(pat_ref, []).append(obs)
    
    rows = []
    
    for pat_id, patient in patients.items():
        pat_encounters = encounters_by_patient.get(pat_id, [])
        
        # Filter to inpatient encounters
        inpatient = [e for e in pat_encounters 
                     if e.get("class", {}).get("code") == "IMP"]
        
        if len(inpatient) < 1:
            continue
        
        # Sort by start date
        inpatient.sort(key=lambda e: e.get("period", {}).get("start", ""))
        
        for i, enc in enumerate(inpatient):
            period = enc.get("period", {})
            admit_str = period.get("start")
            discharge_str = period.get("end")
            
            if not admit_str or not discharge_str:
                continue
            
            admit_date = parse_date(admit_str)
            discharge_date = parse_date(discharge_str)
            
            # Target: readmitted within 30 days?
            readmitted_30d = 0
            if i + 1 < len(inpatient):
                next_admit = parse_date(
                    inpatient[i+1].get("period", {}).get("start", "")
                )
                if (next_admit - discharge_date).days <= 30:
                    readmitted_30d = 1
            
            # Demographics
            birth_date = parse_date(patient.get("birthDate", "1970-01-01"))
            age = (discharge_date - birth_date).days / 365.25
            gender = 1 if patient.get("gender") == "male" else 0
            
            # Diagnosis count (active conditions)
            pat_conditions = conditions_by_patient.get(pat_id, [])
            active_dx = [c for c in pat_conditions
                        if c.get("clinicalStatus", {}).get("coding", [{}])[0]
                        .get("code") == "active"]
            dx_count = len(active_dx)
            
            # Specific conditions
            dx_codes = set()
            for c in pat_conditions:
                for coding in c.get("code", {}).get("coding", []):
                    dx_codes.add(coding.get("code", ""))
            
            has_diabetes = 1 if "44054006" in dx_codes else 0
            has_hf = 1 if "84114007" in dx_codes else 0
            has_copd = 1 if "13645005" in dx_codes else 0
            
            # Medication count
            pat_meds = meds_by_patient.get(pat_id, [])
            active_meds = [m for m in pat_meds
                          if m.get("status") == "active"]
            med_count = len(active_meds)
            
            # Prior admissions in 12 months
            one_year_ago = admit_date - timedelta(days=365)
            prior_admits = sum(
                1 for e in inpatient[:i]
                if parse_date(e.get("period", {}).get("start", "2000-01-01"))
                   >= one_year_ago
            )
            
            # Length of stay
            los = max(1, (discharge_date - admit_date).days)
            
            # Abnormal labs in last 48 hours
            pat_obs = obs_by_patient.get(pat_id, [])
            cutoff_48h = discharge_date - timedelta(hours=48)
            abnormal_labs = 0
            for obs in pat_obs:
                obs_date_str = obs.get("effectiveDateTime", "")
                if not obs_date_str:
                    continue
                obs_date = parse_date(obs_date_str)
                if cutoff_48h <= obs_date <= discharge_date:
                    # Check interpretation
                    interp = obs.get("interpretation", [{}])
                    if interp and interp[0].get("coding", [{}])[0].get(
                        "code", "N") not in ["N", "normal"]:
                        abnormal_labs += 1
            
            rows.append({
                "patient_id": pat_id,
                "encounter_id": enc["id"],
                "age": round(age, 1),
                "gender": gender,
                "dx_count": dx_count,
                "med_count": med_count,
                "prior_admits_12mo": prior_admits,
                "los_days": los,
                "has_diabetes": has_diabetes,
                "has_heart_failure": has_hf,
                "has_copd": has_copd,
                "abnormal_lab_count": abnormal_labs,
                "readmitted_30d": readmitted_30d
            })
    
    df = pd.DataFrame(rows)
    print(f"Extracted {len(df)} encounter records")
    print(f"Readmission rate: {df['readmitted_30d'].mean():.1%}")
    return df

df = extract_features(resources)
# Expected: ~12,000 encounters, ~15% readmission rate

This feature extraction process converts raw FHIR resources into a structured tabular dataset suitable for machine learning. Each row represents a single inpatient encounter, and the target variable readmitted_30d indicates whether the patient was readmitted to the hospital within 30 days of discharge. The approach mirrors how production clinical ML systems operate, extracting features from the same FHIR-based interoperability infrastructure that powers clinical data exchange.

Step 3: Train Three Models — Logistic Regression, Random Forest, XGBoost

Starting with a simple model and progressively adding complexity is a best practice in clinical ML. Logistic regression provides an interpretable baseline, random forest captures non-linear interactions, and XGBoost typically delivers the highest discriminative performance. Comparing all three helps you understand the accuracy-interpretability tradeoff that every clinical ML deployment must navigate.

Data Preparation

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Define features and target
FEATURE_COLS = [
    "age", "gender", "dx_count", "med_count",
    "prior_admits_12mo", "los_days",
    "has_diabetes", "has_heart_failure", "has_copd",
    "abnormal_lab_count"
]

X = df[FEATURE_COLS].values
y = df["readmitted_30d"].values

# Stratified split preserves class balance
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {len(X_train)} samples")
print(f"  Readmitted: {y_train.sum()} ({y_train.mean():.1%})")
print(f"Test set: {len(X_test)} samples")
print(f"  Readmitted: {y_test.sum()} ({y_test.mean():.1%})")

# Scale features for logistic regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Model 1: Logistic Regression (Baseline)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

lr_model = LogisticRegression(
    max_iter=1000,
    class_weight="balanced",  # Handle class imbalance
    random_state=42
)
lr_model.fit(X_train_scaled, y_train)

lr_probs = lr_model.predict_proba(X_test_scaled)[:, 1]
lr_preds = (lr_probs >= 0.3).astype(int)  # Lower threshold for sensitivity

print("Logistic Regression Results:")
print(classification_report(y_test, lr_preds, target_names=["No Readmit", "Readmit"]))
print(f"AUROC: {roc_auc_score(y_test, lr_probs):.3f}")

# Coefficients show feature importance
for feat, coef in sorted(zip(FEATURE_COLS, lr_model.coef_[0]),
                          key=lambda x: abs(x[1]), reverse=True):
    print(f"  {feat:25s}: {coef:+.3f}")

Model 2: Random Forest

from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_leaf=20,
    class_weight="balanced",
    random_state=42,
    n_jobs=-1
)
rf_model.fit(X_train, y_train)  # No scaling needed for tree models

rf_probs = rf_model.predict_proba(X_test)[:, 1]
rf_preds = (rf_probs >= 0.3).astype(int)

print("Random Forest Results:")
print(classification_report(y_test, rf_preds, target_names=["No Readmit", "Readmit"]))
print(f"AUROC: {roc_auc_score(y_test, rf_probs):.3f}")

Model 3: XGBoost

import xgboost as xgb

# Calculate scale_pos_weight for class imbalance
neg_count = (y_train == 0).sum()
pos_count = (y_train == 1).sum()
scale_pos_weight = neg_count / pos_count

xgb_model = xgb.XGBClassifier(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=scale_pos_weight,
    eval_metric="aucpr",
    random_state=42,
    use_label_encoder=False
)
xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=False
)

xgb_probs = xgb_model.predict_proba(X_test)[:, 1]
xgb_preds = (xgb_probs >= 0.3).astype(int)

print("XGBoost Results:")
print(classification_report(y_test, xgb_preds, target_names=["No Readmit", "Readmit"]))
print(f"AUROC: {roc_auc_score(y_test, xgb_probs):.3f}")

Model Comparison Summary

Model	AUROC	Sensitivity	Specificity	PPV	NPV	Training Time
Logistic Regression	0.72	0.81	0.55	0.24	0.94	0.3s
Random Forest	0.78	0.82	0.63	0.29	0.95	2.1s
XGBoost	0.81	0.83	0.67	0.32	0.96	4.5s

All three models achieve sensitivity above 80%, meaning they catch more than 4 out of 5 patients who will be readmitted. The tradeoff is specificity—at the 0.3 threshold, the models generate a significant number of false positives. In clinical practice, this is often acceptable because the cost of missing a readmission (emergency department visit, ICU stay) far exceeds the cost of a false alarm (an extra follow-up phone call). Organizations deploying these models should align the threshold with their clinical utility requirements.

Step 4: Evaluate with Clinical Metrics

Standard ML metrics like accuracy and F1 score do not capture what clinicians need to know. Clinical evaluation requires metrics that map to real-world decision-making: sensitivity (how many sick patients do we catch?), specificity (how many healthy patients do we correctly leave alone?), positive predictive value (when the model flags someone, how often are they actually readmitted?), negative predictive value (when the model clears someone, how safe is that?), AUROC, and calibration.

Complete Clinical Evaluation Code

import matplotlib.pyplot as plt
from sklearn.metrics import (
    roc_curve, auc, precision_recall_curve,
    confusion_matrix, brier_score_loss
)
from sklearn.calibration import calibration_curve
import numpy as np

def clinical_evaluation(y_true, y_prob, model_name, threshold=0.3):
    """Comprehensive clinical metrics for a binary prediction model."""
    y_pred = (y_prob >= threshold).astype(int)
    
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    
    sensitivity = tp / (tp + fn) if (tp + fn) > 0 else 0
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
    ppv = tp / (tp + fp) if (tp + fp) > 0 else 0
    npv = tn / (tn + fn) if (tn + fn) > 0 else 0
    
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    auroc = auc(fpr, tpr)
    
    precision, recall, _ = precision_recall_curve(y_true, y_prob)
    auprc = auc(recall, precision)
    
    brier = brier_score_loss(y_true, y_prob)
    
    print(f"\n{'='*50}")
    print(f"{model_name} Clinical Evaluation (threshold={threshold})")
    print(f"{'='*50}")
    print(f"  Sensitivity (Recall):  {sensitivity:.3f}")
    print(f"  Specificity:           {specificity:.3f}")
    print(f"  PPV (Precision):       {ppv:.3f}")
    print(f"  NPV:                   {npv:.3f}")
    print(f"  AUROC:                 {auroc:.3f}")
    print(f"  AUPRC:                 {auprc:.3f}")
    print(f"  Brier Score:           {brier:.4f}")
    print(f"  Confusion Matrix:")
    print(f"    TP={tp}, FP={fp}, FN={fn}, TN={tn}")
    
    return {
        "sensitivity": sensitivity, "specificity": specificity,
        "ppv": ppv, "npv": npv, "auroc": auroc, "auprc": auprc,
        "brier": brier
    }

# Evaluate all models
lr_metrics = clinical_evaluation(y_test, lr_probs, "Logistic Regression")
rf_metrics = clinical_evaluation(y_test, rf_probs, "Random Forest")
xgb_metrics = clinical_evaluation(y_test, xgb_probs, "XGBoost")

Calibration Plot

Calibration measures whether a predicted probability of 30% actually corresponds to a 30% observed readmission rate. A well-calibrated model is essential for clinical decision-making because clinicians need to trust that the risk score maps to reality.

def plot_calibration(y_true, probabilities_dict, n_bins=10):
    """Plot calibration curves for multiple models."""
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
    
    # Calibration plot
    ax1.plot([0, 1], [0, 1], "k--", label="Perfectly calibrated")
    for name, probs in probabilities_dict.items():
        fraction_pos, mean_pred = calibration_curve(
            y_true, probs, n_bins=n_bins, strategy="uniform"
        )
        ax1.plot(mean_pred, fraction_pos, "s-", label=name)
    ax1.set_xlabel("Mean Predicted Probability")
    ax1.set_ylabel("Fraction of Positives")
    ax1.set_title("Calibration Plot")
    ax1.legend()
    
    # ROC curves
    for name, probs in probabilities_dict.items():
        fpr, tpr, _ = roc_curve(y_true, probs)
        roc_auc = auc(fpr, tpr)
        ax2.plot(fpr, tpr, label=f"{name} (AUC={roc_auc:.3f})")
    ax2.plot([0, 1], [0, 1], "k--")
    ax2.set_xlabel("False Positive Rate")
    ax2.set_ylabel("True Positive Rate")
    ax2.set_title("ROC Curves")
    ax2.legend()
    
    plt.tight_layout()
    plt.savefig("calibration_roc.png", dpi=150)
    plt.show()

plot_calibration(y_test, {
    "Logistic Regression": lr_probs,
    "Random Forest": rf_probs,
    "XGBoost": xgb_probs
})

Step 5: Feature Importance Analysis with SHAP

SHAP (SHapley Additive exPlanations) provides theoretically grounded feature importance values that explain individual predictions. In healthcare, model interpretability is not optional—clinicians need to understand why a patient was flagged as high-risk, and regulatory frameworks like the EU AI Act and FDA SaMD guidance increasingly require explainability for clinical AI systems.

import shap

# SHAP explanations for XGBoost (best model)
explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(X_test)

# Summary plot: global feature importance
plt.figure(figsize=(10, 6))
shap.summary_plot(
    shap_values, X_test,
    feature_names=FEATURE_COLS,
    show=False
)
plt.title("SHAP Feature Importance — 30-Day Readmission Model")
plt.tight_layout()
plt.savefig("shap_summary.png", dpi=150)
plt.show()

# Individual patient explanation
patient_idx = 42  # Example high-risk patient
print(f"\nPatient {patient_idx} prediction: {xgb_probs[patient_idx]:.2%} risk")
print(f"Actual outcome: {'Readmitted' if y_test[patient_idx] else 'Not readmitted'}")

plt.figure(figsize=(10, 4))
shap.waterfall_plot(
    shap.Explanation(
        values=shap_values[patient_idx],
        base_values=explainer.expected_value,
        data=X_test[patient_idx],
        feature_names=FEATURE_COLS
    ),
    show=False
)
plt.title(f"SHAP Waterfall — Patient {patient_idx}")
plt.tight_layout()
plt.savefig("shap_waterfall.png", dpi=150)
plt.show()

Typical SHAP output for readmission models shows that prior_admits_12mo is the strongest predictor, followed by los_days, dx_count, and abnormal_lab_count. This aligns with clinical intuition—patients who have been hospitalized frequently, stayed longer, have more comorbidities, and have unresolved lab abnormalities at discharge are at the highest risk of returning. The individual waterfall plots are particularly valuable in clinical workflows because they answer the question every clinician asks: "Why did the model flag this patient?"

Step 6: Deploy as a FastAPI Endpoint

Moving from a Jupyter notebook to a production prediction service requires wrapping the trained model in an API. FastAPI is ideal for clinical ML serving because it provides automatic OpenAPI documentation, input validation via Pydantic models, async support for high-throughput scenarios, and type safety that reduces runtime errors in production.

# app.py — FastAPI Readmission Risk Prediction Service
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
import joblib
import numpy as np
from datetime import datetime
import logging

app = FastAPI(
    title="30-Day Readmission Risk API",
    version="1.0.0",
    description="Predicts 30-day readmission risk from clinical features"
)

# Load trained model and scaler at startup
model = joblib.load("models/xgb_readmission_v1.joblib")
logger = logging.getLogger("readmission_api")

class PatientFeatures(BaseModel):
    age: float = Field(..., ge=0, le=120, description="Patient age in years")
    gender: int = Field(..., ge=0, le=1, description="0=female, 1=male")
    dx_count: int = Field(..., ge=0, description="Active diagnosis count")
    med_count: int = Field(..., ge=0, description="Active medication count")
    prior_admits_12mo: int = Field(..., ge=0, description="Admissions in past 12 months")
    los_days: int = Field(..., ge=1, description="Length of stay in days")
    has_diabetes: int = Field(..., ge=0, le=1)
    has_heart_failure: int = Field(..., ge=0, le=1)
    has_copd: int = Field(..., ge=0, le=1)
    abnormal_lab_count: int = Field(..., ge=0)

class PredictionResponse(BaseModel):
    risk_score: float = Field(..., description="Readmission probability 0-1")
    risk_level: str = Field(..., description="low / medium / high")
    threshold: float
    model_version: str
    timestamp: str
    contributing_factors: list

@app.post("/predict", response_model=PredictionResponse)
async def predict_readmission(features: PatientFeatures):
    """Predict 30-day readmission risk for a patient."""
    try:
        X = np.array([[
            features.age, features.gender, features.dx_count,
            features.med_count, features.prior_admits_12mo,
            features.los_days, features.has_diabetes,
            features.has_heart_failure, features.has_copd,
            features.abnormal_lab_count
        ]])
        
        prob = float(model.predict_proba(X)[0, 1])
        
        # Risk stratification
        if prob >= 0.5:
            risk_level = "high"
        elif prob >= 0.3:
            risk_level = "medium"
        else:
            risk_level = "low"
        
        # Top contributing factors
        feature_names = [
            "age", "gender", "dx_count", "med_count",
            "prior_admits_12mo", "los_days", "has_diabetes",
            "has_heart_failure", "has_copd", "abnormal_lab_count"
        ]
        importances = model.feature_importances_
        factors = sorted(
            zip(feature_names, importances, X[0]),
            key=lambda x: x[1], reverse=True
        )[:3]
        contributing = [
            {"feature": f, "importance": round(float(imp), 3),
             "value": float(val)}
            for f, imp, val in factors
        ]
        
        logger.info(f"Prediction: risk={prob:.3f}, level={risk_level}")
        
        return PredictionResponse(
            risk_score=round(prob, 4),
            risk_level=risk_level,
            threshold=0.3,
            model_version="xgb_readmission_v1",
            timestamp=datetime.utcnow().isoformat(),
            contributing_factors=contributing
        )
    except Exception as e:
        logger.error(f"Prediction error: {e}")
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health_check():
    return {"status": "healthy", "model": "xgb_readmission_v1"}

Docker Deployment

# Dockerfile
FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app.py .
COPY models/ models/

EXPOSE 8080
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]

# requirements.txt
fastapi==0.109.0
uvicorn==0.27.0
joblib==1.3.2
numpy==1.26.3
xgboost==2.0.3
scikit-learn==1.4.0

# Build and run
docker build -t readmission-model:v1 .
docker run -p 8080:8080 readmission-model:v1

# Test the endpoint
curl -X POST http://localhost:8080/predict \
  -H "Content-Type: application/json" \
  -d '{
    "age": 72,
    "gender": 1,
    "dx_count": 8,
    "med_count": 12,
    "prior_admits_12mo": 3,
    "los_days": 7,
    "has_diabetes": 1,
    "has_heart_failure": 1,
    "has_copd": 0,
    "abnormal_lab_count": 4
  }'

# Response:
# {
#   "risk_score": 0.7234,
#   "risk_level": "high",
#   "threshold": 0.3,
#   "model_version": "xgb_readmission_v1",
#   "timestamp": "2026-03-16T10:30:00",
#   "contributing_factors": [
#     {"feature": "prior_admits_12mo", "importance": 0.287, "value": 3.0},
#     {"feature": "abnormal_lab_count", "importance": 0.156, "value": 4.0},
#     {"feature": "dx_count", "importance": 0.134, "value": 8.0}
#   ]
# }

This deployment pattern follows the same principles used in production healthcare ML CI/CD pipelines: containerized serving, health checks, structured logging, and version-tagged models. For organizations integrating with existing EHR systems, the prediction endpoint can be called from CDS Hooks services or embedded in clinical workflow applications built on healthcare integration platforms.

Step 7: Basic Model Monitoring

A deployed model is not a finished product—it is a living system that degrades over time as patient populations shift, clinical practices evolve, and data quality changes. Monitoring ensures you detect problems before they impact patient care.

Key Monitoring Dimensions

Dimension	What to Track	Alert Threshold	Frequency
Data drift	Feature distribution shifts (KS test)	p-value < 0.01 on any feature	Daily
Prediction drift	Mean predicted risk over time	>10% shift from training distribution	Daily
Performance	AUROC on labeled outcomes	Drop >0.03 from baseline	Monthly (after 30-day label delay)
Throughput	Predictions per minute, latency	p95 latency > 500ms	Real-time
Data quality	Missing values, out-of-range inputs	>5% missing on any feature	Daily

# monitoring.py — Basic model monitoring
import numpy as np
from scipy import stats
from datetime import datetime, timedelta
import json

class ReadmissionModelMonitor:
    def __init__(self, training_distributions: dict):
        self.training_dist = training_distributions
        self.predictions_log = []
    
    def log_prediction(self, features: dict, prediction: float):
        """Log each prediction for monitoring."""
        self.predictions_log.append({
            "timestamp": datetime.utcnow().isoformat(),
            "features": features,
            "prediction": prediction
        })
    
    def check_data_drift(self, window_hours: int = 24) -> dict:
        """Check for feature distribution drift using KS test."""
        cutoff = datetime.utcnow() - timedelta(hours=window_hours)
        recent = [
            p for p in self.predictions_log
            if p["timestamp"] >= cutoff.isoformat()
        ]
        
        if len(recent) < 100:
            return {"status": "insufficient_data", "count": len(recent)}
        
        drift_results = {}
        for feature, train_values in self.training_dist.items():
            recent_values = [p["features"][feature] for p in recent]
            ks_stat, p_value = stats.ks_2samp(train_values, recent_values)
            drift_results[feature] = {
                "ks_statistic": round(ks_stat, 4),
                "p_value": round(p_value, 4),
                "drifted": p_value < 0.01
            }
        
        any_drift = any(r["drifted"] for r in drift_results.values())
        return {
            "status": "drift_detected" if any_drift else "stable",
            "features": drift_results
        }
    
    def check_prediction_drift(self, window_hours: int = 24) -> dict:
        """Check if prediction distribution has shifted."""
        cutoff = datetime.utcnow() - timedelta(hours=window_hours)
        recent_preds = [
            p["prediction"] for p in self.predictions_log
            if p["timestamp"] >= cutoff.isoformat()
        ]
        
        if len(recent_preds) < 50:
            return {"status": "insufficient_data"}
        
        mean_pred = np.mean(recent_preds)
        training_mean = self.training_dist.get("_prediction_mean", 0.15)
        shift_pct = abs(mean_pred - training_mean) / training_mean
        
        return {
            "status": "shifted" if shift_pct > 0.10 else "stable",
            "current_mean": round(mean_pred, 4),
            "training_mean": round(training_mean, 4),
            "shift_percent": round(shift_pct * 100, 1)
        }

For production deployments, integrate this monitoring with observability platforms like Prometheus + Grafana or managed ML monitoring services. The 30-day readmission target introduces a natural label delay—you cannot measure true model performance until 30 days after each prediction, when you learn whether the patient was actually readmitted. This lag makes drift monitoring (detecting input distribution changes) essential as an early warning system before outcome-based metrics become available.

Complete Project Structure

readmission-model/
├── data/
│   ├── raw/                    # Synthea FHIR Bundles
│   ├── processed/              # Extracted features CSV
│   └── splits/                 # Train/test splits
├── notebooks/
│   ├── 01_data_exploration.ipynb
│   ├── 02_feature_engineering.ipynb
│   ├── 03_model_training.ipynb
│   └── 04_evaluation.ipynb
├── src/
│   ├── features.py             # Feature extraction from FHIR
│   ├── train.py                # Model training pipeline
│   ├── evaluate.py             # Clinical metrics evaluation
│   └── monitoring.py           # Production monitoring
├── models/
│   └── xgb_readmission_v1.joblib
├── app.py                      # FastAPI serving
├── Dockerfile
├── requirements.txt
├── tests/
│   ├── test_features.py
│   ├── test_predictions.py
│   └── test_api.py
└── README.md

Frequently Asked Questions

Why is 30-day readmission prediction a good first clinical ML project?

Readmission prediction is the ideal first clinical ML project because the outcome variable is clearly defined (readmitted within 30 days, yes or no), the features live in structured EHR data, the clinical utility is well-established, and the regulatory bar is lower than for diagnostic models. High-risk flags let care teams intervene with discharge planning, follow-up scheduling, medication reconciliation, and post-discharge monitoring before the patient returns to the ED.

How much do hospital readmissions cost and why are they penalized?

Readmissions within 30 days of discharge cost the US healthcare system over $26 billion annually, and CMS penalizes hospitals with excess readmission rates under the Hospital Readmissions Reduction Program (HRRP). That combination means readmission prediction directly affects both patient outcomes and financial performance, which is why it is one of the most widely deployed clinical ML use cases in health systems.

What features best predict hospital readmission risk?

Research consistently identifies six predictive feature categories: demographics, diagnosis burden, medication complexity, prior utilization, lab values, and index admission characteristics, with prior inpatient admissions in the previous 12 months being the strongest single predictor. Specific conditions matter too: heart failure carries the highest readmission rate, diabetes raises risk about 1.3x, and unresolved abnormal labs in the 48 hours before discharge signal trouble. All extract cleanly from FHIR resources.

Which ML model works best for readmission prediction?

In a like-for-like comparison, XGBoost delivered the highest discrimination at 0.81 AUROC versus 0.78 for random forest and 0.72 for logistic regression, with all three achieving sensitivity above 80%, catching more than 4 out of 5 patients who will be readmitted. The tradeoff is specificity and interpretability: starting with logistic regression as a baseline and adding complexity is the recommended practice, and false positives are often acceptable since a missed readmission costs far more than an extra follow-up call.

How can you build a clinical ML model without real patient data?

Use Synthea, which generates realistic synthetic patient records in FHIR format that are completely HIPAA-safe, avoiding the IRB approvals, data use agreements, and compliance infrastructure real data requires. Synthea models disease progression, medications, encounters, labs, and demographics from real-world epidemiological data, producing full FHIR Bundles with Patient, Encounter, Condition, Observation, and MedicationRequest resources, making it the standard for healthcare ML learning and prototyping.

USA Office - Elintex Technologies Inc.

India Office - Elintex Technologies Pvt. Ltd.

We value your privacy