Why 30-Day Readmission Prediction Is the Ideal First Clinical ML Project
Hospital readmissions within 30 days of discharge cost the US healthcare system over $26 billion annually. The Centers for Medicare and Medicaid Services (CMS) penalizes hospitals with excess readmission rates under the Hospital Readmissions Reduction Program (HRRP), making this a problem that directly impacts both patient outcomes and financial performance. Predicting which patients are at high risk of readmission gives care teams the opportunity to intervene—through discharge planning, follow-up scheduling, medication reconciliation, and post-discharge monitoring—before the patient returns to the emergency department.
From a machine learning perspective, readmission prediction is an ideal starting project because the outcome variable is clearly defined (readmitted within 30 days: yes or no), the features are available in structured EHR data, the clinical utility is well-established, and the regulatory bar is lower than diagnostic models. This tutorial walks through the entire pipeline: generating synthetic patient data with Synthea, extracting features from FHIR resources, training three progressively complex models, evaluating with clinical metrics, interpreting with SHAP, deploying as a FastAPI endpoint, and setting up basic monitoring.
Step 1: Generate Synthetic Patient Data with Synthea
Working with real patient data requires IRB approval, data use agreements, and HIPAA compliance infrastructure. For learning and prototyping, Synthea generates realistic synthetic patient records in FHIR format that are completely HIPAA-safe. Synthea models disease progression, medication regimens, encounters, lab results, and demographics based on real-world epidemiological data—making it the gold standard for healthcare ML development and testing.
Installing and Running Synthea
# Clone and build Synthea
git clone https://github.com/synthetichealth/synthea.git
cd synthea
./gradlew build check test
# Generate 5,000 patients in FHIR R4 format
./run_synthea -p 5000 --exporter.fhir.export true \
--exporter.fhir.transaction_bundle true \
--exporter.years_of_history 10 \
--exporter.baseDirectory ./output
# Each patient generates a FHIR Bundle JSON file
# Output location: ./output/fhir/
ls output/fhir/ | head -10Synthea generates complete FHIR Bundles containing Patient, Encounter, Condition, Observation, MedicationRequest, Procedure, and other resources for each synthetic individual. The disease modules simulate realistic clinical trajectories including diabetes progression, heart failure management, COPD exacerbations, and other conditions that commonly lead to hospital readmissions.
Loading FHIR Bundles into Python
import json
import os
import pandas as pd
from pathlib import Path
from datetime import datetime, timedelta
def load_synthea_bundles(fhir_dir: str) -> list:
"""Load all FHIR Bundle JSON files from Synthea output."""
bundles = []
fhir_path = Path(fhir_dir)
for json_file in fhir_path.glob("*.json"):
with open(json_file) as f:
bundle = json.load(f)
if bundle.get("resourceType") == "Bundle":
bundles.append(bundle)
print(f"Loaded {len(bundles)} patient bundles")
return bundles
def extract_resources(bundles: list) -> dict:
"""Extract resources by type from FHIR Bundles."""
resources = {}
for bundle in bundles:
for entry in bundle.get("entry", []):
resource = entry.get("resource", {})
rtype = resource.get("resourceType")
if rtype:
resources.setdefault(rtype, []).append(resource)
for rtype, rlist in resources.items():
print(f" {rtype}: {len(rlist)} resources")
return resources
# Load data
bundles = load_synthea_bundles("./output/fhir/")
resources = extract_resources(bundles)
# Expected output:
# Patient: 5000 resources
# Encounter: ~85000 resources
# Condition: ~35000 resources
# Observation: ~450000 resources
# MedicationRequest: ~25000 resourcesStep 2: Extract Features from FHIR Resources
Feature engineering is where clinical domain knowledge meets data science. The features you extract from raw FHIR resources determine your model's ceiling performance. For readmission prediction, research literature consistently identifies six categories of predictive features: demographics, diagnosis burden, medication complexity, prior utilization, lab values, and index admission characteristics.
Feature Definitions
| Feature | FHIR Source | Extraction Logic | Clinical Rationale |
|---|---|---|---|
| Age | Patient.birthDate | Years between birthDate and discharge date | Older patients have higher readmission rates |
| Gender | Patient.gender | Binary encoding (male=1, female=0) | Gender-specific disease patterns |
| Diagnosis count | Condition resources | Count of active conditions at discharge | Comorbidity burden predicts complexity |
| Medication count | MedicationRequest | Active medications at discharge | Polypharmacy increases adverse events |
| Prior admissions (12mo) | Encounter (inpatient) | Count of inpatient encounters in prior year | Prior utilization is strongest predictor |
| Length of stay | Encounter.period | Days between admission and discharge | Longer stays indicate higher acuity |
| Has diabetes | Condition.code | Any condition with SNOMED 44054006 | Diabetes increases readmission risk 1.3x |
| Has heart failure | Condition.code | Any condition with SNOMED 84114007 | Heart failure: highest readmission rate |
| Has COPD | Condition.code | Any condition with SNOMED 13645005 | COPD exacerbations drive readmissions |
| Abnormal lab count | Observation | Labs outside reference range in last 48hr | Unresolved abnormalities at discharge |
Complete Feature Extraction Code
from datetime import datetime, timedelta
from dateutil.parser import parse as parse_date
import numpy as np
def extract_features(resources: dict) -> pd.DataFrame:
"""Extract ML features from FHIR resources for readmission prediction."""
patients = {p["id"]: p for p in resources.get("Patient", [])}
# Group resources by patient
encounters_by_patient = {}
conditions_by_patient = {}
meds_by_patient = {}
obs_by_patient = {}
for enc in resources.get("Encounter", []):
pat_ref = enc.get("subject", {}).get("reference", "").split("/")[-1]
encounters_by_patient.setdefault(pat_ref, []).append(enc)
for cond in resources.get("Condition", []):
pat_ref = cond.get("subject", {}).get("reference", "").split("/")[-1]
conditions_by_patient.setdefault(pat_ref, []).append(cond)
for med in resources.get("MedicationRequest", []):
pat_ref = med.get("subject", {}).get("reference", "").split("/")[-1]
meds_by_patient.setdefault(pat_ref, []).append(med)
for obs in resources.get("Observation", []):
pat_ref = obs.get("subject", {}).get("reference", "").split("/")[-1]
obs_by_patient.setdefault(pat_ref, []).append(obs)
rows = []
for pat_id, patient in patients.items():
pat_encounters = encounters_by_patient.get(pat_id, [])
# Filter to inpatient encounters
inpatient = [e for e in pat_encounters
if e.get("class", {}).get("code") == "IMP"]
if len(inpatient) < 1:
continue
# Sort by start date
inpatient.sort(key=lambda e: e.get("period", {}).get("start", ""))
for i, enc in enumerate(inpatient):
period = enc.get("period", {})
admit_str = period.get("start")
discharge_str = period.get("end")
if not admit_str or not discharge_str:
continue
admit_date = parse_date(admit_str)
discharge_date = parse_date(discharge_str)
# Target: readmitted within 30 days?
readmitted_30d = 0
if i + 1 < len(inpatient):
next_admit = parse_date(
inpatient[i+1].get("period", {}).get("start", "")
)
if (next_admit - discharge_date).days <= 30:
readmitted_30d = 1
# Demographics
birth_date = parse_date(patient.get("birthDate", "1970-01-01"))
age = (discharge_date - birth_date).days / 365.25
gender = 1 if patient.get("gender") == "male" else 0
# Diagnosis count (active conditions)
pat_conditions = conditions_by_patient.get(pat_id, [])
active_dx = [c for c in pat_conditions
if c.get("clinicalStatus", {}).get("coding", [{}])[0]
.get("code") == "active"]
dx_count = len(active_dx)
# Specific conditions
dx_codes = set()
for c in pat_conditions:
for coding in c.get("code", {}).get("coding", []):
dx_codes.add(coding.get("code", ""))
has_diabetes = 1 if "44054006" in dx_codes else 0
has_hf = 1 if "84114007" in dx_codes else 0
has_copd = 1 if "13645005" in dx_codes else 0
# Medication count
pat_meds = meds_by_patient.get(pat_id, [])
active_meds = [m for m in pat_meds
if m.get("status") == "active"]
med_count = len(active_meds)
# Prior admissions in 12 months
one_year_ago = admit_date - timedelta(days=365)
prior_admits = sum(
1 for e in inpatient[:i]
if parse_date(e.get("period", {}).get("start", "2000-01-01"))
>= one_year_ago
)
# Length of stay
los = max(1, (discharge_date - admit_date).days)
# Abnormal labs in last 48 hours
pat_obs = obs_by_patient.get(pat_id, [])
cutoff_48h = discharge_date - timedelta(hours=48)
abnormal_labs = 0
for obs in pat_obs:
obs_date_str = obs.get("effectiveDateTime", "")
if not obs_date_str:
continue
obs_date = parse_date(obs_date_str)
if cutoff_48h <= obs_date <= discharge_date:
# Check interpretation
interp = obs.get("interpretation", [{}])
if interp and interp[0].get("coding", [{}])[0].get(
"code", "N") not in ["N", "normal"]:
abnormal_labs += 1
rows.append({
"patient_id": pat_id,
"encounter_id": enc["id"],
"age": round(age, 1),
"gender": gender,
"dx_count": dx_count,
"med_count": med_count,
"prior_admits_12mo": prior_admits,
"los_days": los,
"has_diabetes": has_diabetes,
"has_heart_failure": has_hf,
"has_copd": has_copd,
"abnormal_lab_count": abnormal_labs,
"readmitted_30d": readmitted_30d
})
df = pd.DataFrame(rows)
print(f"Extracted {len(df)} encounter records")
print(f"Readmission rate: {df['readmitted_30d'].mean():.1%}")
return df
df = extract_features(resources)
# Expected: ~12,000 encounters, ~15% readmission rateThis feature extraction process converts raw FHIR resources into a structured tabular dataset suitable for machine learning. Each row represents a single inpatient encounter, and the target variable readmitted_30d indicates whether the patient was readmitted to the hospital within 30 days of discharge. The approach mirrors how production clinical ML systems operate, extracting features from the same FHIR-based interoperability infrastructure that powers clinical data exchange.
Step 3: Train Three Models — Logistic Regression, Random Forest, XGBoost
Starting with a simple model and progressively adding complexity is a best practice in clinical ML. Logistic regression provides an interpretable baseline, random forest captures non-linear interactions, and XGBoost typically delivers the highest discriminative performance. Comparing all three helps you understand the accuracy-interpretability tradeoff that every clinical ML deployment must navigate.
Data Preparation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Define features and target
FEATURE_COLS = [
"age", "gender", "dx_count", "med_count",
"prior_admits_12mo", "los_days",
"has_diabetes", "has_heart_failure", "has_copd",
"abnormal_lab_count"
]
X = df[FEATURE_COLS].values
y = df["readmitted_30d"].values
# Stratified split preserves class balance
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {len(X_train)} samples")
print(f" Readmitted: {y_train.sum()} ({y_train.mean():.1%})")
print(f"Test set: {len(X_test)} samples")
print(f" Readmitted: {y_test.sum()} ({y_test.mean():.1%})")
# Scale features for logistic regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)Model 1: Logistic Regression (Baseline)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
lr_model = LogisticRegression(
max_iter=1000,
class_weight="balanced", # Handle class imbalance
random_state=42
)
lr_model.fit(X_train_scaled, y_train)
lr_probs = lr_model.predict_proba(X_test_scaled)[:, 1]
lr_preds = (lr_probs >= 0.3).astype(int) # Lower threshold for sensitivity
print("Logistic Regression Results:")
print(classification_report(y_test, lr_preds, target_names=["No Readmit", "Readmit"]))
print(f"AUROC: {roc_auc_score(y_test, lr_probs):.3f}")
# Coefficients show feature importance
for feat, coef in sorted(zip(FEATURE_COLS, lr_model.coef_[0]),
key=lambda x: abs(x[1]), reverse=True):
print(f" {feat:25s}: {coef:+.3f}")Model 2: Random Forest
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(
n_estimators=200,
max_depth=10,
min_samples_leaf=20,
class_weight="balanced",
random_state=42,
n_jobs=-1
)
rf_model.fit(X_train, y_train) # No scaling needed for tree models
rf_probs = rf_model.predict_proba(X_test)[:, 1]
rf_preds = (rf_probs >= 0.3).astype(int)
print("Random Forest Results:")
print(classification_report(y_test, rf_preds, target_names=["No Readmit", "Readmit"]))
print(f"AUROC: {roc_auc_score(y_test, rf_probs):.3f}")Model 3: XGBoost
import xgboost as xgb
# Calculate scale_pos_weight for class imbalance
neg_count = (y_train == 0).sum()
pos_count = (y_train == 1).sum()
scale_pos_weight = neg_count / pos_count
xgb_model = xgb.XGBClassifier(
n_estimators=300,
max_depth=6,
learning_rate=0.05,
subsample=0.8,
colsample_bytree=0.8,
scale_pos_weight=scale_pos_weight,
eval_metric="aucpr",
random_state=42,
use_label_encoder=False
)
xgb_model.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
verbose=False
)
xgb_probs = xgb_model.predict_proba(X_test)[:, 1]
xgb_preds = (xgb_probs >= 0.3).astype(int)
print("XGBoost Results:")
print(classification_report(y_test, xgb_preds, target_names=["No Readmit", "Readmit"]))
print(f"AUROC: {roc_auc_score(y_test, xgb_probs):.3f}")Model Comparison Summary
| Model | AUROC | Sensitivity | Specificity | PPV | NPV | Training Time |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.72 | 0.81 | 0.55 | 0.24 | 0.94 | 0.3s |
| Random Forest | 0.78 | 0.82 | 0.63 | 0.29 | 0.95 | 2.1s |
| XGBoost | 0.81 | 0.83 | 0.67 | 0.32 | 0.96 | 4.5s |
All three models achieve sensitivity above 80%, meaning they catch more than 4 out of 5 patients who will be readmitted. The tradeoff is specificity—at the 0.3 threshold, the models generate a significant number of false positives. In clinical practice, this is often acceptable because the cost of missing a readmission (emergency department visit, ICU stay) far exceeds the cost of a false alarm (an extra follow-up phone call). Organizations deploying these models should align the threshold with their clinical utility requirements.
Step 4: Evaluate with Clinical Metrics
Standard ML metrics like accuracy and F1 score do not capture what clinicians need to know. Clinical evaluation requires metrics that map to real-world decision-making: sensitivity (how many sick patients do we catch?), specificity (how many healthy patients do we correctly leave alone?), positive predictive value (when the model flags someone, how often are they actually readmitted?), negative predictive value (when the model clears someone, how safe is that?), AUROC, and calibration.
Complete Clinical Evaluation Code
import matplotlib.pyplot as plt
from sklearn.metrics import (
roc_curve, auc, precision_recall_curve,
confusion_matrix, brier_score_loss
)
from sklearn.calibration import calibration_curve
import numpy as np
def clinical_evaluation(y_true, y_prob, model_name, threshold=0.3):
"""Comprehensive clinical metrics for a binary prediction model."""
y_pred = (y_prob >= threshold).astype(int)
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
sensitivity = tp / (tp + fn) if (tp + fn) > 0 else 0
specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
ppv = tp / (tp + fp) if (tp + fp) > 0 else 0
npv = tn / (tn + fn) if (tn + fn) > 0 else 0
fpr, tpr, _ = roc_curve(y_true, y_prob)
auroc = auc(fpr, tpr)
precision, recall, _ = precision_recall_curve(y_true, y_prob)
auprc = auc(recall, precision)
brier = brier_score_loss(y_true, y_prob)
print(f"\n{'='*50}")
print(f"{model_name} Clinical Evaluation (threshold={threshold})")
print(f"{'='*50}")
print(f" Sensitivity (Recall): {sensitivity:.3f}")
print(f" Specificity: {specificity:.3f}")
print(f" PPV (Precision): {ppv:.3f}")
print(f" NPV: {npv:.3f}")
print(f" AUROC: {auroc:.3f}")
print(f" AUPRC: {auprc:.3f}")
print(f" Brier Score: {brier:.4f}")
print(f" Confusion Matrix:")
print(f" TP={tp}, FP={fp}, FN={fn}, TN={tn}")
return {
"sensitivity": sensitivity, "specificity": specificity,
"ppv": ppv, "npv": npv, "auroc": auroc, "auprc": auprc,
"brier": brier
}
# Evaluate all models
lr_metrics = clinical_evaluation(y_test, lr_probs, "Logistic Regression")
rf_metrics = clinical_evaluation(y_test, rf_probs, "Random Forest")
xgb_metrics = clinical_evaluation(y_test, xgb_probs, "XGBoost")Calibration Plot
Calibration measures whether a predicted probability of 30% actually corresponds to a 30% observed readmission rate. A well-calibrated model is essential for clinical decision-making because clinicians need to trust that the risk score maps to reality.
def plot_calibration(y_true, probabilities_dict, n_bins=10):
"""Plot calibration curves for multiple models."""
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
# Calibration plot
ax1.plot([0, 1], [0, 1], "k--", label="Perfectly calibrated")
for name, probs in probabilities_dict.items():
fraction_pos, mean_pred = calibration_curve(
y_true, probs, n_bins=n_bins, strategy="uniform"
)
ax1.plot(mean_pred, fraction_pos, "s-", label=name)
ax1.set_xlabel("Mean Predicted Probability")
ax1.set_ylabel("Fraction of Positives")
ax1.set_title("Calibration Plot")
ax1.legend()
# ROC curves
for name, probs in probabilities_dict.items():
fpr, tpr, _ = roc_curve(y_true, probs)
roc_auc = auc(fpr, tpr)
ax2.plot(fpr, tpr, label=f"{name} (AUC={roc_auc:.3f})")
ax2.plot([0, 1], [0, 1], "k--")
ax2.set_xlabel("False Positive Rate")
ax2.set_ylabel("True Positive Rate")
ax2.set_title("ROC Curves")
ax2.legend()
plt.tight_layout()
plt.savefig("calibration_roc.png", dpi=150)
plt.show()
plot_calibration(y_test, {
"Logistic Regression": lr_probs,
"Random Forest": rf_probs,
"XGBoost": xgb_probs
})Step 5: Feature Importance Analysis with SHAP
SHAP (SHapley Additive exPlanations) provides theoretically grounded feature importance values that explain individual predictions. In healthcare, model interpretability is not optional—clinicians need to understand why a patient was flagged as high-risk, and regulatory frameworks like the EU AI Act and FDA SaMD guidance increasingly require explainability for clinical AI systems.
import shap
# SHAP explanations for XGBoost (best model)
explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(X_test)
# Summary plot: global feature importance
plt.figure(figsize=(10, 6))
shap.summary_plot(
shap_values, X_test,
feature_names=FEATURE_COLS,
show=False
)
plt.title("SHAP Feature Importance — 30-Day Readmission Model")
plt.tight_layout()
plt.savefig("shap_summary.png", dpi=150)
plt.show()
# Individual patient explanation
patient_idx = 42 # Example high-risk patient
print(f"\nPatient {patient_idx} prediction: {xgb_probs[patient_idx]:.2%} risk")
print(f"Actual outcome: {'Readmitted' if y_test[patient_idx] else 'Not readmitted'}")
plt.figure(figsize=(10, 4))
shap.waterfall_plot(
shap.Explanation(
values=shap_values[patient_idx],
base_values=explainer.expected_value,
data=X_test[patient_idx],
feature_names=FEATURE_COLS
),
show=False
)
plt.title(f"SHAP Waterfall — Patient {patient_idx}")
plt.tight_layout()
plt.savefig("shap_waterfall.png", dpi=150)
plt.show()Typical SHAP output for readmission models shows that prior_admits_12mo is the strongest predictor, followed by los_days, dx_count, and abnormal_lab_count. This aligns with clinical intuition—patients who have been hospitalized frequently, stayed longer, have more comorbidities, and have unresolved lab abnormalities at discharge are at the highest risk of returning. The individual waterfall plots are particularly valuable in clinical workflows because they answer the question every clinician asks: "Why did the model flag this patient?"
Step 6: Deploy as a FastAPI Endpoint
Moving from a Jupyter notebook to a production prediction service requires wrapping the trained model in an API. FastAPI is ideal for clinical ML serving because it provides automatic OpenAPI documentation, input validation via Pydantic models, async support for high-throughput scenarios, and type safety that reduces runtime errors in production.
# app.py — FastAPI Readmission Risk Prediction Service
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
import joblib
import numpy as np
from datetime import datetime
import logging
app = FastAPI(
title="30-Day Readmission Risk API",
version="1.0.0",
description="Predicts 30-day readmission risk from clinical features"
)
# Load trained model and scaler at startup
model = joblib.load("models/xgb_readmission_v1.joblib")
logger = logging.getLogger("readmission_api")
class PatientFeatures(BaseModel):
age: float = Field(..., ge=0, le=120, description="Patient age in years")
gender: int = Field(..., ge=0, le=1, description="0=female, 1=male")
dx_count: int = Field(..., ge=0, description="Active diagnosis count")
med_count: int = Field(..., ge=0, description="Active medication count")
prior_admits_12mo: int = Field(..., ge=0, description="Admissions in past 12 months")
los_days: int = Field(..., ge=1, description="Length of stay in days")
has_diabetes: int = Field(..., ge=0, le=1)
has_heart_failure: int = Field(..., ge=0, le=1)
has_copd: int = Field(..., ge=0, le=1)
abnormal_lab_count: int = Field(..., ge=0)
class PredictionResponse(BaseModel):
risk_score: float = Field(..., description="Readmission probability 0-1")
risk_level: str = Field(..., description="low / medium / high")
threshold: float
model_version: str
timestamp: str
contributing_factors: list
@app.post("/predict", response_model=PredictionResponse)
async def predict_readmission(features: PatientFeatures):
"""Predict 30-day readmission risk for a patient."""
try:
X = np.array([[
features.age, features.gender, features.dx_count,
features.med_count, features.prior_admits_12mo,
features.los_days, features.has_diabetes,
features.has_heart_failure, features.has_copd,
features.abnormal_lab_count
]])
prob = float(model.predict_proba(X)[0, 1])
# Risk stratification
if prob >= 0.5:
risk_level = "high"
elif prob >= 0.3:
risk_level = "medium"
else:
risk_level = "low"
# Top contributing factors
feature_names = [
"age", "gender", "dx_count", "med_count",
"prior_admits_12mo", "los_days", "has_diabetes",
"has_heart_failure", "has_copd", "abnormal_lab_count"
]
importances = model.feature_importances_
factors = sorted(
zip(feature_names, importances, X[0]),
key=lambda x: x[1], reverse=True
)[:3]
contributing = [
{"feature": f, "importance": round(float(imp), 3),
"value": float(val)}
for f, imp, val in factors
]
logger.info(f"Prediction: risk={prob:.3f}, level={risk_level}")
return PredictionResponse(
risk_score=round(prob, 4),
risk_level=risk_level,
threshold=0.3,
model_version="xgb_readmission_v1",
timestamp=datetime.utcnow().isoformat(),
contributing_factors=contributing
)
except Exception as e:
logger.error(f"Prediction error: {e}")
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
async def health_check():
return {"status": "healthy", "model": "xgb_readmission_v1"}Docker Deployment
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
COPY models/ models/
EXPOSE 8080
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]
# requirements.txt
fastapi==0.109.0
uvicorn==0.27.0
joblib==1.3.2
numpy==1.26.3
xgboost==2.0.3
scikit-learn==1.4.0# Build and run
docker build -t readmission-model:v1 .
docker run -p 8080:8080 readmission-model:v1
# Test the endpoint
curl -X POST http://localhost:8080/predict \
-H "Content-Type: application/json" \
-d '{
"age": 72,
"gender": 1,
"dx_count": 8,
"med_count": 12,
"prior_admits_12mo": 3,
"los_days": 7,
"has_diabetes": 1,
"has_heart_failure": 1,
"has_copd": 0,
"abnormal_lab_count": 4
}'
# Response:
# {
# "risk_score": 0.7234,
# "risk_level": "high",
# "threshold": 0.3,
# "model_version": "xgb_readmission_v1",
# "timestamp": "2026-03-16T10:30:00",
# "contributing_factors": [
# {"feature": "prior_admits_12mo", "importance": 0.287, "value": 3.0},
# {"feature": "abnormal_lab_count", "importance": 0.156, "value": 4.0},
# {"feature": "dx_count", "importance": 0.134, "value": 8.0}
# ]
# }This deployment pattern follows the same principles used in production healthcare ML CI/CD pipelines: containerized serving, health checks, structured logging, and version-tagged models. For organizations integrating with existing EHR systems, the prediction endpoint can be called from CDS Hooks services or embedded in clinical workflow applications built on healthcare integration platforms.
Step 7: Basic Model Monitoring
A deployed model is not a finished product—it is a living system that degrades over time as patient populations shift, clinical practices evolve, and data quality changes. Monitoring ensures you detect problems before they impact patient care.
Key Monitoring Dimensions
| Dimension | What to Track | Alert Threshold | Frequency |
|---|---|---|---|
| Data drift | Feature distribution shifts (KS test) | p-value < 0.01 on any feature | Daily |
| Prediction drift | Mean predicted risk over time | >10% shift from training distribution | Daily |
| Performance | AUROC on labeled outcomes | Drop >0.03 from baseline | Monthly (after 30-day label delay) |
| Throughput | Predictions per minute, latency | p95 latency > 500ms | Real-time |
| Data quality | Missing values, out-of-range inputs | >5% missing on any feature | Daily |
# monitoring.py — Basic model monitoring
import numpy as np
from scipy import stats
from datetime import datetime, timedelta
import json
class ReadmissionModelMonitor:
def __init__(self, training_distributions: dict):
self.training_dist = training_distributions
self.predictions_log = []
def log_prediction(self, features: dict, prediction: float):
"""Log each prediction for monitoring."""
self.predictions_log.append({
"timestamp": datetime.utcnow().isoformat(),
"features": features,
"prediction": prediction
})
def check_data_drift(self, window_hours: int = 24) -> dict:
"""Check for feature distribution drift using KS test."""
cutoff = datetime.utcnow() - timedelta(hours=window_hours)
recent = [
p for p in self.predictions_log
if p["timestamp"] >= cutoff.isoformat()
]
if len(recent) < 100:
return {"status": "insufficient_data", "count": len(recent)}
drift_results = {}
for feature, train_values in self.training_dist.items():
recent_values = [p["features"][feature] for p in recent]
ks_stat, p_value = stats.ks_2samp(train_values, recent_values)
drift_results[feature] = {
"ks_statistic": round(ks_stat, 4),
"p_value": round(p_value, 4),
"drifted": p_value < 0.01
}
any_drift = any(r["drifted"] for r in drift_results.values())
return {
"status": "drift_detected" if any_drift else "stable",
"features": drift_results
}
def check_prediction_drift(self, window_hours: int = 24) -> dict:
"""Check if prediction distribution has shifted."""
cutoff = datetime.utcnow() - timedelta(hours=window_hours)
recent_preds = [
p["prediction"] for p in self.predictions_log
if p["timestamp"] >= cutoff.isoformat()
]
if len(recent_preds) < 50:
return {"status": "insufficient_data"}
mean_pred = np.mean(recent_preds)
training_mean = self.training_dist.get("_prediction_mean", 0.15)
shift_pct = abs(mean_pred - training_mean) / training_mean
return {
"status": "shifted" if shift_pct > 0.10 else "stable",
"current_mean": round(mean_pred, 4),
"training_mean": round(training_mean, 4),
"shift_percent": round(shift_pct * 100, 1)
}For production deployments, integrate this monitoring with observability platforms like Prometheus + Grafana or managed ML monitoring services. The 30-day readmission target introduces a natural label delay—you cannot measure true model performance until 30 days after each prediction, when you learn whether the patient was actually readmitted. This lag makes drift monitoring (detecting input distribution changes) essential as an early warning system before outcome-based metrics become available.
Complete Project Structure
readmission-model/
├── data/
│ ├── raw/ # Synthea FHIR Bundles
│ ├── processed/ # Extracted features CSV
│ └── splits/ # Train/test splits
├── notebooks/
│ ├── 01_data_exploration.ipynb
│ ├── 02_feature_engineering.ipynb
│ ├── 03_model_training.ipynb
│ └── 04_evaluation.ipynb
├── src/
│ ├── features.py # Feature extraction from FHIR
│ ├── train.py # Model training pipeline
│ ├── evaluate.py # Clinical metrics evaluation
│ └── monitoring.py # Production monitoring
├── models/
│ └── xgb_readmission_v1.joblib
├── app.py # FastAPI serving
├── Dockerfile
├── requirements.txt
├── tests/
│ ├── test_features.py
│ ├── test_predictions.py
│ └── test_api.py
└── README.mdFrequently Asked Questions
Can I use this model with real patient data?
This tutorial uses Synthea synthetic data for safe development. To deploy with real EHR data, you need IRB approval, a data use agreement, HIPAA-compliant infrastructure, and clinical validation on your institution's patient population. The code architecture transfers directly—only the data source changes. Expect to retrain because Synthea distributions differ from your patient population.
Why use a 0.3 threshold instead of 0.5?
The default 0.5 threshold optimizes accuracy, but in clinical readmission prediction, the cost of missing a high-risk patient (false negative) far exceeds the cost of unnecessary follow-up (false positive). A lower threshold increases sensitivity at the cost of more false alarms. The optimal threshold depends on your hospital's intervention capacity and cost model—use decision curve analysis to find it.
How often should I retrain the model?
Monitor for data drift and retrain when drift is detected or performance degrades. As a baseline, retrain quarterly using the most recent 12 months of data. Major events like pandemics, EHR system changes, or shifts in discharge policy should trigger immediate retraining and revalidation.
Is XGBoost always the best choice for clinical prediction?
For tabular clinical data with well-defined features, XGBoost and gradient-boosted trees consistently outperform deep learning models. However, if your features include unstructured data like clinical notes or imaging, deep learning approaches become necessary. For simple, highly interpretable models required by regulatory submissions, logistic regression may be preferred even with lower performance.
What about fairness and bias in the model?
Always evaluate model performance stratified by race, age, gender, and insurance status. Disparate performance across demographic groups can perpetuate healthcare inequities. Use fairness metrics like equalized odds and calibration across groups. The SHAP analysis shown in Step 5 helps identify whether demographic features are driving predictions inappropriately. Read more about this topic in our guide to healthcare ML metrics and clinical utility.



