The Journey Every Clinical Model Takes
There is a well-known statistic in machine learning: 87% of models never make it to production. In healthcare, the number is even worse. A 2025 JMIR scoping review found that fewer than 15% of clinical ML models published in peer-reviewed literature are ever deployed in clinical settings. The gap between a promising Jupyter notebook and a model running at an ICU bedside is not a gap of engineering skill — it is a gap of process, governance, and operational maturity.
This article maps the complete lifecycle of a healthcare ML model across seven stages, from initial exploration to eventual retirement. At each stage, we will cover what happens, who is involved, what the healthcare-specific requirements are, and what typically goes wrong. Whether you are building a sepsis early warning system, a radiology triage model, or a readmission predictor, this lifecycle applies to every clinical ML project. For a foundational understanding of what MLOps is and why healthcare needs it, start with our introduction to MLOps for healthcare developers.
Stage 1: Exploration — The Notebook Phase
Every clinical model begins in a Jupyter notebook. This is the exploratory data analysis (EDA) phase where data scientists load clinical datasets, examine distributions, identify patterns, and test whether a predictive signal exists.
What Happens
- Load de-identified clinical data from your data warehouse or FHIR server (see our guide on FHIR for AI/ML pipelines)
- Perform exploratory data analysis — distributions, correlations, missing data patterns
- Test basic models (logistic regression, random forest) to establish whether a predictive signal exists
- Identify feature candidates and data quality issues
- Document initial findings and define the prediction task clearly
Healthcare-Specific Considerations
Data access requires IRB approval or a de-identification certification. Even in the exploration phase, you must work within a HIPAA-compliant environment — no clinical data on personal laptops, no screenshots of patient records in Slack. Data quality in healthcare is notoriously inconsistent: EHR data has systemic missingness (nurses chart differently by shift, weekends have fewer labs), temporal biases (admission patterns differ by day of week), and coding variations across departments.
# Stage 1: Exploration - Initial EDA for sepsis prediction
import pandas as pd
import matplotlib.pyplot as plt
# Load de-identified clinical data
df = pd.read_parquet("deidentified_icu_stays_2024.parquet")
# Basic data quality assessment
print(f"Total ICU stays: {len(df):,}")
print(f"Sepsis cases: {df['sepsis_label'].sum():,} ({df['sepsis_label'].mean()*100:.1f}%)")
print(f"\nMissing data rates:")
for col in ['heart_rate', 'sbp', 'temperature', 'wbc', 'lactate', 'creatinine']:
missing = df[col].isna().mean() * 100
print(f" {col}: {missing:.1f}%")
# Check for temporal leakage
print(f"\nDate range: {df['admission_date'].min()} to {df['admission_date'].max()}")
print(f"Weekend admissions: {df['is_weekend'].mean()*100:.1f}%")
# Baseline model to check if signal exists
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
features = ['heart_rate', 'sbp', 'temperature', 'wbc', 'age']
X = df[features].fillna(df[features].median())
y = df['sepsis_label']
baseline_auc = cross_val_score(
LogisticRegression(), X, y, cv=5, scoring='roc_auc'
).mean()
print(f"\nBaseline logistic regression AUC: {baseline_auc:.3f}")
Stage 1 Checklist
| Item | Status | Notes |
|---|---|---|
| IRB approval or de-identification certification | Required | Before any data access |
| Data loaded in HIPAA-compliant environment | Required | No local laptops |
| Missing data analysis completed | Required | Document rates per feature |
| Temporal leakage check | Required | No future data in features |
| Baseline model confirms predictive signal | Required | AUC greater than 0.60 to proceed |
| Prediction task clearly defined | Required | Target, time horizon, population |
Estimated duration: 2-4 weeks
Stage 2: Experimentation — Tracked Iterations
Once you have confirmed that a predictive signal exists, experimentation begins in earnest. This is the most iterative stage — dozens or hundreds of runs testing different feature sets, model architectures, hyperparameters, and preprocessing strategies.
What Happens
- Systematic hyperparameter search (grid search, Bayesian optimization)
- Feature engineering and selection experiments
- Model architecture comparison (gradient boosting vs. neural networks vs. ensemble)
- Every run tracked with MLflow: parameters, metrics, artifacts, and data version
- Cross-validation with clinically meaningful metrics (sensitivity at fixed specificity, calibration)
# Stage 2: Experimentation - MLflow tracked hyperparameter search
import mlflow
import optuna
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
mlflow.set_experiment("sepsis-prediction-v2-experiments")
def objective(trial):
with mlflow.start_run(run_name=f"optuna-trial-{trial.number}"):
params = {
"n_estimators": trial.suggest_int("n_estimators", 100, 500),
"max_depth": trial.suggest_int("max_depth", 3, 10),
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
"subsample": trial.suggest_float("subsample", 0.6, 1.0),
"min_samples_leaf": trial.suggest_int("min_samples_leaf", 5, 50),
}
mlflow.log_params(params)
mlflow.log_param("data_version", "v2.3")
mlflow.log_param("feature_set", "vitals+labs+demographics")
model = GradientBoostingClassifier(**params, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_val, y_pred)
mlflow.log_metric("val_auc", auc)
mlflow.log_metric("trial_number", trial.number)
return auc
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
print(f"Best AUC: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")
Critical rule: Never log patient identifiers as experiment parameters. Log data version IDs, feature set names, and aggregate statistics — never individual patient records.
Estimated duration: 4-8 weeks
Stage 3: Validation — The Gate Most Models Never Pass
Validation in healthcare is fundamentally different from validation in other industries. In e-commerce, if your recommendation model's accuracy drops by 5%, revenue dips slightly. In healthcare, a 5% accuracy drop in a sepsis model means patients who should have been flagged are being missed. Clinical validation is where most healthcare ML projects die — not because the model is bad, but because the validation process is rigorous, time-consuming, and involves stakeholders outside the engineering team.
Four Validation Tracks
Track 1: Statistical Validation
- AUC-ROC on held-out test set (never used during training or hyperparameter tuning)
- Sensitivity and specificity at clinically meaningful thresholds
- Calibration curves — does a 70% predicted probability actually mean 70% chance?
- Confidence intervals via bootstrapping
- Comparison against existing clinical scoring systems (qSOFA, MEWS, NEWS)
Track 2: Fairness and Bias Assessment
- Stratified performance by race, ethnicity, sex, age group, insurance status
- Disparate impact analysis — does the model systematically under-flag certain populations?
- Subgroup analysis on known vulnerable populations
Track 3: Clinical Review Board
- Clinicians (typically the CMIO, clinical informaticists, and domain physicians) review model predictions on sample cases
- Case-by-case evaluation: "Would this alert have been clinically useful?"
- False positive analysis: "Would this alert cause alarm fatigue?"
- Integration assessment: "How does this fit into existing clinical workflows?"
Track 4: Edge Case Testing
- Performance on pediatric patients (if model trained primarily on adults)
- Performance on patients with multiple comorbidities
- Performance on rare conditions related to the prediction target
- Behavior on out-of-distribution inputs (what happens with missing data?)
Estimated duration: 4-12 weeks (often the bottleneck of the entire lifecycle)
Stage 4: Packaging — Making the Model Deployable
A validated model in a Jupyter notebook is not deployable. Packaging transforms the model into a production-ready artifact that can be deployed consistently across environments.
What Gets Packaged
- Model artifact: The trained model file (ONNX, SavedModel, or serialized format)
- Dependencies: Exact pinned versions of every library (requirements.txt with ==, not >=)
- Preprocessing code: Feature transformation pipeline that matches training exactly
- Serving code: FastAPI/Flask endpoint that accepts input and returns predictions
- Health check endpoints: /health (is the container running?), /ready (is the model loaded?)
- Model card: Documentation of training data, performance, limitations, intended use
# Dockerfile for clinical ML model
FROM python:3.11-slim
# System dependencies for scikit-learn
RUN apt-get update && apt-get install -y --no-install-recommends \
libgomp1 \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Install exact pinned dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy model and serving code
COPY model/ ./model/
COPY serve.py .
COPY preprocessing.py .
COPY model_card.json .
# Non-root user for security
RUN useradd -m appuser && chown -R appuser:appuser /app
USER appuser
# Health check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
EXPOSE 8080
CMD ["uvicorn", "serve:app", "--host", "0.0.0.0", "--port", "8080"]
# serve.py - FastAPI model serving endpoint
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import numpy as np
from preprocessing import preprocess_features
app = FastAPI(title="Sepsis Prediction Model v2.3")
model = joblib.load("model/sepsis_model_v2.3.joblib")
class PredictionRequest(BaseModel):
heart_rate: float
sbp: float
temperature: float
wbc: float = None
lactate: float = None
creatinine: float = None
age: int
class PredictionResponse(BaseModel):
risk_score: float
risk_level: str
model_version: str = "2.3"
@app.get("/health")
def health():
return {"status": "healthy", "model_loaded": model is not None}
@app.get("/ready")
def ready():
# Test prediction to verify model can serve
test_input = np.zeros((1, 7))
try:
model.predict_proba(test_input)
return {"status": "ready"}
except Exception as e:
raise HTTPException(status_code=503, detail=str(e))
@app.post("/predict", response_model=PredictionResponse)
def predict(req: PredictionRequest):
features = preprocess_features(req.dict())
proba = model.predict_proba(features.reshape(1, -1))[0][1]
risk_level = "LOW" if proba < 0.3 else "MEDIUM" if proba < 0.7 else "HIGH"
return PredictionResponse(risk_score=round(proba, 4), risk_level=risk_level)
Estimated duration: 1-2 weeks
Stage 5: Deployment — Shadow Mode First, Always
In healthcare, you never deploy a model directly to production. The standard practice is shadow deployment: the new model receives real production data and makes predictions, but those predictions are logged — not shown to clinicians. This allows you to compare the new model's predictions against ground truth and the existing model (or existing clinical workflow) before going live.
# Kubernetes deployment with shadow mode
apiVersion: apps/v1
kind: Deployment
metadata:
name: sepsis-model-shadow
labels:
app: sepsis-prediction
mode: shadow
spec:
replicas: 2
selector:
matchLabels:
app: sepsis-prediction
mode: shadow
template:
metadata:
labels:
app: sepsis-prediction
mode: shadow
spec:
containers:
- name: model
image: registry.internal/sepsis-model:v2.3-shadow
ports:
- containerPort: 8080
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "1000m"
memory: "2Gi"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 45
periodSeconds: 10
env:
- name: MODEL_MODE
value: "shadow" # Predictions logged, not served to clinicians
- name: METRICS_ENDPOINT
value: "http://monitoring:9090/push"
Shadow mode typically runs for 2-4 weeks. During this period, you compare the shadow model's predictions against both the existing model and actual clinical outcomes. Only after this validation period — and with clinical review board approval — does the model go live. For more on Kubernetes deployment patterns for healthcare models, see our upcoming guide on Docker and Kubernetes for healthcare ML.
Estimated duration: 2-4 weeks in shadow mode before promotion
Stage 6: Monitoring — The Stage Nobody Plans For
Monitoring is where the MLOps lifecycle diverges most dramatically from traditional DevOps. In DevOps, monitoring means uptime, latency, and error rates. In MLOps, you must monitor accuracy, data drift, concept drift, fairness metrics, and prediction distribution — on top of the standard infrastructure metrics.
What to Monitor
| Metric Category | What to Track | Alert Threshold |
|---|---|---|
| Model Performance | AUC-ROC, sensitivity, specificity (against ground truth) | AUC drops below 0.85 |
| Data Drift | Distribution shift in input features vs. training data | Drift score greater than 0.3 |
| Concept Drift | Relationship between features and outcomes has changed | Calibration error greater than 0.1 |
| Prediction Distribution | Distribution of output scores vs. historical baseline | Greater than 20% shift in mean prediction |
| Fairness | Performance stratified by demographics | Disparity ratio greater than 1.25 |
| Infrastructure | Latency, throughput, error rate, memory usage | P99 latency greater than 500ms |
# Automated monitoring check (runs daily via cron)
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
import smtplib
from email.mime.text import MIMEText
def daily_drift_check(training_data, production_data, alert_email):
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=training_data, current_data=production_data)
result = report.as_dict()
drift_detected = result["metrics"][0]["result"]["dataset_drift"]
drift_share = result["metrics"][0]["result"]["drift_share"]
if drift_detected:
msg = MIMEText(
f"DATA DRIFT ALERT - Sepsis Model v2.3\n\n"
f"Drift detected in {drift_share*100:.0f}% of features.\n"
f"Immediate review required. Check monitoring dashboard.\n"
f"Consider initiating retraining pipeline."
)
msg["Subject"] = "[CRITICAL] Sepsis Model Data Drift Detected"
msg["To"] = alert_email
# Send alert to clinical informatics team
send_alert(msg)
return {"drift_detected": drift_detected, "drift_share": drift_share}
For a non-technical explanation of why models degrade and what clinical leaders should ask their AI vendors, see our companion article on model drift for healthcare teams. For comprehensive observability tooling, see our guide on observability dashboards for healthcare AI.
Duration: Continuous, for the lifetime of the model in production
Stage 7: Retirement — Knowing When to Stop
Every model has a lifespan. Retirement is the least documented but critically important final stage. A model should be retired when:
- A better model replaces it — the new model has been validated and shadow-tested
- The clinical context has changed — new guidelines, new treatment protocols, new patient population
- Accuracy has degraded beyond recovery — retraining does not restore performance to clinical thresholds
- Regulatory requirements have changed — the model no longer meets updated compliance standards
Retirement Checklist
- Archive all model artifacts, training data references, and experiment logs
- Preserve the complete audit trail for regulatory retention requirements
- Notify all clinical users and downstream systems
- Update clinical workflow documentation
- If FDA-regulated: file updated documentation with the agency
- Retain model card and performance history for future reference
Realistic Timeline: What to Expect
| Stage | Typical Duration | Key Bottleneck |
|---|---|---|
| 1. Exploration | 2-4 weeks | Data access and IRB approval |
| 2. Experimentation | 4-8 weeks | Compute resources and iteration speed |
| 3. Clinical Validation | 4-12 weeks | Clinical review board scheduling and feedback cycles |
| 4. Packaging | 1-2 weeks | Dependency management and security scanning |
| 5. Shadow Deployment | 2-4 weeks | Accumulating enough production data for comparison |
| 6. Monitoring | Continuous | Defining meaningful thresholds |
| 7. Retirement | 1-2 weeks | Stakeholder communication |
Total time to first production deployment: 4-8 months. The most common surprise for engineering teams is that clinical validation (Stage 3) often takes longer than all engineering stages combined. Planning for this from the start prevents frustration and missed deadlines.
Frequently Asked Questions
How many models typically make it from exploration to production in healthcare?
Industry data suggests roughly 10-15% of healthcare ML models that enter the exploration phase eventually reach production. The primary drop-off points are Stage 1 (no predictive signal in the data), Stage 3 (clinical validation failure — model accuracy is insufficient or bias is detected), and the gap between Stage 4 and 5 (organizational readiness — IT infrastructure, clinical workflows, and governance processes are not in place).
Can we skip shadow deployment if we are confident in the model?
No. Shadow deployment is non-negotiable in clinical settings. Even if your offline validation metrics are excellent, production data may differ from your test set in ways you did not anticipate. Shadow mode catches integration issues (data format differences, missing fields in production), performance differences (latency under real load), and distribution mismatches (production patient population differs from training population). Two to four weeks of shadow data is a small investment compared to the risk of deploying an untested model to clinicians.
What is the most common reason healthcare models fail in production?
Data drift — specifically, the distribution of input features shifting over time without any code changes. Common causes include: EHR system upgrades that change data formats, new clinical protocols that change lab ordering patterns, seasonal variations in patient demographics, and major events like pandemics that fundamentally alter clinical baselines. This is why continuous monitoring (Stage 6) is essential, not optional.
Who owns the ML model lifecycle in a health system?
Ownership is typically shared. The data science team owns Stages 1-2 (exploration and experimentation). The clinical informatics team owns Stage 3 (validation). The ML engineering or platform team owns Stages 4-6 (packaging, deployment, monitoring). The CMIO typically has final approval authority for the Stage 3 to Stage 5 transition. Retirement decisions are usually made jointly by clinical leadership and the engineering team.
How do we handle model versioning when we need to retrain monthly?
Use a model registry (MLflow Model Registry is the most common choice) with clear versioning semantics. Each retrained model gets a new version number, goes through an abbreviated validation process (Stages 3-5), and is promoted through stages: None to Staging to Production. The registry maintains the complete history of all versions, so you can roll back to any previous version instantly if a newly deployed model underperforms.



