Model Monitoring for Healthcare AI: The Dashboard Your Chief Medical Officer Needs Before Go-Live

April 17, 2026

14 min read

Healthcare IT

Why Healthcare AI Models Fail Silently in Production

A sepsis prediction model that was 94% accurate at deployment drops to 82% after six months. A readmission risk score that performed identically across racial groups develops a 15-point disparity gap. A chest X-ray classifier trained on pre-COVID data starts hallucinating pneumonia patterns it learned from pandemic-era training distribution.

None of these failures triggered a single alert. No dashboard flagged them. No clinician was notified until a retrospective quality review — months later — revealed the damage. According to a 2021 study in Nature Medicine, over 60% of clinical AI models experience measurable performance degradation within the first year of deployment, yet fewer than 15% of healthcare organizations have systematic monitoring in place.

This is the gap that a model monitoring dashboard fills. Not the engineering metrics your DevOps team already tracks (latency, throughput, error rates), but the clinical performance metrics your Chief Medical Officer needs to see before signing off on any AI system going live — and every day after.

This guide covers the complete monitoring stack: what metrics to track, how to detect drift before patients are harmed, when to trigger automated model pauses, and how to build two distinct dashboards — one for your ML engineering team, one for clinical leadership. We will implement this with Evidently AI and Grafana, with production-ready Python code.

The Five Pillars of Clinical Model Monitoring

Engineering monitoring (is the model responding? how fast?) is necessary but insufficient. Clinical model monitoring adds five domain-specific pillars that determine whether a model is safe to keep running.

1. Accuracy Decay Detection

Model accuracy does not degrade gracefully — it decays in patterns. The three most common decay signatures in healthcare AI are:

Gradual drift: Training data ages. Patient demographics shift. New treatment protocols change outcome distributions. AUC-ROC drops 0.5-1% per month.
Sudden shift: A new EHR version changes data formats. A lab vendor switches reference ranges. ICD-10 coding guidelines update. Accuracy drops 5-10% overnight.
Seasonal oscillation: Flu season changes respiratory diagnosis patterns. Summer trauma volumes shift surgical prediction baselines. Models trained on annual data miss quarterly cycles.

The monitoring system must track not just current accuracy, but the rate of change. A model at 0.89 AUC-ROC that has been stable for 6 months is less concerning than a model at 0.91 that dropped from 0.95 in the past 3 weeks.

import numpy as np
from scipy import stats
from dataclasses import dataclass
from datetime import datetime, timedelta

@dataclass
class AccuracyAlert:
    metric: str
    current_value: float
    baseline_value: float
    drift_rate: float  # per month
    severity: str  # warning, critical, emergency
    window_days: int

class AccuracyMonitor:
    def __init__(self, baseline_auc: float, thresholds: dict):
        self.baseline = baseline_auc
        self.thresholds = thresholds  # {'warning': 0.90, 'critical': 0.85}
        self.history = []

    def record(self, timestamp: datetime, auc: float, n_samples: int):
        self.history.append({
            'timestamp': timestamp,
            'auc': auc,
            'n_samples': n_samples
        })

    def check_drift(self, window_days: int = 30) -> AccuracyAlert | None:
        cutoff = datetime.now() - timedelta(days=window_days)
        recent = [h for h in self.history if h['timestamp'] >= cutoff]

        if len(recent) < 3:
            return None

        timestamps = [(h['timestamp'] - recent[0]['timestamp']).days
                      for h in recent]
        aucs = [h['auc'] for h in recent]

        # Linear regression for drift rate
        slope, intercept, r_value, p_value, std_err = stats.linregress(
            timestamps, aucs
        )
        drift_per_month = slope * 30

        current_auc = aucs[-1]

        if current_auc < self.thresholds['critical']:
            severity = 'critical'
        elif current_auc < self.thresholds['warning']:
            severity = 'warning'
        elif drift_per_month < -0.02:  # dropping >2% per month
            severity = 'warning'
        else:
            return None

        return AccuracyAlert(
            metric='AUC-ROC',
            current_value=current_auc,
            baseline_value=self.baseline,
            drift_rate=drift_per_month,
            severity=severity,
            window_days=window_days
        )

# Usage for sepsis prediction model
monitor = AccuracyMonitor(
    baseline_auc=0.94,
    thresholds={'warning': 0.90, 'critical': 0.85}
)

2. Fairness Metrics by Demographics

The FDA's 2024 guidance on AI/ML-based Software as a Medical Device explicitly requires ongoing monitoring of algorithmic fairness across demographic subgroups. This is not optional — it is a regulatory expectation.

Key fairness metrics to monitor continuously:

Metric	Definition	Acceptable Gap	Clinical Impact
Equalized Odds	Equal true positive and false positive rates across groups	<5%	Ensures equal detection rates regardless of race/ethnicity
Calibration Parity	Predicted probabilities match actual outcomes equally across groups	<3%	A "70% risk" means 70% for every demographic group
Predictive Parity	Equal positive predictive value across groups	<5%	When model says "positive," it is equally reliable for all groups
False Negative Rate Parity	Equal miss rates across groups	<3%	Critical — unequal miss rates mean some patients are systematically under-diagnosed

from typing import Dict, List
import pandas as pd

class FairnessMonitor:
    def __init__(self, protected_attributes: List[str],
                 max_disparity: float = 0.05):
        self.protected_attrs = protected_attributes
        self.max_disparity = max_disparity

    def compute_group_metrics(self, df: pd.DataFrame,
                               y_true_col: str, y_pred_col: str,
                               group_col: str) -> Dict:
        results = {}
        for group_value in df[group_col].unique():
            mask = df[group_col] == group_value
            y_true = df.loc[mask, y_true_col]
            y_pred = df.loc[mask, y_pred_col]

            tp = ((y_pred == 1) & (y_true == 1)).sum()
            fp = ((y_pred == 1) & (y_true == 0)).sum()
            fn = ((y_pred == 0) & (y_true == 1)).sum()
            tn = ((y_pred == 0) & (y_true == 0)).sum()

            results[group_value] = {
                'sensitivity': tp / (tp + fn) if (tp + fn) > 0 else 0,
                'specificity': tn / (tn + fp) if (tn + fp) > 0 else 0,
                'ppv': tp / (tp + fp) if (tp + fp) > 0 else 0,
                'fnr': fn / (fn + tp) if (fn + tp) > 0 else 0,
                'n_samples': len(y_true)
            }
        return results

    def check_disparity(self, group_metrics: Dict) -> List[Dict]:
        alerts = []
        metrics = ['sensitivity', 'specificity', 'ppv', 'fnr']

        for metric in metrics:
            values = {g: m[metric] for g, m in group_metrics.items()}
            max_val = max(values.values())
            min_val = min(values.values())
            gap = max_val - min_val

            if gap > self.max_disparity:
                worst_group = min(values, key=values.get)
                best_group = max(values, key=values.get)
                alerts.append({
                    'metric': metric,
                    'gap': round(gap, 4),
                    'worst_group': worst_group,
                    'best_group': best_group,
                    'worst_value': round(values[worst_group], 4),
                    'best_value': round(values[best_group], 4),
                    'severity': 'critical' if gap > 0.10 else 'warning'
                })
        return alerts

3. Calibration Drift

Calibration measures whether a model's predicted probabilities match reality. A well-calibrated model that says "this patient has a 30% risk of readmission" should be wrong 70% of the time and right 30% of the time. When calibration drifts, clinicians lose the ability to interpret model outputs meaningfully — a "high risk" score might actually correspond to moderate risk, leading to resource misallocation.

The Brier score is the primary calibration metric: it measures the mean squared difference between predicted probabilities and actual binary outcomes. A perfectly calibrated model has a Brier score of 0. In practice, clinical models should maintain a Brier score below 0.15 — anything above 0.20 indicates significant miscalibration.

from sklearn.calibration import calibration_curve
from sklearn.metrics import brier_score_loss
import numpy as np

class CalibrationMonitor:
    def __init__(self, n_bins: int = 10, brier_threshold: float = 0.15):
        self.n_bins = n_bins
        self.brier_threshold = brier_threshold
        self.baseline_curve = None

    def set_baseline(self, y_true, y_prob):
        self.baseline_brier = brier_score_loss(y_true, y_prob)
        fraction_pos, mean_predicted = calibration_curve(
            y_true, y_prob, n_bins=self.n_bins
        )
        self.baseline_curve = (fraction_pos, mean_predicted)

    def check_calibration(self, y_true, y_prob) -> dict:
        current_brier = brier_score_loss(y_true, y_prob)
        fraction_pos, mean_predicted = calibration_curve(
            y_true, y_prob, n_bins=self.n_bins
        )

        # Expected Calibration Error (ECE)
        bin_counts = np.histogram(y_prob, bins=self.n_bins)[0]
        ece = np.sum(
            np.abs(fraction_pos - mean_predicted) *
            bin_counts[:len(fraction_pos)] / len(y_prob)
        )

        # Maximum Calibration Error (MCE)
        mce = np.max(np.abs(fraction_pos - mean_predicted))

        brier_ratio = current_brier / self.baseline_brier if self.baseline_brier > 0 else 1

        severity = 'ok'
        if current_brier > self.brier_threshold:
            severity = 'critical'
        elif brier_ratio > 2.0:
            severity = 'warning'
        elif ece > 0.10:
            severity = 'warning'

        return {
            'brier_score': round(current_brier, 4),
            'baseline_brier': round(self.baseline_brier, 4),
            'brier_ratio': round(brier_ratio, 2),
            'ece': round(ece, 4),
            'mce': round(mce, 4),
            'severity': severity
        }

4. Feature Importance Shift

When the features driving model predictions change rank order compared to training, it signals that the model is relying on different patterns than it learned — often a precursor to accuracy collapse. A sepsis model that relied primarily on white blood cell count during training but shifts to using heart rate as its top feature in production might be compensating for a broken lab data pipeline.

import numpy as np
from scipy.stats import spearmanr

class FeatureImportanceMonitor:
    def __init__(self, baseline_importances: dict,
                 rank_shift_threshold: int = 2):
        self.baseline = baseline_importances
        self.baseline_ranks = self._rank(baseline_importances)
        self.rank_shift_threshold = rank_shift_threshold

    def _rank(self, importances: dict) -> dict:
        sorted_features = sorted(importances, key=importances.get, reverse=True)
        return {f: i+1 for i, f in enumerate(sorted_features)}

    def check_shift(self, current_importances: dict) -> dict:
        current_ranks = self._rank(current_importances)

        shifts = {}
        alerts = []
        for feature in self.baseline_ranks:
            if feature in current_ranks:
                shift = current_ranks[feature] - self.baseline_ranks[feature]
                shifts[feature] = {
                    'baseline_rank': self.baseline_ranks[feature],
                    'current_rank': current_ranks[feature],
                    'shift': shift
                }
                if abs(shift) >= self.rank_shift_threshold:
                    alerts.append({
                        'feature': feature,
                        'shift': shift,
                        'direction': 'up' if shift < 0 else 'down'
                    })

        # Spearman correlation between rank orders
        common = [f for f in self.baseline_ranks if f in current_ranks]
        baseline_r = [self.baseline_ranks[f] for f in common]
        current_r = [current_ranks[f] for f in common]
        correlation, p_value = spearmanr(baseline_r, current_r)

        return {
            'rank_correlation': round(correlation, 4),
            'p_value': round(p_value, 4),
            'shifted_features': alerts,
            'severity': 'critical' if correlation < 0.7 else
                        'warning' if correlation < 0.85 else 'ok'
        }

5. Data Quality Scoring

A model can only be as good as the data it receives. Data quality monitoring catches upstream pipeline issues before they corrupt model predictions. According to Gartner research, poor data quality costs organizations an average of $12.9 million per year — and in healthcare, the cost includes patient safety.

Quality Dimension	What to Monitor	Threshold	Example
Completeness	% of non-null values for required features	>95%	Missing lab results, absent vital signs
Freshness	Time since last data update	<1 hour	Stale EHR feed, broken HL7 interface
Schema Conformity	% of records matching expected format	>99%	Wrong data types, unexpected categorical values
Distribution Stability	Statistical distance from training distribution	PSI < 0.10	Age distribution shift, diagnosis code frequency change
Referential Integrity	Valid foreign keys and reference codes	>99%	Invalid ICD codes, unknown medication NDCs

Building the Monitoring Architecture: Evidently AI + Grafana

The monitoring stack consists of four components: a prediction logger that captures every inference, Evidently AI for computing drift and quality reports, PostgreSQL for metric storage, and Grafana for visualization. This is the same pattern used in production by organizations like healthcare data quality-focused teams who understand that monitoring is a prerequisite, not an afterthought.

import evidently
from evidently.report import Report
from evidently.metric_preset import (
    DataDriftPreset,
    DataQualityPreset,
    ClassificationPreset
)
from evidently.metrics import (
    ClassificationQualityMetric,
    ClassificationClassBalance,
    DatasetDriftMetric,
    ColumnDriftMetric
)
import pandas as pd
import psycopg2
import json
from datetime import datetime

class EvidentlyHealthcareMonitor:
    def __init__(self, reference_data: pd.DataFrame, db_config: dict):
        self.reference = reference_data
        self.db_conn = psycopg2.connect(**db_config)
        self._init_db()

    def _init_db(self):
        with self.db_conn.cursor() as cur:
            cur.execute("""
                CREATE TABLE IF NOT EXISTS model_metrics (
                    id SERIAL PRIMARY KEY,
                    timestamp TIMESTAMPTZ DEFAULT NOW(),
                    model_name VARCHAR(100),
                    metric_name VARCHAR(100),
                    metric_value FLOAT,
                    metadata JSONB
                )
            """)
            self.db_conn.commit()

    def run_monitoring_report(self, current_data: pd.DataFrame,
                              model_name: str) -> dict:
        # Data drift report
        drift_report = Report(metrics=[
            DatasetDriftMetric(),
            DataQualityPreset(),
        ])
        drift_report.run(
            reference_data=self.reference,
            current_data=current_data
        )

        # Classification performance (if labels available)
        if 'target' in current_data.columns:
            perf_report = Report(metrics=[
                ClassificationPreset()
            ])
            perf_report.run(
                reference_data=self.reference,
                current_data=current_data
            )

        # Extract metrics and store
        drift_results = drift_report.as_dict()
        metrics = self._extract_metrics(drift_results)

        for metric_name, value in metrics.items():
            self._store_metric(model_name, metric_name, value)

        return metrics

    def _extract_metrics(self, report_dict: dict) -> dict:
        metrics = {}
        for metric in report_dict.get('metrics', []):
            result = metric.get('result', {})
            if 'drift_share' in result:
                metrics['drift_share'] = result['drift_share']
            if 'dataset_drift' in result:
                metrics['dataset_drift'] = 1 if result['dataset_drift'] else 0
            if 'current' in result:
                current = result['current']
                if 'number_of_missing_values' in current:
                    total = current.get('number_of_rows', 1)
                    missing = current['number_of_missing_values']
                    metrics['completeness'] = 1 - (missing / total)
        return metrics

    def _store_metric(self, model_name: str, metric_name: str,
                      value: float):
        with self.db_conn.cursor() as cur:
            cur.execute("""
                INSERT INTO model_metrics (model_name, metric_name, metric_value)
                VALUES (%s, %s, %s)
            """, (model_name, metric_name, value))
        self.db_conn.commit()

Alert Escalation Framework

Healthcare model alerts are not like application alerts. A false negative on a sepsis prediction model can cost a life. The escalation framework must match severity to response — and include clinical stakeholders, not just engineers. This connects to the broader operational monitoring patterns described in our guide on alerting for healthcare systems with PagerDuty runbooks.

from enum import Enum
from dataclasses import dataclass, field
from typing import List, Callable
import requests

class AlertSeverity(Enum):
    WARNING = "warning"
    CRITICAL = "critical"
    EMERGENCY = "emergency"

@dataclass
class AlertRule:
    name: str
    condition: Callable
    severity: AlertSeverity
    actions: List[str]
    auto_pause: bool = False

class ModelAlertManager:
    def __init__(self, model_name: str, slack_webhook: str,
                 pagerduty_key: str):
        self.model_name = model_name
        self.slack_webhook = slack_webhook
        self.pagerduty_key = pagerduty_key
        self.rules = self._default_rules()

    def _default_rules(self) -> List[AlertRule]:
        return [
            AlertRule(
                name="accuracy_warning",
                condition=lambda m: m.get('auc_roc', 1) < 0.90,
                severity=AlertSeverity.WARNING,
                actions=["slack_ml_team", "create_jira_ticket"],
                auto_pause=False
            ),
            AlertRule(
                name="accuracy_critical",
                condition=lambda m: m.get('auc_roc', 1) < 0.85,
                severity=AlertSeverity.CRITICAL,
                actions=["slack_ml_team", "page_oncall", "pause_model"],
                auto_pause=True
            ),
            AlertRule(
                name="fairness_disparity",
                condition=lambda m: m.get('max_fairness_gap', 0) > 0.10,
                severity=AlertSeverity.CRITICAL,
                actions=["slack_ml_team", "notify_cmo", "flag_for_review"],
                auto_pause=True
            ),
            AlertRule(
                name="data_quality_emergency",
                condition=lambda m: m.get('data_quality_score', 1) < 0.50,
                severity=AlertSeverity.EMERGENCY,
                actions=["page_oncall", "notify_cmo", "halt_model",
                         "incident_bridge"],
                auto_pause=True
            ),
            AlertRule(
                name="calibration_drift",
                condition=lambda m: m.get('brier_ratio', 1) > 2.0,
                severity=AlertSeverity.WARNING,
                actions=["slack_ml_team", "schedule_recalibration"],
                auto_pause=False
            ),
        ]

    def evaluate(self, metrics: dict) -> List[dict]:
        triggered = []
        for rule in self.rules:
            if rule.condition(metrics):
                alert = {
                    'rule': rule.name,
                    'severity': rule.severity.value,
                    'model': self.model_name,
                    'actions': rule.actions,
                    'auto_pause': rule.auto_pause,
                    'metrics_snapshot': metrics
                }
                triggered.append(alert)
                self._execute_actions(alert, rule)
        return triggered

    def _execute_actions(self, alert: dict, rule: AlertRule):
        for action in rule.actions:
            if action == "slack_ml_team":
                self._send_slack(alert)
            elif action == "page_oncall":
                self._send_pagerduty(alert)
            elif action == "pause_model":
                self._pause_model()

    def _send_slack(self, alert: dict):
        payload = {
            "text": f"Model Alert: {alert['rule']} | "
                    f"{self.model_name} | "
                    f"Severity: {alert['severity'].upper()}"
        }
        requests.post(self.slack_webhook, json=payload)

    def _pause_model(self):
        # Switch to fallback rules-based system
        print(f"PAUSING model {self.model_name} — switching to fallback")

Two Dashboards: CMO View vs. Engineering View

One of the most common mistakes in healthcare AI monitoring is building a single dashboard that tries to serve both clinical leadership and ML engineers. These audiences have fundamentally different questions, different risk tolerances, and different action items. As noted in our discussion on SRE for healthcare, separating operational views by stakeholder role is essential for effective incident response.

Dimension	CMO / Clinical Dashboard	ML Engineering Dashboard
Primary Question	"Is this model safe for my patients?"	"Is this model healthy and performant?"
Key Metrics	Sensitivity, specificity, fairness gaps, false negatives	Latency p95, throughput, GPU utilization, memory
Update Frequency	Daily summary, weekly trend report	Real-time (sub-minute)
Alert Threshold	Any fairness gap >5%, any accuracy drop >3%	Latency >200ms, error rate >0.1%
Action on Alert	Clinical review committee, model pause decision	Debug, hotfix, rollback
Visualization Style	Large numbers, trend arrows, traffic light indicators	Time series, histograms, log panels

Grafana Dashboard Configuration

Here is the Grafana provisioning configuration for the CMO dashboard, pulling from the PostgreSQL metrics store that Evidently AI populates:

{
  "dashboard": {
    "title": "Clinical AI Safety Dashboard — CMO View",
    "panels": [
      {
        "title": "Model Accuracy (AUC-ROC)",
        "type": "stat",
        "targets": [{
          "rawSql": "SELECT metric_value FROM model_metrics WHERE model_name = '$model' AND metric_name = 'auc_roc' ORDER BY timestamp DESC LIMIT 1"
        }],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                {"color": "red", "value": 0},
                {"color": "orange", "value": 0.85},
                {"color": "green", "value": 0.90}
              ]
            }
          }
        }
      },
      {
        "title": "Fairness Gap (Max Disparity)",
        "type": "gauge",
        "targets": [{
          "rawSql": "SELECT metric_value FROM model_metrics WHERE model_name = '$model' AND metric_name = 'max_fairness_gap' ORDER BY timestamp DESC LIMIT 1"
        }],
        "fieldConfig": {
          "defaults": {
            "max": 0.20,
            "thresholds": {
              "steps": [
                {"color": "green", "value": 0},
                {"color": "orange", "value": 0.05},
                {"color": "red", "value": 0.10}
              ]
            }
          }
        }
      },
      {
        "title": "30-Day Accuracy Trend",
        "type": "timeseries",
        "targets": [{
          "rawSql": "SELECT timestamp, metric_value FROM model_metrics WHERE model_name = '$model' AND metric_name = 'auc_roc' AND timestamp > NOW() - INTERVAL '30 days' ORDER BY timestamp"
        }]
      },
      {
        "title": "Data Quality Score",
        "type": "stat",
        "targets": [{
          "rawSql": "SELECT metric_value * 100 as quality_pct FROM model_metrics WHERE model_name = '$model' AND metric_name = 'completeness' ORDER BY timestamp DESC LIMIT 1"
        }]
      }
    ]
  }
}

Automated Retraining Triggers

Not every accuracy drop requires human intervention. An automated retraining pipeline can respond to specific drift conditions, retrain the model on recent data, validate it against held-out test sets, and promote it to production — all without waking up an engineer at 3 AM. This pattern connects to the broader MLOps lifecycle discussed in resources on event-driven architecture patterns where automated triggers replace manual polling.

from datetime import datetime
import subprocess

class RetrainingOrchestrator:
    def __init__(self, model_name: str, config: dict):
        self.model_name = model_name
        self.config = config
        self.last_retrain = None

    def should_retrain(self, metrics: dict) -> tuple:
        # Rule 1: Accuracy below threshold for 3 consecutive days
        if metrics.get('consecutive_days_below_warning', 0) >= 3:
            return True, "accuracy_sustained_drop"

        # Rule 2: Data drift detected in >30% of features
        if metrics.get('drift_share', 0) > 0.30:
            return True, "significant_data_drift"

        # Rule 3: Calibration degraded beyond recovery threshold
        if metrics.get('brier_ratio', 1) > 3.0:
            return True, "severe_calibration_drift"

        # Rule 4: Scheduled periodic retraining (every 90 days)
        if self.last_retrain:
            days_since = (datetime.now() - self.last_retrain).days
            if days_since >= 90:
                return True, "scheduled_periodic"

        return False, ""

    def trigger_retrain(self, reason: str):
        print(f"[{datetime.now()}] Triggering retraining for "
              f"{self.model_name}: {reason}")

        # 1. Snapshot current production data
        # 2. Run training pipeline
        # 3. Validate on held-out test set
        # 4. Compare against current production model
        # 5. If better: promote; if worse: alert and keep current

        pipeline_config = {
            "model_name": self.model_name,
            "reason": reason,
            "timestamp": datetime.now().isoformat(),
            "validation_thresholds": {
                "min_auc": self.config.get('min_auc', 0.88),
                "max_fairness_gap": self.config.get('max_fairness_gap', 0.05),
                "max_brier_score": self.config.get('max_brier', 0.15)
            }
        }

        # Execute training pipeline (via Airflow, Kubeflow, etc.)
        self.last_retrain = datetime.now()
        return pipeline_config

Putting It All Together: The Complete Monitoring Pipeline

Here is the orchestration layer that ties all five pillars together into a single monitoring run, executed on a schedule (typically hourly for high-risk models, daily for lower-risk):

class ClinicalModelMonitoringPipeline:
    def __init__(self, model_name: str, config: dict):
        self.model_name = model_name
        self.accuracy_monitor = AccuracyMonitor(
            baseline_auc=config['baseline_auc'],
            thresholds=config['accuracy_thresholds']
        )
        self.fairness_monitor = FairnessMonitor(
            protected_attributes=config['protected_attributes']
        )
        self.calibration_monitor = CalibrationMonitor(
            brier_threshold=config['max_brier']
        )
        self.feature_monitor = FeatureImportanceMonitor(
            baseline_importances=config['baseline_feature_importances']
        )
        self.alert_manager = ModelAlertManager(
            model_name=model_name,
            slack_webhook=config['slack_webhook'],
            pagerduty_key=config['pagerduty_key']
        )
        self.retrainer = RetrainingOrchestrator(model_name, config)

    def run(self, predictions_df, reference_df):
        results = {}

        # 1. Accuracy
        accuracy_alert = self.accuracy_monitor.check_drift()
        results['accuracy'] = accuracy_alert

        # 2. Fairness
        for attr in self.fairness_monitor.protected_attrs:
            group_metrics = self.fairness_monitor.compute_group_metrics(
                predictions_df, 'actual', 'predicted', attr
            )
            disparities = self.fairness_monitor.check_disparity(group_metrics)
            results[f'fairness_{attr}'] = disparities

        # 3. Calibration
        calibration = self.calibration_monitor.check_calibration(
            predictions_df['actual'], predictions_df['probability']
        )
        results['calibration'] = calibration

        # 4. Feature importance shift
        # (requires current SHAP values or permutation importance)

        # 5. Evaluate alerts
        metrics_summary = self._summarize(results)
        alerts = self.alert_manager.evaluate(metrics_summary)

        # 6. Check retraining triggers
        should_retrain, reason = self.retrainer.should_retrain(metrics_summary)
        if should_retrain:
            self.retrainer.trigger_retrain(reason)

        return {
            'timestamp': datetime.now().isoformat(),
            'model': self.model_name,
            'results': results,
            'alerts_triggered': len(alerts),
            'retraining_triggered': should_retrain
        }

    def _summarize(self, results: dict) -> dict:
        acc = results.get('accuracy')
        return {
            'auc_roc': acc.current_value if acc else None,
            'max_fairness_gap': max(
                (d['gap'] for disparities in results.values()
                 if isinstance(disparities, list)
                 for d in disparities),
                default=0
            ),
            'brier_ratio': results.get('calibration', {}).get('brier_ratio', 1),
            'data_quality_score': results.get('data_quality', {}).get('score', 1)
        }

Frequently Asked Questions

How often should healthcare AI models be monitored?

High-risk models (sepsis prediction, drug interaction alerts, diagnostic imaging) should be monitored hourly or with every batch of predictions. Medium-risk models (readmission prediction, scheduling optimization) can be monitored daily. Low-risk models (operational analytics, resource forecasting) can use weekly monitoring. The key principle: monitoring frequency should match the clinical risk level and the speed at which the model's decisions reach patients.

What is the difference between data drift and concept drift?

Data drift (covariate shift) means the input features have changed — for example, patient demographics shift or lab value distributions change. Concept drift means the relationship between inputs and outcomes has changed — for example, a new treatment protocol changes what "high risk" looks like even though patient features remain similar. Data drift is detectable without labels; concept drift requires outcome data, which in healthcare often has a delay of days to weeks.

Should model monitoring be a separate system from application monitoring?

Yes. Application monitoring (latency, errors, uptime) answers "is the system running?" Model monitoring answers "is the system making correct and fair decisions?" They have different stakeholders, different alert thresholds, and different response playbooks. Combine them in the same observability platform (e.g., Grafana) but maintain separate dashboards and alert channels. See our guide on OpenTelemetry for healthcare for the application monitoring side.

How do you monitor model fairness when you do not have demographic data?

This is a common challenge due to incomplete race/ethnicity data in EHRs. Strategies include: (1) use proxy variables like zip code and insurance type for approximate demographic analysis, (2) use Bayesian Improved Surname Geocoding (BISG) for race/ethnicity imputation, (3) monitor performance across all available subgroups (age, gender, insurance) even if race data is incomplete, (4) advocate for better demographic data collection as a prerequisite for equitable AI deployment.

When should a model be automatically paused versus manually reviewed?

Auto-pause when: accuracy drops below a critical threshold (defined per model risk level), data quality score drops below 50% (model is operating on garbage data), or a fairness gap exceeds 10% (potential regulatory violation). Manual review when: accuracy is in the warning zone but not critical, a single demographic subgroup shows declining performance, or feature importance shifts suggest investigation is needed but predictions remain accurate.

What regulatory frameworks require model monitoring?

The FDA's AI/ML Software as a Medical Device (SaMD) framework requires a "predetermined change control plan" that includes monitoring. The EU AI Act (effective 2025) classifies most clinical AI as "high risk" and mandates post-market monitoring. CMS Conditions of Participation require quality assurance for clinical decision support. Joint Commission standards require ongoing validation of clinical algorithms. ONC's HTI-1 rule requires transparency for clinical decision support interventions.

USA Office - Elintex Technologies Inc.

India Office - Elintex Technologies Pvt. Ltd.

We value your privacy