The Model Registry for Healthcare AI: Version Control, Approval Workflows, and Regulatory Traceability

March 16, 2026

13 min read

AI & MLMLOpsCompliance

Why Healthcare AI Needs More Than Git

In standard software engineering, version control is a solved problem. Git tracks every line of code, every merge, every rollback. But clinical AI models are not code — they are trained artifacts shaped by data, hyperparameters, compute environments, and random seeds. A model's behavior is determined as much by its training data as by its architecture. And when that model influences clinical decisions — flagging sepsis risk, prioritizing radiology reads, recommending drug dosages — the regulatory bar for tracking changes is orders of magnitude higher than anything a typical engineering team has encountered.

The FDA's 2023 guidance on Predetermined Change Control Plans (PCCPs) for AI-enabled devices makes the requirement explicit: manufacturers must document what changes they plan to make to their models, how those changes will be validated, and what evidence demonstrates that the modified device remains safe and effective. This is not optional guidance — it is the regulatory framework that governs every AI/ML-based Software as a Medical Device (SaMD) on the US market.

A model registry is the system that makes this possible. It is not just a storage layer for model files — it is the single source of truth for what model is running in production, how it got there, who approved it, and what evidence supports that approval.

Healthcare AI model registry architecture showing model artifacts, training metadata, validation results, and approval status — A healthcare model registry stores far more than model weights — it tracks the complete regulatory lineage of every clinical AI model.

What a Healthcare Model Registry Must Store

Generic ML platforms like MLflow, Weights and Biases, and Neptune.ai handle experiment tracking and model versioning well. But none of them were built for the specific requirements of healthcare AI governance. A healthcare-grade model registry must track five categories of information that generic platforms miss.

1. Model Artifacts with Cryptographic Integrity

Every model version must be stored as an immutable artifact with a SHA-256 hash. The FDA requires that the exact model deployed in production can be reconstructed and re-evaluated at any point during a regulatory review — which may happen years after deployment. This means storing not just the model weights, but the complete inference package: serialized model, preprocessing code, feature engineering pipeline, and inference configuration.

2. Training Metadata and Data Lineage

For every model version, the registry must record: which training dataset was used (by version hash, not just name), the exact hyperparameters, the training infrastructure (GPU type, framework version, random seed), training duration, and convergence metrics. For healthcare, it must also record the data versioning information — the demographics breakdown of the training set, any exclusion criteria applied, and the data quality checks that passed.

3. Validation Results Segmented by Demographics

Aggregate performance metrics (overall AUC, accuracy) are insufficient for clinical models. The registry must store performance broken down by protected classes: age groups, sex, race/ethnicity, and clinical subpopulations. A model with AUC 0.92 overall but AUC 0.71 for Black patients has a bias problem that aggregate metrics hide. The FDA's Action Plan for AI/ML explicitly calls for evaluating algorithmic bias across demographic groups.

4. Approval Chain with Non-Repudiation

Every model version must have a documented approval chain: who submitted it, who reviewed it, who approved it, and when each action occurred. These approvals must be cryptographically signed and immutable — you cannot alter an approval record after the fact. This is the healthcare equivalent of a chain of custody.

5. Deployment and Rollback History

The registry must track every deployment event: when a model was promoted to production, which environment it was deployed to, the canary/shadow deployment results, and any rollback events with their justifications. If the FDA asks "what model was running on March 15th at 2:00 PM when this adverse event occurred?" — the registry must answer definitively.

FDA SaMD model approval pipeline with six stages from submission through staged rollout — A six-stage approval pipeline ensures every clinical model passes automated validation, clinical review, and security assessment before deployment.

Why Generic Platforms Fall Short

MLflow, Weights and Biases, and Neptune.ai are excellent tools for ML experiment tracking. But they were designed for tech companies shipping recommendation engines and ad targeting models — not for regulated medical devices.

Comparison of MLflow, Weights and Biases, and Neptune.ai showing gaps in healthcare-specific features — Generic ML platforms lack healthcare-specific features: regulatory approval workflows, FDA audit trails, and HIPAA-compliant access controls.

Capability	MLflow	W&B	Neptune.ai	Healthcare Requirement
Model versioning	Yes	Yes	Yes	Met
Experiment tracking	Yes	Yes	Yes	Met
Model staging (dev/staging/prod)	Yes	Partial	No	Partially met
Multi-step approval workflows	No	No	No	Not met
Clinical review committee integration	No	No	No	Not met
Demographic-segmented validation	No	No	No	Not met
FDA 21 CFR Part 11 audit trail	No	No	No	Not met
HIPAA-compliant access controls	Self-hosted only	BAA available	No	Partially met
Cryptographic approval signatures	No	No	No	Not met

The solution is not to replace these platforms — it is to build a healthcare approval layer on top of them. MLflow remains the best open-source option as the model registry backend because of its mature API, model staging support, and self-hosting capability (critical for HIPAA). The approval workflow wraps around MLflow to add the governance layer that healthcare requires.

Building the Healthcare Approval Workflow

Here is the pattern for a healthcare-grade model approval pipeline built on top of MLflow. The workflow has six stages, each with specific gate criteria that must be satisfied before the model advances.

Model versioning strategy showing when FDA submission is required vs internal review — Version classification determines the regulatory path: minor updates need internal review; major changes require FDA submission.

MLflow Registration with Healthcare Metadata

# healthcare_model_registry.py — Model Registration with Regulatory Metadata
import mlflow
from mlflow.tracking import MlflowClient
import hashlib
import json
from datetime import datetime
from typing import Dict, List, Optional


class HealthcareModelRegistry:
    """MLflow-based model registry with FDA-compliant metadata tracking."""

    def __init__(self, tracking_uri: str, registry_uri: str):
        mlflow.set_tracking_uri(tracking_uri)
        self.client = MlflowClient(tracking_uri=tracking_uri,
                                    registry_uri=registry_uri)

    def register_clinical_model(
        self,
        model_name: str,
        model_uri: str,
        clinical_use_case: str,
        training_data_version: str,
        training_data_demographics: Dict,
        validation_results: Dict,
        model_card: Dict,
        submitter_id: str,
    ) -> str:
        """Register a model with full healthcare metadata.

        Returns the model version string.
        """
        # Register model in MLflow
        mv = mlflow.register_model(model_uri, model_name)
        version = mv.version

        # Compute artifact hash for integrity verification
        artifact_hash = self._compute_model_hash(model_uri)

        # Set healthcare-specific tags
        tags = {
            # Regulatory metadata
            "healthcare.clinical_use_case": clinical_use_case,
            "healthcare.artifact_sha256": artifact_hash,
            "healthcare.submission_timestamp": datetime.utcnow().isoformat(),
            "healthcare.submitter_id": submitter_id,

            # Training data lineage
            "healthcare.training_data_version": training_data_version,
            "healthcare.training_data_demographics": json.dumps(
                training_data_demographics
            ),

            # Approval status
            "healthcare.approval_status": "SUBMITTED",
            "healthcare.approval_stage": "AWAITING_AUTOMATED_VALIDATION",

            # Validation results (segmented by demographics)
            "healthcare.validation_results": json.dumps(validation_results),

            # Model card
            "healthcare.model_card": json.dumps(model_card),
        }

        for key, value in tags.items():
            self.client.set_model_version_tag(
                model_name, version, key, value
            )

        print(f"Model {model_name} v{version} registered.")
        print(f"  Artifact SHA-256: {artifact_hash}")
        print(f"  Status: SUBMITTED -> AWAITING_AUTOMATED_VALIDATION")

        return version

    def run_automated_validation(
        self,
        model_name: str,
        version: str,
        min_auc: float = 0.85,
        max_demographic_gap: float = 0.10,
    ) -> bool:
        """Gate 1: Automated validation checks.

        - Overall AUC >= threshold
        - No demographic subgroup more than max_gap below overall AUC
        - No regression vs current production model
        """
        mv = self.client.get_model_version(model_name, version)
        results = json.loads(
            mv.tags.get("healthcare.validation_results", "{}")
        )

        overall_auc = results.get("overall_auc", 0.0)
        demographic_aucs = results.get("demographic_auc", {})

        checks_passed = True
        issues = []

        # Check 1: Overall AUC threshold
        if overall_auc < min_auc:
            checks_passed = False
            issues.append(
                f"Overall AUC {overall_auc:.3f} below threshold {min_auc}"
            )

        # Check 2: Demographic fairness
        for group, auc in demographic_aucs.items():
            gap = overall_auc - auc
            if gap > max_demographic_gap:
                checks_passed = False
                issues.append(
                    f"Demographic gap for {group}: {gap:.3f} "
                    f"(AUC={auc:.3f}, overall={overall_auc:.3f})"
                )

        # Check 3: No regression vs production
        prod_versions = self.client.get_latest_versions(
            model_name, stages=["Production"]
        )
        if prod_versions:
            prod_auc = json.loads(
                prod_versions[0].tags.get(
                    "healthcare.validation_results", "{}"
                )
            ).get("overall_auc", 0.0)
            if overall_auc < prod_auc - 0.02:  # Allow 2% tolerance
                checks_passed = False
                issues.append(
                    f"Regression vs production: {overall_auc:.3f} "
                    f"vs {prod_auc:.3f}"
                )

        # Update status
        new_status = "PASSED_AUTOMATED" if checks_passed else "FAILED_AUTOMATED"
        new_stage = (
            "AWAITING_CLINICAL_REVIEW" if checks_passed
            else "REJECTED_AUTOMATED"
        )

        self.client.set_model_version_tag(
            model_name, version,
            "healthcare.approval_status", new_status
        )
        self.client.set_model_version_tag(
            model_name, version,
            "healthcare.approval_stage", new_stage
        )
        self.client.set_model_version_tag(
            model_name, version,
            "healthcare.automated_validation_issues",
            json.dumps(issues)
        )

        return checks_passed

    def clinical_review_decision(
        self,
        model_name: str,
        version: str,
        reviewer_id: str,
        decision: str,  # "APPROVED" or "REJECTED"
        comments: str,
    ):
        """Gate 2: Clinical review committee decision."""
        timestamp = datetime.utcnow().isoformat()

        self.client.set_model_version_tag(
            model_name, version,
            "healthcare.clinical_reviewer_id", reviewer_id
        )
        self.client.set_model_version_tag(
            model_name, version,
            "healthcare.clinical_review_decision", decision
        )
        self.client.set_model_version_tag(
            model_name, version,
            "healthcare.clinical_review_timestamp", timestamp
        )
        self.client.set_model_version_tag(
            model_name, version,
            "healthcare.clinical_review_comments", comments
        )

        if decision == "APPROVED":
            self.client.set_model_version_tag(
                model_name, version,
                "healthcare.approval_stage", "AWAITING_SECURITY_REVIEW"
            )
        else:
            self.client.set_model_version_tag(
                model_name, version,
                "healthcare.approval_stage", "REJECTED_CLINICAL"
            )

    def promote_to_production(
        self,
        model_name: str,
        version: str,
        deployer_id: str,
        deployment_strategy: str = "canary",
    ):
        """Final gate: promote model to production after all approvals."""
        mv = self.client.get_model_version(model_name, version)
        stage = mv.tags.get("healthcare.approval_stage", "")

        if stage != "APPROVED_FOR_DEPLOYMENT":
            raise ValueError(
                f"Model not approved. Current stage: {stage}"
            )

        # Archive current production model
        prod_versions = self.client.get_latest_versions(
            model_name, stages=["Production"]
        )
        for pv in prod_versions:
            self.client.transition_model_version_stage(
                model_name, pv.version, "Archived"
            )

        # Promote new version
        self.client.transition_model_version_stage(
            model_name, version, "Production"
        )

        self.client.set_model_version_tag(
            model_name, version,
            "healthcare.deployment_timestamp",
            datetime.utcnow().isoformat()
        )
        self.client.set_model_version_tag(
            model_name, version,
            "healthcare.deployer_id", deployer_id
        )
        self.client.set_model_version_tag(
            model_name, version,
            "healthcare.deployment_strategy", deployment_strategy
        )

    @staticmethod
    def _compute_model_hash(model_uri: str) -> str:
        """Compute SHA-256 hash of model artifacts."""
        import os
        sha256 = hashlib.sha256()
        model_path = mlflow.artifacts.download_artifacts(model_uri)
        for root, dirs, files in os.walk(model_path):
            for fname in sorted(files):
                fpath = os.path.join(root, fname)
                with open(fpath, "rb") as f:
                    for chunk in iter(lambda: f.read(8192), b""):
                        sha256.update(chunk)
        return sha256.hexdigest()

Healthcare Model Cards

Google introduced Model Cards in 2019 as a standardized way to document ML model characteristics. For healthcare, the model card needs significant extensions to satisfy regulatory requirements and clinical stakeholder needs.

Healthcare model card template showing required sections for clinical AI documentation — A healthcare model card extends Google's format with clinical-specific sections: intended clinical setting, demographic performance, and regulatory classification.

Here is a model card template adapted for healthcare AI:

# healthcare_model_card.py — Model Card Template for Clinical AI

def generate_model_card(
    model_name: str,
    version: str,
    model_type: str,
    clinical_use_case: str,
    training_data: dict,
    performance: dict,
    limitations: list,
) -> dict:
    """Generate a healthcare-adapted model card."""
    return {
        "model_details": {
            "name": model_name,
            "version": version,
            "type": model_type,
            "framework": "PyTorch 2.1",
            "clinical_use_case": clinical_use_case,
            "fda_classification": "Class II SaMD",
            "predicate_device": "K231234 (if applicable)",
        },
        "intended_use": {
            "primary_use": "Clinical decision support for 30-day "
                          "readmission risk stratification",
            "target_population": "Adult patients (18+) discharged "
                                "from acute care hospitals",
            "clinical_setting": "Inpatient discharge planning",
            "intended_users": ["Hospitalists", "Discharge planners",
                              "Care coordinators"],
            "out_of_scope": ["Pediatric patients", "Psychiatric "
                           "admissions", "Planned readmissions"],
        },
        "training_data": {
            "source": training_data.get("source", "Multi-site EHR"),
            "size": training_data.get("size", "150,000 encounters"),
            "date_range": training_data.get("range", "2020-2024"),
            "demographics": {
                "age_mean": 62.3,
                "age_std": 15.7,
                "female_pct": 52.1,
                "race_distribution": {
                    "White": 58.2,
                    "Black": 22.1,
                    "Hispanic": 12.4,
                    "Asian": 4.8,
                    "Other": 2.5,
                },
            },
            "exclusions": ["Patients < 18 years",
                         "In-hospital mortality",
                         "Transfers to other acute care"],
        },
        "performance_metrics": {
            "overall": {
                "auc_roc": performance.get("overall_auc", 0.89),
                "sensitivity_at_90_specificity": 0.72,
                "ppv_at_10_threshold": 0.34,
                "calibration_slope": 1.02,
            },
            "by_demographic": performance.get("demographic_auc", {}),
            "by_clinical_subgroup": {
                "heart_failure": 0.91,
                "copd": 0.88,
                "diabetes": 0.87,
                "surgical": 0.85,
            },
        },
        "limitations": limitations or [
            "Performance degrades for patients with < 2 prior "
            "encounters in the system",
            "Not validated for psychiatric readmissions",
            "Training data predominantly from urban academic "
            "centers; rural generalizability unconfirmed",
        ],
        "ethical_considerations": {
            "bias_analysis": "Demographic parity gap < 0.05 across "
                           "all racial groups for FPR",
            "fairness_metric": "Equalized odds within 5% across "
                             "age and race groups",
            "human_oversight": "Model output is advisory only; "
                             "all clinical decisions require "
                             "physician review",
        },
    }

The Six-Stage Approval Pipeline

A healthcare model registry must enforce a multi-stage approval pipeline. No model reaches production without passing every gate. This is not bureaucracy — it is the process that protects patients and satisfies regulators.

FDA SaMD regulatory framework showing change types and required FDA actions — The FDA's change classification determines whether a model update requires a new regulatory submission or can proceed with internal documentation.

Stage 1: Model Submission

The data science team submits a trained model to the registry with all required metadata. The system validates completeness — if the model card is missing fields, if validation results are not segmented by demographics, or if the training data version is not recorded, the submission is rejected automatically.

Stage 2: Automated Validation

A test suite runs automatically against the submitted model. The suite includes performance benchmarks (AUC thresholds), fairness checks (demographic parity), regression tests (comparison to current production model), and adversarial robustness tests. For a CI/CD pipeline perspective on automating these checks, see our guide on CI/CD for healthcare ML.

Stage 3: Clinical Review Committee

A committee of clinicians (typically the CMIO, a domain-expert physician, and a clinical informaticist) reviews the model card, validation results, and clinical relevance. They assess whether the model's performance translates to clinical value — a statistically significant AUC improvement may not matter if it does not change clinical workflows.

Stage 4: CISO Security Review

The security team reviews the model for adversarial vulnerabilities, data leakage risks (can the model memorize and regurgitate training data?), and infrastructure security of the deployment environment.

Stage 5: Deployment Approved

With all reviews complete, the model receives deployment approval. The approval record includes all reviewer decisions, timestamps, and the deployment strategy (canary, shadow, or full rollout).

Stage 6: Staged Rollout

The model is deployed using a canary strategy: 5% of traffic for 48 hours, then 25%, then 100%. At each stage, monitoring dashboards track prediction distribution, latency, error rates, and clinical outcome proxies. Any anomaly triggers an automatic rollback to the previous production model.

MLflow integrated with healthcare approval workflow showing staging, approvals, and deployment — MLflow serves as the model storage backend while a custom approval layer adds healthcare-specific governance gates.

Version Classification for Regulatory Compliance

Not every model update requires the same level of scrutiny. The FDA's PCCP framework distinguishes between changes that need a new regulatory submission and changes that can be managed through internal processes.

Change Type	Example	Risk Level	FDA Action	Internal Process
Data update only	Retrained on 6 months of new data, same architecture	Low	None (if within PCCP)	Full automated validation + clinical review
Hyperparameter tuning	Changed learning rate, batch size, regularization	Low	None (if within PCCP)	Automated validation + clinical sign-off
Feature engineering	Added new lab values or clinical features	Medium	Letter to file	Full pipeline + CISO review
Architecture change	Switched from logistic regression to neural network	High	New 510(k) likely	Full pipeline + external validation
New clinical indication	Expanded from readmission to mortality prediction	High	New 510(k) required	Full pipeline + new clinical studies

Audit Trail Requirements

FDA 21 CFR Part 11 governs electronic records and signatures for regulated industries. While originally designed for pharmaceutical manufacturing, its principles apply to healthcare AI model management. The key requirements are:

FDA audit trail checklist for healthcare AI model registry — Every model registry action must be logged with who, what, when, and why — creating an unbroken audit trail from training to deployment.

Immutability: Records cannot be modified or deleted after creation. Use append-only storage (event sourcing pattern).
Attribution: Every action is attributed to a specific user with authenticated identity.
Timestamping: All events have accurate, tamper-proof timestamps (use a trusted time source, not local system clock).
Reason for change: Every status transition must include a documented rationale.
Access controls: Role-based access ensures only authorized personnel can approve models or modify registry records.

Implementation Recommendations

Based on the patterns above, here is the recommended technology stack for a healthcare model registry:

Model storage: MLflow Model Registry (self-hosted on your HIPAA-compliant infrastructure, never SaaS)
Experiment tracking: MLflow Tracking Server with PostgreSQL backend for durability
Approval workflow: Custom service (Python/Go) that wraps MLflow with approval gates, built on an event-sourcing architecture for immutable audit trails
Notifications: Integration with Slack/Teams/email for review requests and approval notifications
Deployment: Integration with your Kubernetes-based ML serving infrastructure for staged rollouts
Monitoring: Post-deployment drift detection feeds back into the registry, triggering retraining workflows when performance degrades

Frequently Asked Questions

Do we really need a custom approval layer, or can MLflow's built-in staging suffice?

MLflow's staging (Staging/Production/Archived) is a model lifecycle management feature, not an approval workflow. It has no concept of multi-step reviews, role-based approvals, or audit trails. For non-regulated ML applications, MLflow staging is sufficient. For healthcare SaMD, you need the custom layer.

How does the PCCP framework affect model registry design?

The FDA's Predetermined Change Control Plan requires that you define in advance what types of changes your model will undergo, how you will validate them, and what evidence you will generate. Your model registry must classify every change according to the PCCP categories and automatically route it through the appropriate approval path.

Should model cards be static documents or dynamic registry entries?

Dynamic. A model card should be auto-generated from the registry metadata whenever requested, not manually maintained as a separate document. This ensures the model card always reflects the current state of the model and eliminates the risk of documentation drift.

How do you handle rollbacks when the previous model version had known issues?

Maintain a "rollback target" tag in the registry that identifies the last known-good version for each model. This may not always be the immediately preceding version — if v3 had bias issues and v4 introduced a regression, the rollback target might be v2. The rollback target is updated as part of the deployment approval process.

What happens when a model is retrained on federated data from multiple institutions?

The registry must track the data lineage for each participating institution separately: dataset version, sample size, and demographic breakdown per site. For more on this topic, see our guide on federated learning for healthcare. The model card should disclose which institutions contributed to training and how their data was weighted.

Conclusion

A model registry for healthcare AI is not a nice-to-have — it is a regulatory requirement for any organization deploying clinical AI models. The gap between what generic ML platforms offer and what healthcare demands is significant: multi-stage approval workflows, demographic-segmented validation, cryptographic audit trails, and FDA-compliant version control.

The pattern outlined here — MLflow as the storage and versioning backend, wrapped with a custom approval layer that enforces healthcare governance — provides a practical path forward. It leverages the best open-source tooling while adding the regulatory compliance layer that the healthcare industry demands. Organizations that build this infrastructure now will be positioned to scale their clinical AI programs with confidence, knowing that every model in production has been validated, reviewed, and approved through a process that satisfies both clinical stakeholders and regulatory bodies.

Was this article helpful?

Your feedback helps us improve our content.

USA Office - Elintex Technologies Inc.

India Office - Elintex Technologies Pvt. Ltd.