Medallion Architecture for Healthcare: Bronze, Silver, Gold with FHIR Data — A Complete Implementation Guide

Q: How long should I retain Bronze data?

For healthcare, retain Bronze data indefinitely (or at minimum 7 years to match HIPAA retention requirements). Bronze data is cheap to store (compressed Parquet on S3) and invaluable for reprocessing. CareEvolution recommends keeping raw data forever and only aging out Silver/Gold tables. Use cost-effective storage tiers (S3 Glacier) for historical Bronze data

Q: What is the right batch frequency for each layer?

Bronze: near-real-time (Auto Loader processes files within minutes of arrival). Silver: hourly or every 6 hours (transformations are compute-intensive). Gold: daily (aggregations can run overnight for next-morning dashboards). For real-time use cases like clinical alerting, add a streaming path that bypasses Gold and serves directly from Silver

March 16, 2026

16 min read

Why Medallion Architecture Is the Natural Fit for Healthcare Data

Healthcare data arrives messy. A single patient encounter generates data across 17 different systems on average — EHR, lab information system, radiology, pharmacy, billing, scheduling, and more. Each system speaks a different dialect: FHIR JSON from modern APIs, HL7v2 pipe-delimited messages from legacy interfaces, CSV flat files from claims clearinghouses, and scanned PDFs from faxed referrals.

The medallion architecture — Bronze, Silver, Gold — provides a systematic framework for transforming this chaos into analytics-ready datasets. Originally popularized by Databricks, this pattern has become the de facto standard for healthcare data lakehouse implementations because it mirrors how clinical data naturally progresses from raw to refined.

This guide provides a complete implementation reference with PySpark code, Delta Lake SQL, and data quality checks for each layer — using real FHIR data structures throughout.

Bronze Layer: Raw Data Ingestion — Accept Everything, Transform Nothing

Design Principles

The Bronze layer is your system of record — an append-only, immutable store of raw data exactly as it arrived from source systems. The governing principle is simple: accept everything, transform nothing. Every HL7v2 message, every FHIR NDJSON line, every claims CSV row is preserved with full source fidelity.

This matters in healthcare for three reasons:

Regulatory reprocessing: When CMS changes quality measure specifications mid-year, you need to recompute from raw data. If Bronze doesn't have the original values, you're re-extracting from production EHRs.
Audit trails: HIPAA audit requirements demand provable data lineage. Bronze provides the "here's exactly what we received, when, from whom" record.
Error recovery: When a Silver layer transformation has a bug (and it will), you can fix the logic and replay from Bronze without data loss.

Bronze Schema Design

Keep it minimal. The Bronze table schema should capture the raw payload plus metadata about its origin:

-- Bronze table: raw FHIR resources
CREATE TABLE IF NOT EXISTS bronze.fhir_resources (
  raw_json         STRING        COMMENT 'Complete FHIR resource JSON as received',
  resource_type    STRING        COMMENT 'FHIR resource type (Patient, Observation, etc.)',
  resource_id      STRING        COMMENT 'FHIR resource ID extracted for partitioning',
  source_system    STRING        COMMENT 'Source EHR system identifier',
  source_file      STRING        COMMENT 'Original filename or API endpoint',
  ingestion_ts     TIMESTAMP     COMMENT 'When this record was ingested',
  ingestion_date   DATE          COMMENT 'Partition key for date-based queries'
)
USING DELTA
PARTITIONED BY (ingestion_date, resource_type)
TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true',
  'delta.autoOptimize.autoCompact' = 'true',
  'delta.logRetentionDuration' = 'interval 90 days'
);

-- Bronze table: raw HL7v2 messages
CREATE TABLE IF NOT EXISTS bronze.hl7v2_messages (
  raw_message      STRING        COMMENT 'Complete HL7v2 message with segment separators',
  message_type     STRING        COMMENT 'MSH-9 message type (ADT^A01, ORM^O01, etc.)',
  sending_facility STRING        COMMENT 'MSH-4 sending facility',
  message_control_id STRING      COMMENT 'MSH-10 unique message ID',
  message_datetime TIMESTAMP     COMMENT 'MSH-7 message timestamp',
  ingestion_ts     TIMESTAMP,
  ingestion_date   DATE
)
USING DELTA
PARTITIONED BY (ingestion_date, message_type);

PySpark: Ingesting FHIR Bulk Export Data

FHIR Bulk Data Access ($export) produces NDJSON files — one JSON object per line, one file per resource type. Here's how to ingest them into Bronze using Databricks Auto Loader for incremental processing:

from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    current_timestamp, current_date, input_file_name,
    get_json_object, lit
)

spark = SparkSession.builder.appName("FHIR-Bronze-Ingestion").getOrCreate()

def ingest_fhir_to_bronze(source_path, checkpoint_path, source_system="epic"):
    raw_stream = (
        spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "text")
        .option("cloudFiles.schemaLocation", checkpoint_path + "/schema")
        .option("cloudFiles.inferColumnTypes", "false")
        .load(source_path)
    )

    bronze_stream = (
        raw_stream
        .withColumn("raw_json", raw_stream["value"])
        .withColumn("resource_type",
            get_json_object("value", "$.resourceType"))
        .withColumn("resource_id",
            get_json_object("value", "$.id"))
        .withColumn("source_system", lit(source_system))
        .withColumn("source_file", input_file_name())
        .withColumn("ingestion_ts", current_timestamp())
        .withColumn("ingestion_date", current_date())
        .select(
            "raw_json", "resource_type", "resource_id",
            "source_system", "source_file",
            "ingestion_ts", "ingestion_date"
        )
    )

    query = (
        bronze_stream.writeStream
        .format("delta")
        .outputMode("append")
        .option("checkpointLocation", checkpoint_path)
        .partitionBy("ingestion_date", "resource_type")
        .trigger(availableNow=True)
        .toTable("bronze.fhir_resources")
    )
    return query

# Usage: ingest from FHIR bulk export
ingest_fhir_to_bronze(
    source_path="s3://healthcare-lake/fhir-exports/",
    checkpoint_path="s3://healthcare-lake/_checkpoints/fhir-bronze",
    source_system="epic-prod"
)

Silver Layer: Clean, Standardize, Deduplicate

Design Principles

The Silver layer is where raw data becomes trustworthy data. Three operations happen here:

Patient deduplication via Master Patient Index (MPI) matching
Code standardization — mapping local/proprietary codes to standard terminologies (SNOMED CT, LOINC, RxNorm)
Data quality validation — enforcing business rules, reference ranges, and referential integrity

The Silver layer also flattens nested FHIR JSON into relational tables that analysts and downstream systems can query efficiently. A single FHIR Patient resource with nested arrays for names, addresses, telecoms, and identifiers becomes a set of normalized tables.

Patient Deduplication: MPI Matching

Duplicate patients are the most expensive data quality problem in healthcare. A 2022 AHIMA study found that 8-12% of patient records in a typical hospital are duplicates, costing an average of $1,950 per duplicate in downstream clinical and billing errors.

from pyspark.sql.functions import (
    col, when, levenshtein, length, greatest,
    concat_ws, countDistinct, last, date_sub
)

def deduplicate_patients(bronze_patients_df):
    # Step 1: Parse FHIR Patient JSON into columns
    patients = bronze_patients_df.select(
        get_json_object("raw_json", "$.id").alias("source_patient_id"),
        get_json_object("raw_json", "$.name[0].family").alias("last_name"),
        get_json_object("raw_json", "$.name[0].given[0]").alias("first_name"),
        get_json_object("raw_json", "$.birthDate").alias("birth_date"),
        get_json_object("raw_json", "$.address[0].postalCode").alias("zip_code"),
        get_json_object("raw_json", "$.address[0].city").alias("city"),
        col("source_system")
    )

    # Step 2: Block on DOB + first letter of last name
    patients = patients.withColumn(
        "block_key",
        concat_ws("-", col("birth_date"), col("last_name").substr(1, 1))
    )

    # Step 3: Self-join within blocks for candidate pairs
    p1 = patients.alias("p1")
    p2 = patients.alias("p2")
    candidates = p1.join(
        p2,
        (col("p1.block_key") == col("p2.block_key")) &
        (col("p1.source_patient_id") < col("p2.source_patient_id"))
    )

    # Step 4: Compute similarity scores
    matched = candidates.withColumn(
        "name_score",
        1.0 - (levenshtein(col("p1.last_name"), col("p2.last_name")).cast("double") /
               greatest(length(col("p1.last_name")), length(col("p2.last_name"))).cast("double"))
    ).withColumn(
        "dob_score",
        when(col("p1.birth_date") == col("p2.birth_date"), 1.0).otherwise(0.0)
    ).withColumn(
        "composite_score",
        col("name_score") * 0.4 + col("dob_score") * 0.35 +
        when(col("p1.zip_code") == col("p2.zip_code"), 0.25).otherwise(0.0)
    )

    # Step 5: Flag matches above threshold
    MATCH_THRESHOLD = 0.85
    return matched.filter(col("composite_score") >= MATCH_THRESHOLD)

Code Standardization: Mapping to SNOMED/LOINC

Healthcare organizations use a mix of standard and proprietary codes. A lab result might come as a local code ("GLU-FAST") from one system and a LOINC code ("2345-7") from another. The Silver layer maps everything to canonical terminologies:

loinc_map = spark.table("reference.local_to_loinc_mapping")

def standardize_observation_codes(observations_df):
    standardized = (
        observations_df
        .join(loinc_map,
            (observations_df["code_system"] == loinc_map["source_system"]) &
            (observations_df["code_value"] == loinc_map["source_code"]),
            "left"
        )
        .withColumn("standard_code",
            coalesce(loinc_map["loinc_code"], col("code_value")))
        .withColumn("standard_system",
            when(loinc_map["loinc_code"].isNotNull(), lit("http://loinc.org"))
            .otherwise(col("code_system")))
        .withColumn("mapping_status",
            when(loinc_map["loinc_code"].isNotNull(), "mapped")
            .otherwise("unmapped"))
    )

    # Log unmapped codes for terminology team review
    unmapped = standardized.filter(col("mapping_status") == "unmapped")
    if unmapped.count() > 0:
        unmapped.select("code_system", "code_value").distinct() \
            .write.mode("append").saveAsTable("quality.unmapped_codes")

    return standardized

FHIR Resource Flattening: JSON to Relational Tables

The most labor-intensive Silver transformation is flattening nested FHIR resources into relational tables. Here's the Patient resource transformation:

def flatten_fhir_patient(bronze_df):
    patients = bronze_df.filter(col("resource_type") == "Patient")
    silver_patients = patients.select(
        get_json_object("raw_json", "$.id").alias("patient_id"),
        get_json_object("raw_json", "$.name[0].family").alias("last_name"),
        get_json_object("raw_json", "$.name[0].given[0]").alias("first_name"),
        get_json_object("raw_json", "$.birthDate").cast("date").alias("birth_date"),
        get_json_object("raw_json", "$.gender").alias("gender"),
        get_json_object("raw_json", "$.address[0].line[0]").alias("address_line_1"),
        get_json_object("raw_json", "$.address[0].city").alias("city"),
        get_json_object("raw_json", "$.address[0].state").alias("state"),
        get_json_object("raw_json", "$.address[0].postalCode").alias("zip_code"),
        get_json_object("raw_json", "$.telecom[0].value").alias("phone"),
        col("source_system"),
        current_timestamp().alias("silver_processed_ts")
    )
    silver_patients.write.format("delta").mode("overwrite") \
        .saveAsTable("silver.patients")
    return silver_patients

Data Quality Checks Between Layers

Data quality gates between layers prevent bad data from propagating. Each transition has specific validation rules:

class DataQualityChecker:
    def __init__(self, spark):
        self.spark = spark
        self.results = []

    def check_bronze_to_silver(self):
        bronze_count = self.spark.table("bronze.fhir_resources") \
            .filter(col("resource_type") == "Patient").count()
        silver_count = self.spark.table("silver.patients").count()

        # Deduplication rate should be < 15%
        dedup_rate = 1 - (silver_count / max(bronze_count, 1))
        self.results.append({
            "check": "dedup_rate", "value": f"{dedup_rate:.1%}",
            "threshold": "15%",
            "status": "PASS" if dedup_rate < 0.15 else "FAIL"
        })

        # Null rate checks for critical fields
        patients = self.spark.table("silver.patients")
        total = patients.count()
        for field, threshold in {"patient_id": 0.0, "last_name": 0.02, "birth_date": 0.01}.items():
            null_rate = patients.filter(col(field).isNull()).count() / total
            self.results.append({
                "check": f"null_rate_{field}", "value": f"{null_rate:.3%}",
                "threshold": f"{threshold:.1%}",
                "status": "PASS" if null_rate <= threshold else "FAIL"
            })

        return self.results

Gold Layer: Analytics-Ready Datasets

Design Principles

Gold tables are purpose-built for specific business questions. Unlike Silver tables (which mirror source data structure), Gold tables are shaped by the consumer: BI dashboards need pre-aggregated metrics, ML models need feature tables, and quality reporting needs measure-specific denominators and numerators.

Gold Table 1: Readmission Risk Feature Table

def build_readmission_features():
    patients = spark.table("silver.patients")
    encounters = spark.table("silver.encounters")
    conditions = spark.table("silver.conditions")

    # Feature: encounter history
    encounter_features = encounters.groupBy("patient_ref").agg(
        count("*").alias("total_encounters"),
        spark_sum(when(col("encounter_class") == "emergency", 1)
            .otherwise(0)).alias("ed_visits_total"),
        spark_sum(when(
            (col("encounter_class") == "inpatient") &
            (col("start_time") >= date_sub(current_date(), 365)), 1
        ).otherwise(0)).alias("inpatient_admits_12m"),
        avg((col("end_time").cast("long") - col("start_time").cast("long")) / 86400
        ).alias("avg_los_days")
    )

    # Feature: comorbidity count
    comorbidity_features = conditions.groupBy("patient_ref").agg(
        countDistinct("condition_code").alias("unique_conditions"),
        spark_sum(when(col("condition_code").isin("E11", "E10"), 1)
            .otherwise(0)).alias("has_diabetes"),
        spark_sum(when(col("condition_code").isin("I50", "I50.1"), 1)
            .otherwise(0)).alias("has_chf")
    )

    # Join all features
    features = patients \
        .join(encounter_features, patients["patient_id"] == encounter_features["patient_ref"], "left") \
        .join(comorbidity_features, patients["patient_id"] == comorbidity_features["patient_ref"], "left") \
        .select("patient_id", "birth_date", "gender", "total_encounters",
                "ed_visits_total", "inpatient_admits_12m", "avg_los_days",
                "unique_conditions", "has_diabetes", "has_chf",
                current_timestamp().alias("feature_computed_ts"))

    features.write.format("delta").mode("overwrite") \
        .saveAsTable("gold.readmission_features")
    return features

Gold Table 2: HEDIS Quality Measures

-- HEDIS Controlling High Blood Pressure (CBP) measure
CREATE OR REPLACE TABLE gold.hedis_cbp_measure AS
WITH denominator AS (
  SELECT DISTINCT p.patient_id, p.birth_date
  FROM silver.patients p
  JOIN silver.conditions c ON p.patient_id = c.patient_ref
  WHERE c.condition_code IN ('I10', 'I11', 'I12', 'I13', 'I15')
    AND DATEDIFF(CURRENT_DATE, p.birth_date) / 365.25 BETWEEN 18 AND 85
),
latest_bp AS (
  SELECT patient_ref,
    MAX(CASE WHEN standard_code = '8480-6' THEN value_quantity END) as systolic,
    MAX(CASE WHEN standard_code = '8462-4' THEN value_quantity END) as diastolic,
    MAX(effective_date) as measurement_date
  FROM silver.observations
  WHERE standard_code IN ('8480-6', '8462-4')
    AND effective_date >= DATE_SUB(CURRENT_DATE, 365)
  GROUP BY patient_ref
)
SELECT d.patient_id, bp.systolic, bp.diastolic,
  CASE
    WHEN bp.systolic < 140 AND bp.diastolic < 90 THEN 'numerator'
    WHEN bp.systolic IS NULL THEN 'no_reading'
    ELSE 'not_controlled'
  END AS measure_status
FROM denominator d
LEFT JOIN latest_bp bp ON d.patient_id = bp.patient_ref;

Gold Table 3: Cost Analysis by DRG

CREATE OR REPLACE TABLE gold.cost_by_drg AS
SELECT
  e.drg_code, e.drg_description,
  COUNT(*) AS encounter_count,
  AVG(c.total_charges) AS avg_charges,
  PERCENTILE(c.total_charges, 0.5) AS median_charges,
  AVG(DATEDIFF(e.end_time, e.start_time)) AS avg_los_days,
  STDDEV(c.total_charges) AS charge_stddev
FROM silver.encounters e
JOIN silver.claims c ON e.encounter_id = c.encounter_ref
WHERE e.encounter_class = 'inpatient'
  AND e.start_time >= DATE_SUB(CURRENT_DATE, 365)
GROUP BY e.drg_code, e.drg_description
ORDER BY encounter_count DESC;

End-to-End Pipeline: From EHR to Dashboard

Here is how all the pieces connect in a production Databricks deployment. This orchestration script runs as a scheduled Databricks Workflow:

def run_medallion_pipeline():
    # Step 1: Bronze ingestion (Auto Loader - incremental)
    print("=== BRONZE: Ingesting new FHIR data ===")
    ingest_fhir_to_bronze(
        source_path="s3://healthcare-lake/fhir-exports/",
        checkpoint_path="s3://healthcare-lake/_checkpoints/fhir-bronze"
    )

    # Step 2: Silver transformations
    print("=== SILVER: Flattening and standardizing ===")
    bronze_df = spark.table("bronze.fhir_resources")
    flatten_fhir_patient(bronze_df)
    standardize_observation_codes(spark.table("silver.observations"))
    deduplicate_patients(bronze_df)

    # Step 3: Data quality gates
    print("=== QUALITY: Running validation checks ===")
    qc = DataQualityChecker(spark)
    results = qc.check_bronze_to_silver()
    failures = [r for r in results if r["status"] == "FAIL"]
    if failures:
        raise Exception(f"Quality gate failed: {failures}")

    # Step 4: Gold aggregations
    print("=== GOLD: Building analytics tables ===")
    build_readmission_features()
    print("=== PIPELINE COMPLETE ===")

run_medallion_pipeline()

FAQ: Medallion Architecture for Healthcare

How long should I retain Bronze data?

For healthcare, retain Bronze data indefinitely (or at minimum 7 years to match HIPAA retention requirements). Bronze data is cheap to store (compressed Parquet on S3) and invaluable for reprocessing. CareEvolution recommends keeping raw data forever and only aging out Silver/Gold tables. Use cost-effective storage tiers (S3 Glacier) for historical Bronze data.

Should Silver tables mirror FHIR resource types?

Yes, for the initial flattening. Create one Silver table per FHIR resource type (Patient, Encounter, Observation, Condition, MedicationRequest). This preserves the FHIR data model semantics while making data SQL-queryable. Gold tables then reshape this data for specific analytical needs.

How do I handle FHIR extensions in the Silver layer?

FHIR extensions should be extracted into dedicated columns if analytically relevant (e.g., Epic's race/ethnicity extensions), or stored in a JSON column called extensions_json for ad-hoc access. Don't lose them — they often contain the most clinically interesting data. Reference the FHIR resource mapping matrix for common extension patterns.

What is the right batch frequency for each layer?

Bronze: near-real-time (Auto Loader processes files within minutes of arrival). Silver: hourly or every 6 hours (transformations are compute-intensive). Gold: daily (aggregations can run overnight for next-morning dashboards). For real-time use cases like clinical alerting, add a streaming path that bypasses Gold and serves directly from Silver.

Can I use this with non-Databricks platforms?

Absolutely. The medallion pattern works with any lakehouse platform. Apache Iceberg on AWS (EMR + Glue) is a strong alternative. The key components — Delta Lake/Iceberg for ACID, PySpark for transforms, and a catalog for governance — are available across clouds.

Conclusion: Start With Bronze, Iterate Upward

The medallion architecture is not built in a day. Start by landing raw FHIR data in Bronze — this takes days, not months. Then iterate on Silver transformations as you identify which data quality issues actually impact downstream consumers. Build Gold tables on demand as specific analytics use cases emerge.

The biggest mistake organizations make is trying to design the perfect Silver schema before writing a single line of code. Bronze is forgiving — start there, learn from the data, and refine upward.

At Nirmitee, we have implemented medallion architectures for healthcare organizations processing millions of FHIR resources daily. If you are planning a healthcare data platform, reach out for an architecture review — we will help you design a medallion pipeline that scales with your data maturity.

Loading blogs...

Medallion Architecture for Healthcare: Bronze, Silver, Gold with FHIR Data — A Complete Implementation Guide

March 16, 2026

16 min read

Why Medallion Architecture Is the Natural Fit for Healthcare Data

This guide provides a complete implementation reference with PySpark code, Delta Lake SQL, and data quality checks for each layer — using real FHIR data structures throughout.

Bronze Layer: Raw Data Ingestion — Accept Everything, Transform Nothing

Design Principles

This matters in healthcare for three reasons:

Regulatory reprocessing: When CMS changes quality measure specifications mid-year, you need to recompute from raw data. If Bronze doesn't have the original values, you're re-extracting from production EHRs.
Audit trails: HIPAA audit requirements demand provable data lineage. Bronze provides the "here's exactly what we received, when, from whom" record.
Error recovery: When a Silver layer transformation has a bug (and it will), you can fix the logic and replay from Bronze without data loss.

Bronze Schema Design

Keep it minimal. The Bronze table schema should capture the raw payload plus metadata about its origin:

-- Bronze table: raw FHIR resources
CREATE TABLE IF NOT EXISTS bronze.fhir_resources (
  raw_json         STRING        COMMENT 'Complete FHIR resource JSON as received',
  resource_type    STRING        COMMENT 'FHIR resource type (Patient, Observation, etc.)',
  resource_id      STRING        COMMENT 'FHIR resource ID extracted for partitioning',
  source_system    STRING        COMMENT 'Source EHR system identifier',
  source_file      STRING        COMMENT 'Original filename or API endpoint',
  ingestion_ts     TIMESTAMP     COMMENT 'When this record was ingested',
  ingestion_date   DATE          COMMENT 'Partition key for date-based queries'
)
USING DELTA
PARTITIONED BY (ingestion_date, resource_type)
TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true',
  'delta.autoOptimize.autoCompact' = 'true',
  'delta.logRetentionDuration' = 'interval 90 days'
);

-- Bronze table: raw HL7v2 messages
CREATE TABLE IF NOT EXISTS bronze.hl7v2_messages (
  raw_message      STRING        COMMENT 'Complete HL7v2 message with segment separators',
  message_type     STRING        COMMENT 'MSH-9 message type (ADT^A01, ORM^O01, etc.)',
  sending_facility STRING        COMMENT 'MSH-4 sending facility',
  message_control_id STRING      COMMENT 'MSH-10 unique message ID',
  message_datetime TIMESTAMP     COMMENT 'MSH-7 message timestamp',
  ingestion_ts     TIMESTAMP,
  ingestion_date   DATE
)
USING DELTA
PARTITIONED BY (ingestion_date, message_type);

PySpark: Ingesting FHIR Bulk Export Data

from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    current_timestamp, current_date, input_file_name,
    get_json_object, lit
)

spark = SparkSession.builder.appName("FHIR-Bronze-Ingestion").getOrCreate()

def ingest_fhir_to_bronze(source_path, checkpoint_path, source_system="epic"):
    raw_stream = (
        spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "text")
        .option("cloudFiles.schemaLocation", checkpoint_path + "/schema")
        .option("cloudFiles.inferColumnTypes", "false")
        .load(source_path)
    )

    bronze_stream = (
        raw_stream
        .withColumn("raw_json", raw_stream["value"])
        .withColumn("resource_type",
            get_json_object("value", "$.resourceType"))
        .withColumn("resource_id",
            get_json_object("value", "$.id"))
        .withColumn("source_system", lit(source_system))
        .withColumn("source_file", input_file_name())
        .withColumn("ingestion_ts", current_timestamp())
        .withColumn("ingestion_date", current_date())
        .select(
            "raw_json", "resource_type", "resource_id",
            "source_system", "source_file",
            "ingestion_ts", "ingestion_date"
        )
    )

    query = (
        bronze_stream.writeStream
        .format("delta")
        .outputMode("append")
        .option("checkpointLocation", checkpoint_path)
        .partitionBy("ingestion_date", "resource_type")
        .trigger(availableNow=True)
        .toTable("bronze.fhir_resources")
    )
    return query

# Usage: ingest from FHIR bulk export
ingest_fhir_to_bronze(
    source_path="s3://healthcare-lake/fhir-exports/",
    checkpoint_path="s3://healthcare-lake/_checkpoints/fhir-bronze",
    source_system="epic-prod"
)

Silver Layer: Clean, Standardize, Deduplicate

Design Principles

The Silver layer is where raw data becomes trustworthy data. Three operations happen here:

Patient deduplication via Master Patient Index (MPI) matching
Code standardization — mapping local/proprietary codes to standard terminologies (SNOMED CT, LOINC, RxNorm)
Data quality validation — enforcing business rules, reference ranges, and referential integrity

Patient Deduplication: MPI Matching

from pyspark.sql.functions import (
    col, when, levenshtein, length, greatest,
    concat_ws, countDistinct, last, date_sub
)

def deduplicate_patients(bronze_patients_df):
    # Step 1: Parse FHIR Patient JSON into columns
    patients = bronze_patients_df.select(
        get_json_object("raw_json", "$.id").alias("source_patient_id"),
        get_json_object("raw_json", "$.name[0].family").alias("last_name"),
        get_json_object("raw_json", "$.name[0].given[0]").alias("first_name"),
        get_json_object("raw_json", "$.birthDate").alias("birth_date"),
        get_json_object("raw_json", "$.address[0].postalCode").alias("zip_code"),
        get_json_object("raw_json", "$.address[0].city").alias("city"),
        col("source_system")
    )

    # Step 2: Block on DOB + first letter of last name
    patients = patients.withColumn(
        "block_key",
        concat_ws("-", col("birth_date"), col("last_name").substr(1, 1))
    )

    # Step 3: Self-join within blocks for candidate pairs
    p1 = patients.alias("p1")
    p2 = patients.alias("p2")
    candidates = p1.join(
        p2,
        (col("p1.block_key") == col("p2.block_key")) &
        (col("p1.source_patient_id") < col("p2.source_patient_id"))
    )

    # Step 4: Compute similarity scores
    matched = candidates.withColumn(
        "name_score",
        1.0 - (levenshtein(col("p1.last_name"), col("p2.last_name")).cast("double") /
               greatest(length(col("p1.last_name")), length(col("p2.last_name"))).cast("double"))
    ).withColumn(
        "dob_score",
        when(col("p1.birth_date") == col("p2.birth_date"), 1.0).otherwise(0.0)
    ).withColumn(
        "composite_score",
        col("name_score") * 0.4 + col("dob_score") * 0.35 +
        when(col("p1.zip_code") == col("p2.zip_code"), 0.25).otherwise(0.0)
    )

    # Step 5: Flag matches above threshold
    MATCH_THRESHOLD = 0.85
    return matched.filter(col("composite_score") >= MATCH_THRESHOLD)

Code Standardization: Mapping to SNOMED/LOINC

loinc_map = spark.table("reference.local_to_loinc_mapping")

def standardize_observation_codes(observations_df):
    standardized = (
        observations_df
        .join(loinc_map,
            (observations_df["code_system"] == loinc_map["source_system"]) &
            (observations_df["code_value"] == loinc_map["source_code"]),
            "left"
        )
        .withColumn("standard_code",
            coalesce(loinc_map["loinc_code"], col("code_value")))
        .withColumn("standard_system",
            when(loinc_map["loinc_code"].isNotNull(), lit("http://loinc.org"))
            .otherwise(col("code_system")))
        .withColumn("mapping_status",
            when(loinc_map["loinc_code"].isNotNull(), "mapped")
            .otherwise("unmapped"))
    )

    # Log unmapped codes for terminology team review
    unmapped = standardized.filter(col("mapping_status") == "unmapped")
    if unmapped.count() > 0:
        unmapped.select("code_system", "code_value").distinct() \
            .write.mode("append").saveAsTable("quality.unmapped_codes")

    return standardized

FHIR Resource Flattening: JSON to Relational Tables

The most labor-intensive Silver transformation is flattening nested FHIR resources into relational tables. Here's the Patient resource transformation:

def flatten_fhir_patient(bronze_df):
    patients = bronze_df.filter(col("resource_type") == "Patient")
    silver_patients = patients.select(
        get_json_object("raw_json", "$.id").alias("patient_id"),
        get_json_object("raw_json", "$.name[0].family").alias("last_name"),
        get_json_object("raw_json", "$.name[0].given[0]").alias("first_name"),
        get_json_object("raw_json", "$.birthDate").cast("date").alias("birth_date"),
        get_json_object("raw_json", "$.gender").alias("gender"),
        get_json_object("raw_json", "$.address[0].line[0]").alias("address_line_1"),
        get_json_object("raw_json", "$.address[0].city").alias("city"),
        get_json_object("raw_json", "$.address[0].state").alias("state"),
        get_json_object("raw_json", "$.address[0].postalCode").alias("zip_code"),
        get_json_object("raw_json", "$.telecom[0].value").alias("phone"),
        col("source_system"),
        current_timestamp().alias("silver_processed_ts")
    )
    silver_patients.write.format("delta").mode("overwrite") \
        .saveAsTable("silver.patients")
    return silver_patients

Data Quality Checks Between Layers

Data quality gates between layers prevent bad data from propagating. Each transition has specific validation rules:

class DataQualityChecker:
    def __init__(self, spark):
        self.spark = spark
        self.results = []

    def check_bronze_to_silver(self):
        bronze_count = self.spark.table("bronze.fhir_resources") \
            .filter(col("resource_type") == "Patient").count()
        silver_count = self.spark.table("silver.patients").count()

        # Deduplication rate should be < 15%
        dedup_rate = 1 - (silver_count / max(bronze_count, 1))
        self.results.append({
            "check": "dedup_rate", "value": f"{dedup_rate:.1%}",
            "threshold": "15%",
            "status": "PASS" if dedup_rate < 0.15 else "FAIL"
        })

        # Null rate checks for critical fields
        patients = self.spark.table("silver.patients")
        total = patients.count()
        for field, threshold in {"patient_id": 0.0, "last_name": 0.02, "birth_date": 0.01}.items():
            null_rate = patients.filter(col(field).isNull()).count() / total
            self.results.append({
                "check": f"null_rate_{field}", "value": f"{null_rate:.3%}",
                "threshold": f"{threshold:.1%}",
                "status": "PASS" if null_rate <= threshold else "FAIL"
            })

        return self.results

Gold Layer: Analytics-Ready Datasets

Design Principles

Gold Table 1: Readmission Risk Feature Table

def build_readmission_features():
    patients = spark.table("silver.patients")
    encounters = spark.table("silver.encounters")
    conditions = spark.table("silver.conditions")

    # Feature: encounter history
    encounter_features = encounters.groupBy("patient_ref").agg(
        count("*").alias("total_encounters"),
        spark_sum(when(col("encounter_class") == "emergency", 1)
            .otherwise(0)).alias("ed_visits_total"),
        spark_sum(when(
            (col("encounter_class") == "inpatient") &
            (col("start_time") >= date_sub(current_date(), 365)), 1
        ).otherwise(0)).alias("inpatient_admits_12m"),
        avg((col("end_time").cast("long") - col("start_time").cast("long")) / 86400
        ).alias("avg_los_days")
    )

    # Feature: comorbidity count
    comorbidity_features = conditions.groupBy("patient_ref").agg(
        countDistinct("condition_code").alias("unique_conditions"),
        spark_sum(when(col("condition_code").isin("E11", "E10"), 1)
            .otherwise(0)).alias("has_diabetes"),
        spark_sum(when(col("condition_code").isin("I50", "I50.1"), 1)
            .otherwise(0)).alias("has_chf")
    )

    # Join all features
    features = patients \
        .join(encounter_features, patients["patient_id"] == encounter_features["patient_ref"], "left") \
        .join(comorbidity_features, patients["patient_id"] == comorbidity_features["patient_ref"], "left") \
        .select("patient_id", "birth_date", "gender", "total_encounters",
                "ed_visits_total", "inpatient_admits_12m", "avg_los_days",
                "unique_conditions", "has_diabetes", "has_chf",
                current_timestamp().alias("feature_computed_ts"))

    features.write.format("delta").mode("overwrite") \
        .saveAsTable("gold.readmission_features")
    return features

Gold Table 2: HEDIS Quality Measures

-- HEDIS Controlling High Blood Pressure (CBP) measure
CREATE OR REPLACE TABLE gold.hedis_cbp_measure AS
WITH denominator AS (
  SELECT DISTINCT p.patient_id, p.birth_date
  FROM silver.patients p
  JOIN silver.conditions c ON p.patient_id = c.patient_ref
  WHERE c.condition_code IN ('I10', 'I11', 'I12', 'I13', 'I15')
    AND DATEDIFF(CURRENT_DATE, p.birth_date) / 365.25 BETWEEN 18 AND 85
),
latest_bp AS (
  SELECT patient_ref,
    MAX(CASE WHEN standard_code = '8480-6' THEN value_quantity END) as systolic,
    MAX(CASE WHEN standard_code = '8462-4' THEN value_quantity END) as diastolic,
    MAX(effective_date) as measurement_date
  FROM silver.observations
  WHERE standard_code IN ('8480-6', '8462-4')
    AND effective_date >= DATE_SUB(CURRENT_DATE, 365)
  GROUP BY patient_ref
)
SELECT d.patient_id, bp.systolic, bp.diastolic,
  CASE
    WHEN bp.systolic < 140 AND bp.diastolic < 90 THEN 'numerator'
    WHEN bp.systolic IS NULL THEN 'no_reading'
    ELSE 'not_controlled'
  END AS measure_status
FROM denominator d
LEFT JOIN latest_bp bp ON d.patient_id = bp.patient_ref;

Gold Table 3: Cost Analysis by DRG

CREATE OR REPLACE TABLE gold.cost_by_drg AS
SELECT
  e.drg_code, e.drg_description,
  COUNT(*) AS encounter_count,
  AVG(c.total_charges) AS avg_charges,
  PERCENTILE(c.total_charges, 0.5) AS median_charges,
  AVG(DATEDIFF(e.end_time, e.start_time)) AS avg_los_days,
  STDDEV(c.total_charges) AS charge_stddev
FROM silver.encounters e
JOIN silver.claims c ON e.encounter_id = c.encounter_ref
WHERE e.encounter_class = 'inpatient'
  AND e.start_time >= DATE_SUB(CURRENT_DATE, 365)
GROUP BY e.drg_code, e.drg_description
ORDER BY encounter_count DESC;

End-to-End Pipeline: From EHR to Dashboard

Here is how all the pieces connect in a production Databricks deployment. This orchestration script runs as a scheduled Databricks Workflow:

def run_medallion_pipeline():
    # Step 1: Bronze ingestion (Auto Loader - incremental)
    print("=== BRONZE: Ingesting new FHIR data ===")
    ingest_fhir_to_bronze(
        source_path="s3://healthcare-lake/fhir-exports/",
        checkpoint_path="s3://healthcare-lake/_checkpoints/fhir-bronze"
    )

    # Step 2: Silver transformations
    print("=== SILVER: Flattening and standardizing ===")
    bronze_df = spark.table("bronze.fhir_resources")
    flatten_fhir_patient(bronze_df)
    standardize_observation_codes(spark.table("silver.observations"))
    deduplicate_patients(bronze_df)

    # Step 3: Data quality gates
    print("=== QUALITY: Running validation checks ===")
    qc = DataQualityChecker(spark)
    results = qc.check_bronze_to_silver()
    failures = [r for r in results if r["status"] == "FAIL"]
    if failures:
        raise Exception(f"Quality gate failed: {failures}")

    # Step 4: Gold aggregations
    print("=== GOLD: Building analytics tables ===")
    build_readmission_features()
    print("=== PIPELINE COMPLETE ===")

run_medallion_pipeline()