Why Medallion Architecture Is the Natural Fit for Healthcare Data
Healthcare data arrives messy. A single patient encounter generates data across 17 different systems on average — EHR, lab information system, radiology, pharmacy, billing, scheduling, and more. Each system speaks a different dialect: FHIR JSON from modern APIs, HL7v2 pipe-delimited messages from legacy interfaces, CSV flat files from claims clearinghouses, and scanned PDFs from faxed referrals.
The medallion architecture — Bronze, Silver, Gold — provides a systematic framework for transforming this chaos into analytics-ready datasets. Originally popularized by Databricks, this pattern has become the de facto standard for healthcare data lakehouse implementations because it mirrors how clinical data naturally progresses from raw to refined.
This guide provides a complete implementation reference with PySpark code, Delta Lake SQL, and data quality checks for each layer — using real FHIR data structures throughout.
Bronze Layer: Raw Data Ingestion — Accept Everything, Transform Nothing
Design Principles
The Bronze layer is your system of record — an append-only, immutable store of raw data exactly as it arrived from source systems. The governing principle is simple: accept everything, transform nothing. Every HL7v2 message, every FHIR NDJSON line, every claims CSV row is preserved with full source fidelity.
This matters in healthcare for three reasons:
- Regulatory reprocessing: When CMS changes quality measure specifications mid-year, you need to recompute from raw data. If Bronze doesn't have the original values, you're re-extracting from production EHRs.
- Audit trails: HIPAA audit requirements demand provable data lineage. Bronze provides the "here's exactly what we received, when, from whom" record.
- Error recovery: When a Silver layer transformation has a bug (and it will), you can fix the logic and replay from Bronze without data loss.
Bronze Schema Design
Keep it minimal. The Bronze table schema should capture the raw payload plus metadata about its origin:
-- Bronze table: raw FHIR resources
CREATE TABLE IF NOT EXISTS bronze.fhir_resources (
raw_json STRING COMMENT 'Complete FHIR resource JSON as received',
resource_type STRING COMMENT 'FHIR resource type (Patient, Observation, etc.)',
resource_id STRING COMMENT 'FHIR resource ID extracted for partitioning',
source_system STRING COMMENT 'Source EHR system identifier',
source_file STRING COMMENT 'Original filename or API endpoint',
ingestion_ts TIMESTAMP COMMENT 'When this record was ingested',
ingestion_date DATE COMMENT 'Partition key for date-based queries'
)
USING DELTA
PARTITIONED BY (ingestion_date, resource_type)
TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true',
'delta.logRetentionDuration' = 'interval 90 days'
);
-- Bronze table: raw HL7v2 messages
CREATE TABLE IF NOT EXISTS bronze.hl7v2_messages (
raw_message STRING COMMENT 'Complete HL7v2 message with segment separators',
message_type STRING COMMENT 'MSH-9 message type (ADT^A01, ORM^O01, etc.)',
sending_facility STRING COMMENT 'MSH-4 sending facility',
message_control_id STRING COMMENT 'MSH-10 unique message ID',
message_datetime TIMESTAMP COMMENT 'MSH-7 message timestamp',
ingestion_ts TIMESTAMP,
ingestion_date DATE
)
USING DELTA
PARTITIONED BY (ingestion_date, message_type); PySpark: Ingesting FHIR Bulk Export Data
FHIR Bulk Data Access ($export) produces NDJSON files — one JSON object per line, one file per resource type. Here's how to ingest them into Bronze using Databricks Auto Loader for incremental processing:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
current_timestamp, current_date, input_file_name,
get_json_object, lit
)
spark = SparkSession.builder.appName("FHIR-Bronze-Ingestion").getOrCreate()
def ingest_fhir_to_bronze(source_path, checkpoint_path, source_system="epic"):
raw_stream = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "text")
.option("cloudFiles.schemaLocation", checkpoint_path + "/schema")
.option("cloudFiles.inferColumnTypes", "false")
.load(source_path)
)
bronze_stream = (
raw_stream
.withColumn("raw_json", raw_stream["value"])
.withColumn("resource_type",
get_json_object("value", "$.resourceType"))
.withColumn("resource_id",
get_json_object("value", "$.id"))
.withColumn("source_system", lit(source_system))
.withColumn("source_file", input_file_name())
.withColumn("ingestion_ts", current_timestamp())
.withColumn("ingestion_date", current_date())
.select(
"raw_json", "resource_type", "resource_id",
"source_system", "source_file",
"ingestion_ts", "ingestion_date"
)
)
query = (
bronze_stream.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", checkpoint_path)
.partitionBy("ingestion_date", "resource_type")
.trigger(availableNow=True)
.toTable("bronze.fhir_resources")
)
return query
# Usage: ingest from FHIR bulk export
ingest_fhir_to_bronze(
source_path="s3://healthcare-lake/fhir-exports/",
checkpoint_path="s3://healthcare-lake/_checkpoints/fhir-bronze",
source_system="epic-prod"
) Silver Layer: Clean, Standardize, Deduplicate
Design Principles
The Silver layer is where raw data becomes trustworthy data. Three operations happen here:
- Patient deduplication via Master Patient Index (MPI) matching
- Code standardization — mapping local/proprietary codes to standard terminologies (SNOMED CT, LOINC, RxNorm)
- Data quality validation — enforcing business rules, reference ranges, and referential integrity
The Silver layer also flattens nested FHIR JSON into relational tables that analysts and downstream systems can query efficiently. A single FHIR Patient resource with nested arrays for names, addresses, telecoms, and identifiers becomes a set of normalized tables.
Patient Deduplication: MPI Matching
Duplicate patients are the most expensive data quality problem in healthcare. A 2022 AHIMA study found that 8-12% of patient records in a typical hospital are duplicates, costing an average of $1,950 per duplicate in downstream clinical and billing errors.
from pyspark.sql.functions import (
col, when, levenshtein, length, greatest,
concat_ws, countDistinct, last, date_sub
)
def deduplicate_patients(bronze_patients_df):
# Step 1: Parse FHIR Patient JSON into columns
patients = bronze_patients_df.select(
get_json_object("raw_json", "$.id").alias("source_patient_id"),
get_json_object("raw_json", "$.name[0].family").alias("last_name"),
get_json_object("raw_json", "$.name[0].given[0]").alias("first_name"),
get_json_object("raw_json", "$.birthDate").alias("birth_date"),
get_json_object("raw_json", "$.address[0].postalCode").alias("zip_code"),
get_json_object("raw_json", "$.address[0].city").alias("city"),
col("source_system")
)
# Step 2: Block on DOB + first letter of last name
patients = patients.withColumn(
"block_key",
concat_ws("-", col("birth_date"), col("last_name").substr(1, 1))
)
# Step 3: Self-join within blocks for candidate pairs
p1 = patients.alias("p1")
p2 = patients.alias("p2")
candidates = p1.join(
p2,
(col("p1.block_key") == col("p2.block_key")) &
(col("p1.source_patient_id") < col("p2.source_patient_id"))
)
# Step 4: Compute similarity scores
matched = candidates.withColumn(
"name_score",
1.0 - (levenshtein(col("p1.last_name"), col("p2.last_name")).cast("double") /
greatest(length(col("p1.last_name")), length(col("p2.last_name"))).cast("double"))
).withColumn(
"dob_score",
when(col("p1.birth_date") == col("p2.birth_date"), 1.0).otherwise(0.0)
).withColumn(
"composite_score",
col("name_score") * 0.4 + col("dob_score") * 0.35 +
when(col("p1.zip_code") == col("p2.zip_code"), 0.25).otherwise(0.0)
)
# Step 5: Flag matches above threshold
MATCH_THRESHOLD = 0.85
return matched.filter(col("composite_score") >= MATCH_THRESHOLD) Code Standardization: Mapping to SNOMED/LOINC
Healthcare organizations use a mix of standard and proprietary codes. A lab result might come as a local code ("GLU-FAST") from one system and a LOINC code ("2345-7") from another. The Silver layer maps everything to canonical terminologies:
loinc_map = spark.table("reference.local_to_loinc_mapping")
def standardize_observation_codes(observations_df):
standardized = (
observations_df
.join(loinc_map,
(observations_df["code_system"] == loinc_map["source_system"]) &
(observations_df["code_value"] == loinc_map["source_code"]),
"left"
)
.withColumn("standard_code",
coalesce(loinc_map["loinc_code"], col("code_value")))
.withColumn("standard_system",
when(loinc_map["loinc_code"].isNotNull(), lit("http://loinc.org"))
.otherwise(col("code_system")))
.withColumn("mapping_status",
when(loinc_map["loinc_code"].isNotNull(), "mapped")
.otherwise("unmapped"))
)
# Log unmapped codes for terminology team review
unmapped = standardized.filter(col("mapping_status") == "unmapped")
if unmapped.count() > 0:
unmapped.select("code_system", "code_value").distinct() \
.write.mode("append").saveAsTable("quality.unmapped_codes")
return standardized FHIR Resource Flattening: JSON to Relational Tables
The most labor-intensive Silver transformation is flattening nested FHIR resources into relational tables. Here's the Patient resource transformation:
def flatten_fhir_patient(bronze_df):
patients = bronze_df.filter(col("resource_type") == "Patient")
silver_patients = patients.select(
get_json_object("raw_json", "$.id").alias("patient_id"),
get_json_object("raw_json", "$.name[0].family").alias("last_name"),
get_json_object("raw_json", "$.name[0].given[0]").alias("first_name"),
get_json_object("raw_json", "$.birthDate").cast("date").alias("birth_date"),
get_json_object("raw_json", "$.gender").alias("gender"),
get_json_object("raw_json", "$.address[0].line[0]").alias("address_line_1"),
get_json_object("raw_json", "$.address[0].city").alias("city"),
get_json_object("raw_json", "$.address[0].state").alias("state"),
get_json_object("raw_json", "$.address[0].postalCode").alias("zip_code"),
get_json_object("raw_json", "$.telecom[0].value").alias("phone"),
col("source_system"),
current_timestamp().alias("silver_processed_ts")
)
silver_patients.write.format("delta").mode("overwrite") \
.saveAsTable("silver.patients")
return silver_patients Data Quality Checks Between Layers
Data quality gates between layers prevent bad data from propagating. Each transition has specific validation rules:
class DataQualityChecker:
def __init__(self, spark):
self.spark = spark
self.results = []
def check_bronze_to_silver(self):
bronze_count = self.spark.table("bronze.fhir_resources") \
.filter(col("resource_type") == "Patient").count()
silver_count = self.spark.table("silver.patients").count()
# Deduplication rate should be < 15%
dedup_rate = 1 - (silver_count / max(bronze_count, 1))
self.results.append({
"check": "dedup_rate", "value": f"{dedup_rate:.1%}",
"threshold": "15%",
"status": "PASS" if dedup_rate < 0.15 else "FAIL"
})
# Null rate checks for critical fields
patients = self.spark.table("silver.patients")
total = patients.count()
for field, threshold in {"patient_id": 0.0, "last_name": 0.02, "birth_date": 0.01}.items():
null_rate = patients.filter(col(field).isNull()).count() / total
self.results.append({
"check": f"null_rate_{field}", "value": f"{null_rate:.3%}",
"threshold": f"{threshold:.1%}",
"status": "PASS" if null_rate <= threshold else "FAIL"
})
return self.results Gold Layer: Analytics-Ready Datasets
Design Principles
Gold tables are purpose-built for specific business questions. Unlike Silver tables (which mirror source data structure), Gold tables are shaped by the consumer: BI dashboards need pre-aggregated metrics, ML models need feature tables, and quality reporting needs measure-specific denominators and numerators.
Gold Table 1: Readmission Risk Feature Table
def build_readmission_features():
patients = spark.table("silver.patients")
encounters = spark.table("silver.encounters")
conditions = spark.table("silver.conditions")
# Feature: encounter history
encounter_features = encounters.groupBy("patient_ref").agg(
count("*").alias("total_encounters"),
spark_sum(when(col("encounter_class") == "emergency", 1)
.otherwise(0)).alias("ed_visits_total"),
spark_sum(when(
(col("encounter_class") == "inpatient") &
(col("start_time") >= date_sub(current_date(), 365)), 1
).otherwise(0)).alias("inpatient_admits_12m"),
avg((col("end_time").cast("long") - col("start_time").cast("long")) / 86400
).alias("avg_los_days")
)
# Feature: comorbidity count
comorbidity_features = conditions.groupBy("patient_ref").agg(
countDistinct("condition_code").alias("unique_conditions"),
spark_sum(when(col("condition_code").isin("E11", "E10"), 1)
.otherwise(0)).alias("has_diabetes"),
spark_sum(when(col("condition_code").isin("I50", "I50.1"), 1)
.otherwise(0)).alias("has_chf")
)
# Join all features
features = patients \
.join(encounter_features, patients["patient_id"] == encounter_features["patient_ref"], "left") \
.join(comorbidity_features, patients["patient_id"] == comorbidity_features["patient_ref"], "left") \
.select("patient_id", "birth_date", "gender", "total_encounters",
"ed_visits_total", "inpatient_admits_12m", "avg_los_days",
"unique_conditions", "has_diabetes", "has_chf",
current_timestamp().alias("feature_computed_ts"))
features.write.format("delta").mode("overwrite") \
.saveAsTable("gold.readmission_features")
return features Gold Table 2: HEDIS Quality Measures
-- HEDIS Controlling High Blood Pressure (CBP) measure
CREATE OR REPLACE TABLE gold.hedis_cbp_measure AS
WITH denominator AS (
SELECT DISTINCT p.patient_id, p.birth_date
FROM silver.patients p
JOIN silver.conditions c ON p.patient_id = c.patient_ref
WHERE c.condition_code IN ('I10', 'I11', 'I12', 'I13', 'I15')
AND DATEDIFF(CURRENT_DATE, p.birth_date) / 365.25 BETWEEN 18 AND 85
),
latest_bp AS (
SELECT patient_ref,
MAX(CASE WHEN standard_code = '8480-6' THEN value_quantity END) as systolic,
MAX(CASE WHEN standard_code = '8462-4' THEN value_quantity END) as diastolic,
MAX(effective_date) as measurement_date
FROM silver.observations
WHERE standard_code IN ('8480-6', '8462-4')
AND effective_date >= DATE_SUB(CURRENT_DATE, 365)
GROUP BY patient_ref
)
SELECT d.patient_id, bp.systolic, bp.diastolic,
CASE
WHEN bp.systolic < 140 AND bp.diastolic < 90 THEN 'numerator'
WHEN bp.systolic IS NULL THEN 'no_reading'
ELSE 'not_controlled'
END AS measure_status
FROM denominator d
LEFT JOIN latest_bp bp ON d.patient_id = bp.patient_ref; Gold Table 3: Cost Analysis by DRG
CREATE OR REPLACE TABLE gold.cost_by_drg AS
SELECT
e.drg_code, e.drg_description,
COUNT(*) AS encounter_count,
AVG(c.total_charges) AS avg_charges,
PERCENTILE(c.total_charges, 0.5) AS median_charges,
AVG(DATEDIFF(e.end_time, e.start_time)) AS avg_los_days,
STDDEV(c.total_charges) AS charge_stddev
FROM silver.encounters e
JOIN silver.claims c ON e.encounter_id = c.encounter_ref
WHERE e.encounter_class = 'inpatient'
AND e.start_time >= DATE_SUB(CURRENT_DATE, 365)
GROUP BY e.drg_code, e.drg_description
ORDER BY encounter_count DESC; End-to-End Pipeline: From EHR to Dashboard
Here is how all the pieces connect in a production Databricks deployment. This orchestration script runs as a scheduled Databricks Workflow:
def run_medallion_pipeline():
# Step 1: Bronze ingestion (Auto Loader - incremental)
print("=== BRONZE: Ingesting new FHIR data ===")
ingest_fhir_to_bronze(
source_path="s3://healthcare-lake/fhir-exports/",
checkpoint_path="s3://healthcare-lake/_checkpoints/fhir-bronze"
)
# Step 2: Silver transformations
print("=== SILVER: Flattening and standardizing ===")
bronze_df = spark.table("bronze.fhir_resources")
flatten_fhir_patient(bronze_df)
standardize_observation_codes(spark.table("silver.observations"))
deduplicate_patients(bronze_df)
# Step 3: Data quality gates
print("=== QUALITY: Running validation checks ===")
qc = DataQualityChecker(spark)
results = qc.check_bronze_to_silver()
failures = [r for r in results if r["status"] == "FAIL"]
if failures:
raise Exception(f"Quality gate failed: {failures}")
# Step 4: Gold aggregations
print("=== GOLD: Building analytics tables ===")
build_readmission_features()
print("=== PIPELINE COMPLETE ===")
run_medallion_pipeline() FAQ: Medallion Architecture for Healthcare
How long should I retain Bronze data?
For healthcare, retain Bronze data indefinitely (or at minimum 7 years to match HIPAA retention requirements). Bronze data is cheap to store (compressed Parquet on S3) and invaluable for reprocessing. CareEvolution recommends keeping raw data forever and only aging out Silver/Gold tables. Use cost-effective storage tiers (S3 Glacier) for historical Bronze data.
Should Silver tables mirror FHIR resource types?
Yes, for the initial flattening. Create one Silver table per FHIR resource type (Patient, Encounter, Observation, Condition, MedicationRequest). This preserves the FHIR data model semantics while making data SQL-queryable. Gold tables then reshape this data for specific analytical needs.
How do I handle FHIR extensions in the Silver layer?
FHIR extensions should be extracted into dedicated columns if analytically relevant (e.g., Epic's race/ethnicity extensions), or stored in a JSON column called extensions_json for ad-hoc access. Don't lose them — they often contain the most clinically interesting data. Reference the FHIR resource mapping matrix for common extension patterns.
What is the right batch frequency for each layer?
Bronze: near-real-time (Auto Loader processes files within minutes of arrival). Silver: hourly or every 6 hours (transformations are compute-intensive). Gold: daily (aggregations can run overnight for next-morning dashboards). For real-time use cases like clinical alerting, add a streaming path that bypasses Gold and serves directly from Silver.
Can I use this with non-Databricks platforms?
Absolutely. The medallion pattern works with any lakehouse platform. Apache Iceberg on AWS (EMR + Glue) is a strong alternative. The key components — Delta Lake/Iceberg for ACID, PySpark for transforms, and a catalog for governance — are available across clouds.
Conclusion: Start With Bronze, Iterate Upward
The medallion architecture is not built in a day. Start by landing raw FHIR data in Bronze — this takes days, not months. Then iterate on Silver transformations as you identify which data quality issues actually impact downstream consumers. Build Gold tables on demand as specific analytics use cases emerge.
The biggest mistake organizations make is trying to design the perfect Silver schema before writing a single line of code. Bronze is forgiving — start there, learn from the data, and refine upward.
At Nirmitee, we have implemented medallion architectures for healthcare organizations processing millions of FHIR resources daily. If you are planning a healthcare data platform, reach out for an architecture review — we will help you design a medallion pipeline that scales with your data maturity.