De-Identifying Healthcare Data for AI Training: Safe Harbor's 18 Identifiers Are 25 Years Old — What Actually Works Now

Upcoming Webinar

Why Digital Infrastructure Is the Biggest Bottleneck in Pharma Innovation

May 8, 2026

5:00 PM IST

Live On MS Team

March 22, 2026

14 min read

AI & MLHealthcareCompliance

Every health system in the US is sitting on a goldmine. Decades of clinical notes, lab results, imaging reports, medication histories, and longitudinal patient journeys — the exact data that could train the next generation of clinical AI models. The problem? HIPAA makes sharing it terrifying, and for good reason. A single breach of Protected Health Information (PHI) carries penalties of up to $2.1 million per violation category, per year. The 2024 Change Healthcare breach alone affected 100 million patients and is projected to cost UnitedHealth Group over $2.5 billion.

De-identification is the bridge between clinical data locked in EHRs and AI-ready datasets that can safely leave institutional walls. But the primary framework most organizations rely on — HIPAA's Safe Harbor method — was written in 2000, before social media, wearable devices, genomic sequencing, and the re-identification research that has fundamentally changed our understanding of privacy risk.

This guide is for the data engineers, ML teams, and compliance officers who need to actually build de-identification pipelines. We will cover what works, what doesn't, and how to choose between Safe Harbor, Expert Determination, and synthetic data generation based on your specific use case.

The Safe Harbor Method: All 18 Identifiers, with Practical Notes

HIPAA's Safe Harbor method (45 CFR 164.514(b)) is the most commonly used de-identification approach because it is prescriptive: remove or generalize these 18 specific identifiers, and your data is considered de-identified. No statistician required. No risk assessment needed.

HIPAA Safe Harbor 18 identifiers grid infographic

Here are all 18 identifiers with the implementation details that matter for engineering teams:

#	Identifier	What to Remove/Generalize	Engineering Notes
1	Names	Full name, initials, nicknames	Watch for names embedded in clinical notes, not just structured fields
2	Geographic data	Anything more specific than state	Zip codes: keep first 3 digits only if the 3-digit zip population is >20,000. Otherwise, set to 000. Per 2020 Census, 17 three-digit zips must be zeroed
3	Dates	All dates except year	Birth dates, admission dates, discharge dates, death dates — all reduced to year only. Ages over 89 must be grouped into a single category ("90+")
4	Phone numbers	All phone numbers	Check clinical notes for phone numbers documented by clinicians
5	Fax numbers	All fax numbers	Still common in healthcare — referral documents, pharmacy communications
6	Email addresses	All email addresses	Patient portal emails, provider contact fields, and free-text notes
7	Social Security numbers	Full SSN	May appear in insurance fields, billing records, or legacy system imports
8	Medical record numbers	MRNs, encounter IDs	Replace with pseudonymized IDs (hash + salt) to maintain linkability within datasets
9	Health plan beneficiary numbers	Insurance IDs, Medicare/Medicaid numbers	Found in claims data, eligibility responses, EOBs
10	Account numbers	All account numbers	Billing account numbers, guarantor accounts
11	Certificate/license numbers	Driver's license, professional licenses	Sometimes captured during patient registration
12	Vehicle identifiers	VIN, license plate numbers	Rare in clinical data, but present in trauma/accident reports
13	Device identifiers	UDI, serial numbers	Implant records, FHIR Device resources, IoMT device logs
14	Web URLs	Patient portal URLs, any URLs	Check for URLs in clinical notes referencing patient-specific pages
15	IP addresses	All IP addresses	Audit logs, patient portal access records, telehealth session data
16	Biometric identifiers	Fingerprints, voiceprints, retinal scans	Increasingly common with biometric authentication at registration kiosks
17	Full-face photographs	Any comparable image	Dermatology images, wound care photos — crop or blur facial features
18	Any other unique identifying number	Catch-all for unique identifiers	Includes study IDs that could link back to identified datasets, genetic accession numbers, biobank specimen IDs

The Safe Harbor method has a critical advantage: legal certainty. If you remove all 18 identifiers and have no actual knowledge that the remaining data could identify someone, the data is legally de-identified under HIPAA. No expert review. No statistical proof. This makes it the default choice for most health systems and research institutions.

Why Safe Harbor Is Showing Its Age

Re-identification risk Venn diagram showing 87% identifiable by zip code, DOB, and gender

The Safe Harbor regulation was finalized in August 2000. That is before Facebook (2004), before the iPhone (2007), before consumer genomics (23andMe launched in 2006), and before wearable health devices became ubiquitous. The privacy landscape it was designed for no longer exists.

The Re-Identification Research That Should Worry You

In 2000, Latanya Sweeney's landmark research demonstrated that 87% of the US population could be uniquely identified using just three data points: 5-digit zip code, date of birth, and gender. These are all quasi-identifiers that Safe Harbor allows in some form — zip codes (first 3 digits), birth year, and gender are all permitted in Safe Harbor-compliant datasets.

More recent studies have escalated the concern:

2019 — Nature Communications: Researchers at Imperial College London and Université catholique de Louvain showed that 99.98% of Americans could be re-identified in any dataset using 15 demographic attributes, even when the data was "anonymized" (Rocher et al., 2019).
2013 — Science: MIT researchers re-identified individuals in a credit card metadata dataset with 90% accuracy using just four spatiotemporal data points (de Montjoye et al., 2015).
Genetic data: A 2018 study in Science showed that 60% of Americans with European ancestry could be identified through genetic genealogy databases, even if they had never submitted their own DNA (Erlich et al., 2018).

What Safe Harbor Does Not Cover

The 18 identifiers were designed for the data types that existed in 2000. They do not explicitly address:

Social media correlation: A patient's Twitter/X post about their hospital visit, combined with de-identified admission data, can dramatically narrow identification.
Wearable device data: Continuous heart rate, GPS, sleep patterns — even without names, the temporal patterns are highly unique
Genomic data: Genetic sequences are inherently identifying and are not covered by Safe Harbor's 18 identifiers
Geolocation breadcrumbs: Appointment times correlated with cell tower data or location history
Rare disease data: In small populations, diagnosis codes alone can identify patients — only 7,000 people in the US have Huntington's disease
Longitudinal data: Even without direct identifiers, a unique combination of diagnoses, procedures, and visit patterns over time can fingerprint an individual

Safe Harbor is not wrong — it is incomplete for modern data environments, especially when building AI training datasets that combine multiple data sources.

Expert Determination: The Flexible (but Expensive) Alternative

HIPAA's second de-identification method — Expert Determination (45 CFR 164.514(a)) — requires a qualified statistician or scientist to apply statistical and scientific principles and determine that the risk of re-identification is "very small." The expert must document their methods and results.

When Expert Determination Makes Sense

You need date precision: AI models for sepsis prediction, readmission risk, or disease progression need actual dates, not just years. Expert Determination can allow date retention with other controls.
Geographic analysis: Social determinants of health research require sub-state geography. An expert can determine acceptable geographic granularity based on population density.
Small datasets: Safe Harbor is riskier with small datasets, where combinations of permitted attributes (age, gender, zip prefix) might still identify individuals. Expert Determination applies quantitative risk assessment.
Multi-source data: When combining clinical, claims, and FHIR-based data pipelines, the linkage itself creates re-identification risk that Safe Harbor does not address.

What Expert Determination Costs

A typical Expert Determination engagement runs $50,000 to $200,000, depending on dataset complexity, number of data elements, and the expert's assessment of re-identification risk. The process typically takes 8-16 weeks and includes:

Data profiling and quasi-identifier analysis
Population uniqueness assessment
Risk quantification using k-anonymity, l-diversity, or other privacy models
Transformation recommendations (suppression, generalization, perturbation)
Formal certification letter

Organizations like Privacy Analytics (an IQVIA company) and academic groups at institutions like Vanderbilt and Harvard specialize in Expert Determination for healthcare datasets.

Practical Tooling for Data Engineers

Python code workflow diagram for FHIR de-identification steps

If you are building de-identification into your data pipeline — rather than hiring an external service — these are the tools that actually work in production:

Open-Source Libraries

Tool	Maintainer	Best For	Notes
Presidio	Microsoft	NER-based PHI detection in clinical text	Supports custom recognizers, FHIR integration, multiple NLP backends (spaCy, Stanza, Transformers)
scrubadub	LeapBeyond	Quick text scrubbing	Good for structured text, less accurate on clinical notes than Presidio
Philter	UCSF	Clinical note de-identification	Purpose-built for clinical text, rule-based + ML hybrid, validated on i2b2 datasets
Synthea	MITRE	Fully synthetic patient generation	Generates complete FHIR Bundles with realistic clinical progressions

De-Identifying a FHIR Patient Resource: Python Example

Here is a practical code snippet for de-identifying a FHIR Patient resource programmatically. This covers the most common Safe Harbor transformations:

import hashlib
import json
from datetime import datetime

def deidentify_fhir_patient(patient_resource: dict, salt: str) -> dict:
    """
    De-identify a FHIR Patient resource per Safe Harbor method.
    Removes direct identifiers, generalizes quasi-identifiers.
    """
    patient = json.loads(json.dumps(patient_resource))  # deep copy

    # 1. Remove names (Safe Harbor #1)
    patient.pop("name", None)

    # 2. Generalize address — keep only state (Safe Harbor #2)
    if "address" in patient:
        for addr in patient["address"]:
            addr.pop("line", None)
            addr.pop("city", None)
            # Truncate zip to 3 digits; zero out if population < 20K
            zip_code = addr.get("postalCode", "")
            LOW_POP_PREFIXES = {"036", "059", "063", "102", "203",
                                "556", "692", "790", "821", "823",
                                "830", "831", "878", "879", "884",
                                "890", "893"}
            if zip_code:
                prefix = zip_code[:3]
                addr["postalCode"] = "000" if prefix in LOW_POP_PREFIXES else prefix
            addr.pop("district", None)

    # 3. Generalize birthDate to year only (Safe Harbor #3)
    if "birthDate" in patient:
        try:
            birth_year = int(patient["birthDate"][:4])
            current_year = datetime.now().year
            age = current_year - birth_year
            if age > 89:
                patient["birthDate"] = "1900"  # grouped 90+
            else:
                patient["birthDate"] = str(birth_year)
        except (ValueError, IndexError):
            patient.pop("birthDate", None)

    # 4. Remove telecom — phone, fax, email (Safe Harbor #4, 5, 6)
    patient.pop("telecom", None)

    # 5. Remove SSN from identifiers (Safe Harbor #7)
    # 6. Hash MRN for linkability (Safe Harbor #8)
    if "identifier" in patient:
        cleaned_ids = []
        for ident in patient["identifier"]:
            system = ident.get("system", "")
            if "ssn" in system.lower() or "social" in system.lower():
                continue  # remove SSN
            if "mrn" in system.lower() or "medical-record" in system.lower():
                # Pseudonymize MRN with salted hash
                raw = ident.get("value", "")
                ident["value"] = hashlib.sha256(
                    f"{salt}:{raw}".encode()
                ).hexdigest()[:16]
                ident["system"] = "urn:oid:deidentified-mrn"
            cleaned_ids.append(ident)
        patient["identifier"] = cleaned_ids

    # 7. Remove photo (Safe Harbor #17)
    patient.pop("photo", None)

    # 8. Strip extensions that may contain PHI
    patient.pop("extension", None)

    return patient


# Example usage
original_patient = {
    "resourceType": "Patient",
    "id": "example-patient-001",
    "name": [{"family": "Smith", "given": ["John", "Michael"]}],
    "birthDate": "1985-07-23",
    "gender": "male",
    "address": [{
        "line": ["123 Main St"],
        "city": "Boston",
        "state": "MA",
        "postalCode": "02115"
    }],
    "telecom": [
        {"system": "phone", "value": "617-555-0123"},
        {"system": "email", "value": "john.smith@email.com"}
    ],
    "identifier": [
        {"system": "urn:oid:mrn", "value": "MRN-12345678"},
        {"system": "urn:oid:ssn", "value": "123-45-6789"}
    ]
}

deidentified = deidentify_fhir_patient(original_patient, salt="project-x-2026")
print(json.dumps(deidentified, indent=2))
# Output: no name, no telecom, no SSN, birth year only,
#         zip truncated to "021", MRN hashed

This is a starting point. Production pipelines should also scan clinical notes (using Presidio or Philter) for PHI in free-text fields like Observation.valueString or DiagnosticReport.conclusion.

Synthetic Data: When You Don't Need Real Patients at All

For many AI training scenarios — especially development, testing, and algorithm prototyping — synthetic data eliminates the de-identification problem entirely. No real patients means no PHI, no HIPAA applicability, no risk.

Synthetic Data Tools That Work

Synthea (MITRE): Generates fully synthetic patient records as FHIR Bundles. Includes realistic disease progressions, medication histories, and care encounters. Used by ONC, CMS, and hundreds of health IT companies for testing. Free and open source.
Gretel.ai: Uses differential privacy and generative models to create synthetic datasets that preserve statistical properties of real data. Particularly useful when you need synthetic data that mirrors your actual patient population's distributions.
MDClone: Creates "synthetic twins" — synthetic records that preserve the statistical relationships in real clinical data without exposing any individual patient. Used by health systems like Intermountain Health and the Children's Hospital of Philadelphia.
MOSTLY AI: Enterprise synthetic data platform with healthcare-specific models. Includes privacy guarantees and utility metrics.

The trade-off with synthetic data is always fidelity versus privacy. Synthea generates realistic but generic patient journeys — useful for pipeline testing but not for training models that need to capture the specific patterns in your patient population. Tools like Gretel and MDClone bridge this gap by learning from real data to generate synthetic records, but the question of whether the synthetic data is "too similar" to the source data requires careful evaluation.

Beyond HIPAA: Privacy Models for AI Training Data

Privacy models comparison infographic: k-Anonymity, l-Diversity, t-Closeness

HIPAA defines the legal floor for de-identification, but it does not provide mathematical privacy guarantees. If you are building AI training datasets — especially datasets that will be shared with external partners or used in federated learning — you need to understand the formal privacy models:

k-Anonymity

A dataset satisfies k-anonymity if every combination of quasi-identifiers (age, gender, zip code) appears in at least k records. If k=5, every record "hides" among at least 4 others with the same quasi-identifier values. This is the most intuitive privacy model and is used in many Expert Determination assessments.

Limitation: If all 5 records in a k-anonymous group have the same sensitive value (e.g., all have HIV diagnosis), privacy is still breached. This led to l-diversity.

l-Diversity

Extends k-anonymity by requiring that each equivalence class (group of records with the same quasi-identifiers) has at least l distinct values for the sensitive attribute. This prevents the "homogeneity attack" where all records in a group share the same diagnosis.

t-Closeness

Goes further by requiring that the distribution of the sensitive attribute in each equivalence class is within distance t of the attribute's overall distribution. This prevents the "skewness attack" where the distribution within a group reveals information even when values are diverse.

Differential Privacy

The gold standard for mathematical privacy guarantees. A mechanism satisfies differential privacy if the output of a query on a dataset is statistically indistinguishable whether or not any single individual's record is included. The privacy budget (epsilon) quantifies the maximum information leakage. Apple, Google, and the US Census Bureau all use differential privacy.

For healthcare AI training, differential privacy is increasingly the recommended approach when training models on sensitive data. Libraries like Opacus (PyTorch) and TensorFlow Privacy add differential privacy guarantees directly to model training with minimal code changes.

Decision Framework: Choosing the Right Approach

Decision tree for choosing de-identification method Comparison table of de-identification approaches for healthcare AI

Choosing between de-identification methods is not an either/or decision — many organizations layer multiple approaches. Here is a practical decision framework:

Use Case	Recommended Approach	Why
Sharing data with external researchers	Safe Harbor + k-anonymity verification	Legal certainty of Safe Harbor with quantitative privacy guarantee
Training ML models on real clinical patterns	Expert Determination + differential privacy	Preserves clinical nuance while providing mathematical privacy bounds
Development and testing environments	Synthetic data (Synthea)	Zero PHI risk, free, generates valid FHIR resources
Federated learning across institutions	Differential privacy (Opacus/TF Privacy)	Data never leaves the institution; only model gradients are shared with privacy guarantees
Clinical trial data sharing	Expert Determination	Preserves date precision and geographic detail needed for trial analysis
Building FHIR-based AI pipelines	Safe Harbor for structured resources + Presidio for narrative text	Handles both coded data and unstructured clinical notes
Population health analytics	Safe Harbor with aggregation	Aggregate statistics (cohort-level) reduce re-identification risk to near zero

A Layered Strategy for Production

The most robust approach combines multiple methods:

Structured data: Apply Safe Harbor transformations programmatically (the Python code above covers FHIR resources)
Clinical notes: Run through Presidio or Philter for NER-based PHI detection, then manual review of flagged content
Quantitative validation: Measure k-anonymity and l-diversity on the output dataset; ensure k ≥ 5 for most use cases
Model training: Apply differential privacy during model training as a second layer of protection
Ongoing monitoring: Re-assess risk when adding new data sources or linking datasets

Compliance Considerations Beyond HIPAA

If your AI training data includes patients from outside the US, or if your models will be deployed internationally, HIPAA is just the starting point:

GDPR (EU): Does not recognize Safe Harbor-style de-identification. Requires either anonymization (irreversible, assessed against "all means reasonably likely") or pseudonymization (still treated as personal data). Article 89 provides research exemptions but with strict safeguards.
State laws: California's CCPA/CPRA, Washington's My Health My Data Act, and other state privacy laws may impose additional requirements beyond HIPAA.
21st Century Cures Act: While focused on information blocking, the Cures Act's data sharing requirements intersect with de-identification when health systems share data through TEFCA or other networks.

From predictive models to clinical AI, our Healthcare AI Solutions practice helps healthcare organizations deploy AI that delivers real outcomes. We also offer specialized Custom Healthcare Software Development services. Talk to our team to get started.

Frequently Asked Questions

Is de-identified data still protected under HIPAA?

No. Once data is properly de-identified under either Safe Harbor or Expert Determination, it is no longer considered PHI and is not subject to HIPAA restrictions. However, you must maintain the de-identification methodology documentation, and if you retain a re-identification key, the key itself is PHI.

Can I use Safe Harbor for genomic data?

Safe Harbor's 18 identifiers do not explicitly address genomic sequences, which are inherently identifying. For genomic data, Expert Determination is strongly recommended, and many experts consider genomic data impossible to fully de-identify. Synthetic genomic data or federated approaches may be the only viable path.

What is the re-identification risk threshold for Expert Determination?

HIPAA requires the risk to be "very small" but does not define a specific threshold. In practice, most experts target a re-identification risk below 0.04 (1 in 25) to 0.09 (1 in 11), depending on the sensitivity of the data and the anticipated recipient. The HHS guidance on de-identification provides additional context.

How do I de-identify medical images (X-rays, CT scans)?

DICOM images contain PHI in both the pixel data (burned-in text overlays with patient name, MRN) and the metadata headers. Tools like DicomAnonymizer handle metadata scrubbing. For burned-in PHI, OCR-based detection (using Tesseract + Presidio) or deep learning approaches are needed. The RSNA CTP (Clinical Trial Processor) is widely used for radiology image de-identification.

Building De-Identification Into Your Architecture

De-identification should not be an afterthought or a one-time export process. For organizations serious about leveraging clinical data for AI, it needs to be a first-class component of your data architecture — a pipeline stage that runs continuously as new data flows in.

At Nirmitee, we build healthcare data infrastructure with privacy as a core architectural concern, not a bolt-on. Our EHR and integration platforms are designed with FHIR-native data pipelines where de-identification can be applied at the resource level before data ever leaves the clinical system boundary. If you are building an AI-ready data infrastructure and need help getting the de-identification layer right, we would welcome the conversation.

The bottom line: Safe Harbor's 18 identifiers are still the right starting point, but they are not sufficient for modern healthcare AI. Layer Expert Determination, differential privacy, or synthetic data generation on top — and build the de-identification step into your pipeline, not your export process.

Was this article helpful?

Your feedback helps us improve our content.

USA Office - Elintex Technologies Inc.

India Office - Elintex Technologies Pvt. Ltd.