The Healthcare AI Agent's Unstructured Data Pipeline: Extracting Intelligence from Clinical Notes, PDFs, and Faxes

March 16, 2026

15 min read

AI & MLData EngineeringHealthcare

Up to 80% of healthcare data is unstructured. Clinical notes, scanned PDFs, faxed documents, discharge summaries, pathology reports, referral letters — all containing critical clinical information locked in free text. Your AI agent cannot reason about a patient's history if that history exists only in a scanned PDF from 2019 or a faxed prior authorization form.

This guide is the practical HOW. We cover the complete pipeline: document ingestion (FHIR DocumentReference, HL7 MDM messages, fax OCR, scanned PDF OCR), clinical NLP (medical entity extraction, negation detection, temporal reasoning), structuring extracted data into FHIR resources, vector indexing for retrieval-augmented generation (RAG), and agent retrieval patterns. Every stage includes working Python code.

For the foundational context on why unstructured data is the critical blocker for healthcare AI, see our companion guide on Prerequisites Before Building an AI Agent for Healthcare.

Stage 1: Document Ingestion

Before any NLP runs, you need to get the documents into your pipeline. Healthcare documents arrive through four primary channels:

FHIR DocumentReference

Modern EHRs expose clinical documents through the FHIR DocumentReference resource. The document content may be inline (base64-encoded) or referenced via a URL to a Binary resource:

# fhir_document_ingestion.py
import base64
import httpx
from typing import Optional

class FHIRDocumentIngester:
    """Ingest clinical documents from FHIR DocumentReference resources."""

    def __init__(self, fhir_base: str, token: str):
        self.client = httpx.Client(
            base_url=fhir_base,
            headers={
                "Authorization": f"Bearer {token}",
                "Accept": "application/fhir+json"
            },
            timeout=30.0
        )

    def get_patient_documents(self, patient_id: str,
                               category: str = None) -> list[dict]:
        """Fetch all documents for a patient."""
        params = {"patient": patient_id, "_sort": "-date"}
        if category:
            params["category"] = category
        resp = self.client.get("/DocumentReference", params=params)
        resp.raise_for_status()
        bundle = resp.json()
        return [e["resource"] for e in bundle.get("entry", [])]

    def extract_text(self, doc_ref: dict) -> Optional[str]:
        """Extract text content from a DocumentReference."""
        for content in doc_ref.get("content", []):
            attachment = content.get("attachment", {})
            content_type = attachment.get("contentType", "")

            # Inline base64 content
            if "data" in attachment:
                raw = base64.b64decode(attachment["data"])
                if "text" in content_type:
                    return raw.decode("utf-8")
                elif "pdf" in content_type:
                    return self._extract_pdf_text(raw)

            # Referenced Binary resource
            elif "url" in attachment:
                binary_resp = self.client.get(attachment["url"])
                if binary_resp.status_code == 200:
                    if "text" in content_type:
                        return binary_resp.text
                    elif "pdf" in content_type:
                        return self._extract_pdf_text(binary_resp.content)

        return None

    def _extract_pdf_text(self, pdf_bytes: bytes) -> str:
        """Extract text from PDF using PyMuPDF."""
        import fitz  # PyMuPDF
        doc = fitz.open(stream=pdf_bytes, filetype="pdf")
        text = ""
        for page in doc:
            text += page.get_text()
        return text

HL7 v2 MDM Messages

Many hospitals still send clinical documents through HL7 v2 MDM (Medical Document Management) messages. The document text lives in the OBX-5 segment. If your integration handles HL7 v2, see our guide on Turning HL7 v2 Streams into FHIR APIs Using Mirth Connect for the conversion pipeline.

Fax and Scanned Document OCR

Fax remains a primary communication channel in US healthcare — see our analysis of Why Fax Remains Mandatory in Healthcare. Scanned faxes and PDFs require OCR before NLP processing:

# ocr_pipeline.py
import io
from PIL import Image

def ocr_document(image_bytes: bytes, engine: str = "tesseract") -> str:
    """OCR a scanned document image."""
    if engine == "tesseract":
        import pytesseract
        image = Image.open(io.BytesIO(image_bytes))
        # Pre-process for better OCR accuracy
        image = image.convert("L")  # Grayscale
        text = pytesseract.image_to_string(
            image,
            config="--psm 6 --oem 3"  # Assume uniform block of text, LSTM engine
        )
        return text

    elif engine == "google_vision":
        from google.cloud import vision
        client = vision.ImageAnnotatorClient()
        image = vision.Image(content=image_bytes)
        response = client.document_text_detection(image=image)
        return response.full_text_annotation.text

    elif engine == "aws_textract":
        import boto3
        client = boto3.client("textract")
        response = client.detect_document_text(
            Document={"Bytes": image_bytes}
        )
        blocks = [b["Text"] for b in response["Blocks"] if b["BlockType"] == "LINE"]
        return "\n".join(blocks)

    raise ValueError(f"Unknown OCR engine: {engine}")

For production OCR, Google Cloud Vision and AWS Textract significantly outperform Tesseract on handwritten notes and low-quality fax scans. Budget $1.50 per 1000 pages for cloud OCR.

Stage 2: Clinical NLP

Once you have text, clinical NLP extracts structured medical concepts. This is fundamentally different from general NLP — clinical language is dense, abbreviated, full of negation, and uses domain-specific terminology.

Medical Named Entity Recognition (NER)

Medical NER identifies mentions of clinical concepts in text: conditions, medications, procedures, lab values, anatomical sites, and dosages. Here is a complete pipeline using spaCy and medspaCy:

# clinical_nlp.py
import spacy
import medspacy
from medspacy.ner import TargetRule
from medspacy.context import ConTextRule
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class ClinicalEntity:
    text: str
    label: str  # CONDITION, MEDICATION, PROCEDURE, LAB_VALUE
    start: int
    end: int
    is_negated: bool = False
    is_historical: bool = False
    is_family_history: bool = False
    temporal_info: Optional[str] = None
    confidence: float = 0.0
    standard_code: Optional[str] = None
    standard_system: Optional[str] = None

class ClinicalNLPPipeline:
    """Full clinical NLP pipeline for medical entity extraction."""

    def __init__(self):
        self.nlp = medspacy.load(enable=["medspacy_pyrush",
                                         "medspacy_context"])
        self._add_target_rules()
        self._add_context_rules()

    def _add_target_rules(self):
        """Add clinical concept patterns for NER."""
        target_rules = [
            # Conditions
            TargetRule("diabetes", "CONDITION", pattern=[{"LOWER": {"IN": ["diabetes", "dm", "dm2", "dm1"]}}]),
            TargetRule("hypertension", "CONDITION", pattern=[{"LOWER": {"IN": ["hypertension", "htn", "high blood pressure"]}}]),
            TargetRule("heart failure", "CONDITION", pattern=[{"LOWER": "heart"}, {"LOWER": "failure"}]),
            TargetRule("COPD", "CONDITION", pattern=[{"LOWER": {"IN": ["copd", "chronic obstructive pulmonary disease"]}}]),

            # Medications
            TargetRule("metformin", "MEDICATION", pattern=[{"LOWER": "metformin"}]),
            TargetRule("lisinopril", "MEDICATION", pattern=[{"LOWER": "lisinopril"}]),
            TargetRule("insulin", "MEDICATION", pattern=[{"LOWER": "insulin"}]),

            # Procedures
            TargetRule("colonoscopy", "PROCEDURE", pattern=[{"LOWER": "colonoscopy"}]),
            TargetRule("MRI", "PROCEDURE", pattern=[{"LOWER": {"IN": ["mri", "magnetic resonance imaging"]}}]),
        ]
        self.nlp.get_pipe("medspacy_target_matcher").add(target_rules)

    def _add_context_rules(self):
        """Add context rules for negation, historical, family history."""
        context_rules = [
            ConTextRule("no evidence of", "NEGATED_EXISTENCE", direction="forward"),
            ConTextRule("denies", "NEGATED_EXISTENCE", direction="forward"),
            ConTextRule("no history of", "NEGATED_EXISTENCE", direction="forward"),
            ConTextRule("ruled out", "NEGATED_EXISTENCE", direction="backward"),
            ConTextRule("history of", "HISTORICAL", direction="forward"),
            ConTextRule("previously", "HISTORICAL", direction="forward"),
            ConTextRule("family history of", "FAMILY", direction="forward"),
            ConTextRule("mother had", "FAMILY", direction="forward"),
            ConTextRule("father had", "FAMILY", direction="forward"),
        ]
        self.nlp.get_pipe("medspacy_context").add(context_rules)

    def process(self, text: str) -> list[ClinicalEntity]:
        """Process clinical text and return extracted entities."""
        doc = self.nlp(text)
        entities = []

        for ent in doc.ents:
            entity = ClinicalEntity(
                text=ent.text,
                label=ent.label_,
                start=ent.start_char,
                end=ent.end_char,
                is_negated=ent._.is_negated,
                is_historical=ent._.is_historical,
                is_family_history=ent._.is_family,
                confidence=0.85  # Base confidence for rule-based NER
            )
            entities.append(entity)

        return entities


# Example usage
if __name__ == "__main__":
    nlp_pipeline = ClinicalNLPPipeline()

    clinical_note = """
    Patient is a 62-year-old male with a history of type 2 diabetes
    and hypertension, currently on metformin 1000mg BID and lisinopril 20mg daily.
    Patient denies chest pain. No history of heart failure.
    Family history of COPD (mother). Colonoscopy scheduled for next month.
    Last HbA1c was 7.2% (2026-01-15).
    """

    entities = nlp_pipeline.process(clinical_note)
    for e in entities:
        neg = " [NEGATED]" if e.is_negated else ""
        hist = " [HISTORICAL]" if e.is_historical else ""
        fam = " [FAMILY]" if e.is_family_history else ""
        print(f"  {e.label}: {e.text}{neg}{hist}{fam}")

Negation Detection: The Make-or-Break Feature

Negation detection is the single most important feature in clinical NLP. Without it, "patient denies chest pain" becomes a positive finding of chest pain. "No history of diabetes" becomes a diabetes diagnosis. Getting negation wrong does not just reduce accuracy — it creates clinically dangerous false positives.

The medspaCy ConText algorithm handles common negation patterns, but production systems need additional patterns for clinical abbreviations: "r/o" (rule out), "neg" (negative), "w/o" (without), and assertion-modifying phrases like "possible," "suspected," and "unlikely."

Stage 3: Structuring Into FHIR Resources

After NLP extracts clinical entities, convert them into proper FHIR resources. This is where the extracted intelligence becomes computable:

# fhir_structuring.py
from datetime import datetime
import uuid

class FHIRStructurer:
    """Convert extracted clinical entities into FHIR resources."""

    def entity_to_fhir(self, entity, patient_id: str,
                       encounter_id: str = None) -> dict | None:
        """Convert a ClinicalEntity to a FHIR resource."""
        if entity.is_negated:
            return None  # Don't create resources for negated findings

        if entity.label == "CONDITION":
            return self._to_condition(entity, patient_id, encounter_id)
        elif entity.label == "MEDICATION":
            return self._to_medication_statement(entity, patient_id)
        elif entity.label == "PROCEDURE":
            return self._to_procedure(entity, patient_id, encounter_id)
        elif entity.label == "LAB_VALUE":
            return self._to_observation(entity, patient_id)
        return None

    def _to_condition(self, entity, patient_id, encounter_id) -> dict:
        resource = {
            "resourceType": "Condition",
            "id": str(uuid.uuid4()),
            "clinicalStatus": {
                "coding": [{
                    "system": "http://terminology.hl7.org/CodeSystem/condition-clinical",
                    "code": "active" if not entity.is_historical else "resolved"
                }]
            },
            "verificationStatus": {
                "coding": [{
                    "system": "http://terminology.hl7.org/CodeSystem/condition-ver-status",
                    "code": "confirmed" if entity.confidence > 0.9 else "provisional"
                }]
            },
            "code": {"text": entity.text},
            "subject": {"reference": f"Patient/{patient_id}"},
            "meta": {
                "tag": [{
                    "system": "https://your-org.com/nlp-extraction",
                    "code": "nlp-derived",
                    "display": f"NLP confidence: {entity.confidence:.2f}"
                }]
            }
        }
        if entity.standard_code:
            resource["code"]["coding"] = [{
                "system": entity.standard_system,
                "code": entity.standard_code
            }]
        if encounter_id:
            resource["encounter"] = {"reference": f"Encounter/{encounter_id}"}
        return resource

    def _to_medication_statement(self, entity, patient_id) -> dict:
        return {
            "resourceType": "MedicationStatement",
            "id": str(uuid.uuid4()),
            "status": "active",
            "medicationCodeableConcept": {"text": entity.text},
            "subject": {"reference": f"Patient/{patient_id}"},
            "meta": {
                "tag": [{
                    "system": "https://your-org.com/nlp-extraction",
                    "code": "nlp-derived",
                    "display": f"NLP confidence: {entity.confidence:.2f}"
                }]
            }
        }

    def _to_procedure(self, entity, patient_id, encounter_id) -> dict:
        resource = {
            "resourceType": "Procedure",
            "id": str(uuid.uuid4()),
            "status": "completed" if entity.is_historical else "preparation",
            "code": {"text": entity.text},
            "subject": {"reference": f"Patient/{patient_id}"},
            "meta": {
                "tag": [{
                    "system": "https://your-org.com/nlp-extraction",
                    "code": "nlp-derived"
                }]
            }
        }
        if encounter_id:
            resource["encounter"] = {"reference": f"Encounter/{encounter_id}"}
        return resource

    def _to_observation(self, entity, patient_id) -> dict:
        return {
            "resourceType": "Observation",
            "id": str(uuid.uuid4()),
            "status": "final",
            "code": {"text": entity.text},
            "subject": {"reference": f"Patient/{patient_id}"},
            "meta": {
                "tag": [{
                    "system": "https://your-org.com/nlp-extraction",
                    "code": "nlp-derived"
                }]
            }
        }

Critical detail: every NLP-derived FHIR resource is tagged with a meta.tag marking it as NLP-derived with a confidence score. This lets downstream consumers distinguish between clinician-entered structured data and machine-extracted data. Never mix NLP-derived resources with clinician-verified data without clear provenance. This ties into the broader data quality considerations covered in Medallion Architecture for Healthcare Data.

Stage 4: Vector Indexing for RAG

For AI agents using retrieval-augmented generation, clinical text needs to be chunked, embedded, and indexed in a vector database:

# vector_indexing.py
from dataclasses import dataclass
from typing import Optional
import hashlib

@dataclass
class ClinicalChunk:
    text: str
    patient_id: str
    document_id: str
    document_type: str  # progress_note, discharge_summary, lab_report
    document_date: str
    chunk_index: int
    embedding: list[float] = None
    metadata: dict = None

class ClinicalChunker:
    """Chunk clinical documents for vector indexing."""

    def __init__(self, chunk_size: int = 512, overlap: int = 64):
        self.chunk_size = chunk_size
        self.overlap = overlap

    def chunk_document(self, text: str, patient_id: str,
                       document_id: str, document_type: str,
                       document_date: str) -> list[ClinicalChunk]:
        """Chunk clinical text with section-aware splitting."""
        # Split by clinical sections first
        sections = self._split_by_sections(text)
        chunks = []
        chunk_idx = 0

        for section_name, section_text in sections:
            # Sub-chunk sections that exceed chunk_size
            words = section_text.split()
            for i in range(0, len(words), self.chunk_size - self.overlap):
                chunk_words = words[i:i + self.chunk_size]
                chunk_text = " ".join(chunk_words)
                if len(chunk_text.strip()) < 20:
                    continue

                chunks.append(ClinicalChunk(
                    text=chunk_text,
                    patient_id=patient_id,
                    document_id=document_id,
                    document_type=document_type,
                    document_date=document_date,
                    chunk_index=chunk_idx,
                    metadata={
                        "section": section_name,
                        "chunk_hash": hashlib.md5(
                            chunk_text.encode()
                        ).hexdigest()
                    }
                ))
                chunk_idx += 1

        return chunks

    def _split_by_sections(self, text: str) -> list[tuple[str, str]]:
        """Split clinical text by common section headers."""
        import re
        section_pattern = re.compile(
            r'^(CHIEF COMPLAINT|HISTORY OF PRESENT ILLNESS|'
            r'PAST MEDICAL HISTORY|MEDICATIONS|ALLERGIES|'
            r'REVIEW OF SYSTEMS|PHYSICAL EXAMINATION|'
            r'ASSESSMENT|PLAN|DISCHARGE INSTRUCTIONS|'
            r'HPI|PMH|ROS|PE|A/P):?\s*$',
            re.MULTILINE | re.IGNORECASE
        )
        sections = []
        matches = list(section_pattern.finditer(text))

        if not matches:
            return [("FULL_TEXT", text)]

        for i, match in enumerate(matches):
            section_name = match.group(1).upper()
            start = match.end()
            end = matches[i + 1].start() if i + 1 < len(matches) else len(text)
            section_text = text[start:end].strip()
            if section_text:
                sections.append((section_name, section_text))

        return sections


class VectorIndexer:
    """Index clinical chunks in a vector database."""

    def __init__(self, collection_name: str = "clinical_documents"):
        # Using chromadb for simplicity; swap for Pinecone/Weaviate in production
        import chromadb
        self.client = chromadb.PersistentClient(path="./chroma_db")
        self.collection = self.client.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}
        )

    def index_chunks(self, chunks: list[ClinicalChunk]):
        """Index chunks with embeddings and metadata."""
        ids = [f"{c.document_id}_{c.chunk_index}" for c in chunks]
        documents = [c.text for c in chunks]
        metadatas = [{
            "patient_id": c.patient_id,
            "document_type": c.document_type,
            "document_date": c.document_date,
            "section": c.metadata.get("section", "unknown"),
        } for c in chunks]

        self.collection.add(
            ids=ids, documents=documents, metadatas=metadatas
        )

    def query(self, query_text: str, patient_id: str = None,
              document_type: str = None, n_results: int = 5) -> list[dict]:
        """Query the vector index for relevant clinical chunks."""
        where_filter = {}
        if patient_id:
            where_filter["patient_id"] = patient_id
        if document_type:
            where_filter["document_type"] = document_type

        results = self.collection.query(
            query_texts=[query_text],
            n_results=n_results,
            where=where_filter if where_filter else None
        )
        return [{
            "text": doc,
            "metadata": meta,
            "distance": dist
        } for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0]
        )]

Stage 5: Agent Retrieval

The final stage connects the vector index to your AI agent. The agent queries clinical data by concept, filters by date and document type, and uses the retrieved context to answer clinical questions:

# agent_retrieval.py
class ClinicalRetriever:
    """Retrieval interface for AI agents to query clinical documents."""

    def __init__(self, vector_indexer: VectorIndexer,
                 fhir_client=None):
        self.vector_index = vector_indexer
        self.fhir_client = fhir_client

    def get_clinical_context(self, patient_id: str,
                              query: str,
                              max_chunks: int = 10) -> dict:
        """Get comprehensive clinical context for a patient query."""
        # Retrieve relevant unstructured data
        unstructured = self.vector_index.query(
            query_text=query,
            patient_id=patient_id,
            n_results=max_chunks
        )

        # Retrieve structured FHIR data if client available
        structured = {}
        if self.fhir_client:
            structured = {
                "conditions": self.fhir_client.get_conditions(patient_id),
                "medications": self.fhir_client.get_medications(patient_id),
                "recent_labs": self.fhir_client.get_observations(
                    patient_id, category="laboratory"
                )
            }

        return {
            "patient_id": patient_id,
            "query": query,
            "unstructured_context": unstructured,
            "structured_data": structured,
            "retrieval_metadata": {
                "chunks_retrieved": len(unstructured),
                "has_structured_data": bool(structured)
            }
        }

    def summarize_for_prompt(self, context: dict,
                              max_tokens: int = 2000) -> str:
        """Format retrieved context for an LLM prompt."""
        parts = []
        parts.append(f"=== Clinical Context for Patient {context['patient_id']} ===")

        # Add structured data summary
        if context.get("structured_data"):
            sd = context["structured_data"]
            if sd.get("conditions"):
                conditions = [c.get("code", {}).get("text", "Unknown")
                             for c in sd["conditions"][:10]]
                parts.append(f"Active Conditions: {', '.join(conditions)}")
            if sd.get("medications"):
                meds = [m.get("medicationCodeableConcept", {}).get("text", "Unknown")
                       for m in sd["medications"][:10]]
                parts.append(f"Active Medications: {', '.join(meds)}")

        # Add relevant unstructured chunks
        parts.append("\n=== Relevant Clinical Notes ===")
        for chunk in context.get("unstructured_context", []):
            doc_type = chunk["metadata"].get("document_type", "note")
            doc_date = chunk["metadata"].get("document_date", "unknown")
            parts.append(f"[{doc_type} | {doc_date}]")
            parts.append(chunk["text"])
            parts.append("---")

        return "\n".join(parts)

This retrieval layer gives your AI agent both structured FHIR data and semantically relevant unstructured clinical text. The agent can reason across both: "The patient has active diabetes (structured) and the progress note from January mentions poor medication adherence (unstructured)." For patterns on building AI agents that use this data effectively, see our guide on Observability for Agentic AI in Healthcare.

NLP Tool Comparison

Tool	Type	Best For	Cost	PHI Handling
spaCy + medspaCy	Open Source	Custom clinical NLP pipelines	Free	On-premise
Amazon Comprehend Medical	Cloud API	Quick start, medication extraction	$0.01/unit	BAA available
Google Healthcare NLP	Cloud API	High-volume processing	Pay per use	BAA available
Hugging Face Clinical Models	Open Source	Custom fine-tuning, research	Free	On-premise
John Snow Labs Spark NLP	Commercial	Enterprise clinical NLP	License	On-premise

For HIPAA-compliant production deployments, the choice is between on-premise open-source tools (spaCy/medspaCy, Hugging Face) and cloud APIs with BAAs (AWS, Google). Cloud APIs are faster to deploy but create PHI processing dependencies. On-premise gives you full control but requires ML infrastructure. Many teams use a hybrid: cloud OCR for document ingestion, on-premise NLP for entity extraction.

Frequently Asked Questions

How accurate is clinical NLP compared to manual chart review?

State-of-the-art clinical NER achieves 85-95% F1 scores for common entity types (medications, conditions, procedures). Negation detection typically achieves 90-95% accuracy with ConText-based approaches. However, accuracy varies significantly by document type: structured progress notes perform better than handwritten scanned documents. Always validate NLP output against a gold-standard annotated dataset before production use.

How do I handle PHI in the NLP pipeline?

All text processing happens within your HIPAA-compliant environment. Never send PHI to a cloud NLP service without a BAA in place. For vector databases, ensure they are deployed on HIPAA-eligible infrastructure (not a developer laptop). De-identification should happen early in the pipeline if downstream consumers do not need PHI.

What embedding model should I use for clinical text?

General-purpose embeddings (OpenAI text-embedding-3, Cohere embed) work surprisingly well for clinical text. Clinical-specific models like PubMedBERT and ClinicalBERT provide marginal accuracy improvements for medical concept retrieval. In practice, the chunk size and section-aware splitting have more impact on retrieval quality than the embedding model choice.

How do I handle multi-language clinical documents?

In the US, clinical documents are predominantly English, but patient-facing documents (consent forms, discharge instructions) may be in Spanish, Mandarin, or other languages. Use language detection first (fastText langdetect), then route to language-specific NLP pipelines. AWS Comprehend Medical and Google Healthcare NLP support limited multilingual capabilities.

What is the latency budget for real-time NLP processing?

For real-time use cases (clinical decision support during a visit), target under 2 seconds end-to-end: OCR (500ms), NLP (800ms), FHIR structuring (200ms), vector indexing (500ms). For batch processing (nightly population health analytics), latency is less critical — focus on throughput (documents per minute).

Conclusion

The unstructured data pipeline is what separates a proof-of-concept AI agent from a production-grade one. Without it, your agent is blind to 80% of the clinical picture. With it, your agent can reason across discharge summaries, progress notes, scanned referral letters, and faxed prior authorization forms — the same information a clinician uses to make decisions.

Start with Stage 1 (document ingestion) and Stage 2 (basic NER). Get extraction accuracy above 90% before moving to FHIR structuring and vector indexing. The pipeline is only as good as its NLP layer — invest in annotation, validation, and continuous improvement.

For the architectural foundations that support this pipeline, see our guides on Domain-Driven Design in Healthcare and Building an AI Clinical Scribe.

Was this article helpful?

Your feedback helps us improve our content.

USA Office - Elintex Technologies Inc.

India Office - Elintex Technologies Pvt. Ltd.