A single AI model can summarize a clinical note. It takes a multi-agent system to run a post-discharge workflow — generating the discharge summary, reconciling medications against the formulary, scheduling follow-up appointments within the CMS-mandated 7-day window, and producing patient education materials at the right health literacy level. All within hours, all HIPAA-compliant, all auditable.
We built this system. Not as a research prototype — as a production application processing real patient encounters at a 340-bed community hospital. This case study documents every architectural decision, failure mode, and performance metric from that implementation. If you are building healthcare applications that need more than one AI capability, this is the engineering reference you need.
The Problem: Why a Single Agent Fails Complex Clinical Workflows
The discharge process at this hospital involved 7 staff members across 4 departments, consumed an average of 47 minutes per patient, and had a 23% error rate in medication reconciliation. The errors were not negligence — they were systemic. A single clinician cannot simultaneously hold context about the patient's admission medications, new prescriptions added during the stay, formulary restrictions from the patient's insurance, and scheduling availability for 3 different specialists.
The initial approach — a monolithic AI agent that attempted all tasks — failed within two weeks. The context window could not hold a full patient record plus medication database plus scheduling data. Latency exceeded 45 seconds per interaction. And when the agent made an error in medication reconciliation, it was impossible to determine which part of the reasoning chain had failed.
The solution was decomposition: four specialized agents, each with a focused task, coordinated by an orchestration layer.
Architecture: Four Agents, One Orchestrator
The system comprises four domain-specific agents and a stateful orchestrator that manages workflow execution, error recovery, and human-in-the-loop review gates.
Agent 1: Clinical Summary Agent
Reads the patient's encounter history, lab results, vital signs, and provider notes from the FHIR server. Generates a structured discharge summary compliant with C-CDA 2.1 templates. The output is not free text — it is a structured document with discrete sections for diagnosis, procedures, hospital course, and follow-up instructions.
Model: Claude Sonnet 4 — selected for its ability to follow complex output schemas reliably. GPT-4o produced more natural prose but broke the C-CDA structure 12% of the time in testing.
Context window usage: 18,000-25,000 tokens per patient (encounter notes are token-dense). Uses selective FHIR resource loading — only Condition, Procedure, Observation, and MedicationRequest resources from the current encounter, not the full patient history.
Agent 2: Medication Reconciliation Agent
Compares three medication lists: admission medications (what the patient was taking before), in-hospital medications (what was prescribed during the stay), and discharge medications (what the provider intends to send the patient home with). Flags discrepancies, checks for drug-drug interactions via RxNorm and NLM APIs, and verifies formulary coverage with the patient's insurance.
Model: Fine-tuned Llama 3 8B — medication reconciliation is a structured comparison task with well-defined rules. A fine-tuned small model matches GPT-4o accuracy at 1/50th the cost after training on 30,000 labeled reconciliation examples from the hospital's historical data.
Critical safety feature: Every flagged interaction includes the evidence source (DailyMed drug label section, NLM interaction severity rating) and a confidence score. Interactions with severity "high" or confidence below 0.85 trigger mandatory pharmacist review — no exceptions.
Agent 3: Follow-Up Scheduling Agent
Books the PCP follow-up within 7 days of discharge (a CMS quality measure that affects hospital reimbursement), schedules specialist referrals based on discharge diagnoses, and verifies insurance eligibility for each visit. Interfaces with the hospital's scheduling system via HL7 SIU messages and the payer's eligibility API via X12 270/271 transactions.
Model: No LLM — this agent is a deterministic rules engine with API integrations. Scheduling does not benefit from probabilistic reasoning. It needs reliable API calls, retry logic, and constraint satisfaction (find a slot within 7 days that matches provider availability, patient preference, and insurance network).
Agent 4: Patient Education Agent
Generates personalized discharge instructions at the patient's documented health literacy level (pulled from the FHIR Patient resource's communication preferences). Includes medication guides with visual pill identifiers, warning signs that should trigger an ER visit, and dietary restrictions specific to the patient's conditions.
Model: GPT-4o — patient-facing content benefits from natural language quality. The output is reviewed by the discharging nurse before delivery to the patient.
Orchestration: How the Agents Coordinate
The orchestrator is built on LangGraph with a PostgreSQL-backed state store. It manages the workflow as a directed acyclic graph (DAG) where some agents run sequentially (the education agent needs the medication reconciliation output) and others run in parallel (clinical summary and scheduling can execute simultaneously).
The Workflow DAG
Patient Discharge Triggered (ADT^A03 message)
|
v
[Clinical Summary Agent] ----parallel----> [Scheduling Agent]
| |
v v
[Medication Reconciliation Agent] [Appointments Confirmed]
|
v
[Human Review Gate: Pharmacist reviews med recon if high-severity flags]
|
v
[Patient Education Agent] (uses med recon output + clinical summary)
|
v
[Final Review Gate: Nurse reviews complete discharge packet]
|
v
[Discharge Packet Delivered to Patient + EHR Updated] The parallel execution of the Clinical Summary Agent and Scheduling Agent reduces total workflow time from 12 minutes (sequential) to 7 minutes. The human review gates add 3-8 minutes depending on pharmacist and nurse availability, but these are non-negotiable safety checkpoints.
State Management
Each agent writes its output to a shared state object persisted in PostgreSQL. The orchestrator reads this state to determine which agents can execute next and whether review gates have been satisfied.
{
"workflow_id": "dc-2026-03-17-00842",
"patient_id": "Patient/p-28491",
"encounter_id": "Encounter/enc-77832",
"status": "awaiting_nurse_review",
"agents": {
"clinical_summary": {
"status": "completed",
"output_ref": "DocumentReference/dr-99281",
"duration_ms": 4200,
"tokens_used": 22400,
"model": "claude-sonnet-4"
},
"med_reconciliation": {
"status": "completed",
"flags": ["warfarin_dose_change", "metformin_added"],
"high_severity_interactions": 0,
"pharmacist_review": "approved",
"duration_ms": 1800,
"model": "llama-3-8b-medrec-v2"
},
"scheduling": {
"status": "completed",
"pcp_followup": "2026-03-22T10:30:00",
"specialist_referrals": ["cardiology_2026-03-28"],
"insurance_verified": true
},
"patient_education": {
"status": "completed",
"literacy_level": "6th_grade",
"output_ref": "DocumentReference/dr-99282"
}
}
}
Results: What Changed
After 12 weeks in production (2,847 discharge workflows processed):
| Metric | Before (Manual) | After (Multi-Agent) | Change |
|---|---|---|---|
| Average discharge processing time | 47 minutes | 11 minutes (7 agent + 4 human review) | -77% |
| Medication reconciliation error rate | 23% | 3.1% | -87% |
| 7-day PCP follow-up scheduling rate | 61% | 94% | +54% |
| 30-day readmission rate | 14.2% | 11.8% | -17% (early signal) |
| Staff time per discharge | 47 min (7 staff) | 8 min (2 reviewers) | -83% |
| Cost per discharge (labor) | $38.40 | $7.20 + $0.21 AI | -81% |
The 30-day readmission improvement is preliminary — we need 6+ months of data to confirm statistical significance. But the medication reconciliation and scheduling improvements are robust and directly attributable to the system.
Safety Architecture: Five Layers of Protection
Clinical AI systems that touch patient care decisions require defense in depth. A single validation layer is not sufficient — you need multiple independent checks so that any single layer failing does not result in patient harm.
Layer 1: Output Schema Enforcement
Every agent output is validated against a JSON Schema before it enters the shared state. The Clinical Summary Agent must produce valid C-CDA sections. The Medication Reconciliation Agent must return structured medication objects with RxNorm CUIs, not free-text drug names. If schema validation fails, the agent retries with a corrective prompt. After 3 failures, the workflow escalates to manual processing.
Layer 2: Clinical Rules Engine
A deterministic rules engine runs after every agent output. It checks: drug-allergy conflicts against the patient's FHIR AllergyIntolerance resources, dosage ranges against DailyMed label data, scheduling constraints against CMS quality measures. This layer catches errors that the LLM might miss because it operates on structured data with hard-coded clinical rules, not probabilistic inference.
Layer 3: Human-in-the-Loop Review
Not every output requires human review — that would negate the efficiency gains. The system uses a confidence-severity matrix to determine when human review is required:
- High severity + any confidence: Always reviewed (medication interactions, allergy flags)
- Low severity + high confidence: Auto-approved (routine scheduling confirmations)
- Low severity + low confidence: Batched for review (non-critical documentation edits)
Layer 4: Audit Trail
Every agent invocation, every tool call, every human review decision is logged to an immutable audit store. The audit trail includes: which model was used, what data was in the context window, what output was generated, and whether a human modified the output. This is not optional — it is a HIPAA requirement for systems that process PHI.
Layer 5: HIPAA Compliance Boundary
Each agent accesses only the minimum necessary PHI for its task. The Scheduling Agent never sees clinical notes. The Patient Education Agent receives the medication list but not the full encounter history. The orchestrator enforces these boundaries by controlling which FHIR resources each agent can query.
Technology Stack
| Layer | Technology | Why This Choice |
|---|---|---|
| Orchestration | LangGraph + PostgreSQL | Stateful workflows with persistence, retry logic, and human-in-the-loop gates built in |
| Clinical Summary | Claude Sonnet 4 via API | Best schema adherence in testing (98.3% valid C-CDA output vs 87.6% for GPT-4o) |
| Med Reconciliation | Fine-tuned Llama 3 8B (self-hosted) | Structured comparison task — fine-tuned model matches GPT-4o at 2% of cost |
| Scheduling | Custom Python (no LLM) | Deterministic task — LLM adds latency and unpredictability with no accuracy benefit |
| Patient Education | GPT-4o via API | Best natural language quality for patient-facing content |
| RAG / Knowledge Base | pgvector + clinical guidelines | Drug formularies, clinical protocols, discharge checklists embedded for retrieval |
| FHIR Server | HAPI FHIR R4 | Open-source, well-tested, custom FHIR server for specific needs |
| Message Bus | Apache Kafka | ADT event streaming triggers workflows; decouples EHR from agent system |
| Monitoring | OpenTelemetry + Grafana | Per-agent latency, token usage, error rates, and compliance dashboard |
| Infrastructure | AWS EKS (HIPAA-eligible) | BAA in place, dedicated tenancy for PHI workloads, GPU nodes for self-hosted models |
Cost Analysis
The total AI cost per discharge workflow is $0.21, broken down across four agents. This is after optimization — the initial prototype cost $1.84 per workflow before we fine-tuned the medication reconciliation model and implemented prompt caching for the clinical summary agent.
Cost Reduction Strategies That Worked
- Fine-tuning for high-volume tasks: Moving medication reconciliation from GPT-4o ($0.12/call) to a fine-tuned Llama 3 8B ($0.002/call) saved $0.118 per workflow — 57% of total cost.
- Prompt caching: The clinical summary agent uses the same system prompt and clinical guidelines context for every patient. Anthropic's prompt caching reduces input token costs by 90% for this cached prefix.
- No LLM for deterministic tasks: The scheduling agent uses direct API calls, not an LLM. This is faster, cheaper, and more reliable. Not every agent needs a language model.
- Selective context loading: Instead of loading the entire patient record (40,000+ tokens), each agent loads only the FHIR resources it needs (8,000-25,000 tokens).
At 2,847 discharges over 12 weeks (approximately 34/day), the monthly AI infrastructure cost is $215. The labor savings from reduced staff time: $26,400/month. ROI: 122:1.
Implementation Timeline
Total implementation: 16 weeks from kickoff to production. This is not a weekend hackathon — multi-agent clinical systems require deliberate engineering, clinical validation, and compliance review.
Weeks 1-5: Foundation
- FHIR server deployment and data pipeline from EHR (ADT event streaming via Kafka)
- Authentication and authorization layer (SMART on FHIR)
- Vector database setup for clinical knowledge base (drug formularies, clinical guidelines)
- Audit logging infrastructure (immutable event store)
Weeks 3-12: Agent Development (Overlapping)
- Each agent developed and tested independently against synthetic patient data (Synthea-generated FHIR bundles)
- Medication reconciliation model fine-tuning (3 weeks including data preparation, training, and evaluation)
- Integration with hospital's scheduling API and payer eligibility services
Weeks 10-14: Integration and Clinical Validation
- End-to-end workflow testing with 500 historical discharge cases
- Clinical validation: pharmacist and physician review of agent outputs
- Adversarial testing: deliberately feeding edge cases (polypharmacy patients, pediatric cases, patients with documented allergies to common medications)
Weeks 14-16: Deployment
- Shadow mode (2 weeks): system runs in parallel with manual process, outputs compared but not acted upon
- Production rollout with kill switch: manual fallback available for any workflow within 30 seconds
What We Would Do Differently
Start with the scheduling agent, not the clinical summary agent. The scheduling agent is deterministic, has clear success metrics (appointment booked: yes/no), and delivers immediate value. Starting with the most complex agent (clinical summary) meant 6 weeks before anything worked end-to-end. Starting with scheduling would have delivered value in week 4.
Build the audit trail first, not last. We added comprehensive audit logging in week 10. Every bug we investigated before that required manual log correlation. If you are building clinical AI, the audit infrastructure is your debugging tool, your compliance proof, and your safety net. Build it before your first agent.
Fine-tune earlier. We ran medication reconciliation on GPT-4o for 4 weeks before fine-tuning a small model. Those 4 weeks cost $3,200 in API calls that a $500 fine-tuning job would have eliminated. If your task is structured and high-volume, fine-tune immediately.
Frequently Asked Questions
How many agents should a healthcare application have?
As few as possible. Each agent adds orchestration complexity, failure modes, and monitoring overhead. Start with one agent that solves a real problem. Add a second only when the first agent's task scope becomes too broad for a single context window or requires fundamentally different capabilities (e.g., structured data extraction vs. natural language generation).
Can I use a single LLM for all agents?
You can, but you should not. Different tasks have different accuracy-cost trade-offs. Using GPT-4o for a structured extraction task that a fine-tuned 7B model handles equally well costs 50x more per interaction. Match the model to the task complexity.
What happens when an agent fails?
The orchestrator implements circuit breaker patterns. Agent failures trigger: (1) automatic retry with a modified prompt, (2) fallback to a simpler model, (3) escalation to manual processing. No patient discharge is blocked by an agent failure — the system degrades gracefully to manual workflows within 30 seconds.
How do you handle HIPAA compliance with multiple agents?
Each agent operates within a minimum-necessary PHI boundary enforced by the orchestrator. The scheduling agent cannot access clinical notes. The education agent receives medication names but not full encounter history. All inter-agent communication is logged, encrypted in transit (TLS 1.3), and encrypted at rest (AES-256). The HIPAA compliance surface for multi-agent systems is larger than single-agent systems — plan for it from day one.
What is the minimum team size to build this?
Our team comprised: 2 ML engineers (agent development and fine-tuning), 1 backend engineer (orchestration and FHIR integration), 1 DevOps engineer (infrastructure and monitoring), 1 clinical informaticist (validation and safety review), and a part-time project manager. Minimum viable team: 3 engineers + 1 clinical advisor.
Building multi-agent healthcare systems is an orchestration problem, not an ML problem. The hardest parts are not the models — they are the workflow design, safety layers, and clinical validation. At Nirmitee, we build production healthcare AI systems with the integration, compliance, and clinical safety infrastructure built in from day one. If you are planning a multi-agent implementation, talk to our team — we have already made the mistakes so you do not have to.


