Alerting for Healthcare Systems: PagerDuty Runbooks for FHIR Server, Mirth, and EHR Integrations

April 1, 2026

13 min read

DevOpsHealthcare

At 2:47 AM on a Tuesday, your PagerDuty fires: "FHIR Server 5xx rate exceeds 5%." The on-call engineer -- three months into the job, never seen this alert before -- opens their laptop, stares at the Grafana dashboard, and starts guessing. They check the application logs. They restart the server. They check the database. Twenty minutes later, they discover the root cause was a connection pool leak triggered by a specific query pattern. But those twenty minutes translated to 400 failed lab result deliveries that now need manual reconciliation.

This scenario plays out at health systems every week. The alert fired correctly. The engineer responded quickly. But without a runbook -- a documented, step-by-step diagnostic and resolution procedure for that specific alert -- the response was improvised instead of systematic. In healthcare, improvised incident response carries clinical risk.

This guide provides production-ready runbook templates for the six most common healthcare system alerts, severity classification frameworks adapted for clinical impact, and PagerDuty/OpsGenie configuration to route alerts correctly. Every runbook has been tested in production environments processing millions of healthcare messages daily.

Severity Classification for Healthcare Systems

Standard IT severity levels (Sev1-Sev4) do not account for the unique dimension of healthcare alerting: clinical impact. A database running at 95% CPU is a Sev2 in most industries. In healthcare, it depends entirely on what that database serves. If it backs a drug interaction checking service, it is a clinical safety issue. If it backs a reporting system, it can wait until morning.

Healthcare-Adapted Severity Levels

Level	Name	Response Time	Definition	Escalation
P1	Clinical Impact	5 minutes	System failure directly affecting patient care delivery, clinical decision-making, or medication safety	Immediate page to on-call + engineering lead + clinical informatics
P2	Operational Impact	15 minutes	System degradation is affecting healthcare operations, but no immediate clinical safety risk	Page on-call engineer, Slack alert to team
P3	Degraded Performance	1 hour	Performance degradation, elevated error rates, or capacity concerns that may escalate if unaddressed	Slack notification, ticket created
P4	Informational	Next business day	Planned maintenance notifications, certificate expiry warnings, capacity trend alerts	Email notification, backlog ticket

Symptom-Based vs Cause-Based Alerting

A critical distinction in healthcare alerting: alert on symptoms, not causes. Symptoms are what users experience. Causes are what engineers investigate.

Bad (cause-based): "PostgreSQL CPU at 95%." This fires every time the database is busy, but does not tell you if anything is actually broken.
Good (symptom-based): "FHIR search p95 latency exceeds 2 seconds for 5 minutes." This tells you users are experiencing slow responses, regardless of the underlying cause.

Cause-based alerts generate noise. Symptom-based alerts generate action. The runbooks below are organized around symptoms, with diagnostic steps that identify the underlying cause.

PagerDuty Service Configuration

# PagerDuty service configuration for healthcare alerting

services:
  - name: "FHIR Server - Clinical"
    description: "FHIR server supporting clinical applications"
    escalation_policy: "clinical-systems-escalation"
    alert_creation: "create_alerts_and_incidents"
    auto_resolve_timeout: 14400    # 4 hours
    acknowledge_timeout: 1800      # 30 min re-alert if not ack'd
    
    integrations:
      - type: "prometheus"
        name: "Prometheus Alertmanager"
      
    incident_urgency_rule:
      type: "use_support_hours"
      during_support_hours:
        type: "constant"
        urgency: "high"
      outside_support_hours:
        type: "constant"
        urgency: "high"    # Healthcare = always high for clinical

  - name: "Integration Engine - Operational"
    description: "Mirth Connect and message processing"
    escalation_policy: "integration-team-escalation"
    alert_creation: "create_alerts_and_incidents"
    
    incident_urgency_rule:
      type: "use_support_hours"
      during_support_hours:
        type: "constant"
        urgency: "high"
      outside_support_hours:
        type: "constant"
        urgency: "low"     # Non-clinical can wait for business hours

escalation_policies:
  - name: "clinical-systems-escalation"
    rules:
      - escalation_delay_in_minutes: 5
        targets:
          - type: "user"
            id: "on-call-engineer"
      - escalation_delay_in_minutes: 15
        targets:
          - type: "user"
            id: "engineering-lead"
      - escalation_delay_in_minutes: 30
        targets:
          - type: "user"
            id: "vp-engineering"
          - type: "user"
            id: "clinical-informatics-lead"

Runbook 1 — FHIR Server 5xx Error Spike

Alert Definition

- alert: FhirServer5xxSpike
  expr: |
    sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) /
    sum(rate(http_server_requests_seconds_count[5m])) * 100 > 5
  for: 5m
  labels:
    severity: P1
    service: fhir-server
    runbook: "https://wiki.internal/runbooks/fhir-5xx-spike"
  annotations:
    summary: "FHIR server 5xx rate at {{ $value | printf "%.1f" }}%"
    impact: "Clinical applications unable to read/write patient data"

Diagnostic Steps

Step 1: Identify the error pattern (2 minutes)

# Check which endpoints are failing
curl -s http://localhost:8080/actuator/metrics/http.server.requests   | jq '.availableTags[] | select(.tag=="uri") | .values[]'

# Check application logs for stack traces
journalctl -u hapi-fhir --since "5 minutes ago" | grep -E "ERROR|Exception" | head -20

# Check JVM health
curl -s http://localhost:8080/actuator/health | jq '.components'

Step 2: Classify the root cause

Log Pattern	Likely Cause	Immediate Action
`java.lang.OutOfMemoryError`	Heap exhaustion from large queries or memory leak	Restart JVM with increased heap; investigate query patterns
`HikariPool-1 - Connection is not available`	Database connection pool exhausted	See Runbook 4 (DB Connection Pool)
`PSQLException: FATAL: too many connections`	PostgreSQL max_connections reached	Kill idle connections; increase max_connections temporarily
`SocketTimeoutException`	Downstream service (terminology server, external FHIR) timeout	Check downstream service health; increase timeout temporarily
`SearchCoordinatorSvcImpl - Failed to load`	Corrupted search result cache	Clear search cache; restart if persistent

Step 3: Apply the fix (varies by cause)

# Emergency JVM restart (if OOM or unresponsive)
sudo systemctl restart hapi-fhir

# Wait for startup health check (typically 30-60 seconds)
until curl -sf http://localhost:8080/fhir/metadata > /dev/null; do
  sleep 5
  echo "Waiting for FHIR server to start..."
done

# Verify error rate has dropped
for i in $(seq 1 5); do
  sleep 30
  curl -s http://localhost:8080/actuator/metrics/http.server.requests?tag=status:500     | jq '.measurements[] | select(.statistic=="COUNT") | .value'
done

Step 4: Verify and close

Confirm 5xx rate below 1% for 15 consecutive minutes
Run smoke tests against critical endpoints (Patient search, Observation read, CapabilityStatement)
Check for backlogged messages in integration engine queues
Notify the clinical informatics team of the resolution and any data gap window

Runbook 2 — Mirth Channel Stopped

Mirth Connect channel stopped showing diagnostic decision tree and restart procedures">

Alert Definition

- alert: MirthChannelStopped
  expr: mirth_channel_state{state="STOPPED"} == 1
  for: 2m
  labels:
    severity: P2
    service: mirth-connect
    runbook: "https://wiki.internal/runbooks/mirth-channel-stopped"
  annotations:
    summary: "Mirth channel {{ $labels.channel_name }} is STOPPED"
    impact: "Messages for this channel are queuing and not being processed"

Diagnostic Steps

Step 1: Determine if the stop was intentional (1 minute)

# Check Mirth channel status via API
curl -s -u admin:admin https://mirth:8443/api/channels/statuses   | python3 -c "import sys, xml.etree.ElementTree as ET;     root = ET.fromstring(sys.stdin.read());     [print(f'{ch.find("name").text}: {ch.find("state").text}')      for ch in root.findall('.//dashboardStatus')]"

# Check deployment history for recent changes
curl -s -u admin:admin https://mirth:8443/api/channels/{channel_id}/history

Step 2: Check the channel error log

# Get recent errors for the stopped channel
curl -s -u admin:admin   "https://mirth:8443/api/channels/{channel_id}/messages?status=ERROR&limit=10"   | python3 -c "import sys, json;     msgs = json.loads(sys.stdin.read());     [print(f'Error: {m.get("errors",{}).get("content","unknown")}')      for m in msgs.get('messages',[])]"

Step 3: Attempt a restart

# Start the channel via Mirth API
curl -s -X POST -u admin:admin   "https://mirth:8443/api/channels/{channel_id}/_start"

# Verify channel state changed to STARTED
sleep 10
curl -s -u admin:admin   "https://mirth:8443/api/channels/{channel_id}/status" | grep "state"

# Monitor for 5 minutes to ensure it stays running
for i in $(seq 1 10); do
  sleep 30
  STATE=$(curl -s -u admin:admin     "https://mirth:8443/api/channels/{channel_id}/status" | grep -o "STARTED\|STOPPED")
  echo "$(date): Channel state = $STATE"
  [ "$STATE" = "STOPPED" ] && echo "Channel stopped again -- investigate root cause" && break
done

Common Root Causes and Fixes

Symptom	Root Cause	Fix
The channel stops immediately after the start	Port already in use by another channel or process	Check `netstat -tlnp \| grep PORT`; resolve conflict
The channel stops after processing N messages	Memory leak in custom transformer/filter code	Review custom JavaScript; add memory limits
Channel stops with an SSL error	Certificate expired or trust store outdated	Update certificate; see Runbook 5 (Certificate Expiry)
Channel stops with a database error	Mirth internal DB (Derby/PostgreSQL) issue	Check Mirth DB connectivity; restart Mirth service

Runbook 3 — HL7 Parsing Failures

Alert Definition

- alert: HL7ParsingFailureSpike
  expr: |
    sum(rate(mirth_messages_errored_total{error_type="PARSE"}[5m])) by (channel_name) /
    sum(rate(mirth_messages_received_total[5m])) by (channel_name) * 100 > 2
  for: 10m
  labels:
    severity: P2
    service: mirth-connect
    runbook: "https://wiki.internal/runbooks/hl7-parsing-failures"
  annotations:
    summary: "HL7 parsing failure rate at {{ $value | printf "%.1f" }}% on {{ $labels.channel_name }}"

Diagnostic Steps

Step 1: Get sample failed messages

# Extract the raw content of recent parsing failures
curl -s -u admin:admin   "https://mirth:8443/api/channels/{channel_id}/messages?status=ERROR&limit=5"   | python3 -c "
import sys, json
msgs = json.loads(sys.stdin.read())
for m in msgs.get('messages', []):
    raw = m.get('rawData', '')
    print('--- Message ---')
    print(raw[:500])
    print()
"

Step 2: Identify the parsing failure pattern

Pattern	Cause	Fix
Missing MSH segment	Upstream sending partial messages (TCP fragmentation)	Increase MLLP receive timeout; enable message reassembly
Wrong field separator	Upstream changed the encoding characters	Update channel source encoding settings
Invalid date format in fields	Upstream sending non-HL7 date formats	Add a pre-processing transformer to normalize dates
Unexpected segment order	Upstream EHR version upgrade changed message structure	Update channel filter/transformer for new structure
Binary content in message	Base64-encoded PDF/image in OBX-5 not properly encoded	Add a binary content handler in the transformer

Step 3: Temporary mitigation

# Route failing messages to error queue for manual review
# Add this to the channel source transformer:

if (msg['MSH']['MSH.9']['MSH.9.1'].toString() === '') {
  // Message has no message type -- route to error queue
  destinationSet.removeAll();
  destinationSet.add('error-queue');
  logger.error('Missing message type in MSH.9 -- routed to error queue');
}

// For bulk reprocessing after fix is deployed:
// Mirth API: POST /api/channels/{id}/messages/_reprocess
// with body containing message IDs to reprocess

Runbook 4 — DB Connection Pool Exhaustion

Alert Definition

- alert: DBConnectionPoolExhausted
  expr: |
    hikaricp_connections_active / hikaricp_connections_max > 0.9
  for: 5m
  labels:
    severity: P2
    service: database
    runbook: "https://wiki.internal/runbooks/db-pool-exhausted"
  annotations:
    summary: "Connection pool at {{ $value | printf "%.0f" }}% capacity"

Diagnostic Steps

Step 1: Identify what is holding connections (2 minutes)

# Find long-running queries holding connections
psql -h db-host -U hapi -d hapi -c "
SELECT pid, 
       now() - pg_stat_activity.query_start AS duration,
       query,
       state,
       wait_event_type,
       wait_event
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '30 seconds'
  AND state != 'idle'
ORDER BY duration DESC
LIMIT 10;
"

Step 2: Identify blocking locks

# Find blocked and blocking queries
psql -h db-host -U hapi -d hapi -c "
SELECT blocked_locks.pid     AS blocked_pid,
       blocked_activity.usename  AS blocked_user,
       blocking_locks.pid     AS blocking_pid,
       blocking_activity.usename AS blocking_user,
       blocked_activity.query    AS blocked_statement,
       blocking_activity.query   AS blocking_statement
FROM  pg_catalog.pg_locks         blocked_locks
JOIN  pg_catalog.pg_stat_activity blocked_activity  ON blocked_activity.pid = blocked_locks.pid
JOIN  pg_catalog.pg_locks         blocking_locks 
    ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
    AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
    AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
    AND blocking_locks.pid != blocked_locks.pid
JOIN  pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
"

Step 3: Emergency relief

# Kill the longest-running non-idle queries (CAUTION: may lose in-progress transactions)
psql -h db-host -U hapi -d hapi -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
  AND state != 'idle'
  AND pid != pg_backend_pid();
"

# Verify connections freed
psql -h db-host -U hapi -d hapi -c "
SELECT count(*) as active_connections,
       (SELECT setting::int FROM pg_settings WHERE name='max_connections') as max_connections
FROM pg_stat_activity
WHERE state != 'idle';
"

Runbook 5 — Certificate Expiry

Alert Definition

- alert: CertificateExpiringSoon
  expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
  for: 1h
  labels:
    severity: P3
    service: certificates
    runbook: "https://wiki.internal/runbooks/cert-expiry"
  annotations:
    summary: "Certificate for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}"

- alert: CertificateExpiringCritical
  expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7
  for: 10m
  labels:
    severity: P1
    service: certificates
  annotations:
    summary: "CRITICAL: Certificate for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}"

Certificate Inventory

Healthcare systems typically manage these certificate types:

# Automated certificate check script
#!/bin/bash
ENDPOINTS=(
  "fhir.hospital.internal:443"
  "mirth.hospital.internal:8443"
  "auth.hospital.internal:443"
  "api.hospital.internal:443"
)

for endpoint in "${ENDPOINTS[@]}"; do
  EXPIRY=$(echo | openssl s_client -servername "${endpoint%%:*}"     -connect "$endpoint" 2>/dev/null | openssl x509 -noout -enddate | cut -d= -f2)
  DAYS=$(( ($(date -d "$EXPIRY" +%s) - $(date +%s)) / 86400 ))
  
  if [ "$DAYS" -lt 7 ]; then
    echo "CRITICAL: $endpoint expires in $DAYS days ($EXPIRY)"
  elif [ "$DAYS" -lt 30 ]; then
    echo "WARNING:  $endpoint expires in $DAYS days ($EXPIRY)"
  else
    echo "OK:       $endpoint expires in $DAYS days ($EXPIRY)"
  fi
done

Renewal Process

Generate a CSR with the same Subject Alternative Names (SANs) as the current certificate
Submit to CA (internal PKI or public CA like DigiCert)
Test the new certificate in the staging environment before production deployment
Deploy during maintenance window -- coordinate with clinical operations for any service restart
Verify that all dependent systems can connect with the new certificate
Update trust stores on all clients if the CA chain has changed

Runbook 6 — Message Queue Backup

Alert Definition

- alert: MessageQueueBackup
  expr: |
    rabbitmq_queue_messages > 5000 OR
    mirth_channel_queue_depth > 1000
  for: 15m
  labels:
    severity: P2
    service: message-queue
    runbook: "https://wiki.internal/runbooks/queue-backup"
  annotations:
    summary: "Message queue backup: {{ $value }} messages pending on {{ $labels.queue }}"

Diagnostic Steps

# Step 1: Identify which queue is backed up
rabbitmqctl list_queues name messages consumers | sort -k2 -rn | head -10

# Step 2: Check consumer health
rabbitmqctl list_consumers | grep -v "running"

# Step 3: Check message processing rate
# If consumers are connected but not processing, check for poison messages
rabbitmqctl list_queues name messages_ready messages_unacknowledged

# Step 4: If unacknowledged count is high, consumers are stuck
# Possible causes: downstream timeout, processing loop, resource exhaustion

Recovery Procedure

# If consumers are stuck, restart consumer processes
sudo systemctl restart mirth-connect

# If queue contains poison messages (messages that crash consumers):
# Move to dead letter queue for analysis
rabbitmqctl eval 'rabbit_amqqueue:deliver([<<"dead-letter-exchange">>], <<"queue-name">>, 100).'

# For large backlogs (10K+ messages), increase consumer concurrency temporarily
# In Mirth: increase channel thread count from default 1 to 5-10
# In RabbitMQ: add more consumer instances

OpsGenie Alternative Configuration

For teams using OpsGenie instead of PagerDuty, here is the equivalent alert routing configuration:

# OpsGenie team and routing configuration

teams:
  - name: "Healthcare Integration"
    members:
      - user: "oncall-engineer@hospital.org"
        role: "admin"
      - user: "integration-lead@hospital.org"
        role: "admin"
    
    routing_rules:
      - name: "Clinical P1 Alerts"
        conditions:
          - field: "tags"
            operation: "contains"
            expectedValue: "P1"
        notify:
          type: "escalation"
          name: "clinical-escalation"
        
      - name: "Operational P2 Alerts"
        conditions:
          - field: "tags"
            operation: "contains"
            expectedValue: "P2"
        time_restrictions:
          type: "weekday-and-time-of-day"
          restrictions:
            - startDay: "monday"
              endDay: "friday"
              startHour: 7
              endHour: 19
        notify:
          type: "schedule"
          name: "integration-oncall"

escalations:
  - name: "clinical-escalation"
    rules:
      - delay: 0
        notify:
          - type: "schedule"
            name: "integration-oncall"
      - delay: 10
        notify:
          - type: "user"
            name: "engineering-lead@hospital.org"
      - delay: 25
        notify:
          - type: "user"
            name: "vp-engineering@hospital.org"

Building a Runbook Culture

Writing the initial runbooks is the easy part. Keeping them accurate and useful requires discipline:

Every incident updates a runbook. After resolving any P1 or P2 incident, the on-call engineer updates the relevant runbook with what they learned. New failure modes get new runbooks.
Monthly runbook review. The interface team reviews all runbooks monthly, removing outdated procedures and verifying that alert thresholds still match current system behavior.
Runbook drill. Once per quarter, simulate each P1 alert and have a team member walk through the runbook. Time the response. If it takes longer than the target response time, simplify the runbook.
Link runbooks to alerts. Every PagerDuty/OpsGenie alert annotation must include a direct URL to the relevant runbook. An alert without a runbook link is an alert that will be responded to with guesswork.

FAQ

How many runbooks should a healthcare integration team maintain?

Start with 6-8 runbooks covering the most common alert types (the ones in this guide). Expand as you encounter new failure modes. A mature team with 20+ integration channels typically maintains 15-20 runbooks.

Should runbooks be automated or manual?

Start manual, then automate the diagnostic steps. A runbook that automatically collects diagnostic data (logs, metrics, connection states) and presents it alongside the alert saves the most time. Automate remediation only for well-understood, low-risk actions like restarting a stopped Mirth channel. Never automate actions that could cause data loss without human approval.

How do I handle alerts for systems I do not own (e.g., vendor-hosted EHR)?

Create runbooks that focus on what you can control: verify the issue is not on your side (network, firewall, credentials), document the vendor's support contact and escalation path, capture diagnostic evidence (timestamps, error messages, network traces) to share with the vendor, and implement workarounds (queue messages for replay when the vendor system recovers).

What is the right on-call rotation for a healthcare integration team?

Weekly rotations with a primary and secondary on-call. The primary handles P1/P2 alerts. The secondary handles P3 if the primary is engaged with a higher-severity incident. Ensure at least 2 people can respond to any runbook -- single points of knowledge failure are as dangerous as single points of system failure. For teams under 4 engineers, consider a managed NOC for overnight coverage with escalation to the on-call engineer for P1 events.

From alerting to architecture, our Healthcare Software Product Development team builds healthcare platforms with production-grade observability baked in. We also offer specialized Healthcare Interoperability Solutions for FHIR and Mirth Connect environments. Talk to our team to get started.

Related guides

This article is part of our complete guide to Mirth Connect: What Is Mirth Connect? Complete 2026 Guide for Healthcare Leaders. Related deep-dives:

Need production-grade Mirth Connect help? See our Mirth Connect integration services.

Frequently Asked Questions

What alerts should page a human versus go to a daily digest?

Page on conditions that threaten clinical data flow: an interface down, sustained error-rate spikes, queue depth growing past threshold, and failed failover. Single message failures, brief latency blips, and capacity trends belong in digests. Alert fatigue is the failure mode that makes teams miss real incidents — every page should be actionable.

Why pair every alert with a runbook?

Because the on-call engineer at 3 AM is rarely the person who built the interface. A runbook turns an alert from "something is wrong" into a checklist: what to verify, what to restart, when to escalate, and what to communicate. Alerts without runbooks just relocate panic.

How do you avoid duplicate pages when one failure triggers many alerts?

Group related alerts in your incident tool so one underlying failure — say, a database slowdown affecting twenty channels — creates one incident, not twenty pages. Route by service (FHIR server, integration engine, EHR connectivity) and suppress downstream symptoms when an upstream cause is already paging.

What severity levels make sense for healthcare integration alerts?

Three tiers work in practice: P1 for clinical data not flowing (page immediately, any hour), P2 for degraded but flowing (page during business hours, otherwise next morning), P3 for warnings and trends (ticket, no page). Map each interface to a tier based on clinical impact, not technical convenience.

Who should be on call for integration infrastructure?

Engineers with access and context to act: integration engineers for channel-level issues, with a documented escalation path to database and infrastructure owners. An on-call rotation needs at least three people to be sustainable — see our integration team playbook for staffing models.

Alerting for Healthcare Systems: PagerDuty Runbooks for FHIR Server, Mirth, and EHR Integrations

Severity Classification for Healthcare Systems

Healthcare-Adapted Severity Levels

Symptom-Based vs Cause-Based Alerting

PagerDuty Service Configuration

Runbook 1 — FHIR Server 5xx Error Spike

Alert Definition

Diagnostic Steps

Runbook 2 — Mirth Channel Stopped

Alert Definition

Diagnostic Steps

Common Root Causes and Fixes

Runbook 3 — HL7 Parsing Failures

Alert Definition

Diagnostic Steps

Runbook 4 — DB Connection Pool Exhaustion

Alert Definition

Diagnostic Steps

Runbook 5 — Certificate Expiry

Alert Definition

Certificate Inventory

Renewal Process

Runbook 6 — Message Queue Backup

Alert Definition

Diagnostic Steps

Recovery Procedure

OpsGenie Alternative Configuration

Building a Runbook Culture

FAQ

How many runbooks should a healthcare integration team maintain?

Should runbooks be automated or manual?

How do I handle alerts for systems I do not own (e.g., vendor-hosted EHR)?

What is the right on-call rotation for a healthcare integration team?

Related guides

Frequently Asked Questions

Related Posts

Mirth Connect Telehealth Integration for Virtual Care Workflows

Mirth Connect and Cerner Integration: A Practical Guide

Mirth Connect and Athenahealth Integration for Clinical and Billing Workflows