On-Call for Healthcare IT: Rotation Design, Escalation Policies, and Burnout Prevention

Q: What's the minimum team size for sustainable healthcare on-call?

Four engineers minimum for a 1-in-4 rotation. With three engineers, you're at 1-in-3, which research shows leads to burnout within 6-12 months. If you can't hire a fourth, combine on-call pools across related teams or engage a managed NOC for first-level triage. Your alert fatigue reduction efforts directly impact how many engineers you need — fewer alerts means less on-call burden per rotatio

Q: Should on-call engineers also do sprint work during their on-call week?

Yes, but reduce their sprint commitment by 50%. On-call engineers should work on low-priority, interruptible tasks: documentation updates, runbook maintenance, tech debt cleanup, alert tuning. Never assign them to critical-path sprint stories that block other team members if interrupted. Some teams use the on-call week for "improvement work" — fixing the issues that cause on-call pages, which creates a virtuous cycl

Q: How do we handle on-call during holidays?

Holiday on-call requires premium compensation (2-3x the normal on-call stipend is standard). Allow engineers to volunteer for holiday coverage first; many are willing if the compensation is right. If no volunteers, rotate holidays fairly across the team year-over-year. Track holiday on-call assignments to ensure no one is consistently stuck with the same holidays. Some teams offer "holiday swap" — work Christmas this year, guaranteed Thanksgiving off next yea

Q: What do we do when the on-call engineer is out sick?

The secondary on-call automatically becomes primary. The engineering manager takes secondary until someone from the pool can volunteer. This is another reason the 1-in-4 minimum matters: with 4 people, one being sick still leaves 3 to cover. Have a documented "emergency coverage" process that doesn't require the sick engineer to find their own replacemen

Q: How should we handle cross-team incidents that span clinical IT and infrastructure?

Establish a "bridge" protocol: both on-call engineers join a shared incident channel. Designate one as incident commander based on where the root cause appears to be. If unclear, the infrastructure on-call takes command (most cross-team incidents have an infrastructure root cause). For building the incident management framework that coordinates these cross-team responses, see our Healthcare Incident Management guide, and for testing your cross-team resilience, our Chaos Engineering for Healthcar

April 23, 2026

13 min read

Healthcare

Healthcare IT doesn't get nights off. When a hospital runs 24/7/365, the technology that supports it must too. But the engineers who maintain that technology are human. They need sleep, they need weekends, and they burn out when on-call practices are designed around the system's needs without considering theirs.

This guide covers on-call rotation design, escalation policies, compensation models, and burnout prevention strategies specifically for healthcare IT teams. The goal: reliable 24/7 coverage that doesn't destroy the humans providing it.

Why Healthcare On-Call Is Different

Most SRE on-call guides assume you can deprioritize after business hours. E-commerce slows at midnight. SaaS products have maintenance windows. Healthcare has neither. Emergency departments are busiest at night. Lab systems process overnight batch runs. Clinical staff access the EHR at all hours.

Healthcare on-call differs in three key ways:

No true off-peak. While patient volume ebbs overnight, critical systems must remain available. A lab result delay at 3 AM affects a patient in the ICU just as much as one at 3 PM.
Clinical knowledge requirements. The on-call engineer needs to understand not just the technology but the clinical workflow impact. Knowing the FHIR server is slow is insufficient; knowing that slow FHIR queries delay medication reconciliation in the ED is essential.
Regulatory consequences. EHR downtime isn't just a service disruption; it triggers HIPAA assessment requirements if PHI access is affected and may require regulatory reporting if the downtime exceeds certain thresholds.

Rotation Design That Works

The Primary + Secondary Model

Every on-call rotation needs at minimum two layers: a primary responder and a secondary (backup) responder. For healthcare, add a third layer: the engineering manager as escalation backstop.

Role	Responsibility	Response SLA	Escalation Trigger
Primary On-Call	First responder for all pages. Triages, investigates, resolves or escalates.	Acknowledge within 5 min	Auto-escalate to secondary if no ACK in 5 min
Secondary On-Call	Backup responder. Available if primary doesn't acknowledge or needs help.	Acknowledge within 10 min	Auto-escalate to manager if no ACK in 15 min total
Manager Escalation	Ensures someone is engaged. Coordinates cross-team if needed. Doesn't debug — manages.	Acknowledge within 15 min	CIO notification if P1 unresolved after 30 min

Weekly Rotations, Not Daily

Daily rotations create context-switching overhead. The engineer who was on-call last night doesn't carry the context to the next on-call engineer about the flaky Mirth channel they nursed through the evening. Weekly rotations provide continuity:

Monday 9 AM to Monday 9 AM — clean handoff during business hours when both engineers are awake and available for context transfer.
30-minute handoff meeting — outgoing on-call briefs incoming on-call: open incidents, systems under watch, known risks for the coming week (scheduled deployments, certificate expirations, vendor maintenance windows).
Handoff document — standardized template covering: current incidents, systems on watch, upcoming changes, vendor contacts needed.

# PagerDuty on-call rotation (Terraform)
resource "pagerduty_schedule" "clinical_platform" {
  name      = "Clinical Platform On-Call"
  time_zone = "America/New_York"

  layer {
    name                         = "Primary"
    start                        = "2026-01-06T09:00:00-05:00"
    rotation_virtual_start       = "2026-01-06T09:00:00-05:00"
    rotation_turn_length_seconds = 604800  # 7 days
    users = [
      pagerduty_user.engineer_a.id,
      pagerduty_user.engineer_b.id,
      pagerduty_user.engineer_c.id,
      pagerduty_user.engineer_d.id,
    ]
  }

  layer {
    name                         = "Secondary"
    start                        = "2026-01-06T09:00:00-05:00"
    rotation_virtual_start       = "2026-01-13T09:00:00-05:00"  # Offset by 1 week
    rotation_turn_length_seconds = 604800
    users = [
      pagerduty_user.engineer_b.id,
      pagerduty_user.engineer_c.id,
      pagerduty_user.engineer_d.id,
      pagerduty_user.engineer_a.id,
    ]
  }
}

# Escalation policy
resource "pagerduty_escalation_policy" "clinical_platform" {
  name      = "Clinical Platform Escalation"
  num_loops = 2

  rule {
    escalation_delay_in_minutes = 5
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.clinical_platform.id  # Primary layer
    }
  }

  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.clinical_platform.id  # Secondary layer
    }
  }

  rule {
    escalation_delay_in_minutes = 15
    target {
      type = "user_reference"
      id   = pagerduty_user.eng_manager.id
    }
  }
}

Follow-the-Sun for Multi-Timezone Organizations

For health systems with teams across multiple timezones (or healthcare technology companies with distributed engineering), follow-the-sun eliminates overnight pages entirely:

Timezone	Coverage Hours	Team	Handoff
US East (ET)	7 AM - 3 PM ET	Boston/NYC team	3 PM ET handoff to US West
US West (PT)	12 PM - 8 PM PT (3-11 PM ET)	Seattle/SF team	8 PM PT handoff to APAC
Asia Pacific (IST/SGT)	8 AM - 4 PM IST (10:30 PM - 6:30 AM ET)	India/Singapore team	4 PM IST handoff to US East

The key requirement for follow-the-sun: each team must have the same access, runbooks, and authority. If the APAC team can't restart a Mirth channel or roll back a deployment, follow-the-sun creates a secondary problem: the night team pages the day team anyway.

Escalation Policies by Severity

Different severities need different escalation behaviors. A P4 cosmetic issue should never wake anyone up. A P1 patient-safety incident should reach the incident commander within minutes.

# Escalation policy YAML (for documentation/configuration)
# Adapt to your PagerDuty/OpsGenie configuration

escalation_policies:
  P1_critical:
    description: "Patient safety impact — core clinical systems down"
    steps:
      - action: page_immediately
        target: primary_on_call
        channels: [sms, phone_call, push_notification]
      - wait: 5_minutes
        if_no_ack: page_secondary
        channels: [sms, phone_call, push_notification]
      - wait: 10_minutes
        if_no_ack: page_manager
        channels: [phone_call]
      - wait: 15_minutes
        if_no_ack: page_cio
        channels: [phone_call]
    parallel_actions:
      - create_slack_channel: "inc-{date}-{short_desc}"
      - notify_clinical_it_lead: true
      - update_help_desk_script: true
    auto_bridge: true  # Auto-open conference bridge

  P2_high:
    description: "Clinical workflow disruption — workaround available"
    steps:
      - action: page_primary
        channels: [push_notification, sms]
      - wait: 15_minutes
        if_no_ack: page_secondary
        channels: [sms, phone_call]
      - wait: 30_minutes
        if_no_ack: page_manager
        channels: [phone_call]
    parallel_actions:
      - notify_slack: "#clinical-platform-alerts"

  P3_medium:
    description: "Degraded performance — no direct patient impact"
    steps:
      - action: notify_slack
        channel: "#clinical-platform-alerts"
      - wait: 2_hours
        if_no_ack: page_primary
        channels: [push_notification]
    business_hours_only: false  # Still notify 24/7, but don't page

  P4_low:
    description: "Minor/cosmetic — no workflow impact"
    steps:
      - action: create_ticket
        queue: "clinical-platform-backlog"
      - action: notify_slack
        channel: "#clinical-platform-general"
    business_hours_only: true  # Batch for morning review

Compensation: Paying for the Burden

On-call without compensation breeds resentment. Engineers who feel they're giving free labor to the hospital will leave for companies that don't require on-call. Healthcare IT already faces talent retention challenges; underpaying on-call makes it worse.

Compensation Models

Model	How It Works	Pros	Cons
Flat weekly stipend	$500-1,500/week for carrying the pager	Predictable cost; simple to administer	Doesn't differentiate quiet weeks from nightmare weeks
Per-incident bonus	$100-300 per incident handled outside business hours	Compensates proportionally to disruption	Creates perverse incentive to not fix underlying issues
Comp time	1 day off for each week on-call; additional day if paged > 3 times at night	Promotes recovery; prevents cumulative fatigue	Team must be large enough to absorb the absence
Hybrid	$500/week base + $150/night incident + 1 comp day	Balances predictability with proportionality	More complex to administer

The hybrid model is the recommendation for healthcare IT teams. The base stipend recognizes the burden of carrying the pager (even quiet weeks restrict your freedom). The per-incident bonus compensates for actual sleep disruption. The comp day enforces recovery.

Burnout Prevention: The Numbers That Matter

On-call burnout isn't caused by a single bad night. It's caused by accumulated sleep debt, lack of recovery time, and the feeling that the pager burden is unfair or unmanageable. Track these metrics to detect burnout before it causes attrition:

Key Burnout Indicators

Metric	Healthy	Warning	Critical
Pages per on-call night	0-2	3-5	> 5 (fix underlying issues)
On-call frequency	1 week in 4+	1 week in 3	1 week in 2 (hire or restructure)
Sleep interruptions per week	0-1	2-3	> 3 (alert fatigue or system instability)
Mean time between pages (overnight)	> 4 hours	2-4 hours	< 2 hours (clustering indicates single root cause)
On-call NPS (team survey)	> +20	0 to +20	< 0 (morale problem — act immediately)

The 1-in-4 Rule

No engineer should be on-call more than one week in four. This means your on-call pool needs at minimum 4 qualified engineers. With a team of 6-8, you achieve 1-in-6 or 1-in-8 rotation, which is sustainable long-term.

If your team has only 3 engineers, you cannot run a healthy on-call rotation. Options:

Hire a 4th engineer — the cost of the hire is less than the cost of the attrition that 1-in-3 rotation causes.
Share on-call across teams — combine clinical platform and integration team into a single on-call pool with shared runbooks.
Use a managed service — outsource first-level triage to a NOC that follows your runbooks and escalates to your team only when engineering judgment is needed.

Toil Budgets

Google's SRE model caps toil at 50% of engineering time. For on-call, this translates to: if more than half of on-call pages require repetitive manual actions (restart channel, clear queue, rotate certificate), those actions must be automated.

# Toil tracking template (Python/Pandas)
import pandas as pd
from datetime import datetime

# Monthly toil report
toil_data = [
    {"date": "2026-03-01", "incident": "Mirth ADT channel stopped", "action": "Restart channel", "toil": True, "minutes": 15, "automatable": True},
    {"date": "2026-03-03", "incident": "FHIR server OOM", "action": "Scale pods + investigate", "toil": False, "minutes": 45, "automatable": False},
    {"date": "2026-03-05", "incident": "Certificate expiry warning", "action": "Rotate cert manually", "toil": True, "minutes": 30, "automatable": True},
    {"date": "2026-03-07", "incident": "Mirth ADT channel stopped", "action": "Restart channel", "toil": True, "minutes": 15, "automatable": True},
    {"date": "2026-03-10", "incident": "DB connection pool full", "action": "Kill idle connections", "toil": True, "minutes": 20, "automatable": True},
    {"date": "2026-03-12", "incident": "New integration mapping issue", "action": "Debug + fix mapping", "toil": False, "minutes": 90, "automatable": False},
]

df = pd.DataFrame(toil_data)

toil_pct = df[df['toil']]['minutes'].sum() / df['minutes'].sum() * 100
automatable_minutes = df[df['automatable']]['minutes'].sum()

print(f"Toil percentage: {toil_pct:.0f}%")
print(f"Automatable toil: {automatable_minutes} minutes/month")
print(f"Top toil sources:")
for incident, group in df[df['toil']].groupby('incident'):
    print(f"  {incident}: {len(group)} occurrences, {group['minutes'].sum()} min total")

Clinical IT vs Infrastructure On-Call

Larger health systems split on-call into two tracks. Understanding the boundary prevents misrouted pages:

Domain	Clinical IT On-Call	Infrastructure On-Call
Scope	EHR application issues, clinical workflows, interface mappings, user access	Servers, networks, databases, Kubernetes, cloud platform, security
Common pages	Mirth channel stopped, FHIR query errors, clinical report failures, interface mapping issues	Node unhealthy, disk full, database replica lag, certificate expiry, network partition
Required knowledge	HL7v2/FHIR standards, clinical workflows, EHR configuration, Mirth Connect	Linux, Kubernetes, PostgreSQL, networking, cloud services, security
Vendor escalation	Epic/Oracle Health support, interface vendor support	Cloud provider support, hardware vendor support

The critical handoff point: when a Mirth channel stops because the downstream server is unreachable, clinical IT identifies the symptom (Mirth error) and infrastructure on-call investigates the cause (network issue, server down). Establish clear escalation paths between the two tracks.

EHR Vendor Support Escalation

Your on-call engineer will sometimes need vendor support. Pre-negotiate and document these escalation paths before the 3 AM incident.

Epic Support Tiers

Tier	Access Method	Response SLA	When to Use
Tier 1 — Support Web	Epic UserWeb ticket	Next business day	Configuration questions, non-urgent bugs
Tier 2 — Priority	Phone: 608-271-9000	4 hours	Production issue affecting clinical workflows
Tier 3 — Emergency	Emergency hotline (per contract)	1 hour	System-wide outage, patient safety impact

Oracle Health (Cerner) Support Tiers

Tier	Access Method	Response SLA	When to Use
Standard	uCern portal ticket	Next business day	Configuration, non-urgent issues
High	Phone: 888-274-3672	4 hours	Production workflow disruption
Critical	Critical escalation line (per contract)	1 hour	System-wide outage, data integrity risk

Store these contact details in your runbook repository. During a P1 incident, your on-call engineer should open a vendor case within 5 minutes, not spend 15 minutes searching for a phone number.

On-Call Health Scorecard

Track on-call health monthly and share results with engineering leadership. This scorecard makes on-call burden visible and creates accountability for improvement.

# On-call health scorecard generator
import json
from datetime import datetime

def generate_scorecard(month_data: dict) -> dict:
    """
    Generate monthly on-call health scorecard.

    month_data keys:
      - total_pages: int
      - overnight_pages: int (10 PM - 7 AM)
      - p1_incidents: int
      - mtta_minutes: float (mean time to acknowledge)
      - mttr_minutes: float (mean time to resolve)
      - toil_percentage: float
      - engineer_nps: int (-100 to 100)
      - rotation_frequency: str (e.g., "1 in 4")
      - pages_per_night: float (average)
    """
    scorecard = {
        "month": month_data.get("month", datetime.now().strftime("%Y-%m")),
        "metrics": {
            "pages_per_night": {
                "value": month_data["pages_per_night"],
                "target": "< 2",
                "status": "green" if month_data["pages_per_night"] < 2
                          else "yellow" if month_data["pages_per_night"] < 5
                          else "red"
            },
            "mtta": {
                "value": f"{month_data['mtta_minutes']:.1f} min",
                "target": "< 5 min",
                "status": "green" if month_data["mtta_minutes"] < 5
                          else "yellow" if month_data["mtta_minutes"] < 15
                          else "red"
            },
            "engineer_nps": {
                "value": month_data["engineer_nps"],
                "target": "> +20",
                "status": "green" if month_data["engineer_nps"] > 20
                          else "yellow" if month_data["engineer_nps"] > 0
                          else "red"
            },
            "toil_percentage": {
                "value": f"{month_data['toil_percentage']:.0f}%",
                "target": "< 50%",
                "status": "green" if month_data["toil_percentage"] < 30
                          else "yellow" if month_data["toil_percentage"] < 50
                          else "red"
            }
        },
        "action_items": []
    }

    # Auto-generate action items from red/yellow metrics
    for metric_name, metric in scorecard["metrics"].items():
        if metric["status"] == "red":
            scorecard["action_items"].append(
                f"URGENT: {metric_name} is {metric['value']} (target: {metric['target']})"
            )
        elif metric["status"] == "yellow":
            scorecard["action_items"].append(
                f"WATCH: {metric_name} is {metric['value']} (target: {metric['target']})"
            )

    return scorecard

# Example usage
result = generate_scorecard({
    "month": "2026-03",
    "total_pages": 87,
    "overnight_pages": 23,
    "p1_incidents": 1,
    "mtta_minutes": 3.2,
    "mttr_minutes": 28.0,
    "toil_percentage": 42,
    "engineer_nps": 35,
    "rotation_frequency": "1 in 5",
    "pages_per_night": 1.1
})
print(json.dumps(result, indent=2))

Frequently Asked Questions

What's the minimum team size for sustainable healthcare on-call?

Four engineers minimum for a 1-in-4 rotation. With three engineers, you're at 1-in-3, which research shows leads to burnout within 6-12 months. If you can't hire a fourth, combine on-call pools across related teams or engage a managed NOC for first-level triage. Your alert fatigue reduction efforts directly impact how many engineers you need — fewer alerts means less on-call burden per rotation.

Should on-call engineers also do sprint work during their on-call week?

Yes, but reduce their sprint commitment by 50%. On-call engineers should work on low-priority, interruptible tasks: documentation updates, runbook maintenance, tech debt cleanup, alert tuning. Never assign them to critical-path sprint stories that block other team members if interrupted. Some teams use the on-call week for "improvement work" — fixing the issues that cause on-call pages, which creates a virtuous cycle.

How do we handle on-call during holidays?

Holiday on-call requires premium compensation (2-3x the normal on-call stipend is standard). Allow engineers to volunteer for holiday coverage first; many are willing if the compensation is right. If no volunteers, rotate holidays fairly across the team year-over-year. Track holiday on-call assignments to ensure no one is consistently stuck with the same holidays. Some teams offer "holiday swap" — work Christmas this year, guaranteed Thanksgiving off next year.

What do we do when the on-call engineer is out sick?

The secondary on-call automatically becomes primary. The engineering manager takes secondary until someone from the pool can volunteer. This is another reason the 1-in-4 minimum matters: with 4 people, one being sick still leaves 3 to cover. Have a documented "emergency coverage" process that doesn't require the sick engineer to find their own replacement.

How should we handle cross-team incidents that span clinical IT and infrastructure?

Establish a "bridge" protocol: both on-call engineers join a shared incident channel. Designate one as incident commander based on where the root cause appears to be. If unclear, the infrastructure on-call takes command (most cross-team incidents have an infrastructure root cause). For building the incident management framework that coordinates these cross-team responses, see our Healthcare Incident Management guide, and for testing your cross-team resilience, our Chaos Engineering for Healthcare guide covers game day exercises.

Conclusion

On-call is a necessary part of running healthcare IT systems. But it doesn't have to be a sentence. Well-designed rotations, clear escalation policies, fair compensation, and vigilant burnout monitoring transform on-call from a dreaded burden into a manageable aspect of the job. The engineers who built and maintain your clinical systems are your most valuable asset. Protect them from the pager, and they'll protect the systems that protect patients.

Start by measuring where you are today: pages per night, rotation frequency, engineer NPS. Then work systematically through the improvements in this guide. Within a quarter, your on-call will be demonstrably healthier, and your retention will reflect it. For the post-incident learning process that closes the loop on every on-call incident, see our guide on Post-Incident Reviews for Healthcare Systems.

Loading blogs...

On-Call for Healthcare IT: Rotation Design, Escalation Policies, and Burnout Prevention

April 23, 2026

13 min read

Healthcare

Why Healthcare On-Call Is Different

Healthcare on-call differs in three key ways:

No true off-peak. While patient volume ebbs overnight, critical systems must remain available. A lab result delay at 3 AM affects a patient in the ICU just as much as one at 3 PM.
Clinical knowledge requirements. The on-call engineer needs to understand not just the technology but the clinical workflow impact. Knowing the FHIR server is slow is insufficient; knowing that slow FHIR queries delay medication reconciliation in the ED is essential.
Regulatory consequences. EHR downtime isn't just a service disruption; it triggers HIPAA assessment requirements if PHI access is affected and may require regulatory reporting if the downtime exceeds certain thresholds.

Rotation Design That Works

The Primary + Secondary Model

Every on-call rotation needs at minimum two layers: a primary responder and a secondary (backup) responder. For healthcare, add a third layer: the engineering manager as escalation backstop.

Role	Responsibility	Response SLA	Escalation Trigger
Primary On-Call	First responder for all pages. Triages, investigates, resolves or escalates.	Acknowledge within 5 min	Auto-escalate to secondary if no ACK in 5 min
Secondary On-Call	Backup responder. Available if primary doesn't acknowledge or needs help.	Acknowledge within 10 min	Auto-escalate to manager if no ACK in 15 min total
Manager Escalation	Ensures someone is engaged. Coordinates cross-team if needed. Doesn't debug — manages.	Acknowledge within 15 min	CIO notification if P1 unresolved after 30 min

Weekly Rotations, Not Daily

Monday 9 AM to Monday 9 AM — clean handoff during business hours when both engineers are awake and available for context transfer.
30-minute handoff meeting — outgoing on-call briefs incoming on-call: open incidents, systems under watch, known risks for the coming week (scheduled deployments, certificate expirations, vendor maintenance windows).
Handoff document — standardized template covering: current incidents, systems on watch, upcoming changes, vendor contacts needed.

# PagerDuty on-call rotation (Terraform)
resource "pagerduty_schedule" "clinical_platform" {
  name      = "Clinical Platform On-Call"
  time_zone = "America/New_York"

  layer {
    name                         = "Primary"
    start                        = "2026-01-06T09:00:00-05:00"
    rotation_virtual_start       = "2026-01-06T09:00:00-05:00"
    rotation_turn_length_seconds = 604800  # 7 days
    users = [
      pagerduty_user.engineer_a.id,
      pagerduty_user.engineer_b.id,
      pagerduty_user.engineer_c.id,
      pagerduty_user.engineer_d.id,
    ]
  }

  layer {
    name                         = "Secondary"
    start                        = "2026-01-06T09:00:00-05:00"
    rotation_virtual_start       = "2026-01-13T09:00:00-05:00"  # Offset by 1 week
    rotation_turn_length_seconds = 604800
    users = [
      pagerduty_user.engineer_b.id,
      pagerduty_user.engineer_c.id,
      pagerduty_user.engineer_d.id,
      pagerduty_user.engineer_a.id,
    ]
  }
}

# Escalation policy
resource "pagerduty_escalation_policy" "clinical_platform" {
  name      = "Clinical Platform Escalation"
  num_loops = 2

  rule {
    escalation_delay_in_minutes = 5
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.clinical_platform.id  # Primary layer
    }
  }

  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.clinical_platform.id  # Secondary layer
    }
  }

  rule {
    escalation_delay_in_minutes = 15
    target {
      type = "user_reference"
      id   = pagerduty_user.eng_manager.id
    }
  }
}

Follow-the-Sun for Multi-Timezone Organizations

For health systems with teams across multiple timezones (or healthcare technology companies with distributed engineering), follow-the-sun eliminates overnight pages entirely:

Timezone	Coverage Hours	Team	Handoff
US East (ET)	7 AM - 3 PM ET	Boston/NYC team	3 PM ET handoff to US West
US West (PT)	12 PM - 8 PM PT (3-11 PM ET)	Seattle/SF team	8 PM PT handoff to APAC
Asia Pacific (IST/SGT)	8 AM - 4 PM IST (10:30 PM - 6:30 AM ET)	India/Singapore team	4 PM IST handoff to US East

Escalation Policies by Severity

Different severities need different escalation behaviors. A P4 cosmetic issue should never wake anyone up. A P1 patient-safety incident should reach the incident commander within minutes.

# Escalation policy YAML (for documentation/configuration)
# Adapt to your PagerDuty/OpsGenie configuration

escalation_policies:
  P1_critical:
    description: "Patient safety impact — core clinical systems down"
    steps:
      - action: page_immediately
        target: primary_on_call
        channels: [sms, phone_call, push_notification]
      - wait: 5_minutes
        if_no_ack: page_secondary
        channels: [sms, phone_call, push_notification]
      - wait: 10_minutes
        if_no_ack: page_manager
        channels: [phone_call]
      - wait: 15_minutes
        if_no_ack: page_cio
        channels: [phone_call]
    parallel_actions:
      - create_slack_channel: "inc-{date}-{short_desc}"
      - notify_clinical_it_lead: true
      - update_help_desk_script: true
    auto_bridge: true  # Auto-open conference bridge

  P2_high:
    description: "Clinical workflow disruption — workaround available"
    steps:
      - action: page_primary
        channels: [push_notification, sms]
      - wait: 15_minutes
        if_no_ack: page_secondary
        channels: [sms, phone_call]
      - wait: 30_minutes
        if_no_ack: page_manager
        channels: [phone_call]
    parallel_actions:
      - notify_slack: "#clinical-platform-alerts"

  P3_medium:
    description: "Degraded performance — no direct patient impact"
    steps:
      - action: notify_slack
        channel: "#clinical-platform-alerts"
      - wait: 2_hours
        if_no_ack: page_primary
        channels: [push_notification]
    business_hours_only: false  # Still notify 24/7, but don't page

  P4_low:
    description: "Minor/cosmetic — no workflow impact"
    steps:
      - action: create_ticket
        queue: "clinical-platform-backlog"
      - action: notify_slack
        channel: "#clinical-platform-general"
    business_hours_only: true  # Batch for morning review

Compensation: Paying for the Burden

Compensation Models

Model	How It Works	Pros	Cons
Flat weekly stipend	$500-1,500/week for carrying the pager	Predictable cost; simple to administer	Doesn't differentiate quiet weeks from nightmare weeks
Per-incident bonus	$100-300 per incident handled outside business hours	Compensates proportionally to disruption	Creates perverse incentive to not fix underlying issues
Comp time	1 day off for each week on-call; additional day if paged > 3 times at night	Promotes recovery; prevents cumulative fatigue	Team must be large enough to absorb the absence
Hybrid	$500/week base + $150/night incident + 1 comp day	Balances predictability with proportionality	More complex to administer

Burnout Prevention: The Numbers That Matter

Key Burnout Indicators

Metric	Healthy	Warning	Critical
Pages per on-call night	0-2	3-5	> 5 (fix underlying issues)
On-call frequency	1 week in 4+	1 week in 3	1 week in 2 (hire or restructure)
Sleep interruptions per week	0-1	2-3	> 3 (alert fatigue or system instability)
Mean time between pages (overnight)	> 4 hours	2-4 hours	< 2 hours (clustering indicates single root cause)
On-call NPS (team survey)	> +20	0 to +20	< 0 (morale problem — act immediately)

The 1-in-4 Rule

If your team has only 3 engineers, you cannot run a healthy on-call rotation. Options:

Hire a 4th engineer — the cost of the hire is less than the cost of the attrition that 1-in-3 rotation causes.
Share on-call across teams — combine clinical platform and integration team into a single on-call pool with shared runbooks.
Use a managed service — outsource first-level triage to a NOC that follows your runbooks and escalates to your team only when engineering judgment is needed.

Toil Budgets

# Toil tracking template (Python/Pandas)
import pandas as pd
from datetime import datetime

# Monthly toil report
toil_data = [
    {"date": "2026-03-01", "incident": "Mirth ADT channel stopped", "action": "Restart channel", "toil": True, "minutes": 15, "automatable": True},
    {"date": "2026-03-03", "incident": "FHIR server OOM", "action": "Scale pods + investigate", "toil": False, "minutes": 45, "automatable": False},
    {"date": "2026-03-05", "incident": "Certificate expiry warning", "action": "Rotate cert manually", "toil": True, "minutes": 30, "automatable": True},
    {"date": "2026-03-07", "incident": "Mirth ADT channel stopped", "action": "Restart channel", "toil": True, "minutes": 15, "automatable": True},
    {"date": "2026-03-10", "incident": "DB connection pool full", "action": "Kill idle connections", "toil": True, "minutes": 20, "automatable": True},
    {"date": "2026-03-12", "incident": "New integration mapping issue", "action": "Debug + fix mapping", "toil": False, "minutes": 90, "automatable": False},
]

df = pd.DataFrame(toil_data)

toil_pct = df[df['toil']]['minutes'].sum() / df['minutes'].sum() * 100
automatable_minutes = df[df['automatable']]['minutes'].sum()

print(f"Toil percentage: {toil_pct:.0f}%")
print(f"Automatable toil: {automatable_minutes} minutes/month")
print(f"Top toil sources:")
for incident, group in df[df['toil']].groupby('incident'):
    print(f"  {incident}: {len(group)} occurrences, {group['minutes'].sum()} min total")

Clinical IT vs Infrastructure On-Call

Larger health systems split on-call into two tracks. Understanding the boundary prevents misrouted pages:

Domain	Clinical IT On-Call	Infrastructure On-Call
Scope	EHR application issues, clinical workflows, interface mappings, user access	Servers, networks, databases, Kubernetes, cloud platform, security
Common pages	Mirth channel stopped, FHIR query errors, clinical report failures, interface mapping issues	Node unhealthy, disk full, database replica lag, certificate expiry, network partition
Required knowledge	HL7v2/FHIR standards, clinical workflows, EHR configuration, Mirth Connect	Linux, Kubernetes, PostgreSQL, networking, cloud services, security
Vendor escalation	Epic/Oracle Health support, interface vendor support	Cloud provider support, hardware vendor support

EHR Vendor Support Escalation

Your on-call engineer will sometimes need vendor support. Pre-negotiate and document these escalation paths before the 3 AM incident.

Epic Support Tiers

Tier	Access Method	Response SLA	When to Use
Tier 1 — Support Web	Epic UserWeb ticket	Next business day	Configuration questions, non-urgent bugs
Tier 2 — Priority	Phone: 608-271-9000	4 hours	Production issue affecting clinical workflows
Tier 3 — Emergency	Emergency hotline (per contract)	1 hour	System-wide outage, patient safety impact

Oracle Health (Cerner) Support Tiers

Tier	Access Method	Response SLA	When to Use
Standard	uCern portal ticket	Next business day	Configuration, non-urgent issues
High	Phone: 888-274-3672	4 hours	Production workflow disruption
Critical	Critical escalation line (per contract)	1 hour	System-wide outage, data integrity risk

Store these contact details in your runbook repository. During a P1 incident, your on-call engineer should open a vendor case within 5 minutes, not spend 15 minutes searching for a phone number.

On-Call Health Scorecard

Track on-call health monthly and share results with engineering leadership. This scorecard makes on-call burden visible and creates accountability for improvement.

# On-call health scorecard generator
import json
from datetime import datetime

def generate_scorecard(month_data: dict) -> dict:
    """
    Generate monthly on-call health scorecard.

    month_data keys:
      - total_pages: int
      - overnight_pages: int (10 PM - 7 AM)
      - p1_incidents: int
      - mtta_minutes: float (mean time to acknowledge)
      - mttr_minutes: float (mean time to resolve)
      - toil_percentage: float
      - engineer_nps: int (-100 to 100)
      - rotation_frequency: str (e.g., "1 in 4")
      - pages_per_night: float (average)
    """
    scorecard = {
        "month": month_data.get("month", datetime.now().strftime("%Y-%m")),
        "metrics": {
            "pages_per_night": {
                "value": month_data["pages_per_night"],
                "target": "< 2",
                "status": "green" if month_data["pages_per_night"] < 2
                          else "yellow" if month_data["pages_per_night"] < 5
                          else "red"
            },
            "mtta": {
                "value": f"{month_data['mtta_minutes']:.1f} min",
                "target": "< 5 min",
                "status": "green" if month_data["mtta_minutes"] < 5
                          else "yellow" if month_data["mtta_minutes"] < 15
                          else "red"
            },
            "engineer_nps": {
                "value": month_data["engineer_nps"],
                "target": "> +20",
                "status": "green" if month_data["engineer_nps"] > 20
                          else "yellow" if month_data["engineer_nps"] > 0
                          else "red"
            },
            "toil_percentage": {
                "value": f"{month_data['toil_percentage']:.0f}%",
                "target": "< 50%",
                "status": "green" if month_data["toil_percentage"] < 30
                          else "yellow" if month_data["toil_percentage"] < 50
                          else "red"
            }
        },
        "action_items": []
    }

    # Auto-generate action items from red/yellow metrics
    for metric_name, metric in scorecard["metrics"].items():
        if metric["status"] == "red":
            scorecard["action_items"].append(
                f"URGENT: {metric_name} is {metric['value']} (target: {metric['target']})"
            )
        elif metric["status"] == "yellow":
            scorecard["action_items"].append(
                f"WATCH: {metric_name} is {metric['value']} (target: {metric['target']})"
            )

    return scorecard

# Example usage
result = generate_scorecard({
    "month": "2026-03",
    "total_pages": 87,
    "overnight_pages": 23,
    "p1_incidents": 1,
    "mtta_minutes": 3.2,
    "mttr_minutes": 28.0,
    "toil_percentage": 42,
    "engineer_nps": 35,
    "rotation_frequency": "1 in 5",
    "pages_per_night": 1.1
})
print(json.dumps(result, indent=2))

On-Call for Healthcare IT: Rotation Design, Escalation Policies, and Burnout Prevention

Why Healthcare On-Call Is Different

Rotation Design That Works

The Primary + Secondary Model

Weekly Rotations, Not Daily

Follow-the-Sun for Multi-Timezone Organizations

Escalation Policies by Severity

Compensation: Paying for the Burden

Compensation Models

Burnout Prevention: The Numbers That Matter

Key Burnout Indicators

The 1-in-4 Rule

Toil Budgets

Clinical IT vs Infrastructure On-Call

EHR Vendor Support Escalation

Epic Support Tiers

Oracle Health (Cerner) Support Tiers

On-Call Health Scorecard

Frequently Asked Questions

What's the minimum team size for sustainable healthcare on-call?

Should on-call engineers also do sprint work during their on-call week?

How do we handle on-call during holidays?

What do we do when the on-call engineer is out sick?

How should we handle cross-team incidents that span clinical IT and infrastructure?

Conclusion

Related Posts

Mirth Connect Memory Leak & Java Heap Space Errors: The Complete Guide

Mirth Connect REST API Gotchas : 7 Undocumented Issues That Break Deployments

Why Automation Alone Is Not Enough in Healthcare Operations

On-Call for Healthcare IT: Rotation Design, Escalation Policies, and Burnout Prevention

Why Healthcare On-Call Is Different

Rotation Design That Works

The Primary + Secondary Model

Weekly Rotations, Not Daily

Follow-the-Sun for Multi-Timezone Organizations

Escalation Policies by Severity

Compensation: Paying for the Burden

Compensation Models

Burnout Prevention: The Numbers That Matter

Key Burnout Indicators

The 1-in-4 Rule

Toil Budgets

Clinical IT vs Infrastructure On-Call

EHR Vendor Support Escalation

Epic Support Tiers

Oracle Health (Cerner) Support Tiers

On-Call Health Scorecard

Frequently Asked Questions

What's the minimum team size for sustainable healthcare on-call?

Should on-call engineers also do sprint work during their on-call week?

How do we handle on-call during holidays?

What do we do when the on-call engineer is out sick?

How should we handle cross-team incidents that span clinical IT and infrastructure?

Conclusion

Related Posts

Mirth Connect Memory Leak & Java Heap Space Errors: The Complete Guide

Mirth Connect REST API Gotchas : 7 Undocumented Issues That Break Deployments

Why Automation Alone Is Not Enough in Healthcare Operations