High availability (HA) is essential for modern healthcare systems. When Mirth Connect goes down, clinical workflows stop, alerts fail, and patient safety is at risk. Since Mirth Connect often handles HL7, FHIR, DICOM, XML, and other healthcare messages, ensuring a zero-downtime, fault-tolerant setup is critical.
This guide explains how to design and implement a complete HA architecture for Mirth Connect including database clustering, load balancing, shared storage, session management, failover testing, and long-term maintenance.
Why High Availability Matters in Healthcare
Mirth Connect is the backbone of many interoperability workflows. When it goes offline:
- Clinical workflows are delayed
- Critical alerts fail to trigger
- Duplicate or lost patient data increases risk
- HIPAA compliance issues arise
- Financial losses occur due to downtime
Healthcare systems typically aim for 99.99% uptime, equal to less than one hour of downtime per year.
Understanding HA Requirements — RTO and RPO
Before designing the architecture, define your business expectations:
- RTO (Recovery Time Objective): Maximum acceptable downtime
- RPO (Recovery Point Objective): Maximum acceptable data loss
Typical healthcare requirements:
- RTO: 5 - 30 minutes
- RPO: 0- 5 minutes
These values directly shape your HA design, especially database clustering and failover mechanisms.
High Availability Architecture Overview
A single-server Mirth setup has multiple failure points: application crashes, hardware failures, network issues, or database downtime.
A proper HA design removes these risks with layered redundancy:
- Load Balancer Cluster: HAProxy or NGINX
- Multiple Mirth Instances: active or passive
- Clustered Database: PostgreSQL + Patroni or MySQL Galera
- Shared Storage: NFS or GlusterFS
- Redundant Networking: VLANs, redundant switches, dual NICs
A front-end load balancer distributes traffic across active Mirth nodes, ensuring that clients connect through a virtual service or VIP rather than directly to individual nodes.
The load balancer can be configured to use Layer 4 or Layer 7 methods, with Layer 7 Reverse Proxy being the recommended method for Mirth Connect due to its high performance and implementation flexibility.
Note that certain behaviors, such as header insertion or transparency, may not be active by default and require specific configuration. Using a load balancer allows Mirth Connect to achieve high availability by directing traffic only to healthy nodes, and load balancing helps handle increased traffic and improve response times for healthcare applications.
Mirth Connect's architecture supports various load-balancing configurations to meet different deployment needs.
This ensures no single failure can stop message processing.
Prerequisites and Planning
Hardware Requirements
- Mirth Servers: 4 - 8 vCPU, 8 - 16 GB RAM
- Database Nodes: 8- 32 vCPU, 16 - 32 GB RAM
- Load Balancers: 2 nodes minimum
- Shared Storage: RAID 10 SSD/NVMe
Software Requirements
- Linux (RHEL, CentOS, Ubuntu)
- Java 11+
- Mirth Connect 4.0+
- PostgreSQL 12+ or MySQL 8.0+
- HAProxy 2.0+ or NGINX
Network Requirements
- 1 Gbps internal bandwidth
- < 5 ms latency to the database
- Correct DNS + clock synchronization across all server nodes using NTP is essential to prevent database conflicts and reduce waiting times in data processing, ensuring minimal latency and seamless interoperability
- Open ports for Mirth, DB, HAProxy
Setting Up the Database Cluster
Implementing Mirth Connect for high availability involves clustering multiple Mirth Connect instances to share a single database, which eliminates single points of failure. A clustered database is mandatory, because Mirth stores:
- Channels
- Metadata
- Message logs
- User sessions
The recommended method for ensuring high availability is to use a clustered database configuration that includes replication and automatic failover. PostgreSQL is recommended as a highly available database for Mirth Connect, though Oracle, SQL Server, and MySQL are also supported.
This approach ensures data integrity and reliable failover, which are critical for maintaining continuous operations in healthcare environments.
Option 1 — PostgreSQL + Patroni
- Automatic failover
- Synchronous replication
- Quorum and split-brain protection
Option 2 — MySQL/MariaDB Galera Cluster
- Synchronous multi-master replication
- High write throughput
- Automatic cluster recovery
Connection pooling must be enabled in Mirth to prevent DB overload.
Load Balancer Configuration
Utilize HAProxy or NGINX to distribute traffic and ensure high availability. A Layer 7 Reverse Proxy is recommended to handle SSL termination and manage X-Forwarded-For headers for Mirth Connect, ensuring accurate client IP recognition and secure connections.
Key responsibilities:
- Routing incoming traffic across nodes
- Health checking Mirth instances
- SSL termination
- Failover handling
- Session stickiness when required
The load balancer should be configured to log client IP addresses and health checks for monitoring and security. Configuration files can be downloaded via API or scripts for backup and migration purposes, supporting automation and disaster recovery. If a primary node fails, a secondary node can automatically take over the Virtual IP and resume processing queued messages through a heartbeat mechanism.
Correct health checks ensure nodes are only used when fully healthy.
Mirth Connect HA Node Configuration
Each node must have identical configurations except for node IDs.
Critical settings:
- Cluster mode enabled
- Database sessions activated
- Centralized logs
- Consistent channel deployment
- Cache synchronization scripts
If you are achieving high availability without the Advanced Clustering plugin, more manual configuration is required. For example, you need to disable auto-start for polling channels on each node to prevent duplicate processing.
Tools like Ansible should manage configuration consistency across nodes.
Shared Storage Configuration
Shared storage is required for:
- Logs
- Channel exports
- Certificates
- Custom libraries
Secure access to shared storage and credentials is essential for automation and seamless data exchange, ensuring that only authorized systems and users can retrieve or integrate sensitive healthcare data.
Common options:
- NFS Server
- GlusterFS Cluster (replicated)
Utilize an SSD or NVMe for improved throughput and reduced latency.
Session and State Management
For proper failover:
- Enable database-backed sessions
- Synchronize caches periodically
- Use idempotent channel logic to avoid duplicate processing
- Store critical queues in the database
This ensures users remain authenticated and messages are not reprocessed after a failover.
NextGen Healthcare Integration
NextGen Healthcare Integration, powered by NextGen Connect, delivers a proven architecture for healthcare organizations seeking high availability, scalability, and seamless interoperability.
Its proven architecture processes hundreds of millions of clinical documents annually, supporting mission-critical workflows that demand zero downtime and reliable message processing. The platform’s high availability features, combined with ongoing support from both the NextGen Healthcare team and a vibrant open-source community, ensure that healthcare organizations can depend on continuous improvements, security verification, and rapid response to evolving industry needs.
While recent licensing changes have shifted the landscape for the free version, the commitment to community-driven development and enterprise-grade support means that NextGen Connect remains a cornerstone for secure, scalable, and future-ready healthcare integration.
This ongoing collaboration empowers organizations to meet the increasing demands for secure data exchange, regulatory compliance, and resilient system performance in a rapidly changing healthcare world.
Configuration Best Practices
- Use connection pooling
- Set correct timeouts (HTTP + DB)
- Use retry & backoff strategies
- Implement circuit breakers
- Isolate instance-specific directories
These prevent cascading failures and ensure stable HA behavior.
Common Pitfalls and How to Avoid Them
- Split brain: Fix with proper quorum (Patroni/Galera)
- DB connection exhaustion: Enforce pooling & connection limits
- File locking issues: Separate instance-specific directories
- Version mismatch: Use automation (Ansible, CI/CD)
- Configuration drift: Validate config checksums
Testing and Validation
Perform structured testing:
- Failover test (shutdown one node)
- Database failover simulation
- Network partition testing
- Load testing (JMeter)
- End-to-end message testing
Automated test suites should run after each deployment.
Monitoring and Maintenance
Monitor:
- Mirth node health
- Database replication lag
- Error rates
- Message throughput
- Load balancer status
Use tools like Prometheus, Alertmanager, and ELK for visibility.
Maintenance tasks:
- Daily health checks
- Weekly DB maintenance
- Monthly failover drills
- Regular certificate rotation
Failover Mechanics: What Actually Happens When the Primary Dies
Understanding the failure sequence is what separates an HA design that works from one that looks good in a diagram. When the primary Mirth node fails mid-traffic, four things happen in order:
- In-flight TCP connections drop. Every MLLP sender connected to the dead node gets a connection reset. What happens next depends entirely on the sender: well-behaved interfaces reconnect and resend unACKed messages; badly behaved ones stall until someone notices. Inventory your senders' reconnect behavior before go-live, not during the first failover.
- The load balancer health check fails and traffic shifts to the standby. Your effective recovery time is the health check interval plus the standby's readiness — tune check frequency accordingly (5–10 second intervals are typical for clinical traffic).
- The standby resumes processing from shared state. This only works if channel configuration and message queues live in the shared database — any state on the dead node's local disk is gone. This is why file-based channels need shared storage.
- Unprocessed messages in the database queue resume. Messages the primary had persisted-but-not-processed are picked up; messages it had received-but-not-persisted were never ACKed, so senders resend them. This is the mechanism behind a zero-RPO claim — and why ACK-after-persist configuration matters.
Synchronous vs Asynchronous Replication: The RPO Decision
Your database replication mode is your real recovery point objective. Asynchronous replication is faster but means the replica can lag the primary — messages ACKed in the lag window are lost if the primary's storage dies. Synchronous replication guarantees zero loss but adds write latency to every message. For clinical interfaces the standard answer is synchronous replication within a site (or availability zone) and asynchronous replication to the DR site — zero RPO for node failure, small bounded RPO for full-site disaster. Patroni supports exactly this layout with synchronous standbys. For the full-site scenario — backups, cross-region strategy, and what happens to in-flight messages — see our Mirth Connect disaster recovery guide.
The Monthly HA Validation Runbook
An HA setup that hasn't been exercised recently should be assumed broken — certificates expire, health checks drift, someone changes a firewall rule. A monthly 30-minute drill keeps the guarantee real:
- Send a steady stream of test messages through a non-clinical validation channel.
- Stop the primary Mirth service (gracefully one month,
kill -9the next — they exercise different paths). - Measure: time to standby takeover, messages lost (must be zero), duplicate messages delivered (senders resending unACKed messages — verify downstream dedupe handles them).
- Fail back, and verify the original primary rejoins replication cleanly rather than starting a split-brain.
- Log the results. The trend matters: takeover time creeping from 8 to 25 seconds is a warning you want early.
Pair the drill with monitoring that watches both nodes and the replication lag itself — our guide to reliable Mirth monitoring in production covers the alerting rules, and the complete Mirth Connect guide puts the HA layer in context of the full production architecture.
Load Balancer Configuration: MLLP Is Not HTTP
Most load balancer documentation assumes HTTP, and MLLP breaks those assumptions in ways that matter for failover design. MLLP connections are long-lived TCP sessions — a sender connects once and streams messages for hours or days. That means: health checks must be TCP-level (or better, a synthetic MLLP ping channel), not HTTP probes; session persistence is irrelevant because there are no sessions to balance, only long connections to fail over; and “draining” a node gracefully means waiting for senders to disconnect or forcing a reconnect, not waiting for requests to finish. Configure aggressive TCP keepalive detection on the balancer — the default OS-level timeout can leave a sender pinned to a dead node for many minutes, which is the gap where messages back up and pagers fire.
Shared Storage: The Component Everyone Forgets
Channels that read or write files — SFTP pickups, batch exports, attachment archives — hold state outside the database, and that state must survive node failure. The options, in order of operational simplicity: a managed NFS service (AWS EFS, Azure Files) mounted on both nodes; a traditional NFS server with its own HA story; or restructuring file channels to stage through the database or object storage instead. Whichever you choose, test the failure mode where the primary dies mid-file: the standby must not double-process a half-consumed batch file. File locking behavior over NFS is subtle enough that this specific test belongs in your validation runbook.
Using HA for Zero-Downtime Maintenance
A correctly built HA pair pays for itself outside of disasters: it turns maintenance windows into non-events. OS patching, JVM upgrades, and Mirth version changes all follow the same pattern — drain and patch the standby, fail over deliberately during a low-volume window, patch the former primary, fail back. Each step uses the exact mechanics you drill monthly, which means maintenance doubles as failover testing. Teams running this pattern patch monthly without ever stopping message flow — and a security posture that doesn't require downtime is one that actually gets maintained, which matters given Mirth's CVE history.
What HA Actually Costs — and When It's Overkill
Honest sizing matters, because HA is not free: a second Mirth node, a database cluster instead of a single instance, a load balancer, shared storage, and — the part that's underestimated — the ongoing operational discipline of drills and patching choreography. Roughly, expect the infrastructure footprint to double and the operational complexity to go up by more than that.
That cost is clearly justified when the engine carries time-sensitive clinical traffic: ADT driving downstream systems, lab results, medication orders. An hour of outage there is measured in delayed care and manual re-entry across departments. It is much harder to justify for an estate of overnight batch interfaces that could simply re-run after an outage — a well-rehearsed single-node recovery procedure with solid backups may serve a small operation better than a poorly maintained cluster. The worst position is the middle one: paying for HA infrastructure but skipping the drills and replication monitoring that make it real. A failover pair that hasn't been exercised in a year delivers single-node reliability at double-node cost.
If you're weighing this build against alternatives — commercial clustering, managed Mirth hosting, or cloud-managed databases that take replication off your plate — the self-host vs managed platform comparison and the cloud deployment guide cover the trade-offs in depth.
Conclusion
Mirth Connect HA requires much more than running multiple servers. Proper planning, database clustering, load balancing, shared storage, synchronized configuration, and continuous testing all contribute to a reliable HA environment.
With the right architecture, healthcare organizations can achieve:
- Zero-downtime failover
- Consistent message processing
- Audit-ready compliance
- Resilience against hardware and network faults
A well-architected Mirth HA setup ensures safe, uninterrupted patient data flow across the entire healthcare ecosystem. Thousands of clients and organizations account for Mirth Connect's widespread adoption, supported by active forums and a dedicated website for resources and support.



