Healthcare systems are among the most complex distributed architectures in any industry. A single patient encounter can trigger data flows across an EHR, a lab information system, a pharmacy system, a radiology PACS, a billing engine, and multiple integration engines — all within minutes. When something goes wrong — a lab result takes 6 hours instead of 2, a FHIR API times out during a clinical workflow, an HL7 message gets stuck in a Mirth Connect queue — the impact is not just a degraded user experience. It can affect patient care.
Yet most healthcare IT organizations still rely on log-based monitoring: grep through application logs, check Mirth channel statistics, and hope that the error messages tell the full story. They rarely do. Distributed tracing — the ability to follow a single request or message across every system it touches — is the missing capability. And OpenTelemetry (OTel) is the vendor-neutral standard that makes it possible.
This article covers how to instrument healthcare systems with OpenTelemetry: adding tracing to FHIR servers, propagating trace context through HL7 pipelines, creating custom spans for clinical workflows, defining healthcare-specific attributes, and configuring the OTel Collector for production healthcare environments. Includes working code for Java (HAPI FHIR), JavaScript (Mirth Connect), and Python instrumentation.
Why Healthcare Needs Distributed Tracing
Traditional monitoring answers "is this system up?" Distributed tracing answers "what happened to this specific patient's data as it moved through our systems?" The difference is critical in healthcare for three reasons:
1. Multi-System Workflows Are the Norm
A physician places a lab order in Epic. That order becomes an ORM^O01 HL7 message, routed through Mirth Connect to the lab information system. The lab processes the specimen, generates results, and sends an ORU^R01 back through Mirth to Epic. If the turnaround time is too long, which system caused the delay? Without tracing, you are guessing.
2. Clinical SLOs Require End-to-End Visibility
Healthcare organizations are increasingly adopting observability frameworks with clinical SLOs: STAT lab results in under 60 minutes, radiology reports available within 2 hours, medication orders reaching pharmacy within 5 minutes. Meeting these SLOs requires measuring the full end-to-end latency, not just individual system response times.
3. Compliance and Audit Requirements
HIPAA, HITRUST, and ONC certification all require demonstrating system reliability and data integrity. Distributed traces provide an auditable record of how data flows through systems — when it arrived, how long each processing step took, and whether it was delivered successfully.
Instrumenting a FHIR Server with OpenTelemetry
FHIR servers are the core of modern healthcare data exchange. Whether you are running HAPI FHIR (Java), Microsoft FHIR Server (.NET), or a custom implementation, adding OTel instrumentation follows the same pattern: create spans for each FHIR interaction, add healthcare-specific attributes, and propagate context through the request lifecycle.
HAPI FHIR Server Instrumentation (Java)
HAPI FHIR supports interceptors that hook into the request lifecycle. Here is a complete OTel interceptor:
import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.SpanKind;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;
import ca.uhn.fhir.interceptor.api.Hook;
import ca.uhn.fhir.interceptor.api.Interceptor;
import ca.uhn.fhir.interceptor.api.Pointcut;
import ca.uhn.fhir.rest.api.server.RequestDetails;
import ca.uhn.fhir.rest.api.server.ResponseDetails;
import java.security.MessageDigest;
import java.nio.charset.StandardCharsets;
@Interceptor
public class OtelFhirInterceptor {
private static final Tracer tracer =
GlobalOpenTelemetry.getTracer("healthcare.fhir-server", "1.0.0");
private static final String SPAN_KEY = "otel.fhir.span";
private static final String SCOPE_KEY = "otel.fhir.scope";
@Hook(Pointcut.SERVER_INCOMING_REQUEST_PRE_HANDLED)
public void preHandle(RequestDetails details) {
String spanName = String.format("FHIR %s /%s",
details.getRestOperationType(),
details.getResourceName() != null ? details.getResourceName() : "metadata");
Span span = tracer.spanBuilder(spanName)
.setSpanKind(SpanKind.SERVER)
.startSpan();
Scope scope = span.makeCurrent();
// Healthcare-specific attributes
span.setAttribute("healthcare.resource_type",
details.getResourceName() != null ? details.getResourceName() : "System");
span.setAttribute("healthcare.interaction",
details.getRestOperationType().toString());
span.setAttribute("healthcare.fhir_version", "R4");
span.setAttribute("healthcare.tenant_id",
details.getTenantId() != null ? details.getTenantId() : "default");
// Hash patient ID for privacy (never log raw PHI)
if (details.getId() != null && "Patient".equals(details.getResourceName())) {
span.setAttribute("healthcare.patient_id_hash",
sha256(details.getId().getIdPart()));
}
details.getUserData().put(SPAN_KEY, span);
details.getUserData().put(SCOPE_KEY, scope);
}
@Hook(Pointcut.SERVER_OUTGOING_RESPONSE)
public void postHandle(RequestDetails details, ResponseDetails response) {
Span span = (Span) details.getUserData().get(SPAN_KEY);
Scope scope = (Scope) details.getUserData().get(SCOPE_KEY);
if (span != null) {
span.setAttribute("http.status_code", response.getResponseCode());
span.setAttribute("healthcare.response_resource_count",
getResourceCount(response));
if (response.getResponseCode() >= 400) {
span.setAttribute("error", true);
}
span.end();
}
if (scope != null) scope.close();
}
private String sha256(String input) {
try {
MessageDigest digest = MessageDigest.getInstance("SHA-256");
byte[] hash = digest.digest(input.getBytes(StandardCharsets.UTF_8));
StringBuilder hex = new StringBuilder();
for (byte b : hash) hex.append(String.format("%02x", b));
return hex.toString().substring(0, 16); // First 16 chars
} catch (Exception e) {
return "hash_error";
}
}
} Key Design Decisions
- Patient ID hashing: Never put raw patient identifiers in trace attributes. Use a one-way hash (SHA-256, truncated) so you can correlate traces for the same patient without exposing PHI. This is a HIPAA compliance requirement.
- Resource type as attribute: Adding
healthcare.resource_typelets you filter traces by clinical data type — essential for understanding which resources have the most latency or errors. - Tenant ID: Multi-tenant FHIR servers (common in SaaS healthtech) need tenant context in every span for isolation and SLO tracking per customer.
Tracing HL7 Messages Across Mirth Connect Channels
HL7 v2 messages do not natively carry distributed trace context. This is the biggest challenge for healthcare observability: your HL7 pipeline is a black box of fire-and-forget messages. Here is how to add tracing.
Approach: MSH-10 and Z-Segments for Trace Context
There are two practical approaches to propagating trace context through HL7 v2 messages:
- MSH-10 (Message Control ID): Use a structured format that embeds the trace ID:
TRACE-{traceId}-{spanId}-{messageControlId}. This works but is limited by the 199-character max length of MSH-10. - ZTR (custom Z-segment): Add a Z-segment specifically for trace context:
ZTR|{traceId}|{spanId}|{traceFlags}|{baggage}. This is cleaner but requires all downstream systems to ignore the Z-segment (which they should per HL7 v2 spec).
Mirth Connect Channel Instrumentation (JavaScript)
// Source Transformer - Create span and inject trace context
var OTel = Packages.io.opentelemetry.api.GlobalOpenTelemetry;
var tracer = OTel.getTracer("healthcare.mirth-connect", "1.0.0");
// Extract message metadata
var msgType = msg['MSH']['MSH.9']['MSH.9.1'].toString();
var triggerEvent = msg['MSH']['MSH.9']['MSH.9.2'].toString();
var sendingFacility = msg['MSH']['MSH.4']['MSH.4.1'].toString();
var messageControlId = msg['MSH']['MSH.10']['MSH.10.1'].toString();
// Create span
var span = tracer.spanBuilder("mirth.channel." + channelName)
.setSpanKind(Packages.io.opentelemetry.api.trace.SpanKind.CONSUMER)
.startSpan();
// Add HL7-specific attributes
span.setAttribute("hl7.message_type", msgType);
span.setAttribute("hl7.trigger_event", triggerEvent);
span.setAttribute("hl7.sending_facility", sendingFacility);
span.setAttribute("hl7.message_control_id", messageControlId);
span.setAttribute("hl7.version", msg['MSH']['MSH.12']['MSH.12.1'].toString());
// Hash patient ID
var patientId = msg['PID']['PID.3']['PID.3.1'].toString();
span.setAttribute("healthcare.patient_id_hash",
Packages.org.apache.commons.codec.digest.DigestUtils.sha256Hex(patientId).substring(0, 16));
// Inject trace context into Z-segment for downstream
var traceId = span.getSpanContext().getTraceId();
var spanId = span.getSpanContext().getSpanId();
createSeg('ZTR', msg.children().length());
msg['ZTR']['ZTR.1']['ZTR.1.1'] = traceId;
msg['ZTR']['ZTR.2']['ZTR.2.1'] = spanId;
msg['ZTR']['ZTR.3']['ZTR.3.1'] = "01"; // sampled
// Store span reference for destination postprocessor
channelMap.put("otel_span", span);
channelMap.put("otel_trace_id", traceId); Destination Postprocessor
// Complete the span after destination processing
var span = channelMap.get("otel_span");
if (span != null) {
span.setAttribute("mirth.destination_status", responseStatus.toString());
span.setAttribute("mirth.destination_name", connectorName);
if (responseStatus == Packages.com.mirth.connect.donkey.model.message.Status.ERROR) {
span.setAttribute("error", true);
span.setAttribute("error.message", responseErrorMessage);
}
span.end();
} Healthcare-Specific Span Attributes
The OpenTelemetry specification defines semantic conventions for HTTP, database, and messaging attributes. Healthcare needs its own conventions. Here is the attribute schema we recommend:
FHIR Attributes
| Attribute | Type | Example | Description |
|---|---|---|---|
| healthcare.resource_type | string | Patient | FHIR resource type |
| healthcare.interaction | string | search-type | FHIR interaction (read, search, create, update, delete, transaction) |
| healthcare.patient_id_hash | string | a1b2c3d4e5f6 | SHA-256 hash of patient ID (never raw PHI) |
| healthcare.fhir_version | string | R4 | FHIR version |
| healthcare.tenant_id | string | hospital-a | Multi-tenant identifier |
| healthcare.bundle_size | int | 5 | Number of entries in a Bundle |
| healthcare.search_params | string | name=Smith&birthdate=1980 | Search parameters (with PHI scrubbed) |
HL7 v2 Attributes
| Attribute | Type | Example | Description |
|---|---|---|---|
| hl7.message_type | string | ADT | MSH-9.1 message type |
| hl7.trigger_event | string | A01 | MSH-9.2 trigger event |
| hl7.sending_facility | string | EPIC_PROD | MSH-4 sending facility |
| hl7.receiving_facility | string | LAB_LIS | MSH-6 receiving facility |
| hl7.message_control_id | string | MSG00001 | MSH-10 unique ID |
| hl7.processing_id | string | P | MSH-11 (P=production, T=test) |
Clinical Workflow Attributes
| Attribute | Type | Example | Description |
|---|---|---|---|
| workflow.order_id | string | ORD-12345 | Clinical order identifier |
| workflow.step | string | result_delivery | Current workflow step |
| workflow.priority | string | STAT | Order priority (STAT, ROUTINE, ASAP) |
| workflow.turnaround_ms | int | 7200000 | Time since order placement in ms |
| workflow.slo_target_ms | int | 14400000 | SLO target for this workflow type |
Custom Spans for Clinical Workflows
Beyond individual API calls and messages, healthcare needs spans that represent clinical workflows — multi-step processes that span hours or days. The canonical example is the order-to-result lifecycle.
Python: Order Lifecycle Tracker
from opentelemetry import trace
from opentelemetry.trace import SpanKind, StatusCode
import hashlib
import time
tracer = trace.get_tracer("healthcare.clinical-workflow", "1.0.0")
class OrderLifecycleTracker:
"""Track clinical order lifecycle as a distributed trace."""
def __init__(self, order_id: str, patient_id: str, order_type: str):
self.order_id = order_id
self.patient_hash = hashlib.sha256(
patient_id.encode()
).hexdigest()[:16]
self.order_type = order_type
self.root_span = None
self.root_context = None
def start_order(self, priority: str = "ROUTINE"):
"""Called when physician places the order."""
self.root_span = tracer.start_span(
f"clinical.order_lifecycle.{self.order_type}",
kind=SpanKind.INTERNAL,
attributes={
"workflow.order_id": self.order_id,
"healthcare.patient_id_hash": self.patient_hash,
"workflow.order_type": self.order_type,
"workflow.priority": priority,
"workflow.step": "order_placed",
}
)
self.root_context = trace.set_span_in_context(self.root_span)
return self._get_propagation_headers()
def record_step(self, step_name: str, attributes: dict = None):
"""Record a step in the order lifecycle."""
with tracer.start_as_current_span(
f"clinical.{step_name}",
context=self.root_context,
kind=SpanKind.INTERNAL
) as span:
span.setAttribute("workflow.order_id", self.order_id)
span.setAttribute("workflow.step", step_name)
span.setAttribute("healthcare.patient_id_hash", self.patient_hash)
if attributes:
for k, v in attributes.items():
span.setAttribute(k, v)
def complete_order(self, success: bool = True):
"""Called when the order lifecycle is complete."""
if self.root_span:
self.root_span.set_attribute("workflow.step", "completed")
self.root_span.set_status(
StatusCode.OK if success else StatusCode.ERROR
)
self.root_span.end()
def _get_propagation_headers(self) -> dict:
"""Get W3C trace context headers for propagation."""
from opentelemetry.propagate import inject
headers = {}
inject(headers, context=self.root_context)
return headers
# Usage: Lab Order Lifecycle
tracker = OrderLifecycleTracker(
order_id="LAB-2024-001234",
patient_id="patient-john-doe-123",
order_type="laboratory"
)
# Step 1: Order placed in EHR
headers = tracker.start_order(priority="STAT")
# Step 2: Order transmitted (pass headers to integration engine)
tracker.record_step("order_transmitted", {
"hl7.message_type": "ORM",
"hl7.trigger_event": "O01",
"mirth.channel": "lab-orders-outbound"
})
# Step 3: Specimen collected
tracker.record_step("specimen_collected", {
"lab.specimen_type": "blood",
"lab.collection_site": "phlebotomy_station_3"
})
# Step 4: Result available
tracker.record_step("result_available", {
"lab.result_status": "final",
"lab.abnormal_flag": "H",
"workflow.turnaround_ms": 7200000
})
# Step 5: Result delivered to EHR
tracker.record_step("result_delivered", {
"hl7.message_type": "ORU",
"hl7.trigger_event": "R01"
})
tracker.complete_order(success=True) OTel Collector Configuration for Healthcare
The OpenTelemetry Collector is the central hub that receives telemetry data, processes it, and exports it to your observability backend. Healthcare environments need specific configuration for PHI protection, multi-tenancy, and compliance.
Production Collector Configuration
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
processors:
# Add standard attributes to all telemetry
attributes/healthcare:
actions:
- key: deployment.environment
value: "production"
action: upsert
- key: service.namespace
value: "healthcare-platform"
action: upsert
# CRITICAL: Filter out any spans containing raw PHI
filter/phi:
error_mode: ignore
traces:
span:
- attributes["http.url"] != nil and
IsMatch(attributes["http.url"], ".*Patient/[0-9]+.*") and
attributes["healthcare.patient_id_hash"] == nil
# Batch for performance
batch:
timeout: 5s
send_batch_size: 512
send_batch_max_size: 1024
# Tail-based sampling: keep all error spans,
# sample 10% of healthy spans
tail_sampling:
decision_wait: 10s
policies:
- name: errors-always
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-requests
type: latency
latency:
threshold_ms: 5000
- name: clinical-workflows
type: string_attribute
string_attribute:
key: workflow.priority
values: [STAT, ASAP]
- name: sample-healthy
type: probabilistic
probabilistic:
sampling_percentage: 10
exporters:
otlp/jaeger:
endpoint: "jaeger-collector:4317"
tls:
insecure: false
cert_file: /etc/ssl/certs/otel.crt
key_file: /etc/ssl/private/otel.key
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
resource_to_telemetry_conversion:
enabled: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [attributes/healthcare, filter/phi, tail_sampling, batch]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [attributes/healthcare, batch]
exporters: [prometheusremotewrite] Key Configuration Decisions
- PHI filtering: The
filter/phiprocessor drops spans that contain raw patient identifiers in URLs without a corresponding hash attribute. This is a safety net — your instrumentation should hash at the source, but defense-in-depth matters for HIPAA. - Tail-based sampling: In healthcare, you always want 100% of error traces and slow requests. STAT orders and clinical workflows get 100% sampling. Routine healthy requests are sampled at 10% to manage cost and storage.
- TLS: All telemetry transport must be encrypted. Traces contain metadata about patient interactions that constitutes PHI even when patient IDs are hashed.
Building Healthcare Dashboards
With traces and metrics flowing into your observability stack, here are the dashboards that healthcare IT teams need:
FHIR Server Health
- Request rate by resource type (Patient, Observation, MedicationRequest)
- Latency percentiles (p50, p95, p99) by interaction type (read vs. search vs. create)
- Error rate by resource type and HTTP status code
- Bundle processing time and entry count distribution
- Search query latency breakdown (database vs. terminology vs. authorization)
HL7 Pipeline Health
- Message throughput by type (ADT, ORM, ORU, MDM) and sending facility
- Channel queue depth and processing latency per Mirth channel
- Error rate by channel and destination, with error message classification
- Message delivery time (source received to destination acknowledged)
- ACK/NAK ratio per destination system
Clinical Workflow SLOs
- Order-to-result turnaround time by order type and priority (STAT vs. ROUTINE)
- SLO compliance rate: percentage of orders meeting turnaround targets
- Error budget burn rate: are we consuming our error budget faster than expected?
- Bottleneck identification: which workflow step contributes most to total latency?
Building interoperable healthcare systems is complex. Our Healthcare Interoperability Solutions team has deep experience shipping production integrations. We also offer specialized Healthcare Software Product Development services. Talk to our team to get started.
Frequently Asked QuestionsDoes OpenTelemetry work with HAPI FHIR out of the box?
HAPI FHIR does not include OTel instrumentation natively, but the OpenTelemetry Java Agent provides automatic instrumentation for the HTTP layer (Jetty/Spring). For FHIR-specific spans and healthcare attributes, you need custom interceptors like the one shown in this article. The Java Agent handles HTTP span creation, database tracing, and context propagation automatically.
How do you handle PHI in traces?
Three rules: (1) Never put raw patient identifiers, names, dates of birth, or SSNs in span attributes — use one-way hashes. (2) Use the OTel Collector's filter processor as a safety net to drop spans containing PHI patterns. (3) Ensure your observability backend (Jaeger, Grafana Cloud, Datadog) has appropriate access controls and BAAs in place. Trace data that contains metadata about patient interactions is PHI under HIPAA even if patient IDs are removed.
What is the performance overhead of OTel instrumentation?
With proper tail-based sampling, the overhead is minimal: typically 1-3% CPU and 50-100MB additional memory for the OTel Collector. The application-side instrumentation (span creation and attribute setting) adds microseconds per span. For healthcare systems where latency matters, this is well within acceptable bounds — network and database operations dominate request latency by orders of magnitude.
Can I trace messages through legacy HL7 v2 systems that I cannot modify?
You can instrument at the integration engine level (Mirth Connect, Rhapsody) without modifying the source or destination systems. The integration engine creates spans for message receipt and delivery, providing visibility into the pipeline even if the endpoints are black boxes. This covers 80% of the observability need. For the other 20% (what happens inside the endpoint system), you need endpoint-side instrumentation or inference from response timing.
Conclusion
Healthcare IT is entering the observability era. The shift from "is it up?" to "what happened to this patient's data?" requires distributed tracing, and OpenTelemetry is the vendor-neutral standard that makes it possible without lock-in to any specific observability backend.
The key patterns covered in this article — FHIR server interceptors, HL7 trace context propagation, clinical workflow spans, healthcare attribute conventions, and OTel Collector configuration — form a complete observability foundation for healthcare systems. Start with your FHIR server (the easiest win), add HL7 pipeline tracing (the highest-value win), and build toward end-to-end clinical workflow visibility.
For teams building healthcare platforms that need production-grade observability, reach out to our engineering team. We help healthtech companies and health systems build the integration and observability infrastructure that modern clinical operations require.



