Multi-Region Kafka: Active-Active vs Active-Passive
Compare Kafka DR patterns. MirrorMaker 2 setup, offset translation, conflict resolution, and when each architecture makes sense.

Your Kafka cluster will fail. Region outages happen. The question is whether you lose minutes or hours of data, and whether recovery takes seconds or days.
I've seen both extremes. A retail company lost four hours of order data because their DR test was a checkbox exercise. A fintech failed over in under a minute because they practiced quarterly. The difference wasn't technology—it was architecture choice and operational discipline.
Multi-region Kafka falls into two patterns: active-passive (one cluster serves traffic, one stands by) and active-active (both serve traffic). Each has different tradeoffs.
We thought active-passive was simpler until our first real failover. Updating configs across 40 services took longer than the actual outage.
SRE Lead at a logistics company
Active-Passive: The Starting Point
One primary cluster handles production traffic. A secondary cluster receives replicated data and waits.
US-EAST (Primary) ──MirrorMaker 2──> US-WEST (Standby)
▲
All traffic No traffic When the primary fails, you switch traffic to the secondary. Consumers resume from replicated offsets.
Basic MirrorMaker 2 configuration:
clusters = us-east, us-west
us-east.bootstrap.servers = kafka-us-east:9092
us-west.bootstrap.servers = kafka-us-west:9092
us-east->us-west.enabled = true
us-east->us-west.topics = orders, payments, events
# Critical for failover
us-east->us-west.sync.group.offsets.enabled = true Topics appear on the secondary with a prefix: us-east.orders, us-east.payments.
The Offset Translation Problem
When MirrorMaker replicates messages, the target cluster assigns new offsets. Message at offset 1000 in orders might land at offset 998 in us-east.orders.
The checkpoint connector maintains a mapping. With sync.group.offsets.enabled=true, MirrorMaker writes translated offsets to the target cluster's __consumer_offsets.
Warning: Offset sync lags by emit.checkpoints.interval.seconds (default 60s). In a failover, consumers may reprocess messages produced in the last checkpoint interval.
Active-Passive Failover
When you fail over:
- Stop producers on the primary (if reachable)
- Wait for replication to catch up
- Update client configurations to point to secondary
- Restart consumers
Step 3 is where it hurts. Updating dozens of services—environment variables, Kubernetes secrets, config files—takes time. The coordination complexity hits when you can least afford it.
Active-Active: Both Clusters Serve Traffic
Active-active runs both clusters simultaneously. US-East producers write to the US-East cluster. US-West producers write to US-West. MirrorMaker replicates bidirectionally.
US-EAST (Active) ←──MirrorMaker 2──→ US-WEST (Active)
▲ ▲
US-East traffic US-West traffic Each cluster has local topics (orders) and replicated topics from the other region (us-west.orders).
us-east->us-west.enabled = true
us-west->us-east.enabled = true
us-east->us-west.topics = orders, payments
us-west->us-east.topics = orders, payments MirrorMaker prevents infinite loops by not replicating topics that already have a cluster prefix.
The Conflict Problem
Active-active introduces write conflicts:
T=0: US-East writes order-123 status="SHIPPED"
T=0: US-West writes order-123 status="CANCELLED"
T=1: Both messages replicate to both clusters Both clusters now have both messages. Which status is correct?
MirrorMaker does not resolve conflicts. It replicates both. Your application must handle this.
Conflict Resolution Strategies
Regional authority: Designate one region as authoritative for specific entities. Orders prefixed US-* always process in US-East. This avoids conflicts entirely.
Last-writer-wins: Include timestamps in events. Consumers keep the latest version. Simple but can silently drop legitimate updates if clocks drift.
Idempotent events: Instead of SET status=SHIPPED, emit ORDER_SHIPPED events. Let downstream consumers determine final state from the event sequence.
Comparing the Patterns
| Aspect | Active-Passive | Active-Active |
|---|---|---|
| RPO | Replication lag (seconds to minutes) | Near-zero for regional data |
| RTO | Minutes to hours (config changes) | Seconds (traffic already flows) |
| Complexity | Lower normally, higher during failover | Higher always |
| Consistency | Strong (single source of truth) | Eventual (conflicts possible) |
When to Use Active-Passive
- Application can't handle conflicts
- Single source of truth required
- Failover is rare (quarterly or less)
- Team lacks operational maturity for active-active
When to Use Active-Active
- Users geographically distributed
- Regional autonomy matters
- Data can partition by region or entity
- Application designed for eventual consistency
Testing Your DR
Set up monitoring alerts for replication lag and consumer health to detect issues before they become outages.
Don't wait for a real outage. Quarterly failover drills:
- Simulate primary failure
- Time detection and decision
- Execute failover
- Verify consumers resume without data loss
- Measure actual RTO and RPO
Monitor MirrorMaker's heartbeat topic. If heartbeats stop, replication is broken:
kafka-console-consumer --bootstrap-server kafka-us-west:9092 \
--topic heartbeats --timeout-ms 30000 The technology works. Whether your DR works depends on whether you've practiced.
Book a demo to see how Conduktor Console provides visibility into replication lag and consumer health across multi-region deployments.