Kafka Network Partitions and Split-Brain Failures
Understand Kafka network partition failures, split-brain scenarios, and unclean leader election. ISR shrinkage and data loss prevention.

Network partitions are inevitable. Kafka handles them better than most systems, but only if you configure your cluster correctly.
I've seen teams lose data because they didn't understand ISR shrinkage. The failure mode is subtle, and the defaults don't protect you.
We lost 500 messages during a network partition. Our topic had
min.insync.replicas=1. After setting it to 2, we've had zero data loss incidents.Platform Engineer at a fintech company
How Kafka Handles Partitions
When a broker is isolated:
- After
replica.lag.time.max.ms(default 30s), followers leave ISR - If the leader is isolated, a new leader is elected from remaining ISR
- Producers with
acks=allblock until enough replicas acknowledge
# Check for degraded replication
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions Any output means replication is degraded. Alert on this immediately. Set up proactive alerting to catch ISR shrinkage before it causes data loss.
The Split-Brain That Causes Data Loss
Here's the dangerous sequence with min.insync.replicas=1:
- Two followers fall behind; leader shrinks ISR to itself
- Network isolates the leader
- Old leader accepts writes before detecting isolation
- Controller elects a new leader; new leader also accepts writes
- Partition heals; old leader's writes are truncated
[Partition payments-0] Truncating log to offset 15000 (was 15500)
# 500 messages lost Fix: Set min.insync.replicas=2:
kafka-configs.sh --bootstrap-server localhost:9092 \
--alter --entity-type topics --entity-name payments \
--add-config min.insync.replicas=2 Tradeoff: With min.insync.replicas=2 and 3 replicas, losing 2 brokers makes the partition unavailable for writes. You're trading availability for consistency.
Unclean Leader Election
When all ISR replicas are unavailable, Kafka chooses:
- Wait for ISR member (no data loss, partition offline)
- Elect out-of-sync replica (data loss, partition available)
Controlled by unclean.leader.election.enable (default: false since Kafka 0.11).
Keep disabled for financial transactions, audit logs, anything that can't handle gaps.
Enable for metrics, telemetry, topics that can be rebuilt.
Configuration for Durability
# Topic
replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false
# Producer
acks=all
enable.idempotence=true Why this works:
- 3 replicas tolerates 1 broker failure
min.insync.replicas=2ensures writes on 2+ brokersacks=allwaits for all ISR replicas- Idempotent producer prevents duplicates
Metrics to Alert On
| Metric | Warning | Critical |
|---|---|---|
| UnderReplicatedPartitions | > 0 for 1 min | > 0 for 5 min |
| OfflinePartitionsCount | > 0 | > 0 |
| ActiveControllerCount | != 1 | != 1 |
| IsrShrinksPerSec | > 1/min | > 5/min |
NotEnoughReplicasException
Messages are rejected since there are fewer in-sync replicas than required. This is Kafka doing its job—refusing writes that can't be safely replicated. Check ISR state and broker health.
Network partition handling is where Kafka's distributed nature becomes visible. The defaults favor safety over availability, which is correct for most workloads.
Book a demo to see how Conduktor Console provides real-time ISR health and partition status monitoring.