Kafka Network Partitions and Split-Brain Failures

Understand Kafka network partition failures, split-brain scenarios, and unclean leader election. ISR shrinkage and data loss prevention.

Stéphane DerosiauxStéphane Derosiaux · June 13, 2025 ·
Kafka Network Partitions and Split-Brain Failures

Network partitions are inevitable. Kafka handles them better than most systems, but only if you configure your cluster correctly.

I've seen teams lose data because they didn't understand ISR shrinkage. The failure mode is subtle, and the defaults don't protect you.

We lost 500 messages during a network partition. Our topic had min.insync.replicas=1. After setting it to 2, we've had zero data loss incidents.

Platform Engineer at a fintech company

How Kafka Handles Partitions

When a broker is isolated:

  1. After replica.lag.time.max.ms (default 30s), followers leave ISR
  2. If the leader is isolated, a new leader is elected from remaining ISR
  3. Producers with acks=all block until enough replicas acknowledge
# Check for degraded replication
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

Any output means replication is degraded. Alert on this immediately. Set up proactive alerting to catch ISR shrinkage before it causes data loss.

The Split-Brain That Causes Data Loss

Here's the dangerous sequence with min.insync.replicas=1:

  1. Two followers fall behind; leader shrinks ISR to itself
  2. Network isolates the leader
  3. Old leader accepts writes before detecting isolation
  4. Controller elects a new leader; new leader also accepts writes
  5. Partition heals; old leader's writes are truncated
[Partition payments-0] Truncating log to offset 15000 (was 15500)
# 500 messages lost

Fix: Set min.insync.replicas=2:

kafka-configs.sh --bootstrap-server localhost:9092 \
  --alter --entity-type topics --entity-name payments \
  --add-config min.insync.replicas=2

Tradeoff: With min.insync.replicas=2 and 3 replicas, losing 2 brokers makes the partition unavailable for writes. You're trading availability for consistency.

Unclean Leader Election

When all ISR replicas are unavailable, Kafka chooses:

  1. Wait for ISR member (no data loss, partition offline)
  2. Elect out-of-sync replica (data loss, partition available)

Controlled by unclean.leader.election.enable (default: false since Kafka 0.11).

Keep disabled for financial transactions, audit logs, anything that can't handle gaps.

Enable for metrics, telemetry, topics that can be rebuilt.

Configuration for Durability

# Topic
replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false

# Producer
acks=all
enable.idempotence=true

Why this works:

  • 3 replicas tolerates 1 broker failure
  • min.insync.replicas=2 ensures writes on 2+ brokers
  • acks=all waits for all ISR replicas
  • Idempotent producer prevents duplicates

Metrics to Alert On

MetricWarningCritical
UnderReplicatedPartitions> 0 for 1 min> 0 for 5 min
OfflinePartitionsCount> 0> 0
ActiveControllerCount!= 1!= 1
IsrShrinksPerSec> 1/min> 5/min

NotEnoughReplicasException

Messages are rejected since there are fewer in-sync replicas than required.

This is Kafka doing its job—refusing writes that can't be safely replicated. Check ISR state and broker health.

Network partition handling is where Kafka's distributed nature becomes visible. The defaults favor safety over availability, which is correct for most workloads.

Book a demo to see how Conduktor Console provides real-time ISR health and partition status monitoring.