Kafka Network Partitions and Split-Brain Failures

Understand Kafka network partition failures, split-brain scenarios, and unclean leader election. ISR shrinkage and data loss prevention.

Stéphane Derosiaux · June 13, 2025 ·

Kafka Network Partitions and Split-Brain Failures

Network partitions are inevitable. Kafka handles them better than most systems, but only if you configure your cluster correctly.

I've seen teams lose data because they didn't understand ISR shrinkage. The failure mode is subtle, and the defaults don't protect you.

We lost 500 messages during a network partition. Our topic had min.insync.replicas=1. After setting it to 2, we've had zero data loss incidents.
Platform Engineer at a fintech company

How Kafka Handles Partitions

When a broker is isolated:

After replica.lag.time.max.ms (default 30s), followers leave ISR
If the leader is isolated, a new leader is elected from remaining ISR
Producers with acks=all block until enough replicas acknowledge

# Check for degraded replication
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

Any output means replication is degraded. Alert on this immediately. Set up proactive alerting to catch ISR shrinkage before it causes data loss.

The Split-Brain That Causes Data Loss

Here's the dangerous sequence with min.insync.replicas=1:

Two followers fall behind; leader shrinks ISR to itself
Network isolates the leader
Old leader accepts writes before detecting isolation
Controller elects a new leader; new leader also accepts writes
Partition heals; old leader's writes are truncated

[Partition payments-0] Truncating log to offset 15000 (was 15500)
# 500 messages lost

Fix: Set min.insync.replicas=2:

kafka-configs.sh --bootstrap-server localhost:9092 \
  --alter --entity-type topics --entity-name payments \
  --add-config min.insync.replicas=2

Tradeoff: With min.insync.replicas=2 and 3 replicas, losing 2 brokers makes the partition unavailable for writes. You're trading availability for consistency.

Unclean Leader Election

When all ISR replicas are unavailable, Kafka chooses:

Wait for ISR member (no data loss, partition offline)
Elect out-of-sync replica (data loss, partition available)

Controlled by unclean.leader.election.enable (default: false since Kafka 0.11).

Keep disabled for financial transactions, audit logs, anything that can't handle gaps.

Enable for metrics, telemetry, topics that can be rebuilt.

Configuration for Durability

# Topic
replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false

# Producer
acks=all
enable.idempotence=true

Why this works:

3 replicas tolerates 1 broker failure
min.insync.replicas=2 ensures writes on 2+ brokers
acks=all waits for all ISR replicas
Idempotent producer prevents duplicates

Metrics to Alert On

Metric	Warning	Critical
UnderReplicatedPartitions	> 0 for 1 min	> 0 for 5 min
OfflinePartitionsCount	> 0	> 0
ActiveControllerCount	!= 1	!= 1
IsrShrinksPerSec	> 1/min	> 5/min

NotEnoughReplicasException

Messages are rejected since there are fewer in-sync replicas than required.

This is Kafka doing its job—refusing writes that can't be safely replicated. Check ISR state and broker health.

Network partition handling is where Kafka's distributed nature becomes visible. The defaults favor safety over availability, which is correct for most workloads.

Book a demo to see how Conduktor Console provides real-time ISR health and partition status monitoring.