Rolling Restarts Without Downtime

Zero-downtime Kafka rolling restarts. Pre-flight checks, ISR verification, controlled shutdown, and KRaft controller order.

Stéphane DerosiauxStéphane Derosiaux · December 10, 2025 ·
Rolling Restarts Without Downtime

Rolling restarts are routine: patching, config changes, version upgrades. They're also where clusters break. A misconfigured restart order or skipped health check can cascade into hours of downtime.

I've run hundreds of rolling restarts across production clusters. The difference between smooth operations and incidents is verification at every step.

We automated our rolling restart procedure after a manual restart took down 3 partitions for 2 hours. Now we can't restart without the checklist passing.

SRE at a streaming platform

Pre-Flight Checklist

Don't touch anything until this passes.

No Under-Replicated Partitions

kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# Expected: empty output

If URPs exist, investigate first. Restarting a broker while URPs exist risks taking down the only healthy replica. Broker health dashboards make identifying which brokers have URPs straightforward.

All Brokers Responding

kafka-broker-api-versions.sh --bootstrap-server localhost:9092

For KRaft:

kafka-metadata-quorum.sh --bootstrap-server localhost:9092 describe --status
# All controllers in CurrentVoters

ISR > min.insync.replicas

RFmin.insync.replicasSafe to Restart?
32Yes, if ISR = 3
32NO, if ISR = 2

Restart Order

For KRaft dedicated mode: Restart broker-only nodes first, then follower controllers, then the active controller last. For combined mode: Restart non-leader combined nodes first, then the active controller/leader node last.

Identify active controller:

kafka-metadata-quorum.sh --bootstrap-server localhost:9092 describe --status
# LeaderId: 1  ← Active controller

Restarting non-leader controllers first lets the active controller continue coordinating.

Controlled Shutdown

Let leaders migrate before the broker stops:

controlled.shutdown.enable=true  # Default
kill -15 <kafka-pid>  # SIGTERM for graceful shutdown
# Never use kill -9 — bypasses controlled shutdown

Watch for:

[INFO] Controlled shutdown succeeded

Wait for Recovery

Never restart the next broker while URPs exist.

watch -n 5 'kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions'
# Wait until empty

A broker with 1TB of data on 100MB/s network needs ~3 hours to catch up.

After ISR recovery, optionally run preferred leader election:

kafka-leader-election.sh --bootstrap-server localhost:9092 --election-type preferred --all-topic-partitions

Version Upgrades

Version upgrades require two rolling restarts:

  1. Update binaries, keep old protocol version
  2. After all brokers upgraded, update protocol version
  3. Second rolling restart
# First restart
inter.broker.protocol.version=3.6

# Second restart (after observation period)
inter.broker.protocol.version=3.7

For KRaft:

kafka-features.sh --bootstrap-server localhost:9092 upgrade --release-version 3.7

Wait at least 3 days before finalizing metadata version upgrade. This observation period lets you detect latent issues while retaining rollback capability.

Key Metrics During Restart

MetricNormalAlert If
UnderReplicatedPartitions0> 0 for > 5min
OfflinePartitionsCount0> 0
ActiveControllerCount10 or > 1
UncleanLeaderElectionsPerSec0> 0 (data loss)
Rolling restarts are routine, but "routine" and "safe" aren't synonyms. The difference is verification at every step.

Book a demo to see how Conduktor Console provides real-time ISR status and broker health dashboards.