Rolling Restarts Without Downtime
Zero-downtime Kafka rolling restarts. Pre-flight checks, ISR verification, controlled shutdown, and KRaft controller order.

Rolling restarts are routine: patching, config changes, version upgrades. They're also where clusters break. A misconfigured restart order or skipped health check can cascade into hours of downtime.
I've run hundreds of rolling restarts across production clusters. The difference between smooth operations and incidents is verification at every step.
We automated our rolling restart procedure after a manual restart took down 3 partitions for 2 hours. Now we can't restart without the checklist passing.
SRE at a streaming platform
Pre-Flight Checklist
Don't touch anything until this passes.
No Under-Replicated Partitions
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
# Expected: empty output If URPs exist, investigate first. Restarting a broker while URPs exist risks taking down the only healthy replica. Broker health dashboards make identifying which brokers have URPs straightforward.
All Brokers Responding
kafka-broker-api-versions.sh --bootstrap-server localhost:9092 For KRaft:
kafka-metadata-quorum.sh --bootstrap-server localhost:9092 describe --status
# All controllers in CurrentVoters ISR > min.insync.replicas
| RF | min.insync.replicas | Safe to Restart? |
|---|---|---|
| 3 | 2 | Yes, if ISR = 3 |
| 3 | 2 | NO, if ISR = 2 |
Restart Order
For KRaft dedicated mode: Restart broker-only nodes first, then follower controllers, then the active controller last. For combined mode: Restart non-leader combined nodes first, then the active controller/leader node last.
Identify active controller:
kafka-metadata-quorum.sh --bootstrap-server localhost:9092 describe --status
# LeaderId: 1 ← Active controller Restarting non-leader controllers first lets the active controller continue coordinating.
Controlled Shutdown
Let leaders migrate before the broker stops:
controlled.shutdown.enable=true # Default kill -15 <kafka-pid> # SIGTERM for graceful shutdown
# Never use kill -9 — bypasses controlled shutdown Watch for:
[INFO] Controlled shutdown succeeded Wait for Recovery
Never restart the next broker while URPs exist.
watch -n 5 'kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions'
# Wait until empty A broker with 1TB of data on 100MB/s network needs ~3 hours to catch up.
After ISR recovery, optionally run preferred leader election:
kafka-leader-election.sh --bootstrap-server localhost:9092 --election-type preferred --all-topic-partitions Version Upgrades
Version upgrades require two rolling restarts:
- Update binaries, keep old protocol version
- After all brokers upgraded, update protocol version
- Second rolling restart
# First restart
inter.broker.protocol.version=3.6
# Second restart (after observation period)
inter.broker.protocol.version=3.7 For KRaft:
kafka-features.sh --bootstrap-server localhost:9092 upgrade --release-version 3.7 Wait at least 3 days before finalizing metadata version upgrade. This observation period lets you detect latent issues while retaining rollback capability.
Key Metrics During Restart
| Metric | Normal | Alert If |
|---|---|---|
| UnderReplicatedPartitions | 0 | > 0 for > 5min |
| OfflinePartitionsCount | 0 | > 0 |
| ActiveControllerCount | 1 | 0 or > 1 |
| UncleanLeaderElectionsPerSec | 0 | > 0 (data loss) |
Book a demo to see how Conduktor Console provides real-time ISR status and broker health dashboards.