Disk Full: Emergency Recovery When Kafka Runs Out of Space
Emergency runbook for Kafka disk full scenarios. Immediate triage commands, safe segment deletion, recovery steps, and retention tuning to prevent recur...

Your broker just crashed with java.io.IOException: No space left on device. The logs show Exit.halt(1). Kafka didn't gracefully shut down—it terminated immediately, skipping shutdown hooks entirely.
I've been paged for this exact scenario more times than I'd like to admit. The panic is real, but the fix is straightforward if you work through it systematically.
Our disk-full incident turned into a 4-hour outage because we didn't have a runbook. Now we drill this quarterly.
SRE at a payments company
Assess First (2 Minutes)
Before touching anything, understand the scope.
# From any healthy broker
kafka-broker-api-versions.sh --bootstrap-server kafka1:9092,kafka2:9092,kafka3:9092
# Timeout = broker is down SSH to the affected broker:
df -h
# /dev/sda1 500G 500G 0 100% /var/kafka
du -sh /var/kafka/* | sort -rh | head -10
# 180G /var/kafka/data/high-volume-topic-0 Decision point: If only one broker is down and replication factor >= 2, your cluster is still serving traffic. You have time.
Free Space Immediately
Pick the fastest option for your situation.
Option A: Delete Old Segments (Fastest, Most Risk)
⚠️ CRITICAL: Stop the broker first. Deleting segment files while Kafka is running causes immediate data corruption and broker crashes. Always run kafka-server-stop.sh before proceeding.
# ONLY after broker is stopped - verify with: ps aux | grep kafka
find /var/kafka/data -name "*.log" -mtime +7 -type f -delete
find /var/kafka/data -name "*.index" -mtime +7 -type f -delete
find /var/kafka/data -name "*.timeindex" -mtime +7 -type f -delete Never delete the active segment (newest .log file in each partition). Deleting it corrupts the partition.
Option B: Reduce Retention Dynamically (Safer)
You can also adjust topic retention settings through Conduktor Console's UI.
kafka-configs.sh --bootstrap-server kafka2:9092 \
--alter --entity-type topics --entity-name high-volume-topic \
--add-config retention.ms=3600000,retention.bytes=10737418240 This sets 1-hour retention and 10 GB per partition. The log cleaner runs every 5 minutes by default.
| Setting | Emergency | Normal |
|---|---|---|
retention.ms | 3600000 (1h) | 604800000 (7d) |
retention.bytes | 10GB | -1 (unlimited) |
Option C: Expand Disk (Cloud)
# AWS EBS
aws ec2 modify-volume --volume-id vol-xxxx --size 1000
sudo growpart /dev/xvda 1
sudo resize2fs /dev/xvda1 Restart the Broker
Once you have 10-20% free space:
kafka-server-start.sh -daemon /etc/kafka/server.properties
tail -f /var/log/kafka/server.log Common Startup Failures
Corrupt index files:
ERROR Found a corrupted index file /var/kafka/data/my-topic-0/00000000000012345.index Delete the corrupt indexes. Kafka rebuilds them:
rm /var/kafka/data/my-topic-0/00000000000012345.index
rm /var/kafka/data/my-topic-0/00000000000012345.timeindex Empty snapshot files:
find /var/kafka/data -name "*.snapshot" -size 0 -delete All log dirs failed (JBOD): Temporarily exclude the failed disk in server.properties:
# Original: log.dirs=/data1/kafka,/data2/kafka,/data3/kafka
# Temporary: log.dirs=/data1/kafka,/data2/kafka Partitions on the excluded disk become under-replicated. Reassign them after recovery.
Verify Recovery
# Check for under-replicated partitions
kafka-topics.sh --bootstrap-server kafka1:9092 --describe --under-replicated-partitions
# Output should be empty once caught up Recovery time depends on data volume. 100 GB at 100 MB/s network = ~17 minutes per replica.
Prevent Recurrence
Configure disk usage alerts to catch problems before they become emergencies.
| Metric | Warning | Critical |
|---|---|---|
| Disk usage % | 70% | 85% |
| OfflineLogDirectoryCount | > 0 | > 0 |
log.retention.bytes=107374182400 # 100 GB per partition
log.retention.check.interval.ms=60000 # Check every minute Disk full is recoverable if you have replication. Without replicas, you lose data. The real fix is alerting at 70%, not recovering at 100%.
Book a demo to see how Conduktor Console monitors disk usage across all your Kafka clusters.