JVM Tuning for Kafka Brokers: G1GC vs ZGC in Production
Configure G1GC and ZGC for Kafka brokers. Heap sizing, pause time targets, and when to switch collectors in production.

A 500ms GC pause can trigger consumer rebalances, cause ZooKeeper session timeouts, and create cascading failures across your cluster.
I've debugged GC-related Kafka outages more times than I'd like. The fix is usually straightforward once you understand what's happening.
Our production cluster had random 2-second latency spikes. Turned out to be Full GC pauses. Fixed the heap sizing and haven't had an incident in 8 months.
SRE at a payments company
The Standard G1GC Configuration
export KAFKA_HEAP_OPTS="-Xms6G -Xmx6G"
export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC \
-XX:MaxGCPauseMillis=20 \
-XX:InitiatingHeapOccupancyPercent=35 \
-XX:+ExplicitGCInvokesConcurrent \
-XX:G1HeapRegionSize=16M" This is the battle-tested Confluent/LinkedIn configuration. Works for most workloads without tuning.
Heap Sizing: Keep It Small
Kafka brokers don't need large heaps. Data sits in the OS page cache, not the JVM.
| Workload | Heap Size |
|---|---|
| Development | 1-2 GB |
| Standard production | 6 GB |
| High partition count (>10k) | 8 GB |
Tradeoff: Larger heaps mean longer pauses. A 32GB heap with G1GC can have 100-200ms pauses.
When G1GC Causes Problems
Symptom: ZooKeeper session timeouts (ZK mode)
INFO Session expired; client is trying to reconnect to ZooKeeper GC pause longer than session timeout. Fix: reduce pauses or increase zookeeper.session.timeout.ms.
For KRaft clusters: Similar issues manifest as controller election problems. Monitor controller.quorum.election.timeout.ms.
Symptom: Consumer rebalances during pauses
Marking the coordinator dead for group my-group Symptom: Full GC
GC(45) Pause Full (Allocation Failure) 7800M->7500M(8192M) 12500.000ms A 12-second Full GC will definitely cause broker disconnections. Increase heap or investigate memory usage.
When to Switch to ZGC
ZGC promises sub-millisecond pauses regardless of heap size. Netflix switched to Generational ZGC in 2024.
# Java 21+
export KAFKA_HEAP_OPTS="-Xms12G -Xmx12G"
export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseZGC -XX:+ZGenerational" | Factor | G1GC | ZGC |
|---|---|---|
| Heap size | < 16GB | > 16GB |
| Pause target | < 50ms | < 10ms |
| CPU overhead | Lower | Higher (~5-10%) |
| Memory overhead | Lower | ~20% more needed |
| Java version | 8+ | 15+ (21+ for Generational ZGC) |
Choose ZGC: Heap over 16GB, need sub-10ms pauses, Java 17+.
Monitoring GC Health
# Parse GC log for pause times
grep -oP 'Pause Young.*?\K[\d.]+(?=ms)' /var/log/kafka/kafkaServer-gc.log | \
awk '{sum+=$1; count++; if($1>max)max=$1} END {print "avg:",sum/count,"ms, max:",max,"ms"}' | Metric | Warning | Critical |
|---|---|---|
| GC pause P99 | > 50ms | > 200ms |
| GC frequency | > 10/min | > 30/min |
| Heap after GC | > 70% | > 85% |
- No "Pause Full" entries (bad)
- No "Humongous Allocation" warnings
- Pause times under your target
Quick Fixes
Humongous objects: Large Kafka messages trigger inefficient allocation. Increase G1HeapRegionSize=32M.
Concurrent mode failure: Marking didn't finish before heap filled. Lower InitiatingHeapOccupancyPercent=25.
GC tuning is iterative. Start with recommended settings, monitor under load, adjust based on behavior. Premature optimization often makes things worse.
Book a demo to see how Conduktor Console surfaces GC health alongside Kafka metrics.