Kafka Performance: Measure Before You Tune

Kafka performance is 80% understanding your workload, 20% tuning configs. Measure bottlenecks before changing partition counts or batch sizes.

Stéphane Derosiaux · August 19, 2025 ·

Kafka Performance: Measure Before You Tune

Tuning configs without measuring bottlenecks is guessing.

When Kafka performance degrades, the instinct is to tweak settings: increase batch.size, raise linger.ms, adjust compression.type. Sometimes this helps. Often it doesn't, because the real problem isn't configuration—it's architecture. Wrong partition count, saturated network, or consumers that can't keep up with throughput. As we move into 2025, the focus is shifting from brute force scaling to more nuanced, efficient strategies, with organizations discovering that throwing resources at Kafka bottlenecks won't solve long-term scalability issues.

Kafka performance tuning is 20% changing configs and 80% understanding your workload. Start with measurement: identify the bottleneck (producer, broker, or consumer), understand the constraint (CPU, network, disk I/O), then optimize. Tuning the wrong layer wastes time and makes problems worse.

Measuring Kafka Performance

Performance has three dimensions: throughput (messages per second), latency (time from produce to consume), and resource utilization (CPU, network, disk). Optimizing one often degrades another. The question isn't "how do I make Kafka faster?" It's "which metric matters most for my workload?"

Throughput measures messages processed per second or bytes processed per second. High-throughput workloads (event logging, clickstream data) prioritize volume over latency. Tune for batching, compression, and partition parallelism.

Latency measures time from producer send to consumer receive. Low-latency workloads (fraud detection, real-time recommendations) prioritize speed over volume. Tune for immediate send (linger.ms=0), minimal batching, and fast acknowledgment (acks=1).

Resource utilization measures CPU, network, and disk usage on brokers. Common resource bottlenecks include network throughput, storage throughput, and network throughput between brokers and storage backend for network-attached storage.

Measure all three before tuning. If broker CPU is 95% but network is 30%, adding more partitions won't help—CPU is the bottleneck. If network is saturated but CPU is idle, compression helps. If disk I/O is maxed, faster storage helps. Tuning the wrong resource moves the bottleneck without improving performance.

Bottleneck Identification

Kafka performance problems manifest at three layers: producer, broker, or consumer. Identify which layer is slow before optimizing. Understanding latency distributions is the first step.

Producer bottleneck symptoms: high produce latency, low broker CPU, low broker network utilization. The producer is slow to send data. Brokers are idle waiting for messages. Causes: network latency between producer and broker, inefficient batching (sending single messages instead of batches), or application logic slowdown (database lookups before each send).

Solution: increase producer batching (batch.size, linger.ms), use compression to reduce network transfer, or parallelize producer instances to increase aggregate throughput.

Broker bottleneck symptoms: high broker CPU, high disk I/O, under-replicated partitions. The broker can't keep up with incoming data. Causes: insufficient broker capacity, too many partitions per broker (replication overhead), or disk I/O saturation (slow disks, heavy compaction).

Solution: scale horizontally (add brokers, rebalance partitions), use faster storage (SSDs instead of HDDs), or reduce partition count per broker (consolidate topics, increase partition size).

Consumer bottleneck symptoms: growing consumer lag, low broker CPU, low broker network utilization. The consumer can't process messages fast enough. Causes: slow downstream processing (database writes, API calls), insufficient consumer instances (low parallelism), or inefficient deserialization (complex schemas).

Solution: scale consumer instances (one consumer per partition for max parallelism), optimize downstream processing (batch database writes, use async APIs), or simplify deserialization (switch to faster serialization formats).

Monitoring appropriate metrics is essential for identifying bottlenecks: broker metrics like network throughput, disk I/O rates, request latency, CPU utilization, memory usage, and under-replicated partitions.

Partition Count and Parallelism

Partition count determines maximum consumer parallelism. Too few partitions limit throughput; too many increase broker overhead.

Under-partitioned topics limit parallelism. A topic with 3 partitions supports at most 3 consumer instances in a group. If you need 10 consumer instances to handle throughput, some consumers will be idle. The topic is under-partitioned.

Rule of thumb: partitions >= expected max consumer instances. If you anticipate scaling to 20 consumers during peak load, the topic needs at least 20 partitions.

Over-partitioned topics increase broker overhead. Each partition consumes memory (page cache for recent messages), file handles (open log segments), and replication bandwidth (inter-broker traffic for ISR updates). Having more partitions than required can overload the system.

A broker with 10,000 partitions uses significantly more memory and CPU than one with 1,000 partitions, even at the same throughput. Partition leadership elections take longer. Rebalancing takes longer. Operational complexity increases.

Rule of thumb: don't create partitions speculatively. Start with expected parallelism (e.g., 10 partitions for 10 consumers), measure performance, scale up if needed. Adding partitions is easy. Reducing them requires topic recreation.

Partition sizing affects performance. Very small partitions (few MB per partition) waste overhead: broker tracks partition metadata, consumers open network connections. Very large partitions (100+ GB) slow rebalancing: moving data between brokers takes hours, not minutes.

Target partition size: 10-50 GB per partition. This balances overhead (not too many small partitions) with agility (rebalancing completes in reasonable time).

Compression Tradeoffs

Compression reduces network and storage usage but increases CPU usage. The tradeoff depends on bottleneck.

Network-bound workloads benefit from compression. If broker network is saturated but CPU is idle, enabling compression (compression.type=snappy or lz4) reduces bytes transferred. Throughput increases because less network bandwidth is needed per message.

CPU-bound workloads degrade with compression. If broker CPU is maxed, adding compression overhead makes it worse. Throughput decreases because brokers spend cycles compressing instead of processing requests.

Compression algorithm choice:

Snappy: Fast compression, moderate compression ratio. Good default for most workloads.
LZ4: Faster than snappy, slightly worse compression ratio. Use when CPU is scarce.
Gzip: Slower compression, better compression ratio. Use when network bandwidth is scarce and CPU is abundant.
Zstd: Balance of speed and compression ratio. Good for Kafka 2.1+.

You must balance minimizing latency and maximizing throughput to achieve optimal performance, with strategies including tuning partition numbers, optimizing hardware, and using compression.

Measure before choosing. Run benchmarks with different compression types and measure throughput, latency, and CPU usage. Pick the algorithm that optimizes for your bottleneck.

Batching and Buffering Strategies

Batching trades latency for throughput. Larger batches mean fewer network round-trips (higher throughput) but messages wait longer before sending (higher latency).

Producer batching controls how many messages accumulate before sending. batch.size sets the batch size in bytes. linger.ms sets how long to wait for batch to fill before sending anyway.

For high-throughput, latency-tolerant workloads: set batch.size=1MB and linger.ms=100. Messages accumulate for up to 100ms or until batch reaches 1MB, then send. This maximizes batching efficiency and throughput.

For low-latency workloads: set batch.size=16KB and linger.ms=0. Messages send immediately without waiting. Latency is minimized but throughput is lower because batching is minimal.

Consumer fetching controls how much data consumers retrieve per request. fetch.min.bytes sets minimum bytes to return (broker waits until this much data is available). fetch.max.wait.ms sets how long to wait before returning anyway.

For throughput: set fetch.min.bytes=1MB and fetch.max.wait.ms=500. Consumers fetch large batches, reducing request overhead. Throughput increases but latency increases (consumers wait up to 500ms for data).

For latency: set fetch.min.bytes=1 and fetch.max.wait.ms=100. Consumers fetch data immediately without waiting for large batches. Latency decreases but throughput is lower.

When to Scale vs. When to Optimize

Scaling adds resources (brokers, partitions, consumers). Optimization improves efficiency of existing resources. The choice depends on where the bottleneck is.

Scale when resource-constrained: If broker CPU is 95%, network is saturated, or disk I/O is maxed, optimization can only squeeze marginal gains. Add brokers, spread load across more machines.

Optimize when inefficient: If broker CPU is 40% but throughput is poor, adding brokers doesn't help. The problem is inefficiency: wrong partition count, poor batching, or unbalanced load across partitions.

Scaling is expensive (more infrastructure costs) but simple (add machines, rebalance). Optimization is cheap (config changes) but requires diagnosis (find the inefficiency).

Start with optimization. Measure utilization, identify inefficiencies, fix them. Scale only when resources are genuinely exhausted.

Hardware Considerations

Kafka heavily relies on disk I/O operations, so it's essential to use high-performance storage hardware, such as SSDs, to reduce latency and improve throughput.

Storage performance matters more than CPU for most Kafka workloads. Kafka writes all messages to disk and uses page cache for reads. Fast disks (NVMe SSDs) dramatically outperform slow disks (spinning HDDs) for both write throughput and read latency.

Rule of thumb: Use SSDs for production Kafka clusters. HDDs are too slow for anything except archival workloads.

Network bandwidth limits throughput, especially for high-traffic topics. 10 Gbps network interfaces support ~1 GB/sec aggregate throughput. 1 Gbps network interfaces support ~100 MB/sec. If your workload exceeds network capacity, no amount of config tuning helps.

Solution: Upgrade to 10 Gbps+ network interfaces or reduce network traffic through compression or data filtering.

Memory for page cache improves read performance. Kafka uses operating system page cache to keep recently written data in memory. Consumers reading recent data (typical case) hit page cache instead of disk, dramatically reducing latency.

Rule of thumb: Provision memory equal to 1-2 days of message volume. If you write 100 GB/day to a topic, provision 100-200 GB memory per broker to keep one day's worth of data cached.

Performance Testing and Benchmarking

Don't guess at configuration impact. Measure it.

Use kafka-producer-perf-test and kafka-consumer-perf-test to benchmark throughput and latency under different configurations:

# Benchmark producer throughput
kafka-producer-perf-test \
  --topic test-topic \
  --num-records 1000000 \
  --record-size 1024 \
  --throughput -1 \
  --producer-props bootstrap.servers=localhost:9092 \
                     compression.type=snappy \
                     batch.size=32768 \
                     linger.ms=10

Change one variable at a time (compression type, batch size, partition count) and measure impact. This identifies which changes actually improve performance and which don't.

Run benchmarks in environment matching production: same broker hardware, same network topology, same partition count. Benchmarks on a laptop with a single broker don't predict production performance with 20 brokers.

Measuring Success

Track throughput, p99 latency, and resource utilization over time. Performance improvements should show in metrics:

Throughput should increase (messages/sec) or stay flat while resource usage decreases. If throughput is the same but CPU usage drops from 80% to 50%, optimization succeeded.

p99 latency should decrease. Don't optimize for average latency—optimize for tail latency. If p50 latency is 5ms but p99 latency is 500ms, 1% of requests are experiencing severe delays.

Resource utilization should stay below 70% during normal operation. This leaves headroom for traffic spikes. If broker CPU runs at 90% during normal traffic, spikes will saturate it.

The Path Forward

Kafka performance optimization starts with measurement: identify bottlenecks, understand constraints, then tune. Most performance problems are architectural (wrong partition count, consumer can't keep up) not configurational (batch size too small).

Conduktor provides performance monitoring and baselines, configuration recommendations based on actual usage through Console, and capacity planning to predict when clusters need scaling. Teams optimize Kafka performance through visibility into what's slow and why, not guesswork.

If your performance tuning consists of changing configs and hoping, the problem isn't Kafka—it's lack of measurement.