Kafka Monitoring: 10 Metrics That Matter

Stop drowning in 200+ Kafka broker metrics. Focus on the 10 signals that actually predict outages, consumer lag, and cluster instability.

Stéphane Derosiaux · September 12, 2025 ·

Kafka Monitoring: 10 Metrics That Matter

You can't troubleshoot what you can't see. But seeing everything doesn't mean understanding anything.

Kafka exposes 200+ metrics per broker: bytes in/out, request latency, partition counts, replication lag, disk usage, network threads, and dozens more. Teams dutifully collect them all, build Grafana dashboards, and still can't answer "is my Kafka cluster healthy?" during incidents.

The problem isn't lack of data. It's lack of context—the difference between monitoring and observability. A dashboard showing "consumer lag: 1,000,000 messages" is useless without knowing: is this normal for this consumer? Has it been growing or shrinking? Is the consumer designed for real-time processing or batch?

Kafka monitoring in 2025 emphasizes tracking what matters—in-sync replicas, under-replicated partitions, and consumer lag trends—over collecting everything. The shift is from comprehensive metric collection to meaningful insight extraction.

The Metrics That Actually Matter

Start with ten metrics. Everything else is diagnostic detail you reference during investigation, not baseline monitoring.

Under-replicated partitions (broker-level) measures partitions where not all replicas are in sync. If this metric exceeds zero for extended periods, Kafka's high-availability guarantees cannot be met without replication, warranting immediate investigation.

Normal value: 0. If a broker becomes unavailable, this value increases sharply. Brief spikes during rolling restarts are expected. Sustained non-zero values indicate broker issues, network problems, or insufficient resources.

Consumer lag (consumer-level) measures how far behind consumers are from the latest messages. The most critical metric is records-lag-max, which shows maximum lag in number of records for any partition assigned to a consumer.

Normal value: depends on consumer design. Real-time processors should have lag under 1000 messages. Batch consumers might accumulate millions of messages between runs. High consumer lag indicates consumers are struggling to keep up with incoming data, leading to delays in real-time processing.

Fetch latency (consumer-level) measures time consumers wait for broker responses. Increasing fetch latency indicates broker saturation or network degradation. Target: sub-100ms for most workloads.

Produce latency (producer-level) measures time producers wait for acknowledgments. This correlates with acks configuration: acks=1 has lower latency than acks=all because fewer replicas must acknowledge. Target: sub-50ms for acks=1, sub-200ms for acks=all.

Request queue size (broker-level) measures how many requests are waiting for broker threads. Growing queue size indicates broker saturation. Normal value: under 100. Values exceeding 500 indicate brokers can't process requests fast enough.

Disk usage (broker-level) tracks storage consumption (see also cost control). Kafka brokers fail catastrophically when disks fill. Monitor both absolute usage (GB consumed) and growth rate (GB/day). Set alerts at 70% capacity to allow time for remediation.

Network throughput (broker-level) measures bytes in/out per broker. This correlates with message volume and size. Sudden drops indicate producer failures. Sudden spikes indicate consumer failures (retry storms).

Partition count (broker-level) affects broker performance. Thousands of partitions per broker increase memory usage and replication overhead. Modern Kafka handles 10,000+ partitions per broker, but performance degrades as counts grow. Monitor partition count to plan scaling before hitting limits.

Active controller count (cluster-level) should always be 1. If it's 0, the cluster has no controller and can't handle metadata changes. If it's greater than 1, there's a split-brain condition. Both are critical failures requiring immediate intervention.

ISR shrink/expand rate (broker-level) tracks how often in-sync replica sets change. Frequent changes indicate brokers or replicas falling out of sync repeatedly, suggesting resource constraints or network issues.

Consumer Lag: Absolute Numbers vs. Trends

Consumer lag is the most misunderstood Kafka metric. "Consumer lag: 1,000,000 messages" might be a crisis or completely normal depending on context.

Lag for real-time processors should stay near zero. If an orders-processor designed to handle orders within seconds has 100,000 message lag, something is broken. Either the consumer can't keep up (insufficient capacity), rebalancing is thrashing (too many consumer instances joining/leaving), or downstream dependencies are slow (database writes blocking message processing).

Lag for batch processors accumulates by design. A reporting service that runs every 6 hours might accumulate 10 million messages between runs, process them in 30 minutes, then return to zero lag. Alerting on absolute lag would fire constantly. The meaningful metric is: does lag return to zero after each batch? If not, batch duration is increasing, indicating scalability problems.

Lag trends matter more than absolute values. A consumer with 500,000 message lag that's been stable for days is probably fine. A consumer with 5,000 message lag that's growing 1,000 messages/minute will hit millions within hours—that's the crisis.

Modern monitoring shifts from message counts to time-based metrics. Instead of "1 million messages behind," track "30 minutes behind." Time lag is more intuitive: if a consumer is 2 hours behind and your SLA is 5 minutes, you know it's failing without understanding message rates.

Building Baselines: What's Normal for Your Workload

Monitoring without baselines generates false alarms. If normal consumer lag for your batch processor is 5 million messages, alerting at 1 million wastes time investigating non-issues.

Build baselines by observing metrics under normal load over 30 days:

Consumer lag baseline: What's typical lag for this consumer during normal operation? What's typical lag during peak traffic? When does lag return to zero? A consumer that accumulates lag overnight and processes it during off-hours has a different baseline than one that maintains real-time processing 24/7.

Produce/fetch latency baseline: What's p50, p95, p99 latency during normal traffic? What's latency during peak? Latency distributions reveal more than averages. If p99 latency is 10x higher than p50, some requests are experiencing severe delays even when average latency looks healthy.

Disk usage baseline: How much storage grows per day under normal load? This predicts when disks fill. If growth is 100GB/day and 1TB is available, you have 10 days before hitting capacity limits.

Partition rebalancing baseline: How often do consumer groups rebalance under normal conditions? Frequent rebalancing indicates consumer instances joining/leaving constantly, suggesting infrastructure instability or misconfigured health checks.

Baselines turn absolute metrics into meaningful signals. Alert when metrics deviate from baseline, not when they exceed arbitrary thresholds.

Monitoring Consumer Groups

Consumer groups are where most Kafka operational issues manifest. A healthy cluster with broken consumers is still broken from the user's perspective.

Consumer group lag measures aggregate lag across all partitions. If 10 partitions each have 1,000 message lag, aggregate lag is 10,000 messages. This shows overall consumer health but hides per-partition issues.

Per-partition lag reveals hot spots. If 9 partitions have zero lag but 1 partition has 100,000 message lag, aggregate lag might look acceptable (10,000 messages), but one partition is falling behind. This indicates data skew (one partition receives disproportionate traffic) or consumer issues (the consumer assigned to that partition is slower than others).

Consumer group rebalancing happens when members join or leave. Rebalancing pauses message processing while partitions reassign. Frequent rebalancing (multiple times per hour) indicates unstable consumers: crashing repeatedly, deployment churn, or misconfigured session timeouts. Track rebalancing rate and duration. Rebalancing every 10 minutes means consumers spend significant time paused instead of processing.

Offset commit rate shows how often consumers commit progress. Infrequent commits risk reprocessing messages after crashes. Too-frequent commits waste broker resources. Typical commit interval: 5-30 seconds.

Cluster Health Indicators

Cluster-level metrics reveal infrastructure problems that affect all topics and consumers.

Under-replicated partitions is the canary metric. If URPs exceed zero for extended periods, high availability is compromised. Brief spikes during broker restarts are normal. Sustained URPs indicate broker failures, network partitions, or insufficient resources (disk I/O saturation, CPU exhaustion).

Offline partitions means partitions with no leader. These are unavailable for reads and writes. This should always be zero. Non-zero values indicate critical failures requiring immediate intervention.

Unclean leader elections happen when a partition's leader fails and Kafka elects a non-ISR replica as the new leader, potentially losing messages. This should never happen with unclean.leader.election.enable=false (the recommended setting). If it does happen, it indicates all ISR replicas failed simultaneously, which suggests correlated failures (network partition, datacenter issue).

Broker resource utilization includes CPU, memory, disk I/O, and network. Brokers running at 80%+ CPU or 90%+ disk I/O are saturated. Monitoring resource utilization predicts capacity limits before they cause outages.

When to Alert vs. When to Ignore

Good alerting means 95% of alerts result in action, not acknowledgment and ignore.

Alert on symptoms, not causes. Don't alert on "high CPU" unless it causes user-facing impact. Alert on "under-replicated partitions" or "consumer lag exceeding SLA" because these directly affect availability and latency.

Use trend-based alerts, not absolute thresholds. Don't alert on "consumer lag > 100,000 messages." Alert on "consumer lag increasing >10,000 msg/min for 10+ minutes" because that indicates a growing problem, not a temporary spike.

Route alerts by ownership (define owners in the application catalog). Under-replicated partitions go to platform teams. Consumer lag goes to the team that owns the consumer. Alerting everyone for everything creates noise and diffuses responsibility.

Include runbooks with alerts. When consumer lag exceeds threshold, the alert should link to documentation explaining: how to check consumer capacity, how to scale consumers, how to identify slow downstream dependencies. Alerts without context generate ticket escalations that waste time.

Measuring Success

Monitor monitoring effectiveness through three metrics: MTTR (mean time to resolution), false positive rate, and alert fatigue score.

MTTR measures how quickly issues resolve after detection. If under-replicated partitions alert fires and resolution takes 4 hours, that's the baseline. After improving monitoring (adding context, better runbooks), MTTR should drop to under 1 hour.

False positive rate measures alerts that fire but don't require action. If 50% of consumer lag alerts are acknowledged and ignored, thresholds are too sensitive or baselines are wrong. Target: under 10% false positive rate.

Alert fatigue score is subjective but critical. Survey on-call engineers: do they trust alerts enough to wake up and investigate, or do they assume alerts are noise? High trust indicates good alerting. Low trust indicates alert fatigue from too many false positives.

The Path Forward

Kafka monitoring shifts from collecting all metrics to understanding the ten that matter. Under-replicated partitions, consumer lag trends, cluster resource utilization—these reveal health or problems. Everything else is diagnostic detail for investigations, not baseline monitoring.

Conduktor provides unified monitoring across all Kafka clusters with consumer lag tracking, topic health scores, and alerting on trends instead of absolute values. Teams gain visibility into what's broken and why, not just dashboards full of numbers.

If you can't answer "is my Kafka cluster healthy?" in under 30 seconds, the problem isn't your cluster—it's your monitoring.