Kafka Alerting: Fix Alert Fatigue

Kafka alert fatigue is a design problem, not a sensitivity problem. Build tiered alerts where 95% of pages result in action, not silence.

Stéphane Derosiaux · December 16, 2025 ·

Alert fatigue doesn't come from too many alerts. It comes from too many useless alerts.

When 80% of alerts are false positives or non-actionable, on-call engineers learn to ignore them. They acknowledge without investigating, silence without fixing, and eventually distrust the alerting system entirely. Then a real incident happens—consumer lag spiraling, under-replicated partitions, broker crash—and the alert gets ignored along with the noise.

Alert fatigue occurs when teams receive too many low-priority notifications, causing them to ignore or disable alerts entirely—including critical ones. The solution lies in intelligent filtering, prioritization, and governance. Real alerting infrastructure means 95% of alerts result in action. If you're acknowledging and dismissing most alerts, your alerting is broken.

The fix isn't raising thresholds to reduce noise (you'll miss real issues). It's redesigning alerts to fire only when human intervention is needed, not when metrics cross arbitrary thresholds.

The Alert Fatigue Problem

Alert fatigue follows a predictable pattern. Team launches comprehensive monitoring, sets alerts for every metric "just in case," and initially responds to everything. Within weeks, alert volume becomes unsustainable. Engineers start acknowledging without investigating. Eventually, alerts are silenced or disabled entirely.

The root cause: alerting on metrics instead of symptoms. "Broker CPU > 80%" isn't inherently a problem. If CPU is high but the broker is handling load without latency degradation, no action is needed. Alerting on CPU creates noise. Alerting on latency (the symptom users experience) creates signal.

Symptom-based alerts correlate with user impact: consumer lag exceeding SLA, under-replicated partitions (data loss risk), request latency p99 spiking. These indicate problems requiring intervention.

Metric-based alerts fire on resource utilization: high CPU, high memory, high disk I/O. These might indicate problems or might be normal under load. Without context, they're noise.

The shift is from "CPU is high, something might be wrong" to "latency is high, something IS wrong." The first requires investigation to determine if action is needed. The second requires action.

Alerting Principles

Actionable: Every alert should prompt specific action. If acknowledging an alert without taking action is common, the alert shouldn't exist.

Contextual: Alerts should include enough context to triage without additional investigation. "Consumer lag high" is useless. "orders-processor consumer lag 1.2M messages (p95: 50K messages), growing 10K msg/min for 15 minutes" is actionable.

Routable: Alerts should go to the team responsible for fixing them. Under-replicated partitions go to platform teams. Consumer lag goes to application teams that own the consumer. Routing everyone for everything diffuses responsibility.

What to Alert On

Alert on symptoms, not metrics. Alert on trends, not absolute values. Alert when human intervention is needed, not when metrics cross thresholds.

Under-replicated partitions is a critical alert. If this metric exceeds zero for more than 5 minutes, data availability is at risk. Action: investigate which broker is lagging, check resource saturation, potentially scale or rebalance.

Context to include: which topics have under-replicated partitions, which broker is the lagging replica, current ISR size vs. expected.

Consumer lag exceeding SLA indicates processing can't keep up. To make this data actionable and avoid alert fatigue, teams must aggregate and evaluate lag at the consumer group level rather than treating each partition in isolation.

Don't alert on absolute lag ("lag > 100K messages") because normal lag varies by consumer. Alert on lag relative to baseline: "lag exceeds p95 baseline by 3x and growing" or "lag exceeding SLA (messages older than 5 minutes)".

Context to include: current lag, baseline lag (p95 over last 7 days), growth rate (messages/minute), SLA breach duration.

Broker failures (offline broker, controller election, partition leader changes) indicate infrastructure instability. These warrant immediate investigation because they affect availability.

Context to include: which broker failed, which topics/partitions are affected, whether automatic recovery is happening or manual intervention is needed.

Request latency p99 spikes indicate degraded user experience. If p99 produce or fetch latency exceeds baseline by 3x for 10+ minutes, something is degrading.

Context to include: which operation (produce/fetch), current p99 vs. baseline p99, correlated events (GC pauses, rebalancing, disk saturation).

Disk usage exceeding 70% gives time to remediate before disks fill completely. Alert early enough that teams can add capacity, adjust retention, or clean up data before hitting 100%.

Context to include: current usage percentage, growth rate (GB/day), estimated time until full, which topics consume most storage (use cost control for deeper analysis).

What Not to Alert On

Don't alert when metrics cross thresholds without causing user impact.

CPU, memory, or disk I/O utilization alone isn't actionable. High utilization might be normal under load. Alert on symptoms (latency, under-replication) caused by resource exhaustion, not resource metrics themselves.

Exception: If resource utilization exceeds 90% AND correlates with latency degradation, alert on latency (the symptom) with resource usage as context.

Absolute consumer lag without trend analysis generates false positives. A batch consumer might accumulate 10 million messages overnight and process them in 1 hour every morning. Alerting on "lag > 1 million" fires daily without indicating a problem.

Alert on lag trend: if lag doesn't return to baseline after expected processing window, that's actionable. "Lag should be <1K messages by 10 AM, currently 5M messages at 11 AM" indicates the batch job failed or is slower than expected.

Transient spikes that self-correct within minutes don't warrant alerts. If partition reassignment causes 30 seconds of elevated latency but recovers automatically, alerting wakes engineers for no reason.

Use delay windows: alert only if condition persists for 5-10 minutes. This filters transient issues from sustained problems.

Rate-of-Change Detection

Static thresholds fail because "normal" varies by time, load, and consumer. Rate-of-change detection alerts on deviations from baseline, not absolute values.

Baseline calculation: Measure p50, p95, p99 for each metric over the last 7-30 days. This establishes normal ranges accounting for daily/weekly patterns.

Deviation detection: Alert when current value exceeds baseline by a multiplier (3x p95, 5x p99) sustained for a duration (10 minutes).

Example: Consumer lag baseline (p95) is 50K messages. Current lag is 200K messages (4x baseline) and growing. This deviates significantly from normal, warranting investigation.

After switching to rate-of-change detection, teams report dramatic reduction in alert noise: "We had 3 alerts the next month—all real incidents." This approach reduces false positives significantly compared to static thresholds.

Trending-based alerts catch growing problems before they become crises. "Consumer lag growing 10K msg/min for 20 minutes" predicts that lag will reach millions within hours. Alerting on the trend allows intervention before SLA breach.

Static threshold alerting waits until lag exceeds threshold (potentially hours after problem started). Trend alerting catches problems early.

Threshold Tuning and Feedback Loops

Thresholds should evolve based on alert outcomes. Regularly review and tune thresholds to prevent alert fatigue and ensure incidents are meaningful.

Alert post-mortems: For every alert, record: was action taken? If yes, what was fixed? If no, why was it a false positive?

Track false positive rate: alerts that fired but required no action. Target: under 10% false positive rate. If 30% of alerts are false positives, thresholds need tuning or alerts should be removed.

Threshold adjustment: If an alert fires frequently but never requires action, raise the threshold or add conditions. If an alert rarely fires but misses real incidents, lower the threshold.

Example: "Consumer lag > 100K" fires daily during peak traffic, but lag always returns to baseline within 30 minutes. This is noise. Adjust to: "Consumer lag > 100K for 30+ minutes" or "Consumer lag exceeding 3x p95 baseline."

Feedback from on-call: On-call engineers are the primary consumers of alerts. Survey them monthly: which alerts are useful? Which are noise? What incidents occurred without alerts?

Useful alerts get preserved. Noise alerts get tuned or removed. Missed incidents reveal gaps where new alerts are needed.

Alert Routing and Escalation

Good routing gets alerts to the team that can fix them. Bad routing pages everyone, diffusing responsibility.

Ownership-based routing: Consumer lag for orders-processor routes to the team that owns orders-processor (defined in the application catalog). Under-replicated partitions route to platform team that manages cluster health.

This requires ownership metadata: which team owns which consumer? Which team is on-call for cluster infrastructure?

Escalation paths: If an alert isn't acknowledged within 15 minutes, escalate to team lead or secondary on-call. If critical alert (data loss risk) isn't acknowledged within 5 minutes, escalate immediately.

Escalation prevents alerts from being ignored during incidents where primary on-call is overwhelmed or unavailable.

Severity levels: Not all alerts warrant waking someone at 3 AM. Setting the right consumer lag thresholds is the foundation.

Critical (page immediately): Under-replicated partitions, offline brokers, consumer lag exceeding SLA by 10x
High (alert during business hours, page after hours): Consumer lag exceeding SLA by 3x, p99 latency 3x baseline
Medium (alert, don't page): Disk usage >70%, consumer lag 2x baseline
Low (log, don't alert): Metrics within normal range, informational events

Integrate alerts with on-call systems like PagerDuty or Slack, ensuring critical information—such as broker ID, topic name, and exact metric value—is included.

Runbooks and Alert Context

Alerts without guidance waste time. On-call engineer gets paged with "consumer lag high" and has to figure out: which consumer? What's normal lag? What should I check?

Runbooks linked from alerts provide step-by-step investigation and remediation. When consumer lag alert fires, runbook says:

Check consumer pod logs for errors
Check downstream dependencies (database, APIs) for slowness
Check partition assignment (are all partitions assigned?)
Scale consumer instances if processing can't keep up
If still unclear, escalate to team X

Good runbooks turn 1-hour investigations into 10-minute resolutions.

Alert context includes data needed for triage: current value, baseline value, time since threshold exceeded, affected resources (topics, brokers, consumer groups), recent changes (deployments, config changes, schema updates).

As a starting point, aim for 10-15 core metrics. This focused baseline captures the majority of operational issues while keeping monitoring overhead and on-call fatigue under control.

Measuring Alerting Effectiveness

Track alert quality through three metrics: false positive rate, MTTR (mean time to resolution), and on-call satisfaction.

False positive rate measures alerts that fire but don't require action. Target: under 10%. If 30% of alerts are false positives, alerting quality is poor.

MTTR measures time from alert to resolution. Good alerting with runbooks reduces MTTR because engineers know what to check and how to fix it.

On-call satisfaction is subjective but critical. Survey on-call engineers: do they trust alerts? Are runbooks helpful? What would improve on-call experience?

Low satisfaction indicates alert fatigue. High satisfaction indicates effective alerting.

The Path Forward

Kafka alerting isn't about monitoring everything—it's about alerting only when intervention is needed. Symptom-based alerts, rate-of-change detection, ownership-based routing, and comprehensive runbooks reduce alert fatigue while catching real incidents.

Conduktor provides contextual alerting based on trends (not absolute thresholds), ownership-based routing (alerts go to teams that can fix them), and integration with PagerDuty, Slack, and OpsGenie. Teams get actionable alerts with context, not noise.

If your on-call engineers ignore most alerts, the problem isn't sensitivity—it's alerting design.