Consumer Lag Alerts: Setting Thresholds That Don't Cry Wolf

Stop alert fatigue from consumer lag metrics. Offset vs time-based lag, per-workload thresholds, and rate-of-change detection.

Stéphane DerosiauxStéphane Derosiaux · December 8, 2025 ·
Consumer Lag Alerts: Setting Thresholds That Don't Cry Wolf

Every Kafka operator has the same experience: lag alerts fire at 3 AM, you scramble to investigate, and it's nothing. A batch job ran. A consumer scaled down briefly. Traffic spiked.

Meanwhile, actual issues get lost in the noise.

I've tuned lag alerting for dozens of teams. The problem isn't monitoring—it's that most lag alerting strategies are fundamentally flawed.

We had 47 lag alerts in one month. Two were real. After switching to rate-of-change detection, we had 3 alerts the next month—all real incidents.

SRE at an e-commerce platform

Why Offset-Based Lag Fails

The default approach: alert when lag exceeds 10,000 messages.

GroupThroughputOffset LagTime Lag
payment-processor100 msg/sec5005 seconds
analytics-etl10,000 msg/sec50050 ms
Same offset lag. Completely different severity. The payment processor is 5 seconds behind. The analytics pipeline caught up before you finished reading the alert.

Root cause: Offset lag is production-rate-dependent. The same threshold can't work across workloads.

Time-Based Lag: The Better Metric

Time lag answers the question that matters: "How far behind real-time is this consumer?"

Time lag requires additional tooling—Burrow, custom instrumentation, or managed services (Confluent Cloud and MSK expose EstimatedTimeLag).

Tradeoff: Time-based lag requires setup. Offset lag is available out-of-the-box but less meaningful.

Rate-of-Change: The Missing Signal

Static thresholds fail because lag naturally fluctuates. What matters is the trend.

Healthy: Lag spikes during batch jobs, then recovers.

Unhealthy: Lag increases steadily over hours.

# Alert when lag is high AND still growing
kafka_consumer_group_lag > 10000
and
deriv(kafka_consumer_group_lag[15m]) > 0

Alert on high lag AND positive growth rate. The deriv() function catches sustained increases while ignoring temporary spikes during deployments. Conduktor provides built-in alerting that handles these patterns out of the box.

Per-Workload Thresholds

Different workloads need different thresholds:

Consumer GroupTime Lag WarningTime Lag Critical
payment-processor30s2m
fraud-detection10s30s
analytics-etl10m30m
Calculate offset thresholds from time targets:
offset_threshold = target_time × throughput × safety_margin

Payment processor at 100 msg/sec with 2-minute SLO:

offset_critical = 120s × 100 msg/s = 12,000 messages

Partition-Level Alerting

Aggregated alerts hide problems. A consumer group with 10 partitions can show average lag of 1,000 while one partition has lag of 10,000.

# Alert when ANY partition exceeds threshold
max by(group, topic) (kafka_consumer_group_partition_lag) > 10000

Composite Alerts

Lag is a symptom. Correlate with other signals:

# Alert: Lag high AND consumer not fetching
kafka_consumer_group_lag > 10000
and
rate(kafka_consumer_fetch_manager_records_consumed_total[5m]) == 0

This catches stuck consumers that static thresholds miss.

Common Root Causes by Pattern

PatternLikely Cause
Sudden spike, all partitionsProducer burst
Gradual increase, all partitionsConsumer slowdown
One partition stuckConsumer crash
Periodic spikesBatch jobs, GC pauses
Spike after deployRebalance
The goal isn't to monitor lag. It's to know when customers are affected before they notice.

Book a demo to see how Conduktor Console provides opinionated lag alerting with team ownership and threshold tuning built in.