Home / Resources / Ebooks / The Kafka Debugging Playbook

Whitepaper

The Kafka Debugging Playbook: Six Phases from Alert to Root Cause

A Kafka troubleshooting and debugging framework for developers, SREs, and DevOps teams. Three production incidents walked through from alert to root cause, covering consumer lag, Kafka Connect, Schema Registry, offset resets, and reducing MTTR.

The Kafka Debugging Playbook: Six Phases from Alert to Root Cause

This whitepaper is the culmination of our three-part blog series on Kafka debugging. If you want the practitioner-level walkthroughs first, start here:

  1. Why Every Kafka Incident Ends with "Restart It"
  2. Houston, We Have 7 CLI Tools and Zero Answers
  3. Ground Control to On-Call: Kafka Debugging Has Landed

The alert fires. The Kafka monitoring dashboard confirms the symptom. Then the investigation stalls. Because Kafka observability tools and Kafka troubleshooting tools are completely different capabilities, and most teams running Kafka in production only have the first one.

Monitoring tells you that consumer lag is growing. Debugging tells you why: which partition is stuck, what message is at the stuck offset, whether the schema changed, whether a connector task died. These are the questions that actually resolve a Kafka incident and reduce MTTR. The native Kafka CLI tools and dashboards make every one of them painful.

Here's what a typical investigation looks like with the tools Kafka ships:

QuestionToolThe catch
Which consumer group is lagging?kafka-consumer-groups.sh --describeRaw offset numbers. No timestamps, no growth rate, no history.
Which instance owns the stuck partition?kafka-consumer-groups.sh --members --verboseSame tool, different flags, completely different output format. You cross-reference by hand.
What's in the stuck messages?kafka-console-consumer.shBinary garbage if the topic uses Avro. Need a separate tool from a separate distribution.
Did the schema change?curl (Schema Registry REST API)No CLI. Two API calls, pipe to files, run diff. Compatibility check needs escaped JSON nobody types correctly.
Is the connector healthy?curl (Connect REST API)No CLI. Different API, different port. A "RUNNING" connector can have failed tasks buried in the response.
What happened an hour ago?NothingThe CLI shows the present. If you didn't already have Prometheus and Grafana set up, there's no history.
Six questions, five different tools, no shared context between any of them. You carry the thread in your head.

These tools were built in 2011 when Kafka was an internal LinkedIn project. The assumptions were reasonable then. They're not anymore.

2011 assumption2026 reality
One team runs KafkaDozens of teams produce and consume
Kafka admins debug KafkaDevOps, SREs, and application developers debug Kafka
Debugging = cluster operationsDebugging = understanding application data
Plain text messagesAvro, Protobuf, JSON Schema with evolution
Few topics, few consumer groupsHundreds of topics, complex dependency graphs
The result is predictable. When the cost of a proper investigation exceeds the cost of restarting the consumer, the consumer gets restarted. The symptom clears. The root cause goes unidentified. And the same class of incident repeats.

Regardless of whether the root cause is a poison pill, a dead connector task, a schema mismatch, or a slow consumer, the shape of the investigation is the same. The details change. The sequence of questions doesn't.

1
Detect
  • What's wrong?
  • Where do I start?
2
Scope
  • How bad is it?
  • What's affected?
3
Inspect
  • What does the data look like?
  • What's at the stuck offset?
4
Trace
  • Where did it come from?
  • What changed upstream?
5
Resolve
  • How do I fix it safely?
  • What's the blast radius?
6
Prevent
  • Why did this happen?
  • How do I stop it repeating?

Why six phases and not three or ten? Because each one produces a specific output that the next phase needs as input:

PhaseOutputFeeds into
DetectThe consumer group and topic that are affectedScope (where to look)
ScopeWhich partitions, the group state, acute vs chronicInspect (what to look at)
InspectThe actual data at the stuck offsetTrace (which direction to investigate)
TraceThe upstream cause (schema, connector, producer)Resolve (what to fix and how)
ResolveThe immediate fix, applied safelyPrevent (what to change so it doesn't repeat)
PreventAlert tuning, config changes, code fixesNext incident starts from a better baseline
Skip Scope and you don't know which partition matters, so Inspect searches blind. Skip Inspect and you don't know whether the problem is in the data or the infrastructure, so Trace chases the wrong branch. Each phase narrows the search space for the next. The sequence isn't arbitrary.

Here's what that looks like in practice. The whitepaper walks through three real incidents, each branching at a different phase...

Keep reading the full playbook

Three worked incidents with Kafka mechanics, terminal output, and CLI vs Console comparisons at every phase.

Your data is protected per our Privacy Policy.