Kafka Governance: Guardrails Over Guidelines

Kafka governance fails when policies live in wikis instead of code. Automated guardrails prevent incidents before they reach production.

Stéphane Derosiaux · September 27, 2025 ·

Kafka Governance: Guardrails Over Guidelines

Governance isn't what's written in Confluence. It's what gets enforced in production.

In August 2025, PagerDuty experienced a nine-hour Kafka outage that silenced alerts for thousands of companies. The root cause wasn't a Kafka bug or infrastructure failure—it was a governance failure. Code review processes missed a critical error where a new Kafka producer was instantiated for every API request instead of reusing a single producer. The result: cascading failures that rejected 95% of events over a 38-minute period.

This wasn't a technical problem. It was a process problem. The kind that documentation doesn't fix.

Most Kafka governance exists in three places: internal wikis describing best practices, Slack channels where people ask "what's the naming convention again?", and postmortem documents explaining what went wrong. None of these prevent the next incident. Real governance happens before configs reach production, not in the retrospective afterward.

The Governance Theater Problem

Governance theater looks productive but changes nothing. It's policies documented in Confluence that nobody reads, naming conventions that nobody follows, and review processes that rubber-stamp everything to avoid becoming a bottleneck.

The symptoms are consistent across organizations. Topic names that don't match the documented standard because "we were in a hurry and would fix it later." Retention policies set to 30 days for topics that get consumed within 30 seconds because "better safe than sorry." Replication factor 1 in production because "we'll increase it after the demo."

Later never comes. The quick hack becomes the permanent state. Technical debt compounds.

The problem isn't lack of documentation. As we explore in policy enforcement, policies need to live in code, not wikis. Most teams have detailed governance policies. The problem is enforcement happens manually, which means it doesn't happen consistently. When creating a topic requires remembering seven different policies and manually validating them, mistakes slip through. Not because engineers are careless, but because humans are bad at consistency.

Research on common Kafka governance challenges shows organizations face "blurry borders" around ownership of data quality, schema governance, and data retention. These aren't technical gaps—they're organizational ones. When everyone is responsible, nobody is accountable.

Why Post-Incident Reviews Don't Prevent the Next Incident

Every production incident generates a postmortem. The postmortem identifies root causes, proposes action items, and updates documentation. Then the next incident happens anyway.

The pattern is predictable. Incident: topic created with insufficient partitions, causing consumer lag. Root cause: engineer didn't check partition count recommendations. Action item: add partition count guidelines to the wiki. Three months later: different team, same mistake, new incident.

Postmortems focus on what happened, not why it keeps happening. The answer is usually the same: governance that depends on humans remembering things fails when humans forget things.

Kafka production incidents stem from incorrect assumptions about delivery guarantees, offset management, and consumer behavior under failure. These are governance issues disguised as technical problems. The fix isn't better documentation—it's systems that prevent invalid assumptions from reaching production.

The Shift: Guardrails Over Guidelines

Effective governance makes the right thing the only thing you can do.

Instead of documenting "topics should use naming pattern team.domain.entity", enforce it through policy: topic creation requests that don't match the pattern are rejected with a clear error message. Instead of recommending "replication factor 3 in production", make it impossible to create topics with RF less than 3 in production environments.

This isn't about preventing engineers from making decisions. It's about encoding organizational knowledge into systems so every engineer benefits from it, even if they've never read the wiki.

Policy-based governance has three components: validation, enforcement, and feedback.

Validation checks if a requested change complies with policy. Policies are written as code—using frameworks like CEL (Common Expression Language)—that evaluate every resource creation or modification. Does this topic name match our convention? Does the retention period fall within acceptable bounds? Does the partition count make sense for expected throughput?

Enforcement rejects non-compliant changes before they reach Kafka. The enforcement point is the API layer—whether that's a web console, CLI tool, or GitOps pipeline. A developer requests a topic. The request hits policy validation first. If it passes, the topic is created. If it fails, the request is rejected immediately with an explanation of what's wrong.

Feedback explains why a request was rejected and how to fix it. Generic errors like "policy violation" are useless. Specific errors like "topic name 'user-data' doesn't match required pattern 'team.domain.entity'. Example: 'platform.users.profile'" teach developers the policy while they work.

This creates a feedback loop where policies improve based on real usage. If 50% of topic requests fail because partition limits are too restrictive, the policy is wrong—not the requests. Adjust the policy, measure again, iterate.

Implementing Automated Policy Enforcement

Start with the policies that cause the most incidents. Topic naming conventions, replication factors, retention policies, and partition counts are good candidates. These are well-understood, easy to express as rules, and high-impact when enforced.

Define the policy: topics in production must have replication factor 3 or higher. Express it as code:

resource.spec.replicationFactor >= 3

Attach a custom error message:

Replication factor must be at least 3 in production environments.
Current value: ${resource.spec.replicationFactor}

Deploy the policy to your control plane. Now every topic creation request in production hits this validation. If RF is less than 3, the request fails immediately with the error message. No postmortem needed—the incident never happens.

Next, tackle schema governance. Schemas should evolve with compatibility guarantees: backward, forward, or full compatibility depending on use case. Instead of documenting this in a wiki, enforce it through Schema Registry compatibility modes and validation policies.

Then layer on ownership policies. Application-based access management replaces ACL sprawl with structured permissions. Every topic must map to an owning team. Service accounts must map to applications, not individuals. Access control lists must be auditable—no wildcard grants, no unexplained permissions.

Each policy reduces operational overhead. Platform teams stop answering "what's the naming convention?" because developers get instant feedback when they violate it. Security teams stop manual ACL audits because policies prevent unauthorized access at creation time.

Governance at Scale

Small teams can govern through conversation. Ten engineers can coordinate on Slack. Everyone knows who owns what. Standards emerge organically and violations are corrected through code review.

This model breaks at scale. Fifty teams across five clusters in three regions can't coordinate through Slack. Code review catches some issues but not all. Standards documented in wikis drift from reality as the organization evolves.

Governance at scale requires automation. Teams report 75% fewer governance violations after implementing automated policy enforcement—not because developers became more careful, but because policies prevent violations before they happen.

The governance maturity model has five stages:

Reactive: Fix problems after they cause incidents. Documentation is the output of postmortems.

Documented: Policies exist in wikis. Enforcement depends on people reading and remembering them.

Reviewed: Changes go through manual review. Platform teams are the bottleneck.

Automated: Policies enforced at the API layer. Invalid requests rejected instantly.

Continuous: Policies evolve based on usage patterns. Governance adapts as the organization scales.

Most organizations operate between documented and reviewed. The jump to automated requires tooling investment, but the payoff is immediate: fewer incidents, faster provisioning, reduced platform team load.

Measuring Governance Success

Good governance reduces two metrics: time to provision and incidents caused by policy violations.

Time to provision should decrease because developers don't wait for manual review. If policy allows it, the resource is created instantly. Manual review becomes the exception, not the default.

Policy violation incidents should trend toward zero. Topics created with wrong replication factors, schemas registered without compatibility checks, ACLs granted without ownership validation—these should disappear from postmortems.

Watch for leading indicators. Are developers getting instant feedback on policy violations instead of finding out in code review? Are policy errors descriptive enough that developers fix them without asking for help? Are policies being updated based on real usage patterns?

The ultimate measure is cultural: governance shifts from "things we're supposed to do" to "things the platform won't let us do wrong." When that happens, governance stops being a tax on velocity and becomes infrastructure that enables it.

The Path Forward

Kafka governance doesn't fail because teams lack policies. It fails because policies aren't enforced where decisions happen.

Automated guardrails close this gap. Policies expressed as code, enforced at the API layer, with feedback that teaches while it protects. The result is fewer incidents, faster operations, and platform teams focused on building capabilities instead of preventing disasters.

Conduktor enables policy-based Kafka governance through CEL-based validation, custom error messages, and enforcement across all provisioning paths—Console, CLI, and GitOps—with full audit trails. Platform teams define the guardrails, developers operate within them, and violations are caught before they reach production.

If your postmortems keep identifying the same root causes, the problem isn't your engineers. It's your governance model.

Learn more: How Conduktor automates Kafka governance at scale →