Why Kafka Costs Keep Rising (Even When Usage Doesn’t)

Nicole Bouchard April 30, 2026 13 min read

The 60-second version

Most Kafka estates are overpaying by 25 to 40 percent.
Not because of usage, but because of how the estate is structured and provisioned.
A handful of patterns drive most of that spend: partition overprovisioning, retention mismatches, cluster sprawl, topic duplication, inefficient clients, and static per-resource capacity.
Recovery takes three complementary approaches: updating defaults, optimizing existing loads, and rethinking workloads where the architecture no longer fits.
Without ongoing governance (visibility, ownership, regular review), savings erode within 12 to 18 months.

Every Kafka estate we work with has at least one topic quietly burning spend. A common example is a topic with 50 partitions and no recent traffic, sitting alongside dozens of others like it that nobody is willing to delete without confirming they are unused. Patterns like this show up almost everywhere we look, often in numbers large enough to move the bill. Yet teams that try to take this on often find the cost conversation stalls before it gets started.

Why Kafka cost conversations stall

Kafka cost information is hard to come by. What is out there tends to split into two buckets: generic optimization tips that may or may not apply to a given estate, or provider-specific guidance tied to a particular managed platform or product decision. Neither offers much help to a team trying to understand the shape of their own costs and what can be done about them.

The two conclusions teams default to

As a result, most teams we talk to have arrived at one of two conclusions. Either rising Kafka spend is just the cost of running Kafka at scale, or the only real lever is switching providers. Neither is quite right, because the Kafka cost drivers in an estate are unusually context-dependent, and the right path forward depends on where the spend is coming from, what the estate looks like today, and what the team has capacity to take on.

This post, and the series it opens, is built from what our field team has learned through real cost conversations alongside platform engineering leaders, inside Kafka environments that have been running for years.

It is worth being clear about what this is not. A cost conversation is not about penalizing Kafka usage or pushing back on new use cases. The goal is to keep usage growing in a healthy way, where the spend tracks the value the platform delivers to the business.

The four layers of Kafka cost

Before going further it is worth being explicit about what makes up the total cost of ownership of a Kafka estate, since the conversation often collapses into "the bill" when the real picture is broader. In practice, four layers contribute:

Infrastructure. Compute for brokers and controllers, storage multiplied by replication factor, and networking. On hosted platforms, capacity is usually sold in units sized by throughput, partition count, and storage limits, with cross-AZ or cross-region data transfer billed separately. On self-managed estates, the same drivers surface as servers, storage arrays, datacenter capacity, and network gear.
Ecosystem tooling. Schema Registry, Kafka Connect and its connectors, stream processing engines (Kafka Streams, ksqlDB, Flink), cross-cluster replication, monitoring, and management interfaces. On hosted platforms, many of these are paid add-ons; on self-managed estates they consume internal compute and operational time.
Vendor and licensing. Platform licenses (Confluent Platform, Cloudera) for self-managed estates, tier surcharges and feature add-ons (RBAC, audit logs, private networking) for hosted ones, plus support contracts and professional services on either side.
Operational. Engineering time spent running clusters, responding to incidents, and supporting internal consumers. Self-managed estates carry the full broker operations burden, including patching, capacity planning, and hardware lifecycle; hosted estates trade that for more time on cost analysis, configuration management, and vendor coordination.

Each of these is a real line item in any Kafka environment, and any serious cost conversation has to acknowledge that Kafka is rarely just "the brokers."

In the estates we look at closely, the largest savings opportunity sits on the infrastructure side, and that is where this series will focus. We will touch on the others where the picture connects, but infrastructure is the primary lens.

With that frame in place, the rest of the post tackles three questions in order: where the savings hide, how to capture them, and how to keep them from eroding over time.

Where the savings actually hide

Most platform leaders have a rough sense that there is room to optimize their Kafka estate. What we have found when we map where the spend is actually coming from is that the size of the opportunity tends to be larger than the initial guess. In the estates we analyze closely, it is typical to find that 25 to 40 percent of the infrastructure bill is recoverable, much of that on non-prod clusters where there is low risk to the business.

A note on the 25 to 40 percent figure. This is not a claim that all unused or oversized capacity in your Kafka bill is pure waste. Workloads that are large but genuinely well sized do exist and a healthy setup will always have some level of overprovisioning to allow for flex or growth. The 25 to 40 percent is the share that can be safely recovered through config changes, retirement, consolidation, and replatforming where warranted.

Within infrastructure, the patterns where the savings tend to concentrate are:

Partition overprovisioning. Large partition counts that are not matched by proportional throughput. Partitions may have been scaled for a prod load profile on a non-prod environment or inherited from an earlier architecture and never revisited.
Retention settings not matched to consumer needs. Long retention on topics that no one reads past the first few hours, or uniform retention policies applied across topics with very different actual consumption patterns.
Cluster sprawl. Numerous but underutilized clusters. These are often stood up for a specific team or project as a way of ensuring isolation and control, but can be costly as the number of Kafka users grows.
Topic proliferation and duplication. Topics created for experiments, migrations, or testing that were never decommissioned, alongside duplicate topics created in parallel when teams cannot easily find an existing topic that already serves the same need. Each carries its own replication and retention cost.
Inefficient client patterns. Egress traffic that runs many times higher than ingress because of redundant consumer groups, unnecessary fan-out, or misconfigured clients. These patterns often go unnoticed without dedicated monitoring of producer-to-consumer flows.
Static capacity per resource. Each topic, cluster, and connection holds dedicated broker capacity at all times, whether it is at peak load or sitting idle. Across an estate with hundreds of resources on different load profiles, the sum of those dedicated allocations runs well ahead of true aggregate demand, with no built-in way to pool underlying capacity across them.

In the conversations we've had, the platform team generally walks into a cost review assuming things are operating relatively efficiently and there's not much opportunity for improvement. Once we run the analysis, however, at least one or two of these patterns show up, and they show up in a big way.

Stories from the field

Each of the following is a specific situation we have run into during a cost analysis with a customer. Not every estate has all of them, but the underlying patterns are common enough that most teams will recognize a few.

The same large partition count used for critical prod and for dev topics with virtually no load
One-cluster-per-team patterns from on-prem days, never revisited after switching to a hosted environment
Years of dead or duplicate topics never cleaned up because of missing visibility and ownership
Egress on a topic running 30 times higher than ingress because of a misconfigured client

In all of these cases, the critical factor is that Kafka grows by accretion. Each topic, cluster, and job was added for a good reason at the time, but there is no built-in mechanism for teams to zoom out and look at the cumulative effect or to evaluate if previous choices still make sense. That is not a failure of the team, but how platforms of this shape tend to evolve.

Three approaches to Kafka cost optimization

Once the shape of the cost is clear, the right response depends on what is driving it. Some opportunities are genuinely quick. Others are not, and treating them as if they should be is a reliable way to burn effort. What stretches the timeline on the larger ones is usually not the technical work itself but the coordination across teams and the need to keep producers and consumers running through the change. In broad terms, we think about the response in three categories.

Updating defaults for new loads

A lot of waste accumulates because new topics, clusters, and clients are created without policies to keep them from inheriting it. Setting better defaults stops the bleeding and puts a foundation under everything else. The work typically includes:

Sensible partition defaults so new topics do not inherit a peak-load count
Default retention policies aligned with how consumers actually read
Enforcement of client-side compression
Default cluster tier choices for new use cases
Basic ownership and cataloguing requirements at topic creation

Most of this is low-coordination because it does not touch live workloads, and a couple of weeks is usually enough to put the policies in place. On its own this work does not produce immediate savings, but it slows the trajectory of future growth and keeps the gains from the other two categories from eroding over time.

Optimizing existing loads

This is the hygiene and rightsizing work on what is already running:

Tuning retention where it has drifted away from what consumers actually need
Retiring topics that no longer have active producers or consumers
Right-sizing partition counts on existing topics
Consolidating clusters that have ended up underutilized

The technical work itself is rarely the hard part. What stretches the timeline is the coordination required to touch live workloads without disruption, particularly for partition right-sizing, which means recreating topics and coordinating the switch with every producer and consumer. Realistic timelines are a few weeks to a few months depending on scope.

In the estates we look at, this work typically moves the infrastructure bill by 10 to 20 percent, with reductions of 50 percent or more possible in estates with particularly poor hygiene and significant numbers of old or unused topics.

Rethinking workloads

This is the right response when the current shape of the estate fundamentally does not fit, either because data flows have outgrown the architecture or because the platform itself no longer matches the workload. This covers things like:

Reshaping fan-out patterns that have become inefficient as consumer counts grew
Moving from dedicated to pooled capacity at the topic and cluster layer
Rethinking how teams share infrastructure
Moving to a different platform altogether, where warranted

This category is the most variable of the three in both timeline and outcome. The savings can be the largest in absolute terms, particularly in large estates where pooled capacity and shared infrastructure offset the inefficiencies of dedicated allocation. It also tends to help most with production workloads, where teams reasonably optimize for risk mitigation over efficiency and need an architectural change rather than a configuration tweak to make progress.

Timelines depend heavily on how much migration tooling is in place, how much cross-team coordination is required, and whether the goal is a wholesale change or a focused effort on a few key areas.

Choosing where to start

The three response categories at a glance: updating defaults preserves gains, optimizing existing loads delivers the most reliable near-term reductions, and rethinking workloads offers the largest absolute savings in big estates.

The value of spending a little time up front on analysis is that it lets a team see where each approach will have the most impact for their estate and sequence the work accordingly. Trying everything at once tends to produce motion without outcome.

Staying efficient is something you can design for

A fair question we hear is whether cost work sticks once the initial cleanup is done. It can, but it does not automatically.

Defaults and policies are necessary but not sufficient. Even with sensible defaults in place, new edge cases work around them. A topic gets a one-time retention bump that becomes permanent. A cluster gets spun up for a project and is forgotten when the project ends. Without an ongoing discipline of review, the same forces that drove the original accretion start operating again, and over 12 to 18 months a meaningful share of the savings tends to erode.

Three things that hold the gains

Teams that keep their estates efficient pair cleanup with lightweight, ongoing governance. The pattern usually involves three things:

Visibility into what the estate contains, what it costs, and which applications and teams are driving that cost. This is the most foundational of the three, since spend that maps to how the business is structured rather than how the infrastructure is shaped is the only honest answer to "where is the money going." We will come back to it specifically later in the series.
Clear ownership of topics, clusters, and policies, so changes have a person or team accountable rather than falling into the gaps between teams.
A regular cadence for reviewing what is actually being used, so drift gets caught before it becomes permanent.

None of this is heavyweight. Efficiency is a capability you can build into an estate, not only a moment in time.

What comes next

The patterns above are a starting point for the diagnostic question: which of these are present in your environment, and what are they actually costing you?

Over the coming weeks we will go deeper on each one, starting with partition overprovisioning, where the cost impact tends to be the largest and the diagnostic methodology is the most concrete. The full guide covering the identification methodology for each pattern in one place is available now: Where Kafka Costs Hide: A Field Guide. It is designed for the conversation where you need to quantify the waste and make the case internally.

If you want to look at your own estate now rather than waiting, the form below is the place to start.

Want to see how much you could save on your Kafka bill?

Get a free Kafka cost analysis with our field engineering team. We will walk through your estate together, identify the waste patterns that apply, and give you a concrete estimate of where the savings are.

Get your Kafka Cost Analysis