Kafka Costs: A Better Conversation
Where Kafka costs actually hide and how to recover them. The opening post in our Kafka cost optimization series.

Kafka cost information is hard to come by. What is out there tends to split into two buckets: the first is a catalogue of generic optimization tip that may or may not apply to a given estate and the second is guidance that is tied to a specific managed provider or to a single product decision. Unfortunately, neither offers much help to a team trying to understand the shape of their own costs and what can be done about them.
As a result, a lot of teams we talk to have arrived at one of two conclusions. Either rising Kafka spend is just the cost of running Kafka at scale, or the only real lever is switching to a different provider. However, neither is quite right. The Kafka cost drivers in an estate are unusually context-dependent and the right path forward depends on where the spend is coming from, what the estate looks like today, and what the team has capacity to take on. This post, and the series it opens, is built from what our field team has learned about Kafka cost drivers through real cost conversations alongside platform engineering leaders, inside Kafka environments that have been running for years.
It is worth being clear about what this is not. A cost conversation is not about penalizing Kafka usage or pushing back on new use cases. The goal is to keep usage growing in a healthy way, where the spend tracks the value the platform delivers to the business.
Before going further it is worth being explicit about what makes up the total cost of ownership of a Kafka estate, since the conversation often collapses into "the bill" when the real picture is broader. In practice, four layers contribute:
- Infrastructure. Compute for brokers and controllers, storage multiplied by replication factor, and networking. On hosted platforms, capacity is usually sold in units sized by throughput, partition count, and storage limits, with cross-AZ or cross-region data transfer billed separately. On self-managed estates, the same drivers surface as servers, storage arrays, datacenter capacity, and network gear.
- Ecosystem tooling. Schema Registry, Kafka Connect and its connectors, stream processing engines (Kafka Streams, ksqlDB, Flink), cross-cluster replication, monitoring, and management interfaces. On hosted platforms many of these are paid add-ons; on self-managed estates they consume internal compute and operational time.
- Vendor and licensing. Platform licenses (Confluent Platform, Cloudera) for self-managed estates, tier surcharges and feature add-ons (RBAC, audit logs, private networking) for hosted ones, plus support contracts and professional services on either side.
- Operational. Engineering time spent running clusters, responding to incidents, and supporting internal consumers. Self-managed estates carry the full broker operations burden, including patching, capacity planning, and hardware lifecycle; hosted estates trade that for more time on cost analysis, configuration management, and vendor coordination.
Each of these is a real line item in any Kafka environment, and any serious cost conversation has to acknowledge that Kafka is rarely just "the brokers." In the estates we look at closely, though, the largest savings opportunity sits on the infrastructure side, and that is where this series will focus. We will touch on the others where the picture connects, but infrastructure is the primary lens.
With that frame in place, the rest of the post tackles three questions in order: where the savings hide, how to capture them, and how to keep them from eroding over time.
Where the savings actually hide
Most platform leaders have a rough sense that there is room to optimize their Kafka estate. What we have found when we map where the spend is actually coming from is that the size of the opportunity tends to be larger than the initial guess. In the estates we analyze closely, it is typical to find that 25 to 40 percent of the infrastructure bill is recoverable, much of that on non-prod clusters where there is low risk to the business.
A note on the 25 to 40 percent figure. This is not a claim that all unused or oversized capacity in your Kafka bill is pure waste. Workloads that are large but genuinely well sized do exist and a healthy setup will always have some level of overprovisioning to allow for flex or growth. The 25 to 40 percent is the share that can be safely recovered through config changes, retirement, consolidation, and replatforming where warranted.
Within infrastructure, the patterns where the savings tend to concentrate are:
- Partition overprovisioning. Large partition counts that are not matched by proportional throughput. Partitions may have been scaled for a prod load profile on a non-prod environment or inherited from an earlier architecture and never revisited.
- Retention settings not matched to consumer needs. Long retention on topics that no one reads past the first few hours, or uniform retention policies applied across topics with very different actual consumption patterns.
- Cluster sprawl. Numerous but underutilized clusters. These are often stood up for a specific team or project as a way of ensuring isolation and control, but can be costly as the number of kafka users grows.
- Topic proliferation and duplication. Topics created for experiments, migrations, or testing that were never decommissioned, alongside duplicate topics created in parallel when teams cannot easily find an existing topic that already serves the same need. Each carries its own replication and retention cost.
- Inefficient client patterns. Egress traffic that runs many times higher than ingress because of redundant consumer groups, unnecessary fan-out, or misconfigured clients. These patterns often go unnoticed without dedicated monitoring of producer-to-consumer flows.
- Static capacity per resource. Each topic, cluster, and connection holds dedicated broker capacity at all times, whether it is at peak load or sitting idle. Across an estate with hundreds of resources on different load profiles, the sum of those dedicated allocations runs well ahead of true aggregate demand, with no built-in way to pool underlying capacity across them.
In the conversations we've had, the platform team generally walks into a cost review assuming things are operating relatively efficiently and there's not much opportunity for improvement. Once we run the analysis, however, at least 1-2 of these patterns show up and show up in a big way.
It may be that the same large partition count is used both for critical prod workflows and for topics in the dev environment which have virtually no load. It may be that the pattern of one cluster per team which made sense for ensuring isolation in an on-prem environment wasn't revisited when switching to a hosted environment. It could be that years of development have left behind a large number of dead or duplicate topics which were never cleaned up because of lack of visibility and ownership. Or it may be that egress on a topic is running 30 times higher than ingress because of a misconfigured client that nobody has caught.
In all of these cases, the critical factor is that Kafka grows by accretion. Each topic, cluster, and job was added for a good reason at the time, but there is no built-in mechanism for teams to zoom out and look at the cumulative effect or to evaluate if previous choices still make sense. That is not a failure of the team, but how platforms of this shape tend to evolve.
Three approaches to Kafka cost optimization
Once the shape of the cost is clear, the right response depends on what is driving it. Some opportunities are genuinely quick. Others are not, and treating them as if they should be is a reliable way to burn effort. What stretches the timeline on the larger ones is usually not the technical work itself but the coordination across teams and the need to keep producers and consumers running through the change. In broad terms, we think about the response in three categories.
Updating defaults for new loads. A lot of waste accumulates because new topics, clusters, and clients are created without policies to keep them from inheriting it. Setting better defaults stops the bleeding and puts a foundation under everything else. The work typically includes:
- Sensible partition defaults so new topics do not inherit a peak-load count
- Default retention policies aligned with how consumers actually read
- Enforcement of client-side compression
- Default cluster tier choices for new use cases
- Basic ownership and cataloguing requirements at topic creation
Most of this is low-coordination because it does not touch live workloads, and a couple of weeks is usually enough to put the policies in place. On its own this work does not produce immediate savings, but it slows the trajectory of future growth and keeps the gains from the other two categories from eroding over time.
Optimizing existing loads. This is the hygiene and rightsizing work on what is already running:
- Tuning retention where it has drifted away from what consumers actually need
- Retiring topics that no longer have active producers or consumers
- Right-sizing partition counts on existing topics
- Consolidating clusters that have ended up underutilized
The technical work itself is rarely the hard part. What stretches the timeline is the coordination required to touch live workloads without disruption, particularly for partition right-sizing, which means recreating topics and coordinating the switch with every producer and consumer. Realistic timelines are a few weeks to a few months depending on scope.
In the estates we look at, this work typically moves the infrastructure bill by 10 to 20 percent, with reductions of 50 percent or more possible in estates with particularly poor hygiene and significant numbers of old or unused topics.
Rethinking workloads. The response when the current shape of the estate fundamentally does not fit, either because data flows have outgrown the architecture or because the platform itself no longer matches the workload. This covers things like:
- Reshaping fan-out patterns that have become inefficient as consumer counts grew
- Moving from dedicated to pooled capacity at the topic and cluster layer
- Rethinking how teams share infrastructure
- Moving to a different platform altogether, where warranted
This category is the most variable of the three in both timeline and outcome. The savings can be the largest in absolute terms, particularly in large estates where pooled capacity and shared infrastructure offset the inefficiencies of dedicated allocation. It also tends to help most with production workloads, where teams reasonably optimize for risk mitigation over efficiency and need an architectural change rather than a configuration tweak to make progress.
Timelines depend heavily on how much migration tooling is in place, how much cross-team coordination is required, and whether the goal is a wholesale change or a focused effort on a few key areas.
The three response categories at a glance. Updating defaults preserves gains, optimizing existing loads delivers the most reliable near-term reductions, and rethinking workloads offers the largest absolute savings in big estates.
The value of spending a little time up front on analysis is that it lets a team see where each approach will have the most impact for their estate and sequence the work accordingly. Trying everything at once tends to produce motion without outcome.
Staying efficient is something you can design for
A fair question we hear is whether cost work sticks once the initial cleanup is done. It can, but it does not automatically.
Defaults and policies are necessary but not sufficient. Even with sensible defaults in place, new edge cases work around them. A topic gets a one-time retention bump that becomes permanent. A cluster gets spun up for a project and is forgotten when the project ends. Without an ongoing discipline of review, the same forces that drove the original accretion start operating again, and over 12 to 18 months a meaningful share of the savings tends to erode.
Teams that hold the gains pair their cleanup with lightweight, ongoing governance. The pattern usually involves three things, the most foundational being visibility: an accounting of what the estate contains, what it costs, and which applications and teams are driving that cost. Without spend that maps to how the business is structured rather than how the infrastructure is shaped, the question "where is the money going" has no honest answer. The other two pieces are clear ownership of topics, clusters, and policies, and a regular cadence for reviewing what is actually being used. None of this is heavyweight, and we will come back to the visibility piece specifically later in the series.
Efficiency is a capability you can build into an estate, not only a moment in time.
What comes next
This is the start of a series of posts all about Kafka cost savings.
Over the coming weeks we will go deeper on each of the three ideas above, starting on the opportunity side with the Kafka cost drivers we see most often. These are worth checking in your own environment and most teams find that one or two are where the real money is hiding. Later in the series we will return to the structural angle introduced here with a post on topic concentration and virtualization, which is where the largest estates tend to find their next round of savings.
If Kafka costs are a conversation in your organization right now, follow along to learn more or reach out to us for help in evaluating the potential cost savings in your own Kakfa estate.
Want to see how much you could save on your Kafka bill?
Get a free Kafka cost analysis with our field engineering team. We will walk through your estate together, identify the waste patterns that apply, and give you a concrete estimate of where the savings are.