The New Era of Streaming

The New Era of Streaming

Kafka's challenge isn't speed—it's chaos. Most clusters handle modest loads but lack governance. Success requires control planes that manage how teams use Kafka safely at scale.

For a decade, we’ve obsessed over benchmarks, p99s, and GB/s. But look inside the clusters that run real businesses, and you’ll see a different story. Kafka’s biggest challenge isn’t technical : it’s human.

#FailArmy 

Throughput has become a poor measure of success. As an engineer, I love elegant systems. But I’m no longer impressed by endless benchmark wars about throughput, latency, or synthetic datasets.

From my time working with large enterprises, I’ve learned that those numbers barely matter. I understand the need to build technical credibility, but there’s another world out there: the enterprise reality.

Most Kafka clusters aren’t massive data engines. They’re small, strategic systems connecting hundreds of interdependent applications. It’s often treated as pure infrastructure: something you deploy once, give access to all your developers, and expect to quietly deliver value from day one. That couldn’t be further from reality.

Kafka grows with the organization. It’s a living system that spreads across teams and environments, changing as the company evolves

I’ve seen large migration programs to Kafka run for six months… only to stop and roll back. Was Kafka performance to blame? Absolutely not.

The best explanation I’ve heard comes from Filip Yonov at Aiven, who’s running an environment of 4000+ Kafka clusters. In Kafka’s 80% Problem, he explains that most Kafka deployments push less than 1MB/s—and that’s fine. Kafka is not about raw speed. It’s about coordination at scale.

Throughput =/= Success

Kafka built its reputation on speed. Every benchmark, every slide, every blog post highlights throughput and latency; if you don’t believe me, just do a Google search on “Kafka benchmark.” Yet, when you can actually look at production clusters of common businesses, most aren’t pushing GB/s.

Instead, you’ll hear statements like:

  • “Our p99 is lower” 

  • “Our producer latency is faster” 

  • “We handle millions of messages per second”

To be clear, I completely understand. After all, it’s enjoyable to run kafka-producer-perf-test.sh on your machine to achieve an extraordinary throughput, and to get results like:

3556911 records sent, 711382.2 records/sec (694.71 MB/sec), 0.4 ms avg latency, 6.0 ms max latency.

But in truth, only a handful of companies truly operate at hyperscale — names like Netflix, LinkedIn, Uber, New Relic, Slack, Pinterest, Criteo, OpenAI, and Datadog. They either have enormous user bases or process oceans of logs and telemetry. Are you, like LinkedIn, processing 7 trillion messages per day? Probably not. 

For everyone else, the story is different.

The reality is that 99 % of businesses don’t run at hyperscale. They operate in financial services, retail, healthcare, logistics — domains where systems move at a human rhythm. In those worlds, 5,000 records/s or 20 MB/s is already a meaningful load. Aiven’s fleet (4,000+ Kafka services) shows a median ingest rate of around 9.8 MB/s.

And even that number often lies. Most of Kafka’s so-called “throughput” isn’t application data at all; it’s infrastructure traffic that includes:

  • Replication between brokers

  • Cross-cluster sync for DR or multi-region setups (MirrorMaker, Replicator, Cluster Linking)

  • CDC streams from databases (Kafka Connect, Debezium)

  • Derived data from stream processors (Kafka Streams, state stores).

  • Internal logs and metrics

The true business data, such as orders, payments, messages, and transactions, is just a tiny fraction of the total. It’s like judging a city’s traffic by counting delivery trucks instead of people (just look at New York).

For most companies, Kafka isn’t about speed. It’s the backbone for moving operational data safely between teams and systems.

As an aside, the ultra-low-latency world uses other tools entirely:

  • LMAX Exchange in High-Frequency Trading (HFT). We’re talking less than one microsecond here

  • ARINC 429 in aviation

  • AUTOSAR in automative, etc.

Kafka was designed for reliability and coordination, not microsecond latency

Platforming Kafka

Kafka provides a robust foundation for event-driven architectures. However, most organizations overestimate the value of deployment alone.

Many assume that simply deploying it will deliver value and when it doesn’t, someone will obviously blame Kafka for “not working.” Next, they’ll suggest migrating to Pulsar or NATS or some other technology—but of course, the same problem will follow. Because the real challenge isn’t Kafka; it’s the lack of a control system around the data environment.

That’s where platform teams come in, building what Gartner calls a “Digital Platform,” or a control layer that turns raw infrastructure into usable capability.

Some relevant stats:

  • 43% of organizations have been building platform teams for the past three years.

  • Yet, over 55% of those teams are less than two years old, evidence of their early positions on the maturity curve. 

These teams work hard to provide clusters, access controls, connectors, and monitoring, but they still struggle with one thing: developer adoption.

  • Your #kafka-platform-dsp Slack is flooded of questions

  • You have a ton of ServiceNow tickets to help with Kafka applications troubleshooting.

  • You are relying on your favorite Kafka provider (or ticketing system) to support you.

  • Some teams copy preset configurations, while others look for ‘Kafka experts’ to optimize their batch size or reduce their app latency.

  • Ultimately, the result is the same: everybody blames Kafka for being complex.

Kafka doesn’t fail because it’s slow. It fails because it’s unmanaged. Control, not throughput, defines maturity.

If I hand you a Ferrari 488 Pista or a McLaren 720S, could you drive it full speed without crashing at the first turn? Probably not. Not because the car lacks power, but because there’s no control system keeping you on the track. The same applies to streaming: without governance, every turn is a drift.

A control plane for streaming environments doesn’t move data. It creates clarity, defines ownership, enforces rules, and provides feedback loops for producers and consumers. It exposes a useful, simple, human, and applicative layer for collaboration and safety.

That’s what companies like OpenAI understood early. They wrapped Kafka behind proxy-based control planes, making usage safe, auditable, and scalable, and are applying the same principle to Flink.

Illustration inspired by Byte Byte Go.

A V12 Streaming Without Structure Around

Scaling Kafka isn’t about partitions or brokers; it’s about people and teams. Things work fine when ten applications talk to each other, but they start breaking when there are a hundred.

Teams begin inventing their own conventions: JSON vs. Avro vs. Protobuf, or topics with dashes or dots (sometimes versioned, sometimes not). PII flows freely. Your governance team is setting up things on Collibra months later, without the developers in the loop. Nobody knows what’s authoritative.

Amidst all this chaos, most platform teams take a stance of non-interference. “We just provide Kafka as a service,” they say. “It’s up to teams to use it how they want. We don’t want to dictate data ownership, we just provide the pipes.”

It sounds noble, but it’s also a guaranteed path to chaos; I know because I’ve been there. A control plane keeps alignment and synchronizes how humans use Kafka, not just how data flows through it.

Instead, a proper control plane keeps teams aligned by:

  • Making configuration and ownership visible

  • Applying security and data policies automatically

  • Giving platform teams usage metrics and reverse telemetry

  • Reducing friction without removing guardrails

The real maturity metric isn't the number of records/second, it’s how easily new applications can join the ecosystem without breaking anything.

When you zoom out from a single cluster to an entire ecosystem, you realize Kafka isn’t just a tool: it’s a medium for human coordination.

From Data Plane to Control Plane

We’ve entered a new phase of streaming systems. The question is no longer how quickly can we move more data; instead, it’s how safely (and consistently) can we move data across hundreds of apps and teams.

At enterprise scale, culture and context must be encoded into the system. That’s Control Plane Thinking, shifting the focus from “make Kafka faster” to “make Kafka usable.” Control Plane Thinking is the idea that reliability isn’t built in code, it’s built in how humans negotiate change.

The Conduktor Kafka control plane

Streaming used to be a niche infrastructure topic, but today it’s the nervous system of the enterprise. AI pipelines, microservices, analytics, and monitoring all depend on a reliable streaming backbone.

In a recent study from Confluent, 86% of IT leaders see data streaming as a strategic priority (thanks to its role in AI, analytics, and microservices), but they also emphasize the need to “shift left”: embedding quality, security, and compliance early in the data streams.

Managed cloud services (e.g. Confluent Cloud, AWS MSK, Aiven, Redpanda etc.) have made Kafka easier to deploy, but not necessarily easier to work with. We still need a layer that bridges human and technical realities:

  • Central visibility across clusters

  • Policy propagation

  • Schema / data-contract validation

  • Integration with identity systems

We’re entering the age of streaming governance, not as a checkbox, but as part of the infrastructure itself. Why do you think this specific market is expected to grow from $4.68B to $22.87B over the next eight years?

Closing

We don’t need faster Kafka clusters. We need smarter ones.

Throughput is a vanity metric. Adoption is the real one.

As CTOs and platform leaders, our job is no longer to tune brokers or chase p99s. It’s to design the control systems that make streaming usable, safe, and scalable for everyone in the company.

The next decade of data engineering won’t be defined by throughput, it will be defined by trust and control.

Because in the end, speed means nothing without control.