Five Hidden Kafka Challenges for Enterprises

Five Hidden Kafka Challenges for Enterprises

Deploying Apache Kafka at scale brings unexpected complications, especially at enterprise scale. Learn what obstacles large organizations face—and how they overcome them.

15 avr. 2025

As organizations become more digital and data-driven, Apache Kafka has become a core component of modern data architectures. But while Kafka remains the best way to ingest real-time data at scale and speed, integrating it into large, diverse environments creates complexity and unexpected challenges.

Recently, I met with architects, product managers, and tech leads from five Conduktor customers across a wide range of industries, including finance, retail, and logistics. Despite their differences, the same patterns emerged: unexpected costs, data quality gaps, governance concerns, and the delicate interplay between people and technology. 

Challenge 1: Data quality and governance

Many teams equate data quality with schema enforcement and registries, but issues often live within the data itself—and left unchecked, can travel downstream to cause complications. At one European postal service, a lack of consistency in format and data types causes issues between producers and consumers. While enforcement ensures type correctness, it doesn’t always equate to semantic correctness. 

Another team clarified that they typically notice data quality issues only when such issues directly impact the business. Due to limited observability into the data itself, problems are rarely detected through technical dashboards, which tend to focus on system metrics rather than anomalies in the data, such as distribution shifts or outliers.

These issues aren’t just technical—they’re organizational. As one attendee explained, there were “blurry borders” around ownership of data quality, schema governance, or data retention. So, while data owners (software developers) are often considered ‘responsible’ for quality, the expectations are rarely lived and lack consistency across the organization.

Another major multinational retailer uses Kafka as a key component of their order system. Their data platform lead found that despite having implemented schema registries back in 2019, there was no validation of data within the messages themselves. This was largely because the teams lacked the ability to monitor the live streaming data in Kafka itself—a notoriously difficult problem when it comes to real-time data observability.

Instead, this team only discovered issues when they surfaced in downstream applications and business-centric KPIs. And because they lacked automated alerting, they could not easily find and fix problems, leading to higher Mean Times to Detection/Resolution (MTTD/R) as well as a reactive approach.

Challenge 2: Self-service and discoverability

Even when data is clean and reliable, teams struggle to find, understand, or use it—especially when operational and analytical layers are disconnected.

At one logistics service, the process of requesting operational data to be persisted, so that it’s consumable by data scientists and analysts, takes several weeks. Analysts file tickets with the platform team, who then set up Connectors and S3 buckets, but validating permissions to consume the data internally is a hindrance to a fully automated process.

In contrast, the platform team at another retailer simply built a one-click system to move Kafka data into the analytical estate. Data owners have the power to land data into BigQuery through a fully automated process. Users can browse and request data to be landed via catalogs, which are then subject to owner approval. 

In this fashion, they were able to deploy nearly 700 connectors and sync nearly 2,000 topics across them. By making the data landing process opt-in (and requiring owners to only push data if needed for broader use), platform teams were able to standardize the process while also accelerating implementation through automation. With clear lines of ownership, the persisted data is easily made available for data consumers, removing a slow process as a dependency or obstacle.

Flink is powerful, but attendees weren’t always clear on how to fully adopt Flink into a Platform-as-a-Service (PaaS) operating model. 

One platform team at a large retailer considered running Flink as a centralized service. Under this model, application teams would utilize the Flink service for stream processing needs, but not be responsible for the infrastructure itself. They ultimately rejected this operating model due to ambiguity around areas of ownership, unclear responsibilities for incident response, and difficulties in aligning platform capabilities with the specific needs of application teams. As a result, they intend to stick with Kafka Streams where logic stays inside the application domain. 

However, another attendee felt that Flink could work for specific use cases if it was wholly owned by her team. Because her team would write and deploy the stream processing code, as well as own the ops considerations, they could mitigate concerns such as poorly implemented logic or processing failures. However, her team’s focus was on rapidly persisting data for analysts, rather than stream processing.

Challenge 4: Battling costs and zombie infrastructure

As Kafka usage grows, so do inefficiencies such as misused partitions, excessive retention, and unused (zombie) topics and schemas—which generate unnecessary expenses.  

Kafka topics are divided into partitions to support scaling and parallel processing, but too many partitions, especially for low-volume, low-traffic topics, can drive up storage and compute costs. As a solution, one company set a limit of 10 partitions per self-service user, with requests for additional partitions requiring manual override.

Another common issue was excessively long retention policies. In one example, data was retained for almost a decade, simply because no one had reviewed the original, legacy configuration—significantly inflated the Kafka storage costs.

Then there are zombie topics and schemas, which see no traffic or usage but continue to exist—most often in non-production or legacy environments. One team applied a seven-day cleanup policy in their dev environments, automatically flagging unused assets for deletion. Alongside better tools for forecasting and cost visualization, this enabled them to encourage better practices without blocking teams. 

Challenge 5: Bridging legacy and cloud technologies via Kafka

Today, Kafka is more than just a streaming data layer—it’s fast becoming the backbone of data across legacy systems, modern microservices, and external partners. But this evolution brings new obstacles.

For instance, many organizations still include legacy systems, like MQ, MFT (Managed File Transfer Processes), and SOAP/XML, within their environments. Given the possibility of hidden dependencies, removing these older services entirely may lead to outages or other issues.

Instead, teams are using Kafka as their integration solution. One bank, constrained by regulations, cannot fully migrate to the cloud—so they created a single team to handle both MQ-based infrastructure and on-premise Kafka. Another organization substituted MQ with Kafka, though they still utilize file- and REST-based data sharing with smaller partners that are not ready to share data via Kafka. 

Sharing Kafka data with external partners also brings complications, as organizations seek ways to expose Kafka data externally without compromising security or adding operational bloat. One company currently supports 30+ external integrations via REST Proxy but acknowledges that this isn’t scalable long term, due to scalability, security and governance concerns. They are actively looking at how to modernize this process to account for growing consumer demand for fresh, real-time data—critical for powering personalized experiences, AI-driven insights, and next-generation digital products. 

Finally, many organizations are increasingly adopting OpenAPI and AsyncAPI to standardize specifications across teams and reduce vendor lock-in. These specifications provide a clear, machine-readable contract for how services expose and consume data —making it easier to discover, integrate, and govern data within a more transparent and interoperable ecosystem.

In retrospect, none of these challenges are unique—in fact, they’re common to any organization trying to adopt and leverage Kafka at scale. Yet these obstacles demonstrate that not every problem is technical—some may stem from governance, culture, and processes.

In the end, the enterprises that succeed will do so through centralizing guardrails while enabling developer autonomy. To learn how Conduktor helps teams get there, sign up for a demo today.

Don't miss these