Chaos test your Kafka stack with Conduktor to find weak points, reduce downtime risk, and ensure production-readiness.
10.07.2025
It’s not always easy to anticipate how infrastructure will perform in production, where users, traffic, and data flow freely. Unlike the safe confines of sandboxes, production environments can be unpredictable: activity can be bursty, infrastructure might have difficulty scaling, and latency and downtime are only a failed dependency away.
And downtime can be costly. In a survey of 2,000 executives from the Global 2000 Index (which includes the 2,000 largest companies worldwide), Splunk found that service disruptions cost companies up to $400 billion annually. Given these stakes, enterprises need to prepare their infrastructure accordingly.
Chaos testing ensures that applications behave correctly under real-world conditions, and won’t generate unexpected errors. However, because Kafka is by nature a resilient messaging broker, it can be difficult to simulate these circumstances and test infrastructure for resilience. In addition, many organizations may focus on a system’s functionality and product features, without emphasizing its reliability and performance under pressure.
This article looks at how Conduktor solves this problem, enabling the developer or tester to test that applications handle broker errors as required.
What Is Chaos Testing?
Chaos testing introduces controlled failure to verify system resilience, helping you understand how an application reacts when things go wrong. This also enables you to identify (and strengthen) possible points of failure, or to remove potential vulnerabilities.
In the context of Kafka, chaos testing means simulating errors occurring when an application
attempts to read or write messages to the broker. Instead of disrupting the Kafka cluster directly, Conduktor Gateway steps in. It proxies these read or write requests from the application and can respond with specific Kafka error codes, mimicking real failure scenarios.
This allows you to observe whether your application can bounce back from temporary failures or handle permanent ones as intended. The Kafka cluster remains stable, and you get a clear view into how your producers and consumers behave under pressure.

Why chaos test infrastructure?
As data environments become more capable, they also become more complex, consisting of multiple technologies. For instance, building a real-time data stack could include Kafka for ingesting streaming data; Apache Flink for processing this data and for preparing it for analysis; a database such as Apache Pinot for analytical queries; a transactional database such as PostgreSQL; and assorted connectors.
As these different technologies interact with each other, they also create unexpected dependencies—and potential points of failure. Needless to say, the worst time to discover these relationships is during a crisis, when time is short, stakes are high, and stress levels are through the roof. Therefore, platform and SRE teams take a proactive approach with chaos engineering, mapping out and testing these interdependencies beforehand.
Chaos testing also builds more resilient systems, adding a layer of protection during crisis situations. By testing components to failure, teams can uncover (and strengthen) vulnerabilities, or streamline overly complicated dependencies.
Another benefit of chaos testing lies in operational readiness. Any team that has to respond to real-world events such as cyberattacks, system outages, or latency, will need to rapidly restore service and secure their data infrastructure. By practicing their incident response workflows in a safe, non-production setting, these teams can improve their confidence and hone their skills—so that they’re ready for the real thing.
What are some best practices around chaos engineering?
Consistency and planning
Many organizations don’t necessarily codify chaos testing, instead adopting an ad hoc approach to chaos testing, conducting sporadic tests. However, this means that chaos testing is divorced from overall development, and may not be thorough enough to unearth hidden issues.
Instead, chaos testing should be incorporated into any CI/CD pipelines or any other development workflows. Chaos engineering can (and should) be an ongoing process, rather than just a one-and-done deal.
Game days
As team- or organization-wide drills simulating failures and their responses, game days are excellent opportunities to perfect collaboration and troubleshooting under pressure. In general, game days will focus on a single scenario and be as realistic as possible. To maximize their effectiveness, game days should also have observers to document the events and actions, and a debrief to better assess effectiveness and areas for improvement.
Identify and prioritize bottlenecks
Whether it’s a dependency, a business-critical service (like the checkout function), or systems with elevated security privileges, environments may have single points of failure. It’s important to test these key components under pressure to uncover risks. Teams can simulate outages, latency, or bursty traffic to validate recovery processes and ensure that backups will perform as planned.
What can you break simulate with Conduktor Chaos Testing?
Conduktor Gateway deploys interceptors, or software services that can encrypt data, enforce data quality, and for chaos testing, feign any number of Kafka issues, such as failed or slowed components, procedural errors, and misformatted messages.
Here is a complete list of the possible interceptors that you can use for chaos testing, and what they test for:
Broken brokers. This interceptor injects periodic errors into the broker-client connection to test possible responses to broker-side issues.
Duplicate messages. What happens when Kafka clients produce or consume duplicate records? For instance, if a transaction is duplicated, will a payment platform charge the buyer twice?
Invalid schema ID. If records present an invalid schema ID, how will downstream consumers react? Will they crash, log an error, or trigger an alert?
Latency on all interactions. By emulating slowed interactions, this interceptor tests that Kafka environments will respond correctly to network or broker delays.
Leader election errors. Within a Kafka partition, a leader node handles all the read and write requests. When it fails, elections are run to determine new leaders—so this interceptor tests what happens when these elections fail or encounter errors.
Message corruption. This interceptor tests your Kafka environment’s response to misformatted or corrupted messages.
Slow brokers. What happens if broker traffic slows down? How will this impact the rest of your Kafka environment?
Slow producers and consumers. As with the previous test, this interceptor simulates latent produce and fetch requests while also slowing broker traffic.
One Conduktor customer, a leading sports and outdoors retailer based in the United States, tests their Kafka components under real-world conditions to prepare them for the demands of peak shopping seasons. The bulk of their revenue comes from the weeklong break from Thanksgiving through Cyber Monday; in 2024, they earned nearly $10 billion, or close to 70 percent of their revenue, during this time.
Clearly, any outage or latency would significantly impact sales. To prevent this worst-case scenario, engineers simulate failure scenarios at the Kafka layer, such as latency and broker instability. For this retailer, chaos engineering plays a key role in protecting performance and profit during the most critical time of year.
As data infrastructure becomes more complex and mission-critical, validating how it responds under stress is essential. By proactively simulating broker errors, latency spikes, and misbehaving clients, organizations use chaos testing to harden their systems, uncover hidden risks, and train their teams to respond swiftly and effectively.
Conduktor makes it easier to test Kafka applications in realistic failure scenarios—without destabilizing your production environment. With the right chaos testing strategy, resilience is built into your data stack from the very beginning.