Chaos Testing for Kafka Compliance: DR Evidence That Works

Nicole Bouchard March 19, 2026 8 min read

Chaos testing for Kafka may seem optional, but it's what separates a DR plan from audit-ready evidence. Auditors don't accept "we have a disaster recovery plan", they ask three questions:

When did you last test it?
What were the results?
What did you change afterward?

If you can't answer all three with timestamped evidence, you have a finding, regardless of how robust your architecture is.

The complete strategy post mentioned SOC 2, PCI-DSS, HIPAA, and GDPR continuity requirements in passing. The Bitvavo customer story showed how a crypto exchange achieved DORA compliance using Conduktor. This post bridges the two by mapping chaos testing methodology to the specific evidence that regulatory frameworks demand.

Most teams treat compliance testing and resilience testing as separate workstreams with separate schedules, separate documentation, and separate stakeholders, but they're the same workstream. A well-documented chaos experiment produces engineering insights and audit-ready evidence in a single activity.

Series context: This builds on Chaos Engineering for Kafka (methodology) and What Chaos Tests Teach You (insights). You don't need those posts to follow this one, but they provide the technical foundation.

What regulators actually require for DR testing

This isn't a comprehensive compliance guide; Conduktor has covered Kafka security and compliance and audit logging in depth elsewhere. The focus here is narrow: what does each framework require for disaster recovery testing and validation?

DORA (Digital Operational Resilience Act).

DORA applies to banks, insurance companies, investment firms, crypto-asset service providers, and their critical ICT third-party providers operating in the EU.

If your organization falls into any of these categories, Article 11 requires you to maintain and periodically test ICT business continuity and disaster recovery plans, at least yearly and after substantive changes to ICT systems supporting critical or important functions. "We have a secondary Kafka cluster" doesn't satisfy this requirement, but evidence of tested failover with measured recovery times does. DORA is particularly relevant to this series given the Bitvavo story, where Conduktor's governance and security capabilities supported DORA and MiCA compliance for a platform serving 1.5 million users.

SOC 2 (Type II).

SOC 2 isn't legally mandated, but it's effectively required for any SaaS company or service provider selling to US enterprise customers. Prospects will ask for your SOC 2 report.

The Availability criterion (A1.2) requires that recovery infrastructure is authorized, designed, implemented, operated, maintained, and monitored. The key difference is that Type II assesses effectiveness over time, not just design at a point in time. Auditors want periodic testing records showing a pattern, not a single test from last year. Conduktor's own SOC 2 journey distilled this to a useful principle that applies directly to DR testing: write what you do, do what you write.

PCI-DSS (v4.0).

PCI-DSS applies to any organization that stores, processes, or transmits cardholder data, anywhere in the world. If payment transactions flow through your Kafka pipelines, this is you.

Requirement 12.10.2 states that "the incident response plan is reviewed and tested, including all elements listed in Requirement 12.10.1, at least once every 12 months." That applies to the Kafka DR plan. If payment data flows through Kafka and you haven't tested recovery of that pipeline, you have a gap.

GDPR (Article 32).

GDPR applies to any organization processing personal data of EU residents, regardless of where the organization is based. A US company with European users is in scope.

Article 32 requires "a process for regularly testing, assessing and evaluating the effectiveness of technical and organisational measures for ensuring the security of the processing." The operative word is "regularly," and a DR test from 18 months ago doesn't satisfy this requirement.

The common thread: Every framework requires tested and documented recovery capability, not just planned recovery capability. The distinction matters at audit time.

How do chaos experiments map to compliance evidence?

A single chaos experiment (say, simulating broken brokers at a 25% error rate on a production-adjacent environment) produces multiple compliance artifacts without extra work.

Timestamped test execution record: when the test ran, what failure was injected, which systems were in scope.
Measured recovery metrics: consumer lag recovery time, producer retry success rate, end-to-end latency during and after injection.
Gap analysis: expected behavior versus observed behavior, clearly documented.
Remediation record: what configuration or code changes were made as a result.
Re-test confirmation: evidence that the fix actually worked.

Five artifacts from one experiment. Here's how they map.

Framework	Evidence Required	Recommended Cadence
DORA	Quarterly chaos experiments on Wave 1 applications (the critical services that must recover first, as defined in the complete strategy post), plus at least one annual game day covering the full decision chain: detection, authorization, execution.	Quarterly + annual game day
SOC 2 Type II	Chaos test execution log showing regular cadence across quarters, with measured RTO that ideally improves over time.	Quarterly
PCI-DSS	Chaos experiment simulating payment pipeline failure, with documented recovery sequence, timing, and data integrity verification.	At least annually
GDPR	Automated chaos experiments triggered by infrastructure changes (Kafka upgrades, config changes, new application deployments), plus quarterly manual experiments.	On change + quarterly

Auditors don't expect a perfect recovery. They expect a documented process: you tested, you found gaps, you fixed them, you re-tested. That cycle from the previous post happens to be compliance-friendly because it shows continuous improvement, which is what every one of these frameworks is after.

Conduktor Gateway's chaos interceptors were built for this. Your YAML interceptor configuration doubles as the test definition. Monitoring output becomes the test result. Before-and-after configs document the remediation. Version all of it in git alongside your infrastructure code, and the audit trail builds itself.

Audit-ready documentation: The six-field template from the previous post (hypothesis, interceptor config, observed behavior, delta from expected, remediation, re-test results) maps directly to what auditors expect. Keep a running log and your next audit prep is already done.

Building a compliance-ready testing cadence

The goal is a single testing cadence that satisfies both engineering and regulatory needs, not two parallel programs.

Start with quarterly chaos experiments on your Wave 1 applications. This exceeds DORA's annual minimum with comfortable margin, meets SOC 2's ongoing effectiveness standard, and gives PCI-DSS their required annual test plus three additional data points. Four data points per year is enough signal to spot trends in your measured RTO.

Layer in automated experiments on infrastructure changes. When Kafka version upgrades, configuration changes, or new applications deploy, trigger chaos experiments as part of change validation. This satisfies GDPR's "regularly testing" language without a separate schedule and produces evidence that resilience holds through change, not just at quarterly checkpoints.

Once a year, run a full game day with the decision chain included: the on-call team, the approval chain, the communication process. Measure detection-to-decision time. Automated experiments can't test whether your organization can execute the recovery under pressure. A game day can.

Each experiment should produce entries in two places: the engineering failure knowledge base and the compliance evidence log. Same data, different framing. Keep both updated from the same source.

Plot your actual recovery times across quarters. An improving trend line is powerful audit evidence. A flat or worsening line is an early warning you'd rather catch yourself than have an auditor surface.

Combine, don't duplicate. If your chaos testing cadence is well-documented, it is your compliance testing cadence. Don't maintain two separate programs for the same underlying capability.

What Kafka compliance gaps does chaos testing expose?

A few specific gaps show up repeatedly when teams start running chaos experiments with compliance in mind.

Security posture doesn't survive failover (DORA, SOC 2). ACLs, encryption policies, audit logging: these frequently don't carry over to the secondary cluster. The complete strategy post flagged this in Area 1 (Security and Identity Parity) and Area 3 (Data Protection and Compliance Continuity). If your security posture degrades during failover, that's a compliance event on top of an operational one.

Payment pipelines have never been tested in isolation (PCI-DSS). A broader DR test that exercises the whole cluster doesn't isolate the payment processing path. PCI-DSS expects evidence that this specific pipeline can recover, and most teams have never run that test.

RTO claims don't match measured reality (all frameworks). The DR plan says 15 minutes, but the chaos experiment measures 47. Without testing, the 15-minute claim goes into audit documentation unchallenged until a real incident proves otherwise.

Monitoring goes dark during failover (SOC 2, DORA). Audit trail continuity breaks during cluster switching. That window between primary failure and secondary stabilization often has no logging coverage. The previous post's section on chaos-testing your monitoring covers how to catch this before an auditor does.

Compliance is resilience, documented

This series keeps coming back to the same point: replication is solved, coordination is the bottleneck, and none of it counts unless you can prove you tested it. Chaos testing addresses all of these with the same experiments and the same cadence. The only difference is who reads the results.

If you're already running chaos experiments on your Kafka infrastructure, you're doing the hard part. The compliance evidence is a documentation exercise on top of work you're already doing. If you're not running them yet, start with the methodology from the first post in this sub-series and document as you go. You'll be glad you did at your next audit.

Read the full whitepaper: Kafka Disaster Recovery Beyond Replication | Download the Disaster Recovery Readiness Checklist | Read the Bitvavo DORA compliance story | Explore Gateway's chaos interceptors | Book a Kafka DR Workshop

This is part of a series on Kafka Disaster Recovery.

Previously: What Kafka Chaos Tests Actually Teach You (That Monitoring Doesn't)

Start from the beginning: Kafka DR: Why Replication Isn't the Hard Part