Logs, schemas, and DLQs aren’t enough to enforce data quality on Kafka: implement guardrails to prevent bad quality data from entering your system, without interfering with producers or consumers.
Stuart Mould
5 juil. 2024
The quality of real-time data in Apache Kafka has a significant effect on how your overall data infrastructure performs, as well as the effectiveness of its outcomes for the business. So how can those responsible for Kafka ensure high quality data at scale, whilst avoiding central-team bottlenecks and maintaining delegation of responsibilities to application teams?
This article will help you answer this by explaining common data quality issues and their remediation techniques. It also provides an example of how you can implement automation to enhance the efficiency of such practices, by continuously monitoring the quality of data in Kafka without disrupting its operation.
Data quality issues affect business outcomes
Data quality issues are like bumps in a road. If there are no bumps, your data systems and the processes that rely on them can run at full speed; if there are a few bumps, you’ll have to reduce your speed and smooth them out as you go; and if the number of obstructions increases faster than your team can fix them, everything grinds to a halt, traffic (and your business) stops moving.
In the context of real-time applications and Kafka, these "bumps" can start as invalid data and lead to applications crashing. These effects can then be amplified as they cascade into other systems.
High-quality data is essential both for the smooth operation of your systems and for the outcomes derived from them. Kensu research states that 85% of data management leaders have made improving data quality and reliability their highest priority.
The reason for this is evident in the results of the DBT state of analytics engineering report — unclear data ownership (50% of respondents) and poor quality of data at the source (57%) are listed as the biggest problems that data teams are facing. Practically, this means that in over 50% of cases, analysts can’t complete their work because of poor data quality.
How data quality affects Kafka
Kafka, being the center of your streaming and event-driven architecture, is a potential enabler of upstream data quality issues, but also a contributor to problems affecting applications further downstream.
Kafka allows data producers to put anything into a topic. When the messages being produced are formatted as expected, this seems fine and everything “just works.” But when a bad producer inevitably starts outputting unexpected or invalid data, things fall apart quickly.
A producer can “go bad” due to it having its own bad inputs, a Kafka misconfiguration, a bug, or the interference of an inexperienced user, along with a multitude of other possible causes. Tech teams then need to scramble to resolve the issue so that those that rely on the data are not affected. Meanwhile, identifying the root cause can take additional time and resources.
Kafka’s forgiving nature isn't a bug, however; it's a feature that grants it a lot of flexibility (it just requires careful management). This flexibility is also what makes it possible for Kafka to act as the backbone of your data operations, and thus an ideal place to centrally monitor and catch data quality issues.
The most common types of Kafka data quality failures
Here are the most common causes of poor quality data in Kafka deployments:
Invalid message: An invalid message contains data that is different from what downstream applications are expecting. For example, a downstream application expects a message containing an IBAN to identify a bank transfer, but instead gets a license plate number.
Missing or unknown schema: A schema reference is missing from the message, the reference is to a schema that does not/no longer exists, or is referencing the wrong schema.
Wrongly formatted data: Data has the intended content but the wrong format: for example, Avro was expected and it’s JSON.
Invalid data values: The values satisfy the schema but are invalid to the logic of the downstream applications: for example, an incomplete or invalid IBAN whose check digits don't match.
General guardrails for these problems
While you should implement protections for each of these issues individually, you also need an overarching strategy to prevent any kind of data quality failure from causing a cascading effect on the overall system. Such solutions include:
Log and skip: Also known as “log and forget,” this strategy involves skipping problematic messages and logging them with the intention of following them up later (but in practice many people won't). This approach requires good enough logic in all consumer applications to avoid issues.
Manual intervention: In this strategy, you “stop and heal” by crashing on a failure, restarting/rebalancing, and then repeating to try again. This strategy is time-consuming and error prone, especially with the manipulation of offsets. Incorrectly handling offsets when skipping bad data can result in either reprocessing the same bad data (if offsets are not committed) or missing data (if offsets are incorrectly advanced. Similar to log & skip you are leaving yourself vulnerable to data consistency issues.
Dead letter queue (DLQ): Failed messages are put in their own Kafka topic for later reprocessing and inspection to see if they share a common problem so that the source of the bad data can be identified. While implementing a DLQ is best practice, there is debate on whether it’s best to implement it on the producer or consumer side: if only the producers implement the DLQ, the consumer team may think it's unnecessary to include robust error handling in their logic, causing problems if bad data doesn’t get caught.
Schemas: Schemas deal with invalid message structures by supplying a model for what data should look like. Data that doesn't conform to a schema is rejected, preventing it from moving further downstream. The validity of a schema can also be checked. Schemas are not the silver bullet, however, as they do not solve invalid data values that otherwise follow the schema, and can be ignored.
Unfortunately, these “fix it in post” solutions do not protect against bad data being introduced. Because there's no way to trust all clients to do the right thing, a single misbehaving actor can cause everything to fail.
Fixing Kafka data quality at scale without interfering with producers/consumers
Rather than relying entirely on these general guardrails that still rely on every actor always behaving 100% in line with requirements, you can use a middleware approach to solve Kafka data quality issues at scale, without interfering with every component within your existing connected data infrastructure.
The Conduktor Enterprise platform offers a Kafka proxy that your producers and consumers can connect to in the same way as a Kafka cluster whilst getting all the benefits of this control plane. It performs all the validations, schema checks, and any other data quality checks that you need to ensure that only valid records pass through to your actual Kafka infrastructure.
This middleware approach enables addressing issues for all users (producers and consumers) at once, without requiring changes to any application code (other than pointing at the proxy).
By checking that the payloads adhere to detailed schemas (or by defining an SQL statement to assert data quality), you can check and validate data, including checks for invalid data values, before it is produced to Kafka.
Additionally, the Schema ID validation ensures not only that a message conforms to a schema, but that the schema is set and is present in the registry. Rather than having to implement separate data contracts, Conduktor makes the schema the contract, letting you add additional business logic within the schema.
While this approach gives you another piece of infrastructure to manage, this additional work is dwarfed by escalating data management tasks it replaces in scaling Kafka environments.
Next steps: improving data quality with a data mesh
The tools you choose to enhance the quality of your Kafka data should empower your teams and reduce their operational overheads so that they can focus on continually improving your data tools and processes.
Solving Kafka data quality issues should not push responsibility downstream. An ideal solution should automate processes to streamline or remove them entirely, and spread responsibility across the teams. Topics should be assigned to the teams with the most understanding and context for the data: where it's from, what it should look like, and what it will be used for.
That's what data mesh is all about — when ownership of data is made clear and management is democratized, the value of data can be fully recognized, and its quality can be enforced by those who are best positioned to identify when a data value is invalid or incorrect and its most likely cause. You can start your data mesh journey with our webinar exploring how FlixBus built their self-service data streaming platform, or book a demo to see comprehensive data quality enforcement in action.