Bounce Bad Kafka Data with JSON Schema Rules

Bounce Bad Kafka Data with JSON Schema Rules

Avoid bad data in Kafka. Learn how Conduktor Trust’s JSON Schema Rule validates JSON messages before ingestion—no Schema Registry required.

Apache Kafka has revolutionized how enterprises handle real-time, fast-moving data—from fraud detection to recommendation engines. Kafka succeeds because it accepts data and passes it on—regardless of type, structure or intended use—without any fuss.

But its greatest strength is also its biggest weakness. Think of Kafka as a warehouse that accepts packages from various sources and sends them off to their final destinations. But it doesn’t have a quality inspector, so all packages are sent off, even if they’re broken, mislabeled, or dangerous.

Therefore, a single malformed, misrouted, or inconsistent message can wreak havoc. Dashboards may break, downstream jobs crash, and machine learning model training will be corrupted. 

Schema Registry: useful, but lacking

Just as a warehouse needs a quality control inspector to ensure that packages are not misdirected or damaged, so too does Kafka need a tool to define the structure (or schema) of messages, specifically acceptable types of schema and any expected changes. 

In essence, Schema Registry manages the data contracts in Kafka pipelines, which guarantees that producers and consumers can still communicate with each other. In addition, Schema Registry also carries out other tasks, such as tracking schema evolution or providing an API to more easily handle different schemas. 

However, Schema Registry is not a perfect solution, especially in the following situations: 

  • Partial migrations and unclear ownership. In many organizations, only some services have adopted Schema Registry, leaving governance across the Kafka estate in a fragmented and inconsistent state. 

  • Breaking changes. Moving a topic to Schema Registry is a breaking change—unless teams coordinate with all downstream consumers to take preventive actions, such as adopting the schema registry deserializer. However, this sort of tight coordination across teams isn’t always possible.

  • JSON data. Unlike Avro or Protobuf, raw JSON has a defined syntax, but no schema enforcement across messages. This simultaneously makes it easy to use—and difficult to troubleshoot. 

JSON: Flexibility and danger

JavaScript Object Notation (JSON) is a common data format. Its advantages make it extremely intuitive, and very popular, including:

  • Flexible. Without the need for schema, JSON data can be easily ingested and used by many technologies, from databases to analytics to AI. 

  • Human-readable. People—engineers and non-technical roles alike—can skim JSON fields and understand the data contained within.

However, JSON’s strengths are also its weaknesses, including:

  • Loose schema. As a general guideline for data formats, the tighter the contract, the lower the flexibility—and vice versa. This means that JSON has very loose contracts, creating difficulties in communication between producers and consumers.

  • Assumed, rather than enforced, structures. JSON has an implied, informally defined schema, laid out by organizations like JSON Schema. However, without a mechanism to implement structure, JSON can easily suffer from issues such as missing fields, unenforced types, or invisible business rules.

Here’s an example from JSON Schema, for an online retailer’s backend systems:

{

  "$schema": "https://json-schema.org/draft/2020-12/schema",

  "$id": "https://example.com/product.schema.json",

  "title": "Product",

  "description": "A product from Acme's catalog",

  "type": "object",

  "$comment": "setting additionalProperties to false to ensure no extra fields are allowed",

  "additionalProperties": false,

  "properties": {

    "productId": {

      "description": "The unique identifier for a product",

      "type": "integer"

    },

    "productName": {

      "description": "Name of the product",

      "type": "string"

    },

    "price": {

      "description": "The price of the product",

      "type": "number",

      "exclusiveMinimum": 0

    },

    "tags": {

      "description": "Tags for the product",

      "type": "array",

      "items": {

        "type": "string"

      },

      "minItems": 1,

      "uniqueItems": true

    }

  },

  "required": [ "productId", "productName", "price" ]

}

You’re expecting this JSON data to follow this format, but it hasn’t been codified in a schema registry. So now, your order system expects “productId” to be an integer, “productName” to be a word, and “price” to be a decimal. 

Valid JSON data could resemble the following:

{ "productId": 1234, "productName": "duck", "price": 22.22 }

{"productId": 1234, "productName": "duck", "price": 22.22, “tags”: [“very_cool”]} 

While invalid data could look like this:

{ "productName": "duck", "price": 22.22 }  // missing required field

{ "productId": 1234, "productName": "duck", "price": -5 }  // invalid price

{"productId": 1234, "productName": "duck", "price": 22.22, “favorite_company”: “Conduktor”}  // extra field

These poison pills, or pieces of data that violate the expected formats of your downstream systems, can crash applications, cause machine learning drift, and introduce inaccuracies into analytics. 

Kafka’s other blind spot: Sending data to the wrong topic

Due to factors like misconfigurations or human error, Kafka can easily send data onward to the wrong topic. A developer might mistype “user-signups” as “user-signps” (which Kafka will accept), or they might simply assume that data should be routed to a specific topic.

Just as there are no quality controls in Kafka, so too are there few ways to block data from entering the wrong topic. Kafka does include authorization permissions, which enable or deny application access to topics. However, these can be improperly configured—and Kafka has no way to know.

This leads to:

  • Missing data. The correct consumers don’t have what they need.

  • Irrelevant data. Topics are overwhelmed with data that they don’t need, which will then become embedded into Kafka. There’s no way to remove this data either, aside from waiting for it to age out of the retention period.

  • Crashes and other issues. If applications receive data in a completely different format than what they are configured for, there could be deserialization or application errors.

Introducing the JSON Schema Rule for Conduktor Trust

JSON Schema Rule acts as a gatekeeper: it ensures every JSON message matches a predefined schema before it ever reaches your Kafka topics. 

First, Conduktor Trust’s Kafka proxy intercepts the message and validates it against predefined schema, such as field names, types, and structure. Then, the JSON Schema Rule also potentially enforces field-level quality rules (for instance, “productId” “type” is a SKU code, so it must be an integer). 

JSON Schema Rule can: 

  • Log violation for observability, without blocking the message.

  • Mark the violation with a header for downstream actions.

  • Block the message from entering the topic entirely.

JSON Schema Rule specifically (and Trust more generally) does not replace Schema Registry because it doesn’t handle schema management, track schema evolution, ensure compatibility, or provide an API with which to handle schemas.

Trust works just as well with Schema Registry as without—either way, you can use CEL to enforce semantic data quality rules. Additionally, you can use a built-in rule to validate that records contain valid schema ID from Schema Registry.

To go one step further, JSON Schema Rule gives teams a way to enforce structure and protect data pipelines—without requiring a full Schema Registry rollout. It’s the missing middle ground between ad hoc JSON and full-blown schema governance.

Your JSON data—and your Kafka pipelines—need Trust

With Conduktor Trust and JSON Schema Rules, your team can finally ensure that JSON is no longer a source of anxiety, confusion, and roadblocks. Instead, teams can now use JSON data with confidence, knowing that there will be no more hidden poison pills, data trapped in the wrong topics, or worst of all—downstream effects.

Ready to stop bad data at the door? Start your free demo of JSON Schema Rules in Conduktor Trust today.

Verpassen Sie das nicht