Querying data has long challenged Apache Kafka users, but SQL may be the solution. Discover how to use it to query Kafka topics more efficiently.
Stéphane Derosiaux
Aug 14, 2024
How to Query Topics and Messages in Apache Kafka: SQL and Beyond
Data users have long sought the ability to query data in topics directly in Apache Kafka. SQL may be the unrecognized solution to this challenge.
Large organizations frequently use Apache Kafka to exchange large volumes of data, but there is a big problem: you can’t easily query these messages. Someone always needs to duplicate the data from Kafka topics to a regular database to effectively query it. Being able to directly query Kafka using a familiar language like SQL would solve this issue.
Enabling every team member across an organization to access and utilize real-time data using a familiar query syntax is a transformative strategy that drives innovation and operational efficiency. This quality empowers developers, business analysts, data scientists, and decision-makers to build a data-driven culture.
Read on to see how to use SQL to query Kafka topics and boost your data accessibility.
TL;DR
Apache Kafka is a sensible and proven choice for exchanging data, but it doesn’t easily support data queries.
Querying Kafka topics usually requires additional configuration that can turn your data into a mess.
By using Apache Kafka, SQL, and additional tools like Conduktor, you can boost your data accessibility and visibility. As a result, you can strengthen collaboration between your data engineers, developers, analytics, and product teams.
Is Kafka just for streaming ETL?
Kafka was open-sourced in 2011 when massive databases and big data were king. Since then, we’ve learned a lot about how to work with streaming data.
Today, Kafka is mostly used to move data reliably to a destination everyone can work with. This might be a database, data warehouse, or data lake that users can query (such as PostgreSQL, ClickHouse, Elasticsearch, or Snowflake), that the analytics team can work with, and that you can use to build dashboards and machine learning models.
Kafka is often used only for its real-time capabilities and does not contain historical data — its default data retention is just a few days, after which it automatically deletes the data.
You can combine Kafka with stream processing technologies like Kafka Streams, Apache Spark, or Apache Flink: these transform and filter the data and enrich it with additional data points from different sources.
Kafka is great for building streaming ETL (extract, transform, and load), which captures data, transforms it, and loads it to another place in real time — in contrast to traditional batch processing, which is defined on a scheduled basis (every X minutes).
While this looks good, Kafka has one big drawback: data accessibility.
Kafka fails at querying data
Apache Kafka often handles all the data in an organization before moving it into other applications. Then, all the applications communicate and produce data through Kafka. However, as explained below, this data is barely accessible to non-developers, including data scientists, analysts, and product owners.
Exploring and working with data in Kafka is not straightforward, even for developers, because there is no widely adopted query language like SQL for interacting with Kafka data. You often need advanced command-line tooling in your terminal to look at and analyze the data — but that can only go so far.
Not everyone in an organization is tech-savvy, and organizations want to provide a consistent experience for everyone to access the data equally. For example, they want the entire team, no matter how comfortable they are with tech, to be able to work on a new project without having to learn complex new tools.
That’s where platforms like Conduktor can speed things up and increase visibility into your Kafka ecosystem. You can check out Conduktor 2.0 to see how it’s setting the new standard for teams working with data at scale.
Why Kafka queries can be risky
In the Kafka space, organizations depend upon data engineering teams to build the necessary pipelines and ETL to make data accessible. These teams also use change data capture (CDC) tools like Debezium to move the data outside Kafka.
While combining CDC with Debezium can make data in Kafka more accessible, it’s not problem-free, especially if used in data that’s spread across multiple tables. Also bear in mind that this kind of scenario may require installing and configuring Kafka Connect.
Not only does it put your data consistency at risk, it also dilutes its ownership, security, and responsibilities.
But Kafka is not an SQL database, right?
Back in 2018, Martin Kleppmann explained that you can achieve all the atomicity, consistency, isolation, and durability (ACID) requirements of a database in Kafka by building stream processors.
Kafka also has full support for exactly-once transactions, and Apache’s KIP-939 proposal is emerging to support the two-phase commit (2PC) protocol for distributed transactions with other databases.
However, Kleppman concluded that for ad-hoc queries you must move your data to a real database.
Six years later, this caveat is still present, slowing down everyone who wants to work with Kafka.
Solving data mess
Organizations have tons of data of varying quality in Kafka and in databases. Because the rules and best practices for different data solutions are not the same, it can quickly become difficult to know what data is where or where the source of truth is. It’s what we call a data mess.
The duplication of data from Kafka to databases adds a thick layer of complexity. Because of fundamentally different security models, the ownership and security of the data become brittle and possibly inconsistent. Kafka has one way of protecting data, and databases have another way. This security mismatch is hard to fix, and if you add requirements such as data masking or field-level encryption, it’s almost impossible.
This is how data leaks happen. For example, in March a breach of the French government exposed up to 43 million people’s data. These incidents underpin a clear deficiency in skills, consistency, and maturity in the ecosystem.
The rapid multiplication of data products aggravates the fragmentation of the data landscape within organizations. This proliferation creates data silos (that may contain stale or mismatched data), each operating in isolation, diluting the potential for a unified data strategy. The ad-hoc development of data pipelines, built outside a cohesive governance framework, leaves organizations vulnerable to inaccuracies and inconsistencies.
Apache Kafka and SQL - the end game for efficient topic querying?
SQL is a very well-known and popular programming language, ranking 6th on the TIOBE index and used by 40% of developers globally.
When talking about SQL, we have to consider PostgreSQL. It has emerged as the leading database protocol, and every vendor involved in data wants to be compatible with it.
Tools like Grafana, Metabase, Tableau, DBeaver, and Apache Superset can all connect to services that offer a PostgreSQL-compatible endpoint. Having a Kafka platform providing such an endpoint for any topic lets you use these tools for data visualization and straightforward introspection.
SQL offers a solid foundation for building a unified data ecosystem, with Kafka as the single source of truth. PostgreSQL stands out for its broad compatibility and ease of getting started thanks to many accessible vendors. Its open-source nature, seamless integration with development environments, and straightforward setup and management make it a preferred database choice for scalability, versatility, flexibility, and robustness.
By using Conduktor to modernize Kafka to incorporate SQL capabilities, you can significantly reduce the need for data pipelines and replication. It will also lead to overall performance and cost efficiency, simpler governance, and fewer security failures.
This would also integrate data engineers directly into product teams rather than keeping them isolated in their own silos with their own data roadmaps, strengthening collaboration between developers and analytics teams.
Apache Kafka, SQL, and beyond - the case of AI and ML
SQL is perfect for ad-hoc analysis, dashboarding, or building data pipelines. But it’s not the best for handling the massive volumes of data required for data science and AI/ML. This is where technologies like Apache Parquet and Apache Iceberg shine.
They provide column-based systems and pushdown filter optimizations that efficiently query large quantities of data. Many data scientists love them because tools like Apache Spark, Pandas, Dask, and Trino can query them. This improves data accessibility and simplifies building AI/ML applications.
As we shared in our review of Kafka Summit London 2024, Kafka’s ability to serve as the single source of truth — as organizations seek to expose data in Kafka across multiple formats — is becoming a reality. Confluent has announced Tableflow, which seamlessly materializes Kafka topics as Apache Iceberg tables without having to build and maintain data pipelines.
Paving the way to efficiently query Kafka topics and messages
Using Conduktor’s SQL syntax, you can query topics in Kafka more efficiently and benefit from its real-time data capabilities to meet diverse business requirements. Kafka and Conduktor also fulfill AI/ML demands by supporting file formats such as Parquet and Iceberg. In these ways, they are paving the way for a future where data is truly accessible and optimized for various consumption preferences.
Building data-driven products and eliminating the need for technical duplication across different data stores will lead to a more efficient and secure data ecosystem.
If you want to make your data more accessible while eliminating friction from governance and security, book a Conduktor demo.