Our review of Kafka Summit London 2024: Analytics, Self-Serve, SQL
Stéphane Derosiaux
29 mars 2024
Last week, we attended Kafka Summit London, a favorite event of ours. It's the best way to connect to our community and discuss a wide partition of topics (pun intended). It’s the best time of the year for Confluent to announce its vision with something big enough to steer the market in new directions (this Kafka Summit was no exception).
TLDR: You were as excited as us! 😍
Data Streaming is all the rage
Data Streaming has become a pivotal technology in modern data architectures, capturing the interest of various industries due to its ability to provide real-time insights and real-time decision-making.
Gone are the days when discussions about Kafka were solely focused on low-level technicalities such as infrastructure details, deployment, and scaling methods, various settings to tweak, ksql, event sourcing, etc. We have gradually shifted towards higher-level concerns to maximize the value of our data at the organizational scale in the simplest way possible.
Major players like Snowflake, Databricks, and MongoDB have recognized its importance and are continuously working on integrating streaming capabilities directly into their platforms. They want to offer a seamless experience for processing data streams without the need for external services, thereby enhancing their ecosystems' value and efficiency.
Similar to AWS Redshift and its streaming ingestion directly from MSK, Snowflake, known for its cloud data platform, allows for the ingestion of streaming data via Snowpipe, enabling real-time analytics and data sharing across its secure and scalable environment. Databricks, leveraging its Unified Analytics Platform, combines big data and machine learning to process streaming data, thereby accelerating innovation. MongoDB, with its document-based database system, has recently introduced capabilities that allow for the processing of real-time data streams.
Confluent, on the other hand, aims to bring typical data lake use cases into its own realm, thereby maximizing the value of its own data (stored in its infrastructure). Conduktor is also going in this direction by providing easy access to the Kafka data via alternative protocols such as SQL, without the overhead of building pipelines or using any stream processing frameworks.
There is an intention to streamline data usage and data storage while avoiding the construction of costly pipelines that ultimately just duplicate data and create confusion regarding data ownership within a company.
The proliferation of data streaming technologies signifies a shift towards more agile and real-time data management practices. As businesses continue to seek faster and more efficient ways to process and analyze data, the role of data streaming—and the innovations brought forth by these leading actors—will undoubtedly expand, shaping the future of data infrastructure and analytics.
AWS S3 + Apache Kafka Protocol = Win?
Kafka Summit, organized by Confluent, naturally steers the discussions. Last year was focused on Flink, but this year, the spotlight is on analytics. Honestly, this shift is more interesting and groundbreaking, as analytics represents a more significant and costlier challenge within organizations, and it has been a while since any disruptive changes occurred in this area.
“Data streams are the abstraction that unify the operational estate and Kafka is the open standard for data streaming” — Jay Kreps, CEO Confluent
The data infrastructure landscape is currently experiencing several paradigm shifts, each contributing to a broader evolution in how we manage and process data. The concept of separating computing from storage has emerged as a cornerstone, with Amazon S3 as a de facto standard for big data storage (Apache Kafka introduced its tiered storage to use it, and Warpstream pushed this even further). This separation allows for more flexible and efficient data management strategies, enabling organizations to store vast amounts of data inexpensively while choosing the most suitable processing power as needed.
In this evolving landscape, analytics has been a forerunner, advocating for a model where data lakes store everything, leaving the door open to query and analyze data in versatile ways (SQL, .parquet file format, Apache Iceberg table format).
Confluent introduced TableFlow, a seamless materialization of Apache Kafka topics as Apache Iceberg tables (Apache Iceberg being an apparent market leader? Alternatives are Deltalake from Databricks or Apache Hudi which seems more appropriate for streaming workloads). This is a step towards making event streaming data more immediately useful for analytics, moving beyond the traditional view of event streaming platforms as mere conduits for transient data. i.e. rather than keep a retention policy to 7 or 14 days for Kafka topics, we might see a trend towards infinite retention and the death of data pipelines just moving data without added business value.
Infinite retention?
Apache Kafka has begun to redefine its role in the data ecosystem. Traditionally seen as a temporary buffer, Kafka is increasingly being recognized for its potential as a new form of data lake. This shift hints at a future where the distinctions between data-in-motion and data-at-rest blur, enabling more dynamic and cost-effective data infrastructure solutions that leverage the strengths of both event streaming platforms and data lakes.
Our booth: Stay awhile and listen
This was our favorite Kafka Summit to date! The attendees were exceptional; we never had so many deep conversations. Some enthusiasts stayed with us to chat for 2 or 3 hours! It seems we did a good job marketing-wise, as most were already aware of Conduktor and were already on the edge of giving it a try in their organizations! This was just gold to us.
Connecting people
We had many advanced discussions with tech leaders, architects, platform teams, and security professionals. As Conduktor is a Collaborative Kafka Platform that connects all of them, this was spot on, as we are exactly able to provide them with solutions to their problems. People were always excited by what we do: a powerful Console combined with a powerful Kafka proxy to resolve all the challenges regarding Kafka governance and security for users and applications. So many things many never think were possible.
We had a lot of positive feedback covering critical aspects in organizations such as our end-to-end encryption, multi-tenancy, self-service approach (centered around GitOps), granular RBAC (for users and applications), and data quality controls plugged into the Schema Registry. Also, a feature we are currently working on, simply named "SQL" was definitely getting some traction (more on this later).
We received many praises for our work on our Console (achieving an average NPS score of 80 which is great). It took us a while to get it where we wanted it, but users definitely noticed the effort, and the result was there. We are proud of it, and we are glad to see that the community appreciates it.
We have a lot left to do, and our plan is long. But we need to pick our fights and move step by step. We still have many critical features to implement, along with many nice-to-have features. To decide is to renounce!
Praises are good, but please tell us where it’s bad, where we should improve, and what you'd like to see, we’re listening.
The challenges in the Kafka community?
Here's a summary of the challenges discussed by attendees. Does this sound familiar? Get in touch if it does!
Conduktor simplifies things; and we need this simplicity in our landscape. It helps speed up our daily operations, and helps us with credit card data (PCI DSS) by encrypting the topics. Conduktor, in one sentence, for me, is Kafka made simpler.
— Marcos Rodriguez; Domain Architect at Lufthansa
Governance and Security: There's a significant emphasis on managing data access, securing sensitive information, and maintaining data quality and integrity through governance practices. The necessity to protect sensitive data and ensure high data quality at scale is crucial for creating dependable "golden data" that businesses can rely on for decision-making.
Self-Service and Monitoring: A few years back, the market was not there yet. Now, there is a noticeable demand fr self-service governance and monitoring solutions to simplify resource lifecycle management and empower product teams with autonomy. Efficiency with control. More on this below.
Data Encryption: With regulations becoming commonplace, encryption is frequently mentioned as a challenge. The community is looking for solutions that simplify encryption processes while offering flexibility. This is particularly important for transactional data and data shared across networks.
Kafka and SQL: Although Kafka clients exist for all major programming languages and it integrates well with many technologies and SaaS, a significant limitation is the inability to query Kafka directly; we can only consume data sequentially. This is perfect for streaming applications but useless for analytical usage. This is why today data teams make use of complex stream processing frameworks to build aggregated views, using tools like ksqlDB, Spark, Flink, among others. However, these can be brittle, complex, or costly. On our R&D front, we've begun exploring providing SQL access over Kafka without the need for any framework or added complexity. This definitely intrigued our visitors!
Kafka Adoption and Scaling Issues: The Kafka community is growing every day, many attendees are either new to Kafka or in the early stages of adoption, looking to scale their usage significantly, planning rapid expansion, or facing scaling challenges with their current messaging systems (e.g., moving from RabbitMQ to Kafka). The objective is to leverage technology for business goals, not just for the sake of technology.
Need for Better Consoles/UIs: This is where we shine and visitors were in awe. There is a strong need for consoles/UIs to provide developers with better insights, awareness, and controls. They should also ensure a high level of security and visibility for teams managing the Kafka ecosystem, with role-based access control (RBAC), audit capabilities, and the ability to restrict or enhance user experiences as a conscious choice.
Making Everyone’s Life Easier: Self-serve
At Conduktor, we offer multiple interfaces to interact with Conduktor and Kafka:
a graphical user interface (UI)
an application programming interface (API)
very soon: a command-line interface (CLI)
on the horizon: Terraform
Our approach is driven by GitOps principles, common in enterprises for automated, auditable, and repeatable operations. Our objective is to have Central/Platform teams and Product teams managing their resources (such as clusters, groups, permissions, policies, alerts, etc.) and Kafka resources (including topics, subjects, connectors, etc.) through a unified definition mechanism.
For example, the platform team can authorize an application to access a specific Kafka cluster along with its corresponding permissions in the following manner:
Our talk — Kafka Pitfalls and Best Practices
Our Customer Success team had the opportunity to take the micro and present all the typical issues they are confronted with our users and customers. Stay tuned for the recording!
Some items discussed were:
Consumers stuck because of a poison pill (a record making its consumers to fail) causing developers to shift offset via CLI or Conduktor
retention.ms too small: this is causing too many open files and memory issues on the broker side
How to configure properly the producers to avoid data duplication and data loss (acks, idempotence, delivery.timeout.ms)
Best practices to upgrade Kafka clients / brokers, deal with ACLs, and naming convention
... watch the talk when you can for more classic pitfalls you might be doing and best practices you should follow!
The most impactful one is the naming convention as it puts a structure on your resources, and helps you moving in a self serve direction where you define owners based on prefixes.
The most common pitfall is about the producers configuration, because they are the one that have the responsibility to publish the data, so they have to be properly configured to ensure good data quality and continuity
See you at Current 2024 @ Austin
Thanks to all our users, partners, customers who were present at this Kafka Summit. It’s always a joy to connect IRL and not only via a Google Meet.
Next time, see you in Austin for Current 2024! 🤠