Real-time Data with Kafka: Build Tooling That Scales

Real-time Data with Kafka: Build Tooling That Scales

Avoid these common real-time data implementation mistakes and build a tooling strategy to scale your Kafka deployment efficiently and securely.

James White

Jul 18, 2024

Real-time Data with Kafka: Build Tooling That Scales
Real-time Data with Kafka: Build Tooling That Scales
Real-time Data with Kafka: Build Tooling That Scales

Real-time data is fundamental to modern digital products, and it comes from an ever-growing number of sources producing an ever-increasing volume of information.

This data is vital to the operation of businesses, as they need to act on information as it happens, but also gather insights for long-term decision making.

Despite this trend, 36% of IT decision makers are concerned that their infrastructure won't be able to meet future data requirements. In the rush to keep pace with the demand for big data and real-time data infrastructure, engineers may sometimes overlook security and governance concerns.

These oversights can prove costly, especially as you scale. Read on to discover critical points your team must consider when building your real-time data tooling strategy and how to avoid unpleasant mistakes at later stages.

Real-time data in a nutshell: go fast or go obsolete!

With vast quantities of data in motion, teams must handle streaming from thousands or millions of endpoints to applications and environments. Companies need to process their data as it emerges to get the required business value.

This imperative is evident in many industries. In programmatic advertising, real-time bidding ensures serving the right ads to the right audience. In transport logistics, real-time data, combined with AI-driven insights, can help manage and optimize fleet tracking, route efficiency, and connections for hundreds of thousands of passengers daily.

Organizations are increasingly tapping into real-time insights but often struggle with building the tooling to use these insights to their full potential at scale.

Consider a fictional retailer whose online store often destroys the user experience by sending customers products similar to, but not matching, their exact order. This situation is a result of an online stock inventory not updating the website in real time:

While data is in the pipeline, it doesn’t provide value, as nobody can use it. Introducing a real-time data streaming element can drastically change the situation:

Yet what's easy on paper is usually more complex in production — especially when you need to scale. However, you can avoid costly mistakes with proper preparation and tools.

The challenges of building real-time data infrastructure with Kafka

Kafka is one of the most popular real-time data platforms because it is flexible. Yet, while it may be tempting to jump in and create ad-hoc real-time data tooling, you should consider these common issues developers face when planning, building, and using their data infrastructure.

Not understanding data streaming concepts may impair your productivity

Kafka has a steep learning curve, especially in the early stages of adoption. Your team must quickly become familiar with the concepts of topics, partitions, offsets, schemas, producers, consumers, streams, and more.

Not understanding the concepts behind streaming and real-time data will have negative consequences for your data reliability and visibility. Without a clear overview of your real-time data streaming processes, fixing issues and overcoming poison pills in Kafka can be tricky, to say the least.

Implementing real-time data tooling is rarely linear

While implementing real-time data tooling usually starts with a specific use case, its development rarely follows a clear plan from day one.

Initially, teams focus on proving the value rather than laying out long-term strategies for future challenges like new microservices, domains, applications, and resources.

What starts with a specific business case often grows uncontrollably once it proves value. New domains get added, and more data intersections require even more data processing — and this growth usually happens haphazardly and without a roadmap.

Data streaming involves multiple actors at different stages

Real-time data streaming involves different roles in your organization. The range of teams dealing with Kafka can vary greatly from software developers, data engineers, and DevOps to business analysts, architects, and support.

All these groups have differing goals and needs, from building applications and pipelines to understanding the data and standardizing processes.

Adding new domains and services increases complexity

As your real-time data streaming ecosystem grows, it can become messy as your business continues improving to glide past competitors.

With each new domain and more data intersections, there's even more data processing, which requires knowing even more advanced real-time data streaming concepts and applying them to different use cases. This situation can quickly become too complicated for a central team to manage.

Ensuring ongoing compliance for evolving infrastructure can be tricky

With massive data leaks worldwide, businesses can no longer treat Kafka security and compliance as an afterthought.

While great at securing network communications with TLS or mTLS, Apache Kafka falls short when it comes to data encryption at the field level. This aspect can expose personally identifiable information (PII), or data covered by regulations like GDPR or CCPA — especially as new producers and consumers are introduced, and data sharing requirements are added!

Not every company considers this when designing real-time data pipelines, which can lead to technical debt and last-minute patches when things go wrong. Then, an oversight can lead to much higher costs and potential regulatory exposure.

Human error led to sharing sensitive data at this marketing analytics firm earlier this year. A vulnerable Kafka broker inadvertently facilitated the leaking of the private information of over a million users.

Factors to consider when building scalable data tooling

There are several ways to introduce new real-time data tooling in your organization. The amount of planning you do, and which of the following factors you prioritize, will directly affect how easy it will be to scale later.

When assessing your options, you need to consider these seven areas:

  1. Observability: the ability to monitor and understand the internal states of a system by analyzing its outputs to ensure optimal functioning and fast issue resolution.

  2. Organizational knowledge: awareness of the selected tool and optimizing its use for your organization's specific needs.

  3. Velocity: the speed at which a team can deliver new features or updates, often measured by the amount of work completed in a given timeframe.

  4. Cost: the total cost of acquiring, implementing, and operating a tool, including both direct expenses like product licensing and indirect costs like time and resources.

  5. Maintenance: the ongoing overhead of updating, fixing, and optimizing tools to ensure their reliability, security, and efficiency.

  6. Security: measures and protocols for protecting tools and the data they use from unauthorized access and cyber threats.

  7. Governance: policies, procedures, and controls that help ensure proper management and compliance of tools within an organization's data.

Some of these priorities are in direct opposition, usually against cost and security. That’s why it’s important to weigh them as you consider the specific approaches to use to build your data infrastructure.

Tooling options: four approaches to solving real-time data challenges

  1. Kafka's command-line interface (CLI)

Navigating Kafka's command-line interface can be daunting. That's why you get lots of cheat sheets and guides to Kafka.

But even if you master Kafka's commands, a CLI isn't suitable for large projects. While it's enough in the early tooling strategy days, you'll need something else if you want to go further.

  1. Custom tools for managing real-time data

Build-your-own tool projects usually start out nicely as teams create a custom solution from scratch and mold it around their needs.

However, building and maintaining custom software has many downsides, especially when your core business lies outside IT.

The only scenario in which building a custom tool makes sense is when no existing solution serves your specific needs. This situation is improbable in 2024, but should that happen, ensure you know the potential problems and dangers, especially when it comes to maintenance, cost, and building integrations.

Custom tools should be the glue between your internal repos & services and existing tooling — and not much more than that.

  1. GitOps

By using Git repositories as a single source of truth to deliver infrastructure as code (IaC), GitOps can solve many Kafka governance and scaling issues.

GitOps helps to standardize and systematize requests, making them testable and auditable and enabling you to validate them against business policies. The process moves the ownership from the central platform to the requestor.

However, GitOps alone doesn't solve all real-time data tooling challenges, especially as the scope of operations grows.

  1. Software development and CI/CD tools

Third-party dev tools offer solutions to many of the problems teams face when building real-time data pipelines.

Be it Docker for containerization, Jenkins for CI/CD, or Lens for K8s, dev tools can support all actors involved in building your data pipelines. In the context of Kafka, third-party tools help automate and streamline deployment and testing while increasing visibility and reliability in your real-time data infrastructure.

As a result, your team can focus on your core business domain while providing necessary maintenance, updates, documentation, and key integrations. Introducing tools early on — especially those that automate governance and security-related tasks — helps to avoid costly issues in the future.

The winning formula for secure, scalable real-time data: GitOps plus dev tools

Once you're past a proof of concept and enter early production, a combination of GitOps and selected third-party tools can empower your team and streamline their work. By providing a single source of truth, GitOps helps standardize processes across the entire organization while tools solve the diverse issues different actors face.

Although this approach requires upfront investment in tools, it will save you a lot of trouble in the long term, especially once you pick up the pace. When you scale, you can be sure that security issues will haunt you unless you build in relevant mechanisms from day one.

Solutions like Conduktor let you add the required logic on top of your existing Kafka deployment. Moreover, our proxy provides additional functionalities that are unavailable natively, including E2E encryption or extensive RBAC (role-based access control) and data masking controls.

Check the demos to discover how Conduktor Gateway can help your team.

You're now a step closer to real-time data tooling success

While implementing real-time data tooling such as Kafka seems easy in theory, in practice it comes with a steep learning curve and multiple challenges.

Out of the many approaches to build real-time data infrastructure, the winning formula of GitOps and third-party dev tools provides the ideal mix of visibility, flexibility, reliability, and security.

Such a combination will provide you with the required standardization, automation, governance mechanisms, maintenance, and support so that scaling is frictionless.

Starting small with a proof of concept and scaling up with careful planning can help you avoid costly mistakes and ensure long-term success.

Try Conduktor and see how it can further optimize your Kafka security and governance at scale.

Don't miss these