Glossary

ZooKeeper to KRaft Migration

Learn how to migrate Apache Kafka from ZooKeeper to KRaft mode, understand the architectural benefits, and follow best practices for a successful transition to Kafka's new consensus protocol.

ZooKeeper to KRaft Migration

Apache Kafka's shift from ZooKeeper to KRaft (Kafka Raft) represents one of the most significant architectural changes in the platform's history. This migration simplifies Kafka's operational model, reduces infrastructure complexity, and improves cluster performance. Understanding how to migrate from ZooKeeper to KRaft is essential for teams managing production Kafka environments.

Understanding Apache Kafka's Metadata Management

Historically, Apache Kafka relied on Apache ZooKeeper as an external system for storing critical cluster metadata. ZooKeeper handled controller election, topic configurations, access control lists (ACLs), partition leadership information, and broker registration.

This dependency meant that every Kafka cluster required a separate ZooKeeper ensemble, typically consisting of 3-5 nodes for high availability. Operators needed expertise in both systems, and any ZooKeeper issues could directly impact Kafka availability.

The metadata stored in ZooKeeper included:

  • Broker registrations and configurations

  • Topic and partition metadata

  • Replica assignments and ISR (In-Sync Replicas) lists

  • ACLs and quota configurations

  • Controller epoch and leadership information

While ZooKeeper served Kafka well for over a decade, this dual-system architecture created operational overhead and introduced latency in metadata propagation across large clusters.

What is KRaft (Kafka Raft)?

KRaft is Kafka's native consensus protocol based on the Raft algorithm. Introduced through KIP-500 and declared production-ready in Kafka 3.3.1, KRaft eliminates the need for ZooKeeper by managing metadata directly within Kafka itself.

In KRaft mode, dedicated controller nodes form a Raft quorum that stores metadata in an internal Kafka topic called __cluster_metadata. This topic is replicated across controller nodes using the Raft consensus algorithm, ensuring consistency and fault tolerance.

Key architectural changes include:

Unified Architecture: Metadata management happens within Kafka brokers or dedicated controller nodes, removing the external dependency.

Metadata as Events: Cluster metadata is stored as a log of events in the __cluster_metadata topic, making it queryable and recoverable like any other Kafka topic.

Faster Propagation: Controllers push metadata changes to brokers, rather than brokers polling ZooKeeper, significantly reducing metadata propagation time in large clusters.

Simplified Operations: One system to deploy, monitor, and maintain instead of two separate distributed systems.

Why Migrate from ZooKeeper to KRaft?

The migration to KRaft offers substantial benefits that impact both operations and performance.

Operational Simplification: Eliminating ZooKeeper reduces the infrastructure footprint by 30-40%. Teams no longer need to maintain separate ZooKeeper clusters, monitor two different systems, or manage cross-system version compatibility.

Improved Scalability: ZooKeeper's watch mechanism created scalability bottlenecks in clusters with hundreds of thousands of partitions. KRaft scales more efficiently, with production deployments successfully running millions of partitions.

Faster Metadata Operations: Metadata changes propagate in milliseconds rather than seconds. Controller failover typically completes in under a second with KRaft, compared to several seconds with ZooKeeper.

Enhanced Recovery: Since metadata is stored in a Kafka topic, standard Kafka replication and recovery mechanisms apply. Metadata snapshots and log compaction make recovery faster and more predictable.

Future-Proofing: ZooKeeper support was deprecated in Kafka 3.5 and will be removed in Kafka 4.0. Migrating to KRaft ensures continued access to new features and security updates.

A real-world example: A financial services company managing a 50-node Kafka cluster reduced their infrastructure by 15 nodes after migrating to KRaft, as they no longer needed the separate ZooKeeper ensemble. They also observed controller failover times drop from 5-7 seconds to under 1 second.

Migration Approaches and Strategies

There are two primary approaches to migrating from ZooKeeper to KRaft:

1. Direct Migration (Online Migration)

This approach involves migrating an existing ZooKeeper-based cluster to KRaft mode with minimal downtime. It requires Kafka 3.4 or later and involves a phased process where ZooKeeper and KRaft controllers coexist temporarily.

Advantages:

  • Preserves existing cluster data and configurations

  • No need to recreate topics or migrate consumer offsets

  • Suitable for clusters where rebuilding is impractical

Considerations:

  • Requires careful planning and coordination

  • More complex than clean installation

  • Still maturing (consider testing thoroughly in non-production first)

2. New Cluster Setup (Offline Migration)

This involves creating a new KRaft-based cluster and migrating data from the old ZooKeeper-based cluster. Tools like MirrorMaker 2 facilitate data replication between clusters.

Advantages:

  • Clean slate with KRaft from the start

  • Lower risk as the original cluster remains unchanged during migration

  • Easier rollback if issues occur

Considerations:

  • Requires sufficient infrastructure to run both clusters temporarily

  • Producers and consumers must be redirected to the new cluster

  • Consumer offset migration needed

For most production environments, the new cluster approach offers lower risk and clearer rollback options, especially for mission-critical deployments.

Step-by-Step Migration Process

Here's a technical walkthrough of the direct migration process:

Phase 1: Preparation

  1. Upgrade to Kafka 3.4+ (3.6+ recommended for stability)

  2. Verify all brokers are on the same version

  3. Backup ZooKeeper data using zkCli.sh

  4. Document current configurations and ACLs

  5. Test the migration process in a non-production environment

Phase 2: Enable KRaft Controllers

  1. Deploy dedicated controller nodes or configure combined broker/controller nodes

  2. Generate a cluster UUID: kafka-storage.sh random-uuid

  3. Configure controllers with the new process.roles=controller setting

  4. Format controller log directories: kafka-storage.sh format -t <uuid> -c controller.properties

  5. Start controllers and verify Raft quorum formation

Phase 3: Migration Execution

  1. Configure brokers for migration mode: zookeeper.metadata.migration.enable=true

  2. Point brokers to both ZooKeeper and KRaft controllers

  3. Restart brokers one at a time, verifying metadata synchronization

  4. Monitor the migration progress through controller logs and metrics

Phase 4: Finalization

  1. Verify all metadata has migrated successfully

  2. Switch active controller from ZooKeeper to KRaft

  3. Remove ZooKeeper configuration from broker properties

  4. Rolling restart of all brokers in KRaft-only mode

  5. Decommission ZooKeeper ensemble

A critical configuration example for migration mode:

# Controller configuration
process.roles=controller
node.id=1
controller.quorum.voters=1@controller1:9093,2@controller2:9093,3@controller3:9093
metadata.log.dir=/var/lib/kafka/metadata

# Broker configuration during migration
process.roles=broker
zookeeper.metadata.migration.enable=true
zookeeper.connect=zk1:2181,zk2:2181,zk3:2181
controller.quorum.voters=1@controller1:9093,2@controller2:9093,3@controller3:9093

KRaft in Data Streaming Ecosystems

The migration to KRaft has significant implications for data streaming platforms and real-time processing architectures.

Faster Stream Processing Startup: Stream processing applications using Kafka Streams or Flink depend on topic metadata to assign partitions and start processing. KRaft's faster metadata propagation reduces application startup time, especially in auto-scaling scenarios where new instances spin up frequently.

Improved Multi-Cluster Management: Organizations running multiple Kafka clusters for different environments or regions benefit from simplified operations. Fewer components mean easier automation, faster provisioning, and lower maintenance overhead.

Enhanced Observability: With metadata stored as a Kafka topic, monitoring tools can subscribe to metadata changes just like any other stream. This enables real-time tracking of configuration changes, topic creation, and partition reassignments.

Platform Integration: Data governance platforms can leverage KRaft's improved metadata APIs to provide better visibility into cluster topology, track migration progress, and validate metadata consistency. This is particularly valuable during migrations when verifying that all configurations, ACLs, and quotas have transferred correctly.

Cloud-Native Deployments: KRaft's simpler architecture aligns better with containerized and Kubernetes-based deployments. Fewer stateful components make it easier to implement infrastructure-as-code patterns and automated cluster provisioning.

Post-Migration Monitoring and Validation

After completing the migration, thorough validation ensures cluster health and correct operation.

Metadata Verification:

  • Compare topic configurations, partition counts, and replication factors

  • Verify ACLs and quota configurations match pre-migration state

  • Check consumer group offsets are preserved

  • Validate broker configurations and dynamic settings

Performance Monitoring:

  • Observe controller failover behavior under simulated failures

  • Measure metadata operation latency (topic creation, partition reassignment)

  • Monitor broker and controller resource utilization

  • Track client request latency for any regressions

Key Metrics to Watch:

  • kafka.controller:type=KafkaController,name=ActiveControllerCount (should be 1)

  • kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs

  • kafka.server:type=KRaftMetadataCache,name=MetadataLoadLatency

  • Broker log for any metadata-related errors or warnings

Operational Checklist:

  1. Perform a controlled controller failover and verify new leader election

  2. Create test topics and verify metadata propagation speed

  3. Execute partition reassignments to test metadata update paths

  4. Update monitoring dashboards to track KRaft-specific metrics

  5. Document the new operational procedures for the KRaft cluster

Governance platforms can streamline this validation process by providing visual confirmation of cluster state, metadata consistency across brokers, and historical tracking of configuration changes before and after migration.

Summary

Migrating from ZooKeeper to KRaft represents a significant architectural evolution for Apache Kafka, delivering operational simplification, improved performance, and better scalability. While the migration requires careful planning and execution, the long-term benefits of reduced infrastructure complexity, faster metadata operations, and improved reliability make it essential for teams managing Kafka at scale.

The migration process, whether through direct migration or new cluster setup, demands thorough testing and validation. Understanding both approaches allows teams to choose the strategy that best fits their operational constraints and risk tolerance.

As ZooKeeper support approaches end-of-life in Kafka 4.0, migrating to KRaft is not just an optimization—it's a necessary step to ensure continued access to new Kafka features, security updates, and community support. Organizations that plan and execute this migration thoughtfully will benefit from a more streamlined, performant, and maintainable data streaming platform.

Sources and References

  1. Apache Kafka Improvement Proposal KIP-500: "Replace ZooKeeper with a Self-Managed Metadata Quorum" - The original proposal outlining KRaft's architecture and implementation plan. https://cwiki.apache.org/confluence/display/KAFKA/KIP-500

  2. Apache Kafka Documentation: "KRaft Mode" - Official documentation covering KRaft configuration, migration procedures, and operational guidelines. https://kafka.apache.org/documentation/#kraft

  3. Confluent Documentation: "Migrate to KRaft" - Comprehensive migration guide with best practices and troubleshooting tips. https://docs.confluent.io/platform/current/installation/migrate-zk-kraft.html

  4. Apache Kafka 3.3.1 Release Notes: Documentation of KRaft's production-ready declaration and feature completeness. https://archive.apache.org/dist/kafka/3.3.1/RELEASE_NOTES.html

  5. Colin McCabe (Apache Kafka Committer): "The Apache Kafka Control Plane" - Technical deep-dive into KRaft's architecture and performance characteristics presented at Kafka Summit conferences and available through Confluent's technical blog series.