Glossary
ZooKeeper to KRaft Migration
Learn how to migrate Apache Kafka from ZooKeeper to KRaft mode, understand the architectural benefits, and follow best practices for a successful transition to Kafka's new consensus protocol.
ZooKeeper to KRaft Migration
Apache Kafka's shift from ZooKeeper to KRaft (Kafka Raft) represents one of the most significant architectural changes in the platform's history. This migration simplifies Kafka's operational model, reduces infrastructure complexity, and improves cluster performance. Understanding how to migrate from ZooKeeper to KRaft is essential for teams managing production Kafka environments.
Understanding Apache Kafka's Metadata Management
Historically, Apache Kafka relied on Apache ZooKeeper as an external system for storing critical cluster metadata. ZooKeeper handled controller election, topic configurations, access control lists (ACLs), partition leadership information, and broker registration.
This dependency meant that every Kafka cluster required a separate ZooKeeper ensemble, typically consisting of 3-5 nodes for high availability. Operators needed expertise in both systems, and any ZooKeeper issues could directly impact Kafka availability.
The metadata stored in ZooKeeper included:
Broker registrations and configurations
Topic and partition metadata
Replica assignments and ISR (In-Sync Replicas) lists
ACLs and quota configurations
Controller epoch and leadership information
While ZooKeeper served Kafka well for over a decade, this dual-system architecture created operational overhead and introduced latency in metadata propagation across large clusters.
What is KRaft (Kafka Raft)?
KRaft is Kafka's native consensus protocol based on the Raft algorithm. Introduced through KIP-500 and declared production-ready in Kafka 3.3.1, KRaft eliminates the need for ZooKeeper by managing metadata directly within Kafka itself.
In KRaft mode, dedicated controller nodes form a Raft quorum that stores metadata in an internal Kafka topic called __cluster_metadata. This topic is replicated across controller nodes using the Raft consensus algorithm, ensuring consistency and fault tolerance.
Key architectural changes include:
Unified Architecture: Metadata management happens within Kafka brokers or dedicated controller nodes, removing the external dependency.
Metadata as Events: Cluster metadata is stored as a log of events in the __cluster_metadata topic, making it queryable and recoverable like any other Kafka topic.
Faster Propagation: Controllers push metadata changes to brokers, rather than brokers polling ZooKeeper, significantly reducing metadata propagation time in large clusters.
Simplified Operations: One system to deploy, monitor, and maintain instead of two separate distributed systems.
Why Migrate from ZooKeeper to KRaft?
The migration to KRaft offers substantial benefits that impact both operations and performance.
Operational Simplification: Eliminating ZooKeeper reduces the infrastructure footprint by 30-40%. Teams no longer need to maintain separate ZooKeeper clusters, monitor two different systems, or manage cross-system version compatibility.
Improved Scalability: ZooKeeper's watch mechanism created scalability bottlenecks in clusters with hundreds of thousands of partitions. KRaft scales more efficiently, with production deployments successfully running millions of partitions.
Faster Metadata Operations: Metadata changes propagate in milliseconds rather than seconds. Controller failover typically completes in under a second with KRaft, compared to several seconds with ZooKeeper.
Enhanced Recovery: Since metadata is stored in a Kafka topic, standard Kafka replication and recovery mechanisms apply. Metadata snapshots and log compaction make recovery faster and more predictable.
Future-Proofing: ZooKeeper support was deprecated in Kafka 3.5 and will be removed in Kafka 4.0. Migrating to KRaft ensures continued access to new features and security updates.
A real-world example: A financial services company managing a 50-node Kafka cluster reduced their infrastructure by 15 nodes after migrating to KRaft, as they no longer needed the separate ZooKeeper ensemble. They also observed controller failover times drop from 5-7 seconds to under 1 second.
Migration Approaches and Strategies
There are two primary approaches to migrating from ZooKeeper to KRaft:
1. Direct Migration (Online Migration)
This approach involves migrating an existing ZooKeeper-based cluster to KRaft mode with minimal downtime. It requires Kafka 3.4 or later and involves a phased process where ZooKeeper and KRaft controllers coexist temporarily.
Advantages:
Preserves existing cluster data and configurations
No need to recreate topics or migrate consumer offsets
Suitable for clusters where rebuilding is impractical
Considerations:
Requires careful planning and coordination
More complex than clean installation
Still maturing (consider testing thoroughly in non-production first)
2. New Cluster Setup (Offline Migration)
This involves creating a new KRaft-based cluster and migrating data from the old ZooKeeper-based cluster. Tools like MirrorMaker 2 facilitate data replication between clusters.
Advantages:
Clean slate with KRaft from the start
Lower risk as the original cluster remains unchanged during migration
Easier rollback if issues occur
Considerations:
Requires sufficient infrastructure to run both clusters temporarily
Producers and consumers must be redirected to the new cluster
Consumer offset migration needed
For most production environments, the new cluster approach offers lower risk and clearer rollback options, especially for mission-critical deployments.
Step-by-Step Migration Process
Here's a technical walkthrough of the direct migration process:
Phase 1: Preparation
Upgrade to Kafka 3.4+ (3.6+ recommended for stability)
Verify all brokers are on the same version
Backup ZooKeeper data using
zkCli.shDocument current configurations and ACLs
Test the migration process in a non-production environment
Phase 2: Enable KRaft Controllers
Deploy dedicated controller nodes or configure combined broker/controller nodes
Generate a cluster UUID:
kafka-storage.sh random-uuidConfigure controllers with the new
process.roles=controllersettingFormat controller log directories:
kafka-storage.sh format -t <uuid> -c controller.propertiesStart controllers and verify Raft quorum formation
Phase 3: Migration Execution
Configure brokers for migration mode:
zookeeper.metadata.migration.enable=truePoint brokers to both ZooKeeper and KRaft controllers
Restart brokers one at a time, verifying metadata synchronization
Monitor the migration progress through controller logs and metrics
Phase 4: Finalization
Verify all metadata has migrated successfully
Switch active controller from ZooKeeper to KRaft
Remove ZooKeeper configuration from broker properties
Rolling restart of all brokers in KRaft-only mode
Decommission ZooKeeper ensemble
A critical configuration example for migration mode:
KRaft in Data Streaming Ecosystems
The migration to KRaft has significant implications for data streaming platforms and real-time processing architectures.
Faster Stream Processing Startup: Stream processing applications using Kafka Streams or Flink depend on topic metadata to assign partitions and start processing. KRaft's faster metadata propagation reduces application startup time, especially in auto-scaling scenarios where new instances spin up frequently.
Improved Multi-Cluster Management: Organizations running multiple Kafka clusters for different environments or regions benefit from simplified operations. Fewer components mean easier automation, faster provisioning, and lower maintenance overhead.
Enhanced Observability: With metadata stored as a Kafka topic, monitoring tools can subscribe to metadata changes just like any other stream. This enables real-time tracking of configuration changes, topic creation, and partition reassignments.
Platform Integration: Data governance platforms can leverage KRaft's improved metadata APIs to provide better visibility into cluster topology, track migration progress, and validate metadata consistency. This is particularly valuable during migrations when verifying that all configurations, ACLs, and quotas have transferred correctly.
Cloud-Native Deployments: KRaft's simpler architecture aligns better with containerized and Kubernetes-based deployments. Fewer stateful components make it easier to implement infrastructure-as-code patterns and automated cluster provisioning.
Post-Migration Monitoring and Validation
After completing the migration, thorough validation ensures cluster health and correct operation.
Metadata Verification:
Compare topic configurations, partition counts, and replication factors
Verify ACLs and quota configurations match pre-migration state
Check consumer group offsets are preserved
Validate broker configurations and dynamic settings
Performance Monitoring:
Observe controller failover behavior under simulated failures
Measure metadata operation latency (topic creation, partition reassignment)
Monitor broker and controller resource utilization
Track client request latency for any regressions
Key Metrics to Watch:
kafka.controller:type=KafkaController,name=ActiveControllerCount(should be 1)kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMskafka.server:type=KRaftMetadataCache,name=MetadataLoadLatencyBroker log for any metadata-related errors or warnings
Operational Checklist:
Perform a controlled controller failover and verify new leader election
Create test topics and verify metadata propagation speed
Execute partition reassignments to test metadata update paths
Update monitoring dashboards to track KRaft-specific metrics
Document the new operational procedures for the KRaft cluster
Governance platforms can streamline this validation process by providing visual confirmation of cluster state, metadata consistency across brokers, and historical tracking of configuration changes before and after migration.
Summary
Migrating from ZooKeeper to KRaft represents a significant architectural evolution for Apache Kafka, delivering operational simplification, improved performance, and better scalability. While the migration requires careful planning and execution, the long-term benefits of reduced infrastructure complexity, faster metadata operations, and improved reliability make it essential for teams managing Kafka at scale.
The migration process, whether through direct migration or new cluster setup, demands thorough testing and validation. Understanding both approaches allows teams to choose the strategy that best fits their operational constraints and risk tolerance.
As ZooKeeper support approaches end-of-life in Kafka 4.0, migrating to KRaft is not just an optimization—it's a necessary step to ensure continued access to new Kafka features, security updates, and community support. Organizations that plan and execute this migration thoughtfully will benefit from a more streamlined, performant, and maintainable data streaming platform.
Sources and References
Apache Kafka Improvement Proposal KIP-500: "Replace ZooKeeper with a Self-Managed Metadata Quorum" - The original proposal outlining KRaft's architecture and implementation plan. https://cwiki.apache.org/confluence/display/KAFKA/KIP-500
Apache Kafka Documentation: "KRaft Mode" - Official documentation covering KRaft configuration, migration procedures, and operational guidelines. https://kafka.apache.org/documentation/#kraft
Confluent Documentation: "Migrate to KRaft" - Comprehensive migration guide with best practices and troubleshooting tips. https://docs.confluent.io/platform/current/installation/migrate-zk-kraft.html
Apache Kafka 3.3.1 Release Notes: Documentation of KRaft's production-ready declaration and feature completeness. https://archive.apache.org/dist/kafka/3.3.1/RELEASE_NOTES.html
Colin McCabe (Apache Kafka Committer): "The Apache Kafka Control Plane" - Technical deep-dive into KRaft's architecture and performance characteristics presented at Kafka Summit conferences and available through Confluent's technical blog series.