Glossary
Migrating to Apache Iceberg from Hive or Parquet
Learn how to successfully migrate your existing Hive tables and Parquet datasets to Apache Iceberg. This guide covers migration strategies, data conversion techniques, and best practices for transitioning to a modern lakehouse architecture with minimal downtime.
Migrating to Apache Iceberg from Hive or Parquet
Apache Iceberg has emerged as the leading table format for modern data lakehouses, offering features like ACID transactions, time travel, schema evolution, and partition evolution that traditional Hive tables and raw Parquet files cannot provide. Migrating to Iceberg unlocks these capabilities while maintaining compatibility with your existing query engines and data infrastructure.
This guide explores proven migration strategies, practical conversion techniques, and critical considerations for data engineers and architects planning an Iceberg migration.
Table of Contents
Understanding the Migration Landscape
Migration Strategies
Migrating Hive Tables to Iceberg
Converting Parquet Datasets to Iceberg
Streaming Integration During Migration
Validation and Testing
Best Practices and Considerations
Understanding the Migration Landscape
Before initiating a migration, assess your current state and requirements:
From Hive Tables:
Existing Hive metastore integration
Partition structures and naming conventions
Table statistics and metadata
Query patterns and access frequencies
Downstream dependencies on Hive-specific features
From Raw Parquet:
File organization and directory structure
Partition schemes (if any)
Schema consistency across files
Metadata availability
Current read/write patterns
Iceberg's design accommodates both scenarios with different migration approaches: in-place migration for Hive tables and metadata-based adoption for Parquet datasets.
Migration Strategies
Strategy 1: In-Place Migration (Hive to Iceberg)
In-place migration converts existing Hive tables to Iceberg tables without moving or rewriting data files. This approach offers:
Minimal downtime: Metadata conversion happens quickly
No data movement: Original Parquet/ORC files remain in place
Rollback capability: Can revert to Hive if needed
Resource efficiency: No data copying or rewriting required
When to use: Production Hive tables with stable partition schemes, large datasets where data copying is prohibitive, or scenarios requiring minimal disruption.
Strategy 2: Snapshot and Migrate
Create an Iceberg table and copy data from the source, allowing for optimization during migration:
Data optimization: Rewrite files to optimal sizes
Partition evolution: Restructure partitioning scheme
Schema refinement: Clean up schema inconsistencies
Incremental migration: Migrate in batches over time
When to use: When data reorganization is beneficial, source tables have performance issues, or you want to optimize file layouts during migration.
Strategy 3: Dual-Write Transition
Temporarily write to both old and new formats during transition:
Zero downtime: Seamless cutover for readers
Extended validation: Verify Iceberg behavior with production workloads
Gradual migration: Migrate read traffic incrementally
When to use: Mission-critical tables where zero downtime is mandatory, or when extensive validation is required before full cutover.
Migrating Hive Tables to Iceberg
Using Spark SQL for In-Place Migration
Spark 3.x provides native support for in-place Hive table migration:
This command:
Reads existing Hive table metadata
Creates Iceberg metadata files
Updates the Hive metastore to point to Iceberg
Preserves all existing data files
Programmatic Migration with Spark
For more control over the migration process:
Snapshot Migration Approach
For scenarios requiring data rewrite:
Converting Parquet Datasets to Iceberg
Adding Metadata to Existing Parquet Files
Iceberg can adopt existing Parquet files without rewriting them:
Incremental Data Migration
For large datasets, migrate in stages:
Streaming Integration During Migration
Kafka to Iceberg with Spark Structured Streaming
During migration, establish streaming pipelines to keep Iceberg tables updated:
Monitoring Streaming Migrations
Enterprise Kafka management platforms provide capabilities that are invaluable during Iceberg migrations involving streaming data:
Data Quality Monitoring: Track message schemas and validate data integrity during migration
Consumer Lag Tracking: Monitor streaming job performance to ensure migration keeps pace with incoming data
Topic Management: Coordinate multiple Kafka topics feeding into Iceberg tables during phased migrations
Schema Registry Integration: Manage schema evolution across both legacy and Iceberg tables
Robust observability features help data engineers identify bottlenecks, validate data consistency, and ensure zero data loss during the transition to Iceberg-based architectures.
Handling Late-Arriving Data
Iceberg's ACID guarantees make it ideal for handling late data during migration:
Validation and Testing
Data Integrity Validation
Post-migration, verify data completeness and correctness:
Performance Testing
Compare query performance before and after migration:
Best Practices and Considerations
Metadata Management
Catalog Selection: Choose between Hive Metastore, AWS Glue, or dedicated Iceberg catalogs based on infrastructure
Metadata Location: Store metadata in highly available storage (S3, HDFS with replication)
Snapshot Retention: Configure appropriate retention policies to balance time travel with storage costs
Performance Optimization
File Sizing: Target 512MB-1GB files for optimal query performance
Compaction: Schedule regular compaction for tables with many small files
Partition Evolution: Leverage hidden partitioning to avoid partition explosion
Rollback Planning
Maintain rollback capabilities during migration:
Incremental Adoption
Don't migrate everything at once:
Start with non-critical tables: Gain experience with low-risk tables
Validate thoroughly: Run parallel workloads to compare results
Monitor performance: Track query latency, throughput, and resource usage
Gather feedback: Involve data consumers in validation
Scale gradually: Expand to critical tables after proven success
Conclusion
Migrating to Apache Iceberg from Hive or Parquet represents a strategic investment in modern data infrastructure. Whether using in-place migration for minimal disruption, snapshot migration for optimization opportunities, or dual-write for zero-downtime transitions, careful planning and validation ensure successful outcomes.
The migration journey requires coordinated effort across data engineering, data architecture, and analytics teams. By following these strategies and best practices, organizations can unlock Iceberg's powerful capabilities—ACID transactions, time travel, schema evolution, and partition evolution—while minimizing risk and maintaining business continuity.
Start with pilot migrations, validate rigorously, and scale systematically. The result is a robust, flexible data lakehouse foundation that supports evolving analytics and data science requirements for years to come.