Apache Iceberg, Data Migration, Hive, Parquet, Data Lakehouse
Migrating to Apache Iceberg from Hive or Parquet
Learn how to successfully migrate your existing Hive tables and Parquet datasets to Apache Iceberg. This guide covers migration strategies, data conversion techniques, and best practices for transitioning to a modern lakehouse architecture with minimal downtime.
Apache Iceberg has emerged as the leading table format for modern data lakehouses, offering features like ACID transactions, time travel, schema evolution, and partition evolution that traditional Hive tables and raw Parquet files cannot provide. As of 2025, Apache Iceberg 1.7+ provides production-grade migration tools, REST catalog support, and advanced features like branches and tags that make migration safer and more flexible than ever.
Migrating to Iceberg unlocks these capabilities while maintaining compatibility with your existing query engines and data infrastructure. This guide explores proven migration strategies, practical conversion techniques, and critical considerations for data engineers and architects planning an Iceberg migration.

Table of Contents
Understanding the Migration Landscape
Migration Strategies
Migrating Hive Tables to Iceberg
Converting Parquet Datasets to Iceberg
Streaming Integration During Migration
Validation and Testing
Best Practices and Considerations
Understanding the Migration Landscape
Before initiating a migration, assess your current state and requirements:
From Hive Tables:
Existing Hive metastore integration
Partition structures and naming conventions
Table statistics and metadata
Query patterns and access frequencies
Downstream dependencies on Hive-specific features
From Raw Parquet:
File organization and directory structure
Partition schemes (if any)
Schema consistency across files
Metadata availability
Current read/write patterns
Iceberg's design accommodates both scenarios with different migration approaches: in-place migration for Hive tables (converting metadata without moving data) and metadata-based adoption for Parquet datasets (creating Iceberg metadata to track existing files). For comprehensive coverage of Iceberg's metadata architecture, see Iceberg Table Architecture: Metadata and Snapshots.
Migration Strategies
Strategy 1: In-Place Migration (Hive to Iceberg)
In-place migration converts existing Hive tables to Iceberg tables without moving or rewriting data files. This approach offers:
Minimal downtime: Metadata conversion happens quickly
No data movement: Original Parquet/ORC files remain in place
Rollback capability: Can revert to Hive if needed
Resource efficiency: No data copying or rewriting required
When to use: Production Hive tables with stable partition schemes, large datasets where data copying is prohibitive, or scenarios requiring minimal disruption.
Strategy 2: Snapshot and Migrate
Create an Iceberg table and copy data from the source, allowing for optimization during migration:
Data optimization: Rewrite files to optimal sizes
Partition evolution: Restructure partitioning scheme
Schema refinement: Clean up schema inconsistencies
Incremental migration: Migrate in batches over time
When to use: When data reorganization is beneficial, source tables have performance issues, or you want to optimize file layouts during migration.
Strategy 3: Dual-Write Transition
Temporarily write to both old and new formats during transition:
Zero downtime: Seamless cutover for readers
Extended validation: Verify Iceberg behavior with production workloads
Gradual migration: Migrate read traffic incrementally
When to use: Mission-critical tables where zero downtime is mandatory, or when extensive validation is required before full cutover.
Migrating Hive Tables to Iceberg
Using Spark SQL for In-Place Migration
Spark 3.5+ provides native support for in-place Hive table migration with Iceberg 1.7+:
This command performs the following operations:
Reads existing Hive table metadata and partition information
Creates Iceberg metadata files (metadata.json, manifest lists, manifest files)
Updates the Hive metastore to point to Iceberg table format
Preserves all existing data files in their current locations (no data movement)
Maintains backward compatibility for readers that haven't been updated yet
Programmatic Migration with Spark
For more control over the migration process (Spark 3.5+ with Iceberg 1.7+):
Snapshot Migration Approach
For scenarios requiring data rewrite and optimization:
Using Branches for Safe Migration Testing (Iceberg 1.5+)
Iceberg's branch feature allows you to test migrations in isolation before committing to production:
This branching approach provides a safety net for migration testing, allowing you to validate transformations before affecting production queries.
Converting Parquet Datasets to Iceberg
Adding Metadata to Existing Parquet Files
Iceberg can adopt existing Parquet files without rewriting them using the modern bulk import API (Iceberg 1.6+):
Note: The days() partition function automatically handles date transformations, extracting the date component from the timestamp field for efficient partition pruning.
Incremental Data Migration
For large datasets, migrate in stages to minimize resource consumption and enable parallel processing:
Streaming Integration During Migration
Kafka to Iceberg with Spark Structured Streaming
During migration, establish streaming pipelines to keep Iceberg tables updated (Spark 3.5+ with Iceberg 1.7+):
Monitoring Streaming Migrations with Conduktor
Managing Kafka-to-Iceberg streaming pipelines during migration requires comprehensive observability and governance. Conduktor provides essential capabilities for production-grade migrations:
Data Quality Monitoring: Validate message schemas and enforce data contracts before writes reach Iceberg tables, preventing corrupted migrations
Consumer Lag Tracking: Monitor streaming job performance in real-time using topic monitoring to ensure migration keeps pace with incoming data, preventing backlog accumulation
Topic Management: Visualize and coordinate multiple Kafka topics feeding into Iceberg tables during phased migrations
Schema Registry Integration: Manage schema evolution across both legacy and Iceberg tables with Schema Registry, ensuring compatibility during the transition
Pipeline Testing with Conduktor Gateway: Inject chaos scenarios (network delays, broker failures, partition rebalances) to validate exactly-once semantics and checkpoint recovery before production deployment
Data Lineage Tracking: Trace data flow from Kafka topics through transformations to Iceberg snapshots, enabling end-to-end visibility. Manage connectors with Kafka Connect
Conduktor's governance features help data engineers identify bottlenecks, validate data consistency, and ensure zero data loss during the transition to Iceberg-based architectures. For mission-critical migrations, testing pipeline resilience with Conduktor Gateway before cutover reduces risk and ensures production stability.
Handling Late-Arriving Data
Iceberg's ACID guarantees make it ideal for handling late data during migration without data loss or duplication:
Validation and Testing
Data Integrity Validation
Post-migration, verify data completeness and correctness:
Performance Testing
Compare query performance before and after migration to validate improvements:
Best Practices and Considerations
Metadata Management
Choosing the right catalog implementation is critical for migration success. As of 2025, catalog options include:
REST Catalog (Recommended for 2025): Modern cloud-native catalogs like Apache Polaris (Snowflake's open-source catalog) or Project Nessie offer vendor neutrality, multi-tenancy, and Git-like versioning. Ideal for new implementations and multi-cloud environments.
AWS Glue: Native AWS integration with IAM-based access control. Best for AWS-centric architectures but creates cloud vendor lock-in.
Hive Metastore: Legacy option for backward compatibility with existing Hadoop ecosystems. Not recommended for new implementations, consider migrating to REST catalogs.
Metadata Storage Best Practices:
Store metadata in highly available storage (S3 with versioning, HDFS with replication)
Configure snapshot retention policies to balance time travel capabilities with storage costs
Enable metadata compression for large-scale tables (Iceberg 1.7+)
Performance Optimization
File Sizing: Target 512MB-1GB files for optimal query performance
Compaction: Schedule regular compaction for tables with many small files
Partition Evolution: Leverage hidden partitioning to avoid partition explosion
Rollback Planning
Maintain rollback capabilities during migration:
Incremental Adoption
Don't migrate everything at once:
Start with non-critical tables: Gain experience with low-risk tables
Validate thoroughly: Run parallel workloads to compare results
Monitor performance: Track query latency, throughput, and resource usage
Gather feedback: Involve data consumers in validation
Scale gradually: Expand to critical tables after proven success
Related Concepts
Conclusion
Migrating to Apache Iceberg from Hive or Parquet represents a strategic investment in modern data infrastructure. As of 2025, Iceberg 1.7+ provides production-grade features, REST catalogs, branches and tags, puffin statistics, and improved streaming integrations, that make migration safer and more flexible than ever.
Whether using in-place migration for minimal disruption, snapshot migration for optimization opportunities, or dual-write for zero-downtime transitions, careful planning and validation ensure successful outcomes. Leveraging modern tools like Apache Polaris for catalog management and Conduktor for streaming governance significantly reduces migration risk.
The migration journey requires coordinated effort across data engineering, data architecture, and analytics teams. By following these strategies and best practices, organizations can unlock Iceberg's powerful capabilities, ACID transactions, time travel, schema evolution, and partition evolution, while minimizing risk and maintaining business continuity.
Start with pilot migrations, validate rigorously using branches for isolated testing, and scale systematically. The result is a robust, flexible data lakehouse foundation that supports evolving analytics and data science requirements for years to come.
For deeper understanding of Iceberg's capabilities, explore Apache Iceberg for comprehensive feature coverage, Iceberg Table Architecture: Metadata and Snapshots for internal architecture details, and Introduction to Lakehouse Architecture for broader architectural context.