What is Change Data Capture Anyways?

What is Change Data Capture Anyways?

Learn what Change Data Capture (CDC) is, its use cases, and mechanisms.

Stéphane Derosiaux

9 mars 2023

The main reason why top companies don't stay at the top forever is because they can't keep up with new trends. Failure to notice new trends and patterns is caused by the inability to spot new changes in trend data and pattern adjustments. Making data driven decisions is not effective when data being used is not fresh and exhibits low quality. Industries change in split seconds and you have to know which change matters.

Real-time analytics and up-to-date data is the oxygen of effective decision making. Data that is not updated is dangerous and misleading when being used to plot the next technical move. Netflix stock was down by 65% in 2022 July and took a turn at the end of the year when it gained 7 million subscribers. It is important to know when data changes in order to invest well.

Data has to be updated even when it was changed a few seconds ago; there is a mechanism called Change Data Capture (CDC) that spots any data changes in the data source and triggers an update to the target. In this article, you will learn what Change Data Capture (CDC) is, its use cases, and mechanisms.

What is Change Data Capture (CDC)

Change Data Capture (CDC) is a technology used to track and capture changes made to a database. It is used to replicate changes made to a source database to one or more target databases, in  real-time. This technology is commonly used in data warehousing, business intelligence, and data integration scenarios.

CDC works by monitoring the transaction logs of a source database. These logs contain a record of all changes made to the database, including insertions, updates, and deletions. CDC software reads these logs and captures the changes, which are then applied to the target database. This process is often referred to as "log-based CDC."

There are two ways CDC occurs: either data change is pushed from the source to the data warehouse or a data change is pulled from the source by the target.

It's easier to pull data changes from the source than it is to push data changes. This is because pulling data changes only requires the target to read the logs of the data source to identify any changes and take any actions if data has changed. Even though it is easy to pull data this comes with the problem of latency as the target has to consistently read the data source logs.

Pushing data occurs when the data source sends data changes to the target to take any necessary actions. There has to be a mechanism set up to receive and interpret the data changes. This approach comes with low latency.

The only downside of this approach occurs when the target is not listening to the data source sending data changes. The target will miss out data changes. To solve this problem the target has to implement a queuing system that records the received data changes. The push approach is crucial for real-time data systems that need low-latency actions.

What Are the Benefits of Using CDC?

There are several benefits of using CDC. One of the main advantages is that it allows for real-time data replication, which means that data changes made to the source database are quickly reflected in the target database. This is important for scenarios where the target database is used for reporting or analytics purposes, as it ensures that the data in the target database is always up-to-date.

CDC also allows for greater flexibility in terms of data replication. With traditional data replication methods, the entire source database is typically replicated to the target database. With CDC, however, it is possible to replicate only specific tables or columns, which can be useful in scenarios where only a subset of the data is needed in the target database.

CDC is also useful for data integration scenarios. It can be used to replicate data from multiple source databases to a single target database, which can then be used for reporting and analytics purposes. This can be a more efficient and cost-effective solution than using traditional data integration techniques, such as Extract, Transform, Load (ETL) processes.

Another benefit of CDC is that it can be used for data archiving. By capturing changes made to the source database, it is possible to create a historical record of the data. This can be useful for compliance purposes, as well as for troubleshooting and auditing.

Use Cases of CDC

Replication of data is one of CDC's most often used use cases. For backup and disaster recovery purposes, this approach is frequently used to duplicate data from one database to another, for instance, from a primary database to a secondary database. This can be carried out immediately, or on a set timeframe. By just collecting and copying changes made to the data, as opposed to the complete dataset each time, CDC enables efficient data replication. This increases replication speed and can significantly minimize the quantity of data that needs to be sent.

Keeping the data warehouse up to date is one of CDC's typical use cases. This method is used alongside data integration processes to gather, manage, and store massive amounts of data in a data warehouse for the purpose of reporting and analysis. With CDC, data can be gathered from several sources and compiled into a single data warehouse in real-time, close to real-time, or at a predetermined time. As a result, organizations are able to access all of their data consistently and make decisions based on that data.

CDC can be utilized for auditing functions as well. It enables the keeping of a record of data changes, which can be advantageous for compliance and regulatory purposes. Additionally, it can be used to track modifications made by various users and spot any unwanted modifications. Security and fraud detection reasons may benefit from this.

CDC can be utilized in a variety of various situations, including IoT, ETL, event sourcing, and others. Real-time monitoring and analysis of sensor data are made possible by the usage of CDC in the Internet of Things. CDC tracks and records changes made to sensor data in real-time. Data changes performed during the ETL (Extract, Transform, Load) process can be tracked and recorded using CDC, enabling reliable data replication and data warehousing. CDC can be used in event sourcing to track and record changes made to an application's state, enabling accurate tracking of the application's state and the capability to replay earlier states.

CDC Methods and Mechanisms

CDC can be implemented in various ways. One common approach is to use a separate CDC software that reads the transaction logs of the source database and applies the changes to the target database. Another approach is to use the built-in CDC capabilities of the source database, such as Oracle GoldenGate or SQL Server Change Data Capture.

CDC can also be implemented using open-source solutions, such as Debezium, which is a CDC platform built on top of Apache Kafka. Debezium allows for real-time data replication between a variety of different databases, including MySQL, PostgreSQL, and MongoDB.

Here are different CDC Mechanisms you should Know:

  • The row versioning mechanism works by incrementing the version number when any change occurs on the table. The version number can start at 0 or 1. If the version number is 35 then you will know that the data in the table has been changed 35 times or 36 times if the version number started at 0. The row versioning mechanism is simple but it gets complicated to the target system when one record has multiple versions. Getting to know the latest version is difficult. But this can be solved by storing the last known version number in the references table ID. Checking the biggest version will give you the latest record version.

  • The Update timestamp is only concerned about giving you the latest record, not how many times the record has changed. This mechanism uses a timestamp to determine the latest version. The timestamp comprises date and time. The biggest timestamp resembles the latest version. This mechanism is not complex and does not need reference tables to solve the issue of a record having multiple versions. Every time data changes the timestamp is overwritten.

  • The Publish and subscribe queues mechanism uses the push approach as it sends the data changes to a queue where they will be retrieved by the target. This is good because it allows scalability; if many changes are made suddenly they can be stored in the queue and the target will scale them. Changing data sources using this system is easy because the data source and the target are decoupled by the middleman which is the queue.

  • The Database log scanners mechanism was built for two systems that have to be in sync especially in cases where the backup is being synchronized. This mechanism installs scanners on the database which log any changes being made to the data. These changes will trigger the backup to INSERT, UPDATE or DELETE depending on the modifications being made to the database. 

Limitations of Implementing CDC

The CDC technology has some restrictions. The fact that it can be resource-intensive is one of its key drawbacks. The reading of transaction logs and the updating of the target database can place a heavy burden on both the source and target databases. When a lot of data is being copied in real-time, this can be very difficult.

Another drawback of CDC is that it might be difficult to set up and run. It can take some time and a certain amount of technical knowledge to configure the CDC software, set up the replication process, and monitor the replication.

Conclusion

Change Data Capture (CDC) is a technology that allows for real-time data replication between a source and one or more target databases. It is commonly used in data warehousing, business intelligence, and data integration scenarios, but it can also be used for data archiving. CDC has many benefits, including real-time data replication, flexibility in terms of data replication, and efficient data integration. However, it also has some limitations, including resource-intensive process, and complexity of setup and maintenance.

In conclusion, Capture Data Change (CDC) is a potent tool that can be applied in a wide range of situations, such as auditing, data warehousing, and replication. In addition, it supports IoT, ETL, event sourcing, compliance, regulatory requirements, security, and fraud detection. It also provides for effective data replication. For enterprises trying to manage and interpret their data in real-time, it is an essential tool.

One of the best tools you can use to implement CDC is Kafka which helps you to leverage CDC transaction log. But, Kafka is a powerful and sophisticated tool that needs to be managed. Don't worry about Kafka being too sophisticated. Conduktor gives you an intuitive UI that lets you manage your critical Kafka tasks, and it gives you monitoring and auditing features that help you observe your Kafka ecosystem.

Sign up for Conduktor today for free to streamline and manage your Kafka CDC operations swiftly with our powerful, user-friendly platform!