CDC pipeline from MySQL to Snowflake

To Nha Notes | Feb. 10, 2022, 3:39 p.m.

Implementing CDC ourselves from scratch turned out to be a good decision. The data pipeline we developed has been up for more than a year now and is serving an increasing number of consumers.

In this post, I will share the architecture, the challenges we encountered and how we overcame them, as well as some useful tips to start building your own CDC. We’ll dive deep into some core aspects, but not every part of the solution.

References: 

https://community.snowflake.com/s/article/DIY-CDC-Pipeline-from-MySQL-to-Snowflake

https://go.streamsets.com/rs/535-TEA-657/images/MySQL-Change-Data-Capture-to-Snowflakes-Guide_V3.pdf?_ga=2.101915906.2055578397.1640706091-959555593.1618844669

Because your data warehouse is likely not a MySQL database, it’s not possible to simply use the built-in MySQL replication features. To make use of the binlog for data ingestion to a non-MySQL source, there are a number of steps to take:

  1. Enable and configure the binlog on the MySQL server.

  2. Run an initial full table extraction and load.

  3. Extract from the binlog on a continuous basis.

  4. Translate and load binlog extracts into the data warehouse.

First, ensure that binary logging is enabled. Typically it is enabled by default, but you can check by running the following SQL query on the database (exact syntax may vary by MySQL distribution):

SELECT variable_value as bin_log_status
FROM performance_schema.global_variables
WHERE variable_name='log_bin';

 

Next, ensure that the binary logging format is set appropriately. There are three formats supported in the recent version of MySQL:

  • STATEMENT

  • ROW

  • MIXED

With the ROW format, every change to a row in a table is represented on a line of the binlog not as a SQL statement but rather the data in the row itself. This is the preferred format to use.

To change binlog_format, on the Master run these:

FLUSH TABLES WITH READ LOCK;
FLUSH LOGS;
SET GLOBAL binlog_format = 'MIXED';
FLUSH LOGS;
UNLOCK TABLES;

You can verify the current binlog format by running the following SQL query:

SELECT variable_value as bin_log_format
FROM performance_schema.global_variables
WHERE variable_name='binlog_format';

Please ensure your binary logs are retained on the server for a sufficient amount of time, (24 hours is usually enough.) To set your binary log retention time on RDS instances you can use the following command:

call mysql.rds_set_configuration("binlog retention hours", 24);

 

Next, it’s time to get the events from the binlog. Thankfully, there are some open source Python libraries available to get you started. One of the most popular is the python-mysql-replication project, which can be found on https://github.com/noplay/python-mysql-replication. To get started, install it using pip:

(env) $ pip install mysql-replication

 

from pymysqlreplication import BinLogStreamReader
from pymysqlreplication import row_event
import configparser
import pymysqlreplication

# get the MySQL connection info
parser = configparser.ConfigParser()
parser.read("pipeline.conf")
hostname = parser.get("mysql_config", "hostname")
port = parser.get("mysql_config", "port")
username = parser.get("mysql_config", "username")
password = parser.get("mysql_config", "password")

mysql_settings = {
    "host": hostname,
    "port": int(port),
    "user": username,
    "passwd": password
}

b_stream = BinLogStreamReader(
            connection_settings = mysql_settings,
            server_id=100,
            only_events=[row_event.DeleteRowsEvent,
                        row_event.WriteRowsEvent,
                        row_event.UpdateRowsEvent]
            )

for event in b_stream:
    event.dump()

b_stream.close()

 

References:

https://learning.oreilly.com/library/view/data-pipelines-pocket/9781492087823/ch04.html

https://aws.amazon.com/blogs/database/streaming-changes-in-a-database-with-amazon-kinesis/

https://www.alooma.com/integrations/mysql/redshift

https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html