Binary Log Replication of MySQL Data for CDC

To Nha Notes | Jan. 2, 2022, 11:48 a.m.

Because your data warehouse is likely not a MySQL database, it’s not possible to simply use the built-in MySQL replication features. To make use of the binlog for data ingestion to a non-MySQL source, there are a number of steps to take:

  1. Enable and configure the binlog on the MySQL server.
  2. Run an initial full table extraction and load.
  3. Extract from the binlog on a continuous basis.
  4. Translate and load binlog extracts into the data warehouse.

Step 3 is not discussed in detail, but to use the binlog for ingestion, you must first populate the tables in the data warehouse with the current state of the MySQL database and then use the binlog to ingest subsequent changes. Doing so often involves putting a LOCK on the tables you want to extract, running a mysqldump of those tables, and then loading the result of the mysqldump into the warehouse before turning on the binlog ingestion.

Though it’s best to refer to the latest MySQL binlog documentation for instructions in enabling and configuring binary logging.

There are two key settings to ensure on the MySQL database in regard to binlog configuration.

First, ensure that binary logging is enabled. Typically it is enabled by default, but you can check by running the following SQL query on the database (exact syntax may vary by MySQL distribution):

SELECT variable_value as bin_log_status
FROM performance_schema.global_variables
WHERE variable_name='log_bin';

Next, ensure that the binary logging format is set appropriately. There are three formats supported in the recent version of MySQL:

  • STATEMENT
  • ROW
  • MIXED

With the ROW format, every change to a row in a table is represented on a line of the binlog not as a SQL statement but rather the data in the row itself. This is the preferred format to use. It’s not necessary to enable MIXED, given the additional disk space that it takes up.

You can verify the current binlog format by running the following SQL query:

SELECT variable_value as bin_log_format
FROM performance_schema.global_variables
WHERE variable_name='binlog_format';

 

To change binlog_format, on the Master run these:

FLUSH TABLES WITH READ LOCK;
FLUSH LOGS;
SET GLOBAL binlog_format = 'MIXED';
FLUSH LOGS;
UNLOCK TABLES;

An individual client can control the logging format for its own statements by setting the session value of binlog_format:

mysql> SET SESSION binlog_format = 'STATEMENT';
mysql> SET SESSION binlog_format = 'ROW';
mysql> SET SESSION binlog_format = 'MIXED';

Changing the global binlog_format value requires privileges sufficient to set global system variables. Changing the session binlog_format value requires privileges sufficient to set restricted session system variables. See Section 5.1.9.1, “System Variable Privileges”.

There are several reasons why a client might want to set binary logging on a per-session basis:

  • A session that makes many small changes to the database might want to use row-based logging.

  • A session that performs updates that match many rows in the WHERE clause might want to use statement-based logging because it is more efficient to log a few statements than many rows.

  • Some statements require a lot of execution time on the source, but result in just a few rows being modified. It might therefore be beneficial to replicate them using row-based logging.

List up binlog files

mysql> SHOW BINARY LOGS;
mysql> SHOW MASTER LOGS;
+------------------+-----------+
| Log_name         | File_size |
+------------------+-----------+
| mysql-bin.000103 |       154 |
| mysql-bin.000104 |       154 |
| mysql-bin.000105 |     10562 |
| mysql-bin.000106 |       154 |
| mysql-bin.000107 |       154 |
| mysql-bin.000108 |       201 |
| mysql-bin.000109 |       201 |
| mysql-bin.000110 |      1287 |
| mysql-bin.000111 |       201 |
| mysql-bin.000112 |      2326 |
+------------------+-----------+

Show binlog files localtion

mysql> show global variables like '%log_bin%';
+---------------------------------+--------------------------------+
| Variable_name                   | Value                          |
+---------------------------------+--------------------------------+
| log_bin                         | ON                             |
| log_bin_basename                | /var/log/mysql/mysql-bin       |
| log_bin_index                   | /var/log/mysql/mysql-bin.index |
| log_bin_trust_function_creators | OFF                            |
| log_bin_use_v1_row_events       | OFF                            |
+---------------------------------+--------------------------------+

The number of days for automatic binary log file removal. The default is 0, which means “no automatic removal.

To check the current value of the expire_logs_days variable, use the following command.

mysql> show global variables like '%expire_logs_days%';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| expire_logs_days | 0     |
+------------------+-------+

To display local MySQL DB binlog on local:

mysqlbinlog -v -u <db_user> -p<db_pass> -h <db_host> -d <db_name> --base64-output=DECODE-ROWS /var/log/mysql/mysql-bin.xxx

To display RDS MySQL binlog on EC2:

mysqlbinlog -v -u admin -p<db_pass> -h <db_instance>.ap-northeast-1.rds.amazonaws.com -d <db_name> \
 --server-id=<server_id> \
 --read-from-remote-server \
 --base64-output=DECODE-ROWS \
 mysql-bin-changelog.index

Notes:

  • For Amazon RDS as a source, we recommend ensuring that backups are enabled to set up CDC. We also recommend ensuring that the source database is configured to retain change logs for a sufficient time—24 hours is usually enough. To set your binary log retention time on RDS instances you can use the following command:
call mysql.rds_set_configuration("binlog retention hours", 24);

For MySQL DB instances, the maximum binlog retention hours value is 168 (7 days).

After you set the retention period, monitor storage usage for the DB instance to make sure that the retained binary logs don't take up too much storage.

  • If you activate automated backups on your Amazon RDS instance, the binary logs are also automatically activated on your DB instance. These binary logs are stored on the disk and consume storage space, but are purged at every binary log retention configuration. 
  • To avoid low storage space issues, set the appropriate binary log retention period in Amazon RDS for MySQL. You can review the number of hours that a binary log is retained with the mysql.rds_show_configuration command syntax:

    CALL mysql.rds_show_configuration;
  • You can also reduce this value to retain logs for a shorter period to reduce the amount of space the logs use. A value of NULL means that logs are purged as soon as possible.

Next, it’s time to get the events from the binlog. Thankfully, there are some open source Python libraries available to get you started. One of the most popular is the python-mysql-replication project, which can be found on GitHub. To get started, install it using pip:

(env) $ pip install mysql-replication

The following example reads from the MySQL server’s default binlog file.

from pymysqlreplication import BinLogStreamReader
from pymysqlreplication import row_event
import configparser
import pymysqlreplication
import csv
import boto3

# get the MySQL connection info
parser = configparser.ConfigParser()
parser.read("pipeline.conf")
hostname = parser.get("mysql_config", "hostname")
port = parser.get("mysql_config", "port")
username = parser.get("mysql_config", "username")
password = parser.get("mysql_config", "password")

mysql_settings = {
    "host": hostname,
    "port": int(port),
    "user": username,
    "passwd": password
}

b_stream = BinLogStreamReader(
            connection_settings = mysql_settings,
            server_id=100,
            only_events=[row_event.DeleteRowsEvent,
                        row_event.WriteRowsEvent,
                        row_event.UpdateRowsEvent]
            )

order_events = []

for binlogevent in b_stream:
  for row in binlogevent.rows:
    if binlogevent.table == 'orders':
      event = {}
      if isinstance(
            binlogevent,row_event.DeleteRowsEvent
        ):
        event["action"] = "delete"
        event.update(row["values"].items())
      elif isinstance(
            binlogevent,row_event.UpdateRowsEvent
        ):
        event["action"] = "update"
        event.update(row["after_values"].items())
      elif isinstance(
            binlogevent,row_event.WriteRowsEvent
        ):
        event["action"] = "insert"
        event.update(row["values"].items())

      order_events.append(event)

b_stream.close()

keys = order_events[0].keys()
local_filename = 'orders_extract.csv'
with open(
        local_filename,
        'w',
        newline='') as output_file:
    dict_writer = csv.DictWriter(
                output_file, keys,delimiter='|')
    dict_writer.writerows(order_events)

# load the aws_boto_credentials values
parser = configparser.ConfigParser()
parser.read("pipeline.conf")
access_key = parser.get(
                "aws_boto_credentials",
                "access_key")
secret_key = parser.get(
                "aws_boto_credentials",
                "secret_key")
bucket_name = parser.get(
                "aws_boto_credentials",
                "bucket_name")

s3 = boto3.client(
    's3',
    aws_access_key_id=access_key,
    aws_secret_access_key=secret_key)

s3_file = local_filename

s3.upload_file(
    local_filename,
    bucket_name,
    s3_file)

 

Reference

https://learning.oreilly.com/library/view/data-pipelines-pocket/9781492087823/ch04.html#idm45343333942744

https://dev.mysql.com/doc/refman/5.7/en/mysqlbinlog.html

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_LogAccess.MySQL.Binarylog.html

https://www.programcreek.com/python/example/122809/pymysqlreplication.BinLogStreamReader

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_LogAccess.Concepts.MySQL.html

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_LogAccess.MySQL.BinaryFormat.html

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_LogAccess.MySQL.Binarylog.html

https://software-testing.com/topic/169430/understanding-binlogdiskusage-metric-behavior

https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Task.CDC.html