To Nha Notes | Technology blog posts

Speeding Up MySQL Replica Catch-Up: A Practical Technique

To Nha Notes | May 28, 2026, 10:57 p.m.

The Problem

You've just created a MySQL replica — whether for a Blue/Green deployment, a read replica, or a disaster recovery standby. The source database has years of accumulated data, and your new replica is showing:

Seconds_Behind_Source: 9814

Nearly 3 hours of lag. And it's barely moving.

You need it at zero before you can proceed with your maintenance window. The clock is ticking.

Why Replicas Fall Behind

When MySQL replication starts, the replica's SQL thread must replay every binlog event from the source in sequence. On a large production database this means:

Millions of transactions to apply
Each write flushed to disk before the next one starts
Single I/O operations becoming the bottleneck — not CPU, not network

By default, MySQL is configured for maximum durability:

innodb_flush_log_at_trx_commit = 1 — flushes the InnoDB redo log to disk on every single commit
sync_binlog = 1 — syncs the binary log to disk after every transaction

These settings are correct for a production primary — they guarantee zero data loss on crash. But on a replica that's just catching up, they're causing unnecessary I/O overhead on every single replayed transaction.

The Technique

Temporarily relax the durability settings on the replica only while it catches up:

Parameter	Default	Catch-up value	Effect
innodb_flush_log_at_trx_commit	1	2	Writes to OS buffer per commit; flushes to disk every second instead
sync_binlog	1	0	Lets the OS decide when to flush the binlog

This dramatically reduces the number of disk flush operations per replayed transaction — the SQL thread can apply events much faster.

Is It Safe?

On a replica catching up — yes, with caveats:

Risk: If the replica OS crashes while catching up, you could lose up to 1 second of applied transactions from its local binlog and redo log. However:

The replica can simply be rebuilt from the source
No data is lost on the source (primary) — that's unaffected
For Blue/Green deployments specifically, you can delete the green instance and recreate it

Non-negotiable rule: Revert both settings to 1 before the replica becomes a primary. If you switchover or promote the replica with these settings, you risk data loss on a production crash.

Step-by-Step Guide (AWS RDS)

1. Apply the catch-up settings

In RDS, modify the replica's parameter group (dynamic parameters — no reboot needed):

innodb_flush_log_at_trx_commit = 2
sync_binlog = 0

Verify immediately on the replica:

SHOW VARIABLES LIKE 'innodb_flush_log_at_trx_commit';
SHOW VARIABLES LIKE 'sync_binlog';
-- Must return 2 and 0

2. Monitor lag

SHOW REPLICA STATUS\G
-- Watch: Seconds_Behind_Source decreasing
-- Watch: Relay_Source_Log_File advancing

Check every 15–30 minutes. You should see the lag dropping significantly faster.

3. When lag hits zero — revert immediately

innodb_flush_log_at_trx_commit = 1
sync_binlog = 1

Verify:

SHOW VARIABLES LIKE 'innodb_flush_log_at_trx_commit';
SHOW VARIABLES LIKE 'sync_binlog';
-- Must both return 1 before any promotion or switchover

Real-World Results

During a MySQL 8.0 → 8.4 Blue/Green upgrade on a ~2TB RDS production instance:

Phase	Seconds_Behind_Source	Drop per 30 min
Initial catch-up (default settings)	9,814 → 9,700	~114s
After storage I/O warmed up	9,700 → 7,307	~2,393s
Steady state	~5,000–7,000	~1,500–2,000s

Applying innodb_flush_log_at_trx_commit=2 + sync_binlog=0 accelerated the remaining catch-up, reducing estimated time to zero by 1–2 hours on a 2TB dataset.

What About replica_parallel_workers?

Before reaching for durability settings, always check parallelism first:

SHOW VARIABLES LIKE 'replica_parallel_workers';
SHOW VARIABLES LIKE 'replica_parallel_type';

If replica_parallel_workers = 0 or 1, increasing it (e.g. to 8 or 16) with replica_parallel_type = LOGICAL_CLOCK is the first optimization to try — it parallelizes independent transactions and has no durability tradeoff.

The durability relaxation technique is the second lever — for when parallelism is already maxed out and I/O is the remaining bottleneck.

Quick Reference

-- Apply catch-up settings (replica only)
SET GLOBAL innodb_flush_log_at_trx_commit = 2;
SET GLOBAL sync_binlog = 0;

-- Monitor
SHOW REPLICA STATUS\G

-- Revert when Seconds_Behind_Source = 0
SET GLOBAL innodb_flush_log_at_trx_commit = 1;
SET GLOBAL sync_binlog = 1;

⚠️ Never leave innodb_flush_log_at_trx_commit=2 or sync_binlog=0 on a production primary. Always revert before promotion or switchover.

Summary

	Default	Catch-up mode
innodb_flush_log_at_trx_commit	1 (flush every commit)	2 (flush every second)
sync_binlog	1 (sync every transaction)	0 (OS-managed)
Safe on primary?	✅ Yes	❌ No
Safe on catching-up replica?	✅ Yes	✅ Yes
Revert before promotion?	N/A	Required

A simple, low-risk technique that can shave hours off your replica catch-up time — as long as you remember to revert before the replica goes live.

Posts

Technical AI Startup Work Learning Tools Books Kinh doanh Đọc sách Làm bánh Cuộc sống Tài chính Bất động sản Đầu tư Thuế Nice words