Why is my MySQL replica lagging behind the primary?

Replica lag typically occurs due to slow queries on the replica, network latency between primary and replica, single-threaded SQL apply on older MySQL versions, or the replica running on underpowered hardware compared to the primary.

How do I fix a 'Duplicate entry' error in MySQL replication?

You can skip the problematic transaction using SET GLOBAL sql_slave_skip_counter = 1 followed by START SLAVE, or for GTID-based replication, inject an empty transaction matching the failed GTID to move past the conflict.

What is the difference between MySQL async and semi-sync replication?

In asynchronous replication the primary does not wait for confirmation from replicas before committing. In semi-synchronous replication the primary waits for at least one replica to acknowledge receipt of the binary log events before committing, reducing the risk of data loss.

MySQL Replication Troubleshooting Guide

TL;DR — Quick Summary

Troubleshoot MySQL replication errors including slave lag, duplicate key conflicts, and broken relay logs with step-by-step diagnostic commands.

Understanding MySQL Replication

MySQL replication allows data from one database server (the primary, historically called the “master”) to be automatically copied to one or more additional servers (replicas, historically called “slaves”). This mechanism provides critical benefits: high availability, read scaling, geographic distribution, and backup offloading.

Despite its maturity, MySQL replication is far from “set it and forget it.” Configuration drifts, schema changes, hardware discrepancies, and network instabilities conspire to produce replication failures that can cascade into data inconsistencies, application errors, and even downtime.

This guide covers the most common MySQL replication problems, their root causes, and step-by-step solutions for both traditional binary-log and GTID-based replication topologies.

Prerequisites

A working MySQL primary/replica setup (MySQL 5.7+ or MySQL 8.0+).
Shell access to both the primary and replica servers.
Sufficient privileges (SUPER or REPLICATION CLIENT) to run SHOW SLAVE STATUS.
Familiarity with basic SQL and Linux command-line tools.
Optional: Percona Toolkit installed for advanced diagnostics.

Common Replication Problems

The following five failure modes account for the vast majority of MySQL replication issues:

Replica Lag (Seconds_Behind_Master > 0): The replica cannot apply binary log events as fast as the primary produces them.
Duplicate Key Errors (Error 1062): A row that already exists on the replica is being inserted again by a replicated transaction.
Missing Row Errors (Error 1032): A replicated UPDATE or DELETE references a row that does not exist on the replica.
Relay Log Corruption: The relay log file on the replica becomes unreadable due to disk errors, crashes, or improper shutdowns.
GTID Divergence: In GTID-based replication, the replica’s executed GTID set no longer forms a contiguous superset of what the primary expects.

Step-by-Step Solution

1. Assess Current Replication Status

Always start by inspecting the replica’s health. On the replica server, run:

SHOW REPLICA STATUS\G

For MySQL versions prior to 8.0.22, use SHOW SLAVE STATUS\G instead.

Key fields to inspect:

Field	Healthy Value	Meaning
`Replica_IO_Running`	Yes	The I/O thread is connected to the primary and reading binary logs
`Replica_SQL_Running`	Yes	The SQL thread is applying events from the relay log
`Seconds_Behind_Source`	0	No replication lag
`Last_Error`	(empty)	No errors encountered
`Retrieved_Gtid_Set`	matches primary	All GTIDs have been fetched
`Executed_Gtid_Set`	matches Retrieved	All fetched GTIDs have been applied

If either thread shows No, focus on the corresponding error fields: Last_IO_Error for I/O issues, Last_SQL_Error for apply-side failures.

2. Fix Replica Lag

Replica lag means the SQL thread cannot keep up with incoming binary log events. Diagnose with:

# Watch lag in real-time
watch -n 1 "mysql -e 'SHOW REPLICA STATUS\G' | grep -E 'Seconds_Behind|Exec_Source_Log_Pos|Read_Source_Log_Pos'"

Common causes and solutions:

A) Slow queries on the replica:

-- Enable the slow query log on the replica
SET GLOBAL slow_query_log = 1;
SET GLOBAL long_query_time = 1;
SET GLOBAL slow_query_log_file = '/var/log/mysql/replica-slow.log';

Identify the slow queries and optimize them with proper indexing.

B) Single-threaded SQL apply (MySQL 5.6 and earlier):

Upgrade to MySQL 5.7+ or 8.0+ and enable multi-threaded replication:

-- MySQL 8.0+
STOP REPLICA;
SET GLOBAL replica_parallel_workers = 4;
SET GLOBAL replica_parallel_type = 'LOGICAL_CLOCK';
START REPLICA;

C) Hardware bottleneck: The replica disk I/O is saturated. Monitor with iostat -x 1 and consider upgrading to SSDs or distributing read load across multiple replicas.

3. Resolve Duplicate Key Errors (Error 1062)

This error occurs when the replica already has a row that a replicated INSERT is trying to create. Typical causes: manual writes on the replica, or a restored backup that included transactions that will also arrive via replication.

For binary-log replication:

-- Skip the single problematic transaction
STOP SLAVE;
SET GLOBAL sql_slave_skip_counter = 1;
START SLAVE;

For GTID-based replication:

-- Identify the failing GTID from SHOW REPLICA STATUS -> Last_SQL_Error
-- Example: Server UUID = 3E11FA47-71CA-11E1-9E33-C80AA9429562, GTID = 3E11FA47-71CA-11E1-9E33-C80AA9429562:42

STOP REPLICA;
SET GTID_NEXT = '3E11FA47-71CA-11E1-9E33-C80AA9429562:42';
BEGIN; COMMIT;  -- inject an empty transaction
SET GTID_NEXT = 'AUTOMATIC';
START REPLICA;

Warning: Skipping transactions can introduce data drift. Always validate with pt-table-checksum afterward.

4. Resolve Missing Row Errors (Error 1032)

When the replica cannot find the row referenced by a replicated UPDATE or DELETE:

-- Check what the primary has
-- On PRIMARY:
SELECT * FROM mydb.mytable WHERE id = 12345;

-- On REPLICA: compare
SELECT * FROM mydb.mytable WHERE id = 12345;

If the row genuinely should exist on the replica, manually insert it, then restart replication. If the row was intentionally deleted on the replica (a sign of data drift), skip the transaction as shown in section 3.

5. Recover from Relay Log Corruption

If the relay log is corrupt, the safest approach is to re-fetch it from the primary:

STOP REPLICA;
RESET REPLICA;
-- For GTID-based replication, the replica will automatically resume from the correct position:
START REPLICA;

For binary-log position-based replication, you need to specify the correct position:

STOP SLAVE;
RESET SLAVE;
CHANGE MASTER TO
  MASTER_LOG_FILE = 'mysql-bin.000042',
  MASTER_LOG_POS = 154;
START SLAVE;

Find the correct file and position from SHOW MASTER STATUS on the primary or from your monitoring system.

6. Fix GTID Divergence

GTID divergence happens when the replica has transactions not present on the primary (often from direct writes on the replica). Diagnose with:

-- On the REPLICA:
SELECT @@gtid_executed;
-- Compare with the primary's gtid_executed

The solution depends on severity:

Minor divergence (few transactions): Inject empty transactions on the primary for the extra GTIDs, or purge them from the replica using RESET MASTER (destroys all local binary logs — use only if the replica is not a source for other replicas).
Major divergence: Rebuild the replica from a fresh backup of the primary using mysqldump --single-transaction --source-data=2 or xtrabackup.

Prevention and Best Practices

Make replicas read-only: Set read_only = 1 and super_read_only = 1 on all replicas to prevent accidental writes.
Use GTID replication: Greatly simplifies failover, re-pointing replicas, and skipping transactions compared to traditional binary-log positioning.
Monitor replication lag continuously: Use tools like Percona Monitoring and Management (PMM), Prometheus with mysqld_exporter, or Datadog’s MySQL integration.
Run pt-table-checksum weekly: Proactively detect data drift before it causes application-level issues.
Automate failover carefully: Tools like Orchestrator, MHA, or MySQL InnoDB Cluster / Group Replication provide automated failover, but test them thoroughly in staging before relying on them in production.

Use crash-safe replication settings:

# my.cnf on replicas
relay_log_recovery = ON
relay_log_info_repository = TABLE
master_info_repository = TABLE

Gotchas and Edge Cases

binlog_format mismatch: If the primary uses ROW format but the replica expects STATEMENT, you will see cryptic errors. Ensure all servers use the same binlog_format (preferably ROW).
Schema drift: DDL changes applied only on the primary or only on the replica will cause replication to break on the next statement that touches the mismatched schema. Always apply DDL via replication, never directly on replicas.
Filtered replication pitfalls: replicate-do-db and replicate-ignore-db can behave unexpectedly with cross-database queries. Use replicate-wild-do-table for more predictable filtering.
Clock skew: Seconds_Behind_Master relies on timestamps. If the primary and replica clocks are out of sync, the lag metric will be misleading. Use NTP on all servers.

Summary

MySQL replication failures typically boil down to lag, duplicate/missing row errors, relay log corruption, or GTID divergence.
Always begin diagnosis with SHOW REPLICA STATUS and inspect the Replica_IO_Running, Replica_SQL_Running, and Last_Error fields.
Use sql_slave_skip_counter or empty GTID transactions to move past individual errors, but always validate data consistency afterward.
Prevent issues by enforcing read_only on replicas, using GTID replication, enabling crash-safe settings, and monitoring continuously.
When in doubt, rebuild the replica from a fresh backup rather than risking data drift from multiple skipped transactions.