TL;DR — Quick Summary
Troubleshoot MySQL replication errors including slave lag, duplicate key conflicts, and broken relay logs with step-by-step diagnostic commands.
Understanding MySQL Replication
MySQL replication allows data from one database server (the primary, historically called the “master”) to be automatically copied to one or more additional servers (replicas, historically called “slaves”). This mechanism provides critical benefits: high availability, read scaling, geographic distribution, and backup offloading.
Despite its maturity, MySQL replication is far from “set it and forget it.” Configuration drifts, schema changes, hardware discrepancies, and network instabilities conspire to produce replication failures that can cascade into data inconsistencies, application errors, and even downtime.
This guide covers the most common MySQL replication problems, their root causes, and step-by-step solutions for both traditional binary-log and GTID-based replication topologies.
Prerequisites
- A working MySQL primary/replica setup (MySQL 5.7+ or MySQL 8.0+).
- Shell access to both the primary and replica servers.
- Sufficient privileges (
SUPERorREPLICATION CLIENT) to runSHOW SLAVE STATUS. - Familiarity with basic SQL and Linux command-line tools.
- Optional: Percona Toolkit installed for advanced diagnostics.
Common Replication Problems
The following five failure modes account for the vast majority of MySQL replication issues:
- Replica Lag (Seconds_Behind_Master > 0): The replica cannot apply binary log events as fast as the primary produces them.
- Duplicate Key Errors (Error 1062): A row that already exists on the replica is being inserted again by a replicated transaction.
- Missing Row Errors (Error 1032): A replicated
UPDATEorDELETEreferences a row that does not exist on the replica. - Relay Log Corruption: The relay log file on the replica becomes unreadable due to disk errors, crashes, or improper shutdowns.
- GTID Divergence: In GTID-based replication, the replica’s executed GTID set no longer forms a contiguous superset of what the primary expects.
Step-by-Step Solution
1. Assess Current Replication Status
Always start by inspecting the replica’s health. On the replica server, run:
SHOW REPLICA STATUS\G
For MySQL versions prior to 8.0.22, use
SHOW SLAVE STATUS\Ginstead.
Key fields to inspect:
| Field | Healthy Value | Meaning |
|---|---|---|
Replica_IO_Running | Yes | The I/O thread is connected to the primary and reading binary logs |
Replica_SQL_Running | Yes | The SQL thread is applying events from the relay log |
Seconds_Behind_Source | 0 | No replication lag |
Last_Error | (empty) | No errors encountered |
Retrieved_Gtid_Set | matches primary | All GTIDs have been fetched |
Executed_Gtid_Set | matches Retrieved | All fetched GTIDs have been applied |
If either thread shows No, focus on the corresponding error fields: Last_IO_Error for I/O issues, Last_SQL_Error for apply-side failures.
2. Fix Replica Lag
Replica lag means the SQL thread cannot keep up with incoming binary log events. Diagnose with:
# Watch lag in real-time
watch -n 1 "mysql -e 'SHOW REPLICA STATUS\G' | grep -E 'Seconds_Behind|Exec_Source_Log_Pos|Read_Source_Log_Pos'"
Common causes and solutions:
A) Slow queries on the replica:
-- Enable the slow query log on the replica
SET GLOBAL slow_query_log = 1;
SET GLOBAL long_query_time = 1;
SET GLOBAL slow_query_log_file = '/var/log/mysql/replica-slow.log';
Identify the slow queries and optimize them with proper indexing.
B) Single-threaded SQL apply (MySQL 5.6 and earlier):
Upgrade to MySQL 5.7+ or 8.0+ and enable multi-threaded replication:
-- MySQL 8.0+
STOP REPLICA;
SET GLOBAL replica_parallel_workers = 4;
SET GLOBAL replica_parallel_type = 'LOGICAL_CLOCK';
START REPLICA;
C) Hardware bottleneck: The replica disk I/O is saturated. Monitor with iostat -x 1 and consider upgrading to SSDs or distributing read load across multiple replicas.
3. Resolve Duplicate Key Errors (Error 1062)
This error occurs when the replica already has a row that a replicated INSERT is trying to create. Typical causes: manual writes on the replica, or a restored backup that included transactions that will also arrive via replication.
For binary-log replication:
-- Skip the single problematic transaction
STOP SLAVE;
SET GLOBAL sql_slave_skip_counter = 1;
START SLAVE;
For GTID-based replication:
-- Identify the failing GTID from SHOW REPLICA STATUS -> Last_SQL_Error
-- Example: Server UUID = 3E11FA47-71CA-11E1-9E33-C80AA9429562, GTID = 3E11FA47-71CA-11E1-9E33-C80AA9429562:42
STOP REPLICA;
SET GTID_NEXT = '3E11FA47-71CA-11E1-9E33-C80AA9429562:42';
BEGIN; COMMIT; -- inject an empty transaction
SET GTID_NEXT = 'AUTOMATIC';
START REPLICA;
Warning: Skipping transactions can introduce data drift. Always validate with
pt-table-checksumafterward.
4. Resolve Missing Row Errors (Error 1032)
When the replica cannot find the row referenced by a replicated UPDATE or DELETE:
-- Check what the primary has
-- On PRIMARY:
SELECT * FROM mydb.mytable WHERE id = 12345;
-- On REPLICA: compare
SELECT * FROM mydb.mytable WHERE id = 12345;
If the row genuinely should exist on the replica, manually insert it, then restart replication. If the row was intentionally deleted on the replica (a sign of data drift), skip the transaction as shown in section 3.
5. Recover from Relay Log Corruption
If the relay log is corrupt, the safest approach is to re-fetch it from the primary:
STOP REPLICA;
RESET REPLICA;
-- For GTID-based replication, the replica will automatically resume from the correct position:
START REPLICA;
For binary-log position-based replication, you need to specify the correct position:
STOP SLAVE;
RESET SLAVE;
CHANGE MASTER TO
MASTER_LOG_FILE = 'mysql-bin.000042',
MASTER_LOG_POS = 154;
START SLAVE;
Find the correct file and position from SHOW MASTER STATUS on the primary or from your monitoring system.
6. Fix GTID Divergence
GTID divergence happens when the replica has transactions not present on the primary (often from direct writes on the replica). Diagnose with:
-- On the REPLICA:
SELECT @@gtid_executed;
-- Compare with the primary's gtid_executed
The solution depends on severity:
- Minor divergence (few transactions): Inject empty transactions on the primary for the extra GTIDs, or purge them from the replica using
RESET MASTER(destroys all local binary logs — use only if the replica is not a source for other replicas). - Major divergence: Rebuild the replica from a fresh backup of the primary using
mysqldump --single-transaction --source-data=2orxtrabackup.
Prevention and Best Practices
- Make replicas read-only: Set
read_only = 1andsuper_read_only = 1on all replicas to prevent accidental writes. - Use GTID replication: Greatly simplifies failover, re-pointing replicas, and skipping transactions compared to traditional binary-log positioning.
- Monitor replication lag continuously: Use tools like Percona Monitoring and Management (PMM), Prometheus with
mysqld_exporter, or Datadog’s MySQL integration. - Run
pt-table-checksumweekly: Proactively detect data drift before it causes application-level issues. - Automate failover carefully: Tools like Orchestrator, MHA, or MySQL InnoDB Cluster / Group Replication provide automated failover, but test them thoroughly in staging before relying on them in production.
- Use crash-safe replication settings:
# my.cnf on replicas relay_log_recovery = ON relay_log_info_repository = TABLE master_info_repository = TABLE
Gotchas and Edge Cases
binlog_formatmismatch: If the primary usesROWformat but the replica expectsSTATEMENT, you will see cryptic errors. Ensure all servers use the samebinlog_format(preferablyROW).- Schema drift: DDL changes applied only on the primary or only on the replica will cause replication to break on the next statement that touches the mismatched schema. Always apply DDL via replication, never directly on replicas.
- Filtered replication pitfalls:
replicate-do-dbandreplicate-ignore-dbcan behave unexpectedly with cross-database queries. Usereplicate-wild-do-tablefor more predictable filtering. - Clock skew:
Seconds_Behind_Masterrelies on timestamps. If the primary and replica clocks are out of sync, the lag metric will be misleading. Use NTP on all servers.
Summary
- MySQL replication failures typically boil down to lag, duplicate/missing row errors, relay log corruption, or GTID divergence.
- Always begin diagnosis with
SHOW REPLICA STATUSand inspect theReplica_IO_Running,Replica_SQL_Running, andLast_Errorfields. - Use
sql_slave_skip_counteror empty GTID transactions to move past individual errors, but always validate data consistency afterward. - Prevent issues by enforcing
read_onlyon replicas, using GTID replication, enabling crash-safe settings, and monitoring continuously. - When in doubt, rebuild the replica from a fresh backup rather than risking data drift from multiple skipped transactions.