MHA 一个slave宕机的影响-白红宇

MHA 一个slave宕机的影响

阅读量：2196 次

发布时间：2019-05-02

本文共 37107 字，大约阅读时间需要 123 分钟。

文章目录

环境说明

IP	角色	备注	mha4mysql-node	mha4mysql-manager
192.168.98.11	master	读写	√
192.168.98.10	slave	只读	√
192.168.98.12	slave	只读	√
192.168.98.13	manager节点	N/A	√	√

运行前有节点宕机

手动关闭一个从库192.168.98.10mysqld后尝试启动masterha_manager

/usr/local/bin/masterha_manager --global_conf=/etc/masterha/conf/masterha_default.cnf --conf=/etc/masterha/conf/cls_all.cnf

启动失败, 日志中有如下信息

Fri Feb 28 14:47:58 2020 - [info] MHA::MasterMonitor version 0.58.Fri Feb 28 14:47:59 2020 - [info] GTID failover mode = 1Fri Feb 28 14:47:59 2020 - [info] Dead Servers:Fri Feb 28 14:47:59 2020 - [info]   192.168.98.10(192.168.98.10:3306)Fri Feb 28 14:47:59 2020 - [info] Alive Servers:Fri Feb 28 14:47:59 2020 - [info]   192.168.98.11(192.168.98.11:3306)Fri Feb 28 14:47:59 2020 - [info]   192.168.98.12(192.168.98.12:3306)Fri Feb 28 14:47:59 2020 - [info] Alive Slaves:Fri Feb 28 14:47:59 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabledFri Feb 28 14:47:59 2020 - [info]     GTID ONFri Feb 28 14:47:59 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)Fri Feb 28 14:47:59 2020 - [info]     Not candidate for the new Master (no_master is set)Fri Feb 28 14:47:59 2020 - [info] Current Alive Master: 192.168.98.11(192.168.98.11:3306)Fri Feb 28 14:47:59 2020 - [info] Checking slave configurations..Fri Feb 28 14:47:59 2020 - [info] Checking replication filtering settings..Fri Feb 28 14:47:59 2020 - [info]  binlog_do_db= , binlog_ignore_db= Fri Feb 28 14:47:59 2020 - [info]  Replication filtering check ok.Fri Feb 28 14:47:59 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln364] None of slaves can be master. Check failover configuration file or log-bin settings in my.cnfFri Feb 28 14:47:59 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations.  at /usr/local/bin/masterha_manager line 50.Fri Feb 28 14:47:59 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.Fri Feb 28 14:47:59 2020 - [info] Got exit code 1 (Not master dead).

应该先使用masterha_check_repl检查复制状态

#masterha_check_repl --conf=/etc/masterha/conf/cls_all.cnf --global_conf=/etc/masterha/conf/masterha_default.cnfFri Feb 28 15:27:24 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..Fri Feb 28 15:27:24 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_all.cnf..Fri Feb 28 15:27:24 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_all.cnf..Fri Feb 28 15:27:24 2020 - [info] MHA::MasterMonitor version 0.58.Fri Feb 28 15:27:25 2020 - [info] GTID failover mode = 1Fri Feb 28 15:27:25 2020 - [info] Dead Servers:Fri Feb 28 15:27:25 2020 - [info]   192.168.98.10(192.168.98.10:3306)Fri Feb 28 15:27:25 2020 - [info] Alive Servers:Fri Feb 28 15:27:25 2020 - [info]   192.168.98.11(192.168.98.11:3306)Fri Feb 28 15:27:25 2020 - [info]   192.168.98.12(192.168.98.12:3306)Fri Feb 28 15:27:25 2020 - [info] Alive Slaves:Fri Feb 28 15:27:25 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabledFri Feb 28 15:27:25 2020 - [info]     GTID ONFri Feb 28 15:27:25 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)Fri Feb 28 15:27:25 2020 - [info]     Not candidate for the new Master (no_master is set)Fri Feb 28 15:27:25 2020 - [info] Current Alive Master: 192.168.98.11(192.168.98.11:3306)Fri Feb 28 15:27:25 2020 - [info] Checking slave configurations..Fri Feb 28 15:27:25 2020 - [info] Checking replication filtering settings..Fri Feb 28 15:27:25 2020 - [info]  binlog_do_db= , binlog_ignore_db= Fri Feb 28 15:27:25 2020 - [info]  Replication filtering check ok.Fri Feb 28 15:27:25 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln364] None of slaves can be master. Check failover configuration file or log-bin settings in my.cnfFri Feb 28 15:27:25 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations.  at /usr/local/bin/masterha_check_repl line 48.Fri Feb 28 15:27:25 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.Fri Feb 28 15:27:25 2020 - [info] Got exit code 1 (Not master dead).MySQL Replication Health is NOT OK!

在文档https://github.com/yoshinorim/mha4mysql-manager/wiki/masterha_manager中:

--ignore_fail_on_start

By default, master monitoring (not failover) process stops if one or more slaves are down, regardless of “ignore_fail” parameter setting. By setting --ignore_fail_on_start, master monitoring does not stop if ignore_fail marked slaves are down.

默认情况下，如果一个或多个从库宕机，则不管“ ignore_fail”参数设置如何，主服务器监视（非故障转移）过程都会停止。通过设置–ignore_fail_on_start，如果标记为ignore_fail的从属服务器已关闭，则主监视不会停止。

这个意思就是说如果在配置文件中设置了为10设置了ignore_fail=1, 那么再加上--ignore_fail_on_start可以启动masterha_manager, 否则如果不在配置文件中指定ignore_fail=1即使指定了--ignore_fail_on_start也是不能启动的

加上ignore_fail=1

#cat /etc/masterha/conf/cls_all.cnf [server default]#workdir on the management servermanager_workdir=/masterha/cls_all/manager_log=/masterha/cls_all/manager.log#workdir on the node for mysql servermaster_binlog_dir=/data/mysql_3306/data/#自动故障VIP切换调用脚本master_ip_failover_script=/etc/masterha/scripts/master_ip_failover_vip --vip=192.168.98.100#手动故障切换调用脚本master_ip_online_change_script=/etc/masterha/scripts/master_ip_online_change_vip --vip=192.168.98.100#检测master的可用性secondary_check_script=masterha_secondary_check -s 192.168.98.11 -s 192.168.98.12[server1]hostname=192.168.98.10candidate_master=1ignore_fail=1[server2]hostname=192.168.98.11candidate_master=1[server3]hostname=192.168.98.12# no_master=1

启动成功

/usr/local/bin/masterha_manager --global_conf=/etc/masterha/conf/masterha_default.cnf --conf=/etc/masterha/conf/cls_all.cnf --ignore_fail_on_startFri Feb 28 15:59:37 2020 - [info] MHA::MasterMonitor version 0.58.Fri Feb 28 15:59:38 2020 - [info] GTID failover mode = 1Fri Feb 28 15:59:38 2020 - [info] Dead Servers:Fri Feb 28 15:59:38 2020 - [info]   192.168.98.10(192.168.98.10:3306)Fri Feb 28 15:59:38 2020 - [info] Alive Servers:Fri Feb 28 15:59:38 2020 - [info]   192.168.98.11(192.168.98.11:3306)Fri Feb 28 15:59:38 2020 - [info]   192.168.98.12(192.168.98.12:3306)Fri Feb 28 15:59:38 2020 - [info] Alive Slaves:Fri Feb 28 15:59:38 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabledFri Feb 28 15:59:38 2020 - [info]     GTID ONFri Feb 28 15:59:38 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)Fri Feb 28 15:59:38 2020 - [info] Current Alive Master: 192.168.98.11(192.168.98.11:3306)Fri Feb 28 15:59:38 2020 - [info] Checking slave configurations..Fri Feb 28 15:59:38 2020 - [info] Checking replication filtering settings..Fri Feb 28 15:59:38 2020 - [info]  binlog_do_db= , binlog_ignore_db= Fri Feb 28 15:59:38 2020 - [info]  Replication filtering check ok.Fri Feb 28 15:59:38 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.Fri Feb 28 15:59:38 2020 - [info] Checking SSH publickey authentication settings on the current master..Fri Feb 28 15:59:39 2020 - [info] HealthCheck: SSH to 192.168.98.11 is reachable.Fri Feb 28 15:59:39 2020 - [info] 192.168.98.11(192.168.98.11:3306) (current master) +--192.168.98.12(192.168.98.12:3306)Fri Feb 28 15:59:39 2020 - [info] Checking master_ip_failover_script status:Fri Feb 28 15:59:39 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=192.168.98.100 --command=status --ssh_user=root --orig_master_host=192.168.98.11 --orig_master_ip=192.168.98.11 --orig_master_port=3306 Fri Feb 28 15:59:39 2020 - [info]  OK.Fri Feb 28 15:59:39 2020 - [warning] shutdown_script is not defined.Fri Feb 28 15:59:39 2020 - [info] Set master ping interval 3 seconds.Fri Feb 28 15:59:39 2020 - [info] Set secondary check script: masterha_secondary_check -s 192.168.98.11 -s 192.168.98.12Fri Feb 28 15:59:39 2020 - [info] Starting ping health check on 192.168.98.11(192.168.98.11:3306)..Fri Feb 28 15:59:39 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..

不加

#cat /etc/masterha/conf/cls_all.cnf ...[server1]hostname=192.168.98.10candidate_master=1# ignore_fail=1[server2]hostname=192.168.98.11candidate_master=1[server3]hostname=192.168.98.12# no_master=1

启动失败

/usr/local/bin/masterha_manager --global_conf=/etc/masterha/conf/masterha_default.cnf --conf=/etc/masterha/conf/cls_all.cnf --ignore_fail_on_startFri Feb 28 15:58:57 2020 - [info] MHA::MasterMonitor version 0.58.Fri Feb 28 15:58:58 2020 - [info] GTID failover mode = 1Fri Feb 28 15:58:58 2020 - [info] Dead Servers:Fri Feb 28 15:58:58 2020 - [info]   192.168.98.10(192.168.98.10:3306)Fri Feb 28 15:58:58 2020 - [info] Alive Servers:Fri Feb 28 15:58:58 2020 - [info]   192.168.98.11(192.168.98.11:3306)Fri Feb 28 15:58:58 2020 - [info]   192.168.98.12(192.168.98.12:3306)Fri Feb 28 15:58:58 2020 - [info] Alive Slaves:Fri Feb 28 15:58:58 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabledFri Feb 28 15:58:58 2020 - [info]     GTID ONFri Feb 28 15:58:58 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)Fri Feb 28 15:58:58 2020 - [info] Current Alive Master: 192.168.98.11(192.168.98.11:3306)Fri Feb 28 15:58:58 2020 - [info] Checking slave configurations..Fri Feb 28 15:58:58 2020 - [info] Checking replication filtering settings..Fri Feb 28 15:58:58 2020 - [info]  binlog_do_db= , binlog_ignore_db= Fri Feb 28 15:58:58 2020 - [info]  Replication filtering check ok.Fri Feb 28 15:58:58 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.Fri Feb 28 15:58:58 2020 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln492]  Server 192.168.98.10(192.168.98.10:3306) is dead, but must be alive! Check server settings.Fri Feb 28 15:58:58 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations.  at /usr/local/share/perl5/MHA/MasterMonitor.pm line 402.Fri Feb 28 15:58:58 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.Fri Feb 28 15:58:58 2020 - [info] Got exit code 1 (Not master dead).

另外如果加了ignore_fail=1 但是仅仅剩下的一个12指定了no_master=1的话也无法启动

#cat /etc/masterha/conf/cls_all.cnf ...[server1]hostname=192.168.98.10candidate_master=1ignore_fail=1[server2]hostname=192.168.98.11candidate_master=1[server3]hostname=192.168.98.12no_master=1

None of slaves can be master

/usr/local/bin/masterha_manager --global_conf=/etc/masterha/conf/masterha_default.cnf --conf=/etc/masterha/conf/cls_all.cnf --ignore_fail_on_startFri Feb 28 15:55:14 2020 - [info] MHA::MasterMonitor version 0.58.Fri Feb 28 15:55:16 2020 - [info] GTID failover mode = 1Fri Feb 28 15:55:16 2020 - [info] Dead Servers:Fri Feb 28 15:55:16 2020 - [info]   192.168.98.10(192.168.98.10:3306)Fri Feb 28 15:55:16 2020 - [info] Alive Servers:Fri Feb 28 15:55:16 2020 - [info]   192.168.98.11(192.168.98.11:3306)Fri Feb 28 15:55:16 2020 - [info]   192.168.98.12(192.168.98.12:3306)Fri Feb 28 15:55:16 2020 - [info] Alive Slaves:Fri Feb 28 15:55:16 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabledFri Feb 28 15:55:16 2020 - [info]     GTID ONFri Feb 28 15:55:16 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)Fri Feb 28 15:55:16 2020 - [info]     Not candidate for the new Master (no_master is set)Fri Feb 28 15:55:16 2020 - [info] Current Alive Master: 192.168.98.11(192.168.98.11:3306)Fri Feb 28 15:55:16 2020 - [info] Checking slave configurations..Fri Feb 28 15:55:16 2020 - [info] Checking replication filtering settings..Fri Feb 28 15:55:16 2020 - [info]  binlog_do_db= , binlog_ignore_db= Fri Feb 28 15:55:16 2020 - [info]  Replication filtering check ok.Fri Feb 28 15:55:16 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln364] None of slaves can be master. Check failover configuration file or log-bin settings in my.cnfFri Feb 28 15:55:16 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations.  at /usr/local/bin/masterha_manager line 50.Fri Feb 28 15:55:16 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.Fri Feb 28 15:55:16 2020 - [info] Got exit code 1 (Not master dead).

运行中有点节点宕机

如果masterha_manager运行中一个从库宕机, masterha_manager貌似无感知, 因为masterha_manager进程没有退出, 日志也没有报错

check_status仍然是正常的

#masterha_check_status --conf=/etc/masterha/conf/cls_all.cnf --global_conf=/etc/masterha/conf/masterha_default.cnfcls_all (pid:88464) is running(0:PING_OK), master:192.168.98.11

但是手动切换会失败

#/usr/local/bin/masterha_master_switch --global_conf=/etc/masterha/conf/masterha_default.cnf --conf=/etc/masterha/conf/cls_all.cnf --master_state=alive --new_master_host=192.168.98.12 --new_master_port=3306 --orig_master_is_new_slave --interactive=0Fri Feb 28 15:33:34 2020 - [info] MHA::MasterRotate version 0.58.Fri Feb 28 15:33:34 2020 - [info] Starting online master switch..Fri Feb 28 15:33:34 2020 - [info] Fri Feb 28 15:33:34 2020 - [info] * Phase 1: Configuration Check Phase..Fri Feb 28 15:33:34 2020 - [info] Fri Feb 28 15:33:34 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..Fri Feb 28 15:33:34 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_all.cnf..Fri Feb 28 15:33:34 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_all.cnf..Fri Feb 28 15:33:35 2020 - [info] GTID failover mode = 1Fri Feb 28 15:33:35 2020 - [error][/usr/local/share/perl5/MHA/MasterRotate.pm, ln94] Switching master should not be started if one or more servers is down.Fri Feb 28 15:33:35 2020 - [info] Dead Servers:Fri Feb 28 15:33:35 2020 - [info]   192.168.98.10(192.168.98.10:3306)Fri Feb 28 15:33:35 2020 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/local/bin/masterha_master_switch line 53.

Dead Servers:会列出有问题的Server

如果在10还没修复时Master11挂了, 同时12设置了no_master, 自动failover会失败, 因为没有新的master可以用

#cat /etc/masterha/conf/cls_all.cnf ...[server1]hostname=192.168.98.10candidate_master=1ignore_fail=1[server2]hostname=192.168.98.11candidate_master=1[server3]hostname=192.168.98.12no_master=1

关闭11

Fri Feb 28 15:35:38 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=192.168.98.11;port=3306;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '192.168.98.11' (111) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.2003 (Can't connect to MySQL server on '192.168.98.11' (111))Fri Feb 28 15:35:38 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 192.168.98.11 -s 192.168.98.12  --user=root  --master_host=192.168.98.11  --master_ip=192.168.98.11  --master_port=3306 --master_user=mha --master_password=mha --ping_type=CONNECTFri Feb 28 15:35:38 2020 - [info] Executing SSH check script: exit 0Fri Feb 28 15:35:39 2020 - [info] HealthCheck: SSH to 192.168.98.11 is reachable.Monitoring server 192.168.98.11 is reachable, Master is not reachable from 192.168.98.11. OK.Monitoring server 192.168.98.12 is reachable, Master is not reachable from 192.168.98.12. OK.Fri Feb 28 15:35:40 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.Fri Feb 28 15:35:41 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.98.11' (111))Fri Feb 28 15:35:41 2020 - [warning] Connection failed 2 time(s)..Fri Feb 28 15:35:44 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.98.11' (111))Fri Feb 28 15:35:44 2020 - [warning] Connection failed 3 time(s)..Fri Feb 28 15:35:47 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.98.11' (111))Fri Feb 28 15:35:47 2020 - [warning] Connection failed 4 time(s)..Fri Feb 28 15:35:47 2020 - [warning] Master is not reachable from health checker!Fri Feb 28 15:35:47 2020 - [warning] Master 192.168.98.11(192.168.98.11:3306) is not reachable!Fri Feb 28 15:35:47 2020 - [warning] SSH is reachable.Fri Feb 28 15:35:47 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_all.cnf again, and trying to connect to all servers to check server status..Fri Feb 28 15:35:47 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..Fri Feb 28 15:35:47 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_all.cnf..Fri Feb 28 15:35:47 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_all.cnf..Fri Feb 28 15:35:48 2020 - [info] GTID failover mode = 1Fri Feb 28 15:35:48 2020 - [info] Dead Servers:Fri Feb 28 15:35:48 2020 - [info]   192.168.98.10(192.168.98.10:3306)Fri Feb 28 15:35:48 2020 - [info]   192.168.98.11(192.168.98.11:3306)Fri Feb 28 15:35:48 2020 - [info] Alive Servers:Fri Feb 28 15:35:48 2020 - [info]   192.168.98.12(192.168.98.12:3306)Fri Feb 28 15:35:48 2020 - [info] Alive Slaves:Fri Feb 28 15:35:48 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabledFri Feb 28 15:35:48 2020 - [info]     GTID ONFri Feb 28 15:35:48 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)Fri Feb 28 15:35:48 2020 - [info]     Not candidate for the new Master (no_master is set)Fri Feb 28 15:35:48 2020 - [info] Checking slave configurations..Fri Feb 28 15:35:48 2020 - [info] Checking replication filtering settings..Fri Feb 28 15:35:48 2020 - [info]  Replication filtering check ok.Fri Feb 28 15:35:48 2020 - [info] Master is down!Fri Feb 28 15:35:48 2020 - [info] Terminating monitoring script.Fri Feb 28 15:35:48 2020 - [info] Got exit code 20 (Master dead).Fri Feb 28 15:35:48 2020 - [info] MHA::MasterFailover version 0.58.Fri Feb 28 15:35:48 2020 - [info] Starting master failover.Fri Feb 28 15:35:48 2020 - [info] Fri Feb 28 15:35:48 2020 - [info] * Phase 1: Configuration Check Phase..Fri Feb 28 15:35:48 2020 - [info] Fri Feb 28 15:35:49 2020 - [info] GTID failover mode = 1Fri Feb 28 15:35:49 2020 - [info] Dead Servers:Fri Feb 28 15:35:49 2020 - [info]   192.168.98.10(192.168.98.10:3306)Fri Feb 28 15:35:49 2020 - [info]   192.168.98.11(192.168.98.11:3306)Fri Feb 28 15:35:49 2020 - [info] Checking master reachability via MySQL(double check)...Fri Feb 28 15:35:49 2020 - [info]  ok.Fri Feb 28 15:35:49 2020 - [info] Alive Servers:Fri Feb 28 15:35:49 2020 - [info]   192.168.98.12(192.168.98.12:3306)Fri Feb 28 15:35:49 2020 - [info] Alive Slaves:Fri Feb 28 15:35:49 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabledFri Feb 28 15:35:49 2020 - [info]     GTID ONFri Feb 28 15:35:49 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)Fri Feb 28 15:35:49 2020 - [info]     Not candidate for the new Master (no_master is set)Fri Feb 28 15:35:49 2020 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln492]  Server 192.168.98.10(192.168.98.10:3306) is dead, but must be alive! Check server settings.Fri Feb 28 15:35:49 2020 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/local/share/perl5/MHA/MasterFailover.pm line 269.

主要问题在

Not candidate for the new Master (no_master is set)Server 192.168.98.10(192.168.98.10:3306) is dead, but must be alive! Check server settings

vip还正在原Master11上

root@localhost 14:40:38 [(none)]> \! ip a1: lo: 
   
     mtu 65536 qdisc noqueue state UNKNOWN qlen 1    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00    inet 127.0.0.1/8 scope host lo       valid_lft forever preferred_lft forever    inet6 ::1/128 scope host        valid_lft forever preferred_lft forever2: ens33: 
    
      mtu 1500 qdisc pfifo_fast state UP qlen 1000    link/ether 00:0c:29:98:28:0b brd ff:ff:ff:ff:ff:ff    inet 192.168.98.11/24 brd 192.168.98.255 scope global ens33       valid_lft forever preferred_lft forever    inet 192.168.98.100/24 scope global secondary ens33       valid_lft forever preferred_lft forever    inet6 fe80::cd5b:e71c:7a67:b391/64 scope link        valid_lft forever preferred_lft foreverroot@localhost 15:35:04 [(none)]> shutdown;Query OK, 0 rows affected (0.00 sec)root@localhost 15:35:37 [(none)]> 2020-02-28T07:35:50.083534Z mysqld_safe mysqld from pid file /data/mysql_3306/run/mysql.pid endedroot@localhost 15:36:40 [(none)]> \! ip a1: lo: 
     
       mtu 65536 qdisc noqueue state UNKNOWN qlen 1    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00    inet 127.0.0.1/8 scope host lo       valid_lft forever preferred_lft forever    inet6 ::1/128 scope host        valid_lft forever preferred_lft forever2: ens33: 
      
        mtu 1500 qdisc pfifo_fast state UP qlen 1000    link/ether 00:0c:29:98:28:0b brd ff:ff:ff:ff:ff:ff    inet 192.168.98.11/24 brd 192.168.98.255 scope global ens33       valid_lft forever preferred_lft forever    inet 192.168.98.100/24 scope global secondary ens33       valid_lft forever preferred_lft forever    inet6 fe80::cd5b:e71c:7a67:b391/64 scope link        valid_lft forever preferred_lft forever

12仍然是从库, 且没有vip

root@localhost 15:35:32 [(none)]> show slave status\G*************************** 1. row ***************************               Slave_IO_State: Reconnecting after a failed master event read                  Master_Host: 192.168.98.11                  Master_User: repler                  Master_Port: 3306                Connect_Retry: 60              Master_Log_File: mysql-bin.000001          Read_Master_Log_Pos: 2496               Relay_Log_File: mysql-relay-bin.000002                Relay_Log_Pos: 1354        Relay_Master_Log_File: mysql-bin.000001             Slave_IO_Running: Connecting            Slave_SQL_Running: Yes              Replicate_Do_DB:           Replicate_Ignore_DB:            Replicate_Do_Table:        Replicate_Ignore_Table:       Replicate_Wild_Do_Table:   Replicate_Wild_Ignore_Table:                    Last_Errno: 0                   Last_Error:                  Skip_Counter: 0          Exec_Master_Log_Pos: 2496              Relay_Log_Space: 1561              Until_Condition: None               Until_Log_File:                 Until_Log_Pos: 0           Master_SSL_Allowed: No           Master_SSL_CA_File:            Master_SSL_CA_Path:               Master_SSL_Cert:             Master_SSL_Cipher:                Master_SSL_Key:         Seconds_Behind_Master: NULLMaster_SSL_Verify_Server_Cert: No                Last_IO_Errno: 2003                Last_IO_Error: error reconnecting to master 'repler@192.168.98.11:3306' - retry-time: 60  retries: 1               Last_SQL_Errno: 0               Last_SQL_Error:   Replicate_Ignore_Server_Ids:              Master_Server_Id: 98113306                  Master_UUID: 68703597-592c-11ea-88b3-000c2998280b             Master_Info_File: mysql.slave_master_info                    SQL_Delay: 0          SQL_Remaining_Delay: NULL      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates           Master_Retry_Count: 86400                  Master_Bind:       Last_IO_Error_Timestamp: 200228 15:35:45     Last_SQL_Error_Timestamp:                Master_SSL_Crl:            Master_SSL_Crlpath:            Retrieved_Gtid_Set: 68703597-592c-11ea-88b3-000c2998280b:1-4            Executed_Gtid_Set: 3a60f8c7-592c-11ea-8cb1-000c2973aaf0:1-6,68703597-592c-11ea-88b3-000c2998280b:1-4                Auto_Position: 1         Replicate_Rewrite_DB:                  Channel_Name:            Master_TLS_Version: 1 row in set (0.00 sec)root@localhost 15:36:32 [(none)]> \! ip a1: lo: 
   
     mtu 65536 qdisc noqueue state UNKNOWN qlen 1    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00    inet 127.0.0.1/8 scope host lo       valid_lft forever preferred_lft forever    inet6 ::1/128 scope host        valid_lft forever preferred_lft forever2: ens33: 
    
      mtu 1500 qdisc pfifo_fast state UP qlen 1000    link/ether 00:0c:29:96:c2:3a brd ff:ff:ff:ff:ff:ff    inet 192.168.98.12/24 brd 192.168.98.255 scope global ens33       valid_lft forever preferred_lft forever    inet6 fe80::ef03:3251:b4ed:204c/64 scope link        valid_lft forever preferred_lft foreverroot@localhost 15:36:37 [(none)]>

如果有候选master, 也就是12没有加no_master=1是可以自动failover的

Fri Feb 28 16:16:27 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=192.168.98.11;port=3306;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '192.168.98.11' (111) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.2003 (Can't connect to MySQL server on '192.168.98.11' (111))Fri Feb 28 16:16:27 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 192.168.98.11 -s 192.168.98.12  --user=root  --master_host=192.168.98.11  --master_ip=192.168.98.11  --master_port=3306 --master_user=mha --master_password=mha --ping_type=CONNECTFri Feb 28 16:16:27 2020 - [info] Executing SSH check script: exit 0Fri Feb 28 16:16:28 2020 - [info] HealthCheck: SSH to 192.168.98.11 is reachable.Monitoring server 192.168.98.11 is reachable, Master is not reachable from 192.168.98.11. OK.Monitoring server 192.168.98.12 is reachable, Master is not reachable from 192.168.98.12. OK.Fri Feb 28 16:16:28 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.Fri Feb 28 16:16:30 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.98.11' (111))Fri Feb 28 16:16:30 2020 - [warning] Connection failed 2 time(s)..Fri Feb 28 16:16:33 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.98.11' (111))Fri Feb 28 16:16:33 2020 - [warning] Connection failed 3 time(s)..Fri Feb 28 16:16:36 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.98.11' (111))Fri Feb 28 16:16:36 2020 - [warning] Connection failed 4 time(s)..Fri Feb 28 16:16:36 2020 - [warning] Master is not reachable from health checker!Fri Feb 28 16:16:36 2020 - [warning] Master 192.168.98.11(192.168.98.11:3306) is not reachable!Fri Feb 28 16:16:36 2020 - [warning] SSH is reachable.Fri Feb 28 16:16:36 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_all.cnf again, and trying to connect to all servers to check server status..Fri Feb 28 16:16:36 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..Fri Feb 28 16:16:36 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_all.cnf..Fri Feb 28 16:16:36 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_all.cnf..Fri Feb 28 16:16:37 2020 - [info] GTID failover mode = 1Fri Feb 28 16:16:37 2020 - [info] Dead Servers:Fri Feb 28 16:16:37 2020 - [info]   192.168.98.10(192.168.98.10:3306)Fri Feb 28 16:16:37 2020 - [info]   192.168.98.11(192.168.98.11:3306)Fri Feb 28 16:16:37 2020 - [info] Alive Servers:Fri Feb 28 16:16:37 2020 - [info]   192.168.98.12(192.168.98.12:3306)Fri Feb 28 16:16:37 2020 - [info] Alive Slaves:Fri Feb 28 16:16:37 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabledFri Feb 28 16:16:37 2020 - [info]     GTID ONFri Feb 28 16:16:37 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)Fri Feb 28 16:16:37 2020 - [info] Checking slave configurations..Fri Feb 28 16:16:37 2020 - [info] Checking replication filtering settings..Fri Feb 28 16:16:37 2020 - [info]  Replication filtering check ok.Fri Feb 28 16:16:37 2020 - [info] Master is down!Fri Feb 28 16:16:37 2020 - [info] Terminating monitoring script.Fri Feb 28 16:16:37 2020 - [info] Got exit code 20 (Master dead).Fri Feb 28 16:16:37 2020 - [info] MHA::MasterFailover version 0.58.Fri Feb 28 16:16:37 2020 - [info] Starting master failover.Fri Feb 28 16:16:37 2020 - [info] Fri Feb 28 16:16:37 2020 - [info] * Phase 1: Configuration Check Phase..Fri Feb 28 16:16:37 2020 - [info] Fri Feb 28 16:16:38 2020 - [info] GTID failover mode = 1Fri Feb 28 16:16:38 2020 - [info] Dead Servers:Fri Feb 28 16:16:38 2020 - [info]   192.168.98.10(192.168.98.10:3306)Fri Feb 28 16:16:38 2020 - [info]   192.168.98.11(192.168.98.11:3306)Fri Feb 28 16:16:38 2020 - [info] Checking master reachability via MySQL(double check)...Fri Feb 28 16:16:38 2020 - [info]  ok.Fri Feb 28 16:16:38 2020 - [info] Alive Servers:Fri Feb 28 16:16:38 2020 - [info]   192.168.98.12(192.168.98.12:3306)Fri Feb 28 16:16:38 2020 - [info] Alive Slaves:Fri Feb 28 16:16:38 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabledFri Feb 28 16:16:38 2020 - [info]     GTID ONFri Feb 28 16:16:38 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)Fri Feb 28 16:16:38 2020 - [info] Starting GTID based failover.Fri Feb 28 16:16:38 2020 - [info] Fri Feb 28 16:16:38 2020 - [info] ** Phase 1: Configuration Check Phase completed.Fri Feb 28 16:16:38 2020 - [info] Fri Feb 28 16:16:38 2020 - [info] * Phase 2: Dead Master Shutdown Phase..Fri Feb 28 16:16:38 2020 - [info] Fri Feb 28 16:16:38 2020 - [info] Forcing shutdown so that applications never connect to the current master..Fri Feb 28 16:16:38 2020 - [info] Executing master IP deactivation script:Fri Feb 28 16:16:38 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=192.168.98.100 --orig_master_host=192.168.98.11 --orig_master_ip=192.168.98.11 --orig_master_port=3306 --command=stopssh --ssh_user=root  Disabling the VIP on old master: 192.168.98.11 Fri Feb 28 16:16:39 2020 - [info]  done.Fri Feb 28 16:16:39 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.Fri Feb 28 16:16:39 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.Fri Feb 28 16:16:39 2020 - [info] Fri Feb 28 16:16:39 2020 - [info] * Phase 3: Master Recovery Phase..Fri Feb 28 16:16:39 2020 - [info] Fri Feb 28 16:16:39 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..Fri Feb 28 16:16:39 2020 - [info] Fri Feb 28 16:16:39 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000002:234Fri Feb 28 16:16:39 2020 - [info] Retrieved Gtid Set: 68703597-592c-11ea-88b3-000c2998280b:1-4Fri Feb 28 16:16:39 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):Fri Feb 28 16:16:39 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabledFri Feb 28 16:16:39 2020 - [info]     GTID ONFri Feb 28 16:16:39 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)Fri Feb 28 16:16:39 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000002:234Fri Feb 28 16:16:39 2020 - [info] Retrieved Gtid Set: 68703597-592c-11ea-88b3-000c2998280b:1-4Fri Feb 28 16:16:39 2020 - [info] Oldest slaves:Fri Feb 28 16:16:39 2020 - [info]   192.168.98.12(192.168.98.12:3306)  Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabledFri Feb 28 16:16:39 2020 - [info]     GTID ONFri Feb 28 16:16:39 2020 - [info]     Replicating from 192.168.98.11(192.168.98.11:3306)Fri Feb 28 16:16:39 2020 - [info] Fri Feb 28 16:16:39 2020 - [info] * Phase 3.3: Determining New Master Phase..Fri Feb 28 16:16:39 2020 - [info] Fri Feb 28 16:16:39 2020 - [info] Searching new master from slaves..Fri Feb 28 16:16:39 2020 - [info]  Candidate masters from the configuration file:Fri Feb 28 16:16:39 2020 - [info]  Non-candidate masters:Fri Feb 28 16:16:39 2020 - [info] New master is 192.168.98.12(192.168.98.12:3306)Fri Feb 28 16:16:39 2020 - [info] Starting master failover..Fri Feb 28 16:16:39 2020 - [info] From:192.168.98.11(192.168.98.11:3306) (current master) +--192.168.98.12(192.168.98.12:3306)To:192.168.98.12(192.168.98.12:3306) (new master)Fri Feb 28 16:16:39 2020 - [info] Fri Feb 28 16:16:39 2020 - [info] * Phase 3.3: New Master Recovery Phase..Fri Feb 28 16:16:39 2020 - [info] Fri Feb 28 16:16:39 2020 - [info]  Waiting all logs to be applied.. Fri Feb 28 16:16:39 2020 - [info]   done.Fri Feb 28 16:16:39 2020 - [info] Getting new master's binlog name and position..Fri Feb 28 16:16:39 2020 - [info]  mysql-bin.000001:2496Fri Feb 28 16:16:39 2020 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.98.12', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repler', MASTER_PASSWORD='xxx';Fri Feb 28 16:16:39 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000001, 2496, 3a60f8c7-592c-11ea-8cb1-000c2973aaf0:1-6,68703597-592c-11ea-88b3-000c2998280b:1-4Fri Feb 28 16:16:39 2020 - [info] Executing master IP activate script:Fri Feb 28 16:16:39 2020 - [info]   /etc/masterha/scripts/master_ip_failover_vip --vip=192.168.98.100 --command=start --ssh_user=root --orig_master_host=192.168.98.11 --orig_master_ip=192.168.98.11 --orig_master_port=3306 --new_master_host=192.168.98.12 --new_master_ip=192.168.98.12 --new_master_port=3306 --new_master_user='mha'   --new_master_password=xxxEnabling the VIP - 192.168.98.100 on the new master - 192.168.98.12 Set read_only=0 on the new master.Creating app user on the new master..Fri Feb 28 16:16:39 2020 - [info]  OK.Fri Feb 28 16:16:39 2020 - [info] ** Finished master recovery successfully.Fri Feb 28 16:16:39 2020 - [info] * Phase 3: Master Recovery Phase completed.Fri Feb 28 16:16:39 2020 - [info] Fri Feb 28 16:16:39 2020 - [info] * Phase 4: Slaves Recovery Phase..Fri Feb 28 16:16:39 2020 - [info] Fri Feb 28 16:16:39 2020 - [info] Fri Feb 28 16:16:39 2020 - [info] * Phase 4.1: Starting Slaves in parallel..Fri Feb 28 16:16:39 2020 - [info] Fri Feb 28 16:16:39 2020 - [info] All new slave servers recovered successfully.Fri Feb 28 16:16:39 2020 - [info] Fri Feb 28 16:16:39 2020 - [info] * Phase 5: New master cleanup phase..Fri Feb 28 16:16:39 2020 - [info] Fri Feb 28 16:16:39 2020 - [info] Resetting slave info on the new master..Fri Feb 28 16:16:39 2020 - [info]  192.168.98.12: Resetting slave info succeeded.Fri Feb 28 16:16:39 2020 - [error][/usr/local/share/perl5/MHA/MasterFailover.pm, ln2045] Master failover to 192.168.98.12(192.168.98.12:3306) done, but recovery on slave partially failed.Fri Feb 28 16:16:39 2020 - [info] ----- Failover Report -----cls_all: MySQL Master failover 192.168.98.11(192.168.98.11:3306) to 192.168.98.12(192.168.98.12:3306)Master 192.168.98.11(192.168.98.11:3306) is down!Check MHA Manager logs at localhost.localdomain:/masterha/cls_all/manager.log for details.Started automated(non-interactive) failover.Invalidated master IP address on 192.168.98.11(192.168.98.11:3306)Selected 192.168.98.12(192.168.98.12:3306) as a new master.192.168.98.12(192.168.98.12:3306): OK: Applying all logs succeeded.192.168.98.12(192.168.98.12:3306): OK: Activated master IP address.192.168.98.12(192.168.98.12:3306): Resetting slave info succeeded.192.168.98.10(192.168.98.10:3306): ERROR: Could not be reachable so couldn't recover.Master failover to 192.168.98.12(192.168.98.12:3306) done, but recovery on slave partially failed.Fri Feb 28 16:16:39 2020 - [info] Sending mail..sh: /etc/masterha/scripts/send_report: No such file or directoryFri Feb 28 16:16:39 2020 - [error][/usr/local/share/perl5/MHA/MasterFailover.pm, ln2089] Failed to send mail with return code 127:0

只不过由于10无法连通, recover on slave partially failed

192.168.98.10(192.168.98.10:3306): ERROR: Could not be reachable so couldn't recover.Master failover to 192.168.98.12(192.168.98.12:3306) done, but recovery on slave partially failed.

不过failover成功, vip已经到了12上

root@localhost 16:16:16 [(none)]> \! ip a1: lo: 
   
     mtu 65536 qdisc noqueue state UNKNOWN qlen 1    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00    inet 127.0.0.1/8 scope host lo       valid_lft forever preferred_lft forever    inet6 ::1/128 scope host        valid_lft forever preferred_lft forever2: ens33: 
    
      mtu 1500 qdisc pfifo_fast state UP qlen 1000    link/ether 00:0c:29:96:c2:3a brd ff:ff:ff:ff:ff:ff    inet 192.168.98.12/24 brd 192.168.98.255 scope global ens33       valid_lft forever preferred_lft forever    inet 192.168.98.100/24 scope global secondary ens33       valid_lft forever preferred_lft forever    inet6 fe80::ef03:3251:b4ed:204c/64 scope link        valid_lft forever preferred_lft foreverroot@localhost 16:27:37 [(none)]> show slave status\GEmpty set (0.00 sec)root@localhost 16:27:43 [(none)]> show global variables like '%read_only%';+-----------------------+-------+| Variable_name         | Value |+-----------------------+-------+| innodb_read_only      | OFF   || read_only             | OFF   || super_read_only       | OFF   || transaction_read_only | OFF   || tx_read_only          | OFF   |+-----------------------+-------+5 rows in set (0.00 sec)

转载地址：http://ckvub.baihongyu.com/

你可能感兴趣的文章