共计 14137 个字符,预计需要花费 36 分钟才能阅读完成。
针对目前公司的数据库架构要做统计规划和调整,所以针对不同的业务环境要选择合适的架构就比较重要,同时作为运维人员又要考虑到维护的便捷性(不能做过多架构维护),最终停留在 mha+vip/ 主从 +vip/mycat 集群 / 域名 + 主从这几种架构或组合,这里就分布对不同架构部署做下记录并对比起优劣,针对不同场景做出合适的选择。
本文主要针对主备自动切换,即 VIP+ 主从方式,其中 VIP 通过 keepalived 控制。
一、环境描述
服务器 IP | 操作系统 | 数据库 | 硬件配置 | 备注 |
192.168.1.23 | RHEL6.6_x86_64 | MySQL 5.7.21 | 内存:16GB CPU:4cores 硬盘:100GB | VIP:192.168.1.27 Keepalived:1.4.4 |
192.168.1.24 | RHEL6.6_x86_64 | MySQL 5.7.21 | 内存:16GB CPU:4cores 硬盘:100GB |
二、业务场景
1、优点
1) 部署简单,管理方便。
2) 可实现主备库 VIP 切换,对业务无感知。
2、缺点
1) 存在脑裂风险
2) 存在切换主备,状态无检测风险
3) 跨 VLAN 不支持
4) 看似部署简单,实则需要考虑的点非常多,如果考虑不完善建议慎用
3、业务场景
总而言之,此架构不推荐使用,使用时需谨慎。
1) 双节点数据库服务器,仅仅为了实现主备 VIP 切换业务无感知
2) 多个从节点,通过 LVS,做负载均衡(现在不推荐这种架构实现负载均衡)
3) 两节点的双向复制可以降低部分风险(来回切换 VIP、脑裂等),单无法避免
三、数据库安装和主从部署
略
四、keepalived 安装部署
1.keepalived 下载
下载地址:http://www.keepalived.org/software/keepalived-1.4.4.tar.gz
2.keepalived 安装
主库:
[root@node1 tools]# tar -xzvf keepalived-1.4.4.tar.gz
[root@node1 tools]# cd keepalived-1.4.4
[root@node1 keepalived-1.4.4]# ./configure –prefix=/usr/local/keepalived
[root@node1 keepalived-1.4.4]# make -j 4
[root@node1 keepalived-1.4.4]# make install
[root@node1 keepalived-1.4.4]# cp -rp ./keepalived/etc/init.d/keepalived /etc/init.d/
[root@node1 keepalived-1.4.4]# chkconfig –add keepalived
[root@node1 keepalived-1.4.4]# mkdir /etc/keepalived
[root@node2 keepalived-1.4.4]# cp -rp ./bin/keepalived /usr/bin/
[root@node1 keepalived-1.4.4]# cp ./keepalived/etc/keepalived/keepalived.conf /etc/keepalived
[root@node1 keepalived-1.4.4]# cp -rp /usr/local/keepalived/etc/sysconfig/keepalived /etc/sysconfig/
[root@node1 keepalived-1.4.4]# service keepalived status
keepalived is stopped
备库:
安装 keepalived 同上
### 备注
*** WARNING – this build will not support IPVS with IPv6. Please install libnl/libnl-3 dev libraries to support IPv6 with IPVS.
###
3.keepalived 配置
主库配置:
[root@node1 keepalived-1.4.4]# cp -rp /etc/keepalived/keepalived.conf /etc/keepalived/keepalived.conf_bak
[root@node1 keepalived-1.4.4]# >/etc/keepalived/keepalived.conf
[root@node1 keepalived-1.4.4]# vi /etc/keepalived/keepalived.conf
! Configuration File for keepalived
global_defs {
notification_email {
rangle@163.com
}
notification_email_from rangle@163.com
smtp_server 127.0.0.1
smtp_connect_timeout 30
router_id Node_Master
}
vrrp_instance VI_1 {
state BACKUP ## 可配置 master 和 backup 模式,为了防止脑裂现象,主备均需要设置为 backup 模式,master 模式会抢占 VIP
interface eth1
virtual_router_id 43 ##VRRP 组名,两个节点的设置必须一样,以指明各个节点属于同一 VRRP 组
priority 100
nopreempt ## 配合 backup, 防止切换后,主库服务恢复正常后,IP 漂移过来
advert_int 1 ### 组播信息发送间隔,两个节点设置必须一样
authentication {### 设置验证信息,两个节点必须一致
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {### 指定虚拟 IP, 两个节点设置必须一样
192.168.1.27 label eth1:1
}
}
virtual_server 192.168.1.27 3306 {
delay_loop 6
lb_algo wrr #lvs 调度算法 rr|wrr|lc|wlc|lblc|sh|dh
lb_kind DR #负载均衡转发规则 NAT|DR|RUN
persistence_timeout 50
protocol TCP
real_server 192.168.1.23 3306 {
weight 1
notify_down /etc/keepalived/mysql.sh
TCP_CHECK {
connect_timeout 10
nb_get_retry 3
delay_before_retry 3
connect_port 3306
}
}
}
备库配置:
[root@node2 keepalived-1.4.4]# cp -rp /etc/keepalived/keepalived.conf /etc/keepalived/keepalived.conf_bak
[root@node2 keepalived-1.4.4]# >/etc/keepalived/keepalived.conf
[root@node2 keepalived-1.4.4]# vi /etc/keepalived/keepalived.conf
! Configuration File for keepalived
global_defs {
notification_email {
rangle@163.com
}
notification_email_from rangle@163.com
smtp_server 127.0.0.1
smtp_connect_timeout 30
router_id Node_backup
}
vrrp_instance VI_1 {
state BACKUP
interface eth1
virtual_router_id 43
priority 90 ## 主优先级高于备,主设置 100,备可设置 90,(1-254 之间)
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
192.168.1.27 label eth1:1
}
}
virtual_server 192.168.1.27 3306 {
delay_loop 2 ## 每隔 2 秒检查真实服务器状态
lb_algo wrr ##LVS 算法,如果仅做高可用可不设
lb_kind DR ##LVS 模式,如果仅做高可用可不设
persistence_timeout 60
protocol TCP
real_server 192.168.1.24 3306{
weight 3
notify_down /etc/keepalived/mysql.sh ## 如果 mysql 服务宕掉,执行的脚本
echo ‘3’ > /etc/keepalived/t.log
TCP_CHECK {
connect_timeout 10
nb_get_retry 3
delay_before_retry 3
connect_port 3306
}
}
}
mysql.sh 配置内容:
#!/bin/bash
run_status=`service keepalived status|grep running|wc -l`
pro_status=`ps -ef |grep keepalived|grep -v grep |wc -l`
service keepalived stop
if [${run_status} != 0 ] || [${pro_status} != 0 ]
then
pkill keepalived
fi
五、keepalived 优化
1、keepalived+mysql 风险点
1)脑裂风险
现象:
如果由于主备内部网络通信(vrrp 组播)故障(网络波动、防火墙等)导致 VIP 切换,但实际主备服务均正常,此时 VIP 会存在两个节点上,这个就会导致主备数据同时写入导致数据不一致。
优化方案:
**** 主库节点增加脑裂检查脚本,通过本机增加网关链路的检查,增加仲裁节点,判断是否本机对外的网络出现问题,此时在配合 VRRP 组播,如果网络存在问题则直接关闭 keepalived 和 mysql 服务,然后发送告警邮件、短信。
2)主从同步漂移风险
现象:
VIP 漂移不会检查从库状态,即无论从库数据是否同步,VIP 都将完成漂移,数据不一致风险较大。
优化方案:
**** 在从节点增加主从状态检查脚本,判断主从同步进程是否正常,同时判断是否有数据延迟,如果有延迟或同步错误则不能完成切换,直接关闭 keepalived 服务
2、优化后配置如下
具体优化配置和脚本如下:
主库配置:
keepalived.conf
! Configuration File for keepalived
global_defs {
notification_email {
abc@163.com
}
notification_email_from root@localhost
smtp_server root@localhost
smtp_connect_timeout 30
router_id node1
}
vrrp_script check_local {
script “/etc/keepalived/check_gateway.sh”
interval 5
fall 4
rise 4
weight 3
}
vrrp_instance VI_1 {
state BACKUP
interface eth1
virtual_router_id 43
priority 100
nopreempt
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
192.168.1.27 label eth1:1
}
track_script {
check_local
}
}
virtual_server 192.168.1.27 3306 {
delay_loop 6
lb_algo wrr #lvs 调度算法 rr|wrr|lc|wlc|lblc|sh|dh
lb_kind DR #负载均衡转发规则 NAT|DR|RUN
persistence_timeout 50
protocol TCP
real_server 192.168.1.23 3306 {
weight 3
notify_down /etc/keepalived/mysql.sh
TCP_CHECK {
connect_timeout 10
nb_get_retry 3
delay_before_retry 3
connect_port 3306
}
}
}
备库配置:
keepalived.conf
! Configuration File for keepalived
global_defs {
notification_email {
abc@163.com
}
notification_email_from root@localhost
smtp_server root@localhost
smtp_connect_timeout 30
router_id node2
}
vrrp_script check_local {
script “/etc/keepalived/check_gateway.sh “
interval 5
fall 4
rise 4
weight 3
}
vrrp_instance VI_1 {
state BACKUP
interface eth1
virtual_router_id 43
priority 90
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
192.168.1.27 label eth1:1
}
track_script {
check_local
}
}
virtual_server 192.168.1.27 3306 {
delay_loop 2
lb_algo wrr
lb_kind DR
persistence_timeout 60
protocol TCP
real_server 192.168.1.24 3306{
weight 3
notify_down /etc/keepalived/mysql.sh
echo ‘3’ > /etc/keepalived/t.log
TCP_CHECK {
connect_timeout 10
nb_get_retry 3
delay_before_retry 3
connect_port 3306
}
}
}
脑裂检查脚本:
/etc/keepalived/check_gateway.sh
#!/bin/sh
VIP=192.168.1.27
GATEWAY=192.168.1.254
/sbin/arping -I eth1 -c 5 -s $VIP $GATEWAY &>/dev/null
gateway_status=$?
keepalived_status=`ps -ef |grep keepalived|grep -v grep |wc -l`
if [${gateway_status} != 0 ] && [${keepalived_status} = 0 ]
then
service keepalived start
elif [${gateway_status} != 0 ] && [${keepalived_status} != 0 ]
service keepalived stop
fi
数据库服务宕机转移脚本
/etc/keepalived/mysql.sh
#!/bin/bash
run_status=`service keepalived status|grep running|wc -l`
pro_status=`ps -ef |grep keepalived|grep -v grep |wc -l`
service keepalived stop
if [${run_status} != 0 ] || [${pro_status} != 0 ]
then
pkill keepalived
fi
从库状态检查脚本(在备库做 crontab 定时任务 */1 * * * * sh /etc/keepalived/check_slave.sh >/dev/null 2>&1)
#!/bin/bash
VIP=”192.168.1.27″
vip_status=`ip add |egrep “${VIP}”|wc -l`
keepalived_status=`ps -ef |grep keepalived|grep -v grep|wc -l`
slave_status=`mysql -uroot -e “show slave status \G”|egrep “Slave.*Running|Seconds_Behind_Master|Last_Errno”|grep -v “Slave_SQL_Running_State”|awk -F ‘:’ ‘{printf(“%s”,$NF)}’`
io_status=`echo “${slave_status}”|awk ‘{print $1}’|grep “Yes”|wc -l`
sql_status=`echo “${slave_status}”|awk ‘{print $2}’|grep “Yes”|wc -l`
sync_status=`echo “${slave_status}”|awk ‘{print $3}’`
sync_time=`echo “${slave_status}”|awk ‘{print $4}’`
if [${io_status} -ne 1 ] || [${sql_status} -ne 1 ] || [${sync_status} -ne 0 ]
then
if [${vip_status} -eq 0 ] && [${keepalived_status} -ne 0 ];then
service keepalived stop
fi
elif [${io_status} -eq 1 ] && [${sql_status} -eq 1 ] && [${sync_status} -eq 0 ]
then
if [${sync_time} -ge 120 ];then
if [${vip_status} -eq 0 ] && [${keepalived_status} -ne 0 ];then
service keepalived stop
fi
elif [${sync_time} -eq 0 ];then
if [${vip_status} -eq 0 ] && [${keepalived_status} -eq 0 ];then
service keepalived start
fi
fi
fi
3、防火墙端口开放
VRRP 是用 IP 多播的方式(默认多播地址 224.0.0.18)实现高可用对之间通信的,如果开启防火墙记得开放这个 vrrp 协议对应的策略
iptables -A INPUT -p vrrp -d 224.0.0.18 -j ACCEPT
iptables -A INPUT -p vrrp -j ACCEPT
六、keepalived 故障转移测试
1、主数据库实例宕机
当前主数据库主机状态:
[root@node1 keepalived]# service mysqld status
MySQL (Percona Server) running (22983) [OK]
[root@node1 keepalived]# service keepalived status
keepalived (pid 14092) is running…
[root@node1 keepalived]# ip add |grep inet
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
inet 192.168.1.23/24 brd 192.168.1.255 scope global eth1
inet 192.168.1.27/32 scope global eth1:1
inet6 fe80::250:56ff:fe8d:496/64 scope link
当前从数据库主机状态:
[root@node2 keepalived]# service mysqld status
MySQL (Percona Server) running (14146) [OK]
[root@node2 keepalived]# service keepalived status
keepalived (pid 12207) is running…
[root@node2 keepalived]# ip add |grep inet
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
inet 192.168.1.24/24 brd 192.168.1.255 scope global eth1
inet6 fe80::250:56ff:fe8d:3d1e/64 scope link
主库执行关闭 mysql 实例命令,数据库关闭后发现 VIP 已漂移
[root@node1 keepalived]# service mysqld stop
Shutting down MySQL (Percona Server)…………… [OK]
[root@node1 keepalived]# service mysqld status
MySQL (Percona Server) is not running [FAILED]
[root@node1 keepalived]# service keepalived status
keepalived is stopped
[root@node1 keepalived]# ip add |grep inet
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
inet 192.168.1.23/24 brd 192.168.1.255 scope global eth1
inet6 fe80::250:56ff:fe8d:496/64 scope link
此时检查从库相关状态
[root@node2 keepalived]# service mysqld status
MySQL (Percona Server) running (14146) [OK]
[root@node2 keepalived]# service keepalived status
keepalived (pid 12207) is running…
[root@node2 keepalived]# ip add |grep inet
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
inet 192.168.1.24/24 brd 192.168.1.255 scope global eth1
inet 192.168.1.27/32 scope global eth1:1
inet6 fe80::250:56ff:fe8d:3d1e/64 scope link
[root@node2 keepalived]# mysql -uroot -e “show slave status \G”|egrep “Slave.*Running|Seconds_Behind_Master|Last_Errno”|grep -v “Slave_SQL_Running_State”
Slave_IO_Running: Connecting
Slave_SQL_Running: Yes
Last_Errno: 0
Seconds_Behind_Master: NULL
2、主数据库服务器宕机
同上,实现检查主备状态后,在主库执行 shutdown -h now 的命令后,检查从库状态
3、主 keepalived 服务停止
同上
4、主服务器网卡故障
此时 keepalived 服务正常,mysqld 服务正常,VRRP 通信无法完成的情况下,主库会执行关闭 keepalived 服务并完成切换
5、模拟 VRRP 通信故障(脑裂)
在备库开启防火墙,并不开通 vrrp 允许,此时 VIP 会存在于两边的服务器上
[root@node2 keepalived]# service iptables start
iptables: Applying firewall rules: [OK]
主库服务器状态检查
[root@node1 keepalived]# ip add|grep inet
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
inet 192.168.1.23/24 brd 192.168.1.255 scope global eth1
inet 192.168.1.27/32 scope global eth1:1
inet6 fe80::250:56ff:fe8d:496/64 scope link
备库服务器状态检查:
[root@node2 keepalived]# ip add|grep inet
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
inet 192.168.1.24/24 brd 192.168.1.255 scope global eth1
inet 192.168.1.27/32 scope global eth1:1
inet6 fe80::250:56ff:fe8d:3d1e/64 scope link
如果关掉从库的防火墙,那么 VIP 会选择高优先级的服务器上(priority)
此现象如何避免?我们可以在 check_gateway.sh 脚本里添加上组播状态检查的命令,我这里就不做了,仅做了网关检查的脑裂避免(网络问题导致网关暂时不可达而产生的脑裂)
6、模拟主切换到备后,主服务启动后是否会回切(优先级抢占测试)
(本次测试接着测试 1 进行,首先恢复主库的实例进程和 keepalived 进程)
主库执行命令:
[root@node1 keepalived]# service mysqld start
Starting MySQL (Percona Server)…… [OK]
[root@node1 keepalived]# service keepalived start
Starting keepalived: [OK]
[root@node1 keepalived]# service mysqld status
MySQL (Percona Server) running (1141) [OK]
[root@node1 keepalived]# service keepalived status
keepalived (pid 1230) is running…
[root@node1 keepalived]# ip add |grep inet
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
inet 192.168.1.23/24 brd 192.168.1.255 scope global eth1
inet6 fe80::250:56ff:fe8d:496/64 scope link
备库执行命令:
[root@node2 keepalived]# service keepalived status
keepalived (pid 12207) is running…
[root@node2 keepalived]# service mysqld status
MySQL (Percona Server) running (14146) [OK]
[root@node2 keepalived]# ip add |grep inet
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
inet 192.168.1.24/24 brd 192.168.1.255 scope global eth1
inet 192.168.1.27/32 scope global eth1:1
inet6 fe80::250:56ff:fe8d:3d1e/64 scope link
此时发现即使主库的服务恢复正常也不会主动抢占从库的 VIP,因为两边设置的都是 backup 模式,但此时如果备库的服务、实例等原因导致需要切换 VIP 而此时主库由具备切换的条件,那么依旧可以完成切换(此时切记一旦完成主备切换后,需要手动恢复新的主备状态保障数据一致性后才能进行二次切换)
7、模拟备库数据延迟 120S 以上切换
从库先停止 io_thread
(root:localhost:Sun May 20 10:24:45 2018)[(none)]>stop slave io_thread;
Query OK, 0 rows affected (0.00 sec)
主库做大量的数据变更操作
然后监控从库 keepalived 状态即可
8、模拟备库数据同步错误切换
从库状态:
[root@node2 keepalived]# mysql -uroot -e “show slave status \G”|egrep “Slave.*Running|Seconds_Behind_Master|Last_Errno”|grep -v “Slave_SQL_Running_State”
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 1050
Seconds_Behind_Master: NULL
[root@node2 keepalived]# service keepalived status
keepalived is stopped
[root@node2 keepalived]# ip add |grep inet
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
inet 192.168.1.24/24 brd 192.168.1.255 scope global eth1
inet6 fe80::250:56ff:fe8d:3d1e/64 scope link
主库状态:
[root@node1 keepalived]# service mysqld status
MySQL (Percona Server) running (1141) [OK]
[root@node1 keepalived]# service keepalived status
keepalived (pid 1230) is running…
[root@node1 keepalived]# ip add |grep inet
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
inet 192.168.1.23/24 brd 192.168.1.255 scope global eth1
inet 192.168.1.27/32 scope global eth1:1
inet6 fe80::250:56ff:fe8d:496/64 scope link
此时如果主库相关服务宕机,是无法完成 VIP 切换,通过关闭 mysqld 服务测试
主库已剔除 VIP
[root@node1 keepalived]# service mysqld stop
Shutting down MySQL (Percona Server)………….. [OK]
[root@node1 keepalived]# ip addr |grep inet
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
inet 192.168.1.23/24 brd 192.168.1.255 scope global eth1
inet6 fe80::250:56ff:fe8d:496/64 scope link
从库没有附加上 VIP
[root@node2 keepalived]# ip add |grep inet
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
inet 192.168.1.24/24 brd 192.168.1.255 scope global eth1
inet6 fe80::250:56ff:fe8d:3d1e/64 scope link
附录:
导致裂脑发生的原因
高可用服务器之间心跳线链路故障,导致无法正常通信。
心跳线坏了(包括断了,老化)。
网卡即相关驱动坏了,IP 配置及冲突问题(网卡直连)
心跳线间连接的设备故障(网卡及交换机)
仲裁的机器出问题(采用仲裁的方案)。
高可用服务器对上开启了 iptables 防火墙阻挡了心跳信息传输。
高可用服务器对上心跳网卡地址等信息配置不正确,导致发送心跳失败。
其他服务器配置不当等原因,如心跳方式不同,心跳广播冲突,软件 BUG
在实际生产环境中,我们可以从以下几个方面来防止裂脑问题的发生。
(1) 同时使用串行电缆和以太网电缆连接,同时用两条心跳线路,这样一条线路坏了,另一个还是好的,依然能传送心跳消息。
(2) 当检测到裂脑时强行关闭一个心跳节点(这个功能需特殊设备支持,如 fence,stonith)。相当于备节点接收不到心跳信息,发送关机命令通过单独的线路关闭主节点电源。
(3) 做好对裂脑的监控报警(如邮件及手机短信等),在问题发生时人为的第一时间介入仲裁,降低损失。例如:百度的监控报警短信就有上行和下行的区别。报警信息到管理员手机上,就可以通过回复对应的字符串等操作就可以返回给服务器,让服务器根据指令自动执行处理相关。