共计 24518 个字符,预计需要花费 62 分钟才能阅读完成。
Redis Sentinel 是 Redis 的高可用方案。是 Redis 2.8 中正式引入的。
在之前的主从复制方案中,如果主节点出现问题,需要手动将一个从节点升级为主节点,然后将其它从节点指向新的主节点,并且需要修改应用方主节点的地址。整个过程都需要人工干预。
下面通过日志具体看看 Sentinel 的切换流程。
Sentinel 的切换流程
集群拓扑图如下。
角色 IP 端口 runID
主节点 127.0.0.1 6379
从节点 -1 127.0.0.1 6380
从节点 -2 127.0.0.1 6381
Sentinel-1 127.0.0.1 26379 d4424b8684977767be4f5abd1e364153fbb0adbd
Sentinel-2 127.0.0.1 26380 18311edfbfb7bf89fe4b67d08ef432053db62fff
Sentinel-3 127.0.0.1 26381 3e9eb1aa9378d89cfe04fe21bf4a05a901747fa8
kill -9 将主节点进程杀死。
1. 最先反应的是从节点。
其会马上输出如下信息。
28244:S 08 Oct 16:03:34.184 # Connection with master lost.
28244:S 08 Oct 16:03:34.184 * Caching the disconnected master state.
28244:S 08 Oct 16:03:34.548 * Connecting to MASTER 127.0.0.1:6379
28244:S 08 Oct 16:03:34.548 * MASTER <-> SLAVE sync started
28244:S 08 Oct 16:03:34.548 # Error condition on socket for SYNC: Connection refused
28244:S 08 Oct 16:03:35.556 * Connecting to MASTER 127.0.0.1:6379
28244:S 08 Oct 16:03:35.556 * MASTER <-> SLAVE sync started
…
2. Sentinel 的日志 30s 后才有输出,这个与“sentinel down-after-milliseconds mymaster 30000”的设置有关。
下面,依次贴出哨兵各个节点及 slave 的日志输出。
Sentinel-1
28087:X 08 Oct 16:04:04.277 # +sdown master mymaster 127.0.0.1 6379
28087:X 08 Oct 16:04:04.379 # +new-epoch 1
28087:X 08 Oct 16:04:04.385 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28087:X 08 Oct 16:04:05.388 # +odown master mymaster 127.0.0.1 6379 #quorum 3/2
28087:X 08 Oct 16:04:05.388 # Next failover delay: I will not start a failover before Mon Oct 8 16:10:04 2018
28087:X 08 Oct 16:04:05.631 # +config-update-from sentinel 18311edfbfb7bf89fe4b67d08ef432053db62fff 127.0.0.1 26380 @ mymaster 127.0.0.1 6379
28087:X 08 Oct 16:04:05.631 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381
28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381
28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
28087:X 08 Oct 16:04:35.656 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
Sentinel-2
28163:X 08 Oct 16:04:04.289 # +sdown master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.366 # +odown master mymaster 127.0.0.1 6379 #quorum 3/2
28163:X 08 Oct 16:04:04.366 # +new-epoch 1
28163:X 08 Oct 16:04:04.366 # +try-failover master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.373 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28163:X 08 Oct 16:04:04.385 # 3e9eb1aa9378d89cfe04fe21bf4a05a901747fa8 voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28163:X 08 Oct 16:04:04.385 # d4424b8684977767be4f5abd1e364153fbb0adbd voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28163:X 08 Oct 16:04:04.450 # +elected-leader master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.450 # +failover-state-select-slave master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.528 # +selected-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.528 * +failover-state-send-slaveof-noone slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.586 * +failover-state-wait-promotion slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:05.543 # +promoted-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:05.543 # +failover-state-reconf-slaves master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:05.629 * +slave-reconf-sent slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:06.554 # -odown master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:06.555 * +slave-reconf-inprog slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:06.555 * +slave-reconf-done slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:06.606 # +failover-end master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:06.606 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381
28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381
28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
28163:X 08 Oct 16:04:36.687 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
Sentinel-3
28234:X 08 Oct 16:04:04.288 # +sdown master mymaster 127.0.0.1 6379
28234:X 08 Oct 16:04:04.378 # +new-epoch 1
28234:X 08 Oct 16:04:04.385 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28234:X 08 Oct 16:04:04.385 # +odown master mymaster 127.0.0.1 6379 #quorum 2/2
28234:X 08 Oct 16:04:04.385 # Next failover delay: I will not start a failover before Mon Oct 8 16:10:04 2018
28234:X 08 Oct 16:04:05.630 # +config-update-from sentinel 18311edfbfb7bf89fe4b67d08ef432053db62fff 127.0.0.1 26380 @ mymaster 127.0.0.1 6379
28234:X 08 Oct 16:04:05.630 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381
28234:X 08 Oct 16:04:05.630 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381
28234:X 08 Oct 16:04:05.630 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
28234:X 08 Oct 16:04:35.709 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
slave2
28244:S 08 Oct 16:04:04.762 * MASTER <-> SLAVE sync started
28244:S 08 Oct 16:04:04.762 # Error condition on socket for SYNC: Connection refused
28244:S 08 Oct 16:04:05.630 * SLAVE OF 127.0.0.1:6381 enabled (user request from ‘id=6 addr=127.0.0.1:43880 fd=12 name= age=148 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=224 qbuf-free=
32544 obl=81 oll=0 omem=0 events=r cmd=slaveof’)28244:S 08 Oct 16:04:05.636 # CONFIG REWRITE executed with success.
28244:S 08 Oct 16:04:05.770 * Connecting to MASTER 127.0.0.1:6381
28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync started
28244:S 08 Oct 16:04:05.770 * Non blocking connect for SYNC fired the event.
28244:S 08 Oct 16:04:05.770 * Master replied to PING, replication can continue…
28244:S 08 Oct 16:04:05.770 * Trying a partial resynchronization (request b95802ca8afd97c578b355a5838d219681d0af27:24302).
28244:S 08 Oct 16:04:05.770 * Successful partial resynchronization with master.
28244:S 08 Oct 16:04:05.770 # Master replication ID changed to a4022bb5c361353a4773fd460cec5cdcc5c02031
28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.
slave3
28253:S 08 Oct 16:04:03.655 * MASTER <-> SLAVE sync started
28253:S 08 Oct 16:04:03.655 # Error condition on socket for SYNC: Connection refused
28253:M 08 Oct 16:04:04.586 # Setting secondary replication ID to b95802ca8afd97c578b355a5838d219681d0af27, valid up to offset: 24302. New replication ID is a4022bb5c361353a4773fd460cec5cdc
c5c0203128253:M 08 Oct 16:04:04.586 * Discarding previously cached master state.
28253:M 08 Oct 16:04:04.586 * MASTER MODE enabled (user request from ‘id=9 addr=127.0.0.1:49316 fd=8 name=sentinel-18311edf-cmd age=137 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-
free=32768 obl=36 oll=0 omem=0 events=r cmd=exec’)28253:M 08 Oct 16:04:04.593 # CONFIG REWRITE executed with success.
28253:M 08 Oct 16:04:05.770 * Slave 127.0.0.1:6380 asks for synchronization
28253:M 08 Oct 16:04:05.770 * Partial resynchronization request from 127.0.0.1:6380 accepted. Sending 156 bytes of backlog starting from offset 24302.
结合上面的日志,可以看到,
各个 Sentinel 节点都判断 127.0.0.1 6379 为主观下线(Subjectively Down,缩写为 sdown)。
28163:X 08 Oct 16:04:04.289 # +sdown master mymaster 127.0.0.1 6379
达到 quorum 的设置,Sentinel- 2 判断其为客观下线(Objectively Down,缩写为 odown)。结合其它两个 Sentinel 节点的日志,可以看到,Sentinel- 2 最先判定其客观下线。接下来,会进行 Sentinel 的领导者选举。一般来说,谁先完成客观下线的判定,谁就是领导者,只有 Sentinel 领导者才能进行 failover。
28163:X 08 Oct 16:04:04.366 # +odown master mymaster 127.0.0.1 6379 #quorum 3/2
28163:X 08 Oct 16:04:04.366 # +new-epoch 1
28163:X 08 Oct 16:04:04.366 # +try-failover master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.373 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28163:X 08 Oct 16:04:04.385 # 3e9eb1aa9378d89cfe04fe21bf4a05a901747fa8 voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28163:X 08 Oct 16:04:04.385 # d4424b8684977767be4f5abd1e364153fbb0adbd voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28163:X 08 Oct 16:04:04.450 # +elected-leader master mymaster 127.0.0.1 6379
寻找合适的 slave 作为 master
28163:X 08 Oct 16:04:04.450 # +failover-state-select-slave master mymaster 127.0.0.1 6379
+failover-state-select-slave <instance details> — New failover state is select-slave: we are trying to find a suitable slave for promotion.
将 127.0.0.1 6381 设置为新主
28163:X 08 Oct 16:04:04.528 # +selected-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
+selected-slave <instance details> — We found the specified good slave to promote.
命令 6381 节点执行 slaveof no one,使其成为主节点
28163:X 08 Oct 16:04:04.528 * +failover-state-send-slaveof-noone slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
+failover-state-send-slaveof-noone <instance details> — We are trying to reconfigure the promoted slave as master, waiting for it to switch.
等待 6381 节点升级为主节点
28163:X 08 Oct 16:04:04.586 * +failover-state-wait-promotion slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
确认 6381 节点已经升级为主节点
28163:X 08 Oct 16:04:05.543 # +promoted-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
再来看看 16:04:04.528 到 16:04:05.543 这个时间段 slave3 的日志输出。可以看到,其开启了 MASTER 模式,且重写了配置文件。
28253:M 08 Oct 16:04:04.586 # Setting secondary replication ID to b95802ca8afd97c578b355a5838d219681d0af27, valid up to offset: 24302. New replication ID is a4022bb5c361353a4773fd460cec5cdcc5c02031
28253:M 08 Oct 16:04:04.586 * Discarding previously cached master state.
28253:M 08 Oct 16:04:04.586 * MASTER MODE enabled (user request from ‘id=9 addr=127.0.0.1:49316 fd=8 name=sentinel-18311edf-cmd age=137 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=r cmd=exec’)
28253:M 08 Oct 16:04:04.593 # CONFIG REWRITE executed with success.
failover 进入重新配置从节点阶段
28163:X 08 Oct 16:04:05.543 # +failover-state-reconf-slaves master mymaster 127.0.0.1 6379
命令 6380 节点复制新的主节点
28163:X 08 Oct 16:04:05.629 * +slave-reconf-sent slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
+slave-reconf-sent <instance details> — The leader sentinel sent the SLAVEOF command to this instance in order to reconfigure it for the new slave.
看看这个时间点 slave2 的日志输出,基本吻合。其进行的是增量同步。
28244:S 08 Oct 16:04:05.630 * SLAVE OF 127.0.0.1:6381 enabled (user request from ‘id=6 addr=127.0.0.1:43880 fd=12 name= age=148 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=224 qbuf-free=32544 obl=81 oll=0 omem=0 events=r cmd=slaveof’)
28244:S 08 Oct 16:04:05.636 # CONFIG REWRITE executed with success.
28244:S 08 Oct 16:04:05.770 * Connecting to MASTER 127.0.0.1:6381
28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync started
28244:S 08 Oct 16:04:05.770 * Non blocking connect for SYNC fired the event.
28244:S 08 Oct 16:04:05.770 * Master replied to PING, replication can continue…
28244:S 08 Oct 16:04:05.770 * Trying a partial resynchronization (request b95802ca8afd97c578b355a5838d219681d0af27:24302).
28244:S 08 Oct 16:04:05.770 * Successful partial resynchronization with master.
28244:S 08 Oct 16:04:05.770 # Master replication ID changed to a4022bb5c361353a4773fd460cec5cdcc5c02031
28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.
同时,在这个时间点,sentinel 也有日志输出,以 sentinel1 为例。从日志中,可以看到,在这个时间点它会更改配置信息。
28087:X 08 Oct 16:04:05.631 # +config-update-from sentinel 18311edfbfb7bf89fe4b67d08ef432053db62fff 127.0.0.1 26380 @ mymaster 127.0.0.1 6379
28087:X 08 Oct 16:04:05.631 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381
28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381
28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
switch-master <master name> <oldip> <oldport> <newip> <newport> — The master new IP and address is the specified one after a configuration change. This is the message most external users are interested in.
同步过程尚未完成。
28163:X 08 Oct 16:04:06.555 * +slave-reconf-inprog slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
+slave-reconf-inprog <instance details> — The slave being reconfigured showed to be a slave of the new master ip:port pair, but the synchronization process is not yet complete.
主从同步完成。
28163:X 08 Oct 16:04:06.555 * +slave-reconf-done slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
+slave-reconf-done <instance details> — The slave is now synchronized with the new master.
failover 切换完成。
28163:X 08 Oct 16:04:06.606 # +failover-end master mymaster 127.0.0.1 6379
failover 成功后,发布主节点的切换消息
28163:X 08 Oct 16:04:06.606 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381
关联新主节点的 slave 信息,需要注意的是,原来的主节点会作为新主节点的 slave。
28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381
28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
+slave <instance details> — A new slave was detected and attached.
过了 30s 后,判定原来的主节点主观下线。
28163:X 08 Oct 16:04:36.687 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
综合来看,Sentinel 进行 failover 的流程如下
1. 每隔 1 秒,每个 Sentinel 节点会向主节点、从节点、其余 Sentinel 节点发送一条 ping 命令做一次心跳检测,来确认这些节点当前是否可达。当这些节点超过 down-after-milliseconds 没有进行有效回复,Sentinel 节点就会判定该节点为主观下线。
2. 如果被判定为主观下线的节点是主节点,该 Sentinel 节点会通过 sentinel is master-down-by-addr 命令向其他 Sentinel 节点询问对主节点的判断,当超过 <quorum> 个数,Sentinel 节点会判定该节点为客观下线。如果从节点、Sentinel 节点被判定为主观下线,并不会进行后续的故障切换操作。
3. 对 Sentinel 进行领导者选举,由其来进行后续的故障切换(failover)工作。选举算法基于 Raft。
4. Sentinel 领导者节点开始进行故障切换。
5. 选择合适的从节点作为新主节点。
6. Sentinel 领导者节点对上一步选出来的从节点执行 slaveof no one 命令让其成为主节点。
7. 向剩余的从节点发送命令,让它们成为新主节点的从节点,复制规则和 parallel-syncs 参数有关。
8. 将原来的主节点更新为从节点,并将其纳入到 Sentinel 的管理,让其恢复后去复制新的主节点。
Sentinel 的领导者选举流程。
Sentinel 的领导者选举基于 Raft 协议。
1. 每个在线的 Sentinel 节点都有资格成为领导者,当它确认主节点主观下线时候,会向其他 Sentinel 节点发送 sentinel is-master-down-by-addr 命令,要求将自己设置为领导者。
2. 收到命令的 Sentinel 节点,如果没有同意过其他 Sentinel 节点的 sentinel is-master-down-by-addr 命令,将同意该请求,否则拒绝。
3. 如果该 Sentinel 节点发现自己的票数已经大于等于 max(quorum,num(sentinels)/2+1),那么它将成为领导者。
新主节点的选择流程。
1. 删除所有已经处于下线或断线状态的从节点。
2. 删除最近 5 秒没有回复过领导者 Sentinel 的 INFO 命令的从节点。
3. 删除所有与已下线主节点连接断开超过 down-after-milliseconds*10 毫秒的从节点。
4. 选择优先级最高的从节点。
5. 选择复制偏移量最大的从节点。
6. 选择 runid 最小的从节点。
三个定时监控任务
1. 每隔 10 秒,每个 Sentinel 节点会向主节点和从节点发送 info 命令获取最新的拓扑结构。其作用如下:
1> 通过向主节点执行 info 命令,获取从节点的信息,这也是为什么 Sentinel 节点不需要显式配置监控从节点。
2> 当有新的从节点加入时可立刻感知出来。
3> 节点不可达或者故障切换后,可通过 info 命令实时更新节点拓扑信息。
2. 每隔 2 秒,每个 Sentinel 节点会向 Redis 数据节点的__sentinel__:hello 频道上发送该 Sentinel 节点对于主节点的判断以及当前 Sentinel 节点的信息,同时每个 Sentinel 节点也会订阅该频道,来了解其它 Sentinel 节点以及它们对主节点的判断。其作用如下:
1> 发现新的 Sentinel 节点:通过订阅主节点的__sentinel__:hello 了解其它 Sentinel 节点信息,如果是新加入的 Sentinel 节点,将该 Sentinel 节点信息保存起来,并与该 Sentinel 节点创建连接。
2> Sentinel 节点之间交换主节点的状态,作为后面客观下线以及领导者选举的依据。
3. 每隔 1 秒,每个 Sentinel 节点会向主节点、从节点、其余 Sentinel 节点发送一条 ping 命令做一次心跳检测,来确认这些节点当前是否可达。这个定时任务是节点失败判定的重要依据。
Sentinel 的相关参数
# bind 127.0.0.1 192.168.1.1
# protected-mode no
port 26379
# sentinel announce-ip <ip>
# sentinel announce-port <port>
dir /tmp
sentinel monitor mymaster 127.0.0.1 6379 2
# sentinel auth-pass <master-name> <password>
sentinel down-after-milliseconds mymaster 30000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 180000
# sentinel notification-script mymaster /var/redis/notify.sh
# sentinel client-reconfig-script mymaster /var/redis/reconfig.sh
sentinel deny-scripts-reconfig yes
其中,
dir:设置 Sentinel 的工作目录。
sentinel monitor mymaster 127.0.0.1 6379 2:其中 2 是 quorum,即权重,代表至少需要两个 Sentinel 节点认为主节点主观下线,才可判定主节点为客观下线。一般建议将其设置为 Sentinel 节点的一半加 1。不仅如此,quorum 还与 Sentinel 节点的领导者选举有关。为了选出 Sentinel 的领导者,至少需要 max(quorum, num(sentinels) / 2 + 1) 个 Sentinel 节点参与选举。
sentinel down-after-milliseconds mymaster 30000:每个 Sentinel 节点都要通过定期发送 ping 命令来判断 Redis 节点和其余 Sentinel 节点是否可达。
如果在指定的时间内,没有收到主节点的有效回复,则判断其为主观下线。需要注意的是,该参数不仅用来判断主节点状态,同样也用来判断该主节点下面的从节点及其它 Sentinel 的状态。其默认值为 30s。
sentinel parallel-syncs mymaster 1:在 failover 期间,允许多少个 slave 同时指向新的主节点。如果 numslaves 设置较大的话,虽然复制操作并不会阻塞主节点,但多个节点同时指向新的主节点,会增加主节点的网络和磁盘 IO 负载。
sentinel failover-timeout mymaster 180000:定义故障切换超时时间。默认 180000,单位秒,即 3min。需要注意的是,该时间不是总的故障切换的时间,而是适用于故障切换的多个场景。
# Specifies the failover timeout in milliseconds. It is used in many ways:
#
# – The time needed to re-start a failover after a previous failover was
# already tried against the same master by a given Sentinel, is two
# times the failover timeout.
#
# – The time needed for a slave replicating to a wrong master according
# to a Sentinel current configuration, to be forced to replicate
# with the right master, is exactly the failover timeout (counting since
# the moment a Sentinel detected the misconfiguration).
#
# – The time needed to cancel a failover that is already in progress but
# did not produced any configuration change (SLAVEOF NO ONE yet not
# acknowledged by the promoted slave).
#
# – The maximum time a failover in progress waits for all the slaves to be
# reconfigured as slaves of the new master. However even after this time
# the slaves will be reconfigured by the Sentinels anyway, but not with
# the exact parallel-syncs progression as specified.
第一种适用场景:如果 Redis Sentinel 对一个主节点故障切换失败,那么下次再对该主节点做故障切换的起始时间是 failover-timeout 的 2 倍。这点从 Sentinel 的日志就可体现出来(28234:X 08 Oct 16:04:04.385 # Next failover delay: I will not start a failover before Mon Oct 8 16:10:04 2018)
sentinel notification-script:定义通知脚本,当 Sentinel 出现 WARNING 级别的事件时,会调用该脚本,其会传入两个参数:事件类型,事件描述。
sentinel client-reconfig-script:当主节点发生切换时,会调用该参数定义的脚本,其会传入以下参数:<master-name> <role> <state> <from-ip> <from-port> <to-ip> <to-port>
关于脚本,其必须遵循一定的规则。
# SCRIPTS EXECUTION
#
# sentinel notification-script and sentinel reconfig-script are used in order
# to configure scripts that are called to notify the system administrator
# or to reconfigure clients after a failover. The scripts are executed
# with the following rules for error handling:
#
# If script exits with “1” the execution is retried later (up to a maximum
# number of times currently set to 10).
#
# If script exits with “2” (or an higher value) the script execution is
# not retried.
#
# If script terminates because it receives a signal the behavior is the same
# as exit code 1.
#
# A script has a maximum running time of 60 seconds. After this limit is
# reached the script is terminated with a SIGKILL and the execution retried.
sentinel deny-scripts-reconfig:不允许使用 SENTINEL SET 设置 notification-script 和 client-reconfig-script。
Sentinel 的常见操作
•PING This command simply returns PONG.
•SENTINEL masters Show a list of monitored masters and their state.
•SENTINEL master <master name> Show the state and info of the specified master.
•SENTINEL slaves <master name> Show a list of slaves for this master, and their state.
•SENTINEL sentinels <master name> Show a list of sentinel instances for this master, and their state.
•SENTINEL get-master-addr-by-name <master name> Return the ip and port number of the master with that name. If a failover is in progress or terminated successfully for this master it returns the address and port of the promoted slave.
•SENTINEL reset <pattern> This command will reset all the masters with matching name. The pattern argument is a glob-style pattern. The reset process clears any previous state in a master (including a failover in progress), and removes every slave and sentinel already discovered and associated with the master.
•SENTINEL failover <master name> Force a failover as if the master was not reachable, and without asking for agreement to other Sentinels (however a new version of the configuration will be published so that the other Sentinels will update their configurations).
•SENTINEL ckquorum <master name> Check if the current Sentinel configuration is able to reach the quorum needed to failover a master, and the majority needed to authorize the failover. This command should be used in monitoring systems to check if a Sentinel deployment is ok.
•SENTINEL flushconfig Force Sentinel to rewrite its configuration on disk, including the current Sentinel state. Normally Sentinel rewrites the configuration every time something changes in its state (in the context of the subset of the state which is persisted on disk across restart). However sometimes it is possible that the configuration file is lost because of operation errors, disk failures, package upgrade scripts or configuration managers. In those cases a way to to force Sentinel to rewrite the configuration file is handy. This command works even if the previous configuration file is completely missing.
•SENTINEL MONITOR <name> <ip> <port> <quorum> This command tells the Sentinel to start monitoring a new master with the specified name, ip, port, and quorum. It is identical to the sentinel monitor configuration directive in sentinel.conf configuration file
•SENTINEL REMOVE <name> is used in order to remove the specified master: the master will no longer be monitored, and will totally be removed from the internal state of the Sentinel, so it will no longer listed by SENTINEL masters and so forth.
•SENTINEL SET <name> <option> <value> The SET command is very similar to the CONFIG SET command of Redis, and is used in order to change configuration parameters of a specific master. Multiple option / value pairs can be specified (or none at all). All the configuration parameters that can be configured via sentinel.conf are also configurable using the SET command.
sentinel masters
输出被监控的主节点的状态信息
127.0.0.1:26379> sentinel masters
1) 1) “name”
2) “mymaster”
3) “ip”
4) “127.0.0.1”
5) “port”
6) “6379”
7) “runid”
8) “6ab2be5db3a37c10f2473c8fb9daed147a32df3e”
9) “flags”
10) “master”
11) “link-pending-commands”
12) “0”
13) “link-refcount”
14) “1”
15) “last-ping-sent”
16) “0”
17) “last-ok-ping-reply”
18) “639”
19) “last-ping-reply”
20) “639”
21) “down-after-milliseconds”
22) “30000”
23) “info-refresh”
24) “2075”
25) “role-reported”
26) “master”
27) “role-reported-time”
28) “759682”
29) “config-epoch”
30) “0”
31) “num-slaves”
32) “2”
33) “num-other-sentinels”
34) “2”
35) “quorum”
36) “2”
37) “failover-timeout”
38) “180000”
39) “parallel-syncs”
40) “1”
也可单独查看某个主节点的状态
sentinel master mymaster
sentinel slaves mymaster
查看某个主节点 slave 的状态
127.0.0.1:26379> sentinel slaves mymaster
1) 1) “name”
2) “127.0.0.1:6380”
3) “ip”
4) “127.0.0.1”
5) “port”
6) “6380”
7) “runid”
8) “983b87fd070c7f052b26f5135bbb30fdeb170a54”
9) “flags”
10) “slave”
11) “link-pending-commands”
12) “0”
13) “link-refcount”
14) “1”
15) “last-ping-sent”
16) “0”
17) “last-ok-ping-reply”
18) “178”
19) “last-ping-reply”
20) “178”
21) “down-after-milliseconds”
22) “30000”
23) “info-refresh”
24) “6160”
25) “role-reported”
26) “slave”
27) “role-reported-time”
28) “489019”
29) “master-link-down-time”
30) “0”
31) “master-link-status”
32) “ok”
33) “master-host”
34) “127.0.0.1”
35) “master-port”
36) “6379”
37) “slave-priority”
38) “100”
39) “slave-repl-offset”
40) “70375”
2) 1) “name”
2) “127.0.0.1:6381”
3) “ip”
4) “127.0.0.1”
5) “port”
6) “6381”
7) “runid”
8) “b88059cce9104dd4e0366afd6ad07a163dae8b15”
9) “flags”
10) “slave”
11) “link-pending-commands”
12) “0”
13) “link-refcount”
14) “1”
15) “last-ping-sent”
16) “0”
17) “last-ok-ping-reply”
18) “178”
19) “last-ping-reply”
20) “178”
21) “down-after-milliseconds”
22) “30000”
23) “info-refresh”
24) “2918”
25) “role-reported”
26) “slave”
27) “role-reported-time”
28) “489019”
29) “master-link-down-time”
30) “0”
31) “master-link-status”
32) “ok”
33) “master-host”
34) “127.0.0.1”
35) “master-port”
36) “6379”
37) “slave-priority”
38) “100”
39) “slave-repl-offset”
40) “71040”
sentinel sentinels mymaster
查看其它 Sentinel 的状态
127.0.0.1:26379> sentinel sentinels mymaster
1) 1) “name”
2) “738ccbddaa0d4379d89a147613d9aecfec765bcb”
3) “ip”
4) “127.0.0.1”
5) “port”
6) “26381”
7) “runid”
8) “738ccbddaa0d4379d89a147613d9aecfec765bcb”
9) “flags”
10) “sentinel”
11) “link-pending-commands”
12) “0”
13) “link-refcount”
14) “1”
15) “last-ping-sent”
16) “0”
17) “last-ok-ping-reply”
18) “475”
19) “last-ping-reply”
20) “475”
21) “down-after-milliseconds”
22) “30000”
23) “last-hello-message”
24) “79”
25) “voted-leader”
26) “?”
27) “voted-leader-epoch”
28) “0”
2) 1) “name”
2) “7251bb129ca373ad0d8c7baf3b6577ae2593079f”
3) “ip”
4) “127.0.0.1”
5) “port”
6) “26380”
7) “runid”
8) “7251bb129ca373ad0d8c7baf3b6577ae2593079f”
9) “flags”
10) “sentinel”
11) “link-pending-commands”
12) “0”
13) “link-refcount”
14) “1”
15) “last-ping-sent”
16) “0”
17) “last-ok-ping-reply”
18) “475”
19) “last-ping-reply”
20) “475”
21) “down-after-milliseconds”
22) “30000”
23) “last-hello-message”
24) “985”
25) “voted-leader”
26) “?”
27) “voted-leader-epoch”
28) “0”
sentinel get-master-addr-by-name <master name>
返回指定 <master name> 主节点的 IP 地址和端口。如果在进行故障切换,则显示的是新主的信息。
127.0.0.1:26379> sentinel get-master-addr-by-name mymaster
1) “127.0.0.1”
2) “6379”
sentinel reset <pattern>
对符合 <pattern>(通配符风格)主节点的配置进行重置。
如果某个 slave 宕机了,其依然处于 sentinel 的管理中,所以,在其恢复正常后,其依然会加入到之前的复制环境中,即使配置文件中没有指定 slaveof 选项。不仅如此,如果主节点宕机了,在其重启后,其默认会作为从节点接入到之前的复制环境中。
但很多时候,我们可能就是想移除 old master,slave,这个时候,sentinel reset 就派上用场了。其会基于当前主节点的状态,重置其配置(they’ll refresh the list of slaves within the next 10 seconds, only adding the ones listed as correctly replicating from the current master INFO output)。关键的是,对于非正常状态的 slave,会从当前的配置中剔除。这样,被剔除节点在恢复正常后(注意此时的配置文件,需剔除 slaveof 的配置),也不会自动加入到之前的复制环境中。
需要注意的是,该命令仅对当前 sentinel 节点有效,如果要剔除某个节点,需要在所有的 sentinel 节点上执行 reset 操作。
sentinel failover <master name>
对指定 <master name> 主节点进行强制故障切换。相对于常规的故障切换,其无需进行 Sentinel 节点的领导者选举。直接由当前 Sentinel 节点进行后续的故障切换。
sentinel ckquorum <master name>
检测当前可达的 Sentinel 节点总数是否达到 <quorum> 的个数
127.0.0.1:26379> sentinel ckquorum mymaster
OK 3 usable Sentinels. Quorum and failover authorization can be reached
sentinel flushconfig
将 Sentinel 节点的配置信息强制刷到磁盘上,这个命令 Sentinel 节点自身用得比较多,对于开发和运维人员只有当外部原因(例如磁盘损坏)造成配置文件损坏或者丢失时,才会用上。
sentinel remove <master name>
取消当前 Sentinel 节点对于指定 <master name> 主节点的监控。
[root@slowtech redis-4.0.11]# grep -Ev “^#|^$” sentinel_26379.conf
port 26379
dir “/tmp”
sentinel myid 2467530fa249dbbc435c50fbb0dc2a4e766146f8
sentinel deny-scripts-reconfig yes
sentinel monitor mymaster 127.0.0.1 6381 2
sentinel config-epoch mymaster 12
sentinel leader-epoch mymaster 0
sentinel known-slave mymaster 127.0.0.1 6380
sentinel known-slave mymaster 127.0.0.1 6379
sentinel known-sentinel mymaster 127.0.0.1 26381 738ccbddaa0d4379d89a147613d9aecfec765bcb
sentinel known-sentinel mymaster 127.0.0.1 26380 7251bb129ca373ad0d8c7baf3b6577ae2593079f
sentinel current-epoch 12
[root@slowtech redis-4.0.11]# redis-cli -p 26379
127.0.0.1:26379> sentinel remove mymaster
OK
127.0.0.1:26379> quit
[root@slowtech redis-4.0.11]# grep -Ev “^#|^$” sentinel_26379.conf
port 26379
dir “/tmp”
sentinel myid 2467530fa249dbbc435c50fbb0dc2a4e766146f8
sentinel deny-scripts-reconfig yes
sentinel current-epoch 12
sentinel set <name> <option> <value>
参数 用法
quorum sentinel set mymaster quorum 3
down-after-milliseconds sentinel set mymaster down-after-milliseconds 30000
failover-timeout sentinel set mymaster failover-timeout 18000
parallel-syncs sentinel set mymaster parallel-syncs 3
notification-script sentinel set mymaster notification-script /tmp/a.sh
client-reconfig-script sentinel set mymaster client-reconfig-script /tmp/b.sh
auth-pass sentinel set mymaster auth-pass masterpassword
需要注意的是:
1. sentinel set 命令只对当前 Sentinel 节点有效。
2. sentinel set 命令如果执行成功会立即刷新配置文件,这点和 Redis 普通数据节点不同,后者修改完配置后,需要执行 config rewrite 刷新到配置文件。
3. 建议所有 Sentinel 节点的配置尽可能一致。
4. Sentinel 不支持 config 命令。如何要查看参数的设置,可痛过 SENTINEL MASTER 命令查看。
参考:
1.《Redis 开发与运维》
2.《Redis 设计与实现》
3.《Redis 4.X Cookbook》
: