集群自愈
更新时间:2024-07-01 14:39:43
功能介绍
用户开启后,集群在运行过程中,用户无需关心节点状态,在节点出现异常 down 等情况时,自愈模块会自动尝试进行修复。
启动自愈
集群初始化后,自愈模块默认为关闭状态,在 adbmgr 中手动启动 doctor:
antdb=# start doctor;
mgr_doctor_start
------------------
t
(1 row)
通过 list doctor
命令可以查看:
antdb=# list doctor;
type | subtype | key | value | comment
-----------+-----------------+----------------+-------+--------------------------------------------------------------------------------------------------------------
PARAMETER | -- | enable | 1 | 0:false, 1:true. If true, doctor processes will be launched, or else, doctor processes exit.
PARAMETER | -- | forceswitch | 0 | 0:false, 1:true. Whether force to switch the master/slave, note that force switch may cause data loss.
PARAMETER | -- | switchinterval | 30 | In seconds, The time interval for doctor retry the switching if an error occurred in the previous switching.
PARAMETER | -- | nodedeadline | 30 | In seconds. The maximum time for doctor tolerate a NODE running abnormally.
PARAMETER | -- | agentdeadline | 5 | In seconds. The maximum time for doctor tolerate a AGENT running abnormally.
NODE | gtmcoord master | gcn1 | t | enable doctor
NODE | gtmcoord slave | gcn2 | t | enable doctor
NODE | coordinator | cn1 | t | enable doctor
NODE | coordinator | cn2 | t | enable doctor
NODE | coordinator | cn3 | t | enable doctor
NODE | datanode master | dn1_1 | t | enable doctor
NODE | datanode master | dn2_1 | t | enable doctor
NODE | datanode master | dn3_1 | t | enable doctor
NODE | datanode slave | dn1_2 | t | enable doctor
NODE | datanode slave | dn1_3 | t | enable doctor
NODE | datanode slave | dn2_2 | t | enable doctor
NODE | datanode master | dn4_1 | t | enable doctor
NODE | coordinator | cn4 | t | enable doctor
NODE | datanode slave | dn3_2 | t | enable doctor
NODE | datanode slave | dn4_2 | t | enable doctor
HOST | -- | adb01 | t | enable doctor
HOST | -- | adb02 | t | enable doctor
(22 rows)
从结果可以看出,自愈监控的组件类型包括:
- node:集群中的各个节点。
- host:集群中主机的agent进程。
在 adbmgr 中会起多个进程:
[antdb@intel175 ~]$ ps xuf|grep doctor
antdb 193328 0.0 0.0 112716 984 pts/46 S+ 16:02 0:00 | \_ grep --color=auto doctor
antdb 134782 0.0 0.0 359748 7808 ? Ss 14:48 0:02 \_ adbmgr: antdb doctor launcher
antdb 137154 0.0 0.0 358944 6836 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor gcn1
antdb 137155 0.0 0.0 358948 6828 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor gcn2
antdb 137157 0.0 0.0 358948 6852 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor cn1
antdb 137159 0.0 0.0 358948 6848 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor cn2
antdb 137163 0.0 0.0 358948 6848 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor cn3
antdb 137165 0.0 0.0 358948 6848 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor dn1_1
antdb 137167 0.0 0.0 358948 6872 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor dn2_1
antdb 137169 0.0 0.0 358952 6856 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor dn3_1
antdb 137172 0.0 0.0 358952 6860 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor dn1_2
antdb 137175 0.0 0.0 358952 6860 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor dn1_3
antdb 137177 0.0 0.0 358952 6860 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor dn2_2
antdb 137180 0.0 0.0 358952 6848 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor dn4_1
antdb 137183 0.0 0.0 358952 6856 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor cn4
antdb 137186 0.0 0.0 358956 6852 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor dn3_2
antdb 137189 0.0 0.0 358956 6852 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor dn4_2
antdb 137191 0.0 0.0 358948 5888 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor host monitor
关闭自愈
在 adbmgr 中执行 stop doctor;
来关闭自愈:
antdb=# stop doctor;
NOTICE: Update pgxc_node successfully in 'gcn1'.
NOTICE: Update pgxc_node successfully in 'cn1'.
NOTICE: Update pgxc_node successfully in 'cn2'.
NOTICE: Update pgxc_node successfully in 'cn3'.
NOTICE: Update pgxc_node successfully in 'cn4'.
NOTICE: Updating pgxc_node successfully at all datanode master.
mgr_doctor_stop
-----------------
t
(1 row)
antdb=# list doctor;
type | subtype | key | value | comment
-----------+-----------------+----------------+-------+--------------------------------------------------------------------------------------------------------------
PARAMETER | -- | enable | 0 | 0:false, 1:true. If true, doctor processes will be launched, or else, doctor processes exit.
PARAMETER | -- | forceswitch | 1 | 0:false, 1:true. Whether force to switch the master/slave, note that force switch may cause data loss.
PARAMETER | -- | switchinterval | 10 | In seconds, The time interval for doctor retry the switching if an error occurred in the previous switching.
PARAMETER | -- | nodedeadline | 10 | In seconds. The maximum time for doctor tolerate a NODE running abnormally.
PARAMETER | -- | agentdeadline | 5 | In seconds. The maximum time for doctor tolerate a AGENT running abnormally.
NODE | gtmcoord master | gcn1 | t | enable doctor
NODE | gtmcoord slave | gcn2 | t | enable doctor
NODE | coordinator | cn1 | t | enable doctor
NODE | coordinator | cn2 | t | enable doctor
NODE | coordinator | cn3 | t | enable doctor
NODE | datanode master | dn1_1 | t | enable doctor
NODE | datanode master | dn2_1 | t | enable doctor
NODE | datanode master | dn3_1 | t | enable doctor
NODE | datanode slave | dn1_2 | t | enable doctor
NODE | datanode slave | dn1_3 | t | enable doctor
NODE | datanode slave | dn2_2 | t | enable doctor
NODE | datanode master | dn4_1 | t | enable doctor
NODE | coordinator | cn4 | t | enable doctor
NODE | datanode slave | dn3_2 | t | enable doctor
NODE | datanode slave | dn4_2 | t | enable doctor
HOST | -- | adb01 | t | enable doctor
HOST | -- | adb02 | t | enable doctor
(22 rows)
[antdb@intel175 ~]$ ps xuf|grep doctor
antdb 2435 0.0 0.0 112716 984 pts/46 S+ 16:09 0:00 | \_ grep --color=auto doctor
stop 执行完成后,doctor
的参数 enable
为 0 ,且没有了 doctor 的进程。
修改参数
通过 set doctor
命令来修改 doctor
的全局参数,参数含义如下:
enable
:总的开关,1 为开,0 为关。默认为0。forceswitch
:是否强制对异常节点进行切换,1 为是,0 位否,默认为 0。switchinterval
:上次自愈失败后,最长多长时间后再次自愈,默认为 30s。nodedeadline
:节点故障后,最长多长时间进行自愈,默认为 30s。agentdeadline
:agent 进程故障后,最长多长时间进行自愈,默认为 5s。
资源类参数:
switch_on_cpu_usage_pct
:触发切换的 CPU 使用阈值,范围从 80% 到 100%,0 表示禁用。switch_on_mem_usage_pct
:触发切换的内存使用阈值,范围从 80% 到 100%,0 表示已禁用。switch_on_disk_usage_pct
:触发切换的磁盘使用阈值,范围从 80% 到 100%,0 表示已禁用。switch_on_io_usage_pct
:触发切换的 io 使用阈值,范围从 80% 到 100%,0 表示禁用。switch_on_net_usage_pct
:触发切换的带宽使用阈值,范围从 80% 到 100%,0 表示禁用。switch_on_disk_corruption_enable
:磁盘损坏时是否切换主/从,1 为是,0 位否,默认为 0。
修改参数的命令如下:
set doctor (switchinterval=10);
set doctor (nodedeadline=10);
set doctor (forceswitch=1);
set doctor (switch_on_cpu_usage_pct=80);
set doctor (switch_on_mem_usage_pct=80);
set doctor (switch_on_disk_usage_pct=80);
set doctor (switch_on_io_usage_pct=80);
set doctor (switch_on_net_usage_pct=80);
set doctor (switch_on_disk_corruption_enable=1);
修改完成后,查看是否生效:
antdb=# list doctor;
type | subtype | key | value | comment
-----------+-----------------+----------------------------------+-------+-------------------------------------------------------------------------------
-------------------------------
PARAMETER | -- | enable | 0 | 0:false, 1:true. If true, doctor processes will be launched, or else, doctor p
rocesses exit.
PARAMETER | -- | forceswitch | 1 | 0:false, 1:true. Whether force to switch the master/slave, note that force swi
tch may cause DATA LOSS.
PARAMETER | -- | rewindoldmaster | 0 | 0:false, 1:true. Whether rewind old master after switched, note that rewind ol
d master may cause DATA LOSS.
PARAMETER | -- | switchinterval | 10 | In seconds, The time interval for doctor retry the switching if an error occur
red in the previous switching.
PARAMETER | -- | nodedeadline | 10 | In seconds. The maximum time for doctor tolerate a NODE running abnormally.
PARAMETER | -- | agentdeadline | 5 | In seconds. The maximum time for doctor tolerate a AGENT running abnormally.
PARAMETER | -- | switch_on_cpu_usage_pct | 80 | The CPU usage threshold to trigger switch, range from 80 to 100 percent, 0 mea
ns disabled.
PARAMETER | -- | switch_on_mem_usage_pct | 80 | The memory usage threshold to trigger switch, range from 80 to 100 percent, 0
means disabled.
PARAMETER | -- | switch_on_disk_usage_pct | 80 | The disk usage threshold to trigger switch, range from 80 to 100 percent, 0 me
ans disabled.
PARAMETER | -- | switch_on_io_usage_pct | 80 | The io usage threshold to trigger switch, range from 80 to 100 percent, 0 mean
s disabled.
PARAMETER | -- | switch_on_net_usage_pct | 80 | The net usage threshold to trigger switch, range from 80 to 100 percent, 0 mea
ns disabled.
PARAMETER | -- | switch_on_disk_corruption_enable | 1 | 0:false, 1:true. Whether switch the master/slave when disk corrupt.
NODE | gtmcoord master | gtmcoord | t | enable doctor
NODE | gtmcoord slave | gcs1 | t | enable doctor
NODE | coordinator | cn3 | t | enable doctor
NODE | datanode master | dn1 | t | enable doctor
NODE | coordinator | cn2 | t | enable doctor
NODE | coordinator | cn1 | t | enable doctor
NODE | datanode master | dn2 | t | enable doctor
NODE | datanode master | dn3 | t | enable doctor
HOST | -- | host227 | t | enable doctor
HOST | -- | host228 | t | enable doctor
HOST | -- | host214 | t | enable doctor
(23 rows)
自愈工作示例
自愈功能开启后,如果某个节点发生故障,则会尝试执行拉起、切换等操作,保障业务的连续性。
下面我们尝试 kill 一个 datanode master 节点,观察节点是否会自动恢复。
- kill 节点
选取 dn2_1
进行kill:
antdb=# monitor datanode master dn2_1;
nodename | nodetype | status | description | host | port | recovery | boot time
----------+-----------------+--------+-------------+--------------+-------+----------+-------------------------------
dn2_1 | datanode master | t | running | 10.21.20.175 | 52541 | false | 2019-10-16 16:20:06.225503+08
(1 row)
[antdb@intel175 ~]$ ps xuf|grep dn2_1
antdb 35846 0.0 0.0 112712 980 pts/56 S+ 16:54 0:00 \_ grep --color=auto dn2_1
antdb 11456 0.0 0.0 442624 92208 ? S 16:20 0:00 /data/danghb/app/adb50/bin/postgres --datanode -D /data/danghb/data/adb50/d1/dn2_1 -i
antdb 12788 0.0 0.0 358948 6908 ? Ss 16:22 0:00 \_ adbmgr: antdb doctor node monitor dn2_1
[antdb@intel175 ~]$ kill -9 11456
antdb=# monitor datanode master dn2_1;
WARNING: datanode master dn2_1 recovery status is unknown
nodename | nodetype | status | description | host | port | recovery | boot time
----------+-----------------+--------+-------------+--------------+-------+----------+-----------
dn2_1 | datanode master | f | not running | 10.21.20.175 | 52541 | unknown | unknow
(1 row)
- 观察节点状态
等待几秒后,再次观察节点状态:
antdb=# monitor datanode master dn2_1;
nodename | nodetype | status | description | host | port | recovery | boot time
----------+-----------------+--------+-------------+--------------+-------+----------+-------------------------------
dn2_1 | datanode master | t | running | 10.21.20.175 | 52541 | false | 2019-10-16 16:55:10.935821+08
(1 row)
节点已经恢复,进程信息也可以看到:
[antdb@intel175 ~]$ ps xuf|grep dn2_1
antdb 36484 0.0 0.0 112712 980 pts/56 S+ 16:55 0:00 \_ grep --color=auto dn2_1
antdb 36441 1.8 0.0 442624 92212 ? S 16:55 0:00 /data/danghb/app/adb50/bin/postgres --datanode -D /data/danghb/data/adb50/d1/dn2_1 -i
antdb 12788 0.0 0.0 359084 7664 ? Ss 16:22 0:00 \_ adbmgr: antdb doctor node monitor dn2_1
对应的 adbmgr 日志信息:
2019-10-16 16:55:03.315 CST,,,12788,,5da6d32e.31f4,6,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, CONNECT_FAIL, PQerrorMessage:server closed the connection u
nexpectedly
This probably means the server terminated abnormally
before or while processing the request.
",,,,,,,,,""
2019-10-16 16:55:05.818 CST,,,12788,,5da6d32e.31f4,7,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, CONNECT_FAIL, PQerrorMessage:could not connect to server: C
onnection refused
Is the server running on host ""10.21.20.175"" and accepting
TCP/IP connections on port 52541?
",,,,,,,,,""
2019-10-16 16:55:10.824 CST,,,12788,,5da6d32e.31f4,8,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, CONNECT_FAIL, PQerrorMessage:could not connect to server: C
onnection refused
Is the server running on host ""10.21.20.175"" and accepting
TCP/IP connections on port 52541?
",,,,,,,,,""
2019-10-16 16:55:10.824 CST,,,12788,,5da6d32e.31f4,9,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, node crashed",,,,,,,,,""
2019-10-16 16:55:10.826 CST,,,12788,,5da6d32e.31f4,10,,2019-10-16 16:22:06 CST,12/6,651,LOG,00000,"antdb doctor node monitor dn2_1, try to startup node",,,,,,,,,""
2019-10-16 16:55:11.044 CST,,,12788,,5da6d32e.31f4,11,,2019-10-16 16:22:06 CST,12/6,651,LOG,00000,"start dn2_1 /data/antdb/data/adb50/d1/dn2_1 successfully",,,,,,,,,""
2019-10-16 16:55:11.086 CST,,,12788,,5da6d32e.31f4,12,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, startup node successfully",,,,,,,,,""
2019-10-16 16:55:11.086 CST,,,12788,,5da6d32e.31f4,13,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, reset node monitor",,,,,,,,,""
2019-10-16 16:55:11.092 CST,,,12788,,5da6d32e.31f4,14,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, node running normally",,,,,,,,,""
可以看到 dn2_1
节点在 7 秒后,恢复了正常,恢复过程无需人工干预。
问题反馈