1.Base基础/3.Icon图标/操作/search备份
1.Base基础/3.Icon图标/操作/search备份
EN
文档
关于AntDB
部署与升级
快速入门
使用教程
调优
工具和插件
高级服务
数据安全
参考
  • 文档首页 /
  • 运维 /
  • 分布式运维手册 /
  • 集群自愈

集群自愈

更新时间:2024-07-01 14:39:43

功能介绍

用户开启后,集群在运行过程中,用户无需关心节点状态,在节点出现异常 down 等情况时,自愈模块会自动尝试进行修复。

启动自愈

集群初始化后,自愈模块默认为关闭状态,在 adbmgr 中手动启动 doctor:

antdb=# start doctor; 
 mgr_doctor_start  
------------------ 
 t 
(1 row) 

通过 list doctor 命令可以查看:

antdb=# list doctor; 
   type    |     subtype     |      key       | value |                                                   comment                                                     
-----------+-----------------+----------------+-------+-------------------------------------------------------------------------------------------------------------- 
 PARAMETER | --              | enable         | 1     | 0:false, 1:true. If true, doctor processes will be launched, or else, doctor processes exit. 
 PARAMETER | --              | forceswitch    | 0     | 0:false, 1:true. Whether force to switch the master/slave, note that force switch may cause data loss. 
 PARAMETER | --              | switchinterval | 30    | In seconds, The time interval for doctor retry the switching if an error occurred in the previous switching. 
 PARAMETER | --              | nodedeadline   | 30    | In seconds. The maximum time for doctor tolerate a NODE running abnormally. 
 PARAMETER | --              | agentdeadline  | 5     | In seconds. The maximum time for doctor tolerate a AGENT running abnormally. 
 NODE      | gtmcoord master | gcn1           | t     | enable doctor 
 NODE      | gtmcoord slave  | gcn2           | t     | enable doctor 
 NODE      | coordinator     | cn1            | t     | enable doctor 
 NODE      | coordinator     | cn2            | t     | enable doctor 
 NODE      | coordinator     | cn3            | t     | enable doctor 
 NODE      | datanode master | dn1_1          | t     | enable doctor 
 NODE      | datanode master | dn2_1          | t     | enable doctor 
 NODE      | datanode master | dn3_1          | t     | enable doctor 
 NODE      | datanode slave  | dn1_2          | t     | enable doctor 
 NODE      | datanode slave  | dn1_3          | t     | enable doctor 
 NODE      | datanode slave  | dn2_2          | t     | enable doctor 
 NODE      | datanode master | dn4_1          | t     | enable doctor 
 NODE      | coordinator     | cn4            | t     | enable doctor 
 NODE      | datanode slave  | dn3_2          | t     | enable doctor 
 NODE      | datanode slave  | dn4_2          | t     | enable doctor 
 HOST      | --              | adb01          | t     | enable doctor 
 HOST      | --              | adb02          | t     | enable doctor 
(22 rows) 
 

从结果可以看出,自愈监控的组件类型包括:

  • node:集群中的各个节点。
  • host:集群中主机的agent进程。

在 adbmgr 中会起多个进程:

[antdb@intel175 ~]$ ps xuf|grep doctor 
antdb   193328  0.0  0.0 112716   984 pts/46   S+   16:02   0:00  |   \_ grep --color=auto doctor 
antdb   134782  0.0  0.0 359748  7808 ?        Ss   14:48   0:02  \_ adbmgr: antdb doctor launcher    
antdb   137154  0.0  0.0 358944  6836 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor gcn1    
antdb   137155  0.0  0.0 358948  6828 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor gcn2    
antdb   137157  0.0  0.0 358948  6852 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor cn1    
antdb   137159  0.0  0.0 358948  6848 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor cn2    
antdb   137163  0.0  0.0 358948  6848 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor cn3    
antdb   137165  0.0  0.0 358948  6848 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn1_1    
antdb   137167  0.0  0.0 358948  6872 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn2_1    
antdb   137169  0.0  0.0 358952  6856 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn3_1    
antdb   137172  0.0  0.0 358952  6860 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn1_2    
antdb   137175  0.0  0.0 358952  6860 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn1_3    
antdb   137177  0.0  0.0 358952  6860 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn2_2    
antdb   137180  0.0  0.0 358952  6848 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn4_1    
antdb   137183  0.0  0.0 358952  6856 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor cn4    
antdb   137186  0.0  0.0 358956  6852 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn3_2    
antdb   137189  0.0  0.0 358956  6852 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn4_2    
antdb   137191  0.0  0.0 358948  5888 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor host monitor    
  

关闭自愈

在 adbmgr 中执行 stop doctor;来关闭自愈:

antdb=# stop doctor; 
NOTICE:  Update pgxc_node successfully in 'gcn1'. 
NOTICE:  Update pgxc_node successfully in 'cn1'. 
NOTICE:  Update pgxc_node successfully in 'cn2'. 
NOTICE:  Update pgxc_node successfully in 'cn3'. 
NOTICE:  Update pgxc_node successfully in 'cn4'. 
NOTICE:  Updating pgxc_node successfully at all datanode master. 
 mgr_doctor_stop  
----------------- 
 t 
(1 row) 
 
antdb=# list doctor; 
   type    |     subtype     |      key       | value |                                                   comment                                                     
-----------+-----------------+----------------+-------+-------------------------------------------------------------------------------------------------------------- 
 PARAMETER | --              | enable         | 0     | 0:false, 1:true. If true, doctor processes will be launched, or else, doctor processes exit. 
 PARAMETER | --              | forceswitch    | 1     | 0:false, 1:true. Whether force to switch the master/slave, note that force switch may cause data loss. 
 PARAMETER | --              | switchinterval | 10    | In seconds, The time interval for doctor retry the switching if an error occurred in the previous switching. 
 PARAMETER | --              | nodedeadline   | 10    | In seconds. The maximum time for doctor tolerate a NODE running abnormally. 
 PARAMETER | --              | agentdeadline  | 5     | In seconds. The maximum time for doctor tolerate a AGENT running abnormally. 
 NODE      | gtmcoord master | gcn1           | t     | enable doctor 
 NODE      | gtmcoord slave  | gcn2           | t     | enable doctor 
 NODE      | coordinator     | cn1            | t     | enable doctor 
 NODE      | coordinator     | cn2            | t     | enable doctor 
 NODE      | coordinator     | cn3            | t     | enable doctor 
 NODE      | datanode master | dn1_1          | t     | enable doctor 
 NODE      | datanode master | dn2_1          | t     | enable doctor 
 NODE      | datanode master | dn3_1          | t     | enable doctor 
 NODE      | datanode slave  | dn1_2          | t     | enable doctor 
 NODE      | datanode slave  | dn1_3          | t     | enable doctor 
 NODE      | datanode slave  | dn2_2          | t     | enable doctor 
 NODE      | datanode master | dn4_1          | t     | enable doctor 
 NODE      | coordinator     | cn4            | t     | enable doctor 
 NODE      | datanode slave  | dn3_2          | t     | enable doctor 
 NODE      | datanode slave  | dn4_2          | t     | enable doctor 
 HOST      | --              | adb01          | t     | enable doctor 
 HOST      | --              | adb02          | t     | enable doctor 
(22 rows) 
 
[antdb@intel175 ~]$ ps xuf|grep doctor 
antdb     2435  0.0  0.0 112716   984 pts/46   S+   16:09   0:00  |   \_ grep --color=auto doctor 

stop 执行完成后,doctor 的参数 enable 为 0 ,且没有了 doctor 的进程。

修改参数

通过 set doctor 命令来修改 doctor 的全局参数,参数含义如下:

  • enable:总的开关,1 为开,0 为关。默认为0。
  • forceswitch:是否强制对异常节点进行切换,1 为是,0 位否,默认为 0。
  • switchinterval:上次自愈失败后,最长多长时间后再次自愈,默认为 30s。
  • nodedeadline:节点故障后,最长多长时间进行自愈,默认为 30s。
  • agentdeadline:agent 进程故障后,最长多长时间进行自愈,默认为 5s。

资源类参数:

  • switch_on_cpu_usage_pct:触发切换的 CPU 使用阈值,范围从 80% 到 100%,0 表示禁用。
  • switch_on_mem_usage_pct:触发切换的内存使用阈值,范围从 80% 到 100%,0 表示已禁用。
  • switch_on_disk_usage_pct:触发切换的磁盘使用阈值,范围从 80% 到 100%,0 表示已禁用。
  • switch_on_io_usage_pct:触发切换的 io 使用阈值,范围从 80% 到 100%,0 表示禁用。
  • switch_on_net_usage_pct:触发切换的带宽使用阈值,范围从 80% 到 100%,0 表示禁用。
  • switch_on_disk_corruption_enable:磁盘损坏时是否切换主/从,1 为是,0 位否,默认为 0。

修改参数的命令如下:

set doctor (switchinterval=10); 
set doctor (nodedeadline=10); 
set doctor (forceswitch=1); 
set doctor (switch_on_cpu_usage_pct=80);
set doctor (switch_on_mem_usage_pct=80);
set doctor (switch_on_disk_usage_pct=80);
set doctor (switch_on_io_usage_pct=80); 
set doctor (switch_on_net_usage_pct=80);
set doctor (switch_on_disk_corruption_enable=1);

修改完成后,查看是否生效:

antdb=# list doctor;
   type    |     subtype     |               key                | value |                                                   comment

-----------+-----------------+----------------------------------+-------+-------------------------------------------------------------------------------
-------------------------------
 PARAMETER | --              | enable                           | 0     | 0:false, 1:true. If true, doctor processes will be launched, or else, doctor p
rocesses exit.
 PARAMETER | --              | forceswitch                      | 1     | 0:false, 1:true. Whether force to switch the master/slave, note that force swi
tch may cause DATA LOSS.
 PARAMETER | --              | rewindoldmaster                  | 0     | 0:false, 1:true. Whether rewind old master after switched, note that rewind ol
d master may cause DATA LOSS.
 PARAMETER | --              | switchinterval                   | 10    | In seconds, The time interval for doctor retry the switching if an error occur
red in the previous switching.
 PARAMETER | --              | nodedeadline                     | 10    | In seconds. The maximum time for doctor tolerate a NODE running abnormally.
 PARAMETER | --              | agentdeadline                    | 5     | In seconds. The maximum time for doctor tolerate a AGENT running abnormally.
 PARAMETER | --              | switch_on_cpu_usage_pct          | 80    | The CPU usage threshold to trigger switch, range from 80 to 100 percent, 0 mea
ns disabled.
 PARAMETER | --              | switch_on_mem_usage_pct          | 80    | The memory usage threshold to trigger switch, range from 80 to 100 percent, 0
means disabled.
 PARAMETER | --              | switch_on_disk_usage_pct         | 80    | The disk usage threshold to trigger switch, range from 80 to 100 percent, 0 me
ans disabled.
 PARAMETER | --              | switch_on_io_usage_pct           | 80    | The io usage threshold to trigger switch, range from 80 to 100 percent, 0 mean
s disabled.
 PARAMETER | --              | switch_on_net_usage_pct          | 80    | The net usage threshold to trigger switch, range from 80 to 100 percent, 0 mea
ns disabled.
 PARAMETER | --              | switch_on_disk_corruption_enable | 1     | 0:false, 1:true. Whether switch the master/slave when disk corrupt.
 NODE      | gtmcoord master | gtmcoord                         | t     | enable doctor
 NODE      | gtmcoord slave  | gcs1                             | t     | enable doctor
 NODE      | coordinator     | cn3                              | t     | enable doctor
 NODE      | datanode master | dn1                              | t     | enable doctor
 NODE      | coordinator     | cn2                              | t     | enable doctor
 NODE      | coordinator     | cn1                              | t     | enable doctor
 NODE      | datanode master | dn2                              | t     | enable doctor
 NODE      | datanode master | dn3                              | t     | enable doctor
 HOST      | --              | host227                          | t     | enable doctor
 HOST      | --              | host228                          | t     | enable doctor
 HOST      | --              | host214                          | t     | enable doctor
(23 rows)

自愈工作示例

自愈功能开启后,如果某个节点发生故障,则会尝试执行拉起、切换等操作,保障业务的连续性。

下面我们尝试 kill 一个 datanode master 节点,观察节点是否会自动恢复。

  • kill 节点

选取 dn2_1 进行kill:

antdb=# monitor datanode master dn2_1; 
 nodename |    nodetype     | status | description |     host     | port  | recovery |           boot time            
----------+-----------------+--------+-------------+--------------+-------+----------+------------------------------- 
 dn2_1    | datanode master | t      | running     | 10.21.20.175 | 52541 | false    | 2019-10-16 16:20:06.225503+08 
(1 row) 
 
[antdb@intel175 ~]$ ps xuf|grep dn2_1 
antdb    35846  0.0  0.0 112712   980 pts/56   S+   16:54   0:00      \_ grep --color=auto dn2_1 
antdb    11456  0.0  0.0 442624 92208 ?        S    16:20   0:00 /data/danghb/app/adb50/bin/postgres --datanode -D /data/danghb/data/adb50/d1/dn2_1 -i 
antdb    12788  0.0  0.0 358948  6908 ?        Ss   16:22   0:00  \_ adbmgr: antdb doctor node monitor dn2_1     
[antdb@intel175 ~]$ kill -9 11456 
 
antdb=# monitor datanode master dn2_1; 
WARNING:  datanode master dn2_1 recovery status is unknown 
 nodename |    nodetype     | status | description |     host     | port  | recovery | boot time  
----------+-----------------+--------+-------------+--------------+-------+----------+----------- 
 dn2_1    | datanode master | f      | not running | 10.21.20.175 | 52541 | unknown  | unknow 
(1 row) 
  • 观察节点状态

等待几秒后,再次观察节点状态:

antdb=# monitor datanode master dn2_1; 
 nodename |    nodetype     | status | description |     host     | port  | recovery |           boot time            
----------+-----------------+--------+-------------+--------------+-------+----------+------------------------------- 
 dn2_1    | datanode master | t      | running     | 10.21.20.175 | 52541 | false    | 2019-10-16 16:55:10.935821+08 
(1 row) 

节点已经恢复,进程信息也可以看到:

[antdb@intel175 ~]$ ps xuf|grep dn2_1 
antdb    36484  0.0  0.0 112712   980 pts/56   S+   16:55   0:00      \_ grep --color=auto dn2_1 
antdb    36441  1.8  0.0 442624 92212 ?        S    16:55   0:00 /data/danghb/app/adb50/bin/postgres --datanode -D /data/danghb/data/adb50/d1/dn2_1 -i 
antdb    12788  0.0  0.0 359084  7664 ?        Ss   16:22   0:00  \_ adbmgr: antdb doctor node monitor dn2_1    

对应的 adbmgr 日志信息:

2019-10-16 16:55:03.315 CST,,,12788,,5da6d32e.31f4,6,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, CONNECT_FAIL, PQerrorMessage:server closed the connection u 
nexpectedly 
        This probably means the server terminated abnormally 
        before or while processing the request. 
",,,,,,,,,"" 
2019-10-16 16:55:05.818 CST,,,12788,,5da6d32e.31f4,7,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, CONNECT_FAIL, PQerrorMessage:could not connect to server: C 
onnection refused 
        Is the server running on host ""10.21.20.175"" and accepting 
        TCP/IP connections on port 52541? 
",,,,,,,,,"" 
2019-10-16 16:55:10.824 CST,,,12788,,5da6d32e.31f4,8,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, CONNECT_FAIL, PQerrorMessage:could not connect to server: C 
onnection refused 
        Is the server running on host ""10.21.20.175"" and accepting 
        TCP/IP connections on port 52541? 
",,,,,,,,,"" 
2019-10-16 16:55:10.824 CST,,,12788,,5da6d32e.31f4,9,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, node crashed",,,,,,,,,"" 
2019-10-16 16:55:10.826 CST,,,12788,,5da6d32e.31f4,10,,2019-10-16 16:22:06 CST,12/6,651,LOG,00000,"antdb doctor node monitor dn2_1, try to startup node",,,,,,,,,"" 
2019-10-16 16:55:11.044 CST,,,12788,,5da6d32e.31f4,11,,2019-10-16 16:22:06 CST,12/6,651,LOG,00000,"start dn2_1 /data/antdb/data/adb50/d1/dn2_1 successfully",,,,,,,,,"" 
2019-10-16 16:55:11.086 CST,,,12788,,5da6d32e.31f4,12,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, startup node successfully",,,,,,,,,"" 
2019-10-16 16:55:11.086 CST,,,12788,,5da6d32e.31f4,13,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, reset node monitor",,,,,,,,,"" 
2019-10-16 16:55:11.092 CST,,,12788,,5da6d32e.31f4,14,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, node running normally",,,,,,,,,"" 

可以看到 dn2_1 节点在 7 秒后,恢复了正常,恢复过程无需人工干预。

问题反馈