流式数据库_AntDB数据库psql登录

功能介绍

用户开启后，集群在运行过程中，用户无需关心节点状态，在节点出现异常 down 等情况时，自愈模块会自动尝试进行修复。

启动自愈

集群初始化后，自愈模块默认为关闭状态，在 adbmgr 中手动启动 doctor：

antdb=# start doctor; 
 mgr_doctor_start  
------------------ 
 t 
(1 row)

通过 list doctor 命令可以查看：

antdb=# list doctor; 
   type    |     subtype     |      key       | value |                                                   comment                                                     
-----------+-----------------+----------------+-------+-------------------------------------------------------------------------------------------------------------- 
 PARAMETER | --              | enable         | 1     | 0:false, 1:true. If true, doctor processes will be launched, or else, doctor processes exit. 
 PARAMETER | --              | forceswitch    | 0     | 0:false, 1:true. Whether force to switch the master/slave, note that force switch may cause data loss. 
 PARAMETER | --              | switchinterval | 30    | In seconds, The time interval for doctor retry the switching if an error occurred in the previous switching. 
 PARAMETER | --              | nodedeadline   | 30    | In seconds. The maximum time for doctor tolerate a NODE running abnormally. 
 PARAMETER | --              | agentdeadline  | 5     | In seconds. The maximum time for doctor tolerate a AGENT running abnormally. 
 NODE      | gtmcoord master | gcn1           | t     | enable doctor 
 NODE      | gtmcoord slave  | gcn2           | t     | enable doctor 
 NODE      | coordinator     | cn1            | t     | enable doctor 
 NODE      | coordinator     | cn2            | t     | enable doctor 
 NODE      | coordinator     | cn3            | t     | enable doctor 
 NODE      | datanode master | dn1_1          | t     | enable doctor 
 NODE      | datanode master | dn2_1          | t     | enable doctor 
 NODE      | datanode master | dn3_1          | t     | enable doctor 
 NODE      | datanode slave  | dn1_2          | t     | enable doctor 
 NODE      | datanode slave  | dn1_3          | t     | enable doctor 
 NODE      | datanode slave  | dn2_2          | t     | enable doctor 
 NODE      | datanode master | dn4_1          | t     | enable doctor 
 NODE      | coordinator     | cn4            | t     | enable doctor 
 NODE      | datanode slave  | dn3_2          | t     | enable doctor 
 NODE      | datanode slave  | dn4_2          | t     | enable doctor 
 HOST      | --              | adb01          | t     | enable doctor 
 HOST      | --              | adb02          | t     | enable doctor 
(22 rows)

从结果可以看出，自愈监控的组件类型包括：

node：集群中的各个节点。
host：集群中主机的agent进程。

在 adbmgr 中会起多个进程：

[antdb@intel175 ~]$ ps xuf|grep doctor 
antdb   193328  0.0  0.0 112716   984 pts/46   S+   16:02   0:00  |   \_ grep --color=auto doctor 
antdb   134782  0.0  0.0 359748  7808 ?        Ss   14:48   0:02  \_ adbmgr: antdb doctor launcher    
antdb   137154  0.0  0.0 358944  6836 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor gcn1    
antdb   137155  0.0  0.0 358948  6828 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor gcn2    
antdb   137157  0.0  0.0 358948  6852 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor cn1    
antdb   137159  0.0  0.0 358948  6848 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor cn2    
antdb   137163  0.0  0.0 358948  6848 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor cn3    
antdb   137165  0.0  0.0 358948  6848 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn1_1    
antdb   137167  0.0  0.0 358948  6872 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn2_1    
antdb   137169  0.0  0.0 358952  6856 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn3_1    
antdb   137172  0.0  0.0 358952  6860 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn1_2    
antdb   137175  0.0  0.0 358952  6860 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn1_3    
antdb   137177  0.0  0.0 358952  6860 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn2_2    
antdb   137180  0.0  0.0 358952  6848 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn4_1    
antdb   137183  0.0  0.0 358952  6856 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor cn4    
antdb   137186  0.0  0.0 358956  6852 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn3_2    
antdb   137189  0.0  0.0 358956  6852 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn4_2    
antdb   137191  0.0  0.0 358948  5888 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor host monitor

关闭自愈

在 adbmgr 中执行 stop doctor;来关闭自愈：

antdb=# stop doctor; 
NOTICE:  Update pgxc_node successfully in 'gcn1'. 
NOTICE:  Update pgxc_node successfully in 'cn1'. 
NOTICE:  Update pgxc_node successfully in 'cn2'. 
NOTICE:  Update pgxc_node successfully in 'cn3'. 
NOTICE:  Update pgxc_node successfully in 'cn4'. 
NOTICE:  Updating pgxc_node successfully at all datanode master. 
 mgr_doctor_stop  
----------------- 
 t 
(1 row) 
 
antdb=# list doctor; 
   type    |     subtype     |      key       | value |                                                   comment                                                     
-----------+-----------------+----------------+-------+-------------------------------------------------------------------------------------------------------------- 
 PARAMETER | --              | enable         | 0     | 0:false, 1:true. If true, doctor processes will be launched, or else, doctor processes exit. 
 PARAMETER | --              | forceswitch    | 1     | 0:false, 1:true. Whether force to switch the master/slave, note that force switch may cause data loss. 
 PARAMETER | --              | switchinterval | 10    | In seconds, The time interval for doctor retry the switching if an error occurred in the previous switching. 
 PARAMETER | --              | nodedeadline   | 10    | In seconds. The maximum time for doctor tolerate a NODE running abnormally. 
 PARAMETER | --              | agentdeadline  | 5     | In seconds. The maximum time for doctor tolerate a AGENT running abnormally. 
 NODE      | gtmcoord master | gcn1           | t     | enable doctor 
 NODE      | gtmcoord slave  | gcn2           | t     | enable doctor 
 NODE      | coordinator     | cn1            | t     | enable doctor 
 NODE      | coordinator     | cn2            | t     | enable doctor 
 NODE      | coordinator     | cn3            | t     | enable doctor 
 NODE      | datanode master | dn1_1          | t     | enable doctor 
 NODE      | datanode master | dn2_1          | t     | enable doctor 
 NODE      | datanode master | dn3_1          | t     | enable doctor 
 NODE      | datanode slave  | dn1_2          | t     | enable doctor 
 NODE      | datanode slave  | dn1_3          | t     | enable doctor 
 NODE      | datanode slave  | dn2_2          | t     | enable doctor 
 NODE      | datanode master | dn4_1          | t     | enable doctor 
 NODE      | coordinator     | cn4            | t     | enable doctor 
 NODE      | datanode slave  | dn3_2          | t     | enable doctor 
 NODE      | datanode slave  | dn4_2          | t     | enable doctor 
 HOST      | --              | adb01          | t     | enable doctor 
 HOST      | --              | adb02          | t     | enable doctor 
(22 rows) 
 
[antdb@intel175 ~]$ ps xuf|grep doctor 
antdb     2435  0.0  0.0 112716   984 pts/46   S+   16:09   0:00  |   \_ grep --color=auto doctor

stop 执行完成后，doctor 的参数 enable 为 0 ，且没有了 doctor 的进程。

修改参数

通过 set doctor 命令来修改 doctor 的全局参数，参数含义如下：

enable：总的开关，1 为开，0 为关。默认为0。
forceswitch：是否强制对异常节点进行切换，1 为是，0 位否，默认为 0。
switchinterval：上次自愈失败后，最长多长时间后再次自愈，默认为 30s。
nodedeadline：节点故障后，最长多长时间进行自愈，默认为 30s。
agentdeadline：agent 进程故障后，最长多长时间进行自愈，默认为 5s。

资源类参数：

switch_on_cpu_usage_pct：触发切换的 CPU 使用阈值，范围从 80% 到 100%，0 表示禁用。
switch_on_mem_usage_pct：触发切换的内存使用阈值，范围从 80% 到 100%，0 表示已禁用。
switch_on_disk_usage_pct：触发切换的磁盘使用阈值，范围从 80% 到 100%，0 表示已禁用。
switch_on_io_usage_pct：触发切换的 io 使用阈值，范围从 80% 到 100%，0 表示禁用。
switch_on_net_usage_pct：触发切换的带宽使用阈值，范围从 80% 到 100%，0 表示禁用。
switch_on_disk_corruption_enable：磁盘损坏时是否切换主/从，1 为是，0 位否，默认为 0。

修改参数的命令如下：

set doctor (switchinterval=10); 
set doctor (nodedeadline=10); 
set doctor (forceswitch=1); 
set doctor (switch_on_cpu_usage_pct=80);
set doctor (switch_on_mem_usage_pct=80);
set doctor (switch_on_disk_usage_pct=80);
set doctor (switch_on_io_usage_pct=80); 
set doctor (switch_on_net_usage_pct=80);
set doctor (switch_on_disk_corruption_enable=1);

修改完成后，查看是否生效：

antdb=# list doctor;
   type    |     subtype     |               key                | value |                                                   comment

-----------+-----------------+----------------------------------+-------+-------------------------------------------------------------------------------
-------------------------------
 PARAMETER | --              | enable                           | 0     | 0:false, 1:true. If true, doctor processes will be launched, or else, doctor p
rocesses exit.
 PARAMETER | --              | forceswitch                      | 1     | 0:false, 1:true. Whether force to switch the master/slave, note that force swi
tch may cause DATA LOSS.
 PARAMETER | --              | rewindoldmaster                  | 0     | 0:false, 1:true. Whether rewind old master after switched, note that rewind ol
d master may cause DATA LOSS.
 PARAMETER | --              | switchinterval                   | 10    | In seconds, The time interval for doctor retry the switching if an error occur
red in the previous switching.
 PARAMETER | --              | nodedeadline                     | 10    | In seconds. The maximum time for doctor tolerate a NODE running abnormally.
 PARAMETER | --              | agentdeadline                    | 5     | In seconds. The maximum time for doctor tolerate a AGENT running abnormally.
 PARAMETER | --              | switch_on_cpu_usage_pct          | 80    | The CPU usage threshold to trigger switch, range from 80 to 100 percent, 0 mea
ns disabled.
 PARAMETER | --              | switch_on_mem_usage_pct          | 80    | The memory usage threshold to trigger switch, range from 80 to 100 percent, 0
means disabled.
 PARAMETER | --              | switch_on_disk_usage_pct         | 80    | The disk usage threshold to trigger switch, range from 80 to 100 percent, 0 me
ans disabled.
 PARAMETER | --              | switch_on_io_usage_pct           | 80    | The io usage threshold to trigger switch, range from 80 to 100 percent, 0 mean
s disabled.
 PARAMETER | --              | switch_on_net_usage_pct          | 80    | The net usage threshold to trigger switch, range from 80 to 100 percent, 0 mea
ns disabled.
 PARAMETER | --              | switch_on_disk_corruption_enable | 1     | 0:false, 1:true. Whether switch the master/slave when disk corrupt.
 NODE      | gtmcoord master | gtmcoord                         | t     | enable doctor
 NODE      | gtmcoord slave  | gcs1                             | t     | enable doctor
 NODE      | coordinator     | cn3                              | t     | enable doctor
 NODE      | datanode master | dn1                              | t     | enable doctor
 NODE      | coordinator     | cn2                              | t     | enable doctor
 NODE      | coordinator     | cn1                              | t     | enable doctor
 NODE      | datanode master | dn2                              | t     | enable doctor
 NODE      | datanode master | dn3                              | t     | enable doctor
 HOST      | --              | host227                          | t     | enable doctor
 HOST      | --              | host228                          | t     | enable doctor
 HOST      | --              | host214                          | t     | enable doctor
(23 rows)

自愈工作示例

自愈功能开启后，如果某个节点发生故障，则会尝试执行拉起、切换等操作，保障业务的连续性。

下面我们尝试 kill 一个 datanode master 节点，观察节点是否会自动恢复。

kill 节点

选取 dn2_1 进行kill：

antdb=# monitor datanode master dn2_1; 
 nodename |    nodetype     | status | description |     host     | port  | recovery |           boot time            
----------+-----------------+--------+-------------+--------------+-------+----------+------------------------------- 
 dn2_1    | datanode master | t      | running     | 10.21.20.175 | 52541 | false    | 2019-10-16 16:20:06.225503+08 
(1 row) 
 
[antdb@intel175 ~]$ ps xuf|grep dn2_1 
antdb    35846  0.0  0.0 112712   980 pts/56   S+   16:54   0:00      \_ grep --color=auto dn2_1 
antdb    11456  0.0  0.0 442624 92208 ?        S    16:20   0:00 /data/danghb/app/adb50/bin/postgres --datanode -D /data/danghb/data/adb50/d1/dn2_1 -i 
antdb    12788  0.0  0.0 358948  6908 ?        Ss   16:22   0:00  \_ adbmgr: antdb doctor node monitor dn2_1     
[antdb@intel175 ~]$ kill -9 11456 
 
antdb=# monitor datanode master dn2_1; 
WARNING:  datanode master dn2_1 recovery status is unknown 
 nodename |    nodetype     | status | description |     host     | port  | recovery | boot time  
----------+-----------------+--------+-------------+--------------+-------+----------+----------- 
 dn2_1    | datanode master | f      | not running | 10.21.20.175 | 52541 | unknown  | unknow 
(1 row)

观察节点状态

等待几秒后，再次观察节点状态：

antdb=# monitor datanode master dn2_1; 
 nodename |    nodetype     | status | description |     host     | port  | recovery |           boot time            
----------+-----------------+--------+-------------+--------------+-------+----------+------------------------------- 
 dn2_1    | datanode master | t      | running     | 10.21.20.175 | 52541 | false    | 2019-10-16 16:55:10.935821+08 
(1 row)

节点已经恢复，进程信息也可以看到：

[antdb@intel175 ~]$ ps xuf|grep dn2_1 
antdb    36484  0.0  0.0 112712   980 pts/56   S+   16:55   0:00      \_ grep --color=auto dn2_1 
antdb    36441  1.8  0.0 442624 92212 ?        S    16:55   0:00 /data/danghb/app/adb50/bin/postgres --datanode -D /data/danghb/data/adb50/d1/dn2_1 -i 
antdb    12788  0.0  0.0 359084  7664 ?        Ss   16:22   0:00  \_ adbmgr: antdb doctor node monitor dn2_1

对应的 adbmgr 日志信息：

2019-10-16 16:55:03.315 CST,,,12788,,5da6d32e.31f4,6,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, CONNECT_FAIL, PQerrorMessage:server closed the connection u 
nexpectedly 
        This probably means the server terminated abnormally 
        before or while processing the request. 
",,,,,,,,,"" 
2019-10-16 16:55:05.818 CST,,,12788,,5da6d32e.31f4,7,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, CONNECT_FAIL, PQerrorMessage:could not connect to server: C 
onnection refused 
        Is the server running on host ""10.21.20.175"" and accepting 
        TCP/IP connections on port 52541? 
",,,,,,,,,"" 
2019-10-16 16:55:10.824 CST,,,12788,,5da6d32e.31f4,8,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, CONNECT_FAIL, PQerrorMessage:could not connect to server: C 
onnection refused 
        Is the server running on host ""10.21.20.175"" and accepting 
        TCP/IP connections on port 52541? 
",,,,,,,,,"" 
2019-10-16 16:55:10.824 CST,,,12788,,5da6d32e.31f4,9,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, node crashed",,,,,,,,,"" 
2019-10-16 16:55:10.826 CST,,,12788,,5da6d32e.31f4,10,,2019-10-16 16:22:06 CST,12/6,651,LOG,00000,"antdb doctor node monitor dn2_1, try to startup node",,,,,,,,,"" 
2019-10-16 16:55:11.044 CST,,,12788,,5da6d32e.31f4,11,,2019-10-16 16:22:06 CST,12/6,651,LOG,00000,"start dn2_1 /data/antdb/data/adb50/d1/dn2_1 successfully",,,,,,,,,"" 
2019-10-16 16:55:11.086 CST,,,12788,,5da6d32e.31f4,12,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, startup node successfully",,,,,,,,,"" 
2019-10-16 16:55:11.086 CST,,,12788,,5da6d32e.31f4,13,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, reset node monitor",,,,,,,,,"" 
2019-10-16 16:55:11.092 CST,,,12788,,5da6d32e.31f4,14,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, node running normally",,,,,,,,,""

可以看到 dn2_1 节点在 7 秒后，恢复了正常，恢复过程无需人工干预。

集群自愈

功能介绍

启动自愈

关闭自愈

修改参数

自愈工作示例