MogDB
Ecological Tools
Doc Menu

Checking MogDB Health Status

Check Method

Use the gs_check tool provided by MogDB to check the MogDB health status.

Precautions

  • Only user root is authorized to check new nodes added during cluster scale-out. In other cases, the check can be performed only by user omm.
  • Parameter -i or -e must be set. -i specifies a single item to be checked, and -e specifies an inspection scenario where multiple items will be checked.
  • If -i is not set to a root item or no such items are contained in the check item list of the scenario specified by -e, you do not need to enter the name or password of user root.
  • You can run –skip-root-items to skip root items.
  • Check the consistency between the new node and existing nodes. Run the gs_check command on an existing node and specify the –hosts parameter. The IP address of the new node needs to be written into the hosts file.

Procedure

Method 1:

  1. Log in as the OS user omm to the primary node of the database.
  2. Run the following command to check the MogDB database status:

    gs_check -i CheckClusterState

    In the command, -i indicates the check item and is case-sensitive. The format is -i CheckClusterState, -i CheckCPU or -i CheckClusterState,CheckCPU.

    Checkable items are listed in "Server Tools > gs_check > MogDB status checks" in the MogDB Tool Reference. You can create a check item as needed.

Method 2:

  1. Log in as the OS user omm to the primary node of the database.
  2. Run the following command to check the MogDB database health status:

    gs_check -e inspect

    In the command, -e indicates the inspection scenario and is case-sensitive. The format is -e inspect or -e upgrade.

    The inspection scenarios include inspect (routine inspection), upgrade (inspection before upgrade), Install (install inspection ), binary_upgrade (inspection before in-place upgrade), slow_node (node inspection), longtime (time-consuming inspection) and health (health inspection). You can create an inspection scenario as needed.

The MogDB inspection is performed to check MogDB status during MogDB running or to check the environment and conditions before critical operations, such as upgrade or scale-out. For details about the inspection items and scenarios, see "Server Tools > gs_check > MogDB status checks" in the MogDB Tool Reference.

Examples

Check result of a single item:

perfadm@lfgp000700749:/opt/huawei/perfadm/tool/script> gs_check -i CheckCPU
Parsing the check items config file successfully
Distribute the context file to remote hosts successfully
Start to health check for the cluster. Total Items:1 Nodes:3

Checking...               [=========================] 1/1
Start to analysis the check result
CheckCPU....................................OK
The item run on 3 nodes.  success: 3

Analysis the check result successfully
Success. All check items run completed. Total:1  Success:1  Failed:0
For more information please refer to /opt/mogdb/tools/script/gspylib/inspection/output/CheckReport_201902193704661604.tar.gz

Local execution result:

perfadm@lfgp000700749:/opt/huawei/perfadm/tool/script> gs_check -i CheckCPU -L

2017-12-29 17:09:29 [NAM] CheckCPU
2017-12-29 17:09:29 [STD] Check the CPU usage of the host. If the value of idle is greater than 30% and the value of iowait is less than 30%, this item passes the check. Otherwise, this item fails the check.
2017-12-29 17:09:29 [RST] OK

2017-12-29 17:09:29 [RAW]
Linux 4.4.21-69-default (lfgp000700749)  12/29/17  _x86_64_

17:09:24        CPU     %user     %nice   %system   %iowait    %steal     %idle
17:09:25        all      0.25      0.00      0.25      0.00      0.00     99.50
17:09:26        all      0.25      0.00      0.13      0.00      0.00     99.62
17:09:27        all      0.25      0.00      0.25      0.13      0.00     99.37
17:09:28        all      0.38      0.00      0.25      0.00      0.13     99.25
17:09:29        all      1.00      0.00      0.88      0.00      0.00     98.12
Average:        all      0.43      0.00      0.35      0.03      0.03     99.17

Check result of a scenario:

[perfadm@SIA1000131072 Check]$ gs_check -e inspect
Parsing the check items config file successfully
The below items require root privileges to execute:[CheckBlockdev CheckIOrequestqueue CheckIOConfigure CheckCheckMultiQueue CheckFirewall CheckSshdService CheckSshdConfig CheckCrondService CheckNoCheckSum CheckSctpSeProcMemory CheckBootItems CheckFilehandle CheckNICModel CheckDropCache]
Please enter root privileges user[root]:root
Please enter password for user[root]:
Please enter password for user[root] on the node[10.244.57.240]:
Check root password connection successfully
Distribute the context file to remote hosts successfully
Start to health check for the cluster. Total Items:59 Nodes:2

Checking...               [                         ] 21/59
Checking...               [=========================] 59/59
Start to analysis the check result
CheckClusterState...........................OK
The item run on 2 nodes.  success: 2

CheckDBParams...............................OK
The item run on 1 nodes.  success: 1

CheckDebugSwitch............................OK
The item run on 2 nodes.  success: 2

CheckDirPermissions.........................OK
The item run on 2 nodes.  success: 2

CheckReadonlyMode...........................OK
The item run on 1 nodes.  success: 1

CheckEnvProfile.............................OK
The item run on 2 nodes.  success: 2  (consistent)
The success on all nodes value:
GAUSSHOME        /usr1/mogdb/app
LD_LIBRARY_PATH  /usr1/mogdb/app/lib
PATH             /usr1/mogdb/app/bin


CheckBlockdev...............................OK
The item run on 2 nodes.  success: 2

CheckCurConnCount...........................OK
The item run on 1 nodes.  success: 1

CheckCursorNum..............................OK
The item run on 1 nodes.  success: 1

CheckPgxcgroup..............................OK
The item run on 1 nodes.  success: 1

CheckDiskFormat.............................OK
The item run on 2 nodes.  success: 2

CheckSpaceUsage.............................OK
The item run on 2 nodes.  success: 2

CheckInodeUsage.............................OK
The item run on 2 nodes.  success: 2

CheckSwapMemory.............................OK
The item run on 2 nodes.  success: 2

CheckLogicalBlock...........................OK
The item run on 2 nodes.  success: 2

CheckIOrequestqueue.....................WARNING
The item run on 2 nodes.  warning: 2
The warning[host240,host157] value:
On device (vdb) 'IO Request' RealValue '256' ExpectedValue '32768'
On device (vda) 'IO Request' RealValue '256' ExpectedValue '32768'

CheckMaxAsyIOrequests.......................OK
The item run on 2 nodes.  success: 2

CheckIOConfigure............................OK
The item run on 2 nodes.  success: 2

CheckMTU....................................OK
The item run on 2 nodes.  success: 2  (consistent)
The success on all nodes value:
1500

CheckPing...................................OK
The item run on 2 nodes.  success: 2

CheckRXTX...................................NG
The item run on 2 nodes.  ng: 2
The ng[host240,host157] value:
NetWork[eth0]
RX: 256
TX: 256


CheckNetWorkDrop............................OK
The item run on 2 nodes.  success: 2

CheckMultiQueue.............................OK
The item run on 2 nodes.  success: 2

CheckEncoding...............................OK
The item run on 2 nodes.  success: 2  (consistent)
The success on all nodes value:
LANG=en_US.UTF-8

CheckFirewall...............................OK
The item run on 2 nodes.  success: 2

CheckKernelVer..............................OK
The item run on 2 nodes.  success: 2  (consistent)
The success on all nodes value:
3.10.0-957.el7.x86_64

CheckMaxHandle..............................OK
The item run on 2 nodes.  success: 2

CheckNTPD...................................OK
host240: NTPD service is running, 2020-06-02 17:00:28
host157: NTPD service is running, 2020-06-02 17:00:06


CheckOSVer..................................OK
host240: The current OS is centos 7.6 64bit.
host157: The current OS is centos 7.6 64bit.


CheckSysParams..........................WARNING
The item run on 2 nodes.  warning: 2
The warning[host240,host157] value:
Warning reason: variable 'net.ipv4.tcp_retries1' RealValue '3' ExpectedValue '5'.
Warning reason: variable 'net.ipv4.tcp_syn_retries' RealValue '6' ExpectedValue '5'.
Warning reason: variable 'net.sctp.path_max_retrans' RealValue '5' ExpectedValue '10'.
Warning reason: variable 'net.sctp.max_init_retransmits' RealValue '8' ExpectedValue '10'.


CheckTHP....................................OK
The item run on 2 nodes.  success: 2

CheckTimeZone...............................OK
The item run on 2 nodes.  success: 2  (consistent)
The success on all nodes value:
+0800

CheckCPU....................................OK
The item run on 2 nodes.  success: 2

CheckSshdService............................OK
The item run on 2 nodes.  success: 2

CheckSshdConfig.........................WARNING
The item run on 2 nodes.  warning: 2
The warning[host240,host157] value:

Warning reason: UseDNS parameter is not set; expected: no

CheckCrondService...........................OK
The item run on 2 nodes.  success: 2

CheckStack..................................OK
The item run on 2 nodes.  success: 2  (consistent)
The success on all nodes value:
8192

CheckNoCheckSum.............................OK
The item run on 2 nodes.  success: 2  (consistent)
The success on all nodes value:
Nochecksum value is N,Check items pass.

CheckSysPortRange...........................OK
The item run on 2 nodes.  success: 2

CheckMemInfo................................OK
The item run on 2 nodes.  success: 2  (consistent)
The success on all nodes value:
totalMem: 31.260929107666016G

CheckHyperThread............................OK
The item run on 2 nodes.  success: 2

CheckTableSpace.............................OK
The item run on 1 nodes.  success: 1

CheckSctpService............................OK
The item run on 2 nodes.  success: 2

CheckSysadminUser...........................OK
The item run on 1 nodes.  success: 1

CheckGUCConsistent..........................OK
All DN instance guc value is consistent.

CheckMaxProcMemory..........................OK
The item run on 1 nodes.  success: 1

CheckBootItems..............................OK
The item run on 2 nodes.  success: 2

CheckHashIndex..............................OK
The item run on 1 nodes.  success: 1

CheckPgxcRedistb............................OK
The item run on 1 nodes.  success: 1

CheckNodeGroupName..........................OK
The item run on 1 nodes.  success: 1

CheckTDDate.................................OK
The item run on 1 nodes.  success: 1

CheckDilateSysTab...........................OK
The item run on 1 nodes.  success: 1

CheckKeyProAdj..............................OK
The item run on 2 nodes.  success: 2

CheckProStartTime.......................WARNING
host157:
STARTED COMMAND
Tue Jun  2 16:57:18 2020 /usr1/dmuser/dmserver/metricdb1/server/bin/mogdb --single_node -D /usr1/dmuser/dmb1/data -p 22204
Mon Jun  1 16:15:15 2020 /usr1/mogdb/app/bin/mogdb -D /usr1/mogdb/data/dn1 -M standby


CheckFilehandle.............................OK
The item run on 2 nodes.  success: 2

CheckRouting................................OK
The item run on 2 nodes.  success: 2

CheckNICModel...............................OK
The item run on 2 nodes.  success: 2  (consistent)
The success on all nodes value:
version: 1.0.0
model: Red Hat, Inc. Virtio network device


CheckDropCache..........................WARNING
The item run on 2 nodes.  warning: 2
The warning[host240,host157] value:
No DropCache process is running

CheckMpprcFile..............................NG
The item run on 2 nodes.  ng: 2
The ng[host240,host157] value:
There is no mpprc file

Analysis the check result successfully
Failed. All check items run completed. Total:59   Success:52   Warning:5   NG:2
For more information please refer to /usr1/mogdb/tool/script/gspylib/inspection/output/CheckReport_inspect611.tar.gz

Exception Handling

Troubleshoot exceptions detected in the inspection by following instructions in this section.

Table 1 Check of MogDB running status

Check Item Abnormal Status Solution
CheckClusterState (Checks the MogDB status.) MogDB or MogDB instances are not started. Run the following command to start MogDB and instances:

gs_om -t start
The status of MogDB or MogDB instances is abnormal. Check the status of hosts and instances. Troubleshoot this issue based on the status information.
gs_check -i CheckClusterState
CheckDBParams (Checks database parameters.) Database parameters have incorrect values. Use the gs_guc tool to set the parameters to specified values.
CheckDebugSwitch (Checks debug logs.) The log level is incorrect. Use the gs_guc tool to set log_min_messages to specified content.
CheckDirPermissions (Checks directory permissions.) The permission for a directory is incorrect. Change the directory permission to a specified value (750 or 700).
chmod 750 DIR
CheckReadonlyMode (Checks the read-only mode.) The read-only mode is enabled. Verify that the usage of the disk where database nodes are located does not exceed the threshold (60% by default) and no other O&M operations are performed.
gs_check -i CheckDataDiskUsage ps ux
Use the gs_guc tool to disable the read-only mode of MogDB.
gs_guc reload -N all -I all -c 'default_transaction_read_only = off'
CheckEnvProfile (Checks environment variables.) Environment variables are inconsistent. Update the environment variable information.
CheckBlockdev (Checks pre-read blocks.) The size of a pre-read block is not 16384 KB. Use the gs_checkos tool to set the size of the pre-read block to 16384 KB and write the setting into the auto-startup file.
gs_checkos -i B3
CheckCursorNum (Checks the number of cursors.) The number of cursors fails to be checked. Check whether the database is properly connected and whether the MogDB status is normal.
CheckPgxcgroup (Checks the data redistribution status.) There are pgxc_group tables that have not been redistributed. Proceed with the redistribution.
gs_expand、gs_shrink
CheckDiskFormat (Checks disk configurations.) Disk configurations are inconsistent between nodes. Configure disk specifications to be consistent between nodes.
CheckSpaceUsage (Checks the disk space usage.) Disk space is insufficient. Clear or expand the disk for the directory.
CheckInodeUsage (Checks the disk index usage.) Disk indexes are insufficient. Clear or expand the disk for the directory.
CheckSwapMemory (Checks the swap memory.) The swap memory is greater than the physical memory. Reduce or disable the swap memory.
CheckLogicalBlock (Checks logical blocks.) The size of a logical block is not 512 KB. Use the gs_checkos tool to set the size of the logical block to 512 KB and write the setting into the auto-startup file.
gs_checkos -i B4
CheckIOrequestqueue (Checks I/O requests.) The requested I/O is not 32768. Use the gs_checkos tool to set the requested I/O to 32768 and write the setting into the auto-startup file.
gs_checkos -i B4
CheckCurConnCount (Checks the number of current connections.) The number of current connections exceeds 90% of the allowed maximum number of connections. Break idle primary database node connections.
CheckMaxAsyIOrequests (Checks the maximum number of asynchronous requests.) The maximum number of asynchronous requests is less than 104857600 or (Number of database instances on the current node x 1048576). Use the gs_checkos tool to set the maximum number of asynchronous requests to the larger one between 104857600 and (Number of database instances on the current node x 1048576).
gs_checkos -i B4
CheckMTU (Checks MTU values.) MTU values are inconsistent between nodes. Set the MTU value on each node to 1500 or 8192.
ifconfig eth* MTU 1500
CheckIOConfigure (Checks I/O configurations.) The I/O mode is not deadline. Use the gs_checkos tool to set the I/O mode to deadline and write the setting into the auto-startup file.
gs_checkos -i B4
CheckRXTX (Checks the RX/TX value.) The NIC RX/TX value is not 4096. Use the checkos tool to set the NIC RX/TX value to 4096 for MogDB.
gs_checkos -i B5
CheckPing (Checks whether the network connection is normal.) There are MogDB IP addresses that cannot be pinged. Check the network settings, network status, and firewall status between the abnormal IP addresses.
CheckNetWorkDrop (Checks the network packet loss rate.) The network packet loss rate is greater than 1%. Check the network load and status between the corresponding IP addresses.
CheckMultiQueue (Checks the NIC multi-queue function.) Multiqueue is not enabled for the NIC, and NIC interruptions are not bound to different CPU cores. Enable multiqueue for the NIC, and bind NIC interruptions to different CPU cores.
CheckEncoding (Checks the encoding format.) Encoding formats are inconsistent between nodes. Write the same encoding format into /etc/profile for each node.
echo "export LANG=XXX" >> /etc/profile
CheckActQryCount (Checks the archiving mode.) The archiving mode is enabled, and the archiving directory is not under the primary database node directory. Disable archiving mode or set the archiving directory to be under the primary database node directory.
CheckFirewall (Checks the firewall.) The firewall is enabled. Disable the firewall.
systemctl disable firewalld.service
CheckKernelVer (Checks kernel versions.) Kernel versions are inconsistent between nodes.
CheckMaxHandle (Checks the maximum number of file handles.) The maximum number of handles is less than 1000000. Set the soft and hard limits in the 91-nofile.conf or 90-nofile.conf file to 1000000.
gs_checkos -i B2
CheckNTPD (Checks the time synchronization service.) The NTPD service is disabled or the time difference is greater than 1 minute. Enable the NTPD service and set the time to be consistent.
CheckSysParams (Checks OS parameters.) OS parameter settings do not meet requirements. Use the gs_checkos tool or manually set parameters to values meeting requirements.
gs_checkos -i B1 vim /etc/sysctl.conf
CheckTHP (Checks the THP service.) The THP service is disabled. Use the gs_checkos to enable the THP service.
gs_checkos -i B6
CheckTimeZone (Checks time zones.) Time zones are inconsistent between nodes. Set time zones to be consistent between nodes.
cp /usr/share/zoneinfo/\$primary time zone/$secondary time zone\ /etc/localtime
CheckCPU (Checks the CPU.) CPU usage is high or I/O waiting time is too long. Upgrade CPUs or improve disk performance.
CheckSshdService (Checks the SSHD service.) The SSHD service is disabled. Enable the SSHD service and write the setting into the auto-startup file.
service sshd start echo "server sshd start" >> initFile
CheckSshdConfig (Checks SSHD configurations.) The SSHD service is incorrectly configured. Reconfigure the SSHD service.
PasswordAuthentication=no; MaxStartups=1000; UseDNS=yes; ClientAliveInterval=10800/ClientAliveInterval=0
Restart the service.
server sshd start
CheckCrondService (Checks the Crond service.) The Crond service is disabled. Install and enable the Crond service.
CheckStack (Checks the stack size.) The stack size is less than 3072. Use the gs_checkos tool to set the stack size to 3072 and restart the processes with a smaller stack size.
gs_checkos -i B2
CheckNoCheckSum (Checks the NoCheckSum parameter.) NoCheckSum is incorrectly set or its value is inconsistent on each node. Set NoCheckSum to a consistent value on each node. If redHat6.4 or redHat6.5 with the NIC binding mode bond0 exists, set NoCheckSum to Y. In other cases, set it to N.
echo Y > /sys/module/sctp/parameters/no_checksums
CheckSysPortRange (Checks OS port configurations.) OS IP ports are not within the required port range or MogDB ports are within the OS IP port range. Set the OS IP ports within 26000 to 65535 and set the MogDB ports beyond the OS IP port range.
vim /etc/sysctl.conf
CheckMemInfo (Checks the memory information.) Memory sizes are inconsistent between nodes. Use physical memory of the same specifications between nodes.
CheckHyperThread (Checks the hyper-threading.) The CPU hyper-threading is disabled. Enable the CPU hyper-threading.
CheckTableSpace (Checks tablespaces.) The tablespace path is nested with the MogDB path or nested with the path of another tablespace. Migrate tablespace data to the tablespace with a valid path.
CheckSctpService (Checks the SCTP service.) The SCTP service is disabled. Install and enable the SCTP service.
modprobe sctp

Querying Status

Background

GaussDB Kernel allows you to view the status of the entire GaussDB Kernel. The query result shows whether the database or a single host is running properly.

Prerequisites

The database has started.

Procedure

  1. Log in as the OS user omm to the primary node of the database.
  2. Run the following command to query the database status:

    gs_om -t status --detail

    Table 1 describes parameters in the query result.

    To query the instance status on a host, add -h to the command. For example:

    gs_om -t status -h plat2 

    plat2 indicates the name of the host to be queried.

Parameter Description

Table 1 Node role description

Field Description Value
cluster_state The database status, which indicates whether the entire database is running properly. Normal: The database is available and the data has redundancy backup. All the processes are running and the primary/standby relationship is normal.Unavailable: The database is unavailable.Degraded: The database is available and faulty database nodes and primary database nodes exist.
node Host name. Specifies the name of the host where the instance is located. If multiple AZs exist, the AZ IDs will be displayed.
node_ip Host IP Address. Specifies the IP address of the host where the instance is located.
instance Instance ID. Specifies the instance ID.
state Instance role Normal: a single host instance.Primary: The instance is a primary instance.Standby: The instance is a standby instance.Cascade Standby: The instance is a cascaded standby instance.Secondary: The instance is a secondary instance.Pending: The instance is in the quorum phase.Unknown: The instance status is unknown.Down: The instance is down.Abnormal: The node is abnormal.Manually stopped: The node has been manually stopped.

Each role has different states, such as startup and connection. The states are described as follows:

Table 2 Node state description

State Description
Normal The node starts up normally.
Need repair The node needs to be restored.
Starting The node is starting up.
Wait promoting The node is waiting for upgrade. For example, after the standby node sends an upgrade request to the primary node, the standby node is waiting for the response from the primary node.
Promoting The standby node is being upgraded to the primary node.
Demoting The node is being downgraded, for example, the primary node is being downgraded to the standby node.
Building The standby node fails to be started and needs to be rebuilt.
Catchup The standby node is catching up with the primary node.
Coredump The node program breaks down.
Unknown The node status is unknown.

If a node is in Need repair state, you need to rebuild the node to restore it. Generally, the reasons for rebuilding a node are as follows:

Table 3 Node rebuilding causes

State Description
Normal The node starts up normally.
WAL segment removed WALs of the primary node do not exist, and logs of the standby node are later than those of the primary node.
Disconnect Standby node cannot be connected to the primary node.
Version not matched The binary versions of the primary and standby nodes are inconsistent.
Mode not matched Nodes do not match the primary and standby roles. For example, two standby nodes are connected.
System id not matched The database system IDs of the primary and standby nodes are inconsistent. The system IDs of the primary and standby nodes must be the same.
Timeline not matched The log timelines are inconsistent.
Unknown Unknown cause.

Examples

View the database status details, including instance status.

gs_om -t status --detail
[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
current_az      : AZ_ALL

[  Datanode State   ]

node               node_ip         instance                                 state            | node               node_ip         instance                                 state
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1  pekpopgsci00235 10.244.62.204   6001 /opt/mogdb/cluster/data/dn1 P Primary Normal | 2  pekpopgsci00238 10.244.61.81    6002 /opt/mogdb/cluster/data/dn1 S Standby Normal
gs_om -t status --detail
[  CMServer State   ]

node      node_ip         instance                                 state
--------------------------------------------------------------------------

1  host40 10.243.40.20    1    /usr1/cm_gauss/cluster/cm/cm_server Primary
2  host39 10.243.39.8     2    /usr1/cm_gauss/cluster/cm/cm_server Standby
3  host15 10.243.15.65    3    /usr1/cm_gauss/cluster/cm/cm_server Standby

[    ETCD State     ]

node      node_ip         instance                         state
------------------------------------------------------------------------

1  host40 10.243.40.20    7001 /usr1/cm_gauss/cluster/etcd StateFollower
2  host39 10.243.39.8     7002 /usr1/cm_gauss/cluster/etcd StateFollower
3  host15 10.243.15.65    7003 /usr1/cm_gauss/cluster/etcd StateLeader

[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
balanced        : Yes
current_az      : AZ_ALL

[  Datanode State   ]

node      node_ip         instance                        state            | node      node_ip         instance                        state            | node      node_ip         instance                        state
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

1  host40 10.243.40.20    6001 /usr1/cm_gauss/cluster/dn1 P Primary Normal | 2  host39 10.243.39.8     6002 /usr1/cm_gauss/cluster/dn1 S Standby Normal | 3  host15 10.243.15.65    6003 /usr1/cm_gauss/cluster/dn1 S Standby Normal