Cluster Management

Uqbar provides cluster management capabilities. It supports primary/standby high availability (HA). If some nodes are faulty, the system will automatically switch to normal nodes, thereby ensuring that the system availability SLA reaches 99.95%.

Operation Scenarios

During the operation of Uqbar, the database administrator may need to manually perform primary/standby switchover of database nodes. For example, when the primary node is faulty, failover is triggered, or when the hardware is faulty, manual switchover needs to be performed. A cascaded standby node cannot be directly converted to a primary node. It can only be switched over to a standby node through switchover or failover, and then switched over to a primary node.

Note:

Primary/standby switchover is a maintenance operation to ensure normal running of Uqbar. After all services are complete, it will be implemented.

Cascaded standby nodes are not supported when extreme RTO is enabled because cascaded standby nodes cannot be connected for data synchronization when extreme RTO is enabled.

Procedure

Log in to any database node as the operating system user omm, and run the following command to check the primary/standby status.
```
gs_om -t status --detail
```
Log in to the standby node that is to be switched over to a primary node as the operating system user omm, and run the following command.
```
gs_ctl switchover -D /home/omm/cluster/dn1/
```
/home/omm/cluster/dn1/ is the data directory of the standby node.

Notice: For the same database, if the previous primary/standby switchover is incomplete, the next switchover cannot be performed. When the business is in operation while the switchover is initiated, the host thread may not stop, causing the switchover to display timeout. Actually, the backend is still running. After the host thread stops, the switchover can be completed. For example, when a large partition table is deleted from a host, it may not be able to respond to the switchover signal.
After switchover is successful, run the following command to record the primary/standby node information.
```
gs_om -t refreshconf
```

Examples

Switch the standby instance of a database node to a primary instance.

Check the database status.

gs_om -t status --detail
[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
current_az      : AZ_ALL

[  Datanode State   ]

    node             node_ip         port      instance                            state
--------------------------------------------------------------------------------------------------
1  pekpopgsci00235  10.244.62.204    5432      6001 /home/omm/cluster/dn1/   P Primary Normal
2  pekpopgsci00238  10.244.61.81     5432      6002 /home/omm/cluster/dn1/   S Standby Normal

Log in to the standby node and perform switchover. In addition, switch the cascaded standby node over to a standby node and the original standby node over to a cascaded standby node.

gs_ctl switchover -D /home/omm/cluster/dn1/
[2020-06-17 14:28:01.730][24438][][gs_ctl]: gs_ctl switchover ,datadir is -D "/home/omm/cluster/dn1"
[2020-06-17 14:28:01.730][24438][][gs_ctl]: switchover term (1)
[2020-06-17 14:28:01.768][24438][][gs_ctl]: waiting for server to switchover............
[2020-06-17 14:28:11.175][24438][][gs_ctl]: done
[2020-06-17 14:28:11.175][24438][][gs_ctl]: switchover completed (/home/omm/cluster/dn1)

Save the primary/standby database node information.

gs_om -t refreshconf
Generating dynamic configuration file for all nodes.
Successfully generated dynamic configuration file.

Exception Handling

The judgment standards are as follows for an exception:

Under service pressure, the time for primary/standby switchover is long, which does not require any operation.
If a standby node is being built, the primary node can be switched over to a standby node only when it have sent logs to the standby node. This may consume a long switchover time. It does not need to be handled. It is suggested that primary/standby switchover be not performed during the building of a standby node.
During switchover, if dual-primary issue occurs because the primary and standby instances are disconnected from each other due to network fault or insufficient disk space, perform the following operations to rectify the fault.

Warning: Once dual-primary status occurs, perform the following operations to restore the instance status. Otherwise, data loss may occur.

Run the following command to check the current instance status of a database.
```
gs_om -t status --detail
```
If the result shows that both of them are primary instances. It will be abnormal.
Run the following command to disable the service on the node that is to be switched over to a standby node.
```
gs_ctl stop -D /home/omm/cluster/dn1/
```
Run the following command to start the standby node in standby mode.
```
gs_ctl start -D /home/omm/cluster/dn1/ -M standby
```
Save the primary/standby database node information.
```
gs_om -t refreshconf
```
Check the database status and make sure that the instance status is normal.