Manual Failover

Symptom

Before network partitioning, the status of the two-node cluster is normal. There are one primary node and one standby node that are running normally in both the CM and database clusters.

After network partitioning, because the automatic failover parameter cms_enable_failover_on2nodes is disabled (in a two-node cluster, the number of votes is 2) of the CM cluster is disabled, the standby instance will not take over the services of the primary instance and be kept as the standby instance. The primary instance is isolated and changed as a standby instance. As a result, there is no primary instance in the CM cluster, and the database cluster is unavailable, as shown in the following figure.

Status of the CM cluster on node 1
Status of the CM cluster on node 2

Procedure

Confirm the primary instance in the database cluster before the fault and forcibly switch the standby instance on the node to the primary instance.

Confirm the Primary Instance of the Database Cluster

On each node, run the gs_ctl query -D \<datapath> command to obtain the database instance role. As shown in the following figure, Primary indicates that the node is that where the primary instance is located.

If there is no primary instance in a database cluster, run the gsql command to choose the instance whose term/lsn is high as the primary one and perform failover on the node.

Note: gsql cannot be used for connecting a database instance in pending status. In that case, logs can be used for judging the original primary instance of the database cluster. For details, see [Judge the Primary/Standby Status of the Database Cluster Based on Logs](#Judge the Primary/Standby Status of the Database Cluster Based on Logs)

Forcibly Switch the CM instance to the Primary Instance

Once the primary instance is confirmed, forcibly switch the CM instance on the node where the primary instance of the database cluster is located to the primary instance. For details, see the following figure.

Cancel Forcible Switchover

After finishing the previous procedure, there must be a primary instance in the CM cluster or event brain split may occur, as shown in the following figure.

Note: Forcible switchover needs to be canceled no matter whether brain split occurs in a CM cluster, as shown in the following figure. The brain split fault will then be dealt with. The CM instance whose term is high will be chosen as the unique primary instance. For details about how to rectify the brain split fault caused by this mechanism, see Manual Rectification of the Brain Split Fault.

Appendix

Judge the Primary/Standby Status of the Database Cluster Based on Logs

(1) Confirm the original primary instance of a CM cluster.

Among all instances in the CM cluster, query the latest log cm_server_timestamp_process name-current.log and find the following content. The instance including the content will be the original primary instance.

2023-01-06 11:40:34.443 tid=7050 HA LOG: node(1) cms role is Primary, cms change to standby by ddb, and g_ddbRole is 2.

(2) Confirm the original primary instance of a database cluster.

Find the primary/standby status when the database cluster is available at the last time before network partitioning in the latest log file cm_server\_timestamp_process name-current.log of the original primary instance in the CM cluster.

As shown in the following example, there is a primary instance in the database cluster. The index of the node where it is located is 0, and the node number is 6001.

2023-01-06 11:30:47.088 tid=7046 CM_AGENT DEBUG1: [GetCandicate], instanceId(6001), this group has dynamic primary(0), validPrimIdx is 0, not need to choose candicate.

There is a primary instance in the database cluster: dyPrimary: [0: 6001: 1704]

2023-01-06 11:30:47.088 tid=7046 CM_AGENT LOG: [DnArbitrateNormal]: instd(6001) staPrimary: [0: 6001: 1704], dyPrimary: [0: 6001: 1704], dyNorPrim: [0: 6001: 1704], notPendCmd: [insInfo is empty], cascade: [sta: (insInfo is empty);  dy: (insInfo is empty)].

The above log files show that the index of the node where the primary instance of the database cluster is located is 0 before network partitioning, the number is 6001. In this section, the CM instance on this node is forcibly switched to be as a primary instance.

Issue