HomeMogDBMogDB StackUqbar
v3.0

Documentation:v3.0

Supported Versions:

Other Versions:

MOT Administration

The following describes various MOT administration topics.

MOT Durability

Durability refers to long-term data protection (also known as disk persistence). Durability means that stored data does not suffer from any kind of degradation or corruption, so that data is never lost or compromised. Durability ensures that data and the MOT engine are restored to a consistent state after a planned shutdown (for example, for maintenance) or an unplanned crash (for example, a power failure).

Memory storage is volatile, meaning that it requires power to maintain the stored information. Disk storage, on the other hand, is non-volatile, meaning that it does not require power to maintain stored information, thus, it can survive a power shutdown. MOT uses both types of storage - it has all data in memory, while persisting transactional changes to disk MOT Durability and by maintaining frequent periodic MOT Checkpoints in order to ensure data recovery in case of shutdown.

The user must ensure sufficient disk space for the logging and Checkpointing operations. A separated drive can be used for the Checkpoint to improve performance by reducing disk I/O load.

You may refer to the MOT Key Technologies section for an overview of how durability is implemented in the MOT engine.

To configure durability -

To ensure strict consistency, configure the synchronous_commit parameter to On in the postgres.conf configuration file.

MOTs WAL Redo Log and Checkpoints enable durability, as described below -

MOT Logging - WAL Redo Log

To ensure Durability, MOT is fully integrated with the MogDB's Write-Ahead Logging (WAL) mechanism, so that MOT persists data in WAL records using MogDB's XLOG interface. This means that every addition, update, and deletion to an MOT table’s record is recorded as an entry in the WAL. This ensures that the most current data state can be regenerated and recovered from this non-volatile log. For example, if three new rows were added to a table, two were deleted and one was updated, then six entries would be recorded in the log.

MOT log records are written to the same WAL as the other records of MogDB disk-based tables.

MOT only logs an operation at the transaction commit phase.

MOT only logs the updated delta record in order to minimize the amount of data written to disk.

During recovery, data is loaded from the last known or a specific Checkpoint; and then the WAL Redo log is used to complete the data changes that occur from that point forward.

The WAL (Redo Log) retains all the table row modifications until a Checkpoint is performed (as described above). The log can then be truncated in order to reduce recovery time and to save disk space.

Note - In order to ensure that the log IO device does not become a bottleneck, the log file must be placed on a drive that has low latency.

MOT Logging Types

Two synchronous transaction logging options and one asynchronous transaction logging option are supported (these are also supported by the standard MogDB disk engine). MOT also supports synchronous Group Commit logging with NUMA-awareness optimization, as described below.

According to your configuration, one of the following types of logging is implemented -

  • Synchronous Redo Logging

    The Synchronous Redo Logging option is the simplest and most strict redo logger. When a transaction is committed by a client application, the transaction redo entries are recorded in the WAL (Redo Log), as follows -

    1. While a transaction is in progress, it is stored in the MOT's memory.
    2. After a transaction finishes and the client application sends a Commit command, the transaction is locked and then written to the WAL Redo Log on the disk. This means that while the transaction log entries are being written to the log, the client application is still waiting for a response.
    3. As soon as the transaction's entire buffer is written to the log, the changes to the data in memory take place and then the transaction is committed. After the transaction has been committed, the client application is notified that the transaction is complete.

    Summary

    The Synchronous Redo Logging option is the safest and most strict because it ensures total synchronization of the client application and the WAL Redo log entries for each transaction as it is committed; thus ensuring total durability and consistency with absolutely no data loss. This logging option prevents the situation where a client application might mark a transaction as successful, when it has not yet been persisted to disk.

    The downside of the Synchronous Redo Logging option is that it is the slowest logging mechanism of the three options. This is because a client application must wait until all data is written to disk and because of the frequent disk writes (which typically slow down the database).

  • Group Synchronous Redo Logging

    The Group Synchronous Redo Logging option is very similar to the Synchronous Redo Logging option, because it also ensures total durability with absolutely no data loss and total synchronization of the client application and the WAL (Redo Log) entries. The difference is that the Group Synchronous Redo Logging option writes _groups of transaction_r edo entries to the WAL Redo Log on the disk at the same time, instead of writing each and every transaction as it is committed. Using Group Synchronous Redo Logging reduces the amount of disk I/Os and thus improves performance, especially when running a heavy workload.

    The MOT engine performs synchronous Group Commit logging with Non-Uniform Memory Access (NUMA)-awareness optimization by automatically grouping transactions according to the NUMA socket of the core on which the transaction is running.

    You may refer to the NUMA Awareness Allocation and Affinity section for more information about NUMA-aware memory access.

    When a transaction commits, a group of entries are recorded in the WAL Redo Log, as follows -

    1. While a transaction is in progress, it is stored in the memory. The MOT engine groups transactions in buckets according to the NUMA socket of the core on which the transaction is running. This means that all the transactions running on the same socket are grouped together and that multiple groups will be filling in parallel according to the core on which the transaction is running.

      Writing transactions to the WAL is more efficient in this manner because all the buffers from the same socket are written to disk together.

      Note - Each thread runs on a single core/CPU which belongs to a single socket and each thread only writes to the socket of the core on which it is running.

    2. After a transaction finishes and the client application sends a Commit command, the transaction redo log entries are serialized together with other transactions that belong to the same group.

    3. After the configured criteria are fulfilled for a specific group of transactions (quantity of committed transactions or timeout period as describes in the REDO LOG (MOT) section), the transactions in this group are written to the WAL on the disk. This means that while these log entries are being written to the log, the client applications that issued the commit are waiting for a response.

    4. As soon as all the transaction buffers in the NUMA-aware group have been written to the log, all the transactions in the group are performing the necessary changes to the memory store and the clients are notified that these transactions are complete.

    Summary

    The Group Synchronous Redo Logging option is a an extremely safe and strict logging option because it ensures total synchronization of the client application and the WAL Redo log entries; thus ensuring total durability and consistency with absolutely no data loss. This logging option prevents the situation where a client application might mark a transaction as successful, when it has not yet been persisted to disk.

    On one hand this option has fewer disk writes than the Synchronous Redo Logging option, which may mean that it is faster. The downside is that transactions are locked for longer, meaning that they are locked until after all the transactions in the same NUMA memory have been written to the WAL Redo Log on the disk.

    The benefits of using this option depend on the type of transactional workload. For example, this option benefits systems that have many transactions (and less so for systems that have few transactions, because there are few disk writes anyway).

  • Asynchronous Redo Logging

    The Asynchronous Redo Logging option is the fastest logging method, However, it does not ensure no data loss, meaning that some data that is still in the buffer and was not yet written to disk may get lost upon a power failure or database crash. When a transaction is committed by a client application, the transaction redo entries are recorded in internal buffers and written to disk at preconfigured intervals. The client application does not wait for the data being written to disk. It continues to the next transaction. This is what makes asynchronous redo logging the fastest logging method.

    When a transaction is committed by a client application, the transaction redo entries are recorded in the WAL Redo Log, as follows -

    1. While a transaction is in progress, it is stored in the MOT's memory.
    2. After a transaction finishes and the client application sends a Commit command, the transaction redo entries are written to internal buffers, but are not yet written to disk. Then changes to the MOT data memory take place and the client application is notified that the transaction is committed.
    3. At a preconfigured interval, a redo log thread running in the background collects all the buffered redo log entries and writes them to disk.

    Summary

    The Asynchronous Redo Logging option is the fastest logging option because it does not require the client application to wait for data being written to disk. In addition, it groups many transactions redo entries and writes them together, thus reducing the amount of disk I/Os that slow down the MOT engine.

    The downside of the Asynchronous Redo Logging option is that it does not ensure that data will not get lost upon a crash or failure. Data that was committed, but was not yet written to disk, is not durable on commit and thus cannot be recovered in case of a failure. The Asynchronous Redo Logging option is most relevant for applications that are willing to sacrifice data recovery (consistency) over performance.

Configuring Logging

Two synchronous transaction logging options and one asynchronous transaction logging option are supported by the standard MogDB disk engine.

To configure logging -

  1. The determination of whether synchronous or asynchronous transaction logging is performed is configured in the synchronous_commit (On = Synchronous) parameters in the postgres.conf configuration file.

If a synchronous mode of transaction logging has been selected (synchronous_commit = On, as described above), then the enable_group_commit parameter in the mot.conf configuration file determines whether the Group Synchronous Redo Logging option or the Synchronous Redo Logging option is used. For Group Synchronous Redo Logging, you must also define in the mot.conf file which of the following thresholds determine when a group of transactions is recorded in the WAL

  • group_commit_size - The quantity of committed transactions in a group. For example, 16 means that when 16 transactions in the same group have been committed by a client application, then an entry is written to disk in the WAL Redo Log for all 16 transactions.

  • group_commit_timeout - A timeout period in ms. For example, 10 means that after 10 ms, an entry is written to disk in the WAL Redo Log for each of the transactions in the same group that have been committed by their client application in the last 10 ms.

    img NOTE: You may refer to the REDO LOG (MOT) for more information about configuration settings.

MOT Checkpoints

A Checkpoint is the point in time at which all the data of a table's rows is saved in files on persistent storage in order to create a full durable database image. It is a snapshot of the data at a specific point in time.

A Checkpoint is required in order to reduce a database's recovery time by shortening the quantity of WAL (Redo Log) entries that must be replayed in order to ensure durability. Checkpoint's also reduce the storage space required to keep all the log entries.

If there were no Checkpoints, then in order to recover a database, all the WAL redo entries would have to be replayed from the beginning of time, which could take days/weeks depending on the quantity of records in the database. Checkpoints record the current state of the database and enable old redo entries to be discarded.

Checkpoints are essential during recovery scenarios (especially for a cold start). First, the data is loaded from the last known or a specific Checkpoint; and then the WAL is used to complete the data changes that occurred since then.

For example - If the same table row is modified 100 times, then 100 entries are recorded in the log. When Checkpoints are used, then even if a specific table row was modified 100 times, it is recorded in the Checkpoint a single time. After the recording of a Checkpoint, recovery can be performed on the basis of that Checkpoint and only the WAL Redo Log entries that occurred since the Checkpoint need be played.

MOT Recovery

The main objective of MOT Recovery is to restore the data and the MOT engine to a consistent state after a planned shutdown (for example, for maintenance) or an unplanned crash (for example, after a power failure).

MOT recovery is performed automatically with the recovery of the rest of the MogDB database and is fully integrated into MogDB recovery process (also called a Cold Start).

MOT recovery consists of two stages -

Checkpoint Recovery - First, data must be recovered from the latest Checkpoint file on disk by loading it into memory rows and creating indexes.

WAL Redo Log Recovery - Afterwards, the recent data (which was not captured in the Checkpoint) must be recovered from the WAL Redo Log by replaying records that were added to the log since the Checkpoint that was used in the Checkpoint Recovery (described above).

The WAL Redo Log recovery is managed and triggered by MogDB.

  • To configure recovery.

  • While WAL recovery is performed in a serial manner, the Checkpoint recovery can be configured to run in a multi-threaded manner (meaning in parallel by multiple workers).

  • Configure the Checkpoint_recovery_workers parameter in the mot.conf file, which is described in the RECOVERY (MOT) section.

MOT Replication and High Availability

Since MOT is integrated into MogDB and uses/supports its replication and high availability, both synchronous and asynchronous replication are supported out of the box.

The MogDB gs_ctl tool is used for availability control and to operate the cluster. This includes gs_ctl switchover, gs_ctl failover, gs_ctl build and so on.

You may refer to the MogDB Tools Reference document for more information.

  • To configure replication and high availability.
  • Refer to the relevant MogDB documentation.

MOT Memory Management

For planning and finetuning, see the MOT Memory and Storage Planning and MOT Configuration Settings sections.

MOT Vacuum

Use VACUUM for garbage collection and optionally to analyze a database, , as follows -

  • [PG]

    In Postgress (PG), the VACUUM reclaims storage occupied by dead tuples. In normal PG operation, tuples that are deleted or that are made obsolete by an update are not physically removed from their table. They remain present until a VACUUM is done. Therefore, it is necessary to perform a VACUUM periodically, especially on frequently updated tables.

  • [MOT Extension]

    MOT tables do not need a periodic VACUUM operation, since dead/empty tuples are re-used by new ones. MOT tables require VACUUM operations only when their size is significantly reduced and they do not expect to grow to their original size in the near future.

    For example, an application that periodically (for example, once in a week) performs a large deletion of a table/tables data while inserting new data takes days and does not necessarily require the same quantity of rows. In such cases, it makes sense to activate the VACUUM.

    The VACUUM operation on MOT tables is always transformed into a VACUUM FULL with an exclusive table lock.

  • Supported Syntax and Limitations

    Activation of the VACUUM operation is performed in a standard manner.

    VACUUM [FULL | ANALYZE] [ table ];

    Only the FULL and ANALYZE VACUUM options are supported. The VACUUM operation can only be performed on an entire MOT table.

    The following PG vacuum options are not supported:

    • FREEZE

    • VERBOSE

    • Column specification

    • LAZY mode (partial table scan)

      Additionally, the following functionality is not supported

    • AUTOVACUUM

MOT Statistics

Statistics are intended for performance analysis or debugging. It is uncommon to turn them ON in a production environment (by default, they are OFF). Statistics are primarily used by database developers and to a lesser degree by database users.

There is some impact on performance, particularly on the server. Impact on the user is negligible.

The statistics are saved in the database server log. The log is located in the data folder and named postgresql-DATE-TIME.log.

Refer to STATISTICS (MOT) for detailed configuration options.

MOT Monitoring

All syntax for monitoring of PG-based FDW tables is supported. This includes Table or Index sizes (as described below). In addition, special functions exist for monitoring MOT memory consumption, including MOT Global Memory, MOT Local Memory and a single client session.

Table and Index Sizes

The size of tables and indexes can be monitored by querying pg_relation_size.

For example

Data Size

select pg_relation_size('customer');

Index

select pg_relation_size('customer_pkey');

MOT GLOBAL Memory Details

Check the size of MOT global memory, which includes primarily the data and indexes.

select * from mot_global_memory_detail();

Result -

numa_node  | reserved_size        | used_size
----------------+----------------+-------------
-1            | 194716368896      | 25908215808
0             | 446693376         | 446693376
1             | 452984832         | 452984832
2             | 452984832         | 452984832
3             | 452984832         | 452984832
4             | 452984832         | 452984832
5             | 364904448         | 364904448
6             | 301989888         | 301989888
7             | 301989888         | 301989888

Where -

  • -1 is the total memory.
  • 0..7 are NUMA memory nodes.

MOT LOCAL Memory Details

Check the size of MOT local memory, which includes session memory.

select * from mot_local_memory_detail();

Result -

numa_node  | reserved_size      | used_size
----------------+----------------+-------------
-1            | 144703488       | 144703488
0             | 25165824        | 25165824
1             | 25165824        | 25165824
2             | 18874368        | 18874368
3             | 18874368        | 18874368
4             | 18874368        | 18874368
5             | 12582912        | 12582912
6             | 12582912        | 12582912
7             | 12582912        | 12582912

Where -

  • -1 is the total memory.
  • 0..7 are NUMA memory nodes.

Session Memory

Memory for session management is taken from the MOT local memory.

Memory usage by all active sessions (connections) is possible using the following query -

select * from mot_session_memory_detail();

Result -

sessid                   | total_size | free_size | used_size
----------------------------------------+-----------+----------+----------
1591175063.139755603855104 | 6291456    | 1800704   | 4490752

Legend -

  • total_size - is allocated for the session
  • free_size - not in use
  • used_size - In actual use

The following query enables a DBA to determine the state of local memory used by the current session -

select * from mot_session_memory_detail()
 where sessid = pg_current_sessionid();

Result -

img

MOT Error Messages

Errors may be caused by a variety of scenarios. All errors are logged in the database server log file. In addition, user-related errors are returned to the user as part of the response to the query, transaction or stored procedure execution or to database administration action.

  • Errors reported in the Server log include - Function, Entity, Context, Error message, Error description and Severity.
  • Errors reported to users are translated into standard PostgreSQL error codes and may consist of an MOT-specific message and description.

The following lists the error messages, error descriptions and error codes. The error code is actually an internal code and not logged or returned to users.

Errors Written the Log File

All errors are logged in the database server log file. The following lists the errors that are written to the database server log file and are not returned to the user. The log is located in the data folder and named postgresql-DATE-TIME.log.

Table 1 Errors Written Only to the Log File

Message in the Log Error Internal Code
Error code denoting success MOT_NO_ERROR 0
Out of memory MOT_ERROR_OOM 1
Invalid configuration MOT_ERROR_INVALID_CFG 2
Invalid argument passed to function MOT_ERROR_INVALID_ARG 3
System call failed MOT_ERROR_SYSTEM_FAILURE 4
Resource limit reached MOT_ERROR_RESOURCE_LIMIT 5
Internal logic error MOT_ERROR_INTERNAL 6
Resource unavailable MOT_ERROR_RESOURCE_UNAVAILABLE 7
Unique violation MOT_ERROR_UNIQUE_VIOLATION 8
Invalid memory allocation size MOT_ERROR_INVALID_MEMORY_SIZE 9
Index out of range MOT_ERROR_INDEX_OUT_OF_RANGE 10
Error code unknown MOT_ERROR_INVALID_STATE 11

Errors Returned to the User

The following lists the errors that are written to the database server log file and are returned to the user.

MOT returns PG standard error codes to the envelope using a Return Code (RC). Some RCs cause the generation of an error message to the user who is interacting with the database.

The PG code (described below) is returned internally by MOT to the database envelope, which reacts to it according to standard PG behavior.

img NOTE: %s, %u and %lu in the message are replaced by relevant error information, such as query, table name or another information. - %s - String - %u - Number - %lu - Number

Table 2 Errors Returned to the User and Logged to the Log File

Short and Long Description Returned to the User PG Code Internal Error Code
Success.Denotes success ERRCODE_SUCCESSFUL_COMPLETIONCOMPLETION RC_OK = 0
FailureUnknown error has occurred. ERRCODE_FDW_ERROR RC_ERROR = 1
Unknown error has occurred.Denotes aborted operation. ERRCODE_FDW_ERROR RC_ABORT
Column definition of %s is not supported.Column type %s is not supported yet. ERRCODE_INVALID_COLUMN_DEFINITION RC_UNSUPPORTED_COL_TYPE
Column definition of %s is not supported.Column type Array of %s is not supported yet. ERRCODE_INVALID_COLUMN_DEFINITION RC_UNSUPPORTED_COL_TYPE_ARR
Column size %d exceeds max tuple size %u.Column definition of %s is not supported. ERRCODE_FEATURE_NOT_SUPPORTED RC_EXCEEDS_MAX_ROW_SIZE
Column name %s exceeds max name size %u.Column definition of %s is not supported. ERRCODE_INVALID_COLUMN_DEFINITION RC_COL_NAME_EXCEEDS_MAX_SIZE
Column size %d exceeds max size %u.Column definition of %s is not supported. ERRCODE_INVALID_COLUMN_DEFINITION RC_COL_SIZE_INVLALID
Cannot create table.Cannot add column %s; as the number of declared columns exceeds the maximum declared columns. ERRCODE_FEATURE_NOT_SUPPORTED RC_TABLE_EXCEEDS_MAX_DECLARED_COLS
Cannot create index.Total column size is greater than maximum index size %u. ERRCODE_FDW_KEY_SIZE_EXCEEDS_MAX_ALLOWED RC_INDEX_EXCEEDS_MAX_SIZE
Cannot create index.Total number of indexes for table %s is greater than the maximum number of indexes allowed %u. ERRCODE_FDW_TOO_MANY_INDEXES RC_TABLE_EXCEEDS_MAX_INDEXES
Cannot execute statement.Maximum number of DDLs per transaction reached the maximum %u. ERRCODE_FDW_TOO_MANY_DDL_CHANGES_IN_TRANSACTION_NOT_ALLOWED RC_TXN_EXCEEDS_MAX_DDLS
Unique constraint violationDuplicate key value violates unique constraint "%s"".Key %s already exists. ERRCODE_UNIQUE_VIOLATION RC_UNIQUE_VIOLATION
Table "%s" does not exist. ERRCODE_UNDEFINED_TABLE RC_TABLE_NOT_FOUND
Index "%s" does not exist. ERRCODE_UNDEFINED_TABLE RC_INDEX_NOT_FOUND
Unknown error has occurred. ERRCODE_FDW_ERROR RC_LOCAL_ROW_FOUND
Unknown error has occurred. ERRCODE_FDW_ERROR RC_LOCAL_ROW_NOT_FOUND
Unknown error has occurred. ERRCODE_FDW_ERROR RC_LOCAL_ROW_DELETED
Unknown error has occurred. ERRCODE_FDW_ERROR RC_INSERT_ON_EXIST
Unknown error has occurred. ERRCODE_FDW_ERROR RC_INDEX_RETRY_INSERT
Unknown error has occurred. ERRCODE_FDW_ERROR RC_INDEX_DELETE
Unknown error has occurred. ERRCODE_FDW_ERROR RC_LOCAL_ROW_NOT_VISIBLE
Memory is temporarily unavailable. ERRCODE_OUT_OF_LOGICAL_MEMORY RC_MEMORY_ALLOCATION_ERROR
Unknown error has occurred. ERRCODE_FDW_ERROR RC_ILLEGAL_ROW_STATE
Null constraint violated.NULL value cannot be inserted into non-null column %s at table %s. ERRCODE_FDW_ERROR RC_NULL_VIOLATION
Critical error.Critical error: %s. ERRCODE_FDW_ERROR RC_PANIC
A checkpoint is in progress - cannot truncate table. ERRCODE_FDW_OPERATION_NOT_SUPPORTED RC_NA
Unknown error has occurred. ERRCODE_FDW_ERROR RC_MAX_VALUE
<recovery message> - ERRCODE_CONFIG_FILE_ERROR
<recovery message> - ERRCODE_INVALID_TABLE_DEFINITION
Memory engine - Failed to perform commit prepared. - ERRCODE_INVALID_TRANSACTION_STATE
Invalid option <option name> - ERRCODE_FDW_INVALID_OPTION_NAME
Invalid memory allocation request size. - ERRCODE_INVALID_PARAMETER_VALUE
Memory is temporarily unavailable. - ERRCODE_OUT_OF_LOGICAL_MEMORY
Could not serialize access due to concurrent update. - ERRCODE_T_R_SERIALIZATION_FAILURE
Alter table operation is not supported for memory table.Cannot create MOT tables while incremental checkpoint is enabled.Re-index is not supported for memory tables. - ERRCODE_FDW_OPERATION_NOT_SUPPORTED
Allocation of table metadata failed. - ERRCODE_OUT_OF_MEMORY
Database with OID %u does not exist. - ERRCODE_UNDEFINED_DATABASE
Value exceeds maximum precision: %d. - ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE
You have reached a maximum logical capacity %lu of allowed %lu. - ERRCODE_OUT_OF_LOGICAL_MEMORY
Copyright © 2011-2024 www.enmotech.com All rights reserved.