Enhanced Efficiency of Logical Backup and Restore

Availability

This feature is available since MogDB 5.0.6.

Introduction

This feature improves the efficiency of logical backup and restore, supporting parallel execution of logical backups when the export file format is a directory (-F, --format=d), and also supports parallel import of directory-formatted export files.

Benefits

It meets the customer's requirement for backup efficiency in scenarios with large amounts of data, saving time and space costs for database users, with excellent performance improvements in parallel import and export, achieving up to 4 to 10 times enhancement in optimal scenarios.

Description

The gs_dump tool introduces a new parameter: -j, --jobs=NUM, which supports inter-table parallel data export when the export file format is a directory, specifying the number of workers for the backup task to improve the efficiency of backup data export.

The gs_restore tool supports parallel import of files with directory and custom archive formats (.dmp), achieving efficiency improvement in backup data import.

Additionally, this feature supports parallel import/export of data for each slice by sharding single table data; starting from MogDB 5.0.8, it supports grouping each partition of a partitioned table and parallel execution of data import/export for each partition within the group, enhancing backup efficiency.

Note:

Setting the -j/--jobs parameter to 1 is equivalent to turning off the parallel import/export feature.

Worker refers to the process executing the backup import/export.

Parallel import/export will increase MogDB's CPU usage accordingly with different degrees of parallelism, leading to increased machine load.

Constraints

Parallel export of single table sharding and parallel export of partitioned table grouping only apply to large tables over 1G.
Only single tables exported in parallel can be imported in parallel (the -j parameters of gs_dump and gs_restore need to be used in conjunction, and the parameter value must be greater than 1). For example:
```
gs_dump -f backupdir/dir_bdat postgres -F d -j 4 -t <table_name>
gs_restore backupdir/dir_bdat -d postgres -j 4 -t <table_name>
```
If you specify the --inserts/-column-inserts parameter when using gs_dump, you cannot perform a single-table parallel export.

Examples

-- Specify the number of parallel workers for export as 4
-- Method one:
gs_dump -f backupdir/dir_bdat postgres -F d -j 4
-- Method two:
gs_dump -f backupdir/dir_bdat postgres -F d --jobs=4

-- Specify the number of parallel workers for import as 4
-- Method one:
gs_restore backupdir/dir_bdat -d postgres -j 4
-- Method two:
gs_restore backupdir/dir_bdat -d postgres --jobs=4

Performance Testing

There are 7 sets of performance tests, which are:

Parallel export and import of the standard TPCC dataset
Parallel export and import of the standard TPCH dataset
Parallel export and import of 1000 small tables
Parallel export and import of a large single table
Parallel export and import of a 17GB partitioned large table
Parallel export and import of a 51GB partitioned large table
Parallel export and import of a 103GB partitioned large table

1. Parallel export and import of the standard TPCC dataset

Export:

Import:

2. Parallel export and import of the standard TPCH dataset

Export:

Import:

3. Parallel export and import of 1000 small tables

Export:

Import:

4. Parallel export and import of a large single table

Export:

Import:

Results Analysis for Groups 1-4

gs_dump

gs_dump shows superior performance in scenarios with a large number of tables and single large tables.
The export efficiency of the TPCC dataset can be improved by up to 12.5 times, and TPCH by 7.1 times. With 1000 small tables, parallelism can enhance efficiency by up to 7.9 times, and ordinary large tables can be improved by 6.3 to 7.9 times.
The optimal performance is observed with a parallelism degree of 8 to 20. Increasing the parallelism degree further does not increase export efficiency, and the CPU usage of MogDB during export is directly proportional to the number of concurrent tasks.

gs_restore

gs_restore shows superior performance with a large number of tables or a single table in directory format. The dmp format, which cannot leverage data parallelism due to the inability of gs_dump to split data, has limited performance enhancement. However, it still shows excellent performance in scenarios with a large number of small tables.
The import performance of the TPCC dataset in directory format can be improved by up to 3.1 times, and TPCH by 2 times. With 1000 small tables, parallelism can enhance efficiency by up to 3.8 times, and ordinary large tables can be improved by up to 5.5 times.
The import performance of the TPCC dataset in dmp format can be improved by up to 1.5 times, and TPCH by 1.2 times. With 1000 small tables, parallelism can enhance efficiency by up to 3.8 times, while ordinary large tables show no improvement due to the inability of gs_dump to split data.
The optimal performance is observed with a parallelism degree of 10 to 20. Increasing the parallelism degree further does not increase import efficiency, and the CPU usage of MogDB during import is directly proportional to the number of concurrent tasks.

5. Parallel export and import of a 17GB partitioned large table

Export:

Import:

6. Parallel export and import of a 51GB partitioned large table

Export:

Import:

7. Parallel export and import of a 103GB partitioned large table

Export:

Import:

Results Analysis for Groups 5-7

For the 103GB partitioned large table, compared to serial import and export, the performance (import/export time) with parallel degrees set to 2, 4, and 8 has improved by 1 times, 3 times, and 7 times, respectively.

It can be seen that as the degree of parallelism increases, the performance improvement of parallel export and import for partitioned large tables meets expectations, with the 17GB, 51GB, and 103GB partitioned tables showing consistent linear scalability.

gs_dump, gs_restore

Enhanced Efficiency of Logical Backup and Restore

Availability

Introduction

Benefits

Description

Constraints

Examples

Performance Testing

Related Pages