MVD Compare Mode

In MVD, for table data, multiple comparison modes are supported, including:

Partition Split Compare: Slicing data according to partition boundaries and compared concurrently, corresponding to the mode "Partition"
Dynamic Split Compare: Randomly selected slice of data from the table and compared concurrently, corresponding to the mode "Sample"
Full Split Compare: This is a special case of dynamic split, when slices with a sampling ratio of 1 and compared concurrently, corresponding to the mode "Sample"
Manually Split Compare: Manually divide the data slice and compared concurrently, corresponding to the mode "Split"
Partial Data Compare: Manually specify data filter and compared within a single process, corresponding to the mode "Partial"
Normal Data Compare: If the above compare mode is not satisfied, the default contrast mode "Normal" is used.
Summary Signature Compare: Different from the above row-by-row comparison mode, the summary signature comparison calculates the overall hash value for the whole table data, and identifies whether there is a difference between the table data through one comparison

Partition Split Compare

Background

The main idea of partition comparison is to consider that in each database, partitioned table queries basically have the corresponding partition trimming optimization, and partitioned tables generally have large data size, and a single query for normal compare will be time-consuming.

By querying and merging the partition boundary values of source and target tables, we can get the overall data region division, i.e. slices, where each slice will fall in one of the partitions in the source and target tables, so that the partition trimming strategy can take effect.

And the obtained multiple slices can be handed over multiple sub-processes for concurrent data comparison, thus speeding up the data comparison efficiency of partitioned large tables.

Limitation

Only RANGE and LIST partitions are currently supported, and the partition column must be a single column.
Both sides must be the same column of the same kind of partition, in order to use the partition split compare

Workflow

The workflow of partition split compare is as follows:

Get the partition type, partition column, and partition boundary list of source and target tables
Determine whether the partition split compare requirements are met, if not, use the normal compare
Merge both partition boundaries, e.g. source [<1, <10, <100] target [<1, <50, <100], then the merged boundaries is [<1, <10, <50, <100].
Convert the filter conditions to the source and target according to the merged boundaries
Compare the partitioned data with different filter conditions to different sub-processes
After all the partitions are compared, parse and merge the comparison results of each partition
Return the comparison result after merging

Dynamic Split Compare

Background

The main idea of dynamic comparison is to consider that in each database, there may be single table with huge datas, and these large tables are not created as partitioned tables, and a single query for normal compare will be time-consuming.

This can be done by analyzing the primary/unique key data of the source and target tables, dividing them into a number of slices, and randomly selecting a number of proportional slices from them, and checking whether there are differences in the table data by comparing these slice data, which is a partial data comparison mode.

And the multiple slices can be handed over multiple sub-processes for concurrent data comparison, thus speeding up the data comparison efficiency of large tables.

Parameters

Dynamic split compare involves two parameters:

--sample-size: dynamic sampling threshold, only if the size of the source or target table exceeds this threshold, then dynamic split compare will be enabled
--sample-pct: the sampling ratio (range: (0, 1]) for dynamic split compare, which will be converted to a percentage of data comparison

Limitation

Currently only support tables with a single column primary/unique key, or leading column in indexes with higest cardinality, column types are common (excluding BLOB, CLOB and other large object types)
Both sides must be the same column with primary/unique key/higest cardinality column to use dynamic split compare
Dynamic split compare is evaluated only for tables whose size exceeds a specified threshold

Workflow

The workflow of the dynamic split compare is as follows:

Get the structural information of source and target tables: columns, constraints, etc.
Determine whether the dynamic split compare requirements are met, if not, use normal data compare
Obtain the real data interval of each side: divide the primary/unique key values into 100 relatively uniform parts by percent_rank() or similar functions
Randomly select a number of slices from the 100 slices according to the sampling ratio
Force slices 0 and 99 to be added, considering the possibility of data differences in the limit value case
Merge all the selected slices to get the final list of slices and convert them to the relevant data filtering conditions
Pass the different filtering conditions to different sub-processes for partition data comparison
When the data of all slices are compared, parse and merge the comparison results of each slice
Return the merged result of the comparison

Example

Hypothetical scenario: The table has 1000 records from 1-1000 in the ID column, but there are two extremity values, -1000000 and 1000000, so the data is heavily skewed.

Original data: -1000000, 1, 2, 3 ... 99, 100, 1000000

If we divide evenly according to the maximum and minimum values, the total slice range is 200w, divided into 100 slices, and the slices range is 2w:

Max-min slices: <-980000, (980000, 960000], (960000, 940000] ... (0, 20000] ... (960000, 980000], (980000, 1000000]

At this point, the 1000 data rows from 1-1000 in the table fall into only one slice, and the dynamic random sampling of the slices probability queries 0 rows of data.

In dynamic slicing comparison, we use percent_rank() to get the true position of each rows, and then divide 100 slices according to it's true position, i.e. percent_rank() value as one slice for each 0.01.

True value slices: <10, (10, 20], (20, 30] ... (980, 990], (980, 1000000].

In this case, the two extremity values are distributed in the first and last slices, and does not affect the data distribution in the other slices. This kind of slicing will have good slicing effect in scenarios where the data exists in all kinds of data skew.

Full Split Compare

In the dynamic split compare mode, if the sampling ratio is set to 1, i.e. the full amount of data is compared, then the slice column are restricted to support only primary key with single column.

Manually Split Compare

With the DATA_FILTER option in the parameter file, you can manually split some large tables into pieces, and the splitted pieces will also be handed over to different child processes for concurrent execution and comparison.

For configuration on specific options, please refer to: MVD Configurations

Partial Data Compare

You can use the -DATA_FILTER option in the parameter file, or the -data-filter option on the command line to achieve partial data comparison, i.e. to compare only a certain part of the data in the table that the user is more concerned about, so as to speed up the data comparison efficiency by narrowing the data comparison range.

For configuration on specific options, please refer to: MVD Configurations

Normal Data Compare

Normal data comparison will query the source and target tables separately in a single process, and complete the final data comparison.

Summary Signature Compare

Different from the above row-by-row comparison mode, the summary signature comparison calculates the overall hash value for the whole table data, and identifies whether there is a difference between the table data through one comparison

The efficiency of summary signature comparison is much higher than that of row by row comparison mode. Through continuous hash calculation and one-time comparison, the time consumption of data comparison can be reduced by about 50% to 70%, so it can be used to locate anomaly table fastly.

Combined with the -m/--mode option, the summary signature comparison can be used to identify the specific difference table, and then the row by row comparison can identify the specific row difference, so as to achieve fast and accurate data difference comparison.

Attention: If most of the tables being compared are different, the '-m a' comparison may take longer than the row-wise comparison.