v5.0

Adaptive Two-phase Hash Aggregation

Availability

This feature is available since MogDB 5.0.0.

Introduction

Adaptive two-phase hash aggregation is an optimization technique used to solve the problem of selecting the wrong execution and thus reducing the execution efficiency due to inaccurate cost estimation by the optimizer when performing aggregation operations.

This optimization technique is dynamically adjusted by collecting statistical information at runtime. Specifically, the system will select the less expensive way of execution based on the cost estimation of one-phase aggregation and two-phase aggregation when executing the aggregation operation. The system then monitors the resource utilization and performance metrics during the actual execution and records them.

Benefits

In scenarios where adaptive two-phase hash aggregation is appropriate, query performance can be significantly improved when this feature is turned on. According to actual test data and user feedback, the specific performance improvement is about 6% to 17%. This performance improvement may fluctuate due to specific queries and data size and other factors, but overall, adaptive two-phase hash aggregation can significantly improve the database query performance.

Therefore, for query scenarios suitable for adaptive two-phase hash aggregation, turn on the feature can get a significant performance improvement, but also improve the user's query experience and the overall efficiency of the system.

Description

With adaptive two-phase hash aggregation turned on, the optimizer selects the two-phase hash aggregation for the aggregation query by default.

In the specific implementation, the push-down aggregation operator counts the number of rows in each group of aggregation. When the size of the hash table reaches a certain number, if it is found that most of the groups of the current data have fewer aggregated rows, the system will remove this batch of data from the hash table and upload it to the upper level operator. After a number of consecutive occurrences, the subsequent data will only enter the hash table with a smaller probability to do the aggregation operation, while other data that have not entered the hash table will be directly passed to the upper-layer operator, which is equivalent to performing only one stage of aggregation.

Through such dynamic adjustment, adaptive two-phase hash aggregation can be based on the actual distribution of data and aggregation needs, autonomous decision-making whether to perform two-phase aggregation, thus avoiding unnecessary resource overhead and improving query performance.

Constraints

The parallelism switch needs to be turned on, i.e., set the query_dop parameter value greater than 1.
The adaptive two-phase hash aggregation switch needs to be turned on, i.e., set the enable_adaptive_hashagg parameter value to on.
This feature does not support vectorized execution engines at this time.
This feature does not support scenarios such as grouping sets, distinct aggregation, and ordered-set aggregation functions.

Example

Create a table and insert data.

CREATE TABLE MogDB_adaptive_hashagg(a int , b int );
INSERT INTO MogDB_adaptive_hashagg(a, b) select i / 2, i / 2 from generate_series( 1, 1000000) as t(i);

Collecting statistical information.
```
ANALYZE MogDB_adaptive_hashagg;
```
Turn on the parallel query switch and set the query_dop parameter value greater than 1. Also turn on the adaptive two-phase aggregation operator switch.
```
SET query_dop = 4;
SET enable_adaptive_agg to on;
```

View aggregated query details via explain.

EXPLAIN (COSTS off , ANALYZE ) SELECT a, count ( *), sum (b), avg (b) FROM MogDB_adaptive_hashagg group by a;

Setting adaptive_hashagg_reduce_ratio_threshold (number of rows after aggregation/number of rows before aggregation) to 0.4 when the actual value is 0.5, so there are Passthrogh rows: 650034 rows that are passed directly to the final aggregation stage, i.e., there are 650034 rows for which only one-phase of aggregation is performed.

SET adaptive_hashagg_reduce_ratio_threshold = 0.4;
EXPLAIN (COSTS off , ANALYZE ) SELECT a, count ( *), sum (b), avg (b) FROM MogDB_adaptive_hashagg group by a;

QUERY PLAN                                                                                                           
---------------------------------------------------------------------------------------------------------------------
Streaming(type: LOCAL GATHER dop: 1 / 4) (actual time =[ 390.677, 672.311]..[ 390.677, 672.311], rows = 500001)               
  - >  HashAggregate (actual time =[ 368.699, 589.262]..[ 373.300, 627.672], rows = 500001)                                  
        Group By Key: a                                                                                              
        Max File Num: 48  Min File Num: 48                                                                           
        - >  Streaming(type: LOCAL REDISTRIBUTE dop: 4 / 4) (actual time =[ 0.036, 205.972]..[ 4.657, 243.092], rows = 899897) 
              - >  Adaptive HashAggregate (actual time =[ 7.003, 117.946]..[ 8.701, 150.899], rows = 899897)                 
                    Group By Key: a                                                                                  
                    Passthrough rows: 650034                                                                         
                    - >  Seq Scan on mogdb_adaptive_hashagg (actual time =[ 0.007, 19.261]..[ 0.009, 26.093], rows = 1000000)

Set adaptive_hashagg_reduce_ratio_threshold (number of rows after aggregation/number of rows before aggregation) to 0.8. Based on the display message Passthrough rows is 0, it can be seen that there is no fallback to one-phase aggregation at this point.

SET adaptive_hashagg_reduce_ratio_threshold = 0.8;
EXPLAIN (COSTS off , ANALYZE ) SELECT a, count ( *), sum (b), avg (b) FROM MogDB_adaptive_hashagg group by a;

QUERY PLAN                                                                                                            
----------------------------------------------------------------------------------------------------------------------
Streaming(type: LOCAL GATHER dop: 1 / 4) (actual time =[ 385.932, 678.295]..[ 385.932, 678.295], rows = 500001)                
  - >  HashAggregate (actual time =[ 362.077, 562.587]..[ 369.987, 630.232], rows = 500001)                                   
        Group By Key: a                                                                                               
        Max File Num: 48  Min File Num: 48                                                                            
        - >  Streaming(type: LOCAL REDISTRIBUTE dop: 4 / 4) (actual time =[ 15.825, 217.394]..[ 23.797, 280.069], rows = 500045)
              - >  Adaptive HashAggregate (actual time =[ 30.079, 203.190]..[ 34.597, 224.362], rows = 500045)                
                    Group By Key: a                                                                                   
                    Passthrough rows: 0                                                                               
                    - >  Seq Scan on mogdb_adaptive_hashagg (actual time =[ 0.007, 18.349]..[ 0.010, 20.279], rows = 1000000)

Setting to allow disk drops when there is insufficient memory in the pre-aggregation phase, i.e., setting the value of the adaptive_hashagg_allow_spill parameter to true, will allow you to see from the Max File Num that a disk drop occurs at this point.

SET adaptive_hashagg_allow_spill = true;
EXPLAIN (COSTS off , ANALYZE ) SELECT a, count ( *), sum (b), avg (b) FROM MogDB_adaptive_hashagg group by a;

QUERY PLAN                                                                                                            
----------------------------------------------------------------------------------------------------------------------
Streaming(type: LOCAL GATHER dop: 1 / 4) (actual time =[ 498.529, 773.567]..[ 498.529, 773.567], rows = 500001)                
  - >  HashAggregate (actual time =[ 476.078, 710.897]..[ 486.680, 734.490], rows = 500001)                                   
        Group By Key: a                                                                                               
        Max File Num: 48  Min File Num: 48                                                                            
        - >  Streaming(type: LOCAL REDISTRIBUTE dop: 4 / 4) (actual time =[ 67.091, 300.522]..[ 78.236, 439.433], rows = 500045)
              - >  Adaptive HashAggregate (actual time =[ 82.496, 238.676]..[ 89.664, 309.068], rows = 500045)                
                    Group By Key: a                                                                                   
                    Passthrough rows: 0                                                                               
                    Max File Num: 48  Min File Num: 48                                                                
                    - >  Seq Scan on mogdb_adaptive_hashagg (actual time =[ 0.008, 19.271]..[ 0.017, 19.918], rows = 1000000)

Setting the adaptive_hashagg_min_rows parameter for determining if the minimum number of rows suitable for two-phase is 10,000,000, which cannot be degraded to a one-phase aggregation because this value exceeds the actual number of rows in the table by 1,000,000.

SET adaptive_hashagg_min_rows = 10000000;

RESET adaptive_hashagg_reduce_ratio_threshold;
EXPLAIN (COSTS off , ANALYZE ) SELECT a, count ( *), sum (b), avg (b) FROM MogDB_adaptive_hashagg group by a;
QUERY PLAN                                                                                                            
----------------------------------------------------------------------------------------------------------------------
Streaming(type: LOCAL GATHER dop: 1 / 4) (actual time =[ 556.257, 843.716]..[ 556.257, 843.716], rows = 500001)                
  - >  HashAggregate (actual time =[ 533.540, 723.769]..[ 540.334, 796.572], rows = 500001)                                   
        Group By Key: a                                                                                               
        Max File Num: 48  Min File Num: 48                                                                            
        - >  Streaming(type: LOCAL REDISTRIBUTE dop: 4 / 4) (actual time =[ 69.097, 336.018]..[ 75.762, 457.721], rows = 500045)
              - >  Adaptive HashAggregate (actual time =[ 85.874, 243.790]..[ 88.188, 319.401], rows = 500045)                
                    Group By Key: a                                                                                   
                    Passthrough rows: 0                                                                               
                    Max File Num: 48  Min File Num: 48                                                                
                    - >  Seq Scan on mogdb_adaptive_hashagg (actual time =[ 0.008, 19.203]..[ 0.011, 19.495], rows = 1000000)

query_dop, enable_adaptive_hashagg, adaptive_hashagg_reduce_ratio_threashold, adaptive_hashagg_min_rows, adaptive_hashagg_allow_spill

Issue