HomeMogDBMogDB StackUqbar
v5.0

Documentation:v5.0

Supported Versions:

Other Versions:

Aggregate Functions Distinct Performance Optimization

Availability

This feature is available since MogDB 5.0.0.

Introduction

In practice, we often use the distinct statement in the aggregation function to remove duplicates of the data first and then aggregation operations, the characteristics of the distinct statement to optimize the performance of the implementation .

Benefits

Improve the performance of distinct query statements in aggregation functions, with a performance improvement of up to 3 times or more in the test scenarios.

Description

For the implementation of distinct, before the aggregation calculation, the scanned data will be cached first, sorted by distinct columns, and then only take the first line of data in each grouping for the aggregation operation, so as to realize the process of data filtering and then aggregation.

The above implementation of the data de-duplication operation using the sorting method, and the sorting operation itself has a great impact on the performance, if the output is not sorted requirements, purely for the de-duplication of the data, HashAgg can quickly realize the data de-duplication, and the implementation performance is far better than the sorting operation, based on this, the characteristics of this feature through the two-layer agg arithmetic to achieve the distinct operation, the bottom layer through the HashAgg first de-emphasizes the data, and the top layer agg operator then aggregates the de-emphasized data, thus improving the execution performance.

Constraints

  • Distinct performance has been optimized for concurrent scenarios, and this performance optimization is only for non-concurrent scenarios.
  • Scenarios where distinct is multiple columns are not supported.

Example

MogDB=# CREATE TABLE tb_distinct(a INT, b INT, c INT, d INT, e INT);
CREATE TABLE
MogDB=# EXPLAIN VERBOSE SELECT a, COUNT(DISTINCT b) FROM tb_distinct GROUP BY a;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 HashAggregate  (cost=41.59..43.59 rows=200 width=24)
   Output: a, count(b)
   Group By Key: tb_distinct.a
   ->  HashAggregate  (cost=38.59..40.59 rows=200 width=16)
         Output: a, b
         Group By Key: tb_distinct.a, tb_distinct.b
         ->  Seq Scan on public.tb_distinct  (cost=0.00..26.34 rows=1634 width=8)
               Output: a, b
(8 rows)
Copyright © 2011-2024 www.enmotech.com All rights reserved.