v5.0

Workload-level Index Recommendation

For workload-level indexes, you can run scripts outside the database to use this function. This function uses the workload of multiple DML statements as the input to generate a batch of indexes that can optimize the overall workload execution performance. In addition, it provides the function of extracting service data SQL statements from logs.

Prerequisites

The database is normal, and the client can be connected properly.
The gsql tool has been installed by the current user, and the tool path has been added to the _PATH_environment variable.

Service Data Extraction

SQL in Logs

Set the GUC parameters. log_min_duration_statement = 0 log_statement= 'all'
Run the following command to extract SQL statements based on logs:
```
gs_dbmind component extract_log [l LOG_DIRECTORY] [f OUTPUT_FILE] [p LOG_LINE_PREFIX] [-d DATABASE] [-U USERNAME][--start_time] [--sql_amount] [--statement] [--json] [--max_reserved_period] [--max_template_num]
```
The input parameters are as follows:
- LOG_DIRECTORY: directory for storing pg_log.
- OUTPUT_PATH: path for storing the output SQL statements, that is, path for storing the extracted service data.
- LOG_LINE_PREFIX: specifies the prefix format of each log.
- DATABASE (optional): database name. If this parameter is not specified, all databases are selected by default.
- USERNAME (optional): username. If this parameter is not specified, all users are selected by default.
- start_time (optional): start time for log collection. If this parameter is not specified, all files are collected by default.
- sql_amount (optional): maximum number of SQL statements to be collected. If this parameter is not specified, all SQL statements are collected by default.
- statement (optional): Collects the SQL statements starting with statement in pg_log log. If this parameter is not specified, the SQL statements are not collected by default.
- json (optional): specifies that the collected log files are stored in JSON format after SQL normalization. If no format is specified, each SQL statement occupies a line.
- max_reserved_period (optional): specifies the maximum number of days of reserving the template in incremental log collection in JSON mode. If this parameter is not specified, the template is reserved by default. The unit is day.
- max_template_num (optional): Specifies the maximum number of templates that can be reserved in JSON mode. If this parameter is not specified, all templates are reserved by default.
An example is provided as follows.
```
gs_dbmind component extract_log $GAUSSLOG/pg_log/dn_6001 sql_log.txt '%m %c %d %p %a %x %n %e' -d postgres -U omm --start_time '2021-07-06 00:00:00' --statement
```
NOTE: If the -d/-U parameter is specified, the prefix of each log record must contain %d and %u. If transactions need to be extracted, %p must be specified. For details, see the log_line_prefix parameter. It is recommended that the value of max_template_num be less than or equal to 5000 to avoid long execution time of workload indexes.
Change the GUC parameter values set in 1 to the values before the setting.

NOTE:

After service data extraction is completed, restore the GUC parameters to the default configurations. Otherwise, it may lead to log file bloating.

SQL in System Tables

To enable this function, run the following command:

echo PASSWORD | gs_dbmind component fetch_statement [DB_PORT] [DATABASE] [OUTPUT] [--db-host DB_HOST] [-U DB_USER] [--schema SCHEMA]
[--statement-type {asp,slow,history,activity}] [--start-time STAET_TIME] [--end-time END_TIME] [--verify] [--driver]

The input parameters are as follows:

DB_PORT: port number of the connected database.
DATABASE: name of the connected database.
OUTPUT: output file including SQLs.
DB_HOST (optional): ID of the host that connects to the database.
DB_USER (optional): username for connecting to the database. The user needs to have the sysadmin or monitor admin permission.
SCHEMA: schema name. It is used only when statement-type is set to history. The default value is public.
statement-type: SQL statement type, including asp, slow, history, and activity.
- asp: SQL extracted from gs_asp needs to make enable_asp enabled.
- slow: current active slow SQLs are extracted.
- history: historical slow SQLs are extracted.
- activity: current active SQLs are extracted.
START_TIME: start time for log collection. It is used only when statement-type is set to asp and the parameter is mandatory.
END_TIME: end time for log collection. It is used only when statement-type is set to asp and the parameter is mandatory.
verify: whether to verify the SQL validity.
driver: whether to use a Python driver to connect a database. The default value is gsql.

Procedure for Using the Index Recommendation Script

Prepare a file that contains multiple DML statements as the input workload. Each statement in the file occupies a line. You can obtain historical service statements from the offline logs of the database.
To enable this function, run the following command:
```
echo PASSWORD | gs_dbmind component index_advisor [p DB_PORT] [d DATABASE] [f FILE] [--h DB_HOST] [-U DB_USER] [--schema SCHEMA]
[--max_index_num MAX_INDEX_NUM][--max_index_storage MAX_INDEX_STORAGE] [--multi_iter_mode] [--max-n-distinct MAX_N_DISTINCT]
[--min-improved-rate MIN_IMPROVED_RATE] [--max-candidate-columns MAX_CANDIDATE_COLUMNS] [--max-index-columns MAX_INDEX_COLUMNS] 
[--min-reltuples MIN_RELTUPLES] [--multi_node]  [--json] [--driver] [--show_detail] [--show-benifits]
```
The input parameters are as follows:
- DB_PORT: port number of the connected database.
- DATABASE: name of the connected database.
- FILE: file path that contains the workload statement.
- DB_HOST (optional): ID of the host that connects to the database.
- DB_USERNAME (optional): username for connecting to the database.
- SCHEMA: schema name.
- MAX_INDEX_NUM (optional): maximum number of recommended indexes.
- MAX_INDEX_STORAGE (optional): maximum size of the index set space.
- MAX_N_DISTINCT: reciprocal value of the number for the distinct value. The default value is 0.01.
- MIN_IMPROVED_RATE: minimum improved rate. The default value is 0.1.
- MAX_CANDIDATE_COLUMNS (optional): maximum number of candidate index columns.
- MAX_INDEX_COLUMNS: maximum number of index columns. The default value is 4.
- MIN_RELTUPLES: minimum number of records. The default value is 10000.
- multi_node (optional): specifies whether the current instance is a distributed database instance.
- multi_iter_mode (optional): algorithm mode. You can switch the algorithm mode by setting this parameter.
- json (optional): specifies the file path format of the workload statement as JSON after SQL normalization. By default, each SQL statement occupies one line.
- driver (optional): specifies whether to use the Python driver to connect to the database. By default, gsql is used for the connection.
- show_detail (optional): specifies whether to display the detailed optimization information about the current recommended index set.
- show-benefits (optional): whether to show index benefits.
The recommendation result is a batch of indexes, which are displayed on the screen in the format of multiple create index statements. The following is an example of the result.
```
create index ind0 on public.bmsql_stock(s_i_id,s_w_id);
create index ind1 on public.bmsql_customer(c_w_id,c_id,c_d_id);
create index ind2 on public.bmsql_order_line(ol_w_id,ol_o_id,ol_d_id);
create index ind3 on public.bmsql_item(i_id);
create index ind4 on public.bmsql_oorder(o_w_id,o_id,o_d_id);
create index ind5 on public.bmsql_new_order(no_w_id,no_d_id,no_o_id);
create index ind6 on public.bmsql_customer(c_w_id,c_d_id,c_last,c_first);
create index ind7 on public.bmsql_new_order(no_w_id);
create index ind8 on public.bmsql_oorder(o_w_id,o_c_id,o_d_id);
create index ind9 on public.bmsql_district(d_w_id);
```

Issue