Overview

The DeepSQL feature is compatible with the MADLib framework and can implement AI algorithms in the database. A complete set of SQL-based machine learning, data mining, and statistics algorithms is provided. Users can directly use SQL statements to perform machine learning. Deep SQL can abstract the end-to-end R&D process from data to models. With the bottom-layer engine and automatic optimization, technical personnel with basic SQL knowledge can complete most machine learning model training and prediction tasks. The entire analysis and processing are running in the database engine. Users can directly analyze and process data in the database without transferring data between the database and other platforms. This avoids unnecessary data movement between multiple environments.

DeepSQL is an enhancement to MogDB DB4AI, allowing data analysts or developers who are familiar with MADLib to easily migrate data to MogDB. DeepSQL encapsulates common machine learning algorithms into SQL statements and supports more than 60 general algorithms, including regression algorithms (such as linear regression, logistic regression, and random forest), classification algorithms (such as KNN), and clustering algorithms (such as K-means). In addition to basic machine learning algorithms, graph-related algorithms are also included, such as algorithms about the shortest path and graph diameter. Also, it supports data processing (such as PCA), sparse vectors, common statistical algorithms (such as covariance and Pearson coefficient calculation), training set and test set segmentation, and cross validation.

Table 1 Supported machine learning algorithms: regression algorithms

Algorithm Name	Abbreviation	Application Scenario
Logistic regression	-	For example, find the risk factors of a disease, or evaluate enterprises for financial and commercial institutions. Prediction: Use a model to predict the occurrence probabilities of a disease or situation under different independent variables. Judgment: Use a model to determine the probability that a person has certain diseases or be in certain situations.
Cox proportional hazards regression	-	The model takes the survival result and the survival time as dependent variables, can analyze the influence of many factors on the survival time simultaneously, and can analyze the data with the truncated survival time, without the need of estimating the distribution type of the data. Because of the preceding excellent properties, this model has been widely used in medical research since its inception, and is the most widely used multi-factor analysis method in survival analysis.
Elastic net regularization	-	Elastic regression is a hybrid technique of ridge regression and lasso regression, which uses L2 and L1 regularization at the same time. When there are multiple related features, the lasso regression is likely to randomly select one of them, while the elastic regression is likely to select all of them.
Generalized linear model	-	In some practical problems, the relationship between variables is not always linear. In this case, curves should be used for fitting.
Marginal effect	-	Calculation of marginal effects.
Multinomial regression	-	If there are more than two target categories, multinomial regression is required. For example, evaluate the curative effect with "ineffective", "effective", and "cured".
Ordinal regression	-	In statistics, ordinal regression is a regression analysis used to predict ordinal variables. That is, the values of variables are within any range, and the metric distances between different values are different. It can be considered as an issue between regression and classification. Examples include the severity of illness (levels 1, 2, 3, and 4), the pain scale (no pain, mild, moderate, and severe), and the drug dose-response effects (ineffective, less effective, effective, and very effective). The differences between levels are not necessarily equal, for example, the difference between no pain and mild is not necessarily equal to the difference between moderate and severe.
Clustered variance	-	The clustered variance module adjusts the standard error of clustering. For example, when a dataset is copied 100 times, precision of parameter estimation should not be increased, but execution of this process in compliance with an independent identically distributed (IID) assumption actually improves precision.
Robust variance	-	The functions in the robust variance module are used to compute the robust variance (Huber-White estimator) of linear regression, logistic regression, multinomial logistic regression, and Cox proportional hazard regression (Huber-White estimation). They can be used to compute differences of data in datasets with potential anomalous noises.
Support vector machine	SVM	Compared with traditional query optimization schemes, SVM can obtain higher query accuracy for text and hypertext classification and image classification. This also applies to image segmentation systems.
Linear regression	-	This is widely used in economics and finance.

Table 2 Supported machine learning algorithms: other supervised learning

Algorithm Name	Abbreviation	Application Scenario
Decision tree	-	It is one of the most widely used inductive inference algorithms. It handles the classification and prediction problems of category or continuous variables. The model can be represented by graphs and if-then rules, which is highly readable.
Random forest	RF	Random forest is a kind of combinatorial method specially designed for decision tree classifier. It combines multiple decision trees to make predictions.
Conditional random field	CRF	CRF is a discriminant graphic model with undirected probability. A linear chain CRF is a special type of CRF that assumes that the current state depends only on the previous state. Good results have been obtained in sequence annotation tasks such as word segmentation, part-of-speech tagging, and named entity recognition.
Naive Bayes	-	Classification by calculating probabilities can be used to deal with multi-classification issues, such as spam filters.
Neural network	-	It has a wide range of application scenarios, such as speech recognition, image recognition, and machine translation. It is a standard supervised learning algorithm in the domain of pattern recognition, and continues to be a research subject in the domain of computational neurology. MLP has been proved to be a general function approximation method, which can be used to fit complex functions or solve classification problems.
k-nearest neighbors	-	In the k-nearest neighbor classification method, a distance between each training sample and a to-be-classified sample is computed, and K training samples that are closest to the to-be-classified sample are selected. If training samples of a category in the K samples accounts for a majority, the to-be-classified tuple belongs to the category. It can be used for text recognition, facial recognition, gene pattern recognition, customer churn prediction, and fraud detection.

Table 3 Supported machine learning algorithms: data processing algorithms

Algorithm Name	Abbreviation	Application Scenario
Array operation	-	Array and vector operations, including basic addition, subtraction, multiplication, and division, exponentiation, root extraction, cos, sin, absolute value, and variance.
Principal component analysis for dimensionality reduction	PCA	This is used to reduce dimensions and compute the principal component.
Encoding categorical variable	-	Currently, the one-hot and dummy encoding technologies are supported. When a specific group of prediction variables need to be compared with another group of prediction variables, dummy coding is usually used, and a group of variables compared with the group of prediction variables is referred to as a reference group. One-hot encoding is similar to dummy encoding, and a difference lies in that the one-hot encoding establishes a numeric type 0/1 indication column for each classification value. In each row of data (corresponding to one data point), a value of only one classification code column can be 1.
Matrix operation	-	Using matrix decomposition to decompose a large matrix into the product form of a simple matrix can greatly reduce the difficulty and volume of computation. Matrix addition, subtraction, multiplication, and division, extremum, mean, rank calculation, inversion, matrix decomposition (QR, LU, Cholesky), and feature extraction.
Norms and distance functions	-	This is used to compute the norm, cosine similarity, and distance between vectors.
Sparse vector	-	This is used to implement the sparse vector type. If there are a large number of repeated values in the vector, the vector can be compressed to save space.
Pivot	-	Pivot tables are used to meet common row and column transposition requirements in OLAP or report systems. The pivot function can perform basic row-to-column conversion on data stored in a table and output the aggregation result to another table. It makes row and column conversion easier and more flexible.
Path	-	It performs regular pattern matching on a series of rows and extracts useful information about pattern matching. The useful information can be a simple match count or something more involved, such as an aggregate or window function.
Sessionize	-	The sessionize function performs time-oriented session rebuilding on a dataset that includes an event sequence. The defined inactive period indicates the end of a session and the start of the next session. It can be used for network analysis, network security, manufacturing, finance, and operation analysis.
Conjugate gradient	-	A method for solving numerical solutions of linear equations whose coefficient matrices are symmetric positive definite matrices.
Stemming	-	Stemming is simply to find the stem of a word. It can be used to, for example, establish a topic-focused search engine. The optimization effect is obvious on English websites, which can be a reference for websites in other languages.
Train-Test Split	-	It is used to split a dataset into a training set and a test set. The train set is used for training, and the test set is used for verification.
Cross validation	-	It is used to perform cross validation.
Prediction metric	-	It is used to evaluate the quality of model prediction, including the mean square error, AUC value, confusion matrix, and adjusted R-square.
Mini-batch preprocessor	-	It is used to pack the data into small parts for training. The advantage is that the performance is better than that of the stochastic gradient descent (the default MADlib optimizer), and the convergence is faster and smoother.

Table 4 Supported machine learning algorithms: graph

Algorithm Name	Abbreviation	Application Scenario
All pairs shortest path	APSP	APSP finds the length (summed weight) of the shortest path between all pairs of vertices to minimize the sum of the path edge weights.
Breadth-first search	-	This algorithm traverses paths.
Hyperlink-induced topic search	HITS	HITS outputs the authority score and hub score of each vertex, where authority estimates the value of the content of the page and hub estimates the value of its links to other pages.
Average path length	-	This function computes the average value of the shortest paths between each pair of vertices. The average path length is based on the "reachable target vertexes", so it ignores infinite-length paths between unconnected vertices.
Closeness centrality	-	The closeness measures are the inverse of the sum, the inverse of the average, and the sum of inverses of the shortest distances to all reachable target vertices (excluding the source vertex).
Chart diameter	-	The diameter is defined as the longest of all the shortest paths in the graph.
In-Out degree	-	This algorithm computes the in-degree and out-degree of each node. The node in-degree is the number of edges pointing in to the node and node out-degree is the number of edges pointing out of the node.
PageRank	-	Given a graph, the PageRank algorithm outputs a probability distribution representing the likelihood that a person randomly traversing the graph will arrive at any particular vertex.
Single source shortest path	SSSP	Given a graph and a source vertex, the SSSP algorithm finds a path from the source vertex to every other vertex in the graph to minimize the sum of the weights of the path edges.
Weakly connected component	-	Given a directed graph, the WCC is a subgraph of the original graph, where all vertices are connected to each other through a path, ignoring the direction of the edges. In the case of an undirected graph, the WCC is also a strongly connected component. This module also includes many auxiliary functions that run on the WCC output.

Table 5 Supported machine learning algorithms: time series

Algorithm Name	Abbreviation	Application Scenario
Autoregressive integrated moving average model	ARIMA	Time series forecasting, which is used to understand and forecast future values in the series. For example, international air traveler data can be used to forecast the number of passengers.

Table 6 Supported machine learning algorithms - sampling

Algorithm Name	Abbreviation	Application Scenario
Sample	-	Sampling.
Stratified sampling	-	Stratified random sampling, also known as type random sampling, is used to divide the overall units into various types (or layers) according to a certain standard. Then, according to the ratio of the number of units of each type to the total number of units, the number of units extracted from each type is determined. Finally, samples are extracted from each type according to the random principle.
Balanced sampling	-	Some classification algorithms only perform optimally when the number of samples in each class is roughly the same. Highly skewed datasets are common in many domains (such as fraud detection), so resampling to offset this imbalance can produce better decision boundaries.

Table 7 Supported machine learning algorithms: statistics

Algorithm Name	Abbreviation	Application Scenario
Summary	-	This algorithm generates summary statistics for any data table.
Correlation and covariance	-	Descriptive statistics, one of which computes the Pearson coefficient and correlation coefficient, and the other outputs covariance. This help us understand the characteristics of the data amount that is statistically reflected so that we can better understand the data to be mined.
CountMin (Cormode-Muthukrishnan)	-	This algorithm counts the occurrence frequency of an element in a real-time data stream, and is ready to response to the occurrence frequency of an element at any time. No accurate counting is required.
Flajolet-Martin	FM	This algorithm is used to obtain the number of different values in a specified column, and find the number of unique numbers in the number set.
Most frequent values	MFV	This algorithm is used to compute frequent values.
Hypothesis test	-	This algorithm includes F-test and chi2-test.
Probability functions	-	The probability functions module provides cumulative distribution, density/mass, and quantile functions for various probability distributions.

Table 8 Supported machine learning algorithms: other algorithms

Algorithm Name	Abbreviation	Application Scenario
k-means clustering	-	This algorithm is used in the clustering scenario.
Latent Dirichlet allocation	LDA	LDA plays an important role in the topic model and is often used for text classification.
Apriori algorithm	-	Apriori algorithm is used to discover the association between data item sets, such as the typical "beer and diaper" association.