Authors:
(1) Limeng Zhang, Centre for Research on Engineering Software Technologies (CREST), The University of Adelaide, Australia;
(2) M. Ali Babar, Centre for Research on Engineering Software Technologies (CREST), The University of Adelaide, Australia.
Table of Links
1.1 Configuration Parameter Tuning Challenges and 1.2 Contributions
3 Overview of Tuning Framework
4 Workload Characterization and 4.1 Query-level Characterization
4.2 Runtime-based Characterization
5 Feature Pruning and 5.1 Workload-level Pruning
5.2 Configuration-level Pruning
7 Configuration Recommendation and 7.1 Bayesian Optimization
10 Discussion and Conclusion, and References
4 WORKLOAD CHARACTERIZATION
The initial step in the tuning system involves modeling the characteristics of the target workload to facilitate the tuning algorithm in acquiring knowledge from historical data, thereby accelerating the tuning process or improving coverage to achieve optimal configurations. In this section, we present two main aspects for modeling DBMS workloads: a query-level characterization based on the queries composed in a workload and a runtime-based characterization based on DBMS runtime statistics obtained during the execution of the instance.
4.1 Query-level Characterization
Query-level characterization aims at modeling a workload from its composed queries [3]. A DBMS application allows users and other applications to engage with the database by formulating and executing queries, such as SELECT, INSERT, UPDATE, DELETE—tailored to manipulate and retrieve information and accomplish various tasks. Hence, researchers have endeavored to model a workload by analyzing the composition of queries and utilizing query characteristics to represent the workload [3]. Existing work on query-level characterization can be summarized as follows:
• Features from the query text: In a DBMS workload, each query will perform a certain type of operation within the DBMS. Hence, some researchers try to characterize a workload by analyzing its static query text and then merging each query vector to ultimately represent the workload. RestTune [6] analyzes the cost of a MySQL workload by examining the distribution of queries within each workload based on their reserved SQL keywords (e.g., SELECT, UPDATE, DISTINCT). Initially, it identifies the reserved operation keywords for each query and subsequently computes their TF-IDF (term frequency-inverse document frequency) values. These TF-IDF feature vectors serve as input for a random forest model utilized for classification. This model provides a predicted probability distribution regarding predefined resource cost levels for each query. Ultimately, the average of the probability distributions for all queries across the entire workload is calculated to represent the metafeature cost for the input workload. LOCAT [7] features a Spark SQL workload directly with its query composition. It then filters out some unimportant queries and uses those important queries to represent a workload when evaluating the performance of recommended configuration. ONLINETUNE [3] features a workload in terms of the query arrival rate within it and its query composition. It leverages a standard LSTM encoder-decoder network to encode each query in the workload. Subsequently, it averages the query encodings to obtain the composition feature of the workload.
• Features from the query plan: In addition to the static query text, some researchers model a query through looking into its execution process. When a query is submitted to the DBMS, the system first parses the query to check its syntax and semantics. After parsing, the query is compiled into an execution plan, which outlines the steps needed to execute the query efficiently. Some researchers represent a query through parsing its query plan to aggregate key features, including the cost or categories of operators, scanned tables, and predicates, and then featurizing the workload with the query vectors. In Qtune [9], each SQL query is represented with features including query type (e.g., insert, delete, select, update), involved tables, and the estimated processing cost derived from inherent operations (e.g., scan, hash join, aggregate). Then, it merges all the query vectors into a unified workload vector according to its predefined merging strategies for each part.
• Features from other query-related factors. In addition to the aforementioned query features, certain studies consider the underlying data distribution in the database as an informative factor, based on the observation that only data changes affecting the workload queries are relevant to the tuning policy. Building upon this premise, ONLINETUNE [3] extracts the database data distribution by leveraging insights from the DBMS optimizer, which encompass estimates of rows scrutinized by queries, the proportion of rows filtered by table conditions in queries, and the utilization of indexes. Subsequently, it computes the average of these three query features to derive the underlying data distribution associated with a workload. Finally, combined with the features extracted from query text, it concatenates the workload feature and underlying data feature to obtain the final contextual features.
This paper is available on arxiv under CC BY 4.0 DEED.