INTRODUCTION

Learning Analysis Behavior in SQL Workloads

Clement Moreau

Veronika Peralta

0 0 University of Tours Blois , France

This paper presents a set of analyses aiming at better understanding the SQLShare workload [13] and learning users' analysis behavior. SQLShare is a database-as-a-service platform targeting scientists and data scientists with minimal database experience, whose workload was made available to the research community. According to the authors of [13], this workload is the only one containing primarily ad-hoc hand-written queries over useruploaded datasets. In this paper we analyze this workload, by comparing users' explorations (sequences of queries), looking for common SQL operations performed by the users during data analysis. We use a clustering algorithm to retrieve groups of similar explorations and we analyze the obtained clusters through many statistical and visual indicators for explaining analysis patterns inside clusters. To our knowledge, this is the first attempt to characterize human analysis behavior in SQL workloads.

INTRODUCTION

The analysis of a database workload to support Interactive Database Exploration (IDE) [ 12 ] receives increasing interest as it offers many practical interests, from the monitoring of database physical access structures [ 5 ] to the generation of user-tailored collaborative query recommendations for interactive exploration [ 8, 21 ].

Characterising user behavior while analysing data, i.e. learning the way users analyse data (the type and order of operations, the level of detail, the degree of focus) is a step forward in the understanding of analysis activities and ofers new applications, for instance to understand users’ information needs, to identify "struggling" during the exploration, or to provide better query recommendations. Notably, IDE systems usually do not ofer such facilities. The prediction of next analysis steps is particularly interesting, enabling beforehand execution of probable queries and caching of results, as well as advanced optimization strategies. Finally, we mention the detection of clandestine intentions [ 2 ] as another potential benefit. Indeed, as reported by [ 2 ], query sequences may reflect such intentions, where users prefer to obtain information by means of sequences of smaller, less conspicuous queries to avoid direct queries which may disclose their true interests. The identification of typical analysis patterns may help distinguishing normal from clandestine intentions.

In this paper we deal with the identification of analysis patterns in a log of SQL explorations devised by real users. We consider that an exploration is a coherent sequence of queries over a database schema, done by a user with the goal of fulfilling an information need. We experiment on the SQLShare workload of hand-written1 queries over user-uploaded datasets [ 13 ]. In 1Consistently with the authors of [ 13 ], we use the term hand-written to mean, in this context, that the query is introduced manually by a human user, which reflects genuine interactive human activity over a dataset, with consideration between two consecutive queries. particular, we use the segmentation of the SQLShare workload in coherent explorations proposed in [ 29 ].

Some previous works consider analysis patters within OLAP explorations. In [ 30 ], Rizzi and Gallinucci described 4 recurrent types of user analyses and propose a tool for generating realistic explorations based on these usage types. In [ 24 ], we cluster together explorations showing similar analysis patterns, learning 11 analysis patterns from OLAP workloads devised by students and expert analysts.

The idea behind analysis patterns is to look for sequences of common operations performed together when analysing data, as some kind of movements in a data space. From this point of view, OLAP operations (e.g. drilling down, adding a filter, changing a measure) are first class citizens, while the actual analyzed data is less important. For example, we can retain that a user performed a sequence of drills down, disregarding the dimension that was drilled down or the semantics of the underlying data. Explorations are compared in such terms, i.e. to what extent they share the same sequences of operations and evolve at the same level of aggregation and filtering.

Transposing such approach to regular, non multidimensional SQL workloads raises many challenges. Even if a sequence of SQL queries is issued to explore the database content, non multidimensional relational schemata do not have regularities one expects from the multidimensional model, explorations may not be expressed through roll-up or drill-down operations, SQL queries may deviate from the traditional star-join pattern commonly used for analytical purpose, etc.

In this paper we present an extension of our previous work [ 24 ] for learning analysis patterns in SQL workloads. In particular, we reuse and extend similarity measures for comparing queries and explorations, and pair them with clustering algorithms. Contrarily to [ 24 ], which uses hierarchical clustering, we combine UMAP and DBSCAN methods, which proved to be well adapted to complex sequences [ 22 ]. The obtained clusters are then analyzed using several statistic and visual indicators, allowing to characterize analysis behavior in each cluster.

Our contributions include: (i) a representation of queries and explorations in the space of SQL operations, including similarity functions tailored for SQL queries and explorations (described in Section 3), (ii) a statistical analysis of the SQLShare workload in terms of queries and operations (Section 4), (iii) a proposal for clustering SQL explorations, (Section 5), and (iv) a large analysis of the obtained clusters, via many complementary statistical and visual indicators, revealing several (common and specific) patterns of users’ analysis behavior (Section 5). 2

RELATED WORK

In this section we briefly describe the SQLShare workload and we present related work concerning workload analysis and indicators tailored for analyzing sequences. Finally, we discuss clustering algorithms adapted to sequences. 2.1 The SQLShare workload is the result of a Multi-Year SQL-as-aService Experiment [ 13 ], allowing any user with minimal database experience to upload their datasets on-line and manipulate them via SQL queries. What the authors wanted to prove with this experiment is that SQL is beneficial for data scientists. They observed that most of the time people use scripts to modify or visualize their datasets instead of using the SQL paradigm. Indeed, most user needs may be satisfied by first-order queries, that are much simpler than a script, but have the initial cost of creating a schema, importing the data and so on. SQL-as-a-Service frees the user of all this prior work with a relaxed SQL version.

The SQLShare workload is composed of 11,137 SQL statements, 57 users and 3,336 user’s datasets. To the best of our knowledge, as reported by the authors of [ 13 ], this workload is the only one containing primarily ad-hoc hand-written queries over useruploaded datasets. As indicated in the introduction, hand-written means that the query is introduced manually by a human user, which reflects genuine interactive human activity over a dataset, with consideration between two consecutive queries.

The SQLShare workload is analyzed in [ 13 ], particularly to verify the following assumption: "We hypothesized that SQLShare users would write queries that are more complex individually and more diverse as a set, making the corpus more useful for designing new systems." . The authors showed empirically that the queries in the SQLShare workload are complex and diverse. They also analyzed the churn rate of SQLShare users and conclude that most users exhibit a behavior that suggest an exploratory workload.

Other SQL workloads, as SDSS [ 32 ] or REACT-IDA [ 21 ] include SQL queries generated with specific GUI or applications. Although generated SQL queries are less richer than hand-written ones [ 13 ], the approach presented in this paper can be applied to these workloads, and to smaller ones, as those presented in [ 17 ]. 2.2

Workload analysis

Other scientific domains close to Database, like Information Retrieval or Web Search, have a long tradition of log analysis aiming at facilitating the searcher’s task [ 37 ]. Many works extract features from queries or search sessions and use them to disambiguate the session’s goal, to generate recommendations, to detect struggling in sessions, etc. Since databases tend to be more used in an exploratory or analysis fashion, as evidenced by the SQLShare workload, it is not a surprise that many recent works pay attention to the analysis of database workloads, in addition to those works analyzing workload for optimization or self-tuning purposes. We present some recent advances in this area, diferentiating by the type of logs (OLAP logs and SQL logs).

Analyzing OLAP explorations. Logs of OLAP analyses are simpler than SQL ones in the sense that they feature multidimensional queries that can easily be interpreted in terms of OLAP primitives (roll-up, drill-down, slice-and-dice, etc.). In one of our previous works [ 31 ], we proposed an approach for detecting OLAP analyses phrased in SQL, by converting SQL queries into OLAP queries and then checking if two consecutive queries are suficiently close in terms of OLAP operations. Later, we used supervised learning to identify a set of query features allowing to characterize focus zones in OLAP explorations [ 7 ], or to identify queries that better contribute to an exploration [ 6 ].

In a more recent work [ 24 ], we analyzed OLAP workloads devised by students and expert analysts, looking for sequences of common operations performed together when analysing data. We identified 11 analysis patterns corresponding to diferent analysis behavior. For example, focused explorations (which regularly increase the level of detail and filtering by adding drill-downs and iflters), oscillating explorations (alternating drill-downs and rollups with few filters), short explorations with few operations and even explorations with repeated queries. We used a hierarchical clustering algorithm, paired with a Contextual Edit Distance [ 23 ] to cluster explorations representing the same behavior.

The present work is a continuation of our previous work, in particular [ 24 ]. The main diferences are that we make no assumption about the type of queries in the workload (particularly, they may not be multidimensional queries), and we have no ground truth (i.e., no human manual inspection of each query) on the workload.

Analyzing SQL logs. SQL workload analysis has recently attracted attention beyond query optimization, for instance for query recommendation [ 8 ], query autocompletion [ 16 ], or user interest discovery [ 26 ]. All these works use the SDSS workload for their tests. In [ 21 ], Milo and Somech identify and generalize relevant previous sessions, in the REACT-IDA workload, to generate personalized next-action suggestions to the user. In [ 33 ], they study interestingness measures for mining workloads. Embedded SQL code is analyzed in [ 34 ] to measure its quality, mainly for maintainability purpose. The authors quantify quality based on the number of operators (joins, unions), operands (tables, subqueries) and variables in the SQL code, experimenting with SQL codes embedded in PL/SQL, COBOL and Visual Basic.

Jain et al. ran a number of tests on the SQLShare workload [ 13 ], some of them being reported above, showing the diversity and complexity of the workload. In [ 35 ], Vashistha and Jain analyze the complexity of queries in the SQLShare workload, in terms of the following query features: number of tables, number of columns, query length in characters, numbers of operators (Scan, Join, Filter), number of comparison operators (LE, LIKE, GT, OR, AND, Count), and the query run-time. They define two complexity metrics from these features: the Halstead measure (traditionally used to measure programs complexity) and a linear combination whose weights are learned using regression.

Finally, a recent work investigated various similarity metrics over SQL queries, aiming at clustering queries [ 17 ] for better workload understanding. Queries are issued separately, not within explorations, and are compared in terms of query structure, not in terms of SQL operations w.r.t. previous queries. Thus, they capture users interests (e.g. which attributes are projected), not the way user navigates among data. The authors run their tests on smaller SQL workloads.

To our knowledge, this is the first attempt to learn human analysis behavior in SQL workloads. 2.3

Indicators for sequence analysis

Other research communities, in particular mobility science, study human behavior represented as sequences of actions. Data exploration can be viewed through the prism of mobility science [ 11 ]. Indeed, an exploration is a sequence of user’s queries, where the movement is no longer conducted in space but in the data space.

Thus, many indicators proposed for the analysis of mobility sequences can be reused or adapted for the study of sequences of queries. Mobility researchers explored sequences of activities and tested the existence of simple universal rules underlying human movement like travel distance, top ranked visited locations, predictability of human activity and origin-destination flows, Techniques Length distribution State distribution mainly studying recurring patterns/regularity in the sequence or clustering mobility behavior ([ 4 ] presents an important survey). In substance, results show that mobility is strongly characterized by exponential distribution (e.g. heavy-tailed, Zipf) and people constantly exploit a small set of repeatedly visited locations.

Inspired by these considerations, we propose to adapt a set of indicators from mobility mining to analyse data explorations. These complementary techniques, summarized in Table 1, highlight diferent aspects of explorations.

This capacity to explain models, both for practical and ethical issues, is a crucial point for the understanding of machine learning models. With this aim in mind, Guidotti and al. [ 10 ] suggested some techniques, partially borrowed from these above, like statistical methods and prototype selection elements, to explain black box systems in order to make their results more interpretable and understandable. In line with the vision of these techniques, we believe that the elaboration of indicators is essential to understand and explain discovered behavior in complex clusters. 2.4

Clustering methods

The extraction of behavior from a dataset is a process usually performed thanks to unsupervised machine learning. Indeed, clustering methods are widely used for the discovery of human behavior in datasets representing sequences of elements, in particular in sequences of mobility [ 14, 23, 27 ].

Clustering methods are based on similarity measures. A pairwise comparison of sequences results in a distance matrix that is the input of the clustering process. Many methods have been proposed for computing the similarity of categorical sequences. Most of the approaches are based on Optimal Maching (OM) methods [ 1 ], typical measures include those of the Edit Distance family (see [ 22 ] for a review of methods and similarity measures). In particular, the Contextual Edit Distance (CED) [ 23 ] is a generalization of Edit Distance, conceived for the comparison of semantic sequences (an overview of CED measure is given in Subsection 3.3).

However, the topology created by similarity measures for sequences is hard to apprehend. In particular, for OM methods, spaces are often not euclidean nor metric. To the best of our knowledge, the clustering algorithms able to deal with arbitrary distances (not necessarily metrics) are PAM [ 28 ] (or K-medoid), hierarchical clustering [ 15 ], density clustering (DBSCAN [ 9 ], OPTICS [ 3 ]) and spectral clustering [ 25 ], each one making diferent hypothesis about cluster topology.

According to the similarity measure and the representation of the sequences, dimensionality reduction methods can be used in order to extract primary dimensions [ 14 ]. However, commonly used methods like PCA can only be used for Euclidean spaces in practice. Alternatively, methods like UMAP [ 20 ], allow the reduction of a complex topology defined by an arbitrary metric into a low Euclidean space, which facilitates the visualisation of clustering results and enable the usage of other clustering methods, in particular, those requiring an Euclidean space like K-means [ 19 ]. In addition, UMAP ofers a better preservation of the data global structure, fewer hyperparameters to tune and better speed than previous techniques like t-SNE [ 18 ].

In [ 22 ], we empirically compared several clustering methods and similarity measures, in order to find the most adapted to sequences of semantic elements. The combination of CED measure and UMAP reduction, paired with K-means, Spectral or DBSCAN algorithms, outperformed all other combinations of methods. 3

EXPLORATION MODEL

This section introduces the description of queries and explorations used all along the paper as well as their representation in a space of SQL operations.

The SQLShare workload contains 11,137 SQL statements, among which 10,668 correspond to SELECT statements. The remaining statements (mainly updates, inserts and deletes) were filtered. This workload was fragmented in 2,809 explorations containing among 1 and 98 queries [ 29 ]. 3.1

Query and exploration abstractions

In what follows, we use the term query to denote the text of an SQL SELECT statement. In [ 29 ], queries are represented as a collection of fragments extracted from the query text, namely, projections, selections, aggregations and tables. We extend such representation adding group by and order by sets. These fragments abstract the most descriptive parts of a SQL query, and are the most used in the literature (see e.g., [ 8, 16, 21 ]). But note that we do not restrict to SPJG (selection-projection-join-group) queries. Indeed, we consider all queries in the SQLShare workload, some of them containing arbitrarily complex chains of subqueries.

Definition 3.1 (Query). A query over database schema is a 6-uple = ⟨, , , , , ⟩ where: (1) is a set of expressions (attributes or calculated expressions) appearing in the main SELECT clause (i.e. the outermost projection). We deal with * wild card by replacing it by the list of attributes it references. (2) is a set of atomic Boolean predicates, whose combination (conjunction, disjunction, etc.) defines the WHERE and HAVING clauses appearing in the query. We considered indistinctly all predicates appearing in the outermost statements as well as in inner sub-queries. (3) is a set of aggregation expressions appearing in the main

SELECT clause (i.e. the outermost projection). (4) is a set of tables appearing in FROM clauses (outermost statement and inner sub-queries). Views, sub-queries and other expressions appearing in FROM clauses are parsed in order to obtain the referenced tables. (5) is a set of expressions appearing in GROUP BY clauses (outermost statement and inner sub-queries). (6) is a set of expressions appearing in ORDER BY clauses (outermost statement and inner sub-queries).

Note that although we consider tables, selections, group by sets and order by sets occurring in inner sub-queries, we limit to the outermost queries for projections and aggregations, as they correspond to attributes actually visualized by the user. We intentionally remain independent of presentation and optimization aspects, specially the order in which attributes are projected (and visualized by the user), the order in which tables are joined, etc. All the queries we considered are supposed to be well formed, and so we do not deal with query errors.

Finally, an exploration is a sequence of queries of a user.

Definition 3.2 (Exploration). Let be a database schema. An exploration = ⟨1, . . . , ⟩ over is a sequence of queries over . We note ∈ if a query appears in the exploration , and () to refer to the exploration where appears.

3.2 Query features

For each query, we extract a set of simple features computed from the query text and its relationship with previous query in an exploration. The set of features is inspired from our previous work [ 6, 7, 24 ], which models OLAP queries as a set of features capturing typical OLAP navigation. It intends to capture the set of SQL operations that express one query w.r.t. the previous one (e.g. adding a projection, which means that a query projects an additional attribute w.r.t. the previous one).

Table 2 presents the considered features, where added (resp., deleted) indicates the modification made compared to the previous query. In their definitions, let = ⟨ , , , , , ⟩ be the query occurring at position in the exploration over the instance of schema . Features are computed comparing the query to the previous query in the exploration , −1 = ⟨−1, −1, −1, −1, −1, −1⟩. For the first query of , i.e. 1, we consider as predecessor the "empty" query 0 = ⟨∅, ∅, ∅, ∅, ∅, ∅⟩. All the features are defined for ≥ 1.

In what follows, we represent a SQL query in the space of query features, i.e. as a 12-dimensional vector, each position corresponding to one of the features 1...12 described in Table 2. This representation is at the core of our proposal for computing the similarity between queries. It focuses in operations between queries and is independent of the underlying database, i.e. a given sequence of operations, even on diferent databases, will result in the same sequence of query vectors.

Definition 3.3 (Query vector). Let be a query and ′ its predecessor in an exploration. A query vector is a 12-dimensional vector = ⟨1, ...12⟩ where = (, ′).

Example 1. Consider an exploration 1 composed of 4 queries: 1: SELECT species FROM All3col; 2: SELECT species FROM All3col WHERE longitude < 0; 3: SELECT species, longitude, latitude FROM All3col; 4: SELECT species, longitude FROM All3col ORDER BY species;

Vector for 1, ⟨1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0⟩, indicates an added projection (species) and an added table (All3col) w.r.t. the empty query. Vectors for 2, 3 and 4 indicate the diferences w.r.t. previous queries, an added selection (longitude <0 ), 2 added projections (longitude, latitude) with a deleted selection, and 1 deleted projection with an added order by attribute (species): ⟨0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0⟩, ⟨2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0⟩, ⟨0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0⟩, resp. □

As vectors are long and may have many 0-valued coordinates, we concisely represent them by listing the occurring operations (the ones not 0-valued) in the form "±", where ∈ {, , , , , } and ≥ 1 (omitted if 1). Letters and signs refer to features (as abbreviated in Table 2) and represents feature magnitude. For instance, the queries of Example 1 can be noted +P+T, +S, +2P-S, -P+O, resp.

Remark that this representation captures both, the richness of the exploration in terms of query fragments (projections, selections, etc.), captured by the vector of the first query in the exploration (which is compared to an empty query), and the diferences among consecutive queries, captured by the vectors of the following queries. As a consequence, in some explorations, the norm of the first vector may be greater that those of the following ones. For instance, vectors of exploration 9 of the SQLShare workload are: +10P+T, +T-T and +S+T-T.

Finally, in some analyses in Section 5, we focus on the type of operation, disregarding the magnitude (e.g. how many projections are concerned) and sign (addition or deletion). To this end, we define aggregated 6-dimensional vectors (one dimension per type of operation). Analogously, they can be concisely represented with letters P,S,A,T,G,O.

Definition 3.4 (Aggregated vector). Let = ⟨1, ...12⟩ be a query vector. An aggregated vector is a 6-dimensional vector = ⟨ 1, ... 6⟩ where = 1 if (2−1 > 0) or (2 > 0), 0 else.

Example 2. Aggregated vectors for queries in Example 1 are: ⟨1, 0, 0, 1, 0, 0⟩, ⟨0, 1, 0, 0, 0, 0⟩, ⟨1, 1, 0, 0, 0, 0⟩, ⟨1, 0, 0, 0, 0, 1⟩, resp. In concise notation, they are sketched: PT, S, PS and PO. □ 3.3

Query and exploration similarity

We use cosine similarity for computing similarity between query vectors. This measure is well suited to compute the similarity between two vectors and is normalized in [ 0, 1 ]. In this way, it favors more the nature of SQL operations than their number. To deal with null vectors, which are frequent in the SQLShare dataset (see Section 4), we set border cases as follows: (i) two null vectors are considered identical (similarity is 1), and (ii) one null vector is considered as completely diferent from a non-null vector (similarity is 0). Formally, given two query vectors and ′, cosine similarity is calculated as follows: 1 if ∥ ∥ = 0 and ∥ ′ ∥ = 0  cos(, ′) = 0 if ∥ ∥ = 0 or ∥ ′ ∥ = 0 (1)  ∥∥·∥′′ ∥ else

In order to compare explorations, we pair CED measure with the cosine similarity measure among query vectors.

CED is a generalization of the Edit Distance, adapting cost computation to typical characteristics of semantic sequences. In particular, CED answers the following requirements: (i) edition cost depends on the similarity of nearby elements (the more similar and closer the elements, the lower the cost of operations), (ii) edition of repeated close elements has low cost, and (iii) similar and close elements can be exchanged with a low cost.

We describe CED computation as defined in [ 23 ] and tuned in [ 24 ]. Firstly, CED modifies the cost function of Edit Distance to take into account the local context of each element in the sequence. Consider contextual edit operations of the form = (, , , ), denoting the operation ∈ {add, mod, del} on exploration = ⟨1, ... ⟩ at index by query . Let O be the set of all possible contextual edit operations, the cost function : O → [ 0, 1 ] is defined as: () = 1 − max { (, ) × ()} ∈⟦1,⟧ (2) where: is a similarity measure between two queries, computed as the cosine similarity of query vectors, and () ∈ [ 0, 1 ] is a contextual vector which quantifies the notion of proximity between queries. Usually, bigger | − | is, lesser (). As in [ 24 ], 1 2√+1(−) 2! we use: () = exp − 2 | |

CED is computed as Edit Distance, using dynamic programming and Wagner-Fisher algorithm [ 36 ].

In next sections we describe how this representation of queries and operations is used for profiling the SQLShare workload and clustering explorations. 4

DATASET PROFILING

The SQLShare workload contains 2,809 explorations, totalizing 10,668 queries. Length of explorations follows the Zipf’s law. Indeed, 1,379 explorations are one-shot (i.e. they contain only one query), median exploration contains 2 queries and the longest one contains 98 queries. Figure 1a shows the boxplot of the distribution.

The number of operations in a query (w.r.t. previous query) also follows the Zipf’s law. Noticeably, 1,289 out of 10,668 queries have no operations w.r.t. previous query. This happens when a query is identical to previous one (947 queries) but also when there are only visualisation or optimisation changes (e.g. changing the order of projected attributes or joined tables) and when changes concern advanced options not captured by our features (e.g. changing a regular join by an outer join, or changing the ascending/descending sense of ordering). Mean query has 4 operations and the longest one has 510 operations. The latter correspond to a query of the form "SELECT * FROM T" where T contains a large number of columns. Figure 1b shows the boxplot of the distribution. We notice that many outliers correspond to large first queries (i.e. containing many fragments, specially projections) which lead to long query vectors when compared to the empty query 0.

Most frequent operations are adding and deleting projections (+P and -P), adding tables (+T) and adding selections (+S). Less frequent operations concern group by (+G and -G) and order by (+O and -O). Figure 1c shows the complete distribution.

Figures 1d and 1e complement the distribution of operations by highlighting the combinations of operations that are frequently performed together. Precisely, Figure 1d shows the top 10 most frequent query vectors, evidencing that null vectors (∅) are the most frequent, followed by a change in selections (+S-S) and a change in projections (+P-P). Some frequent vectors are surprising, as changing one table without changing anything else (+T-T). This behavior corresponds to users updating and uploading a dataset and then evaluating the same query in the new dataset. In addition, Figure 1e shows the top 10 most frequent aggregated query vectors, and consequently illustrating the most frequent types of operations disregarding vector magnitude and sign. Most frequent aggregated vectors concern changes in projections, selections and tables (PST) and subsets of these operations (PT, P and S). Interestingly, null vectors (∅) come in fifth position. These top 10 aggregated vectors cover 9,205 queries (82.3% of the total number of queries).

Figure 1f goes a step forward showing the main flows 2 between aggregated query vectors, by means of a Chord diagram. A flow ⟨, ⟩ indicates queries with vector A followed by queries with vector B. It is represented by an arrow (the origin being closer to the external circle), whose magnitude indicates its frequency. For example, the flow from to (in purple) represent vectors followed by vectors. We can observe that many autolfows (e.g. for , , ). Interestingly, many null vectors (∅) are followed by other null vectors or by simple changes ( and ). 2Main flows are the ones such that the number of transitions is greater than 5% of the biggest flow. frequency frequency

5.1.1 Implementation and setings. We used the CED implementation and setting described in [ 24 ], which is recalled in Subsection 3.3. Concerning UMAP, we use the umap-learn python library 0.4.3, where _ is set to 0.01, _ℎ to 200 (i.e. around 10% of the dataset) and pseudo random number generator seed is 42. Finally, according to the UMAP projection, we use the DBSCAN clustering algorithm from the sklearn python library 0.22.2, applied on the previous UMAP embedding, with = 1 and _ = 10.

All experiments are available and can be reproduced by running our Python notebook3 in Google Colab or Jupyter environments. In particular, all code generating the graphs, the dataset 3https://colab.research.google.com/drive/1Yt7Q7AFghkcxdea2UicccMCmkaX7dRMD? usp=sharing +P -P +S -S +A -A +T -T +G -G +O -O

5.1.2 Protocol description. To further justify our choice of methods, we performed some preliminary tests on workloads of explorations having a ground truth. Specifically, we used the workloads of OLAP explorations described in [ 24 ], and obtained comparable results for the artificial dataset and improved results for the explorations of real users (Ipums dataset). These results are available in the notebook; we omit them here for lack of space. Remark that as features are lowly correlated (correlation matrix is shown in Figure 2), we decided to keep all of them. A PCA analysis confirmed this choice. -20 0 20 40

C7 C1

Finally, given the large number of one-shot explorations, as shown in Figure 1a), we decided to test two clustering configurations: (i) whole clustering (on the whole dataset), and (ii) restricted clustering (excluding one-shot explorations). Indeed, the unique query of such explorations, when compared to the empty query (0), are 0-valued for features 2, 4, 6, 8, 10 and 12 (which count the deleted query fragments), introducing a bias. The restricted clustering aims to further analyse longer explorations, revealing richer patterns.

Later in the section, we describe the results of both clustering configurations. 5.2

Results of the whole clustering

The clustering of the whole dataset resulted in 7 clusters. Figure 3a plots the UMAP reduction of the dataset to a 2D Euclidean space. Clusters had varying sizes. Indeed, there are 3 large clusters (1 to 3), which explorations exhibit frequent behavior, and 4 small clusters (4 to 7) concerning less frequent behavior.

As expected, the length of explorations had a big impact in clustering results. As shown in Figure 3b, clusters 1 and 2 contain only explorations having at least 2 queries while the remaining clusters contain a majority of one-shot explorations. Indeed, clusters 4 to 6 contain some explorations of length 2, and cluster 3 contains some longer ones. Cluster 7 only contains one-shot explorations. On average, cluster 1 contains longer explorations than 2, including the longest ones.

The overall distribution of operations for clusters 1 and 2 is very similar (see Figure 3d), however, clusters diferentiates in the number of operations among consecutive queries (captured by the ℓ1 norm of query vectors, shown in Figure 3c) and in their lfows. Although explorations in cluster 1 are longer than those in cluster 2, they have less operations (median=2). Furthermore, the most frequent aggregated query vectors, listed in Table 3, confirm this observation. In particular, aggregated vectors in 1 include most of the null vectors, but also many vectors representing only one type of operation (esp. and ). However, many frequent vectors in cluster 2 concern many operations. We further analyse these two clusters in next subsection.

Cluster 3 is the largest one and contains a majority of oneshot explorations. Queries in its explorations contain many operations (median=6), in majority projections. There are two frequent aggregated vectors: and .

Clusters 4 to 7 are smaller, contain mostly one-shot explorations, but evidencing very diferent behavior. Queries in cluster 4 involve many operations (median=6), concerning many selections and tables. The most frequent aggregated vector is . Queries in cluster 5 also involve many operations (median=6), concerning many projections and aggregations, some grouping ) e l a sc102 g o l( m r o n1101 ℓ y r e u Q 100

D6 but few selections. The most frequent aggregated vectors are and . There are two types of queries in cluster 6, as evidenced in Figure 3c. Most queries involve only 2 operations (+P+T), the others concern multiple operations (multiple tables and many projections). The most frequent aggregated vector is . Cluster 7 is the smaller one (only 15 explorations). Its queries have fewer operations, concerning many projections and ordering. The most frequent aggregated vector is .

5.3 Results of the restricted clustering

The restricted clustering resulted in 6 clusters, plot in Figure 4a. There are two well diferentiated clusters ( 1 and 5) and a dense zone including a large cluster (2) and 3 smaller ones (3, 4 and 6). It is here, that DBSCAN best exploits the space topology. Cluster analysis is shown in Figure 4 and the main flows (for the bigger clusters) are shown in Figure 5.

Explorations coming from cluster 1 are distributed between cluster 1 and 5 (excepting 1 exploration that goes to cluster

PS 70 140 210 280 0 70 140 210PST 280 350 2). Cluster 1 contains the longest explorations and the highest median number of explorations. Its queries concern operations of varied types, most of them limiting to 1 or 2 operations. Frequent aggregated vectors are ∅, , and . Flows illustrates many repetitions of the same operations but also the alternation of operations. Cluster 5 contains shorter explorations. 62 % of queries are identical to previous ones (empty vector). In the remaining ones, projections are predominant, with some selections and tables. Frequent aggregated vectors are ∅, and . Flows evidence that first queries in the explorations have typical vectors (e.g. , ) and next queries are identical (∅).

Explorations coming from cluster 2 are distributed among clusters 2, 3, 4 and 6, in the dense zone. In addition, the 40 explorations coming from the other clusters (3 to 6) mainly goes to 2. The four clusters contain, in average, shorter explorations than cluster 1, with more operations per query. Queries in cluster 2 concern operations of varied types, most of them being projections. Frequent aggregated vectors are , , and . Flows evidence that explorations concern chains of the same operations (visible in the autoflows). There are more selections in queries of cluster 3, frequent aggregated vectors being , and . Flows show that first queries (mainly with vectors , and are followed by chains of selections, and some marginal projections. Queries in cluster 4 concern many changes in tables, and more aggregations that previous clusters. Frequent aggregated vectors are , and . Finally, queries in cluster 6 concern most of the order by operations, but all types of operations are present. The most frequent aggregated vector is .

5.4 Learned behavior

The analysis of both clustering configurations allowed the discovery of several patterns, representing common or less-frequent behavior. The most prominent aspects of each pattern are summarized in Table 4. This section briefly highlights our findings.

Firstly, 49% of explorations are one-shot. They diferentiate in the predominant operations in the unique query. We discovered 5 patterns. The most common one (3) consist in evaluating a simple query, projecting many attributes, possibly to verify that the dataset was correctly uploaded or just looking at the data.

Less frequent patterns, also concerning the evaluation of a simple query, diferentiate in the used SQL operations, namely, many selections (4), aggregation and grouping (5), join of multiple tables (6), and ordering (7). The latter is an outlier behavior, only concerning 15 explorations. These patterns suggest a more specific analysis of data (w.r.t. the common behavior in 3), taking advantage of more SQL operations. This may reflect users’ preferences on some SQL clauses, but may also reflect users’ expertise.

The remaining 51% of explorations contain between 2 and 98 queries, median being 4 queries. We discovered 6 patterns:

A common pattern (1) reveals long explorations, with few operations per query, sometimes repeating queries, which translate a focused data analysis. Many types of operations are used, but mostly once per query, suggesting a conscious use of SQL.

Another common pattern (2) reveals short explorations, with more operations per query. Projections are omnipresent, but frequently combined with other operations. What is interesting here, is the chaining of the same types of operations along the exploration. It can be exploited for providing personalized suggestions to users.

Two interesting but less frequent patterns (3 and 5) concern a classical first query, followed by chains of selections ( 3) or repeated queries (5). In both cases, explorations are shorter than in 1 but reveal some kind of analysis. While 3 suggest a meticulous study of the dataset, 5 includes many novices users trying to understand how SQL works. A similarly but more complex pattern (6) involve more richer first queries, followed by changes in the ordering of projected expressions. In addition to a good use of SQL, this behavior may correspond to users looking for the best way of reporting data.

The last pattern (4), also less frequent, exhibits a particular behavior. It concerns many changes in the datasets (frequently, the unique operation in the query is a change in the FROM clause). This corresponds to the upload of a new dataset and the execution of the same query on the new dataset, and suggests data analysts dealing with quality issues in their datasets. 6

CONCLUSION AND FUTURE WORKS

This paper presented an original solution to learn analysis behavior in SQL workloads. The understanding of users’ analysis patterns has great implications for query recommendation, monitoring, optimization and, more generally, providing better IDE support. The proposal includes an abstraction of queries and explorations in the space of SQL operations, a set of similarity functions tailored for SQL queries and explorations, and an innovative clustering process taking advantage of UMAP reduction for analysing a complex space.

The approach was tested on a real workload, SQLShare, allowing the extraction of 11 analysis patterns including 3 typical behaviors: one-shot simple explorations, short exploratory explorations, and longer more focused ones, but also less-frequent behavior evidencing the punctual use or the chaining of specific SQL operations. We believe that the identification of such behavior should be at the kernel of more intelligent IDE tools.

In this paper we used a large palette of indicators for profiling the workload and analyzing the obtained clusters (some additional ones are described in our notebook). In next future, we would like to test additional indicators, specifically concerning how focused are the explorations (i.e. distinguishing flows at the beginning and end of explorations), and how complex are queries, both in terms of expressiveness and usage of advanced clauses and functions (here, we also need to extract additional features). In addition, we would like to classify users according to their analysis behaviors. SQLShare workload, with its 57 users and their 3,336 datasets, is a rich source for further experiments. Of course, there are many one-shot users (already reported in [ 13 ]), but our preliminary analyses reveal very interesting behavior.

Finally, we would like to test our proposal in further workloads, specially those including queries generated by bots, as SDSS. Authors of [ 32 ] acknowledge the dificulty of extracting human sessions from all those collected: "We failed to find clear ways to segment user populations. [...] Interactive human users were 51% of the sessions, 41% of the Web trafic and 10% of the SQL trafic. We cannot be sure of those numbers because we did not find a very reliable way of classifying bots vs mortals." Developping tools helping in the recognition and analysis of hand-written queries is a nice challenge.

[1]

Abbott and

Tsay . Sequence analysis and optimal matching methods in sociology: Review and prospect . SMR , 29 ( 1 ): 3 - 33 , 2000 .

[2]

A. C.

Acar and

Motro . Why is this user asking so many questions? explaining sequences of queries . In DBSec, 2004 .

[3]

Ankerst , M. M. Breunig , H.-P.

Kriegel , and J.

Sander . Optics: ordering points to identify the clustering structure . ACM Sigmod , 28 ( 2 ): 49 - 60 , 1999 .

[4]

Barbosa ,

Barthelemy , G. Ghoshal,

C. R.

James ,

Lenormand ,

Louail ,

Menezes ,

J. J.

Ramasco ,

Simini , and

Tomasini . Human mobility: Models and applications . Physics Reports , 734 : 1 - 74 , 2018 .

[5]

Chaudhuri and

V. R.

Narasayya . Self-tuning database systems: A decade of progress . In VLDB , 2007 .

[6]

Djedaini ,

Drushku ,

Labroche ,

Marcel ,

Peralta , and

Verdeaux . Automatic assessment of interactive OLAP explorations . Inf. Syst., 82 : 148 - 163 , 2019 .

[7]

Djedaini ,

Labroche ,

Marcel , and

Peralta . Detecting user focus in OLAP analyses . In ADBIS' 2017 , Nicosia, Cyprus, 2017 .

[8]

Eirinaki ,

Abraham ,

Polyzotis , and

Shaikh . Querie: Collaborative database exploration . TKDE , 26 ( 7 ): 1778 - 1790 , 2014 .

[9]

Ester ,

H.-P.

Kriegel , Sander , et al. A density-based algorithm for discovering clusters in large spatial databases with noise . Kdd , 96 ( 34 ): 226 - 231 , 1996 .

[10]

Guidotti ,

Monreale ,

Ruggieri ,

Turini ,

Giannotti , and

Pedreschi . A survey of methods for explaining black box models . ACM CSUR , 51 ( 5 ), 2018 .

[11]

Hägerstraand . What about people in regional science? Papers in regional science , 24 ( 1 ): 7 - 24 , 1970 .

[12]

Idreos ,

Papaemmanouil , and

Chaudhuri . Overview of data exploration techniques . In SIGMOD , 2015 .

[13]

Jain ,

Moritz ,

Halperin ,

Howe , and

Lazowska . Sqlshare: Results from a multi-year sql-as-a-service experiment . In SIGMOD , 2016 .

[14]

Jiang ,

Ferreira , and

M. C.

González . Clustering daily patterns of human activities in the city . DMKD , 25 ( 3 ): 478 - 510 , 2012 .

[15]

Kaufman and P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis . John Wiley & Sons, 2009 .

[16]

Khoussainova ,

Kwon ,

Balazinska , and

Suciu . Snipsuggest: Contextaware autocompletion for SQL . PVLDB , 4 ( 1 ): 22 - 33 , 2010 .

[17]

Kul ,

D. T. A.

Luong ,

Xie ,

Chandola ,

Kennedy , and

S. J.

Upadhyaya . Similarity metrics for SQL query clustering . IEEE Trans. Knowl . Data Eng., 30 ( 12 ): 2408 - 2420 , 2018 .

[18] L. v . d. Maaten and

Hinton . Visualizing data using t-sne . Journal of machine learning research , 9 : 2579 - 2605 , 2008 .

[19] J. MacQueen. Some methods for classification and analysis of multivariate observations . In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability , volume 1 , pages 281 - 297 , 1967 .

[20]

McInnes ,

Healy , and

Melville . Umap: Uniform manifold approximation and projection for dimension reduction . arXiv preprint arXiv:1802.03426 , 2018 .

[21]

Milo and

Somech . Next-step suggestions for modern interactive data analysis platforms . In KDD , 2018 .

[22]

Moreau ,

Chanson ,

Peralta ,

Devogele , and C. de Runz. Clustering sequences of multi-dimensional sets of semantic elements . In SAC , 2021 .

[23]

Moreau ,

Devogele ,

Peralta , and

Étienne . A contextual edit distance for semantic trajectories . In SAC , 2020 .

[24]

Moreau ,

Peralta ,

Marcel ,

Chanson , and

Devogele . Learning analysis patterns using a contextual edit distance . In DOLAP , 2020 .

[25]

A. Y.

Ng ,

M. I.

Jordan , and

Weiss . On spectral clustering: Analysis and an algorithm . In Advances in NIPS , pages 849 - 856 , 2002 .

[26]

H. V.

Nguyen ,

Böhm ,

Becker ,

Goldman , G. Hinkel, and

Müller . Identifying user interests within the data space - a case study with skyserver . In EDBT , 2015 .

[27]

Pappalardo ,

Simini ,

Rinzivillo ,

Pedreschi ,

Giannotti , and

A.-L.

Barabási . Returners and explorers dichotomy in human mobility . Nature communications , 6 ( 1 ): 1 - 8 , 2015 .

[28]

H.-S.

Park and

C.-H.

Jun . A simple and fast algorithm for k-medoids clustering . Expert systems with applications , 36 ( 2 ): 3336 - 3341 , 2009 .

[29]

Peralta ,

Marcel ,

Verdeaux , and

A. S.

Diakhaby . Detecting coherent explorations in SQL workloads . Inf. Syst., 92 , 2020 .

[30]

Rizzi and

Gallinucci . Cubeload: A parametric generator of realistic OLAP workloads . In CAiSE, 2014 .

[31]

Romero ,

Marcel ,

Abelló ,

Peralta , and

Bellatreche . Describing analytical sessions using a multidimensional algebra . In DaWaK'2011 , Toulouse, France, 2011 .

[32]

Singh ,

Gray ,

Thakar ,

A. S.

Szalay ,

Raddick ,

Boroski ,

Lebedeva , and

Yanny . Skyserver trafic report - the first five years . Technical report , December 2006 .

[33]

Somech ,

Milo , and

Ozeri . Predicting "what is interesting" by mining interactive-data-analysis session logs . In EDBT , 2019 .

[34] H. van den Brink , R. van der Leek, and

Visser . Quality assessment for embedded SQL . In SCAM , pages 163 - 170 . IEEE Computer Society, 2007 .

[35]

Vashistha and

Jain . Measuring query complexity in sqlshare workload . https://uwescience.github.io/sqlshare/pdfs/Jain-Vashistha.pdf.

[36]

R. A.

Wagner and

M. J.

Fischer . The string-to-string correction problem . J. ACM , 21 ( 1 ): 168 - 173 , 1974 .

[37]

R. W.

White . Interactions with Search Systems . Cambridge Univ. Press, 2016 .