Qualitative Analysis of the SQLShare Workload for Session Segmentation Veronika Peralta, Willeme Verdeaux, Yann Raimont, Patrick Marcel University of Tours Blois, France veronika.peralta|patrick.marcel@univ-tours.fr,willeme.verdeaux|yann.raimont@etu.univ-tours.fr ABSTRACT well defined initially. Identifying such activities has several appli- This paper presents an ongoing work aiming at better under- cations in supporting interactive database exploration (IDE), like standing the workload of SQLShare [9]. SQLShare is database-as- understanding users’ information needs, identifying struggling a-service platform targeting scientists and data scientists with during the exploration, providing better query recommendations, minimal database experience, whose workload was made avail- etc. This is important since, usually, systems used for IDE do not able to the research community. According to the authors of [9], offer such facilities. this workload is the only one containing primarily ad-hoc hand- To identify explorations from a SQL workload, we use a tech- written queries over user-uploaded datasets. We analyzed this nique first proposed in [3] to score the quality of OLAP explo- workload by extracting features that characterize SQL queries rations. This technique consists of characterizing a query by a set and we show how to use these features to separate sequences of simple features that are intrinsic to a query or that relate the of SQL queries into meaningful sessions. We ran a few test over query to its neighbor in the sequence. While in [3] this technique various query workloads to validate empirically our approach. of feature extraction was used with supervised machine learning to score the quality of OLAP explorations, in the present work we use these features to partition an SQL workload into coherent 1 INTRODUCTION explorations. Analyzing a database workload offers many practical interests, The paper is organized as follows. The next section discusses from the monitoring of database physical access structures [2] related work. Section 3 presents our model of queries, tailored for to the generation of user-tailored collaborative query recommen- SQL queries. Section 4 details the features considered and how dations for interactive exploration [6]. There has been much they are extracted. Section 5 introduces our segmentation strat- attention lately devoted to the analysis of user past activities to egy and Section 6 reports the results of the tests we conducted. support Interactive Database Exploration [8]. OLAP analysis of Section 7 concludes and draws perspectives. data cubes is a particular case of IDE, that takes advantage of sim- ple primitives like drill-down or slice-and-dice for the navigation of multidimensional data. These particularities enable the design 2 RELATED WORK of approaches for characterizing user explorations in how focus In this section we present related work concerning real SQL they are [4], in how contributive a query is to the exploration [3], workloads and workload analysis. or even in how to ensure that a sequence of analytical queries forms a coherent exploration [13]. Transposing these works to regular, non multidimensional 2.1 Real SQL workloads SQL workloads raises many challenges. Even if a sequence of SQLShare. The SQLShare workload is the result of a Multi- SQL queries is issued to explore the database content, non multi- Year SQL-as-a-Service Experiment [9], allowing any user with dimensional relational schemas do not have regularities one ex- minimal database experience to upload their datasets on-line pects from the multidimensional model, explorations may not be and manipulate them via SQL queries. What the authors wanted expressed through roll-up or drill-down operations, SQL queries to prove with this experiment is that SQL is beneficial for data may deviate from the traditional star-join pattern commonly scientists. They observed that most of the time people use scripts used for analytical purpose, etc. to modify or visualize their datasets instead of using the SQL In this paper, we present a preliminary approach for analyz- paradigm. Indeed, most user needs may be satisfied by first-order ing SQL workloads, concentrating on the SQLShare workload of queries, that are much simpler than a script, but have the initial hand-written1 queries over user-uploaded datasets. This work- cost of creating a schema, importing the data and so on. SQL-as- load includes raw sequences of queries made by some users, a-Service frees the user of all this prior work with a relaxed SQL without further information on their intention. One of our ob- version. jectives is to investigate whether this workload contains actual The SQLShare workload is composed of 11,137 SQL state- exploration activities, and more particularly how to extract such ments, 57 users and 3,336 user’s datasets. To the best of our explorations. In what follows, we consider that an exploration is knowledge, as reported by the authors of [9], this workload is a coherent sequence of hand-written queries, that all share the the only one containing primarily ad-hoc hand-written queries same goal of fulfilling a user’s information need that may not be over user-uploaded datasets. As indicated in the introduction, hand-written means that the query is introduced manually by 1 Consistently with the authors of [9], we use the term hand-written to mean, in a human user, which reflects genuine interactive human activ- this context, that the query is introduced manually by a human user, which reflects genuine interactive human activity over a dataset, with consideration between two ity over a dataset, with consideration between two consecutive consecutive queries. queries. The SQLShare workload is analyzed in [9], particularly to © 2019 Copyright held by the author(s). Published in the Workshop Proceedings of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisboa, Portugal): verify the following assumption: "We hypothesized that SQLShare users would write of one month. The log consists of 1,352,202 SELECT statements queries that are more complex individually and that, being generated by an application, correspond to only 135 more diverse as a set, making the corpus more use- distinct query strings. ful for designing new systems." The authors indeed showed empirically that the queries in the 2.2 Workload analysis SQLShare workload are complex and diverse. They also analyzed Other scientific domains close to Database, like Information Re- the churn rate of SQLShare users and conclude that most users trieval or Web Search, have a long tradition of log analysis aim- exhibit a behavior that suggest an exploratory workload. ing at facilitating the searcher’s task [17]. Many works extract To our knowledge, and again as reported by the authors of [9], features from queries or search sessions and use them to dis- this workload is one of the two workloads publicly available to ambiguate the session’s goal, to generate recommendations, to the research community, the other being the SDSS workload. detect struggling in sessions, etc. Since databases tend to be more SDSS workload. SkyServer is an Internet portal to the Sloan used in an exploratory or analysis fashion, as evidenced by the Digital Sky Survey Catalog Archive Server; its Web and SQL logs SQLShare workload, it is not a surprise that many recent works are public [14]. The SQL log was produced by a live SQL database pay attention to the analysis of database workloads, in addition to supporting both ad hoc hand-authored queries as well as queries those works analyzing workload for optimization or self-tuning generated from a point-and-click GUI. Many queries in the SDSS purposes. We present some recent advances in this area, differ- are actually not hand-written; they were generated by applica- entiating by the type of logs (OLAP logs and SQL logs). tions such as the Google Earth plugin or the query composer from the SkyServer website. Their cleaning and normalization Analyzing and detecting OLAP explorations. Logs of OLAP took several months effort. analyses are simpler than SQL ones in the sense that they feature Sessions in this log were detected using heuristics: multidimensional queries that can easily be interpreted in terms of OLAP primitives (roll-up, drill-down, slice-and-dice, etc.). In "We arbitrarily start a new session when the previ- one of our previous works [13], we proposed an approach for ous page view from that IP address is more than 30 detecting OLAP analyses phrased in SQL, by converting SQL minutes old, i.e., a think-time larger than 30 min- queries into OLAP queries and then checking if two consecutive utes starts a new session. [...] Wong and Singh [1] queries are sufficiently close in terms of OLAP operations. In our chose the same 30 minute cutoff and we are told more recent work, we used supervised learning to identify a set that MSN and Google use a similar heuristic." of query features allowing to characterize focus zones in OLAP The authors of [14] also acknowledge the difficulty of extract- explorations [4], or to identify queries that better contribute to an ing human sessions from all those collected: exploration [3]. The present work can be seen as a continuation "We failed to find clear ways to segment user popu- of those previous works, since we have the same objective as [13] lations. We were able to ignore the traffic that was and use the same technique as [3]. The main differences with administrative or was eye-candy, leaving us with these previous works is that we make no assumption about the a set of 65M page views and 16M SQL queries. We type of queries in the workload (particularly, they may not be organized these requests into about 3M sessions, multidimensional queries), and we have no ground truth (i.e., no about half of which were from spiders. The residue human manual inspection of each query) on the workload. of 1.5M sessions had 51M page views and 16M SQL queries – still a very substantial corpus. [...] Inter- Analyzing SQL log. SQL workload analysis has recently at- active human users were 51% of the sessions, 41% tracted attention beyond query optimization, for instance for of the Web traffic and 10% of the SQL traffic. We query recommendation [6] query autocompletion [10], or user cannot be sure of those numbers because we did interest discovery [12]. All these works use the SDSS workload not find a very reliable way of classifying bots vs for their tests. Embedded SQL code is analyzed in [15] to mea- mortals." sure its quality, mainly for maintainability purpose. The authors Bots are programs that automatically crawled the SDSS and quantify quality based on the number of operators (joins, unions), launch SQL queries. Such traffic cannot be classified as proper operands (tables, subqueries) and variables in the SQL code, ex- interactive data exploration with human consideration. perimenting with SQL codes embedded in PL/SQL, COBOL and In [9], the authors compared the SQLShare workload and that Visual Basic. Jain et al. ran a number of tests on the SQLShare of the SDSS, and conclude: workload [9], some of them being reported above, showing the diversity and complexity of the workload. In [16], Vashistha and "SQLShare queries on average tend to be more com- Jain analyze the complexity of queries in the SQLShare workload, plex and more diverse than those of a conventional in terms of the following query features: number of tables, num- database workload generated from a comparable ber of columns, query length in characters, numbers of operators science domain: the Sloan Digital Sky Survey (SDSS)." (Scan, Join, Filter), number of comparison operators (LE, LIKE, Smaller SQL datasets. We are aware of other available SQL GT, OR, AND, Count), and the query run-time. They define two workloads. For instance, Kul et al. [11] analyze three specific complexity metrics from these features: the Halstead measure query sets. The first one, Student assignments gathered by IIT (traditionally used to measure programs complexity) and a linear Bombay, is made of a few hundreds queries answering homework combination whose weights are learned using regression. Finally, assignments. The second dataset, publicly available, consists of a recent work investigated various similarity metrics over SQL around 200 queries gathered over 2 years from student exams at queries, aiming at clustering queries [11] for better workload University of Buffalo. The third dataset consists of SQL logs that understanding. The authors run their tests on smaller SQL sets, capture all database activities of 11 Android phones for a period as indicated above. To our knowledge, our work is the first to propose an approach actually visualized by the user. We intentionally remain indepen- for segmenting hand-written SQL queries into meaningful ses- dent of presentation and optimization aspects, specially the order sions. in which attributes are projected (and visualized by the user), the order in which tables are joined, etc. 3 PRELIMINARIES Finally, a session is a sequence of queries of a user over a given This section introduces the SQLShare workload and describes database. our hypothesis and preprocessing. Definition 3.2 (Session). Let DB be a database schema. A session s = ⟨q 1 , . . . , qp ⟩ over DB is a sequence of queries over DB. We 3.1 SQLShare workload preprocessing note q ∈ s if a query q appears in the session s, and session(q) to From the 11,137 SQL statements we kept 10,668 corresponding to refer to the session where q appears. SELECT statements. The remaining statements (mainly updates, inserts and deletes) were filtered. 4 FEATURE EXTRACTION We implemented a preliminary session segmentation follow- In this section, we define a set of features to quantitatively de- ing a simple heuristic: keeping together the sequences of con- scribe different aspects of a SQL query and its context. We then secutive queries of a same user. As a result of the initial segmen- describe the extraction procedure and the obtained scores. tation we obtained 451 sessions, counting between 1 and 937 queries (average of 23.65 queries per session, standard deviation 4.1 Feature description of 75.05 queries). Furthermore, we made the initial hypothesis For each query, we extract a set of simple features computed that queries appear in chronological order in the SQLShare work- from the query text and its relationship with other queries in a load. We noted that the queries of the workload do not come with session. We intend to cover various aspects of a query in order to timestamps, and we contacted the authors of the original SQL- support different types of analysis and modeling based on query Share paper [9] who confirmed that the query order in the work- features. load may not reflect the order in which queries were launched. The set of features is inspired from our previous work [3, 4], Therefore, the disparate distribution of queries along sessions, in which models OLAP queries as a set of features capturing typical addition to some extremely long sessions, possibly disordered, OLAP navigation. calls for a smarter way of segmenting sessions. From now on, to remove any ambiguity, we use the term metric to denote the functions that score query quality. The term feature 3.2 Query and session abstractions is reserved to denote the score output by the function. In what follows, we use the term query to denote the text of an For the sake of presentation, we categorize metrics as follows: SQL SELECT statement. We represent a query as a collection i) intrinsic metrics, i.e., only related to the query itself, ii) relative of fragments extracted from the query text, namely, projections, metrics, i.e., also related to the query’s predecessor in the session, selections, aggregations and tables. These fragments abstract the and iii) contextual metrics, i.e., related to the whole session, pro- most descriptive parts of a SQL query, and are the most used in viding more context to the metrics. Table 1 presents an overview the literature (see e.g., [6, 10]). But note that we do not restrict of metrics. to SPJ (selection-projection-join) queries. Indeed, we consider all queries in the SQLShare workload, some of them containing Intrinsic metrics arbitrarily complex chains of sub-queries. NoP Number of projections (attributs and expressions) Definition 3.1 (Query). A query over database schema DB is a NoS Number of selections (filtering predicates) quadruple q = ⟨P, S, A,T ⟩ where: NoA Number of aggregations, when there is a group-by clause (1) P is a set of expressions (attributes or calculated expres- NoT Number of tables sions) appearing in the main SELECT clause (i.e. the out- Relative metrics ermost projection). We deal with * wild card by replacing NCP Number of common projections, with previous query it by the list of attributes it references. NCS Number of common selections, with previous query (2) S is a set of atomic Boolean predicates, whose combina- NCA Number of common aggregations, with previous query tion (conjunction, disjunction, etc.) defines the WHERE NCT Number of common tables, with previous query and HAVING clauses appearing in the query. We consid- RED Relative edit distance (effort to express a query starting ered indistinctly all predicates appearing in the outermost from the previous one) statements as well as in inner sub-queries. JI Jaccard index of common query parts, with previous (3) A is a set of aggregation expressions appearing in the main query SELECT clause (i.e. the outermost projection). Including Contextual metrics aggregation expressions appearing only in the HAVING NoQ Number of queries (absolute position in the session) clause is part of our future work. (4) T is a set of tables appearing in FROM clauses (outermost Table 1: metrics for SQL queries statement and inner sub-queries). Views, sub-queries and other expressions appearing in FROM clauses are parsed in order to obtain the referenced tables. For all definitions given in this section, let qk = ⟨Pk , Sk , Ak ,Tk ⟩ Note that although we consider tables and selections occur- be the query occurring at position k in the session s over the ring in inner sub-queries, we limit to the outermost queries for instance I of schema DB. All the queries we considered are sup- projections and aggregations, as they correspond to attributes posed to be well formed, and so we do not deal with query errors. For the moment, we only considered metrics based on query Number of Common Tables. NCT (qk , qk −1 ) counts the number text. Further metrics may be defined if the database instance of common tables of qk relatively to qk −1 . The default value is 0 is taken into account (for example, number of tuples in query for q 1 . result, precision and recall of query result w.r.t. previous query, ( card(Tk ∩ Tk −1 ) if k > 1 execution time, etc.). But the computation of such metrics implies NCT (qk , qk −1 ) = (8) the execution of every query in the SQLShare dataset, which is 0 otherwise not always available for confidentiality reasons (i.e. users did not Relative Edit Distance. RED(qk , qk −1 ) represents the edition agree to share their data), and thus considerably reduces the set effort, for a user, to express the current query starting from the of queries. We left such studies to future work. previous one. It is strongly related to query parts, and computed 4.1.1 Intrinsic metrics. Intrinsic metrics are those that can be as the minimum number of atomic operations between queries, computed only considering the query qk , independently of the by considering the operations of adding/removing a projection, session s and other queries in s. In other words, these metrics selection, aggregation or table. The considered cost for each will give the same score to qk , independently of s. observed difference (adding/removing) is the same. RED(qk , qk −1 ) = card(Pk − Pk −1 ) + card(Pk −1 − Pk ) Number of Projections. NoP(qk ) represents the number of pro- jections (attributes and expressions) that are projected by the user. +card(Sk − Sk −1 ) + card(Sk −1 − Sk ) (9) Expressions projected in inner sub-queries, but not projected by +card(Ak − Ak −1 ) + card(Ak −1 − Ak ) the outer query, are not considered as they do not appear in the +card(Tk − Tk−1 ) + card(Tk −1 − Tk ) query result. For the limit case, we consider an empty query, q 0 = ⟨∅, ∅, ∅, ∅⟩. NoP(qk ) = card(Pk ) (1) Then, RED(q 1 , q 0 ) = card(P1 ) + card(S 1 ) + card(A1 ) + card(T1 ). Number of Selections. NoS(qk ) represents the number of selec- Jaccard Index. JI (qk , qk−1 ) represents the ratio between the tions (elementary Boolean predicates) that appear in the query common query parts (projections, selections, aggregations and text, both in outer and inner sub-queries. tables) and the union of query parts. NoS(qk ) = card(Sk ) (2) card(queryParts(qk ) ∩ queryParts(qk −1 ) JI (qk , qk −1 ) = (10) card(queryParts(qk ) ∪ queryParts(qk −1 ) Number of Aggregations. NoP(qk ) represents the number of aggregation expressions that are projected by the user. As for where queryParts(qk ) = Pk ∪ Sk ∪ Ak ∪ Tk . projections, expressions appearing in inner sub-queries are not For the limit case, we consider an empty query, q 0 = ⟨∅, ∅, ∅, ∅⟩. considered. Then, JI (q 1 , q 0 ) = 0. NoA(qk ) = card(Ak ) (3) 4.1.3 Contextual metrics. Contextual metrics are session de- pendent and make sense only in the context of a session. The Number of Tables. NoP(qk ) represents the number of tables same query qk occurring in different sessions may be given dif- appearing in query text, both considering outer and inner sub- ferent scores for metrics in this category. For now we consider queries. only one such metric. NoT (qk ) = card(Tk ) (4) Number of Queries. NoQ(qk , s) counts the absolute position 4.1.2 Relative metrics. Relative metrics are those that are com- of query qk in s. puted comparing the query qk to the previous query in the ses- NoQ(qk , s) = k (11) sion s. Let qk −1 = ⟨Pk−1 , Sk −1 , Ak −1 ,Tk−1 ⟩ be the previous query, being undefined for the first query of s (i.e., q 1 ). Each metric pro- 4.2 Extraction protocol vides a default score for this limit case. This section briefly describes the procedure for extracting query Number of Common Projections. NCP(qk , qk −1 ) counts the features. We proceed in 3 steps: number of common projections of qk relatively to qk −1 . The (1) Filter the SQL workload in order to keep only SELECT default value is 0 for q 1 . statements, i.e. discard updates, inserts, etc. ( (2) Extract query fragments (sets of projections, selections, card(Pk ∩ Pk −1 ) if k > 1 aggregations and tables) for each query. NCP(qk , qk −1 ) = (5) 0 otherwise (3) Compute query features from query fragments. Number of Common Selections. NCS(qk , qk −1 ) counts the num- For the first and second step we developed a C# script using the ber of common selections of qk relatively to qk −1 . The default MSDN TSQL Parser. Removing all the non SELECT statements value is 0 for q 1 . was straightforward. However, extracting query fragments re- ( quired to deal with several particular cases. The major challenge card(Sk ∩ Sk −1 ) if k > 1 was extracting projections. First, for getting the projections vi- NCS(qk , qk −1 ) = (6) sualized by the user, we need to detect the outermost SELECT 0 otherwise clause of a query and then extract the SELECT elements. When Number of Common Aggregations. NCA(qk , qk −1 ) counts the there is a star element (i.e : SELECT * FROM T ) in the query, our number of common aggregations of qk relatively to qk −1 . The script reaches the outermost FROM clause (looking for T ). When default value is 0 for q 1 . T is a table, the script accesses schema metadata and obtains all ( the table attributes. If there is one or more sub-queries in the card(Ak ∩ Ak −1 ) if k > 1 FROM clause, it repeats the previously described pattern until NCA(qk , qk −1 ) = (7) 0 otherwise it finds either a table or a set of SELECT elements. Views and WITH clauses were treated in a similar way. For now, we do not Feature Min Max Avg StdDev Median 75pc 90pc take into account the queries having more than one ’*’ in their Intrinsic metrics SELECT elements. (i.e : SELECT t1.*,t2.*,a,b FROM t1,t2). NoP 1 509 9.36 22.51 5 10 18 Note that this procedure for resolving star elements relies NoS 0 83 1.19 3.10 1 1 3 in the existence of schema metadata, i.e., having access to the NoA 0 49 0.41 2.02 0 0 1 NoT 0 84 1.50 3.29 1 1 2 corresponding datasets in order to obtain the list of attributes. Relative metrics However, some datasets are referenced in queries but are not NCP 0 509 4.90 17.58 1 5 12 present in the SQLShare released data because the user decided NCS 0 82 0.59 1.96 0 1 2 not to share them. For such queries (near 18% of the query work- NCA 0 48 0.21 1.11 0 0 1 load), we could not resolve the list of projections and we needed NCT 0 83 0.85 2.05 1 1 2 to estimate the number of projections during third step (comput- RED 0 1020 11.32 27.21 4 12 25 ing features). JI 0 1 0.45 0.39 0.43 0.83 1 The aggregations were obtained with a Parser’s function that detects all the function calls within a query. The selections are Contextual metrics all the atomic Boolean expressions contained in the queries and NoQ 1 937 23.65 75.05 4 13.5 50 their subqueries. At this stage of our work we do not deal with Table 2: Range, average, standard deviation, median and predicate containment. two percentiles for query features on the SQLShare The third step computes query features as described in Equa- dataset tions 1 to 11. For the unresolved star elements, we estimated both the number of projections and the number of common projec- tions, by taking advantage of the other queries in the exploration that list attributes of the "starred" tables. We used linear regres- sion to estimate the remaining cases, with the AUTO-SKLEARN Python module [7], which is a module aiming at automatically choosing and parametrizing a machine learning algorithm for a given dataset, at a given cost (i.e., the time it takes to test different algorithms). More precisely, we computed two distinct regressions, one for NoP and another one for NoCP. The methodology is the same for both and consist of the following: (1) Each query are represented by 23 features : NoS, NoA, NoT, NCS, NCA, NCT, NoQ (neither NoP nor NCP since they are the target of the regression), the min, max, average and standard deviation of the NoP, NoS, NoA, and NoT, grouped by the exploration the query belongs to. (2) The queries where NoP=0 (and consequently NCP=0) are removed from the data set. (3) The data set is then split in 80/20 for cross-validation, and Figure 1: Value distribution of query features in the SQL- the AUTO-SKLEARN regression mode is used to fit the Share dataset. best regression. For the first regression, NoP is the target, for the second, NCP is the target. The maximum time to find a model is set to 180 seconds and the score used to measure the accuracy of the regression is R 2 (coefficient while median values show that most queries have few projections, of determination) regression score function. selections, aggregations and tables. (4) The R 2 score of each regression is used to predict NoP Focusing in longer queries (with more query fragments), at (respectively NCP) for the queries removed at step 2. 90-percentile, queries have 18 projections, 3 selections, 1 aggre- (5) Degenerated cases (e.g., a predicted number of common gation and 2 tables, while at 75-percentile those values are 10, 1, projection that is greater than the predicted number of 0 and 1 respectively. Indeed, as expected, there is a large number projections) are handled manually. of short queries (having less fragments): 82% of queries have no aggregations and 44% have no selections, while 20% have a unique projection and 78% have a unique table. Interestingly, 6% 4.3 Analysis of query features of queries have no table in the FROM clause. An example of such Table 2 summarizes the results of feature extraction. Value distri- queries is "SELECT 1+2". butions are shown in Figure 1. Concerning common fragments between contiguous queries, A first remark is that many queries have a high number of almost half of the queries have 1 common projection and 1 com- projections. Indeed, 38 queries (out of 10,668) project more than mon table but no common selections nor aggregations, while 100 expressions, while more than 21% project more than 10 ex- there is more sharing at 75-percentile. The remaining two met- pressions. The number of common projections is also high. The rics, Relative Edit Distance and Jaccard Index, inform about the number of the other query fragments is less impressive. Less combination of such common fragments. Specifically, half of the than 1% of queries exceed 10 selections, 10 aggregations or 10 ta- queries differ in 4 or less fragments (RED=4) and have at least bles. Average and standard deviation confirm this disproportion, 43% of its fragments in common (JI=0.43), w.r.t. previous query. Furthermore, only 29% of queries have nothing in common with the previous query in their sessions (JI=0). The next section discusses how to take into account these features for fragmenting sessions. 5 SESSION SEGMENTATION The previous section presented various statistics about queries in the SQLShare workload. A preliminary session segmentation (contiguous queries of a same user) resulted in some extremely long sessions (maximum of 937 queries) with 29% of queries having nothing in common with their immediate predecessor. In this section, we explore how to segment sessions in a smarter way. Session segmentation has been previously studied for the SDSS workload [14]. In their study, the authors consider that a new session starts after 30 minutes of think-time (time spent between two queries). A similar problem was largely studied for the seg- mentation of web traces (see for example [18]) proposing the same 30-minutes cutoff. Search engine providers, like MSN and Google, use similar heuristics. Contrarily to those works, the pub- lished SQLShare workload does not include query timestamps. We need to explore other heuristics for session segmentation. Intuitively, our idea is to compare contiguous queries in a ses- sion and segment when queries are dissimilar enough. Based on query features described in the previous section, we investigate 5 similarity indexes: Edit Index. It is based on the Relative Edit Distance (RED) query feature, normalizing it with respect to an arbitrary thresh- old, set to 10 operations. RED(qk , qk −1 ) EditIndex(qk ) = max {0, 1 − } (12) 10 Jaccard Index. It is the Jaccard Index defined in Equation 10, which is normalized by definition. Figure 2: Comparison of similarity metrics for 3 sessions. Cosine Index. It is calculated as the Cosine of vectors consisting in first 8 query features, namely, NoP, NoS, NoA, NoT, NCP, NCS, NCA and NCT. Let x = ⟨x 1 , . . . , x 8 ⟩ and y = ⟨y1 , . . . , y8 ⟩ be the vectors for queries qk and qk −1 respectively. x i .yi instead of comparing sets of fragments; it captures the variability Í CosIndex(qk , qk −1 ) = q (13) in query complexity. And finally, Common Table Index responds x i . yi Í 2 Í 2 to the intuition that common tables have more impact than the other common fragments, and it is normalized with respect to Common Fragments Index. It is calculated as the number of the number of tables used in the user session. common fragments, normalizing it with respect to an arbitrary Figure 2 depicts the similarity metrics for 3 sessions of different threshold, set to 10 fragments. sizes. Looking at Session 28, the shorter one, it seems quite clear NCF that the session may be split in two parts, by cutting between CF Index(qk , qk −1 ) = min{1, } (14) 10 queries 4 and 5. All similarity metrics agreed. Things are less where NCF = NCP(qk , qk −1 )+NCS(qk , qk −1 )+NCA(qk , qk −1 )+ evident for Session 0. One split seems evident (at query 31), but NCT (qk , qk −1 ) some others may be discussed (e.g. at queries 28, 29 and 12). Decision depends on thresholds. Finally, Session 18 presents a Common Tables Index. It is calculated as the number of com- first part, with a focused analysis, via similar queries, and a second mon tables, normalizing by the highest number of tables in the part, more exploratory, with varied queries. Even if indexes do session. not always agree, their majority seems to indicate a tendency. NCT (qk , qk −1 ) CT Index(qk , qk−1 ) = (15) max {NoT (q)|q ∈ session(qk )} Note that these indexes calculate complementary aspects of 6 EXPERIMENTS query similarity. Edit Index and Common Fragment Index count In this section we report the results of the experiments conducted differences (resp., common fragments) as absolute values (nor- to validate our segmentation approach. We first experiment with malized with a given threshold). Jaccard Index is a compromise the SQLShare dataset. We then experiment with other datasets of the previous ones, computing the ratio of common fragments. for which a ground truth is available, in order to better underline Cosine Index is computed using features (the value of the metrics) the effectiveness of our approach. Before segmentation After segmentation Min 1 quartile 2 quartile 3 quartile Max Min 1 quartile 2 quartile 3 quartile Max Nb queries 1.00 2.00 6.00 19.75 936.00 1.00 1.00 3.00 6.00 78.00 Avg NCP 0.00 1.00 2.40 5.20 305.60 0.00 1.00 3.00 7.00 509.00 Avg NCS 0.00 0.00 0.17 0.72 7.31 0.00 0.00 0.00 1.00 82.00 Avg NCA 0.00 0.00 0.00 0.18 4.00 0.00 0.00 0.00 0.00 48.00 Avg NCT 0.00 0.66 0.97 1.00 4.00 0.00 0.80 1.00 1.00 83.00 Avg NCF 0.00 2.00 4.20 7.33 306.33 1.00 2.69 5.00 9.50 510.00 Avg RED 0.00 2.30 4.64 8.29 204.73 0.00 1.67 3.00 7.00 267.00 Avg JI 0.00 0.38 0.55 0.71 1.00 0.01 0.43 0.65 0.84 1.00 Table 3: Comparison of average features per session before and after segmentation 6.1 Experiments over SQLShare The second dataset, named Enterprise, consists of navigation We implemented a first heuristic based in these 5 similarity in- traces of 14 volunteers among SAP employees, in the context dexes, tuned with simple thresholds whose setting is discussed in of a previous study on discovering user intents [5]. We set 10 next subsection, taking the decision of segmenting or not using business needs, and volunteers were asked to analyze some of the majority vote. This simple heuristic allowed to split the initial 451 7 available data sources to answer each of the 10 business needs, sessions in 2,960 segments (called explorations). In the absence using a SAP prototype that supports keyword-based BI queries3 . of ground-truth, we present in Table 3 a comparison of session In total, this dataset contains 24 sessions, corresponding to 104 features before and after session segmentation. A first remark user explorations and accounting for 525 queries. Volunteers concerns session length: extremely large sessions (maximum of were explicitly requested to express to what information needs 936) were split (new maximum is 78 queries). Indeed, more than they were answering, which constitutes our ground truth for this half of the sessions were not fragmented and at 3rd quartile 1 ses- dataset. sion was split in 4 explorations. Some long and anarchic sessions For these two datasets, the feature extraction was made as in (such as the one counting 936 queries) were split in a multitude of [3], enriched with detection of the features pertaining to tables explorations. We can also highlight an increasing in the average (NoT and NCT). number of common query fragments (Avg NCF) per session, as The third dataset, named Exam, consists of queries gathered well as the average number of common projections, selections, over 2 years from student exams at University of Buffalo. Queries aggregations and tables. This increasing is quite regular and vis- correspond to student’s answers to 2 exam questions (one per ible for all quartiles. Relative edit distance (RED) and Jaccard year). A parser that allow to extract several query parts (including Index (JI) also improved, as expected. the ones we need) is also available for download. The parser also filters the queries that obtained low grades, proposing a total of 6.2 Experiments with ground truth 102 queries. Notably, in Open and Enterprise datasets, users did not have For this series of tests, we used workload datasets with ground to write any SQL code, contrarily to SQLShare or Exam. Indeed, truth. We expect our approach to find a segmentation that is Saiku and the SAP prototype generated queries from users high- close to the ground truth. As one of the datasets also contains level operations. However, in both cases, users devised real explo- timestamps, we compare to the heuristic used in the literature, rations. Conversely, queries in the Exam dataset are hand-written, i.e., the 30-minutes cutoff. but each student devised only one query. In order to simulate Datasets. We used three real workloads. Two of them are logs explorations, we made the hypothesis that the answers to a same of multidimensional queries that we have used in our previous exam question should mimic students attempts to find the correct works [3], the third one is a log SQL queries used for clustering answer. To this end, we arranged together the queries that an- queries in [11]. We briefly present the datasets and we refer the swer a same exam question, obtaining 2 explorations (the ground readers to corresponding articles for more precisions. truth for this dataset). The first dataset, named Open in what follows, consists of navigation traces collected in the context of a French project on Dataset analysis and feature extraction. Table 4 compares the energy vulnerability. These traces were produced by 8 volunteer three datasets in terms of number of sessions, number of explo- students of a Master’s degree in Business Intelligence, answering rations (the ground truth), number of queries and summarizes fuzzy information needs defined by their lecturer, to develop features extraction. A first remark concerns the length of ses- explorative OLAP navigations using Saiku2 over three cubes in- sions. The Open dataset contains long sessions concerning few stances. The main cube is organized as a star schema with 19 explorations while the Enterprise dataset contains shorter ses- dimensions, 68 (non-top) levels, 24 measures, and contains 37,149 sions concerning more explorations. Sessions length is actually facts recorded in the fact table. The other cubes are organized dependent on GUI; while third party OLAP tools, like Saiku, log a in a similar way. From this experiment, we reuse 16 sessions, new query for each user action (including intermediate drag-and- representing 28 explorations and 941 queries. The ground truth drops), the SAP prototype only logs final queries. In the Exam is a manual segmentation made by the lecturer based on some dataset, sessions and explorations were simulated. Regarding fea- guessed notion of user goal and supported by timestamps. No- tures, queries in the three datasets concern a quite small number tably, automatic segmentation was not the purpose of the work of projections, selections, aggregations and tables. Relative edit at the time manual segmentation was done. 3 Patent Reference: 14/856,984 : BI Query and Answering using full text search and 2 http://meteorite.bi/products/saiku keyword semantics Open Enterprise Exam Open Enterprise Exams Open (timestamp) Nb of sessions 16 24 1 Accuracy 0.98 0.88 0.94 0.97 Nb of explorations 28 104 2 Precision 1 0.78 0.17 1 Nb of queries 941 525 102 Recall 0.42 0.63 1 0.25 Avg queries per session 58 21 102 F-measure 0.42 0.48 0.17 0.25 Avg queries per explor. 34 5 51 ARI 0.75 0.77 0.54 0.75 Avg explor. per session 2 4 2 Table 5: Segmentation results for our approach on the Avg and range of NoP 3.62 [1,7] 2.18 [0,6] 1 [0,3] 3 datasets and the timestamp-based approach (rightmost Avg and range of NoS 1.33 [0,21] 0.76 [0,3] 1,57 [0,4] column) Avg and range of NoA 1.34 [1,4] 1.14 [0,5] 0,77 [0,2] Avg and range of NoT 3.28 [1,7] 2.03 [1,4] 3,02 [1,6] Avg and range of NCP 3.16 [0,7] 1.34 [0,4] 0,22 [0,1] Avg and range of NCS 1.13 [0,21] 0.46 [0,3] 0,07 [0,1] Segmentation quality. For each dataset, we compared the ob- Avg and range of NCA 1.17 [0,4] 0.77 [0,3] 0,09 [0,1] tained segmentation to the ground truth, measuring segmenta- Avg and range of NCT 2.97 [0,7] 1.46 [0,4] 2,57 [0,6] tion quality in two complementary ways. The first one, more Avg and range of RED 3.85 [0,19] 2.09 [0,25] 6.78 [1,11] demanding, compares the position of session breaks, indicating if Avg and range of JI 0.57 [0,1] 0.79 [0,1] 0.31 [0,0.86] both the segmentation and the ground truth coincide in the very Table 4: Characteristics of the 3 datasets same positions. To this end, we build binary vectors (a position corresponding to a query in the session), where a 1 means that a new segment starts at that position. We do the same for the ground truth and we compared both vectors in terms of accuracy, precision, recall and f-measure. The second way is the Adjusted distance (RED) and Jaccard index (JI) illustrate that queries are Rand Index (ARI), a popular measure of clustering quality. It is more similar in the Enterprise dataset and highly dissimilar in based on the comparison of pairs of queries in the session, ob- the Exam dataset. The latter observation actually contradicts our serving whether they belong to the same fragment and to the hypothesis on the Exam dataset (stating that students’ answers same exploration in the ground truth. to a same exam question should be similar). We report the results in Table 5. As expected, results are very good in terms of accuracy, mainly explained because classes are Threshold setting. As for the SQLShare dataset, we computed unbalanced (the number of no-break is higher than the number the 5 similarity metrics and used them (with voting strategy) for of break. Relative low values for F-measure while high values of segmenting sessions. We tested different thresholds for voting. ARI indicate that the exact break is not always found, while the We proceeded as follows: we calculated the distribution of values overall segmentation reminds good. The bad results for the Exams for each metric and we used as threshold the value at k-percentile, dataset are mainly explained by the dissimilarities among queries with k varying in {0, 5, 10, 15, 20, 25, 30}. of a same exploration (i.e. answers to a same exam question). The The thresholds that provided better results were those at 0-, conclusions about query similarity presented in [11] confirm this 15- and 5-percentile for the Open, Enterprise and Exam datasets fact. respectively. These thresholds reflect the relationship between Comparison to timestamp-based approach. In order to compare number of explorations to find and number of queries, as well our approach to the one used in the literature, we implemented as the similarity among consecutive queries. Indeed, the Open a second heuristic that segments sessions when there is a 30- dataset contains many queries and few explorations (i.e., a few minutes delay between queries. The Open dataset, the only one segments to find); small thresholds are best adapted. Conversely, containing timestamps, was used for this test. Results are re- the Enterprise dataset needs to be more segmented as the average ported in Table 5, the right-most column corresponding to the number of queries per exploration is low; higher thresholds do timestamp-based approach. They are comparable in terms of ac- better. With the same reasoning, 0-percentile should provide curacy and ARI but much lower in terms of f-measure. Note that convenient thresholds for the Exam dataset. However, as queries 1 for precision means that all breaks found are also breaks in inside an exploration are dissimilar, the only segmentation to find the ground truth. In other words, there are no big delays inside is not detected first (many intra-exploration breaks are proposed explorations, which makes sense. However, the timestamp-based before); 5-percentile provides better results. approach fails to detect 75% of the breaks (when the user changes We remark that more precise thresholds could be learned with its topic of study in a briefer delay). supervised machine learning techniques (e.g. classification). We intentionally avoid this computation because in real applications Analysis of similarity metrics. Finally, we investigated the qual- (as SQLShare) we do not have a ground truth for learning and, ity of the 5 proposed similarity metrics, by studying the corre- when one exists, we risk over-fitting. An expert providing the lation of their vote (when metric value is lower than the corre- ratio of queries per exploration (either intuitively or via prelimi- sponding threshold) with respect to the breaks in the ground nary tests) is more realistic. For the SQLShare dataset, we rely truth. Results are presented in Table 6. on statistics, like number of queries per session (see Table 2) and Jaccard and CF indexes are the more correlated in the Enter- the dataset description in [9] for setting the threshold as values prise dataset. Both of them are highly correlated in the Open at 30-percentile. dataset, CT index being even better. Edit and Cosinus indexes are In the remaining tests, we use values at 0-, 15- and 5-percentile less correlated in one of the datasets. As expected, correlation as thresholds for the Open, Enterprise and Exam datasets, respec- of all measures is low in the Exam dataset, where segmentation tively. is of bad quality. Interestingly, the most influencing metrics, the Open Enterprise Exams [4] M. Djedaini, N. Labroche, P. Marcel, and V. Peralta. Detecting user focus Edit index 0.34 0.62 0.05 in OLAP analyses. In Advances in Databases and Information Systems - 21st European Conference, ADBIS 2017, Nicosia, Cyprus, September 24-27, 2017, Pro- Jackard index 0.86 0.73 0.04 ceedings, pages 105–119, 2017. Cos index 0.75 0.32 0.13 [5] K. Drushku, N. Labroche, P. Marcel, and V. Peralta. Interest-based recommen- dations for business intelligence users. To appear in Information Systems, 2019. CF index 0.86 0.69 0.10 https://doi.org/10.1016/j.is.2018.08.004. CT index 0.90 0.50 0.01 [6] M. Eirinaki, S. Abraham, N. Polyzotis, and N. Shaikh. Querie: Collaborative Table 6: Correlation between similarity metrics and database exploration. IEEE Trans. Knowl. Data Eng., 26(7):1778–1790, 2014. [7] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. ground truth for the three datasets. Efficient and robust automated machine learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2962–2970. Curran Associates, Inc., 2015. [8] S. Idreos, O. Papaemmanouil, and S. Chaudhuri. Overview of data exploration techniques. In Proceedings of the 2015 ACM SIGMOD International Conference Open Enterprise Exams on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, Edit index 0.23 0.69 0.16 pages 277–281, 2015. [9] S. Jain, D. Moritz, D. Halperin, B. Howe, and E. Lazowska. Sqlshare: Results Jackard index 1.00 0.99 0.66 from a multi-year sql-as-a-service experiment. In Proceedings of the 2016 Cos index 0.87 0.49 0.36 International Conference on Management of Data, SIGMOD Conference 2016, CF index 1.00 0.89 0.91 San Francisco, CA, USA, June 26 - July 01, 2016, pages 281–293, 2016. [10] N. Khoussainova, Y. Kwon, M. Balazinska, and D. Suciu. Snipsuggest: Context- CT index 0.95 0.65 0.74 aware autocompletion for SQL. PVLDB, 4(1):22–33, 2010. Table 7: Correlation between similarity metrics and final [11] G. Kul, D. T. A. Luong, T. Xie, V. Chandola, O. Kennedy, and S. J. Upadhyaya. Similarity metrics for SQL query clustering. IEEE Trans. Knowl. Data Eng., vote for the three datasets. 30(12):2408–2420, 2018. [12] H. V. Nguyen, K. Böhm, F. Becker, B. Goldman, G. Hinkel, and E. Müller. Identifying user interests within the data space - a case study with skyserver. In EDBT, pages 641–652. OpenProceedings.org, 2015. [13] O. Romero, P. Marcel, A. Abelló, V. Peralta, and L. Bellatreche. Describing ones more correlated with the final vote (as shown in Table 7) analytical sessions using a multidimensional algebra. In Data Warehousing and are also Jaccard, CF and CT indexes. Knowledge Discovery - 13th International Conference, DaWaK 2011, Toulouse, France, August 29-September 2,2011. Proceedings, pages 224–239, 2011. [14] V. Singh, J. Gray, A. Thakar, A. S. Szalay, J. Raddick, B. Boroski, S. Lebedeva, 7 CONCLUSION and B. Yanny. Skyserver traffic report - the first five years. Technical report, December 2006. This paper discussed the problem of fragmenting sequences of [15] H. van den Brink, R. van der Leek, and J. Visser. Quality assessment for SQL queries into meaningful explorations when only the query embedded SQL. In SCAM, pages 163–170. IEEE Computer Society, 2007. [16] A. Vashistha and S. Jain. Measuring query complexity in sqlshare workload. text is available, and it is not possible to rely on query timestamps. https://uwescience.github.io/sqlshare/pdfs/Jain-Vashistha.pdf. We characterized queries as a set of simple features and de- [17] R. W. White. Interactions with Search Systems. Cambridge University Press, fined five similarity indexes with respect to previous queries in 2016. [18] M. Wong, B. Bhattarai, and R. Singh. Characterization and analysis of usage the session. A simple heuristic, based on the similarity indexes, patterns in large multimedia websites. Technical report, 2006. allowed to split long and heterogeneous sessions into smaller explorations where queries have more connections. Even if our preliminary results are promising, further aspects should be investigated: • Study further query features. We would like to test the common fragments of a query w.r.t. close queries in the session (not only the previous one). Comparing query results seems important, even if it is not possible for many queries (because the database instance is not part of the SQLShare workload). • Study other similarity indexes and tune thresholds. • Discard preliminary hypothesis about chronological or- dering of queries and deal with query similarity beyond it. • Test other ways of fragmenting, in particular via clustering methods. Our long term goal is to measure the quality of SQL explo- rations, allowing the detection of focus/exploratory zones, the discovery of latent user intents, the recommendation of next queries, among other applications. REFERENCES [1] B. D. Bhattarai, M. Wong, and R. Singh. Discovering user information goals with semantic website media modeling. In MMM (1), volume 4351 of Lecture Notes in Computer Science, pages 364–375. Springer, 2007. [2] S. Chaudhuri and V. R. Narasayya. Self-tuning database systems: A decade of progress. In Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007, pages 3–14, 2007. [3] M. Djedaini, K. Drushku, N. Labroche, P. Marcel, V. Peralta, and W. Verdeau. Automatic assessment of interactive OLAP explorations. To appear in Infor- mation Systems, 2019. https://doi.org/10.1016/j.is.2018.06.008.