Qualitative Analysis of the SQLShare Workload for Session
                           Segmentation
                            Veronika Peralta, Willeme Verdeaux, Yann Raimont, Patrick Marcel
                                                     University of Tours
                                                        Blois, France
                veronika.peralta|patrick.marcel@univ-tours.fr,willeme.verdeaux|yann.raimont@etu.univ-tours.fr

ABSTRACT                                                                              well defined initially. Identifying such activities has several appli-
This paper presents an ongoing work aiming at better under-                           cations in supporting interactive database exploration (IDE), like
standing the workload of SQLShare [9]. SQLShare is database-as-                       understanding users’ information needs, identifying struggling
a-service platform targeting scientists and data scientists with                      during the exploration, providing better query recommendations,
minimal database experience, whose workload was made avail-                           etc. This is important since, usually, systems used for IDE do not
able to the research community. According to the authors of [9],                      offer such facilities.
this workload is the only one containing primarily ad-hoc hand-                          To identify explorations from a SQL workload, we use a tech-
written queries over user-uploaded datasets. We analyzed this                         nique first proposed in [3] to score the quality of OLAP explo-
workload by extracting features that characterize SQL queries                         rations. This technique consists of characterizing a query by a set
and we show how to use these features to separate sequences                           of simple features that are intrinsic to a query or that relate the
of SQL queries into meaningful sessions. We ran a few test over                       query to its neighbor in the sequence. While in [3] this technique
various query workloads to validate empirically our approach.                         of feature extraction was used with supervised machine learning
                                                                                      to score the quality of OLAP explorations, in the present work
                                                                                      we use these features to partition an SQL workload into coherent
1    INTRODUCTION                                                                     explorations.
Analyzing a database workload offers many practical interests,                           The paper is organized as follows. The next section discusses
from the monitoring of database physical access structures [2]                        related work. Section 3 presents our model of queries, tailored for
to the generation of user-tailored collaborative query recommen-                      SQL queries. Section 4 details the features considered and how
dations for interactive exploration [6]. There has been much                          they are extracted. Section 5 introduces our segmentation strat-
attention lately devoted to the analysis of user past activities to                   egy and Section 6 reports the results of the tests we conducted.
support Interactive Database Exploration [8]. OLAP analysis of                        Section 7 concludes and draws perspectives.
data cubes is a particular case of IDE, that takes advantage of sim-
ple primitives like drill-down or slice-and-dice for the navigation
of multidimensional data. These particularities enable the design                     2     RELATED WORK
of approaches for characterizing user explorations in how focus                       In this section we present related work concerning real SQL
they are [4], in how contributive a query is to the exploration [3],                  workloads and workload analysis.
or even in how to ensure that a sequence of analytical queries
forms a coherent exploration [13].
   Transposing these works to regular, non multidimensional                           2.1    Real SQL workloads
SQL workloads raises many challenges. Even if a sequence of                              SQLShare. The SQLShare workload is the result of a Multi-
SQL queries is issued to explore the database content, non multi-                     Year SQL-as-a-Service Experiment [9], allowing any user with
dimensional relational schemas do not have regularities one ex-                       minimal database experience to upload their datasets on-line
pects from the multidimensional model, explorations may not be                        and manipulate them via SQL queries. What the authors wanted
expressed through roll-up or drill-down operations, SQL queries                       to prove with this experiment is that SQL is beneficial for data
may deviate from the traditional star-join pattern commonly                           scientists. They observed that most of the time people use scripts
used for analytical purpose, etc.                                                     to modify or visualize their datasets instead of using the SQL
   In this paper, we present a preliminary approach for analyz-                       paradigm. Indeed, most user needs may be satisfied by first-order
ing SQL workloads, concentrating on the SQLShare workload of                          queries, that are much simpler than a script, but have the initial
hand-written1 queries over user-uploaded datasets. This work-                         cost of creating a schema, importing the data and so on. SQL-as-
load includes raw sequences of queries made by some users,                            a-Service frees the user of all this prior work with a relaxed SQL
without further information on their intention. One of our ob-                        version.
jectives is to investigate whether this workload contains actual                         The SQLShare workload is composed of 11,137 SQL state-
exploration activities, and more particularly how to extract such                     ments, 57 users and 3,336 user’s datasets. To the best of our
explorations. In what follows, we consider that an exploration is                     knowledge, as reported by the authors of [9], this workload is
a coherent sequence of hand-written queries, that all share the                       the only one containing primarily ad-hoc hand-written queries
same goal of fulfilling a user’s information need that may not be                     over user-uploaded datasets. As indicated in the introduction,
                                                                                      hand-written means that the query is introduced manually by
1 Consistently with the authors of [9], we use the term hand-written to mean, in
                                                                                      a human user, which reflects genuine interactive human activ-
this context, that the query is introduced manually by a human user, which reflects
genuine interactive human activity over a dataset, with consideration between two     ity over a dataset, with consideration between two consecutive
consecutive queries.                                                                  queries.
                                                                                         The SQLShare workload is analyzed in [9], particularly to
© 2019 Copyright held by the author(s). Published in the Workshop Proceedings of
the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisboa, Portugal):
                                                                                      verify the following assumption:
       "We hypothesized that SQLShare users would write              of one month. The log consists of 1,352,202 SELECT statements
       queries that are more complex individually and                that, being generated by an application, correspond to only 135
       more diverse as a set, making the corpus more use-            distinct query strings.
       ful for designing new systems."
   The authors indeed showed empirically that the queries in the     2.2    Workload analysis
SQLShare workload are complex and diverse. They also analyzed
                                                                     Other scientific domains close to Database, like Information Re-
the churn rate of SQLShare users and conclude that most users
                                                                     trieval or Web Search, have a long tradition of log analysis aim-
exhibit a behavior that suggest an exploratory workload.
                                                                     ing at facilitating the searcher’s task [17]. Many works extract
   To our knowledge, and again as reported by the authors of [9],
                                                                     features from queries or search sessions and use them to dis-
this workload is one of the two workloads publicly available to
                                                                     ambiguate the session’s goal, to generate recommendations, to
the research community, the other being the SDSS workload.
                                                                     detect struggling in sessions, etc. Since databases tend to be more
   SDSS workload. SkyServer is an Internet portal to the Sloan       used in an exploratory or analysis fashion, as evidenced by the
Digital Sky Survey Catalog Archive Server; its Web and SQL logs      SQLShare workload, it is not a surprise that many recent works
are public [14]. The SQL log was produced by a live SQL database     pay attention to the analysis of database workloads, in addition to
supporting both ad hoc hand-authored queries as well as queries      those works analyzing workload for optimization or self-tuning
generated from a point-and-click GUI. Many queries in the SDSS       purposes. We present some recent advances in this area, differ-
are actually not hand-written; they were generated by applica-       entiating by the type of logs (OLAP logs and SQL logs).
tions such as the Google Earth plugin or the query composer
from the SkyServer website. Their cleaning and normalization            Analyzing and detecting OLAP explorations. Logs of OLAP
took several months effort.                                          analyses are simpler than SQL ones in the sense that they feature
   Sessions in this log were detected using heuristics:              multidimensional queries that can easily be interpreted in terms
                                                                     of OLAP primitives (roll-up, drill-down, slice-and-dice, etc.). In
       "We arbitrarily start a new session when the previ-
                                                                     one of our previous works [13], we proposed an approach for
       ous page view from that IP address is more than 30
                                                                     detecting OLAP analyses phrased in SQL, by converting SQL
       minutes old, i.e., a think-time larger than 30 min-
                                                                     queries into OLAP queries and then checking if two consecutive
       utes starts a new session. [...] Wong and Singh [1]
                                                                     queries are sufficiently close in terms of OLAP operations. In our
       chose the same 30 minute cutoff and we are told
                                                                     more recent work, we used supervised learning to identify a set
       that MSN and Google use a similar heuristic."
                                                                     of query features allowing to characterize focus zones in OLAP
   The authors of [14] also acknowledge the difficulty of extract-   explorations [4], or to identify queries that better contribute to an
ing human sessions from all those collected:                         exploration [3]. The present work can be seen as a continuation
       "We failed to find clear ways to segment user popu-           of those previous works, since we have the same objective as [13]
       lations. We were able to ignore the traffic that was          and use the same technique as [3]. The main differences with
       administrative or was eye-candy, leaving us with              these previous works is that we make no assumption about the
       a set of 65M page views and 16M SQL queries. We               type of queries in the workload (particularly, they may not be
       organized these requests into about 3M sessions,              multidimensional queries), and we have no ground truth (i.e., no
       about half of which were from spiders. The residue            human manual inspection of each query) on the workload.
       of 1.5M sessions had 51M page views and 16M SQL
       queries – still a very substantial corpus. [...] Inter-          Analyzing SQL log. SQL workload analysis has recently at-
       active human users were 51% of the sessions, 41%              tracted attention beyond query optimization, for instance for
       of the Web traffic and 10% of the SQL traffic. We             query recommendation [6] query autocompletion [10], or user
       cannot be sure of those numbers because we did                interest discovery [12]. All these works use the SDSS workload
       not find a very reliable way of classifying bots vs           for their tests. Embedded SQL code is analyzed in [15] to mea-
       mortals."                                                     sure its quality, mainly for maintainability purpose. The authors
   Bots are programs that automatically crawled the SDSS and         quantify quality based on the number of operators (joins, unions),
launch SQL queries. Such traffic cannot be classified as proper      operands (tables, subqueries) and variables in the SQL code, ex-
interactive data exploration with human consideration.               perimenting with SQL codes embedded in PL/SQL, COBOL and
   In [9], the authors compared the SQLShare workload and that       Visual Basic. Jain et al. ran a number of tests on the SQLShare
of the SDSS, and conclude:                                           workload [9], some of them being reported above, showing the
                                                                     diversity and complexity of the workload. In [16], Vashistha and
       "SQLShare queries on average tend to be more com-
                                                                     Jain analyze the complexity of queries in the SQLShare workload,
       plex and more diverse than those of a conventional
                                                                     in terms of the following query features: number of tables, num-
       database workload generated from a comparable
                                                                     ber of columns, query length in characters, numbers of operators
       science domain: the Sloan Digital Sky Survey (SDSS)."
                                                                     (Scan, Join, Filter), number of comparison operators (LE, LIKE,
   Smaller SQL datasets. We are aware of other available SQL         GT, OR, AND, Count), and the query run-time. They define two
workloads. For instance, Kul et al. [11] analyze three specific      complexity metrics from these features: the Halstead measure
query sets. The first one, Student assignments gathered by IIT       (traditionally used to measure programs complexity) and a linear
Bombay, is made of a few hundreds queries answering homework         combination whose weights are learned using regression. Finally,
assignments. The second dataset, publicly available, consists of     a recent work investigated various similarity metrics over SQL
around 200 queries gathered over 2 years from student exams at       queries, aiming at clustering queries [11] for better workload
University of Buffalo. The third dataset consists of SQL logs that   understanding. The authors run their tests on smaller SQL sets,
capture all database activities of 11 Android phones for a period    as indicated above.
   To our knowledge, our work is the first to propose an approach      actually visualized by the user. We intentionally remain indepen-
for segmenting hand-written SQL queries into meaningful ses-           dent of presentation and optimization aspects, specially the order
sions.                                                                 in which attributes are projected (and visualized by the user), the
                                                                       order in which tables are joined, etc.
3     PRELIMINARIES                                                       Finally, a session is a sequence of queries of a user over a given
This section introduces the SQLShare workload and describes            database.
our hypothesis and preprocessing.                                         Definition 3.2 (Session). Let DB be a database schema. A session
                                                                       s = ⟨q 1 , . . . , qp ⟩ over DB is a sequence of queries over DB. We
3.1    SQLShare workload preprocessing                                 note q ∈ s if a query q appears in the session s, and session(q) to
From the 11,137 SQL statements we kept 10,668 corresponding to         refer to the session where q appears.
SELECT statements. The remaining statements (mainly updates,
inserts and deletes) were filtered.                                    4     FEATURE EXTRACTION
   We implemented a preliminary session segmentation follow-           In this section, we define a set of features to quantitatively de-
ing a simple heuristic: keeping together the sequences of con-         scribe different aspects of a SQL query and its context. We then
secutive queries of a same user. As a result of the initial segmen-    describe the extraction procedure and the obtained scores.
tation we obtained 451 sessions, counting between 1 and 937
queries (average of 23.65 queries per session, standard deviation      4.1    Feature description
of 75.05 queries). Furthermore, we made the initial hypothesis         For each query, we extract a set of simple features computed
that queries appear in chronological order in the SQLShare work-       from the query text and its relationship with other queries in a
load. We noted that the queries of the workload do not come with       session. We intend to cover various aspects of a query in order to
timestamps, and we contacted the authors of the original SQL-          support different types of analysis and modeling based on query
Share paper [9] who confirmed that the query order in the work-        features.
load may not reflect the order in which queries were launched.             The set of features is inspired from our previous work [3, 4],
Therefore, the disparate distribution of queries along sessions, in    which models OLAP queries as a set of features capturing typical
addition to some extremely long sessions, possibly disordered,         OLAP navigation.
calls for a smarter way of segmenting sessions.                            From now on, to remove any ambiguity, we use the term metric
                                                                       to denote the functions that score query quality. The term feature
3.2    Query and session abstractions                                  is reserved to denote the score output by the function.
In what follows, we use the term query to denote the text of an            For the sake of presentation, we categorize metrics as follows:
SQL SELECT statement. We represent a query as a collection             i) intrinsic metrics, i.e., only related to the query itself, ii) relative
of fragments extracted from the query text, namely, projections,       metrics, i.e., also related to the query’s predecessor in the session,
selections, aggregations and tables. These fragments abstract the      and iii) contextual metrics, i.e., related to the whole session, pro-
most descriptive parts of a SQL query, and are the most used in        viding more context to the metrics. Table 1 presents an overview
the literature (see e.g., [6, 10]). But note that we do not restrict   of metrics.
to SPJ (selection-projection-join) queries. Indeed, we consider
all queries in the SQLShare workload, some of them containing                                     Intrinsic metrics
arbitrarily complex chains of sub-queries.                              NoP      Number of projections (attributs and expressions)
  Definition 3.1 (Query). A query over database schema DB is a          NoS      Number of selections (filtering predicates)
quadruple q = ⟨P, S, A,T ⟩ where:                                       NoA      Number of aggregations, when there is a
                                                                                 group-by clause
    (1) P is a set of expressions (attributes or calculated expres-     NoT      Number of tables
        sions) appearing in the main SELECT clause (i.e. the out-                                 Relative metrics
        ermost projection). We deal with * wild card by replacing       NCP      Number of common projections, with previous query
        it by the list of attributes it references.                     NCS      Number of common selections, with previous query
    (2) S is a set of atomic Boolean predicates, whose combina-         NCA      Number of common aggregations, with previous query
        tion (conjunction, disjunction, etc.) defines the WHERE         NCT      Number of common tables, with previous query
        and HAVING clauses appearing in the query. We consid-           RED      Relative edit distance (effort to express a query starting
        ered indistinctly all predicates appearing in the outermost              from the previous one)
        statements as well as in inner sub-queries.                     JI       Jaccard index of common query parts, with previous
    (3) A is a set of aggregation expressions appearing in the main              query
        SELECT clause (i.e. the outermost projection). Including
                                                                                                Contextual metrics
        aggregation expressions appearing only in the HAVING
                                                                        NoQ      Number of queries (absolute position in the session)
        clause is part of our future work.
    (4) T is a set of tables appearing in FROM clauses (outermost                      Table 1: metrics for SQL queries
        statement and inner sub-queries). Views, sub-queries and
        other expressions appearing in FROM clauses are parsed
        in order to obtain the referenced tables.
                                                                          For all definitions given in this section, let qk = ⟨Pk , Sk , Ak ,Tk ⟩
   Note that although we consider tables and selections occur-         be the query occurring at position k in the session s over the
ring in inner sub-queries, we limit to the outermost queries for       instance I of schema DB. All the queries we considered are sup-
projections and aggregations, as they correspond to attributes         posed to be well formed, and so we do not deal with query errors.
   For the moment, we only considered metrics based on query                  Number of Common Tables. NCT (qk , qk −1 ) counts the number
text. Further metrics may be defined if the database instance              of common tables of qk relatively to qk −1 . The default value is 0
is taken into account (for example, number of tuples in query              for q 1 .
result, precision and recall of query result w.r.t. previous query,                                      (
                                                                                                          card(Tk ∩ Tk −1 ) if k > 1
execution time, etc.). But the computation of such metrics implies                   NCT (qk , qk −1 ) =                                   (8)
the execution of every query in the SQLShare dataset, which is                                             0                otherwise
not always available for confidentiality reasons (i.e. users did not          Relative Edit Distance. RED(qk , qk −1 ) represents the edition
agree to share their data), and thus considerably reduces the set          effort, for a user, to express the current query starting from the
of queries. We left such studies to future work.                           previous one. It is strongly related to query parts, and computed
   4.1.1 Intrinsic metrics. Intrinsic metrics are those that can be        as the minimum number of atomic operations between queries,
computed only considering the query qk , independently of the              by considering the operations of adding/removing a projection,
session s and other queries in s. In other words, these metrics            selection, aggregation or table. The considered cost for each
will give the same score to qk , independently of s.                       observed difference (adding/removing) is the same.
                                                                                 RED(qk , qk −1 ) = card(Pk − Pk −1 ) + card(Pk −1 − Pk )
   Number of Projections. NoP(qk ) represents the number of pro-
jections (attributes and expressions) that are projected by the user.                                 +card(Sk − Sk −1 ) + card(Sk −1 − Sk )
                                                                                                                                               (9)
Expressions projected in inner sub-queries, but not projected by                                    +card(Ak − Ak −1 ) + card(Ak −1 − Ak )
the outer query, are not considered as they do not appear in the                                      +card(Tk − Tk−1 ) + card(Tk −1 − Tk )
query result.
                                                                           For the limit case, we consider an empty query, q 0 = ⟨∅, ∅, ∅, ∅⟩.
                       NoP(qk ) = card(Pk )                      (1)
                                                                           Then, RED(q 1 , q 0 ) = card(P1 ) + card(S 1 ) + card(A1 ) + card(T1 ).
   Number of Selections. NoS(qk ) represents the number of selec-             Jaccard Index. JI (qk , qk−1 ) represents the ratio between the
tions (elementary Boolean predicates) that appear in the query             common query parts (projections, selections, aggregations and
text, both in outer and inner sub-queries.                                 tables) and the union of query parts.
                        NoS(qk ) = card(Sk )                        (2)                          card(queryParts(qk ) ∩ queryParts(qk −1 )
                                                                            JI (qk , qk −1 ) =                                             (10)
                                                                                                 card(queryParts(qk ) ∪ queryParts(qk −1 )
  Number of Aggregations. NoP(qk ) represents the number of
aggregation expressions that are projected by the user. As for             where queryParts(qk ) = Pk ∪ Sk ∪ Ak ∪ Tk .
projections, expressions appearing in inner sub-queries are not              For the limit case, we consider an empty query, q 0 = ⟨∅, ∅, ∅, ∅⟩.
considered.                                                                Then, JI (q 1 , q 0 ) = 0.
                      NoA(qk ) = card(Ak )                   (3)              4.1.3 Contextual metrics. Contextual metrics are session de-
                                                                           pendent and make sense only in the context of a session. The
  Number of Tables. NoP(qk ) represents the number of tables
                                                                           same query qk occurring in different sessions may be given dif-
appearing in query text, both considering outer and inner sub-
                                                                           ferent scores for metrics in this category. For now we consider
queries.
                                                                           only one such metric.
                     NoT (qk ) = card(Tk )                 (4)
                                                                              Number of Queries. NoQ(qk , s) counts the absolute position
   4.1.2 Relative metrics. Relative metrics are those that are com-        of query qk in s.
puted comparing the query qk to the previous query in the ses-                                    NoQ(qk , s) = k                    (11)
sion s. Let qk −1 = ⟨Pk−1 , Sk −1 , Ak −1 ,Tk−1 ⟩ be the previous query,
being undefined for the first query of s (i.e., q 1 ). Each metric pro-    4.2    Extraction protocol
vides a default score for this limit case.
                                                                           This section briefly describes the procedure for extracting query
   Number of Common Projections. NCP(qk , qk −1 ) counts the               features. We proceed in 3 steps:
number of common projections of qk relatively to qk −1 . The                   (1) Filter the SQL workload in order to keep only SELECT
default value is 0 for q 1 .                                                       statements, i.e. discard updates, inserts, etc.
                             (                                                 (2) Extract query fragments (sets of projections, selections,
                              card(Pk ∩ Pk −1 ) if k > 1                           aggregations and tables) for each query.
        NCP(qk , qk −1 ) =                                (5)
                               0                otherwise                      (3) Compute query features from query fragments.
   Number of Common Selections. NCS(qk , qk −1 ) counts the num-           For the first and second step we developed a C# script using the
ber of common selections of qk relatively to qk −1 . The default           MSDN TSQL Parser. Removing all the non SELECT statements
value is 0 for q 1 .                                                       was straightforward. However, extracting query fragments re-
                           (                                               quired to deal with several particular cases. The major challenge
                            card(Sk ∩ Sk −1 ) if k > 1                     was extracting projections. First, for getting the projections vi-
        NCS(qk , qk −1 ) =                                    (6)          sualized by the user, we need to detect the outermost SELECT
                             0                otherwise
                                                                           clause of a query and then extract the SELECT elements. When
   Number of Common Aggregations. NCA(qk , qk −1 ) counts the              there is a star element (i.e : SELECT * FROM T ) in the query, our
number of common aggregations of qk relatively to qk −1 . The              script reaches the outermost FROM clause (looking for T ). When
default value is 0 for q 1 .                                               T is a table, the script accesses schema metadata and obtains all
                             (                                             the table attributes. If there is one or more sub-queries in the
                              card(Ak ∩ Ak −1 ) if k > 1                   FROM clause, it repeats the previously described pattern until
       NCA(qk , qk −1 ) =                                  (7)
                               0                otherwise                  it finds either a table or a set of SELECT elements. Views and
WITH clauses were treated in a similar way. For now, we do not               Feature    Min    Max       Avg StdDev Median        75pc    90pc
take into account the queries having more than one ’*’ in their                                        Intrinsic metrics
SELECT elements. (i.e : SELECT t1.*,t2.*,a,b FROM t1,t2).                    NoP           1    509      9.36     22.51     5        10      18
    Note that this procedure for resolving star elements relies              NoS           0     83      1.19      3.10     1         1       3
in the existence of schema metadata, i.e., having access to the              NoA           0     49      0.41      2.02     0         0       1
                                                                             NoT           0     84      1.50      3.29     1         1       2
corresponding datasets in order to obtain the list of attributes.
                                                                                                       Relative metrics
However, some datasets are referenced in queries but are not
                                                                             NCP           0    509      4.90     17.58     1         5      12
present in the SQLShare released data because the user decided
                                                                             NCS           0     82      0.59      1.96     0         1       2
not to share them. For such queries (near 18% of the query work-             NCA           0     48      0.21      1.11     0         0       1
load), we could not resolve the list of projections and we needed            NCT           0     83      0.85      2.05     1         1       2
to estimate the number of projections during third step (comput-             RED           0   1020     11.32     27.21     4        12      25
ing features).                                                               JI            0      1      0.45      0.39  0.43      0.83       1
    The aggregations were obtained with a Parser’s function that
detects all the function calls within a query. The selections are                                     Contextual metrics
all the atomic Boolean expressions contained in the queries and              NoQ           1    937    23.65     75.05        4    13.5      50
their subqueries. At this stage of our work we do not deal with             Table 2: Range, average, standard deviation, median and
predicate containment.                                                      two percentiles for query features on the SQLShare
    The third step computes query features as described in Equa-            dataset
tions 1 to 11. For the unresolved star elements, we estimated both
the number of projections and the number of common projec-
tions, by taking advantage of the other queries in the exploration
that list attributes of the "starred" tables. We used linear regres-
sion to estimate the remaining cases, with the AUTO-SKLEARN
Python module [7], which is a module aiming at automatically
choosing and parametrizing a machine learning algorithm for a
given dataset, at a given cost (i.e., the time it takes to test different
algorithms).
    More precisely, we computed two distinct regressions, one for
NoP and another one for NoCP. The methodology is the same
for both and consist of the following:

   (1) Each query are represented by 23 features : NoS, NoA, NoT,
       NCS, NCA, NCT, NoQ (neither NoP nor NCP since they
       are the target of the regression), the min, max, average
       and standard deviation of the NoP, NoS, NoA, and NoT,
       grouped by the exploration the query belongs to.
   (2) The queries where NoP=0 (and consequently NCP=0) are
       removed from the data set.
   (3) The data set is then split in 80/20 for cross-validation, and
                                                                            Figure 1: Value distribution of query features in the SQL-
       the AUTO-SKLEARN regression mode is used to fit the
                                                                            Share dataset.
       best regression. For the first regression, NoP is the target,
       for the second, NCP is the target. The maximum time to
       find a model is set to 180 seconds and the score used to
       measure the accuracy of the regression is R 2 (coefficient           while median values show that most queries have few projections,
       of determination) regression score function.                         selections, aggregations and tables.
   (4) The R 2 score of each regression is used to predict NoP                 Focusing in longer queries (with more query fragments), at
       (respectively NCP) for the queries removed at step 2.                90-percentile, queries have 18 projections, 3 selections, 1 aggre-
   (5) Degenerated cases (e.g., a predicted number of common                gation and 2 tables, while at 75-percentile those values are 10, 1,
       projection that is greater than the predicted number of              0 and 1 respectively. Indeed, as expected, there is a large number
       projections) are handled manually.                                   of short queries (having less fragments): 82% of queries have
                                                                            no aggregations and 44% have no selections, while 20% have a
                                                                            unique projection and 78% have a unique table. Interestingly, 6%
4.3    Analysis of query features                                           of queries have no table in the FROM clause. An example of such
Table 2 summarizes the results of feature extraction. Value distri-         queries is "SELECT 1+2".
butions are shown in Figure 1.                                                 Concerning common fragments between contiguous queries,
   A first remark is that many queries have a high number of                almost half of the queries have 1 common projection and 1 com-
projections. Indeed, 38 queries (out of 10,668) project more than           mon table but no common selections nor aggregations, while
100 expressions, while more than 21% project more than 10 ex-               there is more sharing at 75-percentile. The remaining two met-
pressions. The number of common projections is also high. The               rics, Relative Edit Distance and Jaccard Index, inform about the
number of the other query fragments is less impressive. Less                combination of such common fragments. Specifically, half of the
than 1% of queries exceed 10 selections, 10 aggregations or 10 ta-          queries differ in 4 or less fragments (RED=4) and have at least
bles. Average and standard deviation confirm this disproportion,            43% of its fragments in common (JI=0.43), w.r.t. previous query.
Furthermore, only 29% of queries have nothing in common with
the previous query in their sessions (JI=0).
   The next section discusses how to take into account these
features for fragmenting sessions.

5    SESSION SEGMENTATION
The previous section presented various statistics about queries
in the SQLShare workload. A preliminary session segmentation
(contiguous queries of a same user) resulted in some extremely
long sessions (maximum of 937 queries) with 29% of queries
having nothing in common with their immediate predecessor. In
this section, we explore how to segment sessions in a smarter
way.
   Session segmentation has been previously studied for the SDSS
workload [14]. In their study, the authors consider that a new
session starts after 30 minutes of think-time (time spent between
two queries). A similar problem was largely studied for the seg-
mentation of web traces (see for example [18]) proposing the
same 30-minutes cutoff. Search engine providers, like MSN and
Google, use similar heuristics. Contrarily to those works, the pub-
lished SQLShare workload does not include query timestamps.
We need to explore other heuristics for session segmentation.
   Intuitively, our idea is to compare contiguous queries in a ses-
sion and segment when queries are dissimilar enough. Based on
query features described in the previous section, we investigate
5 similarity indexes:
   Edit Index. It is based on the Relative Edit Distance (RED)
query feature, normalizing it with respect to an arbitrary thresh-
old, set to 10 operations.
                                       RED(qk , qk −1 )
          EditIndex(qk ) = max {0, 1 −                  }   (12)
                                             10
  Jaccard Index. It is the Jaccard Index defined in Equation 10,
which is normalized by definition.
                                                                              Figure 2: Comparison of similarity metrics for 3 sessions.
   Cosine Index. It is calculated as the Cosine of vectors consisting
in first 8 query features, namely, NoP, NoS, NoA, NoT, NCP, NCS,
NCA and NCT. Let x = ⟨x 1 , . . . , x 8 ⟩ and y = ⟨y1 , . . . , y8 ⟩ be the
vectors for queries qk and qk −1 respectively.
                                               x i .yi                        instead of comparing sets of fragments; it captures the variability
                                            Í
                CosIndex(qk , qk −1 ) = q                              (13)   in query complexity. And finally, Common Table Index responds
                                              x i . yi
                                            Í 2 Í 2
                                                                              to the intuition that common tables have more impact than the
                                                                              other common fragments, and it is normalized with respect to
   Common Fragments Index. It is calculated as the number of
                                                                              the number of tables used in the user session.
common fragments, normalizing it with respect to an arbitrary
                                                                                  Figure 2 depicts the similarity metrics for 3 sessions of different
threshold, set to 10 fragments.
                                                                              sizes. Looking at Session 28, the shorter one, it seems quite clear
                                        NCF                                   that the session may be split in two parts, by cutting between
                CF Index(qk , qk −1 ) = min{1, }          (14)
                                          10                                  queries 4 and 5. All similarity metrics agreed. Things are less
  where NCF = NCP(qk , qk −1 )+NCS(qk , qk −1 )+NCA(qk , qk −1 )+             evident for Session 0. One split seems evident (at query 31), but
NCT (qk , qk −1 )                                                             some others may be discussed (e.g. at queries 28, 29 and 12).
                                                                              Decision depends on thresholds. Finally, Session 18 presents a
   Common Tables Index. It is calculated as the number of com-
                                                                              first part, with a focused analysis, via similar queries, and a second
mon tables, normalizing by the highest number of tables in the
                                                                              part, more exploratory, with varied queries. Even if indexes do
session.
                                                                              not always agree, their majority seems to indicate a tendency.
                                 NCT (qk , qk −1 )
    CT Index(qk , qk−1 ) =                                 (15)
                           max {NoT (q)|q ∈ session(qk )}
   Note that these indexes calculate complementary aspects of
                                                                              6    EXPERIMENTS
query similarity. Edit Index and Common Fragment Index count                  In this section we report the results of the experiments conducted
differences (resp., common fragments) as absolute values (nor-                to validate our segmentation approach. We first experiment with
malized with a given threshold). Jaccard Index is a compromise                the SQLShare dataset. We then experiment with other datasets
of the previous ones, computing the ratio of common fragments.                for which a ground truth is available, in order to better underline
Cosine Index is computed using features (the value of the metrics)            the effectiveness of our approach.
                                     Before segmentation                                 After segmentation
                        Min 1 quartile 2 quartile 3 quartile       Max Min 1 quartile 2 quartile 3 quartile                                 Max
      Nb queries        1.00        2.00       6.00        19.75 936.00 1.00           1.00        3.00     6.00                            78.00
      Avg NCP           0.00        1.00       2.40         5.20 305.60 0.00           1.00        3.00     7.00                           509.00
      Avg NCS           0.00        0.00       0.17         0.72    7.31 0.00          0.00        0.00     1.00                            82.00
      Avg NCA           0.00        0.00       0.00         0.18    4.00 0.00          0.00        0.00     0.00                            48.00
      Avg NCT           0.00        0.66       0.97         1.00    4.00 0.00          0.80        1.00     1.00                            83.00
      Avg NCF           0.00        2.00       4.20         7.33 306.33 1.00           2.69        5.00     9.50                           510.00
      Avg RED           0.00        2.30       4.64         8.29 204.73 0.00           1.67        3.00     7.00                           267.00
      Avg JI            0.00        0.38       0.55         0.71    1.00 0.01          0.43        0.65     0.84                             1.00
                          Table 3: Comparison of average features per session before and after segmentation


6.1     Experiments over SQLShare                                           The second dataset, named Enterprise, consists of navigation
We implemented a first heuristic based in these 5 similarity in-        traces of 14 volunteers among SAP employees, in the context
dexes, tuned with simple thresholds whose setting is discussed in       of a previous study on discovering user intents [5]. We set 10
next subsection, taking the decision of segmenting or not using         business needs, and volunteers were asked to analyze some of the
majority vote. This simple heuristic allowed to split the initial 451   7 available data sources to answer each of the 10 business needs,
sessions in 2,960 segments (called explorations). In the absence        using a SAP prototype that supports keyword-based BI queries3 .
of ground-truth, we present in Table 3 a comparison of session          In total, this dataset contains 24 sessions, corresponding to 104
features before and after session segmentation. A first remark          user explorations and accounting for 525 queries. Volunteers
concerns session length: extremely large sessions (maximum of           were explicitly requested to express to what information needs
936) were split (new maximum is 78 queries). Indeed, more than          they were answering, which constitutes our ground truth for this
half of the sessions were not fragmented and at 3rd quartile 1 ses-     dataset.
sion was split in 4 explorations. Some long and anarchic sessions           For these two datasets, the feature extraction was made as in
(such as the one counting 936 queries) were split in a multitude of     [3], enriched with detection of the features pertaining to tables
explorations. We can also highlight an increasing in the average        (NoT and NCT).
number of common query fragments (Avg NCF) per session, as                  The third dataset, named Exam, consists of queries gathered
well as the average number of common projections, selections,           over 2 years from student exams at University of Buffalo. Queries
aggregations and tables. This increasing is quite regular and vis-      correspond to student’s answers to 2 exam questions (one per
ible for all quartiles. Relative edit distance (RED) and Jaccard        year). A parser that allow to extract several query parts (including
Index (JI) also improved, as expected.                                  the ones we need) is also available for download. The parser also
                                                                        filters the queries that obtained low grades, proposing a total of
6.2     Experiments with ground truth                                   102 queries.
                                                                            Notably, in Open and Enterprise datasets, users did not have
For this series of tests, we used workload datasets with ground
                                                                        to write any SQL code, contrarily to SQLShare or Exam. Indeed,
truth. We expect our approach to find a segmentation that is
                                                                        Saiku and the SAP prototype generated queries from users high-
close to the ground truth. As one of the datasets also contains
                                                                        level operations. However, in both cases, users devised real explo-
timestamps, we compare to the heuristic used in the literature,
                                                                        rations. Conversely, queries in the Exam dataset are hand-written,
i.e., the 30-minutes cutoff.
                                                                        but each student devised only one query. In order to simulate
   Datasets. We used three real workloads. Two of them are logs         explorations, we made the hypothesis that the answers to a same
of multidimensional queries that we have used in our previous           exam question should mimic students attempts to find the correct
works [3], the third one is a log SQL queries used for clustering       answer. To this end, we arranged together the queries that an-
queries in [11]. We briefly present the datasets and we refer the       swer a same exam question, obtaining 2 explorations (the ground
readers to corresponding articles for more precisions.                  truth for this dataset).
   The first dataset, named Open in what follows, consists of
navigation traces collected in the context of a French project on          Dataset analysis and feature extraction. Table 4 compares the
energy vulnerability. These traces were produced by 8 volunteer         three datasets in terms of number of sessions, number of explo-
students of a Master’s degree in Business Intelligence, answering       rations (the ground truth), number of queries and summarizes
fuzzy information needs defined by their lecturer, to develop           features extraction. A first remark concerns the length of ses-
explorative OLAP navigations using Saiku2 over three cubes in-          sions. The Open dataset contains long sessions concerning few
stances. The main cube is organized as a star schema with 19            explorations while the Enterprise dataset contains shorter ses-
dimensions, 68 (non-top) levels, 24 measures, and contains 37,149       sions concerning more explorations. Sessions length is actually
facts recorded in the fact table. The other cubes are organized         dependent on GUI; while third party OLAP tools, like Saiku, log a
in a similar way. From this experiment, we reuse 16 sessions,           new query for each user action (including intermediate drag-and-
representing 28 explorations and 941 queries. The ground truth          drops), the SAP prototype only logs final queries. In the Exam
is a manual segmentation made by the lecturer based on some             dataset, sessions and explorations were simulated. Regarding fea-
guessed notion of user goal and supported by timestamps. No-            tures, queries in the three datasets concern a quite small number
tably, automatic segmentation was not the purpose of the work           of projections, selections, aggregations and tables. Relative edit
at the time manual segmentation was done.
                                                                        3 Patent Reference: 14/856,984 : BI Query and Answering using full text search and
2 http://meteorite.bi/products/saiku                                    keyword semantics
                            Open       Enterprise      Exam                         Open Enterprise Exams Open (timestamp)
 Nb of sessions               16            24            1             Accuracy     0.98      0.88      0.94          0.97
 Nb of explorations           28           104            2             Precision      1       0.78      0.17            1
 Nb of queries               941           525           102            Recall       0.42      0.63        1           0.25
 Avg queries per session      58            21           102            F-measure    0.42      0.48      0.17          0.25
 Avg queries per explor.      34            5             51            ARI          0.75      0.77      0.54          0.75
 Avg explor. per session       2            4             2            Table 5: Segmentation results for our approach on the
 Avg and range of NoP     3.62 [1,7]    2.18 [0,6]     1 [0,3]         3 datasets and the timestamp-based approach (rightmost
 Avg and range of NoS    1.33 [0,21]    0.76 [0,3]    1,57 [0,4]       column)
 Avg and range of NoA     1.34 [1,4]    1.14 [0,5]    0,77 [0,2]
 Avg and range of NoT     3.28 [1,7]    2.03 [1,4]    3,02 [1,6]
 Avg and range of NCP     3.16 [0,7]    1.34 [0,4]    0,22 [0,1]
 Avg and range of NCS    1.13 [0,21]    0.46 [0,3]    0,07 [0,1]
                                                                          Segmentation quality. For each dataset, we compared the ob-
 Avg and range of NCA     1.17 [0,4]    0.77 [0,3]    0,09 [0,1]
                                                                       tained segmentation to the ground truth, measuring segmenta-
 Avg and range of NCT     2.97 [0,7]    1.46 [0,4]    2,57 [0,6]
                                                                       tion quality in two complementary ways. The first one, more
 Avg and range of RED    3.85 [0,19] 2.09 [0,25]     6.78 [1,11]
                                                                       demanding, compares the position of session breaks, indicating if
 Avg and range of JI      0.57 [0,1]    0.79 [0,1]  0.31 [0,0.86]
                                                                       both the segmentation and the ground truth coincide in the very
         Table 4: Characteristics of the 3 datasets
                                                                       same positions. To this end, we build binary vectors (a position
                                                                       corresponding to a query in the session), where a 1 means that
                                                                       a new segment starts at that position. We do the same for the
                                                                       ground truth and we compared both vectors in terms of accuracy,
                                                                       precision, recall and f-measure. The second way is the Adjusted
distance (RED) and Jaccard index (JI) illustrate that queries are      Rand Index (ARI), a popular measure of clustering quality. It is
more similar in the Enterprise dataset and highly dissimilar in        based on the comparison of pairs of queries in the session, ob-
the Exam dataset. The latter observation actually contradicts our      serving whether they belong to the same fragment and to the
hypothesis on the Exam dataset (stating that students’ answers         same exploration in the ground truth.
to a same exam question should be similar).                               We report the results in Table 5. As expected, results are very
                                                                       good in terms of accuracy, mainly explained because classes are
   Threshold setting. As for the SQLShare dataset, we computed         unbalanced (the number of no-break is higher than the number
the 5 similarity metrics and used them (with voting strategy) for      of break. Relative low values for F-measure while high values of
segmenting sessions. We tested different thresholds for voting.        ARI indicate that the exact break is not always found, while the
We proceeded as follows: we calculated the distribution of values      overall segmentation reminds good. The bad results for the Exams
for each metric and we used as threshold the value at k-percentile,    dataset are mainly explained by the dissimilarities among queries
with k varying in {0, 5, 10, 15, 20, 25, 30}.                          of a same exploration (i.e. answers to a same exam question). The
   The thresholds that provided better results were those at 0-,       conclusions about query similarity presented in [11] confirm this
15- and 5-percentile for the Open, Enterprise and Exam datasets        fact.
respectively. These thresholds reflect the relationship between
                                                                           Comparison to timestamp-based approach. In order to compare
number of explorations to find and number of queries, as well
                                                                       our approach to the one used in the literature, we implemented
as the similarity among consecutive queries. Indeed, the Open
                                                                       a second heuristic that segments sessions when there is a 30-
dataset contains many queries and few explorations (i.e., a few
                                                                       minutes delay between queries. The Open dataset, the only one
segments to find); small thresholds are best adapted. Conversely,
                                                                       containing timestamps, was used for this test. Results are re-
the Enterprise dataset needs to be more segmented as the average
                                                                       ported in Table 5, the right-most column corresponding to the
number of queries per exploration is low; higher thresholds do
                                                                       timestamp-based approach. They are comparable in terms of ac-
better. With the same reasoning, 0-percentile should provide
                                                                       curacy and ARI but much lower in terms of f-measure. Note that
convenient thresholds for the Exam dataset. However, as queries
                                                                       1 for precision means that all breaks found are also breaks in
inside an exploration are dissimilar, the only segmentation to find
                                                                       the ground truth. In other words, there are no big delays inside
is not detected first (many intra-exploration breaks are proposed
                                                                       explorations, which makes sense. However, the timestamp-based
before); 5-percentile provides better results.
                                                                       approach fails to detect 75% of the breaks (when the user changes
   We remark that more precise thresholds could be learned with
                                                                       its topic of study in a briefer delay).
supervised machine learning techniques (e.g. classification). We
intentionally avoid this computation because in real applications         Analysis of similarity metrics. Finally, we investigated the qual-
(as SQLShare) we do not have a ground truth for learning and,          ity of the 5 proposed similarity metrics, by studying the corre-
when one exists, we risk over-fitting. An expert providing the         lation of their vote (when metric value is lower than the corre-
ratio of queries per exploration (either intuitively or via prelimi-   sponding threshold) with respect to the breaks in the ground
nary tests) is more realistic. For the SQLShare dataset, we rely       truth. Results are presented in Table 6.
on statistics, like number of queries per session (see Table 2) and       Jaccard and CF indexes are the more correlated in the Enter-
the dataset description in [9] for setting the threshold as values     prise dataset. Both of them are highly correlated in the Open
at 30-percentile.                                                      dataset, CT index being even better. Edit and Cosinus indexes are
   In the remaining tests, we use values at 0-, 15- and 5-percentile   less correlated in one of the datasets. As expected, correlation
as thresholds for the Open, Enterprise and Exam datasets, respec-      of all measures is low in the Exam dataset, where segmentation
tively.                                                                is of bad quality. Interestingly, the most influencing metrics, the
                      Open Enterprise Exams                                            [4] M. Djedaini, N. Labroche, P. Marcel, and V. Peralta. Detecting user focus
       Edit index     0.34    0.62     0.05                                                in OLAP analyses. In Advances in Databases and Information Systems - 21st
                                                                                           European Conference, ADBIS 2017, Nicosia, Cyprus, September 24-27, 2017, Pro-
       Jackard index 0.86     0.73     0.04                                                ceedings, pages 105–119, 2017.
       Cos index      0.75    0.32     0.13                                            [5] K. Drushku, N. Labroche, P. Marcel, and V. Peralta. Interest-based recommen-
                                                                                           dations for business intelligence users. To appear in Information Systems, 2019.
       CF index       0.86    0.69     0.10                                                https://doi.org/10.1016/j.is.2018.08.004.
       CT index       0.90    0.50     0.01                                            [6] M. Eirinaki, S. Abraham, N. Polyzotis, and N. Shaikh. Querie: Collaborative
Table 6: Correlation between similarity metrics and                                        database exploration. IEEE Trans. Knowl. Data Eng., 26(7):1778–1790, 2014.
                                                                                       [7] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter.
ground truth for the three datasets.                                                       Efficient and robust automated machine learning. In C. Cortes, N. D. Lawrence,
                                                                                           D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information
                                                                                           Processing Systems 28, pages 2962–2970. Curran Associates, Inc., 2015.
                                                                                       [8] S. Idreos, O. Papaemmanouil, and S. Chaudhuri. Overview of data exploration
                                                                                           techniques. In Proceedings of the 2015 ACM SIGMOD International Conference
                       Open Enterprise Exams                                               on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015,
        Edit index     0.23  0.69         0.16                                             pages 277–281, 2015.
                                                                                       [9] S. Jain, D. Moritz, D. Halperin, B. Howe, and E. Lazowska. Sqlshare: Results
        Jackard index 1.00   0.99         0.66                                             from a multi-year sql-as-a-service experiment. In Proceedings of the 2016
        Cos index      0.87  0.49         0.36                                             International Conference on Management of Data, SIGMOD Conference 2016,
        CF index       1.00  0.89         0.91                                             San Francisco, CA, USA, June 26 - July 01, 2016, pages 281–293, 2016.
                                                                                      [10] N. Khoussainova, Y. Kwon, M. Balazinska, and D. Suciu. Snipsuggest: Context-
        CT index       0.95  0.65         0.74                                             aware autocompletion for SQL. PVLDB, 4(1):22–33, 2010.
Table 7: Correlation between similarity metrics and final                             [11] G. Kul, D. T. A. Luong, T. Xie, V. Chandola, O. Kennedy, and S. J. Upadhyaya.
                                                                                           Similarity metrics for SQL query clustering. IEEE Trans. Knowl. Data Eng.,
vote for the three datasets.                                                               30(12):2408–2420, 2018.
                                                                                      [12] H. V. Nguyen, K. Böhm, F. Becker, B. Goldman, G. Hinkel, and E. Müller.
                                                                                           Identifying user interests within the data space - a case study with skyserver.
                                                                                           In EDBT, pages 641–652. OpenProceedings.org, 2015.
                                                                                      [13] O. Romero, P. Marcel, A. Abelló, V. Peralta, and L. Bellatreche. Describing
ones more correlated with the final vote (as shown in Table 7)                             analytical sessions using a multidimensional algebra. In Data Warehousing and
are also Jaccard, CF and CT indexes.                                                       Knowledge Discovery - 13th International Conference, DaWaK 2011, Toulouse,
                                                                                           France, August 29-September 2,2011. Proceedings, pages 224–239, 2011.
                                                                                      [14] V. Singh, J. Gray, A. Thakar, A. S. Szalay, J. Raddick, B. Boroski, S. Lebedeva,
7    CONCLUSION                                                                            and B. Yanny. Skyserver traffic report - the first five years. Technical report,
                                                                                           December 2006.
This paper discussed the problem of fragmenting sequences of                          [15] H. van den Brink, R. van der Leek, and J. Visser. Quality assessment for
SQL queries into meaningful explorations when only the query                               embedded SQL. In SCAM, pages 163–170. IEEE Computer Society, 2007.
                                                                                      [16] A. Vashistha and S. Jain. Measuring query complexity in sqlshare workload.
text is available, and it is not possible to rely on query timestamps.                     https://uwescience.github.io/sqlshare/pdfs/Jain-Vashistha.pdf.
   We characterized queries as a set of simple features and de-                       [17] R. W. White. Interactions with Search Systems. Cambridge University Press,
fined five similarity indexes with respect to previous queries in                          2016.
                                                                                      [18] M. Wong, B. Bhattarai, and R. Singh. Characterization and analysis of usage
the session. A simple heuristic, based on the similarity indexes,                          patterns in large multimedia websites. Technical report, 2006.
allowed to split long and heterogeneous sessions into smaller
explorations where queries have more connections.
   Even if our preliminary results are promising, further aspects
should be investigated:
     • Study further query features. We would like to test the
        common fragments of a query w.r.t. close queries in the
        session (not only the previous one). Comparing query
        results seems important, even if it is not possible for many
        queries (because the database instance is not part of the
        SQLShare workload).
     • Study other similarity indexes and tune thresholds.
     • Discard preliminary hypothesis about chronological or-
        dering of queries and deal with query similarity beyond
        it.
     • Test other ways of fragmenting, in particular via clustering
        methods.
   Our long term goal is to measure the quality of SQL explo-
rations, allowing the detection of focus/exploratory zones, the
discovery of latent user intents, the recommendation of next
queries, among other applications.

REFERENCES
 [1] B. D. Bhattarai, M. Wong, and R. Singh. Discovering user information goals
     with semantic website media modeling. In MMM (1), volume 4351 of Lecture
     Notes in Computer Science, pages 364–375. Springer, 2007.
 [2] S. Chaudhuri and V. R. Narasayya. Self-tuning database systems: A decade
     of progress. In Proceedings of the 33rd International Conference on Very Large
     Data Bases, University of Vienna, Austria, September 23-27, 2007, pages 3–14,
     2007.
 [3] M. Djedaini, K. Drushku, N. Labroche, P. Marcel, V. Peralta, and W. Verdeau.
     Automatic assessment of interactive OLAP explorations. To appear in Infor-
     mation Systems, 2019. https://doi.org/10.1016/j.is.2018.06.008.