<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Joint Conference (March</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>uQalitative Analysis of the SQLShare Workload for Session Segmentation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Veronika Peralta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Willeme Verdeaux</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yann Raimont</string-name>
          <email>yann.raimont@etu.univ-tours.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrick Marcel</string-name>
          <email>patrick.marcel@univ-tours.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Tours Blois</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>26</volume>
      <issue>2019</issue>
      <abstract>
        <p>This paper presents an ongoing work aiming at better understanding the workload of SQLShare [9]. SQLShare is database-asa-service platform targeting scientists and data scientists with minimal database experience, whose workload was made available to the research community. According to the authors of [9], this workload is the only one containing primarily ad-hoc handwritten queries over user-uploaded datasets. We analyzed this workload by extracting features that characterize SQL queries and we show how to use these features to separate sequences of SQL queries into meaningful sessions. We ran a few test over various query workloads to validate empirically our approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Analyzing a database workload ofers many practical interests,
from the monitoring of database physical access structures [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
to the generation of user-tailored collaborative query
recommendations for interactive exploration [6]. There has been much
attention lately devoted to the analysis of user past activities to
support Interactive Database Exploration [8]. OLAP analysis of
data cubes is a particular case of IDE, that takes advantage of
simple primitives like drill-down or slice-and-dice for the navigation
of multidimensional data. These particularities enable the design
of approaches for characterizing user explorations in how focus
they are [4], in how contributive a query is to the exploration [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
or even in how to ensure that a sequence of analytical queries
forms a coherent exploration [13].
      </p>
      <p>Transposing these works to regular, non multidimensional
SQL workloads raises many challenges. Even if a sequence of
SQL queries is issued to explore the database content, non
multidimensional relational schemas do not have regularities one
expects from the multidimensional model, explorations may not be
expressed through roll-up or drill-down operations, SQL queries
may deviate from the traditional star-join pattern commonly
used for analytical purpose, etc.</p>
      <p>In this paper, we present a preliminary approach for
analyzing SQL workloads, concentrating on the SQLShare workload of
hand-written1 queries over user-uploaded datasets. This
workload includes raw sequences of queries made by some users,
without further information on their intention. One of our
objectives is to investigate whether this workload contains actual
exploration activities, and more particularly how to extract such
explorations. In what follows, we consider that an exploration is
a coherent sequence of hand-written queries, that all share the
same goal of fulfilling a user’s information need that may not be
1Consistently with the authors of [9], we use the term hand-written to mean, in
this context, that the query is introduced manually by a human user, which reflects
genuine interactive human activity over a dataset, with consideration between two
consecutive queries.
well defined initially. Identifying such activities has several
applications in supporting interactive database exploration (IDE), like
understanding users’ information needs, identifying struggling
during the exploration, providing better query recommendations,
etc. This is important since, usually, systems used for IDE do not
ofer such facilities.</p>
      <p>
        To identify explorations from a SQL workload, we use a
technique first proposed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to score the quality of OLAP
explorations. This technique consists of characterizing a query by a set
of simple features that are intrinsic to a query or that relate the
query to its neighbor in the sequence. While in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] this technique
of feature extraction was used with supervised machine learning
to score the quality of OLAP explorations, in the present work
we use these features to partition an SQL workload into coherent
explorations.
      </p>
      <p>The paper is organized as follows. The next section discusses
related work. Section 3 presents our model of queries, tailored for
SQL queries. Section 4 details the features considered and how
they are extracted. Section 5 introduces our segmentation
strategy and Section 6 reports the results of the tests we conducted.
Section 7 concludes and draws perspectives.
2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>In this section we present related work concerning real SQL
workloads and workload analysis.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Real SQL workloads</title>
      <p>SQLShare. The SQLShare workload is the result of a
MultiYear SQL-as-a-Service Experiment [9], allowing any user with
minimal database experience to upload their datasets on-line
and manipulate them via SQL queries. What the authors wanted
to prove with this experiment is that SQL is beneficial for data
scientists. They observed that most of the time people use scripts
to modify or visualize their datasets instead of using the SQL
paradigm. Indeed, most user needs may be satisfied by first-order
queries, that are much simpler than a script, but have the initial
cost of creating a schema, importing the data and so on.
SQL-asa-Service frees the user of all this prior work with a relaxed SQL
version.</p>
      <p>The SQLShare workload is composed of 11,137 SQL
statements, 57 users and 3,336 user’s datasets. To the best of our
knowledge, as reported by the authors of [9], this workload is
the only one containing primarily ad-hoc hand-written queries
over user-uploaded datasets. As indicated in the introduction,
hand-written means that the query is introduced manually by
a human user, which reflects genuine interactive human
activity over a dataset, with consideration between two consecutive
queries.</p>
      <p>The SQLShare workload is analyzed in [9], particularly to
verify the following assumption:
"We hypothesized that SQLShare users would write
queries that are more complex individually and
more diverse as a set, making the corpus more
useful for designing new systems."</p>
      <p>The authors indeed showed empirically that the queries in the
SQLShare workload are complex and diverse. They also analyzed
the churn rate of SQLShare users and conclude that most users
exhibit a behavior that suggest an exploratory workload.</p>
      <p>To our knowledge, and again as reported by the authors of [9],
this workload is one of the two workloads publicly available to
the research community, the other being the SDSS workload.</p>
      <p>SDSS workload. SkyServer is an Internet portal to the Sloan
Digital Sky Survey Catalog Archive Server; its Web and SQL logs
are public [14]. The SQL log was produced by a live SQL database
supporting both ad hoc hand-authored queries as well as queries
generated from a point-and-click GUI. Many queries in the SDSS
are actually not hand-written; they were generated by
applications such as the Google Earth plugin or the query composer
from the SkyServer website. Their cleaning and normalization
took several months efort.</p>
      <p>
        Sessions in this log were detected using heuristics:
"We arbitrarily start a new session when the
previous page view from that IP address is more than 30
minutes old, i.e., a think-time larger than 30
minutes starts a new session. [...] Wong and Singh [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
chose the same 30 minute cutof and we are told
that MSN and Google use a similar heuristic."
      </p>
      <p>The authors of [14] also acknowledge the dificulty of
extracting human sessions from all those collected:
"We failed to find clear ways to segment user
populations. We were able to ignore the trafic that was
administrative or was eye-candy, leaving us with
a set of 65M page views and 16M SQL queries. We
organized these requests into about 3M sessions,
about half of which were from spiders. The residue
of 1.5M sessions had 51M page views and 16M SQL
queries – still a very substantial corpus. [...]
Interactive human users were 51% of the sessions, 41%
of the Web trafic and 10% of the SQL trafic. We
cannot be sure of those numbers because we did
not find a very reliable way of classifying bots vs
mortals."</p>
      <p>Bots are programs that automatically crawled the SDSS and
launch SQL queries. Such trafic cannot be classified as proper
interactive data exploration with human consideration.</p>
      <p>In [9], the authors compared the SQLShare workload and that
of the SDSS, and conclude:
"SQLShare queries on average tend to be more
complex and more diverse than those of a conventional
database workload generated from a comparable
science domain: the Sloan Digital Sky Survey (SDSS)."
Smaller SQL datasets. We are aware of other available SQL
workloads. For instance, Kul et al. [11] analyze three specific
query sets. The first one, Student assignments gathered by IIT
Bombay, is made of a few hundreds queries answering homework
assignments. The second dataset, publicly available, consists of
around 200 queries gathered over 2 years from student exams at
University of Bufalo. The third dataset consists of SQL logs that
capture all database activities of 11 Android phones for a period
of one month. The log consists of 1,352,202 SELECT statements
that, being generated by an application, correspond to only 135
distinct query strings.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Workload analysis</title>
      <p>Other scientific domains close to Database, like Information
Retrieval or Web Search, have a long tradition of log analysis
aiming at facilitating the searcher’s task [17]. Many works extract
features from queries or search sessions and use them to
disambiguate the session’s goal, to generate recommendations, to
detect struggling in sessions, etc. Since databases tend to be more
used in an exploratory or analysis fashion, as evidenced by the
SQLShare workload, it is not a surprise that many recent works
pay attention to the analysis of database workloads, in addition to
those works analyzing workload for optimization or self-tuning
purposes. We present some recent advances in this area,
diferentiating by the type of logs (OLAP logs and SQL logs).</p>
      <p>
        Analyzing and detecting OLAP explorations. Logs of OLAP
analyses are simpler than SQL ones in the sense that they feature
multidimensional queries that can easily be interpreted in terms
of OLAP primitives (roll-up, drill-down, slice-and-dice, etc.). In
one of our previous works [13], we proposed an approach for
detecting OLAP analyses phrased in SQL, by converting SQL
queries into OLAP queries and then checking if two consecutive
queries are suficiently close in terms of OLAP operations. In our
more recent work, we used supervised learning to identify a set
of query features allowing to characterize focus zones in OLAP
explorations [4], or to identify queries that better contribute to an
exploration [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The present work can be seen as a continuation
of those previous works, since we have the same objective as [13]
and use the same technique as [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The main diferences with
these previous works is that we make no assumption about the
type of queries in the workload (particularly, they may not be
multidimensional queries), and we have no ground truth (i.e., no
human manual inspection of each query) on the workload.
      </p>
      <p>Analyzing SQL log. SQL workload analysis has recently
attracted attention beyond query optimization, for instance for
query recommendation [6] query autocompletion [10], or user
interest discovery [12]. All these works use the SDSS workload
for their tests. Embedded SQL code is analyzed in [15] to
measure its quality, mainly for maintainability purpose. The authors
quantify quality based on the number of operators (joins, unions),
operands (tables, subqueries) and variables in the SQL code,
experimenting with SQL codes embedded in PL/SQL, COBOL and
Visual Basic. Jain et al. ran a number of tests on the SQLShare
workload [9], some of them being reported above, showing the
diversity and complexity of the workload. In [16], Vashistha and
Jain analyze the complexity of queries in the SQLShare workload,
in terms of the following query features: number of tables,
number of columns, query length in characters, numbers of operators
(Scan, Join, Filter), number of comparison operators (LE, LIKE,
GT, OR, AND, Count), and the query run-time. They define two
complexity metrics from these features: the Halstead measure
(traditionally used to measure programs complexity) and a linear
combination whose weights are learned using regression. Finally,
a recent work investigated various similarity metrics over SQL
queries, aiming at clustering queries [11] for better workload
understanding. The authors run their tests on smaller SQL sets,
as indicated above.</p>
      <p>To our knowledge, our work is the first to propose an approach
for segmenting hand-written SQL queries into meaningful
sessions.
3</p>
    </sec>
    <sec id="sec-5">
      <title>PRELIMINARIES</title>
      <p>This section introduces the SQLShare workload and describes
our hypothesis and preprocessing.
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>SQLShare workload preprocessing</title>
      <p>From the 11,137 SQL statements we kept 10,668 corresponding to
SELECT statements. The remaining statements (mainly updates,
inserts and deletes) were filtered.</p>
      <p>We implemented a preliminary session segmentation
following a simple heuristic: keeping together the sequences of
consecutive queries of a same user. As a result of the initial
segmentation we obtained 451 sessions, counting between 1 and 937
queries (average of 23.65 queries per session, standard deviation
of 75.05 queries). Furthermore, we made the initial hypothesis
that queries appear in chronological order in the SQLShare
workload. We noted that the queries of the workload do not come with
timestamps, and we contacted the authors of the original
SQLShare paper [9] who confirmed that the query order in the
workload may not reflect the order in which queries were launched.
Therefore, the disparate distribution of queries along sessions, in
addition to some extremely long sessions, possibly disordered,
calls for a smarter way of segmenting sessions.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Query and session abstractions</title>
      <p>In what follows, we use the term query to denote the text of an
SQL SELECT statement. We represent a query as a collection
of fragments extracted from the query text, namely, projections,
selections, aggregations and tables. These fragments abstract the
most descriptive parts of a SQL query, and are the most used in
the literature (see e.g., [6, 10]). But note that we do not restrict
to SPJ (selection-projection-join) queries. Indeed, we consider
all queries in the SQLShare workload, some of them containing
arbitrarily complex chains of sub-queries.</p>
      <p>Definition 3.1 (Query). A query over database schema DB is a
quadruple q = ⟨P , S, A, T ⟩ where:
(1) P is a set of expressions (attributes or calculated
expressions) appearing in the main SELECT clause (i.e. the
outermost projection). We deal with * wild card by replacing
it by the list of attributes it references.
(2) S is a set of atomic Boolean predicates, whose
combination (conjunction, disjunction, etc.) defines the WHERE
and HAVING clauses appearing in the query. We
considered indistinctly all predicates appearing in the outermost
statements as well as in inner sub-queries.
(3) A is a set of aggregation expressions appearing in the main
SELECT clause (i.e. the outermost projection). Including
aggregation expressions appearing only in the HAVING
clause is part of our future work.
(4) T is a set of tables appearing in FROM clauses (outermost
statement and inner sub-queries). Views, sub-queries and
other expressions appearing in FROM clauses are parsed
in order to obtain the referenced tables.</p>
      <p>Note that although we consider tables and selections
occurring in inner sub-queries, we limit to the outermost queries for
projections and aggregations, as they correspond to attributes
actually visualized by the user. We intentionally remain
independent of presentation and optimization aspects, specially the order
in which attributes are projected (and visualized by the user), the
order in which tables are joined, etc.</p>
      <p>Finally, a session is a sequence of queries of a user over a given
database.</p>
      <p>Definition 3.2 (Session). Let DB be a database schema. A session
s = ⟨q1, . . . , qp ⟩ over DB is a sequence of queries over DB. We
note q ∈ s if a query q appears in the session s, and session(q) to
refer to the session where q appears.
4</p>
    </sec>
    <sec id="sec-8">
      <title>FEATURE EXTRACTION</title>
      <p>In this section, we define a set of features to quantitatively
describe diferent aspects of a SQL query and its context. We then
describe the extraction procedure and the obtained scores.
4.1</p>
    </sec>
    <sec id="sec-9">
      <title>Feature description</title>
      <p>For each query, we extract a set of simple features computed
from the query text and its relationship with other queries in a
session. We intend to cover various aspects of a query in order to
support diferent types of analysis and modeling based on query
features.</p>
      <p>
        The set of features is inspired from our previous work [
        <xref ref-type="bibr" rid="ref3">3, 4</xref>
        ],
which models OLAP queries as a set of features capturing typical
OLAP navigation.
      </p>
      <p>From now on, to remove any ambiguity, we use the term metric
to denote the functions that score query quality. The term feature
is reserved to denote the score output by the function.</p>
      <p>For the sake of presentation, we categorize metrics as follows:
i) intrinsic metrics, i.e., only related to the query itself, ii) relative
metrics, i.e., also related to the query’s predecessor in the session,
and iii) contextual metrics, i.e., related to the whole session,
providing more context to the metrics. Table 1 presents an overview
of metrics.</p>
      <p>NoP
NoS
NoA
NoT
NCP
NCS
NCA
NCT
RED
JI
NoQ</p>
      <sec id="sec-9-1">
        <title>Intrinsic metrics</title>
        <p>Number of projections (attributs and expressions)
Number of selections (filtering predicates)
Number of aggregations, when there is a
group-by clause
Number of tables</p>
        <p>Relative metrics
Number of common projections, with previous query
Number of common selections, with previous query
Number of common aggregations, with previous query
Number of common tables, with previous query
Relative edit distance (efort to express a query starting
from the previous one)
Jaccard index of common query parts, with previous
query</p>
      </sec>
      <sec id="sec-9-2">
        <title>Contextual metrics Number of queries (absolute position in the session)</title>
        <p>For all definitions given in this section, let qk = ⟨Pk , Sk , Ak , Tk ⟩
be the query occurring at position k in the session s over the
instance I of schema DB. All the queries we considered are
supposed to be well formed, and so we do not deal with query errors.</p>
        <p>For the moment, we only considered metrics based on query
text. Further metrics may be defined if the database instance
is taken into account (for example, number of tuples in query
result, precision and recall of query result w.r.t. previous query,
execution time, etc.). But the computation of such metrics implies
the execution of every query in the SQLShare dataset, which is
not always available for confidentiality reasons (i.e. users did not
agree to share their data), and thus considerably reduces the set
of queries. We left such studies to future work.</p>
        <p>4.1.1 Intrinsic metrics. Intrinsic metrics are those that can be
computed only considering the query qk , independently of the
session s and other queries in s. In other words, these metrics
will give the same score to qk , independently of s.</p>
        <p>Number of Projections. N oP (qk ) represents the number of
projections (attributes and expressions) that are projected by the user.
Expressions projected in inner sub-queries, but not projected by
the outer query, are not considered as they do not appear in the
query result.</p>
        <p>N oP (qk ) = card(Pk )</p>
        <p>Number of Selections. N oS(qk ) represents the number of
selections (elementary Boolean predicates) that appear in the query
text, both in outer and inner sub-queries.</p>
        <p>N oS(qk ) = card(Sk )</p>
        <p>Number of Aggregations. N oP (qk ) represents the number of
aggregation expressions that are projected by the user. As for
projections, expressions appearing in inner sub-queries are not
considered.</p>
        <p>N oA(qk ) = card(Ak )</p>
        <p>Number of Tables. N oP (qk ) represents the number of tables
appearing in query text, both considering outer and inner
subqueries.</p>
        <p>N oT (qk ) = card(Tk )
NCA(qk , qk−1) =
( card(Ak ∩ Ak−1) if k &gt; 1
0 otherwise
(1)
(2)
(3)
(4)
(5)
(6)
(7)
4.1.2 Relative metrics. Relative metrics are those that are
computed comparing the query qk to the previous query in the
session s. Let qk−1 = ⟨Pk−1, Sk−1, Ak−1, Tk−1⟩ be the previous query,
being undefined for the first query of s (i.e., q1). Each metric
provides a default score for this limit case.</p>
        <p>Number of Common Projections. NCP (qk , qk−1) counts the
number of common projections of qk relatively to qk−1. The
default value is 0 for q1.</p>
        <p>NCP (qk , qk−1) =
( card(Pk ∩ Pk−1) if k &gt; 1</p>
        <p>0 otherwise</p>
        <p>Number of Common Selections. NCS(qk , qk−1) counts the
number of common selections of qk relatively to qk−1. The default
value is 0 for q1.</p>
        <p>NCS(qk , qk−1) =
( card(Sk ∩ Sk−1) if k &gt; 1</p>
        <p>0 otherwise</p>
        <p>Number of Common Aggregations. NCA(qk , qk−1) counts the
number of common aggregations of qk relatively to qk−1. The
default value is 0 for q1.
(8)
(9)</p>
        <p>Number of Common Tables. NCT (qk , qk−1) counts the number
of common tables of qk relatively to qk−1. The default value is 0
for q1.</p>
        <p>NCT (qk , qk−1) =
( card(Tk ∩ Tk−1) if k &gt; 1</p>
        <p>0 otherwise</p>
        <p>Relative Edit Distance. RED(qk , qk−1) represents the edition
efort, for a user, to express the current query starting from the
previous one. It is strongly related to query parts, and computed
as the minimum number of atomic operations between queries,
by considering the operations of adding/removing a projection,
selection, aggregation or table. The considered cost for each
observed diference (adding/removing) is the same.</p>
        <p>RED(qk , qk−1) = card(Pk − Pk−1) + card(Pk−1 − Pk )
+card(Sk − Sk−1) + card(Sk−1 − Sk )
+card(Ak − Ak−1) + card(Ak−1 − Ak )</p>
        <p>+card(Tk − Tk−1) + card(Tk−1 − Tk )
For the limit case, we consider an empty query, q0 = ⟨∅, ∅, ∅, ∅⟩.
Then, RED(q1, q0) = card(P1) + card(S1) + card(A1) + card(T1).</p>
        <p>Jaccard Index. J I (qk , qk−1) represents the ratio between the
common query parts (projections, selections, aggregations and
tables) and the union of query parts.</p>
        <p>J I (qk , qk−1) =
card(queryParts(qk ) ∩ queryParts(qk−1) (10)
card(queryParts(qk ) ∪ queryParts(qk−1)
where queryParts(qk ) = Pk ∪ Sk ∪ Ak ∪ Tk .</p>
        <p>For the limit case, we consider an empty query, q0 = ⟨∅, ∅, ∅, ∅⟩.
Then, J I (q1, q0) = 0.</p>
        <p>4.1.3 Contextual metrics. Contextual metrics are session
dependent and make sense only in the context of a session. The
same query qk occurring in diferent sessions may be given
different scores for metrics in this category. For now we consider
only one such metric.</p>
        <p>Number of Queries. N oQ(qk , s) counts the absolute position
of query qk in s.</p>
        <p>N oQ(qk , s) = k
(11)</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>4.2 Extraction protocol</title>
      <p>This section briefly describes the procedure for extracting query
features. We proceed in 3 steps:
(1) Filter the SQL workload in order to keep only SELECT
statements, i.e. discard updates, inserts, etc.
(2) Extract query fragments (sets of projections, selections,
aggregations and tables) for each query.</p>
      <p>(3) Compute query features from query fragments.</p>
      <p>For the first and second step we developed a C# script using the
MSDN TSQL Parser. Removing all the non SELECT statements
was straightforward. However, extracting query fragments
required to deal with several particular cases. The major challenge
was extracting projections. First, for getting the projections
visualized by the user, we need to detect the outermost SELECT
clause of a query and then extract the SELECT elements. When
there is a star element (i.e : SELECT * FROM T ) in the query, our
script reaches the outermost FROM clause (looking for T ). When
T is a table, the script accesses schema metadata and obtains all
the table attributes. If there is one or more sub-queries in the
FROM clause, it repeats the previously described pattern until
it finds either a table or a set of SELECT elements. Views and
WITH clauses were treated in a similar way. For now, we do not
take into account the queries having more than one ’*’ in their
SELECT elements. (i.e : SELECT t1.*,t2.*,a,b FROM t1,t2).</p>
      <p>Note that this procedure for resolving star elements relies
in the existence of schema metadata, i.e., having access to the
corresponding datasets in order to obtain the list of attributes.
However, some datasets are referenced in queries but are not
present in the SQLShare released data because the user decided
not to share them. For such queries (near 18% of the query
workload), we could not resolve the list of projections and we needed
to estimate the number of projections during third step
(computing features).</p>
      <p>The aggregations were obtained with a Parser’s function that
detects all the function calls within a query. The selections are
all the atomic Boolean expressions contained in the queries and
their subqueries. At this stage of our work we do not deal with
predicate containment.</p>
      <p>The third step computes query features as described in
Equations 1 to 11. For the unresolved star elements, we estimated both
the number of projections and the number of common
projections, by taking advantage of the other queries in the exploration
that list attributes of the "starred" tables. We used linear
regression to estimate the remaining cases, with the AUTO-SKLEARN
Python module [7], which is a module aiming at automatically
choosing and parametrizing a machine learning algorithm for a
given dataset, at a given cost (i.e., the time it takes to test diferent
algorithms).</p>
      <p>More precisely, we computed two distinct regressions, one for
NoP and another one for NoCP. The methodology is the same
for both and consist of the following:
(1) Each query are represented by 23 features : NoS, NoA, NoT,
NCS, NCA, NCT, NoQ (neither NoP nor NCP since they
are the target of the regression), the min, max, average
and standard deviation of the NoP, NoS, NoA, and NoT,
grouped by the exploration the query belongs to.
(2) The queries where NoP=0 (and consequently NCP=0) are
removed from the data set.
(3) The data set is then split in 80/20 for cross-validation, and
the AUTO-SKLEARN regression mode is used to fit the
best regression. For the first regression, NoP is the target,
for the second, NCP is the target. The maximum time to
ifnd a model is set to 180 seconds and the score used to
measure the accuracy of the regression is R2 (coeficient
of determination) regression score function.
(4) The R2 score of each regression is used to predict NoP
(respectively NCP) for the queries removed at step 2.
(5) Degenerated cases (e.g., a predicted number of common
projection that is greater than the predicted number of
projections) are handled manually.
4.3</p>
    </sec>
    <sec id="sec-11">
      <title>Analysis of query features</title>
      <p>90pc
NoP
NoS
NoA
NoT
NCP
NCS
NCA
NCT
RED
JI
NoQ
Contextual metrics
23.65 75.05
while median values show that most queries have few projections,
selections, aggregations and tables.</p>
      <p>Focusing in longer queries (with more query fragments), at
90-percentile, queries have 18 projections, 3 selections, 1
aggregation and 2 tables, while at 75-percentile those values are 10, 1,
0 and 1 respectively. Indeed, as expected, there is a large number
of short queries (having less fragments): 82% of queries have
no aggregations and 44% have no selections, while 20% have a
unique projection and 78% have a unique table. Interestingly, 6%
of queries have no table in the FROM clause. An example of such
queries is "SELECT 1+2".</p>
      <p>Concerning common fragments between contiguous queries,
almost half of the queries have 1 common projection and 1
common table but no common selections nor aggregations, while
there is more sharing at 75-percentile. The remaining two
metrics, Relative Edit Distance and Jaccard Index, inform about the
combination of such common fragments. Specifically, half of the
queries difer in 4 or less fragments (RED=4) and have at least
43% of its fragments in common (JI=0.43), w.r.t. previous query.
Furthermore, only 29% of queries have nothing in common with
the previous query in their sessions (JI=0).</p>
      <p>The next section discusses how to take into account these
features for fragmenting sessions.
5</p>
    </sec>
    <sec id="sec-12">
      <title>SESSION SEGMENTATION</title>
      <p>The previous section presented various statistics about queries
in the SQLShare workload. A preliminary session segmentation
(contiguous queries of a same user) resulted in some extremely
long sessions (maximum of 937 queries) with 29% of queries
having nothing in common with their immediate predecessor. In
this section, we explore how to segment sessions in a smarter
way.</p>
      <p>Session segmentation has been previously studied for the SDSS
workload [14]. In their study, the authors consider that a new
session starts after 30 minutes of think-time (time spent between
two queries). A similar problem was largely studied for the
segmentation of web traces (see for example [18]) proposing the
same 30-minutes cutof. Search engine providers, like MSN and
Google, use similar heuristics. Contrarily to those works, the
published SQLShare workload does not include query timestamps.
We need to explore other heuristics for session segmentation.</p>
      <p>Intuitively, our idea is to compare contiguous queries in a
session and segment when queries are dissimilar enough. Based on
query features described in the previous section, we investigate
5 similarity indexes:</p>
      <p>Edit Index. It is based on the Relative Edit Distance (RED)
query feature, normalizing it with respect to an arbitrary
threshold, set to 10 operations.</p>
      <p>EditIndex (qk ) = max {0, 1 −</p>
      <p>RED(qk , qk−1)
10
}
(12)</p>
      <p>Jaccard Index. It is the Jaccard Index defined in Equation 10,
which is normalized by definition.</p>
      <p>Cosine Index. It is calculated as the Cosine of vectors consisting
in first 8 query features, namely, NoP, NoS, NoA, NoT, NCP, NCS,
NCA and NCT. Let x = ⟨x1, . . . , x8⟩ and y = ⟨y1, . . . , y8⟩ be the
vectors for queries qk and qk−1 respectively.</p>
      <p>Í xi .yi
CosIndex (qk , qk−1) = qÍ x 2. Í y2
i i</p>
      <p>Common Fragments Index. It is calculated as the number of
common fragments, normalizing it with respect to an arbitrary
threshold, set to 10 fragments.</p>
      <p>NCF</p>
      <p>CF Index (qk , qk−1) = min{1, 10 }
where NCF = NCP (qk , qk−1)+NCS(qk , qk−1)+NCA(qk , qk−1)+
NCT (qk , qk−1)</p>
      <p>Common Tables Index. It is calculated as the number of
common tables, normalizing by the highest number of tables in the
session.</p>
      <p>CT Index (qk , qk−1) = NCT (qk , qk−1) (15)
max {N oT (q)|q ∈ session(qk )}</p>
      <p>Note that these indexes calculate complementary aspects of
query similarity. Edit Index and Common Fragment Index count
diferences (resp., common fragments) as absolute values
(normalized with a given threshold). Jaccard Index is a compromise
of the previous ones, computing the ratio of common fragments.
Cosine Index is computed using features (the value of the metrics)
(13)
(14)
instead of comparing sets of fragments; it captures the variability
in query complexity. And finally, Common Table Index responds
to the intuition that common tables have more impact than the
other common fragments, and it is normalized with respect to
the number of tables used in the user session.</p>
      <p>Figure 2 depicts the similarity metrics for 3 sessions of diferent
sizes. Looking at Session 28, the shorter one, it seems quite clear
that the session may be split in two parts, by cutting between
queries 4 and 5. All similarity metrics agreed. Things are less
evident for Session 0. One split seems evident (at query 31), but
some others may be discussed (e.g. at queries 28, 29 and 12).
Decision depends on thresholds. Finally, Session 18 presents a
ifrst part, with a focused analysis, via similar queries, and a second
part, more exploratory, with varied queries. Even if indexes do
not always agree, their majority seems to indicate a tendency.
6</p>
    </sec>
    <sec id="sec-13">
      <title>EXPERIMENTS</title>
      <p>In this section we report the results of the experiments conducted
to validate our segmentation approach. We first experiment with
the SQLShare dataset. We then experiment with other datasets
for which a ground truth is available, in order to better underline
the efectiveness of our approach.</p>
      <sec id="sec-13-1">
        <title>Nb queries</title>
        <p>Avg NCP
Avg NCS
Avg NCA
Avg NCT
Avg NCF
Avg RED
Avg JI
We implemented a first heuristic based in these 5 similarity
indexes, tuned with simple thresholds whose setting is discussed in
next subsection, taking the decision of segmenting or not using
majority vote. This simple heuristic allowed to split the initial 451
sessions in 2,960 segments (called explorations). In the absence
of ground-truth, we present in Table 3 a comparison of session
features before and after session segmentation. A first remark
concerns session length: extremely large sessions (maximum of
936) were split (new maximum is 78 queries). Indeed, more than
half of the sessions were not fragmented and at 3rd quartile 1
session was split in 4 explorations. Some long and anarchic sessions
(such as the one counting 936 queries) were split in a multitude of
explorations. We can also highlight an increasing in the average
number of common query fragments (Avg NCF) per session, as
well as the average number of common projections, selections,
aggregations and tables. This increasing is quite regular and
visible for all quartiles. Relative edit distance (RED) and Jaccard
Index (JI) also improved, as expected.
6.2</p>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>Experiments with ground truth</title>
      <p>For this series of tests, we used workload datasets with ground
truth. We expect our approach to find a segmentation that is
close to the ground truth. As one of the datasets also contains
timestamps, we compare to the heuristic used in the literature,
i.e., the 30-minutes cutof.</p>
      <p>
        Datasets. We used three real workloads. Two of them are logs
of multidimensional queries that we have used in our previous
works [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the third one is a log SQL queries used for clustering
queries in [11]. We briefly present the datasets and we refer the
readers to corresponding articles for more precisions.
      </p>
      <p>The first dataset, named Open in what follows, consists of
navigation traces collected in the context of a French project on
energy vulnerability. These traces were produced by 8 volunteer
students of a Master’s degree in Business Intelligence, answering
fuzzy information needs defined by their lecturer, to develop
explorative OLAP navigations using Saiku2 over three cubes
instances. The main cube is organized as a star schema with 19
dimensions, 68 (non-top) levels, 24 measures, and contains 37,149
facts recorded in the fact table. The other cubes are organized
in a similar way. From this experiment, we reuse 16 sessions,
representing 28 explorations and 941 queries. The ground truth
is a manual segmentation made by the lecturer based on some
guessed notion of user goal and supported by timestamps.
Notably, automatic segmentation was not the purpose of the work
at the time manual segmentation was done.</p>
      <sec id="sec-14-1">
        <title>2http://meteorite.bi/products/saiku</title>
        <p>The second dataset, named Enterprise, consists of navigation
traces of 14 volunteers among SAP employees, in the context
of a previous study on discovering user intents [5]. We set 10
business needs, and volunteers were asked to analyze some of the
7 available data sources to answer each of the 10 business needs,
using a SAP prototype that supports keyword-based BI queries3.
In total, this dataset contains 24 sessions, corresponding to 104
user explorations and accounting for 525 queries. Volunteers
were explicitly requested to express to what information needs
they were answering, which constitutes our ground truth for this
dataset.</p>
        <p>
          For these two datasets, the feature extraction was made as in
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], enriched with detection of the features pertaining to tables
(NoT and NCT).
        </p>
        <p>The third dataset, named Exam, consists of queries gathered
over 2 years from student exams at University of Bufalo. Queries
correspond to student’s answers to 2 exam questions (one per
year). A parser that allow to extract several query parts (including
the ones we need) is also available for download. The parser also
iflters the queries that obtained low grades, proposing a total of
102 queries.</p>
        <p>Notably, in Open and Enterprise datasets, users did not have
to write any SQL code, contrarily to SQLShare or Exam. Indeed,
Saiku and the SAP prototype generated queries from users
highlevel operations. However, in both cases, users devised real
explorations. Conversely, queries in the Exam dataset are hand-written,
but each student devised only one query. In order to simulate
explorations, we made the hypothesis that the answers to a same
exam question should mimic students attempts to find the correct
answer. To this end, we arranged together the queries that
answer a same exam question, obtaining 2 explorations (the ground
truth for this dataset).</p>
        <p>Dataset analysis and feature extraction. Table 4 compares the
three datasets in terms of number of sessions, number of
explorations (the ground truth), number of queries and summarizes
features extraction. A first remark concerns the length of
sessions. The Open dataset contains long sessions concerning few
explorations while the Enterprise dataset contains shorter
sessions concerning more explorations. Sessions length is actually
dependent on GUI; while third party OLAP tools, like Saiku, log a
new query for each user action (including intermediate
drag-anddrops), the SAP prototype only logs final queries. In the Exam
dataset, sessions and explorations were simulated. Regarding
features, queries in the three datasets concern a quite small number
of projections, selections, aggregations and tables. Relative edit
3Patent Reference: 14/856,984 : BI Query and Answering using full text search and
keyword semantics</p>
        <p>
          Open Enterprise
Nb of sessions 16 24
Nb of explorations 28 104
Nb of queries 941 525
Avg queries per session 58 21
Avg queries per explor. 34 5
Avg explor. per session 2 4
Avg and range of NoP 3.62 [
          <xref ref-type="bibr" rid="ref1">1,7</xref>
          ] 2.18 [0,6]
Avg and range of NoS 1.33 [0,21] 0.76 [
          <xref ref-type="bibr" rid="ref3">0,3</xref>
          ]
Avg and range of NoA 1.34 [
          <xref ref-type="bibr" rid="ref1">1,4</xref>
          ] 1.14 [0,5]
Avg and range of NoT 3.28 [
          <xref ref-type="bibr" rid="ref1">1,7</xref>
          ] 2.03 [
          <xref ref-type="bibr" rid="ref1">1,4</xref>
          ]
Avg and range of NCP 3.16 [0,7] 1.34 [0,4]
Avg and range of NCS 1.13 [0,21] 0.46 [
          <xref ref-type="bibr" rid="ref3">0,3</xref>
          ]
Avg and range of NCA 1.17 [0,4] 0.77 [
          <xref ref-type="bibr" rid="ref3">0,3</xref>
          ]
Avg and range of NCT 2.97 [0,7] 1.46 [0,4]
Avg and range of RED 3.85 [0,19] 2.09 [0,25]
Avg and range of JI 0.57 [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ] 0.79 [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ]
        </p>
        <p>Table 4: Characteristics of the 3 datasets
distance (RED) and Jaccard index (JI) illustrate that queries are
more similar in the Enterprise dataset and highly dissimilar in
the Exam dataset. The latter observation actually contradicts our
hypothesis on the Exam dataset (stating that students’ answers
to a same exam question should be similar).</p>
        <p>Threshold setting. As for the SQLShare dataset, we computed
the 5 similarity metrics and used them (with voting strategy) for
segmenting sessions. We tested diferent thresholds for voting.
We proceeded as follows: we calculated the distribution of values
for each metric and we used as threshold the value at k-percentile,
with k varying in {0, 5, 10, 15, 20, 25, 30}.</p>
        <p>The thresholds that provided better results were those at 0-,
15- and 5-percentile for the Open, Enterprise and Exam datasets
respectively. These thresholds reflect the relationship between
number of explorations to find and number of queries, as well
as the similarity among consecutive queries. Indeed, the Open
dataset contains many queries and few explorations (i.e., a few
segments to find); small thresholds are best adapted. Conversely,
the Enterprise dataset needs to be more segmented as the average
number of queries per exploration is low; higher thresholds do
better. With the same reasoning, 0-percentile should provide
convenient thresholds for the Exam dataset. However, as queries
inside an exploration are dissimilar, the only segmentation to find
is not detected first (many intra-exploration breaks are proposed
before); 5-percentile provides better results.</p>
        <p>We remark that more precise thresholds could be learned with
supervised machine learning techniques (e.g. classification). We
intentionally avoid this computation because in real applications
(as SQLShare) we do not have a ground truth for learning and,
when one exists, we risk over-fitting. An expert providing the
ratio of queries per exploration (either intuitively or via
preliminary tests) is more realistic. For the SQLShare dataset, we rely
on statistics, like number of queries per session (see Table 2) and
the dataset description in [9] for setting the threshold as values
at 30-percentile.</p>
        <p>In the remaining tests, we use values at 0-, 15- and 5-percentile
as thresholds for the Open, Enterprise and Exam datasets,
respectively.</p>
        <p>Segmentation quality. For each dataset, we compared the
obtained segmentation to the ground truth, measuring
segmentation quality in two complementary ways. The first one, more
demanding, compares the position of session breaks, indicating if
both the segmentation and the ground truth coincide in the very
same positions. To this end, we build binary vectors (a position
corresponding to a query in the session), where a 1 means that
a new segment starts at that position. We do the same for the
ground truth and we compared both vectors in terms of accuracy,
precision, recall and f-measure. The second way is the Adjusted
Rand Index (ARI), a popular measure of clustering quality. It is
based on the comparison of pairs of queries in the session,
observing whether they belong to the same fragment and to the
same exploration in the ground truth.</p>
        <p>We report the results in Table 5. As expected, results are very
good in terms of accuracy, mainly explained because classes are
unbalanced (the number of no-break is higher than the number
of break. Relative low values for F-measure while high values of
ARI indicate that the exact break is not always found, while the
overall segmentation reminds good. The bad results for the Exams
dataset are mainly explained by the dissimilarities among queries
of a same exploration (i.e. answers to a same exam question). The
conclusions about query similarity presented in [11] confirm this
fact.</p>
        <p>Comparison to timestamp-based approach. In order to compare
our approach to the one used in the literature, we implemented
a second heuristic that segments sessions when there is a
30minutes delay between queries. The Open dataset, the only one
containing timestamps, was used for this test. Results are
reported in Table 5, the right-most column corresponding to the
timestamp-based approach. They are comparable in terms of
accuracy and ARI but much lower in terms of f-measure. Note that
1 for precision means that all breaks found are also breaks in
the ground truth. In other words, there are no big delays inside
explorations, which makes sense. However, the timestamp-based
approach fails to detect 75% of the breaks (when the user changes
its topic of study in a briefer delay).</p>
        <p>Analysis of similarity metrics. Finally, we investigated the
quality of the 5 proposed similarity metrics, by studying the
correlation of their vote (when metric value is lower than the
corresponding threshold) with respect to the breaks in the ground
truth. Results are presented in Table 6.</p>
        <p>Jaccard and CF indexes are the more correlated in the
Enterprise dataset. Both of them are highly correlated in the Open
dataset, CT index being even better. Edit and Cosinus indexes are
less correlated in one of the datasets. As expected, correlation
of all measures is low in the Exam dataset, where segmentation
is of bad quality. Interestingly, the most influencing metrics, the</p>
      </sec>
      <sec id="sec-14-2">
        <title>CF index</title>
      </sec>
      <sec id="sec-14-3">
        <title>CT index</title>
      </sec>
      <sec id="sec-14-4">
        <title>Edit index</title>
      </sec>
      <sec id="sec-14-5">
        <title>Jackard index</title>
      </sec>
      <sec id="sec-14-6">
        <title>Cos index</title>
      </sec>
      <sec id="sec-14-7">
        <title>CF index</title>
      </sec>
      <sec id="sec-14-8">
        <title>CT index</title>
      </sec>
    </sec>
    <sec id="sec-15">
      <title>CONCLUSION</title>
      <p>This paper discussed the problem of fragmenting sequences of
SQL queries into meaningful explorations when only the query
text is available, and it is not possible to rely on query timestamps.</p>
      <p>We characterized queries as a set of simple features and
deifned five similarity indexes with respect to previous queries in
the session. A simple heuristic, based on the similarity indexes,
allowed to split long and heterogeneous sessions into smaller
explorations where queries have more connections.</p>
      <p>Even if our preliminary results are promising, further aspects
should be investigated:
• Study further query features. We would like to test the
common fragments of a query w.r.t. close queries in the
session (not only the previous one). Comparing query
results seems important, even if it is not possible for many
queries (because the database instance is not part of the
SQLShare workload).
• Study other similarity indexes and tune thresholds.
• Discard preliminary hypothesis about chronological
ordering of queries and deal with query similarity beyond
it.</p>
      <p>methods.</p>
      <p>• Test other ways of fragmenting, in particular via clustering
Our long term goal is to measure the quality of SQL
explorations, allowing the detection of focus/exploratory zones, the
discovery of latent user intents, the recommendation of next
queries, among other applications.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Bhattarai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Singh</surname>
          </string-name>
          .
          <article-title>Discovering user information goals with semantic website media modeling</article-title>
          .
          <source>In MMM (1)</source>
          , volume
          <volume>4351</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>364</fpage>
          -
          <lpage>375</lpage>
          . Springer,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          and
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Narasayya</surname>
          </string-name>
          .
          <article-title>Self-tuning database systems: A decade of progress</article-title>
          .
          <source>In Proceedings of the 33rd International Conference on Very Large Data Bases</source>
          , University of Vienna, Austria,
          <source>September 23-27</source>
          ,
          <year>2007</year>
          , pages
          <fpage>3</fpage>
          -
          <lpage>14</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Djedaini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Drushku</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Labroche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Marcel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Peralta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Verdeau</surname>
          </string-name>
          .
          <article-title>Automatic assessment of interactive OLAP explorations</article-title>
          . To appear
          <source>in Information Systems</source>
          ,
          <year>2019</year>
          . https://doi.org/10.1016/j.is.
          <year>2018</year>
          .
          <volume>06</volume>
          .008.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>