<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning Analysis Behavior in SQL Workloads</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Clement Moreau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Veronika Peralta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Tours Blois</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a set of analyses aiming at better understanding the SQLShare workload [13] and learning users' analysis behavior. SQLShare is a database-as-a-service platform targeting scientists and data scientists with minimal database experience, whose workload was made available to the research community. According to the authors of [13], this workload is the only one containing primarily ad-hoc hand-written queries over useruploaded datasets. In this paper we analyze this workload, by comparing users' explorations (sequences of queries), looking for common SQL operations performed by the users during data analysis. We use a clustering algorithm to retrieve groups of similar explorations and we analyze the obtained clusters through many statistical and visual indicators for explaining analysis patterns inside clusters. To our knowledge, this is the first attempt to characterize human analysis behavior in SQL workloads.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The analysis of a database workload to support Interactive
Database Exploration (IDE) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] receives increasing interest as it
offers many practical interests, from the monitoring of database
physical access structures [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to the generation of user-tailored
collaborative query recommendations for interactive exploration
[
        <xref ref-type="bibr" rid="ref21 ref8">8, 21</xref>
        ].
      </p>
      <p>
        Characterising user behavior while analysing data, i.e.
learning the way users analyse data (the type and order of operations,
the level of detail, the degree of focus) is a step forward in the
understanding of analysis activities and ofers new applications,
for instance to understand users’ information needs, to identify
"struggling" during the exploration, or to provide better query
recommendations. Notably, IDE systems usually do not ofer such
facilities. The prediction of next analysis steps is particularly
interesting, enabling beforehand execution of probable queries and
caching of results, as well as advanced optimization strategies.
Finally, we mention the detection of clandestine intentions [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
as another potential benefit. Indeed, as reported by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], query
sequences may reflect such intentions, where users prefer to obtain
information by means of sequences of smaller, less conspicuous
queries to avoid direct queries which may disclose their true
interests. The identification of typical analysis patterns may help
distinguishing normal from clandestine intentions.
      </p>
      <p>
        In this paper we deal with the identification of analysis
patterns in a log of SQL explorations devised by real users. We
consider that an exploration is a coherent sequence of queries
over a database schema, done by a user with the goal of fulfilling
an information need. We experiment on the SQLShare workload
of hand-written1 queries over user-uploaded datasets [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In
1Consistently with the authors of [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], we use the term hand-written to mean, in
this context, that the query is introduced manually by a human user, which reflects
genuine interactive human activity over a dataset, with consideration between two
consecutive queries.
particular, we use the segmentation of the SQLShare workload
in coherent explorations proposed in [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ].
      </p>
      <p>
        Some previous works consider analysis patters within OLAP
explorations. In [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ], Rizzi and Gallinucci described 4 recurrent
types of user analyses and propose a tool for generating realistic
explorations based on these usage types. In [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], we cluster
together explorations showing similar analysis patterns, learning
11 analysis patterns from OLAP workloads devised by students
and expert analysts.
      </p>
      <p>The idea behind analysis patterns is to look for sequences of
common operations performed together when analysing data,
as some kind of movements in a data space. From this point of
view, OLAP operations (e.g. drilling down, adding a filter,
changing a measure) are first class citizens, while the actual analyzed
data is less important. For example, we can retain that a user
performed a sequence of drills down, disregarding the dimension
that was drilled down or the semantics of the underlying data.
Explorations are compared in such terms, i.e. to what extent they
share the same sequences of operations and evolve at the same
level of aggregation and filtering.</p>
      <p>Transposing such approach to regular, non multidimensional
SQL workloads raises many challenges. Even if a sequence of SQL
queries is issued to explore the database content, non
multidimensional relational schemata do not have regularities one expects
from the multidimensional model, explorations may not be
expressed through roll-up or drill-down operations, SQL queries
may deviate from the traditional star-join pattern commonly
used for analytical purpose, etc.</p>
      <p>
        In this paper we present an extension of our previous work
[
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] for learning analysis patterns in SQL workloads. In
particular, we reuse and extend similarity measures for comparing
queries and explorations, and pair them with clustering
algorithms. Contrarily to [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], which uses hierarchical clustering,
we combine UMAP and DBSCAN methods, which proved to be
well adapted to complex sequences [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. The obtained clusters
are then analyzed using several statistic and visual indicators,
allowing to characterize analysis behavior in each cluster.
      </p>
      <p>Our contributions include: (i) a representation of queries and
explorations in the space of SQL operations, including similarity
functions tailored for SQL queries and explorations (described in
Section 3), (ii) a statistical analysis of the SQLShare workload in
terms of queries and operations (Section 4), (iii) a proposal for
clustering SQL explorations, (Section 5), and (iv) a large analysis
of the obtained clusters, via many complementary statistical
and visual indicators, revealing several (common and specific)
patterns of users’ analysis behavior (Section 5).
2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        In this section we briefly describe the SQLShare workload and
we present related work concerning workload analysis and
indicators tailored for analyzing sequences. Finally, we discuss
clustering algorithms adapted to sequences.
2.1
The SQLShare workload is the result of a Multi-Year
SQL-as-aService Experiment [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], allowing any user with minimal
database experience to upload their datasets on-line and manipulate
them via SQL queries. What the authors wanted to prove with
this experiment is that SQL is beneficial for data scientists. They
observed that most of the time people use scripts to modify or
visualize their datasets instead of using the SQL paradigm. Indeed,
most user needs may be satisfied by first-order queries, that are
much simpler than a script, but have the initial cost of creating
a schema, importing the data and so on. SQL-as-a-Service frees
the user of all this prior work with a relaxed SQL version.
      </p>
      <p>
        The SQLShare workload is composed of 11,137 SQL statements,
57 users and 3,336 user’s datasets. To the best of our knowledge,
as reported by the authors of [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], this workload is the only
one containing primarily ad-hoc hand-written queries over
useruploaded datasets. As indicated in the introduction, hand-written
means that the query is introduced manually by a human user,
which reflects genuine interactive human activity over a dataset,
with consideration between two consecutive queries.
      </p>
      <p>
        The SQLShare workload is analyzed in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], particularly to
verify the following assumption: "We hypothesized that SQLShare
users would write queries that are more complex individually and
more diverse as a set, making the corpus more useful for designing
new systems." . The authors showed empirically that the queries
in the SQLShare workload are complex and diverse. They also
analyzed the churn rate of SQLShare users and conclude that most
users exhibit a behavior that suggest an exploratory workload.
      </p>
      <p>
        Other SQL workloads, as SDSS [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] or REACT-IDA [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]
include SQL queries generated with specific GUI or applications.
Although generated SQL queries are less richer than hand-written
ones [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], the approach presented in this paper can be applied to
these workloads, and to smaller ones, as those presented in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
2.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Workload analysis</title>
      <p>
        Other scientific domains close to Database, like Information
Retrieval or Web Search, have a long tradition of log analysis
aiming at facilitating the searcher’s task [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ]. Many works extract
features from queries or search sessions and use them to
disambiguate the session’s goal, to generate recommendations, to
detect struggling in sessions, etc. Since databases tend to be more
used in an exploratory or analysis fashion, as evidenced by the
SQLShare workload, it is not a surprise that many recent works
pay attention to the analysis of database workloads, in addition to
those works analyzing workload for optimization or self-tuning
purposes. We present some recent advances in this area,
diferentiating by the type of logs (OLAP logs and SQL logs).
      </p>
      <p>
        Analyzing OLAP explorations. Logs of OLAP analyses are
simpler than SQL ones in the sense that they feature
multidimensional queries that can easily be interpreted in terms of OLAP
primitives (roll-up, drill-down, slice-and-dice, etc.). In one of our
previous works [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], we proposed an approach for detecting
OLAP analyses phrased in SQL, by converting SQL queries into
OLAP queries and then checking if two consecutive queries are
suficiently close in terms of OLAP operations. Later, we used
supervised learning to identify a set of query features allowing to
characterize focus zones in OLAP explorations [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], or to identify
queries that better contribute to an exploration [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In a more recent work [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], we analyzed OLAP workloads
devised by students and expert analysts, looking for sequences of
common operations performed together when analysing data. We
identified 11 analysis patterns corresponding to diferent analysis
behavior. For example, focused explorations (which regularly
increase the level of detail and filtering by adding drill-downs and
iflters), oscillating explorations (alternating drill-downs and
rollups with few filters), short explorations with few operations and
even explorations with repeated queries. We used a hierarchical
clustering algorithm, paired with a Contextual Edit Distance [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]
to cluster explorations representing the same behavior.
      </p>
      <p>
        The present work is a continuation of our previous work, in
particular [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. The main diferences are that we make no
assumption about the type of queries in the workload (particularly, they
may not be multidimensional queries), and we have no ground
truth (i.e., no human manual inspection of each query) on the
workload.
      </p>
      <p>
        Analyzing SQL logs. SQL workload analysis has recently
attracted attention beyond query optimization, for instance for
query recommendation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], query autocompletion [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], or user
interest discovery [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. All these works use the SDSS workload
for their tests. In [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], Milo and Somech identify and generalize
relevant previous sessions, in the REACT-IDA workload, to
generate personalized next-action suggestions to the user. In [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], they
study interestingness measures for mining workloads. Embedded
SQL code is analyzed in [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ] to measure its quality, mainly for
maintainability purpose. The authors quantify quality based on
the number of operators (joins, unions), operands (tables,
subqueries) and variables in the SQL code, experimenting with SQL
codes embedded in PL/SQL, COBOL and Visual Basic.
      </p>
      <p>
        Jain et al. ran a number of tests on the SQLShare workload
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], some of them being reported above, showing the diversity
and complexity of the workload. In [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ], Vashistha and Jain
analyze the complexity of queries in the SQLShare workload, in
terms of the following query features: number of tables, number
of columns, query length in characters, numbers of operators
(Scan, Join, Filter), number of comparison operators (LE, LIKE,
GT, OR, AND, Count), and the query run-time. They define two
complexity metrics from these features: the Halstead measure
(traditionally used to measure programs complexity) and a linear
combination whose weights are learned using regression.
      </p>
      <p>
        Finally, a recent work investigated various similarity
metrics over SQL queries, aiming at clustering queries [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] for
better workload understanding. Queries are issued separately, not
within explorations, and are compared in terms of query
structure, not in terms of SQL operations w.r.t. previous queries. Thus,
they capture users interests (e.g. which attributes are projected),
not the way user navigates among data. The authors run their
tests on smaller SQL workloads.
      </p>
      <p>To our knowledge, this is the first attempt to learn human
analysis behavior in SQL workloads.
2.3</p>
    </sec>
    <sec id="sec-4">
      <title>Indicators for sequence analysis</title>
      <p>
        Other research communities, in particular mobility science, study
human behavior represented as sequences of actions. Data
exploration can be viewed through the prism of mobility science [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
Indeed, an exploration is a sequence of user’s queries, where the
movement is no longer conducted in space but in the data space.
      </p>
      <p>
        Thus, many indicators proposed for the analysis of mobility
sequences can be reused or adapted for the study of sequences
of queries. Mobility researchers explored sequences of activities
and tested the existence of simple universal rules underlying
human movement like travel distance, top ranked visited locations,
predictability of human activity and origin-destination flows,
Techniques
Length distribution
State distribution
mainly studying recurring patterns/regularity in the sequence or
clustering mobility behavior ([
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] presents an important survey).
In substance, results show that mobility is strongly characterized
by exponential distribution (e.g. heavy-tailed, Zipf) and people
constantly exploit a small set of repeatedly visited locations.
      </p>
      <p>Inspired by these considerations, we propose to adapt a set
of indicators from mobility mining to analyse data explorations.
These complementary techniques, summarized in Table 1,
highlight diferent aspects of explorations.</p>
      <p>
        This capacity to explain models, both for practical and ethical
issues, is a crucial point for the understanding of machine
learning models. With this aim in mind, Guidotti and al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] suggested
some techniques, partially borrowed from these above, like
statistical methods and prototype selection elements, to explain black
box systems in order to make their results more interpretable and
understandable. In line with the vision of these techniques, we
believe that the elaboration of indicators is essential to understand
and explain discovered behavior in complex clusters.
2.4
      </p>
    </sec>
    <sec id="sec-5">
      <title>Clustering methods</title>
      <p>
        The extraction of behavior from a dataset is a process usually
performed thanks to unsupervised machine learning. Indeed,
clustering methods are widely used for the discovery of human
behavior in datasets representing sequences of elements, in
particular in sequences of mobility [
        <xref ref-type="bibr" rid="ref14 ref23 ref27">14, 23, 27</xref>
        ].
      </p>
      <p>
        Clustering methods are based on similarity measures. A
pairwise comparison of sequences results in a distance matrix that
is the input of the clustering process. Many methods have been
proposed for computing the similarity of categorical sequences.
Most of the approaches are based on Optimal Maching (OM)
methods [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], typical measures include those of the Edit Distance
family (see [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] for a review of methods and similarity measures).
In particular, the Contextual Edit Distance (CED) [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] is a
generalization of Edit Distance, conceived for the comparison of
semantic sequences (an overview of CED measure is given in
Subsection 3.3).
      </p>
      <p>
        However, the topology created by similarity measures for
sequences is hard to apprehend. In particular, for OM methods,
spaces are often not euclidean nor metric. To the best of our
knowledge, the clustering algorithms able to deal with arbitrary
distances (not necessarily metrics) are PAM [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] (or K-medoid),
hierarchical clustering [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], density clustering (DBSCAN [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
OPTICS [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) and spectral clustering [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], each one making diferent
hypothesis about cluster topology.
      </p>
      <p>
        According to the similarity measure and the representation of
the sequences, dimensionality reduction methods can be used in
order to extract primary dimensions [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. However, commonly
used methods like PCA can only be used for Euclidean spaces
in practice. Alternatively, methods like UMAP [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], allow the
reduction of a complex topology defined by an arbitrary metric
into a low Euclidean space, which facilitates the visualisation
of clustering results and enable the usage of other clustering
methods, in particular, those requiring an Euclidean space like
K-means [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. In addition, UMAP ofers a better preservation
of the data global structure, fewer hyperparameters to tune and
better speed than previous techniques like t-SNE [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], we empirically compared several clustering methods
and similarity measures, in order to find the most adapted to
sequences of semantic elements. The combination of CED measure
and UMAP reduction, paired with K-means, Spectral or DBSCAN
algorithms, outperformed all other combinations of methods.
3
      </p>
    </sec>
    <sec id="sec-6">
      <title>EXPLORATION MODEL</title>
      <p>This section introduces the description of queries and
explorations used all along the paper as well as their representation in
a space of SQL operations.</p>
      <p>
        The SQLShare workload contains 11,137 SQL statements, among
which 10,668 correspond to SELECT statements. The remaining
statements (mainly updates, inserts and deletes) were filtered.
This workload was fragmented in 2,809 explorations containing
among 1 and 98 queries [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ].
3.1
      </p>
    </sec>
    <sec id="sec-7">
      <title>Query and exploration abstractions</title>
      <p>
        In what follows, we use the term query to denote the text of
an SQL SELECT statement. In [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], queries are represented as a
collection of fragments extracted from the query text, namely,
projections, selections, aggregations and tables. We extend such
representation adding group by and order by sets. These
fragments abstract the most descriptive parts of a SQL query, and
are the most used in the literature (see e.g., [
        <xref ref-type="bibr" rid="ref16 ref21 ref8">8, 16, 21</xref>
        ]). But note
that we do not restrict to SPJG (selection-projection-join-group)
queries. Indeed, we consider all queries in the SQLShare
workload, some of them containing arbitrarily complex chains of
subqueries.
      </p>
      <p>Definition 3.1 (Query). A query over database schema  is a
6-uple  = ⟨, , ,  , , ⟩ where:
(1)  is a set of expressions (attributes or calculated
expressions) appearing in the main SELECT clause (i.e. the
outermost projection). We deal with * wild card by replacing
it by the list of attributes it references.
(2)  is a set of atomic Boolean predicates, whose
combination (conjunction, disjunction, etc.) defines the WHERE
and HAVING clauses appearing in the query. We
considered indistinctly all predicates appearing in the outermost
statements as well as in inner sub-queries.
(3)  is a set of aggregation expressions appearing in the main</p>
      <p>SELECT clause (i.e. the outermost projection).
(4)  is a set of tables appearing in FROM clauses (outermost
statement and inner sub-queries). Views, sub-queries and
other expressions appearing in FROM clauses are parsed
in order to obtain the referenced tables.
(5)  is a set of expressions appearing in GROUP BY clauses
(outermost statement and inner sub-queries).
(6)  is a set of expressions appearing in ORDER BY clauses
(outermost statement and inner sub-queries).</p>
      <p>Note that although we consider tables, selections, group by
sets and order by sets occurring in inner sub-queries, we limit to
the outermost queries for projections and aggregations, as they
correspond to attributes actually visualized by the user. We
intentionally remain independent of presentation and optimization
aspects, specially the order in which attributes are projected (and
visualized by the user), the order in which tables are joined, etc.
All the queries we considered are supposed to be well formed,
and so we do not deal with query errors.</p>
      <p>Finally, an exploration is a sequence of queries of a user.</p>
      <p>Definition 3.2 (Exploration). Let  be a database schema. An
exploration  = ⟨1, . . . ,  ⟩ over  is a sequence of queries
over . We note  ∈  if a query  appears in the exploration ,
and  () to refer to the exploration where  appears.</p>
    </sec>
    <sec id="sec-8">
      <title>3.2 Query features</title>
      <p>
        For each query, we extract a set of simple features computed
from the query text and its relationship with previous query in
an exploration. The set of features is inspired from our previous
work [
        <xref ref-type="bibr" rid="ref24 ref6 ref7">6, 7, 24</xref>
        ], which models OLAP queries as a set of features
capturing typical OLAP navigation. It intends to capture the set
of SQL operations that express one query w.r.t. the previous one
(e.g. adding a projection, which means that a query projects an
additional attribute w.r.t. the previous one).
      </p>
      <p>Table 2 presents the considered features, where added (resp.,
deleted) indicates the modification made compared to the
previous query. In their definitions, let  = ⟨ ,  ,  ,  ,  ,  ⟩
be the query occurring at position  in the exploration  over
the instance  of schema . Features are computed
comparing the query  to the previous query in the exploration ,
−1 = ⟨−1, −1, −1, −1, −1, −1⟩. For the first query
of , i.e. 1, we consider as predecessor the "empty" query 0 =
⟨∅, ∅, ∅, ∅, ∅, ∅⟩. All the features are defined for  ≥ 1.</p>
      <p>In what follows, we represent a SQL query in the space of
query features, i.e. as a 12-dimensional vector, each position
corresponding to one of the features 1...12 described in Table 2.
This representation is at the core of our proposal for computing
the similarity between queries. It focuses in operations between
queries and is independent of the underlying database, i.e. a given
sequence of operations, even on diferent databases, will result
in the same sequence of query vectors.</p>
      <p>Definition 3.3 (Query vector). Let  be a query and ′ its
predecessor in an exploration. A query vector is a 12-dimensional
vector  = ⟨1, ...12⟩ where  =  (, ′).</p>
      <p>Example 1. Consider an exploration 1 composed of 4 queries:
1: SELECT species FROM All3col;
2: SELECT species FROM All3col WHERE longitude &lt; 0;
3: SELECT species, longitude, latitude FROM All3col;
4: SELECT species, longitude FROM All3col ORDER BY species;</p>
      <p>Vector for 1, ⟨1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0⟩, indicates an added
projection (species) and an added table (All3col) w.r.t. the empty
query. Vectors for 2, 3 and 4 indicate the diferences w.r.t.
previous queries, an added selection (longitude &lt;0 ), 2 added projections
(longitude, latitude) with a deleted selection, and 1 deleted projection
with an added order by attribute (species): ⟨0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0⟩,
⟨2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0⟩, ⟨0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0⟩, resp. □</p>
      <p>As vectors are long and may have many 0-valued
coordinates, we concisely represent them by listing the occurring
operations (the ones not 0-valued) in the form "±", where  ∈
{, , ,  , ,  } and  ≥ 1 (omitted if 1). Letters and signs refer
to features (as abbreviated in Table 2) and  represents feature
magnitude. For instance, the queries of Example 1 can be noted
+P+T, +S, +2P-S, -P+O, resp.</p>
      <p>Remark that this representation captures both, the richness
of the exploration in terms of query fragments (projections,
selections, etc.), captured by the vector of the first query in the
exploration (which is compared to an empty query), and the
diferences among consecutive queries, captured by the vectors of the
following queries. As a consequence, in some explorations, the
norm of the first vector may be greater that those of the
following ones. For instance, vectors of exploration 9 of the SQLShare
workload are: +10P+T, +T-T and +S+T-T.</p>
      <p>Finally, in some analyses in Section 5, we focus on the type
of operation, disregarding the magnitude (e.g. how many
projections are concerned) and sign (addition or deletion). To this
end, we define aggregated 6-dimensional vectors (one
dimension per type of operation). Analogously, they can be concisely
represented with letters P,S,A,T,G,O.</p>
      <p>Definition 3.4 (Aggregated vector). Let  = ⟨1, ...12⟩ be a
query vector. An aggregated vector is a 6-dimensional vector
 = ⟨ 1, ... 6⟩ where  = 1 if (2−1 &gt; 0) or (2 &gt; 0), 0 else.</p>
      <p>Example 2. Aggregated vectors for queries in Example 1 are:
⟨1, 0, 0, 1, 0, 0⟩, ⟨0, 1, 0, 0, 0, 0⟩, ⟨1, 1, 0, 0, 0, 0⟩, ⟨1, 0, 0, 0, 0, 1⟩, resp.
In concise notation, they are sketched: PT, S, PS and PO. □
3.3</p>
    </sec>
    <sec id="sec-9">
      <title>Query and exploration similarity</title>
      <p>
        We use cosine similarity for computing similarity between query
vectors. This measure is well suited to compute the similarity
between two vectors and is normalized in [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]. In this way, it
favors more the nature of SQL operations than their number.
To deal with null vectors, which are frequent in the SQLShare
dataset (see Section 4), we set border cases as follows: (i) two
null vectors are considered identical (similarity is 1), and (ii) one
null vector is considered as completely diferent from a non-null
vector (similarity is 0). Formally, given two query vectors  and
 ′, cosine similarity is calculated as follows:
1 if ∥ ∥ = 0 and ∥ ′ ∥ = 0

cos(,  ′) = 0 if ∥ ∥ = 0 or ∥ ′ ∥ = 0 (1)
 ∥∥·∥′′ ∥ else
      </p>
      <p>In order to compare explorations, we pair CED measure with
the cosine similarity measure among query vectors.</p>
      <p>CED is a generalization of the Edit Distance, adapting cost
computation to typical characteristics of semantic sequences. In
particular, CED answers the following requirements: (i) edition
cost depends on the similarity of nearby elements (the more
similar and closer the elements, the lower the cost of operations),
(ii) edition of repeated close elements has low cost, and (iii) similar
and close elements can be exchanged with a low cost.</p>
      <p>
        We describe CED computation as defined in [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] and tuned
in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Firstly, CED modifies the cost function  of Edit
Distance to take into account the local context of each element in
the sequence. Consider contextual edit operations of the form
 = (, , , ), denoting the operation  ∈ {add, mod, del} on
exploration  = ⟨1, ... ⟩ at index  by query . Let O be the
set of all possible contextual edit operations, the cost function
 : O → [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] is defined as:
 () = 1 −
max { (, ) ×  ()}
 ∈⟦1,⟧
(2)
where:  is a similarity measure between two queries,
computed as the cosine similarity of query vectors, and  () ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]
is a contextual vector which quantifies the notion of proximity
between queries. Usually, bigger | −  | is, lesser  (). As in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ],
1 2√+1(−) 2!
we use:  () = exp − 2 | |
      </p>
      <p>
        CED is computed as Edit Distance, using dynamic
programming and Wagner-Fisher algorithm [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ].
      </p>
      <p>In next sections we describe how this representation of queries
and operations is used for profiling the SQLShare workload and
clustering explorations.
4</p>
    </sec>
    <sec id="sec-10">
      <title>DATASET PROFILING</title>
      <p>The SQLShare workload contains 2,809 explorations, totalizing
10,668 queries. Length of explorations follows the Zipf’s law.
Indeed, 1,379 explorations are one-shot (i.e. they contain only one
query), median exploration contains 2 queries and the longest
one contains 98 queries. Figure 1a shows the boxplot of the
distribution.</p>
      <p>The number of operations in a query (w.r.t. previous query)
also follows the Zipf’s law. Noticeably, 1,289 out of 10,668 queries
have no operations w.r.t. previous query. This happens when a
query is identical to previous one (947 queries) but also when
there are only visualisation or optimisation changes (e.g.
changing the order of projected attributes or joined tables) and when
changes concern advanced options not captured by our features
(e.g. changing a regular join by an outer join, or changing the
ascending/descending sense of ordering). Mean query has 4
operations and the longest one has 510 operations. The latter
correspond to a query of the form "SELECT * FROM T" where T
contains a large number of columns. Figure 1b shows the
boxplot of the distribution. We notice that many outliers correspond
to large first queries (i.e. containing many fragments, specially
projections) which lead to long query vectors when compared to
the empty query 0.</p>
      <p>Most frequent operations are adding and deleting projections
(+P and -P), adding tables (+T) and adding selections (+S). Less
frequent operations concern group by (+G and -G) and order by
(+O and -O). Figure 1c shows the complete distribution.</p>
      <p>Figures 1d and 1e complement the distribution of operations by
highlighting the combinations of operations that are frequently
performed together. Precisely, Figure 1d shows the top 10 most
frequent query vectors, evidencing that null vectors (∅) are the
most frequent, followed by a change in selections (+S-S) and a
change in projections (+P-P). Some frequent vectors are
surprising, as changing one table without changing anything else (+T-T).
This behavior corresponds to users updating and uploading a
dataset and then evaluating the same query in the new dataset.
In addition, Figure 1e shows the top 10 most frequent aggregated
query vectors, and consequently illustrating the most frequent
types of operations disregarding vector magnitude and sign. Most
frequent aggregated vectors concern changes in projections,
selections and tables (PST) and subsets of these operations (PT, P
and S). Interestingly, null vectors (∅) come in fifth position. These
top 10 aggregated vectors cover 9,205 queries (82.3% of the total
number of queries).</p>
      <p>Figure 1f goes a step forward showing the main flows 2 between
aggregated query vectors, by means of a Chord diagram. A flow
⟨, ⟩ indicates queries with vector A followed by queries with
vector B. It is represented by an arrow (the origin being closer
to the external circle), whose magnitude indicates its frequency.
For example, the flow from  to  (in purple) represent 
vectors followed by  vectors. We can observe that many
autolfows (e.g. for ,  ,  ). Interestingly, many null vectors (∅) are
followed by other null vectors or by simple changes ( and ).
2Main flows are the ones such that the number of transitions is greater than 5% of
the biggest flow.
frequency
frequency</p>
      <p>
        5.1.1 Implementation and setings. We used the CED
implementation and setting described in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], which is recalled in
Subsection 3.3. Concerning UMAP, we use the umap-learn python
library 0.4.3, where _ is set to 0.01, _ℎ to 200 (i.e.
around 10% of the dataset) and pseudo random number generator
seed is 42. Finally, according to the UMAP projection, we use the
DBSCAN clustering algorithm from the sklearn python library
0.22.2, applied on the previous UMAP embedding, with  = 1
and _ = 10.
      </p>
      <p>All experiments are available and can be reproduced by
running our Python notebook3 in Google Colab or Jupyter
environments. In particular, all code generating the graphs, the dataset
3https://colab.research.google.com/drive/1Yt7Q7AFghkcxdea2UicccMCmkaX7dRMD?
usp=sharing
+P -P +S -S +A -A +T -T +G -G +O -O</p>
      <p>
        5.1.2 Protocol description. To further justify our choice of
methods, we performed some preliminary tests on workloads
of explorations having a ground truth. Specifically, we used the
workloads of OLAP explorations described in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], and obtained
comparable results for the artificial dataset and improved results
for the explorations of real users (Ipums dataset). These results
are available in the notebook; we omit them here for lack of space.
Remark that as features are lowly correlated (correlation matrix
is shown in Figure 2), we decided to keep all of them. A PCA
analysis confirmed this choice.
-20
0
20
40
      </p>
      <p>C1</p>
      <p>C2</p>
      <p>C3</p>
      <p>C4</p>
      <p>C5</p>
      <p>C6</p>
      <p>C7
C1</p>
      <p>C2</p>
      <p>C3</p>
      <p>C4</p>
      <p>C5</p>
      <p>C6</p>
      <p>C7</p>
      <p>C1</p>
      <p>C2</p>
      <p>C3</p>
      <p>C4</p>
      <p>C5</p>
      <p>C6</p>
      <p>C7</p>
      <p>Finally, given the large number of one-shot explorations, as
shown in Figure 1a), we decided to test two clustering
configurations: (i) whole clustering (on the whole dataset), and (ii) restricted
clustering (excluding one-shot explorations). Indeed, the unique
query of such explorations, when compared to the empty query
(0), are 0-valued for features 2, 4, 6, 8, 10 and 12 (which
count the deleted query fragments), introducing a bias. The
restricted clustering aims to further analyse longer explorations,
revealing richer patterns.</p>
      <p>Later in the section, we describe the results of both clustering
configurations.
5.2</p>
    </sec>
    <sec id="sec-11">
      <title>Results of the whole clustering</title>
      <p>The clustering of the whole dataset resulted in 7 clusters. Figure
3a plots the UMAP reduction of the dataset to a 2D Euclidean
space. Clusters had varying sizes. Indeed, there are 3 large clusters
(1 to 3), which explorations exhibit frequent behavior, and 4
small clusters (4 to 7) concerning less frequent behavior.</p>
      <p>As expected, the length of explorations had a big impact in
clustering results. As shown in Figure 3b, clusters 1 and 2
contain only explorations having at least 2 queries while the
remaining clusters contain a majority of one-shot explorations.
Indeed, clusters 4 to 6 contain some explorations of length
2, and cluster 3 contains some longer ones. Cluster 7 only
contains one-shot explorations. On average, cluster 1 contains
longer explorations than 2, including the longest ones.</p>
      <p>The overall distribution of operations for clusters 1 and 2
is very similar (see Figure 3d), however, clusters diferentiates in
the number of operations among consecutive queries (captured
by the ℓ1 norm of query vectors, shown in Figure 3c) and in their
lfows. Although explorations in cluster 1 are longer than those
in cluster 2, they have less operations (median=2). Furthermore,
the most frequent aggregated query vectors, listed in Table 3,
confirm this observation. In particular, aggregated vectors in
1 include most of the null vectors, but also many vectors
representing only one type of operation (esp.  and  ). However,
many frequent vectors in cluster 2 concern many operations.
We further analyse these two clusters in next subsection.</p>
      <p>Cluster 3 is the largest one and contains a majority of
oneshot explorations. Queries in its explorations contain many
operations (median=6), in majority projections. There are two frequent
aggregated vectors:  and  .</p>
      <p>Clusters 4 to 7 are smaller, contain mostly one-shot
explorations, but evidencing very diferent behavior. Queries in cluster
4 involve many operations (median=6), concerning many
selections and tables. The most frequent aggregated vector is  .
Queries in cluster 5 also involve many operations (median=6),
concerning many projections and aggregations, some grouping
)
e
l
a
sc102
g
o
l(
m
r
o
n1101
ℓ
y
r
e
u
Q
100</p>
      <p>D1</p>
      <p>D2</p>
      <p>D3</p>
      <p>D4</p>
      <p>D5</p>
      <p>D6</p>
      <p>D1</p>
      <p>D2</p>
      <p>D3</p>
      <p>D4</p>
      <p>D5</p>
      <p>D6
but few selections. The most frequent aggregated vectors are
 and   . There are two types of queries in cluster 6, as
evidenced in Figure 3c. Most queries involve only 2 operations
(+P+T), the others concern multiple operations (multiple tables
and many projections). The most frequent aggregated vector
is  . Cluster 7 is the smaller one (only 15 explorations). Its
queries have fewer operations, concerning many projections and
ordering. The most frequent aggregated vector is  .</p>
    </sec>
    <sec id="sec-12">
      <title>5.3 Results of the restricted clustering</title>
      <p>The restricted clustering resulted in 6 clusters, plot in Figure 4a.
There are two well diferentiated clusters ( 1 and 5) and a dense
zone including a large cluster (2) and 3 smaller ones (3, 4
and 6). It is here, that DBSCAN best exploits the space topology.
Cluster analysis is shown in Figure 4 and the main flows (for the
bigger clusters) are shown in Figure 5.</p>
      <p>Explorations coming from cluster 1 are distributed between
cluster 1 and 5 (excepting 1 exploration that goes to cluster</p>
      <p>PS
70 140 210 280 0 70 140 210PST
280
350
2). Cluster 1 contains the longest explorations and the highest
median number of explorations. Its queries concern operations
of varied types, most of them limiting to 1 or 2 operations.
Frequent aggregated vectors are ∅, ,  and  . Flows illustrates
many repetitions of the same operations but also the alternation
of operations. Cluster 5 contains shorter explorations. 62 % of
queries are identical to previous ones (empty vector). In the
remaining ones, projections are predominant, with some selections
and tables. Frequent aggregated vectors are ∅,  and  . Flows
evidence that first queries in the explorations have typical vectors
(e.g.  ,  ) and next queries are identical (∅).</p>
      <p>Explorations coming from cluster 2 are distributed among
clusters 2, 3, 4 and 6, in the dense zone. In addition, the 40
explorations coming from the other clusters (3 to 6) mainly
goes to 2. The four clusters contain, in average, shorter
explorations than cluster 1, with more operations per query. Queries
in cluster 2 concern operations of varied types, most of them
being projections. Frequent aggregated vectors are  ,  , 
and . Flows evidence that explorations concern chains of the
same operations (visible in the autoflows). There are more
selections in queries of cluster 3, frequent aggregated vectors
being ,  and  . Flows show that first queries (mainly with
vectors  ,  and  are followed by chains of selections, and
some marginal projections. Queries in cluster 4 concern many
changes in tables, and more aggregations that previous clusters.
Frequent aggregated vectors are  ,  and  . Finally, queries
in cluster 6 concern most of the order by operations, but all
types of operations are present. The most frequent aggregated
vector is .</p>
    </sec>
    <sec id="sec-13">
      <title>5.4 Learned behavior</title>
      <p>The analysis of both clustering configurations allowed the
discovery of several patterns, representing common or less-frequent
behavior. The most prominent aspects of each pattern are
summarized in Table 4. This section briefly highlights our findings.</p>
      <p>Firstly, 49% of explorations are one-shot. They diferentiate in
the predominant operations in the unique query. We discovered
5 patterns. The most common one (3) consist in evaluating a
simple query, projecting many attributes, possibly to verify that
the dataset was correctly uploaded or just looking at the data.</p>
      <p>Less frequent patterns, also concerning the evaluation of a
simple query, diferentiate in the used SQL operations, namely, many
selections (4), aggregation and grouping (5), join of multiple
tables (6), and ordering (7). The latter is an outlier behavior,
only concerning 15 explorations. These patterns suggest a more
specific analysis of data (w.r.t. the common behavior in 3),
taking advantage of more SQL operations. This may reflect users’
preferences on some SQL clauses, but may also reflect users’
expertise.</p>
      <p>The remaining 51% of explorations contain between 2 and 98
queries, median being 4 queries. We discovered 6 patterns:</p>
      <p>A common pattern (1) reveals long explorations, with few
operations per query, sometimes repeating queries, which
translate a focused data analysis. Many types of operations are used,
but mostly once per query, suggesting a conscious use of SQL.</p>
      <p>Another common pattern (2) reveals short explorations, with
more operations per query. Projections are omnipresent, but
frequently combined with other operations. What is interesting here,
is the chaining of the same types of operations along the
exploration. It can be exploited for providing personalized suggestions
to users.</p>
      <p>Two interesting but less frequent patterns (3 and 5) concern
a classical first query, followed by chains of selections ( 3) or
repeated queries (5). In both cases, explorations are shorter
than in 1 but reveal some kind of analysis. While 3 suggest
a meticulous study of the dataset, 5 includes many novices
users trying to understand how SQL works. A similarly but more
complex pattern (6) involve more richer first queries, followed
by changes in the ordering of projected expressions. In addition
to a good use of SQL, this behavior may correspond to users
looking for the best way of reporting data.</p>
      <p>The last pattern (4), also less frequent, exhibits a particular
behavior. It concerns many changes in the datasets (frequently,
the unique operation in the query is a change in the FROM clause).
This corresponds to the upload of a new dataset and the execution
of the same query on the new dataset, and suggests data analysts
dealing with quality issues in their datasets.
6</p>
    </sec>
    <sec id="sec-14">
      <title>CONCLUSION AND FUTURE WORKS</title>
      <p>This paper presented an original solution to learn analysis
behavior in SQL workloads. The understanding of users’ analysis
patterns has great implications for query recommendation,
monitoring, optimization and, more generally, providing better IDE
support. The proposal includes an abstraction of queries and
explorations in the space of SQL operations, a set of similarity
functions tailored for SQL queries and explorations, and an
innovative clustering process taking advantage of UMAP reduction
for analysing a complex space.</p>
      <p>The approach was tested on a real workload, SQLShare,
allowing the extraction of 11 analysis patterns including 3 typical
behaviors: one-shot simple explorations, short exploratory
explorations, and longer more focused ones, but also less-frequent
behavior evidencing the punctual use or the chaining of
specific SQL operations. We believe that the identification of such
behavior should be at the kernel of more intelligent IDE tools.</p>
      <p>
        In this paper we used a large palette of indicators for profiling
the workload and analyzing the obtained clusters (some
additional ones are described in our notebook). In next future, we
would like to test additional indicators, specifically concerning
how focused are the explorations (i.e. distinguishing flows at the
beginning and end of explorations), and how complex are queries,
both in terms of expressiveness and usage of advanced clauses
and functions (here, we also need to extract additional features).
In addition, we would like to classify users according to their
analysis behaviors. SQLShare workload, with its 57 users and
their 3,336 datasets, is a rich source for further experiments. Of
course, there are many one-shot users (already reported in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]),
but our preliminary analyses reveal very interesting behavior.
      </p>
      <p>
        Finally, we would like to test our proposal in further workloads,
specially those including queries generated by bots, as SDSS.
Authors of [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] acknowledge the dificulty of extracting human
sessions from all those collected: "We failed to find clear ways to
segment user populations. [...] Interactive human users were 51%
of the sessions, 41% of the Web trafic and 10% of the SQL trafic.
We cannot be sure of those numbers because we did not find a very
reliable way of classifying bots vs mortals." Developping tools
helping in the recognition and analysis of hand-written queries
is a nice challenge.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abbott</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Tsay</surname>
          </string-name>
          .
          <article-title>Sequence analysis and optimal matching methods in sociology: Review and prospect</article-title>
          .
          <source>SMR</source>
          ,
          <volume>29</volume>
          (
          <issue>1</issue>
          ):
          <fpage>3</fpage>
          -
          <lpage>33</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Acar</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Motro</surname>
          </string-name>
          .
          <article-title>Why is this user asking so many questions? explaining sequences of queries</article-title>
          . In DBSec,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ankerst</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Breunig</surname>
            ,
            <given-names>H.-P.</given-names>
          </string-name>
          <string-name>
            <surname>Kriegel</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Sander</surname>
          </string-name>
          .
          <article-title>Optics: ordering points to identify the clustering structure</article-title>
          .
          <source>ACM Sigmod</source>
          ,
          <volume>28</volume>
          (
          <issue>2</issue>
          ):
          <fpage>49</fpage>
          -
          <lpage>60</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Barbosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Barthelemy</surname>
          </string-name>
          , G. Ghoshal,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>James</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenormand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Louail</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Menezes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Ramasco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Simini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Tomasini</surname>
          </string-name>
          .
          <article-title>Human mobility: Models and applications</article-title>
          .
          <source>Physics Reports</source>
          ,
          <volume>734</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>74</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          and
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Narasayya</surname>
          </string-name>
          .
          <article-title>Self-tuning database systems: A decade of progress</article-title>
          .
          <source>In VLDB</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Djedaini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Drushku</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Labroche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Marcel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Peralta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Verdeaux</surname>
          </string-name>
          .
          <article-title>Automatic assessment of interactive OLAP explorations</article-title>
          . Inf. Syst.,
          <volume>82</volume>
          :
          <fpage>148</fpage>
          -
          <lpage>163</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Djedaini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Labroche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Marcel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Peralta</surname>
          </string-name>
          .
          <article-title>Detecting user focus in OLAP analyses</article-title>
          .
          <source>In ADBIS'</source>
          <year>2017</year>
          , Nicosia, Cyprus,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Eirinaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Abraham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Polyzotis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Shaikh</surname>
          </string-name>
          . Querie:
          <article-title>Collaborative database exploration</article-title>
          .
          <source>TKDE</source>
          ,
          <volume>26</volume>
          (
          <issue>7</issue>
          ):
          <fpage>1778</fpage>
          -
          <lpage>1790</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-P.</given-names>
            <surname>Kriegel</surname>
          </string-name>
          ,
          <string-name>
            <surname>Sander</surname>
          </string-name>
          , et al.
          <article-title>A density-based algorithm for discovering clusters in large spatial databases with noise</article-title>
          .
          <source>Kdd</source>
          ,
          <volume>96</volume>
          (
          <issue>34</issue>
          ):
          <fpage>226</fpage>
          -
          <lpage>231</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Guidotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Monreale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruggieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Turini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannotti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Pedreschi</surname>
          </string-name>
          .
          <article-title>A survey of methods for explaining black box models</article-title>
          .
          <source>ACM CSUR</source>
          ,
          <volume>51</volume>
          (
          <issue>5</issue>
          ),
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hägerstraand</surname>
          </string-name>
          .
          <article-title>What about people in regional science? Papers in regional science</article-title>
          ,
          <volume>24</volume>
          (
          <issue>1</issue>
          ):
          <fpage>7</fpage>
          -
          <lpage>24</lpage>
          ,
          <year>1970</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Idreos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Papaemmanouil</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          .
          <article-title>Overview of data exploration techniques</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moritz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Halperin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Howe</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Lazowska</surname>
          </string-name>
          .
          <article-title>Sqlshare: Results from a multi-year sql-as-a-service experiment</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ferreira</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. C.</given-names>
            <surname>González</surname>
          </string-name>
          .
          <article-title>Clustering daily patterns of human activities in the city</article-title>
          .
          <source>DMKD</source>
          ,
          <volume>25</volume>
          (
          <issue>3</issue>
          ):
          <fpage>478</fpage>
          -
          <lpage>510</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaufman</surname>
          </string-name>
          and P. Rousseeuw.
          <article-title>Finding Groups in Data: An Introduction to Cluster Analysis</article-title>
          . John Wiley &amp; Sons,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N.</given-names>
            <surname>Khoussainova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kwon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Balazinska</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Suciu</surname>
          </string-name>
          . Snipsuggest:
          <article-title>Contextaware autocompletion for SQL</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>4</volume>
          (
          <issue>1</issue>
          ):
          <fpage>22</fpage>
          -
          <lpage>33</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>G.</given-names>
            <surname>Kul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. T. A.</given-names>
            <surname>Luong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chandola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kennedy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Upadhyaya</surname>
          </string-name>
          .
          <article-title>Similarity metrics for SQL query clustering</article-title>
          .
          <source>IEEE Trans. Knowl</source>
          . Data Eng.,
          <volume>30</volume>
          (
          <issue>12</issue>
          ):
          <fpage>2408</fpage>
          -
          <lpage>2420</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>L.</surname>
          </string-name>
          <year>v</year>
          . d. Maaten and
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Visualizing data using t-sne</article-title>
          .
          <source>Journal of machine learning research</source>
          ,
          <volume>9</volume>
          :
          <fpage>2579</fpage>
          -
          <lpage>2605</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>J. MacQueen.</surname>
          </string-name>
          <article-title>Some methods for classification and analysis of multivariate observations</article-title>
          .
          <source>In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability</source>
          , volume
          <volume>1</volume>
          , pages
          <fpage>281</fpage>
          -
          <lpage>297</lpage>
          ,
          <year>1967</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>L.</given-names>
            <surname>McInnes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Healy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Melville</surname>
          </string-name>
          . Umap:
          <article-title>Uniform manifold approximation and projection for dimension reduction</article-title>
          .
          <source>arXiv preprint arXiv:1802.03426</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>T.</given-names>
            <surname>Milo</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Somech</surname>
          </string-name>
          .
          <article-title>Next-step suggestions for modern interactive data analysis platforms</article-title>
          .
          <source>In KDD</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>C.</given-names>
            <surname>Moreau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chanson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Peralta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Devogele</surname>
          </string-name>
          , and C. de Runz.
          <article-title>Clustering sequences of multi-dimensional sets of semantic elements</article-title>
          .
          <source>In SAC</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>C.</given-names>
            <surname>Moreau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Devogele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Peralta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Étienne</surname>
          </string-name>
          .
          <article-title>A contextual edit distance for semantic trajectories</article-title>
          .
          <source>In SAC</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>C.</given-names>
            <surname>Moreau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Peralta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Marcel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chanson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Devogele</surname>
          </string-name>
          .
          <article-title>Learning analysis patterns using a contextual edit distance</article-title>
          .
          <source>In DOLAP</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Weiss</surname>
          </string-name>
          .
          <article-title>On spectral clustering: Analysis and an algorithm</article-title>
          .
          <source>In Advances in NIPS</source>
          , pages
          <fpage>849</fpage>
          -
          <lpage>856</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Böhm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Goldman</surname>
          </string-name>
          , G. Hinkel, and
          <string-name>
            <given-names>E.</given-names>
            <surname>Müller</surname>
          </string-name>
          .
          <article-title>Identifying user interests within the data space - a case study with skyserver</article-title>
          .
          <source>In EDBT</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pappalardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Simini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rinzivillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pedreschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannotti</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.-L.</given-names>
            <surname>Barabási</surname>
          </string-name>
          .
          <article-title>Returners and explorers dichotomy in human mobility</article-title>
          .
          <source>Nature communications</source>
          ,
          <volume>6</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>H.-S.</given-names>
            <surname>Park</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.-H.</given-names>
            <surname>Jun</surname>
          </string-name>
          .
          <article-title>A simple and fast algorithm for k-medoids clustering</article-title>
          .
          <source>Expert systems with applications</source>
          ,
          <volume>36</volume>
          (
          <issue>2</issue>
          ):
          <fpage>3336</fpage>
          -
          <lpage>3341</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>V.</given-names>
            <surname>Peralta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Marcel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Verdeaux</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Diakhaby</surname>
          </string-name>
          .
          <article-title>Detecting coherent explorations in SQL workloads</article-title>
          .
          <source>Inf. Syst., 92</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Gallinucci</surname>
          </string-name>
          .
          <article-title>Cubeload: A parametric generator of realistic OLAP workloads</article-title>
          . In CAiSE,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>O.</given-names>
            <surname>Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Marcel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abelló</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Peralta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Bellatreche</surname>
          </string-name>
          .
          <article-title>Describing analytical sessions using a multidimensional algebra</article-title>
          .
          <source>In DaWaK'2011</source>
          , Toulouse, France,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>V.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Thakar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Szalay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Raddick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Boroski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lebedeva</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Yanny</surname>
          </string-name>
          .
          <article-title>Skyserver trafic report - the first five years</article-title>
          .
          <source>Technical report</source>
          ,
          <year>December 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>A.</given-names>
            <surname>Somech</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Milo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Ozeri</surname>
          </string-name>
          .
          <article-title>Predicting "what is interesting" by mining interactive-data-analysis session logs</article-title>
          .
          <source>In EDBT</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>H. van den Brink</surname>
          </string-name>
          , R. van der Leek, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Visser</surname>
          </string-name>
          .
          <article-title>Quality assessment for embedded SQL</article-title>
          .
          <source>In SCAM</source>
          , pages
          <fpage>163</fpage>
          -
          <lpage>170</lpage>
          . IEEE Computer Society,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vashistha</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Jain</surname>
          </string-name>
          .
          <article-title>Measuring query complexity in sqlshare workload</article-title>
          . https://uwescience.github.io/sqlshare/pdfs/Jain-Vashistha.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Wagner</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Fischer</surname>
          </string-name>
          .
          <article-title>The string-to-string correction problem</article-title>
          .
          <source>J. ACM</source>
          ,
          <volume>21</volume>
          (
          <issue>1</issue>
          ):
          <fpage>168</fpage>
          -
          <lpage>173</lpage>
          ,
          <year>1974</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>R. W.</given-names>
            <surname>White</surname>
          </string-name>
          .
          <article-title>Interactions with Search Systems</article-title>
          . Cambridge Univ. Press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>