Learning Analysis Patterns using a Contextual Edit Distance Clement Moreau, Veronika Peralta, Patrick Marcel, Alexandre Chanson and Thomas Devogele University of Tours Blois, France firstname.lastname@univ-tours.fr ABSTRACT and are not appropriate to evaluate modern IDE systems [9]. This paper presents a proposal for learning users’ behavior pat- Identifying analysis patterns would allow to better model user’s terns when they interactively analyse data. Users’ explorations explorations and mimic such activities in benchmark workloads. (sequences of queries) are compared looking for subsequences Finally, we mention the detection of clandestine intentions [1] of common actions or operations performed by the users dur- as another potential benefit. Indeed, as reported by [1], query se- ing data analysis. We use a hierarchical clustering algorithm to quences may reflect such intentions, where users prefer to obtain retrieve groups of similar explorations. The main difficulty is information by means of sequences of smaller, less conspicuous to devise a similarity measure suitable to measure similarities queries to avoid direct queries which may disclose their true between sequences of human actions. We propose to use a Con- interests. The identification of typical analysis patterns may help textual Edit Distance (CED), a generalization of Edit Distance that distinguishing normal from clandestine intentions. manages context-dependent edition costs. CED compares two In this paper we deal with the identification of analysis pat- users’ explorations, making special emphasis in the similarity of terns in a log of explorations devised by real users. We consider queries with nearby queries in the exploration, which determines that an exploration is a coherent sequence of queries over a a local context. We test our approach on three workloads of real database schema, done by a user with the goal of fulfilling an users’ explorations, extracting common analysis patterns, both information need. In [19], Rizzi and Gallinucci described 4 re- in explorations devised by students and expert analysts. We also current types of user analyses and propose a tool for generating experiment on an artificial workload, generated with CubeLoad realistic explorations based on these usage types. Our goal is to [19], showing that our approach is able to identify the patterns go a step forward and learn more analysis patterns from the ex- imposed by the generator. To the best of our knowledge, this plorations of real users. Concretely, we aim to cluster together is the first attempt to characterize human analysis behavior in explorations showing similar analysis patterns. The idea workloads of data explorations. behind analysis patterns is to look for sequences of common actions or operations performed together when analysing data, as some kind of movements in a data space. From this point of 1 INTRODUCTION view, OLAP operations (e.g. drilling down, adding a filter, chang- Analyzing a database workload offers many practical interests, ing a measure) are first class citizens, while the actual analyzed from the monitoring of database physical access structures [5] data is less important. For example, we can retain that a user to the generation of user-tailored collaborative query recommen- performed a sequence of drills down, disregarding the dimension dations for interactive exploration [10]. There has been much that was drilled down or the semantics of the underlying data. attention lately devoted to the analysis of user past activities Explorations can be compared in such terms, i.e. to what extent to support Interactive Database Exploration (IDE) [12]. OLAP they share the same sequences of operations and evolve at the analysis is a particular case of IDE, that takes advantage of simple same level of aggregation and filtering. primitives like drill-down or slice-and-dice for the navigation of Many distances proposed to compare sequences, for example multidimensional data. These particularities enable the design of the Damerau-Levenshtein distance [6] or the Smith-Watermann approaches for characterizing user explorations in how focused algorithm [21], part of Edit Distance family, count the mini- they are [8], in how contributive a query is to the exploration [7], mum number of operations (modification, addition, deletion) or even in how to ensure that a sequence of analytical queries required to transform one sequence into the other. They are forms a coherent exploration [20]. Characterising user behavior particularly adapted for sequences of independent symbols, as while analysing data, i.e. learning the way users analyse data (the DNA sequences or strings. However, when symbols represent type and order of operations, the level of detail, the degree of human behavior, including homogeneous, interconnected and focus) is a step forward in the understanding of analysis activities. repetitive actions, an appropriate distance should satisfy other Identifying analysis behavior has several applications. The requirements. In particular, we want the following requirements: more natural one is a better support of IDE, for instance to under- stand users’ information needs, to identify "struggling" during (R 1 ) edition cost depends on the similarity of nearby symbols. the exploration, or to provide better query recommendations. (R 2 ) edition of repeated close symbols has little cost. Notably, IDE systems usually do not offer such facilities. The pre- (R 3 ) permutation of close symbols has little cost. diction of next analysis steps is particularly interesting, enabling Indeed, while edition cost is constant in classical Edit Distance, beforehand execution of probable queries and caching of results, for comparing interconnected actions, it should take context (i.e. as well as advanced optimization strategies. Another benefit is the nearby actions) into account. For example, removing a mea- the design of more realistic workloads for database benchmark- sure should be costly within a focused sequence of drills down ing. Classical benchmarks like TPC-H or TPC-DS poorly include and filters, while it should be cheaper inside a sequence with interactive exploration activities in their synthetic workloads, other changes in measures. In addition, sequences of filters should © Copyright 2020 for this paper held by its author(s). Published in the proceedings of be similar, no matter how many filters there are. Furthermore, DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT permuting operations should have little impact, e.g. filtering and 2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). then drilling vs. drilling then filtering. Previous attempts made for measuring the similarity of se- • Slice All. Users are sometimes interested in navigating a quences of OLAP queries (like e.g. [3], that extends the Smith- cube by slices, i.e., repeatedly running the same query but Watermann algorithmm) were not designed to satisfy the stated with different selection predicates. Then, this templates requirements. We propose a Contextual Edit Distance (CED) generates sequences of unfilter/filter operations. specially designed to satisfy them. • Exploratory. The motivation for this template is the as- Our contributions, sketched in Figure 1 include: (i) a represen- sumption that several users, while exploring the cube in tation of queries and explorations in the space of OLAP opera- search of significant correlations, will be “attracted” by one tions, including a similarity function among OLAP operations surprising query and then evolve casually. So, explorations in such space (described in Section 3), (ii) a CED for comparing based on this template contain varied random operations. explorations considering context (Section 4), (iii) a proposal for • Goal Oriented. Explorations of this type are run by users clustering explorations, based on CED (Section 5), and (iv) a set who have a specific analysis goal, but whose OLAP skills of experiments showing that CED outperforms state of the art are limited so they may follow a complex path to reach distances allowing the learning of analysis behavior in varied their destination. Explorations of this template contain logs of explorations (Section 6). varied operations but converging to some specific point. Other works analyse real workloads and observe tendences 2 RELATED WORK or patterns. Aligon et al. [2] analyse a workload of explorations The recurrent types of user analyses described by Rizzi and devised by master students and observe some general tendence. Gallinucci [19] are the first attempt to define analysis patterns in They report the number of OLAP operations used between two OLAP workloads. Authors claim that obtaining real OLAP work- consecutive queries, the level of detail and the number of filters loads by monitoring the queries actually issued in companies of queries in the explorations, as well as the queries indicated and organizations is hard, and propose a parametric generator as relevant by the students. As general behavior, they highlight of OLAP workloads, CubeLoad, based on a four templates that that explorations are more focused at the end (i.e. the number of model recurrent types of user analyses: filters and level of detail increase along the exploration, while the • Slice And Drill. Following the default behavior of several number of OLAP operations between consecutive queries varies OLAP front-ends, hierarchies are progressively navigated before a final drop at the end of the exploration) and contain by choosing a member of a current group-by level, cre- more relevant queries at the end. The term focus is used as in [8]: ating a selection predicate on such member and drilling “When focused, an analyst would expect more precise queries, down on it. Therefore, explorations of this template con- related to what she is currently analyzing. On the contrary, when Base de trajectoires Base tain sequences of de and filter trajectoires drill-down operations. exploring the data, the analyst would prefer more diverse queries, spatio-temporelles riche sémantiquement for a better data space coverage." Djedaini et al. [7] analyse an- A other workload of explorations of master students and propose a T e1 = hq1 , q2 , ..., qm i e2 = hq1 , q2 , ..., qn i TS model for learning the quality of an exploration (and the skill of the user) based on a set of query features. .. . d : Q ⇤ ⇥ Q ⇤ ! R+ To our knowledge, our work is the first devoted to discover Métrique de comparaison sur = hq1 , q2 , ..., ektrajectoires les qp i sémantiques S ⇥ onT S ! R+ d : TMeasure exploration patterns in OLAP workloads. (b) Many recent works aimed at comparing queries and sessions. exploration sessions Exploration sessions We mention as good recent surveys [3] for OLAP queries and {e1 , e2 , ..., ek } ✓ Q⇤ explorations, and [4] for SQL queries. (a) Aligon et al. [3] also proposes two similarity measures: one Partitionnement - Regroupement des trajectoiresHierarchical en motifs tailored for OLAP queries and another tailored for OLAP sessions. de comportements clustering similaires Their measures were designed for satisfying other requirements, (c) in particular capturing the portions of the cube that are more explored for improving query recommendation. Consequently, similarity measures are based on common query parts (e.g. filters 2 1 and measures) more than common operations, and are strongly selection predicate … dependent on cube schema. To our knowledge, no similarity 6 5 Synthèse des motifs issus already proposed in the literature for comparing explorations 4 1 includes the requirements presented in the previous section. sclice and drill des partitions sclice all group by 2 … ... 3 3 EXPLORATION MODEL User behavior for data exploration This section introduces the description of queries and explo- rations used all along the paper as well as their representation in (d) a space of OLAP operations. exploratory goal oriented 3.1 Queries and explorations Figure 1: Overview of the approach: (a) explorations and In order to keep the formalism simple, we only take into account queries are represented in a space of OLAP operations, (b) cubes under a ROLAP perspective, described by a star schema computation of CED, (c) clustering of explorations based [14]. For convenience, we consider that a dimension consists of on CED, (d) extraction of patterns of user behavior (here, a unique hierarchy without branches, i.e., consisting of chains represented as the patterns of Cubeload). of levels. In this paper, we focus on multidimensional queries modeled as a collection of fragments extracted from the query Feature Description expression, as in [3]. NAL Number of added levels NDL Number of deleted levels Definition 3.1 (OLAP query). An OLAP query over cube schema NAF Number of added filters S is a triple q = ⟨G, P, M⟩ where: NDF Number of deleted filters (1) G = {д1, . . . , дj } is the query group-by set, each дi being NAM Number of added measures a level of a hierarchy of the cube; NDM Number of deleted measures (2) P = {p1, . . . , pk } is a set of Boolean predicates, of the form Adepth Aggregation depth l = v, with l a level and v a value. Compound predicates Fdepth Filter depth are build as the disjunction of predicates on a same level Table 1: Query features (l), at most one for each hierarchy, whose conjunction defines the selection predicate for q; (3) M = {m 1, . . . , ml } is the measure set whose values are returned by q. Number of Deleted Filters. N DF (qk , qk−1 ) counts the number of filters of qk−1 that are not longer used in qk . We intentionally remain independent of presentation and op- N DF (qk , qk−1 ) = |Pk −1 − Pk | (4) timization aspects, specially the order in which attributes are projected (and visualized), the order of joins, etc. Number of Added Measures. N AM(qk , qk −1 ) counts the num- Finally, an exploration is a coherent sequence of queries over ber of measures of qk that were not measures of qk −1 . a cube schema, devised by a user with the goal of fulfilling an N AM(qk , qk −1 ) = |Mk − Mk −1 | (5) information need that may not be well defined initially. Number of Deleted Measures. N DM(qk , qk −1 ) counts the num- Definition 3.2 (Exploration). Let S be a cube schema. An ex- ber of measures of qk −1 that are not longer used in qk . ploration e = ⟨q 1, . . . , qp ⟩ is a sequence of queries over S. We N DM(qk , qk −1 ) = |Mk −1 − Mk | note q ∈ e if q appears in the exploration e, and exploration(q) (6) to refer to the explorations where q appears. Aggregation Depth. Adepth(qk ) measures the granularity of qk in terms of the depth of each level in its hierarchy. It can 3.2 Query features be seen as the number of drills down necessaries for obtaining For each query, we extract a set of simple features computed from G k from the most aggregated group-by set. Let depth(дi ) be the the query text and its relationship with previous query in an ex- depth of level дi in the hierarchy hi to which it belongs (ranging ploration. The set of features is inspired from our previous works from 0 if дi is the top level of hi to |hi | − 1 if дi is the bottom [7, 8]. It intends to capture various aspects of OLAP navigation, level of hi ): Õ in particular the set of OLAP operations that express one query Adepth(qk ) = depth(дi ) (7) w.r.t. the previous one (e.g. a query being a drill down of the дi ∈G k previous one), the level of aggregation (i.e. how deep the query Filter Depth. Fdepth(qk ) measures the number of filters ap- is in the aggregation lattice) and the level of filtering (i.e. how pearing in qk . filtered is the data space). Table 1 presents an overview of the Fdepth(qk ) = |Pk | (8) features, where added (resp., deleted) indicates the modification In what follows, we represent an OLAP query in the space made compared to the previous query. of query features, i.e. as a 8-dimensional vector, each position For the following definitions let qk = ⟨G k , Pk , Mk ⟩ be the corresponding to one of the features described above. This rep- query occurring at position k in the exploration e over cube resentation is at the core of our proposal for computing the schema S. All the queries we considered are supposed to be similarity between queries and then between explorations. It well formed, and so we do not deal with query errors. Features focuses in operations between queries and is independent of the are computed comparing the query qk to the previous query underlying data cube, i.e. a given sequence of operations, even in the exploration e, qk−1 = ⟨G k −1, Pk −1, Mk −1 ⟩. For the first on different data cubes, will result in the same sequence of query query of e, i.e. q 1 , we consider as predecessor the "empty" query vectors. q 0 = ⟨∅, ∅, ∅⟩. All the following features are defined for k ≥ 1. Example 1. Consider an exploration e 1 composed of 4 queries: Number of Added Levels. N AL(qk , qk −1 ) counts the number of q 1 = ⟨{year }, ∅, {qty}⟩ – sales quantity per year; levels in the group by set of qk that were not part of the group q 2 = ⟨{year }, {year = “2019”}, {qty}⟩ – adds a filter; by set of qk−1 . q 3 = ⟨{year, country}, ∅, {qty}⟩ – unfilter, drill-down; N AL(qk , qk −1 ) = |G k − G k −1 | (1) q 4 = ⟨{year, city}, ∅, {qty, amount }⟩ – drill-down, adds measure; Vector for q 1 , ⟨1, 0, 0, 0, 1, 0, 1, 0⟩, indicates an added level (year) Number of Deleted Levels. N DL(qk , qk−1 ) counts the number and an added measure (qty) w.r.t. the empty query; last positions of levels in the group by set of qk −1 that are not longer used in correspond to aggregation and filter depths. Vectors for q 2, q 3 and q 4 qk . indicate the differences w.r.t. previous queries and the changes in ag- N DL(qk , qk −1 ) = |G k −1 − G k | (2) gregation and filter depths: ⟨0, 0, 1, 0, 0, 0, 1, 1⟩, ⟨1, 0, 0, 1, 0, 0, 2, 0⟩, ⟨1, 0, 0, 0, 1, 0, 3, 0⟩, resp. Number of Added Filters. N AF (qk , qk −1 ) counts the number Finally, we use cosine similarity for computing similarity be- of filters of qk that were not filters of qk −1 . tween queries. This measure is adapted to compute similarity of N AF (qk , qk−1 ) = |Pk − Pk −1 | (3) two vectors and is normalized in [0, 1]. In this way, it privileges the nature of OLAP operations and not their number. 4 CONTEXTUAL EDIT DISTANCE • add : (e, q x , k) 7→ q 1, ..., qk−1, q x , qk , ..., qn This section describes our proposal of Contextual Edit Distance: Insert query q x at index k. The queries at and after index definition and implementation issues. k are shifted forward. CED is a generalization of the Edit Distance that incorporates • del : (e, ε, k) 7→ q 1, ..., qk −1, qk +1, ..., qn the following requirements: Delete the query at position k. The queries after position k are shifted backward. (1) Context-dependent cost: Edition cost depends on the simi- larity of nearby queries. The more similar and closer the Given an operation o(e, q x , k), a context vector is a numeric queries, the lower the cost of operation. vector that indicates the level of influence of nearby queries for (2) Repetition: Edition of repeated close queries has low cost. this operation, being stronger near index k and softening farther. (3) Permutation: Similar and close queries can be exchanged We use the context vector for weighting the similarity between with a low cost. queries. Intuitively, the context vector quantifies the relationship between a query q x and another query qi . Thus, the greater vi , Example 2. Consider an exploration reflecting an exploratory the greater the impact of query qi on q x . A seed function is used behavior at the beginning (many changes in measures and group by to generate context vectors. set) and more focus at the end (drilling and filtering). We can sketch it as follows (where L, F and M means level, filter and measure, Definition 4.2 (Context function and context vector). Consider + means addition and - means deletion; we skip aggregation and a contextual edit operation o(e, q x , k) and a seed function fk : filter depth for simplicity): N → [0, 1] which holds the following properties: ⟨ +D+M, +M, +M, +D, +M-M, -M+D, -D+D, +F+D, +F+D, +F, +F ⟩. (1) fk (k) = 1. Consider the insertion of a query adding an additional measure (2) fk is a monotonically increasing function on ] − ∞, k]. (+M). The edition cost should be low if the query is inserted at the (3) fk is symmetrically centered on k. beginning (as it is similar to near queries), even lower at positions The third property guarantees to take with the same importance 2 to 4 (because repeating the same operations), but high at the end. previous and future queries located at equal distance from qk . This requirements ensure that explorations reflecting a given A context function φo : N∗ → [0, 1] is a transformation of fk pattern (e.g. sequences of drill-downs) are judged to be very stretching the function according to the type of operation o. We similar no matter the exploration length (i.e. how many drill- distinguish three cases for o in {mod, add, del}: downs) nor the underlying data (which data was drilled-down). • φ mod (x) = ( fk (x) We remark that although this paper deals with explorations, f (x + 1) if x ≤ k − 1 CED definition and properties are independent on the type of • φ add (x) = k fk (x) if x ≥ k explorations, the type of queries and even the nature of data. In- deed, CED can be adapted to any type of sequence on an alphabet  fk (x + 1)   if x ≤ k − 1 • φ del (x) = 0   of symbols provided that exists a similarity metric among them. if x = k  fk (x − 1) if x ≥ k + 1   4.1 Definition of Contextual Edit Distance  Finally, a context vector v : O → [0, 1]n is defined as The main contribution of CED is the modification of the cost function γ which generalizes the classical definition of Edit Dis- v(o) = ⟨v 1, . . . , vn ⟩ tance and takes into account the local context of each query in where vi = φo (i). the exploration. Intuitively, the cost of an edit operation will be lower if the edited query (e.g. a query to be added to an ex- About add and del contextual vector functions: ploration) is similar to nearby queries (e.g. the queries that are • For an insertion add : (e, q x , k), query qk −1 and qk are near the position where the query is added). This notion of local fully taken into account for the addition of q x in e because context is modeled as a context vector, which controls the zone of we insert q x between index k − 1 and k i.e. φ add (k − 1) = 1 influence of the context. This subsection presents these concepts and φ add (k) = 1. and a formal definition of CED. • For a deletion del : (e, ε, k), the absence of the query Let Q be the set of all possible queries over a cube schema S qk in e is quantified in relation to the remaining of e i.e. and Q ∗ the set of all possible explorations on Q. For the following φ del (k) = 0. definitions, let e = ⟨q 1, ..., qn ⟩ be an exploration, qk be a query in e, 1 ≤ k ≤ n, and q x be a new query to be edited in e. These 2 1 x −4 ( 2( 2 )) f4 = exp − notations are partially adapted from the formal language theory community and particularly from [22]. φ𝚊𝚍𝚍 CED extends the set of edit operations of Edit Distance (usually modification, addition, deletion) to take context into account. o = 𝚊𝚍𝚍 : (e, q x,4) Definition 4.1 (Contextual edit operation). A contextual edit operation o is a function o : (Q ∗ × Q ∪ {ε} × N) → Q ∗ whose arguments are an exploration, a new query to be included in the q1 q2 q3 q4 exploration (or none) and the index (position) in the exploration | {z } where the edition takes place. We consider the following set e O = {mod, add, del} of edit operations: Figure 2: Add a new query q x in position 4 in e. • mod : (e, q x , k) 7→ q 1, ..., qk −1, q x , qk +1, ..., qn Replace the query at index k by the query q x . Example 3. Consider the operation o = add(e, q x , 4), adding Definition 4.5 (One-sided Contextual Edit Distance). Let d˜C ED : a query at index 4 of an exploration. Figure 2 illustrates the com- Q ∗ × Q ∗ → R+ be the Contextual Edit Distance from e 1 to e 2 putation of φ add (plotted in blue) from a given seed function (in such that: red). The corresponding context vector is v(o) = ⟨0.61, 0.88, 1, 1⟩. It  Õ|P |   d˜C ED (e 1, e 2 ) = min γ (oi )     indicates that the similarity score of queries at indexes 3 and 4 is (10) fully considered (weight of 1) while a lower score is considered for P ∈ P(e 1 ,e 2 )   i=1     query at index 1. In this form, CED would be very similar to Hausdorff distance The cost function γ of CED generalizes the classical definition [11], but would remain asymmetric. This is why we use the same of Edit Distance and takes into account the local context of each trick as Hausdorff distance and we apply the max operator on query in the exploration. each one-sided contextual edit distance to recover the symmetry. Definition 4.3 (Cost function γ ). Given an operation o(e, q x , k), Definition 4.6 (Contextual Edit Distance). Let dC ED : Q∗ × a cost function γ : O → [0, 1] for the contextual edit operations Q ∗ → R+ be the Contextual Edit Distance such that: n o is defined as: dC ED (e 1, e 2 ) = max d˜C ED (e 1, e 2 ), d˜C ED (e 2, e 1 ) (11) γ (o) = α × δ (o)+   4.2 Implementation of CED (1 − α) 1 − max {sim(qi , q x ) × vi (o)} (9) We can compute dC ED using a Dynamic Programming approach i ∈[[1,n]] like the classical Edit Distance [22]. This solution has a polyno- where: mial complexity in O(n × p) and can be adapted easily for the • α ∈ [0, 1] is a contextual parameter. computation of CED. If α → 0 the contextual part is maximal and therefore the Algorithm 1 computes the context vector and the cost function distance between two queries will be strongly evaluated (as in Equation 9). Note that φo : N∗ → [0, 1] and α ∈ [0, 1] are according to the content of the exploration being edited ; fixed parameters and operator · represents vector concatenation. if α → (1 then cost of edition is fixed. Algorithm 2 computes the one-sided Contextual Edit Distance 1 − sim(qk , q x ) i f o = mod (Equation 10), and Algorithm 3 recovers the symmetry. • δ (o) = 1 else is the local cost of the Edit Distance. Algorithm 1: Cost function γ • sim : Q × Q → [0, 1] is a similarity measure between two Data: Contextual edit operator o : (e, q x , k). queries, computed as the cosine of query vectors. Result: Cost γ (o) of the operation o. v(o) ← ⟨⟩ Example 4. Consider the exploration e 1 of Example 1 and con- for i ∈ ⟦1, |e |⟧ do sider the insertion of query q x , with vector ⟨0, 0, 1, 0, 0, 0, 2, 1⟩ (ad- v(o) ← v(o) · ⟨φo (i)⟩ dition of a filter), at position 4, as shown in Figure 2, with α = 0.1. vsim ← ⟨⟩ Then, the cost γ (o) is such that: for i ∈ ⟦1, |e |⟧ do  cos(q 1, q x ) × φ add (1)    vsim ← vsim · ⟨sim(qi , q x ) × vi (o)⟩  cos(q 2, q x ) × φ add (2)  ©     ª® γ (o) = 0.1 + 0.9 ­1 − max if e = mod then ­    cos(q 3, q x ) × φ add (3)  ® ­  ® δ (o) ← 1 − sim(qk , q x )  cos(q 4, q x ) × φ add (4)    «  ¬ else    0.47 × 0.61   ª δ (o) ← 1  0.94 × 0.88  return α × δ (o) + (1 − α) × (1 − max(vsim )) ©  = 0.1 + 0.9 ­1 − max ­  ®  × ® ­   0.66 1   ® ×   «  0.74 1  ¬ = 0.1 + 0.9 × (1 − 0.83) = 0.25 Next, we prove that the computation of CED is polynomial in  The insertion of q x at position 2 (i.e. closer to other filter opera- time: tion) has cost 0.15. Theorem 4.7. CED is in O(n × p × max(n, p)) where |e 1 | = n and |e 2 | = p. Definition 4.4 (Edit path). Given two explorations e, e ′ ∈ Q ∗ , an edit path P = ⟨o 1, o 2, ..., om ⟩ from e to e ′ is a sequence of Proof : Let’s note T (Ai ) the big O time complexity of the operations that transform e in e ′ . We note P(e, e ′ ) the set of all algorithm Ai , with i ∈ {1, 2, 3}. edit paths to transform e in e ′ . So we have : • T (A1 (o)) = |e | where e is the edited exploration. An important point to be mentioned here is that, for efficiency • T (A2 (e 1, e 2 )) = n × p × T (A1 (o : e 1, q x , k)) reasons, the context is static, i.e. it only considers the original = n2 × p queries in exploration e. This means that the contextual edit • T (A2 (e 2, e 1 )) = n × p × T (A1 (o : e 2, q x , k)) operation oi has no impact on the cost of operation oi+1 . If it was = n × p2 otherwise, i.e., dealing with dynamic context, the edition problem • T (A3 (e 1, e 2 )) = T (A2 (e 1, e 2 )) + T (A2 (e 2, e 1 )) would have been NP-hard by reduction to Job shop scheduling = n 2p + p 2 n problem [13]. Also note that as the add operator is not the reverse ∈ O(n × p × max(n, p)) of del operator, the edition is asymmetric. Thus, by re-using the definition of the Edit Distance in [22], we can define the one-side As a result, the Algorithm 3 has a time complexity in O(n × p × distance from an exploration e 1 to an exploration e 2 . max(n, p)). □ Algorithm 2: One-sided Contextual Edit Distance d˜C ED all the participants were volunteers. They developed explorations for answering four analytical questions on the IPUMS cube. The Data: Exploration couple (e 1, e 2 ). IPUMS cube integrates data from the IPUMS (Integrated Public Result: One-sided Context Edit Distance d˜C ED (e 1, e 2 ) Use Microdata Series) website 1 . It is organized as a star schema D ← [0...|e 1 |][0...|e 2 |] with 5 dimensions, 12 (non-top) levels, 25 measures, and contains for i ∈ ⟦0, |e 1 |⟧ do 500,000 facts recorded in the fact table. From this experiment, for j ∈ ⟦0, |e 2 |⟧ do we reuse 27 explorations and 306 queries, with an average of 11 if i = 0 ∨ j = 0 then queries per exploration. D[i, j] ← i + j During a preliminary analysis of the workload, Aligon et al. else (2) labelled explorations distinguishing five analysis styles: o mod ← mod : (e 1, q j−1, i − 1) (1) • FOCUS. The exploration is more focused as time passes, o del ← del : (e 1, qi−1, i − 1) • OSCILLATE-FOCUS. The exploration is more exploratory (2) o add ← add : (e 1, q j−1, i − 1) (the levels of detail and filtering oscillate) at the beginning D[i, j] ← min{ but is more focused at the end, D[i − 1, j − 1] + γ (o mod ), • OSCILLATE. The exploration is always exploratory, D[i − 1, j] + γ (o del ), • FIX. The exploration keeps constant levels of detail and D[i, j − 1] + γ (o add ) filtering, } • ATYPICAL. The exploration has atypical or erratic behav- ior. Real explorations on open data. The second workload, hence- forth dubed Open, consists of navigation traces collected in the Algorithm 3: Contextual Edit Distance dC ED context of a French project on energy vulnerability. These traces Data: Exploration couple (e 1, e 2 ). were produced by 8 volunteer students of a Master’s degree in n Edit Distance dC ED (e 1o, e 2 ). Result: Context Business Intelligence, answering fuzzy information needs defined return max d˜C ED (e 1, e 2 ), d˜C ED (e 2, e 1 ) by their lecturer, to develop explorative OLAP navigations using Saiku2 over three cubes instances [7]. The main cube is organized as a star schema with 19 dimensions, 68 (non-top) levels, 24 mea- 5 CLUSTERING OF EXPLORATIONS sures, and contains 37,149 facts recorded in the fact table. The other cubes are organized in a similar way. From this experiment, Our objective is to cluster together explorations showing the we reuse 28 explorations and 941 queries, with an average of 34 same user behavior. To this end, we pair CED to an off-the-shell queries per exploration. A particularity of some third party OLAP clustering algorithm, and we test it against several workloads tools, like Saiku, is that their user interface submits a new query concerning users with varied analytical skills and different UI, for each user action (including intermediate drag-and-drops), aiming to discover different types of patterns. In addition, as some resulting in very long explorations in the log. Nevertheless, there datasets come with a ground truth, they allow for the quantifica- were some extremely short explorations (6 explorations counting tion of clustering quality and the comparison to state of the art less than 10 queries), which mainly correspond to incomplete distances. The following subsections describe the workload used studies. in experiments, the experimental protocol and implementation details. Real explorations on cyber security data. The third workload, henceforth dubed Security, consists of analysis sessions made by 5.1 Workloads real analysts in the context of the "Honeynet Project" [15]. 56 an- In our experiments, we reuse several workloads described in alysts specialized in the domain of cyber-security were recruited the literature [2, 7, 15] consisting of navigation traces of real (via dedicated forums, network security firms, and volunteer se- users on real data. We chose to test our proposal in several work- nior students from the Israeli National Cyber-Security Program) loads to avoid learning specific behavior of a set of users. The 3 and asked them to analyze 4 different datasets using a prototype workloads concern users with different analysis skills (students, web-based analysis platform. Each dataset contains between 350 experts), using different analysis tools (an OLAP tool, a research to 13K rows of raw network logs that may reveal a distinct se- prototype and an advanced IDE interface) and accessing data curity event, e.g. malware communication hidden in network cubes of different sizes and complexities. We are not aware of traffic, hacking activity inside a local network, an IP range/port other public analytical workloads, specially from senior analysts, scan, etc. (there is no connection between the tuples of different whose analysis activity is jealously guarded by companies [19]. datasets). The analysts were asked to perform as many analysis We also generated artificial explorations on artificial data using actions as required to reveal the details of the underlying security a state-of-the-art workload generator [19]. event of each dataset. They used a web-based analysis platform developed for the project [15]. Real explorations on ipums data. The first workload, hence- Even if there is no ground truth for this workload, it is inter- forth dubed Ipums, consists of navigation traces of OLAP users esting because queries were devised by expert analysts. collected during the testing phase of the development of Falseto [2], a tool meant to assist query and exploration composition, by Artificial explorations. The last workload, with artificial data, letting the user summarize, browse, query, and reuse former ana- comes from the Star Schema Benchmark [17], and was used with lytical explorations. The 17 OLAP users engaged in the test were 1 Minnesota Population Center. Integrated Public Use Microdata Series. students of two Master’s programs specialized in Business Intelli- http://www.ipums.org, 2008. gence. The test was not part of the programs, was not graded and 2 http://meteorite.bi/products/saiku artificial explorations. The Star Schema Benchmark (SSB) is a f1 f3 f5 variation of TPC-H, a popular benchmark from the Transaction Processing Performance Council (TPC). SSB cube consists of a relational database under the form of a star schema, with one fact table and 4 dimension tables. q 2q 1 Instead of using the rather limited SSB workload, we generated q1 q3 q4 q5 artificial explorations using CubeLoad [19]. 5.2 Protocol Figure 3: Example of context function with k = 1, 3, 5 and |e | = 5 In order to cluster explorations, we execute an off-the-shelf clus- tering algorithm using CED as distance function. For comparison, we execute the same clustering algorithm with two alternative distances: (i) the classical Edit Distance (henceforth dubed ED) as based on a Gaussian√ function with a standard deviation coeffi- a baseline, and (ii) Aligon et al.’s distance [3] (henceforth dubed cient equal to 2 |ek +1 | . This coefficient is interesting because, as AD), a state of the art metric for session similarity. We analyze it depends on k, it allows to vary context size along the explo- the obtained clusters under several angles: ration. In particular, when k is small (at the beginning of the Firstly, when we have some knowledge qualifying explorations, exploration, when user intents are less defined and behavior is even if it is not exactly a ground truth, we compare our results more exploratory), the standard deviation is high (i.e. the curve to such knowledge. For the artificial explorations we compare of fk is flattened) which allows to include in local context, some the obtained clusters with the Cubeload templates used for the queries that are far from the index k. On the other hand, when generation of the workload. This experiment aims to show that k → |e | (at the end of the exploration, when behavior is more our approach is able to cluster together all the explorations cor- focused), the curve of fk narrows around k reducing context size. responding to a given template. For the ipums workload we com- Figure 3 illustrates the context function for several values of k. pare to the preliminary labels assigned by Aligon et al. Actually, As we do not know, a priori, the form of clusters, nor their such labels are not a ground truth, as there were not produced density, we use a hierarchical clustering algorithm, which pro- with the goal of clustering explorations, but they may provide a vides more flexibility than hyper-spherical and density-based nice idea of the quality of the exploration. We report Adjusted algorithms. In addition, it outputs a dendrogram that allows to Rand Index (ARI) and V-measure (harmonic mean of clusters parameter the setting of number of clusters and eases the visual homogeneity and completeness) scores3 , and we compare our analysis of clusters. In order to find a good balance, we experi- clustering scores to those obtained with ED and AD distances. mentally combine some criteria to cut the dendrogram: relative Second, for the four workloads, we report further scores con- loss of inertia, cluster diameter and minimum number of clusters. cerning intrinsic cluster quality. Indeed, too few clusters will mix We use the Ward method as agglomeration criteria. different behaviors, too many clusters will overfit user behav- Finally, a correlation study among query features revealed that ior. We aim to balance: number of clusters, cluster diameter (the Fdepth was highly correlated with Adepth in all workloads. In distance between the farthest objects in the cluster) and mean consequence, we excluded Fdepth from query vector. Silhouette Coefficient (a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separa- 6 EXPERIMENTS tion)). Silhouette scores3 are merely informative in out tests, as In this section we report the results of the experiments conducted the metric is more adapted to hyper-spherical clusters. to validate our proposal. Finally, we study the medoids of each cluster (the exploration that is the most similar to all other explorations in the cluster) and we manually observe the OLAP operations of the medoid 6.1 Comparison against ground truth for providing an explanation of the concerned behavior pattern. In this experiment on Artificial and Ipums workloads, we com- pared the clusters obtained with our method to the available 5.3 Implementation and setting ground truth (i.e. the template used to generate each exploration Our methods are implemented in Python, using Scipy, Sklearn and the preliminary classification of Ipums). [18] and Matplotlib libraries. Code and data are available from Figure 4 (a and d) shows the obtained dendrograms. Explo- Github 4 . CED parameters were tuned and set as follows: rations (identified by numbers) are arranged in the horizontal axis, plotting similar explorations close (according to CED). Links • Cosine similarity is used to compare two queries in an indicate which explorations are clustered together, shorter links exploration. meaning more similar explorations (vertical axis reports dis- • α parameter is set to 0 to give fully priority to context. tances). Links of the same color represent a cluster, while dotted • We use the following seed function. links just indicate inter-cluster distances. For easing the interpre- √ !2 © 1 2 k + 1(x − k) ª tation we also color explorations ids, according to ground truth fk (x) = exp ­− labels. We deliberately chose the same set of colors as clusters to |e | ® 2 « ¬ visually highlight the good matches. It is a Gaussian function centered at k, therefore satisfying the properties announced in Definition 4.2. Furthermore, it is Artificial workload. The dendrogram exhibits a perfect match among CubeLoad templates (Slice and Drill, Slide All, Exploratory 3 Metrics for clustering performance evaluation are well described in and Goal Oriented, described in Section 2) constituting well sep- https://scikit-learn.org/stable/modules/clustering.html sect. 2.3.10. [18] arated clusters. We expected a good result with this workload, as 4 https://github.com/ClementMoreau-UnivTours/CED_Dolap CubeLoad templates are well differentiated. Colored Dendrogram ( 4 groups) Colored Dendrogram ( 4 groups) 8 8 (a): CED on Artificial (d): CED on Ipums 6 6 Atypical Oscillate Oscillate Focus 4 and Fix 4 Goal Oriented Exploratory Slice All Slice and Drill focus 2 2 OSCF_G OSC_AA OSCF_U OSCF_B OSCF_Y OSCF_Z OSCF_T OSCF_J ATYP_P ATYP_S FOC_W 0 OSC_Q OSC_O FOC_M OSC_V FOC_R FOC_H FOC_C OSC_F FOC_X FOC_A FOC_E FOC_K FOC_L FOC_I FIX_D FIX_N Focus W Atypic P Atypic S Fix N Osc_F U Fix D Osc_F G Focus M Osc_F B Osc AA Osc_F Y Osc_F Z Osc_F T Focus H Osc Q Osc O Focus R Focus A Focus K Focus X Focus C Osc_F J Osc V Focus E Focus L Osc F Focus I Labels 0 GO_44 GO_30 GO_39 GO_33 GO_49 GO_46 GO_40 GO_38 GO_10 GO_19 GO_18 GO_27 GO_25 GO_35 SD_45 SD_34 SD_29 SD_48 SD_50 SD_37 SD_36 SD_28 SD_21 SA_42 SA_31 SA_16 SA_13 SA_15 SA_32 SA_26 SA_23 SA_20 SA_12 GO_5 GO_1 GO_9 GGGGGG GGGGGGGGGGG EE EEE EE EE AAAAA AAAAAAA DDDDDDDDDDDD SD_4 SD_2 SD_3 SA_8 SA_7 E_14 E_24 E_22 E_11 E_17 E_47 E_41 E_43 Colored Dendrogram ( 4 groups) E_6 Labels Colored Dendrogram ( 3 groups) 60 40 40 50 (b): ED on Artificial (e): ED on Ipums 30 40 20 30 20 10 10 OSCF_G OSCF_U OSC_AA OSCF_B OSCF_Y OSCF_Z OSCF_T OSCF_J ATYP_S ATYP_P FOC_W 0 OSC_Q FOC_M OSC_O OSC_V FOC_H FOC_R FOC_C FOC_A OSC_F FOC_E FOC_K FOC_X FOC_L FOC_I FIX_D FIX_N Focus W P Atypic S Osc_F G Osc_F U Focus M Osc Q Osc O Osc_F B Osc AA Osc_F Y Osc_F Z Osc V Osc_F T Focus H Focus R Focus A Focus K Focus X Osc F Focus C Osc_F J Focus E Focus L Fix N Focus I Fix D Atypic Labels 0 GO_40 GO_46 GO_18 GO_30 GO_39 GO_44 GO_27 GO_33 GO_25 GO_35 GO_38 GO_49 GO_19 GO_10 SD_34 SD_45 SD_36 SD_28 SD_21 SD_50 SD_37 SD_29 SD_48 SA_31 SA_42 SA_13 SA_32 SA_20 SA_23 SA_16 SA_12 SA_15 SA_26 GO_9 GO_5 GO_1 G EE EGGE GGGAAA AAGGGGE AAA EGGEA GGEGDDDG DDDDDDDDDGAEAA SD_4 SD_3 SD_2 SA_8 SA_7 E_17 E_41 E_14 E_43 E_11 E_22 E_47 E_24 Colored Dendrogram ( 6 groups) E_6 Labels Colored Dendrogram ( 4 groups) 2.5 2.5 1.5 (c): AD on Artificial (f): AD on Ipums 2.0 1.5 1.0 1.0 0.5 0.5 OSCF_G OSCF_U OSC_AA 0.0 OSCF_B OSCF_Y OSCF_Z OSCF_T OSCF_J ATYP_P ATYP_S FOC_W OSC_Q OSC_O FOC_M OSC_V FOC_C FOC_H FOC_R OSC_F FOC_A FOC_E FOC_X FOC_K FOC_L FOC_I FIX_D FIX_N Focus W Atypic P Atypic S Osc_F G Osc_F U M Osc Q Osc O Osc_F B Osc_F Y Osc_F Z Osc V Osc_F T Focus H L Focus R Focus A Focus K X Osc F Focus C Osc_F J Fix N Fix D Focus E 0.0 Focus I Osc AA 44/GOA 5/GOAL 35/GOA 27/GOA 10/GOA 9/GOAL 25/GOA 40/GOA 19/GOA 38/GOA 18/GOA 30/GOA 46/GOA 1/GOAL 39/GOA 33/GOA 49/GOA Focus Focus Focus Labels 43/EXP 6/EXPL 17/EXP 14/EXP 41/EXP 11/EXP 22/EXP 24/EXP 47/EXP 8/SLIC 7/SLIC 4/SLIC 3/SLIC 2/SLIC E EE EEGE EE AAA DAAAGAGGG DDDADAGGGGGGGGG DDDGDDGDDDGDGG 13/SLI 31/SLI 42/SLI 16/SLI 15/SLI 32/SLI 12/SLI 45/SLI 34/SLI 23/SLI 20/SLI 26/SLI 28/SLI 36/SLI 50/SLI 48/SLI 29/SLI 37/SLI 21/SLI Labels Figure 4: Dendrogram results on Artificial (on left) and Ipums (on right) workloads. In addition, many explorations of the Slice All template (and Nb Dataset ARI V-measure Distance some of the Slice and Drill template) are highly similar (distance clusters near 0) as they contain sequences of the very same operations, CED 4 1 1 Artificial even if exploration size is variable. This is one of the characteris- ED 4 0.26 0.36 tics that makes CED a well-adapted distance for this problem. AD 6 0.76 0.88 CED 4 0.29 0.42 Ipums workload. CED correctly classes most FOCUS and ATYP- Ipums ED 3 0.09 0.23 ICAL explorations. However, it fails to distinguish between OSCI- AD 4 0 0.19 LLATE-FOCUS and OSCILLATE explorations, the frontier being Table 2: Comparison of clustering results for CED, ED and quite fuzzy, and FIX explorations are not distinguished from FO- AD distances CUS ones. We remind that these labels are not a real ground truth, but a preliminary classification for other purpose. Table 2 indicates ARI and V-measure scores. more on the actual query parts to establish similarity, and tends 6.2 Comparison with other distances to cluster together explorations navigating in the same portion of In this experiment we compare the clusters obtained using 3 a cube. Consequently, very different behaviors (e.g. those of Slice distances: CED, ED and AD. Results are reported in Table 2 and and Drill and Goal Oriented templates) are clustered together Figure 4. In Table 2 we can see that CED outperforms AD and ED (see Figure 4 (b)). On the other hand CED is solely based on the in both Artificial and Ipums workloads on both quality metrics. structural properties of the explorations. With ED, clusters reflect exploration sizes instead of query operations. For example, in Figure 4(e) the first cluster includes 6.3 Other quality considerations short explorations, the second cluster contains the longest ones, In addition to ARI and V-measure scores (calculated w.r.t. a and the last cluster contains medium ones. Conversely, AD relies ground truth), we computed cluster diameters and Silhouette Nb Max The interpretation of clusters is harder for this workload, as Dataset Silhouette ARI V-measure clusters diameter Saiku tool produce very longs explorations (the longest in this Artificial 4 2.22 0.49 1 1 workload counts 127 queries). So manual inspection of explo- Ipums 4 3.31 0.28 0.29 0.42 rations is tedious and may lead to judgement errors. With this Open 6 4.53 0.37 disclaimer, we summarize the behavior represented by clusters Security 5 3.81 0.16 as follows: Table 3: Nb of clusters, diameter, Silhouette, ARI and V- • The outlier in first cluster is the shortest exploration (4 measure scores for each workload using CED queries), whose queries are fully aggregated (only ALL levels), and operations only change measures. • The second cluster contains 10 explorations, many ones being very short, including the medoid. A general behavior scores to complete our quality analysis. Results are reported in is a constant or lightly increasing level of detail, sometimes Table 3 for the 4 datasets. being high, sometimes medium. Most explorations (except Globally, we observe that most diameters are low, indicating one) have little filters, but exhibit some changes in the that clusters are compact. Therefore, medoids are good represen- measure set. tatives of each cluster. Most Silhouette scores are also positive, • The third cluster contains 7 explorations, all of them con- which is a good result given that our clusters are not hyper- tinuously increasing the level of detail and filtering. There spherical. In particular, we note that even if CED was able to are multiple drill-downs and multiple filters all along the generate a pure partition for the Artificial workload, we observe explorations. a low Silhouette score. • Cluster 4 contains another outlier clustered alone. It ex- hibits several abrupt changes in the level of detail with several peaks, continuous changes of measures and some 6.4 Interpretation of students’ behavior filters in the middle. The next experiment clusters students explorations of the Ipums • Another exploration also clustered alone. It also shows and Open workloads. We manually examine some explorations of many abrupt changes in the level of detail, with some each cluster, including the centroid, with the goal of abstracting changes in filters and measures at the beginning and an students explorations patterns. increase in the filter level at the end. Ipums workload. The 27 explorations are arranged in 4 clusters • The last cluster contains 8 explorations. Its medoid has no of different sizes. filters nor changes in the measure set. OLAP operations Clusters of explorations represent the following behavior: include only drill-downs and roll-ups, with an increase of the level of detail in the middle, decreasing at the end of • The 2 explorations in the first cluster start by many changes the exploration. Other explorations in the cluster include in the measure set (as trying to choose the good measures), other OLAP operations. The common behavior resides in followed by a long period of filter/unfilter operations, com- the increasing-decreasing pattern in the level of detail. bined with some drill-downs and roll-ups, but with little In conclusion, this clustering confirmed some of the already changes in the level of detail. identified patterns and enabled to discover a new one. Specifically, • Globally, explorations in the second cluster alternates drill- cluster 3 corresponds quite well to the Focus cluster of Ipums downs and roll-ups. Some of them include a few filters, workload and the Slice and Drill template of CubeLoad; while but most of them (this is the case of the medoid) do not cluster 6 corresponds to the Oscillate cluster of Ipums (having no filter any data. equivalent template in CubeLoad). The new pattern, reflected by • The 3 explorations of the third cluster start alternating cluster 2, corresponds to less skilled students, making timid usage drills down and rolls up (as in cluster 2), then alternate of OLAP operations (some drill-downs and roll-ups, few filters, filters/unfilters, some of them focusing at the end (as in some changes of measure). The recognition of outlier behavior cluster 4). This cluster has common points with clusters 2 is another strong point of our method. and 4, some kind of intermediate behavior. • The last cluster includes focused explorations, which reg- ularly increase the level of detail and filtering by adding 6.5 Interpretation of experts’ behavior drill-down and filter operations. Some of them mix other This experiment shows the application of our approach to a operations, mostly at the beginning, but drill-downs and larger workload, whose explorations were devised by expert filters are the predominant operations. analysts using an advanced IDE interface. Surprisingly, many explorations of the Security workload contain only one query From a more general perspective, our study shows that half of (260 out of 723); we excluded them from the study. Another the explorations follow a focused pattern (cluster 4) translating interesting point is that there are long sequences of repeated that those students have developed a particular type of analy- queries (i.e. the exploration contains many times the same query). sis skills, while other students are more exploratory (cluster 2), This may reflect movements in the way query results are arranged perhaps translating a lack of maturity in their analysis skills, and visualized, while referring to the same underlying query. perhaps just showing their style. Clusters 1 and 3 depicts outlier There is no ground truth and a manual observation of the 723 behaviors. explorations is not doable, however, we provide some general Open workload. Our method organizes the 28 explorations in 6 comments and a detailed analysis of the medoids of the 5 retrieved clusters, 3 of them containing an outlier exploration, the first one clusters (and some randomly picked explorations): being very different from all the others (inter-cluster distance • Explorations in cluster 1 do many movements in the group around 50). by set, oscillating the level of detail, with some peaks in Pattern Artificial Ipums Open Security goal is to automatically classify explorations and qualify users Slice and Drill 4 4 3 skills, allowing the recommendation of pertinent next queries, Slice All 3 among other applications. Exploratory 2 Finally, we remark that CED is not endemic to analysis behav- Goal Oriented 1 ior and it can be adapted to study other types of human behavior Oscilating 2 6 1 (provided that such behavior could be represented as a sequence Oscilate+Focus 3 of symbols). In particular, we have used CED for comparing Constant Agg. level 2 5 human daily moves [16] and we are currently discovering and Add-Delete Fragment 2 analysing peculiar patterns of children mobility. Few operations 3 Repeted queries 4 ACKNOWLEDGMENTS Outliers 1 1,4,5 The authors would like to thank Nicolas Labroche for their useful Table 4: Summary of discovered patterns and ids of the insights on clustering quality. This work is partially supported concerned clusters. by French ANR Agency, in the context of Mobi’kids project. REFERENCES [1] Aybar C. Acar and Amihai Motro. 2004. Why Is this User Asking so Many Questions? Explaining Sequences of Queries. In Eighteenth Annual Conference the middle. There are other operations; the medoid does on Data and Applications Security. 159–176. many changes in measure set. [2] Julien Aligon, Kamal Boulil, Patrick Marcel, and Verónika Peralta. 2014. A Holistic Approach to OLAP Sessions Composition: The Falseto Experience. In • Cluster 2 contains many long explorations, which char- DOLAP 2014. 37–46. acteristic is the alternation of adding and deleting one [3] Julien Aligon, Matteo Golfarelli, Patrick Marcel, Stefano Rizzi, and Elisa Tur- fragment (a level in the group by set, a filter or a measure). ricchia. 2014. Similarity measures for OLAP sessions. KAIS 39, 2 (2014), 463–489. • Explorations in the third cluster are highly similar (many [4] N. Arzamasova, K. Böhm, B. Goldman, C. Saaler, and M. Schaeler. 2019. On the distances are around 0). It includes many short explo- Usefulness of SQL-Query-Similarity Measures to Find User Interests. TKDE rations (as the medoid), with few operations, mainly drill- (2019), 1–1. [5] Surajit Chaudhuri and Vivek R. Narasayya. 2007. Self-Tuning Database Sys- downs. tems: A Decade of Progress. In VLDB. 3–14. • Explorations in cluster 4 have very few operations, and [6] F. J. Damerau. 1964. A Technique for Computer Detection and Correction of Spelling Errors. Commun. ACM 7, 3 (1964), 171–176. globally exhibit long subsequences of identical queries. [7] Mahfoud Djedaini, Krista Drushku, Nicolas Labroche, Patrick Marcel, Verónika • In the last cluster, queries have constant level of detail Peralta, and Willeme Verdeaux. 2019. Automatic assessment of interactive (generally low), with some movements in the group by set OLAP explorations. Inf. Syst. 82 (2019), 148–163. [8] Mahfoud Djedaini, Nicolas Labroche, Patrick Marcel, and Verónika Peralta. and few filters. 2017. Detecting User Focus in OLAP Analyses. In ADBIS. 105–119. As expected, analysts’ behavior is different from students’. [9] Philipp Eichmann, Emanuel Zgraggen, Zheguang Zhao, Carsten Binnig, and Tim Kraska. 2016. Towards a Benchmark for Interactive Data Exploration. Globally, their explorations exhibit less operations, with more IEEE Data Eng. Bull. 39, 4 (2016), 50–61. emphasis in the grouping of data, probably also in their arrange- [10] Magdalini Eirinaki, Suju Abraham, Neoklis Polyzotis, and Naushin Shaikh. 2014. QueRIE: Collaborative Database Exploration. TKDE 26, 7 (2014), 1778– ment and visualization (which is not captured in our method) 1790. of the data; while students are more click-oriented and produce [11] D. P. Huttenlocher, W. J. Rucklidge, and G. A. Klanderman. 1992. Comparing longer explorations with much more operations. images using the Hausdorff distance under translation. In CVPR. 654–656. [12] Stratos Idreos, Olga Papaemmanouil, and Surajit Chaudhuri. 2015. Overview Clusters 3 and 4 evidence this behavior and Cluster 2 is some of Data Exploration Techniques. In SIGMOD. 277–281. kind of generalization of the Slice All pattern of CubeLoad. Clus- [13] Richard M. Karp. 1972. Reducibility among Combinatorial Problems. 85–103. ters 1 and 5 coincide with those of the other workloads. Table 4 [14] Ralph Kimball. 1996. The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses. John Wiley. summarizes the discovered patterns. [15] Tova Milo and Amit Somech. 2018. Next-Step Suggestions for Modern Inter- active Data Analysis Platforms. In KDD. 576–585. [16] C. Moreau, T. Devogele, V. Peralta, and L. Etienne. 2020. A Contextual Edit 7 CONCLUSION Distance for Semantic Trajectories. Proc. of 35th ACM/SIGAPP (2020). This paper addressed the problem of learning analysis patterns [17] Patrick E. O’Neil, Elizabeth J. O’Neil, Xuedong Chen, and Stephen Revilak. 2009. The Star Schema Benchmark and Augmented Fact Table Indexing.. In in OLAP explorations, which is a hot topic for the understanding TPCTC (2009-10-28). 237–252. of human behavior and providing IDE support. Concretely, we [18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. propose to cluster similar explorations using a Contextual Edit Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Distance, and then extract analysis behavior from clusters. CED Machine Learning in Python. Journal of Machine Learning Research 12 (2011), is a new distance that is well-suited for comparing explorations 2825–2830. [19] Stefano Rizzi and Enrico Gallinucci. 2014. CubeLoad: A Parametric Generator taking into account their local context, and then lowering the of Realistic OLAP Workloads. In CAiSE 2014. 610–624. edition cost of similar queries, repetitions and permutations. Our [20] Oscar Romero, Patrick Marcel, Alberto Abelló, Verónika Peralta, and Ladjel experiments with four workloads allowed to detect CubeLoad Bellatreche. 2011. Describing Analytical Sessions Using a Multidimensional Algebra. In DaWaK. 224–239. templates and to learn some new analysis patterns from students [21] TF. Smith and MS. Waterman. 1981. Identification of Common Molecular and expert analysts explorations. Even if these results are promis- Subsequences. Journal of Molecular Biology 147 (1981), 195–197. ing, our method needs to be tested in larger workloads, with [22] R. Wagner and M. Fisher. 1974. The String-to-String Correction Problem. J. ACM 21 (1974), 168–173. varied skilled and non-skilled users and different types of user interfaces, before abstracting more general patterns. As future work, we plan to tune our method and study its sensibility and robustness with respect to CED parameters, query features and query similarity. We would also like to compare our results to other clustering methods and distances. Our long term