=Paper= {{Paper |id=Vol-2572/short15 |storemode=property |title=The Traveling Analyst Problem: Definition and Preliminary Study |pdfUrl=https://ceur-ws.org/Vol-2572/short15.pdf |volume=Vol-2572 |authors=Alexandre Chanson,Ben Crulis,Nicolas Labroche,Patrick Marcel,Verónika Peralta,Stefano Rizzi,Panos Vassiliadis |dblpUrl=https://dblp.org/rec/conf/dolap/ChansonCLMPRV20 }} ==The Traveling Analyst Problem: Definition and Preliminary Study== https://ceur-ws.org/Vol-2572/short15.pdf
                                           The Traveling Analyst Problem:
                                          Definition and preliminary study
         Alexandre Chanson, Ben                                               Stefano Rizzi                              Panos Vassiliadis
         Crulis, Nicolas Labroche,                                   University of Bologna, Italy                  University of Ioannina, Greece
         Patrick Marcel, Verónika                                      stefano.rizzi@unibo.it                            pvassil@cs.uoi.gr
                  Peralta
          University of Tours, France
      firstname.lastname@univ-tours.fr
ABSTRACT                                                                                    Practically, we envision TAP as being at the core of an IDE
This paper introduces the Traveling Analyst Problem (TAP), an                            system being used by data analysts, like the one described in [16].
original strongly NP-hard problem where an automated algo-                               Analysts would not directly enter TAP parameters, but would
rithm assists an analyst to explore a dataset, by suggesting the                         use storytelling mechanism instead (e.g., [8]) or patterns like “I
most interesting and coherent set of queries that are estimated to                       want a quick short report around this query result” or “I want an
be completed under a time constraint. We motivate the problem,                           in-depth analysis in this particular zone of the cube”.
study its complexity, propose a simple heuristic under simplify-                            The contributions of this preliminary effort include the for-
ing assumptions for approximating it, and run preliminary tests                          malization of TAP as well as the proof of its strong NP-hardness
to observe the behavior of this heuristic.                                               (Section 2), an approximation algorithm (Section 3), and some
                                                                                         preliminary tests assessing the parameter sensitivity, execution
                                                                                         time, and closeness to the optimal of our heuristics (Section 4).
1     INTRODUCTION                                                                       Section 5 concludes the paper by discussing future research.
Interactive data analysis (IDE) [10, 19] is an iterative process con-
sisting in executing an action (e.g., a query or a pattern extraction
                                                                                         2     PROBLEM FORMULATION AND
algorithm) over the data, receiving the result and deciding what
query comes next. It is a challenging task that a number of pre-                               COMPLEXITY
vious works aimed at facilitating (see e.g., [5, 19]). Automating                        In this section, we formalize TAP and show its NP-hardness.
such a process raises a number of challenges [16, 25]: how to
determine the direction to follow in often very large and disori-                        2.1      Problem formulation
enting datasets, how to decide what is the best query to apply,
how to determine if a result is interesting, how to tell a story                         TAP is formulated as:
with the data resulting from the analysis [8, 9], etc.                                         Input: Given a set of n queries, a function interest estimat-
   If we define a data story as a coherent sequence of queries                                     ing an interestingness score for each query, a function cost
that answer a user goal, we can express this problem as the                                        estimating the execution cost of each query, and a function
computation of the most interesting and coherent data story that                                  dist estimating a cognitive distance between queries,
can be obtained within a reasonable time. Even with simplifying                                Do: find a sequence of m ≤ n queries (without repetition)
assumptions, like restricting to exploratory OLAP queries over                                 S.T.: the sequence has the following properties:
a multidimensional schema (e.g., a star-schema, which allows                                     (1) it maximizes the overall interestingness score,
navigating hierarchically-structured data with a low formulation                                 (2) the sum of the costs does not exceed a user-specified
effort) and giving a particular starting point, this problem remains                                 time budget t,
inherently highly combinatorial.                                                                 (3) it minimizes the overall cognitive distance between the
   This paper presents a preliminary study of this problem, that                                     queries.
we name the Traveling Analyst Problem (TAP). Similarly to auto-
mated machine learning, which aims at finding the best model                                We assume that a storytelling mechanism (e.g., [8]) generates
on a dataset given a time budget (see e.g., [7]), TAP aims at (i)                        a set of candidate queries towards producing a story. Thus, our
finding, from a very large set of candidate queries, a subset of                         deliberations start with a set of n candidate queries whose exe-
queries that maximizes their interest within a limited time bud-                         cution has to be optimized. Formally, let Q be a set of n queries,
get, and (ii) ordering them so that they narrate a coherent data                         each associated with a positive time cost cost(qi ) and a positive
story. More formally, each query is associated to an interest score                      interestingness score interest(qi ). Each pair of queries is associ-
as well as to an execution cost. A distance between queries is                           ated with a metric dist(qi , q j ) for their cognitive distance. Given
used to order the queries so that the transition cost between two                        a time budget t, the optimization problem consists in finding
consecutive queries is minimized. Interestingly, a study of the                          a sequence ⟨q 1, . . . , qm ⟩ of queries, qi ∈ Q, without repetition,
state-of-the-art reveals that TAP has not been studied in the Op-                        with m ≤ n, such that:
erations Research community, while being close to two classical
                                                                                            (1) max m      interest(qi ),
                                                                                                     Í
optimization problems (the Traveling Salesman Problem and the                                   Ím i=1
Knapsack Problem) [11].                                                                     (2) i=1 cost(qi ) ≤ t
                                                                                            (3) min m−1i=1 dist(qi , qi+1 ),
                                                                                                     Í
© Copyright 2020 for this paper held by its author(s). Published in the proceedings of
DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT                  The decision problem associated with this optimization prob-
2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0).                                                           lem is to decide if such a sequence exists.
2.2    Complexity                                                        k OLAP operations, i.e., the size of Q, can be done by assuming
TAP can be related to two families of classical NP-hard opti-            for simplicity that dimensions only have linear hierarchies (no
mization problems: (i) the Knapsack problem, which consists              branches):
in picking weighted items from a set S such that the sum of
                                                                                  |Q | = ((Πi hi − 1) + (|2D | − 1) + (|2M | − 1)) · k
their values is maximum, without exceeding a given size [13],
corresponding to constraints (1) and (2); and (ii) the traveling         where hi is the number of levels in dimension i, D is the union of
salesman problem (TSP), which aims at finding the shortest route         the active domains of all levels, and M is the set of all measures
that visits all cities and returns to the initial one [1] and is close   in the cube. Changing the query group-by set means picking one
to constraint (3).                                                       group-by set among all the possible ones, excluding the current
   In our context, the former problem would find the most inter-         one д0 . Changing the selected values means picking a set of values
esting queries given a time budget but, in its classical formulation,    in the active domain, excluding the current one s 0 . Changing the
it would miss the ordering of queries. A variant [4] includes the        measure set means picking a set of measures among all possible
position of each object in the Knapsack via the definition of a          sets of measures excluding the current one m 0 .
function (which is constant in the case of classical Knapsack).             In order to approach solutions of TAP for arbitrary large sets of
While this problem is closer to TAP, it involves a function that         queries, we adopt the following strategy. We first use a heuristic
only relies on the position (or order) of an object in the Knapsack,     to solve the Knapsack problem and obtain a subset of queries,
and not on the positions of objects previously picked.                   using estimated costs and interests, so that the estimated costs
   TAP could also be modeled as a TSP problem with particu-              satisfy the time budget. Then, we order the queries by increasing
lar constraints: (i) the TSP cities are the queries; (ii) inter-city     estimated cost and evaluate them. We periodically check the
distances correspond to the cognitive distance between queries,          difference between the time budget constraint and the elapsed
whose total must be minimized. However, differently from classi-         time: if it is negative (too much time is taken) we redo a Knapsack
cal TSP, TAP operates under strong constraints: (iii) it is possible     to reduce the set of chosen queries; otherwise (we can benefit
to visit only a subset of cities, each city has a weight correspond-     from additional time), we redo a Knapsack adding previously not
ing to the action’s interest, whose total is to be maximized; and        taken queries. Finally, we determine an order on the chosen set of
(iv) each city has a duration of visit corresponding to the cost of      queries so that cognitive distance is minimized, using a heuristic
the query, whose sum must not go beyond a given time budget.             to solve the TSP.
   Interestingly, this optimization problem has not been studied
in the literature yet. The variant of the TSP called TSP with profit
(TSPwP), described by Feillet & al. in [6], is closer to our problem,
                                                                         3     APPROXIMATION ALGORITHM
but still differs in two aspects: (i) it looks for circuits and does     Before presenting our approach we discuss its parameters, i.e.,
not reject any vertex in the solution, and (ii) it gives a limit in      the three functions for cost, interest, and distance, and the set Q
terms of the travel distance (inter-queries distance in our case)        of queries. Choosing the best three functions or defining the best
while our limit is on the cost of queries (the duration of visit).       set of queries Q is outside the scope of this paper. Note that a
   An in-depth study of the TAP complexity is beyond the scope           framework for learning cell interestingness in a cube is the topic
of this paper. However, we can easily show that our problem is           of a recent paper [17]. We give examples in this section, and we
strongly NP-hard since the TSP is a particular case of it. Indeed, if    indicate precisely in the tests of Section 4 the parameters used.
the time budget t is high enough, i.e., all queries can be selected,
then TAP is a TSP. This result means that, unless P=NP, the TAP          3.1    Cost
problem can only be solved to optimality by algorithms with a            The cost of a query is related to its execution time. Classically,
worst-case time complexity in O ∗ (c n ), with c a positive root and     this cost can be estimated by a query optimizer before the exe-
n the size of Q.                                                         cution of the query. We therefore consider that we can measure
                                                                         a query cost in two ways, to obtain an a priori cost (e.g., using
2.3    Size of the problem and our naive                                 the RDBMS optimizer) and an a posteriori cost (the actual query
       approach                                                          execution time). The a priori cost is used to decide if a query can
We now discuss the size n of TAP (the number of queries in               be included or not in the solution, while the a posteriori cost is
Q) since the size of an optimization problem usually impacts             used to compute the elapsed time.
the choice of resolution approaches. Theoretically, given a cube
schema, all non empty queries over this schema could be con-             3.2    Interestingness measure
sidered. Practically, it is reasonable to consider that this set is      A crucial part of TAP lies in the definition of an interestingness
generated from a given initial query q 0 .                               measure to determine the optimal subset of queries. To quickly
   In what follows, we restrict to star join queries over a star         decide if a query is interesting, it is preferable that this measure
schema of the form q = (д, s, m) where д is the query group-by           can be computed before the actual evaluation of the query by
set, s is a set of selections, and m is a set of measures, all 3 sets    the DBMS, therefore that it relies on the text of the query. In this
being pairwise disjoint. Transforming a query into another with          sense, we propose to follow the idea of subjective interestingness
a simple OLAP operation means either changing the group-by               measure of a pattern as developed by De Bie in the context of Ex-
set д, changing a selection in s, or changing a measure in m.            ploratory Data Mining (EDM) [2] and to extend it to measure the
   Even restricting Q to the set of queries that can be generated        subjective interestingness of a query as expressed by a coherent
by transforming an initial query q 0 = (д0, s 0, m 0 ) with a sequence   set of query parts.
of simple OLAP operations, the size of Q is potentially very large,         In the information-theoretic formalism proposed by De Bie, in-
not allowing to look for an exact solution. A rough estimate of          terestingness is conditioned by the prior knowledge belie f (p) on
the number of queries that can be generated from q 0 by applying         a pattern p of the data space, which is expressed as a probability
distribution over the set of patterns P. The interestingness mea-              dimension). In the DIFF and RELAX operators [22, 23], the au-
sure I M is derived by normalizing the belief by the complexity                thors consider direct or distant drill-down (resp. roll-up) queries
of the pattern as follows:                                                     that detail (resp. aggregate) two particular cells of a query result.
                                                                               Finally, the CubeLoad OLAP workload generator [21] is based
                                   − log(belie f (p))                          on patterns modeling realistic OLAP sessions, that could be used
                      I M(p) =                                           (1)   to generate the queries of Q.
                                    complexity(p)
                                                                                  Ideally, the set Q should include all these potentially relevant
   In the context of BI, [3] introduces an innovative approach
                                                                               follow-up queries to q 0 . For the present work we will consider
to learn the belief distribution associated to query parts. The
                                                                               different sets varying the generation of roll-up, drill-down, and
approach considers a connected graph of query parts based on
                                                                               sibling queries (see Section 4).
the schema of the cube and the past usage of query parts, and uses
a random walk on this graph to produce the expected long-term
distribution over query parts. Interestingly, by construction, the             3.5    A simple heuristic for approximating TAP
probabilities obtained for each query part are independent, which              Algorithm 1 presents our heuristic, named Reopt, since it is based
allows to propose a simple formulation for the interestingness                 on a naive re-optimization principle. Note that the generation of
measure of a query q = (д, s, m) based on its query parts p ∈                  the set Q from an initial query q 0 is considered as a pre-processing
(д ∪ s ∪ m):                                                                   step. Algorithm 1 takes advantage of the fact that functions in-
                                                                               terest, cost, and distance can be computed solely on the basis of
                       − p ∈(д∪s∪m) log(belie f (p))                           the query expression (and not on the result of the queries).
                        Í
              I M(q) =                                                   (2)       First, Algorithm 1 splits the time budget t in tk (time for the
                              |д| + |s | + |m|
                                                                               global query execution) and to (time for other tasks, like solving
3.3    Cognitive distance                                                      the Knapsack, solving the TSP, etc.). Then the Knapsack is solved
                                                                               (line 2), and the queries selected are ordered by their estimated
To order a set of queries we draw inspiration from Hullman et al.              evaluation cost (line 3). Queries are then executed (line 6) and,
[9] who estimate the cognitive cost of transiting from the result              after each execution, the elapsed time is checked. If it is esti-
of one query to the result of another. We use this cost as a proxy             mated that the time budget tk will not be respected (line 9), then
for cognitive distance. Interestingly, this cost can be estimated              another Knapsack is triggered with the remaining time. If it is
without having evaluated the query. Using controlled studies of                estimated that the time budget will not be completely spent (line
peoples’ preferences for different types of single visualization-              13), then another Knapsack is triggered with all the remaining
to-visualization transitions, Hullman et al. [9] proposed a transi-            queries. Finally, once all the queries of the chosen set of queries
tion cost model that approximates the cognitive cost of moving                 are executed, the TSP is solved. It is easy to verify that Algorithm
from one visualization to the next in a sequence of static views.              1 converges: the set K is at worst Q and, at each iteration of the
The transition cost is defined as the number of transformations                for loop (line 5-16), the set E is augmented with one more query
required to convert the data shown in the first view to the sec-               of K while such query is removed from K.
ond. A single transformation is defined as a change to one of                      Note that, when actually executing the queries, Reopt attaches
the data fields shown from the first view to the second. We use                a particular importance to the estimated cost (see for instance
this cost model to define a simple distance between queries as                 Line 3) compared to interest or distance. The main cause behind
the Jaccard distance between sets of query parts. Formally, let                this decision is due to the time budget that has to be respected:
q 1 = (д1, s 1, m 1 ) and q 2 = (д2, s 2, m 2 ) be two queries; we define:     had we given priority to distance from q 0 or interest, we might
                               |(д1 ∪ s 1 ∪ m 1 ) ∩ (д2 ∪ s 2 ∪ m 2 )|         have executed first costly queries or we might have many ties
       dist(q 1, q 2 ) = 1 −                                             (3)   (since many queries would be at the same distance from q 0 ).
                                 |д1 ∪ s 1 ∪ m 1 ∪ д2 ∪ s 2 ∪ m 2 |
                                                                               Although alternative formulations are also possible, due to the
3.4    Set of queries                                                          sensitivity of time efficiency in user interactions (keep in mind
                                                                               that we operate in the context of exploratory data analysis), it is
TAP is defined for a given set Q of queries. Practically, we consider
                                                                               imperative that the other factors do not guide query execution
that this set is generated from a given initial query, q 0 . This initial
                                                                               to take up too much time.
query can be interpreted as a particular moment in an interactive
                                                                                   Implementation-wise, the Knapsack problem is solved using
analysis where the user considers what is retrieved is important
                                                                               a Fully Polynomial Time Approximation that gives a bound on
and worth exploring “around” it, but has too many directions
                                                                               the divergence from the optimum [14], and for the TSP using the
to explore. In what follows, we consider this initial query as a
                                                                               heuristic of [15].
parameter of our approach.
   Defining the set of queries worth exploring after q 0 should
be rooted in the earlier OLAP literature, especially automatic                 4     PRELIMINARY EXPERIMENTS
reporting [8], automatic cube exploration [12], discovery driven               Our preliminary experiments aim at understanding the behavior
analysis [22, 23], and more generally realistic OLAP workloads                 of our naive algorithm Reopt. To this end, we have compared
[21]. In the Cinecubes approach [8], the authors consider two                  it to two other algorithms: a brute force one that looks for the
types of queries: (i) queries for values similar to those defining             optimal solutions (named optimal), which obviously works only
the selection filters of the initial query (i.e., siblings of ancestors),      on small instances of TAP, and a simplistic one that consists of
and (ii) direct drill-downs into the dimensions of the initial re-             solving the Knapsack, then solving the TSP, and then executing
sult, one dimension at a time. In the DICE approach [12], the                  the queries (named K +T SP). All the algorithms are implemented
authors consider direct roll-up queries, direct drill-down queries,            in Java 12 and run on a Core i7 with 16GB RAM under Linux
sibling queries (a change of a dimension value, i.e., coordinate,              Fedora. The code is available via Github (https://github.com/
in a dimension), and pivot queries (a change in the inspected                  OLAP3/CubeExplorer). Note that, in all tests, costs are estimated
  Algorithm 1: Reopt: simple re-optimization heuristic for
  TAP approximation
   Data: An instance (Q, interest(), dist(), cost(), t) of TAP
   Result: A sequence of queries
 1 Split t = t k + to
 2 K = knapsack(Q, t k )
 3 sort K by increasing cost

 4 E = ∅
 5 for each query q ∈ K do
 6     execute q
 7     K = K \ {q}
 8     E = E ∪ {q}
       if elapsedT ime + q ∈K cost(q) > tk then
                          Í
 9
10          //we may have less time than expected
11          K = knapsack(K, tk − elapsedTime)                                         Figure 2: Total interestingness
12          sort K by increasing cost
       else if elapsedTime + q ∈K cost(q) < tk then                       Figure 1 shows, on a logarithmic scale, the average time taken
                               Í
13
14          //we may have more time than expected                     by the 3 algorithms to optimize (not counting the query execu-
15          K = knapsack(Q \ E, tk − elapsedTime)                     tion time) by different sizes of Q with time budget varying from
16          sort K by increasing cost                                 1 second to 10 seconds for Reopt and K + T SP. Results are as
                                                                      expected, with K +T SP outperforming its two competitors, since
17   S = T SP(E); // construct sequence                               it only runs once Knapsack solving. Notably, the time taken by
18   return S                                                         Reopt remains under control.
                                                                          To assess the benefit of our re-optimization step (line 9-16
                                                                      of Algorithm 1), we counted the number of times each branch
                                                                      of the if statement in line 9 of Reopt is taken, i.e., the number
                                                                      of times for negative (lines 10-12) and for positive (lines 14-16)
                                                                      re-optimizations. We have observed that both branches are used,
                                                                      with positive re-optimizations being rarely done compared to neg-
                                                                      ative ones; precisely, over 399 total runs, positive re-optimizations
                                                                      are done 116 times while negative re-optimizations are done 851
                                                                      times. This demonstrates that Reopt can adaptively respond to
                                                                      inaccurate estimates.

                                                                      4.2    Distance and interestingness vs time
                                                                             budget
                                                                      For the next series of tests, we use a cube issued by a French
                                                                      project on energy vulnerability. It is organized as a star schema
                                                                      with 19 dimensions, 68 (non-top) levels, and 24 measures, and it
                   Figure 1: Optimization time                        contains 37,149 facts recorded in the fact table. The cube is stored
                                                                      in SQL Server. We use 19 real user explorations over this cube
                                                                      (navigation traces generated by master students when analysing
using not the estimation coming from the query optimizer of the       data during a course) and pick as q 0 the first query of the explo-
DBMS but a linear regressive model based on the query length          rations. For each of these queries, we generate Q by rolling-up q 0
in ASCII, the number of tables, the number of projections, the        in a dimension, drilling down q 0 in a dimension, or computing
number of selections, and the number of aggregations of the SQL       a sibling query in a dimension. We use this cube and set Q to
star join query. This is because our empirical tests have shown       obtain more realistic instances of TAP. Precisely, |Q | ranges from
that this model was more accurate than the raw estimates given        29 to 149 queries for these tests. Over these sets Q we run Reopt
by the query optimizer.                                               and K + T SP and observe how total interestingness and average
                                                                      distance between queries change when increasing time budget t
4.1      Optimization time                                            from 500 milliseconds to 5 seconds.
In this first test, we use a simple synthetic cube under the schema      Figures 2 and 3 show the results. It can be seen that Reopt
of the SSB benchmark [20], with an instance of 1GB under Mari-        outperforms the simplistic K + T SP in terms of interestingness,
aDB, produced with the TPC-H data generator. The cube has             which illustrates the benefit of our re-optimization heuristic,
only one fact table and 5 dimension tables, which enables to keep     while both algorithms achieve comparable average distance.
the number of queries in Q under control for comparing Reopt
and K + T SP with optimal. Over this schema, we have used the         4.3    Unused time
CubeLoad generator [21] to generate 40 q 0 queries. These queries     In this last test, using the same cube and protocol as in the previ-
are used to generate Q with direct roll-up or drill down queries,     ous subsection, we want to understand if the budget time is used
i.e., Q can vary between 5 and 10.                                    properly.
                                                                     in the context of data exploration, which means that optimiza-
                                                                     tion algorithms should take advantage of classical data manage-
                                                                     ment optimizations, like re-optimization (e.g., [18]), and that TAP
                                                                     should be declaratively formulated, for instance by having start-
                                                                     ing points expressed in an intentional fashion (e.g., [24]). User
                                                                     tests should be conducted for further evaluating the approach.
                                                                     Finally, we also note that the definition of TAP is general, leaving
                                                                     room for variants, e.g., changing the definition of queries (e.g.,
                                                                     from non-OLAP SQL queries to more complex actions involving
                                                                     queries and pattern mining or statistical tests), as well as chang-
                                                                     ing cost (e.g., using self-adjusting cost model), interest (e.g., using
                                                                     statistics or data sampling), and distance functions.
                                                                       Acknowledgment. The authors would like to thank Vincent
                                                                     T’Kindt for his insights on TAP complexity.
                 Figure 3: Average distance
                                                                     REFERENCES
                                                                      [1] David L. Applegate, Robert E. Bixby, Vašek Chvátal, and William J. Cook.
                                                                          The Traveling Salesman Problem, A Computational Study. Princeton Series in
                                                                          Applied Mathematics. Princeton UP, 2006.
                                                                      [2] Tijl De Bie. Subjective interestingness in exploratory data mining. In Proc. of
                                                                          IDA, pages 19–31, 2013.
                                                                      [3] Alexandre Chanson, Ben Crulis, Krista Drushku, Nicolas Labroche, and Patrick
                                                                          Marcel. Profiling user belief in BI exploration for measuring subjective inter-
                                                                          estingness. In Proc. of DOLAP, 2019.
                                                                      [4] Fabián Díaz-Núñez, Franco Quezada, and Óscar C. Vásquez. The knapsack
                                                                          problem with scheduled items. Electronic Notes in Discrete Mathematics, 69:293–
                                                                          300, 2018.
                                                                      [5] Magdalini Eirinaki, Suju Abraham, Neoklis Polyzotis, and Naushin Shaikh.
                                                                          QueRIE: Collaborative database exploration. TKDE, 26(7):1778–1790, 2014.
                                                                      [6] Dominique Feillet, Pierre Dejax, and Michel Gendreau. Traveling salesman
                                                                          problems with profits. Transportation Science, 39(2):188–205, 2005.
                                                                      [7] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springen-
                                                                          berg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine
                                                                          learning. In Proc. of NIPS, pages 2962–2970, 2015.
                                                                      [8] Dimitrios Gkesoulis, Panos Vassiliadis, and Petros Manousis. Cinecubes:
                                                                          Aiding data workers gain insights from OLAP queries. IS, 53:60–86, 2015.
                                                                      [9] Jessica Hullman, Steven M. Drucker, Nathalie Henry Riche, Bongshin Lee,
     Figure 4: Unused time and candidate queries left                     Danyel Fisher, and Eytan Adar. A deeper understanding of sequence in
                                                                          narrative visualization. TVCG, 19(12):2406–2415, 2013.
                                                                     [10] Stratos Idreos, Olga Papaemmanouil, and Surajit Chaudhuri. Overview of
                                                                          data exploration techniques. In Proc. of SIGMOD, pages 277–281, 2015.
                                                                     [11] D.S. Johnson and M. Garey. Computers and Intractability. W.H.Freeman, 1979.
   Figure 4 plots against different time budgets the proportion of   [12] Niranjan Kamat, Prasanth Jayachandran, Karthik Tunga, and Arnab Nandi.
unused time versus the number of candidate queries left in Q to           Distributed and interactive cube exploration. In Proc. of ICDE, pages 472–483,
                                                                          2014.
execute. As we can see, regardless of the budget, our algorithm      [13] Hans Kellerer, Ulrich Pferschy, and David Pisinger. Knapsack problems.
manages to take advantage of all the time available unless all            Springer, 2004.
                                                                     [14] Katherine Lai and M Goemans. The knapsack problem and fully polynomial
queries of the Q have been explored, in which case the ratio of           time approximation schemes (fptas). Technical report, Massachusetts Institute
unused time increases. On the contrary, the K + T SP approach,            of Technology, 2006.
which is unaware of the possible gain in executing the other         [15] S. Lin and B. W. Kernighan. An effective heuristic algorithm for the traveling-
                                                                          salesman problem. Oper. Res., 21(2):498–516, 1973.
queries, has a larger ratio of unused time and does not manage       [16] Patrick Marcel, Nicolas Labroche, and Panos Vassiliadis. Towards a benefit-
to explore completely Q. The absence of idle time clearly proves          based optimizer for interactive data analysis. In Proc. of DOLAP, 2019.
the advantages of our adaptive re-optimization heuristic over the    [17] Patrick Marcel, Verónika Peralta, and Panos Vassiliadis. A framework for
                                                                          learning cell interestingness from cube explorations. In Proc. of ADBIS, pages
static K + T SP method, which cannot compensate for the errors            425–440, 2019.
in the prediction of the cost, nor ensure that the execution time    [18] Volker Markl, Vijayshankar Raman, David E. Simmen, Guy M. Lohman, and
                                                                          Hamid Pirahesh. Robust query processing through progressive optimization.
is near to the time budget.                                               In Proc. of SIGMOD, pages 659–670, 2004.
                                                                     [19] Tova Milo and Amit Somech. Next-step suggestions for modern interactive
                                                                          data analysis platforms. In Proc. of KDD, pages 576–585, 2018.
5   CONCLUSION                                                       [20] Patrick E. O’Neil, Elizabeth J. O’Neil, Xuedong Chen, and Stephen Revilak.
This paper introduced the Traveling Analyst Problem (TAP), the            The star schema benchmark and augmented fact table indexing. In Proc. of
                                                                          TPCTC, pages 237–252, Lyon, France, 2009.
problem of computing the most interesting and coherent data          [21] Stefano Rizzi and Enrico Gallinucci. CubeLoad: A parametric generator of
story that can be obtained within a reasonable time. We formalize         realistic OLAP workloads. In Proc. of CAISE, pages 610–624, 2014.
                                                                     [22] Sunita Sarawagi. Explaining differences in multidimensional aggregates. In
the problem, show its strong NP-hardness and propose a heuristic          Proc. of VLDB, pages 42–53, 1999.
for finding approximate solutions to it. Our preliminary experi-     [23] Gayatri Sathe and Sunita Sarawagi. Intelligent rollups in multidimensional
ments show that a heuristic based on simple re-optimization is a          OLAP data. In Proc. of VLDB, pages 531–540, 2001.
                                                                     [24] Panos Vassiliadis, Patrick Marcel, and Stefano Rizzi. Beyond roll-up’s and
promising direction to obtain acceptable solutions.                       drill-down’s: An intentional analytics model to reinvent OLAP. IS, 85:68–91,
   We believe that TAP opens many interesting research direc-             2019.
tions. Obviously, the first step is an in-depth theoretical study    [25] Abdul Wasay, Manos Athanassoulis, and Stratos Idreos. Queriosity: Automated
                                                                          data exploration. In Proceedings of IEEE International Congress on Big Data,
of TAP, to understand which types of optimization algorithms              pages 716–719, New York City, NY, 2015.
are more appropriate. Importantly, TAP should be investigated