<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Traveling Analyst Problem: Definition and preliminary study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexandre Chanson, Ben</string-name>
          <email>ifrstname.lastname@univ-tours.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Rizzi</string-name>
          <email>stefano.rizzi@unibo.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Panos Vassiliadis</string-name>
          <email>pvassil@cs.uoi.gr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Crulis</institution>
          ,
          <addr-line>Nicolas Labroche, Patrick Marcel, Verónika, Peralta</addr-line>
          ,
          <institution>University of Tours</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Bologna</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Ioannina</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper introduces the Traveling Analyst Problem (TAP), an original strongly NP-hard problem where an automated algorithm assists an analyst to explore a dataset, by suggesting the most interesting and coherent set of queries that are estimated to be completed under a time constraint. We motivate the problem, study its complexity, propose a simple heuristic under simplifying assumptions for approximating it, and run preliminary tests to observe the behavior of this heuristic.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Interactive data analysis (IDE) [
        <xref ref-type="bibr" rid="ref10 ref19">10, 19</xref>
        ] is an iterative process
consisting in executing an action (e.g., a query or a pattern extraction
algorithm) over the data, receiving the result and deciding what
query comes next. It is a challenging task that a number of
previous works aimed at facilitating (see e.g., [
        <xref ref-type="bibr" rid="ref19 ref5">5, 19</xref>
        ]). Automating
such a process raises a number of challenges [
        <xref ref-type="bibr" rid="ref16 ref25">16, 25</xref>
        ]: how to
determine the direction to follow in often very large and
disorienting datasets, how to decide what is the best query to apply,
how to determine if a result is interesting, how to tell a story
with the data resulting from the analysis [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ], etc.
      </p>
      <p>If we define a data story as a coherent sequence of queries
that answer a user goal, we can express this problem as the
computation of the most interesting and coherent data story that
can be obtained within a reasonable time. Even with simplifying
assumptions, like restricting to exploratory OLAP queries over
a multidimensional schema (e.g., a star-schema, which allows
navigating hierarchically-structured data with a low formulation
efort) and giving a particular starting point, this problem remains
inherently highly combinatorial.</p>
      <p>
        This paper presents a preliminary study of this problem, that
we name the Traveling Analyst Problem (TAP). Similarly to
automated machine learning, which aims at finding the best model
on a dataset given a time budget (see e.g., [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]), TAP aims at (i)
ifnding, from a very large set of candidate queries, a subset of
queries that maximizes their interest within a limited time
budget, and (ii) ordering them so that they narrate a coherent data
story. More formally, each query is associated to an interest score
as well as to an execution cost. A distance between queries is
used to order the queries so that the transition cost between two
consecutive queries is minimized. Interestingly, a study of the
state-of-the-art reveals that TAP has not been studied in the
Operations Research community, while being close to two classical
optimization problems (the Traveling Salesman Problem and the
Knapsack Problem) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Practically, we envision TAP as being at the core of an IDE
system being used by data analysts, like the one described in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
Analysts would not directly enter TAP parameters, but would
use storytelling mechanism instead (e.g., [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]) or patterns like “I
want a quick short report around this query result” or “I want an
in-depth analysis in this particular zone of the cube”.
      </p>
      <p>The contributions of this preliminary efort include the
formalization of TAP as well as the proof of its strong NP-hardness
(Section 2), an approximation algorithm (Section 3), and some
preliminary tests assessing the parameter sensitivity, execution
time, and closeness to the optimal of our heuristics (Section 4).
Section 5 concludes the paper by discussing future research.
2</p>
    </sec>
    <sec id="sec-2">
      <title>PROBLEM FORMULATION AND</title>
    </sec>
    <sec id="sec-3">
      <title>COMPLEXITY</title>
      <p>In this section, we formalize TAP and show its NP-hardness.
2.1</p>
    </sec>
    <sec id="sec-4">
      <title>Problem formulation</title>
      <p>TAP is formulated as:</p>
      <p>Input: Given a set of n queries, a function interest
estimating an interestingness score for each query, a function cost
estimating the execution cost of each query, and a function
dist estimating a cognitive distance between queries,
Do: find a sequence of m ≤ n queries (without repetition)
S.T.: the sequence has the following properties:
(1) it maximizes the overall interestingness score,
(2) the sum of the costs does not exceed a user-specified
time budget t ,
(3) it minimizes the overall cognitive distance between the
queries.</p>
      <p>
        We assume that a storytelling mechanism (e.g., [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]) generates
a set of candidate queries towards producing a story. Thus, our
deliberations start with a set of n candidate queries whose
execution has to be optimized. Formally, let Q be a set of n queries,
each associated with a positive time cost cost (qi ) and a positive
interestingness score interest (qi ). Each pair of queries is
associated with a metric dist (qi , qj ) for their cognitive distance. Given
a time budget t , the optimization problem consists in finding
a sequence ⟨q1, . . . , qm ⟩ of queries, qi ∈ Q, without repetition,
with m ≤ n, such that:
(1) max Ími=1 interest (qi ),
(2) Ím
      </p>
      <p>i=1 cost (qi ) ≤ t
(3) min Ími=−11 dist (qi , qi+1),</p>
      <p>The decision problem associated with this optimization
problem is to decide if such a sequence exists.
2.2</p>
    </sec>
    <sec id="sec-5">
      <title>Complexity</title>
      <p>
        TAP can be related to two families of classical NP-hard
optimization problems: (i) the Knapsack problem, which consists
in picking weighted items from a set S such that the sum of
their values is maximum, without exceeding a given size [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],
corresponding to constraints (1) and (2); and (ii) the traveling
salesman problem (TSP), which aims at finding the shortest route
that visits all cities and returns to the initial one [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and is close
to constraint (3).
      </p>
      <p>
        In our context, the former problem would find the most
interesting queries given a time budget but, in its classical formulation,
it would miss the ordering of queries. A variant [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] includes the
position of each object in the Knapsack via the definition of a
function (which is constant in the case of classical Knapsack).
While this problem is closer to TAP, it involves a function that
only relies on the position (or order) of an object in the Knapsack,
and not on the positions of objects previously picked.
      </p>
      <p>TAP could also be modeled as a TSP problem with
particular constraints: (i) the TSP cities are the queries; (ii) inter-city
distances correspond to the cognitive distance between queries,
whose total must be minimized. However, diferently from
classical TSP, TAP operates under strong constraints: (iii) it is possible
to visit only a subset of cities, each city has a weight
corresponding to the action’s interest, whose total is to be maximized; and
(iv) each city has a duration of visit corresponding to the cost of
the query, whose sum must not go beyond a given time budget.</p>
      <p>
        Interestingly, this optimization problem has not been studied
in the literature yet. The variant of the TSP called TSP with profit
(TSPwP), described by Feillet &amp; al. in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], is closer to our problem,
but still difers in two aspects: (i) it looks for circuits and does
not reject any vertex in the solution, and (ii) it gives a limit in
terms of the travel distance (inter-queries distance in our case)
while our limit is on the cost of queries (the duration of visit).
      </p>
      <p>An in-depth study of the TAP complexity is beyond the scope
of this paper. However, we can easily show that our problem is
strongly NP-hard since the TSP is a particular case of it. Indeed, if
the time budget t is high enough, i.e., all queries can be selected,
then TAP is a TSP. This result means that, unless P=NP, the TAP
problem can only be solved to optimality by algorithms with a
worst-case time complexity in O∗(cn ), with c a positive root and
n the size of Q.
2.3</p>
    </sec>
    <sec id="sec-6">
      <title>Size of the problem and our naive approach</title>
      <p>We now discuss the size n of TAP (the number of queries in
Q) since the size of an optimization problem usually impacts
the choice of resolution approaches. Theoretically, given a cube
schema, all non empty queries over this schema could be
considered. Practically, it is reasonable to consider that this set is
generated from a given initial query q0.</p>
      <p>In what follows, we restrict to star join queries over a star
schema of the form q = (д, s, m) where д is the query group-by
set, s is a set of selections, and m is a set of measures, all 3 sets
being pairwise disjoint. Transforming a query into another with
a simple OLAP operation means either changing the group-by
set д, changing a selection in s, or changing a measure in m.</p>
      <p>Even restricting Q to the set of queries that can be generated
by transforming an initial query q0 = (д0, s0, m0) with a sequence
of simple OLAP operations, the size of Q is potentially very large,
not allowing to look for an exact solution. A rough estimate of
the number of queries that can be generated from q0 by applying
k OLAP operations, i.e., the size of Q, can be done by assuming
for simplicity that dimensions only have linear hierarchies (no
branches):</p>
      <p>|Q | = ((Πi hi − 1) + (|2D | − 1) + (|2M | − 1)) · k
where hi is the number of levels in dimension i, D is the union of
the active domains of all levels, and M is the set of all measures
in the cube. Changing the query group-by set means picking one
group-by set among all the possible ones, excluding the current
one д0. Changing the selected values means picking a set of values
in the active domain, excluding the current one s0. Changing the
measure set means picking a set of measures among all possible
sets of measures excluding the current one m0.</p>
      <p>In order to approach solutions of TAP for arbitrary large sets of
queries, we adopt the following strategy. We first use a heuristic
to solve the Knapsack problem and obtain a subset of queries,
using estimated costs and interests, so that the estimated costs
satisfy the time budget. Then, we order the queries by increasing
estimated cost and evaluate them. We periodically check the
diference between the time budget constraint and the elapsed
time: if it is negative (too much time is taken) we redo a Knapsack
to reduce the set of chosen queries; otherwise (we can benefit
from additional time), we redo a Knapsack adding previously not
taken queries. Finally, we determine an order on the chosen set of
queries so that cognitive distance is minimized, using a heuristic
to solve the TSP.
3</p>
    </sec>
    <sec id="sec-7">
      <title>APPROXIMATION ALGORITHM</title>
      <p>
        Before presenting our approach we discuss its parameters, i.e.,
the three functions for cost, interest, and distance, and the set Q
of queries. Choosing the best three functions or defining the best
set of queries Q is outside the scope of this paper. Note that a
framework for learning cell interestingness in a cube is the topic
of a recent paper [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. We give examples in this section, and we
indicate precisely in the tests of Section 4 the parameters used.
3.1
      </p>
    </sec>
    <sec id="sec-8">
      <title>Cost</title>
      <p>The cost of a query is related to its execution time. Classically,
this cost can be estimated by a query optimizer before the
execution of the query. We therefore consider that we can measure
a query cost in two ways, to obtain an a priori cost (e.g., using
the RDBMS optimizer) and an a posteriori cost (the actual query
execution time). The a priori cost is used to decide if a query can
be included or not in the solution, while the a posteriori cost is
used to compute the elapsed time.
3.2</p>
    </sec>
    <sec id="sec-9">
      <title>Interestingness measure</title>
      <p>
        A crucial part of TAP lies in the definition of an interestingness
measure to determine the optimal subset of queries. To quickly
decide if a query is interesting, it is preferable that this measure
can be computed before the actual evaluation of the query by
the DBMS, therefore that it relies on the text of the query. In this
sense, we propose to follow the idea of subjective interestingness
measure of a pattern as developed by De Bie in the context of
Exploratory Data Mining (EDM) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and to extend it to measure the
subjective interestingness of a query as expressed by a coherent
set of query parts.
      </p>
      <p>In the information-theoretic formalism proposed by De Bie,
interestingness is conditioned by the prior knowledge belie f (p) on
a pattern p of the data space, which is expressed as a probability
I M(p) = − log(belie f (p))</p>
      <p>complexity(p)</p>
      <p>
        In the context of BI, [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] introduces an innovative approach
to learn the belief distribution associated to query parts. The
approach considers a connected graph of query parts based on
the schema of the cube and the past usage of query parts, and uses
a random walk on this graph to produce the expected long-term
distribution over query parts. Interestingly, by construction, the
probabilities obtained for each query part are independent, which
allows to propose a simple formulation for the interestingness
measure of a query q = (д, s, m) based on its query parts p ∈
(д ∪ s ∪ m):
      </p>
      <p>I M(q) =
− Íp ∈(д∪s∪m) log(belie f (p))</p>
      <p>|д| + |s | + |m|
3.3</p>
    </sec>
    <sec id="sec-10">
      <title>Cognitive distance</title>
      <p>
        To order a set of queries we draw inspiration from Hullman et al.
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] who estimate the cognitive cost of transiting from the result
of one query to the result of another. We use this cost as a proxy
for cognitive distance. Interestingly, this cost can be estimated
without having evaluated the query. Using controlled studies of
peoples’ preferences for diferent types of single
visualizationto-visualization transitions, Hullman et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposed a
transition cost model that approximates the cognitive cost of moving
from one visualization to the next in a sequence of static views.
The transition cost is defined as the number of transformations
required to convert the data shown in the first view to the
second. A single transformation is defined as a change to one of
the data fields shown from the first view to the second. We use
this cost model to define a simple distance between queries as
the Jaccard distance between sets of query parts. Formally, let
q1 = (д1, s1, m1) and q2 = (д2, s2, m2) be two queries; we define:
dist (q1, q2) = 1 −
|(д1 ∪ s1 ∪ m1) ∩ (д2 ∪ s2 ∪ m2)|
|д1 ∪ s1 ∪ m1 ∪ д2 ∪ s2 ∪ m2 |
(3)
3.4
      </p>
    </sec>
    <sec id="sec-11">
      <title>Set of queries</title>
      <p>TAP is defined for a given set Q of queries. Practically, we consider
that this set is generated from a given initial query, q0. This initial
query can be interpreted as a particular moment in an interactive
analysis where the user considers what is retrieved is important
and worth exploring “around” it, but has too many directions
to explore. In what follows, we consider this initial query as a
parameter of our approach.</p>
      <p>
        Defining the set of queries worth exploring after q0 should
be rooted in the earlier OLAP literature, especially automatic
reporting [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], automatic cube exploration [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], discovery driven
analysis [
        <xref ref-type="bibr" rid="ref22 ref23">22, 23</xref>
        ], and more generally realistic OLAP workloads
[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. In the Cinecubes approach [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the authors consider two
types of queries: (i) queries for values similar to those defining
the selection filters of the initial query (i.e., siblings of ancestors),
and (ii) direct drill-downs into the dimensions of the initial
result, one dimension at a time. In the DICE approach [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], the
authors consider direct roll-up queries, direct drill-down queries,
sibling queries (a change of a dimension value, i.e., coordinate,
in a dimension), and pivot queries (a change in the inspected
distribution over the set of patterns P . The interestingness
measure I M is derived by normalizing the belief by the complexity
of the pattern as follows:
(1)
(2)
dimension). In the DIFF and RELAX operators [
        <xref ref-type="bibr" rid="ref22 ref23">22, 23</xref>
        ], the
authors consider direct or distant drill-down (resp. roll-up) queries
that detail (resp. aggregate) two particular cells of a query result.
Finally, the CubeLoad OLAP workload generator [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] is based
on patterns modeling realistic OLAP sessions, that could be used
to generate the queries of Q.
      </p>
      <p>Ideally, the set Q should include all these potentially relevant
follow-up queries to q0. For the present work we will consider
diferent sets varying the generation of roll-up, drill-down, and
sibling queries (see Section 4).
3.5</p>
    </sec>
    <sec id="sec-12">
      <title>A simple heuristic for approximating TAP</title>
      <p>Algorithm 1 presents our heuristic, named Reopt , since it is based
on a naive re-optimization principle. Note that the generation of
the set Q from an initial query q0 is considered as a pre-processing
step. Algorithm 1 takes advantage of the fact that functions
interest, cost, and distance can be computed solely on the basis of
the query expression (and not on the result of the queries).</p>
      <p>First, Algorithm 1 splits the time budget t in tk (time for the
global query execution) and to (time for other tasks, like solving
the Knapsack, solving the TSP, etc.). Then the Knapsack is solved
(line 2), and the queries selected are ordered by their estimated
evaluation cost (line 3). Queries are then executed (line 6) and,
after each execution, the elapsed time is checked. If it is
estimated that the time budget tk will not be respected (line 9), then
another Knapsack is triggered with the remaining time. If it is
estimated that the time budget will not be completely spent (line
13), then another Knapsack is triggered with all the remaining
queries. Finally, once all the queries of the chosen set of queries
are executed, the TSP is solved. It is easy to verify that Algorithm
1 converges: the set K is at worst Q and, at each iteration of the
for loop (line 5-16), the set E is augmented with one more query
of K while such query is removed from K .</p>
      <p>Note that, when actually executing the queries, Reopt attaches
a particular importance to the estimated cost (see for instance
Line 3) compared to interest or distance. The main cause behind
this decision is due to the time budget that has to be respected:
had we given priority to distance from q0 or interest, we might
have executed first costly queries or we might have many ties
(since many queries would be at the same distance from q0).
Although alternative formulations are also possible, due to the
sensitivity of time eficiency in user interactions (keep in mind
that we operate in the context of exploratory data analysis), it is
imperative that the other factors do not guide query execution
to take up too much time.</p>
      <p>
        Implementation-wise, the Knapsack problem is solved using
a Fully Polynomial Time Approximation that gives a bound on
the divergence from the optimum [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and for the TSP using the
heuristic of [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-13">
      <title>PRELIMINARY EXPERIMENTS</title>
      <p>Our preliminary experiments aim at understanding the behavior
of our naive algorithm Reopt . To this end, we have compared
it to two other algorithms: a brute force one that looks for the
optimal solutions (named optimal ), which obviously works only
on small instances of TAP, and a simplistic one that consists of
solving the Knapsack, then solving the TSP, and then executing
the queries (named K +T SP ). All the algorithms are implemented
in Java 12 and run on a Core i7 with 16GB RAM under Linux
Fedora. The code is available via Github (https://github.com/
OLAP3/CubeExplorer). Note that, in all tests, costs are estimated
7
8</p>
      <p>Data: An instance (Q, interest(), dist(), cost(), t) of TAP
Result: A sequence of queries
1 Split t = tk + to
2 K = knapsack(Q, tk )
3 sort K by increasing cost
4 E = ∅
5 for each query q ∈ K do
6 execute q</p>
      <p>K = K \ {q}
E = E ∪ {q}
if elapsedT ime + Íq ∈K cost (q) &gt; tk then
//we may have less time than expected
K = knapsack(K, tk − elapsedT ime)
sort K by increasing cost
else if elapsedT ime + Íq ∈K cost (q) &lt; tk then
//we may have more time than expected
K = knapsack(Q \ E, tk − elapsedT ime)
sort K by increasing cost
17 S = T SP (E); // construct sequence
18 return S
using not the estimation coming from the query optimizer of the
DBMS but a linear regressive model based on the query length
in ASCII, the number of tables, the number of projections, the
number of selections, and the number of aggregations of the SQL
star join query. This is because our empirical tests have shown
that this model was more accurate than the raw estimates given
by the query optimizer.
4.1</p>
    </sec>
    <sec id="sec-14">
      <title>Optimization time</title>
      <p>
        In this first test, we use a simple synthetic cube under the schema
of the SSB benchmark [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], with an instance of 1GB under
MariaDB, produced with the TPC-H data generator. The cube has
only one fact table and 5 dimension tables, which enables to keep
the number of queries in Q under control for comparing Reopt
and K + T SP with optimal . Over this schema, we have used the
CubeLoad generator [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] to generate 40 q0 queries. These queries
are used to generate Q with direct roll-up or drill down queries,
i.e., Q can vary between 5 and 10.
      </p>
      <p>Figure 1 shows, on a logarithmic scale, the average time taken
by the 3 algorithms to optimize (not counting the query
execution time) by diferent sizes of Q with time budget varying from
1 second to 10 seconds for Reopt and K + T SP . Results are as
expected, with K +T SP outperforming its two competitors, since
it only runs once Knapsack solving. Notably, the time taken by
Reopt remains under control.</p>
      <p>To assess the benefit of our re-optimization step (line 9-16
of Algorithm 1), we counted the number of times each branch
of the if statement in line 9 of Reopt is taken, i.e., the number
of times for negative (lines 10-12) and for positive (lines 14-16)
re-optimizations. We have observed that both branches are used,
with positive re-optimizations being rarely done compared to
negative ones; precisely, over 399 total runs, positive re-optimizations
are done 116 times while negative re-optimizations are done 851
times. This demonstrates that Reopt can adaptively respond to
inaccurate estimates.
4.2</p>
    </sec>
    <sec id="sec-15">
      <title>Distance and interestingness vs time budget</title>
      <p>For the next series of tests, we use a cube issued by a French
project on energy vulnerability. It is organized as a star schema
with 19 dimensions, 68 (non-top) levels, and 24 measures, and it
contains 37,149 facts recorded in the fact table. The cube is stored
in SQL Server. We use 19 real user explorations over this cube
(navigation traces generated by master students when analysing
data during a course) and pick as q0 the first query of the
explorations. For each of these queries, we generate Q by rolling-up q0
in a dimension, drilling down q0 in a dimension, or computing
a sibling query in a dimension. We use this cube and set Q to
obtain more realistic instances of TAP. Precisely, |Q | ranges from
29 to 149 queries for these tests. Over these sets Q we run Reopt
and K + T SP and observe how total interestingness and average
distance between queries change when increasing time budget t
from 500 milliseconds to 5 seconds.</p>
      <p>Figures 2 and 3 show the results. It can be seen that Reopt
outperforms the simplistic K + T SP in terms of interestingness,
which illustrates the benefit of our re-optimization heuristic,
while both algorithms achieve comparable average distance.
4.3</p>
    </sec>
    <sec id="sec-16">
      <title>Unused time</title>
      <p>In this last test, using the same cube and protocol as in the
previous subsection, we want to understand if the budget time is used
properly.
5</p>
    </sec>
    <sec id="sec-17">
      <title>CONCLUSION</title>
      <p>This paper introduced the Traveling Analyst Problem (TAP), the
problem of computing the most interesting and coherent data
story that can be obtained within a reasonable time. We formalize
the problem, show its strong NP-hardness and propose a heuristic
for finding approximate solutions to it. Our preliminary
experiments show that a heuristic based on simple re-optimization is a
promising direction to obtain acceptable solutions.</p>
      <p>
        We believe that TAP opens many interesting research
directions. Obviously, the first step is an in-depth theoretical study
of TAP, to understand which types of optimization algorithms
are more appropriate. Importantly, TAP should be investigated
in the context of data exploration, which means that
optimization algorithms should take advantage of classical data
management optimizations, like re-optimization (e.g., [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]), and that TAP
should be declaratively formulated, for instance by having
starting points expressed in an intentional fashion (e.g., [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]). User
tests should be conducted for further evaluating the approach.
Finally, we also note that the definition of TAP is general, leaving
room for variants, e.g., changing the definition of queries (e.g.,
from non-OLAP SQL queries to more complex actions involving
queries and pattern mining or statistical tests), as well as
changing cost (e.g., using self-adjusting cost model), interest (e.g., using
statistics or data sampling), and distance functions.
      </p>
      <p>Acknowledgment. The authors would like to thank Vincent
T’Kindt for his insights on TAP complexity.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>David</surname>
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Applegate</surname>
            , Robert E. Bixby, Vašek Chvátal, and
            <given-names>William J.</given-names>
          </string-name>
          <string-name>
            <surname>Cook. The Traveling Salesman Problem</surname>
            ,
            <given-names>A Computational</given-names>
          </string-name>
          <string-name>
            <surname>Study</surname>
          </string-name>
          . Princeton Series in Applied Mathematics. Princeton UP,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Tijl</given-names>
            <surname>De Bie</surname>
          </string-name>
          .
          <article-title>Subjective interestingness in exploratory data mining</article-title>
          .
          <source>In Proc. of IDA</source>
          , pages
          <fpage>19</fpage>
          -
          <lpage>31</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Alexandre</given-names>
            <surname>Chanson</surname>
          </string-name>
          , Ben Crulis, Krista Drushku, Nicolas Labroche, and
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Marcel</surname>
          </string-name>
          .
          <article-title>Profiling user belief in BI exploration for measuring subjective interestingness</article-title>
          .
          <source>In Proc. of DOLAP</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Fabián</given-names>
            <surname>Díaz-Núñez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Franco</given-names>
            <surname>Quezada</surname>
          </string-name>
          , and
          <string-name>
            <surname>Óscar</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Vásquez</surname>
          </string-name>
          .
          <article-title>The knapsack problem with scheduled items</article-title>
          .
          <source>Electronic Notes in Discrete Mathematics</source>
          ,
          <volume>69</volume>
          :
          <fpage>293</fpage>
          -
          <lpage>300</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Magdalini</given-names>
            <surname>Eirinaki</surname>
          </string-name>
          , Suju Abraham, Neoklis Polyzotis, and Naushin Shaikh.
          <article-title>QueRIE: Collaborative database exploration</article-title>
          .
          <source>TKDE</source>
          ,
          <volume>26</volume>
          (
          <issue>7</issue>
          ):
          <fpage>1778</fpage>
          -
          <lpage>1790</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Dominique</given-names>
            <surname>Feillet</surname>
          </string-name>
          , Pierre Dejax, and
          <string-name>
            <given-names>Michel</given-names>
            <surname>Gendreau</surname>
          </string-name>
          .
          <article-title>Traveling salesman problems with profits</article-title>
          .
          <source>Transportation Science</source>
          ,
          <volume>39</volume>
          (
          <issue>2</issue>
          ):
          <fpage>188</fpage>
          -
          <lpage>205</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Feurer</surname>
          </string-name>
          , Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and
          <string-name>
            <given-names>Frank</given-names>
            <surname>Hutter</surname>
          </string-name>
          .
          <article-title>Eficient and robust automated machine learning</article-title>
          .
          <source>In Proc. of NIPS</source>
          , pages
          <fpage>2962</fpage>
          -
          <lpage>2970</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Dimitrios</given-names>
            <surname>Gkesoulis</surname>
          </string-name>
          , Panos Vassiliadis, and
          <string-name>
            <given-names>Petros</given-names>
            <surname>Manousis</surname>
          </string-name>
          . Cinecubes:
          <article-title>Aiding data workers gain insights from OLAP queries</article-title>
          .
          <source>IS</source>
          ,
          <volume>53</volume>
          :
          <fpage>60</fpage>
          -
          <lpage>86</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jessica</given-names>
            <surname>Hullman</surname>
          </string-name>
          ,
          <string-name>
            <surname>Steven M. Drucker</surname>
            , Nathalie Henry Riche,
            <given-names>Bongshin</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            , Danyel Fisher, and
            <given-names>Eytan</given-names>
          </string-name>
          <string-name>
            <surname>Adar</surname>
          </string-name>
          .
          <article-title>A deeper understanding of sequence in narrative visualization</article-title>
          .
          <source>TVCG</source>
          ,
          <volume>19</volume>
          (
          <issue>12</issue>
          ):
          <fpage>2406</fpage>
          -
          <lpage>2415</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Stratos</surname>
            <given-names>Idreos</given-names>
          </string-name>
          , Olga Papaemmanouil, and
          <string-name>
            <given-names>Surajit</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          .
          <article-title>Overview of data exploration techniques</article-title>
          .
          <source>In Proc. of SIGMOD</source>
          , pages
          <fpage>277</fpage>
          -
          <lpage>281</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.S.</given-names>
            <surname>Johnson</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Garey</surname>
          </string-name>
          . Computers and
          <string-name>
            <given-names>Intractability. W.H.</given-names>
            <surname>Freeman</surname>
          </string-name>
          ,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Niranjan</surname>
            <given-names>Kamat</given-names>
          </string-name>
          , Prasanth Jayachandran, Karthik Tunga, and
          <string-name>
            <given-names>Arnab</given-names>
            <surname>Nandi</surname>
          </string-name>
          .
          <article-title>Distributed and interactive cube exploration</article-title>
          .
          <source>In Proc. of ICDE</source>
          , pages
          <fpage>472</fpage>
          -
          <lpage>483</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Hans</surname>
            <given-names>Kellerer</given-names>
          </string-name>
          , Ulrich Pferschy, and David Pisinger. Knapsack problems. Springer,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Katherine</given-names>
            <surname>Lai</surname>
          </string-name>
          and
          <string-name>
            <given-names>M</given-names>
            <surname>Goemans.</surname>
          </string-name>
          <article-title>The knapsack problem and fully polynomial time approximation schemes (fptas)</article-title>
          .
          <source>Technical report</source>
          , Massachusetts Institute of Technology,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          and
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Kernighan</surname>
          </string-name>
          .
          <article-title>An efective heuristic algorithm for the travelingsalesman problem</article-title>
          .
          <source>Oper. Res.</source>
          ,
          <volume>21</volume>
          (
          <issue>2</issue>
          ):
          <fpage>498</fpage>
          -
          <lpage>516</lpage>
          ,
          <year>1973</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Patrick</surname>
            <given-names>Marcel</given-names>
          </string-name>
          , Nicolas Labroche, and
          <string-name>
            <given-names>Panos</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          .
          <article-title>Towards a benefitbased optimizer for interactive data analysis</article-title>
          .
          <source>In Proc. of DOLAP</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Patrick</surname>
            <given-names>Marcel</given-names>
          </string-name>
          , Verónika Peralta, and
          <string-name>
            <given-names>Panos</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          .
          <article-title>A framework for learning cell interestingness from cube explorations</article-title>
          .
          <source>In Proc. of ADBIS</source>
          , pages
          <fpage>425</fpage>
          -
          <lpage>440</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Volker</surname>
            <given-names>Markl</given-names>
          </string-name>
          , Vijayshankar Raman,
          <string-name>
            <given-names>David E.</given-names>
            <surname>Simmen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Guy M. Lohman</surname>
            , and
            <given-names>Hamid</given-names>
          </string-name>
          <string-name>
            <surname>Pirahesh</surname>
          </string-name>
          .
          <article-title>Robust query processing through progressive optimization</article-title>
          .
          <source>In Proc. of SIGMOD</source>
          , pages
          <fpage>659</fpage>
          -
          <lpage>670</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Tova</given-names>
            <surname>Milo</surname>
          </string-name>
          and
          <string-name>
            <given-names>Amit</given-names>
            <surname>Somech</surname>
          </string-name>
          .
          <article-title>Next-step suggestions for modern interactive data analysis platforms</article-title>
          .
          <source>In Proc. of KDD</source>
          , pages
          <fpage>576</fpage>
          -
          <lpage>585</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Patrick E. O'Neil</surname>
          </string-name>
          ,
          <string-name>
            <surname>Elizabeth J. O'Neil</surname>
            ,
            <given-names>Xuedong</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , and
            <given-names>Stephen</given-names>
          </string-name>
          <string-name>
            <surname>Revilak</surname>
          </string-name>
          .
          <article-title>The star schema benchmark and augmented fact table indexing</article-title>
          .
          <source>In Proc. of TPCTC</source>
          , pages
          <fpage>237</fpage>
          -
          <lpage>252</lpage>
          , Lyon, France,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Rizzi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Enrico</given-names>
            <surname>Gallinucci</surname>
          </string-name>
          .
          <article-title>CubeLoad: A parametric generator of realistic OLAP workloads</article-title>
          .
          <source>In Proc. of CAISE</source>
          , pages
          <fpage>610</fpage>
          -
          <lpage>624</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Sunita</given-names>
            <surname>Sarawagi</surname>
          </string-name>
          .
          <article-title>Explaining diferences in multidimensional aggregates</article-title>
          .
          <source>In Proc. of VLDB</source>
          , pages
          <fpage>42</fpage>
          -
          <lpage>53</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Gayatri</given-names>
            <surname>Sathe</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sunita</given-names>
            <surname>Sarawagi</surname>
          </string-name>
          .
          <article-title>Intelligent rollups in multidimensional OLAP data</article-title>
          .
          <source>In Proc. of VLDB</source>
          , pages
          <fpage>531</fpage>
          -
          <lpage>540</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Panos</surname>
            <given-names>Vassiliadis</given-names>
          </string-name>
          , Patrick Marcel, and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Rizzi</surname>
          </string-name>
          .
          <article-title>Beyond roll-up's and drill-down's: An intentional analytics model to reinvent OLAP</article-title>
          . IS,
          <volume>85</volume>
          :
          <fpage>68</fpage>
          -
          <lpage>91</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Abdul</surname>
            <given-names>Wasay</given-names>
          </string-name>
          , Manos Athanassoulis, and
          <string-name>
            <given-names>Stratos</given-names>
            <surname>Idreos</surname>
          </string-name>
          . Queriosity:
          <article-title>Automated data exploration</article-title>
          .
          <source>In Proceedings of IEEE International Congress on Big Data</source>
          , pages
          <fpage>716</fpage>
          -
          <lpage>719</lpage>
          , New York City, NY,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>