<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Joint Conference (March</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Towards a benefit-based optimizer for Interactive Data Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Patrick Marcel, Nicolas Labroche</string-name>
          <email>ifrstname.lastname@univ-tours.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Panos Vassiliadis</string-name>
          <email>pvassil@cs.uoi.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Ioannina</institution>
          ,
          <addr-line>Ioannina</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Tours</institution>
          ,
          <addr-line>Tours</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>26</volume>
      <issue>2019</issue>
      <abstract>
        <p>This vision paper introduces several ideas around the optimization of Interactive Data Analysis (IDA) tasks. With an eye on traditional query optimization (QO) in Relational DataBase Management Systems (RDBMS), we suggest that IDA tasks should be specified through high-level statements and optimized globally, particularly by maximizing the number and significance of insights that can be automatically collected for the task. We envision an architecture for IDA and propose in the context of IDA what corresponds to statistics, cost model and execution plan pruning strategy in relational systems. We give elements pertaining to the feasibility of the vision and draw perspectives.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Interactive Data Analysis (IDA) corresponds to the process of
exploring a dataset by means of a sequence of actions aiming
at answering an often imprecise user information need [
        <xref ref-type="bibr" rid="ref13 ref18">13, 18</xref>
        ].
Let us quote [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]: Data analysis is fundamentally an
interactive, iterative process in which a user issues an analysis action (i.e.
query), receives a results set, and decides if and which action to issue
next. Until recent years, analysis tasks required thorough expertise
in SQL and programming, as well as mathematics and statistics.
However, since the advent of the Big Data era, the infrastructures
and support for Interactive Data Analysis (IDA) [...] are gradually
replacing traditional tools, allowing easy to-use data exploration,
visualization, and mining, even for users lacking knowledge of SQL
and programming languages. Yet, IDA is still a dificult process,
especially for inexperienced users, as it requires a deep understanding
of the investigated domain and the particular context. Users may
therefore skip significant analysis actions and overlook important
aspects of the data.
      </p>
      <p>
        Interestingly, it is already envisioned a fully automated IDA
process called continuous exploration [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. It is therefore
important to be able to optimize this process and its steps.
      </p>
      <p>Topic of this paper. This paper’s topic is the discussion of an
envisioned query optimization process, that goes beyond traditional
query processing and optimizations and addresses the issue of
eficiently computing results to interactive data exploration steps,
with these results going far beyond the simple delivery of tuples
(as nowadays happens), but including tuples, data mining results
on these tuples, as well as text and visualization generation.</p>
      <p>Traditional optimization. If a RDBMS is used in IDA, i.e., the
action corresponds to the execution of an SQL statement, the full
power of RDBMS query evaluation is unleashed to optimize one
action of the IDA task, which can be seen as a local optimization.
In these systems, query optimization is typically done at runtime,
and corresponds to the selection of a physical query execution</p>
      <p>Vision. In what follows, we consider explorations where a
user interactively queries some data sources and apply Machine
Learning1 (ML) algorithms to extract models or patterns out
of the query answers, to find valuable insights in the data. We
foresee that IDA users will express such explorations with a
high-level declarative language. But diferently from RDBMS,
and given the explorative nature of IDA, we assume that such
high-level statements will not be prescriptive, in the sense that the
atomic actions (queries over DB, application of ML algorithms)
and their combination will be left to the statement processor. We
believe, though, that a declarative language remains needed, and,
for instance, simple statements like keyword search may not be
suitable for expressing complex IDA tasks. If traditional query
optimization tackles the problem of optimizing one particular
step of IDE (i.e., relational query processing), there is a need to
optimize the whole sequence of steps, including querying data,
extracting models, etc.</p>
      <p>We are quite emphatic in our vision, that the current notion of
query result needs to be fundamentally questioned. In our vision,
the traditional treatment of queries, holding sets-of-tuples as results,
need to be replaced by data story answering. A "data story" is the
answer to an intentional, high-level query via the composition of
individual results that (a) address the core of the original query,
(b) contextualize it, by comparing the state of the situation that
the query result describes with similar, relevant contexts, (c)
analyze and explain it, by highlighting critical sub-parts of the data
space that are responsible for the observed state, and (d) summarize
key highlights of this mini-exploration into a concise summary. To
achieve this, apart from asking database queries, it is imperative
that the engine(s) involved complement the retrieved data with
the mining of ML models from them and automatically generate
textual descriptions and visualizations in a coherent sequence of
"data episodes" (dashboards with data, ML results, text and visuals)
that compose the data story.</p>
      <p>
        Example. We showcase our vision with an example inspired
from [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Assume a user is analyzing sale results in a dashboard
1We use Machine Learning as a general term to denote supervised/unsupervised
descriptive/predictive methods.
narrating the current state of sales through a collection of
crosstabs and graphics. The user then would like to: "verify whether
this distribution of sales for mfgr#5 in Argentina from 2011 to
2016 still holds in general, and build a clustering model for it, then
backtrack to compare with sibling countries, and finally explain
the highest country-wise diference." This request would
correspond to the following statement in the intentional language:
explainhiдhliдht :Max Dif f er ence (
comparesiblinдCount r ies (
backtrack (
cluster (
      </p>
      <p>
        veri f ymf дr #5,Ar дent ina,2011−2016(current dashboard)))))
This statement is non prescriptive in that it is left to the optimizer
to decide the best roll-up for the verification, the best algorithm
and number of clusters for the clustering, the best way to explain
the diference, etc. Each of these degrees of freedom will give
rise to a new plan, yielding an answer diferent from those of
the other plans, automatically generated with the relevant data
sources identified from the current dashboard. The optimizer
estimates the benefit of each plan in terms of the number of insights,
their significance, the time to extract them, etc., and eventually
translates the best plan into a sequence of operations, for
instance: roll-up, cluster, Cinecube’s put in context [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and Dif
[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. These operations are evaluated and the insights (e.g., major
deviations in the distribution of sales, prototypes of the clusters,
major contributions to the highest country-wise diference) in
each answer are highlighted. A data story is then produced by
sequencing the answers into a new dashboard and choosing for
them the best graphical displays.
      </p>
      <p>Contributions. The contribution is a vision for a global (as
opposed to local) optimization scheme for IDA, where:
• the user issues a declarative intentional statement,
expressing her high-level data analysis goal in a non-prescriptive
way, i.e., without completely specifying the data sources
and operations to apply over them,
• this statement sketches a complete exploration, or data
story specification , i.e., a sequence of complex actions
including data retrieval, model extraction with ML,
automatic selection of key insights, and visualization. These
atomic actions derived from the statement (appearing in
the expression tree used for the execution plan) are DB
queries, ML algorithms, etc., expressed in terms of atomic
operators of the underlying execution engine,
• an optimizer, relying on a specific cost model, that is
essentially related to the number, extraction cost, properties
and significance of insights gained by the user along the
exploration, decides how to combine the operators into
alternative plans and picks the best one on the basis of a
strategy defined for pruning the search space of execution
plans.
• the involved execution engine(s) produce intermediate
results that are combined, ranked, pruned with respect
to their significance and relevance to the user’s goal and
eventually compiled into a data story that is shown to user
in response to her query statement.</p>
      <p>This vision paper is structured as follows. Section 2 discusses
cost-based optimization and insights. Section 3 exposes the
envisioned architecture. Section 4 presents the optimizer while
Section 5 focuses on its cost model. Finally, Section 6 discusses a
pruning strategy and Section 7 draws perspectives.</p>
    </sec>
    <sec id="sec-2">
      <title>COST-BASED OPTIMIZATION AND</title>
    </sec>
    <sec id="sec-3">
      <title>INSIGHTS</title>
      <p>We now motivate the need for a cost-based optimizer, that relies
on a cost model where insights are first-class citizens.</p>
      <p>
        Why a cost-based optimizer? Merging DB with ML is not new.
A recent Sigmod blog post reports the interview of DB and ML
academics and industrials discussing this matter [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In particular,
model selection is mentioned in the discussion as an important
challenge. But mixing ML actions with DB queries (for instance,
computing a set of decision trees over the data of a query, or,
even more far-reaching, selecting the most informative out of
the ones that are generated to this end) is a complex problem.
We can distinguish two basic options to do so: (i) use a
rulebased approach: propose a set of fixed rules to derive queries
and applications of ML algorithms, based on the semantics of
the high-level statement. The rules are applied in a specific order
depending on the statement. Given a statement, the derivation
corresponds to the execution plan. (ii) use a cost-based approach:
one intentional statement generates several execution plans with
diferent operator instantiations. The cost of each plan is
estimated and the best one is retained. The cost model is defined
by means of an objective function that encodes the intuition of
what the optimal solution should be.
      </p>
      <p>At some point of the discussion of the blog’s post, one of the
interviewed insists on the importance of cost-based schemes
compared to the more ad-hoc rule-based ones. Interestingly this
kind of transition from rule-based optimization to cost-based
optimization is what had made DB successful in terms of query
optimization.</p>
      <p>
        Why insights in the cost model? Insight is a fundamental driver
of the IDA process. In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], a benchmark for IDA is outlined. The
authors insist on the importance of insights. Quoting this paper:
Ultimately, the goal of interactive data exploration is to extract
insights from data. Thus, a system that allows to extract more
insights than another system within a given time frame is preferable.
However, creating a measure that captures this notion in a
comparable and reproducible way is hard. What is an insight? Do diferent
users have diferent notions of insights? How do we measure the
complexity and value of an insight? Are they domain-dependent?
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the authors argue that directly measuring insights per
minute cannot be done eficiently and they propose a proxy
metric that reflects how often and by how much a system violates
a latency requirement (for instance the interactive response time).
      </p>
      <p>We agree that insights, their number, the time it takes to
discover them, should be of primary importance when processing
IDA tasks. At the same time, we also believe that other properties,
like for instance diversity of insights, significance of insights, etc.,
should be taken into account.</p>
      <p>
        Yet, what does insight mean? Insights are human-digestible
pieces of interesting information about the data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Let us quote
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]: In a recent approach, Dove and Jones [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] combine the
definitions from the communities of Information Visualization and
Cognitive Psychology: whereas the InfoVis community defines
insight as "something that is gained" (after the observation of data by
a participant), psychologists define it as an "Aha!" moment which
is experienced. Interestingly, the two definitions can be combined
in a common view, where once the user works with information,
starting with an original state of mind on the current state of afairs,
there is an "Aha!" moment, where the user suddenly realizes a new
way of looking at the data, resulting in a new mental model for the
state of afairs, or else, new understanding [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        To facilitate the "Aha!" moment, several works already
proposed to characterize unexpected values in cubes ([
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], among
others), interesting [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] or unexpected [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] patterns in data,
statistically significant relationships between data sets [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and hidden
causes [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. We note that control mechanisms can be included
[
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] to enforce insight validity.
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>ENVISIONED ARCHITECTURE FOR IDA</title>
      <p>We assume the following multi-tier architecture for intentional
IDA:
• a user layer: in this layer, the user expresses intentional
statements asking for data stories;
• a story skeleton layer: in this layer, the translation of the
formal intentional statement(s) to a data story skeleton
(i.e., structure of requested operations, without the results
that will have to be plugged in) takes place;
• an optimizer layer: this layer is responsible for the
translation of the intentional statement into atomic actions
picked from a catalog, that are automatically tune and
parametrized;
• an execution layer: that executes queries, applies models,
extracts highlights, etc.;
• a data layer: this layer holds databases, cubes, flat files,
user profiles, user traces, etc.;
• finally, a storytelling layer: this layer is responsible for
compiling all the intermediate results into meaningful data
stories.</p>
      <p>
        More details about the layers. At the user layer, statements
can be intentions in the spirit of those defined in [
        <xref ref-type="bibr" rid="ref26 ref27">26, 27</xref>
        ], or
translated into such language using NLP facilities. This layer is
also responsible for long-term and sort-term modeling of the
user [
        <xref ref-type="bibr" rid="ref11 ref29">11, 29</xref>
        ], by processing implicit user feedback (e.g., intention
logs). While in traditional RDBMS a query computes an instance
from another instance, an intentional statement expresses a data
story starting from another data story. A data story is a sequence
of visualizations packed in a dashboard, produced by the
storytelling layer from the output of the optimization and execution
layers. Predefined user templates will be used to control the story
complexity, and to generate the initial data story, to start the
analysis. The storytelling layer identifies the most appropriate
graphical representation of query answers, and automatically
crafts narratives, commenting on the highlights presented, etc.
[
        <xref ref-type="bibr" rid="ref10 ref12">10, 12</xref>
        ]. As in web search SERP (Search Engine Result Page),
a set of stories resulting from the processing of an intentional
statement can be presented in a way that allows the user to
navigate the most relevant data stories. We note that explanations of
the models extracted from the data [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] as well as user-tailored
recommendations [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] may also be part of the story. The
execution layer is responsible for choosing the most eficient way to
execute the atomic operation (i.e., it applies local optimizations).
We now detail the optimization layer.
4
      </p>
    </sec>
    <sec id="sec-5">
      <title>OPTIMIZER</title>
      <p>
        The optimizer layer processes an intentional statement, produces
several query plans, evaluate their cost and picks the best one. A
query plan is a tree where nodes are atomic operations, i.e., either
regular queries over data sources or calls to ML algorithms, in the
spirit of the trees used in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. This supposes to have a catalog
of such operators. An objective function is used to estimate, for
each node in the tree, the cost or benefit of the nodes, in the sense
of some objective function to optimize (See Section 5).
      </p>
      <p>One major diference compared to the classical query
optimization scheme is that, given an intentional statement, two
diferent query plans for this statement generally result in two
diferent answers, whereas in traditional QO, diferent execution
plans correspond to a query that is logically equivalent to the
query being optimized. Another fundamental diference is that in
classical QO, the user sees only a set of tuples as the final result.
Here, insights encountered along the plan are collected and
postprocessed for telling the data story, along the lines delineated in
the vision presented in the introduction of the paper.</p>
      <p>
        How to automatically generate a set of relevant queries to a
database? This should obviously be partly specified by the intention.
Many works exist to generate consistent queries from incomplete
specifications [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>
        How to automatically choose and tune ML algorithms?
Choosing a learning algorithm based on data characteristics is the topic
of a field of research called meta-learning [
        <xref ref-type="bibr" rid="ref16 ref25">16, 25</xref>
        ]. Quoting [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]:
A considerable amount of metalearning research has been devoted
to the area of algorithm recommendation. In this special case of
metalearning, the aspect of interest is the relationship between data
characteristics and algorithm performance, with the final goal of
predicting an algorithm or a set of algorithms suitable for a specific
problem under study. As a motivation, the fact that it is infeasible
to examine all possible alternatives of algorithms in a trial and
error procedure is often given along with the experts necessary if
pre-selection of algorithms is to take place. This application of
metalearning can thus be both useful for providing a recommendation
to an end-user or automatically selecting or weighting algorithms
that are most promising.
      </p>
      <p>
        The basic principle of metalearning is as follows: given a set
of datasets, extract characteristics of the datasets, estimate
performance of the algorithms on each dataset (obviously with an
indicator that allows to compare diferent algorithms), build a
meta-dataset describing for each dataset its characteristics and
the estimated performances, and use it to rank the algorithms. In
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], the model selection is viewed under the angle of
manipulation of triples. Each triple has three components: an FE (feature
extraction) “option” (loosely defined, a sequence of computation
operations) that fixes the feature set that represents the data, an
AS (algorithm selection) option that fixes the ML algorithm, and
a PT (parameter tuning) option that fixes the parameter choices
conditioned on the AS option.
      </p>
      <p>
        Automated machine learning (auto-ML) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is very similar to
meta-learning in its purpose. It focuses on how to choose and
parametrize a ML algorithm for a given dataset, at a given cost
(i.e., the time it takes to test diferent algorithms).
5
      </p>
    </sec>
    <sec id="sec-6">
      <title>COST MODEL</title>
      <p>
        Traditional query optimizers are usually concerned with
minimizing resource consumption, especially time or disk access,
and they use a cost model in accordance. In the context of IDA,
we envision the definition of an objective function to express
the gain the user would get from the exploration. As explained
above, and consistently with the literature on IDA, this benefit
function should attach a particular importance to the insights
and their significance, without forgetting the traditional concern
of resource consumption (in particular, local optimizers will be
exploited as black boxes attached to the data sources used during
IDA). Here is a non-exhaustive list of criteria:
• the number of insights (to maximize; albeit, not to infinity,
but to a concise set of top-k results),
• the time it takes to obtain them (i.e., processing the
intention) (to minimize),
• some properties of insights or sets of insights:
– their statistical significance (minimize p-value, see e.g.,
[
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]) (to maximize),
– their relevance for the user (e.g., with respect to some
user preferences) (to maximize),
– their interestingness [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] (to maximize):
understandability (e.g., the size/length of the insight or the complexity
of a model or pattern), diversity, etc.
• the appropriateness of the insight to the current intention
(e.g., in the sense of a short-term interest) (to maximize).
      </p>
      <p>Some more criteria related to the way data and insights are
displayed in the data story may also be included.</p>
      <p>
        As in classical QO, we envision a mechanism of statistics
collection to estimate the benefit of the exploration with the objective
function. For instance, the average processing time of ML
algorithms can be recorded, to prevent a costly algorithm from being
considered in plan generation whenever this algorithm is to be
applied over a dataset having similar characteristics than those
used in the past. Since global query plans are expected to be
complex, we anticipate that global optimization will be a good
candidate for query re-optimization [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] or plan reuse [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], in
order to reduce the overhead induced by the optimization phase.
      </p>
      <p>
        The objective function is used to seek a compromise between
the above-mentioned criteria, since some are antagonistic. We
note that it can be automatically learned and maintained using
supervised ML. For instance, in [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], the authors propose to use
active learning to predict a set of Pareto-optimal solutions with
some theoretical guarantees in a multi-objectives optimization
context. More simply, it is possible to use a scalarization approach
that reduces the multi-objective optimization problem to a
singleobjective problem. A naive solution could be a linear combination
of the criteria to be learned using SVM, provided user feedback
and enough instances of intention processing are available, using
an approach similar to the one presented in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
6
      </p>
    </sec>
    <sec id="sec-7">
      <title>PLAN GENERATION AND PRUNING</title>
    </sec>
    <sec id="sec-8">
      <title>STRATEGY</title>
      <p>
        The search space for plan generation is huge, much more than the
one used by traditional QO. A drastic pruning strategy is needed,
to reduce the number of plans generated. We here present a naive
idea to show the feasibility of plan generation and selection, that
borrows from composite item construction strategy (see e.g.. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]).
      </p>
      <p>Given an initial data story, we form combinations of datasets
and algorithms, and order them to tell the story. The algorithm
below is a first tentative to formalize the process of picking an
execution plan as the answer to an intentional operator.
Algorithm 1: main</p>
      <p>Data: datastory D, intention I, benefit function fo , integer n
Result: set of execution plans P
1 A = дenerateN odes(D, I ); // generate a set of plan nodes N
from D and I
2 C = buildCandidateClusters(n, N ); // cluster N based on
criteria
3 C ′ = prune(C, fo ) ; //prune C
4 for all cluster c in C’ do
5 pc = queryPlan(c) ; // compute query plan
6 estimateBene f it (pc , fo ) // estimate its benefit
7 P = P ∪ pc
8 return P ;</p>
      <p>The algorithm works as follows. In step 1, the intentional
operator triggers the generation of a large set of execution plan
nodes, i.e., atomic operations. Clustering is used in step 2 to build
n clusters of nodes, n being a parameter of the approach. Step 3
kills some clusters with hard constraints. Each remaining cluster
is processed to obtain a query plan, and all query plans are scored
using the benefit function (steps 4-6). Finally, the set of scored
query plans is returned.</p>
      <p>
        We now give some details on the diferent steps. (1)
generateNodes: the generated atomic operations are queries over
datasources or ML algorithms. Queries over sources are generated
using the intention and user preferences, while ML algorithms
are taken from a catalog. Given the size i of intention (in terms of
number of sources), the size p of user preferences (for instance,
selection conditions to expand the query with [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]) and the
number m of ML algorithms, the number of generated operations is
around i × 2p × 2m . (2) buildCandidateClusters: all atomic
operations are described in a uniform way. They are projected into
a predefined vector space (including dataset characteristics, ML
algorithm characteristics, etc.), so that clustering can be used to
form groups of operation being "close" to each other. Like in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
fuzzy clustering can be used, as it allows for an atomic operation
to appear significantly into several execution plans, with extra
terms to incorporate the criteria of the objective function. In this
step, it is crucial to ensure that the clustering time will remain
under control. For instance, avoiding necessary comparisons based
on statistical notions like concentration inequalities [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] could
be used. (3) prune: this step uses a subset of the criteria of the
benefit function as hard constraints for pruning clusters that
are considered too bad (for instance, the worst clusters in terms
of overall execution time of the operators they contain can be
automatically rejected). (4) queryPlan: this step orders the nodes
of the cluster, and adapts ML algorithms to query answers thanks
to auto-learning [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. (5) estimateBenefit: consists of applying the
benefit function to each node of the plan.
7
      </p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSION</title>
      <p>
        This paper introduces a vision for processing high level
statements describing Interactive Data Analysis tasks, inspired by
traditional query optimization in relational databases. To the best
of our knowledge, this is the first attempt at an insight-driven
architecture for analytical intentions. It is consistent with previous
works on IDA: Eichmann and al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] that discusses the challenges
for creating an insight-centered benchmark for IDA, where
different functional aspects of IDA are separated ; Zhao et al. [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]
focusing on the validity of a certain type of insights encountered
along the analysis ; or Milo and Somech [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], that proposes a
collaborative, but insight-agnostic, recommender system for IDA
where analyses over diverse datasets are phrased with high-level
actions of multiple types.
      </p>
      <p>We believe that our vision opens many research directions,
that can be structured around the envisioned architecture and
its layers. Specifically, each aspect of the optimizer deserves to
be refined, i.e., (i) the categorization of insights, and a precise
definition for them, (ii) the objective function, its criteria and
the way is it learned, (iii) the mechanism for statistics collection
and user feedback, (iv) the vector space for representing atomic
operations in a uniform way, and (v) the diferent steps of the
pruning strategy.</p>
      <p>
        Given the maturity of the diferent fields involved in this
vision (databases, machine learning, user-centered search activities,
etc.), we are confident that a proof of concept can be implemented
once these diferent aspects are precised. We note that the dataset
provided by [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and made available through Github2 is a
promising option for carrying a set of tests with real IDA tasks.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abouzied</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti. Courting</surname>
          </string-name>
          <string-name>
            <surname>ML</surname>
          </string-name>
          :
          <article-title>Witnessing the marriage of relational &amp; web data systems to machine learning</article-title>
          . http://wp.sigmod.org/?p=
          <fpage>2243</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Alsayasneh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Amer-Yahia</surname>
          </string-name>
          , É. Gaussier,
          <string-name>
            <given-names>V.</given-names>
            <surname>Leroy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pilourdault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Borromeo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Toyama</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Renders</surname>
          </string-name>
          .
          <article-title>Personalized and diverse task composition in crowdsourcing</article-title>
          .
          <source>IEEE Trans. Knowl</source>
          . Data Eng.,
          <volume>30</volume>
          (
          <issue>1</issue>
          ):
          <fpage>128</fpage>
          -
          <lpage>141</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Bie</surname>
          </string-name>
          .
          <article-title>Subjective interestingness in exploratory data mining</article-title>
          .
          <source>In IDA</source>
          , pages
          <fpage>19</fpage>
          -
          <lpage>31</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Chirigati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Doraiswamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Damoulas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          .
          <article-title>Data polygamy: The many-many relationships among urban spatio-temporal data sets</article-title>
          .
          <source>In SIGMOD</source>
          , pages
          <fpage>1011</fpage>
          -
          <lpage>1025</lpage>
          . ACM,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Dove</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Narrative visualization: Sharing insights into complex data</article-title>
          .
          <source>In IHCI</source>
          ,
          <year>2012</year>
          . Available at http://openaccess.city.ac.uk/1134/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Eichmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zgraggen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Binnig</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Kraska</surname>
          </string-name>
          .
          <article-title>Towards a benchmark for interactive data exploration</article-title>
          .
          <source>IEEE Data Eng. Bull.</source>
          ,
          <volume>39</volume>
          (
          <issue>4</issue>
          ):
          <fpage>50</fpage>
          -
          <lpage>61</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Feurer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Eggensperger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Springenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          .
          <article-title>Eficient and robust automated machine learning</article-title>
          .
          <source>In NIPS</source>
          , pages
          <fpage>2962</fpage>
          -
          <lpage>2970</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Ullman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Widom</surname>
          </string-name>
          .
          <article-title>Database systems - the complete book (2</article-title>
          . ed.).
          <source>Pearson Education</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Geng</surname>
          </string-name>
          and
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Hamilton</surname>
          </string-name>
          .
          <article-title>Interestingness measures for data mining: A survey</article-title>
          .
          <source>ACM Comput. Surv.</source>
          ,
          <volume>38</volume>
          (
          <issue>3</issue>
          ):
          <fpage>9</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gkesoulis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Manousis</surname>
          </string-name>
          . Cinecubes:
          <article-title>Aiding data workers gain insights from OLAP queries</article-title>
          . Inf. Syst.,
          <volume>53</volume>
          :
          <fpage>60</fpage>
          -
          <lpage>86</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R. V.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Raghunathan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          .
          <article-title>User modeling for a personal assistant</article-title>
          .
          <source>In WSDM</source>
          , pages
          <fpage>275</fpage>
          -
          <lpage>284</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hullman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Drucker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. H.</given-names>
            <surname>Riche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lee</surname>
          </string-name>
          , D. Fisher, and
          <string-name>
            <given-names>E.</given-names>
            <surname>Adar</surname>
          </string-name>
          .
          <article-title>A deeper understanding of sequence in narrative visualization</article-title>
          .
          <source>TVCG</source>
          ,
          <volume>19</volume>
          (
          <issue>12</issue>
          ):
          <fpage>2406</fpage>
          -
          <lpage>2415</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Idreos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Papaemmanouil</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          .
          <article-title>Overview of data exploration techniques</article-title>
          .
          <source>In SIGMOD</source>
          , pages
          <fpage>277</fpage>
          -
          <lpage>281</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>McCann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Naughton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Patel</surname>
          </string-name>
          .
          <article-title>Model selection management systems: The next frontier of advanced analytics</article-title>
          .
          <source>SIGMOD Record</source>
          ,
          <volume>44</volume>
          (
          <issue>4</issue>
          ):
          <fpage>17</fpage>
          -
          <lpage>22</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>N.</given-names>
            <surname>Labroche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Detyniecki</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Bärecke</surname>
          </string-name>
          .
          <article-title>Accelerating one-pass clustering by cluster selection racing</article-title>
          .
          <source>In ICTAI</source>
          , pages
          <fpage>491</fpage>
          -
          <lpage>498</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lemke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Budka</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Gabrys</surname>
          </string-name>
          .
          <article-title>Metalearning: a survey of trends and technologies</article-title>
          .
          <source>Artif. Intell. Rev.</source>
          ,
          <volume>44</volume>
          (
          <issue>1</issue>
          ):
          <fpage>117</fpage>
          -
          <lpage>130</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>V.</given-names>
            <surname>Markl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Raman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Simmen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Lohman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Pirahesh</surname>
          </string-name>
          .
          <article-title>Robust query processing through progressive optimization</article-title>
          .
          <source>In SIGMOD</source>
          , pages
          <fpage>659</fpage>
          -
          <lpage>670</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>T.</given-names>
            <surname>Milo</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Somech</surname>
          </string-name>
          .
          <article-title>Next-step suggestions for modern interactive data analysis platforms</article-title>
          .
          <source>In KDD</source>
          , pages
          <fpage>576</fpage>
          -
          <lpage>585</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          .
          <article-title>"why should I trust you?": Explaining the predictions of any classifier</article-title>
          .
          <source>In SIGKDD</source>
          , pages
          <fpage>1135</fpage>
          -
          <lpage>1144</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarawagi</surname>
          </string-name>
          .
          <article-title>Explaining diferences in multidimensional aggregates</article-title>
          .
          <source>In Proceedings of VLDB</source>
          , pages
          <fpage>42</fpage>
          -
          <lpage>53</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarawagi</surname>
          </string-name>
          .
          <article-title>User-adaptive exploration of multidimensional data</article-title>
          .
          <source>In Proceedings of VLDB</source>
          , pages
          <fpage>307</fpage>
          -
          <lpage>316</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Sengar</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Haritsa</surname>
          </string-name>
          .
          <article-title>PLASTIC: reducing query optimization overheads through plan recycling</article-title>
          .
          <source>In SIGMOD, page 676</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Simitsis</surname>
          </string-name>
          , G. Koutrika, and
          <string-name>
            <given-names>Y. E.</given-names>
            <surname>Ioannidis</surname>
          </string-name>
          .
          <article-title>Précis: from unstructured keywords as queries to structured databases as answers</article-title>
          .
          <source>VLDB J</source>
          .,
          <volume>17</volume>
          (
          <issue>1</issue>
          ):
          <fpage>117</fpage>
          -
          <lpage>149</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>K.</given-names>
            <surname>Stefanidis</surname>
          </string-name>
          , G. Koutrika, and
          <string-name>
            <given-names>E.</given-names>
            <surname>Pitoura</surname>
          </string-name>
          .
          <article-title>A survey on representation, composition and application of preferences in database systems</article-title>
          .
          <source>TODS</source>
          ,
          <volume>36</volume>
          (
          <issue>3</issue>
          ):
          <volume>19</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          :
          <fpage>45</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Sun</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          .
          <article-title>Pairwise meta-rules for better meta-learning-based algorithm ranking</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>93</volume>
          (
          <issue>1</issue>
          ):
          <fpage>141</fpage>
          -
          <lpage>161</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>P.</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Marcel</surname>
          </string-name>
          .
          <article-title>The road to highlights is paved with good intentions: Envisioning a paradigm shift in OLAP modeling</article-title>
          .
          <source>In DOLAP</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>P.</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Marcel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          .
          <article-title>Beyond roll-up's and drill-down's: An intentional analytics model to reinvent OLAP</article-title>
          . Submitted to Inf. Syst.,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wasay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Athanassoulis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Idreos</surname>
          </string-name>
          . Queriosity:
          <article-title>Automated data exploration</article-title>
          .
          <source>In IEEE International Congress on Big Data</source>
          , pages
          <fpage>716</fpage>
          -
          <lpage>719</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>R. W.</given-names>
            <surname>White</surname>
          </string-name>
          .
          <article-title>Interactions with Search Systems</article-title>
          . Cambridge University Press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Stefani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zgraggen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Binnig</surname>
          </string-name>
          , E. Upfal, and
          <string-name>
            <given-names>T.</given-names>
            <surname>Kraska</surname>
          </string-name>
          .
          <article-title>Controlling false discoveries during interactive data exploration</article-title>
          .
          <source>In SIGMOD</source>
          , pages
          <fpage>527</fpage>
          -
          <lpage>540</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zuluaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krause</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Püschel.</surname>
          </string-name>
          e
          <article-title>-pal: An active learning approach to the multi-objective optimization problem</article-title>
          .
          <source>JMLR</source>
          ,
          <volume>17</volume>
          :104:
          <fpage>1</fpage>
          -
          <lpage>104</lpage>
          :
          <fpage>32</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>