<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Finding Comparison Insights in Multidimensional Datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Claire Antoine</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexandre Chanson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolas Labroche</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrick Marcel</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LIFAT, université de Tours</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LIFO</institution>
          ,
          <addr-line>Université d'Orléans</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>1</fpage>
      <lpage>1</lpage>
      <abstract>
        <p>What are the best comparisons that can be found in a multidimensional dataset? In this paper, we propose to support exploratory data analysis by answering this question, resorting to both input space sampling and output space sampling. We sample both the dataset and the set of aggregate queries that a user would have to ask to find significant comparisons that are frequent in the data cube of the dataset. Our tests show that our approach is efective in retrieving significant comparisons, and that comparison insights can be computed in seconds even for fairly large datasets, if the number of values compared is small.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Exploratory data analysis</kwd>
        <kwd>Comparison queries</kwd>
        <kwd>Data cube</kwd>
        <kwd>Statistical tests</kwd>
        <kwd>Sampling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Exploratory Data Analysis (EDA) is the notoriously tedious
activity of Data Science consisting of interactively analyzing
a dataset to gain insights. According to De Bie et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
EDA poses the greatest challenges for automation, since
background knowledge and human judgment are the keys
to success. Recently, many approaches were proposed to
support EDA, including approaches to automatically
generate EDA sessions, often defined as maximization problems
(see, e.g., [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
        ]).
      </p>
      <p>
        In this work, we target a specific type of insights,
comparisons, that are very popular among data workers (see
e.g., [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ]). For a given multidimensional dataset  with
a categorical attribute  and a numerical attribute  , a
comparison insight is noted  &lt; , where  and  are values
in the active domain of , and is such that there are enough
evidences that the value of  for  =  is less than the
value of  for  = .
      </p>
      <p>Example 1.1. Consider the excerpt of the flight dataset
displayed in Table 1, loaded in a relational database. Suppose the
analyst is interested in comparing airlines (the categorical
attribute ) to each other regarding the average departure delay
(the numerical attribute  ). To do so, the analyst could for
instance run a SQL query selecting two airlines, say WN and
OO (the values  and ), grouping by airline and some other
attributes (e.g., departure airport), and aggregating measure
departure delay. This implies using a full table scan, which
may be prohibitively costly for large tables, and results in
a very summarized view of the data, which may hide some
"local" efects at more granular level of details. To obtain
interesting comparisons, the analyst would have to run enough
aggregate queries to check that the same insights for WN and
OO are found at various levels of detail.
departure
hour
945
1951
2233
547
2143
1143
927
1346
1654
155
airline
WN
OO
AA
WN
OO
OO
AA
OO
OO
AA
departure
delay
-5
-4
93
-3
63
206
-3
-9
54
85</p>
      <p>Our goal is to eficiently extract such insights in a given
multidimensional dataset. To do so, we resort to sampling
the dataset and use existing optimization structures.
Resorting to sampling the dataset allows to obtain a candidate
comparison insight quickly, using statistical tests to check
its significance. Besides, the candidate insights should be
validated against the possible group-bys that can be
computed over the multidimensional dataset. To validate the
insight, we evaluate aggregate queries against the real data,
using a sample of queries over existing materialized views.</p>
      <p>Our approach to extract comparison insights from a
dataset contributes with:
• a robust way of obtaining candidate comparison
insights by devising a statistics to choose between a
parametric and a non parametric test of significance,
based on the distribution of values in the sample of
the dataset,
• an algorithm for generating queries over existing
materialized views to validate candidate comparison
insights,
• a series of experiments on artificial and real datasets
to asses the efectiveness and eficiency of the
approach.</p>
      <p>The outline of the paper is as follows. Next section
presents the formal background. Section 3 introduces our
approach to extract comparisons. Section 4 details how
candidate comparisons are obtained while Section 5 presents
the validation of candidates. Experiments are reported in
Section 6. Section 7 discusses related work and Section 8
concludes by outlining perspectives.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Formal background</title>
      <p>
        We give the definition of the comparison queries we consider.
This definition extends that of [
        <xref ref-type="bibr" rid="ref4 ref9">4, 9</xref>
        ] by considering more
than one attribute in the group-by set. We consider an
instance of relation  of schema [1, . . . , , 1, . . . ,
]. The ’s are categorical attributes and the  ’s are
numerical attributes, called measures in what follows. The
active domain of attribute  is noted (). As usual for
multidimensional data, we require  to be in Boyce-Codd
Normal Form.
      </p>
      <p>Definition 2.1 (Comparison queries). Comparison
queries are extended relational queries of the form:
 ,()→( =()) ◁▷  ,()→′ ( =())
where  is a categorical attribute in {1, . . . , },  is
a group-by set in 2{1,...,},  is a measure attribute
in {1, . . . , },  is an aggregate function, and
,  ∈ ().  denotes the grouping/aggregation
operator.</p>
      <p>
        For a group-by set  of relation , we call a cuboid of
 for function  and measure  the result of query:
 ;()(). The cuboids of  form the data cube
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] of . The number of cuboids for  and ( ) is
2|{1,...,}|, and the number of cuboids for  and ( )
whose group-by includes attribute  is 2|{1,...,}|− 1.
      </p>
      <p>For two values ,  of an attribute  for measure  , a
comparison insight  &lt;  indicates that, on average, values
of  for  are statistically lower than values of  for .
The result of a comparison query can present evidences (if
the value of  for  is indeed lower than the value of 
for ), or violations (otherwise) of the insight. An actual
insight must have enough evidences that support it. If this
is not known, we call it a candidate comparison insight.
Definition 2.2 (Candidate comparison insight). A
candidate comparison insight  &lt;  where ,  are in
() for attribute , measure  of , aggregation
function , and instance  of a group-by set  over 
including  is a pair of tuples (, , ), (, , ) of the
result of a comparison query over  such that  &lt; .</p>
      <p>A candidate comparison insight can be violated
depending on the group-by set instance, where a violation of
candidate  &lt;  is when  ≥ .</p>
      <p>departure airport
DCA
LAX
SEA</p>
      <p>AA
11.92
10.15
6.22
airline
OO
-1.67
3.18
31.0</p>
      <p>WN
8.13
19.09
4.5
ison query:  ,()→( =( ℎ)) ◁▷
 ,()→( =( ℎ)) where  denotes
all the attributes of table flight (see the previous example).
Definition 2.4 (Validated comparison insight). Let
 &lt;  be a candidate insight, and  be a cuboid over 
including attribute . We say that  supports  &lt;  if the
ratio of violations of  &lt;  is below a user defined threshold
 . Finally, we say that the insight is validated if the ratio
of cuboids supporting the insight is above a user defined
threshold  .</p>
    </sec>
    <sec id="sec-3">
      <title>3. Approach overview</title>
      <p>The computation of comparison insights requires to
explore the data cube of  which may be too computationally
costly if  is large. We developed an approach based on
sampling to tackle those cases of discovering comparison
insights from large relations. The principle is to compute
candidate comparison insights on a sample of  and validate
them over a sample of the cuboids of .</p>
      <p>Our approach is illustrated in Figure 1. We first
sample , so that this sample fits in main memory. Let 
be the sample of . Over , we compute a candidate
comparison insight  &lt; , for the comparison query
 ,()→( =()) ◁▷  ,()→′ ( =())
where  includes all categorical attributes of . To check
if the comparison is significant, we run a statistical test
assessing whether the mean for  is significantly greater than
that for  over the sample.</p>
      <p>To validate the candidate comparisons found with the
statistical tests over , we sample the set  of cuboids over 
as follows. From this set , we assume that some cuboids are
materialized on disk under the form of materialized views
( , dark green cuboids in Figure 1). From this set   of
materialized views (excluding R), we consider the set 
of all comparison queries that can be answered using  
(purple bordered in Figure 1). Generally, || &gt; | |.
We sample this set . This sample is called . Over ,
we approximate the ratio of queries of  satisfying the
user threshold  . We postulate that this approximation is
close enough to the ratio of queries of  satisfying the user
threshold.</p>
      <p>We now detail these two steps.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Statistical test to obtain candidate comparisons</title>
      <p>
        As we do not have all of  but a sample , the pairwise
comparison of ,  is made using a statistical test over . The
null hypothesis of the test should be that the mean for  is
greater than that for . The choice of accepting or rejecting
the hypothesis is made based on the p-value of the chosen
test. As we will be making many comparisons, the p-values
should be corrected, using e.g., the Benjamini-Hochberg
FDR correction [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        We could use a non-parametric test, as in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] since such
tests make few or no assumptions about the data
distribution. However, the statistical power of non parametric
tests is generally lower and some tests, e.g., permutation
tests, require more computation time, as they estimate the
distribution by performing numerous re-samplings.
      </p>
      <p>
        We choose to use the parametric Welch’s one-sided t-test
that has the advantage of not making assumptions about the
variance or sample size, which is well-suited to a context
where there is no prior knowledge of the data. However
Welch’s test still assumes normality of the data and is mainly
impacted by skewness and sample size. Indeed, type I error
tends to increase with the rise in skewness of the
distributions [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>The question becomes: how to build a decision model
based on a dedicated statistic to verify whether the
samples used in the comparison are suficiently normal and
decide whether to use Welch’s test or a non parametric
test. We therefore present our methodology relying on the
parametric Welch-t test to devise potentially
discriminating candidate statistics (Section 4.1) and then train a simple
decision model based on these candidates (Section 4.2).</p>
      <sec id="sec-4-1">
        <title>4.1. Devising candidate statistics</title>
        <p>To devise a statistics for deciding which test to use,
we considered 8 candidate statistics, devised based
on the factors cited in the literature as
influencing Welch’s test robustness, namely sample size
() and skewness () for each sample,  and :
( × ), ( + ), ( × ), ( +
), | − | , ⃒⃒⃒  −  ⃒⃒⃒ , ( 2 +
 ), (  +  ).</p>
        <p>2</p>
        <p>
          We generated artificial datasets for  and  with various
distributions (standard normal, normal with  = 20,
uniform, chi-squared, and exponential) and diferent sample
sizes  ∈ {10, 50, 100, 200, 500, 1000}. The goal was to
obtain a varied panel of pairs (distribution for , ) ×
(distribution for , ): distributions with high asymmetry and
completely symmetric ones, equal or unequal sample sizes,
with large and small size diferences. Each sample of the
same pair was generated with the same expectation  , to be
under the hypothesis 0 :   =  (=  ) of Welch’s test.
If the test rejects 0, we know it makes a type I error. For
each pair considered, we draw 10,000 replicas of diferent
sample pairs, perform Welch’s test at the nominal threshold
 = 5%, and calculate the rejection rate among the replicas,
giving us an estimator of the type I error rate for this given
combination. The number of replicas was chosen to be large
enough to ensure that the rejection rate is a good estimator
of the type I error rate [
          <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
          ]. We also considered several
expectations  ∈ {2, 5, 10, 100} in case the parameter also
impacts the type I error rate.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Decision model</title>
        <p>We want a model that uses only one statistic to classify
pairs, to have a quick and simple decision rule: a decision
threshold on the statistic in question. We trained two models
(decision tree, logistic regression) to classify pairs whose
type I error is out of bounds (1) and those for which it is
controlled (0) based on each of the candidate statistics. The
training sample transmitted to our models was balanced by
undersampling, to avoid favoring the majority class.</p>
        <p>For each model type, we selected the most interesting
statistic (the one creating the greatest heterogeneity for the
decision tree, and the one selected by backward selection for
logistic regression). We took the highest probability
threshold for logistic regression that ensures 0% false negatives on
the training sample. In our case, false negatives (not
detecting a pair where type I error is out of bounds) are considered
more serious errors than false positives (thinking that a pair
with controlled type I error is out of bounds) because in the
ifrst case, we present a false test result to the user, while in
the second case, we postpone the decision to more tests but
do not give false conclusions. We compared the two models
based on the 2-score on the rest of the data (excluding the
training sample) which provides a more severe estimation
of recall, i.e. it maximizes the portion of out-of-bounds pairs
detected, which avoids false conclusions.</p>
        <p>Both models agree on the following: the best statistics
is ⃒⃒⃒  −  ⃒⃒⃒ and a threshold of 0.049 achieves the
best 2-score of 0.99 (with no false negatives) to separate
valid/invalid pairs on the training sample. Below this
threshold, pairs are valid for using Welch’s test. On the test sample,
the 2-score is 0.73 (with a recall of 1). We therefore chose
to use this statistics and threshold for deciding which test
to run.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Validating candidate comparisons</title>
      <p>The validation of a candidate comparison insight is
described in Algorithm 1. Recall from Section 3 that the
algorithm validates the candidate insight  &lt;  on a sample
of queries . Given the sparsity of the data, the sample 
may return no data for  or . If so the candidate is not
considered, and the next candidate comparison is checked
(line 4). The candidate comes with a p-values indicating if
the diference  &lt;  is significant in . If it is not, the next
candidate comparison is checked (line 4). If the candidate is
considered and significant, the queries to check violations
over the sample  are generated (line 5), and ran over the
materialized cuboids (lines 8). Again, the query result may
return no data for  or . In that case, the cuboid is not
counted in the denominator of the prediction (lines 10). If
the prediction is such that it is over the threshold (line 15)
or such that this threshold can not be reached given the
remaining cuboids to check (line 18), then no more
validation queries are run and another candidate is checked.
Otherwise the algorithm runs until no more queries are left.</p>
      <p>Phrased in SQL, the queries generated to check violations
of  &lt;  are of the following form:
WITH
Q1 AS (SELECT G</p>
      <p>FROM (SELECT G,A FROM view_G</p>
      <p>WHERE A in (a,b) group-by G,A )
group-by G HAVING count(*) &gt;1),
Q2 AS (SELECT G,A, agg(M) rank () over
(partition by G ORDER BY agg(M) desc)
FROM view_G
WHERE A in (a,b)</p>
      <p>where view_G is the materialized view over which the
query is evaluated, Q2 is used to order  and  in the result of
the comparison query over view_G and Q1 removes
groupby instances where  or  do not appear.</p>
      <p>Note that Algorithm 1 generates one query per pair (, )
per element of the sample , allowing to leverage existing
indexes over attribute . Another strategy consists of
leaving the DBMS to actually compute the ratio of violations of
all pairs in a cuboid, by generating one query per element
of the sample . Phrased in SQL, these queries are of the
following form:
WITH
Q1 AS (SELECT G,c1.A A1,c2.A A2,</p>
      <p>sign(c1.M-c2.M) sign
FROM
(SELECT G,A, agg(M) M
FROM view_G group-by G,A) c1,
(SELECT G,A, agg(M) M</p>
      <p>FROM view_G group-by G,A) c2</p>
      <p>WHERE c1.A&lt;c2.A and c1.G=c2.G
),
Q2 AS (SELECT A1, A2, count(*) cnt,
count(*) filter (WHERE sign=1) pos,
count(*) filter (WHERE sign=-1) neg
FROM (Q1)
group-by A1,A2
)
SELECT A1, A2, (pos-neg)/cnt score
FROM (Q2);</p>
      <p>We postulate that such queries may not be able to exploit
any index and that the self-join between aggregates over
the materialized view is likely to be very costly. Therefore,
only small datasets may benefit from this strategy.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Experiments</title>
      <p>This section reports the tests we did to assess the
efectiveness and eficiency of our approach.</p>
      <sec id="sec-6-1">
        <title>6.1. Settings</title>
        <p>We developed a prototype in Python 3.10 over the
Postgres 16 RDBMS1. We use the SAMPLE method of Postgres
to sample the initial dataset, with the ’SYSTEM_ROWS’
parameter2. As Postgres does not support query rewriting
using materialized view, we implemented a simple rewriting
method based only on the list of attributes in the group-by
set of the query. Tests are run on a Macbook Pro with Apple
M2 Pro chip and 32GB of RAM. Postgres was run with the
default configuration, i.e., with 128MB of shared memory
bufers and 4MB of working memory.</p>
        <p>Dataset</p>
        <p>#tuples
F100K
F600K
SSB
Health</p>
        <p>
          Datasets used are in Table 3. F100K and F600K are
excerpts from the flight delays dataset 3. SSB is the artificial
dataset of the SSB benchmark [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Health is the Rate table
from the health insurance dataset4.
        </p>
        <p>We ran two sets of tests: the first tests compare the
result of our approach to ground truth on the small F100K
dataset, for the airline attribute. The second set of tests on
all datasets reports the time to compute 90 pairs for one
attribute: the airline attribute having 14 values (91 pairs
in total) for F100k and F600K, the statecode attribute for
Health and the lo_suppkey attribute for SSB.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Efectiveness</title>
        <p>Since our approach uses sampling, we ran efectiveness
tests to compare its outcome (the comparison insights) to
the ground truth, i.e., when comparisons are computed on
the non-sampled relation  and all its cuboids. Unless
otherwise stated, we arbitrarily fixed  = 0.4 for the ratio of
violations in a cuboid and  = 0.4 for the ratio of cuboids
showing less than 0.4 violations. The percentage of the
lattice materialized is fixed to 40%. Note that the materialized
cuboids are randomly chosen. Since random sampling is
1https://github.com/patrickmarcel/rankInsightGeneration
2https://www.postgresql.org/docs/current/tsm-system-rows.html
3https://www.kaggle.com/datasets/usdot/flight-delays
4https://www.kaggle.com/datasets/hhs/health-insurance-marketplace
used, we run each efectiveness tests 5 times and the results
are averaged on the 5 runs.
dataset.</p>
        <p>We start by showing the average prediction, i.e., the
estimation of the ratio of cuboids over the user
threshold  = 0.4 in Figure 2. Both the size of , the sample
of the dataset (each curve), and the size of , the
sample of the set of aggregate queries over the materialized
views ( axis), vary, by changing the sampling ratio in
{0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1}. It can be seen that,
except for the very small size of  (0.1% of ), as the query
sample grows, all sizes of  converge toward the exact ratio
of cuboids.</p>
        <p>Figure 3 shows the average error (prediction over the
sample  compared to the prediction over all ) made
by estimating the ratio of cuboids over the user threshold
 , varying the size of the sample of the dataset (each curve)
and the size of  the sample of the set of aggregate queries
over the materialized views ( axis) varying the sizes as in
the previous test. It can be seen that, except for a very small
size, the size of  has little influence on the error, compared
to the size of the query sample. This result is good since it
shows that the error quickly decreases as the query sample
size grows, with an average error below 0.1 with a sample
size of 40% and above, even when the size of  is 1% of .
If an error of 0.15 is considered acceptable, then a query
sample size of 25% can be used even with 0.1% of the initial</p>
        <p>More interestingly, Figure 4 also shows the average error,
but this time as the prediction over the sample  compared
to the prediction over all the lattice . It can be seen that
the error made on  is very close to the error made on ,
meaning that our approach is a good predictor of the actual
number of cuboids of  featuring the comparison insight.</p>
        <p>Figure 5 reports the F-measure achieved by our approach
compared to the ground truth, i.e., all significant
comparisons found (irrespective of the prediction score) by our
approach against all the significant comparisons found on
. It can be seen that a sample of  of only 25% is suficient
to achieve a F-measure above 75.</p>
        <p>Figure 6 details this result by measuring the Recall @10,
i.e., how many of the comparisons having the top-10
prediction scores are found by our approach compared to that
found in . It can be seen that users can expect to find
between 30 and 40% of the best comparisons among the first
10 retrieved by our approach, with a sample of half of  and
sampling 50% of the queries over the materialized cuboids.</p>
        <p>Our last efectiveness test investigates the sensitivity to
user parameters  and  . Based on the previous tests, we
ifxed the initial sample size at 25% of  and the size of
query sample at 40% of aggregate queries, varying  and
 . Figure 7 shows that  does not impact the error, while 
slightly impacts it, since the more violations are tolerated,</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Eficiency</title>
        <p>We first check the gain of using the parametric Welch’s
tests instead of non parametric permutation tests. Figure
8 reports the diference in time (in seconds) when using
only Welch’s (respectively only permutation) tests on the 90
comparisons ( axis) on F100K, showing that the parametric
test is faster by almost an order of magnitude. Figure 9
shows the number of Welch’s tests by sample size for the
90 comparisons on F100K. We observe that, expectedly, the
larger the sample size, the closer the skewness and size of
samples for  and  are, and therefore the more Welch’s
tests are used. We can conclude that choosing a sample size
for  favoring the use of the parametric test allows to save
time. Interestingly, for F100K, a sample of only 10% allows
to use Welch’s tests around 90% of the time.</p>
        <p>We next measure the time (in seconds, averaged on 5 runs)
it takes to check 90 comparisons (all pairs of the airline
attributes in F100K or F600K, 90 pairs of SSB’s lo_suppkey, 90
pairs of Health’s statecode), using Algorithm 1 in diferent
settings: without index ("False"), with a mono-attribute hash
index on the airline (resp., lo_suppkey) attribute ("True"),
with a multi-attribute index on all attributes ("mc" for
multi-column), with a clustered index on the airline (resp.,
lo_suppkey) attribute ("cl") and with a multi-attribute index
on all attributes when the view is clustered on the airline
(resp., lo_suppkey) attribute ("mc-cl"). Note that for SSB
and Health, only multi-column indexes were tested. In each
case, 40% of the lattice is materialized.</p>
        <p>The results are reported in Figures 10, 11 and 12. The time
taken is linear in the number of comparisons, and increases
with the dataset size and the number of cuboids. This is
because sampling  and hypothesis generation are very fast,
the cost comes from the evaluation of validation queries and
the size of cuboids over which they are evaluated. It takes
less than 15 seconds to compute 90 comparison insights over
F100K (the smallest dataset), around 20 seconds over SSB
(much larger but with only 16 cuboids in its lattice), around
75 seconds on Health (the largest, densest dataset, with
number of cuboids between SSB and F600K) and around 3
minutes for F600K (smaller than SSB and Health but with 4
times more cuboids). Expectedly, it can be seen that indexes
are useful, with a slight advantage for clustered indexes.</p>
        <p>Our last tests aim to assess the alternative strategy that
sends one query per element of . We measure the time
(in seconds) it takes to check all 90 pairs, changing the size
of . We tested diferent index configurations: without
index ("False"), with a multi-attribute index on all attributes
when the view is clustered on the airline (resp., lo_suppkey)
attribute ("mc-cl") and with a multi-attribute index on all
attributes but the airline (resp., lo_suppkey) attribute ("group").
Results are reported in Figures 13, 14 and 15. As expected,
this strategy is beneficial for smaller datasets, with a
substantial speedup for both F100K and F600K. For instance,
with a sample size of 50%, the time to compute 90
comparisons is less than 4 seconds on F100K while it was more than
10 seconds with a sample size of 40% using the first strategy
(Algorithm 1). Indexes and clustering are not useful. On
F100K, we also tested diferent values of parameters  and
 , with a positive impact for  (since the more cuboids we
want, the faster it is to prune cuboids where  is not
satisifed) and no impact for  , on the computation time. Also as
expected, this strategy is very bad for SSB (Figure 15) since
the size of its cuboids prevents an eficient self-join.</p>
      </sec>
      <sec id="sec-6-4">
        <title>6.4. Lessons learned</title>
        <p>The tests run allow to draw a few suggestions for setting up
the parameters of the approach, to achieve good
efectiveness and eficiency. Eficiency-wise, it is suggested to opt
for Algorithm 1 when the dataset is small, and opt for the
second strategy for larger datasets (over 1 GB on a laptop).
As to efectiveness, if the goal is to predict the number of
cuboids with few violations,  = 10% of ,  &lt; 20% and
 = 40% of all aggregate queries should enable to obtain
good predictions. If the goal is to check the presence of a
given comparison insight,  should be above 25% of .</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Related work</title>
      <p>
        Exploratory Data Analysis, insights and Automatic
generation of data explorations Exploratory Data
Analysis (EDA), the notoriously tedious task of interactively
analyzing datasets to gain insights, has attracted a lot of
attention both recently [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ] and since the early ages of
discovery-driven exploration of multidimensional data [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
Sunita Sarawagi’s pioneering work [
        <xref ref-type="bibr" rid="ref18 ref19 ref20">18, 19, 20</xref>
        ] proposed
techniques for interactively browsing interesting cells in a
data cube. Our approach can actually be seen as
complementary to Sunita Sarawagi’s DIFF [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], aiming at explaining
diferences in a multidimensional dataset. In the case of
DIFF, the user is supposed to point the comparison to be
explained while our approach suggests such comparisons.
      </p>
      <p>
        EDA has developed beyond analyzing multidimensional
datasets with classical OLAP operations like drill-down.
EDA operations include data retrieval, data representation,
and data mining tasks [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>
        One key aspect of supporting EDA is quantifying the
importance of insights [
        <xref ref-type="bibr" rid="ref2 ref6">2, 6</xref>
        ]. It is commonly admitted that
interestingness in EDA is manifold [
        <xref ref-type="bibr" rid="ref23 ref24">23, 24</xref>
        ]. For instance,
statistical significance and coverage (importance of the data
scope against the entire dataset) are used in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] and [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Approaches for automatically generating EDA sessions
can be divided into two categories: generate and select, or
guided EDA. Generate and select methods (see e.g., [
        <xref ref-type="bibr" rid="ref2 ref25">25, 2</xref>
        ])
generate many, if not all, possible insights and then select
the best ones. Guided EDA methods (see e.g., [
        <xref ref-type="bibr" rid="ref26 ref3">3, 26</xref>
        ])
generate the session as the algorithm explores the search space,
mimicking a human analyst.
      </p>
      <p>
        Comparisons Several studies highlighted the importance
of comparisons when analyzing data. Blount et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
examined 67 stories produced using EDA, including
awardwinning data stories, from both professional journalists and
data science-aware students, and found comparisons to be
the most popular pattern among novices and professionals
alike. Zgraggen et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] define comparison insights as
observations, hypotheses, and generalizations directly
extracted from data that do not require prior knowledge or
domain expertise. The authors designed an experiment where
participants explored a synthetic dataset and instructed
them to report their reliable insights. 60% of user-reported
insights were spurious, which underlines the need for
systems to be able to automatically characterize comparison
insights.
      </p>
      <p>
        Francia et al. [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] and, independently, Siddiqui et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
respectively defined the Assess and Compare operators to
give a clear semantics and logical foundations of
comparisons (and labeling the result of the comparison in [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]) of
two series of data.
      </p>
      <p>
        Since comparisons happen frequently in practice, with
a high risk of comparison-based insights being spurious,
there is a need to automate the production of non-spurious
comparison insights. In that sense our present work can be
seen as a follow-up to [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], where we develop a robust input
space sampling method (Section 4) and consider output
space sampling to validate candidate insights (Section 5).
      </p>
      <p>
        Note that none of the works mentioned above, in
particular [
        <xref ref-type="bibr" rid="ref2 ref25 ref6 ref7 ref8">25, 2, 7, 6, 8</xref>
        ], can be used as baselines, since they do
not propose a method to automatically extract comparison
insights like the one proposed here.
      </p>
      <p>
        Sampling, approximate query answering
Approximate query answering techniques [
        <xref ref-type="bibr" rid="ref28 ref29 ref30">28, 29, 30</xref>
        ] are aimed at
answering typical aggregation queries much faster than the
exact algorithms implemented by DBMS and do so with a
bounded error. AQP methods may also rely on common
probabilistic bounds such as Hoefding and Chebyshev [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]
to qualify this error. To answer queries faster while
remaining accurate AQP methods need to build samples that are
accurate stand-in for the complete database. This is
especially necessary for group-by query where uniform samples
might entirely miss small groups [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. To avoid this issue
AQP method have been designed with many specific biased
sampling techniques. One such method is Congressional
Sampling [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] that simply ensures small groups are
represented by at least  element in the final sample. BlinkDB’s
[
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] set of multi-dimensional, multi-resolution samples are
used with a selection strategy which determines
automatically the size of the sample based on accuracy or response
time constraints. However, AQP advanced sampling
strategies while useful for their specific use case of providing
accurate results over many diferent queries, may be an
unnecessary cost for our approach. Indeed if samples are
too small the statistical testing will simply fail to reject H0
which would already be more likely for smaller groups even
in the complete database.
      </p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>We propose an approach to find comparison insights in a
multidimensional dataset, by sampling both the dataset and
the set of aggregate queries that can be computed over it.
Our approach makes use of both input space sampling, since
we sample the initial dataset for finding candidate
comparison insights, and output space sampling, as we also sample
the set of aggregate queries to validate the candidates.</p>
      <p>
        While our results are promising, the approach still needs
to be improved both in terms of efectiveness and eficiency.
We outline two promising directions. One could use
frequent pattern mining to extract frequent comparisons.
However, this would require to first compute the datacube of the
dataset and then run a frequent pattern algorithm. Sampling
patterns could be used [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], and it is part of our future work
to compare our approach with frequent pattern sampling
approaches. Our future work will also investigate the use of
BlinkDB [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], that aims at answering SQL-based
aggregation queries over stored data, by building and maintaining
a set of multi-dimensional, multi-resolution samples from
original data, over time. Its dynamic sample selection
strategy determines automatically the size of the sample based
on accuracy or response time constraints.
      </p>
      <p>On the longer term, we plan to adapt the approach to
other types of insights and other data models.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Bie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Raedt</surname>
          </string-name>
          , et al.,
          <source>Automating data science, Commun. ACM</source>
          <volume>65</volume>
          (
          <year>2022</year>
          )
          <fpage>76</fpage>
          -
          <lpage>87</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Tang</surname>
          </string-name>
          , S. Han,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Yiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Extracting top-k insights from multi-dimensional data</article-title>
          ,
          <source>in: Proceedings of SIGMOD</source>
          , Chicago, IL, USA,
          <year>2017</year>
          , pp.
          <fpage>1509</fpage>
          -
          <lpage>1524</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O. B.</given-names>
            <surname>El</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Milo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Somech</surname>
          </string-name>
          ,
          <article-title>Automatically generating data exploration sessions using deep reinforcement learning</article-title>
          ,
          <source>in: Proceedings of SIGMOD</source>
          , Portland,
          <string-name>
            <surname>OR</surname>
          </string-name>
          , USA,
          <year>2020</year>
          , pp.
          <fpage>1527</fpage>
          -
          <lpage>1537</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chanson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Labroche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Marcel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          ,
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>T'kindt, Automatic generation of comparison notebooks for interactive data exploration</article-title>
          , in: J.
          <string-name>
            <surname>Stoyanovich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Teubner</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Guagliardo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Nikolic</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Pieris</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mühlig</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Özcan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Schelter</surname>
            ,
            <given-names>H. V.</given-names>
          </string-name>
          <string-name>
            <surname>Jagadish</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          Zhang (Eds.),
          <source>Proceedings of the 25th International Conference on Extending Database Technology, EDBT</source>
          <year>2022</year>
          ,
          <article-title>Edinburgh</article-title>
          , UK, March 29 - April 1,
          <year>2022</year>
          , OpenProceedings.org,
          <year>2022</year>
          , pp.
          <volume>2</volume>
          :
          <fpage>274</fpage>
          -
          <lpage>2</lpage>
          :
          <fpage>284</fpage>
          . URL: https://doi.org/10. 48786/edbt.
          <year>2022</year>
          .
          <volume>15</volume>
          . doi:
          <volume>10</volume>
          .48786/EDBT.
          <year>2022</year>
          .
          <volume>15</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Youngmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Amer-Yahia</surname>
          </string-name>
          , et al.,
          <article-title>Guided exploration of data summaries</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>15</volume>
          (
          <year>2022</year>
          )
          <fpage>1798</fpage>
          -
          <lpage>1807</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zgraggen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Zeleznik</surname>
          </string-name>
          , T. Kraska,
          <article-title>Investigating the efect of the multiple comparisons problem in visual analysis</article-title>
          , in: R. L.
          <string-name>
            <surname>Mandryk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hancock</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Perry</surname>
            ,
            <given-names>A. L.</given-names>
          </string-name>
          <string-name>
            <surname>Cox</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI</source>
          <year>2018</year>
          , Montreal, QC, Canada,
          <source>April 21-26</source>
          ,
          <year>2018</year>
          , ACM,
          <year>2018</year>
          , p.
          <fpage>479</fpage>
          . URL: https://doi.org/10.1145/ 3173574.3174053. doi:
          <volume>10</volume>
          .1145/3173574.3174053.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Blount</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Koesten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <article-title>Simperl, Understanding the use of narrative patterns by novice data storytellers</article-title>
          ,
          <source>in: Proceedings of CHIRA</source>
          , Budapest, Hungary,
          <year>2020</year>
          , pp.
          <fpage>128</fpage>
          -
          <lpage>138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Siddiqui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Narasayya</surname>
          </string-name>
          ,
          <article-title>COMPARE: accelerating groupwise comparison in relational databases for data analytics</article-title>
          ,
          <source>Proceedings of VLDB Endow</source>
          .
          <volume>14</volume>
          (
          <year>2021</year>
          )
          <fpage>2419</fpage>
          -
          <lpage>2431</lpage>
          . URL: http://www. vldb.org/pvldb/vol14/p2419-siddiqui.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chanson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Labroche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Marcel</surname>
          </string-name>
          ,
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>T'Kindt, Comparison queries generation using mathematical programming for exploratory data analysis</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosworth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Layman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pirahesh</surname>
          </string-name>
          ,
          <article-title>Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total</article-title>
          , in: S. Y. W. Su (Ed.),
          <source>Proceedings of the Twelfth International Conference on Data Engineering, February 26 - March 1</source>
          ,
          <year>1996</year>
          , New Orleans, Louisiana, USA, IEEE Computer Society,
          <year>1996</year>
          , pp.
          <fpage>152</fpage>
          -
          <lpage>159</lpage>
          . URL: https://doi.org/10.1109/ICDE.
          <year>1996</year>
          .
          <volume>492099</volume>
          . doi:
          <volume>10</volume>
          .1109/ICDE.
          <year>1996</year>
          .
          <volume>492099</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Benjamini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hochberg</surname>
          </string-name>
          ,
          <article-title>Controlling the false discovery rate - a practical and powerful approach to multiple testing</article-title>
          ,
          <source>J. Royal Statist. Soc.</source>
          ,
          <string-name>
            <surname>Series</surname>
            <given-names>B</given-names>
          </string-name>
          57 (
          <year>1995</year>
          )
          <fpage>289</fpage>
          -
          <lpage>300</lpage>
          . doi:
          <volume>10</volume>
          .2307/2346101.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Algina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Oshima</surname>
          </string-name>
          , W.-Y. Lin,
          <article-title>Type i error rates for welch's test and james's second-order test under nonnormality and inequality of variance when there are two groups</article-title>
          ,
          <source>Journal of Educational Statistics</source>
          <volume>19</volume>
          (
          <year>1994</year>
          )
          <fpage>275</fpage>
          -
          <lpage>291</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Bishara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Hittner</surname>
          </string-name>
          ,
          <article-title>Testing the significance of a correlation with nonnormal data: comparison of pearson, spearman, transformation, and resampling approaches</article-title>
          .,
          <source>Psychological methods 17</source>
          (
          <year>2012</year>
          )
          <fpage>399</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>P. E. O'Neil</surname>
            ,
            <given-names>E. J. O</given-names>
          </string-name>
          <string-name>
            <surname>'Neil</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>S. Revilak,</given-names>
          </string-name>
          <article-title>The star schema benchmark and augmented fact table indexing</article-title>
          , in: R. O.
          <string-name>
            <surname>Nambiar</surname>
          </string-name>
          , M. Poess (Eds.),
          <source>Performance Evaluation and Benchmarking</source>
          ,
          <source>First TPC Technology Conference, TPCTC</source>
          <year>2009</year>
          , Lyon, France,
          <source>August 24-28</source>
          ,
          <year>2009</year>
          , Revised Selected Papers, volume
          <volume>5895</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2009</year>
          , pp.
          <fpage>237</fpage>
          -
          <lpage>252</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -10424-4_
          <fpage>17</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>642</fpage>
          -10424-4\_
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Idreos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Papaemmanouil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          ,
          <article-title>Overview of data exploration techniques</article-title>
          ,
          <source>in: SIGMOD</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>277</fpage>
          -
          <lpage>281</lpage>
          . URL: https://doi.org/10.1145/2723372. 2731084. doi:
          <volume>10</volume>
          .1145/2723372.2731084.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Milo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Somech</surname>
          </string-name>
          ,
          <article-title>Automating exploratory data analysis via machine learning: An overview</article-title>
          , in: SIGMOD,
          <year>2020</year>
          , p.
          <fpage>2617</fpage>
          -
          <lpage>2622</lpage>
          . URL: https://doi.org/10.1145/ 3318464.3383126. doi:
          <volume>10</volume>
          .1145/3318464.3383126.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarawagi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Megiddo</surname>
          </string-name>
          ,
          <article-title>Discoverydriven exploration of OLAP data cubes</article-title>
          ,
          <source>in: EDBT</source>
          ,
          <year>1998</year>
          , pp.
          <fpage>168</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarawagi</surname>
          </string-name>
          ,
          <article-title>Explaining diferences in multidimensional aggregates</article-title>
          ,
          <source>in: VLDB</source>
          ,
          <year>1999</year>
          , pp.
          <fpage>42</fpage>
          -
          <lpage>53</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarawagi</surname>
          </string-name>
          ,
          <article-title>User-adaptive exploration of multidimensional data</article-title>
          ,
          <source>in: VLDB</source>
          ,
          <year>2000</year>
          , pp.
          <fpage>307</fpage>
          -
          <lpage>316</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>G.</given-names>
            <surname>Sathe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarawagi</surname>
          </string-name>
          ,
          <article-title>Intelligent rollups in multidimensional OLAP data</article-title>
          ,
          <source>in: VLDB</source>
          ,
          <year>2001</year>
          , pp.
          <fpage>531</fpage>
          -
          <lpage>540</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarawagi</surname>
          </string-name>
          ,
          <article-title>idif: Informative summarization of diferences in multidimensional aggregates, Data Min</article-title>
          .
          <source>Knowl. Discov</source>
          .
          <volume>5</volume>
          (
          <year>2001</year>
          )
          <fpage>255</fpage>
          -
          <lpage>276</lpage>
          . URL: https: //doi.org/10.1023/A:1011494927464. doi:
          <volume>10</volume>
          .1023/A:
          <fpage>1011494927464</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Milo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Somech</surname>
          </string-name>
          ,
          <article-title>Next-step suggestions for modern interactive data analysis platforms</article-title>
          , in: SIGKDD, ACM,
          <year>2018</year>
          , pp.
          <fpage>576</fpage>
          -
          <lpage>585</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>L.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Hamilton</surname>
          </string-name>
          ,
          <article-title>Interestingness measures for data mining: A survey</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>38</volume>
          (
          <year>2006</year>
          )
          <article-title>9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>P.</given-names>
            <surname>Marcel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Peralta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          ,
          <article-title>A framework for learning cell interestingness from cube explorations</article-title>
          ,
          <source>in: ADBIS</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>425</fpage>
          -
          <lpage>440</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ding</surname>
          </string-name>
          , S. Han,
          <string-name>
            <surname>Y</surname>
          </string-name>
          . Xu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Zhang,</surname>
          </string-name>
          <article-title>QuickInsights: Quick and automatic discovery of insights from multi-dimensional data</article-title>
          ,
          <source>in: Proceedings of SIGMOD</source>
          , Amsterdam, The Netherlands,
          <year>2019</year>
          , pp.
          <fpage>317</fpage>
          -
          <lpage>332</lpage>
          . URL: https://doi.org/10.1145/3299869.3314037. doi:
          <volume>10</volume>
          . 1145/3299869.3314037.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>A.</given-names>
            <surname>Personnaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Youngmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Amer-Yahia</surname>
          </string-name>
          ,
          <article-title>Eda4sum: Guided exploration of data summaries</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>15</volume>
          (
          <year>2022</year>
          )
          <fpage>3590</fpage>
          -
          <lpage>3593</lpage>
          . URL: https://doi.org/ 10.14778/3554821.3554851. doi:
          <volume>10</volume>
          .14778/3554821. 3554851.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>M.</given-names>
            <surname>Francia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Golfarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Marcel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          ,
          <article-title>Assess queries for interactive analysis of data cubes</article-title>
          ,
          <source>in: EDBT</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>121</fpage>
          -
          <lpage>132</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>S.</given-names>
            <surname>Acharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. B.</given-names>
            <surname>Gibbons</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Poosala</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Ramaswamy,</surname>
          </string-name>
          <article-title>The aqua approximate query answering system</article-title>
          , in: A.
          <string-name>
            <surname>Delis</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Faloutsos</surname>
          </string-name>
          , S. Ghandeharizadeh (Eds.),
          <source>SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3</source>
          ,
          <year>1999</year>
          , Philadephia, Pennsylvania, USA, ACM Press,
          <year>1999</year>
          , pp.
          <fpage>574</fpage>
          -
          <lpage>576</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mozafari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Milner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Stoica</surname>
          </string-name>
          ,
          <article-title>Blinkdb: queries with bounded errors and bounded response times on very large data</article-title>
          , in: Eighth Eurosys Conference 2013, EuroSys '
          <fpage>13</fpage>
          , Prague, Czech Republic,
          <source>April 14-17</source>
          ,
          <year>2013</year>
          ,
          <year>2013</year>
          , pp.
          <fpage>29</fpage>
          -
          <lpage>42</lpage>
          . URL: https://doi.org/10.1145/2465351.2465355. doi:
          <volume>10</volume>
          .1145/2465351.2465355.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>A.</given-names>
            <surname>Galakatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Crotty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zgraggen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Binnig</surname>
          </string-name>
          , T. Kraska,
          <article-title>Revisiting reuse for approximate query processing</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>10</volume>
          (
          <year>2017</year>
          )
          <fpage>1142</fpage>
          -
          <lpage>1153</lpage>
          . URL: http://www.vldb.org/pvldb/ vol10/p1142-galakatos.pdf.
          <source>doi:10.14778/3115404</source>
          . 3115418.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>A.</given-names>
            <surname>Giacometti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soulet</surname>
          </string-name>
          ,
          <article-title>Anytime algorithm for frequent pattern outlier detection</article-title>
          ,
          <source>Int. J. Data Sci. Anal</source>
          .
          <volume>2</volume>
          (
          <year>2016</year>
          )
          <fpage>119</fpage>
          -
          <lpage>130</lpage>
          . URL: https: //doi.org/10.1007/s41060-016-0019-9. doi:
          <volume>10</volume>
          .1007/ S41060-016-0019-9.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>