=Paper= {{Paper |id=Vol-3931/paper2 |storemode=property |title=Finding comparison insights in multidimensional datasets |pdfUrl=https://ceur-ws.org/Vol-3931/paper2.pdf |volume=Vol-3931 |authors=Patrick Marcel,Claire Antoine,Alexandre Chanson,Nicolas Labroche |dblpUrl=https://dblp.org/rec/conf/dolap/MarcelACL25 }} ==Finding comparison insights in multidimensional datasets== https://ceur-ws.org/Vol-3931/paper2.pdf
                         Finding Comparison Insights in Multidimensional Datasets
                         Claire Antoine1,† , Alexandre Chanson1 , Nicolas Labroche1 and Patrick Marcel2,*,‡
                         1
                             LIFAT, université de Tours, France
                         2
                             LIFO, Université d’Orléans, France


                                            Abstract
                                            What are the best comparisons that can be found in a multidimensional dataset? In this paper, we propose to support exploratory data
                                            analysis by answering this question, resorting to both input space sampling and output space sampling. We sample both the dataset
                                            and the set of aggregate queries that a user would have to ask to find significant comparisons that are frequent in the data cube of
                                            the dataset. Our tests show that our approach is effective in retrieving significant comparisons, and that comparison insights can be
                                            computed in seconds even for fairly large datasets, if the number of values compared is small.

                                            Keywords
                                            Exploratory data analysis, Comparison queries, Data cube, Statistical tests, Sampling



                         1. Introduction                                                                                                 departure     date       departure   airline   departure
                                                                                                                                         airport                  hour                  delay
                         Exploratory Data Analysis (EDA) is the notoriously tedious                                                      DCA           1-1-2015   945         WN        -5
                         activity of Data Science consisting of interactively analyzing                                                  LAX           1-1-2015   1951        OO        -4
                                                                                                                                         DCA           1-1-2015   2233        AA        93
                         a dataset to gain insights. According to De Bie et al. [1]
                                                                                                                                         SEA           2-1-2015   547         WN        -3
                         EDA poses the greatest challenges for automation, since
                                                                                                                                         LAX           2-1-2015   2143        OO        63
                         background knowledge and human judgment are the keys
                                                                                                                                         LAX           3-1-2015   1143        OO        206
                         to success. Recently, many approaches were proposed to                                                          LAX           3-1-2015   927         AA        -3
                         support EDA, including approaches to automatically gener-                                                       LAX           3-1-2015   1346        OO        -9
                         ate EDA sessions, often defined as maximization problems                                                        SEA           3-1-2015   1654        OO        54
                         (see, e.g., [2, 3, 4, 5]).                                                                                      SEA           4-1-2015   155         AA        85
                            In this work, we target a specific type of insights, com-
                         parisons, that are very popular among data workers (see                                                       Table 1
                         e.g., [6, 7, 8]). For a given multidimensional dataset 𝑅 with                                                 Excerpt of table flight
                         a categorical attribute 𝐴 and a numerical attribute 𝑀 , a
                         comparison insight is noted 𝑎 < 𝑏, where 𝑎 and 𝑏 are values
                         in the active domain of 𝐴, and is such that there are enough                                                     Our goal is to efficiently extract such insights in a given
                         evidences that the value of 𝑀 for 𝐴 = 𝑎 is less than the                                                      multidimensional dataset. To do so, we resort to sampling
                         value of 𝑀 for 𝐴 = 𝑏.                                                                                         the dataset and use existing optimization structures. Re-
                                                                                                                                       sorting to sampling the dataset allows to obtain a candidate
                         Example 1.1. Consider the excerpt of the flight dataset dis-                                                  comparison insight quickly, using statistical tests to check
                         played in Table 1, loaded in a relational database. Suppose the                                               its significance. Besides, the candidate insights should be
                         analyst is interested in comparing airlines (the categorical at-                                              validated against the possible group-bys that can be com-
                         tribute 𝐴) to each other regarding the average departure delay                                                puted over the multidimensional dataset. To validate the
                         (the numerical attribute 𝑀 ). To do so, the analyst could for                                                 insight, we evaluate aggregate queries against the real data,
                         instance run a SQL query selecting two airlines, say WN and                                                   using a sample of queries over existing materialized views.
                         OO (the values 𝑎 and 𝑏), grouping by airline and some other                                                      Our approach to extract comparison insights from a
                         attributes (e.g., departure airport), and aggregating measure                                                 dataset contributes with:
                         departure delay. This implies using a full table scan, which
                         may be prohibitively costly for large tables, and results in                                                       • a robust way of obtaining candidate comparison
                         a very summarized view of the data, which may hide some                                                              insights by devising a statistics to choose between a
                         "local" effects at more granular level of details. To obtain in-                                                     parametric and a non parametric test of significance,
                         teresting comparisons, the analyst would have to run enough                                                          based on the distribution of values in the sample of
                         aggregate queries to check that the same insights for WN and                                                         the dataset,
                         OO are found at various levels of detail.                                                                          • an algorithm for generating queries over existing
                                                                                                                                              materialized views to validate candidate comparison
                                                                                                                                              insights,
                         DOLAP 2025: 27th International Workshop on Design, Optimization, Lan-                                              • a series of experiments on artificial and real datasets
                         guages and Analytical Processing of Big Data, co-located with EDBT/ICDT                                              to asses the effectiveness and efficiency of the ap-
                         2025, March 25, 2025, Barcelona, Spain                                                                               proach.
                         *
                           Corresponding author and main contributor.
                         †
                           The work was done when the author was an intern at LIFAT.                                                     The outline of the paper is as follows. Next section
                         ‡
                           This work was partially supported by JUNON Program (DATA/LIFO)                                              presents the formal background. Section 3 introduces our
                           with financial support of Région Centre-Val de Loire, France.                                               approach to extract comparisons. Section 4 details how can-
                         $ claire.antoine56@gmail.com (C. Antoine);
                         Alexandre.Chanson@univ-tours.fr (A. Chanson);
                                                                                                                                       didate comparisons are obtained while Section 5 presents
                         Nicolas.Labroche@univ-tours.fr (N. Labroche);                                                                 the validation of candidates. Experiments are reported in
                         Patrick.Marcel@univ-orleans.fr (P. Marcel)                                                                    Section 6. Section 7 discusses related work and Section 8
                          0000-0003-3171-1174 (P. Marcel)                                                                             concludes by outlining perspectives.
                                        © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                        Attribution 4.0 International (CC BY 4.0).



CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Formal background                                                Definition 2.4 (Validated comparison insight). Let
                                                                    𝑎 < 𝑏 be a candidate insight, and 𝐶 be a cuboid over 𝑅
We give the definition of the comparison queries we consider.       including attribute 𝐴. We say that 𝐶 supports 𝑎 < 𝑏 if the
This definition extends that of [4, 9] by considering more          ratio of violations of 𝑎 < 𝑏 is below a user defined threshold
than one attribute in the group-by set. We consider an              𝜏 . Finally, we say that the insight is validated if the ratio
instance of relation 𝑅 of schema 𝑅[𝐴1 , . . . , 𝐴𝑛 , 𝑀1 , . . . ,   of cuboids supporting the insight is above a user defined
𝑀𝑚 ]. The 𝐴𝑖 ’s are categorical attributes and the 𝑀𝑗 ’s are        threshold 𝜌.
numerical attributes, called measures in what follows. The
active domain of attribute 𝐴 is noted 𝑎𝑑𝑜𝑚(𝐴). As usual for
multidimensional data, we require 𝑅 to be in Boyce-Codd             3. Approach overview
Normal Form.
Definition 2.1 (Comparison queries). Comparison
queries are extended relational queries of the form:
𝛾𝐺,𝑎𝑔𝑔(𝑀 )→𝑣𝑎𝑙 (𝜎𝐴=𝑎 (𝑅)) ◁▷ 𝛾𝐺,𝑎𝑔𝑔(𝑀 )→𝑣𝑎𝑙′ (𝜎𝐴=𝑏 (𝑅))
where 𝐴 is a categorical attribute in {𝐴1 , . . . , 𝐴𝑛 }, 𝐺 is
a group-by set in 2{𝐴1 ,...,𝐴𝑛 } , 𝑀 is a measure attribute
in {𝑀1 , . . . , 𝑀𝑚 }, 𝑎𝑔𝑔 is an aggregate function, and
𝑎, 𝑏 ∈ 𝑎𝑑𝑜𝑚(𝐴). 𝛾 denotes the grouping/aggregation
operator.
   For a group-by set 𝐺 of relation 𝑅, we call a cuboid of
𝑅 for function 𝑎𝑔𝑔 and measure 𝑀 the result of query:
𝛾𝐺;𝑎𝑔𝑔(𝑀 ) (𝑅). The cuboids of 𝑅 form the data cube
[10] of 𝑅. The number of cuboids for 𝑅 and 𝑎𝑔𝑔(𝑀 ) is
2|{𝐴1 ,...,𝐴𝑛 }| , and the number of cuboids for 𝑅 and 𝑎𝑔𝑔(𝑀 )
whose group-by includes attribute 𝐴 is 2|{𝐴1 ,...,𝐴𝑛 }|−1 .         Figure 1: Approach overview
   For two values 𝑎, 𝑏 of an attribute 𝐴 for measure 𝑀 , a
comparison insight 𝑎 < 𝑏 indicates that, on average, values
                                                                        The computation of comparison insights requires to ex-
of 𝑀 for 𝑎 are statistically lower than values of 𝑀 for 𝑏.
                                                                    plore the data cube of 𝑅 which may be too computationally
The result of a comparison query can present evidences (if
                                                                    costly if 𝑅 is large. We developed an approach based on
the value of 𝑀 for 𝑎 is indeed lower than the value of 𝑀
                                                                    sampling to tackle those cases of discovering comparison
for 𝑏), or violations (otherwise) of the insight. An actual
                                                                    insights from large relations. The principle is to compute
insight must have enough evidences that support it. If this
                                                                    candidate comparison insights on a sample of 𝑅 and validate
is not known, we call it a candidate comparison insight.
                                                                    them over a sample of the cuboids of 𝑅.
Definition 2.2 (Candidate comparison insight). A                        Our approach is illustrated in Figure 1. We first sam-
candidate comparison insight 𝑎 < 𝑏 where 𝑎, 𝑏 are in                ple 𝑅, so that this sample fits in main memory. Let 𝑆
𝑎𝑑𝑜𝑚(𝐴) for attribute 𝐴, measure 𝑀 of 𝑅, aggregation                be the sample of 𝑅. Over 𝑆, we compute a candidate
function 𝑎𝑔𝑔, and instance 𝑔 of a group-by set 𝐺 over 𝑅             comparison insight 𝑎 < 𝑏, for the comparison query
including 𝐴 is a pair of tuples (𝑔, 𝑎, 𝑚𝑎 ), (𝑔, 𝑏, 𝑚𝑏 ) of the     𝛾𝐺,𝑎𝑔𝑔(𝑀 )→𝑣𝑎𝑙 (𝜎𝐴=𝑎 (𝑆)) ◁▷ 𝛾𝐺,𝑎𝑔𝑔(𝑀 )→𝑣𝑎𝑙′ (𝜎𝐴=𝑏 (𝑆))
result of a comparison query over 𝑅 such that 𝑚𝑎 < 𝑚𝑏 .             where 𝐺 includes all categorical attributes of 𝑅. To check
                                                                    if the comparison is significant, we run a statistical test as-
  A candidate comparison insight can be violated depend-            sessing whether the mean for 𝑎 is significantly greater than
ing on the group-by set instance, where a violation of can-         that for 𝑏 over the sample.
didate 𝑎 < 𝑏 is when 𝑚𝑎 ≥ 𝑚𝑏 .                                          To validate the candidate comparisons found with the
                                                                    statistical tests over 𝑆, we sample the set 𝐿 of cuboids over 𝑅
                                       airline
                                                                    as follows. From this set 𝐿, we assume that some cuboids are
          departure airport    AA       OO       WN
          DCA                  11.92    -1.67    8.13               materialized on disk under the form of materialized views
          LAX                  10.15    3.18     19.09              (𝑀 𝐶, dark green cuboids in Figure 1). From this set 𝑀 𝐶 of
          SEA                  6.22     31.0     4.5                materialized views (excluding R), we consider the set 𝐶𝑄
                                                                    of all comparison queries that can be answered using 𝑀 𝐶
Table 2                                                             (purple bordered in Figure 1). Generally, |𝐶𝑄| > |𝑀 𝐶|.
Cuboid by airline and departure airport (crosstab view) over the    We sample this set 𝐶𝑄. This sample is called 𝑄. Over 𝑄,
flight table
                                                                    we approximate the ratio of queries of 𝐶𝑄 satisfying the
                                                                    user threshold 𝜏 . We postulate that this approximation is
                                                                    close enough to the ratio of queries of 𝐿 satisfying the user
Example 2.3. Assume                 Table         2      shows
                                                                    threshold.
the result of the aggregate query 𝑞                            =
                                                                        We now detail these two steps.
𝛾𝑎𝑖𝑟𝑙𝑖𝑛𝑒,𝑑𝑒𝑝𝑎𝑟𝑡𝑢𝑟𝑒_𝑎𝑖𝑟𝑝𝑜𝑟𝑡;𝑎𝑣𝑔(𝑑𝑒𝑝𝑎𝑟𝑡𝑢𝑟𝑒_𝑑𝑒𝑙𝑎𝑦) (𝑓 𝑙𝑖𝑔ℎ𝑡)
over the flight table of the previous Example. The candidate
comparison insight AA> OO holds for DCA and LAX                     4. Statistical test to obtain candidate
airports, while SEA airport shows a violation of AA> OO.
This candidate insight can be obtained from the compar-                comparisons
ison query: 𝛾𝐺,𝑎𝑣𝑔(𝑑𝑒𝑙𝑎𝑦)→𝐴𝐴 (𝜎𝑎𝑖𝑟𝑙𝑖𝑛𝑒=𝐴𝐴 (𝑓 𝑙𝑖𝑔ℎ𝑡)) ◁▷
                                                                    As we do not have all of 𝑅 but a sample 𝑆, the pairwise
𝛾𝐺,𝑎𝑣𝑔(𝑑𝑒𝑙𝑎𝑦)→𝑂𝑂 (𝜎𝑎𝑖𝑟𝑙𝑖𝑛𝑒=𝑂𝑂 (𝑓 𝑙𝑖𝑔ℎ𝑡)) where 𝐺 denotes
                                                                    comparison of 𝑎, 𝑏 is made using a statistical test over 𝑆. The
all the attributes of table flight (see the previous example).
null hypothesis of the test should be that the mean for 𝑎 is       threshold on the statistic in question. We trained two models
greater than that for 𝑏. The choice of accepting or rejecting      (decision tree, logistic regression) to classify pairs whose
the hypothesis is made based on the p-value of the chosen          type I error is out of bounds (1) and those for which it is
test. As we will be making many comparisons, the p-values          controlled (0) based on each of the candidate statistics. The
should be corrected, using e.g., the Benjamini-Hochberg            training sample transmitted to our models was balanced by
FDR correction [11].                                               undersampling, to avoid favoring the majority class.
   We could use a non-parametric test, as in [4] since such            For each model type, we selected the most interesting
tests make few or no assumptions about the data distri-            statistic (the one creating the greatest heterogeneity for the
bution. However, the statistical power of non parametric           decision tree, and the one selected by backward selection for
tests is generally lower and some tests, e.g., permutation         logistic regression). We took the highest probability thresh-
tests, require more computation time, as they estimate the         old for logistic regression that ensures 0% false negatives on
distribution by performing numerous re-samplings.                  the training sample. In our case, false negatives (not detect-
   We choose to use the parametric Welch’s one-sided t-test        ing a pair where type I error is out of bounds) are considered
that has the advantage of not making assumptions about the         more serious errors than false positives (thinking that a pair
variance or sample size, which is well-suited to a context         with controlled type I error is out of bounds) because in the
where there is no prior knowledge of the data. However             first case, we present a false test result to the user, while in
Welch’s test still assumes normality of the data and is mainly     the second case, we postpone the decision to more tests but
impacted by skewness and sample size. Indeed, type I error         do not give false conclusions. We compared the two models
tends to increase with the rise in skewness of the distribu-       based on the 𝐹2 -score on the rest of the data (excluding the
tions [12].                                                        training sample) which provides a more severe estimation
   The question becomes: how to build a decision model             of recall, i.e. it maximizes the portion of out-of-bounds pairs
based on a dedicated statistic to verify whether the sam-          detected, which avoids false conclusions.
ples used in the comparison are sufficiently normal and               ⃒Both models agree  ⃒ on the following: the best statistics
decide whether to use Welch’s test or a non parametric             is ⃒ 𝑠𝑘𝑒𝑤
                                                                      ⃒      𝑎
                                                                                −   𝑠𝑘𝑒𝑤𝑏 ⃒
                                                                                            and a threshold of 0.049 achieves the
                                                                          𝑛𝑎          𝑛𝑏 ⃒
test. We therefore present our methodology relying on the
                                                                   best 𝐹2 -score of 0.99 (with no false negatives) to separate
parametric Welch-t test to devise potentially discriminat-
                                                                   valid/invalid pairs on the training sample. Below this thresh-
ing candidate statistics (Section 4.1) and then train a simple
                                                                   old, pairs are valid for using Welch’s test. On the test sample,
decision model based on these candidates (Section 4.2).
                                                                   the 𝐹2 -score is 0.73 (with a recall of 1). We therefore chose
                                                                   to use this statistics and threshold for deciding which test
4.1. Devising candidate statistics                                 to run.
To devise a statistics for deciding which test to use,
we considered 8 candidate statistics, devised based                5. Validating candidate comparisons
on the factors cited in the literature as influenc-
ing Welch’s test robustness, namely sample size                    The validation of a candidate comparison insight is de-
(𝑛) and skewness (𝑠𝑘𝑒𝑤) for each sample, 𝑎 and 𝑏:                  scribed in Algorithm 1. Recall from Section 3 that the al-
(𝑛𝑎 × 𝑛𝑏 ), (𝑛𝑎 + 𝑛𝑏 ), (𝑠𝑘𝑒𝑤
                            ⃒    𝑎 × 𝑠𝑘𝑒𝑤⃒𝑏 ), (𝑠𝑘𝑒𝑤𝑎 +            gorithm validates the candidate insight 𝑎 < 𝑏 on a sample
𝑠𝑘𝑒𝑤𝑏 ), |𝑠𝑘𝑒𝑤𝑎 − 𝑠𝑘𝑒𝑤𝑏 | , ⃒ 𝑛𝑎 − 𝑠𝑘𝑒𝑤
                            ⃒ 𝑠𝑘𝑒𝑤𝑎     𝑏⃒
                                            , ( 𝑠𝑘𝑒𝑤 𝑎
                                                       +           of queries 𝑄. Given the sparsity of the data, the sample 𝑆
                                      𝑛𝑏 ⃒        𝑛2
𝑠𝑘𝑒𝑤𝑏
                                                       𝑎
                                                                   may return no data for 𝑎 or 𝑏. If so the candidate is not
  𝑛2
      ), ( 𝑠𝑘𝑒𝑤
             𝑛𝑎
                𝑎
                  + 𝑠𝑘𝑒𝑤
                      𝑛𝑏
                         𝑏
                           ).                                      considered, and the next candidate comparison is checked
   𝑏
   We generated artificial datasets for 𝑎 and 𝑏 with various       (line 4). The candidate comes with a p-values indicating if
distributions (standard normal, normal with 𝜎 = 20, uni-           the difference 𝑎 < 𝑏 is significant in 𝑆. If it is not, the next
form, chi-squared, and exponential) and different sample           candidate comparison is checked (line 4). If the candidate is
sizes 𝑛 ∈ {10, 50, 100, 200, 500, 1000}. The goal was to           considered and significant, the queries to check violations
obtain a varied panel of pairs (distribution for 𝑎, 𝑛𝑎 ) × (dis-   over the sample 𝑄 are generated (line 5), and ran over the
tribution for 𝑏, 𝑛𝑏 ): distributions with high asymmetry and       materialized cuboids (lines 8). Again, the query result may
completely symmetric ones, equal or unequal sample sizes,          return no data for 𝑎 or 𝑏. In that case, the cuboid is not
with large and small size differences. Each sample of the          counted in the denominator of the prediction (lines 10). If
same pair was generated with the same expectation 𝜇, to be         the prediction is such that it is over the threshold (line 15)
under the hypothesis 𝐻0 : 𝜇𝑎 = 𝜇𝑏 (= 𝜇) of Welch’s test.           or such that this threshold can not be reached given the
If the test rejects 𝐻0 , we know it makes a type I error. For      remaining cuboids to check (line 18), then no more vali-
each pair considered, we draw 10,000 replicas of different         dation queries are run and another candidate is checked.
sample pairs, perform Welch’s test at the nominal threshold        Otherwise the algorithm runs until no more queries are left.
𝛼 = 5%, and calculate the rejection rate among the replicas,          Phrased in SQL, the queries generated to check violations
giving us an estimator of the type I error rate for this given     of 𝑎 < 𝑏 are of the following form:
combination. The number of replicas was chosen to be large
enough to ensure that the rejection rate is a good estimator       WITH
of the type I error rate [12, 13]. We also considered several      Q1 AS (SELECT G
expectations 𝜇 ∈ {2, 5, 10, 100} in case the parameter also              FROM (SELECT G,A FROM view_G
impacts the type I error rate.                                                WHERE A in (a,b) group-by G,A )
                                                                         group-by G HAVING count(*) >1),
                                                                   Q2 AS (SELECT G,A, agg(M) rank () over
4.2. Decision model                                                        (partition by G ORDER BY agg(M) desc)
We want a model that uses only one statistic to classify                 FROM view_G
pairs, to have a quick and simple decision rule: a decision              WHERE A in (a,b)
Algorithm 1 Candidate comparison validation                                 FROM (Q1)
Require: a sample of queries 𝑄, two user thresholds 𝜏 and                   group-by A1,A2
    𝜌, a candidate comparison insight 𝑎 < 𝑏                        )
Ensure: Valid if 𝑎 < 𝑏 is valid, Invalid if it is not, Inconclu-   SELECT A1, A2, (pos-neg)/cnt score
    sive if 𝑎 < 𝑏 is not considered or not significant             FROM (Q2);
 1: 𝑛𝑏𝑄𝑢𝑒𝑟𝑖𝑒𝑠 = |𝑄|
                                                                     We postulate that such queries may not be able to exploit
 2: 𝑛𝑏𝑂𝑘 = 0
                                                                   any index and that the self-join between aggregates over
 3: 𝑛𝑏𝑇 𝑒𝑠𝑡 = 0
                                                                   the materialized view is likely to be very costly. Therefore,
 4: if 𝑎 < 𝑏 is considered and significant then
                                                                   only small datasets may benefit from this strategy.
 5:     𝑉 = generate validation queries over 𝑄
 6:     for 𝑞 ∈ 𝑉 do
 7:          𝑛𝑏𝑇 𝑒𝑠𝑡 = 𝑛𝑏𝑇 𝑒𝑠𝑡 + 1                                 6. Experiments
 8:          𝑟 = evaluate 𝑞
 9:          if 𝑎 and 𝑏 are not in 𝑟 then                          This section reports the tests we did to assess the effective-
10:              𝑛𝑏𝑄𝑢𝑒𝑟𝑖𝑒𝑠 = 𝑛𝑏𝑄𝑢𝑒𝑟𝑖𝑒𝑠 − 1                         ness and efficiency of our approach.
11:          end if
12:          if ratio violations in 𝑟 < 𝜏 then                     6.1. Settings
13:              𝑛𝑏𝑂𝑘 = 𝑛𝑏𝑂𝑘 + 1
14:          end if                                                We developed a prototype in Python 3.10 over the Post-
15:          if 𝑛𝑏𝑂𝑘/𝑛𝑏𝑄𝑢𝑒𝑟𝑖𝑒𝑠 > 𝜌 then                            gres 16 RDBMS1 . We use the SAMPLE method of Postgres
16:              return Valid                                      to sample the initial dataset, with the ’SYSTEM_ROWS’
17:          end if                                                parameter2 . As Postgres does not support query rewriting
18:          if 𝑛𝑏𝑂𝑘+|𝑄|−𝑛𝑏𝑇
                     𝑛𝑏𝑄𝑢𝑒𝑟𝑖𝑒𝑠
                                𝑒𝑠𝑡
                                    < 𝜌 then                       using materialized view, we implemented a simple rewriting
19:              return Invalid                                    method based only on the list of attributes in the group-by
20:          end if                                                set of the query. Tests are run on a Macbook Pro with Apple
21:     end for                                                    M2 Pro chip and 32GB of RAM. Postgres was run with the
22:     if 𝑛𝑏𝑂𝑘/𝑛𝑏𝑄𝑢𝑒𝑟𝑖𝑒𝑠 > 𝜌 then                                 default configuration, i.e., with 128MB of shared memory
23:          return Valid                                          buffers and 4MB of working memory.
24:     else
25:          return Invalid                                            Dataset    #tuples        Disk      #cat.    #cub.     Density
26:     end if                                                                                 sp. (MB)    attr.
27: end if
                                                                       F100K       110,358         10        6       32      4.4.10−10
                                                                       F600K       633,938        129        7       64      5.7.10−12
28: return Inconclusive
                                                                        SSB       6,001,171      1406        5       16      2.7.10−8
                                                                       Health    12,694,445      2309        6       32      1.3.10−2

      group-by G,A )                                               Table 3
SELECT G,string_agg(A::text,',')                                   Dataset statistics
FROM (Q2)
WHERE (G) IN (Q1)                                                     Datasets used are in Table 3. F100K and F600K are ex-
group-by G;                                                        cerpts from the flight delays dataset3 . SSB is the artificial
                                                                   dataset of the SSB benchmark [14]. Health is the Rate table
   where view_G is the materialized view over which the
                                                                   from the health insurance dataset4 .
query is evaluated, Q2 is used to order 𝑎 and 𝑏 in the result of
                                                                      We ran two sets of tests: the first tests compare the re-
the comparison query over view_G and Q1 removes group-
                                                                   sult of our approach to ground truth on the small F100K
by instances where 𝑎 or 𝑏 do not appear.
                                                                   dataset, for the airline attribute. The second set of tests on
   Note that Algorithm 1 generates one query per pair (𝑎, 𝑏)
                                                                   all datasets reports the time to compute 90 pairs for one
per element of the sample 𝑄, allowing to leverage existing
                                                                   attribute: the airline attribute having 14 values (91 pairs
indexes over attribute 𝐴. Another strategy consists of leav-
                                                                   in total) for F100k and F600K, the statecode attribute for
ing the DBMS to actually compute the ratio of violations of
                                                                   Health and the lo_suppkey attribute for SSB.
all pairs in a cuboid, by generating one query per element
of the sample 𝑄. Phrased in SQL, these queries are of the
following form:                                                    6.2. Effectiveness
WITH                                                               Since our approach uses sampling, we ran effectiveness
Q1 AS (SELECT G,c1.A A1,c2.A A2,                                   tests to compare its outcome (the comparison insights) to
              sign(c1.M-c2.M) sign                                 the ground truth, i.e., when comparisons are computed on
      FROM                                                         the non-sampled relation 𝑅 and all its cuboids. Unless oth-
        (SELECT G,A, agg(M) M                                      erwise stated, we arbitrarily fixed 𝜏 = 0.4 for the ratio of
        FROM view_G group-by G,A) c1,                              violations in a cuboid and 𝜌 = 0.4 for the ratio of cuboids
        (SELECT G,A, agg(M) M                                      showing less than 0.4 violations. The percentage of the lat-
        FROM view_G group-by G,A) c2                               tice materialized is fixed to 40%. Note that the materialized
      WHERE c1.A