Exploring Data Using Patterns: A Survey and Open Problems
                                 Lukasz Golab                                                           Divesh Srivastava
                     University of Waterloo, Canada                                                 AT&T Chief Data Office, USA
                         lgolab@uwaterloo.ca                                                             divesh@att.com

ABSTRACT                                                                                  (2) We analyze the pros, cons and performance optimizations
We present a survey of data exploration methods that extract                                  of existing work, and we suggest open problems for future
multi-dimensional patterns from datasets consisting of dimen-                                 research.
sion and measure attributes. These patterns are designed to sum-
marize common properties of tuples associated with particular                         2    BACKGROUND
values of the measure attributes. We provide a categorization                         We are given a dataset 𝑆 with a set 𝐷 of dimension attributes
of the characteristics of patterns produced by various solutions                      and a set 𝑀 of measure attributes (also referred to as outcomes
to this problem, we point out the pros, cons and performance                          in some prior work [6]). Let 𝐷 1, 𝐷 2, . . . , 𝐷𝑑 be the 𝑑 dimension
optimizations of existing methods, and we suggest directions for                      attributes and let 𝑀1, 𝑀2, . . . 𝑀𝑚 be the 𝑚 measure attributes. For
future research.                                                                      now, we assume, as in the majority of previous work, that the
                                                                                      dimension attributes are categorical, and we will comment on
KEYWORDS                                                                              ordinal and numeric dimension attributes in Section 4. Measure
Data exploration, Data summarization, Data explanation, Data                          attributes may be binary or numeric.
cube, Pattern mining                                                                      Table 1 shows a flight dataset that will serve as a running
                                                                                      example. For each flight, the dataset includes a record id, followed
                                                                                      by three dimension attributes, Day of the week, flight Origin and
1    INTRODUCTION                                                                     flight Destination, as well as two measure attributes, a numeric
Data volumes have been growing rapidly in recent years. As a                          attribute denoting how late the flight was and a binary attribute
result, data-intensive methods are now common in many con-                            denoting whether the flight was full.
texts, including business, science, and public governance. This                           Let 𝑑𝑜𝑚(𝐷𝑖 ) be the active domain of the 𝑖th dimension at-
motivates the need for tools that allow users who are not neces-                      tribute. A pattern 𝑝 (referred to as a rule in some prior work [8])
sarily data management experts to explore large datasets. Such                        is a tuple from 𝑑𝑜𝑚(𝐷 1 ) ∪ {∗} × · · · × 𝑑𝑜𝑚(𝐷𝑑 ) ∪ {∗}, i.e., from
tools range from visualization and aggregation to flexible search                     the data cube over the dimension attributes, with ‘*’ denoting all
interfaces such as keyword search in structured databases.                            possible values of that attribute. A tuple 𝑡 ∈ 𝑆 matches 𝑝, denoted
   In this paper, we focus on the exploration of datasets con-                        by 𝑡 ≍ 𝑝, if 𝑝 [𝐷 𝑗 ] = ‘*’ or 𝑡 [𝐷 𝑗 ] = 𝑝 [𝐷 𝑗 ] for each dimension
taining dimension attributes and binary or numeric measure                            attribute 𝐷 𝑗 . For example, tuple 4 from Table 1 matches the pat-
attributes. In traditional business datasets, dimension attributes                    terns (∗, ∗, ∗) and (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛), but does not match the pattern
often describe products or employees, and measure attributes                          (𝐹𝑟𝑖, ∗, ∗). Some approaches [10] support richer patterns, with
indicate sales totals or salaries. In Internet-of-Things (IoT) and                    disjunctions and dimension hierarchies.
infrastructure monitoring, dimension attributes may describe                              Let 𝑠𝑢𝑝 (𝑝) be the support of 𝑝 in 𝑆, i.e., the number of tuples
device properties and measure attributes correspond to perfor-                        matching 𝑝, and let 𝑠𝑢𝑝𝑟 (𝑝) be the number of tuples matching
mance statistics. In Web datasets, dimension attributes may de-                       𝑝 and satisfying the predicate 𝑟 over the measure attributes. For
scribe products, with user ratings as measure attributes. Addition-                   example, 𝑠𝑢𝑝 (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛) = 4 and 𝑠𝑢𝑝 𝐹𝑢𝑙𝑙=0 (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛) =
ally, in any of these applications, derived measure attributes may                                                      𝑠𝑢𝑝 (𝑝)
                                                                                      1. Furthermore, let 𝜃𝑟 (𝑝) = 𝑠𝑢𝑝𝑟(𝑝) , which is the fraction of
exist, e.g., a binary attribute denoting whether a given record                       tuples matching 𝑝 that also satisfy the predicate 𝑟 . For example,
was determined to be an outlier or to contain an error.
                                                                                      𝜃 𝐹𝑢𝑙𝑙=1 (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛) = 34 .
   The data cube has traditionally been used to explore these
                                                                                          Let 𝑠𝑢𝑚𝑀𝑖 (𝑝) be the sum of the values of a measure attribute
kinds of datasets, by allowing users to aggregate, roll-up and
drill-down using various subsets of group-by attributes. However,                     𝑀𝑖 over all the tuples matching 𝑝. Let 𝑠𝑢𝑚𝑟𝑀𝑖 (𝑝) be the sum of
in large-scale databases, the data cube may be very large and                         the values of a measure attribute 𝑀𝑖 over all the tuples matching
may not immediately reveal interesting patterns and trends. This                      𝑝 and satisfying the predicate 𝑟 over the measure attributes. For
motivates the need for more sophisticated exploration tools [17].                     example, 𝑠𝑢𝑚𝐿𝑎𝑡𝑒 (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛) = 20 + 15 + 19 + 7 = 61 and
   We observe that a large body of recent work on exploring                           𝑠𝑢𝑚𝐿𝑎𝑡𝑒
                                                                                           𝐹𝑢𝑙𝑙=0
                                                                                                   (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛) = 7.
multi-dimensional and OLAP datasets proposes methods to iden-                             We survey solutions to the following data exploration prob-
tify interesting fragments of the data, summarized using patterns                     lem: given a dataset 𝑆, produce a set or a list of patterns 𝑃 over
over the values of the dimension attributes. We make the fol-                         the dimension attributes of 𝑆, as defined above, that summarize
lowing contributions towards an understanding of these data                           common properties of tuples sharing the same values of the mea-
exploration methods.                                                                  sure attribute(s). Users may then inspect the patterns and explore
                                                                                      tuples covered by the patterns. They may then extract patterns
    (1) We survey recent work on data exploration using multi-
                                                                                      corresponding to a smaller subset of the data that was found to
        dimensional patterns and propose a categorization based
                                                                                      be interesting during the earlier exploration step, and so on.
        on the properties of patterns suggested for exploration:
                                                                                          The number of patterns in 𝑃 should be limited to direct the
        coverage, contrast, and information.
                                                                                      user’s attention to the most important or interesting regions of
                                                                                      the data. This limit may be set explicitly by the user (in terms of
© Copyright 2021 for this paper held by its author(s). Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).                            the maximum number of patterns in 𝑃) or implicitly by retrieving
                        Table 1: A flight dataset                                                           Table 2: Methods surveyed

             id    Day        Origin         Dest.        Late     Full                              Method             Approach         Applications
              1    Fri          SF          London         20       1                              CAPE [14]             Contrast     Explaining queries
              2    Fri       London           LA           16       1                           Data Auditor [9]        Coverage     Data quality analysis
              3    Sun        Tokyo        Frankfurt       10       1                           Data X-ray [19]          Contrast    Data quality analysis
              4    Sun       Chicago        London         15       1                               DIFF [2]             Contrast      Outlier analysis
              5    Sat       Beijing       Frankfurt       13       1                         Explanation tables [7]   Information     Feature selection
              6    Sat      Frankfurt       London         19       1                            Macrobase [1]           Contrast      Outllier analysis
              7    Tue       Chicago          LA           5        0                                MRI [5]            Coverage      Explaining queries
              8    Wed       London         Chicago        6        0                            RSExplain [15]          Contrast     Explaining queries
              9    Thu          SF         Frankfurt       15       1                             Scorpion [20]          Contrast      Outlier analysis
             10    Mon       Beijing           SF          4        0                              Shrink [10]         Information    Explaining queries
             11    Mon          SF          London         7        0                         Smart Drilldown [11]      Coverage      Explaining queries
             12    Mon          SF         Frankfurt       5        0                            SURPRISE [16]         Information    Explaining queries
             13    Mon        Tokyo         Beijing        6        0
             14    Mon      Frankfurt        Tokyo         4        0

                                                                                        focusing on coverage, contrast and information. Table 2 catego-
                                                                                        rizes the surveyed methods and lists their motivating applications,
the fewest possible patterns that jointly satisfy some property                         as mentioned in the corresponding papers.
such as covering some fraction of the data.
   This data exploration problem has the following applications.                        3.1     Methods Based on Coverage
     • Explaining the results of aggregate queries. Suppose a data                      The goal of these methods is to identify patterns that cover tuples
       analyst issues the following query over Table 1: SELECT                          with certain values of the measure attribute. We discuss three
       SUM(Late) FROM S. Suppose the analyst wishes to under-                           coverage-based methods: Data Auditor [9], MRI [5], and Smart
       stand why the result, of 145, is so high. Here, interesting                      Drilldown [11].
       patterns are those which cover tuples that make a sig-                               Suppose we are interested in covering tuples having 𝐹𝑢𝑙𝑙 = 1
       nificant contribution to the result, i.e., those with a high                     in Table 1, as a way to summarize the characteristics of full
       𝑠𝑢𝑚𝐿𝑎𝑡𝑒 () such as (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛). The analyst may then                          flights. A simple coverage-oriented approach is to sort the pat-
       zoom into flights landing in London and investigate po-                          terns according to 𝑠𝑢𝑝 𝐹𝑢𝑙𝑙=1 () and output the top-ranking pat-
       tential reasons for the lengthy delays.                                          terns. Ignoring (∗, ∗, ∗), which always covers everything but
     • Analyzing outliers and data quality issues. Suppose we have                      is not informative, the top candidates are (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛) and
       a binary measure attribute denoting whether a given tuple                        (∗, ∗, 𝐹𝑟𝑎𝑛𝑘 𝑓 𝑢𝑟𝑡), which cover three full flights each, followed
       contains an error or is an outlier. This attribute could be                      by the following patterns that cover two such tuples each:
       created manually by domain experts or automatically by                           (𝐹𝑟𝑖, ∗, ∗), (𝑆𝑢𝑛, ∗, ∗), (𝑆𝑎𝑡, ∗, ∗) and (∗, 𝑆𝐹, ∗).
       identifying tuples that violate data quality rules or deviate                        There are two problems with this simple approach: there may
       from the expected distribution. We may wish to produce                           be many patterns with a nonzero 𝑠𝑢𝑝 𝐹𝑢𝑙𝑙=1 (𝑝), and some of these
       patterns that summarize the properties of erroneous tuples                       patterns may also cover tuples with other values of the measure
       to help determine the root cause of data quality problems.                       attribute (here, 𝐹𝑢𝑙𝑙 = 0). To reduce the size of the output and to
     • Feature selection and explainable AI. Before building predic-                    ensure that each pattern co-occurs with the specified value of the
       tion models, a data scientist may explore interesting pat-                       measure attribute, Data Auditor solves the following set cover
       terns to understand which dimension attributes are related                       problem. Continuing with our example, Data Auditor requires
       to the measure attribute that is to be predicted. Further-                       a minimum threshold for 𝜃 𝐹𝑢𝑙𝑙=1 (𝑝), i.e., the fraction of tuples
       more, suppose a data analyst wants to understand how a                           covered by 𝑝 that correspond to full flights. Suppose we require
       black-box model makes classification decisions. Here, the                        𝜃𝑟 (𝑝) ≥ 0.75. This threshold defines the candidate sets for the
       dimension attributes are the features given to the model as                      set cover problem. The set cover objective is to select the fewest
       input, and, as the measure attribute, the analyst records the                    such patterns that together cover a specified fraction of tuples in
       predictions made by the model. The analyst may then want                         𝑆 having 𝐹𝑢𝑙𝑙 = 1. Suppose this coverage fraction, which would
       to find interesting patterns that explain the prediction de-                     again be set by the user, is 0.5. The goal is then to find the fewest
       cisions. For example, in Table 1, the pattern (𝑀𝑜𝑛, ∗, ∗) is                     patterns, as defined above, to cover half the full flights.
       associated with tuples having 𝐹𝑢𝑙𝑙 = 0, suggesting that                              The set cover problem is NP-hard, and Data Auditor uses the
       flights scheduled on Mondays are usually not full1 .                             standard greedy heuristic that achieves a logarithmic approx-
                                                                                        imation ratio in the size of the solution: it iteratively chooses
3     SURVEY                                                                            the pattern that covers the most uncovered tuples (having the
In this section, we provide a categorization of previous work                           desired value of the measure attribute), until the desired fraction
on data exploration using multi-dimensional patterns based on                           of such tuples has been covered. In our example, the first pattern
the pattern properties and ranking strategies used for pattern                          added to 𝑃 is either (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛) and (∗, ∗, 𝐹𝑟𝑎𝑛𝑘 𝑓 𝑢𝑟𝑡) - they
selection. We categorize these properties into three types: those                       each cover three full flights and their 𝜃 𝐹𝑢𝑙𝑙=1 (𝑝) values are 0.75
                                                                                        each. Suppose the set cover algorithm selects (∗, ∗, 𝐹𝑟𝑎𝑛𝑘 𝑓 𝑢𝑟𝑡).
1 Model explanations may be global (to summarize how classification decisions are
                                                                                        Since there are seven full flights in the dataset and our coverage
made) or local (to explain why a specific record was classified in a particular way).   threshold is 0.5, we need to cover one more full flight. In the next
The explanations discussed in this paper are examples of global explanations.           iteration, the pattern that (has 𝜃 𝐹𝑢𝑙𝑙=1 (𝑝) ≥ 0.75 and) covers the
most remaining full flights is (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛) and the algorithm        a binary measure attribute, and select patterns co-occurring with
terminates, with 𝑃 consisting of these two patterns.                   one value of the measure attribute but not the other. These pat-
    The next method, MRI, finds 𝑘 patterns that cover a user-          terns reflect the contrast between tuples having different values
specified fraction of the data and satisfy additional properties       of the measure attribute.
related to the variance of the measure attribute within each pat-         One could argue that interpretable classifiers such as decision
tern. The motivating example for MRI was to explain queries            trees and rule-based methods (see, e.g., [12]) can also be used for
over product reviews, with the dimension attributes correspond-        contrast-based data exploration. These methods identify patterns
ing to information about the reviewers (such as their gender           of values of the feature attributes that have high discriminative
and age) and the numeric measure attribute corresponding to            power in terms of the class variable (in our case, the binary
the average rating. Minimizing variance amounts to returning           measure attribute). These patterns are therefore likely to provide
patterns (having high coverage and) describing reviewers with          contrast as well. However, classification algorithms focus on out-
similar opinions. For example, when applying MRI to Table 1            of-sample predictive power and include optimizations such as
with 𝐿𝑎𝑡𝑒 as the measure attribute, the pattern (∗, ∗, 𝐹𝑟𝑎𝑛𝑘 𝑓 𝑢𝑟𝑡)    rule pruning to avoid overfitting. On the other hand, the methods
is preferred over (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛). Both patterns cover four tuples,     covered in this survey focus explicitly on identifying a concise
but the former has a lower variance of the 𝐿𝑎𝑡𝑒 attribute within       set of interesting fragments of the data for user exploration.
the covered tuples.                                                       Contrast-based methods can also explain the results of aggre-
    Smart Drilldown also combines coverage with additional pat-        gate queries. Consider the following query over Table 1: SELECT
tern properties. Smart Drilldown produces 𝑘 patterns. In every         SUM(Late) FROM S WHERE Full=1. Here, the measure attribute
iteration of the algorithm, the chosen pattern maximizes the fol-      corresponding to the quantity being aggregated. We then set the
lowing objective: the number of tuples not yet covered multiplied      binary measure attribute to one for all tuples that participate in
by the weight of the pattern corresponding to some measure of          the query (i.e., tuples that match the WHERE predicate), and we
interestingness. One simple weighting function proposed in [11]        select patterns of tuples that contribute to the result of the query
is the number of non-star values in the pattern (i.e., more specific   (but would not contribute had the query been issued against the
patterns are considered to be more desirable). In Table 1 for ex-      other tuples in the dataset).
ample, this weighting function prefers (𝑇𝑢𝑒, 𝐶ℎ𝑖𝑐𝑎𝑔𝑜, 𝐿𝐴) over            Below, we explain the pattern ranking strategies used by
(𝑇𝑢𝑒, ∗, ∗) – both of these patterns have a support of one, but the    contrast-based methods, with Table 1 as a running example.
former has more non-star values.                                          Risk ratio is used by Macrobase; a related metric called
                                                                       Diagnosis Cost is used by Data X-ray. It is the ratio of the
   3.1.1 Performance Optimizations. A performance bottleneck           following two probabilities: 1) the probability that a tuple
in general set cover problems results from having to keep track        with a particular value is covered by the given pattern, and
of the number of uncovered elements that can be covered by             2) the probability that a tuple with this particular value
the candidate sets - this changes in every iteration, whenever         occurs outside this pattern. In our example, 𝑟𝑖𝑠𝑘 𝐹𝑢𝑙𝑙=1 (𝑝) =
a new set is added to the solution. Data Auditor relies on the                                   𝜃 𝐹𝑢𝑙𝑙 =1 (𝑝)
                                                                                       𝑠𝑢𝑝 𝐹𝑢𝑙𝑙 =1 (∗,∗,∗) −𝑠𝑢𝑝 𝐹𝑢𝑙𝑙 =1 (𝑝 )  . For in-
hierarchical nature of patterns and does not generate all the pat-      (𝑠𝑢𝑝 𝐹𝑢𝑙𝑙 =1 (∗,∗,∗) −(𝑠𝑢𝑝 𝐹𝑢𝑙𝑙 =1 (𝑝 ) ) +(𝑠𝑢𝑝 𝐹𝑢𝑙𝑙 =0 (∗,∗,∗) −𝑠𝑢𝑝 𝐹𝑢𝑙𝑙 =0 (𝑝 ) )
terns that serve as input to greedy set cover a priori. Instead,       stance, 𝑟𝑖𝑠𝑘 𝐹𝑢𝑙𝑙=1 (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛) = 0.75       = 1.875, whereas
                                                                                                                 0.4
patterns are generated on-demand, only after all of their super-       𝑟𝑖𝑠𝑘 𝐹𝑢𝑙𝑙=1 (∗, 𝑆𝐹, ∗) = 0.5 = 1. This indicates that (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛)
sets have already been considered. For example, a pattern such                                  0.5
                                                                       represents full flights better than (∗, 𝑆𝐹, ∗).
as (𝐹𝑟𝑖, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛) would only be generated after all of its super-       Mean shift (supported by DIFF) computes the ratio of the
sets - including (𝐹𝑟𝑖, ∗, ∗) and (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛) - have already been    mean of the measure attribute values co-occurring with the
considered. Until then, (𝐹𝑟𝑖, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛) can be safely ignored: it     two values of the binary measure attribute. In our exam-
cannot cover more elements than its supersets and therefore is                                  𝑠𝑢𝑚𝐿𝑎𝑡𝑒    (𝑝)/𝑠𝑢𝑝 𝐹𝑢𝑙𝑙 =1 (𝑝)
                                                                                                   𝐹𝑢𝑙𝑙 =1
guaranteed to not be selected by the greedy set cover heuristic        ple, 𝑚𝑒𝑎𝑛𝐿𝑎𝑡𝑒   (𝑝) =                                   . For instance,
                                                                                𝐹𝑢𝑙𝑙=1          𝑠𝑢𝑚𝐿𝑎𝑡𝑒
                                                                                                   𝐹𝑢𝑙𝑙 =0
                                                                                                           (𝑝)/𝑠𝑢𝑝 𝐹𝑢𝑙𝑙 =0 (𝑝)
at this time. This optimization can greatly reduce the number                                       54/3
                                                                       𝑚𝑒𝑎𝑛𝐿𝑎𝑡𝑒   (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛) = 7/1 = 2.57, indicating that full
of generated patterns whose coverage (of uncovered elements)               𝐹𝑢𝑙𝑙=1
                                                                       fights to London have delays that are 2.57 times longer than
must be re-computed while constructing the solution.
                                                                       non-full flights to London.
   3.1.2 Pros and Cons. One advantage of coverage-based meth-             Intervention is used by RSExplain; a related metric called In-
ods is conciseness: by design, they identify the fewest possible       fluence is used by Scorpion. It measures the ratio of contribution
patterns that cover the desired fraction of tuples of interest. On     towards the numeric measure attribute for tuples occurring with
the other hand, in Data Auditor, some trial-and-error may be           the different values of the binary measure attribute. In our exam-
required on the user’s part to select good values for the two                                          𝑠𝑢𝑚𝐿𝑎𝑡𝑒
                                                                                                           𝐹𝑢𝑙𝑙 =1
                                                                                                                   (∗,∗,∗)−𝑠𝑢𝑚𝐿𝑎𝑡𝑒
                                                                                                                              𝐹𝑢𝑙𝑙 =1
                                                                                                                                      (𝑝)
                                                                       ple, 𝑖𝑛𝑡𝑒𝑟𝑣𝑒𝑛𝑡𝑖𝑜𝑛𝐿𝑎𝑡𝑒   (𝑝) =                                      . For in-
                                                                                        𝐹𝑢𝑙𝑙=1         𝑠𝑢𝑚𝐿𝑎𝑡𝑒     (∗,∗,∗)−𝑠𝑢𝑚𝐿𝑎𝑡𝑒    (𝑝)
required thresholds. For example, a high value of 𝜃 will ensure                                            𝐹𝑢𝑙𝑙 =0            𝐹𝑢𝑙𝑙 =0
                                                                                                                         108−54
                                                                                            𝐿𝑎𝑡𝑒
                                                                       stance, 𝑖𝑛𝑡𝑒𝑟𝑣𝑒𝑛𝑡𝑖𝑜𝑛 𝐹𝑢𝑙𝑙=1 (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛) = 37−7 = 1.8. In other
that each pattern co-occurs mainly with the specified value of
the measure attribute, but will disqualify more patterns from          words, if flights to London were removed from the dataset then
consideration. Similarly, MRI and Smart Drilldown require users        full flights would have delays on average 1.8 times longer than
to set parameters for coverage and other pattern properties.           non-full flights. On the other hand, 𝑖𝑛𝑡𝑒𝑟𝑣𝑒𝑛𝑡𝑖𝑜𝑛 𝐹𝑢𝑙𝑙=1 (∗, 𝑆𝐹, ∗) =
                                                                       108−35 = 2.92, meaning that removing flights departing from SF
                                                                        37−12
                                                                       from the dataset would create a greater contrast between the
3.2    Methods based on Contrast
                                                                       delays of full and non-full flights.
This is the largest group of methods, consisting of CAPE [14],            We also point out CAPE, whose goal is to find patterns whose
Data X-ray [19], DIFF [2], Macrobase [1], RSExplain [15] and           measure attribute values counterbalance those of the pattern
Scorpion [20]. (DIFF is a recent solution that generalizes many of     given as input. For example, the average flight delay in Table 1
these related methods). Contrast-based methods usually assume
Table 3: An explanation table of size four for the binary               Table 5: A summary of size two considered by Shrink [10]
measure attribute Full
                                                                                               Day            Origin      Dest.     Late
                 Day     Origin     Dest.      Full                                      Fri,Sun,Sat,Thu        *          *        15.4
                  *        *          *        0.5                                        Tue,Wed,Mon           *          *         5.3
                 Mon       *          *         0
                  *        *       London      0.75
                  *        *      Frankfurt    0.75
                                                                        do so, the algorithm maintains a maximum-entropy estimate of
                                                                        the distribution based on the patterns that have been added to
Table 4: An explanation table of size four for the numeric              the explanation table so far. It also keeps track of the Kullback-
measure attribute Late                                                  Leibler (KL) divergence between this estimated distribution and
                                                                        the true distribution. To quantify the information contained in
                  Day    Origin    Dest.      Late                      a candidate pattern 𝑝, the algorithm computes the reduction in
                   *       *         *        10.4                      KL-divergence if 𝑝 were to be added to the explanation table.
                   *       *      London      15.3                          Returning to Table 3, the greedy algorithm starts by inserting
                  Fri      *         *         18                       the pattern (∗, ∗, ∗||0.5). At this point, knowing only this one
                  Sat      *         *         16                       piece of information, the maximum-entropy estimate of the dis-
                                                                        tribution of 𝐹𝑢𝑙𝑙 is to assign 𝐹𝑢𝑙𝑙 = 0.5 to every tuple in Table 12 .
                                                                        Next, it turns out that (𝑀𝑜𝑛, ∗, ∗||0) gives the greatest reduction
is 10.4, but it is only 5.2 for tuples covered by (𝑀𝑜𝑛, ∗, ∗). To
                                                                        in KL-divergence. Based on this new pattern, the maximum-
counterbalance this low value, CAPE suggests related patterns
                                                                        entropy estimate for tuples 10 through 14 in Table 1 changes to
such as (𝐹𝑟𝑖, ∗, ∗) and (𝑆𝑎𝑡, ∗, ∗), whose average delays are higher.
                                                                        𝐹𝑢𝑙𝑙 = 0. This revision causes the estimates of the first nine tuples
    3.2.1 Performance Optimizations. Contrast-based methods             to change (from 0.5 to 79 ) in order to maintain consistency with
output patterns whose scores (e.g., risk ratios) are above a user-      the first pattern, which asserts that 𝐹𝑢𝑙𝑙 = 0.5 on average over
supplied threshold. A general optimization used by DIFF is to also      the entire table. Given the updated maximum-entropy estimate,
require a minimum support threshold for the extracted patterns.         the next pattern with the greatest reduction in KL-divergence is
This enables pruning optimizations similar to those used by the         (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛||0.75), and so on.
Apriori algorithm for frequent itemset mining [3]. For example, if          Similar reasoning can explain how Table 4 was created. The
(𝑊 𝑒𝑑, ∗, ∗) fails to satisfy the minimum support threshold, then       first pattern asserts that flights are late by 10.4 minutes on average.
all its subsets, such as (𝑊 𝑒𝑑, 𝐿𝑜𝑛𝑑𝑜𝑛, ∗), can be ignored.             Given this estimate, (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛||15.3) provides the most infor-
                                                                        mation about the distribution of 𝐿𝑎𝑡𝑒. The maximum-entropy
   3.2.2 Pros and Cons. By design, contrast-based methods are
                                                                        estimate of 𝐿𝑎𝑡𝑒 is now updated accordingly, That is, tuples 1, 4, 6
useful when exploring differences between data subsets – some-
                                                                        and 11, corresponding to flights to London, receive an estimate of
thing that coverage-based methods do not implicitly optimize for.
                                                                        15.3, and the remaining tuples receive an estimate of 8.4 to main-
On the other hand, these methods may not guarantee concise
                                                                        tain consistency with the first pattern. The next most-informative
results. One exception is Data X-ray, which performs a set-cover-
                                                                        pattern is then selected, and so on.
like operation on the extracted patterns as a post-processing step
                                                                            SURPRISE is a similar method, whose goal is to identify surpris-
to eliminate redundant patterns.
                                                                        ing fragments of a dataset where the distribution of the measure
                                                                        attribute is different than what the user has seen so far. Sup-
3.3    Methods based on Information                                     pose the user queries Table 1 and finds out that flights are 10.4
Finally, we discuss three methods that select patterns based on the     minutes late on average. SURPRISE finds the most informative
information they provide about the distribution of the measure          non-overlapping and contained patterns, i.e., those which lead to
attribute: Explanation Tables [7], SURPRISE [16] and Shrink [10].       the greatest reduction in KL-divergence between the true distri-
    Table 3 shows an explanation table of size four (i.e., containing   bution of Late and the maximum-entropy estimated distribution.
four patterns) for the binary measure attribute Full based on           Restricting the output to such patterns makes it easier to update
Table 1. In addition to values of the dimension attributes, each        the estimated distribution. In our example, the most informa-
explanation table pattern also includes the fraction of matching        tive pattern is (∗, ∗, 𝐿𝑜𝑛𝑑𝑜𝑛) and its most informative subset is
tuples that have 𝐹𝑢𝑙𝑙 = 1. The first pattern in an explanation          (∗, 𝑆𝐹, 𝐿𝑜𝑛𝑑𝑜𝑛).
table is always the all-stars pattern, and, in this example, it also        Finally, we discuss Shrink, which produces 𝑘 non-overlapping
indicates that half the flights in the entire dataset are full. The     patterns that summarize the distribution of the measure attribute
next pattern suggests that no flights on Mondays are full, and the      with (approximately) minimal sum of squared errors. Shrink
last two patterns indicate that three-quarters of flights to London     allows patterns with disjunctions of values and dimension hierar-
and Frankfurt are full.                                                 chies, and uses a greedy approach in which cells in the data cube
    In Table 4, we show an explanation table of size four for the       are merged to produce patterns with the smallest squared errors.
measure attribute Late based on Table 1. Here, each pattern in-         Table 5 shows an example of a summary with two patterns that
cludes the average value of Late across its matching tuples. Again,     may be considered by Shrink based on Table 1. Given the average
we start with the all-stars pattern, which states that flights are      values of the Late measure attribute reported by the summary,
10.4 minutes late on average. The next pattern indicates that           the sum of squared errors is 77.1.
flights to London are 15.3 minutes late on average, and so on.
    The greedy heuristic for constructing explanation tables used       2 𝐹𝑢𝑙𝑙 is a binary attribute that can only be zero or one. However, for the purpose
in [6–8] iteratively selects the pattern that contains the most         of measuring the divergence between the true and the estimated distributions, the
information about the distribution of the measure attribute. To         maximum-entropy estimates are allowed to be real numbers between zero and one.
   3.3.1 Performance Optimizations. In contrast-based methods,         those functionally determined by other attributes. Distributed
the “goodness” of a pattern usually remains static throughout the      versions of some methods, including DIFF [2] and Explanation
execution of the algorithm. On the other hand, the “goodness”, i.e.,   Tables [8], have also been proposed to parallelize the search for
information, of a candidate explanation table pattern is relative to   interesting patterns. In machine learning, there exists a variety
the current maximum-entropy estimate of the measure attribute.         of dimension reduction methods such as Principal Component
This means that a previously un-informative pattern may be-            Analysis (PCA) and word embeddings. However, these methods
come informative as more patterns are added to the explanation         are not known for being interpretable and thus their suitability
table. As a result, candidate patterns cannot be easily pruned.        for pattern-based exploration requires further study.
Instead, the explanation table construction algorithm in [6–8]            Bringing order to dimension attributes: Much of the previous
uses sampling to reduce the space of candidate patterns. In every      work considers categorical dimension attributes. However, there
iteration of the greedy algorithm, a random sample is drawn, and       exist methods for covering a multi-dimensional dataset using
the set of candidate patterns corresponds to only those patterns       hyper-rectangles corresponding to intervals over numeric dimen-
that have a non-zero support in the sample. The intuition is that      sion attributes [13], there exists a method to cover data anomalies
patterns with frequently occurring combinations of dimension           using intervals over numeric features [21], and explanation ta-
attribute values are likely to be sampled and also likely to contain   bles have recently been extended to support ordinal and numeric
information about the distribution of the measure attribute.           dimension attributes [18]. These extensions further increase the
                                                                       space of candidate patterns and require additional performance
   3.3.2 Pros and Cons. By design, information-based methods           optimizations. For example, returning to Table 1, the Day at-
produce informative patterns that highlight fragments of the data      tribute may lead to additional patterns with ranges or intervals
with surprising distributions of the measure attribute. However,       such as ([𝑀𝑜𝑛 − 𝐹𝑟𝑖], ∗, ∗) or ([𝑆𝑎𝑡 −𝑆𝑢𝑛], ∗, ∗). Techniques used
these methods tend to be expensive, especially as the number of        for constructing optimal histograms and optimal decision trees
dimension attributes grows.                                            may help to discover these types of patterns.
                                                                          Exploring data evolution: Finally, recent work motivates the
4   CONCLUSIONS AND OPEN PROBLEMS                                      need for tools to explore how data (and metadata) change over
In this paper, we surveyed recent data exploration methods that        time [4]. Here, patterns may summarize fragments of the data
extract interesting or informative fragments of the data, repre-       that have changed recently or are updated often.
sented as patterns over the dimension attributes. We categorized
these methods according to the properties of patterns they select.     REFERENCES
Below, we offer suggestions for future work in this area.               [1] F. Abuzaid, P. Bailis, J. Ding, E. Gan, S. Madden, D. Narayanan, K. Rong, S.
   Benchmarks: A performance comparison of contrast-based                   Suri: MacroBase: Prioritizing Attention in Fast Data. ACM Trans. Database
                                                                            Syst. 43(4): 15:1-15:45 (2018)
methods implemented within the DIFF frameworks appears in               [2] F. Abuzaid, P. Kraft, S. Suri, E. Gan, E. Xu, A. Shenoy, A. Anathanaraya, J.
[2]. In terms of effectiveness, prior work reports that methods             Sheu, E. Meijer, X. Wu, J. F. Naughton, P. Bailis, M. Zaharia: DIFF: A Relational
                                                                            Interface for Large-Scale Data Explanation. PVLDB 12(4): 419-432 (2018)
based on information provide more information about the dis-            [3] R. Agrawal, R. Srikant: Fast Algorithms for Mining Association Rules in Large
tribution of the measure attribute than coverage-based methods              Databases. VLDB 1994: 487-499
[6, 7]; similarly, methods based on contrast provide more precise       [4] T. Bleifuss, L. Bornemann, T. Johnson, D. V. Kalashnikov, F. Naumann, D.
                                                                            Srivastava: Exploring Change - A New Dimension of Data Analytics. PVLDB
outlier explanations than methods based on coverage [19]. Some              12(2): 85-98 (2018)
approaches were also evaluated through user studies against sim-        [5] M. Das, S. Amer-Yahia, G. Das, C. Yu: MRI: Meaningful Interpretations of
ple baselines [14]. An interesting direction for future work is to          Collaborative Ratings. PVLDB 4(11): 1063-1074 (2011)
                                                                        [6] K. El Gebaly, P. Agrawal, L. Golab, F. Korn, D. Srivastava: Interpretable and
develop a comprehensive benchmark to highlight the effective-               Informative Explanations of Outcomes. PVLDB 8(1): 61-72 (2014)
ness of various types of methods in various applications.               [7] K. El Gebaly, G. Feng, L. Golab, F. Korn, D. Srivastava: Explanation Tables.
                                                                            IEEE Data Eng. Bull. 41(3): 43-51 (2018)
   New applications: Popular motivating applications that guided        [8] G. Feng, L. Golab, D. Srivastava: Scalable Informative Rule Mining. ICDE 2017:
the development of prior work were outlier and data error anal-             437-448
ysis, as well as query result explanation. Recent interest in ex-       [9] L. Golab, H. J. Karloff, F. Korn, D. Srivastava: Data Auditor: Exploring Data
                                                                            Quality and Semantics using Pattern Tableaux. PVLDB 3(2): 1641-1644 (2010)
plainable AI motivates further studies on exploring the behaviour      [10] M. Golfarelli, S. Graziani, S. Rizzi: Shrink: An OLAP operation for balancing
of black-box machine learning models such as neural networks                precision and size of pivot tables. Data Knowl. Eng. 93: 19-41 (2014)
using multi-dimensional patterns, as was suggested in [7]. Since       [11] M. Joglekar, H. Garcia-Molina, A. G. Parameswaran: Interactive data explo-
                                                                            ration with smart drill-down. ICDE 2016: 906-917
deep learning methods have been successful in the context of           [12] H. Lakkaraju, S. H. Bach, J. Leskovec: Interpretable Decision Sets: A Joint
unstructured data such as text, images and graphs, future re-               Framework for Description and Prediction. KDD 2016: 1675-1684
                                                                       [13] L. V. S. Lakshmanan, R. T. Ng, C. X. Wang, X. Zhou, T. Johnson: The Generalized
search should investigate new ways of formulating interpretable             MDL Approach for Summarization. VLDB 2002: 766-777
patterns over these high-dimensional unstructured datasets.            [14] Z. Miao, Q. Zeng, C. Li, B. Glavic, O. Kennedy, S. Roy: CAPE: Explaining
   Correlated measure attributes: Another characteristic of prior           Outliers by Counterbalancing. PVLDB 12(12): 1806-1809 (2019)
                                                                       [15] S. Roy, D. Suciu: A formal approach to finding explanations for database
work is that it usually formulates exploration problems involving           queries. SIGMOD Conference 2014: 1579-1590
a single measure attribute. Pattern-based exploration of multiple      [16] S. Sarawagi: User-cognizant multidimensional analysis. VLDB J. 10(2-3): 224-
measure attributes is an interesting area for future work.                  239 (2001)
                                                                       [17] P. Vassiliadis, P. Marcel: The Road to Highlights is Paved with Good Intentions:
   Feature reduction: In terms of performance and scalability, the          Envisioning a Paradigm Shift in OLAP Modeling. DOLAP 2018.
large number of possible patterns remains a challenge for many         [18] M. Vollmer, L. Golab, K. Bohm, D. Srivastava: Informative Summarization of
                                                                            Numeric Data. SSDBM 2019: 97-108
methods, especially those based on information which cannot            [19] X. Wang, X. L. Dong, A. Meliou: Data X-Ray: A Diagnostic Tool for Data
leverage Apriori-like pruning strategies. This is an important              Errors. SIGMOD Conference 2015: 1231-1245
challenge for interactive methods that allow users to continu-         [20] E. Wu, S. Madden: Scorpion: Explaining Away Outliers in Aggregate Queries.
                                                                            PVLDB 6(8): 553-564 (2013)
ously issue new exploration tasks. As a result, some techniques        [21] H. Zhang, Y. Diao, A. Meliou: EXstream: Explaining Anomalies in Event Stream
such as DIFF limit the number of dimension attributes for use               Monitoring. EDBT 2017: 156-167
in patterns and discard redundant dimension attributes such as