=Paper=
{{Paper
|id=Vol-2840/short7
|storemode=property
|title=Exploring Data Using Patterns: A Survey and Open Problems
|pdfUrl=https://ceur-ws.org/Vol-2840/short7.pdf
|volume=Vol-2840
|authors=Lukasz Golab,Divesh Srivastava
|dblpUrl=https://dblp.org/rec/conf/dolap/GolabS21
}}
==Exploring Data Using Patterns: A Survey and Open Problems==
Exploring Data Using Patterns: A Survey and Open Problems
Lukasz Golab Divesh Srivastava
University of Waterloo, Canada AT&T Chief Data Office, USA
lgolab@uwaterloo.ca divesh@att.com
ABSTRACT (2) We analyze the pros, cons and performance optimizations
We present a survey of data exploration methods that extract of existing work, and we suggest open problems for future
multi-dimensional patterns from datasets consisting of dimen- research.
sion and measure attributes. These patterns are designed to sum-
marize common properties of tuples associated with particular 2 BACKGROUND
values of the measure attributes. We provide a categorization We are given a dataset π with a set π· of dimension attributes
of the characteristics of patterns produced by various solutions and a set π of measure attributes (also referred to as outcomes
to this problem, we point out the pros, cons and performance in some prior work [6]). Let π· 1, π· 2, . . . , π·π be the π dimension
optimizations of existing methods, and we suggest directions for attributes and let π1, π2, . . . ππ be the π measure attributes. For
future research. now, we assume, as in the majority of previous work, that the
dimension attributes are categorical, and we will comment on
KEYWORDS ordinal and numeric dimension attributes in Section 4. Measure
Data exploration, Data summarization, Data explanation, Data attributes may be binary or numeric.
cube, Pattern mining Table 1 shows a flight dataset that will serve as a running
example. For each flight, the dataset includes a record id, followed
by three dimension attributes, Day of the week, flight Origin and
1 INTRODUCTION flight Destination, as well as two measure attributes, a numeric
Data volumes have been growing rapidly in recent years. As a attribute denoting how late the flight was and a binary attribute
result, data-intensive methods are now common in many con- denoting whether the flight was full.
texts, including business, science, and public governance. This Let πππ(π·π ) be the active domain of the πth dimension at-
motivates the need for tools that allow users who are not neces- tribute. A pattern π (referred to as a rule in some prior work [8])
sarily data management experts to explore large datasets. Such is a tuple from πππ(π· 1 ) βͺ {β} Γ Β· Β· Β· Γ πππ(π·π ) βͺ {β}, i.e., from
tools range from visualization and aggregation to flexible search the data cube over the dimension attributes, with β*β denoting all
interfaces such as keyword search in structured databases. possible values of that attribute. A tuple π‘ β π matches π, denoted
In this paper, we focus on the exploration of datasets con- by π‘ β π, if π [π· π ] = β*β or π‘ [π· π ] = π [π· π ] for each dimension
taining dimension attributes and binary or numeric measure attribute π· π . For example, tuple 4 from Table 1 matches the pat-
attributes. In traditional business datasets, dimension attributes terns (β, β, β) and (β, β, πΏπππππ), but does not match the pattern
often describe products or employees, and measure attributes (πΉππ, β, β). Some approaches [10] support richer patterns, with
indicate sales totals or salaries. In Internet-of-Things (IoT) and disjunctions and dimension hierarchies.
infrastructure monitoring, dimension attributes may describe Let π π’π (π) be the support of π in π, i.e., the number of tuples
device properties and measure attributes correspond to perfor- matching π, and let π π’ππ (π) be the number of tuples matching
mance statistics. In Web datasets, dimension attributes may de- π and satisfying the predicate π over the measure attributes. For
scribe products, with user ratings as measure attributes. Addition- example, π π’π (β, β, πΏπππππ) = 4 and π π’π πΉπ’ππ=0 (β, β, πΏπππππ) =
ally, in any of these applications, derived measure attributes may π π’π (π)
1. Furthermore, let ππ (π) = π π’ππ(π) , which is the fraction of
exist, e.g., a binary attribute denoting whether a given record tuples matching π that also satisfy the predicate π . For example,
was determined to be an outlier or to contain an error.
π πΉπ’ππ=1 (β, β, πΏπππππ) = 34 .
The data cube has traditionally been used to explore these
Let π π’πππ (π) be the sum of the values of a measure attribute
kinds of datasets, by allowing users to aggregate, roll-up and
drill-down using various subsets of group-by attributes. However, ππ over all the tuples matching π. Let π π’ππππ (π) be the sum of
in large-scale databases, the data cube may be very large and the values of a measure attribute ππ over all the tuples matching
may not immediately reveal interesting patterns and trends. This π and satisfying the predicate π over the measure attributes. For
motivates the need for more sophisticated exploration tools [17]. example, π π’ππΏππ‘π (β, β, πΏπππππ) = 20 + 15 + 19 + 7 = 61 and
We observe that a large body of recent work on exploring π π’ππΏππ‘π
πΉπ’ππ=0
(β, β, πΏπππππ) = 7.
multi-dimensional and OLAP datasets proposes methods to iden- We survey solutions to the following data exploration prob-
tify interesting fragments of the data, summarized using patterns lem: given a dataset π, produce a set or a list of patterns π over
over the values of the dimension attributes. We make the fol- the dimension attributes of π, as defined above, that summarize
lowing contributions towards an understanding of these data common properties of tuples sharing the same values of the mea-
exploration methods. sure attribute(s). Users may then inspect the patterns and explore
tuples covered by the patterns. They may then extract patterns
(1) We survey recent work on data exploration using multi-
corresponding to a smaller subset of the data that was found to
dimensional patterns and propose a categorization based
be interesting during the earlier exploration step, and so on.
on the properties of patterns suggested for exploration:
The number of patterns in π should be limited to direct the
coverage, contrast, and information.
userβs attention to the most important or interesting regions of
the data. This limit may be set explicitly by the user (in terms of
Β© Copyright 2021 for this paper held by its author(s). Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). the maximum number of patterns in π) or implicitly by retrieving
Table 1: A flight dataset Table 2: Methods surveyed
id Day Origin Dest. Late Full Method Approach Applications
1 Fri SF London 20 1 CAPE [14] Contrast Explaining queries
2 Fri London LA 16 1 Data Auditor [9] Coverage Data quality analysis
3 Sun Tokyo Frankfurt 10 1 Data X-ray [19] Contrast Data quality analysis
4 Sun Chicago London 15 1 DIFF [2] Contrast Outlier analysis
5 Sat Beijing Frankfurt 13 1 Explanation tables [7] Information Feature selection
6 Sat Frankfurt London 19 1 Macrobase [1] Contrast Outllier analysis
7 Tue Chicago LA 5 0 MRI [5] Coverage Explaining queries
8 Wed London Chicago 6 0 RSExplain [15] Contrast Explaining queries
9 Thu SF Frankfurt 15 1 Scorpion [20] Contrast Outlier analysis
10 Mon Beijing SF 4 0 Shrink [10] Information Explaining queries
11 Mon SF London 7 0 Smart Drilldown [11] Coverage Explaining queries
12 Mon SF Frankfurt 5 0 SURPRISE [16] Information Explaining queries
13 Mon Tokyo Beijing 6 0
14 Mon Frankfurt Tokyo 4 0
focusing on coverage, contrast and information. Table 2 catego-
rizes the surveyed methods and lists their motivating applications,
the fewest possible patterns that jointly satisfy some property as mentioned in the corresponding papers.
such as covering some fraction of the data.
This data exploration problem has the following applications. 3.1 Methods Based on Coverage
β’ Explaining the results of aggregate queries. Suppose a data The goal of these methods is to identify patterns that cover tuples
analyst issues the following query over Table 1: SELECT with certain values of the measure attribute. We discuss three
SUM(Late) FROM S. Suppose the analyst wishes to under- coverage-based methods: Data Auditor [9], MRI [5], and Smart
stand why the result, of 145, is so high. Here, interesting Drilldown [11].
patterns are those which cover tuples that make a sig- Suppose we are interested in covering tuples having πΉπ’ππ = 1
nificant contribution to the result, i.e., those with a high in Table 1, as a way to summarize the characteristics of full
π π’ππΏππ‘π () such as (β, β, πΏπππππ). The analyst may then flights. A simple coverage-oriented approach is to sort the pat-
zoom into flights landing in London and investigate po- terns according to π π’π πΉπ’ππ=1 () and output the top-ranking pat-
tential reasons for the lengthy delays. terns. Ignoring (β, β, β), which always covers everything but
β’ Analyzing outliers and data quality issues. Suppose we have is not informative, the top candidates are (β, β, πΏπππππ) and
a binary measure attribute denoting whether a given tuple (β, β, πΉππππ π π’ππ‘), which cover three full flights each, followed
contains an error or is an outlier. This attribute could be by the following patterns that cover two such tuples each:
created manually by domain experts or automatically by (πΉππ, β, β), (ππ’π, β, β), (πππ‘, β, β) and (β, ππΉ, β).
identifying tuples that violate data quality rules or deviate There are two problems with this simple approach: there may
from the expected distribution. We may wish to produce be many patterns with a nonzero π π’π πΉπ’ππ=1 (π), and some of these
patterns that summarize the properties of erroneous tuples patterns may also cover tuples with other values of the measure
to help determine the root cause of data quality problems. attribute (here, πΉπ’ππ = 0). To reduce the size of the output and to
β’ Feature selection and explainable AI. Before building predic- ensure that each pattern co-occurs with the specified value of the
tion models, a data scientist may explore interesting pat- measure attribute, Data Auditor solves the following set cover
terns to understand which dimension attributes are related problem. Continuing with our example, Data Auditor requires
to the measure attribute that is to be predicted. Further- a minimum threshold for π πΉπ’ππ=1 (π), i.e., the fraction of tuples
more, suppose a data analyst wants to understand how a covered by π that correspond to full flights. Suppose we require
black-box model makes classification decisions. Here, the ππ (π) β₯ 0.75. This threshold defines the candidate sets for the
dimension attributes are the features given to the model as set cover problem. The set cover objective is to select the fewest
input, and, as the measure attribute, the analyst records the such patterns that together cover a specified fraction of tuples in
predictions made by the model. The analyst may then want π having πΉπ’ππ = 1. Suppose this coverage fraction, which would
to find interesting patterns that explain the prediction de- again be set by the user, is 0.5. The goal is then to find the fewest
cisions. For example, in Table 1, the pattern (πππ, β, β) is patterns, as defined above, to cover half the full flights.
associated with tuples having πΉπ’ππ = 0, suggesting that The set cover problem is NP-hard, and Data Auditor uses the
flights scheduled on Mondays are usually not full1 . standard greedy heuristic that achieves a logarithmic approx-
imation ratio in the size of the solution: it iteratively chooses
3 SURVEY the pattern that covers the most uncovered tuples (having the
In this section, we provide a categorization of previous work desired value of the measure attribute), until the desired fraction
on data exploration using multi-dimensional patterns based on of such tuples has been covered. In our example, the first pattern
the pattern properties and ranking strategies used for pattern added to π is either (β, β, πΏπππππ) and (β, β, πΉππππ π π’ππ‘) - they
selection. We categorize these properties into three types: those each cover three full flights and their π πΉπ’ππ=1 (π) values are 0.75
each. Suppose the set cover algorithm selects (β, β, πΉππππ π π’ππ‘).
1 Model explanations may be global (to summarize how classification decisions are
Since there are seven full flights in the dataset and our coverage
made) or local (to explain why a specific record was classified in a particular way). threshold is 0.5, we need to cover one more full flight. In the next
The explanations discussed in this paper are examples of global explanations. iteration, the pattern that (has π πΉπ’ππ=1 (π) β₯ 0.75 and) covers the
most remaining full flights is (β, β, πΏπππππ) and the algorithm a binary measure attribute, and select patterns co-occurring with
terminates, with π consisting of these two patterns. one value of the measure attribute but not the other. These pat-
The next method, MRI, finds π patterns that cover a user- terns reflect the contrast between tuples having different values
specified fraction of the data and satisfy additional properties of the measure attribute.
related to the variance of the measure attribute within each pat- One could argue that interpretable classifiers such as decision
tern. The motivating example for MRI was to explain queries trees and rule-based methods (see, e.g., [12]) can also be used for
over product reviews, with the dimension attributes correspond- contrast-based data exploration. These methods identify patterns
ing to information about the reviewers (such as their gender of values of the feature attributes that have high discriminative
and age) and the numeric measure attribute corresponding to power in terms of the class variable (in our case, the binary
the average rating. Minimizing variance amounts to returning measure attribute). These patterns are therefore likely to provide
patterns (having high coverage and) describing reviewers with contrast as well. However, classification algorithms focus on out-
similar opinions. For example, when applying MRI to Table 1 of-sample predictive power and include optimizations such as
with πΏππ‘π as the measure attribute, the pattern (β, β, πΉππππ π π’ππ‘) rule pruning to avoid overfitting. On the other hand, the methods
is preferred over (β, β, πΏπππππ). Both patterns cover four tuples, covered in this survey focus explicitly on identifying a concise
but the former has a lower variance of the πΏππ‘π attribute within set of interesting fragments of the data for user exploration.
the covered tuples. Contrast-based methods can also explain the results of aggre-
Smart Drilldown also combines coverage with additional pat- gate queries. Consider the following query over Table 1: SELECT
tern properties. Smart Drilldown produces π patterns. In every SUM(Late) FROM S WHERE Full=1. Here, the measure attribute
iteration of the algorithm, the chosen pattern maximizes the fol- corresponding to the quantity being aggregated. We then set the
lowing objective: the number of tuples not yet covered multiplied binary measure attribute to one for all tuples that participate in
by the weight of the pattern corresponding to some measure of the query (i.e., tuples that match the WHERE predicate), and we
interestingness. One simple weighting function proposed in [11] select patterns of tuples that contribute to the result of the query
is the number of non-star values in the pattern (i.e., more specific (but would not contribute had the query been issued against the
patterns are considered to be more desirable). In Table 1 for ex- other tuples in the dataset).
ample, this weighting function prefers (ππ’π, πΆβπππππ, πΏπ΄) over Below, we explain the pattern ranking strategies used by
(ππ’π, β, β) β both of these patterns have a support of one, but the contrast-based methods, with Table 1 as a running example.
former has more non-star values. Risk ratio is used by Macrobase; a related metric called
Diagnosis Cost is used by Data X-ray. It is the ratio of the
3.1.1 Performance Optimizations. A performance bottleneck following two probabilities: 1) the probability that a tuple
in general set cover problems results from having to keep track with a particular value is covered by the given pattern, and
of the number of uncovered elements that can be covered by 2) the probability that a tuple with this particular value
the candidate sets - this changes in every iteration, whenever occurs outside this pattern. In our example, πππ π πΉπ’ππ=1 (π) =
a new set is added to the solution. Data Auditor relies on the π πΉπ’ππ =1 (π)
π π’π πΉπ’ππ =1 (β,β,β) βπ π’π πΉπ’ππ =1 (π ) . For in-
hierarchical nature of patterns and does not generate all the pat- (π π’π πΉπ’ππ =1 (β,β,β) β(π π’π πΉπ’ππ =1 (π ) ) +(π π’π πΉπ’ππ =0 (β,β,β) βπ π’π πΉπ’ππ =0 (π ) )
terns that serve as input to greedy set cover a priori. Instead, stance, πππ π πΉπ’ππ=1 (β, β, πΏπππππ) = 0.75 = 1.875, whereas
0.4
patterns are generated on-demand, only after all of their super- πππ π πΉπ’ππ=1 (β, ππΉ, β) = 0.5 = 1. This indicates that (β, β, πΏπππππ)
sets have already been considered. For example, a pattern such 0.5
represents full flights better than (β, ππΉ, β).
as (πΉππ, β, πΏπππππ) would only be generated after all of its super- Mean shift (supported by DIFF) computes the ratio of the
sets - including (πΉππ, β, β) and (β, β, πΏπππππ) - have already been mean of the measure attribute values co-occurring with the
considered. Until then, (πΉππ, β, πΏπππππ) can be safely ignored: it two values of the binary measure attribute. In our exam-
cannot cover more elements than its supersets and therefore is π π’ππΏππ‘π (π)/π π’π πΉπ’ππ =1 (π)
πΉπ’ππ =1
guaranteed to not be selected by the greedy set cover heuristic ple, πππππΏππ‘π (π) = . For instance,
πΉπ’ππ=1 π π’ππΏππ‘π
πΉπ’ππ =0
(π)/π π’π πΉπ’ππ =0 (π)
at this time. This optimization can greatly reduce the number 54/3
πππππΏππ‘π (β, β, πΏπππππ) = 7/1 = 2.57, indicating that full
of generated patterns whose coverage (of uncovered elements) πΉπ’ππ=1
fights to London have delays that are 2.57 times longer than
must be re-computed while constructing the solution.
non-full flights to London.
3.1.2 Pros and Cons. One advantage of coverage-based meth- Intervention is used by RSExplain; a related metric called In-
ods is conciseness: by design, they identify the fewest possible fluence is used by Scorpion. It measures the ratio of contribution
patterns that cover the desired fraction of tuples of interest. On towards the numeric measure attribute for tuples occurring with
the other hand, in Data Auditor, some trial-and-error may be the different values of the binary measure attribute. In our exam-
required on the userβs part to select good values for the two π π’ππΏππ‘π
πΉπ’ππ =1
(β,β,β)βπ π’ππΏππ‘π
πΉπ’ππ =1
(π)
ple, πππ‘πππ£πππ‘ππππΏππ‘π (π) = . For in-
πΉπ’ππ=1 π π’ππΏππ‘π (β,β,β)βπ π’ππΏππ‘π (π)
required thresholds. For example, a high value of π will ensure πΉπ’ππ =0 πΉπ’ππ =0
108β54
πΏππ‘π
stance, πππ‘πππ£πππ‘πππ πΉπ’ππ=1 (β, β, πΏπππππ) = 37β7 = 1.8. In other
that each pattern co-occurs mainly with the specified value of
the measure attribute, but will disqualify more patterns from words, if flights to London were removed from the dataset then
consideration. Similarly, MRI and Smart Drilldown require users full flights would have delays on average 1.8 times longer than
to set parameters for coverage and other pattern properties. non-full flights. On the other hand, πππ‘πππ£πππ‘πππ πΉπ’ππ=1 (β, ππΉ, β) =
108β35 = 2.92, meaning that removing flights departing from SF
37β12
from the dataset would create a greater contrast between the
3.2 Methods based on Contrast
delays of full and non-full flights.
This is the largest group of methods, consisting of CAPE [14], We also point out CAPE, whose goal is to find patterns whose
Data X-ray [19], DIFF [2], Macrobase [1], RSExplain [15] and measure attribute values counterbalance those of the pattern
Scorpion [20]. (DIFF is a recent solution that generalizes many of given as input. For example, the average flight delay in Table 1
these related methods). Contrast-based methods usually assume
Table 3: An explanation table of size four for the binary Table 5: A summary of size two considered by Shrink [10]
measure attribute Full
Day Origin Dest. Late
Day Origin Dest. Full Fri,Sun,Sat,Thu * * 15.4
* * * 0.5 Tue,Wed,Mon * * 5.3
Mon * * 0
* * London 0.75
* * Frankfurt 0.75
do so, the algorithm maintains a maximum-entropy estimate of
the distribution based on the patterns that have been added to
Table 4: An explanation table of size four for the numeric the explanation table so far. It also keeps track of the Kullback-
measure attribute Late Leibler (KL) divergence between this estimated distribution and
the true distribution. To quantify the information contained in
Day Origin Dest. Late a candidate pattern π, the algorithm computes the reduction in
* * * 10.4 KL-divergence if π were to be added to the explanation table.
* * London 15.3 Returning to Table 3, the greedy algorithm starts by inserting
Fri * * 18 the pattern (β, β, β||0.5). At this point, knowing only this one
Sat * * 16 piece of information, the maximum-entropy estimate of the dis-
tribution of πΉπ’ππ is to assign πΉπ’ππ = 0.5 to every tuple in Table 12 .
Next, it turns out that (πππ, β, β||0) gives the greatest reduction
is 10.4, but it is only 5.2 for tuples covered by (πππ, β, β). To
in KL-divergence. Based on this new pattern, the maximum-
counterbalance this low value, CAPE suggests related patterns
entropy estimate for tuples 10 through 14 in Table 1 changes to
such as (πΉππ, β, β) and (πππ‘, β, β), whose average delays are higher.
πΉπ’ππ = 0. This revision causes the estimates of the first nine tuples
3.2.1 Performance Optimizations. Contrast-based methods to change (from 0.5 to 79 ) in order to maintain consistency with
output patterns whose scores (e.g., risk ratios) are above a user- the first pattern, which asserts that πΉπ’ππ = 0.5 on average over
supplied threshold. A general optimization used by DIFF is to also the entire table. Given the updated maximum-entropy estimate,
require a minimum support threshold for the extracted patterns. the next pattern with the greatest reduction in KL-divergence is
This enables pruning optimizations similar to those used by the (β, β, πΏπππππ||0.75), and so on.
Apriori algorithm for frequent itemset mining [3]. For example, if Similar reasoning can explain how Table 4 was created. The
(π ππ, β, β) fails to satisfy the minimum support threshold, then first pattern asserts that flights are late by 10.4 minutes on average.
all its subsets, such as (π ππ, πΏπππππ, β), can be ignored. Given this estimate, (β, β, πΏπππππ||15.3) provides the most infor-
mation about the distribution of πΏππ‘π. The maximum-entropy
3.2.2 Pros and Cons. By design, contrast-based methods are
estimate of πΏππ‘π is now updated accordingly, That is, tuples 1, 4, 6
useful when exploring differences between data subsets β some-
and 11, corresponding to flights to London, receive an estimate of
thing that coverage-based methods do not implicitly optimize for.
15.3, and the remaining tuples receive an estimate of 8.4 to main-
On the other hand, these methods may not guarantee concise
tain consistency with the first pattern. The next most-informative
results. One exception is Data X-ray, which performs a set-cover-
pattern is then selected, and so on.
like operation on the extracted patterns as a post-processing step
SURPRISE is a similar method, whose goal is to identify surpris-
to eliminate redundant patterns.
ing fragments of a dataset where the distribution of the measure
attribute is different than what the user has seen so far. Sup-
3.3 Methods based on Information pose the user queries Table 1 and finds out that flights are 10.4
Finally, we discuss three methods that select patterns based on the minutes late on average. SURPRISE finds the most informative
information they provide about the distribution of the measure non-overlapping and contained patterns, i.e., those which lead to
attribute: Explanation Tables [7], SURPRISE [16] and Shrink [10]. the greatest reduction in KL-divergence between the true distri-
Table 3 shows an explanation table of size four (i.e., containing bution of Late and the maximum-entropy estimated distribution.
four patterns) for the binary measure attribute Full based on Restricting the output to such patterns makes it easier to update
Table 1. In addition to values of the dimension attributes, each the estimated distribution. In our example, the most informa-
explanation table pattern also includes the fraction of matching tive pattern is (β, β, πΏπππππ) and its most informative subset is
tuples that have πΉπ’ππ = 1. The first pattern in an explanation (β, ππΉ, πΏπππππ).
table is always the all-stars pattern, and, in this example, it also Finally, we discuss Shrink, which produces π non-overlapping
indicates that half the flights in the entire dataset are full. The patterns that summarize the distribution of the measure attribute
next pattern suggests that no flights on Mondays are full, and the with (approximately) minimal sum of squared errors. Shrink
last two patterns indicate that three-quarters of flights to London allows patterns with disjunctions of values and dimension hierar-
and Frankfurt are full. chies, and uses a greedy approach in which cells in the data cube
In Table 4, we show an explanation table of size four for the are merged to produce patterns with the smallest squared errors.
measure attribute Late based on Table 1. Here, each pattern in- Table 5 shows an example of a summary with two patterns that
cludes the average value of Late across its matching tuples. Again, may be considered by Shrink based on Table 1. Given the average
we start with the all-stars pattern, which states that flights are values of the Late measure attribute reported by the summary,
10.4 minutes late on average. The next pattern indicates that the sum of squared errors is 77.1.
flights to London are 15.3 minutes late on average, and so on.
The greedy heuristic for constructing explanation tables used 2 πΉπ’ππ is a binary attribute that can only be zero or one. However, for the purpose
in [6β8] iteratively selects the pattern that contains the most of measuring the divergence between the true and the estimated distributions, the
information about the distribution of the measure attribute. To maximum-entropy estimates are allowed to be real numbers between zero and one.
3.3.1 Performance Optimizations. In contrast-based methods, those functionally determined by other attributes. Distributed
the βgoodnessβ of a pattern usually remains static throughout the versions of some methods, including DIFF [2] and Explanation
execution of the algorithm. On the other hand, the βgoodnessβ, i.e., Tables [8], have also been proposed to parallelize the search for
information, of a candidate explanation table pattern is relative to interesting patterns. In machine learning, there exists a variety
the current maximum-entropy estimate of the measure attribute. of dimension reduction methods such as Principal Component
This means that a previously un-informative pattern may be- Analysis (PCA) and word embeddings. However, these methods
come informative as more patterns are added to the explanation are not known for being interpretable and thus their suitability
table. As a result, candidate patterns cannot be easily pruned. for pattern-based exploration requires further study.
Instead, the explanation table construction algorithm in [6β8] Bringing order to dimension attributes: Much of the previous
uses sampling to reduce the space of candidate patterns. In every work considers categorical dimension attributes. However, there
iteration of the greedy algorithm, a random sample is drawn, and exist methods for covering a multi-dimensional dataset using
the set of candidate patterns corresponds to only those patterns hyper-rectangles corresponding to intervals over numeric dimen-
that have a non-zero support in the sample. The intuition is that sion attributes [13], there exists a method to cover data anomalies
patterns with frequently occurring combinations of dimension using intervals over numeric features [21], and explanation ta-
attribute values are likely to be sampled and also likely to contain bles have recently been extended to support ordinal and numeric
information about the distribution of the measure attribute. dimension attributes [18]. These extensions further increase the
space of candidate patterns and require additional performance
3.3.2 Pros and Cons. By design, information-based methods optimizations. For example, returning to Table 1, the Day at-
produce informative patterns that highlight fragments of the data tribute may lead to additional patterns with ranges or intervals
with surprising distributions of the measure attribute. However, such as ([πππ β πΉππ], β, β) or ([πππ‘ βππ’π], β, β). Techniques used
these methods tend to be expensive, especially as the number of for constructing optimal histograms and optimal decision trees
dimension attributes grows. may help to discover these types of patterns.
Exploring data evolution: Finally, recent work motivates the
4 CONCLUSIONS AND OPEN PROBLEMS need for tools to explore how data (and metadata) change over
In this paper, we surveyed recent data exploration methods that time [4]. Here, patterns may summarize fragments of the data
extract interesting or informative fragments of the data, repre- that have changed recently or are updated often.
sented as patterns over the dimension attributes. We categorized
these methods according to the properties of patterns they select. REFERENCES
Below, we offer suggestions for future work in this area. [1] F. Abuzaid, P. Bailis, J. Ding, E. Gan, S. Madden, D. Narayanan, K. Rong, S.
Benchmarks: A performance comparison of contrast-based Suri: MacroBase: Prioritizing Attention in Fast Data. ACM Trans. Database
Syst. 43(4): 15:1-15:45 (2018)
methods implemented within the DIFF frameworks appears in [2] F. Abuzaid, P. Kraft, S. Suri, E. Gan, E. Xu, A. Shenoy, A. Anathanaraya, J.
[2]. In terms of effectiveness, prior work reports that methods Sheu, E. Meijer, X. Wu, J. F. Naughton, P. Bailis, M. Zaharia: DIFF: A Relational
Interface for Large-Scale Data Explanation. PVLDB 12(4): 419-432 (2018)
based on information provide more information about the dis- [3] R. Agrawal, R. Srikant: Fast Algorithms for Mining Association Rules in Large
tribution of the measure attribute than coverage-based methods Databases. VLDB 1994: 487-499
[6, 7]; similarly, methods based on contrast provide more precise [4] T. Bleifuss, L. Bornemann, T. Johnson, D. V. Kalashnikov, F. Naumann, D.
Srivastava: Exploring Change - A New Dimension of Data Analytics. PVLDB
outlier explanations than methods based on coverage [19]. Some 12(2): 85-98 (2018)
approaches were also evaluated through user studies against sim- [5] M. Das, S. Amer-Yahia, G. Das, C. Yu: MRI: Meaningful Interpretations of
ple baselines [14]. An interesting direction for future work is to Collaborative Ratings. PVLDB 4(11): 1063-1074 (2011)
[6] K. El Gebaly, P. Agrawal, L. Golab, F. Korn, D. Srivastava: Interpretable and
develop a comprehensive benchmark to highlight the effective- Informative Explanations of Outcomes. PVLDB 8(1): 61-72 (2014)
ness of various types of methods in various applications. [7] K. El Gebaly, G. Feng, L. Golab, F. Korn, D. Srivastava: Explanation Tables.
IEEE Data Eng. Bull. 41(3): 43-51 (2018)
New applications: Popular motivating applications that guided [8] G. Feng, L. Golab, D. Srivastava: Scalable Informative Rule Mining. ICDE 2017:
the development of prior work were outlier and data error anal- 437-448
ysis, as well as query result explanation. Recent interest in ex- [9] L. Golab, H. J. Karloff, F. Korn, D. Srivastava: Data Auditor: Exploring Data
Quality and Semantics using Pattern Tableaux. PVLDB 3(2): 1641-1644 (2010)
plainable AI motivates further studies on exploring the behaviour [10] M. Golfarelli, S. Graziani, S. Rizzi: Shrink: An OLAP operation for balancing
of black-box machine learning models such as neural networks precision and size of pivot tables. Data Knowl. Eng. 93: 19-41 (2014)
using multi-dimensional patterns, as was suggested in [7]. Since [11] M. Joglekar, H. Garcia-Molina, A. G. Parameswaran: Interactive data explo-
ration with smart drill-down. ICDE 2016: 906-917
deep learning methods have been successful in the context of [12] H. Lakkaraju, S. H. Bach, J. Leskovec: Interpretable Decision Sets: A Joint
unstructured data such as text, images and graphs, future re- Framework for Description and Prediction. KDD 2016: 1675-1684
[13] L. V. S. Lakshmanan, R. T. Ng, C. X. Wang, X. Zhou, T. Johnson: The Generalized
search should investigate new ways of formulating interpretable MDL Approach for Summarization. VLDB 2002: 766-777
patterns over these high-dimensional unstructured datasets. [14] Z. Miao, Q. Zeng, C. Li, B. Glavic, O. Kennedy, S. Roy: CAPE: Explaining
Correlated measure attributes: Another characteristic of prior Outliers by Counterbalancing. PVLDB 12(12): 1806-1809 (2019)
[15] S. Roy, D. Suciu: A formal approach to finding explanations for database
work is that it usually formulates exploration problems involving queries. SIGMOD Conference 2014: 1579-1590
a single measure attribute. Pattern-based exploration of multiple [16] S. Sarawagi: User-cognizant multidimensional analysis. VLDB J. 10(2-3): 224-
measure attributes is an interesting area for future work. 239 (2001)
[17] P. Vassiliadis, P. Marcel: The Road to Highlights is Paved with Good Intentions:
Feature reduction: In terms of performance and scalability, the Envisioning a Paradigm Shift in OLAP Modeling. DOLAP 2018.
large number of possible patterns remains a challenge for many [18] M. Vollmer, L. Golab, K. Bohm, D. Srivastava: Informative Summarization of
Numeric Data. SSDBM 2019: 97-108
methods, especially those based on information which cannot [19] X. Wang, X. L. Dong, A. Meliou: Data X-Ray: A Diagnostic Tool for Data
leverage Apriori-like pruning strategies. This is an important Errors. SIGMOD Conference 2015: 1231-1245
challenge for interactive methods that allow users to continu- [20] E. Wu, S. Madden: Scorpion: Explaining Away Outliers in Aggregate Queries.
PVLDB 6(8): 553-564 (2013)
ously issue new exploration tasks. As a result, some techniques [21] H. Zhang, Y. Diao, A. Meliou: EXstream: Explaining Anomalies in Event Stream
such as DIFF limit the number of dimension attributes for use Monitoring. EDBT 2017: 156-167
in patterns and discard redundant dimension attributes such as