Exploring Data Using Patterns: A Survey and Open Problems Lukasz Golab Divesh Srivastava University of Waterloo, Canada AT&T Chief Data Office, USA lgolab@uwaterloo.ca divesh@att.com ABSTRACT (2) We analyze the pros, cons and performance optimizations We present a survey of data exploration methods that extract of existing work, and we suggest open problems for future multi-dimensional patterns from datasets consisting of dimen- research. sion and measure attributes. These patterns are designed to sum- marize common properties of tuples associated with particular 2 BACKGROUND values of the measure attributes. We provide a categorization We are given a dataset 𝑆 with a set 𝐷 of dimension attributes of the characteristics of patterns produced by various solutions and a set 𝑀 of measure attributes (also referred to as outcomes to this problem, we point out the pros, cons and performance in some prior work [6]). Let 𝐷 1, 𝐷 2, . . . , 𝐷𝑑 be the 𝑑 dimension optimizations of existing methods, and we suggest directions for attributes and let 𝑀1, 𝑀2, . . . π‘€π‘š be the π‘š measure attributes. For future research. now, we assume, as in the majority of previous work, that the dimension attributes are categorical, and we will comment on KEYWORDS ordinal and numeric dimension attributes in Section 4. Measure Data exploration, Data summarization, Data explanation, Data attributes may be binary or numeric. cube, Pattern mining Table 1 shows a flight dataset that will serve as a running example. For each flight, the dataset includes a record id, followed by three dimension attributes, Day of the week, flight Origin and 1 INTRODUCTION flight Destination, as well as two measure attributes, a numeric Data volumes have been growing rapidly in recent years. As a attribute denoting how late the flight was and a binary attribute result, data-intensive methods are now common in many con- denoting whether the flight was full. texts, including business, science, and public governance. This Let π‘‘π‘œπ‘š(𝐷𝑖 ) be the active domain of the 𝑖th dimension at- motivates the need for tools that allow users who are not neces- tribute. A pattern 𝑝 (referred to as a rule in some prior work [8]) sarily data management experts to explore large datasets. Such is a tuple from π‘‘π‘œπ‘š(𝐷 1 ) βˆͺ {βˆ—} Γ— Β· Β· Β· Γ— π‘‘π‘œπ‘š(𝐷𝑑 ) βˆͺ {βˆ—}, i.e., from tools range from visualization and aggregation to flexible search the data cube over the dimension attributes, with β€˜*’ denoting all interfaces such as keyword search in structured databases. possible values of that attribute. A tuple 𝑑 ∈ 𝑆 matches 𝑝, denoted In this paper, we focus on the exploration of datasets con- by 𝑑 ≍ 𝑝, if 𝑝 [𝐷 𝑗 ] = β€˜*’ or 𝑑 [𝐷 𝑗 ] = 𝑝 [𝐷 𝑗 ] for each dimension taining dimension attributes and binary or numeric measure attribute 𝐷 𝑗 . For example, tuple 4 from Table 1 matches the pat- attributes. In traditional business datasets, dimension attributes terns (βˆ—, βˆ—, βˆ—) and (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›), but does not match the pattern often describe products or employees, and measure attributes (πΉπ‘Ÿπ‘–, βˆ—, βˆ—). Some approaches [10] support richer patterns, with indicate sales totals or salaries. In Internet-of-Things (IoT) and disjunctions and dimension hierarchies. infrastructure monitoring, dimension attributes may describe Let 𝑠𝑒𝑝 (𝑝) be the support of 𝑝 in 𝑆, i.e., the number of tuples device properties and measure attributes correspond to perfor- matching 𝑝, and let π‘ π‘’π‘π‘Ÿ (𝑝) be the number of tuples matching mance statistics. In Web datasets, dimension attributes may de- 𝑝 and satisfying the predicate π‘Ÿ over the measure attributes. For scribe products, with user ratings as measure attributes. Addition- example, 𝑠𝑒𝑝 (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›) = 4 and 𝑠𝑒𝑝 𝐹𝑒𝑙𝑙=0 (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›) = ally, in any of these applications, derived measure attributes may 𝑠𝑒𝑝 (𝑝) 1. Furthermore, let πœƒπ‘Ÿ (𝑝) = π‘ π‘’π‘π‘Ÿ(𝑝) , which is the fraction of exist, e.g., a binary attribute denoting whether a given record tuples matching 𝑝 that also satisfy the predicate π‘Ÿ . For example, was determined to be an outlier or to contain an error. πœƒ 𝐹𝑒𝑙𝑙=1 (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›) = 34 . The data cube has traditionally been used to explore these Let π‘ π‘’π‘šπ‘€π‘– (𝑝) be the sum of the values of a measure attribute kinds of datasets, by allowing users to aggregate, roll-up and drill-down using various subsets of group-by attributes. However, 𝑀𝑖 over all the tuples matching 𝑝. Let π‘ π‘’π‘šπ‘Ÿπ‘€π‘– (𝑝) be the sum of in large-scale databases, the data cube may be very large and the values of a measure attribute 𝑀𝑖 over all the tuples matching may not immediately reveal interesting patterns and trends. This 𝑝 and satisfying the predicate π‘Ÿ over the measure attributes. For motivates the need for more sophisticated exploration tools [17]. example, π‘ π‘’π‘šπΏπ‘Žπ‘‘π‘’ (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›) = 20 + 15 + 19 + 7 = 61 and We observe that a large body of recent work on exploring π‘ π‘’π‘šπΏπ‘Žπ‘‘π‘’ 𝐹𝑒𝑙𝑙=0 (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›) = 7. multi-dimensional and OLAP datasets proposes methods to iden- We survey solutions to the following data exploration prob- tify interesting fragments of the data, summarized using patterns lem: given a dataset 𝑆, produce a set or a list of patterns 𝑃 over over the values of the dimension attributes. We make the fol- the dimension attributes of 𝑆, as defined above, that summarize lowing contributions towards an understanding of these data common properties of tuples sharing the same values of the mea- exploration methods. sure attribute(s). Users may then inspect the patterns and explore tuples covered by the patterns. They may then extract patterns (1) We survey recent work on data exploration using multi- corresponding to a smaller subset of the data that was found to dimensional patterns and propose a categorization based be interesting during the earlier exploration step, and so on. on the properties of patterns suggested for exploration: The number of patterns in 𝑃 should be limited to direct the coverage, contrast, and information. user’s attention to the most important or interesting regions of the data. This limit may be set explicitly by the user (in terms of Β© Copyright 2021 for this paper held by its author(s). Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). the maximum number of patterns in 𝑃) or implicitly by retrieving Table 1: A flight dataset Table 2: Methods surveyed id Day Origin Dest. Late Full Method Approach Applications 1 Fri SF London 20 1 CAPE [14] Contrast Explaining queries 2 Fri London LA 16 1 Data Auditor [9] Coverage Data quality analysis 3 Sun Tokyo Frankfurt 10 1 Data X-ray [19] Contrast Data quality analysis 4 Sun Chicago London 15 1 DIFF [2] Contrast Outlier analysis 5 Sat Beijing Frankfurt 13 1 Explanation tables [7] Information Feature selection 6 Sat Frankfurt London 19 1 Macrobase [1] Contrast Outllier analysis 7 Tue Chicago LA 5 0 MRI [5] Coverage Explaining queries 8 Wed London Chicago 6 0 RSExplain [15] Contrast Explaining queries 9 Thu SF Frankfurt 15 1 Scorpion [20] Contrast Outlier analysis 10 Mon Beijing SF 4 0 Shrink [10] Information Explaining queries 11 Mon SF London 7 0 Smart Drilldown [11] Coverage Explaining queries 12 Mon SF Frankfurt 5 0 SURPRISE [16] Information Explaining queries 13 Mon Tokyo Beijing 6 0 14 Mon Frankfurt Tokyo 4 0 focusing on coverage, contrast and information. Table 2 catego- rizes the surveyed methods and lists their motivating applications, the fewest possible patterns that jointly satisfy some property as mentioned in the corresponding papers. such as covering some fraction of the data. This data exploration problem has the following applications. 3.1 Methods Based on Coverage β€’ Explaining the results of aggregate queries. Suppose a data The goal of these methods is to identify patterns that cover tuples analyst issues the following query over Table 1: SELECT with certain values of the measure attribute. We discuss three SUM(Late) FROM S. Suppose the analyst wishes to under- coverage-based methods: Data Auditor [9], MRI [5], and Smart stand why the result, of 145, is so high. Here, interesting Drilldown [11]. patterns are those which cover tuples that make a sig- Suppose we are interested in covering tuples having 𝐹𝑒𝑙𝑙 = 1 nificant contribution to the result, i.e., those with a high in Table 1, as a way to summarize the characteristics of full π‘ π‘’π‘šπΏπ‘Žπ‘‘π‘’ () such as (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›). The analyst may then flights. A simple coverage-oriented approach is to sort the pat- zoom into flights landing in London and investigate po- terns according to 𝑠𝑒𝑝 𝐹𝑒𝑙𝑙=1 () and output the top-ranking pat- tential reasons for the lengthy delays. terns. Ignoring (βˆ—, βˆ—, βˆ—), which always covers everything but β€’ Analyzing outliers and data quality issues. Suppose we have is not informative, the top candidates are (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›) and a binary measure attribute denoting whether a given tuple (βˆ—, βˆ—, πΉπ‘Ÿπ‘Žπ‘›π‘˜ 𝑓 π‘’π‘Ÿπ‘‘), which cover three full flights each, followed contains an error or is an outlier. This attribute could be by the following patterns that cover two such tuples each: created manually by domain experts or automatically by (πΉπ‘Ÿπ‘–, βˆ—, βˆ—), (𝑆𝑒𝑛, βˆ—, βˆ—), (π‘†π‘Žπ‘‘, βˆ—, βˆ—) and (βˆ—, 𝑆𝐹, βˆ—). identifying tuples that violate data quality rules or deviate There are two problems with this simple approach: there may from the expected distribution. We may wish to produce be many patterns with a nonzero 𝑠𝑒𝑝 𝐹𝑒𝑙𝑙=1 (𝑝), and some of these patterns that summarize the properties of erroneous tuples patterns may also cover tuples with other values of the measure to help determine the root cause of data quality problems. attribute (here, 𝐹𝑒𝑙𝑙 = 0). To reduce the size of the output and to β€’ Feature selection and explainable AI. Before building predic- ensure that each pattern co-occurs with the specified value of the tion models, a data scientist may explore interesting pat- measure attribute, Data Auditor solves the following set cover terns to understand which dimension attributes are related problem. Continuing with our example, Data Auditor requires to the measure attribute that is to be predicted. Further- a minimum threshold for πœƒ 𝐹𝑒𝑙𝑙=1 (𝑝), i.e., the fraction of tuples more, suppose a data analyst wants to understand how a covered by 𝑝 that correspond to full flights. Suppose we require black-box model makes classification decisions. Here, the πœƒπ‘Ÿ (𝑝) β‰₯ 0.75. This threshold defines the candidate sets for the dimension attributes are the features given to the model as set cover problem. The set cover objective is to select the fewest input, and, as the measure attribute, the analyst records the such patterns that together cover a specified fraction of tuples in predictions made by the model. The analyst may then want 𝑆 having 𝐹𝑒𝑙𝑙 = 1. Suppose this coverage fraction, which would to find interesting patterns that explain the prediction de- again be set by the user, is 0.5. The goal is then to find the fewest cisions. For example, in Table 1, the pattern (π‘€π‘œπ‘›, βˆ—, βˆ—) is patterns, as defined above, to cover half the full flights. associated with tuples having 𝐹𝑒𝑙𝑙 = 0, suggesting that The set cover problem is NP-hard, and Data Auditor uses the flights scheduled on Mondays are usually not full1 . standard greedy heuristic that achieves a logarithmic approx- imation ratio in the size of the solution: it iteratively chooses 3 SURVEY the pattern that covers the most uncovered tuples (having the In this section, we provide a categorization of previous work desired value of the measure attribute), until the desired fraction on data exploration using multi-dimensional patterns based on of such tuples has been covered. In our example, the first pattern the pattern properties and ranking strategies used for pattern added to 𝑃 is either (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›) and (βˆ—, βˆ—, πΉπ‘Ÿπ‘Žπ‘›π‘˜ 𝑓 π‘’π‘Ÿπ‘‘) - they selection. We categorize these properties into three types: those each cover three full flights and their πœƒ 𝐹𝑒𝑙𝑙=1 (𝑝) values are 0.75 each. Suppose the set cover algorithm selects (βˆ—, βˆ—, πΉπ‘Ÿπ‘Žπ‘›π‘˜ 𝑓 π‘’π‘Ÿπ‘‘). 1 Model explanations may be global (to summarize how classification decisions are Since there are seven full flights in the dataset and our coverage made) or local (to explain why a specific record was classified in a particular way). threshold is 0.5, we need to cover one more full flight. In the next The explanations discussed in this paper are examples of global explanations. iteration, the pattern that (has πœƒ 𝐹𝑒𝑙𝑙=1 (𝑝) β‰₯ 0.75 and) covers the most remaining full flights is (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›) and the algorithm a binary measure attribute, and select patterns co-occurring with terminates, with 𝑃 consisting of these two patterns. one value of the measure attribute but not the other. These pat- The next method, MRI, finds π‘˜ patterns that cover a user- terns reflect the contrast between tuples having different values specified fraction of the data and satisfy additional properties of the measure attribute. related to the variance of the measure attribute within each pat- One could argue that interpretable classifiers such as decision tern. The motivating example for MRI was to explain queries trees and rule-based methods (see, e.g., [12]) can also be used for over product reviews, with the dimension attributes correspond- contrast-based data exploration. These methods identify patterns ing to information about the reviewers (such as their gender of values of the feature attributes that have high discriminative and age) and the numeric measure attribute corresponding to power in terms of the class variable (in our case, the binary the average rating. Minimizing variance amounts to returning measure attribute). These patterns are therefore likely to provide patterns (having high coverage and) describing reviewers with contrast as well. However, classification algorithms focus on out- similar opinions. For example, when applying MRI to Table 1 of-sample predictive power and include optimizations such as with πΏπ‘Žπ‘‘π‘’ as the measure attribute, the pattern (βˆ—, βˆ—, πΉπ‘Ÿπ‘Žπ‘›π‘˜ 𝑓 π‘’π‘Ÿπ‘‘) rule pruning to avoid overfitting. On the other hand, the methods is preferred over (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›). Both patterns cover four tuples, covered in this survey focus explicitly on identifying a concise but the former has a lower variance of the πΏπ‘Žπ‘‘π‘’ attribute within set of interesting fragments of the data for user exploration. the covered tuples. Contrast-based methods can also explain the results of aggre- Smart Drilldown also combines coverage with additional pat- gate queries. Consider the following query over Table 1: SELECT tern properties. Smart Drilldown produces π‘˜ patterns. In every SUM(Late) FROM S WHERE Full=1. Here, the measure attribute iteration of the algorithm, the chosen pattern maximizes the fol- corresponding to the quantity being aggregated. We then set the lowing objective: the number of tuples not yet covered multiplied binary measure attribute to one for all tuples that participate in by the weight of the pattern corresponding to some measure of the query (i.e., tuples that match the WHERE predicate), and we interestingness. One simple weighting function proposed in [11] select patterns of tuples that contribute to the result of the query is the number of non-star values in the pattern (i.e., more specific (but would not contribute had the query been issued against the patterns are considered to be more desirable). In Table 1 for ex- other tuples in the dataset). ample, this weighting function prefers (𝑇𝑒𝑒, πΆβ„Žπ‘–π‘π‘Žπ‘”π‘œ, 𝐿𝐴) over Below, we explain the pattern ranking strategies used by (𝑇𝑒𝑒, βˆ—, βˆ—) – both of these patterns have a support of one, but the contrast-based methods, with Table 1 as a running example. former has more non-star values. Risk ratio is used by Macrobase; a related metric called Diagnosis Cost is used by Data X-ray. It is the ratio of the 3.1.1 Performance Optimizations. A performance bottleneck following two probabilities: 1) the probability that a tuple in general set cover problems results from having to keep track with a particular value is covered by the given pattern, and of the number of uncovered elements that can be covered by 2) the probability that a tuple with this particular value the candidate sets - this changes in every iteration, whenever occurs outside this pattern. In our example, π‘Ÿπ‘–π‘ π‘˜ 𝐹𝑒𝑙𝑙=1 (𝑝) = a new set is added to the solution. Data Auditor relies on the πœƒ 𝐹𝑒𝑙𝑙 =1 (𝑝) 𝑠𝑒𝑝 𝐹𝑒𝑙𝑙 =1 (βˆ—,βˆ—,βˆ—) βˆ’π‘ π‘’π‘ 𝐹𝑒𝑙𝑙 =1 (𝑝 ) . For in- hierarchical nature of patterns and does not generate all the pat- (𝑠𝑒𝑝 𝐹𝑒𝑙𝑙 =1 (βˆ—,βˆ—,βˆ—) βˆ’(𝑠𝑒𝑝 𝐹𝑒𝑙𝑙 =1 (𝑝 ) ) +(𝑠𝑒𝑝 𝐹𝑒𝑙𝑙 =0 (βˆ—,βˆ—,βˆ—) βˆ’π‘ π‘’π‘ 𝐹𝑒𝑙𝑙 =0 (𝑝 ) ) terns that serve as input to greedy set cover a priori. Instead, stance, π‘Ÿπ‘–π‘ π‘˜ 𝐹𝑒𝑙𝑙=1 (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›) = 0.75 = 1.875, whereas 0.4 patterns are generated on-demand, only after all of their super- π‘Ÿπ‘–π‘ π‘˜ 𝐹𝑒𝑙𝑙=1 (βˆ—, 𝑆𝐹, βˆ—) = 0.5 = 1. This indicates that (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›) sets have already been considered. For example, a pattern such 0.5 represents full flights better than (βˆ—, 𝑆𝐹, βˆ—). as (πΉπ‘Ÿπ‘–, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›) would only be generated after all of its super- Mean shift (supported by DIFF) computes the ratio of the sets - including (πΉπ‘Ÿπ‘–, βˆ—, βˆ—) and (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›) - have already been mean of the measure attribute values co-occurring with the considered. Until then, (πΉπ‘Ÿπ‘–, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›) can be safely ignored: it two values of the binary measure attribute. In our exam- cannot cover more elements than its supersets and therefore is π‘ π‘’π‘šπΏπ‘Žπ‘‘π‘’ (𝑝)/𝑠𝑒𝑝 𝐹𝑒𝑙𝑙 =1 (𝑝) 𝐹𝑒𝑙𝑙 =1 guaranteed to not be selected by the greedy set cover heuristic ple, π‘šπ‘’π‘Žπ‘›πΏπ‘Žπ‘‘π‘’ (𝑝) = . For instance, 𝐹𝑒𝑙𝑙=1 π‘ π‘’π‘šπΏπ‘Žπ‘‘π‘’ 𝐹𝑒𝑙𝑙 =0 (𝑝)/𝑠𝑒𝑝 𝐹𝑒𝑙𝑙 =0 (𝑝) at this time. This optimization can greatly reduce the number 54/3 π‘šπ‘’π‘Žπ‘›πΏπ‘Žπ‘‘π‘’ (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›) = 7/1 = 2.57, indicating that full of generated patterns whose coverage (of uncovered elements) 𝐹𝑒𝑙𝑙=1 fights to London have delays that are 2.57 times longer than must be re-computed while constructing the solution. non-full flights to London. 3.1.2 Pros and Cons. One advantage of coverage-based meth- Intervention is used by RSExplain; a related metric called In- ods is conciseness: by design, they identify the fewest possible fluence is used by Scorpion. It measures the ratio of contribution patterns that cover the desired fraction of tuples of interest. On towards the numeric measure attribute for tuples occurring with the other hand, in Data Auditor, some trial-and-error may be the different values of the binary measure attribute. In our exam- required on the user’s part to select good values for the two π‘ π‘’π‘šπΏπ‘Žπ‘‘π‘’ 𝐹𝑒𝑙𝑙 =1 (βˆ—,βˆ—,βˆ—)βˆ’π‘ π‘’π‘šπΏπ‘Žπ‘‘π‘’ 𝐹𝑒𝑙𝑙 =1 (𝑝) ple, π‘–π‘›π‘‘π‘’π‘Ÿπ‘£π‘’π‘›π‘‘π‘–π‘œπ‘›πΏπ‘Žπ‘‘π‘’ (𝑝) = . For in- 𝐹𝑒𝑙𝑙=1 π‘ π‘’π‘šπΏπ‘Žπ‘‘π‘’ (βˆ—,βˆ—,βˆ—)βˆ’π‘ π‘’π‘šπΏπ‘Žπ‘‘π‘’ (𝑝) required thresholds. For example, a high value of πœƒ will ensure 𝐹𝑒𝑙𝑙 =0 𝐹𝑒𝑙𝑙 =0 108βˆ’54 πΏπ‘Žπ‘‘π‘’ stance, π‘–π‘›π‘‘π‘’π‘Ÿπ‘£π‘’π‘›π‘‘π‘–π‘œπ‘› 𝐹𝑒𝑙𝑙=1 (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›) = 37βˆ’7 = 1.8. In other that each pattern co-occurs mainly with the specified value of the measure attribute, but will disqualify more patterns from words, if flights to London were removed from the dataset then consideration. Similarly, MRI and Smart Drilldown require users full flights would have delays on average 1.8 times longer than to set parameters for coverage and other pattern properties. non-full flights. On the other hand, π‘–π‘›π‘‘π‘’π‘Ÿπ‘£π‘’π‘›π‘‘π‘–π‘œπ‘› 𝐹𝑒𝑙𝑙=1 (βˆ—, 𝑆𝐹, βˆ—) = 108βˆ’35 = 2.92, meaning that removing flights departing from SF 37βˆ’12 from the dataset would create a greater contrast between the 3.2 Methods based on Contrast delays of full and non-full flights. This is the largest group of methods, consisting of CAPE [14], We also point out CAPE, whose goal is to find patterns whose Data X-ray [19], DIFF [2], Macrobase [1], RSExplain [15] and measure attribute values counterbalance those of the pattern Scorpion [20]. (DIFF is a recent solution that generalizes many of given as input. For example, the average flight delay in Table 1 these related methods). Contrast-based methods usually assume Table 3: An explanation table of size four for the binary Table 5: A summary of size two considered by Shrink [10] measure attribute Full Day Origin Dest. Late Day Origin Dest. Full Fri,Sun,Sat,Thu * * 15.4 * * * 0.5 Tue,Wed,Mon * * 5.3 Mon * * 0 * * London 0.75 * * Frankfurt 0.75 do so, the algorithm maintains a maximum-entropy estimate of the distribution based on the patterns that have been added to Table 4: An explanation table of size four for the numeric the explanation table so far. It also keeps track of the Kullback- measure attribute Late Leibler (KL) divergence between this estimated distribution and the true distribution. To quantify the information contained in Day Origin Dest. Late a candidate pattern 𝑝, the algorithm computes the reduction in * * * 10.4 KL-divergence if 𝑝 were to be added to the explanation table. * * London 15.3 Returning to Table 3, the greedy algorithm starts by inserting Fri * * 18 the pattern (βˆ—, βˆ—, βˆ—||0.5). At this point, knowing only this one Sat * * 16 piece of information, the maximum-entropy estimate of the dis- tribution of 𝐹𝑒𝑙𝑙 is to assign 𝐹𝑒𝑙𝑙 = 0.5 to every tuple in Table 12 . Next, it turns out that (π‘€π‘œπ‘›, βˆ—, βˆ—||0) gives the greatest reduction is 10.4, but it is only 5.2 for tuples covered by (π‘€π‘œπ‘›, βˆ—, βˆ—). To in KL-divergence. Based on this new pattern, the maximum- counterbalance this low value, CAPE suggests related patterns entropy estimate for tuples 10 through 14 in Table 1 changes to such as (πΉπ‘Ÿπ‘–, βˆ—, βˆ—) and (π‘†π‘Žπ‘‘, βˆ—, βˆ—), whose average delays are higher. 𝐹𝑒𝑙𝑙 = 0. This revision causes the estimates of the first nine tuples 3.2.1 Performance Optimizations. Contrast-based methods to change (from 0.5 to 79 ) in order to maintain consistency with output patterns whose scores (e.g., risk ratios) are above a user- the first pattern, which asserts that 𝐹𝑒𝑙𝑙 = 0.5 on average over supplied threshold. A general optimization used by DIFF is to also the entire table. Given the updated maximum-entropy estimate, require a minimum support threshold for the extracted patterns. the next pattern with the greatest reduction in KL-divergence is This enables pruning optimizations similar to those used by the (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›||0.75), and so on. Apriori algorithm for frequent itemset mining [3]. For example, if Similar reasoning can explain how Table 4 was created. The (π‘Š 𝑒𝑑, βˆ—, βˆ—) fails to satisfy the minimum support threshold, then first pattern asserts that flights are late by 10.4 minutes on average. all its subsets, such as (π‘Š 𝑒𝑑, πΏπ‘œπ‘›π‘‘π‘œπ‘›, βˆ—), can be ignored. Given this estimate, (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›||15.3) provides the most infor- mation about the distribution of πΏπ‘Žπ‘‘π‘’. The maximum-entropy 3.2.2 Pros and Cons. By design, contrast-based methods are estimate of πΏπ‘Žπ‘‘π‘’ is now updated accordingly, That is, tuples 1, 4, 6 useful when exploring differences between data subsets – some- and 11, corresponding to flights to London, receive an estimate of thing that coverage-based methods do not implicitly optimize for. 15.3, and the remaining tuples receive an estimate of 8.4 to main- On the other hand, these methods may not guarantee concise tain consistency with the first pattern. The next most-informative results. One exception is Data X-ray, which performs a set-cover- pattern is then selected, and so on. like operation on the extracted patterns as a post-processing step SURPRISE is a similar method, whose goal is to identify surpris- to eliminate redundant patterns. ing fragments of a dataset where the distribution of the measure attribute is different than what the user has seen so far. Sup- 3.3 Methods based on Information pose the user queries Table 1 and finds out that flights are 10.4 Finally, we discuss three methods that select patterns based on the minutes late on average. SURPRISE finds the most informative information they provide about the distribution of the measure non-overlapping and contained patterns, i.e., those which lead to attribute: Explanation Tables [7], SURPRISE [16] and Shrink [10]. the greatest reduction in KL-divergence between the true distri- Table 3 shows an explanation table of size four (i.e., containing bution of Late and the maximum-entropy estimated distribution. four patterns) for the binary measure attribute Full based on Restricting the output to such patterns makes it easier to update Table 1. In addition to values of the dimension attributes, each the estimated distribution. In our example, the most informa- explanation table pattern also includes the fraction of matching tive pattern is (βˆ—, βˆ—, πΏπ‘œπ‘›π‘‘π‘œπ‘›) and its most informative subset is tuples that have 𝐹𝑒𝑙𝑙 = 1. The first pattern in an explanation (βˆ—, 𝑆𝐹, πΏπ‘œπ‘›π‘‘π‘œπ‘›). table is always the all-stars pattern, and, in this example, it also Finally, we discuss Shrink, which produces π‘˜ non-overlapping indicates that half the flights in the entire dataset are full. The patterns that summarize the distribution of the measure attribute next pattern suggests that no flights on Mondays are full, and the with (approximately) minimal sum of squared errors. Shrink last two patterns indicate that three-quarters of flights to London allows patterns with disjunctions of values and dimension hierar- and Frankfurt are full. chies, and uses a greedy approach in which cells in the data cube In Table 4, we show an explanation table of size four for the are merged to produce patterns with the smallest squared errors. measure attribute Late based on Table 1. Here, each pattern in- Table 5 shows an example of a summary with two patterns that cludes the average value of Late across its matching tuples. Again, may be considered by Shrink based on Table 1. Given the average we start with the all-stars pattern, which states that flights are values of the Late measure attribute reported by the summary, 10.4 minutes late on average. The next pattern indicates that the sum of squared errors is 77.1. flights to London are 15.3 minutes late on average, and so on. The greedy heuristic for constructing explanation tables used 2 𝐹𝑒𝑙𝑙 is a binary attribute that can only be zero or one. However, for the purpose in [6–8] iteratively selects the pattern that contains the most of measuring the divergence between the true and the estimated distributions, the information about the distribution of the measure attribute. To maximum-entropy estimates are allowed to be real numbers between zero and one. 3.3.1 Performance Optimizations. In contrast-based methods, those functionally determined by other attributes. Distributed the β€œgoodness” of a pattern usually remains static throughout the versions of some methods, including DIFF [2] and Explanation execution of the algorithm. On the other hand, the β€œgoodness”, i.e., Tables [8], have also been proposed to parallelize the search for information, of a candidate explanation table pattern is relative to interesting patterns. In machine learning, there exists a variety the current maximum-entropy estimate of the measure attribute. of dimension reduction methods such as Principal Component This means that a previously un-informative pattern may be- Analysis (PCA) and word embeddings. However, these methods come informative as more patterns are added to the explanation are not known for being interpretable and thus their suitability table. As a result, candidate patterns cannot be easily pruned. for pattern-based exploration requires further study. Instead, the explanation table construction algorithm in [6–8] Bringing order to dimension attributes: Much of the previous uses sampling to reduce the space of candidate patterns. In every work considers categorical dimension attributes. However, there iteration of the greedy algorithm, a random sample is drawn, and exist methods for covering a multi-dimensional dataset using the set of candidate patterns corresponds to only those patterns hyper-rectangles corresponding to intervals over numeric dimen- that have a non-zero support in the sample. The intuition is that sion attributes [13], there exists a method to cover data anomalies patterns with frequently occurring combinations of dimension using intervals over numeric features [21], and explanation ta- attribute values are likely to be sampled and also likely to contain bles have recently been extended to support ordinal and numeric information about the distribution of the measure attribute. dimension attributes [18]. These extensions further increase the space of candidate patterns and require additional performance 3.3.2 Pros and Cons. By design, information-based methods optimizations. For example, returning to Table 1, the Day at- produce informative patterns that highlight fragments of the data tribute may lead to additional patterns with ranges or intervals with surprising distributions of the measure attribute. However, such as ([π‘€π‘œπ‘› βˆ’ πΉπ‘Ÿπ‘–], βˆ—, βˆ—) or ([π‘†π‘Žπ‘‘ βˆ’π‘†π‘’π‘›], βˆ—, βˆ—). Techniques used these methods tend to be expensive, especially as the number of for constructing optimal histograms and optimal decision trees dimension attributes grows. may help to discover these types of patterns. Exploring data evolution: Finally, recent work motivates the 4 CONCLUSIONS AND OPEN PROBLEMS need for tools to explore how data (and metadata) change over In this paper, we surveyed recent data exploration methods that time [4]. Here, patterns may summarize fragments of the data extract interesting or informative fragments of the data, repre- that have changed recently or are updated often. sented as patterns over the dimension attributes. We categorized these methods according to the properties of patterns they select. REFERENCES Below, we offer suggestions for future work in this area. [1] F. Abuzaid, P. Bailis, J. Ding, E. Gan, S. Madden, D. Narayanan, K. Rong, S. Benchmarks: A performance comparison of contrast-based Suri: MacroBase: Prioritizing Attention in Fast Data. ACM Trans. Database Syst. 43(4): 15:1-15:45 (2018) methods implemented within the DIFF frameworks appears in [2] F. Abuzaid, P. Kraft, S. Suri, E. Gan, E. Xu, A. Shenoy, A. Anathanaraya, J. [2]. In terms of effectiveness, prior work reports that methods Sheu, E. Meijer, X. Wu, J. F. Naughton, P. Bailis, M. Zaharia: DIFF: A Relational Interface for Large-Scale Data Explanation. PVLDB 12(4): 419-432 (2018) based on information provide more information about the dis- [3] R. Agrawal, R. Srikant: Fast Algorithms for Mining Association Rules in Large tribution of the measure attribute than coverage-based methods Databases. VLDB 1994: 487-499 [6, 7]; similarly, methods based on contrast provide more precise [4] T. Bleifuss, L. Bornemann, T. Johnson, D. V. Kalashnikov, F. Naumann, D. Srivastava: Exploring Change - A New Dimension of Data Analytics. PVLDB outlier explanations than methods based on coverage [19]. Some 12(2): 85-98 (2018) approaches were also evaluated through user studies against sim- [5] M. Das, S. Amer-Yahia, G. Das, C. Yu: MRI: Meaningful Interpretations of ple baselines [14]. An interesting direction for future work is to Collaborative Ratings. PVLDB 4(11): 1063-1074 (2011) [6] K. El Gebaly, P. Agrawal, L. Golab, F. Korn, D. Srivastava: Interpretable and develop a comprehensive benchmark to highlight the effective- Informative Explanations of Outcomes. PVLDB 8(1): 61-72 (2014) ness of various types of methods in various applications. [7] K. El Gebaly, G. Feng, L. Golab, F. Korn, D. Srivastava: Explanation Tables. IEEE Data Eng. Bull. 41(3): 43-51 (2018) New applications: Popular motivating applications that guided [8] G. Feng, L. Golab, D. Srivastava: Scalable Informative Rule Mining. ICDE 2017: the development of prior work were outlier and data error anal- 437-448 ysis, as well as query result explanation. Recent interest in ex- [9] L. Golab, H. J. Karloff, F. Korn, D. Srivastava: Data Auditor: Exploring Data Quality and Semantics using Pattern Tableaux. PVLDB 3(2): 1641-1644 (2010) plainable AI motivates further studies on exploring the behaviour [10] M. Golfarelli, S. Graziani, S. Rizzi: Shrink: An OLAP operation for balancing of black-box machine learning models such as neural networks precision and size of pivot tables. Data Knowl. Eng. 93: 19-41 (2014) using multi-dimensional patterns, as was suggested in [7]. Since [11] M. Joglekar, H. Garcia-Molina, A. G. Parameswaran: Interactive data explo- ration with smart drill-down. ICDE 2016: 906-917 deep learning methods have been successful in the context of [12] H. Lakkaraju, S. H. Bach, J. Leskovec: Interpretable Decision Sets: A Joint unstructured data such as text, images and graphs, future re- Framework for Description and Prediction. KDD 2016: 1675-1684 [13] L. V. S. Lakshmanan, R. T. Ng, C. X. Wang, X. Zhou, T. Johnson: The Generalized search should investigate new ways of formulating interpretable MDL Approach for Summarization. VLDB 2002: 766-777 patterns over these high-dimensional unstructured datasets. [14] Z. Miao, Q. Zeng, C. Li, B. Glavic, O. Kennedy, S. Roy: CAPE: Explaining Correlated measure attributes: Another characteristic of prior Outliers by Counterbalancing. PVLDB 12(12): 1806-1809 (2019) [15] S. Roy, D. Suciu: A formal approach to finding explanations for database work is that it usually formulates exploration problems involving queries. SIGMOD Conference 2014: 1579-1590 a single measure attribute. Pattern-based exploration of multiple [16] S. Sarawagi: User-cognizant multidimensional analysis. VLDB J. 10(2-3): 224- measure attributes is an interesting area for future work. 239 (2001) [17] P. Vassiliadis, P. Marcel: The Road to Highlights is Paved with Good Intentions: Feature reduction: In terms of performance and scalability, the Envisioning a Paradigm Shift in OLAP Modeling. DOLAP 2018. large number of possible patterns remains a challenge for many [18] M. Vollmer, L. Golab, K. Bohm, D. Srivastava: Informative Summarization of Numeric Data. SSDBM 2019: 97-108 methods, especially those based on information which cannot [19] X. Wang, X. L. Dong, A. Meliou: Data X-Ray: A Diagnostic Tool for Data leverage Apriori-like pruning strategies. This is an important Errors. SIGMOD Conference 2015: 1231-1245 challenge for interactive methods that allow users to continu- [20] E. Wu, S. Madden: Scorpion: Explaining Away Outliers in Aggregate Queries. PVLDB 6(8): 553-564 (2013) ously issue new exploration tasks. As a result, some techniques [21] H. Zhang, Y. Diao, A. Meliou: EXstream: Explaining Anomalies in Event Stream such as DIFF limit the number of dimension attributes for use Monitoring. EDBT 2017: 156-167 in patterns and discard redundant dimension attributes such as