1. INTRODUCTION

Efficient Data Slice Search for Exceptional View Detection

Yohei Mizuno

mizuno.yohei@ist.osaka-u.ac.jp 0

Yuya Sasaki

sasaki@ist.osaka-u.ac.jp 0

Makoto Onizuka

onizuka@ist.osaka-u.ac.jp 0 0 Graduate School of Information Science and Technology, Osaka University , Japan

Business analysts repeatedly execute OLAP queries consisting of aggregate/group-by operations until they nd trends, insights, and/or outliers. One interesting analysis is to identify particular data slices (subset of original data) that generate exceptional views apart from the average view generated from the whole original data. However, since the search space for identifying such data slices is quite large, efficient techniques are indispensable for the data slice search. In this paper, we propose 1) a framework that automatically identi es data slices that generate exceptional views for OLAP queries, and 2) an efficient algorithm that optimizes the data slice search. The algorithm reduces the search space by employing con dence intervals and removes redundant query processing by applying multi-query optimization for evaluating queries over multiple data slices. The experiments validate that our algorithm improves the performance eight times faster than without the optimizations.

Exploratory analysis Probability theory OLAP

1. INTRODUCTION

The need for effective analysis of business data is widely recognized and many data mining techniques have been developed, such as online analytical processing (OLAP), association rule mining, clustering, classi cation, graph mining and stream data mining [ 5 ]. In particular, OLAP is one of frequently used techniques that largely contributes to business data analysis and it is effectively used to nd trends, insights, and/or outliers from data. In typical analytical processing, data analysts reveal hidden knowledge from data by setting up hypotheses and testing them through investigation of the data. However, a major issue of analytical processing is that it is not easy for data analysts to set up effective hypotheses, because it requires them to deeply understand the data itself so that they properly choose interesting analytical dimensions (expressed by OLAP queries) and/or data slices (subset of data) for detailed analysis. So, 2017, Copyright is with the authors. Published in the Workshop Proceedings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017, Venice, Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 a continuous process of trial and error is very common for analytical processing.

The exploratory analysis is a promising research area for solving the issue above [ 2, 6, 12, 13 ]. Morton et al. [ 6 ] investigate systems that support data analysts. One challenging goal is to develop data recommenders that would automatically identify datasets outside of the system that could improve the quality of analysis result. Buoncristiano et al. [ 2 ] study a database exploration model assuming conversations between exploratory users and the system. The conversation goes on so that the system identi es queries that reveal signi cant deviation from uniform distribution. As a more concrete example, SEEDB [ 12, 13 ] automatically explores various types of OLAP queries in a multidimensional space and identi es query q that maximizes the deviation between q(D) and q(S), where D is the whole data and S is a given data slice. However, SEEDB is not applicable to an opposite case; to search for data slices that generate views that are deviated from the view generated from whole data with respect to a given OLAP query. An example of data slice search is to identify particular branch office (say, Kyoto ofce) or certain year (say, 2001) in sales databases such that its generated analysis result is deviated from the one generated from whole data. Moreover, since the search space of data slice search is huge, efficient techniques are indispensable for the data slice search.

To achieve efficient data slice search, we propose a framework that automatically identi es data slices that generate exceptional views from OLAP queries and an efficient algorithm that optimizes the data slice search. The algorithm reduces the search space by employing the con dence intervals [ 10 ] and removes redundant query processing by applying multi-query optimization techniques when searching for multiple data slices [ 9 ]. 1.1

Running example

We give a running example of business analysis. In sales data analysis, it is important to identify particular products whose sales trends are deviated from the average sales of all products. Figure 1 shows normalized monthly sales of all products, products purchased by men, and clothes. We observe that the trend of the products purchased by men is quite close to that of all products, which is not interesting. In contrast, the trend of the clothes largely differs from that of all products. We can see that the sales of the clothes are affected by seasons particularly in March. This suggests that we should start a new discount service for clothes in April so as to keep the clothes sales high. 1 4 F e b 1 4 M a r 1 4 A p r 1 4 M a y 1 4 J u n 1 4 J u l 1 4 A u g 1 4 S e p 1 4 O c t 1 4 N o v 1 4 D e c

This paper is organized as follows. We describe an overview of data slice search for exceptional view detection in Section 2 and con dence intervals derived from probability inequality in Section 3. We give the details of efficient algorithm for data slice search in Section 4. The performance evaluation is given in Section 5. We describe related work in Section 6 and conclude this paper in Section 7. 2.

OVERVIEW OF DATA SLICE SEARCH

In this section, we describe a problem of identifying data slices that generate exceptional views for a given query. We give the problem de nition of data slice search in Section 2.1, and then show its procedures in Section 2.2. 2.1

Problem definition

We focus on relational database D. D is a collection of records X1; X2; ; XjDj, where jDj is the number of the records of D. Xi (1 i jDj) is composed of measure attributes (say, prices or amounts) and dimension attributes B (say, product categories or store names). We denote value of measure attribute m of Xi as Xim. We de ne data slice S as a subset of D cuboid sliced by choosing a single value Y for b 2 B as follows:

S := b=Y (D) where is a select operation in relational algebra. We call this operation as slicing query. An OLAP query is given by data analysts in advance. We de ne a set of data slices S and OLAP query q as follows:

jBj S := ∪f bi=Y (D)jY 2 values(bi)g; (2) i=1 q(S) := gGa=f(m)(S) (1) (3) where jBj is the number of dimension attributes, values(bi) is a set of unique values of bi, g is a dimension attribute for group-by, m is a measure attribute for aggregate attribute, a is an aggregated value of m, and f is an aggregate function. f calculates either the number of the values of m (COUNT), the mean of the values of m (AVG), or the sum of the values of m (SUM). G groups the records using g and aggregates the grouped values of m using f . Those results are a sequence of objects consisting of the unique values of g and the aggregated values of m. The problem here is to identify the top-k data slices in S that generate the most extreme exceptional views (expressed as q(S)) for given query q, deDneednaistifoonllo1ws:argmkax deviation(q(D); q(S))

S2S where deviation is a function to measure the deviation between the view for query q of D and of data slice S. Without the loss of generality, we use Euclidean Distance for the deviation function1. 2. To obtain set of data slices S from D by choosing each single value Y 2 values(bi) for each dimension bi 2 B. S 2 S is generated by executing slicing queries, such as \gender = `men"' and \category = `clothes"'. 3. To evaluate query q over all data slices S 2 S. 4. To compute the deviation between q(D) and q(S).

For example, Euclidean Distance between the monthly sales of all products and those of men purchased products is computed.

5. To display the top-k deviated views to the users. 3.

CONFIDENCE INTERVAL

We apply the con dence interval to efficiently identify topk data slices when OLAP queries use aggregation functions. In this section, we rst describe how we apply the con dence interval for mean value derived from the Hoeffding Ser ing inequality [ 10 ].

The Hoeffding Ser ing inequality is useful to derive the probability's upper bound without assuming the probability distribution of the population when the expected value for stochastic variable and its nite domain are given. From the Hoeffding Ser ing inequality, the con dence interval for the mean value of aggregate attribute m of D (expressed as (D)) can be derived. Let fx1,...,xjxjg and fX1,...,XjDjg be samples and population, and jxj and jDj be sample size and population size, respectively. The Hoeffding Ser ing inequality is modi ed to have the following form.

P r′(t) = P r′[jx (D)j t] where x = jx1j ∑jix=j1 xim, (D) = jD1 j ∑jiD=1j Xim. Since Eq.4 is derived from the Hoeffding Ser ing inequality by replacing x (D) part with its absolute form, probability P r′(t) in Eq.4 has as twice as probability P r(t) in the Hoeffding Ser ing inequality [ 10 ], where P r(t) is the probability that x is t bigger than (D). Therefore, suppose that t > 0, upper bound can be inferred by using the following equation.

P r′(t) 2 exp[ (1

2jxjt2 fx)(max min)2 ] = where fx = jxjDjj1 , min = minfXim; i = 1; 2; ; jDjg, max = maxfXim; i = 1; 2; ; jDjg. From Eq.4 and Eq.5, (1 )con dence interval2 for (D) can be derived as Eq.6. x

t < (D) < x + t where t = √ (1 fx)(max m2jixnj)2(log 2 log ) . 1We can easily extend our technique to apply other deviation functions such as Manhattan Distance. 2We set signi cance level at 0:05 in the experiments. (4) (5) (6)

Furthermore, we express the range of the con dence interval in the following format.

range( (D)) = [x t; x + t]: (7)

In the next place, we estimate the con dence interval for the sum of the values of m (expressed as (D)). We can estimate this interval by using the following form obtained by multiplying jDj to the terms in Eq.6.

Lemma 1 jDj(x

t) < (D) < jDj(x + t): 3.1

Confidence interval for size of data slices

We calculate aggregated values of m of data slices for the problem of identifying data slices that generate exceptional views for a given query. We need to estimate the aggregated values from samples so as to reduce the data slice search space. We derive expressions estimating con dence intervals for the aggregated values of m of the data slice (D) D. Firstly, we derive the con dence interval for the size of (D) (expressed as j (D)j). To estimate j (D)j using Lemma 1, we de ne a function that judges whether an element of the samples xi belongs to (D) as follows: exist(xi) = { 1 (xi 2 (D))

0 (otherwise): Furthermore, we de ne a ratio (expressed as s) of the elements belongs to (x) to the samples using exist as follows: s = j (x)j = jxj 1 jxj

∑ exist(xi): jxj i=1 s t′ < j (D)j < s + t′

jDj By de nition of exist, min in t equals zero, and max in t equals one in Eq.6. Therefore, we can estimate a ratio of (D) to D as follows: where t′ = √ (1 fx)(l2ojgxj2 log ) . We can estimate the condence interval for j (D)j by using the following form that multiplied jDj by Eq.10.

Lemma 2 jDj(s

t′) < j (D)j < jDj(s + t′). 3.2

Confidence interval for mean of data slices

We derive the con dence interval for the mean of m of (D) (expressed as ( (D))) by applying Lemma 2 to Eq.6. We can estimate the con dence interval for ( (D)) as follows by setting the size of population as j (D)j and the size of samples as j (x)j: (x) t′′ < ( (D)) < (x) + t′′ (11) where (x) is the mean of (x) and t′′ is a value obtained by substituting j (x)j and the range of fx in Lemma 2 for jxj and fx in t = √ (1 fx)(max m2jixnj)2(log 2 log ) as follows: jxj

j (x)j; j (x)j jDj(s + t′) 1 < fx < j (x)j jDj(s 1 t′) where the range of fx is obtained by substituting j (x)j and the range of j (D)j in Lemma 2 for jxj and jDj in fx respectively. Furthermore, t′′ is removed from Eq.11 as follows. t′′ (8) (9) (10) has range [tmin; tmax] because t is a function of fx and fx has range. When the size of population equals jDj(s t′) and fx = j (x)j 1

jDj(s t′) , t′′ takes lowerbound tmin. When the size of population equals jDj(s + t′) and fx = j (x)j 1 jDj(s+t′) , t′′ takes upperbound tmax. Therefore, Eq.11 is rewritten as follows. Lemma 3 (x) tmax < ( (D)) < (x) + tmax where tmax = √ (1 jjD(j(xs)+jt′1) )(max min)2(log 2 log ) 2j (x)j . 3.3

Confidence interval for sum of data slices

We derive the con dence interval for the sum of m of (D) (expressed as ( (D))) by multiplying ( (D)) by j (D)j. When j (D)j takes the maximum of j (D)j (i.e. jDj(s + t′)), t should equal tmax and ( (D)) takes the maximum of ( (D)) (i.e. (x) + tmax) at the same time. Therefore, the maximum of ( (D)) is the value obtained by multiplying the maximum of ( (D)) by the maximum of j (D)j. On the other hand, when j (D)j takes the minimum of j (D)j (i.e. jDj(s t′)), t should equal tmin but ( (D)) does not take the minimum of ( (D)) (i.e. (x) tmax). We de ne (x) topt and jDj(s t′opt) which are the values of ( (D)) and j (D)j respectively when ( (D)) takes the minimum. We can estimate the con dence interval for ( (D)) by using the following form.

Lemma 4 ( (x) topt)(jDj(s t′opt)) < ( (D)) < ( (x) + tmax)(jDj(s + t′)) √ (1 jDj(s

j (x)jt′op1t) )(max min)2(log 2 log ) where topt = value within the range of s above ( (x) topt)(jDj(s 2j (x)j t′ < topt < s + t′ when the

, t′opt is the t′opt)) takes the minimum. 4.

EFFICIENT ALGORITHM FOR DATA SLICE SEARCH

The algorithm prunes the search space by employing the con dence interval and removes redundant query processing by applying multi-query optimization for evaluating queries over multiple data slices.

Pruning

One characteristic in identifying data slices that generate exceptional views is that most views generated from data slices would have a high similarity to the view generated from the whole data. Since those data slices cannot be within top-k of large deviation, it is preferable to prune the query evaluation for those data slices. To achieve this pruning, we only use samples and compute the con dence interval from samples for the data slice search. Speci cally, we describe how we calculate deviations (Section 4.1), we estimate deviations from samples by computing the upper bound and lower bound of the con dence intervals for the aggregation of each data slice (Section 4.2). If the upper bound of the query result for data slice S is lower than the lower bound of the kth highest data slice, we can safely prune the query evaluation for the data slice S (Section 4.3). Multi-query Optimization (Section 4.4)

In identifying data slices that generate exceptional views, we have to evaluate a given query over multiple data slices. This query processing is optimized by grouping the slicing queries for the data slices with respect to the same attributes and then by evaluating the grouped queries during a single scan. 4.1

Calculating deviations

We describe how to calculate the deviation between q(D) and q(S) (i.e. deviation(q(D); q(S))), which is computed by Euclidean Distance. We de ne gGa=f(m)(S) in Eq.3 to the following form.

gGa=f(m)(S) := ff ( m( g=′Z′ (S)))jZ 2 values(g)g (12) where values(g) is the set of the unique values of the groupby attribute g. In other words, gGa=f(m)(S) selects subsets of S for each value Z of g, projects the measure attribute m of the data slices, and applies the aggregate function f . We can expand deviation(q(D); q(S)) in De nition 1 to the following form by using Eq.12. deviation(ff ( m( g=′Z′ (D)))jZ 2 values(g)g; ff ( m( g=′Z′ (S)))jZ 2 values(g)g): (13) Furthermore, Eq.13 can be rewritten to the following form because deviation calculates Euclidean Distance.

(f ( m( g=′Z′ (D))) f ( m( g=′Z′ (S))))2: (14) v u u t

∑

Z2values(g) That is, deviation(q(D); q(S)) selects subset of D and S for each value Z 2 values(g), calculates the squares of the differences between the aggregation results of g=′Z′ (D) and g=′Z′ (S), and calculates the square root of the sum of the squares. When f is AVG or SUM, we normalize q(D) and q(S) so that the sum of q(D) and the sum of q(S) equal one. 4.2

Estimating deviations

We calculate f ( m( g=′Z′ (D))) in Eq.14 beforehand and then estimate f ( m( g=′Z′ (S))) by using the samples of S. We apply Lemma 2, 3 and 4 for the aggregate function COUNT, AVG and SUM, respectively. For example, we can express the con dence interval for the size of S (expressed as range(jSj)) by using Lemma 2 when f is COUNT. range(jSj) = [jDj(s t′); jDj(s + t′)]: (15) We can calculate the upper bound of deviation between D and S by using the following form. v u u t

∑ Z2values(g)

(max(f( m( g=′Z′(D))); range( m( g=′Z′(S)))))2 (16) where max(x; [a; b]) is a function that takes a value and a range as arguments, and returns the value which takes the largest distance from x in range [a; b]. On the other hand, we can calculate the lower bound of the deviation, replacing max in Eq.16 to a function which returns the value which takes the smallest distance from x in the range [a; b]. 4.3

Pruning data slices

We prune the query evaluation for data slices that cannot be within top-k of large deviation. The pruning works as follows. We divide the dataset D into the samples and the rest. We evaluate given query q over all data slices S 2 S for the samples. We compute the upper bound and lower bound of the deviations for the result of q. If the upper bound of the data slice S is lower than the lower bound of the kth highest data slice, we do not evaluate q over the data slice S for the rest of dataset. As a result, we can prune the query evaluation for the data slice that does not become the top-k result with a high probability. 4.4

Multi-query Optimization

By grouping the slicing queries for the data slices with respect to the same attributes, the grouped queries can be combined into a single query, which results in the reduction of wasteful computation and faster process. For example, when calculating the monthly sales of products bought by either males or females, combining two queries into a single query can be made. By grouping the slicing queries, the number of the slicing queries becomes equal to the number of dimension attributes B, and these grouped queries are executed with single scan. 5.

PERFORMANCE EVALUATION

Our goal of experiments is to evaluate the efficiency and accuracy of our proposed algorithm in the data slice search. We compared the performance of naive approach (NO OPT), with combining all slicing queries for data slices into single query regardless of different attributes (ONE QUERY), with multi-query optimization (MULTI), and with the combination of multi-query optimization and data slice pruning optimizations by con dence interval (COMB). 5.1

Benchmark

We measured the performances by using various OLAP queries for real dataset.

Dataset: This dataset is sales data of supermarkets which is provided by Joint Association Study Group of Management Science. The dataset is composed of measure attributes (price and number) and dimension attributes (store (9), time (19), sex (3), category 1 (8), category 2 (25) and category 3 (164)). The number in the parenthesis indicates the number of unique values of the attribute. The number of records is 103; 382; 016.

OLAP queries: We use price and number for aggregation and store and time for group-by.

Attributes for selecting data slices: We do not use the same attribute both for group-by operation and slicing data condition at the same time, because the result is not useful in such case; a single value is sliced out, group-by operation is applied, and then only a single value is aggregated.

All experiments were run on single machine with 16 GB RAM and 4 core Intel(R) Core(TM) i7-4702MQ processor. We evaluate our techniques on Microsoft SQL Server 2014. We used nonclustered columnstore indexes for efficiency. We removed extremely large values of measure attributes in advance for the purpose of data cleaning. We chose samples from the rst part of the database. All experiments were repeated three times and the measurements were averaged. 5.2

Results

Figure 2 shows the latencies of top-10 data slices search for evaluating the OLAP queries over the dataset. The xaxis shows the OLAP queries and the label indicates aggregation attribute/group-by attribute. We observe that the multi-query optimization and data slice pruning are effective in improving the latencies for the dataset. In detail, the pruning data slice optimization effectively improves the latencies when the number of distinct values of group-by attribute is small, such as store (9). That is, Euclidean Distance between long sequences tends to be more similar than short sequences, so we have more opportunities of pruning the search space when the number of distinct values of

NO_OPT MULTI

ONE_QUERY COMB 15 0 number/store price/time

OLAP query group-by attribute is small. COMB takes longer time than MULTI when aggregate attribute is price and group-by attribute is time. This is because top-10 values of deviation was not so large and COMB prunes a small number of data slices. Although ONE QUERY executes the smallest number of queries among the four approaches, it takes longer time than MULTI in all the OLAP queries. This result may be explained by the fact that, since we used nonclustered columnstoreindex, grouping slicing queries with respect to the same single attributes provide the best performance.

We made scalability experiments by varying the size of the datasets. As we can see in Figure 3, we observe that the latency for the case with the pruning optimization scales well and is constant even if the dataset size increases. This is because the latency with the pruning optimization mainly depends on k of top-k; even if the data size increases, required search space depends on top-k size.

We made experiments to verify the in uence of k by varying the value of k. As we can see in Figure 4, we observe that the smaller the value of k, the better the efficiency of the pruning optimization. This is because the threshold used by pruning becomes large when the value of k becomes small, and pruning more data slices.

Note, for evaluating the accuracy of data slice pruning optimization, we validate that all the algorithms obtain the same top-k data slices in all the experiments.

RELATED WORK

Visual analytics tools such as Spot re [ 1 ] and Polaris [ 11 ] have been introduced. These tools support the analysis of users by selecting the best visualization for a data set. The Users must chose data slices that they want to analyze.

Visual analytics tools such as Pro ler [ 3 ], Vizdeck [ 4 ] and SEEDB [ 12, 13 ] have a function to select visualizations automatically. Pro ler automatically detects exceptions in a data set. VizDeck has a function to depict visualizations of a data set on a dashboard. SEEDB recommend a visualization result of a data slice based on distance function for the given data slice. Sarawagi et al. [ 8 ] study an exploration of data cubes by using data mining techniques. They explore exceptional and interesting cells (say, Sales of clothes of March, 2014) in a cube. However, those techniques do not explore exceptional q(D) (say, Monthly sales of clothes of 2014). Ordonez et al. [ 7 ] incorporate parametric statistical tests into analysis for interactive visualization of the OLAP cube. The visualization identi es signi cant differences between two similar cuboids.

CONCLUSION

We propose a framework that automatically identi es data slices that generate exceptional views for OLAP queries, and an efficient algorithm that optimizes the data slice search. 3 1 0 10 k 20 The algorithm improves the performance by con dence interval and multi-query optimization. We used real dataset and validated that our algorithm improves the performance eight times faster than without the optimizations.

There are various types of future work for the data slice search. First, we extend the system to compute the deviation not only from the average of whole data, but also the deviation from the average of data clusters. Second, we introduce the hierarchy between attributes, such as categories, so that we can drill down and roll up the analysis results. Finally, we will extend the variations of data slices by permitting the combination of multiple attributes (conjunctive condition) for slicing condition.

ACKNOWLEDGMENTS

This work was supported by JSPS KAKENHI Grant-inAid for Scienti c Research (C) 16k00154.

[1]

Ahlberg . Spot re: An information exploration environment . SIGMOD Rec , 25 ( 4 ), 1996 .

[2]

Buoncristiano ,

Mecca , E. Quintarelli,

Roveri ,

Santoro , and

Tanca . Database challenges for exploratory computing . SIGMOD Rec , 44 ( 2 ), 2015 .

[3]

Kandel ,

Parikh ,

Paepcke ,

J. M.

Hellerstein , and

Heer . Pro ler: Integrated statistical analysis and visualization for data quality assessment . In Prc. AVI , 2012 .

[4]

Key ,

Howe ,

Perry , and

Aragon . Vizdeck: Self-organizing dashboards for visual analytics . In Proc. SIGMOD , 2012 .

[5]

Madraky . Data mining text book . 2012 .

[6]

Morton ,

Balazinska ,

Grossman , and

Mackinlay . Support the data enthusiast: Challenges for next-generation data-analysis systems . Proc. VLDB Endow , 7 ( 6 ), 2014 .

[7]

Ordonez ,

Chen , and

Garc a -Garc a. Interactive exploration and visualization of olap cubes . In Proc. ACM DOLAP Workshop , 2011 .

[8]

Sarawagi ,

Agrawal , and

Megiddo . Discovery-driven exploration of olap data cubes . In EDBT , 1998 .

[9]

T. K.

Sellis . Multiple-query optimization . ACM Trans. Database Syst , 13 ( 1 ), 1988 .

[10] R. J. Ser ing. Probability inequalities for the sum in sampling without replacement . The Annals of Statistics , 2 ( 1 ), 1974 .

[11]

Stolte ,

Tang , and

Hanrahan . Polaris: A system for query, analysis, and visualization of multidimensional databases . Commun. ACM , 51 ( 11 ), 2008 .

[12]

Vartak ,

Madden ,

Parameswaran , and

Polyzotis . Seedb: Automatically generating query visualizations . Proc. VLDB Endow , 7 ( 13 ), 2014 .

[13]

Vartak ,

Rahman ,

Madden ,

Parameswaran , and

Polyzotis . Seedb: efficient data-driven visualization recommendations to support visual analytics . Proc. VLDB Endow , 8 ( 13 ), 2015 .