<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Efficient Data Slice Search for Exceptional View Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yohei Mizuno</string-name>
          <email>mizuno.yohei@ist.osaka-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuya Sasaki</string-name>
          <email>sasaki@ist.osaka-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Makoto Onizuka</string-name>
          <email>onizuka@ist.osaka-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Graduate School of Information Science and Technology, Osaka University</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Business analysts repeatedly execute OLAP queries consisting of aggregate/group-by operations until they nd trends, insights, and/or outliers. One interesting analysis is to identify particular data slices (subset of original data) that generate exceptional views apart from the average view generated from the whole original data. However, since the search space for identifying such data slices is quite large, efficient techniques are indispensable for the data slice search. In this paper, we propose 1) a framework that automatically identi es data slices that generate exceptional views for OLAP queries, and 2) an efficient algorithm that optimizes the data slice search. The algorithm reduces the search space by employing con dence intervals and removes redundant query processing by applying multi-query optimization for evaluating queries over multiple data slices. The experiments validate that our algorithm improves the performance eight times faster than without the optimizations.</p>
      </abstract>
      <kwd-group>
        <kwd>Exploratory analysis</kwd>
        <kwd>Probability theory</kwd>
        <kwd>OLAP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The need for effective analysis of business data is widely
recognized and many data mining techniques have been
developed, such as online analytical processing (OLAP),
association rule mining, clustering, classi cation, graph mining
and stream data mining [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In particular, OLAP is one of
frequently used techniques that largely contributes to
business data analysis and it is effectively used to nd trends,
insights, and/or outliers from data. In typical analytical
processing, data analysts reveal hidden knowledge from data
by setting up hypotheses and testing them through
investigation of the data. However, a major issue of analytical
processing is that it is not easy for data analysts to set up
effective hypotheses, because it requires them to deeply
understand the data itself so that they properly choose
interesting analytical dimensions (expressed by OLAP queries)
and/or data slices (subset of data) for detailed analysis. So,
2017, Copyright is with the authors. Published in the Workshop
Proceedings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017, Venice,
Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is
permitted under the terms of the Creative Commons license CC-by-nc-nd
4.0
a continuous process of trial and error is very common for
analytical processing.
      </p>
      <p>
        The exploratory analysis is a promising research area for
solving the issue above [
        <xref ref-type="bibr" rid="ref12 ref13 ref2 ref6">2, 6, 12, 13</xref>
        ]. Morton et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
investigate systems that support data analysts. One challenging
goal is to develop data recommenders that would
automatically identify datasets outside of the system that could
improve the quality of analysis result. Buoncristiano et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
study a database exploration model assuming conversations
between exploratory users and the system. The
conversation goes on so that the system identi es queries that reveal
signi cant deviation from uniform distribution. As a more
concrete example, SEEDB [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ] automatically explores
various types of OLAP queries in a multidimensional space
and identi es query q that maximizes the deviation between
q(D) and q(S), where D is the whole data and S is a given
data slice. However, SEEDB is not applicable to an
opposite case; to search for data slices that generate views that
are deviated from the view generated from whole data with
respect to a given OLAP query. An example of data slice
search is to identify particular branch office (say, Kyoto
ofce) or certain year (say, 2001) in sales databases such that
its generated analysis result is deviated from the one
generated from whole data. Moreover, since the search space of
data slice search is huge, efficient techniques are
indispensable for the data slice search.
      </p>
      <p>
        To achieve efficient data slice search, we propose a
framework that automatically identi es data slices that generate
exceptional views from OLAP queries and an efficient
algorithm that optimizes the data slice search. The algorithm
reduces the search space by employing the con dence
intervals [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and removes redundant query processing by
applying multi-query optimization techniques when searching for
multiple data slices [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
1.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Running example</title>
      <p>We give a running example of business analysis. In sales
data analysis, it is important to identify particular products
whose sales trends are deviated from the average sales of
all products. Figure 1 shows normalized monthly sales of
all products, products purchased by men, and clothes. We
observe that the trend of the products purchased by men is
quite close to that of all products, which is not interesting.
In contrast, the trend of the clothes largely differs from that
of all products. We can see that the sales of the clothes
are affected by seasons particularly in March. This suggests
that we should start a new discount service for clothes in
April so as to keep the clothes sales high.
1
4
F
e
b
1
4
M
a
r
1
4
A
p
r
1
4
M
a
y
1
4
J
u
n
1
4
J
u
l
1
4
A
u
g
1
4
S
e
p
1
4
O
c
t
1
4
N
o
v
1
4
D
e
c</p>
      <p>This paper is organized as follows. We describe an overview
of data slice search for exceptional view detection in Section
2 and con dence intervals derived from probability
inequality in Section 3. We give the details of efficient algorithm for
data slice search in Section 4. The performance evaluation
is given in Section 5. We describe related work in Section 6
and conclude this paper in Section 7.
2.</p>
    </sec>
    <sec id="sec-3">
      <title>OVERVIEW OF DATA SLICE SEARCH</title>
      <p>In this section, we describe a problem of identifying data
slices that generate exceptional views for a given query. We
give the problem de nition of data slice search in Section
2.1, and then show its procedures in Section 2.2.
2.1</p>
    </sec>
    <sec id="sec-4">
      <title>Problem definition</title>
      <p>We focus on relational database D. D is a collection of
records X1; X2; ; XjDj, where jDj is the number of the
records of D. Xi (1 i jDj) is composed of measure
attributes (say, prices or amounts) and dimension attributes
B (say, product categories or store names). We denote value
of measure attribute m of Xi as Xim. We de ne data slice
S as a subset of D cuboid sliced by choosing a single value
Y for b 2 B as follows:</p>
      <p>S := b=Y (D)
where is a select operation in relational algebra. We call
this operation as slicing query. An OLAP query is given by
data analysts in advance. We de ne a set of data slices S
and OLAP query q as follows:</p>
      <p>jBj
S := ∪f bi=Y (D)jY 2 values(bi)g; (2)
i=1
q(S) := gGa=f(m)(S)
(1)
(3)
where jBj is the number of dimension attributes, values(bi)
is a set of unique values of bi, g is a dimension attribute for
group-by, m is a measure attribute for aggregate attribute, a
is an aggregated value of m, and f is an aggregate function.
f calculates either the number of the values of m (COUNT),
the mean of the values of m (AVG), or the sum of the values
of m (SUM). G groups the records using g and aggregates
the grouped values of m using f . Those results are a
sequence of objects consisting of the unique values of g and
the aggregated values of m. The problem here is to identify
the top-k data slices in S that generate the most extreme
exceptional views (expressed as q(S)) for given query q,
deDneednaistifoonllo1ws:argmkax deviation(q(D); q(S))</p>
      <p>S2S
where deviation is a function to measure the deviation
between the view for query q of D and of data slice S.
Without the loss of generality, we use Euclidean Distance
for the deviation function1.
2. To obtain set of data slices S from D by choosing each
single value Y 2 values(bi) for each dimension bi 2 B.
S 2 S is generated by executing slicing queries, such
as \gender = `men"' and \category = `clothes"'.
3. To evaluate query q over all data slices S 2 S.
4. To compute the deviation between q(D) and q(S).</p>
      <p>For example, Euclidean Distance between the monthly
sales of all products and those of men purchased
products is computed.</p>
      <p>5. To display the top-k deviated views to the users.
3.</p>
    </sec>
    <sec id="sec-5">
      <title>CONFIDENCE INTERVAL</title>
      <p>
        We apply the con dence interval to efficiently identify
topk data slices when OLAP queries use aggregation functions.
In this section, we rst describe how we apply the con dence
interval for mean value derived from the Hoeffding Ser ing
inequality [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>The Hoeffding Ser ing inequality is useful to derive the
probability's upper bound without assuming the probability
distribution of the population when the expected value for
stochastic variable and its nite domain are given. From
the Hoeffding Ser ing inequality, the con dence interval for
the mean value of aggregate attribute m of D (expressed
as (D)) can be derived. Let fx1,...,xjxjg and fX1,...,XjDjg
be samples and population, and jxj and jDj be sample size
and population size, respectively. The Hoeffding Ser ing
inequality is modi ed to have the following form.</p>
      <p>
        P r′(t) = P r′[jx
(D)j
t]
where x = jx1j ∑jix=j1 xim, (D) = jD1 j ∑jiD=1j Xim. Since Eq.4 is
derived from the Hoeffding Ser ing inequality by replacing
x (D) part with its absolute form, probability P r′(t)
in Eq.4 has as twice as probability P r(t) in the Hoeffding
Ser ing inequality [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], where P r(t) is the probability that x
is t bigger than (D). Therefore, suppose that t &gt; 0, upper
bound can be inferred by using the following equation.
      </p>
      <p>P r′(t)
2 exp[
(1</p>
      <p>2jxjt2
fx)(max
min)2
] =
where fx = jxjDjj1 , min = minfXim; i = 1; 2; ; jDjg, max =
maxfXim; i = 1; 2; ; jDjg. From Eq.4 and Eq.5, (1
)con dence interval2 for (D) can be derived as Eq.6.
x</p>
      <p>t &lt; (D) &lt; x + t
where t = √ (1 fx)(max m2jixnj)2(log 2 log ) .
1We can easily extend our technique to apply other deviation
functions such as Manhattan Distance.
2We set signi cance level at 0:05 in the experiments.
(4)
(5)
(6)</p>
      <p>Furthermore, we express the range of the con dence
interval in the following format.</p>
      <p>range( (D)) = [x
t; x + t]:
(7)</p>
      <p>In the next place, we estimate the con dence interval for
the sum of the values of m (expressed as (D)). We can
estimate this interval by using the following form obtained
by multiplying jDj to the terms in Eq.6.</p>
      <sec id="sec-5-1">
        <title>Lemma 1 jDj(x</title>
        <p>t) &lt; (D) &lt; jDj(x + t):
3.1</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Confidence interval for size of data slices</title>
      <p>We calculate aggregated values of m of data slices for the
problem of identifying data slices that generate exceptional
views for a given query. We need to estimate the aggregated
values from samples so as to reduce the data slice search
space. We derive expressions estimating con dence intervals
for the aggregated values of m of the data slice (D) D.
Firstly, we derive the con dence interval for the size of (D)
(expressed as j (D)j). To estimate j (D)j using Lemma 1,
we de ne a function that judges whether an element of the
samples xi belongs to (D) as follows:
exist(xi) =
{ 1 (xi 2 (D))</p>
      <p>0 (otherwise):
Furthermore, we de ne a ratio (expressed as s) of the
elements belongs to (x) to the samples using exist as follows:
s = j (x)j =
jxj
1 jxj</p>
      <p>∑ exist(xi):
jxj i=1
s
t′ &lt; j (D)j &lt; s + t′</p>
      <p>jDj
By de nition of exist, min in t equals zero, and max in t
equals one in Eq.6. Therefore, we can estimate a ratio of
(D) to D as follows:
where t′ = √ (1 fx)(l2ojgxj2 log ) . We can estimate the
condence interval for j (D)j by using the following form that
multiplied jDj by Eq.10.</p>
      <sec id="sec-6-1">
        <title>Lemma 2 jDj(s</title>
        <p>t′) &lt; j (D)j &lt; jDj(s + t′).
3.2</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Confidence interval for mean of data slices</title>
      <p>We derive the con dence interval for the mean of m of
(D) (expressed as ( (D))) by applying Lemma 2 to Eq.6.
We can estimate the con dence interval for ( (D)) as
follows by setting the size of population as j (D)j and the size
of samples as j (x)j:
(x) t′′ &lt; ( (D)) &lt; (x) + t′′
(11)
where (x) is the mean of (x) and t′′ is a value obtained
by substituting j (x)j and the range of fx in Lemma 2 for
jxj and fx in t = √ (1 fx)(max m2jixnj)2(log 2 log ) as follows:
jxj</p>
      <p>j (x)j;
j (x)j
jDj(s + t′)
1 &lt; fx &lt; j (x)j
jDj(s
1
t′)
where the range of fx is obtained by substituting j (x)j and
the range of j (D)j in Lemma 2 for jxj and jDj in fx
respectively. Furthermore, t′′ is removed from Eq.11 as follows. t′′
(8)
(9)
(10)
has range [tmin; tmax] because t is a function of fx and fx
has range. When the size of population equals jDj(s t′)
and fx = j (x)j 1</p>
      <p>jDj(s t′) , t′′ takes lowerbound tmin. When the size
of population equals jDj(s + t′) and fx = j (x)j 1
jDj(s+t′) , t′′ takes
upperbound tmax. Therefore, Eq.11 is rewritten as follows.
Lemma 3 (x) tmax &lt; ( (D)) &lt; (x) + tmax
where tmax =
√ (1 jjD(j(xs)+jt′1) )(max min)2(log 2 log )
2j (x)j
.
3.3</p>
    </sec>
    <sec id="sec-8">
      <title>Confidence interval for sum of data slices</title>
      <p>We derive the con dence interval for the sum of m of (D)
(expressed as ( (D))) by multiplying ( (D)) by j (D)j.
When j (D)j takes the maximum of j (D)j (i.e. jDj(s + t′)),
t should equal tmax and ( (D)) takes the maximum of
( (D)) (i.e. (x) + tmax) at the same time. Therefore, the
maximum of ( (D)) is the value obtained by multiplying
the maximum of ( (D)) by the maximum of j (D)j. On
the other hand, when j (D)j takes the minimum of j (D)j
(i.e. jDj(s t′)), t should equal tmin but ( (D)) does not
take the minimum of ( (D)) (i.e. (x) tmax). We de ne
(x) topt and jDj(s t′opt) which are the values of ( (D))
and j (D)j respectively when ( (D)) takes the minimum.
We can estimate the con dence interval for ( (D)) by using
the following form.</p>
      <p>Lemma 4
( (x) topt)(jDj(s t′opt)) &lt; ( (D)) &lt; ( (x) + tmax)(jDj(s + t′))
√ (1 jDj(s</p>
      <p>j (x)jt′op1t) )(max min)2(log 2 log )
where topt =
value within the range of s
above ( (x)
topt)(jDj(s
2j (x)j
t′ &lt; topt &lt; s + t′ when the</p>
      <p>, t′opt is the
t′opt)) takes the minimum.
4.</p>
    </sec>
    <sec id="sec-9">
      <title>EFFICIENT ALGORITHM FOR DATA</title>
    </sec>
    <sec id="sec-10">
      <title>SLICE SEARCH</title>
      <p>The algorithm prunes the search space by employing the
con dence interval and removes redundant query processing
by applying multi-query optimization for evaluating queries
over multiple data slices.</p>
      <p>Pruning</p>
      <p>One characteristic in identifying data slices that
generate exceptional views is that most views generated from
data slices would have a high similarity to the view
generated from the whole data. Since those data slices cannot
be within top-k of large deviation, it is preferable to prune
the query evaluation for those data slices. To achieve this
pruning, we only use samples and compute the con dence
interval from samples for the data slice search. Speci cally,
we describe how we calculate deviations (Section 4.1), we
estimate deviations from samples by computing the upper
bound and lower bound of the con dence intervals for the
aggregation of each data slice (Section 4.2). If the upper
bound of the query result for data slice S is lower than the
lower bound of the kth highest data slice, we can safely prune
the query evaluation for the data slice S (Section 4.3).
Multi-query Optimization (Section 4.4)</p>
      <p>In identifying data slices that generate exceptional views,
we have to evaluate a given query over multiple data slices.
This query processing is optimized by grouping the slicing
queries for the data slices with respect to the same attributes
and then by evaluating the grouped queries during a single
scan.
4.1</p>
    </sec>
    <sec id="sec-11">
      <title>Calculating deviations</title>
      <p>We describe how to calculate the deviation between q(D)
and q(S) (i.e. deviation(q(D); q(S))), which is computed by
Euclidean Distance. We de ne gGa=f(m)(S) in Eq.3 to the
following form.</p>
      <p>gGa=f(m)(S) := ff ( m( g=′Z′ (S)))jZ 2 values(g)g (12)
where values(g) is the set of the unique values of the
groupby attribute g. In other words, gGa=f(m)(S) selects subsets
of S for each value Z of g, projects the measure attribute
m of the data slices, and applies the aggregate function f .
We can expand deviation(q(D); q(S)) in De nition 1 to the
following form by using Eq.12.
deviation(ff ( m( g=′Z′ (D)))jZ 2 values(g)g;
ff ( m( g=′Z′ (S)))jZ 2 values(g)g):
(13)
Furthermore, Eq.13 can be rewritten to the following form
because deviation calculates Euclidean Distance.</p>
      <p>(f ( m( g=′Z′ (D))) f ( m( g=′Z′ (S))))2: (14)
v
u
u
t</p>
      <p>∑</p>
      <p>Z2values(g)
That is, deviation(q(D); q(S)) selects subset of D and S
for each value Z 2 values(g), calculates the squares of the
differences between the aggregation results of g=′Z′ (D) and
g=′Z′ (S), and calculates the square root of the sum of the
squares. When f is AVG or SUM, we normalize q(D) and
q(S) so that the sum of q(D) and the sum of q(S) equal one.
4.2</p>
    </sec>
    <sec id="sec-12">
      <title>Estimating deviations</title>
      <p>We calculate f ( m( g=′Z′ (D))) in Eq.14 beforehand and
then estimate f ( m( g=′Z′ (S))) by using the samples of S.
We apply Lemma 2, 3 and 4 for the aggregate function
COUNT, AVG and SUM, respectively. For example, we can
express the con dence interval for the size of S (expressed
as range(jSj)) by using Lemma 2 when f is COUNT.
range(jSj) = [jDj(s
t′); jDj(s + t′)]:
(15)
We can calculate the upper bound of deviation between D
and S by using the following form.
v
u
u
t</p>
      <p>∑
Z2values(g)</p>
      <p>(max(f( m( g=′Z′(D))); range( m( g=′Z′(S)))))2 (16)
where max(x; [a; b]) is a function that takes a value and a
range as arguments, and returns the value which takes the
largest distance from x in range [a; b]. On the other hand,
we can calculate the lower bound of the deviation, replacing
max in Eq.16 to a function which returns the value which
takes the smallest distance from x in the range [a; b].
4.3</p>
    </sec>
    <sec id="sec-13">
      <title>Pruning data slices</title>
      <p>We prune the query evaluation for data slices that cannot
be within top-k of large deviation. The pruning works as
follows. We divide the dataset D into the samples and the
rest. We evaluate given query q over all data slices S 2 S for
the samples. We compute the upper bound and lower bound
of the deviations for the result of q. If the upper bound of
the data slice S is lower than the lower bound of the kth
highest data slice, we do not evaluate q over the data slice S
for the rest of dataset. As a result, we can prune the query
evaluation for the data slice that does not become the top-k
result with a high probability.
4.4</p>
    </sec>
    <sec id="sec-14">
      <title>Multi-query Optimization</title>
      <p>By grouping the slicing queries for the data slices with
respect to the same attributes, the grouped queries can be
combined into a single query, which results in the reduction
of wasteful computation and faster process. For example,
when calculating the monthly sales of products bought by
either males or females, combining two queries into a single
query can be made. By grouping the slicing queries, the
number of the slicing queries becomes equal to the number
of dimension attributes B, and these grouped queries are
executed with single scan.
5.</p>
    </sec>
    <sec id="sec-15">
      <title>PERFORMANCE EVALUATION</title>
      <p>Our goal of experiments is to evaluate the efficiency and
accuracy of our proposed algorithm in the data slice search.
We compared the performance of naive approach (NO OPT),
with combining all slicing queries for data slices into
single query regardless of different attributes (ONE QUERY),
with multi-query optimization (MULTI), and with the
combination of multi-query optimization and data slice pruning
optimizations by con dence interval (COMB).
5.1</p>
    </sec>
    <sec id="sec-16">
      <title>Benchmark</title>
      <p>We measured the performances by using various OLAP
queries for real dataset.</p>
      <p>Dataset: This dataset is sales data of supermarkets which is
provided by Joint Association Study Group of Management
Science. The dataset is composed of measure attributes
(price and number) and dimension attributes (store (9), time
(19), sex (3), category 1 (8), category 2 (25) and category 3
(164)). The number in the parenthesis indicates the number
of unique values of the attribute. The number of records is
103; 382; 016.</p>
      <p>OLAP queries: We use price and number for aggregation
and store and time for group-by.</p>
      <p>Attributes for selecting data slices: We do not use the
same attribute both for group-by operation and slicing data
condition at the same time, because the result is not useful
in such case; a single value is sliced out, group-by operation
is applied, and then only a single value is aggregated.</p>
      <p>All experiments were run on single machine with 16 GB
RAM and 4 core Intel(R) Core(TM) i7-4702MQ processor.
We evaluate our techniques on Microsoft SQL Server 2014.
We used nonclustered columnstore indexes for efficiency. We
removed extremely large values of measure attributes in
advance for the purpose of data cleaning. We chose samples
from the rst part of the database. All experiments were
repeated three times and the measurements were averaged.
5.2</p>
    </sec>
    <sec id="sec-17">
      <title>Results</title>
      <p>Figure 2 shows the latencies of top-10 data slices search
for evaluating the OLAP queries over the dataset. The
xaxis shows the OLAP queries and the label indicates
aggregation attribute/group-by attribute. We observe that the
multi-query optimization and data slice pruning are
effective in improving the latencies for the dataset. In detail,
the pruning data slice optimization effectively improves the
latencies when the number of distinct values of group-by
attribute is small, such as store (9). That is, Euclidean
Distance between long sequences tends to be more similar
than short sequences, so we have more opportunities of
pruning the search space when the number of distinct values of</p>
      <p>NO_OPT
MULTI</p>
      <p>ONE_QUERY
COMB
15
0
number/store price/time</p>
      <p>OLAP query
group-by attribute is small. COMB takes longer time than
MULTI when aggregate attribute is price and group-by
attribute is time. This is because top-10 values of deviation
was not so large and COMB prunes a small number of data
slices. Although ONE QUERY executes the smallest
number of queries among the four approaches, it takes longer
time than MULTI in all the OLAP queries. This result may
be explained by the fact that, since we used nonclustered
columnstoreindex, grouping slicing queries with respect to
the same single attributes provide the best performance.</p>
      <p>We made scalability experiments by varying the size of the
datasets. As we can see in Figure 3, we observe that the
latency for the case with the pruning optimization scales well
and is constant even if the dataset size increases. This is
because the latency with the pruning optimization mainly
depends on k of top-k; even if the data size increases,
required search space depends on top-k size.</p>
      <p>We made experiments to verify the in uence of k by
varying the value of k. As we can see in Figure 4, we observe
that the smaller the value of k, the better the efficiency of the
pruning optimization. This is because the threshold used by
pruning becomes large when the value of k becomes small,
and pruning more data slices.</p>
      <p>Note, for evaluating the accuracy of data slice pruning
optimization, we validate that all the algorithms obtain the
same top-k data slices in all the experiments.</p>
    </sec>
    <sec id="sec-18">
      <title>RELATED WORK</title>
      <p>
        Visual analytics tools such as Spot re [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Polaris [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
have been introduced. These tools support the analysis of
users by selecting the best visualization for a data set. The
Users must chose data slices that they want to analyze.
      </p>
      <p>
        Visual analytics tools such as Pro ler [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Vizdeck [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and
SEEDB [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ] have a function to select visualizations
automatically. Pro ler automatically detects exceptions in a
data set. VizDeck has a function to depict visualizations of
a data set on a dashboard. SEEDB recommend a
visualization result of a data slice based on distance function for
the given data slice. Sarawagi et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] study an exploration
of data cubes by using data mining techniques. They
explore exceptional and interesting cells (say, Sales of clothes
of March, 2014) in a cube. However, those techniques do
not explore exceptional q(D) (say, Monthly sales of clothes
of 2014). Ordonez et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] incorporate parametric
statistical tests into analysis for interactive visualization of the
OLAP cube. The visualization identi es signi cant
differences between two similar cuboids.
      </p>
    </sec>
    <sec id="sec-19">
      <title>CONCLUSION</title>
      <p>We propose a framework that automatically identi es data
slices that generate exceptional views for OLAP queries, and
an efficient algorithm that optimizes the data slice search.
3
1
0
10
k
20
The algorithm improves the performance by con dence
interval and multi-query optimization. We used real dataset
and validated that our algorithm improves the performance
eight times faster than without the optimizations.</p>
      <p>There are various types of future work for the data slice
search. First, we extend the system to compute the
deviation not only from the average of whole data, but also the
deviation from the average of data clusters. Second, we
introduce the hierarchy between attributes, such as categories,
so that we can drill down and roll up the analysis results.
Finally, we will extend the variations of data slices by
permitting the combination of multiple attributes (conjunctive
condition) for slicing condition.</p>
    </sec>
    <sec id="sec-20">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was supported by JSPS KAKENHI
Grant-inAid for Scienti c Research (C) 16k00154.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ahlberg</surname>
          </string-name>
          . Spot re:
          <article-title>An information exploration environment</article-title>
          .
          <source>SIGMOD Rec</source>
          ,
          <volume>25</volume>
          (
          <issue>4</issue>
          ),
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Buoncristiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mecca</surname>
          </string-name>
          , E. Quintarelli,
          <string-name>
            <given-names>M.</given-names>
            <surname>Roveri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Santoro</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Tanca</surname>
          </string-name>
          .
          <article-title>Database challenges for exploratory computing</article-title>
          .
          <source>SIGMOD Rec</source>
          ,
          <volume>44</volume>
          (
          <issue>2</issue>
          ),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kandel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paepcke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Hellerstein</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Heer</surname>
          </string-name>
          .
          <article-title>Pro ler: Integrated statistical analysis and visualization for data quality assessment</article-title>
          .
          <source>In Prc. AVI</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Key</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Howe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Perry</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Aragon</surname>
          </string-name>
          . Vizdeck:
          <article-title>Self-organizing dashboards for visual analytics</article-title>
          .
          <source>In Proc. SIGMOD</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Madraky</surname>
          </string-name>
          .
          <article-title>Data mining text book</article-title>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Morton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Balazinska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Grossman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Mackinlay</surname>
          </string-name>
          .
          <article-title>Support the data enthusiast: Challenges for next-generation data-analysis systems</article-title>
          .
          <source>Proc. VLDB Endow</source>
          ,
          <volume>7</volume>
          (
          <issue>6</issue>
          ),
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ordonez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Garc</surname>
          </string-name>
          a
          <article-title>-Garc a. Interactive exploration and visualization of olap cubes</article-title>
          .
          <source>In Proc. ACM DOLAP Workshop</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarawagi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Megiddo</surname>
          </string-name>
          .
          <article-title>Discovery-driven exploration of olap data cubes</article-title>
          .
          <source>In EDBT</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T. K.</given-names>
            <surname>Sellis</surname>
          </string-name>
          .
          <article-title>Multiple-query optimization</article-title>
          .
          <source>ACM Trans. Database Syst</source>
          ,
          <volume>13</volume>
          (
          <issue>1</issue>
          ),
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>R. J.</surname>
          </string-name>
          <article-title>Ser ing. Probability inequalities for the sum in sampling without replacement</article-title>
          .
          <source>The Annals of Statistics</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ),
          <year>1974</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Stolte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Hanrahan</surname>
          </string-name>
          .
          <article-title>Polaris: A system for query, analysis, and visualization of multidimensional databases</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>51</volume>
          (
          <issue>11</issue>
          ),
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Vartak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parameswaran</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Polyzotis</surname>
          </string-name>
          . Seedb:
          <article-title>Automatically generating query visualizations</article-title>
          .
          <source>Proc. VLDB Endow</source>
          ,
          <volume>7</volume>
          (
          <issue>13</issue>
          ),
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Vartak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parameswaran</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Polyzotis</surname>
          </string-name>
          . Seedb:
          <article-title>efficient data-driven visualization recommendations to support visual analytics</article-title>
          .
          <source>Proc. VLDB Endow</source>
          ,
          <volume>8</volume>
          (
          <issue>13</issue>
          ),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>