=Paper=
{{Paper
|id=Vol-2322/DARLIAP_12
|storemode=property
|title=Data Slice Search for Local Outlier View Detection: A Case Study in Fashion EC
|pdfUrl=https://ceur-ws.org/Vol-2322/DARLIAP_12.pdf
|volume=Vol-2322
|authors=Takumi Matsumoto,Yuya Sasaki,Makoto Onizuka
|dblpUrl=https://dblp.org/rec/conf/edbt/Matsumoto0O19
}}
==Data Slice Search for Local Outlier View Detection: A Case Study in Fashion EC==
<pdf width="1500px">https://ceur-ws.org/Vol-2322/DARLIAP_12.pdf</pdf>
<pre>
     Data Slice Search for Local Outlier View Detection: A Case
                        Study in Fashion EC
                                        Takumi Matsumoto, Yuya Sasaki, Makoto Onizuka
                                     {matsumoto.takumi,sasaki,onizuka}@ist.osaka-u.ac.jp
                     Graduate School of Information Science and Technology, Osaka University, Osaka, Japan

ABSTRACT
    The exploratory data analysis is one of the current research
trends for business data analysis, by which we can identify in-
teresting results by executing a large number of OLAP queries.
The queries are generated by changing the analytical viewpoints
(aggregation attribute or group-by attribute) and/or target data
slices (data subset extracted by the select operation). Existing
research effectively detects globally unexpected trends (global
outliers), however, it cannot detect locally unexpected trends
(local outliers), which are known useful in many applications.
In this paper, we describe an analysis framework named D4C.                     Figure 1: Monthly sales of items: X axis and Y axis indicate
D4C detects top-n data slices that generate local outlier results of            months and normalized sales sum, respectively. There are
automatically generated OLAP queries. We also introduce how                     two clusters (yellow lines and blue lines). Item A (green
to use the analysis results in a practical use case of fashion EC.              line) is a global outlier and Item F (red line) is a local out-
Since there are various types of users and items at fashion EC site,            lier.
it is useful to investigate the bias of the sales trend and identify
meaningful results. We also show how the analysis results help
making decisions on sales strategies.                                           given a data slice S of D, we identify query q among various OLAP
                                                                                queries that maximizes the deviation between q(D) and q(S). In
KEYWORDS                                                                        contrast, in the data slice search case, given OLAP query q, we
    Exploratory data analysis, OLAP, Outlier detection                          identify data slice S among various data slices that maximizes the
                                                                                deviation between q(D) and q(S). So, they are effective to detect
1    INTRODUCTION                                                               globally unexpected trends (global outliers) by computing the
                                                                                distance between multiple query results, however, they cannot
   Many companies have collected or accumulated enormous                        detect locally unexpected trends (local outliers). The local outlier
and diversified data. Data analysts start their analysis work by                factor [2] is a well-known concept in many applications areas,
extracting useful information (insight and exceptional data) from               such as in fraud detection by detecting unusual usage of credit
collected and accumulated data for decision making to make                      cards, in customized marketing for identifying the unexpected
social or economic impact. Generally, in a business data anal-                  behavior of customers, or in medical analysis for finding unusual
ysis work, OLAP (online analytical processing) technology is                    responses to various medical treatments [6].
frequently used with visualization tools, such as Tableau [1, 19].                 In this paper, we describe D4C (Dimensionally Deviated Divi-
The analysis workflow consists of two steps, (1) issuing a query                sional Data Captor), a framework for automatically identifying
that specifies an analysis pattern and a target data slice (data                top-n local outliers among OLAP query results. D4C automati-
subset extracted by the select operation), and (2) investigating                cally generates OLAP queries from a query template specified
the query result (view). An analysis pattern is expressed with a                by users, executes those queries, and then identifies unexpected
combination of group-by attribute, measure attribute, and aggre-                trends by computing LOF value for each query result.
gate function. A target data slice is specified by WHERE clause.                   As an application example of D4C, consider a case that a sales
However, the analysis work is heavy burden for the analyst be-                  strategy is decided based on the sales trend over the last six
cause the number of OLAP queries increases as the cardinality                   months. Fig. 1 shows the sales trend of ten items (data slices), A
of data and the number of columns increase, and the number                      to J . Each line represents the monthly sales of each item, and the
of times the analyst repeatedly performs (1) and (2) accordingly                black dash line represents the average of monthly sales of the
increases.                                                                      whole items. This example has two clusters (1) items sold well
   For solving the above problem, the exploratory data analysis is              during winter (A, B, C, D, and E) and (2) items sold well during
promising research area [7, 8, 13, 17, 18, 20, 22]. In these studies,           spring (F , G, H , I and J ). Items A is global outliers because they
analysis axes and data slices that generate exceptional query                   are deviated largely from the average. Notice that item F (red line)
results are automatically identified so that analysts can easily                has the smallest global outlier factor among the items, however
find interesting views. They are categorized into two types of                  it has the largest local outlier factor among the items that are
dual data analysis over OLAP queries, data slice search and query               mainly sold in spring season. Therefore, it may be possible to
search. Let D be a database for analysis. In the query search case,             increase the sales of item F by investigating reasons why item F
© 2019 Copyright held by the author(s). Published in the Workshop Proceedings   is a local outlier and by changing the sales strategy according to
of the EDBT/ICDT 2019 Joint Conference(March 26, 2019, Lisbon, Portugal) on     the investigation (e.g., a discount sale should be made for item F
CEUR-WS.org.
                                                                                in April, since its sales is deviated down from the yellow cluster
                                                                                in April).
                                                                        3     OUTLIER DETECTION FOR DATA SLICES
                                                                           We extend the technique of the exceptional view detection [13]
                                                                        with the notion of the local outlier factor. We describe a problem
                                                                        of identifying data slices that generate views with the largest
                                                                        LOF values for a given query template. We give our problem
                                                                        definition in Section 3.1. Then, in Section 3.2, we present our
                                                                        framework to solve the problem.

                                                                        3.1      Problem definition
                                                                           Let D be a set of records and C be a set of dimension attributes
                  Figure 2: D4C Architecture
                                                                        in a database. We define data slice S as a subset of D cuboid sliced
                                                                        by choosing a single value Y for dimension attribute c ∈ C as
                                                                        follows:
   We apply the D4C to the sales data provided from one of the
largest fashion EC site in Japan. The data contains transactions                                      S := σc=Y (D)                       (6)
from April 2015 to March 2016 at the site. The analysis results
give us interesting observations and helps making decisions on          A query template is given by data analysts in advance. We define
sales strategies.                                                       a set of data slices S and query template q as follows:
Organization This paper is organized as follows. We explain                                  |C |
                                                                                             Ø
local outlier factor as preliminaries in Section 2. Then, we describe                 S :=          {σc i =Y (D)|Y ∈ values(c i )}        (7)
outlier detection for data slices in Section 3 We give analysis                              i=1
results for a sales data of fashion EC site by applying D4C in
Section 4. We describe related work in Section 5 and conclude                                       q(S) := д G f (m) (S)                 (8)
this paper in Section 6.
                                                                        where values(c i ) is a set of unique values of dimension attribute
                                                                        c i ∈ C, д is a dimension attribute for group-by operation, m
2   LOCAL OUTLIER FACTOR                                                is a measure attribute for aggregate function, and f is an ag-
    As preliminary, we introduce an outlier detection technique,        gregate function. G groups the records using д and aggregates
local outlier factor (LOF) [2]. LOF is based on the idea of local       the grouped values of m using f . Since query result q(S) is a
density in multi-dimensional space. For each data point, we can         sequence of pair <unique value of д, aggregated value of m>, we
compute LOF value that indicates the outlierness among its near-        introduce function t: sequence<K,V > → point(V [N ]) so that we
est neighbors. Intuitively, the LOF value of data point A is high       map a query result to a data point in the N -dimentional space
if its local density is low and those of A’s nearest neighbors are      for computing LOF value.
high. Consider data point A, which is represented by a pair of N
positive real numbers ai (1 ≤ i ≤ N ):                                      Definition 1. The problem here is to identify the top-n data
                                                                        slices in S that generate views with the largest LOF values for given
                    A := [a 1 , a 2 , · · · , a N ]               (1)   query template q, defined as follows:
                                                                                                         n
We denote Nk (A), the set of k or more nearest neighbors of A,                                       argmax LOF (t(q(S)))
which is defined as follows:                                                                            S ∈S

    Nk (A) := {B ∈ P − {A} | d(A, B) ≤ k-distance(A)}             (2)       Example 3.1. Remember the analysis example in Fig.1. In this
                                                                        case, ItemN ame is used as a dimension attribute, so S = { “Item A”,
where P is a set of data points, d(A, B) is the Euclidean distance      “Item B”, · · · , “Item J” }. Since X axis and Y axis indicate months
between two data points A and B, k-distance(A) is the Euclidean         and normalized sales sum, respectively, q(S) is expressed as:
distance between A and the kth closest data point to A. The LOF
value of A is defined as follows:                                                             q(S) = month G sum(sal es) (S)
                                  lrdk (B) / |Nk (A)|
                     Í
                       B ∈N k (A)                                       3.2      D4C Framework
         LOF (A) :=                                             (3)
                                  lrdk (A)
                                                                           We develop D4C on top of a relational database, a framework
That is, LOF (A) is the ratio of the average density of A’s nearest     for automatically identifying top-n local outliers among OLAP
neighbors (B ∈ Nk (A)) to A’s density. The density of A, lrdk (A),      query results. Fig. 2 depicts the D4C architecture. D4C automati-
is the inverse of the average reachable distance from A’s k nearest     cally generates and executes OLAP queries from a query template
neighbors to A, which is defined as follows:                            specified by users, and then identifies top-n local outliers among
                                   |Nk (A)|                             the OLAP query results by computing LOF value each query
          lrdk (A) := Í                                           (4)   result. The data analysis by D4C consists of three components:
                          B ∈N (A)
                                    reach-distk (A, B)
                                k
                                                                              Query Generator Given that users specify query template
where reach-distk (A, B) is a reachable distance from B to A de-                q, the query generator lists up various data slices S, instan-
fined next:                                                                     tiate OLAP queries ({q(S) | S ∈ S}) by combining each
                                                                                data slice S with the query template q, and then executes
    reach-distk (A, B) := max{d(A, B), k-distance(B)}             (5)
                                                                                those queries.
In Equation (5), k-distance(B) is a term for reducing statistical             Outlier Detector The outlier detector computes the LOF
fluctuation [2]. Additionally, the value of k should be 10 or more              value for each query result and identifies top-n outliers
to remove unwanted statistical fluctuations.                                    based on the LOF values.
                                                                        values by the aggregate function for the aggregate attribute. Each
                                                                        aggregated value is normalized as the ratio, so that the trend of
                                                                        each data slice can be equally compared on the same scale.

                                                                        4     DATA ANALYSIS
                                                                           In this section, we report the analysis results obtained by ap-
                                                                        plying D4C to the sales data provided from one of the largest
                                                                        fashion EC site in Japan. We observe that D4C automatically
                                                                        finds interesting results that are no easily obtained by traditional
                                                                        analysis tools which requires manual labors. We also discuss how
                                                                        the results can help making decisions on sales strategies.


                                                                        4.1     Dataset and query templates
Figure 3: A visualized result in a 3-dimensional Euclidean                 Fashion EC dataset: We used a dataset of transactions made
space. Top 20 data slices with large LOF values are colored             at a fashion EC, which is provided through Joint Association
in red.                                                                 Study Group of Management Science. The data of the fashion EC
                                                                        contains 1, 111, 365 records obtained from April 2015 to March
                                                                        2016 and its size is 5, 384 MB. Each record contains the attributes
     View Generator The view generator generates views of the           of the purchased item, the purchased user type, purchased date,
       query results identified as top-n outliers and then displays     discount rate, and the questionnaire result. The item type in-
       them on the system.                                              cludes its sales price, category, color, brand, provider (shop), and
Query Generator                                                         size. The user type includes his/her sex, age, and living place
    Users can generate various types of OLAP queries by changing        (prefecture and region).
query templates. They specify a query template, q = д G f (m) , (di-    Query templates: Table 1 shows five query templates used in
mension attribute д, measure attribute m, and aggregate function        the experiments. The dimensions [#] column indicates the car-
f ) and a set of dimension attribute, C, for extracting data slices.    dinality of Group-by attribute column, which is the number of
                                                                        unique values of Group-by attribute. The data slices [#] col-
   Example 3.2. Remember again the example in Fig. 1. OLAP              umn indicates the cardinality of the attributes used for extracting
queries are {q(S) | S ∈ S}. One of the OLAP queries is expressed        Data slice. We set the number of nearest neighbors k of LOF at
by an SQL statement as follows:                                         10 by following the tips described in [2].
        SELECT Month, SUM(Sales)
        FORM D                                                          4.2     Result of Analysis
        WHERE ItemName = 'Item A'
                                                                           We show the analysis results and describe interesting observa-
        GROUP BY Month;
                                                                        tions that may help making decisions on sales strategies. Figs. 4,
                                                                        5, 6, 7, and 8 depict the identified local outliers with their nearest
Outlier Detector                                                        neighbors. They are generated from the five query templates in
   For each generated OLAP query for data slice S, D4C retrieves        Table 1. In each figure, the identified local outlier is colored by
the query result from the database. This component identifies           red, ten data slices in the neighbor of the identified local outlier
top-n local outliers by computing LOF value for each query result,      are colored by orange, and the average of all data slices is colored
q(S) where S ∈ S. The local outlier detection is performed for          by blue.
data points in a N dimensional Euclidean space (N represents               Q1: Which Category is the most Local Outlier in the av-
the number of the values of the group-by attribute). That is,           erage sales grouped-by Prefecture? The result is depicted in
this component transforms each query result with N aggregated           Fig. 4. We observe that “trash box” category is identified as the
values to a data point in N dimensional Euclidean space, and            most local outlier among other categories. The figure shows that
then computes the LOF value for each data point. For instance,          there are exceptional values in several prefectures. For example,
the number of dimensions of Euclidean space N is 12 in the              “trash box” sales both in Osaka and Fukushima (16th and 4th
case where the group-by attribute is “sales month” (“January”,          values from the right on X axis in Fig. 4, respectively) are higher
“February”, · · · , “December”). Fig. 3 depicts top-20 local outliers   than their nearest neighbors, while in Kumamoto it is lower than
in the three dimensional Euclidean space for data slices whose          its nearest neighbors. This result is explained by the fact of the
number of values of the group-by attribute is three. From this          annual emission of garbage per person reported by the Ministry
figure, we observe that the data points with high LOF values are        of the Environment1 . Osaka was ranked at the top in the emission
not away from the average position of the all data points, but are      of garbage in 2014. Fukushima was always ranked in top-5 from
deviated from its nearest neighbors.                                    2014 to 2017. In contrast, Kumamoto was ranked in very low
View Generator                                                          level. Thus, the outlier detection reveals some hidden knowledge
   This component generates views that visualize the query re-          from the given dataset. Another observation we found is that
sults with top-n highest LOF values in line/column charts. For          the sales of “trash box” in Gunma and Niigata are relatively low,
reference, we also visualize the nearest neighbors of the outliers      although these prefectures were ranked in top-10 of the emission
so that how the outliers are deviated from their nearest neigh-         of garbage per person. This investigation implies that there is a
bors. In these views, the horizontal axis denotes the values of the
                                                                        1 https://www.env.go.jp/recycle/waste_tech/ippan
group-by attribute and the vertical axis denotes the aggregated
                                                          Table 1: Query templates

           Query Pattern       Group-by attribute        Aggregate function / attribute         Dimensions [#]      data slices [#]
           Q1                  Prefecture                AVG / Sales Price                                  47                  226
           Q2                  Prefecture                COUNT / Order                                      47                  226
           Q3                  Purchased date            SUM / Sales Price                                  12                  226
           Q4                  Age                       SUM / Sales Price                                   6                  226
           Q5                  Region                    COUNT / Order                                       8                  226


                        Figure 4: local outlier “trash box” in the average sales grouped by 47 prefectures


                                                                         is as follows. In Japan, we have a custom to send gifts that re-
                                                                         late to suits (bow ties, necktie pins, suspenders, cuff links) to
                                                                         new business persons in March/April or give accessories (key
                                                                         cases/accessories) to lovers in Christmas season. Therefore, the
                                                                         profit of the sales can be increased by recommending users “cuff
                                                                         links” with long-sleeved shirts in winter and with thin long-
                                                                         sleeved shirts in summer.
                                                                            Q3: Which Category is the most Local Outlier in the Or-
                                                                         der Count grouped-by Prefecture? The result is depicted in
                                                                         Fig. 6. We observe that “sun visors” is identified as the most local
                                                                         outlier, since it has an exceptional trend in several prefectures
                                                                         compared with its nearest neighbors. For example, the sales of
Figure 5: local outlier “cuff links” in the sum of the sales             “sun visors” are extremely high in Ibaraki and Gunma, but it is
grouped by months                                                        low in Hokkaido. This result is explained by the fact of the annual
                                                                         sunshine hours reported by the Ministry of Land, Infrastructure,
                                                                         Transport and Tourism. Ibaraki and Gunma were ranked within
potential demand of “trash box” for customers in those prefec-           top-5 in 47 prefectures in 2014. In Hokkaido, the annual sunshine
tures: we can leverage it to improve the profit of the sales. Also,      hours is lower than average and the average air temperature is
it would increase the profit if we sell a pair of “trash box” and        relatively low, so the customers in Hokkaido are not expected
its nearest neighbors, such as bath towels and slippers, because         to use “sun visors”. Therefore, it may be possible to increase the
their sale treads are close each other.                                  profit of the sales by recommending sunshade hats and sunlight
    Q2: Which Category is the most Local Outlier in the                  control items to the customers, who live in the prefectures with
Sum of the Sales grouped-by Month? The result is depicted                long annual sunshine hours or high average air temperature.
in Fig. 5. This figure shows the yearly sales trend of men’s ac-         From another perspective, the result may follow the Golf popula-
cessories (yellow lines), such as suspenders and necktie pins.           tion2 . According to this external dataset, the Golf population in
“cuff links” (red line) is identified as the most local outlier, which   Ibaragi and Gunma rank at 1 and 7, respectively.
shows a different trend from other categories. In particular, the           Q4: Which Category is the most Local Outlier in the
sales from June to August are remarkably low and they are high           Sum of the Sales of each Age range of women? The result
in December, March, and April. The reason is that, since it is           is depicted in Fig. 7. We observe that “necktie pins” is identified
summer from June to August in Japan and the temperature is               as the most local outlier. It is interesting to find that all its nearest
high, we rarely wear long-sleeved shirts and thus we do not need
                                                                         2 https://todo-ran.com/t/kiji/19677
cuff links. In other seasons, we wear them. Another observation
                           Figure 6: local outlier “sun visors” in the order count grouped by 47 prefectures


Figure 7: local outlier “necktie pins” in the sum of the sales
grouped by age ranges of women
                                                                        Figure 8: local outlier “ashtray/ignitor” in the order count
                                                                        grouped by eight regions
neighbors are the types of women’s items, however, only “necktie
pins” is the type of men’s items. The sales of “necktie pins” is
exceptional, in particular, it is high for the age range of the early   Table algebra. Tableau is a commercial data visualization tool
20’s women and low in the late 30’s or later. We conjecture that        for data analysts, which is developed based on Polaris. These
the 20’s women present “necktie pins” to their boyfriends in the        visualization tools automatically select an optimal visualization
same age range, because there are many men who start jobs in            settings for a dataset, however they require manual selection of
their 20’s. Therefore, there is a possibility that the sales can be     all attributes for analysis. Google Fusion Tables and DEVise [12]
increased by recommending young women “necktie pins” just               are tools that automate processes of collecting, integrating, and
before major anniversaries, Christmas day, or birthday.                 visualizing data from multiple data sources. Google Fusion Table
   Q5: Which Category is the most Local Outlier in the                  gathers various data from the Web, and then creates a table by
Count of the Orders of each Region? The result is depicted in           integrating the data, and then visualizes analysis results. DEVise
Fig. 8. We observe that “ashtrays/ignitor” is identified as the most    is a data search system that enables users to view and share
local outlier. This figure also shows that the Tohoku region is the     visualized analysis results of huge datasets composed of multiple
only source of the outlierness. We conjecture that the reason is        data sources.
that smoking rate is higher in northern part of Japan, including            There are many exploratory data analysis techniques, such
Tohoku region, as reported in “Overview of National Life Basic          as Sarawagi [16], Tang et al. [21], MuVE [5] SEEDB [15, 23, 24],
Survey” 3 by the Ministry of Health, Labor and Welfare.                 Mizuno et al. [13], and Zenvisage [17, 18]. Sarawagi [16] pro-
                                                                        posed a method that searches for a specific single cell in a multi-
5    RELATED WORK                                                       dimensional data cube. Tang et al. [21] proposed a systematic
                                                                        framework that searches for top-n analysis results based on both
   In this section, we review the related work to our work. We de-
                                                                        multiple utility functions and select operation in order to au-
scribe visualization and analysis tools, exploratory data analysis
                                                                        tomatically extract multiple insights without any user inputs.
techniques, and LOF techniques.
                                                                        MyVE [5] quantifies each view of data slice by using multiple
   As for the data visualization, Polaris [19] is a system that
                                                                        utility functions in order to enable group-by in numerical dimen-
integrates basic database queries with visualization by using
                                                                        sions, which are not supported by SEEDB, and then identifies
3 https://www.mhlw.go.jp/toukei/saikin/hw/c-hoken/03/hyo2.html          OLAP queries with high usefulness. Although SEEDB [15, 23, 24],
Mizuno et al. [13], and Zenvisage [17, 18] are different from each                      [12] Miron Livny, Raghu Ramakrishnan, Kevin Beyer, Guangshun Chen, Donko
other in terms of analysis workflow, all of them evaluate the out-                           Donjerkovic, Shilpa Lawande, Jussi Myllymaki, and Kent Wenger. 1997. DE-
                                                                                             Vise: integrated querying and visual exploration of large datasets. ACM
lierness of data slices according to the aspect of the global outlier.                       SIGMOD Record 26, 2 (June 1997), 301–312.
Our framework, D4C, automatically searches for exceptional data                         [13] Yohei Mizuno, Yuya Sasaki, and Makoto Onizuka. 2017. Efficient Data Slice
                                                                                             Search for Exceptional View Detection. In International Workshop On Design,
slices in a similar way as the systems [13, 17, 18] but it employs                           Optimization, Languages and Analytical Processing of Big Data.
LOF for detecting unexpected trends.                                                    [14] Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, and Christos
   LOF is a major technique of anomaly detection, and thus there                             Faloutsos. 2003. LOCI: Fast Outlier Detection Using the Local Correlation
                                                                                             Integral. In Proceedings of the International Conference on Data Engineering.
are many derivatives techniques [3]. LOCI [14] speeds up com-                                315–326.
puting LOF values by introducing a new outlier measure by using                         [15] Aditya Parameswaran, Neoklis Polyzotis, and Hector Garcia-Molina. 2013.
the standard deviation of the local density of k nearest neigh-                              SeeDB: Visualizing Database Queries Efficiently. Proceedings of the VLDB
                                                                                             Endowment 7, 4 (December 2013), 325–328.
bors. LoOP [10] introduces a probabilistic concept to LOF so that                       [16] Sunita Sarawagi. 2000. User-Adaptive Exploration of Multidimensional Data.
it makes consistent to quantitative interpretation of local out-                             In Proceedings of the International Conference on Very Large Data Bases. 307–
                                                                                             316.
liers even in the distance space having a multimodal distribution.                      [17] Tarique Siddiqui, Albert Kim, John Lee, Karrie Karahalios, and Aditya
Knorr’s method [9] divides the distance space into hypergrids for                            Parameswaran. 2016. Effortless Data Exploration with Zenvisage: An Ex-
reducing computation cost with respect to the number of data                                 pressive and Interactive Visual Analytics System. Proceedings of the VLDB
                                                                                             Endowment 10, 4 (November 2016), 457–468.
linearly.                                                                               [18] Tarique Siddiqui, John Lee, Albert Kim, Edward Xue, Chaoran Wang, Yuxuan
                                                                                             Zou, Lijin Guo, Changfeng Liu, Xiaofo Yu, Karrie Karahalios, et al. 2017. Fast-
                                                                                             Forwarding to Desired Visualizations with zenvisage. In Biennial Conference
6     CONCLUSION                                                                             on Innovative Data Systems Research.
                                                                                        [19] Chris Stolte, Diane Tang, and Pat Hanrahan. 2002. Polaris: A System for Query,
   We described D4C, which automatically identifies top-n data                               Analysis, and Visualization of Multidimensional Relational Databases. IEEE
slices that generate local outlier results of automatically gener-                           Transactions on Visualization and Computer Graphics 8, 1 (Jan. 2002), 52–65.
                                                                                        [20] Bo Tang, Shi Han, Man Lung, Yiu Rui, and Ding Dongmei. 2017. Extracting
ated OLAP queries. D4C is built on top of RDBMS. Through our                                 Top-K Insights from Multi-dimensional Data. Proceedings of the ACM SIGMOD
Fashion EC data analysis, we identified local outliers by using                              (2017). https://doi.org/10.1145/3035918.3035922
D4C and we explained how to use the analysis results in prac-                           [21] Bo Tang, Shi Han, Man Lung Yiu, Rui Ding, and Dongmei Zhang. 2017. Ex-
                                                                                             tracting Top-K Insights from Multi-dimensional Data. In Proceedings of the
tice. We also showed that how the analysis results help making                               ACM International Conference on Management of Data. 1509–1524.
decisions on sales strategies.                                                          [22] Manasi Vartak and Samuel Madden. 2014. SEEDB : Automatically Generating
   There are two types of future work. First, we extend our system                           Query Visualizations. Proceedings of the VLDB 7, 13 (2014), 1581–1584. https:
                                                                                             //doi.org/10.14778/2733004.2733035
to omit the user input, a query template. By computing p-value [4,                      [23] Manasi Vartak, Samuel Madden, Aditya Parameswaran, and Neoklis Polyzotis.
11] from LOF value, we allows comparing local outliers obtained                              2014. SEEDB: Automatically Generating Query Visualizations. Proceedings of
                                                                                             the VLDB Endowment 7, 13 (August 2014), 1581–1584.
by multiple query templates. Second, we introduce a semantic                            [24] Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, and
hierarchy between attributes, so that we can drill down and roll                             Neoklis Polyzotis. 2015. SEEDB: efficient data-driven visualization recommen-
up the analysis results.                                                                     dations to support visual analytics. Proceedings of the VLDB Endowment 8, 13
                                                                                             (September 2015), 2182–2193.


ACKNOWLEDGEMENTS
   This work was supported by JSPS KAKENHI Grant Numbers
JP16K00154.


REFERENCES
 [1] Christopher Ahlberg. 1996. Spotfire: An Information Exploration Environment.
     ACM SIGMOD Record 25, 4 (December 1996), 25–29.
 [2] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander.
     2000. LOF: Identifying Density-based Local Outliers. ACM SIGMOD Record
     29, 2 (May 2000), 93–104.
 [3] Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly Detec-
     tion: A Survey. ACM Computing Surveys (CSUR) 41, 3 (July 2009), 15:1–15:58.
 [4] Debabrata Dash, Jun Rao, Nimrod Megiddo, Anastasia Ailamaki, and Guy
     Lohman. 2008. Dynamic Faceted Search for Discovery-driven Analysis. (2008),
     3–12. https://doi.org/10.1145/1458082.1458087
 [5] Humaira Ehsan, Mohamed A. Sharaf, and Panos K. Chrysanthis. 2016. MuVE:
     Efficient Multi-Objective View Recommendation for Visual Data Exploration.
     Proceedings of the ICDE (2016). https://doi.org/10.1109/ICDE.2016.7498285
 [6] Jiawei Han, Micheline Kamber, and Jian Pei. 2011. Data Mining: Concepts and
     Techniques (3rd ed.). Morgan Kaufmann Publishers Inc.
 [7] Stratos Idreos, Olga Papaemmanouil, and Surajit Chaudhuri. 2015. Overview
     of Data Exploration Techniques. Proceedings of the ACM SIGMOD (2015).
     https://doi.org/10.1145/2723372.2731084
 [8] Niranjan Kamat, Prasanth Jayachandran, Karthik Tunga, and Arnab Nandi.
     2014. Distributed and interactive cube exploration. Proceedings of the Inter-
     national Conference on Data Engineering (2014), 472–483. https://doi.org/10.
     1109/ICDE.2014.6816674
 [9] Edwin M. Knorr and Raymond T. Ng. 1998. Algorithms for Mining Distance-
     Based Outliers in Large Datasets. In Proceedings of the International Conference
     on Very Large Data Bases. 392–403.
[10] Hans-Peter Kriegel, Peer Kröger, Erich Schubert, and Arthur Zimek. 2009.
     LoOP: Local Outlier Probabilities. In Proceedings of the ACM Conference on
     Information and Knowledge Management. 1649–1652.
[11] Martin Krzywinski and Naomi Altman. 2013. Significance, P values and t-tests.
     Nature Methods 10 (oct 2013), 1041. https://doi.org/10.1038/nmeth.2698http:
     //10.0.4.14/nmeth.2698

</pre>