J. Yaghob (Ed.): ITAT 2015 pp. 143–149
Charles University in Prague, Prague, 2015


      Evaluation of Association Rules Extracted during Anomaly Explanation

                                               Martin Kopp1,2,3 Martin Holeňa1,3
                              1 Faculty of Information Technology, Czech Technical University in Prague
                                                      Thákurova 9, 160 00 Prague
                                          2 Cisco Systems, Cognitive Research Team in Prague
                             3 Institute of Computer Science, Academy of Sciences of the Czech Republic

                                                Pod Vodárenskou věží 2, 182 07 Prague

Abstract: Discovering anomalies within data is nowadays                The more formal definition would necessarily reduce
very important, because it helps to uncover interesting             the amount of plausible anomaly detectors and/or applica-
events. Consequently, a considerable amount of anomaly              tion domains. This is in conflict with our goal to provide
detection algorithms was proposed in the last few years.            a solution as general as possible.
Only a few papers about anomaly detection at least men-                Even though anomaly detection techniques are aimed at
tioned why some samples were labelled as anomalous.                 only a minority of samples, the importance and demand
Therefore, we proposed a method allowing to extract rules           for them grows rapidly. The real world applications range
explaining the anomaly from an ensemble of specifically             from the network security [10], bioinformatics [24] or fi-
trained decision trees, called sapling random forest.               nancial fraud detection [22] to the astronomy and space
   Our method is able to interpret the output of an arbi-           exploration [9].
trary anomaly detector. The explanation is given as con-               The identification of anomalies is only a half of the
junctions of atomic conditions, which can be viewed as              whole task. The second and equally important half is the
antecedents of association rules. In this work we focus on          interpretation. In high dimensional domains, like the net-
selection, post processing and evaluation of those rules.           work security or bioinformatics, where hundreds or even
The main goal is to present a small number of the most              thousands of features are common, the proper interpreta-
important rules. To achieve this, we use quality measures           tion is crucial.Therefore, anomalies have to be interpreted
such as lift and confidence boost. The resulting sets of            clearly, as a feature subset that explains its deviation from
rules are experimentally and empirically evaluated on two           ordinary data, or even better as a set of association rules.
artificial datasets and one real-world dataset.                        In [21] we proposed method of anomaly explanation
Keywords: Anomaly detection, anomaly interpretation, as-            based upon specifically trained ensembles of decision trees
sociation rules, confidence boost, random forest                    called sapling random forest (SRF). The main idea be-
                                                                    hind it is to view the explanation as a feature selection
                                                                    and classification problem. Specifically, the goal is to find
1 Introduction                                                      features in which the margin betweenı anomalous sample
                                                                    and the normal samples is maximised. Therefore, SRF
According to an IBM research [13] there were 2,7 zetta-             returns subset of features, respectively rules on these fea-
bytes of data in the digital universe at April 2012 and this        tures describing why this sample has been identified as an
amount is doubling approximately every 40 months.                   anomaly.
   Not only it is almost impossible to process such huge               The main drawback of the direct rule extraction from
amounts of data, we are actually not interested in the              our sapling random forests is the big number of rules with
raw data, but rather in the salient knowledge and inter-            some of them introduced by unfortunate training set selec-
esting patterns contained in them. This is the reason why           tion. Partially, these issues can be solved by confidence
anomaly detection, especially unsupervised anomaly de-              and / or support thresholds. But for our ultimate goal to
tection, becomes more and more important [1, 23]. De-               present the minimal number of rules containing the maxi-
spite it can be formalised as a binary classification, it en-       mal amount of useful information, such a simple approach
tails different issues and challenges than those in super-          is insufficient. Therefore, in this paper we focus on proper
vised classification. For example, anomalous events of-             selection, post processing and evaluation of rulesets ex-
ten adapt to appear normally and even normal behaviour              tracted from sapling random forests during anomaly expla-
evolve over time. Furthermore, defining a normal regions            nation. We tested association rules quality measures such
is very difficult, especially when the boundary between             as lift and confidence boost. This paper is work in progress
normal and anomalous is not always precise.                         and we would like to extend the number of tested quality
   For the purposes of this paper consider anomalies equal          measures by some subjective measures like novelty.
to outliers as defined by Hawkins [11]: “ An outlier is an             The rest of this paper is organised as follows. The next
observation which deviates so much from the other obser-            section briefly reviews related work. Section 3 describes
vations as to arouse suspicions that it was generated by            the SRF principles and its training followed by the rule
a different mechanism.”                                             extraction process in Section 4. The selected quality mea-
144                                                                                                          M. Kopp, M. Holeňa


sures of association rules are presented in Section 5. Ex-     3     Sapling Random Forest
perimental evaluation is described in Section 6 and Sec-
tion 7 concludes the paper.                                    This section outlines principles of sapling random forests.
                                                               SRF is a method able to explain an output of an arbi-
                                                               trary anomaly detector, proposed by us in [21]. It is
2 Related Work                                                 a random forest of specifically trained decision trees. Be-
                                                               cause produced trees are small they are called saplings
For more information about the anomaly detection we re-        rather than trees. Produced explanations show features in
fer to the recent book of Aggarwall [1]. This book pro-        which inspected samples differ the most from the rest of
vides an exhaustive listing of anomaly detection algo-         data. These features are used to produce association rules,
rithms and their applications in different domains. An-        which are more informative than only a set of features. An
other source may be [5], which is briefer but very well        outline of the whole method is at Algorithm 1.
written. To our best knowledge, there have been only few
works addressing not only identification of anomalies, but     Algorithm 1 Algorithm summary
also their explanation.                                            y ← anomalyDetector(data)
   Knorr et al. [15] focused on what kind of knowledge             for all data(y ==anomaly) do
should be extracted and provided to the user. Strong and             T ← createTrainingSet(size, method)
weak outliers were defined and searched within data by               t ← trainTree(T )
distance-based algorithms described in detail in [14].               SRF ← t
   Dang et al. [7] presented an algorithm identifying and          end for
explaining anomalies. The algorithm starts by selecting            extractRules(SRF)
a set of neighbouring samples based on quadratic entropy
that are presented to a Fisher linear discriminant classi-
fier to seek for an optimal half-space, in which a detected
anomaly is well separated. The process of interpretation       3.1   Training Set Selection
is entangled with the presented method of identification
of anomalies. The difference to our work is that SRF can       Dataset X = {x1 , x2 , . . . , xl }, where x ∈ Rd , can be split
be used after an arbitrary anomaly detection algorithm to      into two disjoint sets X a , containing anomalous samples,
interpret its results.                                         and X n ,containing normal samples. Then, a training set T
   The most similar to our approach and most recent            contains the anomaly xa as one class and a subset of X n
is [20]. Their approach, as well as ours, can interpret out-   as the other. The first strategy of creating training sets is
put of an arbitrary anomaly detector as a subset of fea-       to select k nearest neighbours of xa from X n . This strat-
tures. They use classification accuracy for outlier ranking.   egy is sensible for algorithms detecting local anomalies,
The main drawback of this approach is that it needs bal-       as according to [8] they are more general than algorithms
anced training sets which are created by sampling artificial   detecting global anomalies. The drawback of this strategy
samples around the anomalous point. With respect to this       is a computational complexity.
work, our approach can handle unbalanced training sets            The second strategy is to select k samples randomly
easily and returns not only feature subsets but feature sub-   from X n with uniform probability. The advantage of this
sets with rules on them, providing even more information       approach is a possibility to generate more than one train-
about the anomaly. Furthermore, we simplify the analysis       ing set per anomaly by repeating the sampling process.
by clustering, which enables to interpret similar anomalies    More training sets lead to more saplings per anomaly and
at once [16].                                                  to more robust explanation, but at the expense of the more
   On the other hand, there are many papers about associ-      complicated aggregation of rules extracted from them (see
ation rules. This paper was inspired mainly by [3], which      Section 4). A comparison of both approaches can be found
is about measuring redundancy and information quality          in [21].
of sets of association rules. The author presents a mea-
sure called confidence boost and an algorithm to produce       3.2   Training a Sapling
a small set of association rules using this measure. A re-
ally extensive list of interestingness measures can be found   For simplicity consider sapling a binary decision tree with
in [12]. There is a lot of inspiration for our future work.    typical height 1-3. In the SRF method, there are always
   An alternative approach, well described in [6], may be      two leaves at the maximal depth, one of which contains
so called subjective measures. A typical example is the        only an anomaly xa and the other containing only normal
novelty, sometimes called unexpectedness, of a rule with       samples. The saplings small height has two reasons. First,
respect to user provided domain knowledge or against the       training sets are relatively small. Second, according to the
another rule set. Because these terms are ambiguous there      anomaly isolation approach [19], if the analysed sample
are multiple approaches of measuring them. An approach         is an anomaly, it should be separated easily from the rest
in [18] inspired us for our future work.                       of data, resulting again into small trees. Therefore, if the
Evaluation of Association Rules Extracted during Anomaly Explanation                                                                         145


height of a sapling is higher than expected it should be                   to be 0.90 or 0.95. Using the adopted notation, k is deter-
taken into consideration that the explained sample may not                 mined as
be an anomaly.
                                                                                                                      k
   The standard procedure, to find the splitting function h                                                  1
for a new internal node, is maximising an information gain
                                                                                       k = arg min
                                                                                                  k
                                                                                                                     ∑
                                                                                                       ∑ f ∈F |R f | i=1
                                                                                                                         |R fi | > τ,        (4)
over the space of all possible splitting functions H as
                                                                           where it is assumed, that R f are sorted according to their
                                |S b (h)|                                  size to simplify the notation. We have also investigated
               arg max − ∑                H(S b ),                  (1)
                   h∈H            |S|                                      the complementary approach, where groups are selected,
                        b∈{L,R}
                                                                           if they were used with a frequency higher than a specified
where S is the subset of the training set T reaching the                   threshold. But the presented strategy based on the cumu-
leaf being split, S L (h) = {x ∈ S|h(x) = +1} and S R (h) =                lative frequency showed more consistent results in our ex-
{x ∈ S|h(x) = −1} and H(S) is an entropy of S.                             periments.
   The second commonly used approach involves minimis-                        Once the set of groups with decision rules is selected,
ing the Gini impurity.                                                     we create one rule r¯f for every rule set R f .Thresholds for
                                       2                 2             each item f j are calculated as an average of all thresholds
                      |xa |                      |xn |                     within the rule set R f .
    arg min ∑ 1 −                            −                      (2)
        h∈H
            b∈{L,R}
                    (S b (h))                  (S b (h))
                                                                                                                 |R |
                                                                                                                  fi
                                                                                                          1
 For experiments presented in this paper we used infor-                                           θ̄ j =        ∑
                                                                                                         |R f | i=1
                                                                                                                     θi, j                   (5)
mation gain.
                                                                             By this approach we obtain one representative rule for
                                                                           each feature set f as:
4 Extraction and Evaluation of Rules
                                                                                     c¯f = x j1 > θ¯1 ∧ x j2 > θ¯2 ∧ . . . ∧ x jt > θ̄t .    (6)
Once a sapling is grown, it is used to explain the
anomaly xa . Let h j1 ,θ1 , . . . , h jd ,θd be the set of splitting
functions, with features j1 , . . . , jd and threshold θ1 , . . . , θd ,   5    Measuring Quality of Rules
used in the inner nodes on the path from the root to the leaf
with the anomaly xa . Then xa is explained as a conjunction
                                                                           This section reviews selected quality measures of asso-
of atomic conditions :
                                                                           ciation rules. Typical association rules are in the form
       c = (x j1 > θ1 ) ∧ (x j2 > θ2 ) ∧ . . . ∧ (x jd > θd ),      (3)    A → Y, where A, Y are item sets. In rules extracted from
                                                                           SRF, items are atomic conditions and Y always means: “is
which is the output of the algorithm. This conjunction can                 anomalous”. Therefore, our rules are in the form:
be read as “the sample is anomalous because it is greater
than threshold θ1 in feature j1 and greater than θ2 in fea-                                            r f = c f → y,                        (7)
ture j2 and . . . than majority (or nearest neighbour) sam-
                                                                           where c is a conjunction of atomic conditions like (3), y =
ples.” Because resulting trees are very small, the explana-
                                                                           x ∈ X a and f ⊆ F . The r f in its full form then look as:
tion is compact.
   The situation is more difficult, when more saplings per
                                                                           r f (x) = x f1 > θ f1 ∧ x f2 > θ f2 ∧ . . . ∧ x fn > θ fn → x ∈ X a ,
anomaly have been grown, as each sapling provides one
                                                                                                                                            (8)
conjunction of type (3). Using more than one sapling per
                                                                           where n is a maximal index in the itemset f .
anomaly improves robustness for training sets created by
                                                                               For this kind of rules support [2] is calculated as:
uniform sampling. The problem is that returning set of
all conjunctions C is undesirable, as the primary objective                                                |{c f (x)|x ∈ X }|
— explanation of the anomaly to a human — would not be                                     supp(c f ) =                       ,              (9)
                                                                                                                   |X |
met. Hence, the algorithm needs to aggregate conjunctions
in C.
                                                                                                                  |X a |
   For simplicity of the following notation consider 2d                                               supp(y) =          .                  (10)
                                                                                                                   |X |
items, in such a way that 2 items are assigned to each fea-
ture, one for "<" rules, the other for ">" rules. Denote this              and gives the proportion of data points which satisfy the
2d set of items F . Then we can group rules into the rule                  antecedent c, respectively the consequent y. It is used to
sets R f according to the item set f ⊆ F they share.                       measure the importance of a rule or as a frequency con-
   Based on |R f | the algorithm discards groups of low im-                strain. The disadvantage of support is that infrequent rules
portance by sorting them in descend order, and then us-                    are often discarded. This is much bigger problem than it
ing only the first k groups such, that their cumulative fre-               could seem because we are generating rules for anomalies,
quency is greater than a threshold τ, which we recommend                   which are rare by definition.
146                                                                                                                              M. Kopp, M. Holeňa


   Another frequently used measure is confidence [2]:                           rule                                      supp       lift           β
                                                                                r1 = x1 > −0.33 ∧ x2 > −0.39              0.38      1.64           0.27
                                       supp(c f → y)                            r2 = x1 > −0.33 ∧ x2 < 0.3                0.37      1.60           0.27
                  conf(c f → y) =                    .              (11)
                                         supp(c f )                             r3 = x1 < 0.4 ∧ x2 < 0.34                 0.37      1.49           0.25
                                                                                r4 = x1 < 0.37 ∧ x2 > −0.37               0.37      1.57           0.26
It estimates the conditional probability of the consequent                      r5 = x2 > 2.2                             0.02      6.00           1.00
being true on condition that the antecendent is true. The                       r6 = x1 > 2.3                             0.02      6.00           1.00
trouble with confidence is caused by its sensitivity to the                     r7 = x2 < −2.4                            0.01      6.00           1.00
frequency of y. Because all rules extracted from SRF                            r8 = x1 < −2.4                            0.01      6.00           1.00
have the same consequent the rule ranking produced by
lift a confidence would be the same.
   The third measure we used is lift [4]:                                     Table 1: Aggregated rules with their quality measures for
                                                                              the three layer donut dataset, sorted by their respective
                                      conf(c f → y)                           supports.
                   lift(c f → y) =                  ,               (12)
                                        supp(y)

which measures how many times more often the an-                                    4
                                                                                         r4                                                    r1

tecedent c and consequent y occur together than expected
                                                                                                                                 normal
                                                                                                                                 anomal
                                                                                                                                  r6
if they were statistically independent. Lift does not suffer
                                                                                    3

                                                                                                                                              r5
from the rare items problem. Because in our experiments                             2


the consequent will always the same frequency there is no
                                                                                    1
need to measure both
   Finally, confidence boost introduced by Balcázar [3] is                          0


calculated as:                                                                      -1


                                conf(r f )
β (r f ) =                                                                ,         -2

             max{conf(r f )|supp(r0 f 0 ) > σ , r f 6≡ r0 f 0 , f 0 ⊆ f }
                       0 0
                                                                                                                                              r7


                                                                    (13)            -3
                                                                                                   r8
  where σ is support threshold and r f 6≡ r0 f 0 denotes the                        -4
                                                                                         r3                                                    r2
                                                                                      -4      -3        -2   -1   0   1      2            3         4
inequivalence of rules r f and r0f 0 , which for our simple case
where all consequents are the same, means that f 6= f 0 .
From (13) is evident that f 0 ⊂ f .                                           Figure 1: Three layer donut dataset with plotted rules
  If the set of confidences in denominator is empty, the                      r1 − r8 . Rules r1 − r4 are plotted as filled squares and rules
confidence boost is by convention set to infinity.                            r5 − r8 as half-planes delimited by solid lines.


6 Experiments                                                                 number of data points explained, lets recall that anomalies
                                                                              are only one sixth of all data points in this dataset. Both,
For the experimental evaluation we used the synthetical                       lift and confidence boost, mostly reflects our subjective ex-
three layer donut, the well known Fisher’s iris and the Let-                  pectations.
ter recognition data set from the UCI repository [17].                           All rules are depicted at Figure 1. Its evident that pre-
                                                                              sented rules cannot separate anomalies from normal sam-
                                                                              ples perfectly. Especially difficult are anomalies inside
6.1 Three Layer Donut
                                                                              the donut. To separate those inner anomalies perfectly it
The three layer donut dataset contains 1000 normal sam-                       would be necessary to combine more rules together, for
ples forming the two dimensional toroid (donut). There                        example r1 and r3 or r2 and r4 .
are 200 anomalies, one half inside the toroid and the sec-
ond half out of it. For this dataset we created 10 rules
per anomaly, using SRF, resulting in 2000 rules. After the                    6.2   Iris
simple aggregation, described in Section 4, only 8 rules
left. All of them printed in Table 1, sorted by their respec-                 The virginica species were selected as anomalous class for
tive support. All rules before aggregation were used to                       the iris data set. Five rules per anomaly were produced us-
calculate confidence boost. Otherwise, many rules would                       ing SRF, resulting in 250 rules. After aggregation we have
have confidence boost equal to infinity because there are                     got 6 rules. They are written in Table 2 with their respec-
no rules defined on any subset of their item sets f .                         tive quality measures. Confidence boost was calculated
   The fact is that rules r5 − r8 have quite small support                    using all 250 rules.
but, according to the other measures and our intuition, they                     The main problem with all those rules is that almost ev-
are very important. The small supports is due to the small                    ery one of them can sufficiently separate anomalies from
Evaluation of Association Rules Extracted during Anomaly Explanation                                                                            147


       rule                           supp      lift    β                        25
                                                                                                                                 lift
       r1 = x1 > 6 ∧ x4 > 1.7         0.25     3.00    1.00                                                                      confidence boost

       r2 = x4 > 2                    0.19     3.00    1.00                      20
       r3 = x3 > 5.5                  0.17     3.00    1.00
       r4 = x2 < 2.8 ∧ x4 > 1.6       0.06     3.00    1.00
       r5 = x1 > 7.3                  0.05     3.00    1.00                      15

       r6 = x2 < 2.2                  0.03     0.75    0.25


                                                                       ranking
                                                                                 10

Table 2: Aggregated rules with their quality measures for
the iris dataset , sorted by their respective supports.
                                                                                  5


                                                                                  0
                                                                                      0         5         10           15        20                 25
                                                                                                               rules


                                                                       Figure 3: Ranking (higher number means better ranking)
                                                                       of rules from Table 3 by its lift and confidence boost.

                                                                                      rule                                  supp lift β
                                                                                      x2 > 14                               0.29 1.44 0.29
                                                                                      x11 < 0.62                            0.24 0.27 1.82
                                                                                      x9 > 8.8 ∧ x1 0 < 5.2                 0.21 1.67 0.96
                                                                                      x3 > 10                               0.18 0.89 0.35
                                                                                      x1 < 0.38                             0.17 0.19 0.14
                                                                                      x13 > 6.2 ∧ x15 > 8.6                 0.16 0.21 0.17
                                                                                      x6 > 9.8 ∧ x8 < 1.2                   0.15 2.39 1.91
                                                                                      x1 > 9.5 ∧ x2 > 14                    0.15 1.10 0.22
Figure 2: The iris dataset with r1 plotted as a filled rect-
                                                                                      x2 > 13 ∧ x10 > 12                    0.15 0.44 0.04
angle, and rules r2 and r5 as half-planes delimited by solid
                                                                                      x4 > 7.8 ∧ x8 < 1.2 ∧ x9 > 7.2        0.15 5.75 0.71
lines.
                                                                                      x8 < 1.2 ∧ x16 < 5.2                  0.14 1.62 0.39
                                                                                      x2 > 11 ∧ x4 > 7.8 ∧ x9 > 6.4         0.13 5.32 0.88
normal samples. No one of presented measures could help                               x8 < 1.2 ∧ x9 > 8.8                   0.13 8.06 1.80
in selecting the most informative, yet small as possible, set                         x9 > 8 ∧ x10 < 5.2 ∧ x15 > 6.8        0.13 2.20 1.31
of rules. Because presented rules are seen informationally                            x1 > 9.5 ∧ x3 > 9                     0.12 1.04 0.06
equivalent by all quality measures. This doesn’t say much                             x9 > 8.8 ∧ x12 < 5.5                  0.12 0.78 0.22
about difference between the quality measures but it justi-                           x3 > 6.1 ∧ x14 < 6.2 ∧ x15 > 7.2      0.12 2.91 1.09
fies the rule extraction process, because all generated rules                         x1 > 4.2 ∧ x8 < 1.2 ∧ x12 < 6.8       0.12 0.55 0.05
have high score.                                                                      x1 > 8.5 ∧ x10 > 12                   0.12 0.84 0.10
   Figure 2 shows the iris dataset with rules r1 , r2 and r5 .                        x1 > 5.5 ∧ x7 > 7.8 ∧ x8 < 1.2        0.11 3.39 0.22
                                                                                      x2 < 1.5 ∧ x9 > 7.5 ∧ x15 > 5.5       0.11 3.43 0.84
                                                                                      x1 > 7.8 ∧ x6 > 8.8 ∧ x7 > 6          0.11 0.29 0.02
6.3    Letter Recognition                                                             x6 > 8.2 ∧ x15 > 7.9 ∧ x16 < 6.2      0.11 0.29 0.24
                                                                                      x2 > 11 ∧ x3 > 7.6 ∧ x10 < 5.2        0.11 1.21 0.05
This dataset was created as a classification problem with                             x3 > 10 ∧ x5 > 6.2                    0.11 0.61 0.24
26 classes, one class for each letter in the English alphabet.
The charactersÒ were obtained from 20 different fonts and
randomly distorted to produce 20,000 unique samples pre-               Table 3: Rules extracted from Letter recognition by SRF
sented as 16 dimensional numerical vectors. Letter X was               with support higher than 0.10 with their quality measures
selected as the anomaly class. SRF produced more than                  sorted by their respective supports.
15,000 rules, which were reduced by aggregation to 1955.
Aggregated rules with support higher than 0.10 are pre-
sented in Table 3. The ranking of those rules is plotted               (446 rules). The confidence boost selected the smaller rule
at Figure 3. Its evident that the ranking given by lift and            set where almost all rules looked plausible. On the other
confidence boost differs substantially.                                hand, they missed some really interesting ones most highly
   It is nearly impossible to evaluate all rules. Therefore,           rated by lift. The confidence boost tend to choose shorter
we have selected only those with confidence boost higher               more similar rules, whereas lift prefer richer and more het-
than one (202 rules) and those with lift higher than one               erogenous rules. Therefore, from our point of view the
148                                                                                                              M. Kopp, M. Holeňa


  top 10 lift rules                   top 10 β rules              References
  x8 < 1.8 ∧ x15 > 7.8                x4 < 1.2 ∧ x9 > 8
  x2 < 1.8 ∧ x3 > 4.5 ∧ x9 > 8.8      x2 < 1.5 ∧ x9 > 8.8
  x5 > 6.8 ∧ x8 < 1.8 ∧ x15 > 7       x8 < 1.2 ∧ x9 > 8.8          [1] Aggarwal, C. C.: Outlier analysis. Springer, 2013
  x5 > 5 ∧ x9 > 8.2 ∧ x10 < 5.2       x9 > 8 ∧ x14 < 5.2           [2] Agrawal, R., Imieliński, T., Swami, A.: Mining association
  x4 < 2.2 ∧ x6 > 8.2 ∧ x9 > 7.8      x11 > 12 ∧ x16 < 4.5             rules between sets of items in large databases. In: ACM
  x2 < 5.6 ∧ x3 > 7.2 ∧ x8 < 1.2      x6 > 9.8 ∧ x8 < 1.2              SIGMOD Record, volume 22, 207–216, ACM, 1993
  x1 > 4.5 ∧ x8 < 1.8 ∧ x15 > 7.8     x4 > 8.8 ∧ x14 < 5.2         [3] Balcázar, J. L.: Formal and computational properties of the
  x7 < 5.8 ∧ x8 < 2.5 ∧ x15 > 7.8     x5 > 8.5 ∧ x14 < 5.2             confidence boost of association rules. ACM Transactions
  x2 < 4.8 ∧ x9 > 8 ∧ x11 > 9.8       x8 < 1.2 ∧ x16 > 9.9             on Knowledge Discovery from Data (TKDD) 7(4) (2013),
  x11 > 9 ∧ x14 > 13                  x12 > 10 ∧ x16 < 5.2             19
                                                                   [4] Brin, S., Motwani, R., Ullman, J. D., Tsur, S.: Dynamic
                                                                       itemset counting and implication rules for market basket
Table 4: Comparison of top 10 rules extracted from the                 data. In: SIGMOD 1997, Proceedings ACM SIGMOD In-
Letter recognition dataset by SRF selected by lift and con-            ternational Conference on Management of Data, 255–264,
fidence boost.                                                         Tucson, Arizona, USA, May 1997
                                                                   [5] Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection:
                                                                       a survey. ACM Computing Surveys (CSUR) 41(3) (2009),
best selection strategy is choosing top k rules according to           15
the lift ranking. The top 10 rules chosen from the whole
                                                                   [6] Christen, P., Goiser, K.: Quality and complexity measures
set by lift and confidence boost, regardless their support,            for data linkage and deduplication. In: Quality Measures in
are in Table 4.                                                        Data Mining, 127–151, Springer, 2007
   Still there are too much rules to make some conclusions,        [7] Dang, X. -H., Micenková, B., Assent, I., Ng, R. T.: Local
in our future work we are going to investigate more mea-               outlier detection with interpretation. In: European Confer-
sures of interestingness and novelty, which will hopefully             ence on Machine Learning and Principles and Practice of
help us to reduce the amount of extracted rules even more.             Knowledge Discovery in Databases (ECML/PKDD 2013),
                                                                       2013
7 Conclusion                                                       [8] de Vries, T., Chawla, S., Houle, M. E.:           Finding lo-
                                                                       cal anomalies in very high dimensional space. In: IEEE
In this paper, we presented a novel approach for the ex-               10th International Conference on Data Mining (ICDM
planation of an output of an arbitrary anomaly detector us-            2010), 2010
ing sapling random forests. The explanation is given as            [9] Fujimaki, R., Yairi, T., Machida, K.: An approach to space-
conjunctions of atomic conditions, which can be viewed                 craft anomaly detection problem using kernel feature space.
                                                                       In: Proceedings of the Eleventh ACM SIGKDD Interna-
as antecedents of association rules. Due to an extraction
                                                                       tional Conference on Knowledge Discovery in Data Min-
method, the individual rules are short and comprehensi-
                                                                       ing, 2005
ble. The main drawback was that the rule sets for the big-
                                                                  [10] Garcia-Teodoro, P., Diaz-Verdejo, J., Maciá-Fernández, G.,
ger dataset were large and redundant. Therefore, we ap-
                                                                       Vázquez, E.: Anomaly-based network intrusion detection:
plied multiple quality measures to evaluate them and se-               techniques, systems and challenges. Computers & Security,
lect those rules with desired properties. Performed exper-             2009
iments showed that no one of presented measures reflect           [11] Hawkins, D. M.: Identification of outliers, volume 11.
our expectation. From the considered measures the lift                 Springer, 1980
looks the most promising. But this paper is just a work           [12] Jalali-Heravi, M., Zaïane, O. R.: A study on interesting-
in progress and we don’t view this observation as a final              ness measures for associative classifiers. In: Proceedings of
conclusion.                                                            the 2010 ACM Symposium on Applied Computing, 1039–
   For our future work we would like to have a measure                 1046, ACM, 2010
that will rate the novelty of a rule with respect to the set of   [13] Karr, D.:         Big data brings marketing big num-
previously selected rules. The first idea is to chose those            bers.         [https://www.marketingtechblog.com/
rules that describe anomalies not covered by the already               ibm-big-data-marketing/], 2012 [Online; accessed
selected rules. The second idea is to select rules which               19-June-2015].
may describe already covered anomalies but using com-             [14] Knorr, E. M., Ng, R. T.: Algorithms for mining distance-
pletely different set of features. The last thing we would             based outliers in large datasets. In: Proceedings of the In-
like to work on is finding a way of concatenating mined                ternational Conference on Very Large Data Bases, 1998
rules to make smaller yet precise rule sets.                      [15] Knorr, E. M., Ng, R. T.: Finding intensional knowledge of
                                                                       distance-based outliers. In: VLDB, 1999
Acknowledgement                                                   [16] Kopp, M., Pevný, T., Holeňa, M.: Interpreting and clus-
                                                                       tering outliers with sapling random forests. In: Informa-
The research reported in this paper has been supported by              tion Technologies – Applications and Theory Workshops,
the Czech Science Foundation (GA ČR) grant 13-17187S.                 Posters, and Tutorials (ITAT 2014), 2014
Evaluation of Association Rules Extracted during Anomaly Explanation   149


[17] Lichman, M.: UCI machine learning repository. [http:
     //archive.ics.uci.edu/ml/]. University of Califor-
     nia, Irvine, School of Information and Computer Sciences,
     2013
[18] Liu, B., Hsu, W., Chen, S.: Using general impressions to
     analyze discovered classification rules. In: KDD, 31–36,
     1997
[19] Liu, F. T., Ting, K. M., Zhou, Z. -H.: Isolation forest.
     In: Eighth IEEE International Conference on Data Mining
     (ICDM 2008), 2008
[20] Micenková, B., Ng, R. T., Dang, X. -H., Assent, I.:
     Explaining outliers by subspace separability.               In:
     IEEE 13th International Conference on Data Mining
     (ICDM 2013), 2013.
[21] Pevný, T., Kopp, M.: Explaining anomalies with sapling
     random forests. In: Information Technologies – Ap-
     plications and Theory Workshops, Posters, and Tutorials
     (ITAT 2014), 2014
[22] Pradnya, K., Khanuja, H. K.: Article: A survey on outlier
     detection in financial transactions. International Journal of
     Computer Applications 108(17) (December 2014), 23–25
[23] Rousseeuw, P. J., Leroy, A. M.: Robust regression and out-
     lier detection. John Wiley & Sons, 2005
[24] Tibshirani, R., Hastie, T.: Outlier sums for differential gene
     expression analysis. Biostatistics, 2007