Detecting and Explaining Exceptional Values in
              Categorical Data
            DISCUSSION PAPER

       Fabrizio Angiulli, Fabio Fassetti, Luigi Palopoli, and Cristina Serrao

               DIMES, University of Calabria, 87036 Rende (CS), Italy
           {f.angiulli,f.fassetti,palopoli,c.serrao}@dimes.unical.it


         Abstract. In this work we deal with the problem of detecting and ex-
         plaining exceptional behaving values in categorical datasets by perceiv-
         ing an attribute value as anomalous if its frequency occurrence is ex-
         ceptionally typical or un-typical within the distribution of frequencies
         occurrences of any other attribute value. The notion of frequency occur-
         rence is provided by specialising the Kernel Density Estimation method
         to the domain of frequency values and an outlierness measure is defined
         by leveraging the cdf of such a density. This measure is able to simulta-
         neously identify two kinds of anomalies called lower outliers and upper
         outliers, namely exceptionally low or high frequent values.
         Moreover, data values labeled as outliers come with an interpretable
         explanations for their abnormality, which is a desirable feature of any
         knowledge discovery technique.


1     Introduction
An outlying observation is one that appears to deviate markedly from other
members of the sample in which it occurs. Such rare events can be even more
interesting than the more regularly occurring ones as they are suspected of not
being generated by the same mechanisms as the rest of the data.
    We deal with categorical data and, specifically, we perceive an attribute value
as anomalous if its frequency occurrence is exceptionally typical or un-typical
within the distribution of frequencies occurrences of any other attribute value. To
quantify such a frequency occurrence we specialize the classical Kernel Density
Estimation technique to the domain of frequency values and get to the concept
of soft frequency occurrence. The cumulated frequency distribution of the above
density estimate is used to decide if the frequency of a certain value is anomalous
when compared to the other values’ frequencies. In particular, we are able to
identify two kind of anomalies, namely lower outliers and upper outliers. A lower
outlier is a value whose frequency is low while, typically, the dataset objects
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License
    Attribution 4.0 International (CC BY 4.0). This volume is published and copyrighted by its
    editors. SEBD 2020, June 21-24, 2020, Villasimius, Italy.
F. Angiulli et al.

assume a few similar values, namely the frequencies of the other values are high.
An upper outlier is a value whose frequency is high while, typically, the dataset
objects assume almost distinct values, namely the frequencies of the other values
are low. Note that the measure detects both scenarios and to automatically
characterizes the target value as a lower or upper outlier.
   A value can show exceptional behaviour only when we restrict our attention
to a subset of the whole population. Thus, we design our technique to output
the so-called explanation-property pairs (E, p), where E denotes a condition used
to determine the target subpopulation and p represents an attribute pa and a
value pv such that the pv is exceptionally frequent or infrequent within the
subpopulation selected by the explanation E.
   The rest of the work is organised as follows. Section 2 discusses work related
with the present one. Section 3 introduces the frequency occurrence function.
Section 4 describes the outlierness function for ranking categorical values. Section
5 describes experimental results.


2    Related works

Categorical data has received relatively little attention as compared to quanti-
tative data because detecting anomalies in categorical domain is a challenging
problem [11].
     We start by noting that there is little literature about outlier explanation [3],
i.e the problem of detecting anomalous properties and/or related outlier objects
equipped with features justifying their outlierness. Moreover, to the best of our
knowledge, no technique is able to natively detect upper outliers.
     Among traditional outlier detection methods explored in the context of nu-
merical data[8] you can consider two main clusters: distance-based[6] and density-
based[7]. These ideas have been properly adapted to the categorical domain.
     An example of distance-based method is discussed in [6, 1]. Outliers are de-
fined as the N observations whose average distance to the k nearest neighbors
are the greatest; in order to do that an appropriate distance has to be chosen.
     Detecting local anomalies, i.e. observations having outlying behavior in local
areas, is another interesting discovery problem. Local anomaly detection meth-
ods for categorical data include the k-LOF [12] and the WATCH method [9].
     The first one is an extension of Local Anomalies Factor (LOF) method [7] to
the categorical domain while the WATCH method [9] has been designed to find
out outliers in high dimensional categorical datasets using feature grouping.
     Both distance and density are taken into account by the ROAD algorithm
[10]. It considers the Hamming distance to compute cluster-based outliers and
the density, evaluated as the mean frequency of the values, is used to identify
frequency-based outliers.
     With regard to the outlier explanation, [2],[3] propose a technique for cate-
gorical and numerical domains respectively that, given in input one single object
known to be outlier, provides features justifying its anomaly and subpopulations
where its exceptionality is evident. A generalization is proposed in [4]
                    Detecting and Explaining Exceptional Values in Categorical Data

3    Frequency Occurrence
In this section we give same preliminary definitions and introduce the notation
employed throughout the paper.
    A dataset D on a set of categorical attributes A is a set of objects o assuming
values on the attributes in A. By o[a] we denote the value of o on the attribute
a ∈ A. D[a] denotes the multiset {o[a] | o ∈ D}. Given a multiset V , the
frequency fvV of the value v ∈ V is the number of occurrences of v in V . A
condition C is a set of pairs (a, v) where each a is an attribute and each v ∈ D[a].
By DC we denote the new dataset {o ∈ D | o[a] = v, ∀(a, v) ∈ C)}.
Definition 1 (Frequency distribution). A frequency distribution H is a mul-
                           (1)       (w )           (1)          (w )              (j)
tiset of the form H = {f1 , . . . , f1 1 , . . . , fn , . . . , fn n } where each fi ∈ N
                           (j)     (k)
is a distinct frequency, fi = fi = fi for each 1 ≤ j, k ≤ wi , and wi denotes
the number of occurrences of the frequency fi . By N (H) (or simply N whenever
H is clear from the context) we denote w1 · f1 + . . . + wn · fn . For the sake of
simplicity, we will refer to a frequency distribution as a set H = {f1 , f2 , . . . , fn }
and to the number of occurrences wi of fi as w(fi ).
   Now we define the notion of frequency occurrence as a tool for quantifying
how frequent is a certain frequency.
Definition 2 (Hard frequency occurrence). Given a frequency distribution
H, the frequency occurrence FH (fi ) of fi , also denoted by F(fi ) whenever H is
clear from the context, is the product wi · fi .
    The above definition allows us to associate with each distinct value in D[a]
a score that is related not only to its frequency in the dataset but also to how
many other values have its same frequency. However, close frequency values do
not interact with each other, e.g. having fi = 49, wi = 1 and fi+1 = 51, wi+1 = 1
is completely different from fi0 = 50, wi0 = 2, as in the former case F(fi ) = 49
and F(fi+1 ) = 51 while in the latter we have F(fi0 ) = 100 that is about twice
the previous case. Intuitively, we do not desire a similar small variation in the
frequency distribution to impact so largely on the frequency occurrence values. To
force close frequency values to influence each other in order and jointly contribute
to the frequency occurrence we design an ad-hoc density estimation method
inspired to Kernel Density Estimation (KDE).
    A (discrete) kernel function Kfi with parameter fi is a probability mass
function having the property that supf ≥0 Kfi (f ) = Kfi (fi ). Given an interval
I = [fl , fu ] of frequencies, a frequency fi , and a kernel function K, the volume of
                                               Pf
Kfi in I, denoted as VI (Kfi ), is given by fu=fl Kfi (f ). The following expression
                                        ( n                      )
                                    X     X
                         F(f ) =                wi · fi · Kfi (ϕ) .                (1)
                                 ϕ∈I(f )   i=1

where I(f ) represents an interval of frequencies centred in f , provides the density
estimate of the frequency occurrence of the frequency f .
F. Angiulli et al.

    Since Kfi (·) is a probability mass function, the frequency fi provides a con-
tribution to the frequency occurrence of f corresponding to the portion of the
volume of Kfi which is contained in I(f ), that is VI(f ) (Kfi ).
    It is possible to eliminate the dependence from I(f ) by properly weighting the
contribution of Kfi (·) based on its distance from the target frequency f . Such a
weight can be directly obtained from the associated kernel as the ratio between
the probability of observing frequency f and the probability of observing fi ,
when such frequencies are realization of a random variable distributed according
to Kfi (·). This allow us to rewrite equation 1 as follows:
                                ( n                                )
                            X X                           Kfi (f )
                   F(f ) =            wi · fi · Kfi (ϕ) ·              .        (2)
                                  i=1
                                                          Kfi (fi )
                               ϕ≥0

    Note that the summation over the domain of all Kfi (·) values is equal to 1
since it is a probability mass function. Moreover, as F represents a notion of
density function associated with frequency occurrences, it is preferable that its
volume evaluated in the frequencies H = {f1 , . . . , fn } evaluates to N (H). This
leads to the following final form of the frequency occurrence function.
Definition 3 (Soft occurrence function). Given a frequency distribution H,
the frequency occurrence FH (fi ) of fi , also denoted by F(fi ) whenever H is clear
from the context, is given by the following expression
                                           n
                              N (H) X h                          i
                     F(f ) =           ·       wi · fi · K
                                                         b f (f ) ,
                                                            i
                                                                                 (3)
                             NF (H) i=1

where
                                                n
                                                  ( n                     )
        b f (f ) = Kfi (f )
                                                X  Xh                    i
        K  i
                                 and   NF (H) =       wi · fi · Kfi (fj ) .
                                                                b
                   Kfi (fi )                    j=1   i=1

   As for the kernel selection we exploit the binomial distribution binopdf (f ; n, p)
with parameter n, denoting the number of independent trials, equal to N (H),
and parameter p, denoting the success probability, equal to p = fi /N (H).

4    Categorical Outlierness
The idea behind the measure we will discuss in the following is that an object in
a categorical dataset can be considered an outlier with respect to an attribute
if the frequency of the value assumed by this object on such an attribute is rare
if compared to the frequencies associated with the other values assumed on the
same attribute by the other objects of the dataset. We are interested in two
relevant kinds of anomalies referring to two different scenarios.
Lower Outlier. An object o is anomalous since for a given attribute a the
   value that o assumes in a is rare (its frequency is low) while, typically, the
   dataset objects assume a few similar values, namely the frequencies of the
   other values are high.
                                  Detecting and Explaining Exceptional Values in Categorical Data

Upper Outlier. An object o is anomalous since for a given attribute a the
  value that o assumes in a is usual (its frequency is high) while, typically, the
  dataset objects assume almost distinct values, namely the frequencies of the
  other values are low.

In order to discover outliers, we exploit the cumulated frequency distribution
associated with the estimated density.
Definition 4 (Cumulated frequency distribution). Given a frequency dis-
tribution H = {f1 , . . . , fn }, the associated cumulated frequency distribution H
is                                         X
                                   H(f ) =      FH (fj ).
                                                                    fj ≤f

In the following, we refer to the value H(fi ) also as to Hi .
    To quantify the degree of anomaly associated with a certain frequency, we
use the area above and below the curve of the cumulated frequency distribution.
Intuitively, the larger the area A↑ (fi ) above the portion of the curve included
from a certain frequency fi to the maximum frequency fmax , and the more fi
differs from frequencies that are greater than fi . Thus, this area is exploited
to associate lower outlier score out↓ (fi ) to the target frequecy fi . At the same
time, the larger the area A↓ (fi ) below the portion of the curve included from
the minimum frequency fmin and a certain frequency fi , and the more fi differs
from frequencies that are smaller than fi . This area can be used to associate an
uppper outlier score out↑ (fi ) to fi .
    The outlierness associated with the frequency fi is a combined measure of
the above two normalised areas and exceptional values for an attribute a, are
those associated with large values of outlierness. More details are available in[5].

        140                                               140
        120                                               120                                           f w
        100                                               100                                           1 3
         80                                                80                                           2 2
 H(f)


                                                   H(f)


         60                                                60                                           3 1
         40                                                40                                           4 2
         20                                                20
                                                                                                        5 1
          0                                                 0
              0   1   2   3        4   5   6   7                0     1     2   3       4   5   6   7   6 2
                              f                                                     f


Fig. 1: Outlierness computation example. Areas involved in the computation of the
score are highlighted. They are normalized properly to get the upper and lower out-
lierness score for the target frequency


    It must be pointed out that very often a value emerges as exceptional for a
certain attribute only when we restrict our attention to a subset of the whole
population. This intuition leads to the definition of the notion of explanation-
property pair.
F. Angiulli et al.

                                                Scalability                                                                    Scalability
                               10 3                                                                          10 4
                                                                                                                     Depth 1
                                                                                                                     Depth 2
                               10 2                                                                          10 3    Depth 3


                                                                                      Execution Time (sec)
        Execution Time (sec)
                                                                                                                     Depth 4

                                    1
                               10                                                                            10 2


                               10 0                                                                          10 1


                               10 -1                           7 Attributes                                  10 0
                                                               14 Attributes
                                                               22 Attributes
                               10   -2
                                                                                                             10 -1
                                         10 3                                  10 4                                     10 3                  10 4
                                                Dataset size                                                                   Dataset size

                                                 (a)                                                                            (b)

                                                              Fig. 2: Scalability analysis.


Definition 5. An explanation-property pair (E, p), consists of condition E, also
called explanation, and of an atomic condition p = {(pa , pv )}, also called prop-
erty. By pa (pv , resp.) we denote the attribute (value, resp.) involved in the
atomic condition p.
    The outlierness of an explanation-property pair (E, p) is the outlierness score
associated with of the value pv with respect the attribute pa in the dataset DE .
    We implemented an algorithm that receives in input a dataset D and a depth
parameter δ ≥ 1, and returns all the pairs (E, p) among those composed of at
most δ atomic conditions. The algorithm analyzes explanations of length less or
equal than δ according to a depth-first strategy that allows an efficient selection
of sub-populations exploiting an approach similar to the one described in [3].


5    Experimental results
First of all, to study the applicability of our method to real datasets, we have
tested its scalability. Then, to clarify the different nature of our anomalies with
those returned by other outlier detection methods, we compared our method
with traditional distance-based and density-based outlier detection approaches
and with a method tailored for categorical data. Here, the results obtained on
the Mushrooms dataset (n = 8,124 objects and m = 22 attributes) from UCI
ML Repository are reported. More experiments are discussed in [5].
    Scalability. In the experiment reported in Figure 2a, we varied the number
of objects n in {500, 1000, 2000, 5000, 8000} and the number of attributes m in
{7, 14, 22} , while the depth parameter has been held fixed to δ = 3. The dashed
lines represent the trend of the linear growth estimated exploiting regression.
This estimation confirms that the algorithm scales linearly with respect to the
dataset size. As for the number of attributes, as expected for a given number of
objects the execution time increases due to the growth of the associated search
space. On the full dataset the execution time is very contained, as it amounts to
about 2 minutes. In the experiment reported in Figure 2b, we varied both the
number of objects n and the depth parameter δ in {1, 2, 3, 4}, while considering
the full feature space. Also in this case the linear growth is represented by the
dashed lines, and similar considerations can be drawn.
                                                                                   Detecting and Explaining Exceptional Values in Categorical Data

    Comparison with outlier detection methods. We compare our method
with two of the main categories of outliers: (i ) distance-based approaches, that
are used to discover global outliers; (ii ) density-based approaches, which are
able to single out local outliers. As distance-based definition, we use the average
KNN score, representing the average distance from the k-nearest neighbours
of the object. As density-based, we use Local Outlier Factor or LOF [7]. Both
methods employ the Hamming distance. Moreover, we compare our method with
the ROAD algorithm [10] that exploits both densities and distances.
    We ranked the dataset objects o by assigning to each of the them the largest
outlierness of a pair π such that o ∈ Dπ and we determined our top–10 outliers
as the objects associated with the largest outliernesses; then we computed their
outlier scores according to the KNN, LOF and ROAD definitions. Note that
all the chosen competitors require an input parameter k whose selection may
be challanging, so we have computeted their scores for all the possible values
of k and report the ranking positions associated with our top–10 outliers as a
box-plots generated when k changes.


                                                     Mushrooms: lower outliers                                                                        Mushrooms: lower outliers                                                                            Mushrooms: lower outliers
                        8000
                                                                                                                         8000
                                                                                                                                                                                                                                2000
                        7000
                                                                                                                         7000

                        6000
                                                                                                                         6000                                                                                                   1500

                        5000                                                                                             5000


                                                                                                                                                                                                                 ROAD ranking
          KNN ranking


                                                                                                           LOF ranking


                                                                                                                                                                                                                                1000
                        4000                                                                                             4000

                        3000                                                                                             3000
                                                                                                                                                                                                                                 500
                        2000                                                                                             2000

                        1000                                                                                             1000                                                                                                      0


                             0                                                                                                0

                                     1       2       3       4      5     6    7    8   9   10                                        1       2       3       4      5     6     7   8   9   10                                            1       2       3       4      5     6     7   8   9   10
                                                                 Our ranking                                                                                       Our ranking                                                                                          Our ranking


                                                             (a)                                                                                              (b)                                                                                                      (c)
                                                 Mushrooms: upper outliers                                                                        Mushrooms: upper outliers                                                                            Mushrooms: upper outliers
                                                                                                                                                                                                                     5500
              8000                                                                                             8000
                                                                                                                                                                                                                     5000
              7000                                                                                             7000
                                                                                                                                                                                                                     4500
              6000                                                                                             6000
                                                                                                                                                                                                                     4000

                                                                                                               5000
                                                                                                                                                                                                  ROAD ranking


              5000                                                                                                                                                                                                   3500
KNN ranking


                                                                                                 LOF ranking


              4000                                                                                             4000                                                                                                  3000

                                                                                                                                                                                                                     2500
              3000                                                                                             3000
                                                                                                                                                                                                                     2000
              2000                                                                                             2000
                                                                                                                                                                                                                     1500

              1000                                                                                             1000                                                                                                  1000

                         0                                                                                                0                                                                                                 500

                                 1       2       3       4         5     6     7    8   9   10                                    1       2       3       4         5      6     7   8   9   10                                        1       2       3       4         5      6     7   8   9   10
                                                                 Our ranking                                                                                      Our ranking                                                                                          Our ranking


                                                             (d)                                                                                                  (e)                                                                                                  (f)

                                                         Fig. 3: Comparison with KNN, LOF and ROAD on Mushrooms.


   The plots in fig. 3 highlight that the median ranking associated with our
outliers can be far away from the top and also that, within the whole rank-
ing distribution, the same outlier can be ranked in very different positions. In
general, it seems that lower outliers are likely to be ranked better than upper
outliers by our competitors, and this witnesses for the peculiar nature of upper
outliers. On the Mushrooms dataset some of our lower outliers are, on the av-
erage, ranked very high also by the other algorithms. Some of them are almost
F. Angiulli et al.

always top outliers for all methods (see the top 1st, 2nd, 5th, and 7th outliers)
thus witnessing that these outliers have both global and local nature. However,
most of our outliers are not detected by these techniques.
    Note that, the best rankings associated with the selected objects are obtained
for very different values of the parameter k. Since, the output of the KNN, LOF
and ROAD methods are determined for a selected value of k, it is very unlike
that, even in presence of some agreement between our top outliers and local and
global outliers, they are simultaneously ranked in high positions for the same
provided value of k.

6    Conclusions
In this work we have provided a contribution to single out and explain anomalous
values in categorical domains. We perceive frequencies of attribute values as
samples of a distribution whose density has to be estimated. This lead to the
notion of frequency occurrence we exploit to build our definition of outlier. As a
second contribution, our technique is able to provide interpretable explanations
for the abnormal values discovered. Thus, the outliers we provide can be seen
as a product of the knowledge mined, making the approach knowledge-centric
rather than object centric.

References
 1. Angiulli, F., Basta, S., Pizzuti, C.: Distance-based detection and prediction of
    outliers. IEEE transactions on knowledge and data engineering 18(2) (2006)
 2. Angiulli, F., Fassetti, F., Manco, G., Palopoli, L.: Outlying property detection with
    numerical attributes. Data Min. Knowl. Discov. 31(1) (2017)
 3. Angiulli, F., Fassetti, F., Palopoli, L.: Detecting outlying properties of exceptional
    objects. ACM Transactions on Database Systems (TODS) 34(1) (2009)
 4. Angiulli, F., Fassetti, F., Palopoli, L.: Discovering characterizations of the behavior
    of anomalous subpopulations. IEEE TKDE 25(6) (2013)
 5. Angiulli, F., Fassetti, F., Palopoli, L., Serrao, C.: A density estimation approach
    for detecting and explaining exceptional values in categorical data. In: Discovery
    Science - 22nd International Conference, Proceedings. Springer (2019)
 6. Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. IEEE
    transactions on Knowledge and Data engineering 17(2) (2005)
 7. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying density-based
    local outliers. In: ACM sigmod record. vol. 29. ACM (2000)
 8. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM com-
    puting surveys (CSUR) 41(3) (2009)
 9. Li, J., Zhang, J., Pang, N., Qin, X.: Weighted outlier detection of high-dimensional
    categorical data using feature grouping. IEEE SMC (2018)
10. Suri, N.R., Murty, M.N., Athithan, G.: An algorithm for mining outliers in cate-
    gorical data through ranking. In: IEEE HIS (2012)
11. Taha, A., Hadi, A.S.: Anomaly detection methods for categorical data: A review.
    ACM Computing Surveys (CSUR) 52(2) (2019)
12. Yu, J.X., Qian, W., Lu, H., Zhou, A.: Finding centric local outliers in categori-
    cal/numerical spaces. Knowledge and Information Systems 9(3), 309–338 (2006)