Detecting and Explaining Exceptional Values in Categorical Data DISCUSSION PAPER Fabrizio Angiulli, Fabio Fassetti, Luigi Palopoli, and Cristina Serrao DIMES, University of Calabria, 87036 Rende (CS), Italy {f.angiulli,f.fassetti,palopoli,c.serrao}@dimes.unical.it Abstract. In this work we deal with the problem of detecting and ex- plaining exceptional behaving values in categorical datasets by perceiv- ing an attribute value as anomalous if its frequency occurrence is ex- ceptionally typical or un-typical within the distribution of frequencies occurrences of any other attribute value. The notion of frequency occur- rence is provided by specialising the Kernel Density Estimation method to the domain of frequency values and an outlierness measure is defined by leveraging the cdf of such a density. This measure is able to simulta- neously identify two kinds of anomalies called lower outliers and upper outliers, namely exceptionally low or high frequent values. Moreover, data values labeled as outliers come with an interpretable explanations for their abnormality, which is a desirable feature of any knowledge discovery technique. 1 Introduction An outlying observation is one that appears to deviate markedly from other members of the sample in which it occurs. Such rare events can be even more interesting than the more regularly occurring ones as they are suspected of not being generated by the same mechanisms as the rest of the data. We deal with categorical data and, specifically, we perceive an attribute value as anomalous if its frequency occurrence is exceptionally typical or un-typical within the distribution of frequencies occurrences of any other attribute value. To quantify such a frequency occurrence we specialize the classical Kernel Density Estimation technique to the domain of frequency values and get to the concept of soft frequency occurrence. The cumulated frequency distribution of the above density estimate is used to decide if the frequency of a certain value is anomalous when compared to the other values’ frequencies. In particular, we are able to identify two kind of anomalies, namely lower outliers and upper outliers. A lower outlier is a value whose frequency is low while, typically, the dataset objects Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). This volume is published and copyrighted by its editors. SEBD 2020, June 21-24, 2020, Villasimius, Italy. F. Angiulli et al. assume a few similar values, namely the frequencies of the other values are high. An upper outlier is a value whose frequency is high while, typically, the dataset objects assume almost distinct values, namely the frequencies of the other values are low. Note that the measure detects both scenarios and to automatically characterizes the target value as a lower or upper outlier. A value can show exceptional behaviour only when we restrict our attention to a subset of the whole population. Thus, we design our technique to output the so-called explanation-property pairs (E, p), where E denotes a condition used to determine the target subpopulation and p represents an attribute pa and a value pv such that the pv is exceptionally frequent or infrequent within the subpopulation selected by the explanation E. The rest of the work is organised as follows. Section 2 discusses work related with the present one. Section 3 introduces the frequency occurrence function. Section 4 describes the outlierness function for ranking categorical values. Section 5 describes experimental results. 2 Related works Categorical data has received relatively little attention as compared to quanti- tative data because detecting anomalies in categorical domain is a challenging problem [11]. We start by noting that there is little literature about outlier explanation [3], i.e the problem of detecting anomalous properties and/or related outlier objects equipped with features justifying their outlierness. Moreover, to the best of our knowledge, no technique is able to natively detect upper outliers. Among traditional outlier detection methods explored in the context of nu- merical data[8] you can consider two main clusters: distance-based[6] and density- based[7]. These ideas have been properly adapted to the categorical domain. An example of distance-based method is discussed in [6, 1]. Outliers are de- fined as the N observations whose average distance to the k nearest neighbors are the greatest; in order to do that an appropriate distance has to be chosen. Detecting local anomalies, i.e. observations having outlying behavior in local areas, is another interesting discovery problem. Local anomaly detection meth- ods for categorical data include the k-LOF [12] and the WATCH method [9]. The first one is an extension of Local Anomalies Factor (LOF) method [7] to the categorical domain while the WATCH method [9] has been designed to find out outliers in high dimensional categorical datasets using feature grouping. Both distance and density are taken into account by the ROAD algorithm [10]. It considers the Hamming distance to compute cluster-based outliers and the density, evaluated as the mean frequency of the values, is used to identify frequency-based outliers. With regard to the outlier explanation, [2],[3] propose a technique for cate- gorical and numerical domains respectively that, given in input one single object known to be outlier, provides features justifying its anomaly and subpopulations where its exceptionality is evident. A generalization is proposed in [4] Detecting and Explaining Exceptional Values in Categorical Data 3 Frequency Occurrence In this section we give same preliminary definitions and introduce the notation employed throughout the paper. A dataset D on a set of categorical attributes A is a set of objects o assuming values on the attributes in A. By o[a] we denote the value of o on the attribute a ∈ A. D[a] denotes the multiset {o[a] | o ∈ D}. Given a multiset V , the frequency fvV of the value v ∈ V is the number of occurrences of v in V . A condition C is a set of pairs (a, v) where each a is an attribute and each v ∈ D[a]. By DC we denote the new dataset {o ∈ D | o[a] = v, ∀(a, v) ∈ C)}. Definition 1 (Frequency distribution). A frequency distribution H is a mul- (1) (w ) (1) (w ) (j) tiset of the form H = {f1 , . . . , f1 1 , . . . , fn , . . . , fn n } where each fi ∈ N (j) (k) is a distinct frequency, fi = fi = fi for each 1 ≤ j, k ≤ wi , and wi denotes the number of occurrences of the frequency fi . By N (H) (or simply N whenever H is clear from the context) we denote w1 · f1 + . . . + wn · fn . For the sake of simplicity, we will refer to a frequency distribution as a set H = {f1 , f2 , . . . , fn } and to the number of occurrences wi of fi as w(fi ). Now we define the notion of frequency occurrence as a tool for quantifying how frequent is a certain frequency. Definition 2 (Hard frequency occurrence). Given a frequency distribution H, the frequency occurrence FH (fi ) of fi , also denoted by F(fi ) whenever H is clear from the context, is the product wi · fi . The above definition allows us to associate with each distinct value in D[a] a score that is related not only to its frequency in the dataset but also to how many other values have its same frequency. However, close frequency values do not interact with each other, e.g. having fi = 49, wi = 1 and fi+1 = 51, wi+1 = 1 is completely different from fi0 = 50, wi0 = 2, as in the former case F(fi ) = 49 and F(fi+1 ) = 51 while in the latter we have F(fi0 ) = 100 that is about twice the previous case. Intuitively, we do not desire a similar small variation in the frequency distribution to impact so largely on the frequency occurrence values. To force close frequency values to influence each other in order and jointly contribute to the frequency occurrence we design an ad-hoc density estimation method inspired to Kernel Density Estimation (KDE). A (discrete) kernel function Kfi with parameter fi is a probability mass function having the property that supf ≥0 Kfi (f ) = Kfi (fi ). Given an interval I = [fl , fu ] of frequencies, a frequency fi , and a kernel function K, the volume of Pf Kfi in I, denoted as VI (Kfi ), is given by fu=fl Kfi (f ). The following expression ( n ) X X F(f ) = wi · fi · Kfi (ϕ) . (1) ϕ∈I(f ) i=1 where I(f ) represents an interval of frequencies centred in f , provides the density estimate of the frequency occurrence of the frequency f . F. Angiulli et al. Since Kfi (·) is a probability mass function, the frequency fi provides a con- tribution to the frequency occurrence of f corresponding to the portion of the volume of Kfi which is contained in I(f ), that is VI(f ) (Kfi ). It is possible to eliminate the dependence from I(f ) by properly weighting the contribution of Kfi (·) based on its distance from the target frequency f . Such a weight can be directly obtained from the associated kernel as the ratio between the probability of observing frequency f and the probability of observing fi , when such frequencies are realization of a random variable distributed according to Kfi (·). This allow us to rewrite equation 1 as follows: ( n  ) X X Kfi (f ) F(f ) = wi · fi · Kfi (ϕ) · . (2) i=1 Kfi (fi ) ϕ≥0 Note that the summation over the domain of all Kfi (·) values is equal to 1 since it is a probability mass function. Moreover, as F represents a notion of density function associated with frequency occurrences, it is preferable that its volume evaluated in the frequencies H = {f1 , . . . , fn } evaluates to N (H). This leads to the following final form of the frequency occurrence function. Definition 3 (Soft occurrence function). Given a frequency distribution H, the frequency occurrence FH (fi ) of fi , also denoted by F(fi ) whenever H is clear from the context, is given by the following expression n N (H) X h i F(f ) = · wi · fi · K b f (f ) , i (3) NF (H) i=1 where n ( n ) b f (f ) = Kfi (f ) X Xh i K i and NF (H) = wi · fi · Kfi (fj ) . b Kfi (fi ) j=1 i=1 As for the kernel selection we exploit the binomial distribution binopdf (f ; n, p) with parameter n, denoting the number of independent trials, equal to N (H), and parameter p, denoting the success probability, equal to p = fi /N (H). 4 Categorical Outlierness The idea behind the measure we will discuss in the following is that an object in a categorical dataset can be considered an outlier with respect to an attribute if the frequency of the value assumed by this object on such an attribute is rare if compared to the frequencies associated with the other values assumed on the same attribute by the other objects of the dataset. We are interested in two relevant kinds of anomalies referring to two different scenarios. Lower Outlier. An object o is anomalous since for a given attribute a the value that o assumes in a is rare (its frequency is low) while, typically, the dataset objects assume a few similar values, namely the frequencies of the other values are high. Detecting and Explaining Exceptional Values in Categorical Data Upper Outlier. An object o is anomalous since for a given attribute a the value that o assumes in a is usual (its frequency is high) while, typically, the dataset objects assume almost distinct values, namely the frequencies of the other values are low. In order to discover outliers, we exploit the cumulated frequency distribution associated with the estimated density. Definition 4 (Cumulated frequency distribution). Given a frequency dis- tribution H = {f1 , . . . , fn }, the associated cumulated frequency distribution H is X H(f ) = FH (fj ). fj ≤f In the following, we refer to the value H(fi ) also as to Hi . To quantify the degree of anomaly associated with a certain frequency, we use the area above and below the curve of the cumulated frequency distribution. Intuitively, the larger the area A↑ (fi ) above the portion of the curve included from a certain frequency fi to the maximum frequency fmax , and the more fi differs from frequencies that are greater than fi . Thus, this area is exploited to associate lower outlier score out↓ (fi ) to the target frequecy fi . At the same time, the larger the area A↓ (fi ) below the portion of the curve included from the minimum frequency fmin and a certain frequency fi , and the more fi differs from frequencies that are smaller than fi . This area can be used to associate an uppper outlier score out↑ (fi ) to fi . The outlierness associated with the frequency fi is a combined measure of the above two normalised areas and exceptional values for an attribute a, are those associated with large values of outlierness. More details are available in[5]. 140 140 120 120 f w 100 100 1 3 80 80 2 2 H(f) H(f) 60 60 3 1 40 40 4 2 20 20 5 1 0 0 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 6 2 f f Fig. 1: Outlierness computation example. Areas involved in the computation of the score are highlighted. They are normalized properly to get the upper and lower out- lierness score for the target frequency It must be pointed out that very often a value emerges as exceptional for a certain attribute only when we restrict our attention to a subset of the whole population. This intuition leads to the definition of the notion of explanation- property pair. F. Angiulli et al. Scalability Scalability 10 3 10 4 Depth 1 Depth 2 10 2 10 3 Depth 3 Execution Time (sec) Execution Time (sec) Depth 4 1 10 10 2 10 0 10 1 10 -1 7 Attributes 10 0 14 Attributes 22 Attributes 10 -2 10 -1 10 3 10 4 10 3 10 4 Dataset size Dataset size (a) (b) Fig. 2: Scalability analysis. Definition 5. An explanation-property pair (E, p), consists of condition E, also called explanation, and of an atomic condition p = {(pa , pv )}, also called prop- erty. By pa (pv , resp.) we denote the attribute (value, resp.) involved in the atomic condition p. The outlierness of an explanation-property pair (E, p) is the outlierness score associated with of the value pv with respect the attribute pa in the dataset DE . We implemented an algorithm that receives in input a dataset D and a depth parameter δ ≥ 1, and returns all the pairs (E, p) among those composed of at most δ atomic conditions. The algorithm analyzes explanations of length less or equal than δ according to a depth-first strategy that allows an efficient selection of sub-populations exploiting an approach similar to the one described in [3]. 5 Experimental results First of all, to study the applicability of our method to real datasets, we have tested its scalability. Then, to clarify the different nature of our anomalies with those returned by other outlier detection methods, we compared our method with traditional distance-based and density-based outlier detection approaches and with a method tailored for categorical data. Here, the results obtained on the Mushrooms dataset (n = 8,124 objects and m = 22 attributes) from UCI ML Repository are reported. More experiments are discussed in [5]. Scalability. In the experiment reported in Figure 2a, we varied the number of objects n in {500, 1000, 2000, 5000, 8000} and the number of attributes m in {7, 14, 22} , while the depth parameter has been held fixed to δ = 3. The dashed lines represent the trend of the linear growth estimated exploiting regression. This estimation confirms that the algorithm scales linearly with respect to the dataset size. As for the number of attributes, as expected for a given number of objects the execution time increases due to the growth of the associated search space. On the full dataset the execution time is very contained, as it amounts to about 2 minutes. In the experiment reported in Figure 2b, we varied both the number of objects n and the depth parameter δ in {1, 2, 3, 4}, while considering the full feature space. Also in this case the linear growth is represented by the dashed lines, and similar considerations can be drawn. Detecting and Explaining Exceptional Values in Categorical Data Comparison with outlier detection methods. We compare our method with two of the main categories of outliers: (i ) distance-based approaches, that are used to discover global outliers; (ii ) density-based approaches, which are able to single out local outliers. As distance-based definition, we use the average KNN score, representing the average distance from the k-nearest neighbours of the object. As density-based, we use Local Outlier Factor or LOF [7]. Both methods employ the Hamming distance. Moreover, we compare our method with the ROAD algorithm [10] that exploits both densities and distances. We ranked the dataset objects o by assigning to each of the them the largest outlierness of a pair π such that o ∈ Dπ and we determined our top–10 outliers as the objects associated with the largest outliernesses; then we computed their outlier scores according to the KNN, LOF and ROAD definitions. Note that all the chosen competitors require an input parameter k whose selection may be challanging, so we have computeted their scores for all the possible values of k and report the ranking positions associated with our top–10 outliers as a box-plots generated when k changes. Mushrooms: lower outliers Mushrooms: lower outliers Mushrooms: lower outliers 8000 8000 2000 7000 7000 6000 6000 1500 5000 5000 ROAD ranking KNN ranking LOF ranking 1000 4000 4000 3000 3000 500 2000 2000 1000 1000 0 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Our ranking Our ranking Our ranking (a) (b) (c) Mushrooms: upper outliers Mushrooms: upper outliers Mushrooms: upper outliers 5500 8000 8000 5000 7000 7000 4500 6000 6000 4000 5000 ROAD ranking 5000 3500 KNN ranking LOF ranking 4000 4000 3000 2500 3000 3000 2000 2000 2000 1500 1000 1000 1000 0 0 500 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Our ranking Our ranking Our ranking (d) (e) (f) Fig. 3: Comparison with KNN, LOF and ROAD on Mushrooms. The plots in fig. 3 highlight that the median ranking associated with our outliers can be far away from the top and also that, within the whole rank- ing distribution, the same outlier can be ranked in very different positions. In general, it seems that lower outliers are likely to be ranked better than upper outliers by our competitors, and this witnesses for the peculiar nature of upper outliers. On the Mushrooms dataset some of our lower outliers are, on the av- erage, ranked very high also by the other algorithms. Some of them are almost F. Angiulli et al. always top outliers for all methods (see the top 1st, 2nd, 5th, and 7th outliers) thus witnessing that these outliers have both global and local nature. However, most of our outliers are not detected by these techniques. Note that, the best rankings associated with the selected objects are obtained for very different values of the parameter k. Since, the output of the KNN, LOF and ROAD methods are determined for a selected value of k, it is very unlike that, even in presence of some agreement between our top outliers and local and global outliers, they are simultaneously ranked in high positions for the same provided value of k. 6 Conclusions In this work we have provided a contribution to single out and explain anomalous values in categorical domains. We perceive frequencies of attribute values as samples of a distribution whose density has to be estimated. This lead to the notion of frequency occurrence we exploit to build our definition of outlier. As a second contribution, our technique is able to provide interpretable explanations for the abnormal values discovered. Thus, the outliers we provide can be seen as a product of the knowledge mined, making the approach knowledge-centric rather than object centric. References 1. Angiulli, F., Basta, S., Pizzuti, C.: Distance-based detection and prediction of outliers. IEEE transactions on knowledge and data engineering 18(2) (2006) 2. Angiulli, F., Fassetti, F., Manco, G., Palopoli, L.: Outlying property detection with numerical attributes. Data Min. Knowl. Discov. 31(1) (2017) 3. Angiulli, F., Fassetti, F., Palopoli, L.: Detecting outlying properties of exceptional objects. ACM Transactions on Database Systems (TODS) 34(1) (2009) 4. Angiulli, F., Fassetti, F., Palopoli, L.: Discovering characterizations of the behavior of anomalous subpopulations. IEEE TKDE 25(6) (2013) 5. Angiulli, F., Fassetti, F., Palopoli, L., Serrao, C.: A density estimation approach for detecting and explaining exceptional values in categorical data. In: Discovery Science - 22nd International Conference, Proceedings. Springer (2019) 6. Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. IEEE transactions on Knowledge and Data engineering 17(2) (2005) 7. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers. In: ACM sigmod record. vol. 29. ACM (2000) 8. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM com- puting surveys (CSUR) 41(3) (2009) 9. Li, J., Zhang, J., Pang, N., Qin, X.: Weighted outlier detection of high-dimensional categorical data using feature grouping. IEEE SMC (2018) 10. Suri, N.R., Murty, M.N., Athithan, G.: An algorithm for mining outliers in cate- gorical data through ranking. In: IEEE HIS (2012) 11. Taha, A., Hadi, A.S.: Anomaly detection methods for categorical data: A review. ACM Computing Surveys (CSUR) 52(2) (2019) 12. Yu, J.X., Qian, W., Lu, H., Zhou, A.: Finding centric local outliers in categori- cal/numerical spaces. Knowledge and Information Systems 9(3), 309–338 (2006)