=Paper=
{{Paper
|id=Vol-2175/paper05
|storemode=property
|title=Unsupervised Ensembles for Outlier Detection
|pdfUrl=https://ceur-ws.org/Vol-2175/paper05.pdf
|volume=Vol-2175
|authors=Guilherme Campos
|dblpUrl=https://dblp.org/rec/conf/vldb/Campos18
}}
==Unsupervised Ensembles for Outlier Detection==
Unsupervised Ensembles for Outlier Detection Guilherme O. Campos Supervised by Prof. Dr. Wagner Meira Jr. Department of Computer Science Federal University of Minas Gerais Belo Horizonte, Brazil gocampos@dcc.ufmg.br ABSTRACT Given many outlier detection results, how do we select Ensemble techniques have been applied to the unsupervised which members we are going to include into the ensemble? outlier detection problem in some scenarios. Challenges are How do we combine these selected members into a single and the generation of diverse ensemble members and the combi- more robust result? These challenges mixed to an unsuper- nation of individual results into an ensemble. For the latter vised environment are still pertinent and more research is challenge, some methods tried to design smaller ensembles need to increase our knowledge over this topic in question. out of a wealth of possible ensemble members, to improve In this paper we propose an initial approach towards a the diversity and accuracy of the ensemble (relating to the more robust ensemble method by transferring the supervised ensemble selection problem in classification). In this pa- boosting technique to the unsupervised scenario of outlier per, We propose a boosting strategy to solve the ensemble detection ensembles. The proposed outlier ensemble selec- selection problem, called BoostSelect. We evaluate BoostSe- tion technique is called BoostSelect. lect over a large benchmark of datasets for outlier detection, showing improvements over baseline approaches. 2. RELATED WORK The ensemble approach to learning has been studied in 1. INTRODUCTION outlier detection several times. In analogy to supervised The identification of outliers (i.e., data objects that do not learning, an ensemble can be expected to improve over its fit well to the general data distribution) is very important in components if these components deliver results with a rea- many practical applications. Application examples are the sonable accuracy while being diverse [31]. The two main detection of credit card fraud in financial transactions data, challenges for creating good ensembles are, therefore, (i) the the identification of measurement errors in scientific data, generation of diverse (potential) ensemble members, and (ii) or the analysis of sports statistics data. the combination (or selection) of members to an ensemble. Recent research on the unsupervised problem of outlier de- Some strategies to achieve diversity among ensemble mem- tection advanced the area by applying ensemble techniques bers have been explored, such as feature bagging (i.e., com- [31]. Ensemble methods, i.e., combining the findings or re- bining outlier scores learned on different subsets of attributes) sults of individual learners to an integrated, typically more [13], different parameter choices for some base method [5], reliable and better result, are well established in the super- the combination of actually different base methods [16, 10, vised context of classification or regression [20]. In unsuper- 23], the introduction of a random component in a given vised learning, the theoretical underpinnings are less clear learner [14], the use of different subsamples of the data ob- but can be drawn in analogy to the supervised context as it jects (parallel [33] or sequential [21, 19]), adding some ran- has been done for clustering ensembles [30]. dom noise component on the data (“perturbation”) [32], or The focus of my PhD thesis is on ensemble selection, using approximate neighborhoods for density estimates [8]. which has been well studied in supervised scenarios [4] (also Likewise, different combination procedures have been pro- called selective ensembles [29], or ensemble pruning [15, 27, posed based on outlier scores or on outlier ranks [13, 5, 10, 30]). Ensemble selection is also related to boosting [22], 31]. which is often used to change training conditions for addi- Some methods have also been proposed to select the more tionally sought, yet to be trained ensemble members or to diverse or the more accurate ensemble members [23, 18]. select the most suitable additional ensemble members from These unsupervised methods construct a target result vector a pool of solutions. from unfiltered results and then sequentially select individ- ual results that somehow fit to the target vector while be- ing different from already selected solutions. The “Greedy ensemble” approach [23] fails in generating a good target vector by selecting a percentage of top instances for each ensemble candidate to compose the vector. This procedure normally selects many inliers to fit the target vector due to the imbalance between few outliers and many inliers. To solve this problem, “SelectV” approach [18] generate the tar- Proceedings of the VLDB 2018 PhD Workshop, August 27, 2018. Rio de Janeiro, Brazil.Copyright (C) 2018 for this paper by its authors. Copying get vector by taking the average over all outlier detection re- permitted for private and academic purposes.. sults (possible ensemble members). The main problem with 1 “SelectV” is that the algorithm does not consider diversity Algorithm 1 BoostSelect as an important factor in selecting ensemble members. Input: P := set of normalized outlier score lists, d := drop rate (percentage), t := threshold (percentage), 3. BOOSTING FOR ENSEMBLE SELECTION combination := combination technique Starting from the ideas discussed for the “Greedy ensem- Output: E := ensemble members ble” [23] and for “SelectV” [18], we propose here an im- 1: W := [n], E := ∅ proved outlier ensemble selection method that is amenable 2: target := combination(P ) . Generating the target to the application of boosting techniques. Boosting is well vector studied in supervised contexts [22]. We design and apply 3: target := convertBinary(target, t) . Top K = bn · tc an equivalent technique in the unsupervised setting, to se- scores h← 1, others ← 0 i 1 1 lect good components for an ensemble of outlier detectors, 4: W := out = 2K , in = 2(n−K) . K = number of resulting in our method BoostSelect. outliers, n = size 5: Sort P by weighted Pearson Correlation (wP C) to 3.1 Construction of the Target Vector target . Descending order As a prerequisite for the combination of different outlier 6: f := getF irst(P ) . Remove f from P score lists (i.e., individual results, potential ensemble mem- 7: E := E ∪ f bers), we normalize the scores following established proce- 8: while P 6= ∅ do dures [10]. The target vector is constructed by combining 9: curr := combination(E) . Current prediction the scores of all available results. Different combination 10: sort P by wP C to curr . Ascending order methods could be used here, without further assumptions 11: f := getF irst(P ) . Remove f from P taking the average score is the most natural approach [31], 12: if wP C(combination(E ∪ f ), target) > i.e., the target vector lists the average scores of all individ- wP C(curr, target) then ual results for each data object. From this target vector, we 13: E := E ∪ f . Include into ensemble preliminarily assume the top bn · tc objects (ranked by their 14: Boosting(W, target, f, t, d) . Adapt the weights combined score) to be outliers, where n is the dataset size 15: end if and 0 < t 1 is a parameter capturing the expected per- 16: end while centage of outliers in the dataset (i.e., there are K = bn · tc outliers assumed to be present). The target vector thus be- comes a binary vector, listing 1 for an (alleged) outlier and Algorithm 2 Boosting 0 for an (alleged) inlier and serves as pseudo ground truth Input: W := weight vector, target := target vector, f := for the boosting approach to ensemble selection. new ensemble member, t := threshold (percentage), d := drop rate (percentage) 3.2 Weights and Ensemble Diversity Output: W := Updated weights Weighted Pearson correlation has been proposed as a sim- 1: outliers := convertBinary(f, t) ilarity measure for outlier rankings [23]. We follow the pro- 2: for i ∈ 1 : size(target) do cedure of Schubert et al. [23], setting weights for Pearson 3: if target(i) == 1 & outliers(i) == 1 then correlation to outliers and inliers. Different from previous 4: W (i) := W (i) ∗ d approaches, though, these values are only the initial weights. 5: end if The weights will be updated by the boosting procedure. 6: end for The potential ensemble members are sorted according to their weighted Pearson correlation to the target vector. The candidate that is most similar to the target vector is chosen member, while very easy outliers that have been detected by as the first ensemble member. many ensemble members already will get assigned smaller Remaining potential ensemble members are iteratively re- and smaller weights. sorted in ascending order according to their similarity to the Algorithm 1 lists the steps of the overall framework Boost- current prediction of the ensemble, resulting in a preference Select in pseudo code. The boosting procedure is detailed for the most different (i.e., most complementary) additional in Algorithm 2. ensemble members. Potential members are included if their inclusion would increase the similarity of the ensemble pre- diction to the target vector, otherwise they are discarded. 4. EXPERIMENTS If the correlation improves, the ensemble is updated and the remaining lists are re-sorted by their weighted Pearson correlation to the updated prediction. 4.1 Datasets For evaluation, we use a benchmark data repository for 3.3 Boosting Procedure outlier detection [3]. The repository is based on 23 basic The boosting is performed upon the inclusion of a new datasets, processed in different ways mainly to provide vari- member into the ensemble. The idea is to reduce the weights ants with different percentage of outliers and with different for those outliers that have already been identified by any handling of dataset characteristics such as duplicates, at- ensemble member. The weights are reduced by some speci- tribute normalization, and categorical values. As suggested fied parameter 0 < d < 1 (drop rate). for analysis [3], we focus on the normalized datasets without The boosting effect is that the selection will prefer to in- duplicates, which leaves us with 422 dataset variants. clude such additional ensemble members that detect those outliers that have not yet been detected by any ensemble 4.2 Ensemble Members 2 RG 0 161 116 70 81 137 89 126 160 118 137 35 291 127 130 127 127 136 134 146 135 146 143 As basic outlier detection results (i.e., potential ensem- RS 261 0 131 81 94 143 97 159 172 141 172 48 317 160 166 173 167 174 179 185 181 185 188 ble members) we use the results provided along with the RBS_D25T05 306 291 0 134 153 168 141 166 191 162 183 74 304 174 153 150 155 155 162 175 162 164 175 350 datasets [3], testing 12 neighborhood-based outlier detec- RBS_D25T10 RBS_D25T15 352 341 341 328 288 269 0 222 200 0 264 265 155 193 223 193 259 250 205 205 237 224 82 81 322 307 208 190 176 164 179 164 189 159 180 169 194 180 218 192 185 170 199 188 217 192 tion algorithms changing the neighborhood size k from 1 to RBS_D50T05 285 279 254 158 157 0 136 176 188 149 176 69 309 178 154 147 157 155 164 178 164 170 177 300 100. The algorithms are: KNN [17], KNNW [1], LOF [2], RBS_D50T10 333 325 281 267 229 286 0 236 258 208 232 77 320 198 172 177 186 177 188 214 181 196 213 RBS_D50T15 296 263 256 199 229 246 186 0 223 185 175 69 292 155 138 149 143 146 163 156 145 162 163 SimplifiedLOF [25], LoOP [9], LDOF [28], ODIN [6], FastA- RBS_D75T05 262 250 231 163 172 234 164 199 0 145 170 53 313 166 150 148 157 153 167 170 160 170 172 250 BOD [11], KDEOS [24], LDF [12], INFLO [7], and COF [26]. RBS_D75T10 304 281 260 217 217 273 214 237 277 0 207 69 312 180 161 165 173 168 176 197 169 176 194 285 250 239 185 198 246 190 247 252 215 0 66 291 152 129 139 135 141 155 149 138 157 155 For LDOF and KDEOS, k must be larger than 1, for FastA- RBS_D75T15 Naive 387 374 348 340 341 353 345 353 369 353 356 0 333 223 194 202 198 198 213 230 203 221 229 200 BOD, k must be larger than 2, resulting in 1196 results per Greedy 131 105 118 100 115 113 102 130 109 110 131 89 0 111 112 121 126 117 119 115 116 108 112 dataset (less on some small datasets where k cannot reach SelectV 295 262 248 214 232 244 224 267 256 242 270 199 311 0 189 179 175 186 183 179 181 187 186 150 BS_D25T05 292 256 269 246 258 268 250 284 272 261 293 228 310 233 0 215 218 142 207 208 175 213 217 100). These results compose the set of potential ensemble BS_D25T10 295 249 272 243 258 275 245 273 274 257 283 220 301 243 207 0 196 200 150 201 213 181 220 members. BS_D25T15 295 255 267 233 263 265 236 279 265 249 287 224 296 247 204 226 0 192 199 162 204 199 197 100 BS_D50T05 286 248 267 242 253 267 245 276 269 254 281 224 305 236 280 222 230 0 214 215 157 207 214 The outlier scores of these results are processed (following BS_D50T10 288 243 260 228 242 258 234 259 255 246 267 209 303 239 215 272 223 208 0 202 203 166 217 Kriegel et al. [10]) by applying an inverse logarithmic scaling BS_D50T15 276 237 247 204 230 244 208 266 252 225 273 192 307 243 214 221 260 207 220 0 200 192 197 50 on FastABOD results and an inverse linear scaling on ODIN BS_D75T05 287 BS_D75T10 276 241 237 260 258 237 223 252 234 258 252 241 226 277 260 262 252 253 246 284 265 219 201 306 314 241 235 247 209 209 241 218 223 265 215 219 256 222 230 0 201 221 0 222 227 results, since FastABOD and ODIN give inverse score results BS_D75T15 279 234 247 205 230 245 209 259 250 228 267 193 310 236 205 202 225 208 205 225 200 195 0 0 (i.e., the lower the scores, the higher is the chance of an BS_D25T05 BS_D25T10 BS_D25T15 BS_D50T05 BS_D50T10 BS_D50T15 BS_D75T05 BS_D75T10 BS_D75T15 RG RS Naive Greedy SelectV RBS_D25T05 RBS_D25T10 RBS_D25T15 RBS_D50T05 RBS_D50T10 RBS_D50T15 RBS_D75T05 RBS_D75T10 RBS_D75T15 observation to be an outlier). Then a simple linear scaling from 0 to 1 is applied to transform all scores into the same range. Figure 1: Summarization of pairwise comparisons over all 422 datasets. The number counts the wins 4.3 Competitors and Settings in term of ROC AUC (average ROC AUC in case We compare BoostSelect against the Greedy [23] and Se- of random ensembles) of the ensemble listed in the lectV [18] ensembles. We also generate a “Naı̈ve” ensemble row against the ensemble listed in the column. and Random ensembles as baselines. The “Naı̈ve” ensemble is a combination of all individual outlier results (i.e., a full ensemble without selection procedure). a large margin all random ensemble approaches, but still has For each instance of an ensemble selection strategy (Greedy, more losses when compared to BoostSelect. Even though SelectV, and BoostSelect, respectively, on each dataset), we neither the threshold t nor the drop rate d has a strong generate 1000 “Random” ensembles consisting of the same impact on wins, setting a relatively large drop rate and a number of members as the corresponding selective ensemble, relatively small threshold overall seems to be a good choice where the ensemble members are randomly selected. of parameters for BoostSelect, although the optimal param- We used the Greedy ensemble rate parameter as 0.01, as eter choice differs from dataset to dataset. suggested by the authors of the Greedy ensemble [23]. We Looking at the top left quadrant of the heat map (Fig- test a range of parameters for BoostSelect: d = [0.25, 0.5, 0.75] ure 1), where the random ensembles compete against them- and t = [0.05, 0.1, 0.15]. selves, we also see a broad dominance by the random ensem- As combination technique for ensembles we use the aver- bles based on BoostSelect. This suggests that the number of age score. ensemble members selected by BoostSelect is a better choice than those selected by the other strategies. 4.4 Results Figure 1 shows pairwise comparisons between all ensem- bles over all datasets, considering the ROC AUC evaluation 5. CONCLUSION AND FUTURE DIRECTIONS measure (area under the curve of the receiver operating char- We proposed a new ensemble selection strategy for unsu- acteristic). We compare the ensemble selection techniques pervised outlier detection ensembles, using the unsupervised “Naı̈ve”, “Greedy”, “SelectV”, and “BS” (BoostSelect). We equivalent to a boosting strategy for ensemble selection. Ex- include random ensembles for each ensemble selection strat- periments show the favorable behavior of the new ensemble egy and for each parametrization of BoostSelect: RG (Ran- selection strategy compared to existing methods (Greedy dom Greedy), RS (Random SelectV), RBS (Random Boost- and SelectV) on a large set of benchmark datasets. Main Select). The numbers represent on how many datasets the differences between our method BoostSelect, the Greedy en- ensemble listed in the row has performed better than the semble, and SelectV can be attributed to a different way of ensemble listed in the column. Numbers representing the focusing on diversity and accuracy of ensemble members. majority (more than 50%) of the datasets are white, smaller Greedy goes all out for diversity and mostly disregards ac- numbers black. The larger the number, the darker is its curacy, while SelectV ignores diversity and maximizes accu- background. For the random ensemble, we take the average racy of the ensemble members. Our new method BoostS- performance over the 1000 instances. elect considers both, diversity and accuracy, in a balanced The best overall method is BoostSelect with d = 0.75 and manner and performs competitively on average over a large t = 0.05, which has only more losses than wins when com- selection of benchmark datasets with strong improvements peting against BoostSelect with d = 0.25 and t = 0.1. The on many of the benchmark datasets. Greedy ensemble does not perform well in general, having The behavior of BoostSelect is robust to the parameters more losses than wins against every other competitor. Se- on many datasets but depends strongly on the choice of pa- lectV is better than all random variants and Greedy, but rameters on some datasets. As future work, we are specially worse than Naı̈ve and worse than all BoostSelect results. interested on this behavior and potential relation to prop- The Naı̈ve ensemble behaves very consistently, as it beats by erties of the datasets. We are also interested to improve 3 the target vector generation step and to include a complete [18] S. Rayana and L. Akoglu. Less is more: Building study over the combination step of outlier detection ensem- selective anomaly ensembles. ACM TKDD, ble. 10(4):42:1–42:33, 2016. [19] S. Rayana, W. Zhong, and L. Akoglu. Sequential Acknowledgments This work was partially supported by ensemble learning for outlier detection: A CAPES - Brazil, Fapemig, CNPq, and by projects InWeb, bias-variance perspective. In Proc. ICDM, pages MASWeb, EUBra-BIGSEA, INCT-Cyber, and Atmosphere. 1167–1172, 2016. [20] L. Rokach. Ensemble-based classifiers. Artif. Intell. 6. REFERENCES Rev., 33:1–39, 2010. [21] M. Salehi, X. Zhang, J. C. Bezdek, and C. Leckie. [1] F. Angiulli and C. Pizzuti. Fast outlier detection in Smart sampling: A novel unsupervised boosting high dimensional spaces. In Proc. PKDD, pages 15–26, approach for outlier detection. In AI 2016: Advances 2002. in Artificial Intelligence - 29th Australasian Joint [2] M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. Conference, Hobart, TAS, Australia, December 5-8, LOF: Identifying density-based local outliers. In Proc. 2016, Proceedings, pages 469–481, 2016. SIGMOD, pages 93–104, 2000. [22] R. E. Schapire and Y. Freund. Boosting. Foundations [3] G. O. Campos, A. Zimek, J. Sander, R. J. G. B. and Algorithms. MIT Press, Cambridge, MA, 2012. Campello, B. Micenkov, E. Schubert, I. Assent, and [23] E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. M. E. Houle. On the evaluation of unsupervised outlier Kriegel. On evaluation of outlier rankings and outlier detection: Measures, datasets, and an empirical study. scores. In Proc. SDM, pages 1047–1058, 2012. Data Min. Knowl. Disc., 30:891–927, 2016. [24] E. Schubert, A. Zimek, and H.-P. Kriegel. Generalized [4] R. Caruana, A. Niculescu-Mizil, G. Crew, and outlier detection with flexible kernel density estimates. A. Ksikes. Ensemble selection from libraries of models. In Proc. SDM, pages 542–550, 2014. In Proc. ICML, 2004. [25] E. Schubert, A. Zimek, and H.-P. Kriegel. Local [5] J. Gao and P.-N. Tan. Converting output scores from outlier detection reconsidered: a generalized view on outlier detection algorithms into probability estimates. locality with applications to spatial, video, and In Proc. ICDM, pages 212–221, 2006. network outlier detection. Data Min. Knowl. Disc., [6] V. Hautamki, I. Krkkinen, and P. Frnti. Outlier 28(1):190–237, 2014. detection using k-nearest neighbor graph. In Proc. [26] J. Tang, Z. Chen, A. W.-C. Fu, and D. W. Cheung. ICPR, pages 430–433, 2004. Enhancing effectiveness of outlier detections for low [7] W. Jin, A. K. H. Tung, J. Han, and W. Wang. density patterns. In Proc. PAKDD, pages 535–548, Ranking outliers using symmetric neighborhood 2002. relationship. In Proc. PAKDD, pages 577–593, 2006. [27] G. Tsoumakas, I. Partalas, and I. Vlahavas. An [8] E. Kirner, E. Schubert, and A. Zimek. Good and bad ensemble pruning primer. In O. Okun and neighborhood approximations for outlier detection G. Valentini, editors, Applications of Supervised and ensembles. In Proc. SISAP, pages 173–187, 2017. Unsupervised Ensemble Methods, pages 1–13. Springer, [9] H.-P. Kriegel, P. Krger, E. Schubert, and A. Zimek. 2009. LoOP: local outlier probabilities. In Proc. CIKM, [28] K. Zhang, M. Hutter, and H. Jin. A new local pages 1649–1652, 2009. distance-based outlier detection approach for scattered [10] H.-P. Kriegel, P. Krger, E. Schubert, and A. Zimek. real-world data. In Proc. PAKDD, pages 813–822, Interpreting and unifying outlier scores. In Proc. 2009. SDM, pages 13–24, 2011. [29] Z. Zhou, J. Wu, and W. Tang. Ensembling neural [11] H.-P. Kriegel, M. Schubert, and A. Zimek. networks: Many could be better than all. Artif. Intell., Angle-based outlier detection in high-dimensional 137(1-2):239–263, 2002. data. In Proc. KDD, pages 444–452, 2008. [30] Z.-H. Zhou. Ensemble Methods. Foundations and [12] L. J. Latecki, A. Lazarevic, and D. Pokrajac. Outlier Algorithms. CRC Press, 2012. detection with kernel density functions. In Proc. [31] A. Zimek, R. J. G. B. Campello, and J. Sander. MLDM, pages 61–75, 2007. Ensembles for unsupervised outlier detection: [13] A. Lazarevic and V. Kumar. Feature bagging for Challenges and research questions. SIGKDD Explor., outlier detection. In Proc. KDD, pages 157–166, 2005. 15(1):11–22, 2013. [14] F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation-based [32] A. Zimek, R. J. G. B. Campello, and J. Sander. Data anomaly detection. ACM TKDD, 6(1):3:1–39, 2012. perturbation for outlier detection ensembles. In Proc. [15] D. D. Margineantu and T. G. Dietterich. Pruning SSDBM, pages 13:1–12, 2014. adaptive boosting. In Proc. ICML, pages 211–218, [33] A. Zimek, M. Gaudet, R. J. G. B. Campello, and 1997. J. Sander. Subsampling for efficient and effective [16] H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. unsupervised outlier detection ensembles. In Proc. Mining outliers with ensemble of heterogeneous KDD, pages 428–436, 2013. detectors on random subspaces. In Proc. DASFAA, pages 368–383, 2010. [17] S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proc. SIGMOD, pages 427–438, 2000. 4