Experimental setting

Instance space analysis for unsupervised outlier detection

Sevvandi Kandanaarachchi

sevvandi.kandanaarachchi@monash.edu 0

Mario A. Mu n˜oz

y Kate Smith-Milesz

smith-miles@unimelb.edu.au 0 0 Faculty of Business and Economics, Monash University , Clayton VIC 3800 , Australia

meta-features, as well as the knowledge based used to train The success of various methods for unsupervised outlier detection the instance space model. Hence, outselect facilitates depends on how well their definition of an outlier matches the the end-to-end evaluation for a new dataset. dataset properties. Given that each definition, and hence each method, has strengths and weaknesses, measuring those properties could help us to predict the most suitable method for the dataset at 2 hand. In this paper, we construct and validate a set of meta-features that measure such properties. We then conduct the first instance space analysis for unsupervised outlier detection methods based on the meta-features. The analysis provides insights into the methods' strengths and weaknesses, and facilitates the recommendation of an appropriate method with good accuracy.

Experimental setting

To demonstrate that an outlier detection method has regions of “good” performance within the instance space, we define an experimental setting including: (a) a portfolio of methods, (b) a collection of instances, and (c) a definition of good performance based on an evaluation metric. In the next sections, we describe the details of each component of our setting. We note that that the methodology would not change even if the experimental setting does.

1 Introduction

The applications of outlier detection include taks as diverse as identifying fraudulent credit card transactions, and recognising terrorist plots in social media. The significance of these applications has spurred recent research efforts, leading to a steady growth in the number of detection methods [1], and with a variety of definitions of an outlier. As result, it is not uncommon for the methods to disagree on the number and location of the outliers. It therefore becomes challenging to identify the most suitable outlier detection method for the problem at hand. Indeed, it is unlikely that a single method outperforms all others in all applications [2, 3]. Therefore, on a new problem, our focus should be on understanding the suitability of various methods, by exploiting the experience gained through solving similar problems.

This paper extends the computational study of Campos et al. [4] by making the first exploration of algorithm selection for outlier detection methods. Using a comprehensive set of meta-features, i.e., measures of dataset properties, we are able to predict the suitability of a method for a given problem. Moreover, we perform the the first instance space analysis [5, 6, 7], which helps us gain further understanding of the comparative strengths and weaknesses of an outlier detection method, more nuanced than a summary table of statistical results can offer. As a resource for the research community, we make available our repository of more than 12000 datasets [8]. Moreover, we provide the R package outselect [9], containing the code to calculate all the (2.1)

k(dataset) = min (bN=20c; 100) where N is the number of observations in the dataset. The maximum of k = 100 limits the number of computations that can result from a large dataset. The motivation for choosing bN=20c is as follows: Choosing k = N would result in a very large neighbourhood making some methods such as KNN and LOF not discriminate enough between outliers and non-outliers. On the other hand, a very small value of k such as k = 1 may miss small clusters of outliers, as within that small cluster the density is high and distances are low. As such we choose a small value of k, that is not too small. 2.2 Datasets Our benchmark set contains approximately 12000 datasets, most of which were generated by following the approach by [4, 23]. First, we take one of 170 datasets [24] which originally represented a classification problem sourced from the UCI Machine Learning Repository. Then, we generate a new dataset by down-sampling one of the classes. We repeat this procedure for all of the classes, labelling the down-sampled class as the outliers. In addition we also use datasets from [4, 23] which have outlier class down-sampled at rates 5% and 2%. Moreover, to guarantee that the new datasets are fit for outlier detection benchmarking, we carry out the following tasks:

Down-sampling For a dataset with K classes, each class

in turn is designated the outlier-class and observations belonging to that class are down-sampled such that the outlier percentage p 2 f2; 5g. The down-sampling is randomly carried out 10 times to produce 10 variants for each value of p.

Categorical variables From each down-sampled dataset,

we create one dataset with categorical variables removed, and one with categorical variables converted to numerical values using the method of inverse data frequency [4].

Duplicate observations Since the distance between dupli

cates is zero, which in turn leads to infinite densities, duplicate observations are removed to avoid numerical instability caused by division by zero errors.

Missing values For each attribute in each dataset, the number of missing values are computed. Similar to [4], if an attribute has less than 10% of missing values, the observations containing the missing values are removed, otherwise, the attribute is removed.

2.3 Evaluation metric While outlier detection is consid

ered an unsupervised learning task, assessing algorithm performance requires a ground truth given by the outlier labels [4]. This makes it possible to define an evaluation metric. Although no single universally accepted metric exists, the most popular ones in outlier detection are: (a) the Area Under the Receiver Operator Characteristic curve (AU C), and (b) the Precision-Recall curve. Other metrics include precision at n [25], average precision [26], and a combination of positive and negative predictive values. In our study we use AU C as the evaluation metric, and define good performance of a method on a dataset if AU C > 0:8. Alternative definitions can be considered within the methodology, however a detailed study on the impact of this choice is beyond the scope of this proof-of-concept methodological paper. 2.4 Meta-features Given that our outlier detection datasets have been generated by down-sampling classification datasets, we are able to borrow many of the features that have been used to summarize interesting properties of classification datasets, and then add some additional features that are unique to the outlier detection challenge. We have categorized the features in two main classes, with the first being generic features which measure various aspects of a dataset but are not particularly tailored towards outlier detection. While they are broadly relevant, they offer little insight into the outlier structure of a dataset.

The second class of features measure properties of the outlier structure of a dataset using density, residual and graph-based characteristics. These use the outlier labels in their feature computation. The reason for using class labels in feature computation is two-fold. Firstly, different outlier methods may disagree on outliers and their rankings. As such, we need to know how each method declares its outliers. Are they density related, distance related or graph related outliers? Secondly, the use of ground truth in labelling outliers gives the opportunity to learn when each method’s definition is suitable. Learning these relationships from training data enables identificaton of the most suitable outlier method for a given problem. This is common in industry applications where training data contains labelled outliers, and the task is to find effective methods which extract these outliers for future unlabelled data [27, 28].

We compute a broad set of candidate features, and later select a subset for the instance space analysis. Due to space constraints, we provide a brief overview of these features here. More details of these features are given in our gihub repository [9]. Table 1 summarizes the features by category.

Simple features These are related to the basic structure of a

dataset, e.g. the number of observations.

Statistical features These relate to statistical properties

such as skewness and kurtosis of a dataset.

Information theoretic features measure the amount of in

formation present in a dataset, e.g. entropy of a dataset. Outlier features In order to make our feature set richer we include density-based, residual-based and graph-based features. We compute these features for different subsets of the dataset, namely outliers, non-outliers and proxy-outliers, with the last subset being data points that are either far away from “normal” data, reside in low density regions, or have different graph properties compared to the rest. We define proxy-outliers using distance, density, residual and graph based perspectives.

If there is a significant overlap between proxy-outliers and actual outliers, then we expect certain outlier detection methods to perform well on such datasets. Proxyoutlier based features fall into the category of landmark- stance) features and strengths and weaknesses of methods. ing, which has been popular in meta-learning studies Our approach to constructing the instance space is based [29, 30, 31]. on the most recent implementation of the methodology [24].

Critical to both algorithm performance prediction and in1. Density based features are computed either using

stance space construction is the selection of features that are the density based clustering algorithm DBSCAN

both distinctive (i.e. uncorrelated to other features) and pre[32] or kernel density estimates.

dictive (i.e correlated with algorithm performance). There2. Residual based features are computed by fitting

fore, we follow a procedure to select a small subset of feaa series of linear models and obtaining residuals.

tures that best meet these requirements, using a subset of 3. Graph based features are based on graph

2053 instances for which there are three well performing theoretic measures such as vertex degree, shortest

algorithms or less. This choice is somewhat arbitrary, mopath and connected components. The dataset is

tivated by the fact that we are less interested to consider converted to a directed graph using the software

datasets where many algorithms perform well or poorly, [33] to facilitate this feature computation.

given our aim is to understand unique strengths and weaknesses of outlier detection methods. 3 Predicting outlier detection method performance We pre-process the data such that it becomes amenable from meta-features to machine learning and dimensionality projection methods. Using a set of 178 candidate features, we predict suitable Given that some features are ratios, one feature can often outlier detection methods for given datasets. For this pur- produce excessively large or infinite values. Hence, we pose, we train 12 Random Forest classifiers - one for each bound all the features between their median plus or minus outlier method - with the 178 features as inputs and a binary five times their interquartile range. Any not-a-number value predictor for good performance as the output. Table 2 pro- is converted to zero, while all infinite values are equal to vides the average cross-validation accuracy over 10 folds for the bounds. Next, we apply Box-Cox transformation to each outlier detection method. As most outlier methods are each feature to stabilise the variance and normalise the data. only good for a small number of datasets, a naive prediction Finally, we apply a z-transform to standardise each feature. that performance is not good for all datasets achieves the de- Starting with a set of 178 features, we determine fault accuracy. Since our Random Forest accuracy is greater whether a feature has unique values for most instances. A than the default accuracy for all methods, the meta-features feature provides little information if it produces the same are clearly able to help distinguish which methods are well value for most datasets. Hence, we discard those that have suited for some datasets, even when these are a minority. less than 30% unique values, leaving 135 features. Then, we check the correlation between the features and AU C. Sort4 Instance space construction and analysis ing the absolute value of the correlation from highest to lowHaving validated that the features contain sufficient infor- est, we select the top three features per algorithm. This remation to be predictive of outlier detection method perfor- sults in 26 features, as some of them are correlated to more mance, we now use the features to construct an instance than one algorithm and we do not need replicates. Next, we space to visualise the relationships between dataset (in- identify groups of similar features, using a k-means clustering algorithm and a dissimilarity measure of 1 j j, where is the correlation between two features. As result, we obtain seven clusters. Retaining one feature from each cluster, there are 2352 possible feature combinations. To determine the best one for the dimensionality reduction, we project each candidate feature subset into a 2-d space using PCA with two components. Then, we fit a random forest model per algorithm to classify the instances in the trial instance space into groups of good and bad performance, as defined in Section 2.3. That is, we fit 12 models for each feature combination, and define the best combination as the one producing the lowest average out-of-the-bag error across all models.

This process results in 7 features finally being selected.

We then use the Prediction Based Linear Dimensionality Reduction (PBLDR) method [24] to find a projection from 7-d to 2-d that creates the most linear trends of algorithm performance and feature values across the instance space, to assist visualisation of directions of hardness and feature correlations. Given PBLDR’s stochasticity, we calculate 30 different projections and select the one with the highest topological preservation, defined as the correlation between high- and low-dimensional distances. The final projection matrix is defined by Equation 4.2 to represent each dataset as a 2-d vector Z depending on its 7-d feature vector. (4.2) 2 6 6 6 Z = 66 6 6 6 4 0:0862 0:1737 0:0460 0:0938 0:1202 0:1854 0:3543 0:2078 3> 2 Mean Entropy Attr 3 0:1845 7 6 IQR TO SD 95 7 0:2847 77 66 OPO Res ResOut Median 77 0:2025 77 66 OPO Res Out Mean 77 0:0378 77 66 OPO Den Out 95P 77 0:0822 75 64 OPO GDeg PO Mean 75 0:1325 OPO GDeg Out Mean Mean Entropy Attr is the average entropy of the attributes of the dataset, and corresponds to a measurement of the information contained by the attribute. For example, if an attribute is constant for all observations, then the entropy is zero. Thus, high values of entropy indicate highly unpredictable fluctuations. If the outliers are causing high values of entropy, then it is likely that these points are different from the rest, resulting in multiple outlier methods performing well on these datasets, as seen in Figure 2.

IQR TO SD 95 is the 95th percentile of the IQR to standard deviation ratio of each attribute of a dataset. It captures the presence of outliers by means of comparing two

measures of spread, of which one is affected by outliers and another is robust. As this feature is first computed per attribute before taking the 95th percentile, data points that stand out in a single dimension give a small IQR to SD ratio for that dimension. Thus, if the 95th percentile of the IQR to SD ratio is small, then that implies there are data points that stand out in different dimensions. If these data points are indeed outliers, it should be easy for distance based methods to find them.

We expect both KNN and KNNW to perform well on datasets with small values of this feature.

OPO Res ResOut Median is the ratio between the median of residuals of proxy-outliers to the median of residuals of non-proxy-outliers. It is calculated by randomly choosing a subset of non-binary attributes from the dataset, s. Then, from each attribute a in s, we fit a linear model with a as the dependent variable and others as the independent variables. Next, for each model, we compute the residuals r. Define the residual based proxy-outliers ! as those with the top 3% absolute residual values. Finally, we compute

= median(r[!])=median(r[!]), where ! denotes non-proxy-outliers. The average of across all models is the feature value.

OPO Res Out Mean is the ratio between the mean of

residuals of outliers to the mean of residuals of non outliers. It is calculated following a similar approach to that of OPO Res ResOut Median with the exception that ! denotes actual outliers instead of proxyoutliers and the mean instead of the median is used.

We expect KNN and KNNW to perform well on datasets with high values of OPO Res ResOut Median and OPO Res Out Mean.

OPO Den Out 95P is the ratio between the 95% of density estimates of non-outliers to 95% of density estimates of outliers. It is calculated by performing PCA on the dataset, omitting class information. Then, considering two consecutive components at a time and up to the 10th component, we build nine 2-d spaces. For each one of these spaces, we compute kernel density estimates x.

Then, we compute y1 = x[!] and y2 = x[!], where ! low graph degree, as their connections to other points are low is the set of outliers and ! denotes non-outliers. Let 1 in number, making OPO GDeg Out Mean correlate with be the 95th percentile of y1 and 2 be the 95th percentile KNN and KNNW performances. of y2. Let = 1= 2, whose average across all 2-d is the feature value. If outliers reside in less dense regions, 4.1 Footprint analysis of algorithm strengths and weakthis gives rise to high values of this feature, enabling nesses We define a footprint as an area of the instance space certain density based methods to perform well on these where an algorithm is expected to perform well. To construct datasets. As such, we expect LDF to perform well on a footprint, we follow the approach first introduced in [34]: datasets with high values of this feature. (a) we measure the length of the edges between all instances, and remove those edges with a length lower than a threshold, OPO GDeg PO Mean is a ratio of mean degree of inner ; (b) we calculate a Delaunay triangulation within the conpoints to mean degree of non-inner points. It is calcu- vex hull formed by the remaining edges; (c) we create a conlated by generating a graph from the dataset using dis- cave hull, by removing any triangle with edges larger than tances between points [33]. This graph would have v1 another threshold, ; (d) we calculate the density and purity and its nearest neighbour v2 as vertices connected with of each triangle in the concave hull; and, (e) we discard any an edge. Then, we compute the degree of each vertex triangle that does not fulfil the density and purity thresholds. v. Let ! be the set of inner points, which are defined as The values for parameters for the lower and upper distance the vertices with the highest 3% degree values. Then the thresholds, f ; g, are set to 1% and 25% of the maximum feature value is equal to mean( [!])=mean( [!]) where distance respectively. The density threshold, , is set to 10, ! denotes outliers and ! denotes non-outliers. and the purity threshold, , is set to 75%. We then remove any contradictions that could appear when two conclusions OPO GDeg Out Mean is similar to OPO GDeg PO could be drawn from the same section of the instance space Mean with the exception of using outliers instead of due to overlapping footprints, e.g., when comparing two alinner points. These two features compute a ratio of gorithms. This is achieved by comparing the area lost by the vertex-degree between different groups: outliers and overlapping footprints when the contradicting sections are non-outliers, inner-points and non-inner points. As removed. The algorithm that would loose the largest area outliers are generally isolated, they have low vertex- in a contradiction gets to keep it, as long as it maintains the degree values. Thus, these outliers make smaller angles density and purity thresholds. with other data points, making it easier for angle based Table 3 presents the results from the analysis. The methods such as FAST ABOD to find them. Therefore best algorithm is the one such that AUC is the highest for we expect FAST ABOD to perform well on datasets the given instance. The results are a percentage of the with low values of OPO GDeg Out Mean, even though area (6.6703) and density (305.6825) of the convex hull we do not perform any angle based computation. that encloses all instances. The table demonstrate that ttttfvooihhhoafuoaeerltpmutlbtaireoCaeeatotpnrtnoxrowsicmyofibeen(-fuOebOosttontiuhePaPnftsentOiOlhnKic(seegMepDrRRasraieeEeecslaassselOrnpi.eodRRSrTfueEesedhattisnsehmliiOOtscsnirsituoduiflasartpstibonryrMlMtmeetoFloAeeaoipidgdtttrnetrhiuircoadalo)rernxn.esetsy)osesaAi-nsalaooreodnlanfuowinddtfcnwldreaio2oeormtne,mermvns-sewaptattlraternohuohoneespaxndeestyyobneaornotsoefrtu-tshosutptmgoiehptrdglmoeyaiupeelxareslryotaetlrsoss--r-f. ltrsacgKweheoroofeNeqemrotunNtcduecoieanrorpneph.aarmleabonsgrvKlfeeeofeerenotDr,httirttosteEmhwhh.eOaifimhaOntninlSsslanFcdet,raAeptgnshaKeeSunciersnNTceftoaohNstffrArhopeemaewBaoatsrsect,OtpnehIhtrN.dsDahaicnsnetFadttLfttseb,uto.nOeeFldrfisACd,etslodoSsLitmmnToOtihinbAOntdeshiaontBPedatmieOnenfaincgetDnneshwadiws,ett;KyeiiOltnsoDhhaDswtentEhFanIdeeNOnicrgpcelSuoe,ruhriwsaragewinvthead2yeertt This signifies a unique strength of KDEOS as it performs rffaodgwwreonfreoehgravdtlmiiphiFlcoahTteAhnithnlhirdSeeohieanfeaTnattmgnvpitoorAeteeeoeefrrbaBehamftneorsihOogistyregwmgfhDohrfrefoean.teaegunarepTnoeittcahholufsertieinrheeddteiosrehruesistfengapin(nK(rleItOseetQerrNpeiefarPgraRomtNqhOcioroubemsfTasa,oGo.nOarauwtnfdDlrilclSheeedKevniDgilrNestiearstrOo9anNttrinp5guhecWyc)eteleaestMabtsaotenuekopgdedntevpeadnteionntonhrse)taeo.itrattthahesiieOldettnlsaiiyvucfsnm9fiarthdlel5tcieauraae%uauvrserlndeeesst, iifaa4oVmnneld.uig2asesdtwtttoluiauhitrTaehnrioiAlroetichdihcsnsediushm,si,etnstowpwaipgessmlraceegeocrtloaacieececoretgacaiencskitntndsoohiemonomypcanasrealr’dlsotnsagfhooijdnooeofbfvarourcdeatsitsosentnteathrsieaptuteawmtcortnhgiiocfngneeomitattsnrohieadoisnnltttitfesefeanholtcwdnigaettrechnliisiyoeifecsn,n.nsoewss.amittnsnaevapednsBnilaatcecaytoceaehneiurnuuccntseofslsspioittmeanrharsrecgenppaceudcauoibtetecatmesitoennsesm.mmdtgcppteaeaistattonhceIairnndeece-titioning of the instance space to expose regions of strength for each method. We use support vector machines (SVM) for this partitioning. Of the 12 outlier methods we consider only FAST ABOD, KDEOS, KNN and KNNW, as these methods have larger footprints that span a contiguous region of the instance space. For these outlier methods, we use our definition of good performance as the output and the instance space coordinates as the input for the SVM. We train 4 SVMs, one for each of these selected outlier meth- 5 Conclusion ods. The prediction accuracies using 5-fold cross validation, We have investigated the algorithm selection problem for along with the percentage of instances in the majority class, unsupervised outlier detection. Given measurable features are given in Table 4, showing better than default accuracy for of a dataset, we find the most suitable outlier method with all methods except KDEOS. reasonable accuracy. This is important because each method

Figure 3 shows the dominant regions of algorithms has its strengths and weaknesses and no single method outFAST ABOD, KNN, KNNW and KDEOS. We can see that performs all others for all instances. We have explored the FAST ABOD is stronger on the left side of the instance strengths and weaknesses of outlier methods by analysing space with a few sparse instances lighting up on the extreme their footprints in the constructed instance space. Moreover, right, whereas KDEOS is stronger on the top region of we have identified different regions of the instance space the instance space. KNN and KNNW have overlapping that reveal the relative strengths of different outlier detection regions of strength with KNNW sharing some instances methods. Our work clearly demonstrates for example that with KDEOS at the top. However, KDEOS is the stronger KDEOS, which gives poor performance on average, has algorithm in the top region of the instance space. a region of strength in the instance space where no other

We combine these SVM predictions of regions of algorithm excels. strength to obtain a final partitioning of the instance space. In addition to these contributions, we hope to have laid Regions of overlap mean that some instances have multi- some important foundations for future research into new and ple preferred outlier methods. For such instances, we break clude the expansion of the instance space by generating new and diverse instances to fill the space and considering other classes of outlier detection methods, such as subspace, clustering and PCA approaches. In addition we have transformed non-numeric data to numeric in this study. Further avenues of research include incorporating datasets with non-numeric attributes in the instance space. To aid this expansion and future research, we make all of our data and implementation scripts available on our website.

Broadening the scope of this work, we have been adapting the instance space methodology to other problems besides outlier detection. For example, machine learning [24] and time series forecasting [35]. Part of this larger project is to build freely accessible web-tools that carry out the instance space analysis automatically, including testing of algorithms on new instances. Such tools will be available at matilda.unimelb.edu.au in the near future. (a) (b) (c) (d)

Acknowledgements improved outlier detection methods, in the following ways: Funding was provided by the Australian Research Council (a) rigorous evaluation of new methods using the compre- through grants FL140100012 and LP160101885. This rehensive corpus of over 12000 datasets with diverse character- search was also supported by the Monash eResearch Cenistics we have made available; (b) using the instance space, tre and eSolutions-Research Support Services through the the strengths and weaknesses of new outlier methods can be MonARCH HPC Cluster. identified, and their uniqueness compared to existing methods; (c) using the R package outselect the outlier methods References suitable for a new dataset can be identified, and the new instance can be plotted on the instance space. Equally valu- [1] A. Zimek, E. Schubert, and H.-P. Kriegel, “A survey on able, the instance space analysis can also reveal if a new out- unsupervised outlier detection in high-dimensional numerical lier method is similar to existing outlier methods, or offers a data,” Statistical Analysis and Data Mining: The ASA Data unique contribution to the available repertoire of techniques. Science Journal, vol. 5, no. 5, pp. 363–387, 2012.

As a word of caution, we note that our current instance [2] D. H. Wolpert, W. G. Macready et al., “No free lunch theorems space is computed using our collection of datasets, outlier Ifnorstisteuaterc,hT,e”chT.eRchenpi.c,a1l99R5e.port SFI-TR-95-02-010, Santa Fe methods and features. Thus, we do not make claim to have [3] D. H. Wolpert and W. G. Macready, “No free lunch theorems constructed the definitive instance space for all unsupervised for optimization,” IEEE transactions on evolutionary compuoutlier detection methods. Hence, the selected features for tation, vol. 1, no. 1, pp. 67–82, 1997. the instance space may change with the expansion of the cor- [4] G. O. Campos, A. Zimek, J. Sander, R. J. Campello, B. Mipus of datasets and outlier methods. Future research paths in- cenkova´, E. Schubert, I. Assent, and M. E. Houle, “On the evaluation of unsupervised outlier detection: measures, Mining. Springer, 2009, pp. 813–822. datasets, and an empirical study,” Data Mining and Knowledge [20] L. J. Latecki, A. Lazarevic, and D. Pokrajac, “Outlier detecDiscovery, vol. 30, no. 4, pp. 891–927, 2016. tion with kernel density functions,” in International Workshop [5] K. Smith-Miles, D. Baatar, B. Wreford, and R. Lewis, “To- on Machine Learning and Data Mining in Pattern Recognition. wards objective measures of algorithm performance across in- Springer, 2007, pp. 61–75.

stance space,” Comput. Oper. Res., vol. 45, pp. 12–24, 2014. [21] E. Schubert, A. Zimek, and H.-P. Kriegel, “Generalized out[6] K. Smith-Miles and S. Bowly, “Generating new test instances lier detection with flexible kernel density estimates,” in Proby evolving in instance space,” Computers & Operations ceedings of the 2014 SIAM International Conference on Data Research, vol. 63, pp. 102–113, 2015. Mining. SIAM, 2014, pp. 542–550. [7] K. Smith-Miles and T. T. Tan, “Measuring algorithm footprints [22] H.-P. Kriegel, A. Zimek et al., “Angle-based outlier detection in instance space,” in Evolutionary Computation (CEC), 2012 in high-dimensional data,” in Proceedings of the 14th ACM IEEE Congress on. IEEE, 2012, pp. 1–8. SIGKDD international conference on Knowledge discovery [8] S. Kandanaarachchi, M. Munoz Acosta, K. Smith-Miles, and and data mining. ACM, 2008, pp. 444–452.

R. Hyndman, “Datasets for outlier detection,” Feb 2019. [23] M. Goldstein and S. Uchida, “A comparative evaluation of [Online]. Available: https://monash.figshare.com/articles/ unsupervised anomaly detection algorithms for multivariate Datasets 12338 zip/7705127/4 data,” PloS one, vol. 11, no. 4, p. e0152173, 2016. [9] S. Kandanaarachchi, outselect: Algorithm selection for [24] M. A. Mun˜oz, L. Villanova, D. Baatar, and K. Smith-Miles, unsupervised outlier detection, 2018, r package version “Instance spaces for machine learning classification,” Machine 0.0.0.9000. [Online]. Available: https://github.com/sevvandi/ Learning, vol. 107, no. 1, pp. 109–147, 2018. outselect [25] N. Craswell, “Precision at n,” in Encyclopedia of database [10] E. Achtert, H.-P. Kriegel, and A. Zimek, “Elki: a software systems. Springer, 2009, pp. 2127–2128. system for evaluation of subspace clustering algorithms,” in In- [26] E. Zhang and Y. Zhang, “Average precision,” in Encyclopedia ternational Conference on Scientific and Statistical Database of database systems. Springer, 2009, pp. 192–193.

Management. Springer, 2008, pp. 580–585. [27] P. Talagala, R. Hyndman, K. Smith-Miles, S. Kan[11] S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient algo- danaarachchi, M. Mun˜oz et al., “Anomaly detection in streamrithms for mining outliers from large data sets,” in ACM Sig- ing nonstationary temporal data,” Journal of Compuational mod Record, vol. 29, no. 2. ACM, 2000, pp. 427–438. and Graphical Statistics, 2019, accepted. [12] F. Angiulli and C. Pizzuti, “Fast outlier detection in high [28] C. Leigh, O. Alsibai, R. J. Hyndman, S. Kandanaarachchi, dimensional spaces,” in European Conference on Principles O. C. King, J. M. McGree, C. Neelamraju, J. Strauss, P. D. of Data Mining and Knowledge Discovery. Springer, 2002, Talagala, R. D. Turner et al., “A framework for automated pp. 15–27. anomaly detection in high frequency water-quality data from [13] V. Hautamaki, I. Karkkainen, and P. Franti, “Outlier detec- in situ sensors,” Science of The Total Environment, 2019. tion using k-nearest neighbour graph,” in Pattern Recognition, [29] B. Pfahringer, H. Bensusan, and C. G. Giraud-Carrier, “Meta2004. ICPR 2004. Proceedings of the 17th International Con- learning by landmarking various learning algorithms.” in ference on, vol. 3. IEEE, 2004, pp. 430–433. ICML, 2000, pp. 743–750. [14] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, [30] Y. Peng, P. A. Flach, C. Soares, and P. Brazdil, “Improved “Lof: identifying density-based local outliers,” in ACM sigmod dataset characterisation for meta-learning,” in International record, vol. 29, no. 2. ACM, 2000, pp. 93–104. Conference on Discovery Science. Springer, 2002, pp. 141– [15] E. Schubert, A. Zimek, and H.-P. Kriegel, “Local outlier 152.

detection reconsidered: a generalized view on locality with [31] K. A. Smith-Miles, “Cross-disciplinary perspectives on metaapplications to spatial, video, and network outlier detection,” learning for algorithm selection,” ACM Computing Surveys Data Mining and Knowledge Discovery, vol. 28, no. 1, pp. (CSUR), vol. 41, no. 1, p. 6, 2009.

190–237, 2014. [32] M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al., “A density[16] J. Tang, Z. Chen, A. W.-C. Fu, and D. W. Cheung, “Enhancing based algorithm for discovering clusters in large spatial effectiveness of outlier detections for low density patterns,” in databases with noise.” in Kdd, vol. 96, no. 34, 1996, pp. 226– Pacific-Asia Conference on Knowledge Discovery and Data 231.

Mining. Springer, 2002, pp. 535–548. [33] G. Csardi and T. Nepusz, “The igraph software package for [17] W. Jin, A. K. Tung, J. Han, and W. Wang, “Ranking out- complex network research,” InterJournal, Complex Systems, liers using symmetric neighborhood relationship,” in Pacific- vol. 1695, no. 5, pp. 1–9, 2006.

Asia Conference on Knowledge Discovery and Data Mining. [34] M. Mun˜oz and K. Smith-Miles, “Performance analysis of Springer, 2006, pp. 577–593. continuous black-box optimization algorithms via footprints [18] H.-P. Kriegel, P. Kro¨ger, E. Schubert, and A. Zimek, in instance space,” Evol. Comput., vol. 25, no. 4, pp. 529–554, “Loop: local outlier probabilities,” in Proceedings of the 18th 2017.

ACM conference on Information and knowledge management. [35] Y. Kang, R. Hyndman, and K. Smith-Miles, “Visualising ACM, 2009, pp. 1649–1652. forecasting algorithm performance using time series instance [19] K. Zhang, M. Hutter, and H. Jin, “A new local distance-based spaces,” Int. J. Forecast, vol. 33, no. 2, pp. 345–358, 2017. outlier detection approach for scattered real-world data,” in Pacific-Asia Conference on Knowledge Discovery and Data