Similarity measures of algorithm performance for cost-sensitive scenarios Carlos Eduardo Castor de Melo1 and Ricardo Bastos Cavalcante Prudêncio2 Abstract. Algorithm similarity measures adopted in these previous works Knowledge about algorithm similarity is an important feature of are limited since they are not flexible enough to consider different meta-learning, where information gathered from previous learning misclassification costs and different strategies to build classifiers. tasks can be used to guide the selection or combination of algo- Algorithms usually produce models (scoring functions) that return rithms for a new dataset. Usually this task is done by comparing scores or class probabilities for the input examples. A classifier is global performance measures across different datasets or, alterna- then built from a model by adopting a decision threshold. The pre- tively, comparing the performance of algorithms at the instance-level. diction for a given example depends on both the score returned by These previous similarity measures do not consider misclassification the learned model for that example and the decision threshold. The costs, and hence they neglect an important information that can be most common approach to produce classifiers is to adopt a fixed de- exploited in different learning contexts. In this paper we present al- cision threshold (e.g., 0.5). Alternatively, decision thresholds can be gorithm similarity measures that deals with cost proportions and dif- chosen according to the operating conditions or contexts (e.g., class ferent threshold choice methods for building crisp classifiers from frequencies and misclassification costs) observed when the learned learned models. Experiments were performed in a meta-learning model is deployed. In [7], the authors showed that different threshold study with 50 different learning tasks. The similarity measures were choice methods require the use of different performance measures adopted to cluster algorithms according to their aggregated perfor- to evaluate a learned model. Similarly, concerning algorithm simi- mance on the learning tasks. The clustering process revealed similar- larity, specific measures have to be defined when misclassification ity between algorithms under different perspectives. costs and adaptive threshold choice methods are taken into account. Previous work on algorithm similarity measures does not consider such aspects. The previous similarity measures are defined implic- 1 Introduction itly assuming fixed decision thresholds and thus are only suitable for comparing the performance of crisp classifiers. Meta-learning is a framework developed in supervised machine Based on the above motivation, in our work we developed simi- learning for acquiring knowledge from empirical case studies and larity measures for algorithm performance taking into account cost relating features of learning problems to algorithm performance [2]. proportions and different threshold choice methods. As in [13], we The knowledge acquired in meta-learning has been used to support adopted an instance-level strategy. For that, we initially proposed in- algorithm selection and combination in different contexts [13] [11]. stance hardness measures to indicate how difficult is an instance to be In meta-learning, it is assumed that similar problems present simi- correctly classified by a model. Specific instance hardness measures lar algorithm performance. Hence, to know how similar is the per- were proposed for three threshold choice methods: the score-fixed formance obtained by different algorithms is important for meta- [7], the score-driven [6] and rate-driven methods [8]. By assuming learning. This information can be used to discover similarities be- a threshold choice method and a learned model, we built for each tween algorithms and between datasets [9] and also to support the instance a corresponding cost curve, which indicates the model loss prediction of suitable algorithms for a given dataset based on the al- for that instance along different misclassification costs. The instance gorithms’ performance on past datasets [11]. hardness is defined as the area under the cost curve. The dissimi- Deploying global metrics, such as accuracy and AUC, to compare larity3 between two algorithms for a given dataset is defined as the the performance of algorithms on a dataset has been the most com- average absolute difference between the hardness values over all in- mon approach to deal with the above issue. Despite the popularity stances of the dataset. of this approach, it may lose important knowledge about algorithm In our work, we applied the proposed measures in a meta-learning similarity since it is based on average algorithm performance without study in which 50 datasets and 8 learned models were adopted. Clus- considering differences of algorithm behaviour in an instance-level. ters of models were produced by adopting the score-fixed, score- In order to overcome this limitation, alternative measures of algo- driven and rate-driven measures. Each cluster reveals which algo- rithm similarity (e.g., based on error correlation [9] or classifier out- rithms produced similar learned models on the 50 datasets under each put difference [10]) have been adopted. Here, algorithms are consid- cost-sensitive perspective. We observed that the algorithm behaviour ered similar if they produce the same classifications, or alternatively among the datasets was quite distinct for each measure. By adopt- the same mistakes, on the same instances. ing the score-fixed method, two algorithms were similar if they re- 1 Centro de Informática, Universidade Federal de Pernambuco, Brasil, email: turn the same classification for a high number of instances by using cecm2@cin.ufpe.br 2 Centro de Informática, Universidade Federal de Pernambuco, Brasil, email: 3 In this paper we treat dissimilarity as the complement of similarity, hence- rbcp@cin.ufpe.br forth we will use only the latter term a given threshold. By adopting the score-driven method, algorithms rithms will be the same but their behaviour is quite different. were considered similar if they produced similar scores for the in- A more fine-grained approach to algorithm similarity is to con- stances. By adopting the rate-driven method in turn, algorithms were sider the performance at each instance of the dataset. This approach considered similar if they produced similar rankings of instances de- has the advantage of showing local discrepancies of performance spite how calibrated are their scores. among the space of instances. In [9], error correlation is adopted as This paper is organized as follows. Applications of algorithms similarity measure between two algorithms, in such a way that two similarity measures are presented in Section 2, with a review of algorithms are similar if they produce the same errors in the same in- related works. Section 3 introduces the notation and basic defini- stances. In [10], the authors present the Classifier Output Difference tions used through this paper. The concepts of instance hardness and as an alternative measure for this approach. This metric is defined as threshold choice methods are explained in Section 4 and three thresh- the probability of two distinct classifiers make different predictions old choice methods are presented: score-fixed, score-driven and rate- [15]. driven. The proposed method for measuring the similarity of algo- Although this approach represents a more refined view about algo- rithms performance is presented on Section 5. Section 6 presents the rithm performance, the measures proposed in the literature are com- experimental methodology. An analysis of the similarity measures puted from predictions generated by crisp classifiers (i.e., with fixed over a single dataset is provided in Section 7, followed by the Sec- decision thresholds chosen a priori). However as stated in section 1, tion 8 which demonstrates the aggregate similarity across a group of in the context of misclassification costs, decision thresholds can be datasets. Finally, Section 9 concludes the paper with a discussion of adaptively defined to minimize the loss of a learned model. Hence, the results and insights for future works. two instances considered equally easy by a classifier with a fixed threshold can be rather difficult under different decision thresholds and costs. In this work, we derived similarity measures for algo- 2 Measuring algorithm similarity rithm performance in cost sensitive scenarios, by considering differ- Meta-learning deals with methods to exploit knowledge gathered ent methods to choose decision thresholds of classifiers based on the from learning on different tasks [2]. Differently from base-level knowledge about misclassification costs. learning, which focuses on acquiring experience on a single task, meta-learning acquires knowledge from a meta-data set produced 3 Notation and Basic Definitions when a pool of algorithms is applied on distinct learning tasks. Given a learning task, a meta-example usually stores characteristics of the The basic definitions adopted in our work are mostly based on [7]. task and the performance obtained by a pool of candidate algorithms. Instances can be classified into one of the classes Y = {0, 1}, in Meta-learning can be applied: (1) in a descriptive sense aiming to dis- which 0 is the positive class and 1 is the negative class. A learned cover similarities between datasets and algorithms and (2) in a pre- model m is a scoring function that receives an instance x as input dictive sense to select candidate algorithms for new datasets based and returns a score s = m(x) that indicates the chance of a negative on the knowledge acquired in the meta-learning process. class prediction. A model can be transformed in a crisp classifier Different meta-learning approaches rely on the use of similarity assuming a decision threshold t in such a way that if s ≤ t then x is measures of algorithm performance. For instance, in [9] the authors classified as positive, and it is classified as negative if s > t. clustered a pool of learning algorithms based on the patterns of the The errors of a classifier are associated to costs related to the errors observed in different datasets. Clustering of algorithms based classes. The cost of a false negative is represented as c0 and the cost on their performance was also performed more recently in [10]. Clus- of a false positive in turn is represented as c1 . As in [7], we normal- tering algorithms may provide useful information to guide the pro- ize the costs by setting c0 + c1 = b and adopt the cost proportion cesses of algorithm selection and combination [9]. As another line of c = c0 /b to represent the operating condition faced by a model this research in meta-learning is the landmarking approach [11], which deployment. For simplicity, we adopted b = 2 and hence c ∈ [0, 1], uses the performance of simpler algorithms (called landmarkers) to c0 = 2c and c1 = 2(1 − c). guide the selection of more complex ones. In this approach, datasets The loss function produced assuming a decision threshold t and a are characterized based on the performance obtained by the land- cost proportion c is defined as: markers (which are usually simple and diverse algorithms, such as Q(t, c) = c0 π0 F N (t) + c1 π1 F P (t) decision stumps and linear models, or simplified versions of the can- didate algorithms). One method to handle this approach is measur- (1) = 2{cπ0 F N (t) + (1 − c)π1 F P (t)} ing the similiarity of a new dataset with the meta-examples retrived from the meta-data based on the similarity of performance obtained by the landmarkers[5]. The best candidate algorithms adopted in the In the above equation F N (t) and F P (t) are respectively the False retrieved problems are then suggested for the new dataset. Negative and False Positive rates produced by a model when a thresh- The most common approach for measuring algorithm similarity old t is adopted. The variables π0 and π1 represent the proportion of is the use of global performance measures, like accuracy or AUC positive and negative examples. measure, estimated from an empirical evaluation process (such as The Positive Rate R(t) is the proportion of instances predicted as cross-validation). Similarity between algorithms can be obtained by positive at the decision threshold t and can be expressed as π0 (1 − computing the differences or ratios between the performance mea- F N (t)) + π1 F P (t). sures, as performed in [5]. Although this approach is widely applied, it is strongly criticized since it may lose important information about 4 Instance Hardness and Cost Curves algorithms behaviour and may not characterize similarity properly. For example, if a dataset has 100 instances and a given algorithm In Equation 1, the loss produced by a classifier is an aggregation misclassifies 20 instances and another one misclassifies 20 instances of the errors observed for the positive and the negative instances. A completely different from the first ones, the accuracy of both algo- positive instance will be associated to a cost 2c when it is incorrectly classified. An error for a negative instance in turn will be associated to a cost 2(1 − c). In our work, we decompose the Equation 1 to derive the loss func- tions for individual instances. The individual loss for a positive in- stance x is defined as: QI(x, t, c) = 2cF N (x, t) (2) In the above equation, F N (x, t) = 1 if x is a false negative when threshold t is adopted and F N (x, t) = 0 otherwise. A similar defi- nition of loss function can be done for a negative instance x: QI(x, t, c) = 2(1 − c)F P (x, t) (3) In the above equation, F P (x, t) = 1 if x is a false positive at threshold t and F P (x, t) = 0 otherwise. Given a threshold choice method, QI(x, t, c) produce a specific curve for the example x along the range of operating conditions (c ∈ [0, 1]). The instance hardness measures in our work are defined as the area under the individual cost curves and compute the total expected loss for the range of operation conditions. In the general case, given a threshold choice method t = T (c), the hardness of an instance is: Z 1 IH T (x) = QI(x, T (c), c)w(c)dc (4) 0 Figure 1. Example of instance cost curve for a positive and negative In the above equation, w(c) represents the distribution of c. In instances assuming the score-fixed method. our work, we derived the instance hardness measures for different threshold choice methods assuming uniform distribution of cost pro- portions. For correctly classified instances (either positive or negative), IH sf (x) = 0. 4.1 Score-Fixed Instance Hardness In this method, the decision threshold assumes a fixed value regard- less the operation condition: 4.2 Score-Driven Instance Hardness T (c) = t (5) In the score-driven method [6], the decision threshold t is defined by simply setting it equal to the operating condition c: Usually, t is set to 0.5. If x is an incorrectly classified positive instance, then F N (x, t) = 1. By replacing F N (x, t) in Equation 2, T (c) = c (10) the instance cost curve is defined as: For instance, if c = 0.7, the cost of false negatives are higher than the costs of false positives. In this case by setting t = c = 0.7 the QI(x, t, c) = 2c (6) produced classifier tends to predict more instances as positive, thus If x is an incorrectly classified negative instance, then F P (x, t) = minimizing the relative number of false negatives. According to [6], 1 and its corresponding cost curve is: the score-driven method is a natural choice when the model scores are assumed to be class probability estimators and the scores are well QI(x, t, c) = 2(1 − c) (7) calibrated. Under the score-driven method, we can derive the loss function for For correctly classified instances (either positive or negative), the positive instances as follows: instance cost curve is a constant line QI(x, t, c) = 0. Figure 1 il- lustrates the score-fixed instance cost curve considering positive and  1, if s > c negative instances. F N (x, c) = (11) 0, otherwise By integration QI, we derive the instance hardness value for in- correct positive instance as follows: Replacing F N (x, c) in Equation 2 we have the following score- driven cost curve for a positive instance: Z 1 h i1 IH sf (x) = 2c dc = c2  =1 (8) 2c, if s > c 0 0 QI(x, t, c) = (12) 0, otherwise Similarly, for incorrect instances the instance hardness is defined as: Fig. 2(a) illustrates the score-driven cost curve for a positive in- Z 1 h i1 stance with score s = 0.32. For s ≤ c no cost is assumed; on the IH sf (x) = 2(1 − c) dc = 2c − c2 =1 (9) other hand, the cost varies linearly. The area under the cost curve is 0 0 defined as an instance hardness measure: Z s h is a small change in the operating condition (and consequently in the IH sd (x) = 2c dc = c2 = s2 (13) decision threshold) may drastically affect the classifier performance. 0 0 As an alternative, in [8] the authors proposed to use the proportion Since y = 0 for positive instances, the above measure can be re- of instances predicted as positive (i.e., R(t)) to define the decision placed by (y − s)2 , which correspond to the squared-error obtained thresholds. by the model. In the rate-driven method, the decision threshold is set to achieve Equations 14 and 15 define the score-driven cost curve for neg- a desired proportion of positive predictions. The threshold choice ative instances. Figure 2(c) illustrates this curve when the negative method is defined as: instance has a score s = 0.41.  T (c) = R−1 (c) (17) 1, if s ≤ c F P (x, c) = (14) For instance, if c = 0.7 the threshold t is set in such a way that 0, otherwise  70% of the instances are classified as positive. The operating condi- 2(1 − c), if s ≤ c tion c is then expressed as the desired positive rate: c = R(t). The QI(x, t, c) = (15) 0, otherwise probability of a false negative under the rate-driven choice method can be defined as: Similar to the positive instance, instance hardness for a negative instance is defined as the area under the cost curve:  −1 1, if s > R−1 (c) F N (x, R (c)) = (18) 0, otherwise Z 1 h i1 IH sd (x) = 2(1 − c) dc = 2c − c2 = (1 − s)2 (16) Replacing F N (x, R−1 (c)) in Equation 2 we have the following s s rate-driven cost curve for a positive instance: For negative instances we have y = 1 and then the above mea-  2c, if s > R−1 (c) sure corresponds to (y − s)2 . Hence, for both positive and negative QI(x, t, c) = (19) 0, otherwise instances, hardness is defined as the squared-error obtained by the model. Fig. 2(b) illustrates the rate-driven cost curve for a positive in- stance with a score s = 0.6. For s ≤ R−1 (c), or alternatively for R(s) ≤ c, no cost is assumed. When R(s) > c, the cost of the instance varies linearly. The area under the rate-driven cost can be adopted as an instance hardness measure: Z R(s) h iR(s) IH rd (x) = 2c dc = c2 = R(s)2 (20) 0 0 The above measure states that the instance hardness is related to position of the instance in the ranking produced by the learned model. The worse is the ranking of the positive instance, the higher is the instance hardness. Equations 21 and 22 define the rate-driven cost curve for negative instances with score s, illustrated in Figure 2(d).  1, if s ≤ R−1 (c) F P (x, R−1 (c)) = (21) 0, otherwise  2(1 − c), if s ≤ R−1 (c) QI(x, t, c) = (22) 0, otherwise Similar to the positive instance, instance hardness for a negative instance is defined as the area under the cost curve: Z 1 h i IH rd (x) = 2(1 − c) dc = 2c − c2 = (1 − R(s))2 R(s) R(s)1 Figure 2. Instance cost curves for an instance assuming the rate-driven and (23) score-driven methods. Notice that (1 − R(s)) corresponds to the negative rate of a classi- fier at the point s. The instance hardness for negative instances is then related to the ranking of the most likely negative instances produced by the learned model. Different from the score-driven method, which measures the mag- nitude of the errors obtained by a model, the rate-driven method is 4.3 Rate-Driven Instance Hardness more related to ranking performance. By adopting the score-driven According to [8], the score-driven method is sensitive to the estima- method, an instance is considered hard if its score is not well cali- tion of the scores. For instance, if the scores are highly concentrated, brated. On other hand, the same instance may be easy by assuming the rate-driven method if it is well ranked relative to the other in- 6 Experimental Setup stances. Instance hardness by adopting the score-driven method only depends on the score of the instance. Instance hardness by adopting In order to achieve a good diversity of problems, we computed the the rate-driven method in turn depends not only on the instance score instance hardness on a group of 50 datasets4 representing problems but also on how the other instances are ordered. of binary classification. Most problems were collected from the UCI repository[1]. We compare the performance of 6 algorithms available on the Scikit-Learn Project[14]. The scores for each instance were calcu- 5 Cost Sensitive Algorithm Similarity lated using a 10-fold cross-validation. Usually, the classification al- gorithms return the label predicted for an informed example but to The application of global metrics fails to properly measure algorithm produce real-values scores (varying between 0 and 1)for the input similarity since it is based on average performance, losing important instances, we adopted specific procedures for each algorithm. For information during the evaluation process. More fine-grained mea- the Nearest Neighbours (KNN), the score returned is the number of sures provide a solution for this limitation by verifying the algorithm neighbours belonging to the negative class divided by K. The pro- performance at the instance level. However, the proposed measures cedure used for the Random Forest (RF) is similar: we count the are not well defined when misclassification costs are involved. number of votes for the negative class and divide this by the number In this work, we derived different measures for algorithm similar- of trees in the forest. For the Decision Tree (DT), Naive Bayes (NB) ity based on the concept of instance hardness. Given a pair of learned and Logistic Regression (LR) the score represents the probability of models, they will be similar if the hardness values computed on a test the instance being negative. For the Support Vector Machine (SVM), set by using the two models are similar. More specifically, in order to we get the decision function output and normalize it between 0 and relate two models, we initially collect the scores produced by them 1. on a test set of instances. Following, we compute the hardness value In order to have a reference point, we compare variations of algo- for each test instance considering each model. Finally, the similarity rithm at the clustering process. For the KNN, we applied the algo- between the models is defined by aggregating the instance hardness rithm with 3 and 5 nearest neighbours (3NN and 5NN) and for the values as follows: SVM, we adopted the experiments with the linear and RBF kernel functions (SVM LIN and SVM RBF, respectively). 1 X We expect that the models learned with variations of a same al- T T D(ma , mb , D) = |IHm a (x) − IHm (x)| (24) gorithm will be clustered together. In order to cluster the results ob- |D| b x∈D tained by the similarity measures, we applied the agglomerative hier- archical clustering method with the average linkage function [4]. For The above equation measures the pairwise distance between mod- the experiments with the score-fixed method, we set t = 0.5. els ma and mb for each instance x belonging to the test set D. In our work, in order to deal with costs, we derived in the previous sec- tion three distinct instance hardness by adopting different threshold 7 Dataset Algorithm Similarity choice methods T . By adopting the score-fixed method, two models After computing the scores for all datasets instances as described in will be similar if they return the same classification result for a high the previous section, we use the results to calculate the instance hard- number of instances. By adopting the score-driven method, two mod- ness measures for the three threshold choice methods presented in els will be similar with their scores are similar (the squared-errors Section 4. Then we measure the pairwise distance among the models produced by the models are similar at instance level). By adopting for each dataset using the Equation 24. In order to illustrate the use the rate-driven method in turn two models will be similar if the test of the proposed measures, we initially present the results obtained instances are similarly ranked. The three methods correspond to three by a single dataset (the Ecoli dataset). Next section will present the different evaluation strategies. Other instance hardness measures and clustering results obtained by considering all the collected datasets. their corresponding algorithm similarity measures can be developed in the future by considering other threshold choice methods, such as Table 1. Mean Instance Hardness Measures for Ecoli Dataset the probabilistic ones [7]. Once we have the algorithm similarity measure on a single dataset, we can compute the overall similarity over different datasets in order Model IH rd IH sd IH sf to achieve a wider view about the relative performance of algorithms. 3NN 0,269 0,0694 0,0833 There are many strategies to do this aggregation, such as the use of 5NN 0,2583 0,0583 0,0744 the median or the average similarity as well as other consensus sim- DT 0,2858 0,0774 0,0774 ilarity methods [3] that can be taken over all datasets involved. In LR 0,252 0,0671 0,1047 NB 0,2585 0,1949 0,2024 this work, we adopt a simple aggregation strategy by computing the RF 0,256 0,051 0,0714 average similarities measured over all datasets observed: SVM LIN 0,2554 0,1828 0,244 SVM RBF 0,2553 0,1811 0,2381 N 1 X A(ma , mb ) = D(ma , mb , Dj ) (25) Table 1 presents the average instance hardness measures for the N j=1 Ecoli dataset. The average hardness values observed suggest that the models are better at calibrating scores than at ranking instances for In the above equation, N stands for the number of datasets avail- this dataset, as can be seen from the values presented by both meth- able for measuring algorithm similarity. As it will be seen next sec- ods based on score. In fact, the score-fixed and score-driven instance tion, this measure will be adopted in a case study to cluster algo- 4 The list of datasets used is available at http://www.cin.ufpe.br/∼cecm2 rithms based on average similarity over a set of learning tasks. hardness measures are in general smaller than the rate-driven ones. Also, large differences in the quality of scores produced by some al- gorithms do not reflect in large differences in the ranking quality. For instance, although the NB produced bad results for the score-fixed and score-driven methods, its results for the rate-driven method are similar to the other algorithms. Figures 3, 4 and 5 present the dendrograms derived from the clustering process by adopting the proposed measures for the Ecoli dataset (respectively for the score-fixed, score-driven and rate-driven measures). The dendrograms for the score-fixed and score-driven methods show that the learned models produced the same clusters, differing on the cut-points. The clustering presented by the rate- Figure 5. Clustered rate-driven performance for the models learned on the driven method is also related to the methods based on score. The Ecoli dataset cluster formed by the SVM models is observed in all cases. The remaining clusters have a slight difference. In the score-fixed and score-driven methods, the NB model is a cluster by itself, but in the rate-driven case NB was considered similar to LR (i.e., they produced models learned by KNN were the most similar, as expected. The sec- similar rankings of instances although their scores are different). In ond most similar pair of models was produced by the RF and the all cases, the KNN and the RF models were considered similar, which LR algorithms (with a low dissimilarity measure around 0.1 in both can be supported in [12]. cases). Depending on the cut-point (e.g., 0.3) adopted to produce clusters from the dendrogram, the DT algorithm is clustered together with 3NN, 5NN, RF and LR. The models obtained by NB are the ones that present the most distinct behaviour. Finally, the SVM mod- els are similar between them (relative to the other algorithms), but their similarity level is not so high. The change in the kernel function produced in many datasets very distinct models. Some of the results derived from the dendrogram are expected (such as the similarity be- tween 3NN and 5NN). Other results in turn, such as the similarity between RF and LR, were not expected and hence have to be better investigated in the future. Figure 3. Clustered score-fixed performance for the models learned on the Ecoli dataset Figure 6. Clustered average score-fixed performance Figure 8 displays the dendrogram of algorithms produced by Figure 4. Clustered score-driven performance for the models learned on adopting the rate-driven measure. The algorithms are more similar to the Ecoli dataset each other when similarity is measured in terms of ranking quality. As in the score-fixed and score-driven dendrograms, the 3NN, 5NN, DT and RF are clustered together. The algorithms LR, SVM LIN and NB formed another cluster. Different from the dendrograms for the score-fixed and score-driven methods, we see that the SVM models 8 Aggregated Algorithm Similarity do not belong to the same group. Again, a deeper investigation has to In order to have a better view about the algorithms similarities, we be done to provide further explanations on why a given pair of algo- applied the equation 25 to compute the average distance between al- rithms was similar. In our work, we provide a general framework that gorithms across the 50 datasets adopted in our experiments. Figures can be extended with new algorithms and datasets and can be used to 6 and 7 present the dendrograms of algorithms considering the score- identify which algorithms are similar under different cost-sensitive fixed and score-driven measures. These dendrogram shows that the perspectives. ACKNOWLEDGEMENTS This research is supported by CNPq, CAPES and FACEPE (Brazilian agencies). REFERENCES [1] K. Bache and M. Lichman. UCI machine learning repository, 2013. [2] Pavel Brazdil, Christophe Giraud-Carrier, Carlos Soares, and Ricardo Vilalta, Metalearning: Applications to Data Mining, Springer Publish- ing Company, Incorporated, 1 edn., 2008. [3] Francisco de A.T. de Carvalho, Yves Lechevallier, and Filipe M. de Melo, ‘Relational partitioning fuzzy clustering algorithms based on Figure 7. Clustered average score-driven performance multiple dissimilarity matrices’, Fuzzy Sets and Systems, 215(0), 1 – 28, (2013). Theme : Clustering. [4] Chris Ding, Xiaofeng He, and Lawrence Berkeley, ‘Cluster merging and splitting in hierarchical clustering algorithms’, 139–146, (2002). [5] Johannes Fürnkranz and Johann Petrak, ‘An evaluation of landmark- ing variants’, in Working Notes of the ECML/PKDD 2000 Workshop on Integrating Aspects of Data Mining, Decision Support and Meta- Learning, pp. 57–68, (2001). [6] José Hernández-Orallo, Peter Flach, and Cèsar Ferri, ‘Brier curves: A new cost-based visualisation of classifier performance’, in Proceedings of the 28th International Conference on Machine Learning, (2011). [7] José Hernández-Orallo, Peter Flach, and Cèsar Ferri, ‘A unified view of performance metrics: Translating threshold choice into expected classi- fication loss’, J. Mach. Learn. Res., 13(1), 2813–2869, (October 2012). [8] José Hernández-Orallo, Peter Flach, and César Ferri, ‘ROC curves in cost space’, Machine Learning, 93(1), 71–91, (February 2013). [9] A Kalousis, J Gama, and M Hilario, ‘On data and algorithms: Under- standing inductive performance’, Machine Learning, 275–312, (2004). [10] JW Lee and C Giraud-Carrier, ‘A metric for unsupervised metalearn- Figure 8. Clustered average rate-driven performance ing’, Intelligent Data Analysis, 15, 827–841, (2011). [11] Rui Leite, Pavel Brazdil, and Joaquin Vanschoren, ‘Selecting classifi- cation algorithms with active testing’, in Machine Learning and Data Mining in Pattern Recognition, 117–131, Springer, (2012). [12] Yi Lin and Yongho Jeon, ‘Random forests and adaptive nearest neigh- bors’, Journal of the American Statistical Association, 101(474), 578– 9 Conclusion 590, (2006). [13] Tony Martinez Michael R. Smith and Christophe Giraud-Carrier, ‘An instance level analysis of data complexity’, Machine Learning, (2013). In this work, we proposed similarity measures between algorithms [14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, by considering three threshold choice methods: score-fixed, score- O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander- plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- driven and rate-driven. For each method, we proposed a correspond- esnay, ‘Scikit-learn: Machine learning in Python’, Journal of Machine ing instance hardness measure that was then deployed to define the Learning Research, 12, 2825–2830, (2011). algorithm similarity measure. The similarity between algorithms can [15] AH Peterson and TR Martinez, ‘Estimating the potential for combin- be quite different depending on the dataset and the cost-sensitive sce- ing learning models’, Proceedings of the ICML workshop on meta- nario being tackled. For the score-fixed, two algorithms are similar if learning, (2005). they produce the same classification for a high number of instances for a chosen threshold. For the score-driven method, two algorithms are similar if they produce similar scores. For the rate-driven method, in turn, two algorithms are similar if they produce similar rankings of instances. In order to infer overall similarity on different datasets, we com- puted the average similarity between algorithms over 50 datasets and performed clustering of algorithms. The results revealed some unex- pected similarities that can serve as starting points for further inves- tigations to discover and explain hidden relationships between algo- rithms and learning strategies. As future work, the ideas presented here can be applied on a pre- dictive meta-learning strategy to select algorithms for new datasets depending on the threshold choice method and the input costs. In our work, we aggregate similarity across datasets using the mean method, which is simple but possibly naive. Other consensus similar- ity methods can be investigated. Finally, we highlight that our work is very limited concerning the number of algorithms and datasets adopted. Hence, more extensive meta-learning studies can be done in the future, with stronger conclusions and insights.