Similarity measures of algorithm performance for
                        cost-sensitive scenarios
                        Carlos Eduardo Castor de Melo1 and Ricardo Bastos Cavalcante Prudêncio2


Abstract.                                                                         Algorithm similarity measures adopted in these previous works
   Knowledge about algorithm similarity is an important feature of             are limited since they are not flexible enough to consider different
meta-learning, where information gathered from previous learning               misclassification costs and different strategies to build classifiers.
tasks can be used to guide the selection or combination of algo-               Algorithms usually produce models (scoring functions) that return
rithms for a new dataset. Usually this task is done by comparing               scores or class probabilities for the input examples. A classifier is
global performance measures across different datasets or, alterna-             then built from a model by adopting a decision threshold. The pre-
tively, comparing the performance of algorithms at the instance-level.         diction for a given example depends on both the score returned by
These previous similarity measures do not consider misclassification           the learned model for that example and the decision threshold. The
costs, and hence they neglect an important information that can be             most common approach to produce classifiers is to adopt a fixed de-
exploited in different learning contexts. In this paper we present al-         cision threshold (e.g., 0.5). Alternatively, decision thresholds can be
gorithm similarity measures that deals with cost proportions and dif-          chosen according to the operating conditions or contexts (e.g., class
ferent threshold choice methods for building crisp classifiers from            frequencies and misclassification costs) observed when the learned
learned models. Experiments were performed in a meta-learning                  model is deployed. In [7], the authors showed that different threshold
study with 50 different learning tasks. The similarity measures were           choice methods require the use of different performance measures
adopted to cluster algorithms according to their aggregated perfor-            to evaluate a learned model. Similarly, concerning algorithm simi-
mance on the learning tasks. The clustering process revealed similar-          larity, specific measures have to be defined when misclassification
ity between algorithms under different perspectives.                           costs and adaptive threshold choice methods are taken into account.
                                                                               Previous work on algorithm similarity measures does not consider
                                                                               such aspects. The previous similarity measures are defined implic-
1     Introduction                                                             itly assuming fixed decision thresholds and thus are only suitable for
                                                                               comparing the performance of crisp classifiers.
Meta-learning is a framework developed in supervised machine
                                                                                  Based on the above motivation, in our work we developed simi-
learning for acquiring knowledge from empirical case studies and
                                                                               larity measures for algorithm performance taking into account cost
relating features of learning problems to algorithm performance [2].
                                                                               proportions and different threshold choice methods. As in [13], we
The knowledge acquired in meta-learning has been used to support
                                                                               adopted an instance-level strategy. For that, we initially proposed in-
algorithm selection and combination in different contexts [13] [11].
                                                                               stance hardness measures to indicate how difficult is an instance to be
In meta-learning, it is assumed that similar problems present simi-
                                                                               correctly classified by a model. Specific instance hardness measures
lar algorithm performance. Hence, to know how similar is the per-
                                                                               were proposed for three threshold choice methods: the score-fixed
formance obtained by different algorithms is important for meta-
                                                                               [7], the score-driven [6] and rate-driven methods [8]. By assuming
learning. This information can be used to discover similarities be-
                                                                               a threshold choice method and a learned model, we built for each
tween algorithms and between datasets [9] and also to support the
                                                                               instance a corresponding cost curve, which indicates the model loss
prediction of suitable algorithms for a given dataset based on the al-
                                                                               for that instance along different misclassification costs. The instance
gorithms’ performance on past datasets [11].
                                                                               hardness is defined as the area under the cost curve. The dissimi-
   Deploying global metrics, such as accuracy and AUC, to compare
                                                                               larity3 between two algorithms for a given dataset is defined as the
the performance of algorithms on a dataset has been the most com-
                                                                               average absolute difference between the hardness values over all in-
mon approach to deal with the above issue. Despite the popularity
                                                                               stances of the dataset.
of this approach, it may lose important knowledge about algorithm
                                                                                  In our work, we applied the proposed measures in a meta-learning
similarity since it is based on average algorithm performance without
                                                                               study in which 50 datasets and 8 learned models were adopted. Clus-
considering differences of algorithm behaviour in an instance-level.
                                                                               ters of models were produced by adopting the score-fixed, score-
In order to overcome this limitation, alternative measures of algo-
                                                                               driven and rate-driven measures. Each cluster reveals which algo-
rithm similarity (e.g., based on error correlation [9] or classifier out-
                                                                               rithms produced similar learned models on the 50 datasets under each
put difference [10]) have been adopted. Here, algorithms are consid-
                                                                               cost-sensitive perspective. We observed that the algorithm behaviour
ered similar if they produce the same classifications, or alternatively
                                                                               among the datasets was quite distinct for each measure. By adopt-
the same mistakes, on the same instances.
                                                                               ing the score-fixed method, two algorithms were similar if they re-
1 Centro de Informática, Universidade Federal de Pernambuco, Brasil, email:   turn the same classification for a high number of instances by using
    cecm2@cin.ufpe.br
2 Centro de Informática, Universidade Federal de Pernambuco, Brasil, email:   3 In this paper we treat dissimilarity as the complement of similarity, hence-
    rbcp@cin.ufpe.br                                                             forth we will use only the latter term
a given threshold. By adopting the score-driven method, algorithms        rithms will be the same but their behaviour is quite different.
were considered similar if they produced similar scores for the in-          A more fine-grained approach to algorithm similarity is to con-
stances. By adopting the rate-driven method in turn, algorithms were      sider the performance at each instance of the dataset. This approach
considered similar if they produced similar rankings of instances de-     has the advantage of showing local discrepancies of performance
spite how calibrated are their scores.                                    among the space of instances. In [9], error correlation is adopted as
   This paper is organized as follows. Applications of algorithms         similarity measure between two algorithms, in such a way that two
similarity measures are presented in Section 2, with a review of          algorithms are similar if they produce the same errors in the same in-
related works. Section 3 introduces the notation and basic defini-        stances. In [10], the authors present the Classifier Output Difference
tions used through this paper. The concepts of instance hardness and      as an alternative measure for this approach. This metric is defined as
threshold choice methods are explained in Section 4 and three thresh-     the probability of two distinct classifiers make different predictions
old choice methods are presented: score-fixed, score-driven and rate-     [15].
driven. The proposed method for measuring the similarity of algo-            Although this approach represents a more refined view about algo-
rithms performance is presented on Section 5. Section 6 presents the      rithm performance, the measures proposed in the literature are com-
experimental methodology. An analysis of the similarity measures          puted from predictions generated by crisp classifiers (i.e., with fixed
over a single dataset is provided in Section 7, followed by the Sec-      decision thresholds chosen a priori). However as stated in section 1,
tion 8 which demonstrates the aggregate similarity across a group of      in the context of misclassification costs, decision thresholds can be
datasets. Finally, Section 9 concludes the paper with a discussion of     adaptively defined to minimize the loss of a learned model. Hence,
the results and insights for future works.                                two instances considered equally easy by a classifier with a fixed
                                                                          threshold can be rather difficult under different decision thresholds
                                                                          and costs. In this work, we derived similarity measures for algo-
2   Measuring algorithm similarity                                        rithm performance in cost sensitive scenarios, by considering differ-
Meta-learning deals with methods to exploit knowledge gathered            ent methods to choose decision thresholds of classifiers based on the
from learning on different tasks [2]. Differently from base-level         knowledge about misclassification costs.
learning, which focuses on acquiring experience on a single task,
meta-learning acquires knowledge from a meta-data set produced            3   Notation and Basic Definitions
when a pool of algorithms is applied on distinct learning tasks. Given
a learning task, a meta-example usually stores characteristics of the     The basic definitions adopted in our work are mostly based on [7].
task and the performance obtained by a pool of candidate algorithms.      Instances can be classified into one of the classes Y = {0, 1}, in
Meta-learning can be applied: (1) in a descriptive sense aiming to dis-   which 0 is the positive class and 1 is the negative class. A learned
cover similarities between datasets and algorithms and (2) in a pre-      model m is a scoring function that receives an instance x as input
dictive sense to select candidate algorithms for new datasets based       and returns a score s = m(x) that indicates the chance of a negative
on the knowledge acquired in the meta-learning process.                   class prediction. A model can be transformed in a crisp classifier
    Different meta-learning approaches rely on the use of similarity      assuming a decision threshold t in such a way that if s ≤ t then x is
measures of algorithm performance. For instance, in [9] the authors       classified as positive, and it is classified as negative if s > t.
clustered a pool of learning algorithms based on the patterns of the         The errors of a classifier are associated to costs related to the
errors observed in different datasets. Clustering of algorithms based     classes. The cost of a false negative is represented as c0 and the cost
on their performance was also performed more recently in [10]. Clus-      of a false positive in turn is represented as c1 . As in [7], we normal-
tering algorithms may provide useful information to guide the pro-        ize the costs by setting c0 + c1 = b and adopt the cost proportion
cesses of algorithm selection and combination [9]. As another line of     c = c0 /b to represent the operating condition faced by a model this
research in meta-learning is the landmarking approach [11], which         deployment. For simplicity, we adopted b = 2 and hence c ∈ [0, 1],
uses the performance of simpler algorithms (called landmarkers) to        c0 = 2c and c1 = 2(1 − c).
guide the selection of more complex ones. In this approach, datasets         The loss function produced assuming a decision threshold t and a
are characterized based on the performance obtained by the land-          cost proportion c is defined as:
markers (which are usually simple and diverse algorithms, such as
                                                                                    Q(t, c)    = c0 π0 F N (t) + c1 π1 F P (t)
decision stumps and linear models, or simplified versions of the can-
didate algorithms). One method to handle this approach is measur-                                                                             (1)
                                                                                               = 2{cπ0 F N (t) + (1 − c)π1 F P (t)}
ing the similiarity of a new dataset with the meta-examples retrived
from the meta-data based on the similarity of performance obtained
by the landmarkers[5]. The best candidate algorithms adopted in the         In the above equation F N (t) and F P (t) are respectively the False
retrieved problems are then suggested for the new dataset.                Negative and False Positive rates produced by a model when a thresh-
    The most common approach for measuring algorithm similarity           old t is adopted. The variables π0 and π1 represent the proportion of
is the use of global performance measures, like accuracy or AUC           positive and negative examples.
measure, estimated from an empirical evaluation process (such as            The Positive Rate R(t) is the proportion of instances predicted as
cross-validation). Similarity between algorithms can be obtained by       positive at the decision threshold t and can be expressed as π0 (1 −
computing the differences or ratios between the performance mea-          F N (t)) + π1 F P (t).
sures, as performed in [5]. Although this approach is widely applied,
it is strongly criticized since it may lose important information about
                                                                          4   Instance Hardness and Cost Curves
algorithms behaviour and may not characterize similarity properly.
For example, if a dataset has 100 instances and a given algorithm         In Equation 1, the loss produced by a classifier is an aggregation
misclassifies 20 instances and another one misclassifies 20 instances     of the errors observed for the positive and the negative instances. A
completely different from the first ones, the accuracy of both algo-      positive instance will be associated to a cost 2c when it is incorrectly
classified. An error for a negative instance in turn will be associated
to a cost 2(1 − c).
   In our work, we decompose the Equation 1 to derive the loss func-
tions for individual instances. The individual loss for a positive in-
stance x is defined as:

                      QI(x, t, c) = 2cF N (x, t)                       (2)
   In the above equation, F N (x, t) = 1 if x is a false negative when
threshold t is adopted and F N (x, t) = 0 otherwise. A similar defi-
nition of loss function can be done for a negative instance x:

                  QI(x, t, c) = 2(1 − c)F P (x, t)                     (3)
   In the above equation, F P (x, t) = 1 if x is a false positive at
threshold t and F P (x, t) = 0 otherwise.
   Given a threshold choice method, QI(x, t, c) produce a specific
curve for the example x along the range of operating conditions (c ∈
[0, 1]). The instance hardness measures in our work are defined as the
area under the individual cost curves and compute the total expected
loss for the range of operation conditions. In the general case, given
a threshold choice method t = T (c), the hardness of an instance is:
                              Z 1
               IH T (x) =            QI(x, T (c), c)w(c)dc             (4)
                                0
                                                                                Figure 1. Example of instance cost curve for a positive and negative
   In the above equation, w(c) represents the distribution of c. In                        instances assuming the score-fixed method.
our work, we derived the instance hardness measures for different
threshold choice methods assuming uniform distribution of cost pro-
portions.
                                                                               For correctly classified instances (either positive or negative),
                                                                             IH sf (x) = 0.
4.1    Score-Fixed Instance Hardness
In this method, the decision threshold assumes a fixed value regard-
less the operation condition:                                                4.2    Score-Driven Instance Hardness
                                    T (c) = t                          (5)   In the score-driven method [6], the decision threshold t is defined by
                                                                             simply setting it equal to the operating condition c:
   Usually, t is set to 0.5. If x is an incorrectly classified positive
instance, then F N (x, t) = 1. By replacing F N (x, t) in Equation 2,                                       T (c) = c                             (10)
the instance cost curve is defined as:
                                                                                For instance, if c = 0.7, the cost of false negatives are higher than
                                                                             the costs of false positives. In this case by setting t = c = 0.7 the
                              QI(x, t, c) = 2c                         (6)
                                                                             produced classifier tends to predict more instances as positive, thus
   If x is an incorrectly classified negative instance, then F P (x, t) =    minimizing the relative number of false negatives. According to [6],
1 and its corresponding cost curve is:                                       the score-driven method is a natural choice when the model scores
                                                                             are assumed to be class probability estimators and the scores are well
                        QI(x, t, c) = 2(1 − c)                         (7)   calibrated.
                                                                                Under the score-driven method, we can derive the loss function for
   For correctly classified instances (either positive or negative), the     positive instances as follows:
instance cost curve is a constant line QI(x, t, c) = 0. Figure 1 il-
lustrates the score-fixed instance cost curve considering positive and
                                                                                                                
                                                                                                                    1, if s > c
negative instances.                                                                              F N (x, c) =                                     (11)
                                                                                                                    0, otherwise
   By integration QI, we derive the instance hardness value for in-
correct positive instance as follows:                                           Replacing F N (x, c) in Equation 2 we have the following score-
                                                                             driven cost curve for a positive instance:
                                Z 1             h i1
                 IH sf (x) =            2c dc = c2
                                                                                                                
                                                         =1            (8)                                          2c, if s > c
                                    0                0                                          QI(x, t, c) =                                     (12)
                                                                                                                    0, otherwise
   Similarly, for incorrect instances the instance hardness is defined
as:                                                                             Fig. 2(a) illustrates the score-driven cost curve for a positive in-
                        Z 1                     h         i1                 stance with score s = 0.32. For s ≤ c no cost is assumed; on the
          IH sf (x) =          2(1 − c) dc = 2c − c2              =1   (9)   other hand, the cost varies linearly. The area under the cost curve is
                          0                                   0              defined as an instance hardness measure:
                                Z s           h is                            a small change in the operating condition (and consequently in the
                 IH sd (x) =          2c dc = c2        = s2          (13)    decision threshold) may drastically affect the classifier performance.
                                 0                 0
                                                                              As an alternative, in [8] the authors proposed to use the proportion
   Since y = 0 for positive instances, the above measure can be re-           of instances predicted as positive (i.e., R(t)) to define the decision
placed by (y − s)2 , which correspond to the squared-error obtained           thresholds.
by the model.                                                                    In the rate-driven method, the decision threshold is set to achieve
   Equations 14 and 15 define the score-driven cost curve for neg-            a desired proportion of positive predictions. The threshold choice
ative instances. Figure 2(c) illustrates this curve when the negative         method is defined as:
instance has a score s = 0.41.
                                                                                                            T (c) = R−1 (c)                              (17)
                                         1, if s ≤ c
                     F P (x, c) =                                     (14)       For instance, if c = 0.7 the threshold t is set in such a way that
                                         0, otherwise
                                                                             70% of the instances are classified as positive. The operating condi-
                                      2(1 − c), if s ≤ c                      tion c is then expressed as the desired positive rate: c = R(t). The
                QI(x, t, c) =                                         (15)
                                      0, otherwise                            probability of a false negative under the rate-driven choice method
                                                                              can be defined as:
   Similar to the positive instance, instance hardness for a negative
instance is defined as the area under the cost curve:
                                                                                                                     
                                                                                                       −1                    1, if s > R−1 (c)
                                                                                           F N (x, R        (c)) =                                        (18)
                                                                                                                             0, otherwise
                    Z 1                  h         i1
      IH sd (x) =         2(1 − c) dc = 2c − c2         = (1 − s)2    (16)       Replacing F N (x, R−1 (c)) in Equation 2 we have the following
                     s                              s                         rate-driven cost curve for a positive instance:
   For negative instances we have y = 1 and then the above mea-                                                  
                                                                                                                     2c, if s > R−1 (c)
sure corresponds to (y − s)2 . Hence, for both positive and negative                         QI(x, t, c) =                                                (19)
                                                                                                                     0, otherwise
instances, hardness is defined as the squared-error obtained by the
model.
                                                                                 Fig. 2(b) illustrates the rate-driven cost curve for a positive in-
                                                                              stance with a score s = 0.6. For s ≤ R−1 (c), or alternatively for
                                                                              R(s) ≤ c, no cost is assumed. When R(s) > c, the cost of the
                                                                              instance varies linearly. The area under the rate-driven cost can be
                                                                              adopted as an instance hardness measure:
                                                                                                       Z R(s)                  h iR(s)
                                                                                         IH rd (x) =             2c dc = c2                = R(s)2        (20)
                                                                                                         0                          0

                                                                                 The above measure states that the instance hardness is related
                                                                              to position of the instance in the ranking produced by the learned
                                                                              model. The worse is the ranking of the positive instance, the higher
                                                                              is the instance hardness.
                                                                                 Equations 21 and 22 define the rate-driven cost curve for negative
                                                                              instances with score s, illustrated in Figure 2(d).
                                                                                                                     
                                                                                                                         1, if s ≤ R−1 (c)
                                                                                           F P (x, R−1 (c)) =                                             (21)
                                                                                                                         0, otherwise
                                                                                                             
                                                                                                                 2(1 − c), if s ≤ R−1 (c)
                                                                                          QI(x, t, c) =                                                   (22)
                                                                                                                 0, otherwise
                                                                                 Similar to the positive instance, instance hardness for a negative
                                                                              instance is defined as the area under the cost curve:

                                                                                             Z 1                         h          i
                                                                               IH rd (x) =           2(1 − c) dc = 2c − c2                      = (1 − R(s))2
                                                                                              R(s)                                      R(s)1
Figure 2. Instance cost curves for an instance assuming the rate-driven and                                                                        (23)
                          score-driven methods.
                                                                                 Notice that (1 − R(s)) corresponds to the negative rate of a classi-
                                                                              fier at the point s. The instance hardness for negative instances is then
                                                                              related to the ranking of the most likely negative instances produced
                                                                              by the learned model.
                                                                                 Different from the score-driven method, which measures the mag-
                                                                              nitude of the errors obtained by a model, the rate-driven method is
4.3     Rate-Driven Instance Hardness
                                                                              more related to ranking performance. By adopting the score-driven
According to [8], the score-driven method is sensitive to the estima-         method, an instance is considered hard if its score is not well cali-
tion of the scores. For instance, if the scores are highly concentrated,      brated. On other hand, the same instance may be easy by assuming
the rate-driven method if it is well ranked relative to the other in-      6    Experimental Setup
stances. Instance hardness by adopting the score-driven method only
depends on the score of the instance. Instance hardness by adopting        In order to achieve a good diversity of problems, we computed the
the rate-driven method in turn depends not only on the instance score      instance hardness on a group of 50 datasets4 representing problems
but also on how the other instances are ordered.                           of binary classification. Most problems were collected from the UCI
                                                                           repository[1].
                                                                              We compare the performance of 6 algorithms available on the
                                                                           Scikit-Learn Project[14]. The scores for each instance were calcu-
5   Cost Sensitive Algorithm Similarity                                    lated using a 10-fold cross-validation. Usually, the classification al-
                                                                           gorithms return the label predicted for an informed example but to
The application of global metrics fails to properly measure algorithm      produce real-values scores (varying between 0 and 1)for the input
similarity since it is based on average performance, losing important      instances, we adopted specific procedures for each algorithm. For
information during the evaluation process. More fine-grained mea-          the Nearest Neighbours (KNN), the score returned is the number of
sures provide a solution for this limitation by verifying the algorithm    neighbours belonging to the negative class divided by K. The pro-
performance at the instance level. However, the proposed measures          cedure used for the Random Forest (RF) is similar: we count the
are not well defined when misclassification costs are involved.            number of votes for the negative class and divide this by the number
   In this work, we derived different measures for algorithm similar-      of trees in the forest. For the Decision Tree (DT), Naive Bayes (NB)
ity based on the concept of instance hardness. Given a pair of learned     and Logistic Regression (LR) the score represents the probability of
models, they will be similar if the hardness values computed on a test     the instance being negative. For the Support Vector Machine (SVM),
set by using the two models are similar. More specifically, in order to    we get the decision function output and normalize it between 0 and
relate two models, we initially collect the scores produced by them        1.
on a test set of instances. Following, we compute the hardness value          In order to have a reference point, we compare variations of algo-
for each test instance considering each model. Finally, the similarity     rithm at the clustering process. For the KNN, we applied the algo-
between the models is defined by aggregating the instance hardness         rithm with 3 and 5 nearest neighbours (3NN and 5NN) and for the
values as follows:                                                         SVM, we adopted the experiments with the linear and RBF kernel
                                                                           functions (SVM LIN and SVM RBF, respectively).
                           1 X                                                We expect that the models learned with variations of a same al-
                                  T           T
      D(ma , mb , D) =         |IHm a
                                      (x) − IHm   (x)|             (24)    gorithm will be clustered together. In order to cluster the results ob-
                          |D|                   b
                               x∈D                                         tained by the similarity measures, we applied the agglomerative hier-
                                                                           archical clustering method with the average linkage function [4]. For
   The above equation measures the pairwise distance between mod-          the experiments with the score-fixed method, we set t = 0.5.
els ma and mb for each instance x belonging to the test set D. In
our work, in order to deal with costs, we derived in the previous sec-
tion three distinct instance hardness by adopting different threshold      7    Dataset Algorithm Similarity
choice methods T . By adopting the score-fixed method, two models          After computing the scores for all datasets instances as described in
will be similar if they return the same classification result for a high   the previous section, we use the results to calculate the instance hard-
number of instances. By adopting the score-driven method, two mod-         ness measures for the three threshold choice methods presented in
els will be similar with their scores are similar (the squared-errors      Section 4. Then we measure the pairwise distance among the models
produced by the models are similar at instance level). By adopting         for each dataset using the Equation 24. In order to illustrate the use
the rate-driven method in turn two models will be similar if the test      of the proposed measures, we initially present the results obtained
instances are similarly ranked. The three methods correspond to three      by a single dataset (the Ecoli dataset). Next section will present the
different evaluation strategies. Other instance hardness measures and      clustering results obtained by considering all the collected datasets.
their corresponding algorithm similarity measures can be developed
in the future by considering other threshold choice methods, such as
                                                                                 Table 1.   Mean Instance Hardness Measures for Ecoli Dataset
the probabilistic ones [7].
   Once we have the algorithm similarity measure on a single dataset,
we can compute the overall similarity over different datasets in order                       Model        IH rd      IH sd     IH sf
to achieve a wider view about the relative performance of algorithms.                         3NN         0,269      0,0694    0,0833
There are many strategies to do this aggregation, such as the use of                          5NN         0,2583     0,0583    0,0744
the median or the average similarity as well as other consensus sim-                           DT         0,2858     0,0774    0,0774
ilarity methods [3] that can be taken over all datasets involved. In                           LR         0,252      0,0671    0,1047
                                                                                               NB         0,2585     0,1949    0,2024
this work, we adopt a simple aggregation strategy by computing the                             RF         0,256       0,051    0,0714
average similarities measured over all datasets observed:                                   SVM LIN       0,2554     0,1828     0,244
                                                                                            SVM RBF       0,2553     0,1811    0,2381
                                     N
                               1 X
               A(ma , mb ) =       D(ma , mb , Dj )                (25)       Table 1 presents the average instance hardness measures for the
                               N
                                   j=1                                     Ecoli dataset. The average hardness values observed suggest that the
                                                                           models are better at calibrating scores than at ranking instances for
   In the above equation, N stands for the number of datasets avail-       this dataset, as can be seen from the values presented by both meth-
able for measuring algorithm similarity. As it will be seen next sec-      ods based on score. In fact, the score-fixed and score-driven instance
tion, this measure will be adopted in a case study to cluster algo-
                                                                           4 The list of datasets used is available at http://www.cin.ufpe.br/∼cecm2
rithms based on average similarity over a set of learning tasks.
hardness measures are in general smaller than the rate-driven ones.
Also, large differences in the quality of scores produced by some al-
gorithms do not reflect in large differences in the ranking quality. For
instance, although the NB produced bad results for the score-fixed
and score-driven methods, its results for the rate-driven method are
similar to the other algorithms.
   Figures 3, 4 and 5 present the dendrograms derived from the
clustering process by adopting the proposed measures for the Ecoli
dataset (respectively for the score-fixed, score-driven and rate-driven
measures). The dendrograms for the score-fixed and score-driven
methods show that the learned models produced the same clusters,
differing on the cut-points. The clustering presented by the rate-
                                                                            Figure 5. Clustered rate-driven performance for the models learned on the
driven method is also related to the methods based on score. The                                          Ecoli dataset
cluster formed by the SVM models is observed in all cases. The
remaining clusters have a slight difference. In the score-fixed and
score-driven methods, the NB model is a cluster by itself, but in the
rate-driven case NB was considered similar to LR (i.e., they produced       models learned by KNN were the most similar, as expected. The sec-
similar rankings of instances although their scores are different). In      ond most similar pair of models was produced by the RF and the
all cases, the KNN and the RF models were considered similar, which         LR algorithms (with a low dissimilarity measure around 0.1 in both
can be supported in [12].                                                   cases). Depending on the cut-point (e.g., 0.3) adopted to produce
                                                                            clusters from the dendrogram, the DT algorithm is clustered together
                                                                            with 3NN, 5NN, RF and LR. The models obtained by NB are the
                                                                            ones that present the most distinct behaviour. Finally, the SVM mod-
                                                                            els are similar between them (relative to the other algorithms), but
                                                                            their similarity level is not so high. The change in the kernel function
                                                                            produced in many datasets very distinct models. Some of the results
                                                                            derived from the dendrogram are expected (such as the similarity be-
                                                                            tween 3NN and 5NN). Other results in turn, such as the similarity
                                                                            between RF and LR, were not expected and hence have to be better
                                                                            investigated in the future.


Figure 3. Clustered score-fixed performance for the models learned on the
                              Ecoli dataset


                                                                                     Figure 6. Clustered average score-fixed performance


                                                                               Figure 8 displays the dendrogram of algorithms produced by
 Figure 4. Clustered score-driven performance for the models learned on     adopting the rate-driven measure. The algorithms are more similar to
                           the Ecoli dataset
                                                                            each other when similarity is measured in terms of ranking quality.
                                                                            As in the score-fixed and score-driven dendrograms, the 3NN, 5NN,
                                                                            DT and RF are clustered together. The algorithms LR, SVM LIN and
                                                                            NB formed another cluster. Different from the dendrograms for the
                                                                            score-fixed and score-driven methods, we see that the SVM models
8   Aggregated Algorithm Similarity
                                                                            do not belong to the same group. Again, a deeper investigation has to
In order to have a better view about the algorithms similarities, we        be done to provide further explanations on why a given pair of algo-
applied the equation 25 to compute the average distance between al-         rithms was similar. In our work, we provide a general framework that
gorithms across the 50 datasets adopted in our experiments. Figures         can be extended with new algorithms and datasets and can be used to
6 and 7 present the dendrograms of algorithms considering the score-        identify which algorithms are similar under different cost-sensitive
fixed and score-driven measures. These dendrogram shows that the            perspectives.
                                                                           ACKNOWLEDGEMENTS
                                                                           This research is supported by CNPq, CAPES and FACEPE (Brazilian
                                                                           agencies).


                                                                           REFERENCES
                                                                            [1] K. Bache and M. Lichman. UCI machine learning repository, 2013.
                                                                            [2] Pavel Brazdil, Christophe Giraud-Carrier, Carlos Soares, and Ricardo
                                                                                Vilalta, Metalearning: Applications to Data Mining, Springer Publish-
                                                                                ing Company, Incorporated, 1 edn., 2008.
                                                                            [3] Francisco de A.T. de Carvalho, Yves Lechevallier, and Filipe M.
                                                                                de Melo, ‘Relational partitioning fuzzy clustering algorithms based on
         Figure 7. Clustered average score-driven performance                   multiple dissimilarity matrices’, Fuzzy Sets and Systems, 215(0), 1 –
                                                                                28, (2013). Theme : Clustering.
                                                                            [4] Chris Ding, Xiaofeng He, and Lawrence Berkeley, ‘Cluster merging
                                                                                and splitting in hierarchical clustering algorithms’, 139–146, (2002).
                                                                            [5] Johannes Fürnkranz and Johann Petrak, ‘An evaluation of landmark-
                                                                                ing variants’, in Working Notes of the ECML/PKDD 2000 Workshop
                                                                                on Integrating Aspects of Data Mining, Decision Support and Meta-
                                                                                Learning, pp. 57–68, (2001).
                                                                            [6] José Hernández-Orallo, Peter Flach, and Cèsar Ferri, ‘Brier curves: A
                                                                                new cost-based visualisation of classifier performance’, in Proceedings
                                                                                of the 28th International Conference on Machine Learning, (2011).
                                                                            [7] José Hernández-Orallo, Peter Flach, and Cèsar Ferri, ‘A unified view of
                                                                                performance metrics: Translating threshold choice into expected classi-
                                                                                fication loss’, J. Mach. Learn. Res., 13(1), 2813–2869, (October 2012).
                                                                            [8] José Hernández-Orallo, Peter Flach, and César Ferri, ‘ROC curves in
                                                                                cost space’, Machine Learning, 93(1), 71–91, (February 2013).
                                                                            [9] A Kalousis, J Gama, and M Hilario, ‘On data and algorithms: Under-
                                                                                standing inductive performance’, Machine Learning, 275–312, (2004).
                                                                           [10] JW Lee and C Giraud-Carrier, ‘A metric for unsupervised metalearn-
          Figure 8. Clustered average rate-driven performance                   ing’, Intelligent Data Analysis, 15, 827–841, (2011).
                                                                           [11] Rui Leite, Pavel Brazdil, and Joaquin Vanschoren, ‘Selecting classifi-
                                                                                cation algorithms with active testing’, in Machine Learning and Data
                                                                                Mining in Pattern Recognition, 117–131, Springer, (2012).
                                                                           [12] Yi Lin and Yongho Jeon, ‘Random forests and adaptive nearest neigh-
                                                                                bors’, Journal of the American Statistical Association, 101(474), 578–
9   Conclusion                                                                  590, (2006).
                                                                           [13] Tony Martinez Michael R. Smith and Christophe Giraud-Carrier, ‘An
                                                                                instance level analysis of data complexity’, Machine Learning, (2013).
In this work, we proposed similarity measures between algorithms           [14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
by considering three threshold choice methods: score-fixed, score-              O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
                                                                                plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
driven and rate-driven. For each method, we proposed a correspond-              esnay, ‘Scikit-learn: Machine learning in Python’, Journal of Machine
ing instance hardness measure that was then deployed to define the              Learning Research, 12, 2825–2830, (2011).
algorithm similarity measure. The similarity between algorithms can        [15] AH Peterson and TR Martinez, ‘Estimating the potential for combin-
be quite different depending on the dataset and the cost-sensitive sce-         ing learning models’, Proceedings of the ICML workshop on meta-
nario being tackled. For the score-fixed, two algorithms are similar if         learning, (2005).
they produce the same classification for a high number of instances
for a chosen threshold. For the score-driven method, two algorithms
are similar if they produce similar scores. For the rate-driven method,
in turn, two algorithms are similar if they produce similar rankings
of instances.
   In order to infer overall similarity on different datasets, we com-
puted the average similarity between algorithms over 50 datasets and
performed clustering of algorithms. The results revealed some unex-
pected similarities that can serve as starting points for further inves-
tigations to discover and explain hidden relationships between algo-
rithms and learning strategies.
   As future work, the ideas presented here can be applied on a pre-
dictive meta-learning strategy to select algorithms for new datasets
depending on the threshold choice method and the input costs. In
our work, we aggregate similarity across datasets using the mean
method, which is simple but possibly naive. Other consensus similar-
ity methods can be investigated. Finally, we highlight that our work
is very limited concerning the number of algorithms and datasets
adopted. Hence, more extensive meta-learning studies can be done
in the future, with stronger conclusions and insights.