<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Similarity measures of algorithm performance for cost-sensitive scenarios</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ricardo Bastos Cavalcante Prudeˆncio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Carlos Eduardo Castor de Melo</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Knowledge about algorithm similarity is an important feature of meta-learning, where information gathered from previous learning tasks can be used to guide the selection or combination of algorithms for a new dataset. Usually this task is done by comparing global performance measures across different datasets or, alternatively, comparing the performance of algorithms at the instance-level. These previous similarity measures do not consider misclassification costs, and hence they neglect an important information that can be exploited in different learning contexts. In this paper we present algorithm similarity measures that deals with cost proportions and different threshold choice methods for building crisp classifiers from learned models. Experiments were performed in a meta-learning study with 50 different learning tasks. The similarity measures were adopted to cluster algorithms according to their aggregated performance on the learning tasks. The clustering process revealed similarity between algorithms under different perspectives.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Meta-learning is a framework developed in supervised machine
learning for acquiring knowledge from empirical case studies and
relating features of learning problems to algorithm performance [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
The knowledge acquired in meta-learning has been used to support
algorithm selection and combination in different contexts [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
In meta-learning, it is assumed that similar problems present
similar algorithm performance. Hence, to know how similar is the
performance obtained by different algorithms is important for
metalearning. This information can be used to discover similarities
between algorithms and between datasets [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and also to support the
prediction of suitable algorithms for a given dataset based on the
algorithms’ performance on past datasets [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Deploying global metrics, such as accuracy and AUC, to compare
the performance of algorithms on a dataset has been the most
common approach to deal with the above issue. Despite the popularity
of this approach, it may lose important knowledge about algorithm
similarity since it is based on average algorithm performance without
considering differences of algorithm behaviour in an instance-level.
In order to overcome this limitation, alternative measures of
algorithm similarity (e.g., based on error correlation [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] or classifier
output difference [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]) have been adopted. Here, algorithms are
considered similar if they produce the same classifications, or alternatively
the same mistakes, on the same instances.
      </p>
      <p>
        Algorithm similarity measures adopted in these previous works
are limited since they are not flexible enough to consider different
misclassification costs and different strategies to build classifiers.
Algorithms usually produce models (scoring functions) that return
scores or class probabilities for the input examples. A classifier is
then built from a model by adopting a decision threshold. The
prediction for a given example depends on both the score returned by
the learned model for that example and the decision threshold. The
most common approach to produce classifiers is to adopt a fixed
decision threshold (e.g., 0.5). Alternatively, decision thresholds can be
chosen according to the operating conditions or contexts (e.g., class
frequencies and misclassification costs) observed when the learned
model is deployed. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the authors showed that different threshold
choice methods require the use of different performance measures
to evaluate a learned model. Similarly, concerning algorithm
similarity, specific measures have to be defined when misclassification
costs and adaptive threshold choice methods are taken into account.
Previous work on algorithm similarity measures does not consider
such aspects. The previous similarity measures are defined
implicitly assuming fixed decision thresholds and thus are only suitable for
comparing the performance of crisp classifiers.
      </p>
      <p>
        Based on the above motivation, in our work we developed
similarity measures for algorithm performance taking into account cost
proportions and different threshold choice methods. As in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], we
adopted an instance-level strategy. For that, we initially proposed
instance hardness measures to indicate how difficult is an instance to be
correctly classified by a model. Specific instance hardness measures
were proposed for three threshold choice methods: the score-fixed
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the score-driven [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and rate-driven methods [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. By assuming
a threshold choice method and a learned model, we built for each
instance a corresponding cost curve, which indicates the model loss
for that instance along different misclassification costs. The instance
hardness is defined as the area under the cost curve. The
dissimilarity3 between two algorithms for a given dataset is defined as the
average absolute difference between the hardness values over all
instances of the dataset.
      </p>
      <p>In our work, we applied the proposed measures in a meta-learning
study in which 50 datasets and 8 learned models were adopted.
Clusters of models were produced by adopting the score-fixed,
scoredriven and rate-driven measures. Each cluster reveals which
algorithms produced similar learned models on the 50 datasets under each
cost-sensitive perspective. We observed that the algorithm behaviour
among the datasets was quite distinct for each measure. By
adopting the score-fixed method, two algorithms were similar if they
return the same classification for a high number of instances by using
3 In this paper we treat dissimilarity as the complement of similarity,
henceforth we will use only the latter term
a given threshold. By adopting the score-driven method, algorithms
were considered similar if they produced similar scores for the
instances. By adopting the rate-driven method in turn, algorithms were
considered similar if they produced similar rankings of instances
despite how calibrated are their scores.</p>
      <p>This paper is organized as follows. Applications of algorithms
similarity measures are presented in Section 2, with a review of
related works. Section 3 introduces the notation and basic
definitions used through this paper. The concepts of instance hardness and
threshold choice methods are explained in Section 4 and three
threshold choice methods are presented: score-fixed, score-driven and
ratedriven. The proposed method for measuring the similarity of
algorithms performance is presented on Section 5. Section 6 presents the
experimental methodology. An analysis of the similarity measures
over a single dataset is provided in Section 7, followed by the
Section 8 which demonstrates the aggregate similarity across a group of
datasets. Finally, Section 9 concludes the paper with a discussion of
the results and insights for future works.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Measuring algorithm similarity</title>
      <p>
        Meta-learning deals with methods to exploit knowledge gathered
from learning on different tasks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Differently from base-level
learning, which focuses on acquiring experience on a single task,
meta-learning acquires knowledge from a meta-data set produced
when a pool of algorithms is applied on distinct learning tasks. Given
a learning task, a meta-example usually stores characteristics of the
task and the performance obtained by a pool of candidate algorithms.
Meta-learning can be applied: (1) in a descriptive sense aiming to
discover similarities between datasets and algorithms and (2) in a
predictive sense to select candidate algorithms for new datasets based
on the knowledge acquired in the meta-learning process.
      </p>
      <p>
        Different meta-learning approaches rely on the use of similarity
measures of algorithm performance. For instance, in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] the authors
clustered a pool of learning algorithms based on the patterns of the
errors observed in different datasets. Clustering of algorithms based
on their performance was also performed more recently in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Clustering algorithms may provide useful information to guide the
processes of algorithm selection and combination [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. As another line of
research in meta-learning is the landmarking approach [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which
uses the performance of simpler algorithms (called landmarkers) to
guide the selection of more complex ones. In this approach, datasets
are characterized based on the performance obtained by the
landmarkers (which are usually simple and diverse algorithms, such as
decision stumps and linear models, or simplified versions of the
candidate algorithms). One method to handle this approach is
measuring the similiarity of a new dataset with the meta-examples retrived
from the meta-data based on the similarity of performance obtained
by the landmarkers[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The best candidate algorithms adopted in the
retrieved problems are then suggested for the new dataset.
      </p>
      <p>
        The most common approach for measuring algorithm similarity
is the use of global performance measures, like accuracy or AUC
measure, estimated from an empirical evaluation process (such as
cross-validation). Similarity between algorithms can be obtained by
computing the differences or ratios between the performance
measures, as performed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Although this approach is widely applied,
it is strongly criticized since it may lose important information about
algorithms behaviour and may not characterize similarity properly.
For example, if a dataset has 100 instances and a given algorithm
misclassifies 20 instances and another one misclassifies 20 instances
completely different from the first ones, the accuracy of both
algorithms will be the same but their behaviour is quite different.
      </p>
      <p>
        A more fine-grained approach to algorithm similarity is to
consider the performance at each instance of the dataset. This approach
has the advantage of showing local discrepancies of performance
among the space of instances. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], error correlation is adopted as
similarity measure between two algorithms, in such a way that two
algorithms are similar if they produce the same errors in the same
instances. In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the authors present the Classifier Output Difference
as an alternative measure for this approach. This metric is defined as
the probability of two distinct classifiers make different predictions
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>Although this approach represents a more refined view about
algorithm performance, the measures proposed in the literature are
computed from predictions generated by crisp classifiers (i.e., with fixed
decision thresholds chosen a priori). However as stated in section 1,
in the context of misclassification costs, decision thresholds can be
adaptively defined to minimize the loss of a learned model. Hence,
two instances considered equally easy by a classifier with a fixed
threshold can be rather difficult under different decision thresholds
and costs. In this work, we derived similarity measures for
algorithm performance in cost sensitive scenarios, by considering
different methods to choose decision thresholds of classifiers based on the
knowledge about misclassification costs.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Notation and Basic Definitions</title>
      <p>
        The basic definitions adopted in our work are mostly based on [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Instances can be classified into one of the classes Y = f0; 1g, in
which 0 is the positive class and 1 is the negative class. A learned
model m is a scoring function that receives an instance x as input
and returns a score s = m(x) that indicates the chance of a negative
class prediction. A model can be transformed in a crisp classifier
assuming a decision threshold t in such a way that if s t then x is
classified as positive, and it is classified as negative if s &gt; t.
      </p>
      <p>
        The errors of a classifier are associated to costs related to the
classes. The cost of a false negative is represented as c0 and the cost
of a false positive in turn is represented as c1. As in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we
normalize the costs by setting c0 + c1 = b and adopt the cost proportion
c = c0=b to represent the operating condition faced by a model this
deployment. For simplicity, we adopted b = 2 and hence c 2 [0; 1],
c0 = 2c and c1 = 2(1 c).
      </p>
      <p>The loss function produced assuming a decision threshold t and a
cost proportion c is defined as:</p>
      <p>Q(t; c)
= c0 0F N (t) + c1 1F P (t)
= 2fc 0F N (t) + (1
c) 1F P (t)g
(1)</p>
      <p>In the above equation F N (t) and F P (t) are respectively the False
Negative and False Positive rates produced by a model when a
threshold t is adopted. The variables 0 and 1 represent the proportion of
positive and negative examples.</p>
      <p>The Positive Rate R(t) is the proportion of instances predicted as
positive at the decision threshold t and can be expressed as 0(1
F N (t)) + 1F P (t).
4</p>
    </sec>
    <sec id="sec-4">
      <title>Instance Hardness and Cost Curves</title>
      <p>In Equation 1, the loss produced by a classifier is an aggregation
of the errors observed for the positive and the negative instances. A
positive instance will be associated to a cost 2c when it is incorrectly
IHT (x) =</p>
      <p>QI(x; T (c); c)w(c)dc
(4)</p>
      <p>In the above equation, w(c) represents the distribution of c. In
our work, we derived the instance hardness measures for different
threshold choice methods assuming uniform distribution of cost
proportions.
4.1</p>
    </sec>
    <sec id="sec-5">
      <title>Score-Fixed Instance Hardness</title>
      <p>In this method, the decision threshold assumes a fixed value
regardless the operation condition:
classified. An error for a negative instance in turn will be associated
to a cost 2(1 c).</p>
      <p>In our work, we decompose the Equation 1 to derive the loss
functions for individual instances. The individual loss for a positive
instance x is defined as:</p>
      <p>QI(x; t; c) = 2cF N (x; t)</p>
      <p>In the above equation, F N (x; t) = 1 if x is a false negative when
threshold t is adopted and F N (x; t) = 0 otherwise. A similar
definition of loss function can be done for a negative instance x:
QI(x; t; c) = 2(1
c)F P (x; t)</p>
      <p>In the above equation, F P (x; t) = 1 if x is a false positive at
threshold t and F P (x; t) = 0 otherwise.</p>
      <p>Given a threshold choice method, QI(x; t; c) produce a specific
curve for the example x along the range of operating conditions (c 2
[0; 1]). The instance hardness measures in our work are defined as the
area under the individual cost curves and compute the total expected
loss for the range of operation conditions. In the general case, given
a threshold choice method t = T (c), the hardness of an instance is:
IHsf (x) =
2(1
c) dc =
h
c
2i1</p>
      <p>Usually, t is set to 0:5. If x is an incorrectly classified positive
instance, then F N (x; t) = 1. By replacing F N (x; t) in Equation 2,
the instance cost curve is defined as:</p>
      <p>T (c) = t</p>
      <p>QI(x; t; c) = 2c</p>
      <p>If x is an incorrectly classified negative instance, then F P (x; t) =
1 and its corresponding cost curve is:</p>
      <p>QI(x; t; c) = 2(1
c)</p>
      <p>For correctly classified instances (either positive or negative), the
instance cost curve is a constant line QI(x; t; c) = 0. Figure 1
illustrates the score-fixed instance cost curve considering positive and
negative instances.</p>
      <p>By integration QI, we derive the instance hardness value for
incorrect positive instance as follows:</p>
      <p>IHsf (x) =</p>
      <p>h 2i1
2c dc = c</p>
      <p>= 1
0
Similarly, for incorrect instances the instance hardness is defined
0
as:
(2)
(3)
(5)
(6)
(7)
(8)</p>
      <p>Replacing F N (x; c) in Equation 2 we have the following
scoredriven cost curve for a positive instance:</p>
      <p>Fig. 2(a) illustrates the score-driven cost curve for a positive
instance with score s = 0:32. For s c no cost is assumed; on the
other hand, the cost varies linearly. The area under the cost curve is
defined as an instance hardness measure:</p>
      <p>For correctly classified instances (either positive or negative),
IHsf (x) = 0.
4.2</p>
    </sec>
    <sec id="sec-6">
      <title>Score-Driven Instance Hardness</title>
      <p>
        In the score-driven method [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the decision threshold t is defined by
simply setting it equal to the operating condition c:
      </p>
      <p>T (c) = c</p>
      <p>
        For instance, if c = 0:7, the cost of false negatives are higher than
the costs of false positives. In this case by setting t = c = 0:7 the
produced classifier tends to predict more instances as positive, thus
minimizing the relative number of false negatives. According to [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
the score-driven method is a natural choice when the model scores
are assumed to be class probability estimators and the scores are well
calibrated.
      </p>
      <p>Under the score-driven method, we can derive the loss function for
positive instances as follows:</p>
      <p>F N (x; c) =
2(1</p>
      <p>h
c) dc = 2c
c
2i1
s
= (1
s)2
(16)</p>
      <p>
        For negative instances we have y = 1 and then the above
measure corresponds to (y s)2. Hence, for both positive and negative
instances, hardness is defined as the squared-error obtained by the
model.
(13)
(14)
(15)
a small change in the operating condition (and consequently in the
decision threshold) may drastically affect the classifier performance.
As an alternative, in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] the authors proposed to use the proportion
of instances predicted as positive (i.e., R(t)) to define the decision
thresholds.
      </p>
      <p>In the rate-driven method, the decision threshold is set to achieve
a desired proportion of positive predictions. The threshold choice
method is defined as:</p>
      <p>T (c) = R 1(c)</p>
      <p>For instance, if c = 0:7 the threshold t is set in such a way that
70% of the instances are classified as positive. The operating
condition c is then expressed as the desired positive rate: c = R(t). The
probability of a false negative under the rate-driven choice method
can be defined as:</p>
      <p>F N (x; R 1(c)) =
1; if s &gt; R 1(c)
0; otherwise</p>
      <p>
        Replacing F N (x; R 1(c)) in Equation 2 we have the following
rate-driven cost curve for a positive instance:
According to [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the score-driven method is sensitive to the
estimation of the scores. For instance, if the scores are highly concentrated,
      </p>
      <p>Similar to the positive instance, instance hardness for a negative
instance is defined as the area under the cost curve:
IHrd(x) =</p>
      <p>R(s)
2(1</p>
      <p>h
c) dc = 2c
c2i
= (1</p>
      <p>R(s))2
R(s)1
(23)</p>
      <p>Notice that (1 R(s)) corresponds to the negative rate of a
classifier at the point s. The instance hardness for negative instances is then
related to the ranking of the most likely negative instances produced
by the learned model.</p>
      <p>Different from the score-driven method, which measures the
magnitude of the errors obtained by a model, the rate-driven method is
more related to ranking performance. By adopting the score-driven
method, an instance is considered hard if its score is not well
calibrated. On other hand, the same instance may be easy by assuming
the rate-driven method if it is well ranked relative to the other
instances. Instance hardness by adopting the score-driven method only
depends on the score of the instance. Instance hardness by adopting
the rate-driven method in turn depends not only on the instance score
but also on how the other instances are ordered.
5</p>
    </sec>
    <sec id="sec-7">
      <title>Cost Sensitive Algorithm Similarity</title>
      <p>The application of global metrics fails to properly measure algorithm
similarity since it is based on average performance, losing important
information during the evaluation process. More fine-grained
measures provide a solution for this limitation by verifying the algorithm
performance at the instance level. However, the proposed measures
are not well defined when misclassification costs are involved.</p>
      <p>In this work, we derived different measures for algorithm
similarity based on the concept of instance hardness. Given a pair of learned
models, they will be similar if the hardness values computed on a test
set by using the two models are similar. More specifically, in order to
relate two models, we initially collect the scores produced by them
on a test set of instances. Following, we compute the hardness value
for each test instance considering each model. Finally, the similarity
between the models is defined by aggregating the instance hardness
values as follows:</p>
      <p>D(ma; mb; D) =
jDj</p>
      <p>x2D
1 X jIHmTa (x)</p>
      <p>IHmTb (x)j
(24)</p>
      <p>
        The above equation measures the pairwise distance between
models ma and mb for each instance x belonging to the test set D. In
our work, in order to deal with costs, we derived in the previous
section three distinct instance hardness by adopting different threshold
choice methods T . By adopting the score-fixed method, two models
will be similar if they return the same classification result for a high
number of instances. By adopting the score-driven method, two
models will be similar with their scores are similar (the squared-errors
produced by the models are similar at instance level). By adopting
the rate-driven method in turn two models will be similar if the test
instances are similarly ranked. The three methods correspond to three
different evaluation strategies. Other instance hardness measures and
their corresponding algorithm similarity measures can be developed
in the future by considering other threshold choice methods, such as
the probabilistic ones [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Once we have the algorithm similarity measure on a single dataset,
we can compute the overall similarity over different datasets in order
to achieve a wider view about the relative performance of algorithms.
There are many strategies to do this aggregation, such as the use of
the median or the average similarity as well as other consensus
similarity methods [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] that can be taken over all datasets involved. In
this work, we adopt a simple aggregation strategy by computing the
average similarities measured over all datasets observed:
A(ma; mb) =
j=1
      </p>
      <p>N
N1 X D(ma; mb; Dj )
(25)</p>
      <p>In the above equation, N stands for the number of datasets
available for measuring algorithm similarity. As it will be seen next
section, this measure will be adopted in a case study to cluster
algorithms based on average similarity over a set of learning tasks.</p>
    </sec>
    <sec id="sec-8">
      <title>Experimental Setup</title>
      <p>
        In order to achieve a good diversity of problems, we computed the
instance hardness on a group of 50 datasets4 representing problems
of binary classification. Most problems were collected from the UCI
repository[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        We compare the performance of 6 algorithms available on the
Scikit-Learn Project[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The scores for each instance were
calculated using a 10-fold cross-validation. Usually, the classification
algorithms return the label predicted for an informed example but to
produce real-values scores (varying between 0 and 1)for the input
instances, we adopted specific procedures for each algorithm. For
the Nearest Neighbours (KNN), the score returned is the number of
neighbours belonging to the negative class divided by K. The
procedure used for the Random Forest (RF) is similar: we count the
number of votes for the negative class and divide this by the number
of trees in the forest. For the Decision Tree (DT), Naive Bayes (NB)
and Logistic Regression (LR) the score represents the probability of
the instance being negative. For the Support Vector Machine (SVM),
we get the decision function output and normalize it between 0 and
1.
      </p>
      <p>In order to have a reference point, we compare variations of
algorithm at the clustering process. For the KNN, we applied the
algorithm with 3 and 5 nearest neighbours (3NN and 5NN) and for the
SVM, we adopted the experiments with the linear and RBF kernel
functions (SVM LIN and SVM RBF, respectively).</p>
      <p>
        We expect that the models learned with variations of a same
algorithm will be clustered together. In order to cluster the results
obtained by the similarity measures, we applied the agglomerative
hierarchical clustering method with the average linkage function [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For
the experiments with the score-fixed method, we set t = 0:5.
7
      </p>
    </sec>
    <sec id="sec-9">
      <title>Dataset Algorithm Similarity</title>
      <p>After computing the scores for all datasets instances as described in
the previous section, we use the results to calculate the instance
hardness measures for the three threshold choice methods presented in
Section 4. Then we measure the pairwise distance among the models
for each dataset using the Equation 24. In order to illustrate the use
of the proposed measures, we initially present the results obtained
by a single dataset (the Ecoli dataset). Next section will present the
clustering results obtained by considering all the collected datasets.</p>
      <p>Table 1 presents the average instance hardness measures for the
Ecoli dataset. The average hardness values observed suggest that the
models are better at calibrating scores than at ranking instances for
this dataset, as can be seen from the values presented by both
methods based on score. In fact, the score-fixed and score-driven instance
4 The list of datasets used is available at http://www.cin.ufpe.br/ cecm2
hardness measures are in general smaller than the rate-driven ones.
Also, large differences in the quality of scores produced by some
algorithms do not reflect in large differences in the ranking quality. For
instance, although the NB produced bad results for the score-fixed
and score-driven methods, its results for the rate-driven method are
similar to the other algorithms.</p>
      <p>
        Figures 3, 4 and 5 present the dendrograms derived from the
clustering process by adopting the proposed measures for the Ecoli
dataset (respectively for the score-fixed, score-driven and rate-driven
measures). The dendrograms for the score-fixed and score-driven
methods show that the learned models produced the same clusters,
differing on the cut-points. The clustering presented by the
ratedriven method is also related to the methods based on score. The
cluster formed by the SVM models is observed in all cases. The
remaining clusters have a slight difference. In the score-fixed and
score-driven methods, the NB model is a cluster by itself, but in the
rate-driven case NB was considered similar to LR (i.e., they produced
similar rankings of instances although their scores are different). In
all cases, the KNN and the RF models were considered similar, which
can be supported in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
    </sec>
    <sec id="sec-10">
      <title>Aggregated Algorithm Similarity</title>
      <p>In order to have a better view about the algorithms similarities, we
applied the equation 25 to compute the average distance between
algorithms across the 50 datasets adopted in our experiments. Figures
6 and 7 present the dendrograms of algorithms considering the
scorefixed and score-driven measures. These dendrogram shows that the
models learned by KNN were the most similar, as expected. The
second most similar pair of models was produced by the RF and the
LR algorithms (with a low dissimilarity measure around 0.1 in both
cases). Depending on the cut-point (e.g., 0.3) adopted to produce
clusters from the dendrogram, the DT algorithm is clustered together
with 3NN, 5NN, RF and LR. The models obtained by NB are the
ones that present the most distinct behaviour. Finally, the SVM
models are similar between them (relative to the other algorithms), but
their similarity level is not so high. The change in the kernel function
produced in many datasets very distinct models. Some of the results
derived from the dendrogram are expected (such as the similarity
between 3NN and 5NN). Other results in turn, such as the similarity
between RF and LR, were not expected and hence have to be better
investigated in the future.</p>
      <p>Figure 8 displays the dendrogram of algorithms produced by
adopting the rate-driven measure. The algorithms are more similar to
each other when similarity is measured in terms of ranking quality.
As in the score-fixed and score-driven dendrograms, the 3NN, 5NN,
DT and RF are clustered together. The algorithms LR, SVM LIN and
NB formed another cluster. Different from the dendrograms for the
score-fixed and score-driven methods, we see that the SVM models
do not belong to the same group. Again, a deeper investigation has to
be done to provide further explanations on why a given pair of
algorithms was similar. In our work, we provide a general framework that
can be extended with new algorithms and datasets and can be used to
identify which algorithms are similar under different cost-sensitive
perspectives.
9</p>
    </sec>
    <sec id="sec-11">
      <title>Conclusion</title>
      <p>In this work, we proposed similarity measures between algorithms
by considering three threshold choice methods: score-fixed,
scoredriven and rate-driven. For each method, we proposed a
corresponding instance hardness measure that was then deployed to define the
algorithm similarity measure. The similarity between algorithms can
be quite different depending on the dataset and the cost-sensitive
scenario being tackled. For the score-fixed, two algorithms are similar if
they produce the same classification for a high number of instances
for a chosen threshold. For the score-driven method, two algorithms
are similar if they produce similar scores. For the rate-driven method,
in turn, two algorithms are similar if they produce similar rankings
of instances.</p>
      <p>In order to infer overall similarity on different datasets, we
computed the average similarity between algorithms over 50 datasets and
performed clustering of algorithms. The results revealed some
unexpected similarities that can serve as starting points for further
investigations to discover and explain hidden relationships between
algorithms and learning strategies.</p>
      <p>As future work, the ideas presented here can be applied on a
predictive meta-learning strategy to select algorithms for new datasets
depending on the threshold choice method and the input costs. In
our work, we aggregate similarity across datasets using the mean
method, which is simple but possibly naive. Other consensus
similarity methods can be investigated. Finally, we highlight that our work
is very limited concerning the number of algorithms and datasets
adopted. Hence, more extensive meta-learning studies can be done
in the future, with stronger conclusions and insights.</p>
    </sec>
    <sec id="sec-12">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This research is supported by CNPq, CAPES and FACEPE (Brazilian
agencies).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Bache</surname>
          </string-name>
          and
          <string-name>
            <surname>M. Lichman.</surname>
          </string-name>
          <article-title>UCI machine learning repository</article-title>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Pavel</given-names>
            <surname>Brazdil</surname>
          </string-name>
          , Christophe Giraud-Carrier,
          <string-name>
            <given-names>Carlos</given-names>
            <surname>Soares</surname>
          </string-name>
          , and Ricardo Vilalta, Metalearning: Applications to Data Mining, Springer Publishing Company, Incorporated,
          <volume>1</volume>
          <fpage>edn</fpage>
          .,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3] Francisco de A.T. de Carvalho, Yves Lechevallier, and
          <string-name>
            <surname>Filipe M. de Melo</surname>
          </string-name>
          , '
          <article-title>Relational partitioning fuzzy clustering algorithms based on multiple dissimilarity matrices'</article-title>
          ,
          <source>Fuzzy Sets and Systems</source>
          ,
          <volume>215</volume>
          (
          <issue>0</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          , (
          <year>2013</year>
          ). Theme : Clustering.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Chris</given-names>
            <surname>Ding</surname>
          </string-name>
          , Xiaofeng He, and Lawrence Berkeley, '
          <article-title>Cluster merging and splitting in hierarchical clustering algorithms</article-title>
          ',
          <fpage>139</fpage>
          -
          <lpage>146</lpage>
          , (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Fu</surname>
          </string-name>
          <article-title>¨rnkranz and Johann Petrak, 'An evaluation of landmarking variants'</article-title>
          ,
          <source>in Working Notes of the ECML/PKDD 2000 Workshop on Integrating Aspects of Data Mining, Decision Support and MetaLearning</source>
          , pp.
          <fpage>57</fpage>
          -
          <lpage>68</lpage>
          , (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6] Jose´ Herna´
          <article-title>ndez-</article-title>
          <string-name>
            <surname>Orallo</surname>
            ,
            <given-names>Peter</given-names>
          </string-name>
          <string-name>
            <surname>Flach</surname>
          </string-name>
          , and Ce`sar Ferri, '
          <article-title>Brier curves: A new cost-based visualisation of classifier performance'</article-title>
          ,
          <source>in Proceedings of the 28th International Conference on Machine Learning</source>
          , (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7] Jose´ Herna´
          <article-title>ndez-</article-title>
          <string-name>
            <surname>Orallo</surname>
            ,
            <given-names>Peter</given-names>
          </string-name>
          <string-name>
            <surname>Flach</surname>
          </string-name>
          , and Ce`sar Ferri, '
          <article-title>A unified view of performance metrics: Translating threshold choice into expected classification loss'</article-title>
          ,
          <source>J. Mach. Learn. Res.</source>
          ,
          <volume>13</volume>
          (
          <issue>1</issue>
          ),
          <fpage>2813</fpage>
          -
          <lpage>2869</lpage>
          , (
          <year>October 2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8] Jose´ Herna´
          <article-title>ndez-</article-title>
          <string-name>
            <surname>Orallo</surname>
            ,
            <given-names>Peter</given-names>
          </string-name>
          <string-name>
            <surname>Flach</surname>
          </string-name>
          , and
          <article-title>Ce´sar Ferri, 'ROC curves in cost space'</article-title>
          ,
          <source>Machine Learning</source>
          ,
          <volume>93</volume>
          (
          <issue>1</issue>
          ),
          <fpage>71</fpage>
          -
          <lpage>91</lpage>
          , (
          <year>February 2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A</given-names>
            <surname>Kalousis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J</given-names>
            <surname>Gama</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M</given-names>
            <surname>Hilario</surname>
          </string-name>
          , '
          <article-title>On data and algorithms: Understanding inductive performance'</article-title>
          ,
          <source>Machine Learning</source>
          ,
          <fpage>275</fpage>
          -
          <lpage>312</lpage>
          , (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>JW</given-names>
            <surname>Lee</surname>
          </string-name>
          and
          <string-name>
            <given-names>C</given-names>
            <surname>Giraud-Carrier</surname>
          </string-name>
          ,
          <article-title>'A metric for unsupervised metalearning'</article-title>
          ,
          <source>Intelligent Data Analysis</source>
          ,
          <volume>15</volume>
          ,
          <fpage>827</fpage>
          -
          <lpage>841</lpage>
          , (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Rui</surname>
            <given-names>Leite</given-names>
          </string-name>
          , Pavel Brazdil, and Joaquin Vanschoren, '
          <article-title>Selecting classification algorithms with active testing'</article-title>
          ,
          <source>in Machine Learning and Data Mining in Pattern Recognition</source>
          ,
          <fpage>117</fpage>
          -
          <lpage>131</lpage>
          , Springer, (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Yi</given-names>
            <surname>Lin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yongho</given-names>
            <surname>Jeon</surname>
          </string-name>
          , '
          <article-title>Random forests and adaptive nearest neighbors'</article-title>
          ,
          <source>Journal of the American Statistical Association</source>
          ,
          <volume>101</volume>
          (
          <issue>474</issue>
          ),
          <fpage>578</fpage>
          -
          <lpage>590</lpage>
          , (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Tony</given-names>
            <surname>Martinez Michael R. Smith</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christophe</given-names>
            <surname>Giraud-Carrier</surname>
          </string-name>
          , '
          <article-title>An instance level analysis of data complexity'</article-title>
          ,
          <source>Machine Learning</source>
          , (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , and E. Duchesnay, '
          <article-title>Scikit-learn: Machine learning in Python'</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>12</volume>
          ,
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          , (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <article-title>AH Peterson and TR Martinez, 'Estimating the potential for combining learning models'</article-title>
          ,
          <source>Proceedings of the ICML workshop on metalearning,</source>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>