=Paper=
{{Paper
|id=None
|storemode=property
|title=Selecting Classification Algorithms with Active Testing on Similar Datasets
|pdfUrl=https://ceur-ws.org/Vol-950/planlearn2012_submission_5.pdf
|volume=Vol-950
}}
==Selecting Classification Algorithms with Active Testing on Similar Datasets==
<pdf width="1500px">https://ceur-ws.org/Vol-950/planlearn2012_submission_5.pdf</pdf>
<pre>
Selecting Classification Algorithms with Active Testing on
                      Similar Datasets
                                      Rui Leite1 and Pavel Brazdil2 and Joaquin Vanschoren3


Abstract. Given the large amount of data mining algorithms, their           performance estimates (the target variable).
combinations (e.g. ensembles) and possible parameter settings, find-           The earliest techniques considered only the dataset itself and cal-
ing the most adequate method to analyze a new dataset becomes an            culated an array of various simple, statistical or information-theoretic
ever more challenging task. This is because in many cases testing all       properties of the data (e.g., dataset size, class skewness and signal-
possibly useful alternatives quickly becomes prohibitively expensive.       noise ratio) [17, 3]. Another approach, called landmarking [2, 12],
In this paper we propose a novel technique, called active testing, that     ran simple and fast versions of algorithms (e.g. decision stumps
intelligently selects the most useful cross-validation tests. It proceeds   instead of decision trees) on the new dataset and used their per-
in a tournament-style fashion, in each round selecting and testing the      formance results to characterize the new dataset. Alternatively, in
algorithm that is most likely to outperform the best algorithm of the       sampling landmarks [21, 8, 14], the complete (non-simplified) al-
previous round on the new dataset. This ‘most promising’ competitor         gorithms are run on small samples of the data. A series of sampling
is chosen based on a history of prior duels between both algorithms         landmarks on increasingly large samples represents a partial learn-
on similar datasets. Each new cross-validation test will contribute         ing curve which characterizes datasets and which can be used to pre-
information to a better estimate of dataset similarity, and thus better     dict the performance of algorithms significantly more accurately than
predict which algorithms are most promising on the new dataset. We          with classical dataset characteristics [13, 14]. Finally, an ‘active test-
also follow a different path to estimate dataset similarity based on        ing strategy’ for sampling landmarks [14] was proposed that actively
data characteristics. We have evaluated this approach using a set of        selects the most informative sample sizes while building these partial
292 algorithm-parameter combinations on 76 UCI datasets for clas-           learning curves, thus reducing the time needed to compute them.
sification. The results show that active testing will quickly yield an
                                                                            Motivation. All these approaches have focused on dozens of al-
algorithm whose performance is very close to the optimum, after rel-
                                                                            gorithms at most and usually considered only default parameter
atively few tests. It also provides a better solution than previously
                                                                            settings. Dealing with hundreds, perhaps thousands of algorithm-
proposed methods. The variants of our method that rely on cross-
                                                                            parameter combinations4 , provides a new challenge that requires a
validation tests to estimate dataset similarity provides better solutions
                                                                            new approach. First, distinguishing between hundreds of subtly dif-
than those that rely on data characteristics.
                                                                            ferent algorithms is significantly harder than distinguishing between
                                                                            a handful of very different ones. We would need many more data
1      Background and Motivation                                            characterizations that relate the effects of certain parameters on per-
                                                                            formance. On the other hand, the latter method [14] has a scalability
In many data mining applications, an important problem is select-
                                                                            issue: it requires that pairwise comparisons be conducted between
ing the best algorithm for a specific problem. Especially in classifi-
                                                                            algorithms. This would be rather impractical when faced with hun-
cation, there are hundreds of algorithms to choose from. Moreover,
                                                                            dreds of algorithm-parameter combinations.
these algorithms can be combined into composite learning systems
                                                                               To address these issues, we propose a quite different way to char-
(e.g. ensembles) and often have many parameters that greatly influ-
                                                                            acterize datasets, namely through the effect that the dataset has on the
ence their performance. This yields a whole spectrum of methods and
                                                                            relative performance of algorithms run on them. As in landmarking,
their variations, so that testing all possible candidates on the given
                                                                            we use the fact that each algorithm has its own learning bias, making
problem, e.g., using cross-validation, quickly becomes prohibitively
                                                                            certain assumptions about the data distribution. If the learning bias
expensive.
                                                                            ‘matches’ the underlying data distribution of a particular dataset, it
   The issue of selecting the right algorithm has been the subject of
                                                                            is likely to perform well (e.g., achieve high predictive accuracy). If it
many studies over the past 20 years [17, 3, 23, 20, 19]. Most ap-
                                                                            does not, it will likely under- or overfit the data, resulting in a lower
proaches rely on the concept of metalearning. This approach exploits
                                                                            performance.
characterizations of datasets and past performance results of algo-
                                                                               As such, we characterize a dataset based on the pairwise perfor-
rithms to recommend the best algorithm on the current dataset. The
                                                                            mance differences between algorithms run on them: if the same al-
term metalearning stems from the fact that we try to learn the func-
                                                                            gorithms win, tie or lose against each other on two datasets, then the
tion that maps dataset characterizations (meta-data) to algorithm
                                                                            data distributions of these datasets are likely to be similar as well, at
1 LIAAD-INESC Porto L.A./Faculty of Economics, University of Porto, Por-    least in terms of their effect on learning performance. It is clear that
    tugal, rleite@fep.up.pt                                                 the more algorithms are used, the more accurate the characterization
2 LIAAD-INESC Porto L.A./Faculty of Economics, University of Porto, Por-
    tugal, pbrazdil@inescporto.pt                                           4 In the remainder of this text, when we speak of algorithms, we mean fully-
3 LIACS - Leiden Institute of Advanced Computer Science, University of        defined algorithm instances with fixed components (e.g., base-learners, ker-
    Leiden, Nederlands, joaquin@liacs.nl                                      nel functions) and parameter settings.
will be. While we cannot run all algorithms on each new dataset be-                number of tests, and we will evaluate our technique as such.
cause of the computational cost, we can run a fair amount of CV tests                 In all, the research hypothesis that we intend to prove in this paper
to get a reasonably good idea of which prior datasets are most similar             is: Relative landmarks provide useful information on the similarity
to the new one.                                                                    of datasets and can be used to efficiently predict the most promising
   Moreover, we can use these same performance results to estab-                   algorithms to test on new datasets.
lish which (yet untested) algorithms are likely to perform well on the                We will test this hypothesis by running our active testing approach
new dataset, i.e., those algorithms that outperformed or rivaled the               in a leave-one-out fashion on a large set of CV evaluations testing 292
currently best algorithm on similar datasets in the past. As such, we              algorithms on 76 datasets. The results show that our AT approach is
can intelligently select the most promising algorithms for the new                 indeed effective in finding very accurate algorithms with a limited
dataset, run them, and then use their performance results to gain in-              number of tests.
creasingly better estimates of the most similar datasets and the most                 We also present an adaptation of method AT that uses data char-
promising algorithms.                                                              acteristics to define the similarity of the datasets. Our purpose was to
                                                                                   compare the relative landmark versus data measures approaches to
Key concepts. There are two key concepts used in this work. The
                                                                                   select classification algorithms.
first one is that of the current best candidate algorithm which may
be challenged by other algorithms in the process of finding an even
                                                                                   Roadmap. The remainder of this paper is organized as follows. First,
better candidate.
                                                                                   we formulate the concepts of relative landmarks in Section 2 and
   The second is the pairwise performance difference between two
                                                                                   active testing in Section 3. Next, Section 4 presents the empirical
algorithms running on the same dataset, which we call relative land-
                                                                                   evaluation and Section 5 presents an overview of some work in other
mark. A collection of such relative landmarks represents a history of
                                                                                   related areas. The final section presents conclusions and future work.
previous ‘duels’ between two algorithms on prior datasets. The term
itself originates from the study of landmarking algorithms: since ab-
solute values for the performance of landmarkers vary a lot depend-                2   Relative Landmarks
ing on the dataset, several types of relative landmarks have been pro-
posed, which basically capture the relative performance difference                 In this section we formalize our definition of relative landmarks, and
between two algorithms [12]. In this paper, we extend the notion of                explain how are used to identify the most promising competitor for a
relative landmarks to all (including non-simplified) classification al-            currently best algorithm.
gorithms.                                                                             Given a set of classification algorithms and some new classifica-
   The history of previous algorithm duels is used to select the                   tion dataset dnew , the aim is to identify the potentially best algorithm
most promising challenger for the current best candidate algorithm,                for this task with respect to some given performance measure M
namely the method that most convincingly outperformed or rivaled                   (e.g., accuracy, AUC or rank). Let us represent the performance of
the current champion on prior datasets similar to the new dataset.                 algorithm ai on dataset dnew as M (ai , dnew ). As such, we need to
                                                                                   identify an algorithm a∗, for which the performance measure is max-
Approach. Given the current best algorithm and a history of rela-                  imal, or ∀ai M (a∗, dnew ) ≥ M (ai , dnew ). The decision concerning
tive landmarks (duels), we can start a tournament game in which,                   ≥ (i.e. whether a∗ is at least as good as ai ) may be established using
in each round, the current best algorithm is compared to the next,                 either a statistical significance test or a simple comparison.
most promising contender. We select the most promising challenger                     However, instead of searching exhaustively for a∗, we aim to
as discussed above, and run a CV test with this algorithm. The winner              find a near-optimal algorithm, a∗,   ˆ which has a high probability
becomes the new current best candidate, the loser is removed from                  P (M (a∗,ˆ dnew ) ≥ M (ai , dnew )) to be optimal, ideally close to
consideration. We will discuss the exact procedure in Section 3.                   1.
   We call this approach active testing (AT)5 , as it actively selects the            As in other work that exploits metalearning, we assume that a∗    ˆ is
most interesting CV tests instead of passively performing them one                 likely better than ai on dataset dnew if it was found to be better on a
by one: in each iteration the best competitor is identified, which de-             similar dataset dj (for which we have performance estimates):
termines a new CV test to be carried out. Moreover, the same result
will be used to further characterize the new dataset and more accu-
rately estimate the similarity between the new dataset and all prior                        ˆ dnew ) ≥ M (ai , dnew )) ∼ P (M (a∗,
                                                                                    P (M (a∗,                                         ˆ dj ) ≥ M (ai , dj ))
datasets.                                                                                                                                                 (1)
Evaluation. By intelligently selecting the most promising algorithms               The latter estimate can be maximized by going through all algorithms
to test on the new dataset, we can more quickly discover an algorithm              and identifying the algorithm a∗    ˆ that satisfies the ≥ constraint in
that performs very well. Note that running a selection of algorithms is            a maximum number of cases. However, this requires that we know
typically done anyway to find a suitable algorithm. Here, we optimize              which datasets dj are most similar to dnew . Since our definition of
and automate this process using historical performance results of the              similarity requires CV tests to be run on dnew , but we cannot run all
candidate algorithms on prior datasets.                                            possible CV tests, we use an iterative approach in which we repeat
   While we cannot possibly guarantee to return the absolute best                  this scan for a∗ˆ in every round, using only the datasets dj that seem
algorithm without performing all possible CV tests, we can return an               most similar at that point, as dataset similarities are recalculated after
algorithm whose performance is either identical or very close to the               every CV test.
truly best one. The difference between the two can be expressed in                    Initially, having no information, we deem all datasets to be similar
terms of a loss. Our aim is thus to minimize this loss using a minimal                                ˆ will be the globally best algorithm over all prior
                                                                                   to dnew , so that a∗
                                                                                   datasets. We then call this algorithm the current best algorithm abest
5 Note that while the term ‘active testing’ is also used in the context of ac-     and run a CV test to calculate its performance on dnew . Based on
  tively selected sampling landmarks [14], there is little or no relationship to   this, the dataset similarities are recalculated (see Section 3), yielding
  the approach described here.                                                     a possibly different set of datasets dj . The best algorithm on this new
set becomes the best competitor ak (different from abest ), calculated      define which prior datasets are similar to it. As such, we naively as-
by counting the number of times that M (ak , dj ) > M (abest , dj ),        sume that all prior datasets are similar. As such, we generate a global
over all datasets dj .                                                      ranking of all algorithms using the performance results of all algo-
   We can further refine this method by taking into account how large       rithms on all previous datasets, and choose the top-ranked algorithm
the performance differences are: the larger a difference was in the         as our initial candidate abest . To illustrate this, we use a toy example
past, the higher chances are to obtain a large gain on the new dataset.     involving 6 classification algorithms, with default parameter settings,
This leads to the notion of relative landmarks RL, defined as:              from Weka [10] evaluated on 40 UCI datasets [1], a portion of which
                                                                            is shown in Table 1.
                                                                               As said before, our approach is entirely independent from the ex-
                                                                     (2)    act evaluation measure used: the most appropriate measure can be
    RL(ak , abest , dj ) = i(M (ak , dj ) > M (abest , dj )) ∗              chosen by the user in function of the specific data domain. In this
                              ∗(M (ak , dj ) − M (abest , dj ))             example, we use success rate (accuracy), but any other suitable mea-
                                                                            sure of classifier performance, e.g. AUC (area under the ROC curve),
The function i(test) returns 1 if the test is true and 0 otherwise. As      precision, recall or F1 can be used as well.
stated before, this can be a simple comparison or a statistical signifi-       Each accuracy figure shown in Table 1 represents the mean of 10
cance test that only returns 1 if ak performs significantly better than     values obtained in 10-fold cross-validation. The ranks of the algo-
abest on dj . The term RL thus expresses how much better ak is, rela-       rithms on each dataset are shown in parentheses next to the accuracy
tive to abest , on a dataset dj . Experimental tests have shown that this   value. For instance, if we consider dataset abalone, algorithm M LP
approach yields much better results than simply counting the number         is attributed rank 1 as its accuracy is highest on this problem. The
of wins.                                                                    second rank is occupied by LogD, etc.
   Up to now, we are assuming a dataset dj to be either similar to             The last row in the table shows the mean rank of each algo-
dnew or not. A second refinement is to use a gradual (non-binary)           rithm,  obtained by averaging over the ranks of each dataset: Rai =
                                                                             1
                                                                               P  n
measure of similarity Sim(dnew , dj ) between datasets dnew and dj .        n     dj =1
                                                                                        Rai ,dj , where Rai ,dj represents the rank of algorithm ai
As such, we can weigh the performance difference between ak and             on dataset dj and n the number of datasets. This is a quite common
abest on dj by how similar dj is to dnew . Indeed, the more similar         procedure, often used in machine learning to assess how a particular
the datasets, the more informative the performance difference is. As        algorithm compares to others [5].
such, we aim to optimize the following criterion:                           The mean ranks permit us to obtain a global rank-
                                                                            ing      of    candidate     algorithms,      CA.       In    our    case,
                     X                                                      CA = hM LP, J48, JRip, LogD, IB1, N Bi. It must be noted
    ak = arg max           RL(ai , abest , dj ) ∗ Sim(dnew , dj ))   (3)
              ai                                                            that such a ranking is not very informative in itself. For instance,
                    dj∈D
                                                                            statistical significance tests are needed to obtain a truthful ranking.
in which D is the set of all prior datasets dj .                            Here, we only use this global ranking CA are a starting point for the
   To calculate the similarity Sim(), we use the outcome of each CV         iterative procedure, as explained next.
test on dnew and compare it to the outcomes on dj .
                                                                            Step 2 - Identify the Current Best Algorithm. Indeed, global rank-
   In each iteration, with each CV test, we obtain a new eval-
                                                                            ing CA permits us to identify the top-ranked algorithm as our initial
uation M (ai , dnew ), which is used to recalculate all similarities
                                                                            best candidate algorithm abest . In Table 1, abest = M LP . This al-
Sim(dnew , dj ). In fact, we will compare four variants of Sim(),
                                                                            gorithm is then evaluated using a CV test to establish its performance
which are discussed in the next section. With this, we can recalculate
                                                                            on the new dataset dnew .
equation 3 to find the next best competitor ak .
                                                                            Step 3 - Identify the Most Promising Competitor. In the next step
                                                                            we identify ak , the best competitor of abest . To do this, all algorithms
3    Active Testing
                                                                            are considered one by one, except for abest and the eliminated algo-
In this section we describe the active testing (AT) approach, which         rithms (see step 4).
proceeds according to the following steps:                                     For each algorithm we analyze the information of past experiments
                                                                            (meta-data) to calculate the relative landmarks, as outlined in the pre-
1. Construct a global ranking of a given set of algorithms using per-       vious section. As equation 3 shows, for each ak , we sum up all rela-
   formance information from past experiments (metadata).                   tive landmarks involving abest , weighted by a measure of similarity
2. Initiate the iterative process by assigning the top-ranked algorithm     between dataset dj and the new dataset dnew . The algorithm ak that
   as abest and obtain the performance of this algorithm on dnew            achieves the highest value is the most promising competitor in this
   using a CV test.                                                         iteration. In case of a tie, the competitor that appears first in ranking
3. Find the most promising competitor ak for abest using relative           CA is chosen.
   landmarks and all previous CV tests on dnew .                               To calculate Sim(dnew , dj ), the similarity between dj and dnew ,
4. Obtain the performance of ak on dnew using a CV test and com-            we have explored six different variants, AT0, AT1, ATWs, ATx,
   pare with abest . Use the winner as the current best algorithm, and      ATdc, ATWs k described below.
   eliminate the losing algorithm.
5. Repeat the whole process starting with step 3 until a stopping cri-      AT0 is a base-line method which ignores dataset similarity. It always
   terium has been reached. Finally, output the current abest as the        returns a similarity value of 1 and so all datasets are considered sim-
   overall winner.                                                          ilar. This means that the best competitor is determined by summing
                                                                            up the values of the relative landmarks.

Step 1 - Establish a Global Ranking of Algorithms. Before having            AT1 method works as AT0 at the beginning, when no tests
run any CV tests, we have no information on the new dataset dnew to         have been carried out on dnew . In all subsequent iterations, this
   Table 1.   Accuracies and ranks (in parentheses) of the algorithms 1-nearest neighbor (IB1), C4.5 (J48), RIPPER (JRip), LogisticDiscriminant (LogD),
                               MultiLayerPerceptron (MLP) and naive Bayes (NB) on different datasets and their mean rank.


                             Datasets        IB1          J48          JRip           LogD           MLP             NB
                              abalone       .197 (5)     .218 (4)      .185 (6)        .259 (2)      .266 (1)       .237 (3)
                           acetylation      .844 (1)     .831 (2)      .829 (3)        .745 (5)      .609 (6)       .822 (4)
                                 adult      .794 (6)     .861 (1)      .843 (3)        .850 (2)      .830 (5)       .834 (4)
                                    ...          ...          ...           ...             ...           ...            ...
                           Mean rank             4.05         2.73          3.17            3.74          2.54           4.78


method estimates dataset similarity using only the most recent                 the information in the relative landmarks.
CV test. Consider the algorithms listed in Table 1 and the rank-
                                                                               Step 4 - Determine which of the Two Algorithms is Better. Having
ing CA. Suppose we started with algorithm M LP as the cur-
                                                                               found ak , we can now run a CV test and compare it with abest . The
rent best candidate. Suppose also that in the next iteration LogD
                                                                               winner (which may be either the current best algorithm or the com-
was identified as the best competitor, and won from M LP in
                                                                               petitor) is used as the new current best algorithm in the new round.
the CV test: (M (LogD, dnew ) > M (M LP, dnew )). Then, in the
                                                                               The losing algorithm is eliminated from further consideration.
subsequent iteration, all prior datasets dj satisfying the con-
dition M (LogD, dj ) > M (M LP, dj ) are considered similar to                 Step 5 - Repeat the Process and Check the Stopping Criteria. The
dnew . In general terms, suppose that the last test revealed that              whole process of identifying the best competitor (step 3) of abest and
M (ak , dnew ) > M (abest , dnew ), then Sim(dnew , dj ) is 1 if also          determining which one of the two is better (step 4) is repeated until a
M (ak , dj ) > M (abest , dj ), and 0 otherwise. The similarity mea-           stopping criterium has been reached. For instance, the process could
sure determines which RL’s are taken into account when summing                 be constrained to a fixed number of CV tests: considering the results
up their contributions to identify the next best competitor.                   presented further on in Section 4, it would be sufficient to run at
   Another variant of AT1 could use the difference between                     most 20% of all possible CV tests. Alternatively, one could impose
RL(ak , abest , dnew ) and RL(ak , abest , dj ), normalized between            a fixed CPU time, thus returning the best algorithm in h hours, as in
0 and 1, to obtain a real-valued (non-binary) similarity estimate              budgeted learning. In any case, until aborted, the method will keep
Sim(dnew , dj ). In other words, dj is more similar to dnew if the rel-        choosing a new competitor in each round: there will always be a next
ative performance difference between ak and abest is about as large            best competitor. In this respect our system differs from, for instance,
on both dj and dnew . We plan to investigate this in our future work.          hill climbing approaches which can get stuck in a local minimum.
ATWs is similar to AT1, but instead of only using the last test,               Discussion - Comparison with Active Learning: The term active
it uses all CV tests carried out on the new dataset, and calculates            testing was chosen because the approach shares some similarities
the Laplace-corrected ratio of corresponding results. For instance,            with active learning [7]. The concern of both is to speed up the pro-
suppose we have conducted 3 tests on dnew , thus yielding 3 pair-              cess of improving a given performance measure. In active learning,
wise algorithm comparisons on dnew . Suppose that 2 tests had the              the goal is to select the most informative data point to be labeled next,
same result on dataset dj (i.e. M (ax , dnew ) > M (ay , dnew ) and            so as to improve the predictive performance of a supervised learning
M (ax , dj ) > M (ay , dj )), then the frequency of occurrence is 2/3,         algorithm with a minimum of (expensive) labelings. In active testing,
which is adjusted by Laplace correction to obtain an estimate of prob-         the goal is to select the most informative CV test, so as to improve the
ability (2 + 1)/(3 + 2). As such, Sim(dnew , dj ) = 53 .                       prediction of the best algorithm on the new dataset with a minimum
ATx is similar to ATWs, but requires that all pairwise comparisons             of (expensive) CV tests.
yield the same outcome. In the example used above, it will return
Sim(dnew , dj ) = 1 only if all three comparisons lead to same result
on both datasets and 0 otherwise.
                                                                               4     Empirical Evaluation
ATdc is similar to ATWs, but uses a different similarity function.             4.1     Evaluation Methodology and Experimental
The idea of using this variant was to test if data characteristics (e.g.               Set-up
number of examples) could provide better information to identify the
                                                                               The proposed method AT was evaluated using a leave-one-out
most similar datasets (w.r.t. dnew ). We define the similarity between
                                                                               method [18]. The experiments reported here involve D datasets and
two datasets dnew and dj using the following expression:
                                                                               so the whole procedure was repeated D times. In each cycle, all per-
                                                                               formance results on one dataset were left out for testing and the re-
                                   1 X |z(dnew ) − z(dj )|
        Sim(dnew , dj ) = 1 −                                          (4)     sults on the remaining D − 1 datasets were used as metadata to de-
                                  |Z|  max(z) − min(z)
                                      z∈Z                                      termine the best candidate algorithm.
                                                                                  This study involved 292 algorithms (algorithm-parameter com-
   The symbol z represents a generic data measure used to character-           binations), which were extracted from the experiment database for
ize the datasets (z(d) is the value of characteristic z for dataset d). Z      machine learning (ExpDB) [11, 22]. This set includes many dif-
is the set of measures used.                                                   ferent algorithms from the Weka platform [10], which were varied
ATWs k is similar to ATWS, but only consider the k most similar                by assigning different values to their most important parameters.
datasets (those with highest Sim values). The similarity for all the           It includes SMO (a support vector machine, SVM), MLP (Multi-
other datasets are set to 0. The idea is to avoid the situation where          layer Perceptron), J48 (C4.5), and different types of ensembles, in-
the similarity values show very small variations, making unusefull             cluding RandomForest, Bagging and Boosting. Moreover, different
                                                     1.5
                                                                                                                         AT0
                                                                                                                         AT1
                                                                                                                         ATx
                          Median Accuracy Loss (%)                                                                       ATWs
                                                                                                                         ATdc
                                                     1.0
                                                     0.5
                                                     0.0


                                                           1      2        4         8        16          32      64       128        292

                                                                                         Number of CV's


                                                               Figure 1. Median loss as a function of the number of CV tests.


SVM kernels were used with their own parameter ranges and all                                      4.2    Results
non-ensemble learners were used as base-learners for the ensem-
ble learners mentioned above. The 76 datasets used in this study
were all from UCI [1]. A complete overview of the data used in this                                By aggregating the results over D datasets, we can track the median
study, including links to all algorithms and datasets can be found on                              loss of the recommended algorithm as a function of the number of
http://expdb.cs.kuleuven.be/ref/blv11.                                                             CV tests carried out. The results are shown in Figure 1. Note that the
   The data measures used to characterize the datasets for variant                                 number of CV tests is plotted on a logarithmic scale.
ATdc were number of examples, proportion of nominal attributes,                                       First, we see that ATWs and AT1 perform much better than AT0,
proportion of missing values, class entropy and mean mutual infor-                                 which indicates that it is indeed useful to include dataset similarity.
mation.                                                                                            If we consider a particular level of loss (say 0.5%) we note that these
   Regarding ATWs k we set k = 20 (this value was set by exploring                                 variants require much fewer CV tests than AT0. The results also indi-
only the alternatives 30, 20 and 10).                                                              cate that the information associated with relative landmarks obtained
   The main aim of the test was to prove the research hypothesis for-                              on the new dataset is indeed valuable. The median loss curves decline
mulated earlier: relative landmarks provide useful information for                                 quite rapidly and are always below the AT0 curve. We also see that
predicting the most promising algorithms on new datasets. There-                                   after only 10 CV tests (representing about 3% of all possible tests),
fore, we also include two baseline methods:                                                        the median loss is less than 0.5%. If we continue to 60 tests (about
                                                                                                   20% of all possible tests) the median loss is near 0.
                                                                                                      Also note that ATWs, which uses all relative landmarks involv-
                                                                                                   ing abest and dnew , does not perform much better than AT1, which
TopN has been described before (e.g. [3]). It also builds a ranking of                             only uses the most recent CV test. However in Figure 2 we show the
  candidate algorithms as described in step 1 (although other mea-                                 performance of variant ATWs k is much better than ATWs.
  sures different from mean rank could be used), and only runs CV                                     Method ATx, the most restrictive approach, only considers prior
  tests on the first N algorithms. The overall winner is returned.                                 datasets on which all relative landmarks including abest obtained
Rand simply selects N algorithms at random from the given set,                                     similar results. As shown in Figure 1, this approach manages to re-
  evaluates them using CV and returns the one with the best perfor-                                duce the loss quite rapidly, and competes well with the other variants
  mance. It is repeated 10 times with different random seeds and the                               in the initial region. However, after achieving a minimum loss in the
  results are averaged.                                                                            order of 0.5%, there are no more datasets that fulfill this restriction,
                                                                                                   and consequently no new competitor can be chosen, causing it to
                                                                                                   stop. The other three methods, ATWs, ATdc and AT1, do not suf-
   Since our AT methods are iterative, we will restart TopN and Rand                               fer from this shortcoming. We can see that the variant that uses data
N times, with N equal to the number of iterations (or CV tests).                                   characteristics (ATdc) is generally worse than the other variants. Be-
   To evaluate the performance of all approaches, we calculate the                                 fore the 23rd CV test all variants AT1, ATx and ATWs need about
loss of the currently best algorithm, defined as M (abest , dnew ) −                               half the CV tests needed by ATdc.
M (a∗, dnew ), where abest represents the currently best algorithm,                                   AT0 was also our best baseline method. To avoid overloading Fig-
a∗ the best possible algorithm and M (.) represents the performance                                ure 1, we show this separately in Figure 2. Indeed, AT0 is clearly
measure (success rate).                                                                            better than the random choice method Rand. Comparing AT0 to
                                                     1.5
                                                                                                                          Rand
                                                                                                                          TopN
                                                                                                                          AT0
                          Median Accuracy Loss (%)                                                                        ATWs
                                                                                                                          ATWs_k
                                                     1.0
                                                     0.5
                                                     0.0


                                                            1        2       4        8        16          32       64       128        292

                                                                                          Number of CV's


                                                           Figure 2. Median loss of AT0, ATWs, ATWs k and the two baseline methods.


TopN, we cannot say that one is clearly better than the other over-                                 identified an algorithm with a very small loss of 0.2%:
all, as the curves cross. However, it is clear that TopN looses out if                              Bagging.I.25..50.M ultilayerP erceptron, i.e. Bagging with rel-
we allow more CV tests, and that it is not competitive with the more                                atively few iterations but with a MultiLayerPerceptron base-learner.
advanced methods such as AT1, ATWs and ATWs k. Is clear in Fig-                                        Incidentally, this dataset appears to represent a quite atypical prob-
ure 2 that the best method variant is ATWs k (using k = 20). We                                     lem: the truly best algorithm, SM O.C.1.0.RBF.G.20, i.e. SVM
can see clearly in the curves that ATWs k typically needs about half                                with an RBF kernel with kernel width (gamma) set to 20, is ranked
the CV tests needed by ATWs. The comparison with all the other                                      globally as algorithm 246 (of all 292). AT1 identifies it after 177 CV
variants is even more impressive.                                                                   tests.
   The curves for mean loss (instead of median loss) follow similar
trends, but the values are 1-2% worse due to outliers (see Fig. 3 rel-
ative to method ATWs k). Besides, this figure shows also the curves                                 5   Related Work in Other Scientific Areas
associated with quartiles of 25% and 75% for ATWs k. As the num-
ber of CV tests increases, the distance between the two curves de-                                  In this section we briefly cover some work in other scientific areas
creases and approaches the median curve. Similar behavior has been                                  which is related to the problem tackled here and could provide further
observed for ATWs and AT1 but we skip the curves in this text.                                      insight into how to improve the method.
                                                                                                       One particular area is experiment design [6] and in particular ac-
Algorithm trace. It is interesting to trace the iterations carried out                              tive learning. As discussed before, the method described here follows
for one particular dataset. Table 2 shows the details for method AT1,                               the main trends that have been outlined in this literature. However,
where abalone represents the new dataset. Column 1 shows the num-                                   there is relatively little work on active learning for ranking tasks.
ber of the iteration (thus the number of CV tests). Column 2 shows                                  One notable exception is [15], who use the notion of Expected Loss
the most promising competitor ak chosen in each step. Column 3                                      Optimization (ELO). Another work in this area is [4], whose aim was
shows the index of ak in our initial ranking CA, and column 4 the                                   to identify the most interesting substances for drug screening using a
index of abest , the new best algorithm after the CV test has been                                  minimum number of tests. In the experiments described, the authors
performed. As such, if the values in column 3 and 4 are the same,                                   have focused on the top-10 substances. Several different strategies
then the most promising competitor has won the duel. For instance,                                  were considered and evaluated. Our problem here is not ranking, but
in step 2, SM O.C.1.0.P olynomial.E.3, i.e. SVM with complexity                                     rather simply finding the best item (algorithm), so this work is only
constant set to 1.0 and a 3rd degree polynomial kernel, (index 96) has                              partially relevant.
been identified as the best competitor to be used (column 2), and af-                                  Another relevant area is the so called multi-armed bandit prob-
ter the CV test, it has won against Bagging.I.75..100.P ART , i.e.                                  lem (MAB) studied in statistics and machine learning [9, 16]. This
Bagging with a high number of iterations (between 75 and 100) and                                   problem is often formulated in a setting that involves a set of tradi-
PART as a base-learner. As such, it wins this round and becomes the                                 tional slot machines. When a particular lever is pulled, a reward is
new abest . Columns 5 and 6 show the actual rank of the competi-                                    provided from a distribution associated with that specific lever. The
tor and the winner on the abalone dataset. Column 7 shows the loss                                  bandit problem is formally equivalent to a one-state Markov deci-
compared to the optimal algorithm and the final column shows the                                    sion process. The aim is to minimize regret after T rounds, which is
number of datasets whose similarity measure is 1.                                                   defined as a difference between the reward sum associated with an
   We observe that after only 12 CV tests, the method has                                           optimal strategy and the sum of collected rewards. Indeed, pulling
                                                  1.5
                                                                                                                    mean
                                                                                                                    quantile.25
                                                                                                                    median
                                                                                                                    quantile.75


                                                  1.0
                              Accuracy Loss (%)

                                                  0.5
                                                  0.0


                                                          1        2       4        8       16           32    64      128        292

                                                                                        Number of CV's


                                                        Figure 3. Loss of ATWs k (k=20) as a function of the number of CV tests.


                    Table 2. Trace of the steps taken by AT1 in the search for the supposedly best algorithm for the abalone dataset


                 CV      Algorithm used                                                  CA          CA         abalone       abalone    Loss     D
                 test    (current best competitor, ak )                                  ak       new abest       ak         new abest   (%)     size
                  1      Bagging.I.75..100.PART                                           1           1           75            75        1.9     75
                  2      SMO.C.1.0.Polynomial.E.3                                        96          96           56            56        1.6     29
                  3      AdaBoostM1.I.10.MultilayerPerceptron                            92          92           47            47        1.5     34
                  4      Bagging.I.50..75.RandomForest                                   15          92           66            47        1.5     27
                 ···     ···                                                             ···         ···          ···           ···      ···     ···
                  10     LMT                                                              6           6           32            32        1.1     45
                  11     LogitBoost.I.10.DecisionStump                                   81           6           70            32        1.1     51
                  12     Bagging.I.25..50.MultilayerPerceptron                           12          12            2             2        0.2     37
                  13     LogitBoost.I.160.DecisionStump                                  54          12           91             2        0.2     42
                 ···     ···                                                             ···         ···          ···           ···      ···     ···
                 177     SMO.C.1.0.RBF.G.20                                              246        246            1             1         0       9


a lever can be compared to carrying out a CV test on a given algo-                               algorithms on the new dataset. However, the aim is to reduce the
rithm. However, there is one fundamental difference between MAB                                  number of tests to a minimum. This is done by carefully selecting
and our setting: whereas in MAB the aim is to maximize the sum of                                which tests should be carried out, using the information of both past
collected rewards, our aim it to maximize one reward, i.e. the reward                            and present algorithm evaluations represented in the form of relative
associated with identifying the best algorithm. So again, this work is                           landmarks.
only partially relevant.                                                                            In our view this method incorporates several innovative features.
   To the best of our knowledge, no other work in this area has ad-                              First, it is an iterative process that uses the information in each CV
dressed the issue of how to select a suitable algorithm from a large                             test to find the most promising next test based on a history of prior
set of candidates.                                                                               ‘algorithm duels’. In a tournament-style fashion, it starts with a cur-
                                                                                                 rent best (parameterized) algorithm, selects the most promising ri-
                                                                                                 val algorithm in each round, evaluates it on the given problem, and
6   Significance and Impact                                                                      eliminates the algorithm that performs worse. Second, it continually
                                                                                                 focuses on the most similar prior datasets: those where the algorithm
In this paper we have addressed the problem of selecting the best                                duels had a similar outcome to those on the new dataset.
classification algorithm for a specific task. We have introduced a new                              Four variants of this basic approach, differing in their definition
method, called active testing, that exploits information concerning                              of dataset similarity, were investigated in a very extensive experi-
past evaluation results (metadata), to recommend the best algorithm                              ment setup involving 292 algorithm-parameter combinations on 76
using a limited number of tests on the new dataset.                                              datasets. Our experimental results show that particularly versions
   Starting from an initial ranking of algorithms on previous datasets,                          ATWs and AT1 provide good recommendations using a small num-
the method runs additional CV evaluations to test several competing
ber of CV tests. When plotting the median loss as a function of the               [14] Rui Leite and Pavel Brazdil, ‘Active testing strategy to predict the best
number of CV tests (Fig. 1), it shows that both outperform all other                   classification algorithm via sampling and metalearning’, in Proceedings
                                                                                       of the 19th European Conference on Artificial Intelligence - ECAI 2010,
variants and baseline methods. They also outperform AT0, indicating                    (2010).
that dataset similarity is an important aspect.                                   [15] B. Long, O. Chapelle, Y. Zhang, Y. Chang, Z. Zheng, and B. Tseng,
   We also see that after only 10 CV tests (representing about 3% of                   ‘Active learning for rankings through expected loss optimization’, in
all possible tests), the median loss is less than 0.5%. If we continue                 Proceedings of the SIGIR’10. ACM, (2010).
to 60 tests (about 20% of all possible tests) the median loss is near 0.          [16] A. Mahajan and D. Teneketzis, ‘Multi-armed bandit problems’, in
                                                                                       Foundations and Applications of Sensor Management, eds., D. A. Cas-
Similar trends can be observed when considering mean loss.                             tanon, D. Cochran, and K. Kastella, Springer-Verlag, (2007).
   The results support the hypothesis that we have formulated at the              [17] D. Michie, D.J.Spiegelhalter, and C.C.Taylor, Machine Learning, Neu-
outset of our work, that relative landmarks are indeed informative and                 ral and Statistical Classification, Ellis Horwood, 1994.
can be used to suggest the best contender. If this is procedure is used           [18] Tom M. Mitchell, Machine Learning, McGraw-Hill, New York, 1997.
                                                                                  [19] John R. Rice, ‘The algorithm selection problem’, volume 15 of Ad-
iteratively, it can be used to accurately recommend a classification                   vances in Computers, 65 – 118, Elsevier, (1976).
algorithm after a very limited number of CV tests.                                [20] Kate A. Smith-Miles, ‘Cross-disciplinary perspectives on meta-
   Still, we believe that the results could be improved further. Classi-               learning for algorithm selection’, ACM Comput. Surv., 41(1), 1–25,
cal information-theoretic measures and/or sampling landmarks could                     (2008).
be incorporated into the process of identifying the most similar                  [21] Carlos Soares, Johann Petrak, and Pavel Brazdil, ‘Sampling-based rel-
                                                                                       ative landmarks: Systematically test-driving algorithms before choos-
datasets. This could lead to further improvements and forms part of                    ing’, in Proceedings of the 10th Portuguese Conference on Artificial
our future plans.                                                                      Intelligence (EPIA 2001), pp. 88–94. Springer, (2001).
                                                                                  [22] J. Vanschoren and H. Blockeel, ‘A community-based platform for ma-
                                                                                       chine learning experimentation’, in Machine Learning and Knowledge
ACKNOWLEDGEMENTS                                                                       Discovery in Databases, European Conference, ECML PKDD 2009,
                                                                                       volume LNCS 5782, pp. 750–754. Springer, (2009).
This work is funded (or part-funded) by the ERDF European Re-                     [23] Ricardo Vilalta and Youssef Drissi, ‘A perspective view and survey of
                                                                                       meta-learning’, Artif. Intell. Rev., 18(2), 77–95, (2002).
gional Development Fund through the COMPETE Programme (op-
erational programme for competitiveness) and by National Funds
through the FCT Fundação para a Ciência e a Tecnologia (Por-
tuguese Foundation for Science and Technology) within project
FCOMP - 01-0124-FEDER-022701.


REFERENCES
 [1] C.L. Blake and C.J. Merz. UCI repository of machine learning
     databases, 1998.
 [2] B.Pfahringer, H.Bensussan, and C. Giraud-Carrier, ‘Meta-learning by
     landmarking various learning algorithms’, in Proceedings of the 17th
     Int. Conf. on Machine Learning (ICML-2000, Stanford,CA, (2000).
 [3] P. Brazdil, C. Soares, and J. Costa, ‘Ranking learning algorithms: Using
     IBL and meta-learning on accuracy and time results’, Machine Learn-
     ing, 50, 251–277, (2003).
 [4] K. De Grave, J. Ramon, and L. De Raedt, ‘Active learning for primary
     drug screening’, in Proceedings of Discovery Science. Springer, (2008).
 [5] J. Demsar, ‘Statistical comparisons of classifiers over multiple data
     sets’, The Journal of Machine Learning Research, 7, 1–30, (2006).
 [6] V. Fedorov, Theory of Optimal Experiments, Academic Press, New
     York, 1972.
 [7] Y. Freund, H. Seung, E. Shamir, and N. Tishby, ‘Selective sampling
     using the query by committee algorithm’, Machine Learning, 28, 133–
     168, (1997).
 [8] Johannes Fürnkranz and Johann Petrak, ‘An evaluation of landmarking
     variants’, in Proceedings of the ECML/PKDD Workshop on Integrating
     Aspects of Data Mining, Decision Support and Meta-Learning (IDDM-
     2001), pp. 57–68. Springer, (2001).
 [9] J. Gittins, ‘Multi-armed bandit allocation indices’, in Wiley Interscience
     Series in Systems and Optimization, John Wiley & Sons, Ltd., (1989).
[10] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter
     Reutemann, and Ian H. Witten, ‘The WEKA data mining software: an
     update’, SIGKDD Explor. Newsl., 11(1), 10–18, (2009).
[11] H.Blockeel, ‘Experiment databases: A novel methodology for ex-
     perimental research’, in Leacture Notes on Computer Science 3933.
     Springer, (2006).
[12] J.Fürnkranz and J.Petrak, ‘An evaluation of landmarking variants’, in
     Working Notes of ECML/PKDD 2000 Workshop on Integration Aspects
     of Data Mining, Decision Support and Meta-Learning, eds., C.Carrier,
     N.Lavrac, and S.Moyle, (2001).
[13] R. Leite and P. Brazdil, ‘Predicting relative performance of classifiers
     from samples’, in ICML ’05: Proceedings of the 22nd international
     conference on Machine learning, pp. 497–503, New York, NY, USA,
     (2005). ACM Press.

</pre>