<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Algorithm Selection via Meta-learning and Sample-based Active Testing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Salisu Mamman Abdulrahman</string-name>
          <email>salisu.abdul@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pavel Brazdil</string-name>
          <email>pbrazdil@inesctec.pt</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan N. van Rijn</string-name>
          <email>j.n.van.rijn@liacs.leidenuniv.nl</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joaquin Vanschoren</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Eindhoven University of Technology</institution>
          ,
          <addr-line>Eindhoven</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>FEP, Univiversity of Porto</institution>
          ,
          <addr-line>Porto</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>LIAAD-INESC Tec</institution>
          ,
          <addr-line>Porto</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Leiden University</institution>
          ,
          <addr-line>Leiden</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Identifying the best machine learning algorithm for a given problem continues to be an active area of research. In this paper we present a new method which exploits both meta-level information acquired in past experiments and active testing, an algorithm selection strategy. Active testing attempts to iteratively identify an algorithm whose performance will most likely exceed the performance of previously tried algorithms. The novel method described in this paper uses tests on smaller data sample to rank the most promising candidates, thus optimizing the schedule of experiments to be carried out. The experimental results show that this approach leads to considerably faster algorithm selection.</p>
      </abstract>
      <kwd-group>
        <kwd>Algorithm selection</kwd>
        <kwd>Meta-learning</kwd>
        <kwd>Active testing</kwd>
        <kwd>Algorithm Ranking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>A large number of data mining algorithms exist, rooted in the elds of machine
learning, statistics, pattern recognition, arti cial intelligence, and database
systems, which are used to perform di erent data analysis tasks on large volumes
of data. The task to recommend the most suitable algorithms has thus become
rather challenging. Moreover, the problem is exacerbated by the fact that it is
necessary to consider di erent combinations of parameter settings, or the
constituents of composite methods such as ensembles.</p>
      <p>
        The algorithm selection problem, originally described by Rice [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], has
attracted a great deal of attention, as it endeavours to select and apply the best or
near best algorithm(s) for a given task [
        <xref ref-type="bibr" rid="ref19 ref4">4, 19</xref>
        ]. The algorithm selection problem
can be cast as a learning problem: the aim is to learn a model that captures the
relationship between the properties of the datasets, or meta-data, and the
algorithms, in particular their performance. This model can then be used to predict
the most suitable algorithm for a given new dataset.
      </p>
      <p>
        This paper presents a new method, which builds on ranking approaches for
algorithm selection [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] in that it exploits meta-level information acquired in
past experiments. Moreover, the proposed method combines this information
with an algorithm selection strategy known as active testing [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. The aim of
active testing is to iteratively select and evaluate a candidate algorithm whose
performance will most likely exceed the performance of previously tested
algorithms.
      </p>
      <p>
        The method described in this paper di ers from earlier approaches in
various aspects. First, two methods presented earlier [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ] were combined, and
adapted to optimize a multi-objective measure, called A3R, that combines
predictive accuracy and time. The rst method is known as an average ranking
method, as it calculates an average ranking for all algorithms over all datasets.
The upgrade here consists of using A3R as the measure to optimize, rather than
simply accuracy. The second method uses fast sample-based tests instead of full
cross-validation (CV) tests to identify the most promising candidates by
evaluating them on small data samples. Again, we will use A3R, instead of accuracy,
in the process of identifying the best competitor of the current alternative.
      </p>
      <p>Fast sample-based tests allow selecting a good algorithm in less time, but are
less reliable. This needs to be taken into account in the design of the method.
One further contribution of this work is to show how the sample-based tests
should be integrated with the elaboration of the ranking.</p>
      <p>
        Finally, the experimental results are presented in the form of loss-time curves,
where time is represented on a log scale. This representation is very useful in
the evaluation of rankings representing schedules, as was shown in recent
ndings [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>The remainder of this paper is organized as follows. In the next section we
present an overview of work in related areas. Section 3 is dedicated to the
description of our new method of active testing. We explain how it relates to earlier
proposals. Section 4 presents the empirical evaluation of the newly proposed
method. The nal section presents conclusions and future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In this paper we are addressing a particular case of the algorithm selection
problem [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], oriented towards the selection of classi cation algorithms. Various
researchers addressed this problem in the course of the last 25 years. One very
common approach that could now be considered as \the classical approach" uses
a set of measures to characterize datasets and establish their relationship to
algorithm performance. This information is often referred to as meta-data.
      </p>
      <p>
        The meta-data typically includes a set of simple measures, statistical
measures, information-theoretic measures and/or the performance of simple
algorithms referred to as landmarkers [
        <xref ref-type="bibr" rid="ref14 ref19 ref4">4, 14, 19</xref>
        ]. The aim is to obtain a model that
characterizes the relationship between the given meta-data and the performance
of algorithms evaluated on these datasets. This model can then be used to
predict the most suitable algorithm for a given new dataset, or alternatively, provide
a ranking of algorithms, ordered by their suitability for the task at hand. Many
studies conclude that ranking is in fact better, as it enables the user to iterative
test the top candidates to identify the algorithms most suitable in practice. This
strategy is sometimes referred to as the Top-N strategy [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        The Top-N strategy has the disadvantage that it is unable to leverage what
is learned from previous evaluations. For instance, if the top algorithm performs
worse than expected, this may tell us something about the given dataset that
can be used to update the ranking. Indeed, very similar algorithms are now also
likely to perform worse than expected. This led researchers to investigate an
alternative testing strategy, known as active testing [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. This strategy intelligently
selects the most useful cross-validation tests using the concept of relative
landmarkers [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. These landmarkers estimate the relative probability that a particular
algorithm will outperform the current best candidate.
      </p>
      <p>
        The method presented in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] exploits partial learning curves created on
small samples of the data, as suggested by the authors of [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. It makes pairwise
algorithm comparisons and represents the results in the form of a partially
ordered ranking. The method can be evaluated by comparing the predicted partial
order of algorithms to the actual partial order, representing the golden
standard obtained by exhaustive testing. An extension to this method was presented
in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. It relies on earlier performed cross-validation tests to calculate relative
landmarkers. The authors showed that this led to better results than traditional
top-N ranking strategies.
      </p>
      <p>The novel aspect of the method described in this paper is the use of relatively
fast sample-based tests to reorder the algorithms in the top-N ranking. Using
tests on small data samples represents a trade-o in that they lead to less
accurate performance estimation. Indeed, the tests carried out on small data samples
are less reliable and thus the best algorithms may sometimes not be identi ed.
Hence, it is di cult to say a priori which variant would be better. This is one of
the motivations to investigate this issue.</p>
      <p>
        Active testing is somewhat related to experiment design [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and also to active
learning. However, there has been relatively little work on active learning for
algorithm selection. One notable exception is [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], who use the notion of Expected
Loss Optimization (ELO). Another application in this area is [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], whose aim was
to identify the most interesting substances for drug screening using a minimum
number of tests. In these experiments, the authors have focused on the top-10
substances. Several di erent strategies were considered and evaluated.
      </p>
      <p>
        Another notable active learning approach to meta-learning was presented
in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], where the authors use active learning to support the selection on
informative examples. A prototype was implemented using the k-NN algorithm as
a meta-learner and a certainty-based method was used for active learning. The
prototype was evaluated in two di erent case studies, and the results obtained
by the active learning method were in general better than a random method for
selecting meta-examples.
      </p>
      <p>In this paper, we attribute particular importance to the tests on the new
dataset. Our aim is to propose a way that minimizes the time before the best
(or near best) algorithm is identi ed.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Active Sample-based Testing (ASbT) Method</title>
      <p>We propose a new method, called Active Sample-based Testing (ASbT), which
is detailed in Algorithm 1. It exploits meta-level information acquired from past
experiments. This information includes the performance evaluations of various
machine learning algorithms, on prior datasets, over multiple performance
measures (e.g., accuracy, AUC, time). Moreover, we also use data samples or various
sizes to evaluate algorithms. Further details concerning the meta-level
information are provided below.</p>
      <p>
        This information enables us to construct an average ranking of algorithms
(line 4). The term average ranking is used here to indicate that it is averaged over
all previously seen datasets. The average ranking can be followed by the user
to identify the best performing algorithm, i.e., by performing a cross-validation
test on all algorithms in the order of the ranking. This strategy was referred to
as the Top-N strategy [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and constitutes a baseline for other more competitive
approaches.
      </p>
      <p>
        Our method also exploits the active testing strategy [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. However, in the
ASbT approach, we use faster sample-based tests to identify competitive
algorithms (line 5). In our experiments, these algorithms are also evaluated on the
full dataset in order to construct a loss curve. To evaluate this method,
experiments are carried out in a leave-one-out fashion. One dataset at a time is left out
to evaluate our approach. The complete method, including evaluation, is
summarized in Algorithm 1. The following sections discuss key parts of the method
in more detail.
      </p>
      <p>Algorithm 1 Active sample-based testing (ASbT)
1: Identify datasets Ds and algorithms
2: for all Di in Ds do
3: fLeave-one-out cycle; Di represents Dnew g
4: Construct the average ranking
5: Carry-out sample-based active testing and evaluate recommended algorithms
6: Return a loss curve
7: end for
8: Aggregate the loss curves for all Di in Ds
Return: Mean loss curve</p>
      <p>0 D
1</p>
      <p>D
(1)
3.1</p>
      <sec id="sec-3-1">
        <title>Constructing the average ranking</title>
        <p>
          The challenge here is to order the algorithms in the form of a top-N ranking.
The underlying assumption is that the rankings obtained on past problems will
transfer to new problems. Among many popular ranking criteria we nd, for
instance, average ranks, success rates and signi cant wins [
          <xref ref-type="bibr" rid="ref10 ref3 ref5">3, 5, 10</xref>
          ].
        </p>
        <p>
          In this work we use average ranks, inspired by Friedman's M statistic [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
For each dataset, the algorithms are ordered according to the performance
measure chosen (e.g., predictive accuracy) and assigned ranks accordingly. The best
j
algorithm is assigned rank 1, the runner-up is assigned rank 2, and so on. Let ri
be the rank of algorithm i on dataset j. The average rank for each algorithm is
obtained using
where D is the number of datasets. The nal ranking is obtained by ordering
the average ranks and assigning ranks to the individual algorithms accordingly.
        </p>
        <p>The average ranking is used here both as a baseline against which we can
compare other methods, and as an input to the proposed algorithm (line 5), as
discussed below.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Active Testing on Data Samples</title>
        <p>Do we really need a full cross-validation test to establish a good sequence of
algorithms to test? In this section we discuss a novel method that also follows an
active testing strategy, but uses sample-based tests instead of full cross-validation
tests. Hence, instead of deciding whether a candidate algorithm ac is better than
the currently best algorithm abest using a full cross-validation test, we perform
a test on a smaller sample of the new dataset. This is motivated by the fact that
a sample-based test is much faster than a full cross-validation test.</p>
        <p>
          However, as the sample-based test only yields a proxy for the actual
performance, it may identify an apparent winner that di ers from the one selected by
the full cross-validation test. Hence, if a candidate algorithm beats the currently
best algorithm on the sample-based test, we additionally carry out an evaluation
using a full cross-validation test. This approach di ers from [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] in three key
aspects, i.e., the use of small data samples as proxies, the use of the A3R
criterium that combines accuracy and run time (see below), and the strategy that
we occasionally run a full cross-validation test. Algorithm 2 shows this method
in detail.
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Evaluating the returned ranking</title>
        <p>To evaluate the quality of the returned ranking, the whole process is repeated
in a leave-one-out fashion for all datasets Di belonging to the set Ds (line 2 of
Algorithm 1). For each generated algorithm ranking we generate a loss curve that
Algorithm 2 Active testing with sample-based tests
Require: Datasets Di (representing Dnew ), Ds; Average ranking for Di
1: Use the average ranking of Di to identify the topmost algorithm and initialize abest ;</p>
        <p>Obtain the performance of abest on dataset Di using a full CV test;
2: Find the most promising competitor ac of abest using relative landmarkers
(previous test results on datasets)
3: Obtain the performance of ac on new dataset using a sample-based test;</p>
        <p>Record the accuracy and time of the test to compute A3R;
4: Compare the performance of ac with abest ;
use the winner as the current best algorithm abest ;
If the current best algorithm has changed in the previous step, do:</p>
        <p>Carry out evaluation of the new abest using a full CV test
5: Repeat the whole process starting with step 2 until reaching a stopping criterion
Return: Loss curve
plots the loss in accuracy (see below) versus the time spent on the evaluation of
the algorithms using full cross-validation tests. Finally, all individual loss curves
are aggregated into a mean loss curve.</p>
        <p>
          Loss-time curves In a typical loss curve, the x-axis represents the number
of cross-validation tests and the y-axis shows the loss in accuracy with respect
to the ideal ranking. Loss is de ned to be the di erence in accuracy between
the current best evaluated classi er and the actual best classi er [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. As tests
proceed following the Top-N strategy, the loss either maintains its value, or
decreases when the newly selected algorithm improved upon the previously selected
algorithms. Each test represents the result of a full cross-validation evaluation
on one dataset. However, some algorithms are much slower learners than others
(sometimes by orders of magnitude), and these simple loss curves do not capture
this.
        </p>
        <p>This is why, in this article, we take into account the actual time required to
evaluate each algorithm and update the loss curve accordingly. We will refer to
this type of curve as a loss versus time curve, or loss-time curve for short.</p>
        <p>As train/test times include both very small and very large numbers, it is
natural to use the logarithm of the time, instead of the actual time. This has
the e ect that the same time intervals appear to be shorter as we shift further
on along the time axis. This graphical representation is advantageous if we wish
to give more importance to the initial items in the ranking.</p>
        <p>To evaluate our algorithm (ASbT), we need to carry out tests on di erent
datasets in a leave-one-out fashion and construct the mean loss-time curve by
aggregating the individual loss-time curves.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>A3R: combining Accuracy and Run time</title>
        <p>
          In many situations, we have a preference for algorithms that are fast and also
achieve high accuracy. However, the question is whether such a preference would
lead to better loss-time curves. To investigate this, we have adopted a
multiobjective evaluation measure, A3R, described in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], that combines both accuracy
and time. The measure A3R is de ned as:
        </p>
        <p>A3Rdairef ;aj =</p>
        <p>SRdaij</p>
        <p>SRdairef
q
N Tadji =Tadrief
(2)</p>
        <p>Here, SRdaij and SRdairef represent the success rates (accuracies) of algorithms
aj and aref on dataset di, where aref represents a given reference algorithm.
Similarly, Tadji and Tadrief represent the run times of the algorithms, in seconds. To
trade-o the importance of time, A3R includes the N th root parameter. This is
motivated by the observation that run times vary much more than accuracies.
It is not uncommon that one particular algorithm is three orders of magnitude
slower (or faster) than another one. Obviously, we do not want the time ratios
to dominate the equation. If we take the N th root of the run time, we will get a
number that goes to 1 in the limit.</p>
        <p>For instance, if we used N = 256, an algorithm that is 1;000 times slower
would yield a denominator of 1:027. It would thus be equivalent to the faster
reference algorithm only if its accuracy was 2:7% higher than the reference
algorithm. Table 1 shows how a ratio of 1;000 (one algorithm is 1;000 times slower
than the reference algorithm) is reduced for increasing values of N . As N gets
higher, the time is given less and less importance.</p>
        <p>The performance measure A3R can be used to rank a given set of algorithms
on a particular dataset in a similar way than using simply accuracy. Hence,
the average rank method described earlier can easily be upgraded to generate
a time-aware average ranking: the A3R-based average ranking. The method of
active testing with sample-based tests can also be easily updated. We note that
the algorithm shown in Algorithm 2 mentions performance on lines 3 and 4. If
we use A3R instead of accuracy, we obtain a multi-objective variant of the active
testing method.</p>
        <p>Obviously, we can expect somewhat di erent results for each particular choice
of N . Which value of N will lead to the best results in loss-time space? Another
important question is whether the use of A3R is bene cial when compared to
the approach that only uses accuracy. The answers to these questions will be
answered empirically in the next section.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>
        This section describes the empirical evaluation of the proposed method. We
have constructed a dataset from evaluation results retrieved from OpenML [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ],
a collaborative science platform for machine learning. This dataset contains the
results of 53 parameterized learning algorithms from the Weka workbench [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
on 39 datasets1. Section 4.1 evaluates our proposed method independently of
the A3R criterion, thus solely using accuracy. Section 4.2 explores the further
impact of the A3R criterion, combining accuracy and run time.
4.1
      </p>
      <sec id="sec-4-1">
        <title>Active Sample-based Testing using Accuracy</title>
        <p>Figure 1 presents our results in the form of loss-time curves, with time
represented on a log scale. First, it shows the loss curve of the average ranking method
(AvgRank-ACC) which uses the Top-N strategy when carrying out the tests.</p>
        <p>
          The gure also includes the loss-time curve of one predecessor method that
uses active testing with full cross-validation tests (ATCV-ACC). This is ASbT
in the extreme, using the full dataset instead of a smaller sample. In the earlier
paper [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] this method was referred as AT0.
        </p>
        <p>Finally, the gure shows the loss-time curves of three variants of the
active sample-based testing method presented here. These variants use a di erent
sample size for the sample-based testing. Here we use the notation
ASbT-4ACC to refer to a variant of active sample-based testing that uses accuracy on
sample number 4 when conducting tests. Similarly, ASbT-5-ACC uses sample
number 5 in testing. The sample sizes grow geometrically following the equation
2(5:5+0:5 n), where n represents the sample number. So, the sizes of the rst few
samples, after rounding, are 64, 91, 128, 181, 256, 362 examples. Hence, sample
number 4 includes 181 data points.</p>
        <p>In Figure 1, log10 time is used on the x-axis. The values on the x-axis represent
the time that has elapsed as we repeatedly conduct cross-validation tests.</p>
        <p>What can we conclude from this gure? The rst observation we can make is
that all the three variants of the sample-based active testing methods compete
quite well with the (more complex) predecessor method (ATCV-ACC) that uses
full CV test to identify the best competitor. These two approaches beat the
simple average ranking method (AvgRank-ACC). In Section 4.2, we investigate
the e ect of adopting the A3R measure instead of accuracy in the selection
algorithms problem.
1 Full details: http://www.openml.org/project/tag/ActiveTestingSamples/u/1
1.8
1.6
1.4
) 1.2
%
(
sso 1
L
y
rca 0.8
u
c
cA 0.6
0.4
0.2</p>
        <p>AvgRank-ACC</p>
        <p>ATCV-ACC
ASbT-4-ACC
ASbT-5-ACC</p>
        <p>ASbT-6-ACC</p>
        <p>Early stopping criteria The main active testing algorithm (ATCV) described
in Algorithm 2 includes a stopping criterion: when the probability that a new
candidate algorithm ac will win over the currently best algorithm abest becomes
too small, it will not be considered. Since ASbT derives from ATCV, we can use
the same criterion, and we have empirically evaluated the e ect using ASbT-4.
To do this, we need to de ne a minimal improvement in the sum of the relative
landmarkers, which is a parameter of the method. This was set to 0:02 (2%) and
the e ect was that the algorithm stopped after about half of the algorithms were
tested. The result of this study shows that it is only slightly better, but overall
not much di erent. It does manage to skip less promising algorithms early on.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Active Sample-based Testing using Accuracy and Time</title>
        <p>Our next aim in this paper is to analyze the e ects of adopting A3R as a
performance measure within the methods presented earlier. The rst objective is
to analyze the e ects on the average ranking method. The second objective is
to examine the e ects of adopting A3R within the active testing strategy that
uses sample-based tests. In each case we seek the best solution by tuning the
value of parameter N of A3R. The third objective is to compare the proposed
method with one predecessor method that generates a revised ranking rst before
conducting evaluation.</p>
        <p>Figure 2 shows the rst set of results. Note that the upgraded average ranking
method (AvgRank-A3R) has an excellent performance when compared to the
average ranking method that uses only accuracy as a measure (AvgRank-ACC).</p>
        <p>2
1.8
1.6
1.4
)
(s% 1.2
s
yoL 1
c
a
rcu 0.8
c
A 0.6
0.4
0.2</p>
        <p>AvgRank-ACC
AvgRank-A3R
ASbT-4-ACC-R</p>
        <p>ASbT-4-ACC
ASbT-4-A3R
Hence, the new average ranking method represents a new useful algorithm that
can be exploited in practice for the selection of algorithms.
1e+1
1e+2
1e+5
1e+6
1e+7</p>
        <p>Next, let us analyze the e ects of adopting A3R within the active testing
strategy that uses sample-based tests. As the experiments described in the
previous section have shown that there is not much di erence between using sample
number 4, 5 or 6, we have simply opted for the variant that uses the smallest
sample number 4 (ASbT-4-A3R).</p>
        <p>We compare our new sample-based active testing method ASbT-4-A3R that
uses A3R in the selection process with ASbT-4-ACC which represents the method
that uses accuracy instead. The results show that our method ASbT-4-A3R does
not beat the previous method (ASbT-4-ACC).</p>
        <p>For completeness, we also include another variant, ASbT-4-ACC-R. This
variant works by rst re-ranking all algorithms by evaluating them on a small
sample of the new dataset. Hence, it starts later than the other algorithms, on
average 500 seconds later, because it needs to run this process rst. Then, in a
second phase, it conducts the nal evaluation using full CV tests. Merging the
two phases, as is done in ASbT-4-ACC, results in lower losses sooner.</p>
        <p>The average ranking method (AvgRank-A3R), which represents a much
simpler method than the sample-based variant, achieves surprisingly good results
which warrants further investigation.</p>
        <p>One possible explanation is that our meta-dataset is perhaps too simple,
including 53 algorithms with default parameters. In future work we will investigate
the e ect of adopting the A3R measure in the ASbT method, while using more
algorithms (in the order of hundreds).
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We have described two novel algorithm selection methods. The rst method uses
fast sample-based tests to identify the most promising candidates, as its aim is
to intelligently select the next algorithm to test on the new dataset. The second
method is a rather simple one. It calculates an average ranking for all algorithms,
but uses A3R as the measure to rank on. The novelty here lies in the use of A3R,
instead of just accuracy.</p>
      <p>The experimental results are presented in the form of loss-time curves. Since
exhaustive testing is not always practically feasible, we focus on the behavior
of the algorithm within smaller time interval. This is achieved by using a loss
curves with log10 of time on the x-axis. This representation stresses the losses
at the beginning of the curve, corresponding to the initial tests.</p>
      <p>The results show that both methods lead to considerable time savings, when
compared to the previous approaches that exploited just accuracy. The
experimental results suggest that the new version of the average ranking method
represents a very good alternative that could be used in practice.</p>
      <p>The next challenge then is to explore ways on how to improve the ASbT
method that uses A3R in selecting the best competitor for active testing method
by using more algorithms in the order of hundreds.</p>
      <p>Acknowledgments This work has been funded by Federal Government of
Nigeria Tertiary Education Trust Fund under the TETFund 2012 AST$D
Intervention for Kano University of Science and Technology, Wudil, Kano State, Nigeria
for PhD Overseas Training. This work was also partially funded by FCT/MEC
through PIDDAC and ERDF/ON2 within project
NORTE-07-0124-FEDER000059 and through the COMPETE Programme (operational programme for
competitiveness) and by National Funds through the FCT Portuguese
Foundation for Science and Technology within project
FCOMP-01-0124-FEDER037281.We wish to thank Carlos Soares for his useful comments on the method
presented in this paper.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abdulrahman</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brazdil</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Measures for Combining Accuracy and Time for Meta-learning</article-title>
          .
          <source>In: Meta-Learning and Algorithm Selection Workshop at ECAI 2014</source>
          . pp.
          <volume>49</volume>
          {
          <issue>50</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Brazdil</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soares</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A Comparison of Ranking Methods for Classi cation Algorithm Selection</article-title>
          .
          <source>In: Machine Learning: ECML</source>
          <year>2000</year>
          , pp.
          <volume>63</volume>
          {
          <fpage>75</fpage>
          . Springer (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Brazdil</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soares</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Da</surname>
            <given-names>Costa</given-names>
          </string-name>
          ,
          <string-name>
            <surname>J.P.</surname>
          </string-name>
          :
          <article-title>Ranking learning algorithms: Using IBL and meta-learning on accuracy and time results</article-title>
          .
          <source>Machine Learning</source>
          <volume>50</volume>
          (
          <issue>3</issue>
          ),
          <volume>251</volume>
          {
          <fpage>277</fpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Brazdil</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giraud-Carrier</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soares</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vilalta</surname>
          </string-name>
          , R.: Metalearning:
          <article-title>Applications to data mining</article-title>
          . Springer Science &amp; Business
          <string-name>
            <surname>Media</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Demsar</surname>
          </string-name>
          , J.:
          <article-title>Statistical Comparisons of Classi ers over Multiple Data Sets</article-title>
          .
          <source>The Journal of Machine Learning Research 7</source>
          ,
          <issue>1</issue>
          {
          <fpage>30</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fedorov</surname>
            ,
            <given-names>V.V.</given-names>
          </string-name>
          :
          <article-title>Theory of Optimal Experiments</article-title>
          . Academic Press (
          <year>1972</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Furnkranz, J.,
          <string-name>
            <surname>Petrak</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>An Evaluation of Landmarking Variants</article-title>
          .
          <source>In: Working Notes of the ECML/PKDD 2000 Workshop on Integrating Aspects of Data Mining</source>
          ,
          <article-title>Decision Support and Meta-Learning</article-title>
          . pp.
          <volume>57</volume>
          {
          <issue>68</issue>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. de Grave,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Ramon</surname>
          </string-name>
          , J., de Raedt, L.:
          <article-title>Active Learning for Primary Drug Screening</article-title>
          .
          <source>In: Benelearn</source>
          <volume>08</volume>
          ,
          <string-name>
            <surname>The Annual</surname>
          </string-name>
          Belgian-Dutch
          <source>Machine Learning Conference</source>
          . vol.
          <year>2008</year>
          , pp.
          <volume>55</volume>
          {
          <issue>56</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfahringer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reutemann</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          :
          <article-title>The WEKA Data Mining Software: An Update</article-title>
          .
          <source>ACM SIGKDD Explorations Newsletter</source>
          <volume>11</volume>
          (
          <issue>1</issue>
          ),
          <volume>10</volume>
          {
          <fpage>18</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Leite</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brazdil</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Active Testing Strategy to Predict the Best Classi cation Algorithm via Sampling and Metalearning</article-title>
          .
          <source>In: ECAI</source>
          . pp.
          <volume>309</volume>
          {
          <issue>314</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Leite</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brazdil</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanschoren</surname>
          </string-name>
          , J.:
          <article-title>Selecting Classi cation Algorithms with Active Testing</article-title>
          .
          <source>In: Machine Learning and Data Mining in Pattern Recognition</source>
          , pp.
          <volume>117</volume>
          {
          <fpage>131</fpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Long</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chapelle</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tseng</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Active Learning for Ranking through Expected Loss Optimization</article-title>
          .
          <source>In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval</source>
          . pp.
          <volume>267</volume>
          {
          <fpage>274</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Neave</surname>
            ,
            <given-names>H.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Worthington</surname>
            ,
            <given-names>P.L.</given-names>
          </string-name>
          :
          <article-title>Distribution-free Tests</article-title>
          . Unwin Hyman London (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Pfahringer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bensusan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giraud-Carrier</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Tell me who can learn you and I can tell you who you are: Landmarking various learning algorithms</article-title>
          .
          <source>In: Proceedings of the 17th International Conference on Machine Learning</source>
          . pp.
          <volume>743</volume>
          {
          <issue>750</issue>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Provost</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jensen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oates</surname>
          </string-name>
          , T.:
          <article-title>E cient Progressive Sampling</article-title>
          .
          <source>In: Proceedings of the fth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          . pp.
          <volume>23</volume>
          {
          <fpage>32</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Prudencio</surname>
            ,
            <given-names>R.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ludermir</surname>
          </string-name>
          , T.B.:
          <article-title>Active Selection of Training Examples for MetaLearning</article-title>
          .
          <source>In: Hybrid Intelligent Systems</source>
          ,
          <year>2007</year>
          .
          <source>HIS</source>
          <year>2007</year>
          . 7th International Conference on. pp.
          <volume>126</volume>
          {
          <fpage>131</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Rice</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          :
          <source>The Algorithm Selection Problem. Advances in Computers 15</source>
          ,
          <issue>65</issue>
          {
          <fpage>118</fpage>
          (
          <year>1976</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. van Rijn,
          <string-name>
            <given-names>J.N.</given-names>
            ,
            <surname>Abdulrahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.M.</given-names>
            ,
            <surname>Brazdil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Vanschoren</surname>
          </string-name>
          , J.:
          <article-title>Fast Algorithm Selection using Learning Curves</article-title>
          .
          <source>In: Advances in Intelligent Data Analysis XIV</source>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Smith-Miles</surname>
            ,
            <given-names>K.A.</given-names>
          </string-name>
          :
          <article-title>Cross-disciplinary Perspectives on Meta-Learning for Algorithm Selection</article-title>
          .
          <source>ACM Computing Surveys (CSUR) 41(1)</source>
          , 6:
          <issue>1</issue>
          {6:
          <issue>25</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Vanschoren</surname>
            , J., van Rijn,
            <given-names>J.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bischl</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torgo</surname>
          </string-name>
          , L.:
          <article-title>OpenML: networked science in machine learning</article-title>
          .
          <source>ACM SIGKDD Explorations Newsletter</source>
          <volume>15</volume>
          (
          <issue>2</issue>
          ),
          <volume>49</volume>
          {
          <fpage>60</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>