<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Selecting Classification Algorithms with Active Testing on Similar Datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rui Leite</string-name>
          <email>rleite@fep.up.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pavel Brazdil</string-name>
          <email>pbrazdil@inescporto.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joaquin Vanschoren</string-name>
          <email>joaquin@liacs.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LIAAD-INESC Porto L.A./Faculty of Economics, University of Porto</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LIACS - Leiden Institute of Advanced Computer Science, University of Leiden</institution>
          ,
          <addr-line>Nederlands</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Given the large amount of data mining algorithms, their combinations (e.g. ensembles) and possible parameter settings, finding the most adequate method to analyze a new dataset becomes an ever more challenging task. This is because in many cases testing all possibly useful alternatives quickly becomes prohibitively expensive. In this paper we propose a novel technique, called active testing, that intelligently selects the most useful cross-validation tests. It proceeds in a tournament-style fashion, in each round selecting and testing the algorithm that is most likely to outperform the best algorithm of the previous round on the new dataset. This 'most promising' competitor is chosen based on a history of prior duels between both algorithms on similar datasets. Each new cross-validation test will contribute information to a better estimate of dataset similarity, and thus better predict which algorithms are most promising on the new dataset. We also follow a different path to estimate dataset similarity based on data characteristics. We have evaluated this approach using a set of 292 algorithm-parameter combinations on 76 UCI datasets for classification. The results show that active testing will quickly yield an algorithm whose performance is very close to the optimum, after relatively few tests. It also provides a better solution than previously proposed methods. The variants of our method that rely on crossvalidation tests to estimate dataset similarity provides better solutions than those that rely on data characteristics.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In many data mining applications, an important problem is
selecting the best algorithm for a specific problem. Especially in
classification, there are hundreds of algorithms to choose from. Moreover,
these algorithms can be combined into composite learning systems
(e.g. ensembles) and often have many parameters that greatly
influence their performance. This yields a whole spectrum of methods and
their variations, so that testing all possible candidates on the given
problem, e.g., using cross-validation, quickly becomes prohibitively
expensive.</p>
      <p>
        The issue of selecting the right algorithm has been the subject of
many studies over the past 20 years [
        <xref ref-type="bibr" rid="ref17 ref19 ref20 ref23 ref3">17, 3, 23, 20, 19</xref>
        ]. Most
approaches rely on the concept of metalearning. This approach exploits
characterizations of datasets and past performance results of
algorithms to recommend the best algorithm on the current dataset. The
term metalearning stems from the fact that we try to learn the
function that maps dataset characterizations (meta-data) to algorithm
performance estimates (the target variable).
      </p>
      <p>
        The earliest techniques considered only the dataset itself and
calculated an array of various simple, statistical or information-theoretic
properties of the data (e.g., dataset size, class skewness and
signalnoise ratio) [
        <xref ref-type="bibr" rid="ref17 ref3">17, 3</xref>
        ]. Another approach, called landmarking [
        <xref ref-type="bibr" rid="ref12 ref2">2, 12</xref>
        ],
ran simple and fast versions of algorithms (e.g. decision stumps
instead of decision trees) on the new dataset and used their
performance results to characterize the new dataset. Alternatively, in
sampling landmarks [
        <xref ref-type="bibr" rid="ref14 ref21 ref8">21, 8, 14</xref>
        ], the complete (non-simplified)
algorithms are run on small samples of the data. A series of sampling
landmarks on increasingly large samples represents a partial
learning curve which characterizes datasets and which can be used to
predict the performance of algorithms significantly more accurately than
with classical dataset characteristics [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ]. Finally, an ‘active
testing strategy’ for sampling landmarks [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] was proposed that actively
selects the most informative sample sizes while building these partial
learning curves, thus reducing the time needed to compute them.
Motivation. All these approaches have focused on dozens of
algorithms at most and usually considered only default parameter
settings. Dealing with hundreds, perhaps thousands of
algorithmparameter combinations4, provides a new challenge that requires a
new approach. First, distinguishing between hundreds of subtly
different algorithms is significantly harder than distinguishing between
a handful of very different ones. We would need many more data
characterizations that relate the effects of certain parameters on
performance. On the other hand, the latter method [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] has a scalability
issue: it requires that pairwise comparisons be conducted between
algorithms. This would be rather impractical when faced with
hundreds of algorithm-parameter combinations.
      </p>
      <p>To address these issues, we propose a quite different way to
characterize datasets, namely through the effect that the dataset has on the
relative performance of algorithms run on them. As in landmarking,
we use the fact that each algorithm has its own learning bias, making
certain assumptions about the data distribution. If the learning bias
‘matches’ the underlying data distribution of a particular dataset, it
is likely to perform well (e.g., achieve high predictive accuracy). If it
does not, it will likely under- or overfit the data, resulting in a lower
performance.</p>
      <p>As such, we characterize a dataset based on the pairwise
performance differences between algorithms run on them: if the same
algorithms win, tie or lose against each other on two datasets, then the
data distributions of these datasets are likely to be similar as well, at
least in terms of their effect on learning performance. It is clear that
the more algorithms are used, the more accurate the characterization
4 In the remainder of this text, when we speak of algorithms, we mean
fullydefined algorithm instances with fixed components (e.g., base-learners,
kernel functions) and parameter settings.
will be. While we cannot run all algorithms on each new dataset
because of the computational cost, we can run a fair amount of CV tests
to get a reasonably good idea of which prior datasets are most similar
to the new one.</p>
      <p>Moreover, we can use these same performance results to
establish which (yet untested) algorithms are likely to perform well on the
new dataset, i.e., those algorithms that outperformed or rivaled the
currently best algorithm on similar datasets in the past. As such, we
can intelligently select the most promising algorithms for the new
dataset, run them, and then use their performance results to gain
increasingly better estimates of the most similar datasets and the most
promising algorithms.</p>
      <p>Key concepts. There are two key concepts used in this work. The
first one is that of the current best candidate algorithm which may
be challenged by other algorithms in the process of finding an even
better candidate.</p>
      <p>
        The second is the pairwise performance difference between two
algorithms running on the same dataset, which we call relative
landmark. A collection of such relative landmarks represents a history of
previous ‘duels’ between two algorithms on prior datasets. The term
itself originates from the study of landmarking algorithms: since
absolute values for the performance of landmarkers vary a lot
depending on the dataset, several types of relative landmarks have been
proposed, which basically capture the relative performance difference
between two algorithms [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In this paper, we extend the notion of
relative landmarks to all (including non-simplified) classification
algorithms.
      </p>
      <p>The history of previous algorithm duels is used to select the
most promising challenger for the current best candidate algorithm,
namely the method that most convincingly outperformed or rivaled
the current champion on prior datasets similar to the new dataset.
Approach. Given the current best algorithm and a history of
relative landmarks (duels), we can start a tournament game in which,
in each round, the current best algorithm is compared to the next,
most promising contender. We select the most promising challenger
as discussed above, and run a CV test with this algorithm. The winner
becomes the new current best candidate, the loser is removed from
consideration. We will discuss the exact procedure in Section 3.</p>
      <p>We call this approach active testing (AT)5, as it actively selects the
most interesting CV tests instead of passively performing them one
by one: in each iteration the best competitor is identified, which
determines a new CV test to be carried out. Moreover, the same result
will be used to further characterize the new dataset and more
accurately estimate the similarity between the new dataset and all prior
datasets.</p>
      <p>Evaluation. By intelligently selecting the most promising algorithms
to test on the new dataset, we can more quickly discover an algorithm
that performs very well. Note that running a selection of algorithms is
typically done anyway to find a suitable algorithm. Here, we optimize
and automate this process using historical performance results of the
candidate algorithms on prior datasets.</p>
      <p>
        While we cannot possibly guarantee to return the absolute best
algorithm without performing all possible CV tests, we can return an
algorithm whose performance is either identical or very close to the
truly best one. The difference between the two can be expressed in
terms of a loss. Our aim is thus to minimize this loss using a minimal
5 Note that while the term ‘active testing’ is also used in the context of
actively selected sampling landmarks [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], there is little or no relationship to
the approach described here.
number of tests, and we will evaluate our technique as such.
      </p>
      <p>In all, the research hypothesis that we intend to prove in this paper
is: Relative landmarks provide useful information on the similarity
of datasets and can be used to efficiently predict the most promising
algorithms to test on new datasets.</p>
      <p>We will test this hypothesis by running our active testing approach
in a leave-one-out fashion on a large set of CV evaluations testing 292
algorithms on 76 datasets. The results show that our AT approach is
indeed effective in finding very accurate algorithms with a limited
number of tests.</p>
      <p>We also present an adaptation of method AT that uses data
characteristics to define the similarity of the datasets. Our purpose was to
compare the relative landmark versus data measures approaches to
select classification algorithms.</p>
      <p>Roadmap. The remainder of this paper is organized as follows. First,
we formulate the concepts of relative landmarks in Section 2 and
active testing in Section 3. Next, Section 4 presents the empirical
evaluation and Section 5 presents an overview of some work in other
related areas. The final section presents conclusions and future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Relative Landmarks</title>
      <p>In this section we formalize our definition of relative landmarks, and
explain how are used to identify the most promising competitor for a
currently best algorithm.</p>
      <p>Given a set of classification algorithms and some new
classification dataset dnew, the aim is to identify the potentially best algorithm
for this task with respect to some given performance measure M
(e.g., accuracy, AUC or rank). Let us represent the performance of
algorithm ai on dataset dnew as M (ai; dnew). As such, we need to
identify an algorithm a , for which the performance measure is
maximal, or 8aiM (a ; dnew) M (ai; dnew). The decision concerning
(i.e. whether a is at least as good as ai) may be established using
either a statistical significance test or a simple comparison.</p>
      <p>However, instead of searching exhaustively for a , we aim to
find a near-optimal algorithm, a^ , which has a high probability
P (M (a^ ; dnew) M (ai; dnew)) to be optimal, ideally close to
1.</p>
      <p>As in other work that exploits metalearning, we assume that a^ is
likely better than ai on dataset dnew if it was found to be better on a
similar dataset dj (for which we have performance estimates):
P (M (a^ ; dnew) M (ai; dnew)) P (M (a^ ; dj ) M (ai; dj ))
(1)
The latter estimate can be maximized by going through all algorithms
and identifying the algorithm a^ that satisfies the constraint in
a maximum number of cases. However, this requires that we know
which datasets dj are most similar to dnew. Since our definition of
similarity requires CV tests to be run on dnew, but we cannot run all
possible CV tests, we use an iterative approach in which we repeat
this scan for a^ in every round, using only the datasets dj that seem
most similar at that point, as dataset similarities are recalculated after
every CV test.</p>
      <p>Initially, having no information, we deem all datasets to be similar
to dnew, so that a^ will be the globally best algorithm over all prior
datasets. We then call this algorithm the current best algorithm abest
and run a CV test to calculate its performance on dnew. Based on
this, the dataset similarities are recalculated (see Section 3), yielding
a possibly different set of datasets dj . The best algorithm on this new
set becomes the best competitor ak (different from abest), calculated
by counting the number of times that M (ak; dj ) &gt; M (abest; dj ),
over all datasets dj .</p>
      <p>We can further refine this method by taking into account how large
the performance differences are: the larger a difference was in the
past, the higher chances are to obtain a large gain on the new dataset.
This leads to the notion of relative landmarks RL, defined as:
(2)
RL(ak; abest; dj ) = i(M (ak; dj ) &gt; M (abest; dj ))
(M (ak; dj )</p>
      <p>M (abest; dj ))
The function i(test) returns 1 if the test is true and 0 otherwise. As
stated before, this can be a simple comparison or a statistical
significance test that only returns 1 if ak performs significantly better than
abest on dj . The term RL thus expresses how much better ak is,
relative to abest, on a dataset dj . Experimental tests have shown that this
approach yields much better results than simply counting the number
of wins.</p>
      <p>Up to now, we are assuming a dataset dj to be either similar to
dnew or not. A second refinement is to use a gradual (non-binary)
measure of similarity Sim(dnew; dj ) between datasets dnew and dj .
As such, we can weigh the performance difference between ak and
abest on dj by how similar dj is to dnew. Indeed, the more similar
the datasets, the more informative the performance difference is. As
such, we aim to optimize the following criterion:
ak = arg max X RL(ai; abest; dj )
ai
dj2D</p>
      <p>Sim(dnew; dj ))
(3)
in which D is the set of all prior datasets dj .</p>
      <p>To calculate the similarity Sim(), we use the outcome of each CV
test on dnew and compare it to the outcomes on dj .</p>
      <p>In each iteration, with each CV test, we obtain a new
evaluation M (ai; dnew), which is used to recalculate all similarities
Sim(dnew; dj ). In fact, we will compare four variants of Sim(),
which are discussed in the next section. With this, we can recalculate
equation 3 to find the next best competitor ak.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Active Testing</title>
      <p>In this section we describe the active testing (AT) approach, which
proceeds according to the following steps:
1. Construct a global ranking of a given set of algorithms using
performance information from past experiments (metadata).
2. Initiate the iterative process by assigning the top-ranked algorithm
as abest and obtain the performance of this algorithm on dnew
using a CV test.
3. Find the most promising competitor ak for abest using relative
landmarks and all previous CV tests on dnew.
4. Obtain the performance of ak on dnew using a CV test and
compare with abest. Use the winner as the current best algorithm, and
eliminate the losing algorithm.
5. Repeat the whole process starting with step 3 until a stopping
criterium has been reached. Finally, output the current abest as the
overall winner.</p>
      <p>
        Step 1 - Establish a Global Ranking of Algorithms. Before having
run any CV tests, we have no information on the new dataset dnew to
define which prior datasets are similar to it. As such, we naively
assume that all prior datasets are similar. As such, we generate a global
ranking of all algorithms using the performance results of all
algorithms on all previous datasets, and choose the top-ranked algorithm
as our initial candidate abest. To illustrate this, we use a toy example
involving 6 classification algorithms, with default parameter settings,
from Weka [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] evaluated on 40 UCI datasets [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a portion of which
is shown in Table 1.
      </p>
      <p>As said before, our approach is entirely independent from the
exact evaluation measure used: the most appropriate measure can be
chosen by the user in function of the specific data domain. In this
example, we use success rate (accuracy), but any other suitable
measure of classifier performance, e.g. AUC (area under the ROC curve),
precision, recall or F1 can be used as well.</p>
      <p>Each accuracy figure shown in Table 1 represents the mean of 10
values obtained in 10-fold cross-validation. The ranks of the
algorithms on each dataset are shown in parentheses next to the accuracy
value. For instance, if we consider dataset abalone, algorithm M LP
is attributed rank 1 as its accuracy is highest on this problem. The
second rank is occupied by LogD, etc.</p>
      <p>
        The last row in the table shows the mean rank of each
algorithm, obtained by averaging over the ranks of each dataset: Rai =
1 Pn
n dj =1 Rai;dj , where Rai;dj represents the rank of algorithm ai
on dataset dj and n the number of datasets. This is a quite common
procedure, often used in machine learning to assess how a particular
algorithm compares to others [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>The mean ranks permit us to obtain a global
ranking of candidate algorithms, CA. In our case,
CA = hM LP; J 48; J Rip; LogD; IB1; N Bi. It must be noted
that such a ranking is not very informative in itself. For instance,
statistical significance tests are needed to obtain a truthful ranking.
Here, we only use this global ranking CA are a starting point for the
iterative procedure, as explained next.</p>
      <sec id="sec-3-1">
        <title>Step 2 - Identify the Current Best Algorithm. Indeed, global rank</title>
        <p>ing CA permits us to identify the top-ranked algorithm as our initial
best candidate algorithm abest. In Table 1, abest = M LP . This
algorithm is then evaluated using a CV test to establish its performance
on the new dataset dnew.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Step 3 - Identify the Most Promising Competitor. In the next step</title>
        <p>we identify ak, the best competitor of abest. To do this, all algorithms
are considered one by one, except for abest and the eliminated
algorithms (see step 4).</p>
        <p>For each algorithm we analyze the information of past experiments
(meta-data) to calculate the relative landmarks, as outlined in the
previous section. As equation 3 shows, for each ak, we sum up all
relative landmarks involving abest, weighted by a measure of similarity
between dataset dj and the new dataset dnew. The algorithm ak that
achieves the highest value is the most promising competitor in this
iteration. In case of a tie, the competitor that appears first in ranking
CA is chosen.</p>
        <p>To calculate Sim(dnew; dj ), the similarity between dj and dnew,
we have explored six different variants, AT0, AT1, ATWs, ATx,
ATdc, ATWs k described below.</p>
        <p>AT0 is a base-line method which ignores dataset similarity. It always
returns a similarity value of 1 and so all datasets are considered
similar. This means that the best competitor is determined by summing
up the values of the relative landmarks.</p>
        <p>AT1 method works as AT0 at the beginning, when no tests
have been carried out on dnew. In all subsequent iterations, this
method estimates dataset similarity using only the most recent
CV test. Consider the algorithms listed in Table 1 and the
ranking CA. Suppose we started with algorithm M LP as the
current best candidate. Suppose also that in the next iteration LogD
was identified as the best competitor, and won from M LP in
the CV test: (M (LogD; dnew) &gt; M (M LP; dnew)). Then, in the
subsequent iteration, all prior datasets dj satisfying the
condition M (LogD; dj ) &gt; M (M LP; dj ) are considered similar to
dnew. In general terms, suppose that the last test revealed that
M (ak; dnew) &gt; M (abest; dnew), then Sim(dnew; dj ) is 1 if also
M (ak; dj ) &gt; M (abest; dj ), and 0 otherwise. The similarity
measure determines which RL’s are taken into account when summing
up their contributions to identify the next best competitor.</p>
        <p>Another variant of AT1 could use the difference between
RL(ak; abest; dnew) and RL(ak; abest; dj ), normalized between
0 and 1, to obtain a real-valued (non-binary) similarity estimate
Sim(dnew; dj ). In other words, dj is more similar to dnew if the
relative performance difference between ak and abest is about as large
on both dj and dnew. We plan to investigate this in our future work.
ATWs is similar to AT1, but instead of only using the last test,
it uses all CV tests carried out on the new dataset, and calculates
the Laplace-corrected ratio of corresponding results. For instance,
suppose we have conducted 3 tests on dnew, thus yielding 3
pairwise algorithm comparisons on dnew. Suppose that 2 tests had the
same result on dataset dj (i.e. M (ax; dnew) &gt; M (ay; dnew) and
M (ax; dj ) &gt; M (ay; dj )), then the frequency of occurrence is 2=3,
which is adjusted by Laplace correction to obtain an estimate of
probability (2 + 1)=(3 + 2). As such, Sim(dnew; dj ) = 35 .
ATx is similar to ATWs, but requires that all pairwise comparisons
yield the same outcome. In the example used above, it will return
Sim(dnew; dj ) = 1 only if all three comparisons lead to same result
on both datasets and 0 otherwise.</p>
        <p>ATdc is similar to ATWs, but uses a different similarity function.
The idea of using this variant was to test if data characteristics (e.g.
number of examples) could provide better information to identify the
most similar datasets (w.r.t. dnew). We define the similarity between
two datasets dnew and dj using the following expression:
Sim(dnew; dj ) = 1
1
jZj
z2Z
X jz(dnew)
max(z)
z(dj )j
min(z)
(4)</p>
        <p>The symbol z represents a generic data measure used to
characterize the datasets (z(d) is the value of characteristic z for dataset d). Z
is the set of measures used.</p>
        <p>ATWs k is similar to ATWS, but only consider the k most similar
datasets (those with highest Sim values). The similarity for all the
other datasets are set to 0. The idea is to avoid the situation where
the similarity values show very small variations, making unusefull
the information in the relative landmarks.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Step 4 - Determine which of the Two Algorithms is Better. Having</title>
        <p>
          found ak, we can now run a CV test and compare it with abest. The
winner (which may be either the current best algorithm or the
competitor) is used as the new current best algorithm in the new round.
The losing algorithm is eliminated from further consideration.
Step 5 - Repeat the Process and Check the Stopping Criteria. The
whole process of identifying the best competitor (step 3) of abest and
determining which one of the two is better (step 4) is repeated until a
stopping criterium has been reached. For instance, the process could
be constrained to a fixed number of CV tests: considering the results
presented further on in Section 4, it would be sufficient to run at
most 20% of all possible CV tests. Alternatively, one could impose
a fixed CPU time, thus returning the best algorithm in h hours, as in
budgeted learning. In any case, until aborted, the method will keep
choosing a new competitor in each round: there will always be a next
best competitor. In this respect our system differs from, for instance,
hill climbing approaches which can get stuck in a local minimum.
Discussion - Comparison with Active Learning: The term active
testing was chosen because the approach shares some similarities
with active learning [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The concern of both is to speed up the
process of improving a given performance measure. In active learning,
the goal is to select the most informative data point to be labeled next,
so as to improve the predictive performance of a supervised learning
algorithm with a minimum of (expensive) labelings. In active testing,
the goal is to select the most informative CV test, so as to improve the
prediction of the best algorithm on the new dataset with a minimum
of (expensive) CV tests.
4
4.1
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Empirical Evaluation</title>
    </sec>
    <sec id="sec-5">
      <title>Evaluation Methodology and Experimental Set-up</title>
      <p>
        The proposed method AT was evaluated using a leave-one-out
method [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The experiments reported here involve D datasets and
so the whole procedure was repeated D times. In each cycle, all
performance results on one dataset were left out for testing and the
results on the remaining D 1 datasets were used as metadata to
determine the best candidate algorithm.
      </p>
      <p>
        This study involved 292 algorithms (algorithm-parameter
combinations), which were extracted from the experiment database for
machine learning (ExpDB) [
        <xref ref-type="bibr" rid="ref11 ref22">11, 22</xref>
        ]. This set includes many
different algorithms from the Weka platform [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which were varied
by assigning different values to their most important parameters.
It includes SMO (a support vector machine, SVM), MLP
(Multilayer Perceptron), J48 (C4.5), and different types of ensembles,
including RandomForest, Bagging and Boosting. Moreover, different
0
.
1
5
.
0
0
.
0
AT0
AT1
ATx
ATWs
ATdc
1
2
4
8
16
32
64
128
292
SVM kernels were used with their own parameter ranges and all
non-ensemble learners were used as base-learners for the
ensemble learners mentioned above. The 76 datasets used in this study
were all from UCI [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A complete overview of the data used in this
study, including links to all algorithms and datasets can be found on
http://expdb.cs.kuleuven.be/ref/blv11.
      </p>
      <p>The data measures used to characterize the datasets for variant
ATdc were number of examples, proportion of nominal attributes,
proportion of missing values, class entropy and mean mutual
information.</p>
      <p>Regarding ATWs k we set k = 20 (this value was set by exploring
only the alternatives 30, 20 and 10).</p>
      <p>
        The main aim of the test was to prove the research hypothesis
formulated earlier: relative landmarks provide useful information for
predicting the most promising algorithms on new datasets.
Therefore, we also include two baseline methods:
TopN has been described before (e.g. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). It also builds a ranking of
candidate algorithms as described in step 1 (although other
measures different from mean rank could be used), and only runs CV
tests on the first N algorithms. The overall winner is returned.
Rand simply selects N algorithms at random from the given set,
evaluates them using CV and returns the one with the best
performance. It is repeated 10 times with different random seeds and the
results are averaged.
      </p>
      <p>Since our AT methods are iterative, we will restart TopN and Rand
N times, with N equal to the number of iterations (or CV tests).</p>
      <p>To evaluate the performance of all approaches, we calculate the
loss of the currently best algorithm, defined as M (abest; dnew)
M (a ; dnew), where abest represents the currently best algorithm,
a the best possible algorithm and M (:) represents the performance
measure (success rate).
4.2</p>
    </sec>
    <sec id="sec-6">
      <title>Results</title>
      <p>By aggregating the results over D datasets, we can track the median
loss of the recommended algorithm as a function of the number of
CV tests carried out. The results are shown in Figure 1. Note that the
number of CV tests is plotted on a logarithmic scale.</p>
      <p>First, we see that ATWs and AT1 perform much better than AT0,
which indicates that it is indeed useful to include dataset similarity.
If we consider a particular level of loss (say 0.5%) we note that these
variants require much fewer CV tests than AT0. The results also
indicate that the information associated with relative landmarks obtained
on the new dataset is indeed valuable. The median loss curves decline
quite rapidly and are always below the AT0 curve. We also see that
after only 10 CV tests (representing about 3% of all possible tests),
the median loss is less than 0.5%. If we continue to 60 tests (about
20% of all possible tests) the median loss is near 0.</p>
      <p>Also note that ATWs, which uses all relative landmarks
involving abest and dnew, does not perform much better than AT1, which
only uses the most recent CV test. However in Figure 2 we show the
performance of variant ATWs k is much better than ATWs.</p>
      <p>Method ATx, the most restrictive approach, only considers prior
datasets on which all relative landmarks including abest obtained
similar results. As shown in Figure 1, this approach manages to
reduce the loss quite rapidly, and competes well with the other variants
in the initial region. However, after achieving a minimum loss in the
order of 0.5%, there are no more datasets that fulfill this restriction,
and consequently no new competitor can be chosen, causing it to
stop. The other three methods, ATWs, ATdc and AT1, do not
suffer from this shortcoming. We can see that the variant that uses data
characteristics (ATdc) is generally worse than the other variants.
Before the 23rd CV test all variants AT1, ATx and ATWs need about
half the CV tests needed by ATdc.</p>
      <p>AT0 was also our best baseline method. To avoid overloading
Figure 1, we show this separately in Figure 2. Indeed, AT0 is clearly
better than the random choice method Rand. Comparing AT0 to
0
.
1
5
.
0
0
.
0
Rand
TopN
AT0
ATWs
ATWs_k
1
2
4
8
16
32
64
128
292
TopN, we cannot say that one is clearly better than the other
overall, as the curves cross. However, it is clear that TopN looses out if
we allow more CV tests, and that it is not competitive with the more
advanced methods such as AT1, ATWs and ATWs k. Is clear in
Figure 2 that the best method variant is ATWs k (using k = 20). We
can see clearly in the curves that ATWs k typically needs about half
the CV tests needed by ATWs. The comparison with all the other
variants is even more impressive.</p>
      <p>The curves for mean loss (instead of median loss) follow similar
trends, but the values are 1-2% worse due to outliers (see Fig. 3
relative to method ATWs k). Besides, this figure shows also the curves
associated with quartiles of 25% and 75% for ATWs k. As the
number of CV tests increases, the distance between the two curves
decreases and approaches the median curve. Similar behavior has been
observed for ATWs and AT1 but we skip the curves in this text.
Algorithm trace. It is interesting to trace the iterations carried out
for one particular dataset. Table 2 shows the details for method AT1,
where abalone represents the new dataset. Column 1 shows the
number of the iteration (thus the number of CV tests). Column 2 shows
the most promising competitor ak chosen in each step. Column 3
shows the index of ak in our initial ranking CA, and column 4 the
index of abest, the new best algorithm after the CV test has been
performed. As such, if the values in column 3 and 4 are the same,
then the most promising competitor has won the duel. For instance,
in step 2, SM O:C:1:0:P olynomial:E:3, i.e. SVM with complexity
constant set to 1.0 and a 3rd degree polynomial kernel, (index 96) has
been identified as the best competitor to be used (column 2), and
after the CV test, it has won against Bagging:I:75::100:P ART , i.e.
Bagging with a high number of iterations (between 75 and 100) and
PART as a base-learner. As such, it wins this round and becomes the
new abest. Columns 5 and 6 show the actual rank of the
competitor and the winner on the abalone dataset. Column 7 shows the loss
compared to the optimal algorithm and the final column shows the
number of datasets whose similarity measure is 1.</p>
      <p>We observe that after only 12 CV tests, the method has
identified an algorithm with a very small loss of 0.2%:
Bagging:I:25::50:M ultilayerP erceptron, i.e. Bagging with
relatively few iterations but with a MultiLayerPerceptron base-learner.</p>
      <p>Incidentally, this dataset appears to represent a quite atypical
problem: the truly best algorithm, SM O:C:1:0:RBF:G:20, i.e. SVM
with an RBF kernel with kernel width (gamma) set to 20, is ranked
globally as algorithm 246 (of all 292). AT1 identifies it after 177 CV
tests.
5</p>
    </sec>
    <sec id="sec-7">
      <title>Related Work in Other Scientific Areas</title>
      <p>In this section we briefly cover some work in other scientific areas
which is related to the problem tackled here and could provide further
insight into how to improve the method.</p>
      <p>
        One particular area is experiment design [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and in particular
active learning. As discussed before, the method described here follows
the main trends that have been outlined in this literature. However,
there is relatively little work on active learning for ranking tasks.
One notable exception is [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], who use the notion of Expected Loss
Optimization (ELO). Another work in this area is [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], whose aim was
to identify the most interesting substances for drug screening using a
minimum number of tests. In the experiments described, the authors
have focused on the top-10 substances. Several different strategies
were considered and evaluated. Our problem here is not ranking, but
rather simply finding the best item (algorithm), so this work is only
partially relevant.
      </p>
      <p>
        Another relevant area is the so called multi-armed bandit
problem (MAB) studied in statistics and machine learning [
        <xref ref-type="bibr" rid="ref16 ref9">9, 16</xref>
        ]. This
problem is often formulated in a setting that involves a set of
traditional slot machines. When a particular lever is pulled, a reward is
provided from a distribution associated with that specific lever. The
bandit problem is formally equivalent to a one-state Markov
decision process. The aim is to minimize regret after T rounds, which is
defined as a difference between the reward sum associated with an
optimal strategy and the sum of collected rewards. Indeed, pulling
()% .10
s
s
o
L
y
c
a
r
u
cc 5
A .
      </p>
      <p>0
0
.
0
1
2
4
8
16
32
64
128
292
a lever can be compared to carrying out a CV test on a given
algorithm. However, there is one fundamental difference between MAB
and our setting: whereas in MAB the aim is to maximize the sum of
collected rewards, our aim it to maximize one reward, i.e. the reward
associated with identifying the best algorithm. So again, this work is
only partially relevant.</p>
      <p>To the best of our knowledge, no other work in this area has
addressed the issue of how to select a suitable algorithm from a large
set of candidates.
6</p>
    </sec>
    <sec id="sec-8">
      <title>Significance and Impact</title>
      <p>In this paper we have addressed the problem of selecting the best
classification algorithm for a specific task. We have introduced a new
method, called active testing, that exploits information concerning
past evaluation results (metadata), to recommend the best algorithm
using a limited number of tests on the new dataset.</p>
      <p>Starting from an initial ranking of algorithms on previous datasets,
the method runs additional CV evaluations to test several competing
algorithms on the new dataset. However, the aim is to reduce the
number of tests to a minimum. This is done by carefully selecting
which tests should be carried out, using the information of both past
and present algorithm evaluations represented in the form of relative
landmarks.</p>
      <p>In our view this method incorporates several innovative features.
First, it is an iterative process that uses the information in each CV
test to find the most promising next test based on a history of prior
‘algorithm duels’. In a tournament-style fashion, it starts with a
current best (parameterized) algorithm, selects the most promising
rival algorithm in each round, evaluates it on the given problem, and
eliminates the algorithm that performs worse. Second, it continually
focuses on the most similar prior datasets: those where the algorithm
duels had a similar outcome to those on the new dataset.</p>
      <p>Four variants of this basic approach, differing in their definition
of dataset similarity, were investigated in a very extensive
experiment setup involving 292 algorithm-parameter combinations on 76
datasets. Our experimental results show that particularly versions
ATWs and AT1 provide good recommendations using a small
number of CV tests. When plotting the median loss as a function of the
number of CV tests (Fig. 1), it shows that both outperform all other
variants and baseline methods. They also outperform AT0, indicating
that dataset similarity is an important aspect.</p>
      <p>We also see that after only 10 CV tests (representing about 3% of
all possible tests), the median loss is less than 0.5%. If we continue
to 60 tests (about 20% of all possible tests) the median loss is near 0.
Similar trends can be observed when considering mean loss.</p>
      <p>The results support the hypothesis that we have formulated at the
outset of our work, that relative landmarks are indeed informative and
can be used to suggest the best contender. If this is procedure is used
iteratively, it can be used to accurately recommend a classification
algorithm after a very limited number of CV tests.</p>
      <p>Still, we believe that the results could be improved further.
Classical information-theoretic measures and/or sampling landmarks could
be incorporated into the process of identifying the most similar
datasets. This could lead to further improvements and forms part of
our future plans.</p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This work is funded (or part-funded) by the ERDF European
Regional Development Fund through the COMPETE Programme
(operational programme for competitiveness) and by National Funds
through the FCT Fundac¸ a˜o para a Cieˆncia e a Tecnologia
(Portuguese Foundation for Science and Technology) within project
FCOMP - 01-0124-FEDER-022701.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.L.</given-names>
            <surname>Blake</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.J.</given-names>
            <surname>Merz</surname>
          </string-name>
          .
          <source>UCI repository of machine learning databases</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bensussan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Giraud-Carrier</surname>
          </string-name>
          , '
          <article-title>Meta-learning by landmarking various learning algorithms'</article-title>
          ,
          <source>in Proceedings of the 17th Int. Conf. on Machine Learning (ICML-2000</source>
          , Stanford,CA, (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Brazdil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Soares</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Costa</surname>
          </string-name>
          , '
          <article-title>Ranking learning algorithms: Using IBL and meta-learning on accuracy and time results'</article-title>
          ,
          <source>Machine Learning</source>
          ,
          <volume>50</volume>
          ,
          <fpage>251</fpage>
          -
          <lpage>277</lpage>
          , (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>K. De Grave</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Ramon</surname>
          </string-name>
          , and L. De Raedt, '
          <article-title>Active learning for primary drug screening'</article-title>
          ,
          <source>in Proceedings of Discovery Science</source>
          . Springer, (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Demsar</surname>
          </string-name>
          , '
          <article-title>Statistical comparisons of classifiers over multiple data sets'</article-title>
          ,
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>7</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          , (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Fedorov</surname>
          </string-name>
          , Theory of Optimal Experiments, Academic Press, New York,
          <year>1972</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Freund</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Seung</surname>
          </string-name>
          , E. Shamir, and
          <string-name>
            <given-names>N.</given-names>
            <surname>Tishby</surname>
          </string-name>
          , '
          <article-title>Selective sampling using the query by committee algorithm'</article-title>
          ,
          <source>Machine Learning</source>
          ,
          <volume>28</volume>
          ,
          <fpage>133</fpage>
          -
          <lpage>168</lpage>
          , (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Fu</surname>
          </string-name>
          <article-title>¨rnkranz and Johann Petrak, 'An evaluation of landmarking variants'</article-title>
          ,
          <source>in Proceedings of the ECML/PKDD Workshop on Integrating Aspects of Data Mining</source>
          ,
          <article-title>Decision Support and Meta-Learning (IDDM2001</article-title>
          ), pp.
          <fpage>57</fpage>
          -
          <lpage>68</lpage>
          . Springer, (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gittins</surname>
          </string-name>
          ,
          <article-title>'Multi-armed bandit allocation indices'</article-title>
          ,
          <source>in Wiley Interscience Series in Systems and Optimization</source>
          , John Wiley &amp; Sons, Ltd., (
          <year>1989</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Hall</surname>
          </string-name>
          , Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Reutemann</surname>
          </string-name>
          , and
          <string-name>
            <surname>Ian H. Witten</surname>
          </string-name>
          , '
          <article-title>The WEKA data mining software: an update'</article-title>
          ,
          <source>SIGKDD Explor. Newsl.</source>
          ,
          <volume>11</volume>
          (
          <issue>1</issue>
          ),
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
          , (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Blockeel</surname>
          </string-name>
          , '
          <article-title>Experiment databases: A novel methodology for experimental research'</article-title>
          ,
          <source>in Leacture Notes on Computer Science 3933</source>
          . Springer, (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          <article-title>¨rnkranz and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Petrak</surname>
          </string-name>
          , '
          <article-title>An evaluation of landmarking variants'</article-title>
          ,
          <source>in Working Notes of ECML/PKDD 2000 Workshop on Integration Aspects of Data Mining</source>
          ,
          <article-title>Decision Support and Meta-Learning, eds</article-title>
          .,
          <string-name>
            <given-names>C.</given-names>
            <surname>Carrier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lavrac</surname>
          </string-name>
          , and
          <string-name>
            <surname>S.Moyle</surname>
          </string-name>
          , (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Leite</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Brazdil</surname>
          </string-name>
          , '
          <article-title>Predicting relative performance of classifiers from samples'</article-title>
          ,
          <source>in ICML '05: Proceedings of the 22nd international conference on Machine learning</source>
          , pp.
          <fpage>497</fpage>
          -
          <lpage>503</lpage>
          , New York, NY, USA, (
          <year>2005</year>
          ). ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Rui</given-names>
            <surname>Leite</surname>
          </string-name>
          and Pavel Brazdil, '
          <article-title>Active testing strategy to predict the best classification algorithm via sampling and metalearning'</article-title>
          ,
          <source>in Proceedings of the 19th European Conference on Artificial Intelligence - ECAI</source>
          <year>2010</year>
          , (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Chapelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Tseng</surname>
          </string-name>
          , '
          <article-title>Active learning for rankings through expected loss optimization'</article-title>
          ,
          <source>in Proceedings of the SIGIR'10. ACM</source>
          , (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mahajan</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Teneketzis</surname>
          </string-name>
          ,
          <article-title>'Multi-armed bandit problems'</article-title>
          , in Foundations and Applications of Sensor Management, eds.,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Castanon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cochran</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          K. Kastella, Springer-Verlag,
          <article-title>(</article-title>
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Michie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.J.</given-names>
            <surname>Spiegelhalter</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.C.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , Machine Learning,
          <source>Neural and Statistical Classification</source>
          , Ellis Horwood,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Tom</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Mitchell</surname>
          </string-name>
          , Machine Learning,
          <string-name>
            <surname>McGraw-Hill</surname>
          </string-name>
          , New York,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>John R. Rice</surname>
          </string-name>
          , '
          <article-title>The algorithm selection problem'</article-title>
          , volume
          <volume>15</volume>
          of Advances in Computers,
          <volume>65</volume>
          -
          <fpage>118</fpage>
          , Elsevier, (
          <year>1976</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Kate</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Smith-Miles</surname>
          </string-name>
          , '
          <article-title>Cross-disciplinary perspectives on metalearning for algorithm selection'</article-title>
          ,
          <source>ACM Comput. Surv.</source>
          ,
          <volume>41</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          , (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Carlos</surname>
            <given-names>Soares</given-names>
          </string-name>
          , Johann Petrak, and Pavel Brazdil, '
          <article-title>Sampling-based relative landmarks: Systematically test-driving algorithms before choosing'</article-title>
          ,
          <source>in Proceedings of the 10th Portuguese Conference on Artificial Intelligence (EPIA</source>
          <year>2001</year>
          ), pp.
          <fpage>88</fpage>
          -
          <lpage>94</lpage>
          . Springer, (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanschoren</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Blockeel</surname>
          </string-name>
          , '
          <article-title>A community-based platform for machine learning experimentation'</article-title>
          ,
          <source>in Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD</source>
          <year>2009</year>
          , volume LNCS
          <volume>5782</volume>
          , pp.
          <fpage>750</fpage>
          -
          <lpage>754</lpage>
          . Springer, (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Ricardo</given-names>
            <surname>Vilalta</surname>
          </string-name>
          and Youssef Drissi, '
          <article-title>A perspective view and survey of meta-learning'</article-title>
          ,
          <source>Artif. Intell. Rev.</source>
          ,
          <volume>18</volume>
          (
          <issue>2</issue>
          ),
          <fpage>77</fpage>
          -
          <lpage>95</lpage>
          , (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>