=Paper=
{{Paper
|id=None
|storemode=property
|title=Selecting Classification Algorithms with Active Testing on Similar Datasets
|pdfUrl=https://ceur-ws.org/Vol-950/planlearn2012_submission_5.pdf
|volume=Vol-950
}}
==Selecting Classification Algorithms with Active Testing on Similar Datasets==
Selecting Classification Algorithms with Active Testing on Similar Datasets Rui Leite1 and Pavel Brazdil2 and Joaquin Vanschoren3 Abstract. Given the large amount of data mining algorithms, their performance estimates (the target variable). combinations (e.g. ensembles) and possible parameter settings, find- The earliest techniques considered only the dataset itself and cal- ing the most adequate method to analyze a new dataset becomes an culated an array of various simple, statistical or information-theoretic ever more challenging task. This is because in many cases testing all properties of the data (e.g., dataset size, class skewness and signal- possibly useful alternatives quickly becomes prohibitively expensive. noise ratio) [17, 3]. Another approach, called landmarking [2, 12], In this paper we propose a novel technique, called active testing, that ran simple and fast versions of algorithms (e.g. decision stumps intelligently selects the most useful cross-validation tests. It proceeds instead of decision trees) on the new dataset and used their per- in a tournament-style fashion, in each round selecting and testing the formance results to characterize the new dataset. Alternatively, in algorithm that is most likely to outperform the best algorithm of the sampling landmarks [21, 8, 14], the complete (non-simplified) al- previous round on the new dataset. This ‘most promising’ competitor gorithms are run on small samples of the data. A series of sampling is chosen based on a history of prior duels between both algorithms landmarks on increasingly large samples represents a partial learn- on similar datasets. Each new cross-validation test will contribute ing curve which characterizes datasets and which can be used to pre- information to a better estimate of dataset similarity, and thus better dict the performance of algorithms significantly more accurately than predict which algorithms are most promising on the new dataset. We with classical dataset characteristics [13, 14]. Finally, an ‘active test- also follow a different path to estimate dataset similarity based on ing strategy’ for sampling landmarks [14] was proposed that actively data characteristics. We have evaluated this approach using a set of selects the most informative sample sizes while building these partial 292 algorithm-parameter combinations on 76 UCI datasets for clas- learning curves, thus reducing the time needed to compute them. sification. The results show that active testing will quickly yield an Motivation. All these approaches have focused on dozens of al- algorithm whose performance is very close to the optimum, after rel- gorithms at most and usually considered only default parameter atively few tests. It also provides a better solution than previously settings. Dealing with hundreds, perhaps thousands of algorithm- proposed methods. The variants of our method that rely on cross- parameter combinations4 , provides a new challenge that requires a validation tests to estimate dataset similarity provides better solutions new approach. First, distinguishing between hundreds of subtly dif- than those that rely on data characteristics. ferent algorithms is significantly harder than distinguishing between a handful of very different ones. We would need many more data 1 Background and Motivation characterizations that relate the effects of certain parameters on per- formance. On the other hand, the latter method [14] has a scalability In many data mining applications, an important problem is select- issue: it requires that pairwise comparisons be conducted between ing the best algorithm for a specific problem. Especially in classifi- algorithms. This would be rather impractical when faced with hun- cation, there are hundreds of algorithms to choose from. Moreover, dreds of algorithm-parameter combinations. these algorithms can be combined into composite learning systems To address these issues, we propose a quite different way to char- (e.g. ensembles) and often have many parameters that greatly influ- acterize datasets, namely through the effect that the dataset has on the ence their performance. This yields a whole spectrum of methods and relative performance of algorithms run on them. As in landmarking, their variations, so that testing all possible candidates on the given we use the fact that each algorithm has its own learning bias, making problem, e.g., using cross-validation, quickly becomes prohibitively certain assumptions about the data distribution. If the learning bias expensive. ‘matches’ the underlying data distribution of a particular dataset, it The issue of selecting the right algorithm has been the subject of is likely to perform well (e.g., achieve high predictive accuracy). If it many studies over the past 20 years [17, 3, 23, 20, 19]. Most ap- does not, it will likely under- or overfit the data, resulting in a lower proaches rely on the concept of metalearning. This approach exploits performance. characterizations of datasets and past performance results of algo- As such, we characterize a dataset based on the pairwise perfor- rithms to recommend the best algorithm on the current dataset. The mance differences between algorithms run on them: if the same al- term metalearning stems from the fact that we try to learn the func- gorithms win, tie or lose against each other on two datasets, then the tion that maps dataset characterizations (meta-data) to algorithm data distributions of these datasets are likely to be similar as well, at 1 LIAAD-INESC Porto L.A./Faculty of Economics, University of Porto, Por- least in terms of their effect on learning performance. It is clear that tugal, rleite@fep.up.pt the more algorithms are used, the more accurate the characterization 2 LIAAD-INESC Porto L.A./Faculty of Economics, University of Porto, Por- tugal, pbrazdil@inescporto.pt 4 In the remainder of this text, when we speak of algorithms, we mean fully- 3 LIACS - Leiden Institute of Advanced Computer Science, University of defined algorithm instances with fixed components (e.g., base-learners, ker- Leiden, Nederlands, joaquin@liacs.nl nel functions) and parameter settings. will be. While we cannot run all algorithms on each new dataset be- number of tests, and we will evaluate our technique as such. cause of the computational cost, we can run a fair amount of CV tests In all, the research hypothesis that we intend to prove in this paper to get a reasonably good idea of which prior datasets are most similar is: Relative landmarks provide useful information on the similarity to the new one. of datasets and can be used to efficiently predict the most promising Moreover, we can use these same performance results to estab- algorithms to test on new datasets. lish which (yet untested) algorithms are likely to perform well on the We will test this hypothesis by running our active testing approach new dataset, i.e., those algorithms that outperformed or rivaled the in a leave-one-out fashion on a large set of CV evaluations testing 292 currently best algorithm on similar datasets in the past. As such, we algorithms on 76 datasets. The results show that our AT approach is can intelligently select the most promising algorithms for the new indeed effective in finding very accurate algorithms with a limited dataset, run them, and then use their performance results to gain in- number of tests. creasingly better estimates of the most similar datasets and the most We also present an adaptation of method AT that uses data char- promising algorithms. acteristics to define the similarity of the datasets. Our purpose was to compare the relative landmark versus data measures approaches to Key concepts. There are two key concepts used in this work. The select classification algorithms. first one is that of the current best candidate algorithm which may be challenged by other algorithms in the process of finding an even Roadmap. The remainder of this paper is organized as follows. First, better candidate. we formulate the concepts of relative landmarks in Section 2 and The second is the pairwise performance difference between two active testing in Section 3. Next, Section 4 presents the empirical algorithms running on the same dataset, which we call relative land- evaluation and Section 5 presents an overview of some work in other mark. A collection of such relative landmarks represents a history of related areas. The final section presents conclusions and future work. previous ‘duels’ between two algorithms on prior datasets. The term itself originates from the study of landmarking algorithms: since ab- solute values for the performance of landmarkers vary a lot depend- 2 Relative Landmarks ing on the dataset, several types of relative landmarks have been pro- posed, which basically capture the relative performance difference In this section we formalize our definition of relative landmarks, and between two algorithms [12]. In this paper, we extend the notion of explain how are used to identify the most promising competitor for a relative landmarks to all (including non-simplified) classification al- currently best algorithm. gorithms. Given a set of classification algorithms and some new classifica- The history of previous algorithm duels is used to select the tion dataset dnew , the aim is to identify the potentially best algorithm most promising challenger for the current best candidate algorithm, for this task with respect to some given performance measure M namely the method that most convincingly outperformed or rivaled (e.g., accuracy, AUC or rank). Let us represent the performance of the current champion on prior datasets similar to the new dataset. algorithm ai on dataset dnew as M (ai , dnew ). As such, we need to identify an algorithm a∗, for which the performance measure is max- Approach. Given the current best algorithm and a history of rela- imal, or ∀ai M (a∗, dnew ) ≥ M (ai , dnew ). The decision concerning tive landmarks (duels), we can start a tournament game in which, ≥ (i.e. whether a∗ is at least as good as ai ) may be established using in each round, the current best algorithm is compared to the next, either a statistical significance test or a simple comparison. most promising contender. We select the most promising challenger However, instead of searching exhaustively for a∗, we aim to as discussed above, and run a CV test with this algorithm. The winner find a near-optimal algorithm, a∗, ˆ which has a high probability becomes the new current best candidate, the loser is removed from P (M (a∗,ˆ dnew ) ≥ M (ai , dnew )) to be optimal, ideally close to consideration. We will discuss the exact procedure in Section 3. 1. We call this approach active testing (AT)5 , as it actively selects the As in other work that exploits metalearning, we assume that a∗ ˆ is most interesting CV tests instead of passively performing them one likely better than ai on dataset dnew if it was found to be better on a by one: in each iteration the best competitor is identified, which de- similar dataset dj (for which we have performance estimates): termines a new CV test to be carried out. Moreover, the same result will be used to further characterize the new dataset and more accu- rately estimate the similarity between the new dataset and all prior ˆ dnew ) ≥ M (ai , dnew )) ∼ P (M (a∗, P (M (a∗, ˆ dj ) ≥ M (ai , dj )) datasets. (1) Evaluation. By intelligently selecting the most promising algorithms The latter estimate can be maximized by going through all algorithms to test on the new dataset, we can more quickly discover an algorithm and identifying the algorithm a∗ ˆ that satisfies the ≥ constraint in that performs very well. Note that running a selection of algorithms is a maximum number of cases. However, this requires that we know typically done anyway to find a suitable algorithm. Here, we optimize which datasets dj are most similar to dnew . Since our definition of and automate this process using historical performance results of the similarity requires CV tests to be run on dnew , but we cannot run all candidate algorithms on prior datasets. possible CV tests, we use an iterative approach in which we repeat While we cannot possibly guarantee to return the absolute best this scan for a∗ˆ in every round, using only the datasets dj that seem algorithm without performing all possible CV tests, we can return an most similar at that point, as dataset similarities are recalculated after algorithm whose performance is either identical or very close to the every CV test. truly best one. The difference between the two can be expressed in Initially, having no information, we deem all datasets to be similar terms of a loss. Our aim is thus to minimize this loss using a minimal ˆ will be the globally best algorithm over all prior to dnew , so that a∗ datasets. We then call this algorithm the current best algorithm abest 5 Note that while the term ‘active testing’ is also used in the context of ac- and run a CV test to calculate its performance on dnew . Based on tively selected sampling landmarks [14], there is little or no relationship to this, the dataset similarities are recalculated (see Section 3), yielding the approach described here. a possibly different set of datasets dj . The best algorithm on this new set becomes the best competitor ak (different from abest ), calculated define which prior datasets are similar to it. As such, we naively as- by counting the number of times that M (ak , dj ) > M (abest , dj ), sume that all prior datasets are similar. As such, we generate a global over all datasets dj . ranking of all algorithms using the performance results of all algo- We can further refine this method by taking into account how large rithms on all previous datasets, and choose the top-ranked algorithm the performance differences are: the larger a difference was in the as our initial candidate abest . To illustrate this, we use a toy example past, the higher chances are to obtain a large gain on the new dataset. involving 6 classification algorithms, with default parameter settings, This leads to the notion of relative landmarks RL, defined as: from Weka [10] evaluated on 40 UCI datasets [1], a portion of which is shown in Table 1. As said before, our approach is entirely independent from the ex- (2) act evaluation measure used: the most appropriate measure can be RL(ak , abest , dj ) = i(M (ak , dj ) > M (abest , dj )) ∗ chosen by the user in function of the specific data domain. In this ∗(M (ak , dj ) − M (abest , dj )) example, we use success rate (accuracy), but any other suitable mea- sure of classifier performance, e.g. AUC (area under the ROC curve), The function i(test) returns 1 if the test is true and 0 otherwise. As precision, recall or F1 can be used as well. stated before, this can be a simple comparison or a statistical signifi- Each accuracy figure shown in Table 1 represents the mean of 10 cance test that only returns 1 if ak performs significantly better than values obtained in 10-fold cross-validation. The ranks of the algo- abest on dj . The term RL thus expresses how much better ak is, rela- rithms on each dataset are shown in parentheses next to the accuracy tive to abest , on a dataset dj . Experimental tests have shown that this value. For instance, if we consider dataset abalone, algorithm M LP approach yields much better results than simply counting the number is attributed rank 1 as its accuracy is highest on this problem. The of wins. second rank is occupied by LogD, etc. Up to now, we are assuming a dataset dj to be either similar to The last row in the table shows the mean rank of each algo- dnew or not. A second refinement is to use a gradual (non-binary) rithm, obtained by averaging over the ranks of each dataset: Rai = 1 P n measure of similarity Sim(dnew , dj ) between datasets dnew and dj . n dj =1 Rai ,dj , where Rai ,dj represents the rank of algorithm ai As such, we can weigh the performance difference between ak and on dataset dj and n the number of datasets. This is a quite common abest on dj by how similar dj is to dnew . Indeed, the more similar procedure, often used in machine learning to assess how a particular the datasets, the more informative the performance difference is. As algorithm compares to others [5]. such, we aim to optimize the following criterion: The mean ranks permit us to obtain a global rank- ing of candidate algorithms, CA. In our case, X CA = hM LP, J48, JRip, LogD, IB1, N Bi. It must be noted ak = arg max RL(ai , abest , dj ) ∗ Sim(dnew , dj )) (3) ai that such a ranking is not very informative in itself. For instance, dj∈D statistical significance tests are needed to obtain a truthful ranking. in which D is the set of all prior datasets dj . Here, we only use this global ranking CA are a starting point for the To calculate the similarity Sim(), we use the outcome of each CV iterative procedure, as explained next. test on dnew and compare it to the outcomes on dj . Step 2 - Identify the Current Best Algorithm. Indeed, global rank- In each iteration, with each CV test, we obtain a new eval- ing CA permits us to identify the top-ranked algorithm as our initial uation M (ai , dnew ), which is used to recalculate all similarities best candidate algorithm abest . In Table 1, abest = M LP . This al- Sim(dnew , dj ). In fact, we will compare four variants of Sim(), gorithm is then evaluated using a CV test to establish its performance which are discussed in the next section. With this, we can recalculate on the new dataset dnew . equation 3 to find the next best competitor ak . Step 3 - Identify the Most Promising Competitor. In the next step we identify ak , the best competitor of abest . To do this, all algorithms 3 Active Testing are considered one by one, except for abest and the eliminated algo- In this section we describe the active testing (AT) approach, which rithms (see step 4). proceeds according to the following steps: For each algorithm we analyze the information of past experiments (meta-data) to calculate the relative landmarks, as outlined in the pre- 1. Construct a global ranking of a given set of algorithms using per- vious section. As equation 3 shows, for each ak , we sum up all rela- formance information from past experiments (metadata). tive landmarks involving abest , weighted by a measure of similarity 2. Initiate the iterative process by assigning the top-ranked algorithm between dataset dj and the new dataset dnew . The algorithm ak that as abest and obtain the performance of this algorithm on dnew achieves the highest value is the most promising competitor in this using a CV test. iteration. In case of a tie, the competitor that appears first in ranking 3. Find the most promising competitor ak for abest using relative CA is chosen. landmarks and all previous CV tests on dnew . To calculate Sim(dnew , dj ), the similarity between dj and dnew , 4. Obtain the performance of ak on dnew using a CV test and com- we have explored six different variants, AT0, AT1, ATWs, ATx, pare with abest . Use the winner as the current best algorithm, and ATdc, ATWs k described below. eliminate the losing algorithm. 5. Repeat the whole process starting with step 3 until a stopping cri- AT0 is a base-line method which ignores dataset similarity. It always terium has been reached. Finally, output the current abest as the returns a similarity value of 1 and so all datasets are considered sim- overall winner. ilar. This means that the best competitor is determined by summing up the values of the relative landmarks. Step 1 - Establish a Global Ranking of Algorithms. Before having AT1 method works as AT0 at the beginning, when no tests run any CV tests, we have no information on the new dataset dnew to have been carried out on dnew . In all subsequent iterations, this Table 1. Accuracies and ranks (in parentheses) of the algorithms 1-nearest neighbor (IB1), C4.5 (J48), RIPPER (JRip), LogisticDiscriminant (LogD), MultiLayerPerceptron (MLP) and naive Bayes (NB) on different datasets and their mean rank. Datasets IB1 J48 JRip LogD MLP NB abalone .197 (5) .218 (4) .185 (6) .259 (2) .266 (1) .237 (3) acetylation .844 (1) .831 (2) .829 (3) .745 (5) .609 (6) .822 (4) adult .794 (6) .861 (1) .843 (3) .850 (2) .830 (5) .834 (4) ... ... ... ... ... ... ... Mean rank 4.05 2.73 3.17 3.74 2.54 4.78 method estimates dataset similarity using only the most recent the information in the relative landmarks. CV test. Consider the algorithms listed in Table 1 and the rank- Step 4 - Determine which of the Two Algorithms is Better. Having ing CA. Suppose we started with algorithm M LP as the cur- found ak , we can now run a CV test and compare it with abest . The rent best candidate. Suppose also that in the next iteration LogD winner (which may be either the current best algorithm or the com- was identified as the best competitor, and won from M LP in petitor) is used as the new current best algorithm in the new round. the CV test: (M (LogD, dnew ) > M (M LP, dnew )). Then, in the The losing algorithm is eliminated from further consideration. subsequent iteration, all prior datasets dj satisfying the con- dition M (LogD, dj ) > M (M LP, dj ) are considered similar to Step 5 - Repeat the Process and Check the Stopping Criteria. The dnew . In general terms, suppose that the last test revealed that whole process of identifying the best competitor (step 3) of abest and M (ak , dnew ) > M (abest , dnew ), then Sim(dnew , dj ) is 1 if also determining which one of the two is better (step 4) is repeated until a M (ak , dj ) > M (abest , dj ), and 0 otherwise. The similarity mea- stopping criterium has been reached. For instance, the process could sure determines which RL’s are taken into account when summing be constrained to a fixed number of CV tests: considering the results up their contributions to identify the next best competitor. presented further on in Section 4, it would be sufficient to run at Another variant of AT1 could use the difference between most 20% of all possible CV tests. Alternatively, one could impose RL(ak , abest , dnew ) and RL(ak , abest , dj ), normalized between a fixed CPU time, thus returning the best algorithm in h hours, as in 0 and 1, to obtain a real-valued (non-binary) similarity estimate budgeted learning. In any case, until aborted, the method will keep Sim(dnew , dj ). In other words, dj is more similar to dnew if the rel- choosing a new competitor in each round: there will always be a next ative performance difference between ak and abest is about as large best competitor. In this respect our system differs from, for instance, on both dj and dnew . We plan to investigate this in our future work. hill climbing approaches which can get stuck in a local minimum. ATWs is similar to AT1, but instead of only using the last test, Discussion - Comparison with Active Learning: The term active it uses all CV tests carried out on the new dataset, and calculates testing was chosen because the approach shares some similarities the Laplace-corrected ratio of corresponding results. For instance, with active learning [7]. The concern of both is to speed up the pro- suppose we have conducted 3 tests on dnew , thus yielding 3 pair- cess of improving a given performance measure. In active learning, wise algorithm comparisons on dnew . Suppose that 2 tests had the the goal is to select the most informative data point to be labeled next, same result on dataset dj (i.e. M (ax , dnew ) > M (ay , dnew ) and so as to improve the predictive performance of a supervised learning M (ax , dj ) > M (ay , dj )), then the frequency of occurrence is 2/3, algorithm with a minimum of (expensive) labelings. In active testing, which is adjusted by Laplace correction to obtain an estimate of prob- the goal is to select the most informative CV test, so as to improve the ability (2 + 1)/(3 + 2). As such, Sim(dnew , dj ) = 53 . prediction of the best algorithm on the new dataset with a minimum ATx is similar to ATWs, but requires that all pairwise comparisons of (expensive) CV tests. yield the same outcome. In the example used above, it will return Sim(dnew , dj ) = 1 only if all three comparisons lead to same result on both datasets and 0 otherwise. 4 Empirical Evaluation ATdc is similar to ATWs, but uses a different similarity function. 4.1 Evaluation Methodology and Experimental The idea of using this variant was to test if data characteristics (e.g. Set-up number of examples) could provide better information to identify the The proposed method AT was evaluated using a leave-one-out most similar datasets (w.r.t. dnew ). We define the similarity between method [18]. The experiments reported here involve D datasets and two datasets dnew and dj using the following expression: so the whole procedure was repeated D times. In each cycle, all per- formance results on one dataset were left out for testing and the re- 1 X |z(dnew ) − z(dj )| Sim(dnew , dj ) = 1 − (4) sults on the remaining D − 1 datasets were used as metadata to de- |Z| max(z) − min(z) z∈Z termine the best candidate algorithm. This study involved 292 algorithms (algorithm-parameter com- The symbol z represents a generic data measure used to character- binations), which were extracted from the experiment database for ize the datasets (z(d) is the value of characteristic z for dataset d). Z machine learning (ExpDB) [11, 22]. This set includes many dif- is the set of measures used. ferent algorithms from the Weka platform [10], which were varied ATWs k is similar to ATWS, but only consider the k most similar by assigning different values to their most important parameters. datasets (those with highest Sim values). The similarity for all the It includes SMO (a support vector machine, SVM), MLP (Multi- other datasets are set to 0. The idea is to avoid the situation where layer Perceptron), J48 (C4.5), and different types of ensembles, in- the similarity values show very small variations, making unusefull cluding RandomForest, Bagging and Boosting. Moreover, different 1.5 AT0 AT1 ATx Median Accuracy Loss (%) ATWs ATdc 1.0 0.5 0.0 1 2 4 8 16 32 64 128 292 Number of CV's Figure 1. Median loss as a function of the number of CV tests. SVM kernels were used with their own parameter ranges and all 4.2 Results non-ensemble learners were used as base-learners for the ensem- ble learners mentioned above. The 76 datasets used in this study were all from UCI [1]. A complete overview of the data used in this By aggregating the results over D datasets, we can track the median study, including links to all algorithms and datasets can be found on loss of the recommended algorithm as a function of the number of http://expdb.cs.kuleuven.be/ref/blv11. CV tests carried out. The results are shown in Figure 1. Note that the The data measures used to characterize the datasets for variant number of CV tests is plotted on a logarithmic scale. ATdc were number of examples, proportion of nominal attributes, First, we see that ATWs and AT1 perform much better than AT0, proportion of missing values, class entropy and mean mutual infor- which indicates that it is indeed useful to include dataset similarity. mation. If we consider a particular level of loss (say 0.5%) we note that these Regarding ATWs k we set k = 20 (this value was set by exploring variants require much fewer CV tests than AT0. The results also indi- only the alternatives 30, 20 and 10). cate that the information associated with relative landmarks obtained The main aim of the test was to prove the research hypothesis for- on the new dataset is indeed valuable. The median loss curves decline mulated earlier: relative landmarks provide useful information for quite rapidly and are always below the AT0 curve. We also see that predicting the most promising algorithms on new datasets. There- after only 10 CV tests (representing about 3% of all possible tests), fore, we also include two baseline methods: the median loss is less than 0.5%. If we continue to 60 tests (about 20% of all possible tests) the median loss is near 0. Also note that ATWs, which uses all relative landmarks involv- ing abest and dnew , does not perform much better than AT1, which TopN has been described before (e.g. [3]). It also builds a ranking of only uses the most recent CV test. However in Figure 2 we show the candidate algorithms as described in step 1 (although other mea- performance of variant ATWs k is much better than ATWs. sures different from mean rank could be used), and only runs CV Method ATx, the most restrictive approach, only considers prior tests on the first N algorithms. The overall winner is returned. datasets on which all relative landmarks including abest obtained Rand simply selects N algorithms at random from the given set, similar results. As shown in Figure 1, this approach manages to re- evaluates them using CV and returns the one with the best perfor- duce the loss quite rapidly, and competes well with the other variants mance. It is repeated 10 times with different random seeds and the in the initial region. However, after achieving a minimum loss in the results are averaged. order of 0.5%, there are no more datasets that fulfill this restriction, and consequently no new competitor can be chosen, causing it to stop. The other three methods, ATWs, ATdc and AT1, do not suf- Since our AT methods are iterative, we will restart TopN and Rand fer from this shortcoming. We can see that the variant that uses data N times, with N equal to the number of iterations (or CV tests). characteristics (ATdc) is generally worse than the other variants. Be- To evaluate the performance of all approaches, we calculate the fore the 23rd CV test all variants AT1, ATx and ATWs need about loss of the currently best algorithm, defined as M (abest , dnew ) − half the CV tests needed by ATdc. M (a∗, dnew ), where abest represents the currently best algorithm, AT0 was also our best baseline method. To avoid overloading Fig- a∗ the best possible algorithm and M (.) represents the performance ure 1, we show this separately in Figure 2. Indeed, AT0 is clearly measure (success rate). better than the random choice method Rand. Comparing AT0 to 1.5 Rand TopN AT0 Median Accuracy Loss (%) ATWs ATWs_k 1.0 0.5 0.0 1 2 4 8 16 32 64 128 292 Number of CV's Figure 2. Median loss of AT0, ATWs, ATWs k and the two baseline methods. TopN, we cannot say that one is clearly better than the other over- identified an algorithm with a very small loss of 0.2%: all, as the curves cross. However, it is clear that TopN looses out if Bagging.I.25..50.M ultilayerP erceptron, i.e. Bagging with rel- we allow more CV tests, and that it is not competitive with the more atively few iterations but with a MultiLayerPerceptron base-learner. advanced methods such as AT1, ATWs and ATWs k. Is clear in Fig- Incidentally, this dataset appears to represent a quite atypical prob- ure 2 that the best method variant is ATWs k (using k = 20). We lem: the truly best algorithm, SM O.C.1.0.RBF.G.20, i.e. SVM can see clearly in the curves that ATWs k typically needs about half with an RBF kernel with kernel width (gamma) set to 20, is ranked the CV tests needed by ATWs. The comparison with all the other globally as algorithm 246 (of all 292). AT1 identifies it after 177 CV variants is even more impressive. tests. The curves for mean loss (instead of median loss) follow similar trends, but the values are 1-2% worse due to outliers (see Fig. 3 rel- ative to method ATWs k). Besides, this figure shows also the curves 5 Related Work in Other Scientific Areas associated with quartiles of 25% and 75% for ATWs k. As the num- ber of CV tests increases, the distance between the two curves de- In this section we briefly cover some work in other scientific areas creases and approaches the median curve. Similar behavior has been which is related to the problem tackled here and could provide further observed for ATWs and AT1 but we skip the curves in this text. insight into how to improve the method. One particular area is experiment design [6] and in particular ac- Algorithm trace. It is interesting to trace the iterations carried out tive learning. As discussed before, the method described here follows for one particular dataset. Table 2 shows the details for method AT1, the main trends that have been outlined in this literature. However, where abalone represents the new dataset. Column 1 shows the num- there is relatively little work on active learning for ranking tasks. ber of the iteration (thus the number of CV tests). Column 2 shows One notable exception is [15], who use the notion of Expected Loss the most promising competitor ak chosen in each step. Column 3 Optimization (ELO). Another work in this area is [4], whose aim was shows the index of ak in our initial ranking CA, and column 4 the to identify the most interesting substances for drug screening using a index of abest , the new best algorithm after the CV test has been minimum number of tests. In the experiments described, the authors performed. As such, if the values in column 3 and 4 are the same, have focused on the top-10 substances. Several different strategies then the most promising competitor has won the duel. For instance, were considered and evaluated. Our problem here is not ranking, but in step 2, SM O.C.1.0.P olynomial.E.3, i.e. SVM with complexity rather simply finding the best item (algorithm), so this work is only constant set to 1.0 and a 3rd degree polynomial kernel, (index 96) has partially relevant. been identified as the best competitor to be used (column 2), and af- Another relevant area is the so called multi-armed bandit prob- ter the CV test, it has won against Bagging.I.75..100.P ART , i.e. lem (MAB) studied in statistics and machine learning [9, 16]. This Bagging with a high number of iterations (between 75 and 100) and problem is often formulated in a setting that involves a set of tradi- PART as a base-learner. As such, it wins this round and becomes the tional slot machines. When a particular lever is pulled, a reward is new abest . Columns 5 and 6 show the actual rank of the competi- provided from a distribution associated with that specific lever. The tor and the winner on the abalone dataset. Column 7 shows the loss bandit problem is formally equivalent to a one-state Markov deci- compared to the optimal algorithm and the final column shows the sion process. The aim is to minimize regret after T rounds, which is number of datasets whose similarity measure is 1. defined as a difference between the reward sum associated with an We observe that after only 12 CV tests, the method has optimal strategy and the sum of collected rewards. Indeed, pulling 1.5 mean quantile.25 median quantile.75 1.0 Accuracy Loss (%) 0.5 0.0 1 2 4 8 16 32 64 128 292 Number of CV's Figure 3. Loss of ATWs k (k=20) as a function of the number of CV tests. Table 2. Trace of the steps taken by AT1 in the search for the supposedly best algorithm for the abalone dataset CV Algorithm used CA CA abalone abalone Loss D test (current best competitor, ak ) ak new abest ak new abest (%) size 1 Bagging.I.75..100.PART 1 1 75 75 1.9 75 2 SMO.C.1.0.Polynomial.E.3 96 96 56 56 1.6 29 3 AdaBoostM1.I.10.MultilayerPerceptron 92 92 47 47 1.5 34 4 Bagging.I.50..75.RandomForest 15 92 66 47 1.5 27 ··· ··· ··· ··· ··· ··· ··· ··· 10 LMT 6 6 32 32 1.1 45 11 LogitBoost.I.10.DecisionStump 81 6 70 32 1.1 51 12 Bagging.I.25..50.MultilayerPerceptron 12 12 2 2 0.2 37 13 LogitBoost.I.160.DecisionStump 54 12 91 2 0.2 42 ··· ··· ··· ··· ··· ··· ··· ··· 177 SMO.C.1.0.RBF.G.20 246 246 1 1 0 9 a lever can be compared to carrying out a CV test on a given algo- algorithms on the new dataset. However, the aim is to reduce the rithm. However, there is one fundamental difference between MAB number of tests to a minimum. This is done by carefully selecting and our setting: whereas in MAB the aim is to maximize the sum of which tests should be carried out, using the information of both past collected rewards, our aim it to maximize one reward, i.e. the reward and present algorithm evaluations represented in the form of relative associated with identifying the best algorithm. So again, this work is landmarks. only partially relevant. In our view this method incorporates several innovative features. To the best of our knowledge, no other work in this area has ad- First, it is an iterative process that uses the information in each CV dressed the issue of how to select a suitable algorithm from a large test to find the most promising next test based on a history of prior set of candidates. ‘algorithm duels’. In a tournament-style fashion, it starts with a cur- rent best (parameterized) algorithm, selects the most promising ri- val algorithm in each round, evaluates it on the given problem, and 6 Significance and Impact eliminates the algorithm that performs worse. Second, it continually focuses on the most similar prior datasets: those where the algorithm In this paper we have addressed the problem of selecting the best duels had a similar outcome to those on the new dataset. classification algorithm for a specific task. We have introduced a new Four variants of this basic approach, differing in their definition method, called active testing, that exploits information concerning of dataset similarity, were investigated in a very extensive experi- past evaluation results (metadata), to recommend the best algorithm ment setup involving 292 algorithm-parameter combinations on 76 using a limited number of tests on the new dataset. datasets. Our experimental results show that particularly versions Starting from an initial ranking of algorithms on previous datasets, ATWs and AT1 provide good recommendations using a small num- the method runs additional CV evaluations to test several competing ber of CV tests. When plotting the median loss as a function of the [14] Rui Leite and Pavel Brazdil, ‘Active testing strategy to predict the best number of CV tests (Fig. 1), it shows that both outperform all other classification algorithm via sampling and metalearning’, in Proceedings of the 19th European Conference on Artificial Intelligence - ECAI 2010, variants and baseline methods. They also outperform AT0, indicating (2010). that dataset similarity is an important aspect. [15] B. Long, O. Chapelle, Y. Zhang, Y. Chang, Z. Zheng, and B. Tseng, We also see that after only 10 CV tests (representing about 3% of ‘Active learning for rankings through expected loss optimization’, in all possible tests), the median loss is less than 0.5%. If we continue Proceedings of the SIGIR’10. ACM, (2010). to 60 tests (about 20% of all possible tests) the median loss is near 0. [16] A. Mahajan and D. Teneketzis, ‘Multi-armed bandit problems’, in Foundations and Applications of Sensor Management, eds., D. A. Cas- Similar trends can be observed when considering mean loss. tanon, D. Cochran, and K. Kastella, Springer-Verlag, (2007). The results support the hypothesis that we have formulated at the [17] D. Michie, D.J.Spiegelhalter, and C.C.Taylor, Machine Learning, Neu- outset of our work, that relative landmarks are indeed informative and ral and Statistical Classification, Ellis Horwood, 1994. can be used to suggest the best contender. If this is procedure is used [18] Tom M. Mitchell, Machine Learning, McGraw-Hill, New York, 1997. [19] John R. Rice, ‘The algorithm selection problem’, volume 15 of Ad- iteratively, it can be used to accurately recommend a classification vances in Computers, 65 – 118, Elsevier, (1976). algorithm after a very limited number of CV tests. [20] Kate A. Smith-Miles, ‘Cross-disciplinary perspectives on meta- Still, we believe that the results could be improved further. Classi- learning for algorithm selection’, ACM Comput. Surv., 41(1), 1–25, cal information-theoretic measures and/or sampling landmarks could (2008). be incorporated into the process of identifying the most similar [21] Carlos Soares, Johann Petrak, and Pavel Brazdil, ‘Sampling-based rel- ative landmarks: Systematically test-driving algorithms before choos- datasets. This could lead to further improvements and forms part of ing’, in Proceedings of the 10th Portuguese Conference on Artificial our future plans. Intelligence (EPIA 2001), pp. 88–94. Springer, (2001). [22] J. Vanschoren and H. Blockeel, ‘A community-based platform for ma- chine learning experimentation’, in Machine Learning and Knowledge ACKNOWLEDGEMENTS Discovery in Databases, European Conference, ECML PKDD 2009, volume LNCS 5782, pp. 750–754. Springer, (2009). This work is funded (or part-funded) by the ERDF European Re- [23] Ricardo Vilalta and Youssef Drissi, ‘A perspective view and survey of meta-learning’, Artif. Intell. Rev., 18(2), 77–95, (2002). gional Development Fund through the COMPETE Programme (op- erational programme for competitiveness) and by National Funds through the FCT Fundação para a Ciência e a Tecnologia (Por- tuguese Foundation for Science and Technology) within project FCOMP - 01-0124-FEDER-022701. REFERENCES [1] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. [2] B.Pfahringer, H.Bensussan, and C. Giraud-Carrier, ‘Meta-learning by landmarking various learning algorithms’, in Proceedings of the 17th Int. Conf. on Machine Learning (ICML-2000, Stanford,CA, (2000). [3] P. Brazdil, C. Soares, and J. Costa, ‘Ranking learning algorithms: Using IBL and meta-learning on accuracy and time results’, Machine Learn- ing, 50, 251–277, (2003). [4] K. De Grave, J. Ramon, and L. De Raedt, ‘Active learning for primary drug screening’, in Proceedings of Discovery Science. Springer, (2008). [5] J. Demsar, ‘Statistical comparisons of classifiers over multiple data sets’, The Journal of Machine Learning Research, 7, 1–30, (2006). [6] V. Fedorov, Theory of Optimal Experiments, Academic Press, New York, 1972. [7] Y. Freund, H. Seung, E. Shamir, and N. Tishby, ‘Selective sampling using the query by committee algorithm’, Machine Learning, 28, 133– 168, (1997). [8] Johannes Fürnkranz and Johann Petrak, ‘An evaluation of landmarking variants’, in Proceedings of the ECML/PKDD Workshop on Integrating Aspects of Data Mining, Decision Support and Meta-Learning (IDDM- 2001), pp. 57–68. Springer, (2001). [9] J. Gittins, ‘Multi-armed bandit allocation indices’, in Wiley Interscience Series in Systems and Optimization, John Wiley & Sons, Ltd., (1989). [10] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten, ‘The WEKA data mining software: an update’, SIGKDD Explor. Newsl., 11(1), 10–18, (2009). [11] H.Blockeel, ‘Experiment databases: A novel methodology for ex- perimental research’, in Leacture Notes on Computer Science 3933. Springer, (2006). [12] J.Fürnkranz and J.Petrak, ‘An evaluation of landmarking variants’, in Working Notes of ECML/PKDD 2000 Workshop on Integration Aspects of Data Mining, Decision Support and Meta-Learning, eds., C.Carrier, N.Lavrac, and S.Moyle, (2001). [13] R. Leite and P. Brazdil, ‘Predicting relative performance of classifiers from samples’, in ICML ’05: Proceedings of the 22nd international conference on Machine learning, pp. 497–503, New York, NY, USA, (2005). ACM Press.