=Paper=
{{Paper
|id=Vol-1201/paper-05
|storemode=property
|title=Using Metalearning to Predict When Parameter Optimization Is Likely to Improve Classification
Accuracy
|pdfUrl=https://ceur-ws.org/Vol-1201/paper-05.pdf
|volume=Vol-1201
|dblpUrl=https://dblp.org/rec/conf/ecai/RiddG14
}}
==Using Metalearning to Predict When Parameter Optimization Is Likely to Improve Classification
Accuracy==
Using Metalearning to Predict When Parameter Optimization Is Likely to Improve Classification Accuracy Parker Ridd1 and Christophe Giraud-Carrier2 Abstract. Work on metalearning for algorithm selection has often • I(L, P, T ) be the model induced by algorithm L with parameter been criticized because it mostly considers only the default param- setting P on some classification learning task T . Hence, I maps eter settings of the candidate base learning algorithms. Many have objects in X to labels in Y. indeed argued that the choice of parameter values can have a signif- • A(I) be the predictive accuracy of model I (typically measured icant impact on accuracy. Yet little empirical evidence exists to pro- by cross-validation). vide definitive support for that argument. Recent experiments do sug- gest that parameter optimization may indeed have an impact. How- The classification learning algorithm selection problem can be for- ever, the distribution of performance differences has a long tail, sug- mulated as follows. gesting that in most cases parameter optimization has little effect on accuracy. In this paper, we revisit some of these results and use met- Classification Learning Algorithm Selection Given a training set alearning to characterize the situations when parameter optimization T for some classification learning task, find the pair (L∗ , P ∗ ) where is likely to cause a significant increase in accuracy. In so doing, we L∗ ∈ L and P ∗ ∈ PL∗ , such that show that 1) a relatively simple and efficient landmarker carries sig- nificant predictive power, and 2) metalearning for algorithm selection ∀(L, P ) ∈ L × PL A(I(L, P, T )) ≤ A(I(L∗ , P ∗ , T )). should be effected in two phases, the first in which one determines The above formulation is generic in that it says nothing about the whether parameter optimization is likely to increase accuracy, and process used to search the combined spaces of classification learning the second in which algorithm selection actually takes place. algorithms and parameter settings to find the optimal algorithm. Met- alearning for classification learning algorithm selection is the specific 1 Introduction instance of that general problem wherein the search is effected by a learning algorithm [6]. In other words, the past performance of clas- The availability of a large number of classification learning algo- sification learning algorithms on a variety of tasks is used to induce a rithms together with the No Free Lunch theorem for classification predictive model that takes as input a classification learning task and present business users with a significant challenge, namely that of de- produces as output a classification learning algorithm and its associ- ciding which algorithm is likely to induce the most accurate model ated parameter setting. for their particular classification task. This selection process is fur- In practice, one has access to a set T = {T1 , T2 , . . . , Tm } of m ther compounded by the fact that many classification learning algo- training sets (corresponding to m classification learning tasks). For rithms include parameters, and the various possible settings of these each Tj , the learning algorithms in L, together with their parameter parameters may give rise to models whose accuracy on the target settings, are used one at a time to induce a model on Tj . The pair classification task varies significantly. As a result, the algorithm se- of classification learning algorithm and parameter setting that max- lection problem in machine learning consists not only in choosing imizes A(I(L, P, Tj )) is recorded. Each Tj with its corresponding an algorithm, but rather in choosing an algorithm and an associated winning pair becomes a training example for a (meta)learning algo- parameter setting. Formally, let: rithm. Since storing complete learning tasks is unfeasible and likely • L = {L1 , L2 , . . . , Ln } be a finite set of n classification learn- undesirable, one uses instead some characterization of learning tasks ing algorithms (e.g., C4.5, Naı̈ve Bayes-NB, Backpropagation-BP, by meta-features. Meta-features may be drawn from basic statistics Support Vector Machine-SVM). and information-theoretic measures (e.g., ratio of nominal features, • PLi = PL1i × PL2i × . . . PLkii be the set of parameter settings skewness, class entropy) [14, 7, 22], landmarking measures (i.e., per- formances of simple learners that serve as signpost for more com- associated with Li , where each PLki represents one of the param- 1 2 plex ones) [2, 17, 8], and model-based measures (e.g., properties of eters of Li (e.g., PC4.5 =pruning indicator, PC4.5 =splitting cri- 1 2 induced decision trees) [1, 3, 16]. Given a training set T for some terion, . . ., PBP =number of hidden layers, PBP =learning rate, 3 classification learning task, let C(T ) be the characterization of T by PBP =momentum term, . . .). some set of meta-features. • T = X × Y be a training set for a classification task where X is One can now take a number of classification tasks, characterize a set of features and Y is a finite set of labels. them via the function C, and record the corresponding meta-features 1 Brigham Young University, USA, email: parker.ridd@byu.net together with the accuracy of the best learning algorithm and asso- 2 Brigham Young University, USA, email: cgc@cs.byu.edu ciated parameter setting. The classification learning algorithm selec- tion problem, as solved by metalearning, can then be reformulated as both a learning algorithm an associated parameter setting (e.g., follows.3 see [13, 24, 21]). Metalearning for Classification Learning Algorithm Selection Interestingly, and somewhat surprisingly, very few researchers Given a set T m = {(C(T ), argmaxL∈L,P ∈PL A(I(L, P, T ))} of have bothered to check the validity of the Parameter Optimization past classification learning episodes, together with a classification Claim. Yet, to the best of our knowledge —and as presented by most, learning algorithm Lm , which may belong to L, and an associated it is just that: a claim. We have been hard-pressed to find any system- parameter setting PLm : atic study in the literature that addresses the impact of parameter set- tings over a wide variety of classification learning algorithms. What 1. Construct M = I(Lm , PLm , T m ). if the Parameter Optimization Claim does not hold in general? What 2. For any training set T 0 , (L∗ , P ∗ ) = M(T 0 ). if it only holds in some specific cases? Would it not be useful to know what these cases are? And what about using metalearning to By convention, let PL0i denote the default parameter setting of Li , characterize when the claim holds? We address these questions here. as determined by the implementation of Li under consideration (e.g., Weka, IBM SPSS Modeler). Most of the work in metalearning so far has addressed the above problem with the further assumption that for 2 Impact of Parameter Optimization on all learning algorithms the parameter setting is fixed to its default. Classification Learning Algorithm Performance This, of course, creates a much restricted, yet also greatly simplified, There is one exception to our statement that the literature contains no version of the selection problem, since the large, possibly infinite, systematic study of the impact of parameter optimization on the per- space of parameter settings need not be considered at all. However, formance of classification learning algorithms, found in [23]. In that that restriction has also been the source of much criticism, and some- paper, the authors considered 466 datasets and for each, computed times dismissal, by a part of the machine learning community, who the difference in accuracy between their default parameter setting has maintained that: and the best possible parameter setting after optimization. The opti- Parameter Optimization Claim Parameter setting has a significant mization procedure is based on particle swarm optimization (PSO), impact (for the better) on the predictive accuracy of classification wherein the authors specify which parameters should be optimized learning algorithms. (e.g., kernel function and complexity constant for SVM) and the range of values that PSO should consider. We obtained the data from It would seem that most metalearning researchers, and indeed the authors and reproduced their results, with two small exceptions: most machine learning researchers, have taken this claim to be well 1) we found that one of the datasets in the list was redundant, so we founded, and considered ways to address it. There have been two removed it; and 2) our dataset characterization tool (see below) did main approaches. not terminate on two of the datasets after several days so we stopped it, and omitted the corresponding datasets from our study. Hence, the • Two-stage Metalearning. Some metalearning researchers have results here are for 463 datasets. For each dataset, Figure 1 shows the adopted a two-stage approach to metalearning, wherein they percentage of improvement of the best AUC score among 20 clas- continue to use the restricted form of the Metalearning for sification learning algorithms after parameter optimization over the Classification Learning Algorithm Selection problem to choose best AUC score among the same classification learning algorithms a learning algorithm, but then follow up with an optimization with their default parameter setting. The data is ordered by increas- phase to find the best set of parameter values for the selected ing value of improvement. algorithm.4 The problem in this case, however, is that one may reach a suboptimal solution. Indeed, let L1 and L2 be two clas- sification learning algorithms, such that, for some classification task T 0 , M(T 0 ) = (L1 , PL01 ). Then, L1 would be selected 20 and its parameter setting optimized to PL∗1 . Yet, despite the fact that A(I(L1 , PL01 , T 0 )) > A(I(L2 , PL02 , T 0 )) (assuming the metalearner is accurate), it is entirely possible that there 15 exists some parameter setting PLk2 of algorithm L2 such that A(I(L1 , PL∗1 , T 0 )) < A(I(L2 , PLk2 , T 0 )). In other words, the % Improvement early greedy commitment to L1 makes it impossible to explore 10 other parts of the combined spaces of learning algorithms and parameter settings. 5 • Unrestricted Metalearning. Other metalearning researchers have lifted the traditional restriction, and recently begun to design solu- tions from the unrestricted Metalearning for Classification Learn- 0 ing Algorithm Selection problem. Their systems seek to select 0 100 200 300 400 3 We recognize that it is possible to use metalearning to predict rankings of learning algorithms (e.g., see [5, 23]), or even the actual performance of learning algorithms via regression (e.g., see [4, 10, 20]). We restrict our Figure 1. Impact of Parameter Optimization on 463 Datasets attention here to the prediction of a single best algorithm although much of the discussion extends naturally to these settings. 4 Some have also simply picked a classification learning algorithm manually, and used metalearning to choose the best parameter setting (e.g., see [9, 19]). Figure 1 makes it clear that the impact of parameter optimization is highly variable across datasets, and seems to be rather small for 2. For many datasets, parameter optimization yields very little im- a large number of them. We are a little surprised that with such a provement and may be viewed as computational overkill. skewed distribution, the authors of [23] concluded that “the result 3. For a few datasets, parameter optimization makes a significant dif- demonstrates the benefit of using the performances of optimised al- ference and should thus be performed. gorithms for generating algorithm rankings”, and thus carried on From a practitioner’s standpoint, the last two conclusions are par- with a blanket application of their combined classification learning ticularly relevant. If no improvement is to be expected from parame- algorithm / parameter setting selection approach. ter optimization, then one would gladly save the extra computational To make the relationship even clearer, consider Figure 2 that shows time required to effect it. Conversely, if a significant improvement the cumulative distribution of the same datasets, where each succes- can be expected and one is focused on maximizing predictive accu- sive bar represents the proportion of datasets for which the improve- racy, then one would be willing to bear the extra cost. The question ment in AUC due to parameter optimization is less than or equal to then, for such practitioners, is: Will the predictive accuracy on the the value indicated on the x-axis in increments of 1%. classification learning task I am considering be improved by param- eter optimization? It should be obvious to machine learning researchers that if one 100 labels the datasets falling under conclusion 2 above as no advantage and the datasets falling under conclusion 3 as advantage, one obtains a training dataset for a classification learning task. We propose to 80 do exactly that and essentially take our own medicine, by applying machine learning to the problem of distinguishing between datasets that may benefit from parameter optimization and datasets that would 60 not. Because our data consists of information about the performance of learning algorithms, this is an instance of metalearning. 40 3 Metalearning the Impact of Parameter Optimization 20 As per standard metalearning practice, we build our training meta- data by characterizing each of our 463 datasets by a set of meta- 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 features. However, because smaller datasets may produce unreliable meta-features, we remove from the analysis all of the datasets con- taining less than 100 instances. This leaves us with 326 datasets. Figure 2. Cumulative Distribution of the Impact of Parameter We use an existing R script that, for each dataset, creates a meta- Optimization on 463 Datasets feature vector by extracting over 68 meta-features, including statis- tical and information-theoretic meta-features, landmarkers, model- based meta-features, and timing information [18]. Prior to our exper- According to Figure 2, for 19% of the datasets considered parame- iments, we manually remove a number of meta-features that carry ter optimization offers no gain in performance. The ascent is actually little information in this context (e.g., timing data, model-based very steep, as shown by the shaded portion of the distribution, reach- meta-features, redundant meta-features). In each experiment, we also ing 80% of the datasets for an improvement in performance of no use correlation-based feature subset selection (CFS) [11], as imple- more than 5%. From 0% to 5%, the relationship is virtually linear mented by the function CfsSubsetEval in Weka [12], to further (r=0.999) with a slope of 12. These results seem robust as an inde- reduce the number of meta-features. pendent analysis of 129 datasets and 9 classification learning algo- The metalearning task consists in distinguishing datasets where rithms reveals that there is no gain in performance with optimization parameter optimization is deemed to offer no advantage (class 0) for about 15% of the datasets and 96% of the datasets exhibit less from datasets where parameter optimization is likely to yield a per- than 5% improvement.5 formance advantage (class 1). Each meta-feature vector is labeled We note that the above analysis was not performed on a per- appropriately based on the observed difference in performance over algorithm basis. As stated, the differences in performance are com- its corresponding dataset. We use various threshold values to separate puted from among the best in 20 algorithms, which means that the class 1 from class 0, as shown below. best optimized version could be obtained with algorithm A, while There are two types of error our (meta)models can make. A false the best default version for the same dataset would be obtained with positive (FP) error is made when the model considers a class 0 dataset algorithm B. It is possible, however, that some classification learn- but labels it as class 1. When this happens, there is wasted effort in ing algorithms are more sensitive to parameter settings and may thus performing parameter optimization when no significant gain will be be more likely to exhibit significant differences with parameter op- achieved by such optimization. Conversely, a false negative (FN) er- timization. It may be worthwhile in a future study to consider such ror is made when the model considers a class 1 dataset but labels algorithm-level variation. In spite of this limitation, several conclu- it as class 0. When this happens, valuable parameter optimization is sions seem inescapable from the foregoing analysis: omitted and there is a resulting loss in predictive accuracy in the clas- 1. Parameter optimization does not improve performance uniformly sification task under consideration. Assuming that a practitioner’s ul- across datasets. timate goal is to get the best accuracy possible on their specific clas- 5 There may be some overlap in the datasets used in this study and those used sification task, one can argue that FN errors are more costly than FP in [23], although the datasets were not transformed into binary classification errors, and thus should be minimized. This is, of course, is equiva- tasks in the former as they were in the latter. lent to maximizing recall, R = T PT+F P N (where TP is the number of true positive classifications). Yet, one must be careful as it is trivial recall is very high at R=90.91% (compared to R=0 for the default), to maximize R by simply assigning all classification tasks to class 1. with 10 of the 11 datasets of class 1 being predicted in class 1. This Clearly, this would defeat the purpose of the model and is unaccept- suggests that the model has picked up useful information to identify able due to the unnecessary computational burden it places on the learning tasks where parameter optimization would be advantageous system. Hence, we focus on obtaining models with high recall, but (i.e., expected improvement greater than 1.5%). also good precision, P = T PT+F P P . In addition to its own performance at the metalevel, we consider We are faced at the metalevel with the same challenge of select- how the metamodel affects performance at the base level. To do so, ing a (meta)learning algorithm adequate for the task at hand. We are we compare maxImp, the total amount of possible improvement guided here by a desire for comprehensibility of the induced model due to parameter optimization to predImp, the amount of improve- and the results of preliminary experiments. Hence, we use for our ment one would obtain by using the metamodel. For each dataset d, metalearner, a decision tree learning algorithm, specifically Weka’s let ld be d’s label, pd be d’s predicted class, and Id be the improve- J48, and implement the training and testing processes in Rapid- ment one would experience if parameter optimization were used with Miner [15]. Interestingly, J48 also resulted in higher accuracy than d. Then, X other algorithms such as SVM and Random Forest. maxImp = Id d i.e., maxImp is the sum of the individual improvement values across 3.1 Threshold = 1.5 all of our 326 datasets. Here, maxImp = 949.26. Similarly, We begin by setting the threshold relatively low, assuming that there X Id if (pd = ld ) ∧ (ld = 1) is an advantage to parameter optimization if the performance im- predImp = provement exceeds 1.5%. 0 otherwise d The default accuracy in this case is 61.96%, obtained by predict- i.e., predImp is the sum of the improvement values for all datasets ing class 1 uniformly. This, of course, produces maximum recall, where the metamodel makes the correct class 1 prediction. We do R=100%, but very poor precision, P =61.96%. not include correct class 0 predictions as these would artificially in- Applying CFS selects the following mixed set of meta-features: flate the results. Here, predImp = 789.92. Hence, the metamodel attributes, kurtosis, joint entropy, NB, LDA and NN1. would allow us to claim 83.21% of the total performance improve- Further experimentation shows that performance can be improved ment available at the base level. still by using only joint entropy, NB and NN1. The confusion matrix is as follows. 3.2 Threshold=2.5 Predicted 0 1 We next raise the threshold, assuming that there is an advantage 0 84 40 to parameter optimization if the performance improvement exceeds Actual 1 28 174 2.5%. The default accuracy in this case is 50.31%, obtained by predicting This yields an accuracy of 79.17%, significantly above default, class 1 uniformly. This, again, produces maximum recall, R=100%, with both high recall R=86.14% and high precision P =81.31%. The but very poor precision, P =50.31%. decision tree is shown below. Applying CFS selects the following mixed set of meta-features: kurtosis prep, normalized attribute entropy, nn_1 <= 0.902439 | joint_entropy <= 0.606044: 0 (14.0/2.0) joint entropy, NB, LDA, stump min gain and NN1. Fur- | joint_entropy > 0.606044 ther experimentation shows that performance can be improved still | | naive_bayes <= 0.952: 1 (208.0/30.0) by using only kurtosis prep, normalized attribute | | naive_bayes > 0.952 | | | nn_1 <= 0.71134 entropy, joint entropy, NB and NN1. The confusion matrix | | | | joint_entropy <= 2.08461: 0 (2.0) is as follows. | | | | joint_entropy > 2.08461: 1 (5.0) | | | nn_1 > 0.71134: 0 (12.0/2.0) Predicted nn_1 > 0.902439: 0 (85.0/15.0) 0 1 0 105 57 Actual To further test the metamodel, we collected 42 independent 1 26 138 datasets with more than 100 instances each. It is possible that some of these correspond to some of the tasks used in the training data. This yields an accuracy of 74.52%, significantly above default, However, all 463 training tasks have been binarized [23], while these with both high recall R=84.15% and high precision P =70.77%. The were left untouched. Hence, we are training on 2-class datasets and decision tree is shown below. testing on n-class datasets, which may be somewhat unfair, but still nn_1 <= 0.857143 interesting. | joint_entropy <= 0.606044 We extracted the meta-features of the 42 datasets, and ran the cor- | | joint_entropy <= 0.508598 responding meta-feature vectors against the metamodel.6 The accu- | | | kurtosis_prep <= 26.479217: 1 (2.0) | | | kurtosis_prep > 26.479217: 0 (2.0) racy on the test datasets is 54.76%, which is rather poor given a de- | | joint_entropy > 0.508598: 0 (8.0) fault accuracy of 73.81%. However, the default is obtained by pre- | joint_entropy > 0.606044 dicting class 0 uniformly. While the test accuracy is not very good, | | naive_bayes <= 0.969466: 1 (194.0/49.0) | | naive_bayes > 0.969466 6 As the results for the test datasets were obtained with a different set of | | | normalized_attribute_entropy <= 0.84991 | | | | kurtosis_prep <= 45.518635: 0 (2.0) classification learning algorithms and a different parameter optimization | | | | kurtosis_prep > 45.518635: 1 (2.0) procedure, we are not entirely sure the comparison is fair. Yet, we feel there | | | normalized_attribute_entropy > 0.84991: 0 (7.0) is value in including these results. nn_1 > 0.857143: 0 (109.0/15.0) As with the threshold value of 1.5, using the metamodel against metalearning that suggest that landmarking meta-features are gen- the test datasets produces poor accuracy (35.71% for a default of erally better than other meta-features, especially NN1 and NB [20]. 88.10%), but recall is significantly better with R=60% (3 of the 5 Second, and significantly more valuable, is that NN1 is a very ef- datasets in class 1 are predicted correctly) against a default of R=0. ficient algorithm, suggesting that it would be very fast to determine Considering performance at the base level, we have here whether parameter optimization is likely to improve accuracy for any predImp = 698.61, so that the metamodel would allow us to still dataset. claim 73.60% of the total performance improvement available. To further test the predictive power of NN1, we ran Weka’s linear regression on our dataset, without CFS, and with the target set to the actual percentage of accuracy improvement (Imp) observed for each 3.3 Larger Threshold Values dataset. The resulting model is shown below. Based on the distribution of performance improvements, raising the Imp = 11.4405 * class_prob_min + threshold will cause the majority class to shift to 0 and result in far 0.1626 * skewness_prep + -0.0051 * kurtosis_prep + fewer class 1 datasets. On the other hand, while fewer, correctly iden- -2.003 * cancor_1 + -6.3391 * class_entropy + tifying these datasets offers the highest benefit to the practitioner as -8.2398 * mutual_information + the expected improvement in accuracy with parameter optimization -0.0008 * noise_signal_ratio + -0.5321 * attribute_mean + is rather significant. However, the metalearning task is also more dif- -5.7852 * stump_min + -4.8382 * stump_max + ficult as the class distribution of the training data is rather skewed. 13.3081 * stump_mean + 8.4397 * stump_sd + Setting the threshold to 5.0% results in a default accuracy of 3.3769 * stump_min_gain + -7.5918 * nn_1 + 83.44% with a model that predicts class 0 uniformly. While it was 6.4233 not possible in this case to improve on the default accuracy with the available set of meta-features, recall rose to R=25.94% (10 of The correlation coefficient of the model is 0.40, and the relatively the 54 datasets in class 1 are predicted correctly). Interestingly, CFS large multiplicative factor associated with NN1 in the model suggests selected LDA and NN1, and while the tree induced without feature its influence on the value of Imp. Furthermore, if we regress Imp on selection is a little larger than the previous ones, it also splits on NN1 NN1 alone, we obtain the following very simple model. at the root node. Using the metamodel against the test datasets pro- Imp = duces reasonable accuracy (76.19%) but only because the test data is -8.3476 * nn_1 + skewed in favor of class 0. Only two datasets belong to class 1 and 9.3173 neither is predicted correctly. Setting the threshold to 10.0% results in a default accuracy of The correlation coefficient of this simpler model is 0.44. There is 96.93% again with a model that predicts class 0 uniformly. Only 10 still error in this model, and it will be interesting to see whether using datasets belong to class 1, making for a very imbalanced class dis- only the NN1 metafeature could in general reasonably predict the tribution. Using all the meta-features, the induced decision tree has a datasets improvement if classification learning algorithm parameters slightly improved accuracy (97.85%), with recall R=30%. The root are optimized. of the tree is mutual information. There are no datasets in our test set for which the improvement with parameter optimization ex- 5 Conclusion ceeds 10%. We have showed that the observed improvement in performance due to parameter optimization in classification learning has a rather 4 Discussion skewed distribution, with most datasets seeing none or very little im- provement. However, there remains a portion of datasets for which As mentioned previously, one of the biggest issues with parameter parameter optimization has a significant impact. We have used met- optimization is the amount of time it takes to optimize the param- alearning to build a model capable of predicting whether parameter eters. In [24], it took 6,000 single core CPU hours to optimize the optimization can be expected to improve classification accuracy. As parameters of all their datasets, which means that it took on average a result, it would seem that both current approaches to metalearn- about 13 hours per dataset. For obvious reasons, a practitioner would ing, the one that ignores parameter optimization and considers only probably not wish to wait for that amount of time when it may result default settings, and the other that applies parameter optimization in in little or no improvement from the models baseline performance. all cases, are misinformed. Instead, a two-phase approach where one What this short study demonstrates is that it is possible to predict, first determines whether parameter optimization is likely to produce with good recall value, whether parameter optimization will signifi- an improvement, and then if so applies it, is more appropriate. cantly increase a classification learning models accuracy by using the Furthermore, one unexpected and very interesting side effect of metafeatures of any particular dataset, which are less computation- our metalearning study is that our results show that good recall can be ally intensive than the parameter optimization procedure. Of course, achieved at the metalevel, and that, with a very small set of efficient as the threshold increases, or as the expected performance improve- meta-features. In particular, it would appear that NN1 may serve as a ment gets larger, the prediction becomes less accurate. Nevertheless, great landmarker in this context. it is better than default. For future work, it may be valuable to revisit error costs. In par- One of the unexpected findings of our study, valid across almost ticular, we have argued here that FN errors were more costly than all metamodels, is that the root node of the decision tree is NN1. In FP errors on the basis that practitioners are likely to wish to obtain other words, a dataset’s performance on NN1 is indicative of whether the highest possible accuracy on their classification tasks. This is cer- parameter optimization will significantly improve the performance tainly also true when the computational cost of parameter optimiza- of learning algorithms trained against it. There are at least two in- tion is prohibitive. Were computational cost not to matter, the best teresting things about this finding. First, it confirms prior work on decision would be to always perform parameter optimization, and there would consequently be no need for a metamodel. In practice, ings of the Seventeenth International Conference on Machine Learning, however, it is well-known that parameter optimization is computa- 743-750. [18] Reif, M. (2012). A Comprehensive Dataset for Evaluating Approaches tionally intensive. As a result, FP errors may actually be rather costly, of various Meta-Learning Tasks. In Proceedings of the First Interna- and possibly even more so than FN errors. In any case, there would tional Conference on Pattern Recognition Applications and Methods, seem to exist a range of operating conditions between these extremes, 273-276. or different cost-benefits between FN and FP associated with param- [19] Reif, M., Shafait, F. and Dengel, A. (2012). Meta-learning for Evo- eter optimization. It may be interesting to perform a cost-sensitive lutionary Parameter Optimization of Classifiers. Machine Learning, 87(3):357-380. analysis at the metalevel, for example, relating expected gain in per- [20] Reif, M., Shafait, F., Goldstein, M., Breuel, T. and Dengel, A. (2014). formance (or threshold) and incurred cost. We cannot do this with Automatic Classifier Selection for Non-experts. Pattern Analysis & Ap- the data available since we only have final performance improvement plications, 17(1):83-96. (after some fixed search time). We would need to re-run experiments [21] Smith, M.R., Mitchell, L., Giraud-Carrier, C. and Martinez, T. (2014). Recommending Learning Algorithms and Their Associated Hyperpa- to gather information about improvement over time allowed for opti- rameters. Submitted. mization. [22] Sohn, S.Y. (1999). Meta Analysis of Classification Algorithms for Pat- tern Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(11):1137-1144. REFERENCES [23] Sun, Q. and Pfahringer, B. (2013). Pairwise Meta-rules for Better Meta- learning-based Algorithm Ranking. Machine Learning, 93(1):141-161. [1] Bensusan, H. (1998). Odd Bites into Bananas Don’t Make You Blind: [24] Thornton, C., Hutter, F., Hoos, H.H. and Leyton-Brown, K. (2013). Learning about Simplicity and Attribute Addition. In Proceedings of Auto-WEKA: Combined Selection and Hyperparameter Optimization the ECML’98 Workshop on Upgrading Learning to the Meta-level: of Classification Algorithms. In Proceedings of the Nineteenth ACM Model Selection and Data Transformation, 30-42. SIGKDD International Conference on Knowledge Discovery and Data [2] Bensusan, H. and Giraud-Carrier, C. (2000). Discovering Task Neigh- Mining, 847-855. bourhoods through Landmark Learning Performances. In Proceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases, LNAI 1910, 325-330. [3] Bensusan, H., Giraud-Carrier, C. and Kennedy, C. (2000). A Higher- order Approach to Meta-learning. In Proceedings of the ECML-2000 Workshop on Meta-learning: Building Automatic Advice Strategies for Model Selection and Method Combination, 109-118. [4] Bensusan, H. and Kalousis, A. (2001). Estimating the Predictive Accu- racy of a Classifier. In Proceedings of the Twelfth European Conference on Machine Learning (LNCS 2167), 25-36. [5] Brazdil, P., Soares, C. and Pinto da Costa, J. (2003). Ranking Learn- ing Algorithms: Using IBL and Meta-Learning on Accuracy and Time Results. Machine Learning, 50(3):251-277. [6] Brazdil, P., Giraud-Carrier, C., Soares, C. and Vilalta, R. (2009). Met- alearning: Applications to Data Mining, Springer. [7] Engels, R. and Theusinger, C. (1998). Using a Data Metric for Offering Preprocessing Advice in Data-mining Applications. In Proceedings of the Thirteenth European Conference on Artificial Intelligence, 430-434. [8] Fuernkranz, J. and Petrak, J. (2001). An Evaluation of Landmarking Variants. In Proceedings of the ECML/PKDD-01 Workshop on Inte- grating Aspects of Data Mining, Decision Support and Meta-learning, 57-68. [9] Gomes, T.A.F., Prudêncio, R.B.C., Soares, C., Rossi, A.L.D. and Car- valho, A. (2012). Combining Meta-learning and Search Techniques to Select Parameters for Support Vector Machines. Neurocomputing, 75:3- 13. [10] Guerra, S.B., Prudêncio, R.B.C. and Ludermir, T.B. (2008). Predicting the Performance of Learning Algorithms Using Support Vector Ma- chines as Meta-regressors. In Proceedings of the Eighteenth Interna- tional Conference on Artificial Neural Networks (LNCS 5163), 523- 532. [11] Hall, M.A. (1999). Correlation-based Feature Subset Selection for Ma- chine Learning. PhD Thesis, Department of Computer Science, The University of Waikato, Hamilton, New Zealand. [12] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I.H. (2009). The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1):10-18. [13] Leite, R., Brazdil, P. and Vanschoren, J. (2012). Selecting Classification Algorithms with Active Testing. In Proceedings of the Eighth Interna- tional Conference on Machine Learning and Data Mining in Pattern Recognition (LNCS 7376), 117-131. [14] Michie, D., Spiegelhalter, D.J. and Taylor, C.C. (1994). Machine Learn- ing, Neural and Statistical Classification, Ellis Horwood. [15] North, M. (2012). Data Mining for the Masses, Global Textbook Project. [16] Peng, Y., Flach, P.A., Brazdil, P. and Soares, C. (2002). Improved Data Set Characterisation for Meta-learning. In Proceedings of the Fifth In- ternational Conference on Discovery Science, 141-152. [17] Pfahringer, B., Bensusan, H. and Giraud-Carrier, C. (2000). Meta- learning by Landmarking Various Learning Algorithms. In Proceed-