=Paper= {{Paper |id=Vol-1201/paper-05 |storemode=property |title=Using Metalearning to Predict When Parameter Optimization Is Likely to Improve Classification Accuracy |pdfUrl=https://ceur-ws.org/Vol-1201/paper-05.pdf |volume=Vol-1201 |dblpUrl=https://dblp.org/rec/conf/ecai/RiddG14 }} ==Using Metalearning to Predict When Parameter Optimization Is Likely to Improve Classification Accuracy== https://ceur-ws.org/Vol-1201/paper-05.pdf
            Using Metalearning to Predict When Parameter
            Optimization Is Likely to Improve Classification
                               Accuracy
                                          Parker Ridd1 and Christophe Giraud-Carrier2


Abstract. Work on metalearning for algorithm selection has often           • I(L, P, T ) be the model induced by algorithm L with parameter
been criticized because it mostly considers only the default param-          setting P on some classification learning task T . Hence, I maps
eter settings of the candidate base learning algorithms. Many have           objects in X to labels in Y.
indeed argued that the choice of parameter values can have a signif-       • A(I) be the predictive accuracy of model I (typically measured
icant impact on accuracy. Yet little empirical evidence exists to pro-       by cross-validation).
vide definitive support for that argument. Recent experiments do sug-
gest that parameter optimization may indeed have an impact. How-           The classification learning algorithm selection problem can be for-
ever, the distribution of performance differences has a long tail, sug-    mulated as follows.
gesting that in most cases parameter optimization has little effect on
accuracy. In this paper, we revisit some of these results and use met-     Classification Learning Algorithm Selection Given a training set
alearning to characterize the situations when parameter optimization       T for some classification learning task, find the pair (L∗ , P ∗ ) where
is likely to cause a significant increase in accuracy. In so doing, we     L∗ ∈ L and P ∗ ∈ PL∗ , such that
show that 1) a relatively simple and efficient landmarker carries sig-
nificant predictive power, and 2) metalearning for algorithm selection          ∀(L, P ) ∈ L × PL A(I(L, P, T )) ≤ A(I(L∗ , P ∗ , T )).
should be effected in two phases, the first in which one determines
                                                                              The above formulation is generic in that it says nothing about the
whether parameter optimization is likely to increase accuracy, and
                                                                           process used to search the combined spaces of classification learning
the second in which algorithm selection actually takes place.
                                                                           algorithms and parameter settings to find the optimal algorithm. Met-
                                                                           alearning for classification learning algorithm selection is the specific
1   Introduction                                                           instance of that general problem wherein the search is effected by a
                                                                           learning algorithm [6]. In other words, the past performance of clas-
The availability of a large number of classification learning algo-        sification learning algorithms on a variety of tasks is used to induce a
rithms together with the No Free Lunch theorem for classification          predictive model that takes as input a classification learning task and
present business users with a significant challenge, namely that of de-    produces as output a classification learning algorithm and its associ-
ciding which algorithm is likely to induce the most accurate model         ated parameter setting.
for their particular classification task. This selection process is fur-      In practice, one has access to a set T = {T1 , T2 , . . . , Tm } of m
ther compounded by the fact that many classification learning algo-        training sets (corresponding to m classification learning tasks). For
rithms include parameters, and the various possible settings of these      each Tj , the learning algorithms in L, together with their parameter
parameters may give rise to models whose accuracy on the target            settings, are used one at a time to induce a model on Tj . The pair
classification task varies significantly. As a result, the algorithm se-   of classification learning algorithm and parameter setting that max-
lection problem in machine learning consists not only in choosing          imizes A(I(L, P, Tj )) is recorded. Each Tj with its corresponding
an algorithm, but rather in choosing an algorithm and an associated        winning pair becomes a training example for a (meta)learning algo-
parameter setting. Formally, let:                                          rithm. Since storing complete learning tasks is unfeasible and likely
• L = {L1 , L2 , . . . , Ln } be a finite set of n classification learn-   undesirable, one uses instead some characterization of learning tasks
  ing algorithms (e.g., C4.5, Naı̈ve Bayes-NB, Backpropagation-BP,         by meta-features. Meta-features may be drawn from basic statistics
  Support Vector Machine-SVM).                                             and information-theoretic measures (e.g., ratio of nominal features,
• PLi = PL1i × PL2i × . . . PLkii be the set of parameter settings         skewness, class entropy) [14, 7, 22], landmarking measures (i.e., per-
                                                                           formances of simple learners that serve as signpost for more com-
  associated with Li , where each PLki represents one of the param-
                       1                             2                     plex ones) [2, 17, 8], and model-based measures (e.g., properties of
  eters of Li (e.g., PC4.5  =pruning indicator, PC4.5    =splitting cri-
                  1                                  2                     induced decision trees) [1, 3, 16]. Given a training set T for some
  terion, . . ., PBP =number of hidden layers, PBP      =learning rate,
    3                                                                      classification learning task, let C(T ) be the characterization of T by
  PBP   =momentum term, . . .).
                                                                           some set of meta-features.
• T = X × Y be a training set for a classification task where X is
                                                                              One can now take a number of classification tasks, characterize
  a set of features and Y is a finite set of labels.
                                                                           them via the function C, and record the corresponding meta-features
1 Brigham Young University, USA, email: parker.ridd@byu.net                together with the accuracy of the best learning algorithm and asso-
2 Brigham Young University, USA, email: cgc@cs.byu.edu
                                                                           ciated parameter setting. The classification learning algorithm selec-
tion problem, as solved by metalearning, can then be reformulated as                              both a learning algorithm an associated parameter setting (e.g.,
follows.3                                                                                         see [13, 24, 21]).

Metalearning for Classification Learning Algorithm Selection                                 Interestingly, and somewhat surprisingly, very few researchers
Given a set T m = {(C(T ), argmaxL∈L,P ∈PL A(I(L, P, T ))} of                            have bothered to check the validity of the Parameter Optimization
past classification learning episodes, together with a classification                    Claim. Yet, to the best of our knowledge —and as presented by most,
learning algorithm Lm , which may belong to L, and an associated                         it is just that: a claim. We have been hard-pressed to find any system-
parameter setting PLm :                                                                  atic study in the literature that addresses the impact of parameter set-
                                                                                         tings over a wide variety of classification learning algorithms. What
1. Construct M = I(Lm , PLm , T m ).                                                     if the Parameter Optimization Claim does not hold in general? What
2. For any training set T 0 , (L∗ , P ∗ ) = M(T 0 ).                                     if it only holds in some specific cases? Would it not be useful to
                                                                                         know what these cases are? And what about using metalearning to
   By convention, let PL0i denote the default parameter setting of Li ,
                                                                                         characterize when the claim holds? We address these questions here.
as determined by the implementation of Li under consideration (e.g.,
Weka, IBM SPSS Modeler). Most of the work in metalearning so far
has addressed the above problem with the further assumption that for                     2            Impact of Parameter Optimization on
all learning algorithms the parameter setting is fixed to its default.                                Classification Learning Algorithm Performance
This, of course, creates a much restricted, yet also greatly simplified,                 There is one exception to our statement that the literature contains no
version of the selection problem, since the large, possibly infinite,                    systematic study of the impact of parameter optimization on the per-
space of parameter settings need not be considered at all. However,                      formance of classification learning algorithms, found in [23]. In that
that restriction has also been the source of much criticism, and some-                   paper, the authors considered 466 datasets and for each, computed
times dismissal, by a part of the machine learning community, who                        the difference in accuracy between their default parameter setting
has maintained that:                                                                     and the best possible parameter setting after optimization. The opti-
Parameter Optimization Claim Parameter setting has a significant                         mization procedure is based on particle swarm optimization (PSO),
impact (for the better) on the predictive accuracy of classification                     wherein the authors specify which parameters should be optimized
learning algorithms.                                                                     (e.g., kernel function and complexity constant for SVM) and the
                                                                                         range of values that PSO should consider. We obtained the data from
  It would seem that most metalearning researchers, and indeed                           the authors and reproduced their results, with two small exceptions:
most machine learning researchers, have taken this claim to be well                      1) we found that one of the datasets in the list was redundant, so we
founded, and considered ways to address it. There have been two                          removed it; and 2) our dataset characterization tool (see below) did
main approaches.                                                                         not terminate on two of the datasets after several days so we stopped
                                                                                         it, and omitted the corresponding datasets from our study. Hence, the
• Two-stage Metalearning. Some metalearning researchers have                             results here are for 463 datasets. For each dataset, Figure 1 shows the
  adopted a two-stage approach to metalearning, wherein they                             percentage of improvement of the best AUC score among 20 clas-
  continue to use the restricted form of the Metalearning for                            sification learning algorithms after parameter optimization over the
  Classification Learning Algorithm Selection problem to choose                          best AUC score among the same classification learning algorithms
  a learning algorithm, but then follow up with an optimization                          with their default parameter setting. The data is ordered by increas-
  phase to find the best set of parameter values for the selected                        ing value of improvement.
  algorithm.4 The problem in this case, however, is that one may
  reach a suboptimal solution. Indeed, let L1 and L2 be two clas-
  sification learning algorithms, such that, for some classification
  task T 0 , M(T 0 ) = (L1 , PL01 ). Then, L1 would be selected
                                                                                                 20




  and its parameter setting optimized to PL∗1 . Yet, despite the
  fact that A(I(L1 , PL01 , T 0 )) > A(I(L2 , PL02 , T 0 )) (assuming
  the metalearner is accurate), it is entirely possible that there
                                                                                                 15




  exists some parameter setting PLk2 of algorithm L2 such that
  A(I(L1 , PL∗1 , T 0 )) < A(I(L2 , PLk2 , T 0 )). In other words, the
                                                                                 % Improvement




  early greedy commitment to L1 makes it impossible to explore
                                                                                                 10




  other parts of the combined spaces of learning algorithms and
  parameter settings.
                                                                                                 5




• Unrestricted Metalearning. Other metalearning researchers have
  lifted the traditional restriction, and recently begun to design solu-
  tions from the unrestricted Metalearning for Classification Learn-
                                                                                                 0




  ing Algorithm Selection problem. Their systems seek to select
                                                                                                       0          100          200          300          400
3 We recognize that it is possible to use metalearning to predict rankings of
  learning algorithms (e.g., see [5, 23]), or even the actual performance of
  learning algorithms via regression (e.g., see [4, 10, 20]). We restrict our
                                                                                                      Figure 1. Impact of Parameter Optimization on 463 Datasets
  attention here to the prediction of a single best algorithm although much of
  the discussion extends naturally to these settings.
4 Some have also simply picked a classification learning algorithm manually,
  and used metalearning to choose the best parameter setting (e.g., see [9,
  19]).                                                                                           Figure 1 makes it clear that the impact of parameter optimization
is highly variable across datasets, and seems to be rather small for               2. For many datasets, parameter optimization yields very little im-
a large number of them. We are a little surprised that with such a                    provement and may be viewed as computational overkill.
skewed distribution, the authors of [23] concluded that “the result                3. For a few datasets, parameter optimization makes a significant dif-
demonstrates the benefit of using the performances of optimised al-                   ference and should thus be performed.
gorithms for generating algorithm rankings”, and thus carried on                      From a practitioner’s standpoint, the last two conclusions are par-
with a blanket application of their combined classification learning               ticularly relevant. If no improvement is to be expected from parame-
algorithm / parameter setting selection approach.                                  ter optimization, then one would gladly save the extra computational
   To make the relationship even clearer, consider Figure 2 that shows             time required to effect it. Conversely, if a significant improvement
the cumulative distribution of the same datasets, where each succes-               can be expected and one is focused on maximizing predictive accu-
sive bar represents the proportion of datasets for which the improve-              racy, then one would be willing to bear the extra cost. The question
ment in AUC due to parameter optimization is less than or equal to                 then, for such practitioners, is: Will the predictive accuracy on the
the value indicated on the x-axis in increments of 1%.                             classification learning task I am considering be improved by param-
                                                                                   eter optimization?
                                                                                      It should be obvious to machine learning researchers that if one
  100




                                                                                   labels the datasets falling under conclusion 2 above as no advantage
                                                                                   and the datasets falling under conclusion 3 as advantage, one obtains
                                                                                   a training dataset for a classification learning task. We propose to
  80




                                                                                   do exactly that and essentially take our own medicine, by applying
                                                                                   machine learning to the problem of distinguishing between datasets
                                                                                   that may benefit from parameter optimization and datasets that would
  60




                                                                                   not. Because our data consists of information about the performance
                                                                                   of learning algorithms, this is an instance of metalearning.
  40




                                                                                   3    Metalearning the Impact of Parameter
                                                                                        Optimization
  20




                                                                                   As per standard metalearning practice, we build our training meta-
                                                                                   data by characterizing each of our 463 datasets by a set of meta-
  0




         0   1   2   3   4    5   6   7    8   9   10   11   12   13   14   15
                                                                                   features. However, because smaller datasets may produce unreliable
                                                                                   meta-features, we remove from the analysis all of the datasets con-
                                                                                   taining less than 100 instances. This leaves us with 326 datasets.
        Figure 2. Cumulative Distribution of the Impact of Parameter                   We use an existing R script that, for each dataset, creates a meta-
                     Optimization on 463 Datasets                                  feature vector by extracting over 68 meta-features, including statis-
                                                                                   tical and information-theoretic meta-features, landmarkers, model-
                                                                                   based meta-features, and timing information [18]. Prior to our exper-
   According to Figure 2, for 19% of the datasets considered parame-               iments, we manually remove a number of meta-features that carry
ter optimization offers no gain in performance. The ascent is actually             little information in this context (e.g., timing data, model-based
very steep, as shown by the shaded portion of the distribution, reach-             meta-features, redundant meta-features). In each experiment, we also
ing 80% of the datasets for an improvement in performance of no                    use correlation-based feature subset selection (CFS) [11], as imple-
more than 5%. From 0% to 5%, the relationship is virtually linear                  mented by the function CfsSubsetEval in Weka [12], to further
(r=0.999) with a slope of 12. These results seem robust as an inde-                reduce the number of meta-features.
pendent analysis of 129 datasets and 9 classification learning algo-                   The metalearning task consists in distinguishing datasets where
rithms reveals that there is no gain in performance with optimization              parameter optimization is deemed to offer no advantage (class 0)
for about 15% of the datasets and 96% of the datasets exhibit less                 from datasets where parameter optimization is likely to yield a per-
than 5% improvement.5                                                              formance advantage (class 1). Each meta-feature vector is labeled
   We note that the above analysis was not performed on a per-                     appropriately based on the observed difference in performance over
algorithm basis. As stated, the differences in performance are com-                its corresponding dataset. We use various threshold values to separate
puted from among the best in 20 algorithms, which means that the                   class 1 from class 0, as shown below.
best optimized version could be obtained with algorithm A, while                       There are two types of error our (meta)models can make. A false
the best default version for the same dataset would be obtained with               positive (FP) error is made when the model considers a class 0 dataset
algorithm B. It is possible, however, that some classification learn-              but labels it as class 1. When this happens, there is wasted effort in
ing algorithms are more sensitive to parameter settings and may thus               performing parameter optimization when no significant gain will be
be more likely to exhibit significant differences with parameter op-               achieved by such optimization. Conversely, a false negative (FN) er-
timization. It may be worthwhile in a future study to consider such                ror is made when the model considers a class 1 dataset but labels
algorithm-level variation. In spite of this limitation, several conclu-            it as class 0. When this happens, valuable parameter optimization is
sions seem inescapable from the foregoing analysis:                                omitted and there is a resulting loss in predictive accuracy in the clas-
1. Parameter optimization does not improve performance uniformly                   sification task under consideration. Assuming that a practitioner’s ul-
   across datasets.                                                                timate goal is to get the best accuracy possible on their specific clas-
5 There may be some overlap in the datasets used in this study and those used
                                                                                   sification task, one can argue that FN errors are more costly than FP
  in [23], although the datasets were not transformed into binary classification   errors, and thus should be minimized. This is, of course, is equiva-
  tasks in the former as they were in the latter.                                  lent to maximizing recall, R = T PT+F  P
                                                                                                                            N
                                                                                                                               (where TP is the number of
true positive classifications). Yet, one must be careful as it is trivial          recall is very high at R=90.91% (compared to R=0 for the default),
to maximize R by simply assigning all classification tasks to class 1.             with 10 of the 11 datasets of class 1 being predicted in class 1. This
Clearly, this would defeat the purpose of the model and is unaccept-               suggests that the model has picked up useful information to identify
able due to the unnecessary computational burden it places on the                  learning tasks where parameter optimization would be advantageous
system. Hence, we focus on obtaining models with high recall, but                  (i.e., expected improvement greater than 1.5%).
also good precision, P = T PT+F  P
                                   P
                                     .                                                In addition to its own performance at the metalevel, we consider
   We are faced at the metalevel with the same challenge of select-                how the metamodel affects performance at the base level. To do so,
ing a (meta)learning algorithm adequate for the task at hand. We are               we compare maxImp, the total amount of possible improvement
guided here by a desire for comprehensibility of the induced model                 due to parameter optimization to predImp, the amount of improve-
and the results of preliminary experiments. Hence, we use for our                  ment one would obtain by using the metamodel. For each dataset d,
metalearner, a decision tree learning algorithm, specifically Weka’s               let ld be d’s label, pd be d’s predicted class, and Id be the improve-
J48, and implement the training and testing processes in Rapid-                    ment one would experience if parameter optimization were used with
Miner [15]. Interestingly, J48 also resulted in higher accuracy than               d. Then,                                X
other algorithms such as SVM and Random Forest.                                                              maxImp =          Id
                                                                                                                           d
                                                                                   i.e., maxImp is the sum of the individual improvement values across
3.1    Threshold = 1.5
                                                                                   all of our 326 datasets. Here, maxImp = 949.26. Similarly,
We begin by setting the threshold relatively low, assuming that there                                     X  Id if (pd = ld ) ∧ (ld = 1)
is an advantage to parameter optimization if the performance im-                            predImp =
provement exceeds 1.5%.                                                                                           0 otherwise
                                                                                                           d
   The default accuracy in this case is 61.96%, obtained by predict-
                                                                                   i.e., predImp is the sum of the improvement values for all datasets
ing class 1 uniformly. This, of course, produces maximum recall,
                                                                                   where the metamodel makes the correct class 1 prediction. We do
R=100%, but very poor precision, P =61.96%.
                                                                                   not include correct class 0 predictions as these would artificially in-
   Applying CFS selects the following mixed set of meta-features:
                                                                                   flate the results. Here, predImp = 789.92. Hence, the metamodel
attributes, kurtosis, joint entropy, NB, LDA and NN1.
                                                                                   would allow us to claim 83.21% of the total performance improve-
Further experimentation shows that performance can be improved
                                                                                   ment available at the base level.
still by using only joint entropy, NB and NN1. The confusion
matrix is as follows.
                                                                                   3.2    Threshold=2.5
                                          Predicted
                                           0     1                                 We next raise the threshold, assuming that there is an advantage
                                     0    84    40                                 to parameter optimization if the performance improvement exceeds
                          Actual
                                     1    28    174                                2.5%.
                                                                                      The default accuracy in this case is 50.31%, obtained by predicting
  This yields an accuracy of 79.17%, significantly above default,                  class 1 uniformly. This, again, produces maximum recall, R=100%,
with both high recall R=86.14% and high precision P =81.31%. The                   but very poor precision, P =50.31%.
decision tree is shown below.                                                         Applying CFS selects the following mixed set of meta-features:
                                                                                   kurtosis prep,            normalized attribute entropy,
nn_1 <= 0.902439
|   joint_entropy <= 0.606044: 0 (14.0/2.0)                                        joint entropy, NB, LDA, stump min gain and NN1. Fur-
|   joint_entropy > 0.606044                                                       ther experimentation shows that performance can be improved still
|   |   naive_bayes <= 0.952: 1 (208.0/30.0)                                       by using only kurtosis prep, normalized attribute
|   |   naive_bayes > 0.952
|   |   |   nn_1 <= 0.71134                                                        entropy, joint entropy, NB and NN1. The confusion matrix
|   |   |   |    joint_entropy <= 2.08461: 0 (2.0)                                 is as follows.
|   |   |   |    joint_entropy > 2.08461: 1 (5.0)
|   |   |   nn_1 > 0.71134: 0 (12.0/2.0)                                                                                Predicted
nn_1 > 0.902439: 0 (85.0/15.0)                                                                                           0     1
                                                                                                                   0    105    57
                                                                                                         Actual
   To further test the metamodel, we collected 42 independent                                                      1    26    138
datasets with more than 100 instances each. It is possible that some
of these correspond to some of the tasks used in the training data.                  This yields an accuracy of 74.52%, significantly above default,
However, all 463 training tasks have been binarized [23], while these              with both high recall R=84.15% and high precision P =70.77%. The
were left untouched. Hence, we are training on 2-class datasets and                decision tree is shown below.
testing on n-class datasets, which may be somewhat unfair, but still
                                                                                   nn_1 <= 0.857143
interesting.                                                                       |   joint_entropy <= 0.606044
   We extracted the meta-features of the 42 datasets, and ran the cor-             |   |   joint_entropy <= 0.508598
responding meta-feature vectors against the metamodel.6 The accu-                  |   |   |   kurtosis_prep <= 26.479217: 1 (2.0)
                                                                                   |   |   |   kurtosis_prep > 26.479217: 0 (2.0)
racy on the test datasets is 54.76%, which is rather poor given a de-              |   |   joint_entropy > 0.508598: 0 (8.0)
fault accuracy of 73.81%. However, the default is obtained by pre-                 |   joint_entropy > 0.606044
dicting class 0 uniformly. While the test accuracy is not very good,               |   |   naive_bayes <= 0.969466: 1 (194.0/49.0)
                                                                                   |   |   naive_bayes > 0.969466
6 As the results for the test datasets were obtained with a different set of       |   |   |   normalized_attribute_entropy <= 0.84991
                                                                                   |   |   |   |    kurtosis_prep <= 45.518635: 0 (2.0)
  classification learning algorithms and a different parameter optimization        |   |   |   |    kurtosis_prep > 45.518635: 1 (2.0)
  procedure, we are not entirely sure the comparison is fair. Yet, we feel there   |   |   |   normalized_attribute_entropy > 0.84991: 0 (7.0)
  is value in including these results.                                             nn_1 > 0.857143: 0 (109.0/15.0)
   As with the threshold value of 1.5, using the metamodel against           metalearning that suggest that landmarking meta-features are gen-
the test datasets produces poor accuracy (35.71% for a default of            erally better than other meta-features, especially NN1 and NB [20].
88.10%), but recall is significantly better with R=60% (3 of the 5           Second, and significantly more valuable, is that NN1 is a very ef-
datasets in class 1 are predicted correctly) against a default of R=0.       ficient algorithm, suggesting that it would be very fast to determine
   Considering performance at the base level, we have here                   whether parameter optimization is likely to improve accuracy for any
predImp = 698.61, so that the metamodel would allow us to still              dataset.
claim 73.60% of the total performance improvement available.                    To further test the predictive power of NN1, we ran Weka’s linear
                                                                             regression on our dataset, without CFS, and with the target set to the
                                                                             actual percentage of accuracy improvement (Imp) observed for each
3.3    Larger Threshold Values                                               dataset. The resulting model is shown below.

Based on the distribution of performance improvements, raising the           Imp =
                                                                                  11.4405 * class_prob_min +
threshold will cause the majority class to shift to 0 and result in far            0.1626 * skewness_prep +
                                                                                  -0.0051 * kurtosis_prep +
fewer class 1 datasets. On the other hand, while fewer, correctly iden-           -2.003 * cancor_1 +
                                                                                  -6.3391 * class_entropy +
tifying these datasets offers the highest benefit to the practitioner as          -8.2398 * mutual_information +
the expected improvement in accuracy with parameter optimization                  -0.0008 * noise_signal_ratio +
                                                                                  -0.5321 * attribute_mean +
is rather significant. However, the metalearning task is also more dif-           -5.7852 * stump_min +
                                                                                  -4.8382 * stump_max +
ficult as the class distribution of the training data is rather skewed.           13.3081 * stump_mean +
                                                                                   8.4397 * stump_sd +
   Setting the threshold to 5.0% results in a default accuracy of                  3.3769 * stump_min_gain +
                                                                                  -7.5918 * nn_1 +
83.44% with a model that predicts class 0 uniformly. While it was                  6.4233
not possible in this case to improve on the default accuracy with
the available set of meta-features, recall rose to R=25.94% (10 of              The correlation coefficient of the model is 0.40, and the relatively
the 54 datasets in class 1 are predicted correctly). Interestingly, CFS      large multiplicative factor associated with NN1 in the model suggests
selected LDA and NN1, and while the tree induced without feature             its influence on the value of Imp. Furthermore, if we regress Imp on
selection is a little larger than the previous ones, it also splits on NN1   NN1 alone, we obtain the following very simple model.
at the root node. Using the metamodel against the test datasets pro-
                                                                             Imp =
duces reasonable accuracy (76.19%) but only because the test data is              -8.3476 * nn_1 +
skewed in favor of class 0. Only two datasets belong to class 1 and                9.3173

neither is predicted correctly.
   Setting the threshold to 10.0% results in a default accuracy of              The correlation coefficient of this simpler model is 0.44. There is
96.93% again with a model that predicts class 0 uniformly. Only 10           still error in this model, and it will be interesting to see whether using
datasets belong to class 1, making for a very imbalanced class dis-          only the NN1 metafeature could in general reasonably predict the
tribution. Using all the meta-features, the induced decision tree has a      datasets improvement if classification learning algorithm parameters
slightly improved accuracy (97.85%), with recall R=30%. The root             are optimized.
of the tree is mutual information. There are no datasets in our
test set for which the improvement with parameter optimization ex-           5   Conclusion
ceeds 10%.
                                                                             We have showed that the observed improvement in performance
                                                                             due to parameter optimization in classification learning has a rather
4     Discussion                                                             skewed distribution, with most datasets seeing none or very little im-
                                                                             provement. However, there remains a portion of datasets for which
As mentioned previously, one of the biggest issues with parameter            parameter optimization has a significant impact. We have used met-
optimization is the amount of time it takes to optimize the param-           alearning to build a model capable of predicting whether parameter
eters. In [24], it took 6,000 single core CPU hours to optimize the          optimization can be expected to improve classification accuracy. As
parameters of all their datasets, which means that it took on average        a result, it would seem that both current approaches to metalearn-
about 13 hours per dataset. For obvious reasons, a practitioner would        ing, the one that ignores parameter optimization and considers only
probably not wish to wait for that amount of time when it may result         default settings, and the other that applies parameter optimization in
in little or no improvement from the models baseline performance.            all cases, are misinformed. Instead, a two-phase approach where one
    What this short study demonstrates is that it is possible to predict,    first determines whether parameter optimization is likely to produce
with good recall value, whether parameter optimization will signifi-         an improvement, and then if so applies it, is more appropriate.
cantly increase a classification learning models accuracy by using the          Furthermore, one unexpected and very interesting side effect of
metafeatures of any particular dataset, which are less computation-          our metalearning study is that our results show that good recall can be
ally intensive than the parameter optimization procedure. Of course,         achieved at the metalevel, and that, with a very small set of efficient
as the threshold increases, or as the expected performance improve-          meta-features. In particular, it would appear that NN1 may serve as a
ment gets larger, the prediction becomes less accurate. Nevertheless,        great landmarker in this context.
it is better than default.                                                      For future work, it may be valuable to revisit error costs. In par-
    One of the unexpected findings of our study, valid across almost         ticular, we have argued here that FN errors were more costly than
all metamodels, is that the root node of the decision tree is NN1. In        FP errors on the basis that practitioners are likely to wish to obtain
other words, a dataset’s performance on NN1 is indicative of whether         the highest possible accuracy on their classification tasks. This is cer-
parameter optimization will significantly improve the performance            tainly also true when the computational cost of parameter optimiza-
of learning algorithms trained against it. There are at least two in-        tion is prohibitive. Were computational cost not to matter, the best
teresting things about this finding. First, it confirms prior work on        decision would be to always perform parameter optimization, and
there would consequently be no need for a metamodel. In practice,                        ings of the Seventeenth International Conference on Machine Learning,
however, it is well-known that parameter optimization is computa-                        743-750.
                                                                                  [18]   Reif, M. (2012). A Comprehensive Dataset for Evaluating Approaches
tionally intensive. As a result, FP errors may actually be rather costly,                of various Meta-Learning Tasks. In Proceedings of the First Interna-
and possibly even more so than FN errors. In any case, there would                       tional Conference on Pattern Recognition Applications and Methods,
seem to exist a range of operating conditions between these extremes,                    273-276.
or different cost-benefits between FN and FP associated with param-               [19]   Reif, M., Shafait, F. and Dengel, A. (2012). Meta-learning for Evo-
eter optimization. It may be interesting to perform a cost-sensitive                     lutionary Parameter Optimization of Classifiers. Machine Learning,
                                                                                         87(3):357-380.
analysis at the metalevel, for example, relating expected gain in per-            [20]   Reif, M., Shafait, F., Goldstein, M., Breuel, T. and Dengel, A. (2014).
formance (or threshold) and incurred cost. We cannot do this with                        Automatic Classifier Selection for Non-experts. Pattern Analysis & Ap-
the data available since we only have final performance improvement                      plications, 17(1):83-96.
(after some fixed search time). We would need to re-run experiments               [21]   Smith, M.R., Mitchell, L., Giraud-Carrier, C. and Martinez, T. (2014).
                                                                                         Recommending Learning Algorithms and Their Associated Hyperpa-
to gather information about improvement over time allowed for opti-                      rameters. Submitted.
mization.                                                                         [22]   Sohn, S.Y. (1999). Meta Analysis of Classification Algorithms for Pat-
                                                                                         tern Recognition. IEEE Transactions on Pattern Analysis and Machine
                                                                                         Intelligence, 21(11):1137-1144.
REFERENCES                                                                        [23]   Sun, Q. and Pfahringer, B. (2013). Pairwise Meta-rules for Better Meta-
                                                                                         learning-based Algorithm Ranking. Machine Learning, 93(1):141-161.
 [1] Bensusan, H. (1998). Odd Bites into Bananas Don’t Make You Blind:            [24]   Thornton, C., Hutter, F., Hoos, H.H. and Leyton-Brown, K. (2013).
     Learning about Simplicity and Attribute Addition. In Proceedings of                 Auto-WEKA: Combined Selection and Hyperparameter Optimization
     the ECML’98 Workshop on Upgrading Learning to the Meta-level:                       of Classification Algorithms. In Proceedings of the Nineteenth ACM
     Model Selection and Data Transformation, 30-42.                                     SIGKDD International Conference on Knowledge Discovery and Data
 [2] Bensusan, H. and Giraud-Carrier, C. (2000). Discovering Task Neigh-                 Mining, 847-855.
     bourhoods through Landmark Learning Performances. In Proceedings
     of the Fourth European Conference on Principles and Practice of
     Knowledge Discovery in Databases, LNAI 1910, 325-330.
 [3] Bensusan, H., Giraud-Carrier, C. and Kennedy, C. (2000). A Higher-
     order Approach to Meta-learning. In Proceedings of the ECML-2000
     Workshop on Meta-learning: Building Automatic Advice Strategies for
     Model Selection and Method Combination, 109-118.
 [4] Bensusan, H. and Kalousis, A. (2001). Estimating the Predictive Accu-
     racy of a Classifier. In Proceedings of the Twelfth European Conference
     on Machine Learning (LNCS 2167), 25-36.
 [5] Brazdil, P., Soares, C. and Pinto da Costa, J. (2003). Ranking Learn-
     ing Algorithms: Using IBL and Meta-Learning on Accuracy and Time
     Results. Machine Learning, 50(3):251-277.
 [6] Brazdil, P., Giraud-Carrier, C., Soares, C. and Vilalta, R. (2009). Met-
     alearning: Applications to Data Mining, Springer.
 [7] Engels, R. and Theusinger, C. (1998). Using a Data Metric for Offering
     Preprocessing Advice in Data-mining Applications. In Proceedings of
     the Thirteenth European Conference on Artificial Intelligence, 430-434.
 [8] Fuernkranz, J. and Petrak, J. (2001). An Evaluation of Landmarking
     Variants. In Proceedings of the ECML/PKDD-01 Workshop on Inte-
     grating Aspects of Data Mining, Decision Support and Meta-learning,
     57-68.
 [9] Gomes, T.A.F., Prudêncio, R.B.C., Soares, C., Rossi, A.L.D. and Car-
     valho, A. (2012). Combining Meta-learning and Search Techniques to
     Select Parameters for Support Vector Machines. Neurocomputing, 75:3-
     13.
[10] Guerra, S.B., Prudêncio, R.B.C. and Ludermir, T.B. (2008). Predicting
     the Performance of Learning Algorithms Using Support Vector Ma-
     chines as Meta-regressors. In Proceedings of the Eighteenth Interna-
     tional Conference on Artificial Neural Networks (LNCS 5163), 523-
     532.
[11] Hall, M.A. (1999). Correlation-based Feature Subset Selection for Ma-
     chine Learning. PhD Thesis, Department of Computer Science, The
     University of Waikato, Hamilton, New Zealand.
[12] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and
     Witten, I.H. (2009). The WEKA Data Mining Software: An Update.
     SIGKDD Explorations, 11(1):10-18.
[13] Leite, R., Brazdil, P. and Vanschoren, J. (2012). Selecting Classification
     Algorithms with Active Testing. In Proceedings of the Eighth Interna-
     tional Conference on Machine Learning and Data Mining in Pattern
     Recognition (LNCS 7376), 117-131.
[14] Michie, D., Spiegelhalter, D.J. and Taylor, C.C. (1994). Machine Learn-
     ing, Neural and Statistical Classification, Ellis Horwood.
[15] North, M. (2012). Data Mining for the Masses, Global Textbook
     Project.
[16] Peng, Y., Flach, P.A., Brazdil, P. and Soares, C. (2002). Improved Data
     Set Characterisation for Meta-learning. In Proceedings of the Fifth In-
     ternational Conference on Discovery Science, 141-152.
[17] Pfahringer, B., Bensusan, H. and Giraud-Carrier, C. (2000). Meta-
     learning by Landmarking Various Learning Algorithms. In Proceed-