The Potential Benefits of Data Set Filtering and
   Learning Algorithm Hyperparameter Optimization

           Michael R. Smith, Tony Martinez, and Christophe Giraud-Carrier

    Department of Computer Science, Brigham Young University, Provo, UT 84602 USA
    msmith@axon.cs.byu.edu,martinez@cs.byu.edu,cgc@cs.byu.edu
                           http://axon.cs.byu.edu


       Abstract. The quality of a model induced by a learning algorithm is dependent
       upon the training data and the hyperparameters supplied to the learning algo-
       rithm. Prior work has shown that a model’s quality can be significantly improved
       by filtering out low quality instances or by tuning the learning algorithm hyper-
       parameters. The potential impact of filtering and hyperparameter optimization
       (HPO) is largely unknown. In this paper, we estimate the potential benefits of in-
       stance filtering and HPO. While both HPO and filtering significantly improve the
       quality of the induced model, we find that filtering has a greater potential effect
       on the quality of the induced model than HPO, motivating future work in filtering.


1 Introduction

Given a set of training instances composed of input feature vectors and corresponding
labels, the goal of supervised machine learning is to induce an accurate generalizing
function (hypothesis) that maps feature vectors to labels. The quality of the induced
function is dependent on the learning algorithm’s hyperparameters and the quality of
the training data. It is known that no learning algorithm or hyperparameter setting is
best for all data sets (no free lunch theorem [26]) and that the performance of many
learning algorithms is sensitive to their hyperparameter settings. It is also well-known
that real-world data sets are typically noisy.
     Prior work has shown that the generalization performance of an induced model
can be significantly improved through hyperparameter optimization (HPO) [1], or by
increasing the quality of the training data using techniques such as noise correction [11],
instance weighting [17], or instance filtering [20]. Searching the hyperparameter space
and improving the quality of the training data have generally been examined in isolation
and the potential impact of their usage has not been examined. In this paper, we compare
the effects of HPO with the effects of improving the quality of the training data through
filtering. The results of our experiments provide insight into the potential effectiveness
of both HPO and filtering.
     We evaluate 6 commonly used learning algorithms and 46 data sets. We examine
the effects of HPO and filtering by: 1) using a standard approach that selects the hyper-
parameters of an algorithm by maximizing the accuracy on a validation set and 2) using
an optimistic approach that sets the hyperparameters for an algorithm using the 10-
fold cross-validation accuracy. The standard and optimistic approaches are explained in
2       The Potential Benefits of Filtering and Hyperparameter Optimization

more detail in Section 4. Essentially, the optimistic approach indicates how well a tech-
nique could perform if the training set were representative of the test set and provides
insight into the potential benefit of a given technique. The standard approach provides a
representative view of HPO and filtering in their present state and allows an evaluation
of how well current HPO and filtering techniques fulfill their potential.
     Using the standard approach, we find that in most cases both HPO and filtering
significantly increase classification accuracy over using a learning algorithm with its
default parameters trained on unfiltered data. For the optimistic estimates of HPO and
filtering, we find that filtering significantly improves the classification accuracy over
HPO for all of the investigated learning algorithms–increasing the accuracy more than
HPO for almost all of the considered data sets. HPO achieves an average accuracy of
84.8% while filtering achieves an average accuracy of 89.1%. The standard approach
for HPO and filtering achieves an average accuracy of 82.6% and 82.0% respectively.
These results provide motivation for further research into developing algorithms that
improve the quality of the training data.


2 Related Work

Smith et al. [21] found that a significant number of instances are difficult to classify
correctly, that the hardness of each instance is dependent on its relationship with the
other instances in the training set and that some instances can be detrimental. Thus, there
is a need for improving how detrimental instances are handled during training as they
affect the classification of other instances. Improving the quality of the training data has
typically fallen into three approaches: filtering, cleaning, and instance weighting [7].
     Each technique within an approach differs in how detrimental instances are iden-
tified. A common technique for filtering removes instances from a data set that are
misclassified by a learning algorithm or an ensemble of learning algorithms [3]. Re-
moving the training instances that are suspected to be noise and/or outliers prior to
training has the advantage that they do not influence the induced model and generally
increase classification accuracy. A negative side-effect of filtering is that beneficial in-
stances can also be discarded and produce a worse model than if all of the training data
had been used [18]. Rather than discarding the instances from a training set, noisy or
possibly corrupted instances can be cleaned or corrected [11]. However, this could arti-
ficially corrupt valid instances. Alternatively, weighting weights suspected detrimental
instances rather than discards them and allows for an instance to be considered on a
continuum of detrimentality rather than making a binary decision [17].
     Other methods exist for improving the quality of the training data, such as feature
selection/extraction [8]. While feature selection and extraction can improve the quality
of the training data, we focus on improving quality via filtering – facilitating a compar-
ison between filtering and HPO on the same feature set.
     Much of the previous work in improving the quality of the training data artificially
corrupts training instances to determine how well an approach would work in the pres-
ence of noisy or mislabeled instances. In some cases, a given approach only has a sig-
nificant impact when there are large degrees of artificial noise. In contrast, we do not
artificially corrupt a data set to create detrimental instances. Rather, we seek to identify
                 The Potential Benefits of Filtering and Hyperparameter Optimization          3

the detrimental instances that are already contained in a data set and show that correctly
labeled, non-noisy instances can also be detrimental for inducing a model of the data.
Properly handling detrimental instances can result in significant gains in accuracy.
    The grid search and manual search are the most common types of HPO techniques
in machine learning and a combination of the two approaches are commonly used [12].
Bergstra and Bengio [1] proposed to use a random search of the hyperparameter space.
The premise of random HPO is that most machine learning algorithms have very few
hyperparameters that considerably affect the final model while the other hyperparam-
eters have little to no effect. Random search provides a greater variety of the hyperpa-
rameters that considerably affect the model. Given the same amount of time constraints,
random HPO has been shown to outperform a grid search. Random search, while pro-
viding improvements over a grid-search, is unreliable for tuning the hyperparameters
for some learning algorithms such as deep belief networks [2]. Bayesian optimization
has also been used to search the hyperparameter space [23]. Bayesian optimization tech-
niques model the dependence of an error function E on the hyperparameters λ as p(E|λ)
using, for example, random forests [10] or Gaussian processes [23].

3 Preliminaries
Let T represent a training set composed of a set of input vectors X = {x1 , x2 , . . . , xn }
and corresponding label vectors Y = {y1 , y2 , . . . , yn }, i.e., T = {hxi , yi i : xi ∈ X ∧
yi ∈ Y }. Given that in most cases, all that is known about a task is contained in the set
of training instances T , at least initially, the training instances are generally considered
equally. Most machine learning algorithms seek to induce a hypothesis h : X → Y that
minimizes a specified loss function L(·). As most real-world data sets contain some
level of noise, there is generally a model-dependent regularization term R(·) added to
L(·) that penalizes more complex models and aids in overfit avoidance. The noise in
T may arise from errors in the data collection process such as typos or errors in data
collection equipment. In addition to noise from errors, there may be non-noisy outlier
instances due to the stochastic nature of the task. A hypothesis h is induced by a learning
algorithm g trained on T with hyperparameters λ (h = g(T, λ)), such that:
                                     1      X
                    h∗ = argmin                     L(h(xi ), yi ) + αR(h)                 (1)
                            h∈H |T |
                                        hxi ,yi i∈T

where α is a regularization parameter greater than or equal to 0 that determines how
much weight to apply to the regularization term and h(·) returns the predicted class for a
given input. The quality of the induced hypothesis h is characterized  P by its empirical er-
ror for a specified error function E on a test set V : E(h, V ) = |V1 | hxi ,yi i∈V E(h(xi ), yi )
where V can be T or a disjoint set of instances. In k-fold cross-validation, the empirical
error is the average empirical error from the k folds (i.e., 1/k E(hi , Vi )).
    Characterizing the success of a learning algorithm at the data set level (e.g., accu-
racy or precision) optimizes over the entire training set and marginalizes the impact of
a single training instance on an induced model. Some sets of instances can be more
beneficial than others for inducing a model of the data and some can even be detri-
mental. By detrimental instances, we mean instances that have a negative impact on
4        The Potential Benefits of Filtering and Hyperparameter Optimization

the induced model. For example, outliers or mislabeled instances are not as beneficial
as border instances and are detrimental in many cases. In addition, other instances can
be detrimental for inducing a model of the data even if they are labeled correctly. For-
mally, a set D of detrimental instances is a subset of the training data that, when used
in training, increases the empirical error, i.e., E(g(T, λ), V ) > E(g(T − D, λ), V ).
     The effect of training with detrimental instances is demonstrated in the hypothetical
two-dimensional data set shown in Figure 1. Instances A and B represent detrimental
instances. The solid line represents the “actual” classification boundary and the dashed
line represents a potential induced classification boundary. Instances A and B adversely
affect the induced classification boundary because they “pull” the classification bound-
ary and cause several other instances to be misclassified that otherwise would have been
classified correctly.


                                  A


                                                          B

Fig. 1. Hypothetical 2-dimensional data set that shows the potential effects of detrimental in-
stances in the training data on a learning algorithm.


    Despite most learning algorithms having a mechanism to avoid overfitting, the pres-
ence of detrimental instances may still affect the induced model for many learning algo-
rithms. Mathematically, the effect of each instance on the induced hypothesis is shown
in Equation 1. The loss from each instance in T , including detrimental instances, is
equally weighted. Detrimental instances have the most significant impact during the
early stages of training where it is difficult to identify them [6]. The presence of D
may also affect the value of R(h). For example, removing D from T could produce a
“simpler” h that reduces R(h).

3.1 Hyperparameter Optimization
The quality of an induced model by a learning algorithm depends in part on the learning
algorithm’s hyperparameters. With hyperparameter optimization (HPO), the hyperpa-
rameter space Λ is searched to minimize the empirical error on V :

                                  argmin E(g(T, λ), V ).                                   (2)
                                    λ∈Λ

The hyperparameters can have a significant effect on the quality of the induced model
as well as suppressing the effects of detrimental instances. For example, in a support
vector machine, [4] use the ramp-loss function which limits the penalty on instances that
are too far from the decision boundary rather than the more typical 0-1 loss function to
                The Potential Benefits of Filtering and Hyperparameter Optimization     5

handle detrimental instances. Suppressing the effects of detrimental instances with HPO
improves the induced model, but does not change the fact that detrimental instances still
affect the model. Each instance is still considered during the learning process though its
influence may be lessened. We describe the method we use for HPO in Section 4.1.

3.2 Filtering
The quality of an induced model also depends on the quality of the training data where,
for example, the quality of the training data can be measured by the amount of detrimen-
tal instances present. Low quality training data results in lower quality induced models.
Improving the quality of the training data involves searching the training set space to
find an optimal subset that minimizes the empirical error:

                                 argmin E(g(t, λ), V )
                                  t∈P(T )


where t is a subset of T and P(T ) is the power set of T . The removed instances obvi-
ously have no effect on the induced model. In Section 4.2, we describe how we identify
detrimental instances and search for an optimal subset of the training data that mini-
mizes empirical error.


4 Implementation Details
4.1 Bayesian Hyperparameter Optimization
In this paper, we use Bayesian optimization for HPO. Specifically, we use sequential
model-based optimization (SMBO) [10] as it has been shown to yield better perfor-
mance than grid and random search [23,24]. SMBO is a stochastic optimization frame-
work that builds a probabilistic model M that captures the dependence of E on λ.
SMBO first initializes M. After initializing M, SMBO searches the search space by 1)
querying M for a promising λ to evaluate, 2) evaluating the loss E of using configura-
tion λ, and then 3) updating M with λ and E. Once the budgeted time is exhausted, the
hyperparameter configuration with the minimal loss is returned.
     To select a candidate hyperparameter configuration, SMBO relies on an acquisition
function aM : Λ → R which uses the predictive distribution of M to quantify how
useful knowledge about λ would be. SMBO maximizes aM over Λ to select the most
useful hyperparameter configuration λ to evaluate next. One of the most prominent
acquisition functions is the positive expected improvement (EI) over an existing error
rate Emin [19]. If E(λ) represents the error rate of hyperparameter configuration λ,
then the EI function over Emin is: EIEmin (λ) = max{Emin − E(λ), 0}. As E(λ) is
unknown, the expectation of E(λ) with respect to the current model M can be computed
                         R Emin
as: EM [EIEmin (λ)] = −∞        max{Emin − E, 0} · p(E|λ)dE.
     SMBO is dependent on the model class used for M. Following [24], we use sequen-
tial model-based algorithm configuration (SMAC) [10] for M with EI as aM , although
others could be used such as the tree-structured Parzen estimator. To model p(E|λ), we
use random forests as they tend to perform well with discrete and continuous input data.
6       The Potential Benefits of Filtering and Hyperparameter Optimization

Using random forests, SMAC obtains a predictive mean µλ and variance σλ2 of p(E|λ)
calculated using the predictions from the individual trees in the forest for λ. p(E|λ) is
then modeled as a Gaussian distribution N (µλ , σλ2 ). To create diversity in the evaluated
configurations, every second configuration is selected at random as suggested [24]. For
k-fold cross-validation, the standard approach finds the hyperparameters that mini-
mize the error for each of the k validation sets as shown in Equation 2. The optimistic
approach finds the hyperparameters that minimize the k-fold cross-validation error:
argminλ∈Λ k1 E(g(Ti , λ), Vi ) where Ti and Vi are the training and validation sets for
the ith fold. The hyperparameter space Λ is searched using Bayesian hyperparameter
optimization for both approaches.

4.2 Filtering
Identifying detrimental instances is a non-trivial task. Fully searching the space of sub-
sets of training instances generates 2N subsets of training instances where N is the
number of training instances. Even for small data sets, it is computationally infeasible
to induce 2N models to determine which instances are detrimental. There is no known
way to determine how a set of instances will affect the induced classification function
from a learning algorithm without inducing a classification function with the investi-
gated set of instances removed from the training set.

The Standard Filtering Approach (G-Filter) Previous work in noise handling has
shown that class noise (e.g. mislabeled instances) is more detrimental than attribute
noise [15]. Thus, searching for detrimental instances that are likely to be misclassified
is a natural place to start. In other words, we search for instances where the proba-
bility of the class label is low given the feature values (i.e., low p(yi |xi )). In general,
p(yi |xi ) does not make sense outside the context of an induced hypothesis. Thus, using
an induced hypothesis h from a learning algorithm trained on T , the quantity p(yi |xi )
can be approximated as p(yi |xi , h). After training a learning algorithm on T , the class
distribution for an instance xi can be estimated based on the output from the learn-
ing algorithm. Prior work has examined removing instances that are misclassified by a
learning algorithm or an ensemble of learning algorithms [3]. We filter instances using
an ensemble filter that removes instances that are misclassified by more than x% of the
algorithms in the ensemble.
     The dependence of p(yi |xi , h) on a particular h can be lessened by summing over
the space of all possible hypotheses:
                                         X
                            p(yi |xi ) =    p(yi |xi , h)p(h|T ).                        (3)
                                        h∈H

However, this formulation is infeasible to compute in most practical applications as
p(h|T ) is generally unknown and H is large and possibly infinite. To sum over H, one
would have to sum over the complete set of hypotheses, or, since h = g(T, λ), over the
complete set of learning algorithms and their associated hyperparameters.
    The quantity p(yi |xi ) can be estimated by restricting attention to a diverse set of
representative algorithms (and hyperparameters). The diversity of the learning algo-
rithms refers to the likelihood that the learning algorithms classify instances differently.
                 The Potential Benefits of Filtering and Hyperparameter Optimization       7

               Table 1. Set of learning algorithms G used to estimate p(yi |xi ).

                               L EARNING A LGORITHMS
   * M ULTILAYER P ERCEPTRON TRAINED WITH BACK P ROPAGATION (MLP)
   * D ECISION T REE (C4.5)
   * L OCALLY W EIGHTED L EARNING (LWL)
   * 5-N EAREST N EIGHBORS (5-NN)
   * N EAREST N EIGHBOR WITH GENERALIZATION (NN GE )
   * NA ÏVE BAYES (NB)
   * RI PPLE DOWN RULE LEARNER (RIDOR)
   * R ANDOM F OREST (R AND F OREST )
   * R EPEATED I NCREMENTAL P RUNING TO P RODUCE E RROR R EDUCTION (RIPPER)


A natural way to approximate the unknown distribution p(h|T ) is to weight a set of
representative learning algorithms, and their associated hyperparameters, G, a priori
with an equal, non-zero probability while treating all other learning algorithms as hav-
ing zero probability. We select a diverse set of learning algorithms using unsupervised
metalearning (UML) [13] to get a good representation of H, and hence a reasonable
estimate of p(yi |xi ). UML uses Classifier Output Difference (COD) [16] measures the
diversity between learning algorithms as the probability that the learning algorithms
make different predictions. UML clusters the learning algorithms based on their COD
scores with hierarchical agglomerative clustering. Here, we consider 20 commonly used
learning algorithms with their default hyperparameters as set in Weka [9]. A cut-point
of 0.18 was chosen to create nine clusters and a representative algorithm from each
cluster was used to create G as shown in Table 1.
    Given a set G of learning algorithms, we approximate Equation 3 to the following:

                                          1 X
                                             |G|
                          p(yi |xi ) ≈           p(yi |xi , gj (T, λ))                   (4)
                                         |G| j=1

                                   1
where p(h|T ) is approximated as |G| and gj is the jth learning algorithm from G. As not
all learning algorithms produce probabilistic outputs, the distribution p(yi |xi , gj (T, λ))
is estimated using the Kronecker delta function in this paper.


The Optimistic Filtering Approach (A-Filter) To measure the potential impact of
filtering, we need to know how removing an instance or set of instances affects the
generalization capabilities of the model. We measure this by dynamically creating an
ensemble filter from G using a greedy algorithm for a given data set and learning algo-
rithm. This allows us to find a specific ensemble filter that is best for filtering a given
data set and learning algorithm combination. The adaptive ensemble filter is constructed
by iteratively adding the learning algorithm g from G that produces the highest cross-
validation classification accuracy when g is added to the ensemble filter. Because we are
using the probability that an instance will be misclassified rather than a binary yes/no
decision (Equation 4), we also use a threshold φ to determine which instances are detri-
mental. Instances with a p(yi |xi ) less than φ are discarded from the training set. A
8       The Potential Benefits of Filtering and Hyperparameter Optimization

constant threshold value for φ is set to filter the instances for all iterations. The baseline
accuracy for the adaptive approach is the accuracy of the learning algorithm without
filtering. The search stops once adding one of the remaining learning algorithms to the
ensemble filter does not increase accuracy, or all of the learning algorithms in G have
been used.
     Even though all of the detrimental instances are included for evaluation, the adap-
tive filter (A-Filter) overfits the data since the cross-validation accuracy is used to de-
termine which set of learning algorithms to use in the ensemble filter. This allows us to
find the detrimental instances to examine the effects that they can have on an induced
model. This is not feasible in practical settings, but provides insight into the potential
improvement gained from filtering.


5 Filtering and HPO

In this section, we compare the effects of filtering with those of HPO using the op-
timistic and standard approaches presented in Section 4. The optimistic approach pro-
vides an approximation of the potential of HPO and filtering. In addition to reporting the
average classification accuracy, we also report the average rank of each approach. The
average accuracy and rank for each algorithm is determined using 5 by 10-fold cross-
validation. Statistical significance between pairs of algorithms is determined using the
Wilcoxon signed-ranks test (as suggested by [5]) with an alpha value of 0.05.


5.1 Experimental Methodology

For HPO, we use the version of SMAC implemented in auto-WEKA [24] as described
in Section 4.1 Auto-WEKA searches the hyperparameter spaces for the learning al-
gorithms in the Weka machine learning toolkit [9] for a specified amount of time. To
estimate the amount of time required for a learning algorithm to induce a model of the
data, we ran our selected learning algorithms with ten random hyperparameter settings
and calculated the average and max running times. On average, a model was induced
in less than 3 minutes. The longest time required to induce a model was 845 minutes.
Based on this analysis, we run auto-WEKA for one hour for most of the data sets. An
hour long search explores more than 512 hyperparameter configurations for most of the
learning algorithm/data set combinations. The time limit is adjusted accordingly for the
larger data sets. Following [24], we run four runs with different random seeds provided
to SMAC.
    For filtering using the ensemble filter (G-filter), we use thresholds φ of 0.5, 0.7, and
0.9. Instances that are misclassified by more than φ% of the learning algorithms are
removed from the training set. The G-filter uses all of the learning algorithms in the set
G (Table 1). The accuracy on the test set from the value of φ that produces the highest
accuracy on the training set is reported.
    To show the effect of filtering detrimental instances and HPO on an induced model,
we examine filtering and HPO in six commonly used learning algorithms (MLP trained
with backpropagation, C4.5, kNN, Naı̈ve Bayes, Random Forest, and RIPPER) on a set
of 46 UCI data sets [14]. The LWL, NNge, and Ridor learning algorithms are not used
                    The Potential Benefits of Filtering and Hyperparameter Optimization           9

 Table 2. Results for maximizing the 10-fold cross-validation accuracy for HPO and filtering.


              MLP          C4.5         kNN          NB           RF           RIP
O RIG     82.28 (2.98) 81.30 (2.91) 80.56 (2.74) 77.66 (2.70) 82.98 (2.89) 79.86 (2.96)
HPO       86.37 (1.87) 84.25 (1.96) 83.89 (2.22) 80.89 (1.96) 86.81 (1.85) 82.08 (1.80)
VS ORIG      45,0,1       42,1,3      34,1,11      34,0,12       44,0,2       46,0,0
A-F ILTER 89.96 (1.13) 88.74 (1.09) 91.14 (1.02) 82.74 (1.30) 91.02 (1.20) 88.16 (1.24)
VS ORIG      46,0,0       46,0,0       46,0,0       44,2,0       43,3,0       44,0,2
VS HPO       39,1,6       41,1,4       45,0,1      32,0,14       37,0,9       37,0,9


for analysis because they do not scale well with the larger data sets–not finishing due to
memory overflow or large amounts of running time.1

5.2 Optimistic Approach
The optimistic approach indicates how well a model could generalize on novel data.
Maximizing the cross-validation accuracy is a type of overfitting. However, using 10-
fold cross-validation accuracy for HPO and filtering, essentially measures the general-
ization capability of a learning algorithm for a given data set.
     The results comparing the potential benefits of HPO and filtering are shown in Ta-
ble 2. Each section gives the average accuracy and average rank for each learning al-
gorithm as well as the number of times the algorithm is greater than, equal to, or less
than a compared algorithm. HPO and the adaptive filter significantly increase the clas-
sification accuracy for all of the investigated learning algorithms. The values in bold
represent if HPO or the adaptive filter is significantly greater than the other. For all of
the investigated learning algorithms, the A-filter significantly increases the accuracy
over HPO. The closest the two techniques come to each other is for NB, where the A-
filter achieves an accuracy of 82.74% and an average rank of 1.30 while HPO achieves
an accuracy of 80.89% and an average rank of 1.96. For all learning algorithms other
than NB, the average accuracy is about 89% for filtering and 84% for HPO. Thus, fil-
tering has a greater potential for increase in generalization accuracy. The difficulty lies
in how to find the optimal set of training instances.
     As might be expected, there is no set of learning algorithms that is the optimal en-
semble filter for all algorithms and/or data sets. Table 3 shows the frequency for which
a learning algorithm with default hyperparameters was selected for filtering by the A-
filter. The greatest percentage of cases an algorithm is selected for filtering for each
learning algorithm is in bold. The column “ALL” refers to the average from all of the
learning algorithms as the base learner. No instances are filtered in 5.36% of the cases.
Thus, given the right filter, filtering to some extent increases the classification accuracy
in about 95% of the cases. Furthermore, random forest, NNge, MLP, and C4.5 are the
most commonly chosen algorithms for inclusion in the ensemble filter. However, no
one learning algorithm is selected in more than 27% of the cases. The filtering algo-
rithm that is most appropriate is dependent on the data set and the learning algorithm.
 1
     For the data sets on which the learning algorithms did finish, the effects of HPO and filtering
     on LWL, NNge, and Ridor are consistent with the other learning algorithms.
10       The Potential Benefits of Filtering and Hyperparameter Optimization

Table 3. The frequency of selecting a learning algorithm when adaptively constructing an ensem-
ble filter. Each row gives the percentage of cases that an algorithm was included in the ensemble
filter for the learning algorithm in the column.


                         ALL     MLP     C4.5    kNN      NB      RF      RIP
                N ONE    5.36     2.69   2.95     3.08    5.64   5.77     1.60
                MLP      18.33   16.67   15.77   20.00   25.26   23.72   16.36
                C4.5     17.17   17.82   15.26   22.82   14.49   13.33   20.74
                5NN      12.59   11.92   14.23    1.28   10.00   17.18   16.89
                LWL      6.12     3.59   3.85     4.36   23.72   3.33     3.59
                NB       7.84     5.77   6.54     8.08    5.13   10.26    4.92
                NN GE    19.49   26.67   21.15   21.03   11.15   24.74   23.40
                RF       21.14   22.95   26.54   23.33   15.77   15.13   24.20
                R ID     14.69   14.87   16.79   18.33   11.92   16.54   12.77
                RIP      8.89     7.82   7.69     8.85   13.08   7.44     4.39


This coincides with the findings from [18] that the efficacy of noise filtering in the
nearest-neighbor classifier is dependent on the characteristics of the data set. Under-
standing the efficacy of filtering and determining which filtering approach to use for a
given algorithm/data set is a direction of future work.
    Analysis. In some cases HPO achieves a lower accuracy than orig, showing the
complexity of HPO. The A-Filter, on the other hand, never fails to improve the accuracy.
Thus, higher quality data can compensate for hyperparameter settings and suggests that
the instance space may be less complex and/or richer than the hyperparameter space.
Of course, filtering does not outperform HPO in all cases, but it does so in the majority
of cases.


5.3 Standard Approach

The previous results show the potential impact of filtering and HPO. We now examine
HPO and filtering using the standard approach to highlight the need for improvement
in filtering. The results comparing the G-filter, HPO, and using the default hyperparam-
eters trained on the original data set are shown in Table 4. HPO significantly increases
the classification accuracy over not using HPO for all of the learning algorithms. Fil-
tering significantly increases the accuracy for all of the investigated algorithms except
for random forests. Comparing HPO and the G-Filter, only HPO for naı̈ve Bayes and
random forests significantly outperforms the G-filter.
     Analysis. In their current state, HPO and filtering generally improve the quality of
the induced model. The results justify the computational overhead required to run HPO.
Despite these results, using the default hyperparameters result in higher classification
accuracy for 11 of the 46 data sets for C4.5, 12 for kNN, and 12 for NB, highlighting
the complexity of searching over the hyperparameter space Λ. The effectiveness of
HPO is dependent on the data set as well as the learning algorithm. Typically, as was
done here, a single filtering technique is used for a set of data sets with no model of
the dependence of a learning algorithm on the training instances. The accuracies for
                  The Potential Benefits of Filtering and Hyperparameter Optimization          11

Table 4. Performance comparison of using the default hyperparameters, HPO, and the G-filter.


              MLP          C4.5         kNN          NB           RF           RIP
O RIG     82.28 (2.54) 81.3 (2.13) 80.56 (2.26) 77.66 (2.28) 82.98 (2.28) 79.86 (2.46)
HPO       83.08 (1.78) 82.42 (1.76) 82.85 (1.59) 81.11 (1.63) 84.72 (1.54) 81.15 (1.74)
VS ORIG     32,0,14      29,1,16      31,5,10      31,2,13      34,0,12      30,2,14
G-F ILTER 84.17 (1.61) 81.9 (1.96) 81.61 (2.00) 79.49 (2.02) 83.49 (2.17) 81.25 (1.72)
VS ORIG      39,0,7      23,5,18      27,1,18      28,1,17      25,0,21       37,0,9
VS HPO      22,3,21      19,1,26      17,1,28      16,0,30      13,0,33      20,2,24


filtering and HPO are significantly lower than the optimistic estimate given in Section
5.2 motivating future work in HPO and especially in filtering.


6 Conclusion
In this paper, we compared the potential benefits of filtering with HPO. HPO may re-
duce the effects of detrimental instances on an induced model but the detrimental in-
stances are still considered in the learning process. Filtering, on the other hand, removes
the detrimental instances–completely eliminating their effects on the induced model.
    We used an optimistic approach to estimate the potential accuracy of each method.
Using the optimistic approach, both filtering and HPO significantly increase the classi-
fication accuracy for all of the considered learning algorithms. However, filtering has a
greater potential effect on average, increasing the classification accuracy from 80.8%
to 89.1% on the observed data sets. HPO increases the average classification accuracy
to 84.8%. Future work includes developing models to understand the dependence of
the performance of learning algorithms given the instances used for training. To better
understand how instances affect each other, we are examining the results from machine
learning experiments stored in repositories that include which instances were used for
training and their predicted class [25,22]. We hope that the presented results provide
motivation for improving the quality of the training data.


References
 1. J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of
    Machine Learning Research, 13:281–305, 2012.
 2. J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyper-parameter opti-
    mization. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors,
    Advances in Neural Information Processing Systems 24, pages 2546–2554. Curran Asso-
    ciates, Inc., 2011.
 3. C. E. Brodley and M. A. Friedl. Identifying mislabeled training data. Journal of Artificial
    Intelligence Research, 11:131–167, 1999.
 4. R. Collobert, F. Sinz, J. Weston, and L. Bottou. Trading convexity for scalability. In Pro-
    ceedings of the 23rd International Conference on Machine learning, pages 201–208, 2006.
 5. J. Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine
    Learning Research, 7:1–30, 2006.
12       The Potential Benefits of Filtering and Hyperparameter Optimization

 6. J. L. Elman. Learning and development in neural networks: The importance of starting small.
    Cognition, 48:71–99, 1993.
 7. B. Frénay and M. Verleysen. Classification in the presence of label noise: a survey. IEEE
    Transactions on Neural Networks and Learning Systems, 25(5):845–869, 2014.
 8. I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of
    Machine Learning Research, 3:1157–1182, 2003.
 9. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data
    mining software: an update. SIGKDD Explorations Newsletter, 11(1):10–18, 2009.
10. F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for gen-
    eral algorithm configuration. In Proceedings of the International Learning and Intelligent
    Optimization Conference, pages 507–523, 2011.
11. J. Kubica and A. Moore. Probabilistic noise identification and data cleaning. In Proceedings
    of the 3rd IEEE International Conference on Data Mining, pages 131–138, 2003.
12. H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation
    of deep architectures on problems with many factors of variation. In Proceedings of the 24th
    International Conference on Machine Learning, pages 473–480, 2007.
13. J. Lee and C. Giraud-Carrier. A metric for unsupervised metalearning. Intelligent Data
    Analysis, 15(6):827–841, 2011.
14. M. Lichman. UCI machine learning repository, 2013.
15. D. F. Nettleton, A. Orriols-Puig, and A. Fornells. A study of the effect of different types
    of noise on the precision of supervised learning techniques. Artificial Intelligence Review,
    33(4):275–306, 2010.
16. A. H. Peterson and T. R. Martinez. Estimating the potential for combining learning models.
    In Proceedings of the ICML Workshop on Meta-Learning, pages 68–75, 2005.
17. U. Rebbapragada and C. E. Brodley. Class noise mitigation through instance weighting. In
    Proceedings of the 18th European Conference on Machine Learning, pages 708–715, 2007.
18. J. A. Sáez, J. Luengo, and F. Herrera. Predicting noise filtering efficacy with data complexity
    measures for nearest neighbor classification. Pattern Recognition, 46(1):355–364, 2013.
19. M. Schonlau, W. J. Welch, and D. R. Jones. Global versus local search in constrained
    optimization of computer models, volume Volume 34 of Lecture Notes–Monograph Series,
    pages 11–25. Institute of Mathematical Statistics, Hayward, CA, 1998.
20. M. R. Smith and T. Martinez. Improving classification accuracy by identifying and remov-
    ing instances that should be misclassified. In Proceedings of the IEEE International Joint
    Conference on Neural Networks, pages 2690–2697, 2011.
21. M. R. Smith, T. Martinez, and C. Giraud-Carrier. An instance level analysis of data com-
    plexity. Machine Learning, 95(2):225–256, 2014.
22. M. R. Smith, A. White, C. Giraud-Carrier, and T. Martinez. An easy to use repository for
    comparing and improving machine learning algorithm usage. In Proceedings of the 2014
    International Workshop on Meta-learning and Algorithm Selection (MetaSel), pages 41–48,
    2014.
23. J. Snoek, H. Larochelle, and R. Adams. Practical bayesian optimization of machine learning
    algorithms. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in
    Neural Information Processing Systems 25, pages 2951–2959. 2012.
24. C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-weka: combined selection
    and hyperparameter optimization of classification algorithms. In proceedings of the 19th
    International Conference on Knowledge Discovery and Data Mining, pages 847–855, 2013.
25. J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo. Openml: Networked science in
    machine learning. SIGKDD Explorations, 15(2):49–60, 2013.
26. D. H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Com-
    putation, 8(7):1341–1390, 1996.