<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Potential Benefits of Data Set Filtering and Learning Algorithm Hyperparameter Optimization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael R. Smith</string-name>
          <email>msmith@axon.cs.byu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tony Martinez</string-name>
          <email>martinez@cs.byu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christophe Giraud-Carrier</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Brigham Young University</institution>
          ,
          <addr-line>Provo, UT 84602</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The quality of a model induced by a learning algorithm is dependent upon the training data and the hyperparameters supplied to the learning algorithm. Prior work has shown that a model's quality can be significantly improved by filtering out low quality instances or by tuning the learning algorithm hyperparameters. The potential impact of filtering and hyperparameter optimization (HPO) is largely unknown. In this paper, we estimate the potential benefits of instance filtering and HPO. While both HPO and filtering significantly improve the quality of the induced model, we find that filtering has a greater potential effect on the quality of the induced model than HPO, motivating future work in filtering.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Given a set of training instances composed of input feature vectors and corresponding
labels, the goal of supervised machine learning is to induce an accurate generalizing
function (hypothesis) that maps feature vectors to labels. The quality of the induced
function is dependent on the learning algorithm’s hyperparameters and the quality of
the training data. It is known that no learning algorithm or hyperparameter setting is
best for all data sets (no free lunch theorem [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]) and that the performance of many
learning algorithms is sensitive to their hyperparameter settings. It is also well-known
that real-world data sets are typically noisy.
      </p>
      <p>
        Prior work has shown that the generalization performance of an induced model
can be significantly improved through hyperparameter optimization (HPO) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], or by
increasing the quality of the training data using techniques such as noise correction [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
instance weighting [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], or instance filtering [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Searching the hyperparameter space
and improving the quality of the training data have generally been examined in isolation
and the potential impact of their usage has not been examined. In this paper, we compare
the effects of HPO with the effects of improving the quality of the training data through
filtering. The results of our experiments provide insight into the potential effectiveness
of both HPO and filtering.
      </p>
      <p>We evaluate 6 commonly used learning algorithms and 46 data sets. We examine
the effects of HPO and filtering by: 1) using a standard approach that selects the
hyperparameters of an algorithm by maximizing the accuracy on a validation set and 2) using
an optimistic approach that sets the hyperparameters for an algorithm using the
10fold cross-validation accuracy. The standard and optimistic approaches are explained in
more detail in Section 4. Essentially, the optimistic approach indicates how well a
technique could perform if the training set were representative of the test set and provides
insight into the potential benefit of a given technique. The standard approach provides a
representative view of HPO and filtering in their present state and allows an evaluation
of how well current HPO and filtering techniques fulfill their potential.</p>
      <p>Using the standard approach, we find that in most cases both HPO and filtering
significantly increase classification accuracy over using a learning algorithm with its
default parameters trained on unfiltered data. For the optimistic estimates of HPO and
filtering, we find that filtering significantly improves the classification accuracy over
HPO for all of the investigated learning algorithms–increasing the accuracy more than
HPO for almost all of the considered data sets. HPO achieves an average accuracy of
84.8% while filtering achieves an average accuracy of 89.1%. The standard approach
for HPO and filtering achieves an average accuracy of 82.6% and 82.0% respectively.
These results provide motivation for further research into developing algorithms that
improve the quality of the training data.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Smith et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] found that a significant number of instances are difficult to classify
correctly, that the hardness of each instance is dependent on its relationship with the
other instances in the training set and that some instances can be detrimental. Thus, there
is a need for improving how detrimental instances are handled during training as they
affect the classification of other instances. Improving the quality of the training data has
typically fallen into three approaches: filtering, cleaning, and instance weighting [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Each technique within an approach differs in how detrimental instances are
identified. A common technique for filtering removes instances from a data set that are
misclassified by a learning algorithm or an ensemble of learning algorithms [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Removing the training instances that are suspected to be noise and/or outliers prior to
training has the advantage that they do not influence the induced model and generally
increase classification accuracy. A negative side-effect of filtering is that beneficial
instances can also be discarded and produce a worse model than if all of the training data
had been used [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Rather than discarding the instances from a training set, noisy or
possibly corrupted instances can be cleaned or corrected [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. However, this could
artificially corrupt valid instances. Alternatively, weighting weights suspected detrimental
instances rather than discards them and allows for an instance to be considered on a
continuum of detrimentality rather than making a binary decision [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        Other methods exist for improving the quality of the training data, such as feature
selection/extraction [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. While feature selection and extraction can improve the quality
of the training data, we focus on improving quality via filtering – facilitating a
comparison between filtering and HPO on the same feature set.
      </p>
      <p>Much of the previous work in improving the quality of the training data artificially
corrupts training instances to determine how well an approach would work in the
presence of noisy or mislabeled instances. In some cases, a given approach only has a
significant impact when there are large degrees of artificial noise. In contrast, we do not
artificially corrupt a data set to create detrimental instances. Rather, we seek to identify
the detrimental instances that are already contained in a data set and show that correctly
labeled, non-noisy instances can also be detrimental for inducing a model of the data.
Properly handling detrimental instances can result in significant gains in accuracy.</p>
      <p>
        The grid search and manual search are the most common types of HPO techniques
in machine learning and a combination of the two approaches are commonly used [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
Bergstra and Bengio [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] proposed to use a random search of the hyperparameter space.
The premise of random HPO is that most machine learning algorithms have very few
hyperparameters that considerably affect the final model while the other
hyperparameters have little to no effect. Random search provides a greater variety of the
hyperparameters that considerably affect the model. Given the same amount of time constraints,
random HPO has been shown to outperform a grid search. Random search, while
providing improvements over a grid-search, is unreliable for tuning the hyperparameters
for some learning algorithms such as deep belief networks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Bayesian optimization
has also been used to search the hyperparameter space [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Bayesian optimization
techniques model the dependence of an error function E on the hyperparameters λ as p(E |λ)
using, for example, random forests [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] or Gaussian processes [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Preliminaries</title>
      <p>Let T represent a training set composed of a set of input vectors X = {x1, x2, . . . , xn}
and corresponding label vectors Y = {y1, y2, . . . , yn}, i.e., T = {hxi, yii : xi ∈ X ∧
yi ∈ Y }. Given that in most cases, all that is known about a task is contained in the set
of training instances T , at least initially, the training instances are generally considered
equally. Most machine learning algorithms seek to induce a hypothesis h : X → Y that
minimizes a specified loss function L(·). As most real-world data sets contain some
level of noise, there is generally a model-dependent regularization term R(·) added to
L(·) that penalizes more complex models and aids in overfit avoidance. The noise in
T may arise from errors in the data collection process such as typos or errors in data
collection equipment. In addition to noise from errors, there may be non-noisy outlier
instances due to the stochastic nature of the task. A hypothesis h is induced by a learning
algorithm g trained on T with hyperparameters λ (h = g(T , λ)), such that:
1
h∗ = argmin
h∈H |T |</p>
      <p>X
hxi,yii∈T</p>
      <p>L(h(xi), yi) + αR(h)
(1)
where α is a regularization parameter greater than or equal to 0 that determines how
much weight to apply to the regularization term and h(·) returns the predicted class for a
given input. The quality of the induced hypothesis h is characterized by its empirical
er1
ror for a specified error function E on a test set V : E(h, V ) = |V | Phxi,yii∈V E (h(xi), yi)
where V can be T or a disjoint set of instances. In k-fold cross-validation, the empirical
error is the average empirical error from the k folds (i.e., 1/k E(hi, Vi)).</p>
      <p>Characterizing the success of a learning algorithm at the data set level (e.g.,
accuracy or precision) optimizes over the entire training set and marginalizes the impact of
a single training instance on an induced model. Some sets of instances can be more
beneficial than others for inducing a model of the data and some can even be
detrimental. By detrimental instances, we mean instances that have a negative impact on
the induced model. For example, outliers or mislabeled instances are not as beneficial
as border instances and are detrimental in many cases. In addition, other instances can
be detrimental for inducing a model of the data even if they are labeled correctly.
Formally, a set D of detrimental instances is a subset of the training data that, when used
in training, increases the empirical error, i.e., E(g(T , λ), V ) &gt; E(g(T − D, λ), V ).</p>
      <p>The effect of training with detrimental instances is demonstrated in the hypothetical
two-dimensional data set shown in Figure 1. Instances A and B represent detrimental
instances. The solid line represents the “actual” classification boundary and the dashed
line represents a potential induced classification boundary. Instances A and B adversely
affect the induced classification boundary because they “pull” the classification
boundary and cause several other instances to be misclassified that otherwise would have been
classified correctly.</p>
      <p>A</p>
      <p>B</p>
      <p>
        Despite most learning algorithms having a mechanism to avoid overfitting, the
presence of detrimental instances may still affect the induced model for many learning
algorithms. Mathematically, the effect of each instance on the induced hypothesis is shown
in Equation 1. The loss from each instance in T , including detrimental instances, is
equally weighted. Detrimental instances have the most significant impact during the
early stages of training where it is difficult to identify them [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The presence of D
may also affect the value of R(h). For example, removing D from T could produce a
“simpler” h that reduces R(h).
The quality of an induced model by a learning algorithm depends in part on the learning
algorithm’s hyperparameters. With hyperparameter optimization (HPO), the
hyperparameter space Λ is searched to minimize the empirical error on V :
argmin E(g(T , λ), V ).
      </p>
      <p>
        λ∈Λ
(2)
The hyperparameters can have a significant effect on the quality of the induced model
as well as suppressing the effects of detrimental instances. For example, in a support
vector machine, [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] use the ramp-loss function which limits the penalty on instances that
are too far from the decision boundary rather than the more typical 0-1 loss function to
handle detrimental instances. Suppressing the effects of detrimental instances with HPO
improves the induced model, but does not change the fact that detrimental instances still
affect the model. Each instance is still considered during the learning process though its
influence may be lessened. We describe the method we use for HPO in Section 4.1.
3.2
      </p>
      <sec id="sec-3-1">
        <title>Filtering</title>
        <p>The quality of an induced model also depends on the quality of the training data where,
for example, the quality of the training data can be measured by the amount of
detrimental instances present. Low quality training data results in lower quality induced models.
Improving the quality of the training data involves searching the training set space to
find an optimal subset that minimizes the empirical error:
argmin E(g(t, λ), V )
t∈P(T )
where t is a subset of T and P (T ) is the power set of T . The removed instances
obviously have no effect on the induced model. In Section 4.2, we describe how we identify
detrimental instances and search for an optimal subset of the training data that
minimizes empirical error.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4 Implementation Details</title>
      <p>4.1</p>
      <sec id="sec-4-1">
        <title>Bayesian Hyperparameter Optimization</title>
        <p>
          In this paper, we use Bayesian optimization for HPO. Specifically, we use sequential
model-based optimization (SMBO) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] as it has been shown to yield better
performance than grid and random search [
          <xref ref-type="bibr" rid="ref23 ref24">23,24</xref>
          ]. SMBO is a stochastic optimization
framework that builds a probabilistic model M that captures the dependence of E on λ.
SMBO first initializes M. After initializing M, SMBO searches the search space by 1)
querying M for a promising λ to evaluate, 2) evaluating the loss E of using
configuration λ, and then 3) updating M with λ and E . Once the budgeted time is exhausted, the
hyperparameter configuration with the minimal loss is returned.
        </p>
        <p>
          To select a candidate hyperparameter configuration, SMBO relies on an acquisition
function aM : Λ → R which uses the predictive distribution of M to quantify how
useful knowledge about λ would be. SMBO maximizes aM over Λ to select the most
useful hyperparameter configuration λ to evaluate next. One of the most prominent
acquisition functions is the positive expected improvement (EI) over an existing error
rate Emin [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. If E (λ) represents the error rate of hyperparameter configuration λ,
then the EI function over Emin is: EIEmin (λ) = max{Emin − E (λ), 0}. As E (λ) is
unknown, the expectation of E (λ) with respect to the current model M can be computed
as: EM[EIEmin (λ)] = R−Em∞in max{Emin − E , 0} · p(E |λ)dE .
        </p>
        <p>
          SMBO is dependent on the model class used for M. Following [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], we use
sequential model-based algorithm configuration (SMAC) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] for M with EI as aM, although
others could be used such as the tree-structured Parzen estimator. To model p(E |λ), we
use random forests as they tend to perform well with discrete and continuous input data.
Using random forests, SMAC obtains a predictive mean μλ and variance σλ2 of p(E |λ)
calculated using the predictions from the individual trees in the forest for λ. p(E |λ) is
then modeled as a Gaussian distribution N (μλ, σλ2). To create diversity in the evaluated
configurations, every second configuration is selected at random as suggested [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. For
k-fold cross-validation, the standard approach finds the hyperparameters that
minimize the error for each of the k validation sets as shown in Equation 2. The optimistic
approach finds the hyperparameters that minimize the k-fold cross-validation error:
argminλ∈Λ k1 E(g(Ti, λ), Vi) where Ti and Vi are the training and validation sets for
the ith fold. The hyperparameter space Λ is searched using Bayesian hyperparameter
optimization for both approaches.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Filtering</title>
        <p>Identifying detrimental instances is a non-trivial task. Fully searching the space of
subsets of training instances generates 2N subsets of training instances where N is the
number of training instances. Even for small data sets, it is computationally infeasible
to induce 2N models to determine which instances are detrimental. There is no known
way to determine how a set of instances will affect the induced classification function
from a learning algorithm without inducing a classification function with the
investigated set of instances removed from the training set.</p>
        <p>
          The Standard Filtering Approach (G-Filter) Previous work in noise handling has
shown that class noise (e.g. mislabeled instances) is more detrimental than attribute
noise [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Thus, searching for detrimental instances that are likely to be misclassified
is a natural place to start. In other words, we search for instances where the
probability of the class label is low given the feature values (i.e., low p(yi|xi)). In general,
p(yi|xi) does not make sense outside the context of an induced hypothesis. Thus, using
an induced hypothesis h from a learning algorithm trained on T , the quantity p(yi|xi)
can be approximated as p(yi|xi, h). After training a learning algorithm on T , the class
distribution for an instance xi can be estimated based on the output from the
learning algorithm. Prior work has examined removing instances that are misclassified by a
learning algorithm or an ensemble of learning algorithms [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. We filter instances using
an ensemble filter that removes instances that are misclassified by more than x% of the
algorithms in the ensemble.
        </p>
        <p>The dependence of p(yi|xi, h) on a particular h can be lessened by summing over
the space of all possible hypotheses:
p(yi|xi) =</p>
        <p>X p(yi|xi, h)p(h|T ).
h∈H
(3)
However, this formulation is infeasible to compute in most practical applications as
p(h|T ) is generally unknown and H is large and possibly infinite. To sum over H, one
would have to sum over the complete set of hypotheses, or, since h = g(T , λ), over the
complete set of learning algorithms and their associated hyperparameters.</p>
        <p>
          The quantity p(yi|xi) can be estimated by restricting attention to a diverse set of
representative algorithms (and hyperparameters). The diversity of the learning
algorithms refers to the likelihood that the learning algorithms classify instances differently.
A natural way to approximate the unknown distribution p(h|T ) is to weight a set of
representative learning algorithms, and their associated hyperparameters, G, a priori
with an equal, non-zero probability while treating all other learning algorithms as
having zero probability. We select a diverse set of learning algorithms using unsupervised
metalearning (UML) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] to get a good representation of H, and hence a reasonable
estimate of p(yi|xi). UML uses Classifier Output Difference (COD) [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] measures the
diversity between learning algorithms as the probability that the learning algorithms
make different predictions. UML clusters the learning algorithms based on their COD
scores with hierarchical agglomerative clustering. Here, we consider 20 commonly used
learning algorithms with their default hyperparameters as set in Weka [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. A cut-point
of 0.18 was chosen to create nine clusters and a representative algorithm from each
cluster was used to create G as shown in Table 1.
        </p>
        <p>Given a set G of learning algorithms, we approximate Equation 3 to the following:
1 |G|
p(yi|xi) ≈ X p(yi|xi, gj (T , λ))
|G| j=1
(4)
where p(h|T ) is approximated as |G1| and gj is the jth learning algorithm from G. As not
all learning algorithms produce probabilistic outputs, the distribution p(yi|xi, gj(T , λ))
is estimated using the Kronecker delta function in this paper.</p>
        <p>The Optimistic Filtering Approach (A-Filter) To measure the potential impact of
filtering, we need to know how removing an instance or set of instances affects the
generalization capabilities of the model. We measure this by dynamically creating an
ensemble filter from G using a greedy algorithm for a given data set and learning
algorithm. This allows us to find a specific ensemble filter that is best for filtering a given
data set and learning algorithm combination. The adaptive ensemble filter is constructed
by iteratively adding the learning algorithm g from G that produces the highest
crossvalidation classification accuracy when g is added to the ensemble filter. Because we are
using the probability that an instance will be misclassified rather than a binary yes/no
decision (Equation 4), we also use a threshold φ to determine which instances are
detrimental. Instances with a p(yi|xi) less than φ are discarded from the training set. A
constant threshold value for φ is set to filter the instances for all iterations. The baseline
accuracy for the adaptive approach is the accuracy of the learning algorithm without
filtering. The search stops once adding one of the remaining learning algorithms to the
ensemble filter does not increase accuracy, or all of the learning algorithms in G have
been used.</p>
        <p>Even though all of the detrimental instances are included for evaluation, the
adaptive filter (A-Filter) overfits the data since the cross-validation accuracy is used to
determine which set of learning algorithms to use in the ensemble filter. This allows us to
find the detrimental instances to examine the effects that they can have on an induced
model. This is not feasible in practical settings, but provides insight into the potential
improvement gained from filtering.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Filtering and HPO</title>
      <p>
        In this section, we compare the effects of filtering with those of HPO using the
optimistic and standard approaches presented in Section 4. The optimistic approach
provides an approximation of the potential of HPO and filtering. In addition to reporting the
average classification accuracy, we also report the average rank of each approach. The
average accuracy and rank for each algorithm is determined using 5 by 10-fold
crossvalidation. Statistical significance between pairs of algorithms is determined using the
Wilcoxon signed-ranks test (as suggested by [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) with an alpha value of 0.05.
5.1
      </p>
      <sec id="sec-5-1">
        <title>Experimental Methodology</title>
        <p>
          For HPO, we use the version of SMAC implemented in auto-WEKA [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] as described
in Section 4.1 Auto-WEKA searches the hyperparameter spaces for the learning
algorithms in the Weka machine learning toolkit [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] for a specified amount of time. To
estimate the amount of time required for a learning algorithm to induce a model of the
data, we ran our selected learning algorithms with ten random hyperparameter settings
and calculated the average and max running times. On average, a model was induced
in less than 3 minutes. The longest time required to induce a model was 845 minutes.
Based on this analysis, we run auto-WEKA for one hour for most of the data sets. An
hour long search explores more than 512 hyperparameter configurations for most of the
learning algorithm/data set combinations. The time limit is adjusted accordingly for the
larger data sets. Following [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], we run four runs with different random seeds provided
to SMAC.
        </p>
        <p>For filtering using the ensemble filter (G-filter), we use thresholds φ of 0.5, 0.7, and
0.9. Instances that are misclassified by more than φ% of the learning algorithms are
removed from the training set. The G-filter uses all of the learning algorithms in the set
G (Table 1). The accuracy on the test set from the value of φ that produces the highest
accuracy on the training set is reported.</p>
        <p>
          To show the effect of filtering detrimental instances and HPO on an induced model,
we examine filtering and HPO in six commonly used learning algorithms (MLP trained
with backpropagation, C4.5, kNN, Na¨ıve Bayes, Random Forest, and RIPPER) on a set
of 46 UCI data sets [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. The LWL, NNge, and Ridor learning algorithms are not used
for analysis because they do not scale well with the larger data sets–not finishing due to
memory overflow or large amounts of running time.1
The optimistic approach indicates how well a model could generalize on novel data.
Maximizing the cross-validation accuracy is a type of overfitting. However, using
10fold cross-validation accuracy for HPO and filtering, essentially measures the
generalization capability of a learning algorithm for a given data set.
        </p>
        <p>The results comparing the potential benefits of HPO and filtering are shown in
Table 2. Each section gives the average accuracy and average rank for each learning
algorithm as well as the number of times the algorithm is greater than, equal to, or less
than a compared algorithm. HPO and the adaptive filter significantly increase the
classification accuracy for all of the investigated learning algorithms. The values in bold
represent if HPO or the adaptive filter is significantly greater than the other. For all of
the investigated learning algorithms, the A-filter significantly increases the accuracy
over HPO. The closest the two techniques come to each other is for NB, where the
Afilter achieves an accuracy of 82.74% and an average rank of 1.30 while HPO achieves
an accuracy of 80.89% and an average rank of 1.96. For all learning algorithms other
than NB, the average accuracy is about 89% for filtering and 84% for HPO. Thus,
filtering has a greater potential for increase in generalization accuracy. The difficulty lies
in how to find the optimal set of training instances.</p>
        <p>
          As might be expected, there is no set of learning algorithms that is the optimal
ensemble filter for all algorithms and/or data sets. Table 3 shows the frequency for which
a learning algorithm with default hyperparameters was selected for filtering by the
Afilter. The greatest percentage of cases an algorithm is selected for filtering for each
learning algorithm is in bold. The column “ALL” refers to the average from all of the
learning algorithms as the base learner. No instances are filtered in 5.36% of the cases.
Thus, given the right filter, filtering to some extent increases the classification accuracy
in about 95% of the cases. Furthermore, random forest, NNge, MLP, and C4.5 are the
most commonly chosen algorithms for inclusion in the ensemble filter. However, no
one learning algorithm is selected in more than 27% of the cases. The filtering
algorithm that is most appropriate is dependent on the data set and the learning algorithm.
1 For the data sets on which the learning algorithms did finish, the effects of HPO and filtering
on LWL, NNge, and Ridor are consistent with the other learning algorithms.
This coincides with the findings from [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] that the efficacy of noise filtering in the
nearest-neighbor classifier is dependent on the characteristics of the data set.
Understanding the efficacy of filtering and determining which filtering approach to use for a
given algorithm/data set is a direction of future work.
        </p>
        <p>Analysis. In some cases HPO achieves a lower accuracy than orig, showing the
complexity of HPO. The A-Filter, on the other hand, never fails to improve the accuracy.
Thus, higher quality data can compensate for hyperparameter settings and suggests that
the instance space may be less complex and/or richer than the hyperparameter space.
Of course, filtering does not outperform HPO in all cases, but it does so in the majority
of cases.
5.3</p>
      </sec>
      <sec id="sec-5-2">
        <title>Standard Approach</title>
        <p>The previous results show the potential impact of filtering and HPO. We now examine
HPO and filtering using the standard approach to highlight the need for improvement
in filtering. The results comparing the G-filter, HPO, and using the default
hyperparameters trained on the original data set are shown in Table 4. HPO significantly increases
the classification accuracy over not using HPO for all of the learning algorithms.
Filtering significantly increases the accuracy for all of the investigated algorithms except
for random forests. Comparing HPO and the G-Filter, only HPO for na¨ıve Bayes and
random forests significantly outperforms the G-filter.</p>
        <p>Analysis. In their current state, HPO and filtering generally improve the quality of
the induced model. The results justify the computational overhead required to run HPO.
Despite these results, using the default hyperparameters result in higher classification
accuracy for 11 of the 46 data sets for C4.5, 12 for kNN, and 12 for NB, highlighting
the complexity of searching over the hyperparameter space Λ. The effectiveness of
HPO is dependent on the data set as well as the learning algorithm. Typically, as was
done here, a single filtering technique is used for a set of data sets with no model of
the dependence of a learning algorithm on the training instances. The accuracies for
In this paper, we compared the potential benefits of filtering with HPO. HPO may
reduce the effects of detrimental instances on an induced model but the detrimental
instances are still considered in the learning process. Filtering, on the other hand, removes
the detrimental instances–completely eliminating their effects on the induced model.</p>
        <p>
          We used an optimistic approach to estimate the potential accuracy of each method.
Using the optimistic approach, both filtering and HPO significantly increase the
classification accuracy for all of the considered learning algorithms. However, filtering has a
greater potential effect on average, increasing the classification accuracy from 80.8%
to 89.1% on the observed data sets. HPO increases the average classification accuracy
to 84.8%. Future work includes developing models to understand the dependence of
the performance of learning algorithms given the instances used for training. To better
understand how instances affect each other, we are examining the results from machine
learning experiments stored in repositories that include which instances were used for
training and their predicted class [
          <xref ref-type="bibr" rid="ref22 ref25">25,22</xref>
          ]. We hope that the presented results provide
motivation for improving the quality of the training data.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>J.</given-names>
            <surname>Bergstra</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Random search for hyper-parameter optimization</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>13</volume>
          :
          <fpage>281</fpage>
          -
          <lpage>305</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Bergstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bardenet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Ke</surname>
          </string-name>
          <article-title>´gl. Algorithms for hyper-parameter optimization</article-title>
          . In J.
          <string-name>
            <surname>Shawe-Taylor</surname>
            , R. Zemel,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Bartlett</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Pereira</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          K. Weinberger, editors,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>24</volume>
          , pages
          <fpage>2546</fpage>
          -
          <lpage>2554</lpage>
          . Curran Associates, Inc.,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Brodley</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Friedl</surname>
          </string-name>
          .
          <article-title>Identifying mislabeled training data</article-title>
          .
          <source>Journal of Artificial Intelligence Research</source>
          ,
          <volume>11</volume>
          :
          <fpage>131</fpage>
          -
          <lpage>167</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sinz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          .
          <article-title>Trading convexity for scalability</article-title>
          .
          <source>In Proceedings of the 23rd International Conference on Machine learning</source>
          , pages
          <fpage>201</fpage>
          -
          <lpage>208</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>J. Demsˇar.</surname>
          </string-name>
          <article-title>Statistical comparisons of classifiers over multiple data sets</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>7</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Elman</surname>
          </string-name>
          .
          <article-title>Learning and development in neural networks: The importance of starting small</article-title>
          .
          <source>Cognition</source>
          ,
          <volume>48</volume>
          :
          <fpage>71</fpage>
          -
          <lpage>99</lpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>B.</given-names>
            <surname>Fre</surname>
          </string-name>
          <article-title>´nay and M. Verleysen. Classification in the presence of label noise: a survey</article-title>
          .
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          ,
          <volume>25</volume>
          (
          <issue>5</issue>
          ):
          <fpage>845</fpage>
          -
          <lpage>869</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>I.</given-names>
            <surname>Guyon</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Elisseeff</surname>
          </string-name>
          .
          <article-title>An introduction to variable and feature selection</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>3</volume>
          :
          <fpage>1157</fpage>
          -
          <lpage>1182</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>M.</given-names>
            <surname>Hall</surname>
          </string-name>
          , E. Frank,
          <string-name>
            <given-names>G.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reutemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          .
          <article-title>The weka data mining software: an update</article-title>
          .
          <source>SIGKDD Explorations Newsletter</source>
          ,
          <volume>11</volume>
          (
          <issue>1</issue>
          ):
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Hoos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Leyton-Brown</surname>
          </string-name>
          .
          <article-title>Sequential model-based optimization for general algorithm configuration</article-title>
          .
          <source>In Proceedings of the International Learning and Intelligent Optimization Conference</source>
          , pages
          <fpage>507</fpage>
          -
          <lpage>523</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>J.</given-names>
            <surname>Kubica</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Moore</surname>
          </string-name>
          .
          <article-title>Probabilistic noise identification and data cleaning</article-title>
          .
          <source>In Proceedings of the 3rd IEEE International Conference on Data Mining</source>
          , pages
          <fpage>131</fpage>
          -
          <lpage>138</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Erhan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Bergstra</surname>
            , and
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>An empirical evaluation of deep architectures on problems with many factors of variation</article-title>
          .
          <source>In Proceedings of the 24th International Conference on Machine Learning</source>
          , pages
          <fpage>473</fpage>
          -
          <lpage>480</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Giraud-Carrier</surname>
          </string-name>
          .
          <article-title>A metric for unsupervised metalearning</article-title>
          .
          <source>Intelligent Data Analysis</source>
          ,
          <volume>15</volume>
          (
          <issue>6</issue>
          ):
          <fpage>827</fpage>
          -
          <lpage>841</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>M. Lichman.</surname>
          </string-name>
          <article-title>UCI machine learning repository</article-title>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>D. F. Nettleton</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Orriols-Puig</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Fornells</surname>
          </string-name>
          .
          <article-title>A study of the effect of different types of noise on the precision of supervised learning techniques</article-title>
          .
          <source>Artificial Intelligence Review</source>
          ,
          <volume>33</volume>
          (
          <issue>4</issue>
          ):
          <fpage>275</fpage>
          -
          <lpage>306</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Peterson</surname>
          </string-name>
          and
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Martinez</surname>
          </string-name>
          .
          <article-title>Estimating the potential for combining learning models</article-title>
          .
          <source>In Proceedings of the ICML Workshop on Meta-Learning</source>
          , pages
          <fpage>68</fpage>
          -
          <lpage>75</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. U. Rebbapragada and
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Brodley</surname>
          </string-name>
          .
          <article-title>Class noise mitigation through instance weighting</article-title>
          .
          <source>In Proceedings of the 18th European Conference on Machine Learning</source>
          , pages
          <fpage>708</fpage>
          -
          <lpage>715</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Sa</surname>
          </string-name>
          <article-title>´ez</article-title>
          , J. Luengo, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Herrera</surname>
          </string-name>
          .
          <article-title>Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification</article-title>
          .
          <source>Pattern Recognition</source>
          ,
          <volume>46</volume>
          (
          <issue>1</issue>
          ):
          <fpage>355</fpage>
          -
          <lpage>364</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>M. Schonlau</surname>
            ,
            <given-names>W. J.</given-names>
          </string-name>
          <string-name>
            <surname>Welch</surname>
            , and
            <given-names>D. R.</given-names>
          </string-name>
          <string-name>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Global versus local search in constrained optimization of computer models</article-title>
          , volume Volume
          <volume>34</volume>
          of Lecture Notes-Monograph Series, pages
          <fpage>11</fpage>
          -
          <lpage>25</lpage>
          . Institute of Mathematical Statistics, Hayward, CA,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Smith</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Martinez</surname>
          </string-name>
          .
          <article-title>Improving classification accuracy by identifying and removing instances that should be misclassified</article-title>
          .
          <source>In Proceedings of the IEEE International Joint Conference on Neural Networks</source>
          , pages
          <fpage>2690</fpage>
          -
          <lpage>2697</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>M. R. Smith</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Martinez</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Giraud-Carrier</surname>
          </string-name>
          .
          <article-title>An instance level analysis of data complexity</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>95</volume>
          (
          <issue>2</issue>
          ):
          <fpage>225</fpage>
          -
          <lpage>256</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>M. R. Smith</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>White</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Giraud-Carrier</surname>
            , and
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Martinez</surname>
          </string-name>
          .
          <article-title>An easy to use repository for comparing and improving machine learning algorithm usage</article-title>
          .
          <source>In Proceedings of the 2014 International Workshop on Meta-learning and Algorithm Selection (MetaSel)</source>
          , pages
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>J. Snoek</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Larochelle</surname>
            , and
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Adams</surname>
          </string-name>
          .
          <article-title>Practical bayesian optimization of machine learning algorithms</article-title>
          . In F. Pereira,
          <string-name>
            <given-names>C.</given-names>
            <surname>Burges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          K. Weinberger, editors,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>25</volume>
          , pages
          <fpage>2951</fpage>
          -
          <lpage>2959</lpage>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>C. Thornton</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Hutter</surname>
            ,
            <given-names>H. H.</given-names>
          </string-name>
          <string-name>
            <surname>Hoos</surname>
            , and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Leyton-Brown</surname>
          </string-name>
          .
          <article-title>Auto-weka: combined selection and hyperparameter optimization of classification algorithms</article-title>
          .
          <source>In proceedings of the 19th International Conference on Knowledge Discovery and Data Mining</source>
          , pages
          <fpage>847</fpage>
          -
          <lpage>855</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>J. Vanschoren</surname>
            ,
            <given-names>J. N. van Rijn</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bischl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Torgo</surname>
          </string-name>
          . Openml:
          <article-title>Networked science in machine learning</article-title>
          .
          <source>SIGKDD Explorations</source>
          ,
          <volume>15</volume>
          (
          <issue>2</issue>
          ):
          <fpage>49</fpage>
          -
          <lpage>60</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Wolpert</surname>
          </string-name>
          .
          <article-title>The lack of a priori distinctions between learning algorithms</article-title>
          .
          <source>Neural Computation</source>
          ,
          <volume>8</volume>
          (
          <issue>7</issue>
          ):
          <fpage>1341</fpage>
          -
          <lpage>1390</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>