<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recommending Learning Algorithms and Their Associated Hyperparameters</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael R. Smith</string-name>
          <email>msmith@axon.cs.byu.edu</email>
        </contrib>
      </contrib-group>
      <fpage>3</fpage>
      <lpage>4</lpage>
      <abstract>
        <p>The success of machine learning on a given task depends on, among other things, which learning algorithm is selected and its associated hyperparameters. Selecting an appropriate learning algorithm and setting its hyperparameters for a given data set can be a challenging task, especially for users who are not experts in machine learning. Previous work has examined using meta-features to predict which learning algorithm and hyperparameters should be used. However, choosing a set of meta-features that are predictive of algorithm performance is difficult. Here, we propose to apply collaborative filtering techniques to learning algorithm and hyperparameter selection, and find that doing so avoids determining which meta-features to use and outperforms traditional meta-learning approaches in many cases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        4
Most previous meta-learning work has focused on selecting a
learning algorithm or a set of hyperparameters based on meta-features
used to characterize datasets [
        <xref ref-type="bibr" rid="ref6">5</xref>
        ]. As such, it can be viewed as a
form of content-based filtering, a technique commonly-used in
recommender systems that captures a set of measured characteristics
of an item and/or user to recommend items with similar
characteristics. On the other hand, collaborative filtering (CF), also used by
some recommender systems, predicts the rating or preference that a
user would give to an item, based on the past behavior of a set of
users, characterized by ratings assigned by users to a set of items [
        <xref ref-type="bibr" rid="ref10">9</xref>
        ].
The underlying assumption of CF is that if users A and B agree on
some issues, then user A is more likely to have the same opinion
on a new issue X as user B than another randomly chosen user. A
key advantage of CF is that it does not rely on directly measurable
characteristics of the items. Thus, it is capable of modeling complex
items without actually understanding the items themselves.
      </p>
      <p>
        Here, we propose meta-CF (MCF) a novel approach to
metalearning that applies CF in the context of algorithm and/or
hyperparameter selection. MCF differs from most previous meta-learning
techniques in that it does not rely on meta-features. Instead, MCF
considers the similarity of the performance of the learning algorithms
with their associated hyperparameter settings from previous
experiments. In this sense, the approach is more similar to landmarking [
        <xref ref-type="bibr" rid="ref13">12</xref>
        ]
and active testing [
        <xref ref-type="bibr" rid="ref11">10</xref>
        ] since both also use the performance results
from previous experiments to determine similarity among data sets.
      </p>
      <p>
        While algorithm selection and hyperparameter optimization have
been mostly studied in isolation (e.g., see [
        <xref ref-type="bibr" rid="ref13 ref16 ref2 ref3 ref4 ref5">12, 4, 1, 2, 3, 15</xref>
        ]), recent
work has begun to consider them in tandem. For example,
AutoWEKA simultaneously chooses a learning algorithm and sets its
hyperparameters using Bayesian optimization over a tree-structured
representation of the combined space of learning algorithms and their
hyperparameters [
        <xref ref-type="bibr" rid="ref17">16</xref>
        ]. All of these approaches face the difficult
challenge of determining a set of meta-features that capture relevant and
predictive characteristics of datasets. By contrast, MCF does
consider both algorithm selection and hyperparameter setting at once,
but alleviates the problem of meta-feature selection by leveraging
information from previous experiments through collaborative filtering.
      </p>
      <p>Our results suggest that using MCF for learning
algorithm/hyperparameter setting recommendation is a promising
direction. Using MCF for algorithm recommendation has some
differences from the traditional CF used for human ratings. For example,
CF for humans may have to deal with concept drift, where a user’s
taste may change over time; working with learning algorithms and
hyperparameter settings is deterministic.
2</p>
      <p>
        Empirical Evaluation
For MCF, we examine several CF techniques implemented in the
Waffles toolkit [
        <xref ref-type="bibr" rid="ref7">6</xref>
        ]: baseline (predict the mean of the previously seen
results), Fuzzy K-Means (FKM) [
        <xref ref-type="bibr" rid="ref12">11</xref>
        ], Matrix Factorization (MF) [
        <xref ref-type="bibr" rid="ref10">9</xref>
        ],
Nonlinear PCA (NLPCA) [
        <xref ref-type="bibr" rid="ref14">13</xref>
        ], and Unsupervised Backpropagation
(UBP) [
        <xref ref-type="bibr" rid="ref8">7</xref>
        ].
      </p>
      <p>
        To establish a baseline, we first calculate the accuracy on a set
of 125 data sets and 9 diverse learning algorithms (see [
        <xref ref-type="bibr" rid="ref15">14</xref>
        ] for a
discussion on diversity) with default parameters as set in Weka [
        <xref ref-type="bibr" rid="ref9">8</xref>
        ].
The set of learning algorithms is composed of backpropagation (BP),
C4.5, kNN, locally weight learning (LWL), na¨ıve Bayes (NB),
nearest neighbor with generalization (NNge), random forest (RF), ridor
(Rid), and RIPPER (RIP). We select the accuracy from the learning
algorithm that produces the highest classification accuracy. This
represents algorithm selection with perfect recall. We also estimate the
hyperparameter optimized accuracies for each learning algorithm
using random hyperparameter optimization [
        <xref ref-type="bibr" rid="ref4">3</xref>
        ]. The results are shown
in Table 1, where the accuracy from each learning algorithm is the
average hyperparameter optimized accuracy for each data set,
“Default” refers to the best accuracy from the learning algorithm with its
default parameters, “ALL” refers to the accuracy from the best
learning algorithm and hyperparameter setting, and “AW” refers to the
results from running Auto-WEKA. For Auto-WEKA, each dataset
was allowed to run as long as the longest algorithm took to run on
the dataset when doing the random hyperparameter optimization. As
Auto-WEKA is a random algorithm, we ran 4 runs each time with a
different seed and chose the seed with highest accuracy. This can be
seen as equivalent to allowing a user to run on average 16 learning
algorithm and hyperparameter combinations on a data set.
71.48
76.80
AW
82.00
      </p>
      <p>For MCF, we compiled the results from hyperparameter
optimization. We randomly removed 10% to 90% of the results by increments
of 10% and then used MCF to fill in the missing values. The top
4 learning algorithm/hyperparameter configurations are returned by
the CF technique and the accuracy from the configuration that returns
the highest classification accuracy is used. This process was repeated
10 times. A summary of the average results for MCF are provided in
Table 2. The columns “Best”, “Median”, and “Average” refer to the
accuracies averaged across all of the sparsity levels for the
hyperparameter setting for the CF technique that provided the results. The
columns 0.1 to 0.9 refer to the percentage of the results used for CF
averaged over the hyperparameter settings. The row “Content” refers
to meta-learning where a learning algorithm recommends a learning
algorithm based on a set of meta-features.
81.11
81.04
82.06
81.33
81.27
81.11
81.29
81.95
81.33
81.31
80.49
80.13
80.49
79.98
80.05
0.3</p>
      <p>
        Overall, MF achieves the highest accuracy values. The
effectiveness of MCF increases as the percentage of the results increases.
MCF significantly increases the classification accuracy compared
with both hyperparameter optimization for a given learning
algorithm and model selection with their default parameters as well as
using the meta-features to predict which learning algorithm and
hyperparameters to use. On average, MCF and Auto-WEKA achieve
similar accuracy, which highlights the importance of considering both
algorithm selection and hyperparameter optimization. However,
provided one has access to a database of experiments, such as the
ExperimentDB [
        <xref ref-type="bibr" rid="ref18">17</xref>
        ], MCF only requires the time to run a number of
algorithms (often ran in parallel), and retraining the collaborative
filter. In the current implementation, retraining takes less than 10
seconds. Thus, MCF presents an efficient method for recommending a
learning algorithm and its associated hyperparameters.
      </p>
      <p>While our results show that MCF is a viable technique for
recommending learning algorithms and hyperparameters, some work
remains to be done. Future work for MCF includes addressing the
cold-start problem which occurs when a data set is presented and no
learning algorithm has been ran on it. MCF is adept at exploiting the
space that has already been explored, but (like active testing) it does
not explore unknown spaces at all. One way to overcome this
limitation would be to use a hybrid recommendation system that combines
content-based filtering and MCF.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>Content 81.35 80.47 78.91 0</source>
          .
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ali</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.A.</given-names>
            <surname>Smith</surname>
          </string-name>
          , '
          <article-title>On Learning Algorithm Selection for Classification'</article-title>
          , Applied Soft Computing,
          <volume>62</volume>
          ,
          <fpage>119</fpage>
          -
          <lpage>138</lpage>
          , (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ali</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.A.</given-names>
            <surname>Smith-Miles</surname>
          </string-name>
          ,
          <article-title>'A Meta-learning Approach to Automatic Kernel Selection for Support Vector Machines'</article-title>
          , Neurocomputing,
          <volume>70</volume>
          ,
          <fpage>173</fpage>
          -
          <lpage>186</lpage>
          , (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bergstra</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , '
          <article-title>Random search for hyper-parameter optimization'</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>13</volume>
          ,
          <fpage>281</fpage>
          -
          <lpage>305</lpage>
          , (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P. B.</given-names>
            <surname>Brazdil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Soares</surname>
          </string-name>
          , and J. Pinto Da Costa, '
          <article-title>Ranking learning algorithms: Using IBL and meta-learning on accuracy and time results'</article-title>
          ,
          <source>Machine Learning</source>
          ,
          <volume>50</volume>
          (
          <issue>3</issue>
          ),
          <fpage>251</fpage>
          -
          <lpage>277</lpage>
          , (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Brazdil</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Giraud-Carrier</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Soares</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Vilalta</surname>
          </string-name>
          , 'Metalearning: Applications to Data Mining', Springer, (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Gashler</surname>
          </string-name>
          , '
          <article-title>Waffles: A machine learning toolkit'</article-title>
          ,
          <source>Journal of Machine Learning Research, MLOSS</source>
          <volume>12</volume>
          ,
          <fpage>2383</fpage>
          -
          <lpage>2387</lpage>
          , (
          <year>July 2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Gashler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morris</surname>
          </string-name>
          , and T. Martinez, '
          <article-title>Missing value imputation with unsupervised backpropagation'</article-title>
          ,
          <source>Computational Intelligence</source>
          , Accepted, (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hall</surname>
          </string-name>
          , E. Frank,
          <string-name>
            <given-names>G.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reutemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          , '
          <article-title>The weka data mining software: an update'</article-title>
          ,
          <source>SIGKDD Explorations Newsletter</source>
          ,
          <volume>11</volume>
          (
          <issue>1</issue>
          ),
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
          , (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Koren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bell</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Volinsky</surname>
          </string-name>
          , '
          <article-title>Matrix factorization techniques for recommender systems'</article-title>
          , Computer,
          <volume>42</volume>
          (
          <issue>8</issue>
          ),
          <fpage>30</fpage>
          -
          <lpage>37</lpage>
          , (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Leite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Brazdil</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanschoren</surname>
          </string-name>
          , '
          <article-title>Selecting classification algorithms with active testing', in Machine Learning and Data Mining in Pattern Recognition</article-title>
          , ed.,
          <source>Petra Perner</source>
          , volume
          <volume>7376</volume>
          of Lecture Notes in Computer Science,
          <volume>117</volume>
          -
          <fpage>131</fpage>
          , Springer Berlin / Heidelberg, (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deogun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Spaulding</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Shuart</surname>
          </string-name>
          , '
          <article-title>Towards missing data imputation: A study of fuzzy k-means clustering method'</article-title>
          ,
          <source>in Rough Sets and Current Trends in Computing</source>
          , volume
          <volume>3066</volume>
          of Lecture Notes in Computer Science,
          <volume>573</volume>
          -
          <fpage>579</fpage>
          , Springer Berlin / Heidelberg, (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bensusan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. G.</given-names>
            <surname>Giraud-Carrier</surname>
          </string-name>
          , '
          <article-title>Meta-learning by landmarking various learning algorithms'</article-title>
          ,
          <source>in Proceedings of the 17th International Conference on Machine Learning</source>
          , pp.
          <fpage>743</fpage>
          -
          <lpage>750</lpage>
          , San Francisco, CA, USA, (
          <year>2000</year>
          ). Morgan Kaufmann Publishers Inc.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Scholz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Guy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kopka</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Selbig</surname>
          </string-name>
          , '
          <article-title>Non-linear pca: a missing data approach'</article-title>
          , Bioinformatics,
          <volume>21</volume>
          (
          <issue>20</issue>
          ),
          <fpage>3887</fpage>
          -
          <lpage>3895</lpage>
          , (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Martinez</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Giraud-Carrier</surname>
          </string-name>
          , '
          <article-title>An instance level analysis of data complexity'</article-title>
          ,
          <source>Machine Learning</source>
          ,
          <volume>95</volume>
          (
          <issue>2</issue>
          ),
          <fpage>225</fpage>
          -
          <lpage>256</lpage>
          , (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Snoek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Adams</surname>
          </string-name>
          , '
          <article-title>Practical bayesian optimization of machine learning algorithms'</article-title>
          ,
          <source>in Advances in Neural Information Processing Systems</source>
          <volume>25</volume>
          , eds.,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.J.C.</given-names>
            <surname>Burges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <volume>2951</volume>
          -
          <fpage>2959</fpage>
          , (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Thornton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Hoos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Leyton-Brown</surname>
          </string-name>
          , '
          <article-title>Auto-weka: combined selection and hyperparameter optimization of classification algorithms'</article-title>
          ,
          <source>in proceedings of the 19th International Conference on Knowledge Discovery and Data Mining</source>
          , pp.
          <fpage>847</fpage>
          -
          <lpage>855</lpage>
          , (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanschoren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Blockeel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          , and G. Holmes, '
          <article-title>Experiment databases - a new way to share, organize and learn from experiments'</article-title>
          ,
          <source>Machine Learning</source>
          ,
          <volume>87</volume>
          (
          <issue>2</issue>
          ),
          <fpage>127</fpage>
          -
          <lpage>158</lpage>
          , (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>