An Easy to Use Repository for Comparing and Improving
          Machine Learning Algorithm Usage
         Michael R. Smith1 and Andrew White2 and Christophe Giraud-Carrier3 and Tony Martinez4


Abstract. The results from most machine learning experiments                 perience (results from previous machine learning experiments). Al-
are used for a specific purpose and then discarded. This causes sig-         though some research from the machine learning community has fo-
nificant loss of information and requires rerunning experiments to           cused on meta-learning (e.g., see [17, 5, 2, 3, 8]), much of the focus
compare learning algorithms. Often, this also requires a researcher          of machine learning research has been on developing more learning
or practitioner to implement another algorithm for comparison, that          algorithms and/or applying machine learning in specific domains.
may not always be correctly implemented. By storing the results                 Part of the difficulty of meta-learning is due to the lack of ac-
from previous experiments, machine learning algorithms can be                cessible results. As meta-learning requires running several learning
compared easily and the knowledge gained from them can be used               algorithms and hyperparameter settings over many data sets, gath-
to improve the performance of future machine learning experiments.           ering results requires large amounts of computational resources. In
The purpose of this work is to provide easy access to previous ex-           addition to the computational requirements, results from the learning
perimental results for learning and comparison. These stored results         algorithms may differ due to slight differences in their implementa-
are comprehensive – storing the prediction for each test instance as         tions. Thus, comparing results among meta-learning studies becomes
well as the learning algorithm, hyperparameters, and training set that       difficult.
were used in the experiment. Previous experimental results are par-             To aid in further research in meta-learning, we have developed the
ticularly important for meta-learning, which, in a broad sense, is the       machine learning results repository (MLRR) that provides data sets
process of learning from previous machine learning results such that         ready for download for meta-learning problems, akin to the UCI data
the learning process is improved. While other experiment databases           repository for machine learning problems. We refer to the data sets
do exist, one of our focuses is on easy access to the data, eliminat-        for meta-learning as meta-data sets to distinguish them from the data
ing any learning curve required to acquire the desired information.          sets that are used in the machine learning experiments. The meta-data
We provide meta-learning data sets that are ready to be downloaded           sets provide a snapshot of an underlying database that stores the re-
for meta-learning experiments. Easy access to previous experimental          sults of machine learning experiments. Users can update the database
results aids other researchers looking to do meta-learning and helps         with new results from machine learning experiments and then update
in comparing meta-learning algorithms. In addition, simple queries           the meta-data sets for meta-learning. A revision history is kept so
to the underlying database can be made if specific information is de-        that comparisons among meta-learning algorithms is facilitated. As
sired. We also differ from previous experiment databases in that our         a starting point, meta-data sets are provided by MLRR for typical
database is designed at the instance level, where an instance is an ex-      meta-learning tasks, such as, given a set of meta-features, predict
ample in a data set. We store the predictions of a learning algorithm        which learning algorithm and/or hyperparameter setting to use.
trained on a specific training set for each instance in the test set. Data      The MLRR stores instance level meta-features and the predictions
set level information can then be obtained by aggregating the results        made on each instance by the learning algorithms. Providing infor-
from the instances. The instance level information can be used for           mation at the instance level allows studies to be performed on the
many tasks such as determining the diversity of a classifier or algo-        instances themselves. Studying the effects of machine learning on a
rithmically determining the optimal subset of training instances for a       single instance and/or the effects of a single instance on the perfor-
learning algorithm.                                                          mance of an algorithm has generally been overlooked. Instance-level
                                                                             information is important in several areas of machine learning, how-
                                                                             ever. In ensembles, computing the classifier diversity of the ensem-
1 Introduction                                                               bled classifiers using the predictions for each instance is important in
The quality of an induced model is dependent on, among other as-             determining the effectiveness of the ensembling technique [12, 6, 1].
pects, the learning algorithm that is chosen, the hyper-parameter set-       In curriculum learning, the training set is incrementally augmented
tings for the chosen learning algorithm, and the quality of the training     such that “easier” instances are presented to the learning algorithm
set. Choosing a learning algorithm for a given task, setting its hyper-      first, thus creating a need to understand and identify the easier in-
parameters, and selecting which instances to train on, however, is           stances [4]. Smith et al. used instance-level predictions to identify
non-trivial. Meta-learning deals with the problem of how to select a         and characterize instances that are likely to be misclassified [23] and
learning algorithm and set its hyper-parameters based on previous ex-        used this information to create a curriculum [22]. Other work has
                                                                             also used the instance-level predictions for meta-learning. The clas-
1 Brigham Young University, USA, email: msmith@axon.cs.byu.edu               sifier output difference (COD) measures the distance between two
2 Brigham Young University, USA, email: andrewkvavlewhite@gmail.com
                                                                             learning algorithms as the probability that the learning algorithms
3 Brigham Young University, USA, email: cgc@cs.byu.edu
4 Brigham Young University, USA, email: martinez@cs.byu.edu
                                                                             make different predictions on test instances [16]. Unsupervised meta-
learning (UML) clusters learning algorithms based on their COD           When we refer to an experiment, we mean the results from training a
scores (rather than accuracy) to examine the behavior of the learn-      learning algorithm l with hyperparamter settings λ on a training set t.
ing algorithms [13]. Meta-learning for algorithm selection can then      We first describe how we manage experiment information, and then
be done over the clusters rather than a larger set of learning algo-     describe the provided meta-data sets.
rithms to recommend a cluster of learning algorithms that all behave
similarly [14]. Additionally, several techniques treat instances indi-
vidually during the training process, such as filtering instances from
                                                                         2.1 Experiment Information
the training set based on their instance-level meta-features [21] or     The information about each experiment is provided in three tables
weighting the instances [18].                                            in MLRR. Which learning algorithm and hyperparameters were used
   Other attempts have been made at creating a repository for            is provided in a file structured as shown in Table 1. It provides the
machine learning experiments from which learning can be con-             toolkit including the version number that was ran, the learning algo-
ducted [20, 24]. However, we feel that they lack simplicity and/or       rithm, and the hyperparameters that were used. This allows for mul-
extensibility. In addition to providing instance-level information, we   tiple learning algorithms, hyperparameters, and toolkits to be com-
hope to bridge this gap with the MLRR. Probably the most promi-          pared. In the examples in Table 1, the class names from the Weka ma-
nent and well-developed data repository is ExpDB, an experiment          chine learning toolkit [9] and the Waffles machine learning toolkit [7]
database that provides a framework for reporting experimental re-        are shown. LA seed corresponds to the learning algorithm that was
sults and their associated workflow [24]. The purpose of ExpDB is        used (LA) and to a seed that represents which hyperparameter setting
to comprehensively store the workflow process of all experiments for     was used (seed). The LA seed will be used in other tables as a foreign
reproducibility. One of the results of storing the experiments is that   key to map back to this table. A seed of -1 represents the default hy-
the results can be used for meta-learning. Unfortunately, there is a     perparameter settings as many studies examine the default behavior
relatively steep learning curve to access the data due to the inherent   as given in a toolkit and the default parameters are commonly used
complexity involved in storing all of the details about exact repro-     in practice.
ducibility. Because of this complexity and formality, it is difficult
to directly access the information that would be most beneficial for           Table 1. The structure of the meta-data set that describes the
meta-learning, which may deter some potential users. Additionally,        hyperparameter settings for the learning algorithms stored in the database.
ExpDB does not currently support storage and manipulation of any
instance level features.                                                  LA S Toolkit Version Hyperparameters
   We acknowledge that maintaining a database of previous experi-         BP 1      weka3.6.11 weka.classifiers.functions.MultilayerPerceptron\
ments is not a trivial problem. We do, however, add our voice to sup-                           – -L 0.261703 -M 0.161703 -H 12 -D
port the importance of maintaining a repository of machine learning       BP 2 weka 3.6.11 weka.classifiers.functions.MultilayerPerceptron\
                                                                                                – -L 0.25807 -M 0.15807 -H 4
results and offer an effective solution for storing results from pre-     BP 3 waffles 13-12-09 neuralnet -addlayer 8 -learningrate 0.1 \
vious experiments. Our primary goal is to maintain simplicity and                               -momentum 0 -windowsepochs 50
provide easily accessible data for meta-learning to 1) help promote        ..     ..       ..                            ..
more research in meta-learning, 2) provide a standard set of data sets      .      .        .                             .
                                                                         C4.5 1 weka 3.6.11 weka.classifiers.trees.J48 – -C 0.443973 -M 1
for meta-learning algorithm comparison, and 3) continue to stimulate
                                                                            ..     ..       ..                            ..
research at the instance level.                                              .      .        .                             .
   We next describe our approach for providing a repository for ma-
chine learning meta-data that emphasizes ease of access to the meta-        As the parameter values differ between toolkits, there is a mapping
data. MLRR currently has the results from 72 data sets, 9 learning       provided to distinguish hyperparameter settings. For example, Weka
algorithms and 10 hyperparameter settings for each learning algo-        uses the “-L” parameter to set the learning rate in backpropagation
rithm. The database description is provided in Section 3. How to add     while the Waffles toolkit uses “-learningrate”. Also, some toolkits
new experimental results to the database is detailed in Section 4. We    have hyperparameters that other implementations of the same learn-
then give a more detailed description of the data set level and in-      ing algorithm do not include. In such cases, an unknown value will be
stance level meta-features that are used in the MLRR. Conclusions        provided in the meta-data set. This mapping is shown in Table 2 for
and directions for future work are provided in Section 6.                the backpropagation learning algorithm. The first row contains the
                                                                         values used by MLRR. The following rows contain the command-
                                                                         line parameter supplied to a specific toolkit to set that hyperparame-
2 Meta-data Set Descriptions                                             ter.

The purpose of the machine learning results repository (MLRR) is            Table 2. The structure of the table for mapping learning algorithm
to provide easy access to the results of previous machine learning ex-   hyperparameters between different toolkits for the backpropagation learning
periments for meta-learning at the data set and instance levels. This,                                  algorithm.
in turn, would allow other researchers interested in meta-learning and                                 Command line parameters
in better understanding machine learning algorithms direct access to      toolkit          LR          Mo         HN      DC              WE
prior results without having to re-run all of the algorithms or learn     weka            -L           -M            -H       -D          ?
how to navigate a more complex experiment database. The quality of        waffles   -learningrate   -momentum     -addlayer    ?    -windowsepochs
an induced model for a task is dependent on at least three things:        ..               ..           ..            ..       ..          ..
                                                                           .                .            .             .        .           .
1. the learning algorithm chosen to induce the model,
2. the hyperparameter settings for the chosen learning algorithm, and       A mapping of which instances are used for training is also pro-
3. the instances used for training.                                      vided in a separate file. The structure of this table is shown in Table
3. Each row represents an experiment as toolkit seed numFolds fold.          Table 4. In the given example, instance 77 is shown. The “inst meta”
The toolkit represents which toolkit was used, the seed represents the       section provides the instance level meta-features for that instance.
random seed that was provided to the toolkit, numFolds represents            The actual class label is 2. The predictions from the experiments on
how many folds were ran, and fold represents in which fold an in-            this data set are provided in the following columns (i.e., experiment
stance was included for testing. The values in the following columns         BP 1/1 predicted class 3, BP N/1 predicted class 2, etc.).
represent if an instance was used for training or testing. There is one
column for each instance in the data set. They are stored as real val-              Table 4.       The structure of the meta-data set at the instance level.
ues. This allows for the situations when training instances have as-
sociated weights. In the file, an unknown value of “?” represents a
testing instance, otherwise a real value represents a training instance.              inst meta                        predictions
A value of 0 represents a filtered instance, a value of 1 represents an       #     kAN MV . . . act BP 1/1 . . . BP N/1 . . . BP N/M C4.5 1/1 . . .
unweighted training instance and any value between 0 and 1 repre-             77 0.92        0     ...   2         3     ...     2     ...      2         3         ...
sents the weight for that training instance. In the cases where there         ..   ..        ..          ..        ..            ..             ..        ..
                                                                               .    .         .           .         .             .              .         .
are specific training and testing sets, then the row will be labeled
as toolkit 0 0 1 and information for the training set can be entered
                                                                                  At the data set level, several meta-data sets are provided:
as before. A random test/training split of the data is represented as
toolkit seed percentSplit 1 where “percentSplit” represents the per-         • a general meta-data set that stores the data set meta-features and
centage of the data set that was used for testing as generated by the          the average N by 10-fold cross-validation accuracy for all of the
toolkit.                                                                       data sets from a learning algorithm with a given hyperparameter
                                                                               setting.
Table 3. The structure of the meta-data set that indicates which instances   • for each learning algorithm a meta-data set that stores the data
             were used for training given a random seed.                       set meta-features, the learning algorithm hyperparameter settings,
                                                                               and the average N by 10-fold cross-validation accuracy for all of
            toolkit seed # folds fold     1     2    3    ...
                                                                               the data sets for the given hyperparameter setting.
                  weka 1 10 1             1     1    1    ...
                  weka 1 10 2             1     0    1    ...
                                                                                The structure for the general meta-data set is provided in Table 5.
                       ..                 ..    ..   ..
                        .                  .     .    .                      The structure and information of this meta-data set is typical of that
                 weka 1 10 10           0.74    1    ?    ...                used in previous meta-learning studies that provides a mapping from
                 weka 2 1 10              ?     1    1    ...                data set meta-features to accuracies obtained by a set of learning al-
                      .
                      ..
                                          .
                                          ..
                                                .
                                                ..
                                                     .
                                                     ..                      gorithms. Most previous studies have been limited to only using the
                                                                             default hyperparameters, however. The MLRR includes the accura-
                                                                             cies from multiple hyperparameter settings. The hyperparameter set-
                                                                             tings from each learning algorithm are denoted by a “LA #” where
                                                                             LA refers to a learning algorithm and # refers to which hyperparam-
2.2 Meta-data sets                                                           eter setting was used for that learning algorithm.
One of the features of MLRR is its focus on storing and present-
ing instance level information, namely, instance level characteristics              Table 5. The structure of the meta-data set at the data set level.
and associated predictions from previous experiments. Indeed, the
MLRR is designed intentionally from the instance level perspective,
                                                                                            data set meta-features        LA accuracies
from which data set level information can be computed (e.g., accu-                 data set numInst numAttr . . . BP 1 BP 2 . . . BP N C4.5 1 . . .
racy or precision).
                                                                                   iris            150        4         . . . 96.80 95.07 . . . 93.47 95.60 . . .
   As one of the purposes of the MLRR is ease of access, the MLRR                  abalone        4177        8         . . . 20.27 29.84 . . . 21.91 23.24 . . .
stores several data sets in attribute-relation file format (ARFF) which            ..               ..        ..                ..    ..    ..    ..    ..
is supported by many machine learning toolkits. In essence, ARFF is                 .                .         .                 .     .     .     .     .
a comma or space separated file with attribute information and pos-
sible comments. The precomputed meta-data sets include instance                 The meta-data sets for each learning algorithm are designed to
level meta-data sets and data set level meta-data sets.                      aid in algorithmic hyperparameter estimation, i.e., given a data set,
   At the instance level, MLRR provides for each data set a meta-            can we predict which hyperparameter setting will give the highest
data set that stores the instance level meta-features and the predic-        classification accuracy. For each learning algorithm, a meta-data set
tion from each experiment. This allows for analyses to be done ex-           is provided that contains the data set meta-features, the toolkit that
ploring the effects of hyperparameters and learning algorithms at the        was used, the hyperparameter setting and the average accuracy for
instance-level, which is currently mostly overlooked. For each data          each unique tool kit/hyperparameter combination. The structure of
set, a meta-data set is provided that gives the values for the instance      the meta-data set for each learning algorithm is provided in Table 6.
level meta-features, the actual class value (stored as a numeric value),     The accuracy (“acc”) represents the average accuracy for all k-fold
and the predicted class value for each experiment. The training set          validation runs (i.e., multiple runs of the same learning algorithm
and learning algorithm/hyperparameter information is stored in the           with different random seeds to partition the folds). The toolkit is also
column heading as “LA seed/hyperparameter” where LA is a learn-              provided to allow a user to compare toolkits or only do hyperparam-
ing algorithm and hyperparameter is the hyperparameter setting for           eter estimation for a single toolkit.
the learning algorithm. Together, they map to the entries in Table 1.           MLRR provides easy access for researchers and practitioners to a
The seed represents the seed that was used to partition the data (see        large and varying set of meta-data information as shown in the tables
Table 3). The structure of the instance level meta-data set is shown in      above. The provided meta-data sets are a snapshot of an underlying
                                                                              learning results. The MLRR also stores the original data sets to al-
     Table 6.   The structure of the table for mapping learning algorithm
                       hyperparameters among toolkits.                        low a user to add results from additional experiments on the current
                                                                              set of data sets. The results from experimentation on a new data set
               DS meta features         toolkit   hyperparameters             require that the new data set be uploaded as well as the experimen-
 data set   numInst numAttr . . .        weka     LR    Mo . . .       acc    tal results. Scripts are provided to calculate the meta-features for the
  iris          150        4      ...    weka     0.71   0.61   ...   96.80   new data set. In the case where a data set is proprietary or has other
 iris           150        4      ...    weka     0.11   0.25   ...   97.04   privacy/licensing issues that prevent it from being posted, the meta-
 ..              ..        ..             ..        ..     ..           ..    features can be calculated on the data set without storing the actual
  .               .         .              .         .      .            .
                                                                              data set.
database that stores all of the previous experimental results that can           Currently, scripts are provided to upload the output from running
be updated as more results are obtained. A revision history of the data       Weka. This provides a simple way to upload experimental results
sets is provided so that results can be compared even if the meta-data        from a commonly used toolkit. The file is slightly modified such that
set has been updated.                                                         the first line provides which learning algorithm and hyperparameters
                                                                              were used. The database will have the ability to upload files gener-
                                                                              ated by other toolkits in the future.
3 Database Description                                                           Of course, there are issues of data reliability. Currently, all of the
MLRR uses MongoDB as the database to store the results from ma-               results stored in the MLRR are from our experiments. To help with
chine learning experiments. MongoDB is a NoSQL database that                  data reliability, we require that the script(s) and executable(s) re-
allows for adding new features (such as new learning algorithms               quired to reproduce the results are uploaded along with the results.
and/hyperparameters), thus, escaping the rigidity of the more tradi-          This allows the results to be verified if their validity is questioned. If
tional SQL databases. This allows for easily expanding the database           the results from an experiment are thought to be invalid, they can can
with new learning algorithms and/or hyperparameters. Of course, this          be flagged, and inspected for possible removal from the MLRR.
is theoretically also possible in a relational database, provided the
database has been designed adequately. For example, one could cer-            5 Included Meta-features
tainly have, and that would indeed be following good design princi-
ples, one table for the algorithms and one table for the hyper param-         In this section, we detail the meta-features that are included in the
eters with appropriate foreign keys. However, such design requires            machine learning results repository (MLRR). We store a set of data
some amount of foresight. In traditional relational databases, the in-        set meta-features that have been commonly used in previous meta-
formation that needs to be stored (and how) has to be planned for in          learning studies. Specifically, we used the meta-features from Brazdil
advance. Otherwise, when new features are desired, a new schema               et al. [5], Ho and Basu [10], Pfahringer et al. [17], and Smith et
needs to be created and then the database has to be migrated over             al. [23]. As the underlying database is a NoSQL database, additional
to the new schema. With a NoSQL database, new learning algo-                  meta-features can be easily added in the future. We now describe the
rithms/hyperparameters and other pieces of information can easily             meta-features from each study.
be added into the MLRR.                                                          The study by Brazdil et al. [5] examined ranking learning algo-
   The data is stored as a document database as collections of                rithms using instance-based learning. The meta-features are designed
key-value pairs. Each collection represents the experimental re-              to be quickly calculated and to represent properties that affect algo-
sults on a particular data set. In each collection, the keys are              rithm performance.
LA hyperparameterSetting. The value then is a JSON text document              • Number of examples. This feature helps identify how scalable an
that stores the results of an experiment (e.g., the results of 10-fold          algorithm is based on the size of its input.
cross-validation on the iris data set using C4.5). These documents            • Proportion of symbolic attributes. This feature can be used to con-
also contain pointers to other documents that hold information about            sider how well an algorithm deals with symbolic or numeric at-
training/testing sets for each experiment. The data set/instance level          tributes.
meta-features are stored in separate documents in their respective            • Proportion of missing values. This features can be used to consider
data set collection. A separate collection stores information about             how robust an algorithm is to incomplete data.
the learning algorithms and their hyperparameters.                            • Proportion of attributes with outliers. An attribute is considered to
   The best way to visualize the database is as a hierarchy of key-             have an outlier if the ratio of variances of the mean value and the
value pairs as shown in Figure 1. At the top-level, there are collec-           α-trimmed mean is smaller than 0.7 where α = 0.05. This feature
tions - these are the individual data sets in the database. Each of             can be used to consider how robust an algorithm is to outlying
them holds a collection of documents that represent an output file,             numeric values.
or experiment, named by its learning algorithm with two numbers               • Entropy of classes. This feature measures one aspect of problem
that correspond to the random seed used to partition the data and               difficulty in the form of whether one class outnumbers another.
the hyperparameter setting. In these documents, the predictions for
each instance is stored. Collections for which instances were used               Ho and Basu [10] sought to measure the complexity of a data set to
for training hyperparameter settings are also included.                       identify areas of the data set that contribute to its complexity focusing
                                                                              on the geometrical complexity of the class boundary.

4 Extending the Database                                                      • Measures of overlap of individual feature values:
                                                                                 – The maximum Fisher’s Discriminant ratio. This is the Fisher’s
The data provided by MLRR only contains a snapshot of current
                                                                                   discriminant ratio for an attribute:
machine learning results. To allow more machine learning results to
be added and to allow the MLRR to evolve as the state of machine                                                 (µ1 − µ2 )2
                                                                                                            f=               ,
learning evolves, MLRR provides a method to upload new machine                                                    σ12 + σ22
     where µi and σi2 represent the mean and variance for a class.              selected instances of the same class. The error rate of a linear
     The maximum Fisher’s discriminant value over the attributes                classifier trained with the original training set on the generated
     is used for this measure. For multiple classes, this measure is            test set is returned.
     expanded to:                                                             – The nonlinearity of the one-nearest neighbor classifier. A test
                       PC PC                              2                     set is created as with the previous feature, but the error rate of
                           i=1
                                         p p (µi − µj )
                                   j=i+1 i j
                  f=                P C
                                                                                a 1-nearest neighbor classifier is returned.
                                          p σ2
                                      i=1 i i                                 – The fraction of maximum covering spheres. A covering sphere
     where C is the number of classes and pi is the proportion of               is created by centering on an instance and growing as much
     instances that belong to the ith class.                                    as possible before touching an instance from another class.
                                                                                Only the largest spheres are considered. The measure returns
  – The overlap of the per-class bounding boxes. This feature mea-
                                                                                the number of spheres divided by the number of instances in
    sures the overlap of the tails of the two class-conditional dis-
                                                                                the data set and provides an indication of how much the in-
    tributions. For data sets with more than 2 classes, the overlap
                                                                                stances are clustered in hyperspheres or distributed in thinner
    of the per-class bounding boxes is computed for each pair of
                                                                                structures.
    classes and the sum over all pairs of classes is returned.
                                                                              – The average number of points per dimension. This measure is
  – The maximum (individual) feature efficiency. This feature mea-
                                                                                the ratio of instances to attributes and roughly indicates how
    sures how discriminative a single feature is. For each attribute,
                                                                                sparse a data set is.
    the ratio of instances with differing classes that are not in the
    overlapping region is returned. The attribute that produces the        Multi-class modifications are made according to the implementation
    largest ratio of instances is returned.                                of the data complexity library (DCoL) [15].
  – The collective feature efficiency. This measure builds off of             Pfahringer et al. [17] introduced the notion of using performance
    the previous one. The maximum ratio is first calculated as be-         values (i.e., accuracy) of simple and fast classification algorithms as
    fore. Then, the instances that can be discriminated are removed        meta-features. The landmarkers that are included in the MLRR are
    and the maximum (individual) feature efficiency is recalculated        listed below.
    with the remaining instances. This process is repeated until no
    more instances can be removed. The ratio of instances that can         • Linear discriminant learner. Creates a linear classifier that finds a
    be discriminated is returned.                                            linear combination of the features to separate the classes.
                                                                           • One nearest neighbor learner. Redundant with the leave-one-
• Measures of class separability:
                                                                             out error rate of the one-nearest neighbor classifier from Ho and
  – The minimized sum of the error distance of a linear classifier.          Basu [10].
    This feature measures to what extent training data is linearly         • Decision node learning. A decision stump that splits on the at-
    separable and returns the difference between a linear classifier         tribute that has the highest information gain. A decision stump is
    and the actual class value.                                              a decision tree with only one node.
  – The training error of a linear classifier. This feature also mea-      • Randomly chosen node learner. A decision stump that splits on a
    sures to what extent the training data is linearly separable.            randomly chosen attribute.
                                                                           • Worst node learner. A decision stump that splits on the attribute
  – The fraction of points on the class boundary. This feature esti-         that has the lowest information gain.
    mates the length of the class boundary by constructing a min-          • Average node learner. A decision stump is created for each at-
    imum spanning tree over the entire data set and returning the            tribute and the average accuracy is returned.
    ratio of the number of nodes in the spanning tree that are con-
    nected and belong to different classes to the number of in-            The use of landmarkers has been shown to be competitive with the
    stances in the data set.                                               best performing meta-features with a significant decrease in compu-
  – The ratio of average intra/inter class nearest neighbor dis-           tational effort [19].
    tance. This measure compares the within class spread with the             Smith et al. [23] sought to identify and characterize instances that
    distances to the nearest neighbors of the other classes. For each      are difficult to classify correctly. The difficulty of an instance was
    instance, the distance to its nearest neighbor with the same class     determined based on how frequently it was misclassified. To char-
    (intraDist(x)) and to its nearest neighbor with a different            acterize why some instances are more difficult than others to clas-
    class (interDist(x)) is calculated. Then the measure returns:          sify correctly, the authors used different hardness measures. They
                           PN                                              include:
                               intraDist(xi )
                           PiN                                             • k-Disagreeing Neighbors. The percentage of k nearest neighbors
                              i
                                  interDist(xi )
                                                                             that do not share the target class of an instance. This measures the
     where N is the number of instances in the data set.                     local overlap of an instance in the original space of the task.
  – The leave-one-out error rate of the one-nearest neighbor classi-       • Disjunct size. This feature indicates how tightly a learning algo-
    fier. This feature measures how close the examples of different          rithm has to divide the task space to correctly classify an instance.
    classes are.                                                             It is measured as the size of a disjunct that covers an instance
                                                                             divided by the largest disjunct produced, where the disjuncts are
• Measures of geometry, topology, and density of manifolds                   formed using the C4.5 learning algorithm.
  – The nonlinearity of a linear classifier. Following Hoekstra and        • Disjunct class percentage. This features measure the overlap of
    Duin [11], given a training set, a test set is created by linear in-     an instance on a subset of the features. Using a pruned C4.5 tree,
    terpolation with random coefficients between pairs of randomly           the disjunct class percentage is the number of instances in a dis-
  junct that belong to the same class divided by the total number of       level information resulting from preprocessing, but it lacks a mecha-
  instances in the disjunct.                                               nism to store the preprocessing process. Integrating this information
• Tree depth (pruned and unpruned). Tree depth provides a way to           in an efficient way is a direction of current research.
  estimate the description length, or Kolmogorov complexity, of an
  instance. It is the depth of the leaf node that classifies an instance
  in an induced tree.                                                      REFERENCES
• Class likelihood. This features provides a global measure of over-        [1] M. Aksela and J. Laaksonen, ‘Using diversity of errors for selecting
  lap and the likelihood of an instance belonging to the target class.          members of a committee classifier’, Pattern Recognition, 39(4), 608–
  It is calculated as:                                                          623, (2006).
                             Y
                             |x|                                            [2] S. Ali and K.A. Smith, ‘On Learning Algorithm Selection for Classifi-
                                   p(xi |t(x))                                  cation’, Applied Soft Computing, 62, 119–138, (2006).
                                                                            [3] S. Ali and K.A. Smith-Miles, ‘A Meta-learning Approach to Automatic
                               i
                                                                                Kernel Selection for Support Vector Machines’, Neurocomputing, 70,
  where |x| represents the number of attributes for the instance x              173–186, (2006).
  and t(x) is the target class of x.                                        [4] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, ‘Curriculum
                                                                                learning’, in Proceedings of the 26th International Conference on Ma-
• Minority value. This feature measures the skewness of the class
                                                                                chine Learning, pp. 41–48. ACM, (2009).
  that an instance belongs to. It is measured as the ratio of instances     [5] P. B. Brazdil, C. Soares, and J. Pinto Da Costa, ‘Ranking learning al-
  sharing the target class of an instance to the number of instances            gorithms: Using ibl and meta-learning on accuracy and time results’,
  in the majority class.                                                        Machine Learning, 50(3), 251–277, (2003).
• Class balance. This feature also measures the class skew. First,          [6] G. Brown, J. L. Wyatt, and P. Tino, ‘Managing diversity in regression
                                                                                ensembles.’, Journal of Machine Learning Research, 6, 1621–1650,
  the ratio of the number of instances belonging the target class to            (2005)
  the total number of instances is calculated. The difference of this       [7] M. S. Gashler, ‘Waffles: A machine learning toolkit’, Journal of Ma-
  ratio with the ratio of one over the number of possible classes is            chine Learning Research, MLOSS 12, 2383–2387, (July 2011).
  returned. If the class were completely balanced (i.e. all class had       [8] T.A.F. Gomes and R.B.C. Prudêncio and C. Soares and A.L.D. Rossi
                                                                                and A. Cravalho, ‘Combining Meta-learning and Search Techniques to
  the same number of instances), a value of 0 would be returned for
                                                                                Select Parameters for Support Vector Machines’, Neurocomputing, 75,
  each instance.                                                                3–13, (2012).
                                                                            [9] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.
The hardness measures are designed to capture the characteristics of            Witten, ‘The weka data mining software: an update’, SIGKDD Explo-
why instances are hard to classify correctly. Data set measures can             rations Newsletter, 11(1), 10–18, (2009).
be generated by averaging the hardness measures over the instances         [10] T. K. Ho and M. Basu, ‘Complexity measures of supervised classifi-
in a data set.                                                                  cation problems’, IEEE Transactions on Pattern Analysis and Machine
                                                                                Intelligence, 24, 289–300, (March 2002).
                                                                           [11] A. Hoekstra and R. P.W. Duin, ‘On the nonlinearity of pattern classi-
6 Conclusions and Future Work                                                   fiers’, in Proceedings of the 13th International Conference on Pattern
                                                                                Recognition, pp. 271–275, (1996).
In this paper, we presented the machine learning results reposi-           [12] L. I. Kuncheva and C. J. Whitaker, ‘Measures of diversity in classifier
tory (MLRR) an easily accessible and extensible database for meta-              ensembles and their relationship with the ensemble accuracy.’, Machine
                                                                                Learning, 51(2), 181–207, (2003).
learning. MLRR was designed with the main goals of providing an            [13] J. Lee and C. Giraud-Carrier, ‘A metric for unsupervised metalearning’,
easily accessible data repository to facilitate meta-learning and pro-          Intelligent Data Analysis, 15(6), 827–841, (2011).
viding benchmark meta-data sets to compare meta-learning experi-           [14] J. Lee and C. Giraud-Carrier, ‘Automatic selection of classification
ments. To this end, the MLRR provides ready to download meta-data               learning algorithms for data mining practitioners’, Intelligent Data
                                                                                Analysis, 17(4), 665–678, (2013).
sets of previous experimental results. One of the important features of    [15] A. Orriols-Puig, N. Macià, E. Bernadó-Mansilla, and T. K. Ho, ‘Doc-
MLRR is that it provides meta-data at the instance level. Of course,            umentation for the data complexity library in c++’, Technical Report
the results could also be used as a means of comparing one’s work               2009001, La Salle - Universitat Ramon Llull, (April 2009).
with prior work as they are stored in the MLRR. The MLRR can be            [16] A. H. Peterson and T. R. Martinez, ‘Estimating the potential for com-
accessed at http://axon.cs.byu.edu/mlrr.                                        bining learning models’, in Proceedings of the ICML Workshop on
                                                                                Meta-Learning, pp. 68–75, (2005).
   The MLRR allows for reproducible results as the data sets are           [17] B. Pfahringer, H. Bensusan, and C. G. Giraud-Carrier, ‘Meta-learning
stored on the server and as the class names and toolkits are provided.          by landmarking various learning algorithms’, in Proceedings of the
The ExpDB tends to be a lot more rigid in its design as it is based             17th International Conference on Machine Learning, pp. 743–750, San
on relational databases and PMML (predictive model markup lan-                  Francisco, CA, USA, (2000). Morgan Kaufmann Publishers Inc.
                                                                           [18] U. Rebbapragada and C. E. Brodley, ‘Class noise mitigation through
guage), thus exhibiting a relatively steep learning curve to import             instance weighting’, in Proceedings of the 18th European Conference
and extract data. The MLRR is less rigid in its design allowing for             on Machine Learning, pp. 708–715, (2007).
easier access to the data and more extensibility, with the trade-off of    [19] M. Reif, F. Shafait, M. Goldstein, T. Breuel, and A. Dengel, ‘Automatic
less formality.                                                                 classifier selection for non-experts’, Pattern Analysis & Applications,
   One direction for future work is to integrate the API provided at            17(1), 83–96, (2014).
                                                                           [20] M. Reif, ‘A Comprehensive Dataset for Evaluating Approaches of Var-
OpenML5 (an implementation of an experiment database) to incor-                 ious Meta-learning Tasks’, in Proceedings of the 1st International Con-
porate their results with those that are in the MLRR. This will help            ference on Pattern Recognition Applications and Methods, pp. 273–
provide easy access to the results that are already stored in OpenML            276, (2012).
without having to incur the learning cost associated with understand-      [21] M. R. Smith and T. Martinez, ‘Improving classification accuracy by
                                                                                identifying and removing instances that should be misclassified’, in
ing the database schema.                                                        Proceedings of the IEEE International Joint Conference on Neural Net-
   Another open problem is how to store information about how a                 works, pp. 2690–2697, (2011).
data set is preprocessed. Currently, the MLRR can store the instance       [22] M. R. Smith and T. Martinez, ‘A comparative evaluation of curriculum
                                                                                learning with filtering and boosting in supervised classification prob-
5 www.openml.org
                                                                                lems’, Computational Intelligence, accepted, (2014).
[23] M. R. Smith, T. Martinez, and C. Giraud-Carrier, ‘An instance level
     analysis of data complexity’, Machine Learning, 95(2), 225–256,
     (2014).
[24] J. Vanschoren, H. Blockeel, Bernhard Pfahringer, and Geoffrey
     Holmes, ‘Experiment databases - a new way to share, organize and
     learn from experiments’, Machine Learning, 87(2), 127–158, (2012).
                                                                                                        Root


                                        iris                            abalone                        ...                                training
                                                                                                                                            sets
                                                                                                                                                                                 hyperparameters


                                               meta featu


                                                                                                me
                                                                                                  ta
                 1
                       2


                                                                                                                                      waffles
                                                                               BP 2


                                                                                                    fea
                                                                     BP 1


                                                                                                                                                                              C4.5
                                                                                                                               weka
            BP

                     BP


                                                                                                       tur


                                                                                                                                                                                              BP
                                                                                                        es
                                                         res


    fold:{#:Pred}
          fold:{#:Pred}                    MF:val                fold:{#:Pred}
                                                                       fold:{#:Pred}                   MF:val         seed numFolds
                                                                                                                           seed numFolds
                                                                                                                                    fold fold                            #:HP setting
                                                                                                                                                                                #:HP setting

   1{1:1,
       1{15:1,   ..} ...}
          57:2, .147:3,                                        1{1035:9,
                                                                    1{15:1,   ...} ...}
                                                                         7:2,147:3,                                   1 10 1{1:1,
                                                                                                                           1 10 1{1:1,     .} ...}
                                                                                                                                  2:?, . .2:?,                       1{Co:0.25,Pr:1,. ..}
                                                                                                                                                                           1{LR:0.26,Mo:0.16}
                                  . . . kAN:0.97                                             ...     kAN:0.24                                            ...                                              ...
  2{2:1,2{26:1,  ..} ...}
         107:3, .67:2,                     DS:0.84             2{9:1,2{26:1,   ..} ...}
                                                                      237:3, .67:2,                    DS:0.14        1 10 2{1:1,
                                                                                                                           1 10 2{1:1,     .} ...}
                                                                                                                                  2:1, . .2:1,                        2{Co:0.1,Pr:0,. ..}
                                                                                                                                                                            2{LR:0.26,Mo:0.15}
           ...       ...                       ...                     ...   ...                         ...                    ...        ...                                 ...      ...


Figure 1. Hierarchical representation of how the results from machine learning experiments are stored in the NoSQL database for the MLRR. Each data set has a collection containing the predictions for each
instance from a learning algorithm as well as its meta-features. A separate collection stores all of the information for the learning algorithms and which hyperparameters were used. Another collection stores the
                                                                              information for which instances were used for training.