An Easy to Use Repository for Comparing and Improving Machine Learning Algorithm Usage Michael R. Smith1 and Andrew White2 and Christophe Giraud-Carrier3 and Tony Martinez4 Abstract. The results from most machine learning experiments perience (results from previous machine learning experiments). Al- are used for a specific purpose and then discarded. This causes sig- though some research from the machine learning community has fo- nificant loss of information and requires rerunning experiments to cused on meta-learning (e.g., see [17, 5, 2, 3, 8]), much of the focus compare learning algorithms. Often, this also requires a researcher of machine learning research has been on developing more learning or practitioner to implement another algorithm for comparison, that algorithms and/or applying machine learning in specific domains. may not always be correctly implemented. By storing the results Part of the difficulty of meta-learning is due to the lack of ac- from previous experiments, machine learning algorithms can be cessible results. As meta-learning requires running several learning compared easily and the knowledge gained from them can be used algorithms and hyperparameter settings over many data sets, gath- to improve the performance of future machine learning experiments. ering results requires large amounts of computational resources. In The purpose of this work is to provide easy access to previous ex- addition to the computational requirements, results from the learning perimental results for learning and comparison. These stored results algorithms may differ due to slight differences in their implementa- are comprehensive – storing the prediction for each test instance as tions. Thus, comparing results among meta-learning studies becomes well as the learning algorithm, hyperparameters, and training set that difficult. were used in the experiment. Previous experimental results are par- To aid in further research in meta-learning, we have developed the ticularly important for meta-learning, which, in a broad sense, is the machine learning results repository (MLRR) that provides data sets process of learning from previous machine learning results such that ready for download for meta-learning problems, akin to the UCI data the learning process is improved. While other experiment databases repository for machine learning problems. We refer to the data sets do exist, one of our focuses is on easy access to the data, eliminat- for meta-learning as meta-data sets to distinguish them from the data ing any learning curve required to acquire the desired information. sets that are used in the machine learning experiments. The meta-data We provide meta-learning data sets that are ready to be downloaded sets provide a snapshot of an underlying database that stores the re- for meta-learning experiments. Easy access to previous experimental sults of machine learning experiments. Users can update the database results aids other researchers looking to do meta-learning and helps with new results from machine learning experiments and then update in comparing meta-learning algorithms. In addition, simple queries the meta-data sets for meta-learning. A revision history is kept so to the underlying database can be made if specific information is de- that comparisons among meta-learning algorithms is facilitated. As sired. We also differ from previous experiment databases in that our a starting point, meta-data sets are provided by MLRR for typical database is designed at the instance level, where an instance is an ex- meta-learning tasks, such as, given a set of meta-features, predict ample in a data set. We store the predictions of a learning algorithm which learning algorithm and/or hyperparameter setting to use. trained on a specific training set for each instance in the test set. Data The MLRR stores instance level meta-features and the predictions set level information can then be obtained by aggregating the results made on each instance by the learning algorithms. Providing infor- from the instances. The instance level information can be used for mation at the instance level allows studies to be performed on the many tasks such as determining the diversity of a classifier or algo- instances themselves. Studying the effects of machine learning on a rithmically determining the optimal subset of training instances for a single instance and/or the effects of a single instance on the perfor- learning algorithm. mance of an algorithm has generally been overlooked. Instance-level information is important in several areas of machine learning, how- ever. In ensembles, computing the classifier diversity of the ensem- 1 Introduction bled classifiers using the predictions for each instance is important in The quality of an induced model is dependent on, among other as- determining the effectiveness of the ensembling technique [12, 6, 1]. pects, the learning algorithm that is chosen, the hyper-parameter set- In curriculum learning, the training set is incrementally augmented tings for the chosen learning algorithm, and the quality of the training such that “easier” instances are presented to the learning algorithm set. Choosing a learning algorithm for a given task, setting its hyper- first, thus creating a need to understand and identify the easier in- parameters, and selecting which instances to train on, however, is stances [4]. Smith et al. used instance-level predictions to identify non-trivial. Meta-learning deals with the problem of how to select a and characterize instances that are likely to be misclassified [23] and learning algorithm and set its hyper-parameters based on previous ex- used this information to create a curriculum [22]. Other work has also used the instance-level predictions for meta-learning. The clas- 1 Brigham Young University, USA, email: msmith@axon.cs.byu.edu sifier output difference (COD) measures the distance between two 2 Brigham Young University, USA, email: andrewkvavlewhite@gmail.com learning algorithms as the probability that the learning algorithms 3 Brigham Young University, USA, email: cgc@cs.byu.edu 4 Brigham Young University, USA, email: martinez@cs.byu.edu make different predictions on test instances [16]. Unsupervised meta- learning (UML) clusters learning algorithms based on their COD When we refer to an experiment, we mean the results from training a scores (rather than accuracy) to examine the behavior of the learn- learning algorithm l with hyperparamter settings λ on a training set t. ing algorithms [13]. Meta-learning for algorithm selection can then We first describe how we manage experiment information, and then be done over the clusters rather than a larger set of learning algo- describe the provided meta-data sets. rithms to recommend a cluster of learning algorithms that all behave similarly [14]. Additionally, several techniques treat instances indi- vidually during the training process, such as filtering instances from 2.1 Experiment Information the training set based on their instance-level meta-features [21] or The information about each experiment is provided in three tables weighting the instances [18]. in MLRR. Which learning algorithm and hyperparameters were used Other attempts have been made at creating a repository for is provided in a file structured as shown in Table 1. It provides the machine learning experiments from which learning can be con- toolkit including the version number that was ran, the learning algo- ducted [20, 24]. However, we feel that they lack simplicity and/or rithm, and the hyperparameters that were used. This allows for mul- extensibility. In addition to providing instance-level information, we tiple learning algorithms, hyperparameters, and toolkits to be com- hope to bridge this gap with the MLRR. Probably the most promi- pared. In the examples in Table 1, the class names from the Weka ma- nent and well-developed data repository is ExpDB, an experiment chine learning toolkit [9] and the Waffles machine learning toolkit [7] database that provides a framework for reporting experimental re- are shown. LA seed corresponds to the learning algorithm that was sults and their associated workflow [24]. The purpose of ExpDB is used (LA) and to a seed that represents which hyperparameter setting to comprehensively store the workflow process of all experiments for was used (seed). The LA seed will be used in other tables as a foreign reproducibility. One of the results of storing the experiments is that key to map back to this table. A seed of -1 represents the default hy- the results can be used for meta-learning. Unfortunately, there is a perparameter settings as many studies examine the default behavior relatively steep learning curve to access the data due to the inherent as given in a toolkit and the default parameters are commonly used complexity involved in storing all of the details about exact repro- in practice. ducibility. Because of this complexity and formality, it is difficult to directly access the information that would be most beneficial for Table 1. The structure of the meta-data set that describes the meta-learning, which may deter some potential users. Additionally, hyperparameter settings for the learning algorithms stored in the database. ExpDB does not currently support storage and manipulation of any instance level features. LA S Toolkit Version Hyperparameters We acknowledge that maintaining a database of previous experi- BP 1 weka3.6.11 weka.classifiers.functions.MultilayerPerceptron\ ments is not a trivial problem. We do, however, add our voice to sup- – -L 0.261703 -M 0.161703 -H 12 -D port the importance of maintaining a repository of machine learning BP 2 weka 3.6.11 weka.classifiers.functions.MultilayerPerceptron\ – -L 0.25807 -M 0.15807 -H 4 results and offer an effective solution for storing results from pre- BP 3 waffles 13-12-09 neuralnet -addlayer 8 -learningrate 0.1 \ vious experiments. Our primary goal is to maintain simplicity and -momentum 0 -windowsepochs 50 provide easily accessible data for meta-learning to 1) help promote .. .. .. .. more research in meta-learning, 2) provide a standard set of data sets . . . . C4.5 1 weka 3.6.11 weka.classifiers.trees.J48 – -C 0.443973 -M 1 for meta-learning algorithm comparison, and 3) continue to stimulate .. .. .. .. research at the instance level. . . . . We next describe our approach for providing a repository for ma- chine learning meta-data that emphasizes ease of access to the meta- As the parameter values differ between toolkits, there is a mapping data. MLRR currently has the results from 72 data sets, 9 learning provided to distinguish hyperparameter settings. For example, Weka algorithms and 10 hyperparameter settings for each learning algo- uses the “-L” parameter to set the learning rate in backpropagation rithm. The database description is provided in Section 3. How to add while the Waffles toolkit uses “-learningrate”. Also, some toolkits new experimental results to the database is detailed in Section 4. We have hyperparameters that other implementations of the same learn- then give a more detailed description of the data set level and in- ing algorithm do not include. In such cases, an unknown value will be stance level meta-features that are used in the MLRR. Conclusions provided in the meta-data set. This mapping is shown in Table 2 for and directions for future work are provided in Section 6. the backpropagation learning algorithm. The first row contains the values used by MLRR. The following rows contain the command- line parameter supplied to a specific toolkit to set that hyperparame- 2 Meta-data Set Descriptions ter. The purpose of the machine learning results repository (MLRR) is Table 2. The structure of the table for mapping learning algorithm to provide easy access to the results of previous machine learning ex- hyperparameters between different toolkits for the backpropagation learning periments for meta-learning at the data set and instance levels. This, algorithm. in turn, would allow other researchers interested in meta-learning and Command line parameters in better understanding machine learning algorithms direct access to toolkit LR Mo HN DC WE prior results without having to re-run all of the algorithms or learn weka -L -M -H -D ? how to navigate a more complex experiment database. The quality of waffles -learningrate -momentum -addlayer ? -windowsepochs an induced model for a task is dependent on at least three things: .. .. .. .. .. .. . . . . . . 1. the learning algorithm chosen to induce the model, 2. the hyperparameter settings for the chosen learning algorithm, and A mapping of which instances are used for training is also pro- 3. the instances used for training. vided in a separate file. The structure of this table is shown in Table 3. Each row represents an experiment as toolkit seed numFolds fold. Table 4. In the given example, instance 77 is shown. The “inst meta” The toolkit represents which toolkit was used, the seed represents the section provides the instance level meta-features for that instance. random seed that was provided to the toolkit, numFolds represents The actual class label is 2. The predictions from the experiments on how many folds were ran, and fold represents in which fold an in- this data set are provided in the following columns (i.e., experiment stance was included for testing. The values in the following columns BP 1/1 predicted class 3, BP N/1 predicted class 2, etc.). represent if an instance was used for training or testing. There is one column for each instance in the data set. They are stored as real val- Table 4. The structure of the meta-data set at the instance level. ues. This allows for the situations when training instances have as- sociated weights. In the file, an unknown value of “?” represents a testing instance, otherwise a real value represents a training instance. inst meta predictions A value of 0 represents a filtered instance, a value of 1 represents an # kAN MV . . . act BP 1/1 . . . BP N/1 . . . BP N/M C4.5 1/1 . . . unweighted training instance and any value between 0 and 1 repre- 77 0.92 0 ... 2 3 ... 2 ... 2 3 ... sents the weight for that training instance. In the cases where there .. .. .. .. .. .. .. .. . . . . . . . . are specific training and testing sets, then the row will be labeled as toolkit 0 0 1 and information for the training set can be entered At the data set level, several meta-data sets are provided: as before. A random test/training split of the data is represented as toolkit seed percentSplit 1 where “percentSplit” represents the per- • a general meta-data set that stores the data set meta-features and centage of the data set that was used for testing as generated by the the average N by 10-fold cross-validation accuracy for all of the toolkit. data sets from a learning algorithm with a given hyperparameter setting. Table 3. The structure of the meta-data set that indicates which instances • for each learning algorithm a meta-data set that stores the data were used for training given a random seed. set meta-features, the learning algorithm hyperparameter settings, and the average N by 10-fold cross-validation accuracy for all of toolkit seed # folds fold 1 2 3 ... the data sets for the given hyperparameter setting. weka 1 10 1 1 1 1 ... weka 1 10 2 1 0 1 ... The structure for the general meta-data set is provided in Table 5. .. .. .. .. . . . . The structure and information of this meta-data set is typical of that weka 1 10 10 0.74 1 ? ... used in previous meta-learning studies that provides a mapping from weka 2 1 10 ? 1 1 ... data set meta-features to accuracies obtained by a set of learning al- . .. . .. . .. . .. gorithms. Most previous studies have been limited to only using the default hyperparameters, however. The MLRR includes the accura- cies from multiple hyperparameter settings. The hyperparameter set- tings from each learning algorithm are denoted by a “LA #” where LA refers to a learning algorithm and # refers to which hyperparam- 2.2 Meta-data sets eter setting was used for that learning algorithm. One of the features of MLRR is its focus on storing and present- ing instance level information, namely, instance level characteristics Table 5. The structure of the meta-data set at the data set level. and associated predictions from previous experiments. Indeed, the MLRR is designed intentionally from the instance level perspective, data set meta-features LA accuracies from which data set level information can be computed (e.g., accu- data set numInst numAttr . . . BP 1 BP 2 . . . BP N C4.5 1 . . . racy or precision). iris 150 4 . . . 96.80 95.07 . . . 93.47 95.60 . . . As one of the purposes of the MLRR is ease of access, the MLRR abalone 4177 8 . . . 20.27 29.84 . . . 21.91 23.24 . . . stores several data sets in attribute-relation file format (ARFF) which .. .. .. .. .. .. .. .. is supported by many machine learning toolkits. In essence, ARFF is . . . . . . . . a comma or space separated file with attribute information and pos- sible comments. The precomputed meta-data sets include instance The meta-data sets for each learning algorithm are designed to level meta-data sets and data set level meta-data sets. aid in algorithmic hyperparameter estimation, i.e., given a data set, At the instance level, MLRR provides for each data set a meta- can we predict which hyperparameter setting will give the highest data set that stores the instance level meta-features and the predic- classification accuracy. For each learning algorithm, a meta-data set tion from each experiment. This allows for analyses to be done ex- is provided that contains the data set meta-features, the toolkit that ploring the effects of hyperparameters and learning algorithms at the was used, the hyperparameter setting and the average accuracy for instance-level, which is currently mostly overlooked. For each data each unique tool kit/hyperparameter combination. The structure of set, a meta-data set is provided that gives the values for the instance the meta-data set for each learning algorithm is provided in Table 6. level meta-features, the actual class value (stored as a numeric value), The accuracy (“acc”) represents the average accuracy for all k-fold and the predicted class value for each experiment. The training set validation runs (i.e., multiple runs of the same learning algorithm and learning algorithm/hyperparameter information is stored in the with different random seeds to partition the folds). The toolkit is also column heading as “LA seed/hyperparameter” where LA is a learn- provided to allow a user to compare toolkits or only do hyperparam- ing algorithm and hyperparameter is the hyperparameter setting for eter estimation for a single toolkit. the learning algorithm. Together, they map to the entries in Table 1. MLRR provides easy access for researchers and practitioners to a The seed represents the seed that was used to partition the data (see large and varying set of meta-data information as shown in the tables Table 3). The structure of the instance level meta-data set is shown in above. The provided meta-data sets are a snapshot of an underlying learning results. The MLRR also stores the original data sets to al- Table 6. The structure of the table for mapping learning algorithm hyperparameters among toolkits. low a user to add results from additional experiments on the current set of data sets. The results from experimentation on a new data set DS meta features toolkit hyperparameters require that the new data set be uploaded as well as the experimen- data set numInst numAttr . . . weka LR Mo . . . acc tal results. Scripts are provided to calculate the meta-features for the iris 150 4 ... weka 0.71 0.61 ... 96.80 new data set. In the case where a data set is proprietary or has other iris 150 4 ... weka 0.11 0.25 ... 97.04 privacy/licensing issues that prevent it from being posted, the meta- .. .. .. .. .. .. .. features can be calculated on the data set without storing the actual . . . . . . . data set. database that stores all of the previous experimental results that can Currently, scripts are provided to upload the output from running be updated as more results are obtained. A revision history of the data Weka. This provides a simple way to upload experimental results sets is provided so that results can be compared even if the meta-data from a commonly used toolkit. The file is slightly modified such that set has been updated. the first line provides which learning algorithm and hyperparameters were used. The database will have the ability to upload files gener- ated by other toolkits in the future. 3 Database Description Of course, there are issues of data reliability. Currently, all of the MLRR uses MongoDB as the database to store the results from ma- results stored in the MLRR are from our experiments. To help with chine learning experiments. MongoDB is a NoSQL database that data reliability, we require that the script(s) and executable(s) re- allows for adding new features (such as new learning algorithms quired to reproduce the results are uploaded along with the results. and/hyperparameters), thus, escaping the rigidity of the more tradi- This allows the results to be verified if their validity is questioned. If tional SQL databases. This allows for easily expanding the database the results from an experiment are thought to be invalid, they can can with new learning algorithms and/or hyperparameters. Of course, this be flagged, and inspected for possible removal from the MLRR. is theoretically also possible in a relational database, provided the database has been designed adequately. For example, one could cer- 5 Included Meta-features tainly have, and that would indeed be following good design princi- ples, one table for the algorithms and one table for the hyper param- In this section, we detail the meta-features that are included in the eters with appropriate foreign keys. However, such design requires machine learning results repository (MLRR). We store a set of data some amount of foresight. In traditional relational databases, the in- set meta-features that have been commonly used in previous meta- formation that needs to be stored (and how) has to be planned for in learning studies. Specifically, we used the meta-features from Brazdil advance. Otherwise, when new features are desired, a new schema et al. [5], Ho and Basu [10], Pfahringer et al. [17], and Smith et needs to be created and then the database has to be migrated over al. [23]. As the underlying database is a NoSQL database, additional to the new schema. With a NoSQL database, new learning algo- meta-features can be easily added in the future. We now describe the rithms/hyperparameters and other pieces of information can easily meta-features from each study. be added into the MLRR. The study by Brazdil et al. [5] examined ranking learning algo- The data is stored as a document database as collections of rithms using instance-based learning. The meta-features are designed key-value pairs. Each collection represents the experimental re- to be quickly calculated and to represent properties that affect algo- sults on a particular data set. In each collection, the keys are rithm performance. LA hyperparameterSetting. The value then is a JSON text document • Number of examples. This feature helps identify how scalable an that stores the results of an experiment (e.g., the results of 10-fold algorithm is based on the size of its input. cross-validation on the iris data set using C4.5). These documents • Proportion of symbolic attributes. This feature can be used to con- also contain pointers to other documents that hold information about sider how well an algorithm deals with symbolic or numeric at- training/testing sets for each experiment. The data set/instance level tributes. meta-features are stored in separate documents in their respective • Proportion of missing values. This features can be used to consider data set collection. A separate collection stores information about how robust an algorithm is to incomplete data. the learning algorithms and their hyperparameters. • Proportion of attributes with outliers. An attribute is considered to The best way to visualize the database is as a hierarchy of key- have an outlier if the ratio of variances of the mean value and the value pairs as shown in Figure 1. At the top-level, there are collec- α-trimmed mean is smaller than 0.7 where α = 0.05. This feature tions - these are the individual data sets in the database. Each of can be used to consider how robust an algorithm is to outlying them holds a collection of documents that represent an output file, numeric values. or experiment, named by its learning algorithm with two numbers • Entropy of classes. This feature measures one aspect of problem that correspond to the random seed used to partition the data and difficulty in the form of whether one class outnumbers another. the hyperparameter setting. In these documents, the predictions for each instance is stored. Collections for which instances were used Ho and Basu [10] sought to measure the complexity of a data set to for training hyperparameter settings are also included. identify areas of the data set that contribute to its complexity focusing on the geometrical complexity of the class boundary. 4 Extending the Database • Measures of overlap of individual feature values: – The maximum Fisher’s Discriminant ratio. This is the Fisher’s The data provided by MLRR only contains a snapshot of current discriminant ratio for an attribute: machine learning results. To allow more machine learning results to be added and to allow the MLRR to evolve as the state of machine (µ1 − µ2 )2 f= , learning evolves, MLRR provides a method to upload new machine σ12 + σ22 where µi and σi2 represent the mean and variance for a class. selected instances of the same class. The error rate of a linear The maximum Fisher’s discriminant value over the attributes classifier trained with the original training set on the generated is used for this measure. For multiple classes, this measure is test set is returned. expanded to: – The nonlinearity of the one-nearest neighbor classifier. A test PC PC 2 set is created as with the previous feature, but the error rate of i=1 p p (µi − µj ) j=i+1 i j f= P C a 1-nearest neighbor classifier is returned. p σ2 i=1 i i – The fraction of maximum covering spheres. A covering sphere where C is the number of classes and pi is the proportion of is created by centering on an instance and growing as much instances that belong to the ith class. as possible before touching an instance from another class. Only the largest spheres are considered. The measure returns – The overlap of the per-class bounding boxes. This feature mea- the number of spheres divided by the number of instances in sures the overlap of the tails of the two class-conditional dis- the data set and provides an indication of how much the in- tributions. For data sets with more than 2 classes, the overlap stances are clustered in hyperspheres or distributed in thinner of the per-class bounding boxes is computed for each pair of structures. classes and the sum over all pairs of classes is returned. – The average number of points per dimension. This measure is – The maximum (individual) feature efficiency. This feature mea- the ratio of instances to attributes and roughly indicates how sures how discriminative a single feature is. For each attribute, sparse a data set is. the ratio of instances with differing classes that are not in the overlapping region is returned. The attribute that produces the Multi-class modifications are made according to the implementation largest ratio of instances is returned. of the data complexity library (DCoL) [15]. – The collective feature efficiency. This measure builds off of Pfahringer et al. [17] introduced the notion of using performance the previous one. The maximum ratio is first calculated as be- values (i.e., accuracy) of simple and fast classification algorithms as fore. Then, the instances that can be discriminated are removed meta-features. The landmarkers that are included in the MLRR are and the maximum (individual) feature efficiency is recalculated listed below. with the remaining instances. This process is repeated until no more instances can be removed. The ratio of instances that can • Linear discriminant learner. Creates a linear classifier that finds a be discriminated is returned. linear combination of the features to separate the classes. • One nearest neighbor learner. Redundant with the leave-one- • Measures of class separability: out error rate of the one-nearest neighbor classifier from Ho and – The minimized sum of the error distance of a linear classifier. Basu [10]. This feature measures to what extent training data is linearly • Decision node learning. A decision stump that splits on the at- separable and returns the difference between a linear classifier tribute that has the highest information gain. A decision stump is and the actual class value. a decision tree with only one node. – The training error of a linear classifier. This feature also mea- • Randomly chosen node learner. A decision stump that splits on a sures to what extent the training data is linearly separable. randomly chosen attribute. • Worst node learner. A decision stump that splits on the attribute – The fraction of points on the class boundary. This feature esti- that has the lowest information gain. mates the length of the class boundary by constructing a min- • Average node learner. A decision stump is created for each at- imum spanning tree over the entire data set and returning the tribute and the average accuracy is returned. ratio of the number of nodes in the spanning tree that are con- nected and belong to different classes to the number of in- The use of landmarkers has been shown to be competitive with the stances in the data set. best performing meta-features with a significant decrease in compu- – The ratio of average intra/inter class nearest neighbor dis- tational effort [19]. tance. This measure compares the within class spread with the Smith et al. [23] sought to identify and characterize instances that distances to the nearest neighbors of the other classes. For each are difficult to classify correctly. The difficulty of an instance was instance, the distance to its nearest neighbor with the same class determined based on how frequently it was misclassified. To char- (intraDist(x)) and to its nearest neighbor with a different acterize why some instances are more difficult than others to clas- class (interDist(x)) is calculated. Then the measure returns: sify correctly, the authors used different hardness measures. They PN include: intraDist(xi ) PiN • k-Disagreeing Neighbors. The percentage of k nearest neighbors i interDist(xi ) that do not share the target class of an instance. This measures the where N is the number of instances in the data set. local overlap of an instance in the original space of the task. – The leave-one-out error rate of the one-nearest neighbor classi- • Disjunct size. This feature indicates how tightly a learning algo- fier. This feature measures how close the examples of different rithm has to divide the task space to correctly classify an instance. classes are. It is measured as the size of a disjunct that covers an instance divided by the largest disjunct produced, where the disjuncts are • Measures of geometry, topology, and density of manifolds formed using the C4.5 learning algorithm. – The nonlinearity of a linear classifier. Following Hoekstra and • Disjunct class percentage. This features measure the overlap of Duin [11], given a training set, a test set is created by linear in- an instance on a subset of the features. Using a pruned C4.5 tree, terpolation with random coefficients between pairs of randomly the disjunct class percentage is the number of instances in a dis- junct that belong to the same class divided by the total number of level information resulting from preprocessing, but it lacks a mecha- instances in the disjunct. nism to store the preprocessing process. Integrating this information • Tree depth (pruned and unpruned). Tree depth provides a way to in an efficient way is a direction of current research. estimate the description length, or Kolmogorov complexity, of an instance. It is the depth of the leaf node that classifies an instance in an induced tree. REFERENCES • Class likelihood. This features provides a global measure of over- [1] M. Aksela and J. Laaksonen, ‘Using diversity of errors for selecting lap and the likelihood of an instance belonging to the target class. members of a committee classifier’, Pattern Recognition, 39(4), 608– It is calculated as: 623, (2006). Y |x| [2] S. Ali and K.A. Smith, ‘On Learning Algorithm Selection for Classifi- p(xi |t(x)) cation’, Applied Soft Computing, 62, 119–138, (2006). [3] S. Ali and K.A. Smith-Miles, ‘A Meta-learning Approach to Automatic i Kernel Selection for Support Vector Machines’, Neurocomputing, 70, where |x| represents the number of attributes for the instance x 173–186, (2006). and t(x) is the target class of x. [4] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, ‘Curriculum learning’, in Proceedings of the 26th International Conference on Ma- • Minority value. This feature measures the skewness of the class chine Learning, pp. 41–48. ACM, (2009). that an instance belongs to. It is measured as the ratio of instances [5] P. B. Brazdil, C. Soares, and J. Pinto Da Costa, ‘Ranking learning al- sharing the target class of an instance to the number of instances gorithms: Using ibl and meta-learning on accuracy and time results’, in the majority class. Machine Learning, 50(3), 251–277, (2003). • Class balance. This feature also measures the class skew. First, [6] G. Brown, J. L. Wyatt, and P. Tino, ‘Managing diversity in regression ensembles.’, Journal of Machine Learning Research, 6, 1621–1650, the ratio of the number of instances belonging the target class to (2005) the total number of instances is calculated. The difference of this [7] M. S. Gashler, ‘Waffles: A machine learning toolkit’, Journal of Ma- ratio with the ratio of one over the number of possible classes is chine Learning Research, MLOSS 12, 2383–2387, (July 2011). returned. If the class were completely balanced (i.e. all class had [8] T.A.F. Gomes and R.B.C. Prudêncio and C. Soares and A.L.D. Rossi and A. Cravalho, ‘Combining Meta-learning and Search Techniques to the same number of instances), a value of 0 would be returned for Select Parameters for Support Vector Machines’, Neurocomputing, 75, each instance. 3–13, (2012). [9] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. The hardness measures are designed to capture the characteristics of Witten, ‘The weka data mining software: an update’, SIGKDD Explo- why instances are hard to classify correctly. Data set measures can rations Newsletter, 11(1), 10–18, (2009). be generated by averaging the hardness measures over the instances [10] T. K. Ho and M. Basu, ‘Complexity measures of supervised classifi- in a data set. cation problems’, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 289–300, (March 2002). [11] A. Hoekstra and R. P.W. Duin, ‘On the nonlinearity of pattern classi- 6 Conclusions and Future Work fiers’, in Proceedings of the 13th International Conference on Pattern Recognition, pp. 271–275, (1996). In this paper, we presented the machine learning results reposi- [12] L. I. Kuncheva and C. J. Whitaker, ‘Measures of diversity in classifier tory (MLRR) an easily accessible and extensible database for meta- ensembles and their relationship with the ensemble accuracy.’, Machine Learning, 51(2), 181–207, (2003). learning. MLRR was designed with the main goals of providing an [13] J. Lee and C. Giraud-Carrier, ‘A metric for unsupervised metalearning’, easily accessible data repository to facilitate meta-learning and pro- Intelligent Data Analysis, 15(6), 827–841, (2011). viding benchmark meta-data sets to compare meta-learning experi- [14] J. Lee and C. Giraud-Carrier, ‘Automatic selection of classification ments. To this end, the MLRR provides ready to download meta-data learning algorithms for data mining practitioners’, Intelligent Data Analysis, 17(4), 665–678, (2013). sets of previous experimental results. One of the important features of [15] A. Orriols-Puig, N. Macià, E. Bernadó-Mansilla, and T. K. Ho, ‘Doc- MLRR is that it provides meta-data at the instance level. Of course, umentation for the data complexity library in c++’, Technical Report the results could also be used as a means of comparing one’s work 2009001, La Salle - Universitat Ramon Llull, (April 2009). with prior work as they are stored in the MLRR. The MLRR can be [16] A. H. Peterson and T. R. Martinez, ‘Estimating the potential for com- accessed at http://axon.cs.byu.edu/mlrr. bining learning models’, in Proceedings of the ICML Workshop on Meta-Learning, pp. 68–75, (2005). The MLRR allows for reproducible results as the data sets are [17] B. Pfahringer, H. Bensusan, and C. G. Giraud-Carrier, ‘Meta-learning stored on the server and as the class names and toolkits are provided. by landmarking various learning algorithms’, in Proceedings of the The ExpDB tends to be a lot more rigid in its design as it is based 17th International Conference on Machine Learning, pp. 743–750, San on relational databases and PMML (predictive model markup lan- Francisco, CA, USA, (2000). Morgan Kaufmann Publishers Inc. [18] U. Rebbapragada and C. E. Brodley, ‘Class noise mitigation through guage), thus exhibiting a relatively steep learning curve to import instance weighting’, in Proceedings of the 18th European Conference and extract data. The MLRR is less rigid in its design allowing for on Machine Learning, pp. 708–715, (2007). easier access to the data and more extensibility, with the trade-off of [19] M. Reif, F. Shafait, M. Goldstein, T. Breuel, and A. Dengel, ‘Automatic less formality. classifier selection for non-experts’, Pattern Analysis & Applications, One direction for future work is to integrate the API provided at 17(1), 83–96, (2014). [20] M. Reif, ‘A Comprehensive Dataset for Evaluating Approaches of Var- OpenML5 (an implementation of an experiment database) to incor- ious Meta-learning Tasks’, in Proceedings of the 1st International Con- porate their results with those that are in the MLRR. This will help ference on Pattern Recognition Applications and Methods, pp. 273– provide easy access to the results that are already stored in OpenML 276, (2012). without having to incur the learning cost associated with understand- [21] M. R. Smith and T. Martinez, ‘Improving classification accuracy by identifying and removing instances that should be misclassified’, in ing the database schema. Proceedings of the IEEE International Joint Conference on Neural Net- Another open problem is how to store information about how a works, pp. 2690–2697, (2011). data set is preprocessed. Currently, the MLRR can store the instance [22] M. R. Smith and T. Martinez, ‘A comparative evaluation of curriculum learning with filtering and boosting in supervised classification prob- 5 www.openml.org lems’, Computational Intelligence, accepted, (2014). [23] M. R. Smith, T. Martinez, and C. Giraud-Carrier, ‘An instance level analysis of data complexity’, Machine Learning, 95(2), 225–256, (2014). [24] J. Vanschoren, H. Blockeel, Bernhard Pfahringer, and Geoffrey Holmes, ‘Experiment databases - a new way to share, organize and learn from experiments’, Machine Learning, 87(2), 127–158, (2012). Root iris abalone ... training sets hyperparameters meta featu me ta 1 2 waffles BP 2 fea BP 1 C4.5 weka BP BP tur BP es res fold:{#:Pred} fold:{#:Pred} MF:val fold:{#:Pred} fold:{#:Pred} MF:val seed numFolds seed numFolds fold fold #:HP setting #:HP setting 1{1:1, 1{15:1, ..} ...} 57:2, .147:3, 1{1035:9, 1{15:1, ...} ...} 7:2,147:3, 1 10 1{1:1, 1 10 1{1:1, .} ...} 2:?, . .2:?, 1{Co:0.25,Pr:1,. ..} 1{LR:0.26,Mo:0.16} . . . kAN:0.97 ... kAN:0.24 ... ... 2{2:1,2{26:1, ..} ...} 107:3, .67:2, DS:0.84 2{9:1,2{26:1, ..} ...} 237:3, .67:2, DS:0.14 1 10 2{1:1, 1 10 2{1:1, .} ...} 2:1, . .2:1, 2{Co:0.1,Pr:0,. ..} 2{LR:0.26,Mo:0.15} ... ... ... ... ... ... ... ... ... ... Figure 1. Hierarchical representation of how the results from machine learning experiments are stored in the NoSQL database for the MLRR. Each data set has a collection containing the predictions for each instance from a learning algorithm as well as its meta-features. A separate collection stores all of the information for the learning algorithms and which hyperparameters were used. Another collection stores the information for which instances were used for training.