The experiment database for machine learning (Demo)
                                                              Joaquin Vanschoren1


Abstract. We demonstrate the use of the experiment database                  information-theoretic properties [7] and landmarkers [10], while al-
for machine learning, a community-based platform for the sharing,            gorithms can be tagged by model properties, the average ratio of bias
reuse, and in-depth investigation of the thousands of machine learn-         or variance error, or their sensitivity to noise [3].
ing experiments executed every day. It is aimed at researchers and              As such, all empirical results, past and present, are immediately
practitioners of data mining techniques, and is publicly available at        linked to all known theoretical properties of algorithms and datasets,
http://expdb.cs.kuleuven.be. This demo gives a hands-                        providing new grounds for deeper analysis. For instance, algorithm
on overview of how to share novel experimental results, how to in-           designers can include these properties in queries to gain precise in-
tegrate the database in existing data mining toolboxes, and how to           sights on how their algorithms are affected by certain kinds of data
query the database through an intuitive graphical query interface.           or how they relate to other algorithms.

                                                                             1.2    Overview of benefits
1     Introduction
                                                                             We can summarize the benefits of this platform as follows:
Experimentation is the lifeblood of machine learning (ML) research.          Reproducibility The database stores all details of the experimental
A considerable amount of effort and resources are invested in assess-          setup, resulting in truly reproducible research.
ing the usefulness of new algorithms, finding the optimal approach           Reference All experiments, including algorithms and datasets, are
for new applications or just to gain some insight into, for instance, the      automatically organized in one resource, creating an overview of
effect of a parameter. Yet in spite of all these efforts, experimental re-     the state-of-the-art, and a useful ‘map’ of all known approaches,
sults are often discarded or forgotten shortly after they are obtained,        their properties, and their performance. This also includes nega-
or at best averaged out to be published, which again limits their fu-          tive results, which usually do not get published.
ture use. If we could collect all these ML experiments in a central          Querying When faced with a question on the performance of learn-
resource and make them publicly available in an organized (search-             ing algorithms, e.g., ‘What is the effect of the training set size on
able) fashion, the combined results would provide a highly detailed            runtime?’, we can answer it in seconds by writing a query, instead
picture of the performance of algorithms on a wide range of data               of spending days (or weeks) setting up new experiments. More-
configurations, speeding up ML research.                                       over, we can draw upon many more experiments, on many more
   In this paper, we demonstrate a community-based platform de-                algorithms and datasets, than we can afford to run ourselves.
signed to do just this: the experiment database for machine learn-           Reuse It saves time and energy, as previous experiments can be
ing. First, experiments are automatically transcribed in a common              readily reused. For instance, when benchmarking a new algorithm,
language that captures the exact experiment setup and all details              there is no need to benchmark the older algorithms over and over
needed to reproduce them. Then, they are uploaded to pre-designed              again as well: their evaluations are likely stored online, and can
databases where they are stored in an organized fashion: the results           simply be downloaded.
of every experiment are linked to the exact underlying components            Larger studies Studies covering many algorithms, parameter set-
(such as the algorithm, parameter settings and dataset used) and thus          tings and datasets are very expensive to run, but could become
also integrated with all prior results. Finally, to answer any ques-           much more feasible if a large portion of the necessary experiments
tion about algorithm behavior, we only have to write a query to the            are available online. Even when all the experiments have yet to be
database to sift through millions of experiments and retrieve all re-          run, the automatic storage and organization of experimental re-
sults of interest. As we shall demonstrate, many kinds of questions            sults markedly simplifies conducting such large scale experimen-
can be answered in one or perhaps a few queries, thus enabling fast            tation and thorough analysis thereof.
and thorough analysis of large numbers of collected results. The re-         Visibility By using the database, users may learn about (new) algo-
sults can also be interpreted unambiguously, as all conditions under           rithms they were not previously aware of.
which they are valid are explicitly stored.                                  Standardization The formal description of experiments may cat-
                                                                               alyze the standardization of experiment design, execution and ex-
                                                                               change across labs and data mining tools.
1.1    Meta-learning
                                                                                The remainder of this paper is organized as follows. Sect. 2 out-
Instead of being purely empirical, these experiment databases also           lines how we constructed our pilot experiment database and the un-
store known or measurable properties of datasets and algorithms.             derlying models and languages that enable the free exchange of ex-
For datasets, this can include the number of features, statistical and       periments. In Sect. 3, we demonstrate how it can be used to quickly
                                                                             discover new insights into a wide range of research questions and to
1 LIACS, Leiden University, The Netherlands, email: joaquin@liacs.nl
                                                                             verify prior studies. Sect. 4 concludes.
                                                                          describe large arrays of DM operators, including information about
                            interface (API)     share                     their use to support automatic workflow planning.
                                                                             To streamline ontology development, a ‘core’ ontology was de-
                                    ExpML               DM platforms /    fined, and an open ontology development forum was created: the
                                                         algorithms
                                     files                                Data Mining Ontology (DMO) Foundry2 . The goal is to make the
               Exposé                                                     ontologies interoperable and orthogonal, each focusing on a particu-
              Ontology
                                                                          lar aspect of the data mining field. Moreover, following best practices
                                                                          in ontology engineering, we reuse concepts and relationships from
                                                                          established top-level scientific ontologies: BFO,3 OBI,4 IAO,5 and
Researcher       Query
                               ExpDB           Mining      Meta-models    RO.6 We often use subproperties, e.g. implements for concretizes,
               interface
                                                                          and runs for realizes, to reflect common usage in the field. Exposé is
                                                                          designed to integrate or be similar to the above mentioned ontologies,
      Figure 1. Components of the experiment database framework.
                                                                          but focusses on aspects related to experimental evaluation.


2     Framework description                                               2.1.2   Top-level View

In this section, we outline the design of this collaborative framework,   Fig. 2 shows Exposé’s high-level concepts and relationships. The full
outlined in Fig. 1. We first establish a controlled vocabulary for data   arrows symbolize is-a relationships, meaning that the first concept
mining experimentation in the form of an open ontology (Exposé),         is a subclass of the second, and the dashed arrows symbolize other
before mapping it to an experiment description language (called           common relationships. The most top-level concepts are reused from
ExpML) and an experiment database (ExpDB). These three elements           the aforementioned top-level scientific ontologies, and help to de-
(boxed in Fig. 1) will be discussed in the next three subsections. Full   scribe the exact semantics of many data mining concepts. For in-
versions of the ontologies, languages and database models discussed       stance, when speaking of a ‘data mining algorithm’, we can seman-
below will be available on http://expdb.cs.kuleuven.be.                   tically distinguish an abstract algorithm (e.g., C4.5 in pseudo-code),
   Experiments are shared (see Fig. 1) by entering all experiment         a concrete algorithm implementation (e.g., WEKA’s J48 implemen-
setup details and results through the framework’s interface (API),        tation of C4.5), and a specific algorithm setup, including parameter
which exports them as ExpML files or directly streams them to an          settings and subcomponent setups. The latter may include other algo-
ExpDB. Any data mining platform or custom algorithm can thus use          rithm setups, e.g. for base-learners in ensemble algorithms, as well as
this API to add a ‘sharing’ feature that publishes new experiments.       mathematical functions such as kernels, distance functions and eval-
The ExpDB can be set up locally, e.g., for a single person or a single    uation measures. A function setup details the implementation and
lab, or globally, a central database open to submissions from all over    parameter settings used to evaluate the function.
the world. Finally, the bottom of the figure shows different ways to         An algorithm setup thus defines a deterministic function which can
tap into the stored information:                                          be directly linked to a specific result: it can be run on a machine given
                                                                          specific input data (e.g., a dataset), and produce specific output data
Querying. Querying interfaces allow researchers to formulate ques-        (e.g., new datasets, models or evaluations). As such, we can trace
  tions about the stored experiments, and immediately get all results     any output result back to the inputs and processes that generated it
  of interest. We currently offer various such interfaces, including      (data provenance). For instance, we can query for evaluation results,
  graphical ones (see Sect. 2.3.2).                                       and link them to the specific algorithm, implementation or individual
Mining. A second use is to automatically look for patterns in al-         parameter settings used, as well as the exact input data.
  gorithm performance by mining the stored evaluation results and            Algorithm setups can be combined in workflows, which addition-
  theoretical meta-data. These meta-models can then be used, for          ally describe how data is passed between multiple algorithms. Work-
  instance, in algorithm recommendation [1].                              flows are hierarchical: they can contain sub-workflows, and algo-
                                                                          rithm setups themselves can contain internal workflows (e.g., a cross-
2.1     The Exposé Ontology                                              validation setup may define a workflow to train and evaluate learning
                                                                          algorithms). The level of detail is chosen by the author of an ex-
The Exposé ontology describes the concepts and the structure of data     periment: a simple experiment may require a single algorithm setup,
mining experiments. It establishes an unambiguous and machine-            while others involve complex scientific workflows.
interpretable (semantic) vocabulary, through which experiments can           Tasks cover different data mining (sub)tasks, e.g., supervised clas-
be automatically shared, organized and queried. We will also use it       sification. Qualities are known or measurable properties of algo-
to define a common experiment description language and database           rithms and datasets (see Sect. 1.1), which are useful to interpret
models, as we shall illustrate below. Ontologies can be easily ex-        results afterwards. Finally, algorithms, functions or parameters can
tended and refined, which is a key concern since data mining and          play certain roles in a complex setup: an algorithm can sometimes
machine learning are ever-expanding fields.                               act as a base-learner in an ensemble algorithm, and a dataset can act
                                                                          as a training set in one experiment and as a test set in the next.
2.1.1    Collaborative Ontology Design                                    2 The DMO Foundry: http://dmo-foundry.org
                                                                          3 The Basic Formal Ontology (BFO): http://www.ifomis.org/bfo
Several other useful ontologies are being developed in parallel: On-      4
toDM [8] is a top-level ontology for data mining concepts, EXPO               The Ontology for Biomedical Investigations (OBI): http:
                                                                            //obi-ontology.org
[11] models scientific experiments, DMOP [4] describes learning           5 The Information Artifact Ontology (IAO): http://bioportal.
algorithms (including their internal mechanisms and models) and             bioontology.org/ontologies/40642
                                                                          6 The Relation Ontology (RO): http://www.obofoundry.org/ro
workflows, and the KD ontology [13] and eProPlan ontology [5]
                 BFO/OBI/IAO                  thing

                                                                                          Information                               Role
                                                                                         Content Entity                                                 Quality
                Material       Planned                                  Digital
                                                    Plan
                 Entity        Process                                  Entity                                                 hq
                                                                                                                                         Data               Algorithm
                                                                achieves                 Task                                           Quality              Quality
                                                          has input/output data
                                                                                                       Data
                                            runs                                                                                            hq
                                                                                             Model   Dataset
                Machine            Run
                                                                                               Evaluations
                        executed               DM setup                                                                                evaluates
                           on
                                         hp                        hp                                implements                          has part                    f(x)
                                                                        concretizes
                                   hp                                             Implementation                        Algorithm                      Mathematical
                                                                                                                                                        Function
                                                                           concretizes                                   has parameter
                     Experiment         Workflow Algorithm hp                            has parameter
                                                                                                                                                      KernelFunction
                                                   Setup
                                                   f(x)
                                                                                                                                                       Perf.Measure
                                                              hp
                                                                          p=!     concretizes
                                                                                                  p=?            implements            p=?                   ...
              hp = has part                                             Parameter           Implementation                       Algorithm
                                         FunctionSetup
              hq = has quality                                           Setting              Parameter                          Parameter


                                         Figure 2. An overview of the top-level concepts in the Exposé ontology.


2.1.3   Experiments                                                                   2.2.1        Workflow Runs

An experiment tries to answer a question (in exploratory settings) or                 Fig. 3 shows a workflow run in ExpML, executed in WEKA [2] and
test a hypothesis by assigning certain values to these input variables.               exported through the aforementioned API, and a schematic represen-
It has experimental variables: independent variables with a range of                  tation is shown in Fig. 4. The workflow has two inputs: a dataset
possible values, controlled variables with a single value, or depen-                  URL and parameter settings. It also contains two algorithm setups:
dent variables, i.e., a monitored output. The experiment design (e.g.,                the first loads a dataset from the given URL, and then passes it to
full factorial) defines which combinations of input values are used.                  a cross-validation setup (10 folds, random seed 1). The latter eval-
   One experiment run may generate several workflow runs (with dif-                   uates a Support Vector Machine (SVM) implementation, using the
ferent input values), and a workflow run may consist of smaller al-                   given parameter settings, and outputs evaluations and predictions.
gorithm runs. Runs are triples consisting of input data, a setup and                  Note that the workflow is completely concretized: all parameter set-
output data. Any sub-runs, such as the 10 algorithm runs within a                     tings and implementations are fixed. The bottom of Figure 3 shows
10-fold CV run, could also be stored with the exact input data (folds)                the workflow run and its two algorithm sub-runs, each pointing to the
and output data (predictions). Again, the level of detail is chosen by                setup used. Here, we chose not to output the 10 per-fold SVM runs.
the experimenter. Especially for complex workflows, it might be in-                      The final output consists of Evaluations and Predictions. As
teresting to afterwards query the results of certain sub-runs.                        shown in the ExpML code, these have a predefined structure so
                                                                                      that they can be automatically interpreted and organized. Evaluations
                                                                                      contain, for each evaluation function (as defined in Exposé), the eval-
2.2     ExpML: A Common Language                                                      uation value and standard deviation. They can also be labeled, as for
                                                                                      the per-class precision results. Predictions can be probabilistic, with
                                                                                      a probability for each class, and a final prediction for each instance.
Returning to our framework in Fig. 1, we now use this ontology to de-
                                                                                      For storing models, we can use existing formats such as PMML.
fine a common language to describe experiments. The most straight-
forward way to do this would be to describe experiments in Exposé,
export them in RDF7 and store everything in RDF databases (triple-                    url
                                                                                                     Weka.                     Weka.                  Weka.SMO                Weka.RBF        eval   evalu-
                                                                                                     ARFFLoader                Evaluation
stores). However, such databases are still under active development,                  par      p=!    location=
                                                                                                                   data
                                                                                                                         p=!                                          p=!
                                                                                                                                                                                                     ations
                                                                                                        http://...  data         F=10           p=!      C=0.01                G=0.01
and many researchers are more familiar with XML and relational                                 logRuns=true              p=! S=1                                    f(x) 5:kernel             pred predic-
                                                                                                                                                logRuns=false
                                                                                                                                                                                                   tions
databases, which are also widely supported by many current data                                 2:loadData               logRuns=true            4:learner
                                                                                                                           3:crossValidate
mining tools. Therefore, we will also map the ontology to a sim-                             1:mainFlow
                                                                                                                                                      evaluations
ple XML-based language, ExpML, and a relational database schema.                                                                             eval
                                                                                                                                                                          6
                                                                                                                                                                                Evaluations
                                                                                                                 data      8    data          pred
Technical details of this mapping are outside the scope of this paper.                                              Weka.Instances                    predictions         7
                                                                                                                                                                                Predictions
Below, we show a small example of ExpML output to illustrate our
modeling of data mining workflows.
                                                                                                          Figure 4. A schematic representation of the run.

7 Resource Description Framework: http://www.w3.org/RDF
<Run machine=" " timestamp=" " author=" ">
 <Workflow id="1:mainflow" template="10:mainflow">
   <AlgorithmSetup id="2:loadData" impl="Weka.ARFFLoader(1.22)" logRuns="true">
      <ParameterSetting name="location" value="http://.../lymph.arff"/>
    </AlgorithmSetup>
   <AlgorithmSetup id="3:crossValidate" impl="Weka.Evaluator(1.25)" logRuns="true" role="CrossValidation">
      <ParameterSetting name="F" value="10"/>
      <ParameterSetting name="S" value="1"/>
      <AlgorithmSetup id="4:learner" impl="Weka.SMO(1.68)" logRuns="false" role="Learner">
        <ParameterSetting name="C" value="0.01"/>
        <FunctionSetup id="5:RBFKernel" impl="Weka.RBF(1.3.1)" role="Kernel">
         <ParameterSetting name="G" value="0.1"/>
        </FunctionSetup>
       </AlgorithmSetup>
    </AlgorithmSetup>
   <Input name="url" dataType="Tuples" value="http://.../lymph.arff"/>
   <Input name="par" dataType="Tuples" value="[name:G,value:0.1]"/>
   <Output name="evaluations" dataType="Evaluations"/>
   <Output name="predictions" dataType="Predictions"/>
   <Connection source="2:loadData" sourcePort="data" target="3:crossValidate" targetPort="data" dataType="Weka.Instances"/>
   <Connection source="3:crossValidate" sourcePort="evaluations" target="1:mainflow" targetPort="evaluations" dataType="
             Evaluations"/>
   <Connection source="3:crossValidate" sourcePort="predictions" target="1:mainflow" targetPort="predictions" dataType="
             Predictions"/>
  </Workflow>
 <OutputData name="evaluations">
  <Evaluations id="6">
   <Evaluation function="PredictiveAccuracy" value="0.831081" stDev="0.02"/>
   <Evaluation function="Precision" label="class:normal" value="0" stDev="0"/>
  . . . < / Evaluations>
 </OutputData>
  <Run setup="2:loadData">
  <OutputData name="data">
   <Dataset id="8" name="lymph" url="http://.../lymph.arff" dataType="Weka.Instances"/>
  </OutputData>
 </Run>
 <Run setup="3:crossValidate">
  <InputData name="data"> <Dataset ref="8"/> </InputData>
  <OutputData name="evaluations"> <Evaluations ref="6"/> </OutputData>
  <OutputData name="predictions"> <P r e d i c t i o n s ref="7"/> </OutputData>
 </Run>
</Run>

                                                          Figure 3. A workflow run in ExpML.


                    Data                            Run                         Setup                  Workflow                   Connection
                 did                           rid                          sid                       sid                   workflow dataType
                 source                        parent                      rootWorkflow               template              sourcePort source
                                               setup                        parent                                          targetPort target
                                               machine
                             ...                                                                                     Input,Output
                                                                                                                                          Experiment
                                                 InputData                                             ExperimentalVariable               sid
         Dataset       Evaluation
                                                run                                                                                       name
       did             did                                             AlgorithmSetup                  FunctionSetup
                                                data                                                                                     description,...
       name            function                                        sid                           sid
                                                name
       url,...         label                                           implementation                implementation                        Function
                       value                                           algorithm                     function                             name
                                                OutputData
                       stdev                                           role                          role                                 formula
     DataQuality                                run
                                                                       logRuns                                                           description,...
      data                                      data
                                                                       isDefault                     ParameterSetting
      quality                                   name
                                                                                                     setup                              Parameter
      label                                Function                                                  parameter                      name
      value                                                            Implementation
                                   AlgorithmQuality                                                  value                          implementation
            Quality                                                    fullName                                                     shortName
                                   implementation
                                                                       name                                                         generalName
         name                      quality
                                                                       version                            Algorithm                 defaultValue
         formula                   label
                                                                       algorithm                      name                          suggestedValues
         description,...           value
                                                                       url, library, ...              description,...               min, max,...
           Inheritance             Many-to-one          0/1-to-one


 Figure 5. The general structure of the experiment database. Underlined columns indicate primary keys, the arrows denote foreign keys. Tables in italics are
                                                       abstract: their fields only exist in child tables.
2.3     Organizing Machine Learning Information                            a list of all available options, in this case a list of all evaluation func-
                                                                           tions present in the database. Here, we choose a specific input dataset
The final step in our framework (see Fig. 1) is organizing all this        and a specific evaluation function, and we aim to plot the evaluation
information in searchable databases such that it can be retrieved,         value against the used algorithm.
rearranged, and reused in further studies. This is done by col-               Running the query returns all known experiment results, which are
lecting ExpML descriptions and storing all details in a predefined         scatterplotted in Fig. 7, ordered by performance. This immediately
database. To design such a database, we mapped Exposé to a rela-          provides a complete overview of how each algorithm performed. Be-
tional database model. In this section, we offer a brief overview of       cause the results are as general as allowed by the constraints written
the model to help interpret the queries in the remainder of this paper.    in the query, the results on sub-optimal parameter settings are shown
                                                                           as well (at least for those algorithms whose parameters were varied),
2.3.1    Anatomy of an Experiment Database                                 clearly indicating the performance variance they create. As expected,
                                                                           ensemble and kernel methods are dependent on the selection of the
Fig. 5 shows the most important tables, columns and links of               correct kernel, base-learner, and other parameter settings. Each of
the database model. Runs are linked to their input- and out-               them can be explored by adding further constraints.
put data through the join tables InputData and OutputData,
and data always has a source run, i.e., the run that generated
it. Runs can have parent runs, and a specific Setup: either a
Workflow or AlgorithmSetup, which can also be hierar-
chical. AlgorithmSetups and FunctionSetups can have
ParameterSettings, a specific Implementation and a
general Algorithm or Function. Implementations and
Datasets can also have Qualities, stored in Algorithm
Quality and DataQuality, respectively. Data, runs and setups
have unique id’s, while algorithms, functions, parameters and quali-
ties have unique names defined in Exposé.


2.3.2    Accessing the Experiment Database
The experiment database is available at http://expdb.cs.
kuleuven.be. A graphical query interface is provided (see the ex-
amples below) that hides the complexity of the database, but still sup-           Figure 7. Performance of all algorithms on dataset ‘letter’.
ports most types of queries. In addition, it is possible to run standard
SQL queries (a library of example queries is available. Several video
tutorials help the user to get started quickly. We are currently updat-
ing the database, query interface and submission system, and a public      3.2    Investigating Parameter Effects
submission interface for new experiments (described in ExpML) will         For instance, we can examine the effect of the used kernel, or even
be available shortly.                                                      the parameters of a given kernel. Building on our first query, we zoom
                                                                           in on these results by adding two constraints: the algorithm should be
3     Example Queries                                                      an SVM9 and contain an RBF kernel. Next, we select the value of the
                                                                           ‘gamma’ parameter (kernel width) of that kernel. We also relax the
In this section, we illustrate the use of the experiment database.8 In     constraint on the dataset by including three more datasets, and ask
doing this, we aim to take advantage of the theoretical information        for the number of features in each dataset.
stored with the experiments to gain deeper insights.                          The result is shown in Fig. 10. First, note that much of the varia-
                                                                           tion seen for SVMs on the ‘letter’ dataset (see Fig. 7) is indeed ex-
                                                                           plained by the effect of this parameter. We also see that its effect on
3.1     Comparing Algorithms
                                                                           other datasets is markedly different: on some datasets, performance
To compare the performance of all algorithms on one specific dataset,      increases until reaching an optimum and then slowly declines, while
we can plot the outcomes of cross-validation (CV) runs against the         on other datasets, performance decreases slowly up to a point, after
algorithm names. In the graphical query interface, see Fig. 6, this        which it quickly drops to default accuracy, i.e., the SVM is simply
can be done by starting with the CrossValidation node, which will          predicting the majority class. This behavior seems to correlate with
be connected to the input Dataset, the outputted Evaluations and the       the number of features in each dataset (shown in brackets). Further
underlying Learner (algorithm setup). Green nodes represent data,          study shows that some SVM implementations indeed tend to overfit
blue nodes are setups and white nodes are qualities (runs are hid-         on datasets with many attributes [12].
den). By clicking a node it can be expanded to include other parts of
the workflow setup (see below). For instance, ‘Learner’ expands into
the underlying implementation, parameter settings, base-learners and       3.3    Preprocessing Effects
sub-functions (e.g. kernels). By clicking a node one can also add a        The database also stores workflows with preprocessing methods, and
selection (in green, e.g. the used learning algorithm) or a constraint     thus we can investigate their effect on the performance of learning
(in red, e.g. a preferred evaluation function). The user is always given
                                                                           9 Alternatively, we could ask for a specific implementation, i.e., ‘implemen-
8 See [12] for a much more extensive list of possible queries                tation=weka.SMO’.
                                    Figure 6. The graphical query interface.


Figure 8. Querying the performance of SVMs with different kernel widths on datasets of different dimensionalities.


                                      Figure 9. Building a learning curve.
                                                                                    Figure 12. Query for the bias-variance profile of algorithms.


                                                                             database contains a large number of bias-variance decomposition ex-
                                                                             periments, we can give a realistic numerical assessment of how ca-
                                                                             pable each algorithm is in reducing bias and variance error. Fig. 13
                                                                             shows, for each algorithm, the proportion of the total error that can
                                                                             be attributed to bias error, calculated according to [6], using default
Figure 10. The effect of parameter gamma of the RBF kernel in SVMs on        parameter settings and averaged over all datasets. The simple query
  a number of different datasets (number of attributes shown in brackets).   is shown in Fig. 12. The algorithms are ordered from large bias (low
                                                                             variance), to low bias (high variance). NaiveBayes is, as expected,
                                                                             one of the algorithms whose error consists primarily of bias error,
                                                                             whereas RandomTree has relatively good bias management, but gen-
                                                                             erates more variance error than NaiveBayes. When looking at the en-
                                                                             semble methods, Fig. 13 shows that bagging is a variance-reduction
                                                                             method, as it causes REPTree to shift significantly to the left. Con-
                                                                             versely, boosting reduces bias, shifting DecisionStump to the right in
                                                                             AdaBoost and LogitBoost (additive logistic regression).


                                                                             3.5    Further queries

                                                                             These are just a few examples the queries that can be answered using
                                                                             the database. Other queries allow algorithm comparisons using mul-
                                                                             tiple evaluation measures, algorithm rankings, statistical significance
                                                                             tests, analysis of ensemble learners, and especially the inclusion of
                                                                             many more dataset properties and algorithm properties to study how
                                                                             algorithms are affected by certain types of data. Please see [12] and
           Figure 11. Learning curves on the Letter-dataset.
                                                                             the database website for more examples.


algorithms. For instance, when querying for workflows that include           4     Conclusions
a downsampling method, we can draw learning curves by plotting
learning performance against sample size. Fig. 9 shows the query:
                                                                             Experiment databases are databases specifically designed to collect
a preprocessing step is added and we query for the resulting num-
                                                                             all the details on large numbers of experiments, performed and shared
ber of instances, and the performance of a range of learning algo-
                                                                             by many different researchers, and make them immediately available
rithms (with default parameter settings). The result is shown in Fig.
                                                                             to everyone. They ensure that experiments are repeatable and auto-
11. From these results, it is clear that the ranking of algorithm per-
                                                                             matically organize them such that they can be easily reused in future
formances depends on the size of the sample: the curves cross. While
                                                                             studies.
logistic regression is initially stronger than C4.5, the latter keeps im-
                                                                                This demo paper gives an overview of the design of the frame-
proving when given more data, confirming earlier analysis [9]. Note
                                                                             work, the underlying ontologies, and the resulting data exchange for-
that RandomForest performs consistently better for all sample sizes,
                                                                             mats and database structures. It discusses how these can be used to
that RacedIncrementalLogitBoost crosses two other curves, and that
                                                                             share novel experimental results, to integrate the database in exist-
HyperPipes actually performs worse when given more data, which
                                                                             ing data mining toolboxes, and how to query the database through
suggests that its initially higher score was largely due to chance.
                                                                             an intuitive graphical query interface. By design, the database also
                                                                             calculates and stores a wide range of known or measurable proper-
3.4    Bias-Variance Profiles                                                ties of datasets and algorithms. As such, all empirical results, past
                                                                             and present, are immediately linked to all known theoretical prop-
The database also stores a series of algorithm properties, many of           erties of algorithms and datasets, providing new grounds for deeper
them calculated based on large numbers of experiments. One in-               analysis. This results in a great resource for meta-learning and its
teresting algorithm property is its bias-variance profile. Because the       applications.
                       Figure 13. The average percentage of bias-related error for each algorithm averaged over all datasets.


Acknowledgements                                                             REFERENCES
We acknowledge the support of BigGrid, the Dutch e-Science Grid,             [1] P Brazdil, C Giraud-Carrier, C Soares, and R Vilalta, ‘Metalearning:
                                                                                 Applications to data mining’, Springer, (2009).
supported by the Netherlands Organisation for Scientific Research,
                                                                             [2] MA Hall, E Frank, G Holmes, B Pfahringer, P Reutemann, and
NWO. We like to thank Larisa Soldatova and Pance Panov for many                  IH Witten, ‘The WEKA data mining software: An update’, SIGKDD
fruitful discussions on ontology design.                                         Explorations, 11(1), 10–18, (2009).
                                                                             [3] M Hilario and A Kalousis, ‘Building algorithm profiles for prior model
                                                                                 selection in knowledge discovery systems’, Engineering Intelligent
                                                                                 Systems, 8(2), 956–961, (2000).
                                                                             [4] M Hilario, A Kalousis, P Nguyen, and A Woznica, ‘A data min-
                                                                                 ing ontology for algorithm selection and meta-mining’, Proceedings
                                                                                 of the ECML-PKDD’09 Workshop on Service-oriented Knowledge
                                                                                 Discovery, 76–87, (2009).
                                                                             [5] J Kietz, F Serban, A Bernstein, and S Fischer, ‘Towards co-
                                                                                 operative planning of data mining workflows’, Proceedings of
                                                                                 the ECML-PKDD’09 Workshop on Service-oriented Knowledge
                                                                                 Discovery, 1–12, (2009).
                                                                             [6] R Kohavi and D Wolpert, ‘Bias plus variance decomposition for zero-
                                                                                 one loss functions’, Proceedings of the International Conference on
                                                                                 Machine Learning (ICML), 275–283, (1996).
                                                                             [7] D Michie, D Spiegelhalter, and C Taylor, ‘Machine learning, neural and
                                                                                 statistical classification’, Ellis Horwood, (1994).
                                                                             [8] P Panov, LN Soldatova, and S Džeroski, ‘Towards an ontology of data
                                                                                 mining investigations’, Lecture Notes in Artificial Intelligence, 5808,
                                                                                 257–271, (2009).
                                                                             [9] C Perlich, F Provost, and J Simonoff, ‘Tree induction vs. logistic
                                                                                 regression: A learning-curve analysis’, Journal of Machine Learning
                                                                                 Research, 4, 211–255, (2003).
                                                                            [10] B Pfahringer, H Bensusan, and C Giraud-Carrier, ‘Meta-learning
                                                                                 by landmarking various learning algorithms’, Proceedings of the
                                                                                 International Conference on Machine Learning (ICML), 743–750,
                                                                                 (2000).
                                                                            [11] LN Soldatova and RD King, ‘An ontology of scientific experiments’,
                                                                                 Journal of the Royal Society Interface, 3(11), 795–803, (2006).
                                                                            [12] J Vanschoren, H Blockeel, B Pfahringer, and G Holmes, ‘Experiment
                                                                                 databases: A new way to share, organize and learn from experiments’,
                                                                                 Machine Learning, 87(2), (2012).
                                                                            [13] M Zakova, P Kremen, F Zelezný, and N Lavrač, ‘Planning to
                                                                                 learn with a knowledge discovery ontology’, Proceedings of the
                                                                                 ICML/UAI/COLT’08 Workshop on Planning to Learn, 29–34, (2008).