<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Interoperable Machine Learning Metadata using MEX</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Diego Esteves</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diego Moussallem</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ciro Baron Neto</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jens Lehmann</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Claudia Cavalcanti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julio Cesar Duarte</string-name>
          <email>duarteg@ime.eb.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Military Institute of Engineering, Department of Computer Engineering</institution>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Leipzig</institution>
          ,
          <addr-line>AKSW</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>One key step towards machine learning scenarios is the reproducibility of an experiment as well as the interchanging of machine learning metadata. A notorious existing problem on different machine learning architectures is either the interchangeability of measures generated by executions of an algorithm and general provenance information for the experiment configuration. This demand tends to bring forth a cumbersome task of redefining schemas in order to facilitate the exchanging of information over different system implementations. This scenario is due to the missing of a standard specification. In this paper, we address this gap by presenting a built upon on a flexible and lightweight vocabulary dubbed MEX. We benefit from the linked data technologies to provide a public format in order to achieve a higher level of interoperability over different architectures.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        So far, we have seen a variety of publications on the Machine Learning (ML) topics,
many of them contributing to the state of the art in their respective fields. However,
the last years experienced a knowledge gap in the standardization of experiment
results for mapping and storing produced performance measures. This technological gap
can be summed up by the following question: “How to achieve interoperability among
machine learning experiments over different system architectures?”. In other words,
experimental results are not delivered in a common machine-readable way, causing the
information extraction and processing to be tricky and burdensome [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Generally, the
missing of a consensus for a lightweight and flexible format to achieve the
interoperability for machine learning experiments over any system implementation sakes on the
development of schema based on existing machine readable formats, using established
formats (e.g.: Extensible Markup Language (XML), Comma-separated values (CSV)),
which do not allow high levels of interoperability though. In this paper, we introduce
an application program interface (API) based on MEX Vocabulary [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to tackle with
this gap, allowing the generation of common outputs to be either reused or processed
by other systems regardless software implementation and platform (Figure 1). To the
best of our knowledge, this is the first report in the literature of an API for exporting
metadata of machine learning iterations based on an interchange format.
      </p>
    </sec>
    <sec id="sec-2">
      <title>MEX Format: a lightweight interchange format</title>
      <p>
        The MEX vocabulary [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has been designed to tackle the problem of sharing
provenance information particularly on the machine learning iterations (for Classification,
Regression and Clustering problems) in a lightweight and flexible format, built upon on
the W3C PROV Ontology (PROV-O) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], i.e., the format aims to allow the interchange of
variables existing on each run of a machine learning algorithm among different systems
implementations. The MEX vocabulary is composed by three layers: mexcore
(formalizes the key entities for representing the iterations on the machine learning
executions, where each iteration has parameters as input and measures as output) ;mexalgo
(represents the context of machine learning algorithms) ;mexperf (provides the
basic entities for representing the associated measures). Variables concerning the ML
pipeline, which often involves a sequence of data pre-processing, model fitting, feature
extraction analysis, and validation stages are out of scope for this work. They can be
managed properly by implementing an existing scientific workflow system ([
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]),
however presenting a low level of interoperability for different system architectures.
The MEX focuses on a lightweight format of the basic elements for each iteration of
a machine learning algorithm in order to achieve a higher level of interoperability: the
performed execution itself and its parameters, as well as the produced measures.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Demonstration</title>
      <p>
        In this demo paper, we show the MEX usage for two different programming-languages:
Java3 and NodeJS4. We argue that a higher level of interoperability can be achieved
exporting the variables using MEX as a format. The Figure 2 depicts an overview of
the system architecture, where the three layers provide the full MEX schema, whereas
the Jena API5 (representing the RDF Library for the Java scenario) represents the RDF
serialization. We present the development of the Java and NodeJS APIs [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and similar
use cases as examples, showing the advantage of defining MEX as a format for the
machine learning iterations.
      </p>
      <p>
        The MEX files for these examples can be found here [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] as well as the use case
implementation for Java and Weka. Moreover, to assist with the tedious task of generating
Latex tables based on the machine learning performance outputs (a manual task
commonly executed by the user), we implemented functions to automatize this task based
on MEX files[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Finally, we also provide a GUI to generate the basic MEX file (Figure
3) for non-expert users[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <sec id="sec-3-1">
        <title>3 https://java.com/pt_BR/download/ 4 https://nodejs.org/ 5 https://jena.apache.org/</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <p>
        We defined a novel interface for the representation of the variables associated with the
machine learning model executions and developed a Java and NodeJS APIs based on
that, allowing the exporting of a flexible and lightweight format for data interchanging.
As future work, we plan the integration with more established platforms (e.g.: [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) and
new programming languages, such as Weka6 and C++7, for instance. Also, a repository
for MEX files linked with nanopublications8 and the examination of more machine
learning representations are desired. Finally, we argue that experiment databases [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
can benefit from the defined MEX interchange format and its APIs.
      </p>
      <p>Acknowledgments This research has been partially supported by grants from the
CAPES foundation, Ministry of Education of Brazil, Brasilia - DF 70040-020, Brazil
(Bolsista da CAPES - Proc. n: BEX 10179/13-5) and the H2020 ALIGNED Project
(GA No. 644055)</p>
      <sec id="sec-4-1">
        <title>6 http://www.cs.waikato.ac.nz/ml/weka/</title>
        <p>7 http://en.cppreference.com/w/
8 http://nanopub.org/wordpress/</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Joaquin</given-names>
            <surname>Vanschoren</surname>
          </string-name>
          et al.
          <source>Experiment databases. Machine Learning</source>
          , pages
          <fpage>127</fpage>
          -
          <lpage>158</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Diego Esteves et al. MEX Vocabulary:
          <article-title>A lightweight interchange format for machine learning experiments</article-title>
          .
          <source>In SEMANTiCS</source>
          <year>2015</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>3. W3C PROV-O ontology</article-title>
          . http://www.w3.org/TR/prov-o/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Yolanda</given-names>
            <surname>Gil</surname>
          </string-name>
          et al.
          <article-title>Wings: Intelligent workflow-based design of computational experiments</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          , pages
          <fpage>62</fpage>
          -
          <lpage>72</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Joaquin</given-names>
            <surname>Vanschoren</surname>
          </string-name>
          et al.
          <article-title>Openml: Networked science in machine learning</article-title>
          .
          <source>SIGKDD Explor</source>
          . Newsl., pages
          <fpage>49</fpage>
          -
          <lpage>60</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>6. MEX website. http://mex.aksw.org.</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>