<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Use of Open Data to Improve the Repeatability of Adaptivity and Personalisation Experiment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Harshvardhan Pandit</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roghaiyeh Gachpaz Hamed</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shay Lawless</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Lewis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ADAPT Centre, Trinity College Dublin</institution>
          ,
          <addr-line>Ireland seamus.lawless, david.lewis</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Reproducibility of results is a key element for the veri cation of scienti c experiments and an important indicator of the quality of a published experiment. It is vital therefore to precisely and transparently share both the method and the data associated with an experiment. Data associated with an experiment is often linked within peer-reviewed scienti c publications, and is di cult to assess in a consistent manner. In this paper we explore how emerging linked data standards can be applied to the description and data of published adaptivity and personalisation experiments in a manner that can be linked from publications and easily located, accessed and reused to repeat an experiment. The approach also provides possibilities for published experiments to be extended or modi ed to provide a rmer grounding for publishing new results and conclusions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Compared to other sciences, experiments in adaptive
software are the technological aspects of implementation, i.e.
the software implementing di erent parts of the experiment.
The publication of source code, especially in open version
control systems such as GitHub1 renders the version of a
software component equally referable from a publication as
well as the data sets involved. Repeatability of experiments
therefore relies on the known rights to reuse both the code
and the data. While usage licenses for source code are well
established, the usage rights for experimental data can still
present obstacles, especially when, as is common in
personalisation experiments, it contains speci c di erentiating data
related to individual experimental subjects.</p>
      <p>As the Adaptivity and Personalisation scienti c
community considers a more structured approach to comparative
experimentation, we examine how this can leverage the state
of the art in both Linked Open Data and open science best
practices to establish a cutting edge infrastructure for
repeatability in experimentation. An in uence is the
Natural Language Processing (NLP) community, which has a
well established practice of shared tasks and competitions.
This is especially relevant as an exemplar of scienti c
community best practice because NLP and Machine Learning
(ML) components operating over structured and
unstructured data are becoming increasingly important as parts of
research in the UMAP community. Also important is the
work in establishing open meta-data for scienti c data.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>LINKED OPEN DATA</title>
      <p>Linked Data is the interlinking of structured data through
semantic queries using web technologies. This is distinct
from the use of the term in some disciplines that use linked
data to describe commonality of data between sources.</p>
      <p>
        Linked Open Data is based upon the principle[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] of
interlinking resources and data with RDF and URIs that can
be read and queried by machines through powerful querying
mechanisms like SPARQL. Adaptive and personalisation
experiments can bene t from declaration of information such
as usage rights, provenance, and authorship to create a more
declarative and open approach for sharing research
knowledge and experimental data. Ontologies such as DCAT2
help in expressing authorship of work ows and data sets,
whereas ODRL3 expresses usage rights and licensing. The
provenance of an experiment can be captured and modelled
using the PROV4 family of ontologies. The use of distinct
ontologies for experiment steps and data allows the
publication and discovery of experimental data with speci c
metadata such as usage rights that can be collected or aggregated
to form linked repositories such as OpenAire5 and Linghub6.
Existing research for these standard vocabularies has
examples of best practice[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for publishing data sets' metadata as
linked open data. The machine readable nature of metadata
makes it easy for an automated system to verify the
correctness of the data, or perform other operations that may prove
helpful in validations such as checking of data formats and
2http://www.w3.org/TR/vocab-dcat/
3http://www.w3.org/TR/odrl/
4http://www.w3.org/TR/prov-0/
5https://www.openaire.eu
6http://linghub.lider-project.eu
alterations through replacement of key steps, which provides
opportunities for new outcomes and results.
      </p>
      <p>Repetition and variation form the bulk of research in
Natural Language Processing (NLP) and Machine Learning (ML)
experiments, with several community-led projects that aim
to share experiments as a set of metadata using linked open
data vocabularies. A broadly accepted work ow modelling
process provides the opportunity to combine various data
sets into a collective corpus that increases the range of
available annotated data. This can be leveraged for the
discovery of experiments and data sets based on their metadata
through aggregated repositories such as LingHub. By
linking experiments as resources with the data used or produced,
variations in experiments can be evaluated using existing
data sets for a better comparative evaluation.</p>
      <p>
        The structured sharing of an experiment through
metadata allows others to build upon the experiment by
modifying its steps or reusing its components. The process of
providing an experiment as a linked resource reduces duplicity
in research and allows promotes reuse of knowledge and
repeatability. By approaching a de nitive work ow framework
as in the ML and NLP community, a system of expressive
models could provide useful insights and design
considerations for personalisation and user modelling systems. Such
an approach would leverage previous research in work ow
modelling such as MEX[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and NIF. The NLP community
has developed a schema, termed META-SHARE[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], for
language resource metadata that shares many characteristics
with the OpenAire metadata scheme. The META-SHARE
schema has also been mapped into RDF with relevant
attributes mapped to speci c properties from the above
standard vocabularies and is used by LingHub as an aggregation
source.
      </p>
    </sec>
    <sec id="sec-3">
      <title>EXPLORING NEW TECHNOLOGIES IN</title>
    </sec>
    <sec id="sec-4">
      <title>DATA FORMATS AND ONTOLOGIES</title>
      <p>An experiment method is often comprised of several steps
that are connected by the data exchanged between steps. By
providing information about the steps along with the data
being used, it is possible to repeat or verify certain steps
without repeating the entire experiment. The separation of
steps also enables replacement of a step with other
comparable approaches and to evaluate comparative results between
them. Such a work ow lends itself to ease variation in
experiment repeatability and makes it possible to compare results
across a range of similar experiments. The abstraction of
experiment work ows from individual steps also allows each
step to be implemented using di erent technologies. The
information or metadata about the experiment and each step
can be expressed e ciently using the P-Plan7 ontology that
expands upon PROV for representing execution work ows.</p>
      <p>When dealing with adaptive and personalisation systems,
data forms an important set of resource for comparative
evaluations. The practical implementation for such systems is
sometimes designed based on performance or viability to
existing practical considerations. CSV on the Web
(CSVW)8 is an adaptation of the widely used CSV format for
linked data sets that allows representing structured data
along with a metadata vocabulary for describing the
contents of the data. This makes it possible for a system to be
7http://www.opmw.org/model/p-plan/
8http://www.w3.org/TR/csvw-ucr/
performant while using the CSV-W format for all forms of
data exchange including the metadata and the actual data
set related to an experiment. Similarly, JSON-LD9 is based
on the JSON format, which is a popular format for
exchanging data in a structured manner across the various REST
APIs across the web. JSON-LD is a lightweight data format
that is easy for humans to read and write, and is currently
a W3C recommendation. CSV-W and JSON-LD o er a fast
and performant way to exchange RDF/OWL based data in
linked open data systems.
4.</p>
    </sec>
    <sec id="sec-5">
      <title>EXAMPLE USE CASE</title>
      <p>
        We applied the combination of DCAT, PROV, P-PLAN
and CSVW metadata to tabular data sets arising from an
experiment using the Personalized Multilingual Information
Retrieval (PMIR) platform described in[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This platform
supported the modular design, implementation and
evaluation of a set of algorithms for multilingual user modelling,
multilingual query adaptation, and multilingual result-list
adaptation. A simpli ed summary of the user modelling
process is expressed in Figure 1. It shows the structure of the
work ow, including steps and variables involved and their
dependencies on state and execution order captured using
the P-Plan vocabulary. Steps are de ned with additional
metadata describing dependency relations and precedence
order in relation to other steps along with the data
consumed or generated, which is expressed as variables. The
use of an expressive ontology provides an interoperable and
unambiguous record of the work ow interlinked with the
resulting data set.
5.
      </p>
    </sec>
    <sec id="sec-6">
      <title>FUTURE WORK &amp; MOTIVATION</title>
      <p>The creation of an ontology speci c to the description of
the steps and data associated with a published adaptivity
and personalisation experiment o ers a cohesive model that
can be linked together to form a corpus of knowledge of
related experiments. The e orts required to create such an
ontology are heavily based on adopting existing ontologies
and standards and have the advantage of best practices for
models that easily locate, access, reuse and repeat an
experiment. Sharing experiment work ows and experimental
data allows broader sharing of knowledge linked across
domains. This brings new ideas into the UMAP community
while exposing the ongoing work in new research avenues to
other elds of related disciplines. Ultimately, scienti c
research can only progress through sharing of knowledge in a
usable and repeatable manner and hence must endure e orts
towards the same.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This work has been supported partially by the European
Commission as part of the FALCON project (contact
number 610879) and the ADAPT Centre for Digital Content
Technology which is funded under the SFI Research Centres
Programme (Grant 13/RC/2106) and is co-funded under the
European Regional Development Fund.
9http://json-ld.org</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bru</surname>
          </string-name>
          mmer, C. Baron, I. Ermilov,
          <string-name>
            <given-names>M.</given-names>
            <surname>Freudenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          . Dataid:
          <article-title>Towards semantically rich metadata for complex datasets</article-title>
          .
          <source>In Proceedings of the 10th International Conference on Semantic Systems, SEM '14</source>
          , pages
          <fpage>84</fpage>
          {
          <fpage>91</fpage>
          , New York, NY, USA,
          <year>2014</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>T. B.-L. Christian</surname>
            <given-names>Bizer</given-names>
          </string-name>
          , Tom Heath.
          <article-title>Linked data - the story so far</article-title>
          .
          <source>In Special Issue on Linked Data</source>
          , pages
          <volume>5</volume>
          (
          <issue>3</issue>
          ) 1{
          <fpage>22</fpage>
          .
          <source>International Journal on Semantic Web and Information Systems</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Esteves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moussallem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. B.</given-names>
            <surname>Neto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Soru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Usbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ackermann</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Lehmann.</surname>
          </string-name>
          <article-title>MEX vocabulary: a lightweight interchange format for machine learning experiments</article-title>
          .
          <source>pages</source>
          <volume>169</volume>
          {
          <fpage>176</fpage>
          . ACM Press,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Ghorab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lawless</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. O'Connor</surname>
            , and
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Wade</surname>
          </string-name>
          .
          <article-title>Does personalization bene t everyone in the same way? multilingual search personalization for english vs. non-english users</article-title>
          .
          <source>In Posters, Demos, Late-breaking Results and Workshop Proceedings of the 22nd Conference on User Modeling</source>
          , Adaptation, and
          <article-title>Personalization co-located with the 22nd Conference on User Modeling, Adaptation, and Personalization (UMAP2014), Aalborg</article-title>
          , Denmark, July 7-
          <issue>11</issue>
          ,
          <year>2014</year>
          .,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Piperidis</surname>
          </string-name>
          .
          <article-title>The meta-share language resources sharing infrastructure: Principles, challenges, solutions</article-title>
          . In N. Calzolari,
          <string-name>
            <given-names>K.</given-names>
            <surname>Choukri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Declerck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. U.</given-names>
            <surname>Dogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Maegaard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mariani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Odijk</surname>
          </string-name>
          , and S. Piperidis, editors,
          <source>LREC</source>
          , pages
          <volume>36</volume>
          {
          <fpage>42</fpage>
          .
          <string-name>
            <surname>European Language Resources Association</surname>
          </string-name>
          (ELRA),
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>