<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ShExML: An heterogeneous data mapping language based on ShEx</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Herminio Garcia-Gonzalez</string-name>
          <email>herminio.garcia-gonzalez@inria.fr</email>
          <email>herminiogg@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Fernandez-Alvarez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jose Emilio Labra-Gayo</string-name>
          <email>labra@uniovi.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Oviedo</institution>
          ,
          <addr-line>Oviedo, Asturias</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Inria Lille Nord Europe</institution>
          ,
          <addr-line>Villeneuve-d'Ascq</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data interoperability is currently a problem that we are facing more intensely due to the appearance of elds like Big Data or IoT. Many data is persisted in information silos with neither interconnection nor format homogenisation. Our proposal to alleviate this problem is ShExML, a language based on ShEx that can map and merge heterogeneous data formats into a single RDF representation. We advocate the creation of this type of tools that can facilitate the migration of nonsemantic data to the Semantic Web.</p>
      </abstract>
      <kwd-group>
        <kwd>data</kwd>
        <kwd>interoperability</kwd>
        <kwd>RDF</kwd>
        <kwd>ShEx</kwd>
        <kwd>ShExML</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Mapping and merging heterogeneous data sources is a task that has gained
in importance throughout the last years. With the improvement of hardware
support, the development of new technological areas|such as Big Data or
Internet of Things (IoT)|and the deeper interconnection between heterogeneous
devices, a huge amount of data is generated every second. However, this data is
created in various formats and persisted using di erent technologies. Therefore,
understanding and exploitation of this data becomes a hard work due to the
information silos model.</p>
      <p>One of the goals of the Semantic Web was the interconnection of data sources
and the avoidance of the aforementioned information silos. Therefore, many
technologies were proposed to accompany that objective. However, the migration of
non-semantic data to the new semantic technologies is a hard task that many
individuals and companies are not able to face due to the time or resources
consumption. Migrating all databases in a company to their counterpart in
Semantic Web world will carry not only the migration of the platforms, but also the
data with the development of ad-hoc solutions for every dataset. Therefore,
solutions that alleviate this translation can contribute to the adoption of semantic
technologies or, at least, facilitate it.</p>
      <p>We propose a language to map and merge heterogeneous data into its
Resource Description Framework (RDF) counterpart. But also taking into account
usability and easiness of use.</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        Many mapping languages and tools were proposed to perform a mapping between
a non-semantic format to its RDF counterpart. This is the case of XSPARQL
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] which converts from XML to RDF based on XQuery and SPARQL queries,
R2RML [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] which allows to de ne mappings from relational databases to RDF
graphs, or CSV2RDF [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] which permits to convert from CSV to RDF.
      </p>
      <p>
        However, none of these works tackle the mapping and the merging of
heterogeneous datasets in the same solution. This is addressed by RML [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] which
extends R2RML language to support formats like JSON, CSV or XML in
addition to relational databases. Other alternative is YARRRML [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] a text-based
language which is intended to be easy-readable by humans. YARRRML is based
on YAML and can be used to represent RML and R2RML rules.
      </p>
      <p>ShExML shares the same goal as RML and YARRRML. However, as being
based on ShEx, validation of generated data can be done faster, i.e., the gap
between ShExML and ShEx is small. Moreover, it is designed to keep the same
simplicity and easiness of use that ShEx has.
3</p>
    </sec>
    <sec id="sec-3">
      <title>ShExML at a glance</title>
      <p>
        ShExML3 is based on ShEx [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] which means that language constructions of
ShExML are similar to ShEx. Therefore, it uses the shape as the main foundation
for every transformation.
      </p>
      <p>Listing 1.1. ShExML example for lms
PREFIX : &lt;http :// example . com /&gt;
PREFIX dbo : &lt;http :// dbpedia . org / ontology /&gt;
PREFIX foaf : &lt;http :// xmlns . com / foaf /0.1/ &gt;
PREFIX dbr : &lt;http :// dbpedia . org / resource /&gt;
SOURCE films_xml &lt; https :// example . com / films .xml &gt;
SOURCE films_json &lt; https :// example . com / films . json &gt;
QUERY film_ids_xml &lt;// film /@id &gt;
QUERY film_names_xml &lt;// film / name &gt;
QUERY film_years_xml &lt;// film / year &gt;
QUERY film_directors_xml &lt;// film / director &gt;
QUERY film_ids_json &lt;$. films [*]. id &gt;
QUERY film_names_json &lt;$. films [*]. name &gt;
QUERY film_years_json &lt;$. films [*]. year &gt;
QUERY film_directors_json &lt;$. films [*]. director &gt;
EXPRESSION film_ids &lt; $films_xml . film_ids_xml UNION $films_json . film_ids_json &gt;
EXPRESSION film_names &lt; $films_xml . film_names_xml UNION $films_json . film_names_json &gt;
EXPRESSION film_years &lt; $films_xml . film_years_xml UNION $films_json . film_years_json &gt;
EXPRESSION film_directors &lt; $films_xml . film_directors_xml UNION $films_json . film_directors_json &gt;
: Films :[ film_ids ] {
foaf : name [ film_names ] ;
dbo : year dbr :[ film_years ] ;
dbo : director [ film_directors ] ;
}</p>
      <p>We can see ShExML as a combination of declarations followed by a set of
shapes. Being the declarations a collection of variable de nitions and the shapes
the core procedure to de ne and execute the mappings.
3 ShExML on Github: https://github.com/herminiogg/ShExML</p>
      <p>ShExML: An heterogeneous data mapping language based on ShEx</p>
      <p>Inside the set of declarations there are pre xes, sources, queries and
expressions. Pre xes work as Turtle pre xes; sources allow to de ne a URL in which
the le is hosted; queries are intended to de ne reusable queries for the
previously de ned sources (which normally are de ned in a query language, e.g.,
JSONPath or XMLPath); and expressions which are used to perform the queries
over a source, make unions among queries and transform them.</p>
      <p>Listing 1.2. JSON lms le</p>
      <p>Listing 1.3. XML lms le
{
}
" films ": [
{
" id ": 3,
" name ": " Inception ",
" year ": "2010" ,
" director ":</p>
      <p>" Christopher Nolan "
}, {
" id ": 4,
" name ": " The Prestige ",
" year ": "2006" ,
" director ":</p>
      <p>" Christopher Nolan "</p>
      <p>Thus, imagine that we want to make the transformation of two lists of lms:
one in JSON and the other in XML (see Listings 1.2 and 1.3). We de ne a
ShExML which can convert both les to RDF and merge them into a single
RDF le (see Listing 1.1). This conversion has a single shape called :Films
which has the main conversion for the lms. In order to construct each triple
a name is de ned under the :[films ids] directive which will match with the
subject of every triple generated by this shape. Then, predicates and objects
are generated, based on the previous ids, using the expressions enclosed between
braces. For example, foaf:name [films name] will generate a triple in the form
of subject foaf:name :object. Notice that every expression enclosed between
square brackets allows a pre x de nition which tells the compiler if this
expression will be a node or a literal. Moreover, if a query produces a list of results,
instead of a single one, the ShExML engine performs the mapping taking into
account the relation of them with each entity. Hence, making it possible to merge
les with various entities. Finally, the result of this example is showed in Listing
1.4.
:4
dbo : director " Christopher Nolan " ;
dbo : year dbr :2006 ;
foaf : name " The Prestige " .
dbo : director " Christopher Nolan " ;
dbo : year dbr :2010 ;
:2
4
foaf : name</p>
      <p>" Inception " .
dbo : director " Christopher Nolan " ;
dbo : year dbr :2014 ;
foaf : name " Interstellar " .
dbo : director " Christopher Nolan " ;
dbo : year dbr :2017 ;
foaf : name " Dunkirk " .</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>In this work, we have presented ShExML, a language that allows to map and
merge heterogeneous data into its RDF counterpart. This tool helps the
migration of semi-structured data to a semantic data format, improving its
interoperability and searchability. With the development of this solution, the integration
of data into the Semantic Web is an easier task and it can be adapted to di
erent scenarios. We are planning to include some extra features in future versions,
such as: the uni cation of URIs between di erent representations, the matching
between generated URIs and existing ones in the Linked Open Data cloud and
the conversion of streaming sources.</p>
      <p>Acknowledgments This work has been partially funded by the Vicerectorate
for Research of the University of Oviedo under the call of "Plan de Apoyo y
Promocion de la Investigacion" and by the Ministerio de Econom a, Industria y
Competitividad under the call of "Programa Estatal de I+D+i Orientada a los
Retos de la Sociedad" (project TIN2017-88877-R).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bischof</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krennwallner</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopes</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Mapping between RDF and XML with XSPARQL</article-title>
          .
          <source>Journal on Data Semantics</source>
          <volume>1</volume>
          (
          <issue>3</issue>
          ),
          <volume>147</volume>
          {
          <fpage>185</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sundara</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>R2RML: RDB to RDF Mapping Language</article-title>
          . https://www.w3.org/TR/r2rml/ (
          <year>2012</year>
          ),
          <source>W3C Recommendation 27 September 2012</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vander</surname>
            <given-names>Sande</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data</article-title>
          . In: LDOW. Seoul, Korea (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ermilov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stadler</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>CSV2RDF: User-driven CSV to RDF mass conversion framework</article-title>
          .
          <source>In: Proceedings of the ISEM</source>
          . vol.
          <volume>13</volume>
          , pp.
          <volume>04</volume>
          {
          <fpage>06</fpage>
          .
          <string-name>
            <surname>Graz</surname>
          </string-name>
          ,
          <string-name>
            <surname>Austria</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Heyvaert</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Meester</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verborgh</surname>
          </string-name>
          , R.:
          <article-title>Declarative Rules for Linked Data Generation at your Fingertips! In: Proceedings of the 15th ESWC: Posters and Demos</article-title>
          . Heraklion,
          <string-name>
            <surname>Greece</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Prud'hommeaux</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Labra</surname>
            <given-names>Gayo</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.E.</given-names>
            ,
            <surname>Solbrig</surname>
          </string-name>
          , H.:
          <article-title>Shape Expressions: An RDF Validation and Transformation Language</article-title>
          .
          <source>In: Proceedings of the 10th International Conference on Semantic Systems</source>
          . pp.
          <volume>32</volume>
          {
          <fpage>40</fpage>
          . SEM '14,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>