<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Semi-Automatic Tool for Linked Data Integration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Benjamin Moreau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolas Terpolilli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patricia Serrano-Alvarado</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Nantes University</institution>
          ,
          <addr-line>LS2N, CNRS, UMR6004, 44000 Nantes</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>OpenDataSoft</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Linked Data (LD) is a set of best practices to publish data in RDF format. Transformating structured datasets into RDF datasets is possible thanks to RDF Mappings. To be able to define such mappings, it is necessary to be familiar with the LD practices and to know perfectly concerned datasets. An obstacle to the democratisation of the LD is that few people satisfy these two conditions. We believe that tools making easy the process of LD integration will foster the LD growth. In this demonstration, we present a chatbot-like tool that can semi-automatically generate RDF mappings for existing structured datasets. The challenge is to automate part of the integration process that requires getting familiar with RDF.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Linked Data (LD) is a set of best practices to publish data in RDF3 format.
Data published as LD are described according to ontologies that represent
relationships (i.e., properties) between concepts (i.e., classes) of a domain. RDFS4
and OWL5 are semantic web languages to describe ontologies. Using existing
and widely used ontologies increases interoperability among LD datasets.</p>
      <p>An RDF Mapping defines the transformation of a structured dataset
(columnbased, JSON, etc.) into an RDF dataset. It maps columns of a dataset to terms
of an RDF graph.</p>
      <p>Writing mappings is not easy. Consider Figure 1 that shows an excerpt of a
structured dataset describing Roman Emperors and Figure 2 that represents an
RDF mapping allowing to transform such dataset in RDF. Writing this mapping
requires to answer several questions, for instance: (i) what concepts contain the
Name and Birth city columns? In this case, Name contains entities that are
Persons (emperors) and Birth city contains entites that are Places (cities). (ii)
What are the relationships between these two concepts? Here, Places are birth
places of Persons. (iii) Which existing ontologies are relevant to describe these
concepts? In this example, DBpedia, GeoNames, etc.
?? Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0)
3 https://www.w3.org/RDF/
4 https://www.w3.org/TR/rdf-schema/
5 https://www.w3.org/TR/owl2-overview/
string date string string float float
Name Birth Birth City Birth Province Lat Long
Augustus 0062-09-23 Rome
Caligula 0012-08-31 Antitum
Claudius 0009-08-01 Lugdunum Gallia Lugdunensis 47.932559 0.191854
... ... ... ... ... ...</p>
      <p>Answering these questions requires both, to know the dataset perfectly and
to be familiar with RDF concepts such as RDFS, OWL and RDF mapping
languages. Unfortunately, many data producers are not familiar with RDF and
are not yet ready to invest time to integrate their data. In this work, we focus on
how to simplify as much as possible the integration of existing structured datasets
as Linked Data. The challenge we face is to automate part of the integration
process that requires getting familiar with RDF.</p>
      <p>
        RML [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and SPARQL-Generate [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] are two RDF mapping languages. Even
if there exist simplified and human-readable syntaxes of mapping languages like
YARRRML [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], writing a mapping requires to be familiar with RDF. Recently,
interesting tools like KARMA [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], RMLeditor [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or Juma [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have been proposed
to assist users during the creation of an RDF mapping. However, these tools are
not easy to use for users that are not familiar with RDF concepts.
      </p>
      <p>We propose a chatbot-like tool that can generate an RDF mapping from a
structured dataset by only asking simple questions to users about their dataset.
Our tool can simply and quickly integrate datasets as Linked Data and
encourages new users to make their first steps into Linked Data.
2</p>
    </sec>
    <sec id="sec-2">
      <title>A Chatbot-Like Tool for Linked Data Integration</title>
      <p>To generate an RDF mapping from a structured dataset, our tool uses two
knowledge graphs, DBpedia and YAGO, the ontologies of LOV6, and the semantic web
languages OWL and RDFS.</p>
      <p>Roughly speaking, from a set of instances of each column, our tool searches
corresponding entities in DBpedia and YAGO. The goal is to find a class
corre6 https://lov.linkeddata.es/dataset/lov/</p>
      <p>A Semi-Automatic Tool for Linked Data Integration
sponding to each column. Then, similarly, LOV is used to find the most relevant
properties that may correspond to column names, such that instances of a
column correspond to the object of a property. To confirm and complete these
correspondences, the tool asks simple questions to the user. User confirmed
correspondences allow to generate a first RDF mapping. Finally, this mapping is
saturated with entailment rules of OWL and RDFS.7</p>
      <p>Using the Roman emperor dataset of Figure 1, in the class correspondence
step, the Augustus value of the Name column corresponds to the entity http:
//dbpedia.org/resource/Augustus of the class dbo:Person in DBpedia. Thus,
Augustus is identified as an entity of the class dbo:Person. In this example,
we obtain two class correspondences suggesting that columns Name and Birth
City contain respectively entities of the classes dbo:Person and dbo:Place.
These correspondences are suggested to the user with simple yes or no questions:
“Does the column Name in your dataset contain Persons?” . In order to hide URIs,
questions are built using the rdfs:label property of classes.</p>
      <p>In the property correspondence step, our tool obtains 5 property
correspondences suggesting that columns Name, Birth, Birth City, Lat and Long are
respectively objects of properties dbo:name, locah:dateBirth, dbo:birthPlace,
geo:lat and geo:long. Again, these correspondences are suggested to the user
with simple yes or no questions: “It seems that the column Lat is the latitude
of a Spatial thing. Is it true?” . Question is built using the rdfs:label and the
rdfs:domain of the property.</p>
      <p>To complete confirmed correspondences, the tool asks the user to select the
column of the dataset that will correspond to the subject of the property. If
the user confirms the geo:lat correspondence with column Lat, the tool asks
“latitude is a characteristic of a Spatial Thing. Select the column that contains
Spatial Thing.”. In our example, if the user answers correctly, the column Birth
Province is used as the subject of the geo:lat property.</p>
      <p>Our tool uses heuristics to reduce the number of questions. It suggests at most
one class and one property for each column. Only the class that corresponds to
the most instances of a column is suggested. The property that is suggested for
a column is the property that has the best popularity score in the LOV answer.
Our tool does not suggest a property if its LOV score is lower than a fixed lower
bound. Moreover, to improve the pertinence of suggested properties, the type of
a column can also be added in the text search. In our example, it searches for
the property Birth date instead of Birth.</p>
      <p>From user confirmed correspondences, our tool generates a first RDF
mapping. In a final step, our tool saturates this mapping by applying RDFS and OWL
entailment rules. Using the range and domain of all properties (rdfs:range and
rdfs:domain), new classes are infered. This is possible because the domain of a
property represents the type (i.e., a class or the literal datatype) of the subject
and the range represents the type of the object. In our example, for instance,
7 We only consider rules 2, 3, 5, 7, 9 and 11 from RDFS: https://www.w3.org/
TR/rdf11-mt/#rdfs-entailment and rules based on owl:equivalentClass and
owl:equivalentProperty from OWL: https://www.w3.org/TR/owl-ref
the user defined Birth Province as the subject of the latitude property. In the
GeoNames ontology, the rdfs:domain of this property is geo:SpatialThing.
Thus, our tool infers that the column Birth Province contains entities of type
geo:SpatialThing. Our tool also takes into account owl:equivalentClass,
owl:equivalentProperty, rdfs:subClassOf and rdfs:subPropertyOf
properties of concerned ontologies. The RDF mapping in YARRRML of our example
is available at https://git.io/fjKY6 and the result of the transformation in
turtle is available at https://git.io/fjKYo.</p>
    </sec>
    <sec id="sec-3">
      <title>Demonstration</title>
      <p>We implemented a chatbot-like tool that is able to generate YARRRML
mappings for datasets of the OpenDataSoft’s data network8. We chose YARRRML
because, at our knowledge, it is the most human readable and understandable
RDF mapping syntax for users that are not familiar with LD. Source code of
out tool is available at GitHub9 under the MIT license. Our tool is available as a
web service at https://chatbot.opendatasoft.com/. During the
demonstration, attendees will be able to use the tool to generate RDF mappings for any
structured dataset of the network as LD (e.g., Roman Emperors10).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vander</surname>
            <given-names>Sande</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data</article-title>
          . In: Workshop on
          <article-title>Linked Data on the Web (LDOW) collocated with WWW (</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szekely</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knoblock</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taheriyan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muslea</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Karma: A System for Mapping Structured Sources Into the Semantic Web</article-title>
          .
          <source>In: Extended Semantic Web Conference (ESWC)</source>
          ,
          <source>Poster&amp;Demo</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Heyvaert</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Meester</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verborgh</surname>
          </string-name>
          , R.:
          <article-title>Declarative Rules for Linked Data Generation at Your Fingertips</article-title>
          ! In: Extended Semantic Web Conference (ESWC),
          <source>Poster&amp;Demo</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Heyvaert</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herregodts</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuurman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>RMLEditor: a Graph-Based Mapping Editor for Linked Data Mappings</article-title>
          .
          <source>In: Extended Semantic Web Conference (ESWC)</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Junior</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Debruyne</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>O'Sullivan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>An Editor that Uses a Block Metaphor for Representing Semantic Mappings in Linked Data</article-title>
          .
          <source>In: Extended Semantic Web Conference (ESWC)</source>
          ,
          <source>Poster&amp;Demo</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lefrançois</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zimmermann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bakerally</surname>
          </string-name>
          , N.:
          <string-name>
            <surname>A SPARQL Extension For Generating RDF From Heterogeneous</surname>
          </string-name>
          <article-title>Formats</article-title>
          .
          <source>In: Extended Semantic Web Conference (ESWC)</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>8 https://data.opendatasoft.com</mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>9 https://github.com/opendatasoft/ontology-mapping-chatbot</mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          10 Roman emperors dataset is available at https://data.opendatasoft.com/explore/ dataset/roman-emperors%
          <article-title>40public/table/ and mapping of the dataset can be generated</article-title>
          at https://chatbot.opendatasoft.com/chatbot/roman-emperors@public
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>