<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Demonstration of Linked Data Source Discovery and Integration?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jason Slepicka</string-name>
          <email>slepicka@isi.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chengye Yin</string-name>
          <email>chengyey@usc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pedro Szekely</string-name>
          <email>pszekely@isi.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Craig A. Knoblock</string-name>
          <email>knoblock@isi.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Southern California Information Sciences Institute and Department of Computer Science</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Linked Data cloud is an enormous repository of data, but it is di cult for users to nd relevant data and integrate it into their datasets. Users can navigate datasets in the Linked Data cloud with ontologies, but they lack detailed characterization of datasets' contents. We present an approach that leverages r2rml mappings to characterize datasets. Our demonstration shows how users can easily create r2rml mappings for their datasets and then use these mappings to nd data from the Linked Data cloud and integrate it into their datasets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The Linked Data cloud contains an enormous amount of data about many topics.
Consider museums, which often have detailed data about their artworks but may
only have sparse data about the artists who created them. Museums typically
have tombstone data about artists (name, birth/death years, and places) but
may lack biographies, in uences, etc. Museums could use additional information
about their artists in the Linked Data cloud and integrate it with their own to
produce a richer, more complete dataset.</p>
      <p>
        Our approach to this, built into our Karma data integration system [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
uses r2rml mappings [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to describe users' datasets and datasets in the Linked
Data cloud. Today, datasets include, at best, a VoID description [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] with basic
metadata, such as access method and vocabularies used. r2rml-style mappings
could complement VoID with their schema-like nature by capturing the semantic
structure of a dataset and characterize its subjects and properties accordingly
with statistics or set approximations like Bloom lters. With this information,
users can reason better about how a dataset might integrate with their own data.
      </p>
      <p>
        r2rml was de ned to specify mappings from relational DBs to RDF, but
recent work [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has proposed extensions to handle data types like CSV, JSON,
XML and Web APIs. Consequently, it is reasonable to expect that more datasets
in the Linked Data cloud could be published with r2rml-style descriptions.
      </p>
      <p>In this demonstration we show how museum users can use Karma to quickly
de ne an r2rml mapping of a dataset (our previous work), use r2rml mappings
? A video demonstration is available at http://youtu.be/sr-XDBKeNCY
2</p>
    </sec>
    <sec id="sec-2">
      <title>Datasets</title>
      <p>from other datasets to nd more information about artists in their dataset, and
then augment their dataset with that information.</p>
      <p>
        For our demonstration we will integrate a CSV le containing 197 artists with
Linked Data published by the Smithsonian American Art Museum (SAAM). In
previous work [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we mapped the SAAM dataset, including over 40,000 artworks
and 8,000 artists to the CIDOC CRM ontology [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] using r2rml and made it
accessible by a SPARQL endpoint, along with a repository for the r2rml
mappings. The SAAM LOD here is a proxy for the Linked Data cloud to illustrate
the vision of a Linked Data cloud populated with r2rml models.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Demonstration</title>
      <p>We will show how a user can interactively model an artist dataset, discover the
Smithsonian's data for those artists, and then integrate the Smithsonian's data.</p>
      <p>Step 1: Modeling a New Source. The user begins by using Karma's
existing capability to model the artists in the CSV le as crm:E21 Person in an
r2rml mapping shown in Figure 1. Karma can use this mapping to generate
RDF, and can also compare it to retrieve other mappings, discovering new related
sources that can be integrated with the artist dataset.</p>
      <p>Step 2: Discovering Data Sources. The user then clicks on E21 Person1
in the r2rml mapping and selects Augment Data to discover new data to
integrate into artist records. Karma retrieves r2rml mappings in its repository
that describe crm:E21 Person, and uses these mappings to generate a candidate
set of linked data sources to integrate, identi es meaningful object and data
properties, and presents them to the user as illustrated in Figure 2. To help the
users select properties to integrate, Karma uses Bloom lters to estimate the
number of artists that have each of the properties listed in Figure 2.</p>
      <p>
        Step 3: Integrating Data Sources. The user selects the artist's biography
(for completeness) and birth (for validation). Karma automatically constructs
SPARQL queries to retrieve the data, integrates it into the worksheet, and
augments the r2rml mapping accordingly (Figure 3). To support the integrated
SPARQL queries, we generated owl:sameAs links between the artists in the CSV
le and the Smithsonian dataset using LIMES [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] (we plan to integrate LIMES
with Karma to enable users to perform all integration steps within Karma).
      </p>
      <p>Slepicka et al.</p>
    </sec>
    <sec id="sec-4">
      <title>Related Work and Conclusions</title>
      <p>
        We see similarities in our approach with those used in relational database
integration and semantic service composition. ORCHESTRA[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] starts, like r2rml,
by aligning database tables to a schema graph. For integration, heuristics are
used to translate keyword searches over the graph into join paths using its Q
query system. However, these joins are not guaranteed to be semantically
meaningful, unlike the integration paths Karma nds using r2rml.
      </p>
      <p>
        Platforms such as iServe[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] capture Linked Services and make them
discoverable and queryable by annotating them with their Minimal Service Model.
However, the past work on service discovery and composition only uses a semantic
model of the inputs and outputs of the services. In contrast, Karma service
descriptions [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] also capture the relationship between the attributes, which allows
us to automatically discover semantically meaningful joins.
      </p>
      <p>By building on Karma's ability to quickly model many source types, we
demonstrate how a user can discover other linked data sources, select the desired
attributes from those sources, and then integrate the data from those sources
into their own dataset. Through this source discovery and integration, a user
can transparently compose and join other sources and services in a semantically
meaningful, interactive way that was not previously possible.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alexander</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hausenblas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Describing linked datasets with the VoID vocabulary</article-title>
          .
          <source>W3C note</source>
          , W3C, Mar.
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sande</surname>
            ,
            <given-names>M. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colpaert</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mannens</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , and de Walle, R. V.
          <article-title>Extending R2RML to a source-independent mapping language for RDF</article-title>
          .
          <source>In International Semantic Web Conference (Posters and Demos)</source>
          (
          <year>2013</year>
          ), vol.
          <volume>1035</volume>
          of CEUR Workshop Proceedings, CEUR-WS.org, pp.
          <volume>237</volume>
          {
          <fpage>240</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Doerr</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>The CIDOC conceptual reference module: An ontological approach to semantic interoperability of metadata</article-title>
          .
          <source>AI Mag</source>
          .
          <volume>24</volume>
          ,
          <issue>3</issue>
          (Sept.
          <year>2003</year>
          ),
          <volume>75</volume>
          {
          <fpage>92</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ives</surname>
            ,
            <given-names>Z. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Green</surname>
            ,
            <given-names>T. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karvounarakis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , N. E.,
          <string-name>
            <surname>Tannen</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Talukdar</surname>
            ,
            <given-names>P. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jacob</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>The ORCHESTRA collaborative data sharing system</article-title>
          .
          <source>ACM SIGMOD Record 37</source>
          ,
          <issue>3</issue>
          (
          <year>2008</year>
          ),
          <volume>26</volume>
          {
          <fpage>32</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.-C. N.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>LIMES: a time-e cient approach for large-scale link discovery on the web of data</article-title>
          .
          <source>In Proceedings of the Twenty-Second international joint conference on Arti cial Intelligence</source>
          (
          <year>2011</year>
          ), AAAI Press, pp.
          <volume>2312</volume>
          {
          <fpage>2317</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Pedrinaci</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maleshkova</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lambert</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kopecky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Domingue</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>iServe: a linked services publishing platform</article-title>
          .
          <source>In CEUR workshop proceedings (2010)</source>
          , vol.
          <volume>596</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Sundara</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>R2RML: RDB to RDF mapping language</article-title>
          .
          <source>W3C recommendation, W3C</source>
          , Sept.
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Szekely</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knoblock</surname>
            ,
            <given-names>C. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fink</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Allen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Goodlander</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <article-title>Connecting the Smithsonian American Art Museum to the Linked Data Cloud</article-title>
          .
          <source>In Proceedings of the 10th ESWC</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Taheriyan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knoblock</surname>
            ,
            <given-names>C. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szekely</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ambite</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          <article-title>Semiautomatically modeling web APIs to create linked APIs</article-title>
          .
          <source>In Proceedings of the ESWC 2012 Workshop on Linked APIs</source>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>