Introduction

A Demonstration of Linked Data Source Discovery and Integration?

Jason Slepicka

slepicka@isi.edu 0

Chengye Yin

chengyey@usc.edu 0

Pedro Szekely

pszekely@isi.edu 0

Craig A. Knoblock

knoblock@isi.edu 0 0 University of Southern California Information Sciences Institute and Department of Computer Science , USA

The Linked Data cloud is an enormous repository of data, but it is di cult for users to nd relevant data and integrate it into their datasets. Users can navigate datasets in the Linked Data cloud with ontologies, but they lack detailed characterization of datasets' contents. We present an approach that leverages r2rml mappings to characterize datasets. Our demonstration shows how users can easily create r2rml mappings for their datasets and then use these mappings to nd data from the Linked Data cloud and integrate it into their datasets.

Introduction

The Linked Data cloud contains an enormous amount of data about many topics. Consider museums, which often have detailed data about their artworks but may only have sparse data about the artists who created them. Museums typically have tombstone data about artists (name, birth/death years, and places) but may lack biographies, in uences, etc. Museums could use additional information about their artists in the Linked Data cloud and integrate it with their own to produce a richer, more complete dataset.

Our approach to this, built into our Karma data integration system [ 8 ], uses r2rml mappings [ 7 ] to describe users' datasets and datasets in the Linked Data cloud. Today, datasets include, at best, a VoID description [ 1 ] with basic metadata, such as access method and vocabularies used. r2rml-style mappings could complement VoID with their schema-like nature by capturing the semantic structure of a dataset and characterize its subjects and properties accordingly with statistics or set approximations like Bloom lters. With this information, users can reason better about how a dataset might integrate with their own data.

r2rml was de ned to specify mappings from relational DBs to RDF, but recent work [ 2 ] has proposed extensions to handle data types like CSV, JSON, XML and Web APIs. Consequently, it is reasonable to expect that more datasets in the Linked Data cloud could be published with r2rml-style descriptions.

In this demonstration we show how museum users can use Karma to quickly de ne an r2rml mapping of a dataset (our previous work), use r2rml mappings ? A video demonstration is available at http://youtu.be/sr-XDBKeNCY 2

Datasets

from other datasets to nd more information about artists in their dataset, and then augment their dataset with that information.

For our demonstration we will integrate a CSV le containing 197 artists with Linked Data published by the Smithsonian American Art Museum (SAAM). In previous work [ 8 ], we mapped the SAAM dataset, including over 40,000 artworks and 8,000 artists to the CIDOC CRM ontology [ 3 ] using r2rml and made it accessible by a SPARQL endpoint, along with a repository for the r2rml mappings. The SAAM LOD here is a proxy for the Linked Data cloud to illustrate the vision of a Linked Data cloud populated with r2rml models. 3

Demonstration

We will show how a user can interactively model an artist dataset, discover the Smithsonian's data for those artists, and then integrate the Smithsonian's data.

Step 1: Modeling a New Source. The user begins by using Karma's existing capability to model the artists in the CSV le as crm:E21 Person in an r2rml mapping shown in Figure 1. Karma can use this mapping to generate RDF, and can also compare it to retrieve other mappings, discovering new related sources that can be integrated with the artist dataset.

Step 2: Discovering Data Sources. The user then clicks on E21 Person1 in the r2rml mapping and selects Augment Data to discover new data to integrate into artist records. Karma retrieves r2rml mappings in its repository that describe crm:E21 Person, and uses these mappings to generate a candidate set of linked data sources to integrate, identi es meaningful object and data properties, and presents them to the user as illustrated in Figure 2. To help the users select properties to integrate, Karma uses Bloom lters to estimate the number of artists that have each of the properties listed in Figure 2.

Step 3: Integrating Data Sources. The user selects the artist's biography (for completeness) and birth (for validation). Karma automatically constructs SPARQL queries to retrieve the data, integrates it into the worksheet, and augments the r2rml mapping accordingly (Figure 3). To support the integrated SPARQL queries, we generated owl:sameAs links between the artists in the CSV le and the Smithsonian dataset using LIMES [ 5 ] (we plan to integrate LIMES with Karma to enable users to perform all integration steps within Karma).

Slepicka et al.

Related Work and Conclusions

We see similarities in our approach with those used in relational database integration and semantic service composition. ORCHESTRA[ 4 ] starts, like r2rml, by aligning database tables to a schema graph. For integration, heuristics are used to translate keyword searches over the graph into join paths using its Q query system. However, these joins are not guaranteed to be semantically meaningful, unlike the integration paths Karma nds using r2rml.

Platforms such as iServe[ 6 ] capture Linked Services and make them discoverable and queryable by annotating them with their Minimal Service Model. However, the past work on service discovery and composition only uses a semantic model of the inputs and outputs of the services. In contrast, Karma service descriptions [ 9 ] also capture the relationship between the attributes, which allows us to automatically discover semantically meaningful joins.

By building on Karma's ability to quickly model many source types, we demonstrate how a user can discover other linked data sources, select the desired attributes from those sources, and then integrate the data from those sources into their own dataset. Through this source discovery and integration, a user can transparently compose and join other sources and services in a semantically meaningful, interactive way that was not previously possible.

1. Alexander , K. , Cyganiak , R. , Hausenblas , M. , and Zhao , J. Describing linked datasets with the VoID vocabulary . W3C note , W3C, Mar. 2011 .

2. Dimou , A. , Sande , M. V. , Colpaert , P. , Mannens , E. , and de Walle, R. V. Extending R2RML to a source-independent mapping language for RDF . In International Semantic Web Conference (Posters and Demos) ( 2013 ), vol. 1035 of CEUR Workshop Proceedings, CEUR-WS.org, pp. 237 { 240 .

3. Doerr , M. The CIDOC conceptual reference module: An ontological approach to semantic interoperability of metadata . AI Mag . 24 , 3 (Sept. 2003 ), 75 { 92 .

4. Ives , Z. G. , Green , T. J. , Karvounarakis , G. , Taylor , N. E., Tannen , V. , Talukdar , P. P. , Jacob , M. , and Pereira , F. The ORCHESTRA collaborative data sharing system . ACM SIGMOD Record 37 , 3 ( 2008 ), 26 { 32 .

5. Ngomo , A.-C. N. , and Auer , S. LIMES: a time-e cient approach for large-scale link discovery on the web of data . In Proceedings of the Twenty-Second international joint conference on Arti cial Intelligence ( 2011 ), AAAI Press, pp. 2312 { 2317 .

6. Pedrinaci , C. , Liu , D. , Maleshkova , M. , Lambert , D. , Kopecky , J. , and Domingue , J. iServe: a linked services publishing platform . In CEUR workshop proceedings (2010) , vol. 596 .

7. Sundara , S. , Cyganiak , R. , and Das , S. R2RML: RDB to RDF mapping language . W3C recommendation, W3C , Sept. 2012 .

8. Szekely , P. , Knoblock , C. A. , Yang , F. , Zhu , X. , Fink , E. , Allen , R. , and Goodlander , G. Connecting the Smithsonian American Art Museum to the Linked Data Cloud . In Proceedings of the 10th ESWC ( 2013 ).

9. Taheriyan , M. , Knoblock , C. A. , Szekely , P. , and Ambite , J. L. Semiautomatically modeling web APIs to create linked APIs . In Proceedings of the ESWC 2012 Workshop on Linked APIs ( 2012 ).