Data Integration System for Linked Open Data Space

                                            © Kuznetcov Konstantin
                                    Lomonosov Moscow State University
                                        K.Kuznetcov@gmail.com


                       Abstract                             system to support integration of data from independent
                                                            data sources into the web of linked data is required.
    This paper describes research-in-progress work
    on data integration s ystem in Linked Open              2 Related work
    Data space. Proposed system uses concept of
    RDF identity links to interlink heterogeneous           To the moment, there are quite few solutions that
    local data sources and integrate them into              support variuos steps required to include one’s data into
    global data space.                                      Linked Open Data space, even though they are based on
    Academic supervisor: Vladimir Serebriakov,              existing hypertext web technologies. And there is no
    serebr@ccas.ru.                                         system that includes all the functionality recommended
                                                            by LOD project. The most complex solution available
1 Introduction                                              now is Virtuoso Universal Server [11] platform. It
                                                            provides tools for representing data from different
For the last few decades data integration has been one      sources (relational databases, RDF-storages, Web APIs,
of the most actual problems of computer science. With       etc.) as a single virtual database and supports RDF data
the development of IT industry countless data sources       publishing. Virtuoso offers SPARQL access to its data
emerged in the Internet. These data sources are             and such features as RDF-crawler and simple reasoner.
heterogeneous in all possible ways. Effective usage of      Virtuoso can be extended in multiple ways, e.g.
such data sources is impossible without automatical         published RDF-data can be accompanied with voiD
tools for data search, retrieval, publishing and            descriptors. Unfortunately all the extensions that are
transformation.                                             useful for linked data publishing are made on
     The original hypertextual web didn’t suit well for     instrumental level and not on data model level.
automatical processing of data from heterogeneous           Virtuoso is commercial software with limited open-
sources spread across the web. This led to emergence of     source community edition. Among other open-source
various microformats and web APIs, and finally the          solutions it is worth to mention D2R Server [3], which
concept of Semantic Web. Sematic Web implies usage          supports RDF data publishing from relational databases
of standard stack of data formats and technologies          and SPARQL querying. The MASTRO [4] and the data
intended to support data accumulation, structuring and      integration system developed in Dorodnicyn Computing
exchange across the web. The most important of these        Centre of RAS [2] provide richer semantic formalisms
technologies are RDF, RDFS, OWL and SPARQL.                 compared to D2R and Virtuoso. However, the first
     From the practical point of view one of the most       system is bound to a single federative database and the
interesting Semantic Web initiatives is Linking Open        second one implies that its sources share some common
Data project [8]. This project aims for quantitative        URIs. And both of them don’t provide any means for
filling of the web with data s tructured according to       publishing or interlinking with other RDF datasets.
Semantic Web standards and for interlinking of                  However, both Virtuoso and D2R Server do not go
semantic data sources. As a result, a global Linked         beyond simple RDF publishing. Their RDF-resource
Open Data space should be established, similar to           interlinking capabilities are limited to URI generation
hypertextual web of linked documents. Publishing data       from templates. In many cases using such templated
in Linked Open Data space encourages reuse of data,         URIs cannot expose identity relations between RDF
decreases data redundancy, maximizes its (real and          resources from different datasets and truly interlink
potential) inter-connectedness and enables network          these datasets. There are several applications for RDF
effects to add value to data. Data providers can benefit    data interlinking and link supporting, such as SILK
from publishing their data in LOD space. Unfortunately,     [12], LIMES, SemMF and DSNotify. But at the
the process of publishing is not that simple and consists   moment none of these applications provide means for
of several steps. Small organizations often cannot afford   integration of one’s RDF into the whole Linked Data
to transform their data into LOD- acceptable form and       space. I.e. there are no toos that can automatically
then support the published dataset. To the moment,          discover new related datasets in the web, the set up and
there is no special software to support the full cycle of   support links to the resources in these datasets. Some
publishing and managing linked open datasets. A             proposals for such systems are made in [10].
    The possibilities of non-trivial usage of generated    form and store it. This system is a subject of a future
linksets are yet to be explored. Very few applications     work.
take advantage of this feature of Linked Open Data
space. It is worth to mention SPLENDID system [6]          4.1 Ontology
here, which uses linksets statistics to optimize           The system uses OWL ontology to semantically
federative SPARQL queries. Some semantic search            organize objects and links that match the concepts of
engines also utilize voiD descriptions of linksets.        interest from knowledge domain of system’s data
                                                           sources. Ontology consists of core terms and imported
3 Problem statement                                        modules which can be added in case when some
                                                           resources in newly added data source require more
This article proposes a concept of automated data
                                                           precise definition. In Linked Open Data space ontology
integration system in Linked Open Data space.
                                                           serves as system’s data vocabulary, it is used to
Proposed system should
                                                           establish terminological outgoing links to external
    •     Form the single dataset from multiple
                                                           datasets and allows external applications to discover
heterogeneous sources of structured or unstructured
                                                           metadata to establish ingoing links. Following the
information in similar knowledge domain and
                                                           principles of Linked Open Data, ontology is annotated
support/update formed dataset;
                                                           in human language with such terms as rdfs:label or
    •     Discover and store links between resources
                                                           rdfs:comment. Ontology’s terms should be defined in
from system dataset and resources from different
                                                           URI namespace controlled by the system. Ontology
Linked Open Data sets available on the Internet in RDF
                                                           adapts common Linked Data vocabularies such as
format, as well as implicit links between resources
                                                           Dublin Core, FOAF, vCard, PRISM, SIOC, Creative
within system dataset;
                                                           Commons, BibTex, Schema.org. Core of ontology is
    •     Publish system dataset in the Internet in RDF
                                                           based on ENIP RAS ontology.
format and provide access to it via user interface and
API;                                                       4.2 Publishing subsystem
     •    Provide users and external applications with
unified query interface to all of system’s data sources;   Publishing subsystem will serve as an entry point to the
    •     Support different data source types (including   system for human users and Linked Data applications . It
relational databases and SPARQL endpoints) and             should dereference URIs of system’s resources, i.e.
support on-fly connection of new data sources;             return descriptions of the object or concept identified by
    •     Include flexible ontology of knowledge           these URIs. It can be achieved by using a mechanism
domain that follows Linking Open Data project              called content negotiation. Depending on HTTP GET
recommendations and can be extended to s upport new        request header, publishing subsystem will return either
data sources.                                              HTML representation or RDF/XML (as required by
                                                           Linked Data applications) representation of the
4 System architecture overview                             resource.
                                                               Publishing subsystem will receive data from data
Proposed system will follow modular architecture and       integration subsystem. For dereferencing resource URI,
will consist of following components:                      following information should be requested from data
    •     Ontology of informational objects and links of   integration subsystem:
interest to system data consumers and providers ;              •     All the literal values of resource, all incoming
    •     Linking subsystem, that should discover and      and outgoing RDF links. This information can be
store links between resources from system’s data           retrieved with simple SPARQL queries with patterns
sources and/or resources from external Linked Data         {<URI> ?x ?y} and {?x ?y <URI>};
sources;                                                       •     Most likely the results of these simple requests
    •     Publishing subsystem, which should provide       will contain URIs of other system’s resources. Linked
users and applications with access to resources from       Data applications often traverse URIs they find in RDF
system’s dataset according to LOD project                  documents. Therefore to reduce the number of HTTP
recommendations;                                           requests publishing subsystem should extend
    •     Data integration subsystem, which will contain   aforementioned requests to some depth, or by applying
mechanism to uniquely identify system’s resources both     some explicitly stated rules;
within system and in Linked Open Data space and                •     Information on ontology class to which the
provide uniform access to all system’s resources. This     requested resource belongs and all its ancestors;
subsystem will include a set of adapters that provide          •     Information on the dataset to which this
unified SPARQL access to system’s data sources of          resource belongs;
different types (relational databases, Web APIs, etc.);        All the information retrieved from data integration
    •     Harvesting and extraction subsystem with a set   subsystem will be represented as a set of RDF triples. In
of harvester components, which will gather data from       case of RDF document these triples will be merged into
system’s sources of unstructured data (text files,         resulting RDF/XML document and returned to client. In
scanned documents, etc.), transform it into structured     other case the triples will be published as HTML+RDFa
                                                           document generated from template. These templates can
be specified in general form and then redefined for          relevant to an atom, the entire conjunctive query is
specific classes.                                            dropped. As a result, a union of conjunctive queries
                                                             with atoms of different data sets will be obtained.
4.3 Data integration subsystem                                    Traditionally, the next step in data mediation
A data integration subsystem will provide other              process is construction of the physical query plan and
subsystems or external agents with uniform access            its execution. During execution of query with atoms
interface to all of the system’s data sources. Requested     related to different data sources the results of subqueries
information should be specified with SPARQL query.           to these data sources are joined. However, in the
This subsystem will be responsible for presenting            proposed system subquery results can be joined on
system’s data as single dataset in Linked Open Data          literal field values and not on the URIs, because data
space. There are several approaches to data integration      sources are presented in a form of independent Linked
systems – data warehousing, data mediation, peer-to-         Open Data sets and do not share common URIs . If
peer systems. Proposed system is supposed to work            subqueries to different data sources are to be joined on
with multiple strongly autonomous data sources;              URIs we will have to use the sets of links generated by
therefore it adapts data mediation architecture. The         linking subsystem between these data sources. Each
drawback of such systems (e.g. Virtuoso) is huge             conjunctive query is a graph pattern with vertices being
amount in network interactions required to produce           either literal values or the URIs or variables, and edges
query answer. Proposed system uses Linked Open Data          are labeled with predicates in terms of different data
principles to reduce this drawback.                          sources. If two adjacent edges are labeled with
    Data sources will be connected to the system via         predicates from different sources, it is necessary to refer
adapters, which are SPARQL endpoints capable of              to the linkset for this pair of sources and select resource
querying data sources in terms of system ontolog y.          pairs that satisfy a given part of graph pattern. By
These adapters should be generic, configurable               performing this operation on all the links in the
components (e.g. JDBC adapter, REST API adapter).            conjunctive query, we will obtain the set of resource
As opposed to existing data integration systems with         URIs that satisfies part of the pattern that defines
semantic capabilities (e.g. Virtuoso with its Sponger        relationships between different data sources. Then the
cartridges), resources from different data sources won’t     subquery parts related to specific data sources will be
be merged into single dataset by providing same URI to       executed by adapters with corresponding join variables
identical resources. Instead, in the spirit of Linked Open   being replaced by URIs from linksets. On this step
Data, every data source should be considered to contain      traditional query optimization techniques can be applied
unique resources and get its own sub-namespace (like         again.
http://<system_URL>/datasets/<source_id>). Adapter
                                                             4.3 Linking subsystem
should confront every resource from its data source
with HTTP URI from this namespace. Therefore we              RDF documents published in the Linked Open Data
will be able to track resource origin by its URI. When       space are required to contain outgoing links. These
new data source will be added to the system, its adapter     outgoing links are RDF triples with the subject being
will be configured by specifying generic adapter             the URI of the resource from the local namespace and
settings (e.g. JDBC connection string), general Dublin       the URI of the object and / or predicate belonging to the
Core description of the source, topic of interest            namespace of another dataset. The most important type
categorization, licensing information, etc. Adapter          of outgoing links is identity links that point at URI
configuration also includes the set of ontology classes      aliases used by other data sources to identify the same
and properties to which data in these sources belongs.       real-world object or abstract concept. Identity links can
This information can be entered manually or obtained         use such predicates as owl: sameAs, rdfs: seeAlso or
with    SPARQL ASK request. Next, adapter                    special SKOS terms. Although the uses of predicate
configuration will be published as voiD [1] descriptor       owl: sameAs in the LOD space are often contrary to the
of dataset. All such datasets are subsets (in terms of       semantics of OWL [7], its use is recommended by W3C
voiD) of system’s whole dataset. However, all of them        Technical Architecture Group. Linking subsystem will
will be accessed via single SPARQL endpoint. Such            be responsible for the discovery, storage and support of
structure preserves autonomy and independence of data        identity links. Properties of the link include pair of
sources while integrating them all together in Linked        URIs, link generation time and method, date of last link
Open Data space.                                             check and similarity factor. When the link is published
    Execution of queries in data integration subsystem       either owl:sameAs or rdfs:seeAlso predicate is used in
will be carried out as follows. The first step is a          the triple depending on similarity factor value.
SPARQL query rewriting according to the axioms of                Linking subsystem will work as follows. In the first
ontology, as described in [2]. Then algebraic query          step, the two data sources to be interlinked are found.
optimization techniques are applied. The result of this      For this pair an initially empty voiD linkset is created
phase in terms of descriptive logic is the union of          and published. When new data sources is added to the
conjunctive queries with simple constraints. In the          system the linksets between this new data source and all
second step the set of relevant data sources for each        existing data sets from other sources are automatically
atom of each conjunctive query is determined according       created. A linkset between internal dataset and external
to configuration of adapters. If there are no data sources
Linked Open Data set will be created in one of the                Future works on this project might include the study
following cases:                                              of link network generation and support algorithms. The
    •     The user can manually select a pair of datasets     system can also be extended with modules to access
for linking;                                                  external Semantic Web resource aggregators (sig.ma)
    •     Relevant datasets can be discovered using           and semantic search engines (sindice.com). Also,
HTTP referrer technique described in [9];                     additional studies in the management of licensing and
    •     Relevant datasets can set be discovered by          data access in the context of the Linked Open Data are
linking subsystem itself by traversing links in external      required.
dataset that is already linked to one of internal datasets.
    When two target datasets for interlinking will have       References
been selected, the subsystem will clusterize datasets by
classes and determines pairs of clusters to be                [1] K. Alexander, R. Cyganiak, M. Hausenblas, and J.
interlinked. This should be done to reduce the number              Zhao. Describing linked datasets. In Proceedings of
of pairwise comparisons of datasets elements. In the               the WWW2009 Workshop on Linked Data on the
case of two internal datasets both of them are described           Web, 2009.
by the same ontology, so that pairs of clusters contain       [2] A. A. Bezdushny. Formal Model of Ontology-
instances of same ontology classes. In the case of                 Based Data Integration Systems . Novosibirsk, 2008
linking to an external dataset the subsystem might select     [3] C. Bizer, R. Cyganiak. D2RQ — Lessons Learned.
pairs of classes with help of different ontology mapping           Position paper for the W3C Workshop on RDF
techniques [5], as well as using discovered or manually            Access to Relational Databases, 2007.
specified ontology mapping rules.                                  http://www.w3.org/2007/03/ RdfRDB/papers/d2rq-
    The third and final step of interlinking involves              positionpaper/
pairwise comparison of clusters elements to detect pairs      [4] D. Calvanese, G. De Giacomo, D. Lembo et al. The
of identity relations. These relations will be detected            MASTRO system for ontology-based data access.
using SILK LSL language rules. In the case of internal             Semantic Web Journal, volume 2, number 1, pages
data sources, rules will be declared together with the             43-53, 2011
ontology and determine which instances of the same            [5] J. Euzenat, A. Ferrara, et al. First results of the
class are identical. In the case of an external dataset            ontology alignment evaluation initiative 2011. In
rules will be either specified manually, or derived from           Proc. of 6th Ontology Matching Workshop
the existing rules and ontology mapping rules.                     (OM‘11), at International Semantic Web
    Complete binding is achieved by pairwise                       Conference (ISWC‘11), Bonn, Germany, 2011.
comparison of all elements of all datasets (both internal     [6] O. Gorlitz, S. Staab. SPLENDID: SPARQL
and external), but in practice such comparison is                  Endpoint Federation Exploiting VOID
impossible. Link generation optimization requires                  Descriptions. Proceedings of the 2nd International
additional study.                                                  Workshop on Consuming Linked Data, Bonn,
                                                                   Germany, 2011.
5 Conclusion                                                  [7] H. Halpin, P. Hayes, J. McCusker, D. Mcguinness,
                                                                   and H. Thompson. When owl:sameas isn't the
This paper proposes a concept of data integration                  same: An analysis of identity in linked data. In
system orientated towards Linked Open Data space.                  Proceedings of the 9th International Semantic Web
The novelty of this concept lies in its hybrid approach;           Conference, 2010
the system proposed combines data mediation and data          [8] T. Heath and C. Bizer. Linked Data: Evolving the
warehousing approaches by using locally stored linksets            Web into a Global Data Space (1st edition).
as indexes for a search engine hasn’t been implemented             Synthesis Lectures on the Semantic Web: Theory
yet. To the author’s knowledge, such method hasn’t                 and Technology, 1:1, 1-136. Morgan & Claypool,
been implemented yet. Besides, while there are works               2011. http://linkeddatabook.com/editions/1.0/
dedicated to bringing single data sources into the LOD        [9] H. Muhleisen and A. Jentzsch: Augmenting the
space or dealing with multiple already present sources             Web of Data using Referers Linked Data on the
in LOD space, the idea of bringing multiple data                   Web (LDOW2011), Mar. 2011
sources into LOD space via single data integration            [10] A. Nikolov and M. d'Aquin. Identifying Relevant
system has received very little attention.                         Sources for Data Linking using a Semantic Web
    Currently, the proof-of-concept system is being                Index, Workshop: 4th Workshop on Linked Data
developed in CC RAS as a part of a practical project               on the Web (LDOW 2011) at 20th International
dedicated to integration of data on protected sites and            World Wide Web Conference (WWW 2011),
animal species. While participating in a group on this             Hyderabad, India, 2011.
project, the author is working on query answering             [11] Virtuoso Universal Server, 2011.
algorithms in presence of linksets. As a result of this            http://virtuoso.openlinksw.com/
project, a large set of data on national parks should         [12] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov.
emerge in the LOD space, and if incoming links from                Discovering and maintaining links on the web of
external datasets appear, the project would be                     data. In Proceedings of the International Semantic
considered to be successful.                                       Web Conference, pages 650–665, 2009