=Paper=
{{Paper
|id=None
|storemode=property
|title=Data integration system for Linked Open Data space
|pdfUrl=https://ceur-ws.org/Vol-899/paper3.pdf
|volume=Vol-899
|dblpUrl=https://dblp.org/rec/conf/syrcodis/Konstantin12
}}
==Data integration system for Linked Open Data space==
Data Integration System for Linked Open Data Space
© Kuznetcov Konstantin
Lomonosov Moscow State University
K.Kuznetcov@gmail.com
Abstract system to support integration of data from independent
data sources into the web of linked data is required.
This paper describes research-in-progress work
on data integration s ystem in Linked Open 2 Related work
Data space. Proposed system uses concept of
RDF identity links to interlink heterogeneous To the moment, there are quite few solutions that
local data sources and integrate them into support variuos steps required to include one’s data into
global data space. Linked Open Data space, even though they are based on
Academic supervisor: Vladimir Serebriakov, existing hypertext web technologies. And there is no
serebr@ccas.ru. system that includes all the functionality recommended
by LOD project. The most complex solution available
1 Introduction now is Virtuoso Universal Server [11] platform. It
provides tools for representing data from different
For the last few decades data integration has been one sources (relational databases, RDF-storages, Web APIs,
of the most actual problems of computer science. With etc.) as a single virtual database and supports RDF data
the development of IT industry countless data sources publishing. Virtuoso offers SPARQL access to its data
emerged in the Internet. These data sources are and such features as RDF-crawler and simple reasoner.
heterogeneous in all possible ways. Effective usage of Virtuoso can be extended in multiple ways, e.g.
such data sources is impossible without automatical published RDF-data can be accompanied with voiD
tools for data search, retrieval, publishing and descriptors. Unfortunately all the extensions that are
transformation. useful for linked data publishing are made on
The original hypertextual web didn’t suit well for instrumental level and not on data model level.
automatical processing of data from heterogeneous Virtuoso is commercial software with limited open-
sources spread across the web. This led to emergence of source community edition. Among other open-source
various microformats and web APIs, and finally the solutions it is worth to mention D2R Server [3], which
concept of Semantic Web. Sematic Web implies usage supports RDF data publishing from relational databases
of standard stack of data formats and technologies and SPARQL querying. The MASTRO [4] and the data
intended to support data accumulation, structuring and integration system developed in Dorodnicyn Computing
exchange across the web. The most important of these Centre of RAS [2] provide richer semantic formalisms
technologies are RDF, RDFS, OWL and SPARQL. compared to D2R and Virtuoso. However, the first
From the practical point of view one of the most system is bound to a single federative database and the
interesting Semantic Web initiatives is Linking Open second one implies that its sources share some common
Data project [8]. This project aims for quantitative URIs. And both of them don’t provide any means for
filling of the web with data s tructured according to publishing or interlinking with other RDF datasets.
Semantic Web standards and for interlinking of However, both Virtuoso and D2R Server do not go
semantic data sources. As a result, a global Linked beyond simple RDF publishing. Their RDF-resource
Open Data space should be established, similar to interlinking capabilities are limited to URI generation
hypertextual web of linked documents. Publishing data from templates. In many cases using such templated
in Linked Open Data space encourages reuse of data, URIs cannot expose identity relations between RDF
decreases data redundancy, maximizes its (real and resources from different datasets and truly interlink
potential) inter-connectedness and enables network these datasets. There are several applications for RDF
effects to add value to data. Data providers can benefit data interlinking and link supporting, such as SILK
from publishing their data in LOD space. Unfortunately, [12], LIMES, SemMF and DSNotify. But at the
the process of publishing is not that simple and consists moment none of these applications provide means for
of several steps. Small organizations often cannot afford integration of one’s RDF into the whole Linked Data
to transform their data into LOD- acceptable form and space. I.e. there are no toos that can automatically
then support the published dataset. To the moment, discover new related datasets in the web, the set up and
there is no special software to support the full cycle of support links to the resources in these datasets. Some
publishing and managing linked open datasets. A proposals for such systems are made in [10].
The possibilities of non-trivial usage of generated form and store it. This system is a subject of a future
linksets are yet to be explored. Very few applications work.
take advantage of this feature of Linked Open Data
space. It is worth to mention SPLENDID system [6] 4.1 Ontology
here, which uses linksets statistics to optimize The system uses OWL ontology to semantically
federative SPARQL queries. Some semantic search organize objects and links that match the concepts of
engines also utilize voiD descriptions of linksets. interest from knowledge domain of system’s data
sources. Ontology consists of core terms and imported
3 Problem statement modules which can be added in case when some
resources in newly added data source require more
This article proposes a concept of automated data
precise definition. In Linked Open Data space ontology
integration system in Linked Open Data space.
serves as system’s data vocabulary, it is used to
Proposed system should
establish terminological outgoing links to external
• Form the single dataset from multiple
datasets and allows external applications to discover
heterogeneous sources of structured or unstructured
metadata to establish ingoing links. Following the
information in similar knowledge domain and
principles of Linked Open Data, ontology is annotated
support/update formed dataset;
in human language with such terms as rdfs:label or
• Discover and store links between resources
rdfs:comment. Ontology’s terms should be defined in
from system dataset and resources from different
URI namespace controlled by the system. Ontology
Linked Open Data sets available on the Internet in RDF
adapts common Linked Data vocabularies such as
format, as well as implicit links between resources
Dublin Core, FOAF, vCard, PRISM, SIOC, Creative
within system dataset;
Commons, BibTex, Schema.org. Core of ontology is
• Publish system dataset in the Internet in RDF
based on ENIP RAS ontology.
format and provide access to it via user interface and
API; 4.2 Publishing subsystem
• Provide users and external applications with
unified query interface to all of system’s data sources; Publishing subsystem will serve as an entry point to the
• Support different data source types (including system for human users and Linked Data applications . It
relational databases and SPARQL endpoints) and should dereference URIs of system’s resources, i.e.
support on-fly connection of new data sources; return descriptions of the object or concept identified by
• Include flexible ontology of knowledge these URIs. It can be achieved by using a mechanism
domain that follows Linking Open Data project called content negotiation. Depending on HTTP GET
recommendations and can be extended to s upport new request header, publishing subsystem will return either
data sources. HTML representation or RDF/XML (as required by
Linked Data applications) representation of the
4 System architecture overview resource.
Publishing subsystem will receive data from data
Proposed system will follow modular architecture and integration subsystem. For dereferencing resource URI,
will consist of following components: following information should be requested from data
• Ontology of informational objects and links of integration subsystem:
interest to system data consumers and providers ; • All the literal values of resource, all incoming
• Linking subsystem, that should discover and and outgoing RDF links. This information can be
store links between resources from system’s data retrieved with simple SPARQL queries with patterns
sources and/or resources from external Linked Data { ?x ?y} and {?x ?y };
sources; • Most likely the results of these simple requests
• Publishing subsystem, which should provide will contain URIs of other system’s resources. Linked
users and applications with access to resources from Data applications often traverse URIs they find in RDF
system’s dataset according to LOD project documents. Therefore to reduce the number of HTTP
recommendations; requests publishing subsystem should extend
• Data integration subsystem, which will contain aforementioned requests to some depth, or by applying
mechanism to uniquely identify system’s resources both some explicitly stated rules;
within system and in Linked Open Data space and • Information on ontology class to which the
provide uniform access to all system’s resources. This requested resource belongs and all its ancestors;
subsystem will include a set of adapters that provide • Information on the dataset to which this
unified SPARQL access to system’s data sources of resource belongs;
different types (relational databases, Web APIs, etc.); All the information retrieved from data integration
• Harvesting and extraction subsystem with a set subsystem will be represented as a set of RDF triples. In
of harvester components, which will gather data from case of RDF document these triples will be merged into
system’s sources of unstructured data (text files, resulting RDF/XML document and returned to client. In
scanned documents, etc.), transform it into structured other case the triples will be published as HTML+RDFa
document generated from template. These templates can
be specified in general form and then redefined for relevant to an atom, the entire conjunctive query is
specific classes. dropped. As a result, a union of conjunctive queries
with atoms of different data sets will be obtained.
4.3 Data integration subsystem Traditionally, the next step in data mediation
A data integration subsystem will provide other process is construction of the physical query plan and
subsystems or external agents with uniform access its execution. During execution of query with atoms
interface to all of the system’s data sources. Requested related to different data sources the results of subqueries
information should be specified with SPARQL query. to these data sources are joined. However, in the
This subsystem will be responsible for presenting proposed system subquery results can be joined on
system’s data as single dataset in Linked Open Data literal field values and not on the URIs, because data
space. There are several approaches to data integration sources are presented in a form of independent Linked
systems – data warehousing, data mediation, peer-to- Open Data sets and do not share common URIs . If
peer systems. Proposed system is supposed to work subqueries to different data sources are to be joined on
with multiple strongly autonomous data sources; URIs we will have to use the sets of links generated by
therefore it adapts data mediation architecture. The linking subsystem between these data sources. Each
drawback of such systems (e.g. Virtuoso) is huge conjunctive query is a graph pattern with vertices being
amount in network interactions required to produce either literal values or the URIs or variables, and edges
query answer. Proposed system uses Linked Open Data are labeled with predicates in terms of different data
principles to reduce this drawback. sources. If two adjacent edges are labeled with
Data sources will be connected to the system via predicates from different sources, it is necessary to refer
adapters, which are SPARQL endpoints capable of to the linkset for this pair of sources and select resource
querying data sources in terms of system ontolog y. pairs that satisfy a given part of graph pattern. By
These adapters should be generic, configurable performing this operation on all the links in the
components (e.g. JDBC adapter, REST API adapter). conjunctive query, we will obtain the set of resource
As opposed to existing data integration systems with URIs that satisfies part of the pattern that defines
semantic capabilities (e.g. Virtuoso with its Sponger relationships between different data sources. Then the
cartridges), resources from different data sources won’t subquery parts related to specific data sources will be
be merged into single dataset by providing same URI to executed by adapters with corresponding join variables
identical resources. Instead, in the spirit of Linked Open being replaced by URIs from linksets. On this step
Data, every data source should be considered to contain traditional query optimization techniques can be applied
unique resources and get its own sub-namespace (like again.
http:///datasets/). Adapter
4.3 Linking subsystem
should confront every resource from its data source
with HTTP URI from this namespace. Therefore we RDF documents published in the Linked Open Data
will be able to track resource origin by its URI. When space are required to contain outgoing links. These
new data source will be added to the system, its adapter outgoing links are RDF triples with the subject being
will be configured by specifying generic adapter the URI of the resource from the local namespace and
settings (e.g. JDBC connection string), general Dublin the URI of the object and / or predicate belonging to the
Core description of the source, topic of interest namespace of another dataset. The most important type
categorization, licensing information, etc. Adapter of outgoing links is identity links that point at URI
configuration also includes the set of ontology classes aliases used by other data sources to identify the same
and properties to which data in these sources belongs. real-world object or abstract concept. Identity links can
This information can be entered manually or obtained use such predicates as owl: sameAs, rdfs: seeAlso or
with SPARQL ASK request. Next, adapter special SKOS terms. Although the uses of predicate
configuration will be published as voiD [1] descriptor owl: sameAs in the LOD space are often contrary to the
of dataset. All such datasets are subsets (in terms of semantics of OWL [7], its use is recommended by W3C
voiD) of system’s whole dataset. However, all of them Technical Architecture Group. Linking subsystem will
will be accessed via single SPARQL endpoint. Such be responsible for the discovery, storage and support of
structure preserves autonomy and independence of data identity links. Properties of the link include pair of
sources while integrating them all together in Linked URIs, link generation time and method, date of last link
Open Data space. check and similarity factor. When the link is published
Execution of queries in data integration subsystem either owl:sameAs or rdfs:seeAlso predicate is used in
will be carried out as follows. The first step is a the triple depending on similarity factor value.
SPARQL query rewriting according to the axioms of Linking subsystem will work as follows. In the first
ontology, as described in [2]. Then algebraic query step, the two data sources to be interlinked are found.
optimization techniques are applied. The result of this For this pair an initially empty voiD linkset is created
phase in terms of descriptive logic is the union of and published. When new data sources is added to the
conjunctive queries with simple constraints. In the system the linksets between this new data source and all
second step the set of relevant data sources for each existing data sets from other sources are automatically
atom of each conjunctive query is determined according created. A linkset between internal dataset and external
to configuration of adapters. If there are no data sources
Linked Open Data set will be created in one of the Future works on this project might include the study
following cases: of link network generation and support algorithms. The
• The user can manually select a pair of datasets system can also be extended with modules to access
for linking; external Semantic Web resource aggregators (sig.ma)
• Relevant datasets can be discovered using and semantic search engines (sindice.com). Also,
HTTP referrer technique described in [9]; additional studies in the management of licensing and
• Relevant datasets can set be discovered by data access in the context of the Linked Open Data are
linking subsystem itself by traversing links in external required.
dataset that is already linked to one of internal datasets.
When two target datasets for interlinking will have References
been selected, the subsystem will clusterize datasets by
classes and determines pairs of clusters to be [1] K. Alexander, R. Cyganiak, M. Hausenblas, and J.
interlinked. This should be done to reduce the number Zhao. Describing linked datasets. In Proceedings of
of pairwise comparisons of datasets elements. In the the WWW2009 Workshop on Linked Data on the
case of two internal datasets both of them are described Web, 2009.
by the same ontology, so that pairs of clusters contain [2] A. A. Bezdushny. Formal Model of Ontology-
instances of same ontology classes. In the case of Based Data Integration Systems . Novosibirsk, 2008
linking to an external dataset the subsystem might select [3] C. Bizer, R. Cyganiak. D2RQ — Lessons Learned.
pairs of classes with help of different ontology mapping Position paper for the W3C Workshop on RDF
techniques [5], as well as using discovered or manually Access to Relational Databases, 2007.
specified ontology mapping rules. http://www.w3.org/2007/03/ RdfRDB/papers/d2rq-
The third and final step of interlinking involves positionpaper/
pairwise comparison of clusters elements to detect pairs [4] D. Calvanese, G. De Giacomo, D. Lembo et al. The
of identity relations. These relations will be detected MASTRO system for ontology-based data access.
using SILK LSL language rules. In the case of internal Semantic Web Journal, volume 2, number 1, pages
data sources, rules will be declared together with the 43-53, 2011
ontology and determine which instances of the same [5] J. Euzenat, A. Ferrara, et al. First results of the
class are identical. In the case of an external dataset ontology alignment evaluation initiative 2011. In
rules will be either specified manually, or derived from Proc. of 6th Ontology Matching Workshop
the existing rules and ontology mapping rules. (OM‘11), at International Semantic Web
Complete binding is achieved by pairwise Conference (ISWC‘11), Bonn, Germany, 2011.
comparison of all elements of all datasets (both internal [6] O. Gorlitz, S. Staab. SPLENDID: SPARQL
and external), but in practice such comparison is Endpoint Federation Exploiting VOID
impossible. Link generation optimization requires Descriptions. Proceedings of the 2nd International
additional study. Workshop on Consuming Linked Data, Bonn,
Germany, 2011.
5 Conclusion [7] H. Halpin, P. Hayes, J. McCusker, D. Mcguinness,
and H. Thompson. When owl:sameas isn't the
This paper proposes a concept of data integration same: An analysis of identity in linked data. In
system orientated towards Linked Open Data space. Proceedings of the 9th International Semantic Web
The novelty of this concept lies in its hybrid approach; Conference, 2010
the system proposed combines data mediation and data [8] T. Heath and C. Bizer. Linked Data: Evolving the
warehousing approaches by using locally stored linksets Web into a Global Data Space (1st edition).
as indexes for a search engine hasn’t been implemented Synthesis Lectures on the Semantic Web: Theory
yet. To the author’s knowledge, such method hasn’t and Technology, 1:1, 1-136. Morgan & Claypool,
been implemented yet. Besides, while there are works 2011. http://linkeddatabook.com/editions/1.0/
dedicated to bringing single data sources into the LOD [9] H. Muhleisen and A. Jentzsch: Augmenting the
space or dealing with multiple already present sources Web of Data using Referers Linked Data on the
in LOD space, the idea of bringing multiple data Web (LDOW2011), Mar. 2011
sources into LOD space via single data integration [10] A. Nikolov and M. d'Aquin. Identifying Relevant
system has received very little attention. Sources for Data Linking using a Semantic Web
Currently, the proof-of-concept system is being Index, Workshop: 4th Workshop on Linked Data
developed in CC RAS as a part of a practical project on the Web (LDOW 2011) at 20th International
dedicated to integration of data on protected sites and World Wide Web Conference (WWW 2011),
animal species. While participating in a group on this Hyderabad, India, 2011.
project, the author is working on query answering [11] Virtuoso Universal Server, 2011.
algorithms in presence of linksets. As a result of this http://virtuoso.openlinksw.com/
project, a large set of data on national parks should [12] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov.
emerge in the LOD space, and if incoming links from Discovering and maintaining links on the web of
external datasets appear, the project would be data. In Proceedings of the International Semantic
considered to be successful. Web Conference, pages 650–665, 2009