Data Integration System for Linked Open Data Space © Kuznetcov Konstantin Lomonosov Moscow State University K.Kuznetcov@gmail.com Abstract system to support integration of data from independent data sources into the web of linked data is required. This paper describes research-in-progress work on data integration s ystem in Linked Open 2 Related work Data space. Proposed system uses concept of RDF identity links to interlink heterogeneous To the moment, there are quite few solutions that local data sources and integrate them into support variuos steps required to include one’s data into global data space. Linked Open Data space, even though they are based on Academic supervisor: Vladimir Serebriakov, existing hypertext web technologies. And there is no serebr@ccas.ru. system that includes all the functionality recommended by LOD project. The most complex solution available 1 Introduction now is Virtuoso Universal Server [11] platform. It provides tools for representing data from different For the last few decades data integration has been one sources (relational databases, RDF-storages, Web APIs, of the most actual problems of computer science. With etc.) as a single virtual database and supports RDF data the development of IT industry countless data sources publishing. Virtuoso offers SPARQL access to its data emerged in the Internet. These data sources are and such features as RDF-crawler and simple reasoner. heterogeneous in all possible ways. Effective usage of Virtuoso can be extended in multiple ways, e.g. such data sources is impossible without automatical published RDF-data can be accompanied with voiD tools for data search, retrieval, publishing and descriptors. Unfortunately all the extensions that are transformation. useful for linked data publishing are made on The original hypertextual web didn’t suit well for instrumental level and not on data model level. automatical processing of data from heterogeneous Virtuoso is commercial software with limited open- sources spread across the web. This led to emergence of source community edition. Among other open-source various microformats and web APIs, and finally the solutions it is worth to mention D2R Server [3], which concept of Semantic Web. Sematic Web implies usage supports RDF data publishing from relational databases of standard stack of data formats and technologies and SPARQL querying. The MASTRO [4] and the data intended to support data accumulation, structuring and integration system developed in Dorodnicyn Computing exchange across the web. The most important of these Centre of RAS [2] provide richer semantic formalisms technologies are RDF, RDFS, OWL and SPARQL. compared to D2R and Virtuoso. However, the first From the practical point of view one of the most system is bound to a single federative database and the interesting Semantic Web initiatives is Linking Open second one implies that its sources share some common Data project [8]. This project aims for quantitative URIs. And both of them don’t provide any means for filling of the web with data s tructured according to publishing or interlinking with other RDF datasets. Semantic Web standards and for interlinking of However, both Virtuoso and D2R Server do not go semantic data sources. As a result, a global Linked beyond simple RDF publishing. Their RDF-resource Open Data space should be established, similar to interlinking capabilities are limited to URI generation hypertextual web of linked documents. Publishing data from templates. In many cases using such templated in Linked Open Data space encourages reuse of data, URIs cannot expose identity relations between RDF decreases data redundancy, maximizes its (real and resources from different datasets and truly interlink potential) inter-connectedness and enables network these datasets. There are several applications for RDF effects to add value to data. Data providers can benefit data interlinking and link supporting, such as SILK from publishing their data in LOD space. Unfortunately, [12], LIMES, SemMF and DSNotify. But at the the process of publishing is not that simple and consists moment none of these applications provide means for of several steps. Small organizations often cannot afford integration of one’s RDF into the whole Linked Data to transform their data into LOD- acceptable form and space. I.e. there are no toos that can automatically then support the published dataset. To the moment, discover new related datasets in the web, the set up and there is no special software to support the full cycle of support links to the resources in these datasets. Some publishing and managing linked open datasets. A proposals for such systems are made in [10]. The possibilities of non-trivial usage of generated form and store it. This system is a subject of a future linksets are yet to be explored. Very few applications work. take advantage of this feature of Linked Open Data space. It is worth to mention SPLENDID system [6] 4.1 Ontology here, which uses linksets statistics to optimize The system uses OWL ontology to semantically federative SPARQL queries. Some semantic search organize objects and links that match the concepts of engines also utilize voiD descriptions of linksets. interest from knowledge domain of system’s data sources. Ontology consists of core terms and imported 3 Problem statement modules which can be added in case when some resources in newly added data source require more This article proposes a concept of automated data precise definition. In Linked Open Data space ontology integration system in Linked Open Data space. serves as system’s data vocabulary, it is used to Proposed system should establish terminological outgoing links to external • Form the single dataset from multiple datasets and allows external applications to discover heterogeneous sources of structured or unstructured metadata to establish ingoing links. Following the information in similar knowledge domain and principles of Linked Open Data, ontology is annotated support/update formed dataset; in human language with such terms as rdfs:label or • Discover and store links between resources rdfs:comment. Ontology’s terms should be defined in from system dataset and resources from different URI namespace controlled by the system. Ontology Linked Open Data sets available on the Internet in RDF adapts common Linked Data vocabularies such as format, as well as implicit links between resources Dublin Core, FOAF, vCard, PRISM, SIOC, Creative within system dataset; Commons, BibTex, Schema.org. Core of ontology is • Publish system dataset in the Internet in RDF based on ENIP RAS ontology. format and provide access to it via user interface and API; 4.2 Publishing subsystem • Provide users and external applications with unified query interface to all of system’s data sources; Publishing subsystem will serve as an entry point to the • Support different data source types (including system for human users and Linked Data applications . It relational databases and SPARQL endpoints) and should dereference URIs of system’s resources, i.e. support on-fly connection of new data sources; return descriptions of the object or concept identified by • Include flexible ontology of knowledge these URIs. It can be achieved by using a mechanism domain that follows Linking Open Data project called content negotiation. Depending on HTTP GET recommendations and can be extended to s upport new request header, publishing subsystem will return either data sources. HTML representation or RDF/XML (as required by Linked Data applications) representation of the 4 System architecture overview resource. Publishing subsystem will receive data from data Proposed system will follow modular architecture and integration subsystem. For dereferencing resource URI, will consist of following components: following information should be requested from data • Ontology of informational objects and links of integration subsystem: interest to system data consumers and providers ; • All the literal values of resource, all incoming • Linking subsystem, that should discover and and outgoing RDF links. This information can be store links between resources from system’s data retrieved with simple SPARQL queries with patterns sources and/or resources from external Linked Data { ?x ?y} and {?x ?y }; sources; • Most likely the results of these simple requests • Publishing subsystem, which should provide will contain URIs of other system’s resources. Linked users and applications with access to resources from Data applications often traverse URIs they find in RDF system’s dataset according to LOD project documents. Therefore to reduce the number of HTTP recommendations; requests publishing subsystem should extend • Data integration subsystem, which will contain aforementioned requests to some depth, or by applying mechanism to uniquely identify system’s resources both some explicitly stated rules; within system and in Linked Open Data space and • Information on ontology class to which the provide uniform access to all system’s resources. This requested resource belongs and all its ancestors; subsystem will include a set of adapters that provide • Information on the dataset to which this unified SPARQL access to system’s data sources of resource belongs; different types (relational databases, Web APIs, etc.); All the information retrieved from data integration • Harvesting and extraction subsystem with a set subsystem will be represented as a set of RDF triples. In of harvester components, which will gather data from case of RDF document these triples will be merged into system’s sources of unstructured data (text files, resulting RDF/XML document and returned to client. In scanned documents, etc.), transform it into structured other case the triples will be published as HTML+RDFa document generated from template. These templates can be specified in general form and then redefined for relevant to an atom, the entire conjunctive query is specific classes. dropped. As a result, a union of conjunctive queries with atoms of different data sets will be obtained. 4.3 Data integration subsystem Traditionally, the next step in data mediation A data integration subsystem will provide other process is construction of the physical query plan and subsystems or external agents with uniform access its execution. During execution of query with atoms interface to all of the system’s data sources. Requested related to different data sources the results of subqueries information should be specified with SPARQL query. to these data sources are joined. However, in the This subsystem will be responsible for presenting proposed system subquery results can be joined on system’s data as single dataset in Linked Open Data literal field values and not on the URIs, because data space. There are several approaches to data integration sources are presented in a form of independent Linked systems – data warehousing, data mediation, peer-to- Open Data sets and do not share common URIs . If peer systems. Proposed system is supposed to work subqueries to different data sources are to be joined on with multiple strongly autonomous data sources; URIs we will have to use the sets of links generated by therefore it adapts data mediation architecture. The linking subsystem between these data sources. Each drawback of such systems (e.g. Virtuoso) is huge conjunctive query is a graph pattern with vertices being amount in network interactions required to produce either literal values or the URIs or variables, and edges query answer. Proposed system uses Linked Open Data are labeled with predicates in terms of different data principles to reduce this drawback. sources. If two adjacent edges are labeled with Data sources will be connected to the system via predicates from different sources, it is necessary to refer adapters, which are SPARQL endpoints capable of to the linkset for this pair of sources and select resource querying data sources in terms of system ontolog y. pairs that satisfy a given part of graph pattern. By These adapters should be generic, configurable performing this operation on all the links in the components (e.g. JDBC adapter, REST API adapter). conjunctive query, we will obtain the set of resource As opposed to existing data integration systems with URIs that satisfies part of the pattern that defines semantic capabilities (e.g. Virtuoso with its Sponger relationships between different data sources. Then the cartridges), resources from different data sources won’t subquery parts related to specific data sources will be be merged into single dataset by providing same URI to executed by adapters with corresponding join variables identical resources. Instead, in the spirit of Linked Open being replaced by URIs from linksets. On this step Data, every data source should be considered to contain traditional query optimization techniques can be applied unique resources and get its own sub-namespace (like again. http:///datasets/). Adapter 4.3 Linking subsystem should confront every resource from its data source with HTTP URI from this namespace. Therefore we RDF documents published in the Linked Open Data will be able to track resource origin by its URI. When space are required to contain outgoing links. These new data source will be added to the system, its adapter outgoing links are RDF triples with the subject being will be configured by specifying generic adapter the URI of the resource from the local namespace and settings (e.g. JDBC connection string), general Dublin the URI of the object and / or predicate belonging to the Core description of the source, topic of interest namespace of another dataset. The most important type categorization, licensing information, etc. Adapter of outgoing links is identity links that point at URI configuration also includes the set of ontology classes aliases used by other data sources to identify the same and properties to which data in these sources belongs. real-world object or abstract concept. Identity links can This information can be entered manually or obtained use such predicates as owl: sameAs, rdfs: seeAlso or with SPARQL ASK request. Next, adapter special SKOS terms. Although the uses of predicate configuration will be published as voiD [1] descriptor owl: sameAs in the LOD space are often contrary to the of dataset. All such datasets are subsets (in terms of semantics of OWL [7], its use is recommended by W3C voiD) of system’s whole dataset. However, all of them Technical Architecture Group. Linking subsystem will will be accessed via single SPARQL endpoint. Such be responsible for the discovery, storage and support of structure preserves autonomy and independence of data identity links. Properties of the link include pair of sources while integrating them all together in Linked URIs, link generation time and method, date of last link Open Data space. check and similarity factor. When the link is published Execution of queries in data integration subsystem either owl:sameAs or rdfs:seeAlso predicate is used in will be carried out as follows. The first step is a the triple depending on similarity factor value. SPARQL query rewriting according to the axioms of Linking subsystem will work as follows. In the first ontology, as described in [2]. Then algebraic query step, the two data sources to be interlinked are found. optimization techniques are applied. The result of this For this pair an initially empty voiD linkset is created phase in terms of descriptive logic is the union of and published. When new data sources is added to the conjunctive queries with simple constraints. In the system the linksets between this new data source and all second step the set of relevant data sources for each existing data sets from other sources are automatically atom of each conjunctive query is determined according created. A linkset between internal dataset and external to configuration of adapters. If there are no data sources Linked Open Data set will be created in one of the Future works on this project might include the study following cases: of link network generation and support algorithms. The • The user can manually select a pair of datasets system can also be extended with modules to access for linking; external Semantic Web resource aggregators (sig.ma) • Relevant datasets can be discovered using and semantic search engines (sindice.com). Also, HTTP referrer technique described in [9]; additional studies in the management of licensing and • Relevant datasets can set be discovered by data access in the context of the Linked Open Data are linking subsystem itself by traversing links in external required. dataset that is already linked to one of internal datasets. When two target datasets for interlinking will have References been selected, the subsystem will clusterize datasets by classes and determines pairs of clusters to be [1] K. Alexander, R. Cyganiak, M. Hausenblas, and J. interlinked. This should be done to reduce the number Zhao. Describing linked datasets. In Proceedings of of pairwise comparisons of datasets elements. In the the WWW2009 Workshop on Linked Data on the case of two internal datasets both of them are described Web, 2009. by the same ontology, so that pairs of clusters contain [2] A. A. Bezdushny. Formal Model of Ontology- instances of same ontology classes. In the case of Based Data Integration Systems . Novosibirsk, 2008 linking to an external dataset the subsystem might select [3] C. Bizer, R. Cyganiak. D2RQ — Lessons Learned. pairs of classes with help of different ontology mapping Position paper for the W3C Workshop on RDF techniques [5], as well as using discovered or manually Access to Relational Databases, 2007. specified ontology mapping rules. http://www.w3.org/2007/03/ RdfRDB/papers/d2rq- The third and final step of interlinking involves positionpaper/ pairwise comparison of clusters elements to detect pairs [4] D. Calvanese, G. De Giacomo, D. Lembo et al. The of identity relations. These relations will be detected MASTRO system for ontology-based data access. using SILK LSL language rules. In the case of internal Semantic Web Journal, volume 2, number 1, pages data sources, rules will be declared together with the 43-53, 2011 ontology and determine which instances of the same [5] J. Euzenat, A. Ferrara, et al. First results of the class are identical. In the case of an external dataset ontology alignment evaluation initiative 2011. In rules will be either specified manually, or derived from Proc. of 6th Ontology Matching Workshop the existing rules and ontology mapping rules. (OM‘11), at International Semantic Web Complete binding is achieved by pairwise Conference (ISWC‘11), Bonn, Germany, 2011. comparison of all elements of all datasets (both internal [6] O. Gorlitz, S. Staab. SPLENDID: SPARQL and external), but in practice such comparison is Endpoint Federation Exploiting VOID impossible. Link generation optimization requires Descriptions. Proceedings of the 2nd International additional study. Workshop on Consuming Linked Data, Bonn, Germany, 2011. 5 Conclusion [7] H. Halpin, P. Hayes, J. McCusker, D. Mcguinness, and H. Thompson. When owl:sameas isn't the This paper proposes a concept of data integration same: An analysis of identity in linked data. In system orientated towards Linked Open Data space. Proceedings of the 9th International Semantic Web The novelty of this concept lies in its hybrid approach; Conference, 2010 the system proposed combines data mediation and data [8] T. Heath and C. Bizer. Linked Data: Evolving the warehousing approaches by using locally stored linksets Web into a Global Data Space (1st edition). as indexes for a search engine hasn’t been implemented Synthesis Lectures on the Semantic Web: Theory yet. To the author’s knowledge, such method hasn’t and Technology, 1:1, 1-136. Morgan & Claypool, been implemented yet. Besides, while there are works 2011. http://linkeddatabook.com/editions/1.0/ dedicated to bringing single data sources into the LOD [9] H. Muhleisen and A. Jentzsch: Augmenting the space or dealing with multiple already present sources Web of Data using Referers Linked Data on the in LOD space, the idea of bringing multiple data Web (LDOW2011), Mar. 2011 sources into LOD space via single data integration [10] A. Nikolov and M. d'Aquin. Identifying Relevant system has received very little attention. Sources for Data Linking using a Semantic Web Currently, the proof-of-concept system is being Index, Workshop: 4th Workshop on Linked Data developed in CC RAS as a part of a practical project on the Web (LDOW 2011) at 20th International dedicated to integration of data on protected sites and World Wide Web Conference (WWW 2011), animal species. While participating in a group on this Hyderabad, India, 2011. project, the author is working on query answering [11] Virtuoso Universal Server, 2011. algorithms in presence of linksets. As a result of this http://virtuoso.openlinksw.com/ project, a large set of data on national parks should [12] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov. emerge in the LOD space, and if incoming links from Discovering and maintaining links on the web of external datasets appear, the project would be data. In Proceedings of the International Semantic considered to be successful. Web Conference, pages 650–665, 2009