1. INTRODUCTION

April

RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data

Ghent University - iMinds - Multimedia Lab

0 0 Ghent , Belgium

2014

8 2014 2 6

Despite the signi cant number of existing tools, incorporating data from multiple sources and di erent formats into the Linked Open Data cloud remains complicated. No mapping formalisation exists to de ne how to map such heterogeneous sources into rdf in an integrated and interoperable fashion. This paper introduces the rml mapping language, a generic language based on an extension over r2rml, the w3c standard for mapping relational databases into rdf. Broadening r2rml's scope, the language becomes source-agnostic and extensible, while facilitating the de nition of mappings of multiple heterogeneous sources. This leads to higher integrity within datasets and richer interlinking among resources.

1. INTRODUCTION

Deploying the ve stars of the Linked Open Data schema1 is the de-facto way of mapping data. In real-world situations, multiple sources of di erent formats are part of multiple domains, which in their turn are formed by multiple sources and the relations between them. Approaching the stars as a set of consecutive steps and applying them to a single source every time|as most solutions tend to do|is not always an optimal solution. When mapping heterogeneous data into rdf, such approaches often fail to reach the nal goal of publishing interlinked data. The semantic representation of each mapped resource is de ned independently, disregarding its possible prior de nitions and its links to other resources. Manual alignment to their prior appearances is performed by rede ning their semantic representations, while links to other resources are de ned after the data are mapped and published. Nonetheless, as datasets are often shaped gradually, a demand emerges for a well-considered policy regarding mapping and primary interlinking of data in the context of a certain knowledge domains.

For instance, governments publish their data as Open Data and turn them into Linked Open Data afterwards. Much of this data, as expected when dealing with many sources, complements each other in the description of di erent knowledge domain. Therefore, the same concepts appear in multiple data sets, and problematically, often with di erent identi ers or even in di erent formats. Furthermore, data is mapped progressively, thus it is important that data publishers incorporate their data in what is already published. Reusing the same unique identi ers for concepts is necessary to achieve this, but it is only possible if prior existing de nitions in the same dataset are discovered and if they can be replicated. Otherwise, duplicates will inevitably appear|even within a publisher's own datasets. Identifying, replicating, and keeping those de nitions aligned is complicated and the situation aggravates the more data is mapped and published.

Solving this problem requires a uniform, modular, interoperable and extensible technology that supports this need for gradually incrementing datasets. Such a solution can deal with the mapping and primary interlinking of the data, which should take place in a tightly coordinated way instead of as two separate, consecutive actions. This ensures semantic representations of higher quality and datasets with better integrity. To this end, we propose rml, a generic mapping language de ned as an extension of r2rml2, the w3c recommendation for mapping data in relational databases into rdf.

The remainder of the paper is organized as follows: Section 2 discusses related solutions existing today. Section 3 analyzes the requirements of a mapping language, and Section 4 introduces the proposed approach. Next, Section 5 addresses the challenges of implementing an rml processor. Finally, Section 6 outlines our conclusions and future work. 2.

RELATED WORK

Several solutions exist to execute mappings from di erent le structures and serialisations to rdf. For relational databases, di erent mapping languages beyond r2rml are de ned [ 3 ] and several implementations already exist3. Similarly, mapping languages were de ned to support conversion from data in csv and spreadsheets to the rdf data model. They include the XLWrap's mapping language [ 5 ] that converts data in various spreadsheets to rdf, the declarative owl-centric mapping language Mapping Master's M2 [ 6 ] that converts data from spreadsheets into the Web Ontology Language (owl), Tarql4 that follows a querying approach and Vertere5. The main drawback in the case of most csv/spreadsheet-to-rdf mapping solutions is the assumption 2http://www.w3.org/TR/r2rml 3http://www.w3.org/2001/sw/rdb2rdf/wiki/Implementations 4https://github.com/cygri/tarql 5https://github.com/knudmoeller/Vertere-RDF that each row describes an entity (entity-per-row assumption) and that each column represents a property.

A larger variety of solutions exist to map from xml to rdf, but to the best of our knowledge, no speci c languages were de ned for this, apart from grddl6 that essentially provides the links to the algorithms (typically represented in xslt) that map the data to rdf. Instead, tools mostly rely on existing xml solutions, such as xslt (e.g., Krextor [ 4 ] and AstroGrid-D7), xpath (e.g., Tripliser8), and xquery (e.g., XSPARQL [ 1 ]). In general, most existing tools deploy mappings from a certain source format to rdf (per-source approaches). Few tools provide mappings from di erent source formats to rdf; and those tools actually employ separate source-centric approaches for each of the formats they support. Datalift [ 7 ], The DataTank 9, OpenRe ne10, RDFizers 11 and Virtuoso Sponger12 are the most well-known.

MAPPINGS METHODOLOGY

After outlining the limitations of existing solutions, we present the factors that can improve the mappings to produce better integrated datasets and early interlinked resources. 3.1

Limitations of current mapping methods

We identi ed the following limitations that prevent current practices from achieving well integrated datasets. Mapping of data on a per-source basis. Most of the current solutions work on a per-source basis: only one source is mapped at once, as opposed to mapping di erent related sources together, despite covering the same domains or sharing the same formats. As a result, data publishers can only generate resources and links between data appearing within a single source. Their mapping de nitions need to be aligned manually when the same resources already appear in the targeting dataset. Thus, data publishers need to rede ne and replicate the patterns for the resources' uris de nition every time they appear in a new mapping rule. Furthermore, this is not always possible, as the data included in the one source may not be su cient to replicate the same uris. This results in distinct uris for identical resources, which leads to duplicates within a publisher's own dataset. In addition, the interlinking of the resources generated from di erent sources has to be performed afterwards.

Mapping data on a per-format basis. Besides the persource approach, most of the current solutions provide a per-format approach: only mappings from a certain source format (e.g., xml) are supported. In practice, data publishers need to map various source formats to rdf. Therefore, they need to install, learn, use and maintain di erent tools for each case separately, which hampers their e ort to ensure the integrity of their datasets even more. Alternatively, some end up implementing their own case-speci c solutions. 6http://www.w3.org/TR/grddl/ 7http://www.gac-grid.de/project-products/Software/XML2RDF.html 8http://daverog.github.io/tripliser/ 9http://thedatatank.com 10http://openrefine.org/ 11http://simile.mit.edu/wiki/RDFizers 12http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/ VirtSponger Mapping definitions’ reusability. The mapping de nitions of current solutions are not reusable, as there is no standard formalisation for any source format apart from relational databases, i.e., r2rml. In most cases, the mapping rules are not interoperable as they are tied to the implementation, which prevents their extraction and reuse across di erent implementations. Moreover, this prohibits reuse of the same mapping rules to map data that describe the same model, but is serialized in di erent initial formats. 3.2

Requirements for generic mappings

To achieve datasets with better integrated and richer interlinked resources, the aforementioned issues should be addressed during the mapping phase, rather than later. A set of factors that contribute to this are outlined below. Uniform and interoperable mapping definitions. Since we require a uniform way of dealing with di erent source serializations, the mapping de nitions should be de ned independently of the references to the input data. The same mappings may then be reused across di erent sources| as long as they capture the same context (i.e., the same rdf representations)|only by changing the reference to the input source that holds the information. For example, a performance described in a json le and an exhibition described in an xml le may take place at the same location, indicated by an identical longitude/latitude pair. We only need a single mapping de nition to describe their location, adjusted to point to respectively the json objects and the xml elements that hold the corresponding values. Therefore, we require a modular language in which the references to the data extracts and the mapping de nitions are distinct and not interdependent. Thereby, the mapping de nitions can be reused across di erent implementations for di erent source formats, reducing the implementation and learning costs. Robust cross-references and interlinking. Rede ning and replicating patterns every time a new input source is integrated should be avoided. Publishers should be able to uniquely de ne the pattern that generates a resource and refer to its de nition every other time this resource is mapped (in this way enriched), which has the following three advantages: First, possible modi cations to the patterns, or data values appearing in the patterns that generate the uris, are propagated to every other reference of the resource, making the interlinking more robust. Second, taking advantage of this integrated solution, cross-references among sources become possible; links between resources in di erent input sources are de ned already on mapping level. Third, and most signi cant, when data publishers want to map a new source, their new mappings are de ned taking advantage of and automatically aligning to the existing ones.

Extending the aforementioned example, the venue where the performance and the event take place is the same. When the input source for the performances was mapped, the mappings for the possible venues were de ned considering certain identi ers to de ne their uris. Once the exhibitions are about to be mapped, the data publisher might not be able to reuse the existing mapping de nition for the venues as the identi ers are not included in the dataset to replicate the same patterns. However, the venue name might be considered to determine the binding. Then, the existing mapping de nition can be referred to generate the same uris and, thus enrich the existing resource with new attributes and interlink data from the newly mapped dataset to the existing one. As the original input source is an Open Data set that can be referenced, it is always available to be used to support the mapping of the new data. Summarizing, the de nition of the links between resources in di erent sources|even if they are in di erent formats|happens on the mapping level instead of during a subsequent interlinking step. Scalable mapping language. As the references to the data extracts and the mapping de nitions are distinct and not interdependent, the pointer to the input source's data can be adjusted to each case. Such modular solution leads to correspondingly modular implementations that perform the mappings in a uniform way, independent of the input source. They only adjust the respective extraction mechanism depending on the input source. Case-speci c solutions exist because complete generic solutions fail, as it is impossible to predict every potential input. A scalable solution addresses what can be de ned in a generic way for all possible different input sources and scales over what cannot. In order to support emerging needs, it should allow extensions with source-speci c references, addressed on a case-speci c level.

RML MAPPING LANGUAGE

The RDF Mapping language (rml) is a generic mapping language de ned to express customized mapping rules from heterogeneous data structures and serializations to the rdf data model. rml is de ned as a superset of the w3cstandardized mapping language r2rml, aiming to extend its applicability and broaden its scope. 4.1

R2RML r2rml is de ned to express customized mappings only from data in relational databases to datasets represented using the rdf data model. In r2rml, the mapping to the rdf data model is based on one or more Triples Maps and occur over a Logical Table iterating on a per-row basis. A Triples Map consists of three main parts: the Logical Table (rr:LogicalTable), the Subject Map and zero or more Predicate-Object Maps. The Subject Map (rr:SubjectMap) de nes the rule that generates unique identi ers (uris) for the resources which are mapped and is used as the subject of all the rdf triples that are generated from this Triples Map. A Predicate-Object Map consists of Predicate Maps, which de ne the rule that generates the triple's predicate and Object Maps or Referencing Object Maps, which de nes the rule that generates the triple's object. The Subject Map, the Predicate Map and the Object Map are Term Maps, namely rules that generate an rdf term (an iri, a blank node or a literal). A Term Map can be a constant-valued term map (rr:constant) that always generates the same rdf term, or a column-valued term map (rr:column) that is the data value of a referenced column in a given Logical Table's row, or a template-valued term map (rr:template) that is a valid string template that can contain referenced columns.

Input Reference Value Reference Iteration model Source Expression

R2RML Table Name

Column per row(implicit) SQL (implicit)

RML

Source Reference de ned

Reference Formulation Triples Maps, when the subject of a Triples Map is the same as the object generated by a Predicate-Object Map. A Referencing Object Map (rr:RefObjectMap) is used then to point to the Triples Map that generates on its Subject Map the corresponding resource, the so-called Referencing Object Map's Parent Triples Map. If the Triples Maps refer to di erent Logical Tables, a join between the Logical Tables is required. The join condition (rr:joinCondition) performs the join exactly as a join is executed in sql. The join condition consists of a reference to a column name that exists in the Logical Table of the Triples Map that contains the Referencing Object Map (rr:child) and a reference to a column name that exists in the Logical Table of the Referencing Object Map's Parent Triples Map (rr:parent). 4.2

RML rml keeps the mapping de nitions as in r2rml but excludes its database-speci c references from the core model. The potential broad concepts of r2rml, which were explained previously [ 2 ], are formally designated in the frame of the rml mapping language and are elaborated upon here. The primary di erence is the potential input that is limited to a certain database in the case of r2rml, while it can be a broad set of (one or more) input sources in the case of rml. Table 1 summarizes overall the rml's extensions over r2rml entailed because of the broader set of possible input sources.

rml provides a generic way of de ning the mappings that is easily transferable to cover references to other data structures, combined with case-speci c extensions, but always remains backward compatible with r2rml as relational databases form such a speci c case. rml considers that the mappings to rdf of sets of sources that all together describe a certain domain, can be de ned in a combined and uniform way, while the mapping de nitions may be re-used across di erent sources that describe the same domain to incrementally form well-integrated datasets, as displayed at Figure 1.

An rml mapping de nition follows the same syntax as r2rml. The rml vocabulary namespace is http://semweb. mmlab.be/ns/rml# and the preferred pre x is rml. More details about the rml mapping language can be found at http://rml. io. De ning and executing a mapping with rml requires the user to provide a valid and well-formatted input dataset to be mapped and the mapping de nition (mapping document ) according to which the mapping will be executed to generate the data's representation using the RDF data model (output dataset ). Data cleansing is out of the scope of the language's de nition and, if necessary, should be performed in advance. An extract of two heterogeneous input sources is displayed at Listing 1, an example of a corresponding mapping de nition is displayed at Listing 3 and the produced output at Listing 2. Logical Source. A Logical Source (rml:LogicalSource) extends r2rml's Logical Table and is used to determine the input source with the data to be mapped. The r2rml Logical Table de nition determines a database's table, using the Table Name (rr:tableName). In the case of rml, a broader reference to any input source is required. Thus, the Logical Source and source Reference Formulation. rml needs to deal with di erent data serialisations which use di erent ways to refer to their elements/objects. But, as rml aims to be generic, not a uniform way of referring to the data's elements/objects is de ned. r2rml uses columns' names for this purpose. In the same context, rml considers that any reference to the Logical Source should be de ned in a form relevant to the input data, e.g. XPath for xml les or jsonpath for json les. To this end, the Reference Formulation (rml:referenceFormulation) declaration is introduced indicating the formulation (for instance, a standard or a query language) used to refer to its data. At the current version of rml, the ql:CSV, ql:XPath and ql:JSONPath Reference Formulations are prede ned.

Iterator. While in r2rml it is already known that a perrow iteration occurs, as rml remains generic, the iteration pattern, if any, can not always be implicitly assumed, but it needs to be determined. Thereafter, the iterator (rml:iterator) is introduced. The iterator determines the iteration pattern over the input source and speci es the extract of the data mapped during each iteration. For example, the "$.[*]" determines the iteration over a json le that occurs over the object's outer level. The iterator is not required in the case of tabular sources as the default per-row iteration is implied or if there is no need to iterate over the input data. Logical Reference. A column-valued term map, according to r2rml, is de ned using the property rr:column which determines a column's name. In the case of rml, a more generic property is introduced rml:reference. Its value must be a valid reference to the data of the input dataset. Therefore, the reference's value should be a valid expression according to the Reference Formulation de ned at the Logical Source, as well as the string template used in the de nition of a template-valued term map and the iterator's value. For instance, the iterator, the subject's template-valued term map and the object's reference-valued term map are all valid jsonpath expressions. Referencing Object Map. The last aspect of r2rml that is extended in rml is the Referencing Object Map. The join condition's child reference (rr:child) indicates the reference to the data value (using an rml:reference) of the Logical Source that contains the Referencing Object Map. The join condition's child reference (rr:parent) indicates the reference to the data extract (rr:reference) of the Referencing Object Map's Parent Triples Map. The reference is speci ed using the Reference Formulation de ned at the current Logical Source. The join condition's parent reference indicates the reference to the data extract (rml:reference) of the Parent Triples Map. The reference is speci ed using the Reference Formulation de ned at the Parent Triples Map Logical Source de nition. Therefore, the child reference and the parent reference of a join condition may be de ned using di erent Reference Formulations, if the Triples Map refers to sources of di erent format. rml is highly extensible towards new source formats, allowing di erent levels of support. On processing level that adds some complexity as it demands the processor to be scalable to support di erent input sources, in a uniform way. To deal with these caveats, rml relies on expressions in a target expression language relevant to the source format to refer to the values of the sources while uses the rml syntax for the rest of the mapping de nition. This target expression language needs to be tied to its format and should act as a point of reference to the values in a source.

Expressions can be located wherever values need to be extracted from the source (Term maps and rr:iterator) and have to be valid according to the formulation speci ed in the Triples Map (rr:referenceFormulation). In order to deal with these embedded expressions, an rml processor is required to have a modular architecture where the extraction and mapping modules are executed independently of each other. When the rml mappings are processed, the mapping module deals with the mappings' execution as de ned at the mapping document in rml syntax, while the extraction module deals with the target language's expressions.

Mapping Models

An RML processor can be implemented using two alternative models: mapping-driven, data-driven or in a hybridic fashion following any combination of the two solutions that turns the processor to better perform.

Mapping-driven. In this model, the processing is driven by the mapping module. The processor processes each Triples Maps in a consecutive order. Based on the de ned expression language, each Triples Map is delegated to a languagespeci c sub-extractor. For each Triples Map, its delegated sub-extractor iterates over the source data as the Triples Map's Iterator speci es. For each iteration the mapping module requests an extract of data from the extraction module. The de ned Subject Map and Predicate-Object Maps are applied and the corresponding triples are generated. The execution of dependent Triples Maps, because of joins, is triggered by the Parent Triples Map and a nested mapping process occurs. Data-driven. In this model, the processing is driven by the extractor module, namely the data sources. The processor extracts beforehand the iteration patterns, if any, from the Triples Maps. Each de ned dataset is integrated by its language-speci c sub-extractor. Based on the de ned expression language and the iterator, each Triples Map is delegated to a speci c sub-mapper. For each iteration, a data extract is passed to the processor, which in turn, delegates the extract of data to the corresponding sub-mapper. The de ned Subject Map and Predicate-Object Maps are applied and the corresponding triples are generated. The execution of dependent Triples Maps, because of joins, is triggered by the Parent Triples Map and a nested mapping-driven process occurs.

The e ciency of the processor can be increased by scheduling the execution of the present expressions in an intelligent way. The mapping-driven model allows the most straightforward implementation, since Triples Maps are processed independently from each other. However, because of this, avoiding multiple passes over the same dataset is di cult. With execution planning, the number of le passes can be reduced to the bare minimum, but can not be one for all cases. The data-driven model does not have this problem, since one element of a single dataset can activate all related mappings. The execution planning does become more complex, since all dependencies have to be resolved beforehand. Note that we deliberately ignore storing les into memory, which would solve the multiple passes for the mapping-driven approach. We only consider a streaming solution, since rml can be used to process datasets too big for the processor' s memory. We accept a longer mapping time in trade of lower memory usage. A side-e ect of a streaming approach, is the inability to support some features of expression languages. For instance, XPath has look-ahead functionality that requires access to data which is not yet known. Thus, we can only support a subset. Nevertheless, in practice, most of the expressions only require functionality within this subset.

We created a prototype rml processor implementation in Java based on the mapping-driven model which is available at https://github.com/mmlab/RMLProcessor. 6.

CONCLUSIONS AND FUTURE WORK

In this paper, we presented a novel approach for mapping heterogeneous sources into rdf using the rml, an easily extendable mapping language that signi cantly reduces the e ort for integrated mapping of heterogeneous resources. Our proposed solution e ciently solves the limitations outlined (Section 3.1) by addressing the factors presented (Section 3.2) that could improve the dataset' s integrity and their resources' interlinking, incorporates the data publisher's uri policy in a well considered mapping policy. The per-format and per- le mapping models followed so far get surpassed, leading to contingent data integration and interlinking at a primary stage. The language's extensibility is self-evident as the whole solution relies on the extension of the r2rml mapping language and arose in a progressive way, as it was initially performed to accommodate mappings from the xml format to the rdf data model and later on was re-used as such for mappings of data appearing in json.

In the future, a thorough evaluation of rml's e ciency and e ectiveness will be performed. Furthermore, rml can be extended to support views on sources, built by queries. This captures, to an extent, the issue of data cleaning and transformation enhancing its applicability. Next, the e ciency of rml processing can be improved. A possible optimization is the use of execution plans that e ciently arrange the execution order depending on their dependencies. Finally, rml could be used to specify the triples' provenance, by taking advantage of the rdf-nature of the mapping documents.

[1]

Bischof ,

Decker ,

Krennwallner ,

Lopes , and

Polleres . Mapping between RDF and XML with XSPARQL . Journal on Data Semantics , 1 ( 3 ): 147 { 185 , 2012 .

[2]

Dimou ,

M. Vander

Sande ,

Colpaert , E. Mannens, and R. Van de Walle. Extending r2rml to a Source-independent Mapping Language for rdf . In International Semantic Web Conference (Posters and Demos) , 2013 .

[3]

Hert , G. Reif, and

H. C.

Gall . A comparison of RDB-to-RDF mapping languages . In Proceedings of the 7th International Conference on Semantic Systems, I-Semantics '11 , pages 25 { 32 . ACM, 2011 .

[4]

Lange . Krextor - an extensible framework for contributing content math to the Web of Data . In Proceedings of the 18th Calculemus and 10th international conference on Intelligent computer mathematics, MKM'11 , pages 304 { 306 . Springer-Verlag, 2011 .

[5]

Langegger and W. Wo . XLWrap { Querying and Integrating Arbitrary Spreadsheets with SPARQL . In Proceedings of the 8th International Semantic Web Conference, ISWC '09 , pages 359 { 374 . Springer-Verlag, 2009 .

[6] M. J. O'Connor , C.

Halaschek-Wiener , and M. A.

Musen . Mapping Master: a exible approach for mapping spreadsheets to OWL . In Proceedings of the 9th International Semantic Web Conference on The Semantic Web - Volume Part

, ISWC' 10 , pages 194 { 208 . Springer-Verlag, 2010 .

[7]

Schar e, G. Atemezing,

Troncy ,

Gandon ,

Villata ,

Bucher ,

Hamdi ,

Bihanic ,

Kepeklian ,

Cotton ,

Euzenat ,

Fan , P.-Y. Vandenbussche, and

Vatant . Enabling Linked Data publication with the Datalift platform . In Proc. AAAI workshop on semantic cities , 2012 .