=Paper= {{Paper |id=Vol-2073/article-07 |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-2073/article-07.pdf |volume=Vol-2073 |dblpUrl=https://dblp.org/rec/conf/www/ChortarasS18 }} ==None== https://ceur-ws.org/Vol-2073/article-07.pdf
       D2RML: Integrating Heterogeneous Data and Web Services
                      into Custom RDF Graphs
                            Alexandros Chortaras                                                                     Giorgos Stamou
                 National Technical University of Athens                                              National Technical University of Athens
                             Athens, Greece                                                                       Athens, Greece
                           achort@cs.ntua.gr                                                                     gstam@cs.ntua.gr

ABSTRACT                                                                                       Following the Linked Data growth, several research institutions
In this paper, we present the D2RML Data-to-RDF Mapping Lan-                                and companies such as DBpedia1 , WordNet2 , OpenStreetmap3 , of-
guage, as an extension of the R2RML mapping language, which                                 fer now access to their huge datastores through SPARQL endpoints
significantly enhances its abilities to collect data from diverse data                      or RESTful web services. Even more recently, the expansion of
sources and transform them into custom RDF graphs. The defini-                              cloud computing and the exciting developments in the field of ma-
tion of D2RML is based on a simple formal abstract data model,                              chine learning and the subsequent revival of interest in artificial
which is needed to clearly define its semantics, given the diverse                          intelligence applications has resulted in the emergence of cloud
types of data representation standards used in practice. D2RML                              platforms and marketplaces that offer intelligent data analysis web
allows web service-based data transformations, simple data ma-                              services, often representing their output using Linked Open Data
nipulation and filtering, and conditional maps, so as to improve                            vocabularies and resources, such as DBedia Spotlight4 , Google’s
the selectivity of RDF mapping rules and facilitate the generation                          Cloud Natural Language5 and Microsoft’s Computer Vision API6 .
of higher quality RDF data stores, through a lightweight, easy to                           These services typically deliver data using some structured data
write and modify specification.                                                             exchange format (usually JSON or XML documents).
                                                                                               Thus, if until recently the question was how to integrate exist-
                                                                                            ing data with the Semantic Web, now part of the question is also
CCS CONCEPTS                                                                                how to use all these available data and diverse services in a coor-
• Information systems → Information integration; Web data                                   dinated and integrated manner to selectively pick and aggregate
description languages; Query languages; Web services;                                       data into custom data stores to power new intelligent applications.
                                                                                            In this respect, aggregating data into custom RDF data stores is of
KEYWORDS                                                                                    particular interest not only because they allow direct integration
RDF mapping language, Data integration, Web service integration                             with the Linked Data cloud, but also because intelligence can be
                                                                                            added on top of the data by including e.g. axiomatic knowledge in
ACM Reference Format:                                                                       the form of OWL2 [20] axioms. As a matter of fact, recent work on
Alexandros Chortaras and Giorgos Stamou. 2018. D2RML: Integrating Het-                      efficient algorithms and methods for reasoning with tractable frag-
erogeneous Data and Web Services into Custom RDF Graphs. In Proceed-                        ments of ontologies (e.g. [3], [21]) has allowed the development of
ings of Linked Data on the Web 2018 (LDOW2018). ACM, New York, NY,                          practical systems that provide inferencing over semantic data.
USA, 10 pages.                                                                                 In this environment, we propose D2RML, a generic Data-to-RDF
                                                                                            Mapping Language, whose aim is to facilitate the generation of
                                                                                            custom RDF data stores by selectively collecting and integrating
1     INTRODUCTION                                                                          data from diverse data sources and web services into as much as
In the past years, a considerable amount of work has been done                              possible high quality RDF data stores. Our purpose is to provide
on developing methodologies for mapping relational databases to                             a formal basis for defining transformation-oriented general Data-
RDF graphs. Several approaches, mapping languages and systems                               to-RDF mappings, as well as, while staying within the mapping
have been proposed, including two W3C recommendations [1, 8].                               language approach, to transfer as much as possible of the burden
This work has mainly been motivated by the need to integrate                                for generating such data stores in practice from writing code or
the huge amount of information contained in existing relational                             using heavyweight data workflow solutions, to writing easy un-
databases with the emerging Semantic Web, and make them part                                derstandable and modifiable specifications.
of the Linked Data cloud.                                                                      The rest of the paper is organized as follows: In Section 2 we
                                                                                            briefly discuss related work with emphasis on R2RML and RML,
                                                                                            which are the starting points for our work. In Section 3 we define
                                                                                            the simple theoretical data model that underlies D2RML. In Section
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed       4 we describe how several widely used information sources can
for profit or commercial advantage and that copies bear this notice and the full citation   be cast onto our model, and in Section 5 we present the formal
on the first page. Copyrights for third-party components of this work must be honored.      specification of D2RML. Section 6 presents an extensive realistic
For all other uses, contact the owner/author(s).
LDOW2018, April 2018, Lyon, France                                                          1 http://dbpedia.org/sparql/                    2 http://wordnet-rdf.princeton.edu/
© 2018 Copyright held by the owner/author(s).                                               3 http://api.openstreetmap.org/            4 http://www.dbpedia-spotlight.org/api/
                                                                                            5 https://cloud.google.com/natural-language/
                                                                                            6 https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/
LDOW2018, April 2018, Lyon, France                                                                             Alexandros Chortaras and Giorgos Stamou


use case that showcases the expressivity and practical usefulness                       other data types. E.g. select conditions and transformation func-
of the proposed language, and Section 7 concludes the paper.                            tions are supporting implicitly by R2RML by relying on the expres-
                                                                                        sivity of the SQL query language, but this is not fully portable in a
2    RELATED WORK                                                                       straightforward extension to the case of XML or JSON documents.
Several languages and systems have been proposed to map rela-
tional databases to RDF (RDB-to-RDF mapping languages). A com-                          2.1    R2RML and RML
parative analysis is presented in [14], which determines fifteen de-
                                                                                        R2RML works with logical tables (rr:LogicalTable), which may
sirable features (e.g. support for transformation functions, named
                                                                                        be either base tables or views (rr:BaseTableOrView) defined by
graphs, integrity constraints) that such languages should have, and
                                                                                        specifying an appropriate table name (rr:tableName), or result
discusses how they are or are not supported by the several lan-
                                                                                        sets (rr:R2RMLView) obtained by executing a query (rr:sqlQuery).
guages. Existing RDB-to-RDF mapping languages vary consider-
                                                                                        Each logical table is mapped to RDF triples using one or more
ably in the flexibility they allow in defining mappings, from the
                                                                                        triples maps (rr:TriplesMap). A triples map is a complex rule
rigid Direct Mapping [1] approach that automatically translates
                                                                                        that maps each row in the underlying logical table to several RDF
the data of a relational database into an RDF graph representation
                                                                                        triples. The rule has two parts: a subject map (rr:SubjectMap) that
following the database schema, to the R2RML language [8] that al-
                                                                                        generates the subject of all RDF triples that will be generated from
lows the user to define custom views and mapping rules (expressed
                                                                                        each row of the logical table, and several predicate-object maps
as RDF graphs), and satisfies most of the fifteen desirable features.
                                                                                        (rr:PredicateObjectMap) that in turn consist of predicate maps
   The development of mapping languages and practical systems
                                                                                        (rr:PredicateMap) and object maps (rr:ObjectMap) or referenc-
for translating data sources other than relational databases to RDF
                                                                                        ing object maps (rr:RefObjectMap). A predicate map determines
graphs has also been attempted. Closer to the relational model are
                                                                                        predicates for the to-be generated RDF triples for the given sub-
CSV/TSV documents and spreadsheets, which retain the tabular
                                                                                        ject, and the object maps their objects. A subject map may include
format. Tools for converting from these data sources include XL-
                                                                                        several IRIs (rr:class) that will be used as objects to generate
Wrap [18], TaRQL7 , Vertere8 , and M2 [22]. In all such tools, for each
                                                                                        triples with the predicate rdf:type for the particular subjects. A
table row one or more RDF resources are generated, and for each
                                                                                        subject map or predicate-object map may have also one or more
column one or more RDF triples about the respective resources
                                                                                        graph maps (rr:GraphMap) associated with it, which specify the
are generated. Other formats, such as XML, diverge considerably
                                                                                        target graph of the resulting RDF triples. Referring object maps al-
from tabular data owing to their hierarchical structure, and the sys-
                                                                                        low joining two different triples maps. A referring object map spec-
tems that have been proposed to translate XML to RDF graphs rely
                                                                                        ifies a parent triples map (rr:parentTriplesMap), the subjects of
on XSLT transformations (e.g. XML2RDF9 ), XPath (e.g. Tripliser10 ),
                                                                                        which will act as objects for the current triples map, and may con-
XQuery (e.g. XSPARQL [2]) or on embedding within the XML doc-
                                                                                        tain (rr:joinCondition) a join condition (rr:Join) specified by
uments links to transformation algorithms, typically XSLT trans-
                                                                                        a reference to a column name of the current and parent triples
formations (GRDDL [6]). All such tools rely on syntactical trans-
                                                                                        map (rr:child and rr:parent, respectively). The IRIs and liter-
formations of parts of the XML structure to RDF triples. Another
                                                                                        als that will be used as RDF triple subjects, predicates, objects, or
framework to assist the transformation of XML and JSON data
                                                                                        RDF graph names may be either declared constants (rr:constant),
sources is xCurator [13] which focus on delivering high-quality
                                                                                        or obtained from the underlying table, view or result set by speci-
linked data. Apart from the above, there exist also tools, in the
                                                                                        fying the desired column name (rr:column) that will act as value
form of web services (e.g. The Datatank11 ) or parts of other infras-
                                                                                        source, or generated through a string template (rr:template) to
tructures (e.g. Virtuoso Sponger12 ) that provide custom solutions
                                                                                        concatenate column values and custom strings. String templates of-
to work with data from different formats and possibly construct
                                                                                        fer only very rudimentary options to manipulate actual database
RDF graphs out of them. These tools, however, are general data
                                                                                        values and generate custom IRIs and literals.
processing and transformation tools and not designed to directly
                                                                                            RML extends R2RML by allowing other sources (e.g. JSON or
support semantic mappings of general data to RDF triples.
                                                                                        XML files) apart from logical tables (rml:LogicalSource), that
   To resolve the polymorphy of tools and focus on the semantic
                                                                                        may be used in an interlinked manner, by defining data iterators
aspects of the Data-to-RDF mapping process, several works ex-
                                                                                        (rml:iterator) to split the data obtained from such sources into
tend the W3C recommended R2RML language to support other
                                                                                        base elements on which each mapping rule will be applied, and
data formats. These include KR2RML [23], xR2RML [19] and RML
                                                                                        by allowing particular references (rml:reference), in the form of
[9]. These proposals are a considerable advance with respect to
                                                                                        subelement selectors within the base element, to define the value
custom system solutions, because they are based on an existing,
                                                                                        sources to be used for the generation of IRIs and literals. Both the
clean, mapping-oriented standard, and allow backward compati-
                                                                                        iterators and the references depend on the underlying data source,
bility, and in most cases extensibility. It should be noted, however,
                                                                                        and may be XPath queries, JSONPath queries, CSV column names
that simply extending the R2RML standard to support other data
                                                                                        or SPARQL return variable names. Their type is declared using the
source types, does not necessarily carry on all its features into the
                                                                                        rml:referenceFormulation predicate.
7                                     8 https://github.com/knudmoeller/Vertere-RDF/
                                                                                            With respect to the specification of the actual access to the data
  https://github.com/tarql/tarql/
9
                                                                                        sources, R2RML leaves the issue to the implementation. The as-
  http://www.gac-grid.de/project-products/Software/XML2RDF.html
10 http://daverog.github.io/tripliser/                     11 http://thedatatank.com/   sumption is that each R2RML document applies to data from a
12 http://vos.openlinksw.com/owiki/wiki/VOS/VirtSponger                                 unique database. In contrast, RML, which allows multiple sources
D2RML: Integrating Heterogeneous Data and Web Services
into Custom RDF Graphs                                                                                            LDOW2018, April 2018, Lyon, France

and cross-references between the retrieved data, must include the                  and object filter, respectively. The implementation of R is the set of
data source descriptions within the RML document. To describe                      RDF triples
them, it suggests the use of some recommended or widely-used vo-                         {(s, p, o) | s ∈ Fs (D), p ∈ Fp (D), o ∈ Fo (D), D ∈ T }.
cabularies such as DCAT13 , D2RQ14 , CSVW15 , Hydra16 , SPARQL-
SD17 to access files, relational databases, CSV/TSV files, web APIs                   A set of triples rules over one or more set tables defines a Data-
and SPARQL endpoints, respectively. However, these vocabularies                    to-RDF mapping. Using the above simple model we can define Data-
have been developed mainly for APIs and data sources to inform                     to-RDF mappings for any information sources that can give rise to
clients about their exact properties and services they offer, and not              one or more set tables. The triple store represented by a Data-to-
as a form of formulating requests to them. E.g. to retrieve data from              RDF mapping is then the implementation of all its triples rules.
a web API that paginates the results using next page access keys,                     We consider an information source to be any online software sys-
knowledge on how to formulate each time the subsequent HTTP                        tem that can deliver structured data upon request. The information
request is needed; this is not covered for example by Hydra. Sim-                  source may be a data repository (e.g. a relational database, an RDF
ilarly, a SPARQL-SD specification provides information about the                   store, an XML file stored in some directory) or an implementation
supported SPARQL version, the default entailment regime, the de-                   of a service or an algorithm (e.g. a RESTful web service) that may
fault named graph, etc., which are not useful to a client, at the time             process some input data and deliver some structured output. The
of formulating a request.                                                          request, in the form of a query (e.g. an SQL or SPARQL SELECT
                                                                                   query) or message (e.g. an HTTP GET or POST request) in a for-
3    DATA MODEL                                                                    mat supported by the information source, includes all input data
In this section, we extend the table-based model underlying the                    and parameters required by the information source to generate and
R2RML language to support complex, non-tabular data, that can be                   deliver the output. The reply, or effective data source, is the output
obtained from various information sources (such as JSON or XML                     produced by the information source, upon processing the request.
document returning sources). To do this we consider that instead                   The reply may be delivered to the client in a native format (e.g. as
of logical tables, RDF triples are generated from set tables. In the               an SQL result set), or in a generic document format (e.g. as a JSON
following we represent an RDF triple as a tuple ⟨s, p, o⟩, where s is              or XML document).
the subject, p the property or predicate and o the object.                            To accommodate the several possible information sources in
                                                                                   our model, we consider, as in RML, that the effective data source
    Definition 3.1. A set row of arity k is a tuple ⟨D 1 , . . . , D k ⟩, where    groups some set of autonomous elements (e.g. rows of an SQL re-
D 1 , . . . , D k are sets of values over some domains. A name row of              sult set, elements of a JSON array). The division of the reply in
arity k is a tuple ⟨n 1 , . . . , nk ⟩, where n 1 , . . . , nk are names. A set    these autonomous elements is achieved through an iterator. Hence,
table of arity k with m rows is a tuple S = ⟨N , T ⟩, where N is a                 an effective data source together with an iterator specifies a logical
name row and T = [D1 , . . . , Dm ] a list of set rows, all of arity k,            array, through whose items the iterator eventually iterates. Each
such as the i-th elements of D1 , . . . , Dm , for 1 ≤ i ≤ k, share all            item of a logical array may itself be a complex data structure (a
the same domain.                                                                   new effective data source), so in order to extract from it lists of val-
   The names allow us to refer to particular elements of set rows                  ues to construct set rows and use them as subjects, predicates and
and tables. We denote the set of values that corresponds to name                   objects of RDF triples, we need some selectors. Thus, the role of the
ni (1 ≤ i ≤ k) in a set row D by D[nk ]. We also denote the list                   selectors is to transform a logical array into a set table.
[D1 [nk ], . . . , Dm [nk ]] of value sets that are obtained from the sev-            Definition 3.4. The triple A = ⟨I, t, L⟩, where I is a informa-
eral set rows of S by S[nk ], which we call a column of S. Let also                tion source and request specification, t an iterator specification,
dom(n) denote the domain of column n. It should be underlined,                     and L a set of selectors, is a data acquisition pipeline.
that for a particular set row D and the different possible names ni ,
the several sets D[ni ] may have different numbers of values, there                   It follows that each data acquisition pipeline A gives rise to a
is no alignment between the individual values among the several                    unique set table S A . A data acquisition pipeline may be paramet-
sets, and all individual values are equivalent with respect to their               ric, in the sense that the information source or request specification
relation to the values of the other sets in the same set row.                      may contain parameters. Given a non-parametric data acquisition
                                                                                   pipeline A, a parametric data acquisition pipeline A ′ that depends
   Definition 3.2. A filter F over a set table S of arity k is a tuple             on A is a data acquisition pipeline whose parameters take values
⟨n, f ⟩, where n is a column name and f : dom(n) → dom(n) a                        from one or more columns of S A . We call such a parametric data
function, such that f (D[n]) ⊆ D[n] for all set rows D of S.                       acquisition pipeline a transformation of A.
   We denote the set value f (D[n]), obtained by applying F on a                      Definition 3.5. A series of data acquisition pipelines A0 , A1 , . . .,
set row D by F (D). Clearly, f may be the identity function.                       Al , where each Ai , for i > 1, is a transformation that depends on
   Definition 3.3. A triples rule R over a set table S = ⟨N , T ⟩ is               one or more A j for j < i is a set table specification. A0 is the
a triple of filters ⟨Fs , Fp , Fo ⟩, over S, called the subject, predicate         primary data acquisition pipeline.

13                                              14 http://d2rq.org/d2rq-language      A set table specification gives rise to a unique set table, which is
   https://www.w3.org/TR/vocab-dcat/
15 https://www.w3.org/TR/tabular-metadata/                                         S A0 extended by columns contributed by transformations A1 , . . .,
16 https://www.hydra-cg.com/spec/latest/core/                                      Al . A trivial set table specification consists only of the primary
17 https://www.w3.org/TR/sparql11-service-description/                             data acquisition pipeline A0 . Each transformation in a set table
LDOW2018, April 2018, Lyon, France                                                                                  Alexandros Chortaras and Giorgos Stamou


specification is realized as a series of requests to the respective in-                        Table 1: Information sources, requests and replies
formation source, after binding the parameters to all possible com-
binations of values obtained from the referred to columns of the set                          Information Source             Request            Effective Data Source
table constructed from the preceding data acquisition pipelines. In                                                           SQL
                                                                                                   RDBMS                                     SQL Result Set
particular, to evaluate a set table specification, we must evaluate se-                                                  SELECT Query
rially the data acquisition pipelines, extending at each step the pre-                                                      SPARQL
viously obtained set table: The primary data acquisition pipeline                                                        SELECT Query      SPARQL Result Set
                                                                                              SPARQL Endpoint
A0 gives rise to set table S A0 . Then, for each set row D of S A0 ,                                                   and RDF graph IRIs via HTTP Message
                                                                                                                       via HTTP Message
evaluating A1 gives rise to a set table S A1 (D). By flattening all
                                                                                                RESTful                      HTTP         JSON/XML/CSV/TSV
rows of S A1 (D) into a single row (by merging the respective col-
                                                                                              Web Service              GET/POST Request        Document
umn values of each row) we obtain a new set row that is appended                           JSON/XML/CSV/TSV                  HTTP         JSON/XML/CSV/TSV
to D. Doing this for all set rows D results in S A0 A1 . By applying                           Document                   GET Request          Document
this process iteratively, eventually S A0 is extended with additional
columns to set table S A0 A1 ... Al .
   More formally, let n 1 , . . . , nk be the names, and [D1 , . . . , Dm ]                Table 2: Effective Data Sources, iterators and selectors
the rows of Ŝ  S A0 ... Ai . Evaluating Ai+1 on each row of Ŝ pro-
duces set tables S Ai +1 (D1 ), . . ., S Ai +1 (Dm ). Since all these set ta-                 Effective Data Source          Iterator                Selector
bles are produced by the same data acquisition pipeline Ai+1 , they                              SQL Result Set           Row Iterator            Column name
share the same arity, say k ′ , and column names, say n̂ 1 , . . . , n̂k ′ .                   SPARQL Result Set          Row Iterator           Variable name
Thus S A0 ... Ai +1 = ⟨N , T ⟩, where N = ⟨n 1 , . . . , nk , n̂ 1 , . . . , n̂k ′ ⟩,           JSON Document           JSONPath query        Flat JSONPath query
T = [D1′ , . . . , Dm
                    ′ ], D ′ = [D [n ], . . . , D [n ], D̂ , . . . , D̂ ′ ]
                           j           j 1           j k      j1               jk                XML Document             XPath query           Flat XPath query
for 1 ≤ j ≤ m, and D = S A (Dj )[n̂ ] for 1 ≤ l ≤ k .
                         ′                                          ′
                       ˆ     Ð
                                        i +1        l                                         CSV/TSV Document            Row Iterator            Column name
                           jl
   The row flattening step is intentional: S A0 provides the origi-
nal data that we want to extend through transformations, ie. by
appending new columns containing new properties of that data.                           RDF form. D2RQ Mapping Language [7] allows a JDBC-dependent
Since, as mentioned above, all values contained in a particular row                     RDF definition of connection strings and is used by RML to specify
and column of S A are equivalent with respect to the values in                          RDBMS connectivity.
the sets of the other columns of the current row, the flattening be-                       An implementation provided with a RDBMS connection specifi-
haviour maintains this relationship between values, without intro-                      cation can connect to the particular RDBMS, pose an SQL SELECT
ducing non-desired hierarchical dependencies. Finally, the primary                      query q that specifies attributes n 1 , . . . , nk in the SELECT state-
data acquisition pipeline may be itself parametric. In this case, the                   ment for the returned columns, and obtain as result a list of rows
evaluation is done exactly as described above, but the set rows                         [⟨v 11 , . . . , v 1k ⟩, . . . , ⟨vn1 , . . . , vnk ⟩]. Using, a trivial row iterator
generated by S A0 are not appended to the set table on which it                         and column names n 1 , . . . , nk as selectors, the results of q can be
depends, but initiate a new set table.                                                  converted to the following set table: ⟨⟨n 1 , . . . , nk ⟩,
                                                                                        [⟨{v 11 }, . . . , {v 1k }⟩, . . . , ⟨{vn1 }, . . . , {vnk }⟩]⟩
4     INFORMATION SOURCES AND REPLIES
We now study how several information and effective data sources                         4.2     RESTful Web Services
used in real applications can be accommodated by our model. We                          RESTful web services are services based on the REST principles
discuss relational databases, RESTful web services, JSON, XML,                          [11], and are usually implemented using the HTTP protocol. Typ-
CSV/TSV documents, and SPARQL endpoints.                                                ically, a data retrieving RESTful service accepts an HTTP request
                                                                                        and delivers the result in a self-descriptive text message (e.g. an
4.1     Relational Databases                                                            HTML, XML, JSON, plain text). Here we are interested in struc-
In relational databases data is organized into one or more tables                       tured reply services, i.e. services whose reply is in one of the XML,
(or relations) of columns (or attributes) and rows (or tuples). Each                    JSON or CSV/TSV formats. To access a RESTful web service, the
table column has a name. Data are retrieved by issuing an SQL                           elements of the appropriate HTTP request have to be specified.
SELECT query and the results are packed as a result set, which                          These include the method (GET or POST), the URI (including the
is essentially a row-by-row iteratable table along with its meta-                       query string in the case of a GET message), any headers, and the
data. Because relational database management systems (RDBMS)                            body (for passing parameters in the case of a POST message). All
use native formats to implement the data stores and the result for-                     these can be specified in RDF using the W3C’s Working Group
mats, communications with RDBMSs’ are done using special pro-                           Notes ‘HTTP Vocabulary in RDF 1.0’ [16] and ‘Representing Con-
tocols (such as ODBC, JDBC) to implement clients for particular                         tent in RDF 1.0’ [17]. Thus, we can assume that an HTTP client that
RDMBSs’. Practical access requires several parameters to be speci-                      can consume an HTTP Vocabulary and Representing Content in
fied (e.g. server location, database name, user name, password, ac-                     RDF 1.0 description to create an HTTP request, can use a RESTful
cess driver), which are usually grouped in the so-called connec-                        web service and obtain as result a structured document. Although
tion string and are programming language implementation depen-                          not strictly qualifying as RESTful web services, we include in this
dent. There is no standard for representing connection strings in                       category also URIs that simply deliver structured documents (e.g.
D2RML: Integrating Heterogeneous Data and Web Services
into Custom RDF Graphs                                                                                                          LDOW2018, April 2018, Lyon, France


URIs to static JSON/XML files), since the communication is per-                          4.5      XML Documents
formed in exactly in the same way through HTTP messages.                                 An XML document may also be modeled using a tree [5], however
   A practical consideration usually related with some RESTful                           its structure differs from a JSON tree. The core part of an XML
web services, is that the APIs that implement the services, to avoid                     document is represented in the tree by element, attribute and text
extremely long replies, perform pagination of the results and do                         nodes. Each element node corresponds to an element of the XML
not return the full set of results as one document, but as a series                      document and has a name (the element name) and children that
of smaller documents: in most cases, each returned document con-                         are all the enclosed elements. It may also have as child a text node,
tains some keys that can be used by the client in the subsequent re-                     that holds in its string value the characters in the CDATA section
quest to instruct the server to return the next set of results. The pag-                 of the element. Each element node may have associated with it also
ination schema may get non-trivial, as in the case of MediaWiki18 .                      a set of attribute nodes that represent the attributes of the element,
                                                                                         which, however, are not considered to be children of the element
4.3      SPARQL Endpoints                                                                node. Each attribute node has a name (the attribute name) and a
SPARQL endpoints are URIs at which a SPARQL Protocol service                             string value that holds the respective attribute value. Relying on
listens [10]. SPARQL Protocol is built on top of HTTP and as such                        this model, the XPath language allows to select particular nodes
it can be treated as a RESTful web service. However, since special                       from the tree that meet certain conditions. Unlike in the case of
SPARQL Protocol clients, in the form of APIs, exist (e.g. Apache                         JSON, the result is not itself an XML document, but a set of the
Jena19 ) that hide from the user the cumbersome details of building                      nodes that match the query criteria. We will say that an XPath
and decoding the necessary HTTP request and reply messages it                            query is flat if the result contains only text or attribute nodes.
is useful to provide support also for this type of interaction. The                         Hence, we can consider as iterator for an XML document tree
situation is similar to the RDBMS case: The request is a SPARQL                          T any relevant non-flat XPath query q that splits T into a logical
SELECT (possibly along with some default and named RDF graph                             array of nodes N 1 , . . . , Nn . Since the query is non-flat, these nodes
IRIs) instead of an SQL SELECT query, and the effective data source                      are element nodes, and can be treated as smaller XML document
is a result set, whose column names are the return variable names                        trees T1 , . . . , Tn . The selectors are then flat XPath queries q 1 , . . . qk
specified in the SPARQL query. Thus, the translation of the reply                        that are executed over each one of these smaller XML documents.
to a set table is done in exactly the same way. The only essential                       Thus, T after applying iterator q and selectors q 1 , . . . qk yields
thing that changes is the specification of the access to the SPARQL                      the set table ⟨⟨q 1 , . . . , qk ⟩, [⟨C 11 , . . . , C 1k ⟩, . . . , ⟨Cn1 , . . . , Cnk ⟩]⟩,
endpoint for which a single URI is enough.                                               where Ci j are the sting values of the text or attribute nodes in the
                                                                                         node set obtained by applying q j on Ti .
4.4      JSON Documents
A JSON document [15] may be modeled as a JSON tree [4]. A JSON                           4.6      CSV/TSV Documents
tree is an edge-labeled tree, whose root represents the entire docu-
                                                                                         CSV/TSV documents are textual representations of tabular data.
ment. A node may have either string- or integer-labeled children,
                                                                                         Each line represents a data row, expect possibly from the first row
but not both. A node with string-labeled outgoing edges represents
                                                                                         that contains the names of the columns. Hence, the situation is
a set of JSON key-value pairs: the edge label is the key and the edge
                                                                                         similar to the RDBMS case, with no need of a query to be specified.
destination the corresponding value. A node with integer-labeled
                                                                                         The name tuple consists of the names of the columns in the file (or
outgoing edges represents an array: the edge label is the array in-
                                                                                         of their numbering) and the row sets of the actual rows of the table.
dex and the edge destination the corresponding value. Value nodes,
                                                                                         The only thing the needs to be specified are the formatting details
are either leaf nodes having a string or integer label, or JSON trees.
                                                                                         (eg. delimiter, escape separator, quote character).
    In the absence of an official standard, to select values from a
JSON document that meet specific conditions, in practice the JSON-
Path [12] specification is used, which is inspired by XPath. JSON-                       5     D2RML SPECIFICATION
Path queries select nodes of a JSON tree that meet a certain path                        D2RML draws significantly from R2RML and RML, and follows
condition, and group them into a JSON array, which is the result                         the same simple syntactical strategy for defining mappings: Triples
of the query. Since a JSON array is a JSON document, the result of                       maps, which consist of a subject map and several predicate object
a JSONPath query is always a JSON document. We will say that a                           maps. From RML it adopts and appropriately extends the way to
JSONPath query is flat if the result JSON tree has depth 1, ie. is an                    define the interaction with information sources through requests,
array of simple values.                                                                  iterators and selectors. Moreover, it significantly extends the ex-
    Hence, an iterator for a JSON tree T is any relevant JSONPath                        pressive capabilities of R2RML and RML by allowing transforma-
query q, which splits T into a logical array of smaller JSON trees                       tions, conditional statements, and custom IRI generation functions.
T1 , . . . , Tn , and the selectors are flat JSONPath queries q 1 , . . . qk                For its semantics, D2RML relies on the data model described in
that are executed over each T1 , . . . , Tn to deliver a set table from the              Section 3. Each triples map is essentially a set table specification
underlying logical array. Thus T , after applying iterator q and se-                     of Def. 3.3 and a specification of a set of triple rules of Def. 3.5
lectors q 1 , . . . qk , yields the set table ⟨⟨q 1 , . . . , qk ⟩,                      with the same subject filter over the common underlying set table.
[⟨C 11 , . . . , C 1k ⟩, . . . , ⟨Cn1 , . . . , Cnk ⟩]⟩, where Ci j is the set of val-   The information source, request and iterator of the original data
ues contained in the array that results from applying q j on Ti .                        acquisition pipeline is directly provided in the triples map defini-
18   https://www.mediawiki.org/wiki/API:Query      19   https://jena.apache.org/         tion. Any transformations to be added to the set table specification
LDOW2018, April 2018, Lyon, France                                                             Alexandros Chortaras and Giorgos Stamou

      Table 3: Namespaces used in D2RML documents                       determines also the form of all selectors that will be applied on the
                                                                        particular effective data source.
           Prefix   IRI
           rr       http://www.w3.org/ns/r2rml#                          LogicalTable ← a rr:LogicalTable
           dr       http://islab.ntua.gr/ns/d2rml#                                      dr:source ⟨InformationSource⟩
           op       http://islab.ntua.gr/ns/d2rml-op#                                   SQLTable | SPARQLTable | CSVTable
           is       http://islab.ntua.gr/ns/d2rml-is#                                   (is:parameters ( ⟨DataVariable⟩+ ))?
           http     http://www.w3.org/2011/http#                         LogicalSource ← a dr:LogicalSource
           cnt      http://www.w3.org/2011/content#                                      dr:source ⟨InformationSource⟩
                                                                                         dr:iterator literal
are declared in the order of their application. The selectors are im-                    dr:referenceFormulation iri
plicitly declared in the subject, predicate, object and graph maps.      SQLTable ← a rr:BaseTableOrView           a rr:R2RMLView
Several triples map are allowed to coexist in the a D2RML docu-                     rr:tableName literal           rr:sqlQuery literal
ment, in which case several distinct set tables are generated.                                                     (rr:sqlVersion iri)?
   We define D2RML using a BNF-like notation. Terminal sym-              SPARQLTable ← a dr:SPARQLTable
bols are written in monospace, and non-terminals in italics. Non-                      dr:sparqlQuery literal
terminals within angle brackets represent RDF nodes. Parenthesis                       (dr:sparqlVersion iri)?
specify the scope of alternatives (separated by |) and of the stan-                    (dr:defaultGraph iri)*
                                                                                       (dr:namedGraph iri)*
dard quantifiers ?, *, and +. Terminal symbols not explicitly defined
in the specification are written in smallcaps. The namespaces are        CSVTable ← a dr:TextTable
defined in Table 3. D2RML is compatible with R2RML, but not fully                   dr:delimiter literal
                                                                                    dr:headerline boolean
compatible with RML, so it does not directly extend its namespace.
                                                                                    (dr:quoteCharacter literal)?
                                                                                    (dr:commentCharacter literal)?
5.1    Triples Maps                                                                 (dr:escapeCharacter literal)?
A triples map is defined as in R2RML and RML, but tabular data                      (dr:recordSeparator literal)?
providing information sources are clearly distinguished from non-
tabular by using rr:logicalTable for tabular data providing in-         5.3    Information Sources
formation sources, and dr:logicalSource for the rest.
                                                                        The version of D2RML presented here provides definitions for im-
 TriplesMap ← a rr:TriplesMap                                           plementing data acquisition pipelines involving RDBMSs’, REST-
              rr:logicalTable ⟨LogicalTable⟩ |                          ful web services and SPARQL endpoints. Extensions for additional
                         dr:logicalSource ⟨LogicalSource⟩               sources are expected in subsequent versions.
              (dr:transformations ( ⟨Transformation⟩+ ))?
              rr:subjectMap ⟨SubjectMap⟩ | rr:subject iri                InformationSource ← RDMSSource | SPARQLService | HTTPSource
              (rr:predicateObjectMap ⟨PredObjMap⟩)*
                                                                         RDMSSource ← a is:RDBMSSource
 PredObjMap ← a rr:PredicateObjectMap                                                 is:rdbms iri
              (rr:predicateMap ⟨PredicateMap⟩ |                                       is:location literal
                                          rr:predicate iri)+                          (is:username literal)?
              (rr:objectMap (⟨ObjectMap⟩ | ⟨RefObjectMap⟩) |                          (is:password literal)?
                                   rr:object (iri | literal))+                        (is:database literal)?
              (rr:graphMap ⟨GraphMap⟩ | rr:graph iri)*
                                                                         SPARQLService ← a is:SPARQLService
                                                                                         is:uri uri
5.2    Logical Tables and Logical Sources
                                                                         HTTPSource ← a is:HTTPSource
The LogicalTable and LogicalSource nodes provide details about the                    is:request ⟨HTTPRequest ⟩ | is:uri uri
primary information source used to generate the set table. In the                     (is:parameters ( ⟨Parameter ⟩+ ))?
case of query supporting information sources (such as RDBMSs’
                                                                         Parameter ← DataVariable | SimpleKeyRequestIterator
and SPARQL endpoints), for backward compatibility with R2RML,
they contain also the query-relevant details of the request that         DataVariable ← a is:DataVariable
                                                                                        is:name literal
should be sent to the information source. The is:parameters pred-
icate may be used to declare parameter names in queries that par-        SimpleKeyRequestIterator ← a is:SimpleKeyRequestIterator
ticipate in parametric data acquisition pipelines. For other informa-                               is:name literal
                                                                                                    dr:reference literal
tion sources (such as RESTful web services), the request, and any
                                                                                                    dr:referenceFormulation literal
parameters, are included in the InformationSource specification it-
                                                                                                    is:initialValue literal
self. For non-tabular data providing information sources, Logical-
Source contains also the definition of the iterator (dr:iterator           In an RDBMSSource, is:rdbms determines the specific RBMBS
and dr:referenceFormulation) that will be used to split the ef-         (eg. MySQL, PostgreSQL). An HTTPSource is specified in terms of a
fective data source into a logical array. Since the effective data      HTTPRequest which should be a http:Request and specify the de-
source format is fixed, the object of dr:referenceFormulation           tails of the HTTP message to be sent. An HTTPSource may contain
D2RML: Integrating Heterogeneous Data and Web Services
into Custom RDF Graphs                                                                               LDOW2018, April 2018, Lyon, France


parameters in case the web service is part of a parametric data ac-      Condition ← (ValueRef )?
quisition pipeline, or it paginates the results. Data parameters are                 (dr:booleanOperator iri)?
                                                                                     (operator literal | dr:operand ⟨Condition⟩)+
identified by a name (is:name). For paginated results, the above
specification allows, as an example, iterated requests through a re-     RefObjectMap ← a rr:RefObjectMap
quest iterator that should be part eg. of the web service URI and                       rr:parentTriplesMap ⟨TriplesMap⟩
                                                                                        ((rr:joinCondition ⟨JoinCondition⟩)+ |
whose values, apart from the initial value (is:initialValue) are
                                                                                           (dr:parameterBinding ⟨ParameterBinding⟩)+ )?
extracted each time from the previous reply using a selector. Ex-
tensions are possible to support additional pagination policies.         JoinCondition ← a rr:Join
                                                                                         rr:child literal
                                                                                         rr:parent literal
5.4    Transformations
                                                                           To support filters, a SubjectMap, GraphMap, PredicateMap or Ob-
A triples map definition may include a list of transformations that
                                                                        jectMap may contain a condition (dr:condition) and/or a case
should be applied in the declared order to the set table derived from
                                                                        statement (dr:cases). If a term map contains a condition state-
the primary information source. Since a transformation is itself
                                                                        ment, this will be evaluated and the corresponding subject, graph,
a parametric data acquisition pipeline, its definition includes the
                                                                        predicate or object value will be included in the respective value
specification of an InformationSource through a rr:logicalTable
                                                                        set only if the condition evaluates to true. Each condition statement
or dr:logicalSource and one or more ParameterBindings. A Pa-
                                                                        should first specify the actual value on which it will operate (as a
rameterBinding consists of a reference to a value (ValueRef ) or a
                                                                        ValueRef ), and may include several tests which will be jointly eval-
constant value, and the parameter name (dr:parameter) in the
                                                                        uated using the boolean operator specified by dr:booleanOperator
corresponding information source the value will be bound to.
                                                                        (op:and or op:or). Each test is specified either through an opera-
 Transformation ← a dr:Transformation                                   tor and a literal which define a constant value with which the
                  rr:logicalTable ⟨LogicalTable⟩ |                      actual value will be compared using operator, or as a nested con-
                             dr:logicalSource ⟨LogicalSource⟩           dition. An operator is a common operator such as op:eq, op:le,
                  (dr:parameterBinding ⟨ParameterBinding⟩)+             op:leq, op:ge, op:geq, op:matches, etc. The type of the operation
 ParameterBinding ← a dr:ParameterBinding                               (eg. number or string comparison) depends on the XSD type of the
                    dr:parameter literal                                literal provided as operand. If a nested condition does not specify
                    rr:constant literal | ValueRef                      a value reference, it inherits it from the enclosing condition.
                                                                           The case statement offers alternatives for realizing a term map:
                                                                        It contains a list of alternative term maps, each along with a con-
5.5    Term Maps and Conditions                                         dition. If the condition evaluates to true the term map is realized,
The definitions of term maps (i.e. of subject maps, graph maps,         otherwise control flows to the next case.
predicate maps and object map) follow the R2RML specification              Finally, a referring object map (RefObjectMap) may be defined by
with the addition of filters.                                           a ParameterBinding, instead of by a R2RML JoinCondition. This
                                                                        is how set table specifications with parametric primary data acqui-
 SubjectMap ← a rr:SubjectMap
                                                                        sition pipelines are defined: the parametric set table specification
              IRIRef | BlankNodeRef
              (SubjectBody CaseSubjectBody*) | CaseSubjectBody+
                                                                        corresponds to the parent triples map of RefObjectMap, and the
                                                                        ParameterBinding provides the parameters values.
 PredicateMap ← a rr:PredicateMap
                (PredicateBody CasePredBody*) | CasePredBody+
 ObjectMap ← a rr:ObjectMap                                             5.6    IRIs, Literals and Blank Nodes
             (ObjectBody CaseObjectBody*) | CaseObjectBody+             In R2RML, RDF terms are generated using the rr:constant, the
 GraphMap ← a rr:GraphMap                                               rr:column and rr:template predicates; to these, RML adds the
            (GraphBody CaseGraphBody*) | CaseGraphBody+                 rml:reference option. D2RML follows the same strategy, but to
 SubjectBody ← (rr:class IRI)*                                          account for values coming from transformations, RDF terms are
               (rr:graphMap ⟨GraphMap⟩ | rr:graph IRI)*                 generated through value references (ValueRefs), specified by two
               (dr:condition ⟨Condition⟩)?                              distinct components: a compulsory rr:column, rr:template or
 PredicateBody ← IRIRef                                                 dr:reference, and an optional dr:transformationReference to
                 (dr:condition ⟨Condition⟩)?                            specify the transformation that provides the logical array for the
 ObjectBody ← IRIRef | BlankNodeRef | LiteralRef                        respective rr:column, rr:template or dr:reference. If missing,
              (dr:condition ⟨Condition⟩)?                               the primary logical array is assumed.
 GraphBody ← IRIRef                                                        Although rr:template allows some minimal flexibility in defin-
             (dr:condition ⟨Condition⟩)?                                ing custom IRIs or literals, the overall mechanism is quite restric-
 CaseSubjectBody ← dr:cases ( ⟨SubectBody ⟩+ )
                                                                        tive, since no simple transformations (e.g. replace particular char-
                                                                        acters etc.) can be applied on the values obtained from the underly-
 CasePredBody ← dr:cases ( ⟨PredicateBody ⟩+ )
                                                                        ing set tables. D2RML addresses this issue by allowing simple func-
 CaseObjectBody ← dr:cases ( ⟨ObjectBody ⟩+ )                           tions to be applied on the raw values obtained from effective data
 CaseGraphBody ← dr:cases ( ⟨GraphBody ⟩+ )                             sources. Thus, a ValueRef may include definitions of one or more
LDOW2018, April 2018, Lyon, France                                                                  Alexandros Chortaras and Giorgos Stamou


defined columns (dr:definedColumns) that are constructed by ap-            page contains a key to accessing the next page (nextCursor). An
plying a series of functional transformations on particular set table      extract from the response obtained from executing the above is the
column values and may be used in a rr:column or rr:template.               following JSON document, which contains a list of items modeled
A defined column should declare the new column name dr:name it             using the Europeana Data Model (EDM):
will be referred by, the function (dr:function) that will generate         {
the custom values (eg. op:regex, op:replace), and a list of argu-              "nextCursor": "AoE/GC8yMDI0OTA0L3Bob3Rv****=",
                                                                               "items": [
ments, in the form of one or more dr:parameterBindings. The                      {
parameter names should be provided by the function definition.                     "id": "/2024904/photography_ProvidedCHO_TopFoto_co_uk_EU061905",
                                                                                    "dcDescription": [
                                                                                         "Former chief inspector Berrett decorated by the king.\n
 IRIRef     ← rr:constant iri | ValueRef
                                                                                           Former chief detective inspector James Berrett of
              (rr:termType rr:IRI)?                                                        Scotland Yard was decorated by the King at the royal
 LiteralRef     ← rr:constant literal | ValueRef                                           invesititure at Buckingham Palace. "
                                                                                    ],
                  (rr:termType rr:Literal)?                                         "edmIsShownBy": [
                  (rr:language literal | rr:datatype iri)?                            "http://www.topfoto.co.uk/imageflows/imagepreview/f=EU061905"
                                                                                    ],
 BlankNodeRef        ← ValueRef
                                                                                    "edmConcept": [
                       (rr:termType rr:BlankNode)?                                     "http://bib.arts.kuleuven.be/photoVocabulary/12003",
 ValueRef      ← rr:column literal | rr:template literal |                             "http://data.europeana.eu/concept/base/1711"
                                                                                    ],
                                           dr:reference literal                     "type": "IMAGE"
                 (dr:transformationReference ⟨Transformation⟩)?                  }, ...
                 (dr:definedColumns ( ⟨DefinedColumn⟩+ ))?                     ]
                                                                           }
 DefinedColumn ← a dr:DefinedColumn
                 dr:name literal                                              Most fields are self-explanatory. edmConcept contains a list of
                 dr:function iri                                           Open Linked Data resources that have been associated to each item
                 (dr:parameterBinding ⟨ParameterBinding⟩)+                 by the provider to characterize the respective item content. To gen-
                                                                           erate RDF triples for this information, as well as for the type of each
                                                                           item, we define the following D2RML document:
6      USE CASE                                                            <#EuropeanaMapping>
In this section, we present a realistic use case for D2RML, involving         dr:logicalSource [ dr:source <#EuropeanaAPI> ;
                                                                                                  dr:iterator "$.items" ;
true data and readily available web services and data repositories.                               dr:referenceFormulation is:JSONPath ; ] ;
The aim is to extract an extensive set of textual or URI features for a       rr:subjectMap [
set of cultural items, in order to subsequently use them to perform               dr:definedColumns ( [
                                                                                     dr:name "SID" ;
several tasks such as clustering and similarity ranking. We assume                   dr:function op:extractMatch ;
that we want to extract features in several ways (e.g. directly from                 dr:parameterBinding [ dr:parameter "input" ;
the metadata, from applying named entity extraction, image analy-                                          dr:reference "$.id" ; ] ;
                                                                                     dr:parameterBinding [ dr:parameter "regex" ;
sis, etc.), and that we want to keep information about the source of                                       rr:constant "^.*_(.*)$" ; ] ;
each feature so that we can use them selectively to test how they                 ] ) ;
                                                                                  rr:template "http://islab.ntua.gr/resources/tp/{SID}" ;
affect the clustering or similarity algorithm performance.                        dr:cases ( [
   As primary information source of cultural items we use Euro-                     rr:class  ;
peana Collections20 , in particular the collection provided by Top-                 dr:condition [ dr:reference "$.type" ;
                                                                                                   op:eq "IMAGE"^^xsd:string ; ] ;
Foto21 , which consists of 60,882 black and white images of the                   ] [
1930s, along with their metadata. This collection can be obtained                   rr:class  ;
through the Europeana API. The D2RML specification for getting                    ] ) ;
                                                                              ] ;
the effective data source for this collection is the following:               rr:predicateObjectMap [
<#EuropeanaAPI>                                                                   rr:predicate  ;
   a is:HTTPSource ;                                                              rr:objectMap [ dr:reference "$.edmConcept" ;
   is:request [                                                                                   rr:termType rr:IRI ; ] ;
       http:absoluteURI "http://www.europeana.eu/api/v2/search.json?          ] .
             wskey=A*******W&rows=20&cursor={@@cursor@@}&profile=rich&
             query=europeana_collectionName%3A%222024904_Ag_EU_               Note the use of a defined column to construct custom RDF sub-
             EuropeanaPhotography_TopFoto_1013%22" ;                       ject IRIs. The particular defined column applies the regular expres-
       http:methodName "GET" ;
   ] ;
                                                                           sion ˆ.*_(.*)$ on the id field of each item and uses the value
   is:parameters ( [ a is:SimpleKeyRequestIterator ;                       of the first capturing group, named SID. The above specification
                      is:name "cursor" ;                                   generates the following RDF triples for the first item:
                      is:initialValue "*" ;
                      dr:reference "$.nextCursor" ;                        
                      dr:referenceFormulation is:JSONPath ; ] ) .                   
                                                                                              .
  The specification includes a is:SimpleKeyRequestIterator as              
                                                                                    
parameter, because the API returns the results in pages, and each                            .
                                                                           
20   https://www.europeana.eu/portal/en   21   http://www.topfoto.co.uk/            
D2RML: Integrating Heterogeneous Data and Web Services
into Custom RDF Graphs                                                                                  LDOW2018, April 2018, Lyon, France


                   .          and the transformation
   Since we want to extract several features, we can invoke ser-          <#DBpediaTransformation>
                                                                             dr:logicalSource [
vices to the analyze metadata. An option is to use DBpedia Spot-                 dr:source <#DBpediaSPARQLService> ;
light to extract named entities from the textual descriptions. To do             dr:query "SELECT ?dbpediatype WHERE
this, we need a transformation that takes the description of each                            { <{@@resource@@}> a ?dbpediatype }" ;
                                                                                 is:parameters ( [ a is:DataVariable;
item (dcDescription) and invokes DBpedia Spotlight on it. We                                       is:name "resource" ; ] ) ;
first define the relevant information source:                                ] ;
<#DBpediaSpotlightAPI>                                                       dr:parameterBinding [
   a is:HTTPSource ;                                                             dr:parameter "resource" ;
   is:request [                                                                  dr:reference "/Resource/@URI" ;
       http:absoluteURI "http://model.dbpedia-spotlight.org/en/                  dr:transformationReference <#SpotlightTransformation> ;
             annotate?text={@@text@@}&confidence=0.5&support=0&              ] .
             spotter=Default&disambiguator=Default&policy=whitelist&
             types=&sparql=" ;
       http:methodName "GET" ;
                                                                          Finally, we modify <#EuropeanaMapping> to add the new trans-
       http:headers ( [ http:fieldName "Accept" ;                         formation and a add new predicate object map:
                        http:fieldValue "application/xml" ; ] ) ;         <#EuropeanaMapping>
   ] ;                                                                       ...
   is:parameters ( [ a is:DataVariable ;                                     dr:transformations ( <#SpotlightTransformation>
                      is:name "text" ;    ] ) .                                                     <#DBpediaTransformation> ) ;
The respective effective data source has the following XML format            rr:predicateObjectMap [
                                                                                 rr:predicate  ;
 ;
      Palace." confidence="0.5" support="0"
                                                                                     rr:termType rr:IRI ;
      types="" sparql="" policy="whitelist">
                                                                                     dr:condition [
   
                                                                                        op:matches "http://dbpedia\\.org/ontology/.*" ;
      
      ...                                                                    Note that the mapping includes a conditional statement. It has
   
                                                             been included because the query returns not only DBpedia ontol-
which includes all detected named entities (Resource) as DBpedia          ogy concepts as types, but also FOAF, YAGO, Schema, Wikidata,
resources (URI). We next define the transformation                        and other resources, which we do not want to include in our re-
<#SpotlightTransformation>                                                sults. Eventually, this map generates the following RDF triples:
   dr:logicalSource [ dr:source <#DBpediaSpotlightAPI> ;                  
                      dr:iterator "/Annotation/Resources/Resource" ;               
                      dr:referenceFormulation is:XPath ; ] ;                                 .
   dr:parameterBinding [ dr:parameter "text" ;                            
                         dr:reference "$.dcDescription" ; ] .                      
                                                                                             .
and add the transformation and a new predicate object map to the          
<#EuropeanaMapping> triples map:                                                   
<#EuropeanaMapping>                                                                          .
   ...
   dr:transformations ( <#SpotlightTransformation> ) ;                       Finally, we can use computer vision technologies to analyze the
   rr:predicateObjectMap [                                                image of each item (the URI is provided by the edmIsShownBy field
       rr:predicate  ;           in the document returned by the Europeana API) to detect objects
       rr:objectMap [
           dr:reference "/Resource/@URI" ;                                that appear in it. To this end we use Microsoft’s Computer Vision
           dr:transformationReference <#SpotlightTransformation> ;        API, that is offered as a RESTful web service. Thus, we add a new
           rr:termType rr:IRI ;
       ] ;
                                                                          information source including the required request parameters
   ] .                                                                    <#ComputerVisionAPI>
                                                                             a is:HTTPSource
When executed, it generates the following additional triples:                is:request [
                                    http:absoluteURI "https://westcentralus.api.cognitive.microsoft.
                                              com/vision/v1.0/analyze?visualFeatures=Categories&
                   .                            language=en" ;
                                     http:methodName "POST" ;
                                        http:headers ( [ http:fieldName "Content-Type" ;
                   .                                   http:fieldValue "application/json" ; ]
                                                                                       [ http:fieldName "Ocp-Apim-Subscription-Key" ;
  We further extend the set of features by using DBpedia ontol-                          http:fieldValue "3*************************b" ; ] ) ;
ogy to get the types of the retrieved DBpedia resources. For this                http:body [ a cnt:ContentAsText ;
we need a second transformation, dependent on the first one, that                            cnt:chars "{\"url\" : \"{@@imageURL@@}\" }" ; ] ;
                                                                             ] ;
consults a DBpedia endpoint. The information source definition is            is:parameters ( [ a is:DataVariable ;
<#DBpediaSPARQLService>                                                                         is:name "imageURL" ; ] ) .
   a is:SPARQLService ;
   is:uri "http://dbpedia.org/sparql" .                                   which produces the following JSON-formatted effective data source:
LDOW2018, April 2018, Lyon, France                                                                     Alexandros Chortaras and Giorgos Stamou


{                                                                         ACKNOWLEDGEMENTS
     "categories": [ {
        "name": "people_group",                                           We acknowledge support of this work by the project ‘APOLLONIS’
        "score": 0.578125                                                 (MIS 5002738) which is implemented under the Action ‘Reinforce-
     } ],
     "requestId": "3b28df72-abf5-488c-86f4-b2c6a7eb9703"
                                                                          ment of the Research and Innovation Infrastructure’, funded by the
}                                                                         Operational Programme ‘Competitiveness, Entrepreneurship and
                                                                          Innovation’ (NSRF 2014-2020) and co-financed by Greece and the
     Based on this, we define the transformation
                                                                          European Union (European Regional Development Fund).
<#ImageTransformation>
   dr:logicalSource [ dr:source <#ComputerVisionAPI> ;
                       dr:iterator "$.categories" ;                       REFERENCES
                       dr:referenceFormulation is:JSONPath ; ] ;           [1] Marcelo Arenas, Alexandre Bertails, Eric Prud’hommeaux, and Juan Sequeda.
   dr:parameterBinding [ dr:parameter "imageURL" ;                             2012. A Direct Mapping of Relational Data to RDF. (2012). https://www.w3.
                          dr:reference "$.edmIsShownBy" ; ] .                  org/TR/rdb-direct-mapping/
                                                                           [2] Stefan Bischof, Stefan Decker, Thomas Krennwallner, Nuno Lopes, and Axel
to generate a logical array from categories that contains the names            Polleres. 2012. Mapping between RDF and XML with XSPARQL. J. Data Se-
of the detected objects, and modify <#EuropeanaMapping> by adding              mantics 1, 3 (2012), 147–185.
the new transformation and a new predicate object map:                     [3] Barry Bishop, Atanas Kiryakov, Damyan Ognyanoff, Ivan Peikov, Zdravko Ta-
                                                                               shev, and Ruslan Velkov. 2011. OWLIM: A family of scalable semantic reposito-
<#EuropeanaMapping>                                                            ries. Semantic Web 2, 1 (2011), 33–42.
   dr:transformations ( <#SpotlightTransformation>                         [4] Pierre Bourhis, Juan L. Reutter, Fernando Suárez, and Domagoj Vrgoc. 2017.
                    <#DBpediaTransformation> <#ImageTransformation> ) ;        JSON: Data model, Query languages and Schema specification. In PODS. ACM,
   rr:predicateObjectMap [                                                     123–135.
       rr:predicate  ;         [5] James Clark and Steve DeRose. 2016. XML Path Language (XPath) Version 1.0.
       rr:objectMap [                                                          (2016). https://www.w3.org/TR/xpath/
           dr:reference "$.name" ;                                         [6] Dan Connolly. 2007. Gleaning Resource Descriptions from Dialects of Languages
           dr:transformationReference <#ImageTransformation> ;                 (GRDDL). (2007). https://www.w3.org/TR/grddl/
           rr:termType rr:Literal ;                                        [7] Richard Cyganiak, Chris Bizer, Jörg Garbers, Oliver Maresch, and Christian
           dr:condition [                                                      Becker. 2012. The D2RQ Mapping Language. (2012).                  http://d2rq.org/
              dr:reference "$.score" ;                                         d2rq-language
              dr:transformationReference <#ImageTransformation> ;          [8] Souripriya Das, Seema Sundara, and Richard Cyganiak. 2012. R2RML: RDB to
              op:geq "0.4"^^xsd:decimal ;                                      RDF Mapping Language. (2012). https://www.w3.org/TR/r2rml/
           ] ;                                                             [9] Anastasia Dimou, Miel Vander Sande, Pieter Colpaert, Ruben Verborgh, Erik
       ] ;                                                                     Mannens, and Rik Van de Walle. 2014. RML: A Generic Language for Integrated
   ] ;                                                                         RDF Mappings of Heterogeneous Data. In LDOW (CEUR Workshop Proceedings),
                                                                               Vol. 1184. CEUR-WS.org.
The above object map applies a filter in order to keep only objects       [10] Lee Feigenbaum, Gregory Todd Williams, Kendall Grant Clark, and Elias Torres.
that have been detected with relatively high confidence (score).               2013. SPARQL 1.1 Protocol. (2013). https://www.w3.org/TR/sparql11-protocol/
                                                                          [11] Roy T. Fielding and Richard N. Taylor. 2000. Principled design of the modern
Eventually, the above map adds the following RDF triple:                       Web architecture. In ICSE. ACM, 407–416.
                              [12] Stefan Gössner and Stephen Frank. 2007. JSONPath. (2007). http://goessner.
                                   net/articles/JsonPath/
                  "people_group" .                                        [13] Oktie Hassanzadeh, Soheil Hassas Yeganeh, and Renée J. Miller. 2011. Linking
                                                                               Semistructured Data on the Web. In WebDB.
  The RDF triples generated by all the above predicate-object maps        [14] Matthias Hert, Gerald Reif, and Harald C. Gall. 2011. A comparison of RDB-to-
make up the desired RDF graph. In terms of performance, for exe-               RDF mapping languages. In I-SEMANTICS (ACM International Conference Pro-
                                                                               ceeding Series). ACM, 25–32.
cuting the above D2RML document, our implementation of D2RML              [15] Internet Engineering Task Force (IETF). 2014. The JavaScript Object Notation
processor22 took about 7 minutes per 100 Europeana items.                      (JSON) Data Interchange Format. (2014). https://tools.ietf.org/html/rfc7159
                                                                          [16] Johannes Koch, Carlos A Velasco, and Philip Ackermann. 2017. HTTP Vocabu-
                                                                               lary in RDF 1.0. (2017). https://www.w3.org/TR/HTTP-in-RDF10/
7      CONCLUSIONS                                                        [17] Johannes Koch, Carlos A Velasco, and Philip Ackermann. 2017. Representing
                                                                               Content in RDF 1.0. (2017). https://www.w3.org/TR/Content-in-RDF10/
We presented D2RML, a Data-to-RDF mapping language, which                 [18] Andreas Langegger and Wolfram Wöß. 2009. XLWrap - Querying and Integrat-
based on an abstract data model, allows the orchestrated retrieval             ing Arbitrary Spreadsheets with SPARQL. In International Semantic Web Confer-
of data from several information sources, their transformation and             ence (Lecture Notes in Computer Science), Vol. 5823. Springer, 359–374.
                                                                          [19] Franck Michel, Loïc Djimenou, Catherine Faron Zucker, and Johan Montagnat.
extension using relevant web services, their filtering and manipu-             2014. xR2RML: Non-Relational Databases to RDF Mapping Language. (2014).
lation using simple operations, and finally their mapping to RDF               https://hal.inria.fr/hal-01066663v1/document
                                                                          [20] Boris Motik, Peter F. Patel-Schneider, and Bijan Parsia. 2012. OWL 2 Web On-
graphs. It combines the mapping approach of R2RML and RML                      tology Language Structural Specification and Functional-Style Syntax (Second
with workflow approaches, by allowing the definition of easy to                Edition). (2012). https://www.w3.org/TR/owl2-syntax/
write and understand, homogenous views of the underlying data             [21] Yavor Nenov, Robert Piro, Boris Motik, Ian Horrocks, Zhe Wu, and Jay Baner-
                                                                               jee. 2015. RDFox: A Highly-Scalable RDF Store. In International Semantic Web
and services in a lightweight document. We developed D2RML on                  Conference (2) (Lecture Notes in Computer Science), Vol. 9367. Springer, 3–20.
top of a formal abstract data model, so as to formally define its se-     [22] Martin J. O’Connor, Christian Halaschek-Wiener, and Mark A. Musen. 2010. M2 :
mantics and allow future extensions. We also presented a realistic             A Language for Mapping Spreadsheets to OWL. In OWLED (CEUR Workshop
                                                                               Proceedings), Vol. 614. CEUR-WS.org.
use case, which demonstrates the capabilities of the proposed lan-        [23] Jason Slepicka, Chengye Yin, Pedro A. Szekely, and Craig A. Knoblock. 2015.
guage in real settings, by delivering a unified and coordinated ac-            KR2RML: An Alternative Interpretation of R2RML for Heterogenous Sources.
cess to Linked Data data stores and other services in a clean specifi-         In COLD (CEUR Workshop Proceedings), Vol. 1426. CEUR-WS.org.
cation without the need of code writing or heavy-weight solutions.

22   Available as a web service at http://apps.islab.ntua.gr/d2rml/