Automated Metadata Generation for Linked Data Generation and Publishing Workflows Anastasia Dimou Tom De Nies Ruben Verborgh anastasia.dimou@ugent.be tom.denies@ugent.be ruben.verborgh@ugent.be Erik Mannens Rik Van de Walle erik.mannens@ugent.be rik.vandewalle@ugent.be Ghent University – iMinds – Data Science Lab ABSTRACT mation neither allow being aware of the origin of the rdf Provenance and other metadata are essential for determin- data, nor reproducing the rdf data generation outside the ing ownership and trust. Nevertheless, no systematic ap- context of the application that originally generated it. This proaches were introduced so far in the Linked Data pub- occurs because most of the tools that generate rdf data lishing workflow to capture them. Defining such metadata derived from heterogeneous data, put the focus on indepen- remained independent of the rdf data generation and pub- dently providing the corresponding rdf representation, dis- lishing. In most cases, metadata is manually defined by the sociating the resulting rdf data from its original source. data publishers (person-agents), rather than produced by In the same context, provenance and metadata information the involved applications (software-agents). Moreover, the regarding the actual mapping rules which specify how the generated rdf data and the published one are considered to rdf data is generated from raw data, are not captured at be one and the same, which is not always the case, leading to all. Nevertheless, such information might equally influence pure, condense and often seductive information. This paper the assessment of the generated rdf data trustworthiness. introduces an approach that relies on declarative descrip- Similarly, data publishing infrastructures, such as triple tions of (i) mapping rules, specifying how the rdf data is stores, do not automatically publish any provenance or other generated, and of (ii) raw data access interfaces to automat- metadata regarding the rdf data they host. Instead they ically and incrementally generate provenance and metadata would have been expected to enrich the metadata produced information. This way, it is assured that the metadata in- while the rdf data was generated with metadata associated formation is accurate, consistent and complete. with the publishing activity. Moreover, the rdf data gener- ation and its publication are considered as interrelated ac- tivities that occur together. Although, this is not always the 1. INTRODUCTION case. Therefore, the generated rdf data and the one sub- Nowadays, data owners publish their data at an increasing sequently published are not always one and the same. For rate. More and more of them publish also its correspond- instance, rdf data might be generated in subsets and pub- ing rdf representation and interlink it with other data. lished all together, or generated as a single dataset but pub- However, even though provenance and other metadata be- lished in different rdf graphs. Consequently, their prove- come increasingly important, most rdf datasets published nance and rest metadata information is not identical. in the Linked Data cloud provide no or seldom narrow meta- In a nutshell, capturing provenance and metadata infor- data. To be more precise, only 37% of the published rdf mation on every step of the Linked Data publishing work- dataset provide provenance information or any other meta- flow is not addressed in a systematic and incremental way so data [22]. In these rare cases that such metadata is available, far. In this paper, we introduce an approach that considers it is only manually defined by the data publishers (person- declarative and machine-interpretable data descriptions and agents), rather than produced by the applications (software- mapping rules to automatically assert provenance as well agents) involved in the Linked Data publishing cycle. Most as other metadata information. Our proposed solution is of the current solutions which generate and/or publish rdf indicatively applied on mappings described using the rml data, do not consider also automatically generating the cor- language [8] and is implemented in the rml tool chain. responding metadata information, despite the well-defined The remainder of the paper is structured as follows: In Sec- and w3c recommended vocabularies, e.g., prov-o [16] or tion 2, we outline the current state of the art. In Section 3, void [1], that clearly specify the expected metadata output. we discuss the essential steps of the Linked Data publish- As a consequence, the lack of available metadata infor- ing cycle where provenance and metadata can be generated and in Section 4, we discuss the different levels of metadata details identified. In Section 5, we describe how machine- interpretable mapping rules are considered to automate the Copyright is held by the author/owner(s). metadata generation and in Section 6 we showcase how we WWW2016 Workshop: Linked Data on the Web (LDOW2016) implemented it in the rml tool chain. 2. STATE OF THE ART were decoupled from the source code of the corresponding In this section, we investigate existing systems, involved tools that execute them. However, mapping languages are in the Linked Data publishing workflow. Tools generating explicitly focused on specifying the mapping rules, neglect- mappings and rdf data or publish rdf data are approached ing to provide the means to specify the data source too. with respect to their support for automated metadata gener- Whereas, for instance the d2rq language allows to specify ation (Section 2.1). In addition, we outline the w3c recom- the relational database where the data is derived from, other mended vocabularies for metadata description (Section 2.2), languages, including r2rml, do not, considering it out of the as well as the most well-known and broadly used approach language’s scope. rml [8] is the only language that allows for representing provenance and other metadata (Section 2.3). referring to data descriptions based on well-known vocabu- laries to determine the data source [9] (see Section 5). 2.1 Linked Data publishing cycle The situation remains the same also in the case of inter- In the Linked Data publishing workflow there are different linking tools, such as the prevalent Silk [28] and Limes [19]. activities taking place. Among them, the definition of the Interlinking tools generate rdf data consisting of links be- rules to generate rdf data from raw data, its actual gener- tween rdf datasets, the so-called linksets. None of the most ation, its publishing and its interlinking are few of the most well-known tools generate any provenance or metadata an- essential steps. However, the majority of the tools devel- notations regarding the links that were identified and repre- oped to address these tasks do not generate automatically sented as the output dataset of the interlinking task. any provenance or metadata information as the correspond- In the same context, tools were developed to support data ing tasks are accomplished, let alone enriching metadata de- owners to semantically annotate their data. However, those fined in prior steps of the Linked Data publishing workflow. tools still generate both the mapping rules and the corre- Hartig and Zhao [10] argued regarding the need of in- sponding rdf data after the rules execution, without pro- tegrating provenance information publication in the Linked viding any provenance or metadata information. To be more Data publishing workflow. However, they focused only on its precise, none of the tools that automatically generate map- last step, namely the rdf data publication, outlining meta- pings of relational databases to its rdf representation, such data publication approaches and showcasing on well-know as BootOx [14], IncMap [21], or Mirror [5], or support users rdf data publishing tools, such as Pubby1 and Triplify2 . in defining mapping rules, e.g., FluidOps editor [23], sup- None of the well-know systems that generate rdf repre- ports automated provenance and metadata information gen- sentations from any type of (semi-)structured data provide eration, neither for the mapping rules, nor for the generated any provenance or metadata information in conjunction with rdf data. Specifying metadata for the mapping rules or the generated rdf data, to the best of our knowledge. For considering the mapping rules to determine the provenance instance, none of the prevalent tools for generating rdf data, and metadata becomes even more cumbersome, in particu- such as DB2triples3 , Karma4 , or xsparql5 , to indicatively lar in the case of mapping language whose representation is mention a few of the prevalent tools. The main obstacle, at not in rdf, e.g., sml, sparql or xquery. least with respect to provenance, is that it is hard to specify Similarly, among the rdf data publishing infrastructures, where the data originally resides. That occurs because most only Triple Pattern Fragments10 (tpf) [26, 27] provide some of these tools, consider a file as data input. However, where metadata information, mainly regarding dataset level statis- the data of this file is derived from is not known and, there- tics and access. Virtuso 11 , 4store 12 and other pioneer pub- fore, the corresponding provenance annotations can not be lishing infrastructures do not provide out-of-the-box meta- accurately defined in an automated fashion. data information, e.g., provenance, dataset-level statics etc. The d2r server6 and the csv2rdf4lod7 are the only tool of the rdf data published. lodlaundromat13 is the only that generates provenance and metadata information in con- Linked Data publishing infrastructure that provides auto- junction with the rdf data. However, the d2r server refers matically generated metadata information. However, it uses only to data in relational databases, it supports a custom its own custom ontology14 which partially relies on the prov- provenance vocabulary, not the w3c-recommended prov- o ontology to provide metadata information. o [16], and is limited to dataset high level metadata in- formation. The csv2rdf4lod refers only to csv files and 2.2 Provenance and Metadata Vocabularies it achieves capturing provenance using custom bash scripts w3c recommended vocabularies were already defined to that aim to keep track of the commands used. The situation specify rdf data provenance and metadata information: aggravates in the case of custom solutions for generating rdf data which neglect to include in its development cycle mech- 2.2.1 PROV Ontology anisms to generate provenance and metadata information. The prov ontology (prov-o) [16] is recommended by w3c With the advent of mapping languages, such as the d2rq8 , to express the prov Data Model [18] using the owl2 Web sml9 , or the w3c recommended r2rml [4], the mapping Ontology Language (owl2) [13]. prov-o can be used to rules that specify how triples are generated from raw data, represent provenance information generated in different sys- 1 tems and under different contexts. http://wifo5-03.informatik.uni-mannheim.de/pubby/ According to the prov ontology, a prov:Entity is a physi- 2 http://triplify.org/ cal, digital, conceptual, or other kind of thing. A prov:Activity 3 https://github.com/antidot/db2triples 4 occurs over a period of time and acts upon or with entities; http://usc-isi-i2.github.io/karma/ 5 10 http://xsparql.deri.org/ http://linkeddatafragments.org/ 6 11 http://d2rq.org/ http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/ 7 12 https://github.com/timrdf/csv2rdf4lod-automation/wiki http://4store.org/ 8 13 http://d2rq.org/d2rq-language http://lodlaundromat.org/ 9 14 http://sml.aksw.org/ http://lodlaundromat.org/ontology/ it may include consuming, processing, transforming, modi- an rdf triple using four statements. A description of a state- fying, relocating, using, or generating entities. A prov:Agent ment is called a reification of the statement. The rdf reifica- bears some form of responsibility for an activity taking place, tion vocabulary consists of the type rdf:Statement, and the for the existence of an entity, or for another agent’s activity. properties rdf:subject, rdf:predicate and rdf:object. rdf reification is the w3c recommended approach for rep- 2.2.2 VoID Vocabulary resenting provenance and metadata information. The Vocabulary of Interlinked Datasets (void) [1] is a vo- 1 _:ex12345 rdf:type rdf:Statement . cabulary for expressing metadata about rdf datasets with 2 _:ex12345 rdf:subject ex:item10245 . applications ranging from data discovery to cataloging and 3 _:ex12345 rdf:predicate ex:weight . 4 _:ex12345 rdf:object "2.4"^^xsd:decimal . archiving of datasets. void expresses (i) general, (ii) ac- 5 _:ex12345 prov:wasDerivedFrom _:src123 . cess and (iii) structural metadata, as well as links between datasets. General metadata is based on Dublin Core. Ac- The major disadvantage of rdf reification is the number cess metadata describes how the rdf data can be accessed of triples required to represent a reified statement. For each using different protocols. Structural metadata describes the generated triple, at least four additional statements is re- structure and schema of the rdf data. quired to be generated. So, for an rdf dataset of N triples, According to the void vocabulary, a void:Dataset is a set the metadata graph will be equal to four times the number of rdf triples maintained or aggregated by a single provider. of the rdf dataset triples in the best case where only the A void:Dataset is a meaningful collection of triples, that deal rdf reification statements are generated and no additional. with a certain topic, originate from a certain source or pro- 2.3.2 Singleton Properties cess, and contains sufficient number of triples that there is benefit in providing a concise summary. The concrete triples Singleton properties [20] is an alternative approach for contained in a void:Dataset is established through access in- representing statements about statements using rdf. This formation, such as the address of a sparql endpoint. Last, approach relies on the intuition that the nature of every re- a void:Linkset is a collection of rdf links whose subject and lationship is universally unique and can be a key for any object are described in different datasets. statement using a singleton property. A singleton property represents one specific relationship between two entities un- 2.2.3 DCAT Vocabulary der a certain context. It is assigned a uri, as any other The Data Catalog Vocabulary (dcat) [17] is designed to property, and can be considered as a subproperty or an in- facilitate interoperability between data catalogs published stance of a generic property. Singleton properties and their on the Web. It aims to (i) increase data discoverability, generic property are associated with each other using the (ii) enable applications to easily consume metadata from singletonPropertyOf property, subproperty of rdf:type. multiple catalogs, (iii) enable decentralized catalogs pub- 1 ex:item10245 ex:weight#1 "2.4"^^xsd:decimal . 2 ex:weight#1 sp:singletonPropertyOf ex:weigh . lishing, and (iv) facilitate federated dataset search. 3 ex:weight#1 prov:wasDerivedFrom _:src123 . According to the dcat vocabulary, a dcat:Catalog repre- sents a dataset catalog, a dcat:Dataset represents a dataset in the catalog, whereas a dcat:Distribution represents an ac- 2.3.3 Explicit Graphs cessible form of a dataset, e.g., a downloadable file, an rss The Explicit Graphs approach relies on named graphs. feed or a Web service that provides the data. dcat consid- Named Graphs is a set of rdf triples named by a uri and ers as a dataset a collection of data, published or curated can be represented using TriG [3], N-Quads [2] or JSON- by a single agent, and available for access or download in LD [24], but it is not compatible with all rdf serialisations. one or more formats. This data is considered for generating This approach is similar to Singleton Properties. Instead of an rdf dataset. Thus, the generated rdf dataset forms a annotating the common predicate of the triples, the context dcat:Distribution of a certain dcat:Dataset. of the triple is annotated. This way, introducing one triple per predicate is avoided. However, the Explicit Graphs ap- 2.3 Approaches for tracing PROV & metadata proach has two drawbacks: (i) they are not supported by all We outline methods for capturing provenance and other rdf serializations; and (ii) they might be in conflict with the metadata information. We identify two approaches that cap- named graph defined as part of the rdf dataset and whose ture provenance and other metadata information inline with intent is different than tracing provenance information. the rest rdf data –Explicit Graphs (Section 2.3.3) and Sin- 1 ex:item10245 ex:weight "2.4"^^xsd:decimal ex:graph . gleton Properties (Section 2.3.2)– and two that trace them 2 ex:graph prov:wasDerivedFrom _:src123 . independently of the rdf data –rdf Reification (Section 2.3.1) and Implicit Graphs(Section 2.3.4). In the following subsec- 2.3.4 Implicit Graphs tions, we discuss in more details alternative approaches for Implicit graphs are uris assigned implicitly to a dataset, defining the provenance of the following rdf triple: graph, triple or term. An Implicit Graph is aware of what it 1 ex:item10245 ex:weight "2.4"^^xsd:decimal . represents but the represented entity is not directly linked to its implicit graph. Implicit graphs might be used to iden- tify a dataset or a graph, but also triples. In the later case, 2.3.1 RDF Reification as Triple Pattern Fragments (tpf) introduced [26, 27], each The rdf framework considers a vocabulary for describing triple can be found by using the elements of itself, thus, rdf statements and providing additional information. rdf each triple has a uri and, thereby, its implicit graph. For reification is intended for expressing properties such as dates example, the triple x y z for a certain dataset could be iden- of composition and source information, applied to specific in- tified by the tpf uri http://example.org/dataset?subject= stances of triples. The conventional use involves describing x&predicate=y&object=z. 1 edited by a human-agent using a mapping editor. However, 3 prov:wasDerivedFrom _:src123 . such a maping rule might have been modified or used in con- junction with other mapping rules which were generated, in their own turn, by another human-agent at a different time. 3. WORKFLOW METADATA STEPS The agent who defined the mapping rules (Fig. 1, Map- Provenance and other metadata information can be cap- ping Editor ) might differ from the one who generated (Fig. 1, tured at different steps of the publishing workflow. Keep- Data Generator ) or published the data (Fig. 1, Data Pub- ing track of metadata derived from the different steps of lisher ), or even the owner of the data (Fig. 1, Data Owner ). the rdf data generation and publishing workflow, results in Being aware of who defined the mapping rules is of crucial more complete information regarding how an rdf dataset importance to assess the trustworthiness of the final rdf was generated and formed in the end. Moreover, provenance data, even though it is neglected so far. For instance, rdf and metadata information generated at different steps of the data generated using mapping rules from an automated gen- publishing workflow offer complementary information. erator might be considered less trustworthy compared to rdf We identify the following primary steps: mapping defi- data whose mapping rules were defined by a data specialist. nitions generation (Section 3.1), data source retrieval (Sec- tion 3.2), rdf data generation (Section 3.3), rdf data pub- 3.2 Data Sources Retrieval lication (Section 3.4). We consider each workflow step as an An rdf dataset might be derived from one or more het- activity (prov:Activity) whose properties is needed to be erogeneous data sources (Fig. 1, Data Source Acquisition). traced. In Table 1, we summarize those activities and the Each data source, in its own turn, might be derived from information that needs to be defined each time. The prove- an input. For instance, a table might be derived from a nance and how the different steps are associated with each database or some json data might be derived from a Web other are shown at Figure 1. api. Such a data source might be turned into an rdf graph partially or in its entirety. This might mean that not the Same Different entire stored data is retrieved but a selection is only used Dataset Dataset to generate the rdf data. For instance, only the data that Map. Gen. Pub. Gen. Pub. Link. fulfils an sql query could be retrieved to generate the rdf prov:Entity dataset, instead of the entire table or database. prov:wasGeneratedBy prov:wasDerivedFrom For this activity, it is important to keep track of metadata prov:wasAttributedTo regarding the data sources and their retrieval, as this indi- prov:Agent # G # G cates the original data sources of the generated rdf data. prov:actedOnBehalfOf However, the originally stored data might have changed over void:Dataset – General time. For instance, in the case of an api, some data is re- dcterms:creator trieved at a certain time, but different data might be re- dcterms:contributor trieved at a subsequent time. Therefore, it is crucial to dcterms:publisher know when the data is accessed to assess its timeliness with dcterms:source the original data. For instance, comparing the last modified dcterms:created date of the original data and the generation date of the rdf dcterms:modified data, indicates whether the available rdf representation is dcterms:issued G # G # dcterms:license # G # G aligned with the current version of the original data or not. void:feature # G # G void:Dataset – Access 3.3 RDF Data Generation void:Dataset – Structural G # G # As soon as the mapping rules and the data source are void:Dataset – Statistics # G # G available, the rdf data is generated (Fig. 1, Generate RDF void:Linkset # G # G Data). For this activity, it is important to keep track of (i) how the rdf data generation was triggered, i.e. data- A filled circle ( ) indicates that the property should be (re-)assessed in each of the marked steps. A half-filled circle (G #) indicates that driven or mapping-driven, from raw data (rdf generation) that property can be assessed in any of the marked steps. or from rdf data (rdf interlinking); (ii) when the rdf dataset was generated, and (iii) how, i.e. in a single dataset Table 1: Table of properties required for each entity or in subgraphs, subsets etc. Besides the aforementioned, this activity is crucial for capturing the origin of the rdf data, as only at this step that information is known (in com- 3.1 Mapping Rules Definition bination with the data description and acquisition). Provenance and metadata information is required to be captured when the mapping rules are defined (Fig. 1, Edit 3.4 RDF Data Publication Map Doc). In this case, it is important to track when the The published rdf data is not always identical to the gen- mapping rules were edited or modified and by whom. An erated one (Fig. 1, genRDF Vs. pubRDF ). For instance, it rdf dataset might have been generated using multiple map- might be the result of merging multiple rdf datasets which ping rules whose definition occurred at different moments are generated from different data sources at the same or dif- and by different agents. Consequently, the generation of ferent moments. Moreover, the published rdf dataset might certain mapping rules (Fig. 1, Generate Map Doc) is an ac- be published in a different way compared to how the rdf tivity (prov:Activity) which is informed by all prior editing data was generated. For instance, it could be split in differ- activities (Fig. 1, Edit Map Doc). For instance, a mapping ent graphs to facilitate its consumption. This might lead to wasAttributedTo generatedAtTime wasDerivedFrom actedOnBehalfOf pub generatedAtTime Data Publisher Start Time End Time RDF Data startedAtTime endedAtTime endedAtTime actedOnBehalfOf wasAssociatedWith Generate generated gen used Publish End Time Data Owner Data Generator RDF data RDF Data RDF Data Start Time startedAtTime used used wasDerivedFrom hadPrimarySource generatedAtTime actedOnBehalfOf wasStartedBy wasStartedBy generated non-RDF Map Doc Mapping Editor Data endedAtTime wasGeneratedBy generatedAtTime Data Source End Time wasAssociatedWith wasDerivedFrom Acquisition Start Time endedAtTime startedAtTime Edit Generate End Time Map Doc wasInformedBy Map Doc Start Time Stored Data startedAtTime used Figure 1: A coordinated view of Linked Data publishing workflow activities. different metadata for the generation and publication activi- 4.1 Dataset Level ties, and these metadata sets might have different purposes. Dataset level provenance and metadata provide high-level For instance, void access information metadata is more information for the complete rdf dataset. This level of de- meaningful and possible to be generated during the rdf data tail is meaningful for all metadata information that refer to publication, whereas provenance information in respect to the whole dataset, i.e. a void:Dataset and are the same the original data can only be defined during the rdf data for each triple. Therefore, among the alternative represen- generation activity. To the contrary, void structural or sta- tation approaches, considering an explicit or implicit graph tistical metadata might be generated both during rdf data for the dataset to represent provenance and metadata anno- generation and publication. However, the generated rdf tations is sufficient on dataset level and it requires the least data is not always identical to the one published. If the number of additional triples. The alternative approaches generated rdf data differ from the one published, then such in principle assign the same metadata information to each metadata should be defined for both cases (see Table 1). triple. Thus, the exact same information is replicated for each triple, causing unnecessary overhead. Provenance information on dataset level is sufficient if all 4. METADATA DETAILS LEVELS triples are derived from the same original data source and are There are different details levels for capturing provenance generated at the same time, as a result of a single activity. and metadata information. However, in most cases so far, The same occurs if the overall origin source is sufficient to the provenance and metadata information is delivered on assess the rdf dataset trustworthiness. On the contrary, if dataset level. This mainly occurs because the metadata in- being aware of the exact data source is required, for instance formation are only defined after the rdf data is generated to align the semantically annotated representation with the and/or published. However, different applications and data original data values, more detailed provenance information consumption cases require different levels of provenance and is desired, because the high level provenance information is metadata information. Overall, the goal is to achieve the not as complete and accurate to accomplish the desired task. best trade-off between details level and number of additional triples generated for balancing information overhead and ac- ceptable information loss in an automated metadata gen- 4.2 Named Graph Level eration occasion. For instance, considering rdf reification An rdf dataset might consist of one or more named graphs. for capturing all provenance and metadata information for Named graph based subsets of an rdf dataset provide con- each triple, means that metadata referring to the entire rdf ceptual partitions of rdf triples semanticfully distinguished dataset is captured repeatedly for each individual triple. To in graphs. Named graph level provenance and metadata in- the contrary, considering an implicit graph on dataset level formation refer to all rdf annotations which are related to results in information loss in respect to the origin of each a certain named graph and contain information for each one triple, if multiple data sources are used to generate the rdf of the named graphs. Each named graph is a void:Dataset dataset, because it is not explicitly defined where each triple and consists a subset of the whole rdf dataset. is derived from. In the case of named graphs, it is not possible to rep- Automating the provenance and metadata information resent metadata and provenance information using explicit generation, allows exploiting hybridic approaches which can graphs, because the rdf statements are already quads and contribute in optimizing the metadata information balance. the named graph has different semantics than providing meta- In this section, we outline the different details levels for data information. As in the case of dataset level, implicit capturing metadata that we identified: Dataset level (Sec- graphs for each named graph and for the complete dataset tion 4.1), named graph level (Section 4.2), partition level (Sec- generate the minimum number of additional rdf triples. tion 4.3), triple level (Section 4.4) and term level (Section 4.5). Moreover, the named graph level metadata information are For each level, we describe what type of metadata is cap- sufficient if all triples of a certain named graph are derived tured and we discuss the advantages and disadvantages when from the same data source. Otherwise, there is information used in combination with different representation approach. loss which can be addressed at a narrower detail level. 4.3 Partitioned Dataset Metadata Level 5. METADATA GENERATION WITH RML A dataset might be partitioned based on different aspects. We introduce an approach that takes into consideration The most frequent partitions are related to (i) the underlying machine interpretable descriptions of data sources and map- data source or the triple’s (ii) subject, (iii) predicate, or ping rules, which are used to generate rdf datasets, to (iv) object. Besides the aforementioned partitions, any other also automatically generate its corresponding provenance custom partition can be equally considered. A source-based and metadata information. Our approach relies on asserting partitioned rdf dataset is an rdf dataset whose subsets are statements from declarative descriptions of data sources and formed with respect to their derivation source. To be more mapping rules. This allows our proposed approach to be ap- precise, all rdf terms and triples are derived from the same plied on alternative mapping languages and be replicated in original data source. Source-based partitioned rdf datasets different implementations. derived from a single data source are not considered because In our exemplary case, machine interpretable mapping they coincide with the actual rdf dataset. A subject-based rules are defined using the rdf Mapping Language (rml) [8]. partitioned rdf dataset is the part of an rdf graph whose rml is considered because it is the only language that al- triples share the same subject. Consequently, subject-level lows uniformly defining the mapping rules over heteroge- metadata provides information for all triples which share the neous data sources. Moreover, rml is aligned with machine same subject. It similarly applies in the case of predicate- interpretable data source descriptions defined using different based or object-based partitions. vocabularies, e.g., dcat [17], csvw [25], Hydra [15] etc [9]. Partitioned datasets might be treated in the same way as named graphs, but it is also possible to use explicit graphs to 5.1 RML Mapping Definitions define the subsets metadata. An implicit graph for each sub- Mapping rules are defined using the rdf Mapping Lan- set of the rdf dataset which resembles a partition achieves guage (rml). rml [8] extends the w3c recommended r2rml generating the minimum number of additional triples for the mapping language [4] defined for specifying mappings of metadata information. In the particular case of predicate- data in relational databases to the rdf data model. rml based partition, representing the provenance and metadata covers also mappings from data sources in different (semi- information using singleton properties would cause generat- )structured formats, such as csv, xml, and json. ing almost the same number of additional triples as in the rml documents contain rules defining how the input data case of defining an explicit or implicit graph per partition. can be represented in rdf. An rml document (see Listing 1) contains one or more Triples Maps (line 5 and 13). A Triples 4.4 Triple Level Map defines how triples are generated and consists of three If metadata is captured on triple level, it becomes possi- main parts: the Logical Source, the Subject Map and zero or ble to keep track of the data source each triple was derived more Predicate-Object Maps. The Subject Map (line 6 and 14) from. However, that causes the generation of rdf annota- defines how unique identifiers (uris) are generated for the tions for metadata whose number of triples is larger than resources and is used as the subject of all rdf triples gen- the actual dataset. In the simplest case, the number of ad- erated from this Triples Map. A Predicate-Object Map (line 7 ditional triples for the metadata information depends on the and 15) consists of Predicate Maps, which define the rule that number of data sources. The more data sources, the more generates the triple’s predicate (line 9, 17 and 19) and Ob- metadata information to be defined. Triple level metadata ject Maps (line 18 and 20) or Referencing Object Maps (line 10), become meaningful also in the case of big data or streamed which define how the triple’s object is generated. data where the time one triple was generated might signifi- 1 @prefix rr: . cantly differ compared to the rest triples of the rdf dataset. 2 @prefix rml: . In the case of triple level metadata, singleton properties 3 @prefix foaf: . become meaningful when statements about all triples shar- 4 5 <#PersonMap> rml:logicalSource <#DCAT_LogicalSource> ; ing the same property share the same metadata information. 6 rr:subjectMap <#PersonSubjectMap>; For instance, if all triples whose rdf terms are associated us- 7 rr:predicateObjectMap <#AccountPreObjMap>. ing a certain predicate, share the same metadata, e.g., they 8 <#PersonSubjectMap> rr:template "http://ex.com/{ID}". 9 <#AccountPreObjMap> rr:predicate foaf:account; are all derived from the same data source. 10 rr:objectMap <#TwitterRefObjMap>. 11 <#TwitterRefObjMap> rr:parentTriplesMap <#TwitterAcount>. 4.5 RDF Term Level 12 13 <#TwitterAcountMap> rml:logicalSource <#DB_LogicalSource>; Even rdf terms that are part of a certain rdf triple can 14 rr:subjectMap <#TwitterSubMap>; derive from different data sources. For instance, an rdf 15 rr:predicateObjectMap <#AccountPreObjMap>, <#HomepagePreObjMap>. term is generated considering some data value derived from 16 <#TwitterSubMap> rr:template "http://ex.com/{account_ID}". 17 <#AccountPreObjMap> rr:predicate foaf:accountName; a source A. This rdf term might constitute the subject of 18 rr:objectMap [ rml:reference "name"]. an rdf triple whose object though is an rdf term derived 19 <#HomepagePreObjMap> rr:predicate foaf:accountServiceHomepage; from a source B. In this case, even more detailed metadata 20 rr:objectMap <#HomepageObjMap>. information is required to keep track of the provenance infor- 21 <#HomepageObjMap> rml:reference "resource". mation. Among the alternative approaches for representing metadata, the rdf reification becomes meaningful at this Listing 1: RML Mapping Rules level of detail. To be more precise, the rdf reification is meaningful in the cases that the rdf terms that consist an 5.2 Mapping Document Metadata rdf triple and form a statement derive from different data A mapping document summarizes mapping rules defined sources and/or are generated at a different time. using the rml language. rml is serialized in rdf, thus a mapping document (<#MapDoc>) can be considered as an rdf dataset itself (void:Dataset). Therefore, it has its own metadata as any other rdf data can have. To be more 5.3 Data Sources Retrieval Metadata precise, a mapping document is a prov:Entity that can be as- One or more data sources might be considered for gen- sociated with a prov:Agent, either a human agent or software. erating an rdf dataset. In our exemplary case, one data The Mapping Document is the result of a prov:Activity, which source is described by the <#DB LogicalSource> and the un- is informed, on its own turn, from different editing activities. derlying database that contains the data is described by the 1 @prefix dcterms: . <#DB Source> using the d2r vocabulary. Its description: 2 @prefix prov: . 3 @prefix void: . 1 @prefix rml: . 4 2 @prefix dcat: . 5 <#MapDoc> a prov:Entity, void:Dataset; 3 @prefix d2rq: 6 prov:generatedAtTime "2016-01-05T17:10:00Z"^^xsd:dateTime; 4 . 7 prov:wasGeneratedBy <#MapDoc_Generation>; 5 8 prov:wasAssociatedWith <#RMLEditor>; 6 <#DB_LogicalSource> rml:logicalSource [ 9 prov:wasAttributedTo ; 7 rml:query """SELECT * FROM DEPT WHERE ... """ ; 10 dcterms:creator ; 8 rml:source <#DB_Source> ]. 11 dcterms:created "2016-01-05T17:10:00Z"^^xsd:dateTime; 9 12 dcterms:modified "2016-01-05T17:15:00Z"^^xsd:dateTime; 10 <#DB_Source> a d2rq:Database; 13 dcterms:issued "2016-01-07T10:10:00Z"^^xsd:dateTime. 11 d2rq:jdbcDSN "jdbc:mysql://localhost/example"; 14 12 d2rq:jdbcDriver "com.mysql.jdbc.Driver"; 15 <#MapDoc_Editing> a prov:Activity; 13 d2rq:username "user"; 16 prov:startedAtTime "2016-01-05T17:00:00Z"^^xsd:dateTime; 14 d2rq:password "password". 17 prov:endedAtTime "2016-01-05T17:10:00Z"^^xsd:dateTime . 18 Listing 4: Database Source description 19 <#MapDoc_Generation> a prov:Activity; 20 prov:generated <#MapDoc>; Similarly, a data source might be a dcat:Dataset and one 21 prov:startedAtTime "2016-01-05T17:09:00Z"^^xsd:dateTime; 22 prov:endedAtTime "2016-01-05T17:10:00Z"^^xsd:dateTime; of its distributions might be considered for generating the 23 prov:wasInformedBy <#MapDoc_Editing>. rdf dataset. Directly downloadable distributions contain a 24 dcat:downloadURL reference. For instance: 25 <#RMLEditor> a prov:Agent; 26 prov:type prov:SoftwareAgent. 1 @prefix rml: . 27 2 @prefix dcat: . 28 a prov:Agent; 3 29 prov:type prov:Person; 4 <#DCAT_LogicalSource> rml:source <#DCAT_Source>; 30 prov:actedOnBehalfOf <#DataOwner>. 5 rml:referenceFormulation ql:XPath; 6 rml:iterator "...". 7 Listing 2: Mapping Metadata Description 8 <#DCAT_Source> a dcat:Dataset; 9 dcat:distribution <#XML_Distribution> . Besides the metadata regarding the entire mapping doc- 10 ument (<#MapDoc>), similarly metadata might be defined 11 <#XML_Distribution> a dcat:Distribution; on Triples Map level or regarding any of the Term Maps, espe- 12 dcat:downloadURL . cially in case that different parts of the mapping document (subsets of <#MapDoc>) were defined by different agents or Listing 5: DCAT source description at different times. For instance, the metadata information The data source retrieval can be considered as a prov:Activity of different Triples Map might be as follows: attributed to a prov:Agent. Such a prov:Agent can be the data 1 @prefix dcterms: . owner or an agent acting on his behalf, i.e. Data Genera- 2 @prefix prov: . 3 @prefix void: . tor. The data source consists of a prov:Entity which was de- 4 rived from the data acquisition activity. The original data 5 <#MapDoc> void:subset <#PersonMap>, <#TwitterAccountMap>. source description provides some information regarding the 6 data source. Additional, provenance information is added 7 <#PersonMap> a prov:Entity, void:Dataset; 8 prov:wasGeneratedBy <#PersonMap_Generation>; for further clarity. The metadata for a data source, e.g., the 9 prov:wasAssociatedWith <#RMLEditor>; <#DB LogicalSource> and the <#DCAT LogicalSource>, are 10 prov:wasAttributedTo . described as follows: 11 12 <#TwitterAccountMap> a prov:Entity, void:Dataset; 1 @prefix prov: . 13 prov:wasGeneratedBy <#TwitterAccountMap_Generation>; 2 14 prov:wasAssociatedWith <#RMLEditor>; 3 <#DB_LogicalSource> a prov:Entity; 15 prov:wasAttributedTo . 4 prov:wasDerivedFrom <#DB_Source> ; 16 5 prov:generatedAtTime "2016-01-05T17:10:00Z"^^xsd:dateTime . 17 <#PersonMap_Generation> a prov:Activity; 6 18 prov:generated <#PersonMap>; 7 <#DB_Retrieval> a prov:Activity; 19 prov:wasInformedBy <#PersonMap_Editing>. 8 prov:generated <#DB_LogicalSource> ; 20 9 prov:used <#DB_Source>; 21 <#TwitterAccountMap_Generation> a prov:Activity; 10 prov:startedAtTime "2016-01-05T17:00:00Z"^^xsd:dateTime ; 22 prov:generated <#TwitterAccountMap>; 11 prov:endedAtTime "2016-01-05T17:10:00Z"^^xsd:dateTime . 23 prov:wasInformedBy <#TwitterAccountMap_Editing>. 12 24 13 <#DCAT_LogicalSource> a prov:Entity; 25 <#PersonMap_Editing> a prov:Activity. 14 prov:generatedAtTime "2016-01-05T17:05:00Z"^^xsd:dateTime . 26 <#TwitterAccountMap_Editing> a prov:Activity. 15 16 <#DCATsource_Retrieval> a prov:Activity; Listing 3: Triples Map Metadata Description 17 prov:generated <#DCAT_LogicalSource>; 18 prov:used <#DCAT_Source>. Listing 6: Data Sources Metadata Description 5.4 RDF Dataset Generation Metadata proach occurs at a certain case, the rdf data generation Considering the aforementioned mapping document and activity (<#RDFdataset Generation>) is informed by the map- the data source descriptions, rdf triples are generated. ping document generation activity (<#MapDoc Generation>) or by the data source generation activity (e.g., the <#DB- Approaches for Tracing Metadata and RML source Retrieval> or the <#DCATsource Retrieval>). Among the different approaches for capturing provenance Specifying the data source where the rdf dataset was de- and metadata information, rml can best be aligned with rived from becomes easy and can be automatically asserted implicit graphs and rdf reification. Those two approaches thanks to the aligned mapping and data source descriptions. generate metadata information independently of the gener- The rml mapping rules declaratively define the data sources ated rdf data. In particular, explicit graphs are not consid- used, in contrast to other mapping languages which do not ered as they might coincide with the named graphs if any explicitly define the data sources considered for fulfilling the explicitly defined for the actual rdf dataset. mapping activity. However, as one can observe, it is defined that the rdf Metadata Details Levels and RML dataset was derived from an extract of data from a database and an xml file published on the Web, but it is not explicitly Dataset level metadata information is associated with all defined which triples are derived from each data source. triples generated considering all mapping rules in a map- ping document. Named graph level metadata information is Triple Level Metadata associated with triples which are generated considering Term In order to address the aforementioned ambiguity regarding Maps related to the corresponding Graph Map. the rdf triples origin, rml metadata generation might be As far as partitioned rdf datasets is concerned, each par- defined based on the Predicate Object Maps, for instance the tition is associated with different parts of one or more Triples <#AccountPreObjMap> and the <#HomepagePreObjMap>. The Maps. To be more precise, source-level metadata information metadata information of the generated rdf triples follows: is generated for all triples which are derived from Triples Maps which share the same Logical Source. In the same context, 1 @prefix prov: . 2 @prefix void: . subject-level metadata information is generated for each unique 3 instantiation of one (or more) of the Subject Maps that ap- 4 <#RDF_Dataset> void:subset <#AccountRDF>, <#HomepageRDF>. pear in the mapping document. Similarly, predicate-level 5 metadata information is generated for each unique predi- 6 <#AccountRDF> a prov:Entity, void:Dataset; 7 prov:wasGeneratedBy <#RDFdataset_Generation>; cate which appears in one or more Triples Maps. Last, object- 8 prov:wasDerivedFrom <#DB_LogicalSource>,<#DCAT_LogicalSource>; level metadata anotations are generated for each unique ob- 9 prov:wasAssociatedWith <#RMLProcessor>; ject which is generated due to an Object Map. Whereas the 10 prov:wasAttributedTo . 11 aforementioned levels consider implicit graphs to represent 12 <#HomepageRDF> a prov:Entity, void:Dataset; provenance and metadata information, triple and RDF term 13 prov:wasGeneratedBy <#RDFdataset_Generation>; level metadata information can only be captured considering 14 prov:wasDerivedFrom <#DB_LogicalSource>; 15 prov:wasAssociatedWith <#RMLProcessor>; reification statements. 16 prov:wasAttributedTo . Dataset Level Metadata Listing 8: RDF Triple Level Metadata Dataset level provenance and metadata information has as follows for the aforementioned running example: Even though it is easy one to observe that this resolves the ambiguity issue regarding the provenance in the case of 1 @prefix dcterms: . triples generated considering the <#HomepagePreObjMap>, 2 @prefix prov: . 3 it is not the same in the case of triples generated consider- 4 <#RDF_Dataset> a prov:Entity, void:Dataset; ing the <#AccountPreObjMap>. In the later case, the rdf 5 prov:generatedAtTime "2016-01-05T17:10:00Z"^^xsd:dateTime; triples are formed generating the subject and the object 6 prov:wasGeneratedBy <#RDFdataset_Generation>; 7 prov:wasDerivedFrom <#DB_LogicalSource>,<#DCAT_LogicalSource>; from different data sources. To be more precise, the triple’s 8 prov:wasAssociatedWith <#RMLProcessor>; subject is generated considering a value derived from the 9 prov:wasAttributedTo ; <#DCAT LogicalSource>, whereas the triple’s object is de- 10 dcterms:creator ; 11 dcterms:created "2016-01-05T17:10:00Z"^^xsd:dateTime; rived from the <#DB LogicalSource>. If, even more detailed 12 dcterms:modified "2016-01-05T17:12:00Z"^^xsd:dateTime; provenance is required, rdf term level should be preferred. 13 dcterms:issued "2016-01-07T10:10:00Z"^^xsd:dateTime. 14 RDF Term Level Metadata 15 <#RDFdataset_Generation> a prov:Activity; 16 prov:generated <#RDF_Dataset>; The rdf term level is the narrowest details level for meta- 17 prov:startedAtTime "2016-01-05T17:00:00Z"^^xsd:dateTime; data information. It is applicable in the cases of Referencing 18 prov:endedAtTime "2016-01-05T17:10:00Z"^^xsd:dateTime; 19 prov:wasInformedBy <#MapDoc_Generation>; Object Maps, namely when the subject and the object of an 20 prov:used <#MapDoc>,<#DB_LogicalSource>,<#DCAT_LogicalSource>. rdf triple is derived from different data sources. rdf reifi- 21 cation is the only approach for representing this metadata 22 <#RMLProcessor> a prov:Agent; 23 prov:type prov:SoftwareAgent. information. Considering the aforementioned running exam- ple, the metadata information of the rdf triples generated Listing 7: RDF Dataset Level Metadata from the <#AccountPreObjMap> Predicate Object Map are de- fined as it follows: The rdf data generation might be triggered either by a mapping document (mapping-driven approach) or by a data source (data-driven approach) [11]. Depending on which ap- 1 _:ex12345 rdf:type rdf:Statement . RML Processor 2 _:ex12345 rdf:subject ex:item10245 . 3 _:ex12345 rdf:predicate foaf:account . The rmlprocessor was extended with a Metadata Module 4 _:ex12345 rdf:object . that automatically generates metadata for the generated 5 6 ex:item10245 prov:wasDerivedFrom <#DCAT_LogicalSource>. rdf dataset. The desired metadata to be generated by the 7 rmlprocessor can be configured by an agent1. The agent 8 prov:wasDerivedFrom . who triggers the mapping activity can define the vocabulary to be used, as well as the desired details level. By default, Listing 9: RDF Term Level Metadata the w3c recommended prov [16], void [1] and dcat [17] vo- cabularies are supported. Although, the rmlprocessor can be further extended to support other vocabularies and gen- DCAT Catalogue Enrichment erate more metadata information. In the case that the original data is published on the Web The rmlprocessor has also been extended to automati- in the frame of a catalogue described with the dcat vo- cally generate corresponding metadata regarding rdf dataset cabulary, complementary metadata information can be gen- generation. It was extended to generate metadata informa- erated to enrich it. If one of the dcat:Dataset distributions tion considering implicit graphs or rdf reification. To be (dcat:Distribution) is considered to generate the corresponding more precise, the rmlprocessor was configured to generate rdf representation, complementary dcat metadata might metadata information using implicit graphs for the metadata be generated as well, to specify that the generated rdf which are related with the whole dataset as well as for named dataset is another distribution of a certain dcat:Dataset pub- graphs. The rmlprocessor was configured to generate rdf lished on the dcat:Catalog. For instance, in the case of the reification triples if the metadata level is set on triple or rdf running example, the <#DCAT RDF> and the <#DB RDF> term level. The explicit graphs and singleton properties were are source-level partitions of the rdf dataset. The rdf not considered because they need to be defined inline with triples in the <#DCAT RDF> partition is an rdf distribu- the actual rdf data, together with the mapping rules. tion of the <#XML Distribution> for the <#DCAT Source>. The <#DCAT RDF> metadata information and the enriched <#DCAT Source> have as follows: 7. CONCLUSIONS AND FUTURE WORK The proposed approach aims to show how metadata of the 1 @prefix dcat: . 2 @prefix prov: . fundamental activities for the generation and publication 3 @prefix void: . of rdf triples can be automatically generated. Our solu- 4 tion covers the rdf dataset generation, including metadata 5 <#RDF_Dataset> void:subset <#DCAT_RDF>, <#DCAT_RDF>. for the mapping rules definition and the data descriptions. 6 7 <#DCAT_RDF> a prov:Entity, void:Dataset; Based on the provided metadata information, it is expected 8 prov:wasGeneratedBy <#RDFdataset_Generation>; that publishing infrastructures will enrich this information 9 prov:wasDerivedFrom <#DCAT_LogicalSource>; with complementary details regarding the rdf dataset pub- 10 prov:wasAssociatedWith <#RMLProcessor>; 11 prov:wasAttributedTo . lication activity. Moreover, it is expected that rdf pub- 12 lication infrastructures will re-determine certain properties 13 <#DCAT_Source> dcat:distribution <#DCAT_RDF>. regarding the metadata information, if the rdf dataset is re- formed before it gets published. Moreover, the metadata can Listing 10: DCAT Metadata Enrichment be enriched with additional information derived from other activities involved in the Linked Data publishing workflow. Provenanve and metadata information can be multidimen- 6. METADATA & THE RML TOOL CHAIN sional and its consumption diverges across different systems. Different applications require different levels of metadata in- We implemented the aforementioned approach at the rml formation to fulfil their tasks, whereas diverse metadata in- tool chain, namely the rmleditor15 and the rmlprocessor16 . formation might be desired. The presented workflow was The rml tool chain was configured to support implicit graphs focused on the essential parts of the Linked Data publish- and rdf reification. The supported metadata can be further ing workflow. However, any metadata information might be extended to take into consideration other metadata vocabu- considered as they accrue from other activities involved in laries too and generate corresponding metadata information. the Linked Data publishing workflow. In the future, we con- In more details: sider including metadata information regarding the results of the rdf validation [7, 6], applied both on the mapping RML Editor document, as well as on the generated rdf data. The rmleditor [12] was extended to generate metadata re- Different aspects of the rdf data generation and publish- garding the editing and generation of the mapping rules, as ing might influence its quality and trustworthiness assess- they are declaratively represented using the rml language. ment. In most of the cases so far, the provenance and meta- The rmleditor keeps track of the mapping document edition data information are manually delivered on dataset level. and generation activities, when they occurred and by whom. Automating the provenance and metadata generation rely- The rmleditor was extended to support implicit graphs for ing on machine interpretable descriptions of the different defining the metadata information which is related to the workflow steps, allows to generate metadata in a system- rml mapping document or its subsets. atic way. The generated provenance and metadata informa- tion becomes more accurate, consistent and complete. The 15 http://rml.io/RMLeditor.html metadata generation for certain rdf data is an incremental 16 http://github.com/RMLio/RML-Mapper procedure that relies on the contribution of different activ- ities in the Linked Data publishing workflow to enrich the for Editing Mapping Definitions. In Workshop on information we have for the generated rdf dataset. Intelligent Exploration of Semantic Data, 2015. [13] P. Hitzler, M. Krötzsch, B. Parsia, P. F. 8. ACKNOWLEDGMENTS Patel-Schneider, and S. Rudolph. OWL 2 Web Ontology Language. W3C Recom., Dec. 2012. The described research activities were funded by Ghent http://www.w3.org/TR/owl2-primer/ . University, iMinds, the Institute for the Promotion of Inno- [14] E. Jiménez-Ruiz, E. Kharlamov, D. Zheleznyakov, vation by Science and Technology in Flanders (IWT), the I. Horrocks, C. Pinkel, M. Skjæveland, E. Thorstensen, Fund for Scientific Research Flanders (FWO Flanders), and and J. Mora. BootOX: Practical Mapping of RDBs to the European Union. OWL 2. In The Semantic Web - ISWC 2015. 2015. [15] M. Lanthaler. Hydra Core Vocabulary. Unofficial 9. REFERENCES Draft, June 2014. [1] K. Alexander, R. Cyganiak, M. Hausenblas, and http://www.hydra-cg.com/spec/latest/core/. J. Zhao. Describing Linked Datasets with the VoID [16] T. Lebo, S. Sahoo, and D. McGuinness. PROV-O: The Vocabulary. W3C Interest Group Note, Mar. 2011. PROV Ontology. Working Group Recommendation, http://www.w3.org/TR/void/ . W3C, Apr. 2013. http://www.w3.org/TR/prov-o/ . [2] G. Carothers. RDF 1.1 N-Quads. Working Group [17] F. Maali and J. Erickson. Data Catalog Vocabulary Recommendation, W3C, Feb. 2014. (DCAT). W3C Recommendation, Jan. 2014. https://www.w3.org/TR/n-quads/ . http://www.w3.org/TR/vocab-dcat/ . [3] G. Carothers and A. Seaborne. RDF 1.1 TriG. [18] L. Moreau and P. Missier. PROV-DM: The PROV Working Group Recommendation, W3C, Feb. 2014. Data Model. Working Group Recommendation, W3C, https://www.w3.org/TR/trig/ . Apr. 2013. http://www.w3.org/TR/prov-dm/ . [4] S. Das, S. Sundara, and R. Cyganiak. R2RML: RDB [19] A.-C. Ngonga Ngomo and S. Auer. Limes: A to RDF Mapping Language. Working Group Time-efficient Approach for Large-scale Link Recommendation, W3C, Sept. 2012. Discovery on the Web of Data. 2011. http://www.w3.org/TR/r2rml/ . [20] V. Nguyen, O. Bodenreider, and A. Sheth. Don’t Like [5] L. de Medeiros, F. Priyatna, and O. Corcho. RDF Reification?: Making Statements About MIRROR: Automatic R2RML Mapping Generation Statements Using Singleton Property. In Proceedings from Relational Databases. In Engineering the Web in of the 23rd International Conference on World Wide the Big Data Era. 2015. Web, 2014. [6] T. De Nies, A. Dimou, R. Verborgh, E. Mannens, and [21] C. Pinkel, C. Binnig, E. Kharlamov, and P. Haase. R. Van de Walle. Enabling dataset trustworthiness by IncMap: Pay As You Go Matching of Relational exposing the provenance of mapping quality Schemata to OWL Ontologies. In Proceedings of the assessment and refinement. In Proceedings of the 4th 8th International Conference on Ontology Matching, International Workshop on Methods for Establishing pages 37–48, 2013. Trust of (Open) Data, 2015. [22] M. Schmachtenberg, C. Bizer, and H. Paulheim. [7] A. Dimou, D. Kontokostas, M. Freudenberg, Adoption of the Linked Data Best Practices in R. Verborgh, J. Lehmann, E. Mannens, S. Hellmann, Different Topical Domains. In ISWC 2014. 2014. and R. Van de Walle. Assessing and Refining [23] K. Sengupta, P. Haase, M. Schmidt, and P. Hitzler. Mappings to RDF to Improve Dataset Quality. In Editing R2RML mappings made easy. 2013. Proceedings of the 14th ISWC, 2015. [24] M. Sporny, D. Longley, G. Kellogg, M. Lanthaler, and [8] A. Dimou, M. Vander Sande, P. Colpaert, N. Lindström. JSON-LD. Working Group R. Verborgh, E. Mannens, and R. Van de Walle. Recommendation, W3C, Jan. 2014. RML: A Generic Language for Integrated RDF https://www.w3.org/TR/json-ld/ . Mappings of Heterogeneous Data. In Workshop on [25] J. Tennison, G. Kellogg, and I. Herman. Model for Linked Data on the Web, 2014. Tabular Data and Metadata on the Web. W3C [9] A. Dimou, R. Verborgh, M. Vander Sande, Working Draft, Apr. 2015. http: E. Mannens, and R. Van de Walle. //www.w3.org/TR/2015/WD-tabular-data-model-20150416/. Machine-Interpretable Dataset and Service [26] R. Verborgh, O. Hartig, B. De Meester, Descriptions for Heterogeneous Data Access and G. Haesendonck, L. De Vocht, M. Vander Sande, Retrieval. In SEMANTiCS 2015, 2015. R. Cyganiak, P. Colpaert, E. Mannens, and R. Van de [10] O. Hartig and J. Zhao. Provenance and Annotation of Walle. Querying datasets on the Web with high Data and Processes: Third International Provenance availability. In Proceedings of the 13th ISWC, 2014. and Annotation Workshop, IPAW 2010, chapter [27] R. Verborgh, M. Vander Sande, P. Colpaert, Publishing and Consuming Provenance Metadata on S. Coppens, E. Mannens, and R. Van de Walle. the Web of Linked Data. 2010. Web-scale querying through Linked Data Fragments. [11] P. Heyvaert, A. Dimou, R. Verborgh, E. Mannens, and In Proceedings of the 7th Workshop on Linked Data on R. Van de Walle. Approaches for Generating the Web, 2014. Mappings to RDF. In Proceedings of the 14th ISWC: [28] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov. Silk – Posters and Demos, 2015. A Link Discovery Framework for the Web of Data. In [12] P. Heyvaert, A. Dimou, R. Verborgh, E. Mannens, and Workshop on Linked Data on the Web, 2009. R. Van de Walle. Towards a Uniform User Interface