Automated Metadata Generation for
             Linked Data Generation and Publishing Workflows

               Anastasia Dimou                            Tom De Nies                          Ruben Verborgh
           anastasia.dimou@ugent.be                     tom.denies@ugent.be                 ruben.verborgh@ugent.be

                                   Erik Mannens                    Rik Van de Walle
                              erik.mannens@ugent.be               rik.vandewalle@ugent.be

                                            Ghent University – iMinds – Data Science Lab

ABSTRACT                                                            mation neither allow being aware of the origin of the rdf
Provenance and other metadata are essential for determin-           data, nor reproducing the rdf data generation outside the
ing ownership and trust. Nevertheless, no systematic ap-            context of the application that originally generated it. This
proaches were introduced so far in the Linked Data pub-             occurs because most of the tools that generate rdf data
lishing workflow to capture them. Defining such metadata            derived from heterogeneous data, put the focus on indepen-
remained independent of the rdf data generation and pub-            dently providing the corresponding rdf representation, dis-
lishing. In most cases, metadata is manually defined by the         sociating the resulting rdf data from its original source.
data publishers (person-agents), rather than produced by            In the same context, provenance and metadata information
the involved applications (software-agents). Moreover, the          regarding the actual mapping rules which specify how the
generated rdf data and the published one are considered to          rdf data is generated from raw data, are not captured at
be one and the same, which is not always the case, leading to       all. Nevertheless, such information might equally influence
pure, condense and often seductive information. This paper          the assessment of the generated rdf data trustworthiness.
introduces an approach that relies on declarative descrip-             Similarly, data publishing infrastructures, such as triple
tions of (i) mapping rules, specifying how the rdf data is          stores, do not automatically publish any provenance or other
generated, and of (ii) raw data access interfaces to automat-       metadata regarding the rdf data they host. Instead they
ically and incrementally generate provenance and metadata           would have been expected to enrich the metadata produced
information. This way, it is assured that the metadata in-          while the rdf data was generated with metadata associated
formation is accurate, consistent and complete.                     with the publishing activity. Moreover, the rdf data gener-
                                                                    ation and its publication are considered as interrelated ac-
                                                                    tivities that occur together. Although, this is not always the
1.     INTRODUCTION                                                 case. Therefore, the generated rdf data and the one sub-
   Nowadays, data owners publish their data at an increasing        sequently published are not always one and the same. For
rate. More and more of them publish also its correspond-            instance, rdf data might be generated in subsets and pub-
ing rdf representation and interlink it with other data.            lished all together, or generated as a single dataset but pub-
However, even though provenance and other metadata be-              lished in different rdf graphs. Consequently, their prove-
come increasingly important, most rdf datasets published            nance and rest metadata information is not identical.
in the Linked Data cloud provide no or seldom narrow meta-             In a nutshell, capturing provenance and metadata infor-
data. To be more precise, only 37% of the published rdf             mation on every step of the Linked Data publishing work-
dataset provide provenance information or any other meta-           flow is not addressed in a systematic and incremental way so
data [22]. In these rare cases that such metadata is available,     far. In this paper, we introduce an approach that considers
it is only manually defined by the data publishers (person-         declarative and machine-interpretable data descriptions and
agents), rather than produced by the applications (software-        mapping rules to automatically assert provenance as well
agents) involved in the Linked Data publishing cycle. Most          as other metadata information. Our proposed solution is
of the current solutions which generate and/or publish rdf          indicatively applied on mappings described using the rml
data, do not consider also automatically generating the cor-        language [8] and is implemented in the rml tool chain.
responding metadata information, despite the well-defined              The remainder of the paper is structured as follows: In Sec-
and w3c recommended vocabularies, e.g., prov-o [16] or              tion 2, we outline the current state of the art. In Section 3,
void [1], that clearly specify the expected metadata output.        we discuss the essential steps of the Linked Data publish-
   As a consequence, the lack of available metadata infor-          ing cycle where provenance and metadata can be generated
                                                                    and in Section 4, we discuss the different levels of metadata
                                                                    details identified. In Section 5, we describe how machine-
                                                                    interpretable mapping rules are considered to automate the
Copyright is held by the author/owner(s).
                                                                    metadata generation and in Section 6 we showcase how we
WWW2016 Workshop: Linked Data on the Web (LDOW2016)
                                                                    implemented it in the rml tool chain.
2.    STATE OF THE ART                                            were decoupled from the source code of the corresponding
  In this section, we investigate existing systems, involved      tools that execute them. However, mapping languages are
in the Linked Data publishing workflow. Tools generating          explicitly focused on specifying the mapping rules, neglect-
mappings and rdf data or publish rdf data are approached          ing to provide the means to specify the data source too.
with respect to their support for automated metadata gener-       Whereas, for instance the d2rq language allows to specify
ation (Section 2.1). In addition, we outline the w3c recom-       the relational database where the data is derived from, other
mended vocabularies for metadata description (Section 2.2),       languages, including r2rml, do not, considering it out of the
as well as the most well-known and broadly used approach          language’s scope. rml [8] is the only language that allows
for representing provenance and other metadata (Section 2.3).     referring to data descriptions based on well-known vocabu-
                                                                  laries to determine the data source [9] (see Section 5).
2.1    Linked Data publishing cycle                                  The situation remains the same also in the case of inter-
   In the Linked Data publishing workflow there are different     linking tools, such as the prevalent Silk [28] and Limes [19].
activities taking place. Among them, the definition of the        Interlinking tools generate rdf data consisting of links be-
rules to generate rdf data from raw data, its actual gener-       tween rdf datasets, the so-called linksets. None of the most
ation, its publishing and its interlinking are few of the most    well-known tools generate any provenance or metadata an-
essential steps. However, the majority of the tools devel-        notations regarding the links that were identified and repre-
oped to address these tasks do not generate automatically         sented as the output dataset of the interlinking task.
any provenance or metadata information as the correspond-            In the same context, tools were developed to support data
ing tasks are accomplished, let alone enriching metadata de-      owners to semantically annotate their data. However, those
fined in prior steps of the Linked Data publishing workflow.      tools still generate both the mapping rules and the corre-
   Hartig and Zhao [10] argued regarding the need of in-          sponding rdf data after the rules execution, without pro-
tegrating provenance information publication in the Linked        viding any provenance or metadata information. To be more
Data publishing workflow. However, they focused only on its       precise, none of the tools that automatically generate map-
last step, namely the rdf data publication, outlining meta-       pings of relational databases to its rdf representation, such
data publication approaches and showcasing on well-know           as BootOx [14], IncMap [21], or Mirror [5], or support users
rdf data publishing tools, such as Pubby1 and Triplify2 .         in defining mapping rules, e.g., FluidOps editor [23], sup-
   None of the well-know systems that generate rdf repre-         ports automated provenance and metadata information gen-
sentations from any type of (semi-)structured data provide        eration, neither for the mapping rules, nor for the generated
any provenance or metadata information in conjunction with        rdf data. Specifying metadata for the mapping rules or
the generated rdf data, to the best of our knowledge. For         considering the mapping rules to determine the provenance
instance, none of the prevalent tools for generating rdf data,    and metadata becomes even more cumbersome, in particu-
such as DB2triples3 , Karma4 , or xsparql5 , to indicatively      lar in the case of mapping language whose representation is
mention a few of the prevalent tools. The main obstacle, at       not in rdf, e.g., sml, sparql or xquery.
least with respect to provenance, is that it is hard to specify      Similarly, among the rdf data publishing infrastructures,
where the data originally resides. That occurs because most       only Triple Pattern Fragments10 (tpf) [26, 27] provide some
of these tools, consider a file as data input. However, where     metadata information, mainly regarding dataset level statis-
the data of this file is derived from is not known and, there-    tics and access. Virtuso 11 , 4store 12 and other pioneer pub-
fore, the corresponding provenance annotations can not be         lishing infrastructures do not provide out-of-the-box meta-
accurately defined in an automated fashion.                       data information, e.g., provenance, dataset-level statics etc.
   The d2r server6 and the csv2rdf4lod7 are the only tool         of the rdf data published. lodlaundromat13 is the only
that generates provenance and metadata information in con-        Linked Data publishing infrastructure that provides auto-
junction with the rdf data. However, the d2r server refers        matically generated metadata information. However, it uses
only to data in relational databases, it supports a custom        its own custom ontology14 which partially relies on the prov-
provenance vocabulary, not the w3c-recommended prov-              o ontology to provide metadata information.
o [16], and is limited to dataset high level metadata in-
formation. The csv2rdf4lod refers only to csv files and
                                                                  2.2      Provenance and Metadata Vocabularies
it achieves capturing provenance using custom bash scripts          w3c recommended vocabularies were already defined to
that aim to keep track of the commands used. The situation        specify rdf data provenance and metadata information:
aggravates in the case of custom solutions for generating rdf
data which neglect to include in its development cycle mech-
                                                                   2.2.1    PROV Ontology
anisms to generate provenance and metadata information.             The prov ontology (prov-o) [16] is recommended by w3c
   With the advent of mapping languages, such as the d2rq8 ,      to express the prov Data Model [18] using the owl2 Web
sml9 , or the w3c recommended r2rml [4], the mapping              Ontology Language (owl2) [13]. prov-o can be used to
rules that specify how triples are generated from raw data,       represent provenance information generated in different sys-
1
                                                                  tems and under different contexts.
  http://wifo5-03.informatik.uni-mannheim.de/pubby/                 According to the prov ontology, a prov:Entity is a physi-
2
  http://triplify.org/                                            cal, digital, conceptual, or other kind of thing. A prov:Activity
3
  https://github.com/antidot/db2triples
4                                                                 occurs over a period of time and acts upon or with entities;
  http://usc-isi-i2.github.io/karma/
5                                                                 10
  http://xsparql.deri.org/                                           http://linkeddatafragments.org/
6                                                                 11
  http://d2rq.org/                                                   http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/
7                                                                 12
  https://github.com/timrdf/csv2rdf4lod-automation/wiki              http://4store.org/
8                                                                 13
  http://d2rq.org/d2rq-language                                      http://lodlaundromat.org/
9                                                                 14
  http://sml.aksw.org/                                               http://lodlaundromat.org/ontology/
it may include consuming, processing, transforming, modi-          an rdf triple using four statements. A description of a state-
fying, relocating, using, or generating entities. A prov:Agent     ment is called a reification of the statement. The rdf reifica-
bears some form of responsibility for an activity taking place,    tion vocabulary consists of the type rdf:Statement, and the
for the existence of an entity, or for another agent’s activity.   properties rdf:subject, rdf:predicate and rdf:object.
                                                                   rdf reification is the w3c recommended approach for rep-
2.2.2    VoID Vocabulary                                           resenting provenance and metadata information.
   The Vocabulary of Interlinked Datasets (void) [1] is a vo-      1   _:ex12345   rdf:type        rdf:Statement .
cabulary for expressing metadata about rdf datasets with           2   _:ex12345   rdf:subject     ex:item10245 .
applications ranging from data discovery to cataloging and         3   _:ex12345   rdf:predicate   ex:weight .
                                                                   4   _:ex12345   rdf:object      "2.4"^^xsd:decimal .
archiving of datasets. void expresses (i) general, (ii) ac-        5   _:ex12345   prov:wasDerivedFrom _:src123 .
cess and (iii) structural metadata, as well as links between
datasets. General metadata is based on Dublin Core. Ac-               The major disadvantage of rdf reification is the number
cess metadata describes how the rdf data can be accessed           of triples required to represent a reified statement. For each
using different protocols. Structural metadata describes the       generated triple, at least four additional statements is re-
structure and schema of the rdf data.                              quired to be generated. So, for an rdf dataset of N triples,
   According to the void vocabulary, a void:Dataset is a set       the metadata graph will be equal to four times the number
of rdf triples maintained or aggregated by a single provider.      of the rdf dataset triples in the best case where only the
A void:Dataset is a meaningful collection of triples, that deal    rdf reification statements are generated and no additional.
with a certain topic, originate from a certain source or pro-
                                                                   2.3.2    Singleton Properties
cess, and contains sufficient number of triples that there is
benefit in providing a concise summary. The concrete triples          Singleton properties [20] is an alternative approach for
contained in a void:Dataset is established through access in-      representing statements about statements using rdf. This
formation, such as the address of a sparql endpoint. Last,         approach relies on the intuition that the nature of every re-
a void:Linkset is a collection of rdf links whose subject and      lationship is universally unique and can be a key for any
object are described in different datasets.                        statement using a singleton property. A singleton property
                                                                   represents one specific relationship between two entities un-
2.2.3    DCAT Vocabulary                                           der a certain context. It is assigned a uri, as any other
   The Data Catalog Vocabulary (dcat) [17] is designed to          property, and can be considered as a subproperty or an in-
facilitate interoperability between data catalogs published        stance of a generic property. Singleton properties and their
on the Web. It aims to (i) increase data discoverability,          generic property are associated with each other using the
(ii) enable applications to easily consume metadata from           singletonPropertyOf property, subproperty of rdf:type.
multiple catalogs, (iii) enable decentralized catalogs pub-        1   ex:item10245   ex:weight#1              "2.4"^^xsd:decimal .
                                                                   2   ex:weight#1    sp:singletonPropertyOf   ex:weigh .
lishing, and (iv) facilitate federated dataset search.             3   ex:weight#1    prov:wasDerivedFrom      _:src123 .
   According to the dcat vocabulary, a dcat:Catalog repre-
sents a dataset catalog, a dcat:Dataset represents a dataset
in the catalog, whereas a dcat:Distribution represents an ac-      2.3.3    Explicit Graphs
cessible form of a dataset, e.g., a downloadable file, an rss        The Explicit Graphs approach relies on named graphs.
feed or a Web service that provides the data. dcat consid-         Named Graphs is a set of rdf triples named by a uri and
ers as a dataset a collection of data, published or curated        can be represented using TriG [3], N-Quads [2] or JSON-
by a single agent, and available for access or download in         LD [24], but it is not compatible with all rdf serialisations.
one or more formats. This data is considered for generating        This approach is similar to Singleton Properties. Instead of
an rdf dataset. Thus, the generated rdf dataset forms a            annotating the common predicate of the triples, the context
dcat:Distribution of a certain dcat:Dataset.                       of the triple is annotated. This way, introducing one triple
                                                                   per predicate is avoided. However, the Explicit Graphs ap-
2.3     Approaches for tracing PROV & metadata                     proach has two drawbacks: (i) they are not supported by all
   We outline methods for capturing provenance and other           rdf serializations; and (ii) they might be in conflict with the
metadata information. We identify two approaches that cap-         named graph defined as part of the rdf dataset and whose
ture provenance and other metadata information inline with         intent is different than tracing provenance information.
the rest rdf data –Explicit Graphs (Section 2.3.3) and Sin-        1   ex:item10245   ex:weight "2.4"^^xsd:decimal    ex:graph .
gleton Properties (Section 2.3.2)– and two that trace them         2   ex:graph       prov:wasDerivedFrom   _:src123 .
independently of the rdf data –rdf Reification (Section 2.3.1)
and Implicit Graphs(Section 2.3.4). In the following subsec-       2.3.4    Implicit Graphs
tions, we discuss in more details alternative approaches for          Implicit graphs are uris assigned implicitly to a dataset,
defining the provenance of the following rdf triple:               graph, triple or term. An Implicit Graph is aware of what it
1   ex:item10245 ex:weight "2.4"^^xsd:decimal .                    represents but the represented entity is not directly linked
                                                                   to its implicit graph. Implicit graphs might be used to iden-
                                                                   tify a dataset or a graph, but also triples. In the later case,
2.3.1    RDF Reification                                           as Triple Pattern Fragments (tpf) introduced [26, 27], each
   The rdf framework considers a vocabulary for describing         triple can be found by using the elements of itself, thus,
rdf statements and providing additional information. rdf           each triple has a uri and, thereby, its implicit graph. For
reification is intended for expressing properties such as dates    example, the triple x y z for a certain dataset could be iden-
of composition and source information, applied to specific in-     tified by the tpf uri http://example.org/dataset?subject=
stances of triples. The conventional use involves describing       x&predicate=y&object=z.
1    <http://example.org/dataset?                                         rule might have been generated by a mapping generator or
2    subject=ex:item10245&predicate=ex:weight&object="2.4">               edited by a human-agent using a mapping editor. However,
3        prov:wasDerivedFrom _:src123 .
                                                                          such a maping rule might have been modified or used in con-
                                                                          junction with other mapping rules which were generated, in
                                                                          their own turn, by another human-agent at a different time.
3.    WORKFLOW METADATA STEPS                                                The agent who defined the mapping rules (Fig. 1, Map-
   Provenance and other metadata information can be cap-                  ping Editor ) might differ from the one who generated (Fig. 1,
tured at different steps of the publishing workflow. Keep-                Data Generator ) or published the data (Fig. 1, Data Pub-
ing track of metadata derived from the different steps of                 lisher ), or even the owner of the data (Fig. 1, Data Owner ).
the rdf data generation and publishing workflow, results in               Being aware of who defined the mapping rules is of crucial
more complete information regarding how an rdf dataset                    importance to assess the trustworthiness of the final rdf
was generated and formed in the end. Moreover, provenance                 data, even though it is neglected so far. For instance, rdf
and metadata information generated at different steps of the              data generated using mapping rules from an automated gen-
publishing workflow offer complementary information.                      erator might be considered less trustworthy compared to rdf
   We identify the following primary steps: mapping defi-                 data whose mapping rules were defined by a data specialist.
nitions generation (Section 3.1), data source retrieval (Sec-
tion 3.2), rdf data generation (Section 3.3), rdf data pub-               3.2    Data Sources Retrieval
lication (Section 3.4). We consider each workflow step as an                 An rdf dataset might be derived from one or more het-
activity (prov:Activity) whose properties is needed to be                 erogeneous data sources (Fig. 1, Data Source Acquisition).
traced. In Table 1, we summarize those activities and the                 Each data source, in its own turn, might be derived from
information that needs to be defined each time. The prove-                an input. For instance, a table might be derived from a
nance and how the different steps are associated with each                database or some json data might be derived from a Web
other are shown at Figure 1.                                              api. Such a data source might be turned into an rdf graph
                                                                          partially or in its entirety. This might mean that not the
                                       Same     Different                 entire stored data is retrieved but a selection is only used
                                      Dataset   Dataset
                                                                          to generate the rdf data. For instance, only the data that
                                Map. Gen. Pub. Gen. Pub. Link.
                                                                          fulfils an sql query could be retrieved to generate the rdf
                prov:Entity
                                                                          dataset, instead of the entire table or database.
       prov:wasGeneratedBy
       prov:wasDerivedFrom
                                                                             For this activity, it is important to keep track of metadata
      prov:wasAttributedTo                                                regarding the data sources and their retrieval, as this indi-
                prov:Agent              #
                                        G      #
                                               G                          cates the original data sources of the generated rdf data.
      prov:actedOnBehalfOf                                                However, the originally stored data might have changed over
  void:Dataset – General                                                  time. For instance, in the case of an api, some data is re-
             dcterms:creator                                              trieved at a certain time, but different data might be re-
         dcterms:contributor                                              trieved at a subsequent time. Therefore, it is crucial to
           dcterms:publisher                                              know when the data is accessed to assess its timeliness with
              dcterms:source                                              the original data. For instance, comparing the last modified
             dcterms:created
                                                                          date of the original data and the generation date of the rdf
           dcterms:modified
                                                                          data, indicates whether the available rdf representation is
              dcterms:issued            G
                                        #      G
                                               #
              dcterms:license           #
                                        G      #
                                               G
                                                                          aligned with the current version of the original data or not.
                 void:feature           #
                                        G      #
                                               G
    void:Dataset – Access                                                 3.3    RDF Data Generation
void:Dataset – Structural               G
                                        #      G
                                               #                             As soon as the mapping rules and the data source are
 void:Dataset – Statistics              #
                                        G      #
                                               G                          available, the rdf data is generated (Fig. 1, Generate RDF
               void:Linkset             #
                                        G      #
                                               G                          Data). For this activity, it is important to keep track of
                                                                          (i) how the rdf data generation was triggered, i.e. data-
A filled circle ( ) indicates that the property should be (re-)assessed
in each of the marked steps. A half-filled circle (G #) indicates that    driven or mapping-driven, from raw data (rdf generation)
that property can be assessed in any of the marked steps.                 or from rdf data (rdf interlinking); (ii) when the rdf
                                                                          dataset was generated, and (iii) how, i.e. in a single dataset
Table 1: Table of properties required for each entity                     or in subgraphs, subsets etc. Besides the aforementioned,
                                                                          this activity is crucial for capturing the origin of the rdf
                                                                          data, as only at this step that information is known (in com-
3.1     Mapping Rules Definition                                          bination with the data description and acquisition).
   Provenance and metadata information is required to be
captured when the mapping rules are defined (Fig. 1, Edit                 3.4    RDF Data Publication
Map Doc). In this case, it is important to track when the                    The published rdf data is not always identical to the gen-
mapping rules were edited or modified and by whom. An                     erated one (Fig. 1, genRDF Vs. pubRDF ). For instance, it
rdf dataset might have been generated using multiple map-                 might be the result of merging multiple rdf datasets which
ping rules whose definition occurred at different moments                 are generated from different data sources at the same or dif-
and by different agents. Consequently, the generation of                  ferent moments. Moreover, the published rdf dataset might
certain mapping rules (Fig. 1, Generate Map Doc) is an ac-                be published in a different way compared to how the rdf
tivity (prov:Activity) which is informed by all prior editing             data was generated. For instance, it could be split in differ-
activities (Fig. 1, Edit Map Doc). For instance, a mapping                ent graphs to facilitate its consumption. This might lead to
                                                                                wasAttributedTo

                                                                                                                generatedAtTime      wasDerivedFrom
                        actedOnBehalfOf                                                                                                                      pub          generatedAtTime
                                                  Data Publisher         Start Time           End Time
                                                                                                                                                           RDF Data
                                                                startedAtTime                         endedAtTime
                                                                                                                                                                         endedAtTime
                      actedOnBehalfOf                      wasAssociatedWith       Generate               generated           gen            used       Publish                             End Time
      Data Owner                        Data Generator                             RDF data                                 RDF Data                    RDF Data                            Start Time
                                                                                                                                                                        startedAtTime
                                                                                      used   used                 wasDerivedFrom   hadPrimarySource

                                                                                                                                                      generatedAtTime
             actedOnBehalfOf
                                                                           wasStartedBy wasStartedBy                                        generated
                                                                                                                            non-RDF
                                                     Map Doc
     Mapping Editor                                                                                                           Data                                      endedAtTime
                                          wasGeneratedBy     generatedAtTime                                                                            Data Source                         End Time
             wasAssociatedWith                                                                                    wasDerivedFrom                        Acquisition                         Start Time
                                                                   endedAtTime                                                                                          startedAtTime
     Edit                                         Generate                                   End Time
     Map Doc                 wasInformedBy        Map Doc                                    Start Time                    Stored Data
                                                                   startedAtTime
                                                                                                                                           used


                                 Figure 1: A coordinated view of Linked Data publishing workflow activities.


different metadata for the generation and publication activi-                                                  4.1       Dataset Level
ties, and these metadata sets might have different purposes.                                                      Dataset level provenance and metadata provide high-level
   For instance, void access information metadata is more                                                      information for the complete rdf dataset. This level of de-
meaningful and possible to be generated during the rdf data                                                    tail is meaningful for all metadata information that refer to
publication, whereas provenance information in respect to                                                      the whole dataset, i.e. a void:Dataset and are the same
the original data can only be defined during the rdf data                                                      for each triple. Therefore, among the alternative represen-
generation activity. To the contrary, void structural or sta-                                                  tation approaches, considering an explicit or implicit graph
tistical metadata might be generated both during rdf data                                                      for the dataset to represent provenance and metadata anno-
generation and publication. However, the generated rdf                                                         tations is sufficient on dataset level and it requires the least
data is not always identical to the one published. If the                                                      number of additional triples. The alternative approaches
generated rdf data differ from the one published, then such                                                    in principle assign the same metadata information to each
metadata should be defined for both cases (see Table 1).                                                       triple. Thus, the exact same information is replicated for
                                                                                                               each triple, causing unnecessary overhead.
                                                                                                                  Provenance information on dataset level is sufficient if all
4.   METADATA DETAILS LEVELS                                                                                   triples are derived from the same original data source and are
   There are different details levels for capturing provenance                                                 generated at the same time, as a result of a single activity.
and metadata information. However, in most cases so far,                                                       The same occurs if the overall origin source is sufficient to
the provenance and metadata information is delivered on                                                        assess the rdf dataset trustworthiness. On the contrary, if
dataset level. This mainly occurs because the metadata in-                                                     being aware of the exact data source is required, for instance
formation are only defined after the rdf data is generated                                                     to align the semantically annotated representation with the
and/or published. However, different applications and data                                                     original data values, more detailed provenance information
consumption cases require different levels of provenance and                                                   is desired, because the high level provenance information is
metadata information. Overall, the goal is to achieve the                                                      not as complete and accurate to accomplish the desired task.
best trade-off between details level and number of additional
triples generated for balancing information overhead and ac-
ceptable information loss in an automated metadata gen-                                                        4.2       Named Graph Level
eration occasion. For instance, considering rdf reification                                                       An rdf dataset might consist of one or more named graphs.
for capturing all provenance and metadata information for                                                      Named graph based subsets of an rdf dataset provide con-
each triple, means that metadata referring to the entire rdf                                                   ceptual partitions of rdf triples semanticfully distinguished
dataset is captured repeatedly for each individual triple. To                                                  in graphs. Named graph level provenance and metadata in-
the contrary, considering an implicit graph on dataset level                                                   formation refer to all rdf annotations which are related to
results in information loss in respect to the origin of each                                                   a certain named graph and contain information for each one
triple, if multiple data sources are used to generate the rdf                                                  of the named graphs. Each named graph is a void:Dataset
dataset, because it is not explicitly defined where each triple                                                and consists a subset of the whole rdf dataset.
is derived from.                                                                                                  In the case of named graphs, it is not possible to rep-
   Automating the provenance and metadata information                                                          resent metadata and provenance information using explicit
generation, allows exploiting hybridic approaches which can                                                    graphs, because the rdf statements are already quads and
contribute in optimizing the metadata information balance.                                                     the named graph has different semantics than providing meta-
In this section, we outline the different details levels for                                                   data information. As in the case of dataset level, implicit
capturing metadata that we identified: Dataset level (Sec-                                                     graphs for each named graph and for the complete dataset
tion 4.1), named graph level (Section 4.2), partition level (Sec-                                              generate the minimum number of additional rdf triples.
tion 4.3), triple level (Section 4.4) and term level (Section 4.5).                                            Moreover, the named graph level metadata information are
For each level, we describe what type of metadata is cap-                                                      sufficient if all triples of a certain named graph are derived
tured and we discuss the advantages and disadvantages when                                                     from the same data source. Otherwise, there is information
used in combination with different representation approach.                                                    loss which can be addressed at a narrower detail level.
4.3    Partitioned Dataset Metadata Level                         5.    METADATA GENERATION WITH RML
   A dataset might be partitioned based on different aspects.        We introduce an approach that takes into consideration
The most frequent partitions are related to (i) the underlying    machine interpretable descriptions of data sources and map-
data source or the triple’s (ii) subject, (iii) predicate, or     ping rules, which are used to generate rdf datasets, to
(iv) object. Besides the aforementioned partitions, any other     also automatically generate its corresponding provenance
custom partition can be equally considered. A source-based        and metadata information. Our approach relies on asserting
partitioned rdf dataset is an rdf dataset whose subsets are       statements from declarative descriptions of data sources and
formed with respect to their derivation source. To be more        mapping rules. This allows our proposed approach to be ap-
precise, all rdf terms and triples are derived from the same      plied on alternative mapping languages and be replicated in
original data source. Source-based partitioned rdf datasets       different implementations.
derived from a single data source are not considered because         In our exemplary case, machine interpretable mapping
they coincide with the actual rdf dataset. A subject-based        rules are defined using the rdf Mapping Language (rml) [8].
partitioned rdf dataset is the part of an rdf graph whose         rml is considered because it is the only language that al-
triples share the same subject. Consequently, subject-level       lows uniformly defining the mapping rules over heteroge-
metadata provides information for all triples which share the     neous data sources. Moreover, rml is aligned with machine
same subject. It similarly applies in the case of predicate-      interpretable data source descriptions defined using different
based or object-based partitions.                                 vocabularies, e.g., dcat [17], csvw [25], Hydra [15] etc [9].
   Partitioned datasets might be treated in the same way as
named graphs, but it is also possible to use explicit graphs to   5.1     RML Mapping Definitions
define the subsets metadata. An implicit graph for each sub-         Mapping rules are defined using the rdf Mapping Lan-
set of the rdf dataset which resembles a partition achieves       guage (rml). rml [8] extends the w3c recommended r2rml
generating the minimum number of additional triples for the       mapping language [4] defined for specifying mappings of
metadata information. In the particular case of predicate-        data in relational databases to the rdf data model. rml
based partition, representing the provenance and metadata         covers also mappings from data sources in different (semi-
information using singleton properties would cause generat-       )structured formats, such as csv, xml, and json.
ing almost the same number of additional triples as in the           rml documents contain rules defining how the input data
case of defining an explicit or implicit graph per partition.     can be represented in rdf. An rml document (see Listing 1)
                                                                  contains one or more Triples Maps (line 5 and 13). A Triples
4.4    Triple Level                                               Map defines how triples are generated and consists of three
   If metadata is captured on triple level, it becomes possi-     main parts: the Logical Source, the Subject Map and zero or
ble to keep track of the data source each triple was derived      more Predicate-Object Maps. The Subject Map (line 6 and 14)
from. However, that causes the generation of rdf annota-          defines how unique identifiers (uris) are generated for the
tions for metadata whose number of triples is larger than         resources and is used as the subject of all rdf triples gen-
the actual dataset. In the simplest case, the number of ad-       erated from this Triples Map. A Predicate-Object Map (line 7
ditional triples for the metadata information depends on the      and 15) consists of Predicate Maps, which define the rule that
number of data sources. The more data sources, the more           generates the triple’s predicate (line 9, 17 and 19) and Ob-
metadata information to be defined. Triple level metadata         ject Maps (line 18 and 20) or Referencing Object Maps (line 10),
become meaningful also in the case of big data or streamed        which define how the triple’s object is generated.
data where the time one triple was generated might signifi-
                                                                   1   @prefix rr:     <http://www.w3.org/ns/r2rml#>.
cantly differ compared to the rest triples of the rdf dataset.     2   @prefix rml:    <http://semweb.mmlab.be/ns/rml#>.
   In the case of triple level metadata, singleton properties      3   @prefix foaf:   <http://xmlns.com/foaf/0.1/>.
become meaningful when statements about all triples shar-          4
                                                                   5   <#PersonMap> rml:logicalSource <#DCAT_LogicalSource> ;
ing the same property share the same metadata information.         6                rr:subjectMap <#PersonSubjectMap>;
For instance, if all triples whose rdf terms are associated us-    7                rr:predicateObjectMap <#AccountPreObjMap>.
ing a certain predicate, share the same metadata, e.g., they       8   <#PersonSubjectMap> rr:template "http://ex.com/{ID}".
                                                                   9   <#AccountPreObjMap> rr:predicate foaf:account;
are all derived from the same data source.                        10                       rr:objectMap <#TwitterRefObjMap>.
                                                                  11   <#TwitterRefObjMap> rr:parentTriplesMap <#TwitterAcount>.
4.5    RDF Term Level                                             12
                                                                  13   <#TwitterAcountMap> rml:logicalSource <#DB_LogicalSource>;
   Even rdf terms that are part of a certain rdf triple can       14                       rr:subjectMap <#TwitterSubMap>;
derive from different data sources. For instance, an rdf          15   rr:predicateObjectMap <#AccountPreObjMap>, <#HomepagePreObjMap>.
term is generated considering some data value derived from        16   <#TwitterSubMap>     rr:template "http://ex.com/{account_ID}".
                                                                  17   <#AccountPreObjMap> rr:predicate foaf:accountName;
a source A. This rdf term might constitute the subject of         18                        rr:objectMap [ rml:reference "name"].
an rdf triple whose object though is an rdf term derived          19   <#HomepagePreObjMap> rr:predicate foaf:accountServiceHomepage;
from a source B. In this case, even more detailed metadata        20                        rr:objectMap <#HomepageObjMap>.
information is required to keep track of the provenance infor-    21   <#HomepageObjMap>    rml:reference "resource".
mation. Among the alternative approaches for representing
metadata, the rdf reification becomes meaningful at this                       Listing 1: RML Mapping Rules
level of detail. To be more precise, the rdf reification is
meaningful in the cases that the rdf terms that consist an        5.2     Mapping Document Metadata
rdf triple and form a statement derive from different data          A mapping document summarizes mapping rules defined
sources and/or are generated at a different time.                 using the rml language. rml is serialized in rdf, thus a
                                                                  mapping document (<#MapDoc>) can be considered as an
                                                                  rdf dataset itself (void:Dataset). Therefore, it has its own
metadata as any other rdf data can have. To be more                   5.3     Data Sources Retrieval Metadata
precise, a mapping document is a prov:Entity that can be as-            One or more data sources might be considered for gen-
sociated with a prov:Agent, either a human agent or software.         erating an rdf dataset. In our exemplary case, one data
The Mapping Document is the result of a prov:Activity, which          source is described by the <#DB LogicalSource> and the un-
is informed, on its own turn, from different editing activities.      derlying database that contains the data is described by the
 1   @prefix dcterms: <http://purl.org/dc/terms/>.                    <#DB Source> using the d2r vocabulary. Its description:
 2   @prefix prov:    <http://www.w3.org/ns/prov#>.
 3   @prefix void:    <http://rdfs.org/ns/void#>.                      1   @prefix rml: <http://semweb.mmlab.be/ns/rml#>.
 4                                                                     2   @prefix dcat: <http://www.w3.org/ns/dcat#>.
 5   <#MapDoc> a prov:Entity, void:Dataset;                            3   @prefix d2rq:
 6    prov:generatedAtTime   "2016-01-05T17:10:00Z"^^xsd:dateTime;     4    <http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1#>.
 7    prov:wasGeneratedBy    <#MapDoc_Generation>;                     5
 8    prov:wasAssociatedWith <#RMLEditor>;                             6   <#DB_LogicalSource> rml:logicalSource [
 9    prov:wasAttributedTo <http://rml.io/people/AnastasiaDimou>;      7     rml:query """SELECT * FROM DEPT WHERE ... """ ;
10    dcterms:creator      <http://rml.io/people/AnastasiaDimou>;      8     rml:source <#DB_Source> ].
11    dcterms:created      "2016-01-05T17:10:00Z"^^xsd:dateTime;       9
12    dcterms:modified     "2016-01-05T17:15:00Z"^^xsd:dateTime;      10   <#DB_Source>      a d2rq:Database;
13    dcterms:issued       "2016-01-07T10:10:00Z"^^xsd:dateTime.      11     d2rq:jdbcDSN    "jdbc:mysql://localhost/example";
14                                                                    12     d2rq:jdbcDriver "com.mysql.jdbc.Driver";
15   <#MapDoc_Editing>      a prov:Activity;                          13     d2rq:username   "user";
16    prov:startedAtTime    "2016-01-05T17:00:00Z"^^xsd:dateTime;     14     d2rq:password   "password".
17    prov:endedAtTime      "2016-01-05T17:10:00Z"^^xsd:dateTime .
18                                                                            Listing 4: Database Source description
19   <#MapDoc_Generation>   a prov:Activity;
20    prov:generated        <#MapDoc>;                                  Similarly, a data source might be a dcat:Dataset and one
21    prov:startedAtTime    "2016-01-05T17:09:00Z"^^xsd:dateTime;
22    prov:endedAtTime      "2016-01-05T17:10:00Z"^^xsd:dateTime;     of its distributions might be considered for generating the
23    prov:wasInformedBy    <#MapDoc_Editing>.                        rdf dataset. Directly downloadable distributions contain a
24                                                                    dcat:downloadURL reference. For instance:
25   <#RMLEditor> a prov:Agent;
26    prov:type prov:SoftwareAgent.                                    1   @prefix rml: <http://semweb.mmlab.be/ns/rml#>.
27                                                                     2   @prefix dcat: <http://www.w3.org/ns/dcat#>.
28   <http://rml.io/people/AnastasiaDimou> a prov:Agent;               3
29    prov:type prov:Person;                                           4   <#DCAT_LogicalSource> rml:source <#DCAT_Source>;
30    prov:actedOnBehalfOf <#DataOwner>.                               5                         rml:referenceFormulation ql:XPath;
                                                                       6                         rml:iterator "...".
                                                                       7
       Listing 2: Mapping Metadata Description                         8   <#DCAT_Source> a dcat:Dataset;
                                                                       9                  dcat:distribution <#XML_Distribution> .
   Besides the metadata regarding the entire mapping doc-             10
ument (<#MapDoc>), similarly metadata might be defined                11   <#XML_Distribution> a dcat:Distribution;
on Triples Map level or regarding any of the Term Maps, espe-         12                       dcat:downloadURL <http://ex.org/file.xml>.
cially in case that different parts of the mapping document
(subsets of <#MapDoc>) were defined by different agents or                      Listing 5: DCAT source description
at different times. For instance, the metadata information
                                                                         The data source retrieval can be considered as a prov:Activity
of different Triples Map might be as follows:
                                                                      attributed to a prov:Agent. Such a prov:Agent can be the data
 1   @prefix dcterms: <http://purl.org/dc/terms/>.                    owner or an agent acting on his behalf, i.e. Data Genera-
 2   @prefix prov:    <http://www.w3.org/ns/prov#>.
 3   @prefix void:    <http://rdfs.org/ns/void#>.
                                                                      tor. The data source consists of a prov:Entity which was de-
 4                                                                    rived from the data acquisition activity. The original data
 5   <#MapDoc> void:subset <#PersonMap>, <#TwitterAccountMap>.        source description provides some information regarding the
 6                                                                    data source. Additional, provenance information is added
 7   <#PersonMap>            a prov:Entity, void:Dataset;
 8    prov:wasGeneratedBy    <#PersonMap_Generation>;                 for further clarity. The metadata for a data source, e.g., the
 9    prov:wasAssociatedWith <#RMLEditor>;                            <#DB LogicalSource> and the <#DCAT LogicalSource>, are
10    prov:wasAttributedTo   <http://rml.io/people/AnastasiaDimou>.   described as follows:
11
12   <#TwitterAccountMap>    a prov:Entity, void:Dataset;              1   @prefix prov: <http://www.w3.org/ns/prov#>.
13    prov:wasGeneratedBy    <#TwitterAccountMap_Generation>;          2
14    prov:wasAssociatedWith <#RMLEditor>;                             3   <#DB_LogicalSource>   a prov:Entity;
15    prov:wasAttributedTo   <http://rml.io/people/AnastasiaDimou>.    4    prov:wasDerivedFrom <#DB_Source> ;
16                                                                     5    prov:generatedAtTime "2016-01-05T17:10:00Z"^^xsd:dateTime .
17   <#PersonMap_Generation> a prov:Activity;                          6
18    prov:generated         <#PersonMap>;                             7   <#DB_Retrieval>     a prov:Activity;
19    prov:wasInformedBy     <#PersonMap_Editing>.                     8    prov:generated     <#DB_LogicalSource> ;
20                                                                     9    prov:used          <#DB_Source>;
21   <#TwitterAccountMap_Generation> a prov:Activity;                 10    prov:startedAtTime "2016-01-05T17:00:00Z"^^xsd:dateTime ;
22    prov:generated         <#TwitterAccountMap>;                    11    prov:endedAtTime   "2016-01-05T17:10:00Z"^^xsd:dateTime .
23    prov:wasInformedBy     <#TwitterAccountMap_Editing>.            12
24                                                                    13   <#DCAT_LogicalSource> a prov:Entity;
25   <#PersonMap_Editing>         a prov:Activity.                    14    prov:generatedAtTime "2016-01-05T17:05:00Z"^^xsd:dateTime .
26   <#TwitterAccountMap_Editing> a prov:Activity.                    15
                                                                      16   <#DCATsource_Retrieval> a prov:Activity;
     Listing 3: Triples Map Metadata Description                      17    prov:generated         <#DCAT_LogicalSource>;
                                                                      18    prov:used              <#DCAT_Source>.

                                                                           Listing 6: Data Sources Metadata Description
5.4     RDF Dataset Generation Metadata                                proach occurs at a certain case, the rdf data generation
  Considering the aforementioned mapping document and                  activity (<#RDFdataset Generation>) is informed by the map-
the data source descriptions, rdf triples are generated.               ping document generation activity (<#MapDoc Generation>)
                                                                       or by the data source generation activity (e.g., the <#DB-
Approaches for Tracing Metadata and RML                                source Retrieval> or the <#DCATsource Retrieval>).
Among the different approaches for capturing provenance                   Specifying the data source where the rdf dataset was de-
and metadata information, rml can best be aligned with                 rived from becomes easy and can be automatically asserted
implicit graphs and rdf reification. Those two approaches              thanks to the aligned mapping and data source descriptions.
generate metadata information independently of the gener-              The rml mapping rules declaratively define the data sources
ated rdf data. In particular, explicit graphs are not consid-          used, in contrast to other mapping languages which do not
ered as they might coincide with the named graphs if any               explicitly define the data sources considered for fulfilling the
explicitly defined for the actual rdf dataset.                         mapping activity.
                                                                          However, as one can observe, it is defined that the rdf
Metadata Details Levels and RML                                        dataset was derived from an extract of data from a database
                                                                       and an xml file published on the Web, but it is not explicitly
Dataset level metadata information is associated with all              defined which triples are derived from each data source.
triples generated considering all mapping rules in a map-
ping document. Named graph level metadata information is               Triple Level Metadata
associated with triples which are generated considering Term
                                                                       In order to address the aforementioned ambiguity regarding
Maps related to the corresponding Graph Map.
                                                                       the rdf triples origin, rml metadata generation might be
   As far as partitioned rdf datasets is concerned, each par-
                                                                       defined based on the Predicate Object Maps, for instance the
tition is associated with different parts of one or more Triples
                                                                       <#AccountPreObjMap> and the <#HomepagePreObjMap>. The
Maps. To be more precise, source-level metadata information
                                                                       metadata information of the generated rdf triples follows:
is generated for all triples which are derived from Triples Maps
which share the same Logical Source. In the same context,               1   @prefix prov: <http://www.w3.org/ns/prov#>.
                                                                        2   @prefix void: <http://rdfs.org/ns/void#>.
subject-level metadata information is generated for each unique         3
instantiation of one (or more) of the Subject Maps that ap-             4   <#RDF_Dataset> void:subset <#AccountRDF>, <#HomepageRDF>.
pear in the mapping document. Similarly, predicate-level                5
metadata information is generated for each unique predi-                6   <#AccountRDF> a prov:Entity, void:Dataset;
                                                                        7    prov:wasGeneratedBy    <#RDFdataset_Generation>;
cate which appears in one or more Triples Maps. Last, object-           8    prov:wasDerivedFrom <#DB_LogicalSource>,<#DCAT_LogicalSource>;
level metadata anotations are generated for each unique ob-             9    prov:wasAssociatedWith <#RMLProcessor>;
ject which is generated due to an Object Map. Whereas the              10    prov:wasAttributedTo   <http://rml.io/people/AnastasiaDimou>.
                                                                       11
aforementioned levels consider implicit graphs to represent            12   <#HomepageRDF> a prov:Entity, void:Dataset;
provenance and metadata information, triple and RDF term               13    prov:wasGeneratedBy    <#RDFdataset_Generation>;
level metadata information can only be captured considering            14    prov:wasDerivedFrom    <#DB_LogicalSource>;
                                                                       15    prov:wasAssociatedWith <#RMLProcessor>;
reification statements.                                                16    prov:wasAttributedTo   <http://rml.io/people/AnastasiaDimou>.

Dataset Level Metadata                                                          Listing 8: RDF Triple Level Metadata
Dataset level provenance and metadata information has as
follows for the aforementioned running example:                           Even though it is easy one to observe that this resolves
                                                                       the ambiguity issue regarding the provenance in the case of
 1   @prefix dcterms: <http://purl.org/dc/terms/>.                     triples generated considering the <#HomepagePreObjMap>,
 2   @prefix prov:    <http://www.w3.org/ns/prov#>.
 3                                                                     it is not the same in the case of triples generated consider-
 4   <#RDF_Dataset> a prov:Entity, void:Dataset;                       ing the <#AccountPreObjMap>. In the later case, the rdf
 5    prov:generatedAtTime   "2016-01-05T17:10:00Z"^^xsd:dateTime;     triples are formed generating the subject and the object
 6    prov:wasGeneratedBy    <#RDFdataset_Generation>;
 7    prov:wasDerivedFrom <#DB_LogicalSource>,<#DCAT_LogicalSource>;   from different data sources. To be more precise, the triple’s
 8    prov:wasAssociatedWith <#RMLProcessor>;                          subject is generated considering a value derived from the
 9    prov:wasAttributedTo <http://rml.io/people/AnastasiaDimou>;      <#DCAT LogicalSource>, whereas the triple’s object is de-
10    dcterms:creator      <http://rml.io/people/AnastasiaDimou>;
11    dcterms:created      "2016-01-05T17:10:00Z"^^xsd:dateTime;
                                                                       rived from the <#DB LogicalSource>. If, even more detailed
12    dcterms:modified     "2016-01-05T17:12:00Z"^^xsd:dateTime;       provenance is required, rdf term level should be preferred.
13    dcterms:issued       "2016-01-07T10:10:00Z"^^xsd:dateTime.
14                                                                     RDF Term Level Metadata
15   <#RDFdataset_Generation> a prov:Activity;
16    prov:generated          <#RDF_Dataset>;                          The rdf term level is the narrowest details level for meta-
17    prov:startedAtTime      "2016-01-05T17:00:00Z"^^xsd:dateTime;    data information. It is applicable in the cases of Referencing
18    prov:endedAtTime        "2016-01-05T17:10:00Z"^^xsd:dateTime;
19    prov:wasInformedBy      <#MapDoc_Generation>;                    Object Maps, namely when the subject and the object of an
20    prov:used <#MapDoc>,<#DB_LogicalSource>,<#DCAT_LogicalSource>.   rdf triple is derived from different data sources. rdf reifi-
21                                                                     cation is the only approach for representing this metadata
22    <#RMLProcessor> a prov:Agent;
23     prov:type prov:SoftwareAgent.
                                                                       information. Considering the aforementioned running exam-
                                                                       ple, the metadata information of the rdf triples generated
        Listing 7: RDF Dataset Level Metadata                          from the <#AccountPreObjMap> Predicate Object Map are de-
                                                                       fined as it follows:
  The rdf data generation might be triggered either by a
mapping document (mapping-driven approach) or by a data
source (data-driven approach) [11]. Depending on which ap-
 1     _:ex12345   rdf:type        rdf:Statement .                    RML Processor
 2     _:ex12345   rdf:subject     ex:item10245 .
 3     _:ex12345   rdf:predicate   foaf:account .                     The rmlprocessor was extended with a Metadata Module
 4     _:ex12345   rdf:object      <https://twitter.com/natadimou>.   that automatically generates metadata for the generated
 5
 6     ex:item10245 prov:wasDerivedFrom <#DCAT_LogicalSource>.
                                                                      rdf dataset. The desired metadata to be generated by the
 7     <http://twitter.com/natadimou>                                 rmlprocessor can be configured by an agent1. The agent
 8      prov:wasDerivedFrom <DB_LogicalSource>.                       who triggers the mapping activity can define the vocabulary
                                                                      to be used, as well as the desired details level. By default,
           Listing 9: RDF Term Level Metadata                         the w3c recommended prov [16], void [1] and dcat [17] vo-
                                                                      cabularies are supported. Although, the rmlprocessor can
                                                                      be further extended to support other vocabularies and gen-
DCAT Catalogue Enrichment                                             erate more metadata information.
In the case that the original data is published on the Web               The rmlprocessor has also been extended to automati-
in the frame of a catalogue described with the dcat vo-               cally generate corresponding metadata regarding rdf dataset
cabulary, complementary metadata information can be gen-              generation. It was extended to generate metadata informa-
erated to enrich it. If one of the dcat:Dataset distributions         tion considering implicit graphs or rdf reification. To be
(dcat:Distribution) is considered to generate the corresponding       more precise, the rmlprocessor was configured to generate
rdf representation, complementary dcat metadata might                 metadata information using implicit graphs for the metadata
be generated as well, to specify that the generated rdf               which are related with the whole dataset as well as for named
dataset is another distribution of a certain dcat:Dataset pub-        graphs. The rmlprocessor was configured to generate rdf
lished on the dcat:Catalog. For instance, in the case of the          reification triples if the metadata level is set on triple or rdf
running example, the <#DCAT RDF> and the <#DB RDF>                    term level. The explicit graphs and singleton properties were
are source-level partitions of the rdf dataset. The rdf               not considered because they need to be defined inline with
triples in the <#DCAT RDF> partition is an rdf distribu-              the actual rdf data, together with the mapping rules.
tion of the <#XML Distribution> for the <#DCAT Source>.
The <#DCAT RDF> metadata information and the enriched
<#DCAT Source> have as follows:                                       7.   CONCLUSIONS AND FUTURE WORK
                                                                         The proposed approach aims to show how metadata of the
 1     @prefix dcat: <http://www.w3.org/ns/dcat#>.
 2     @prefix prov: <http://www.w3.org/ns/prov#>.
                                                                      fundamental activities for the generation and publication
 3     @prefix void: <http://rdfs.org/ns/void#>.                      of rdf triples can be automatically generated. Our solu-
 4                                                                    tion covers the rdf dataset generation, including metadata
 5     <#RDF_Dataset> void:subset <#DCAT_RDF>, <#DCAT_RDF>.           for the mapping rules definition and the data descriptions.
 6
 7     <#DCAT_RDF> a prov:Entity, void:Dataset;                       Based on the provided metadata information, it is expected
 8      prov:wasGeneratedBy    <#RDFdataset_Generation>;              that publishing infrastructures will enrich this information
 9      prov:wasDerivedFrom    <#DCAT_LogicalSource>;                 with complementary details regarding the rdf dataset pub-
10      prov:wasAssociatedWith <#RMLProcessor>;
11      prov:wasAttributedTo <http://rml.io/people/AnastasiaDimou>.   lication activity. Moreover, it is expected that rdf pub-
12                                                                    lication infrastructures will re-determine certain properties
13     <#DCAT_Source> dcat:distribution <#DCAT_RDF>.                  regarding the metadata information, if the rdf dataset is re-
                                                                      formed before it gets published. Moreover, the metadata can
         Listing 10: DCAT Metadata Enrichment                         be enriched with additional information derived from other
                                                                      activities involved in the Linked Data publishing workflow.
                                                                         Provenanve and metadata information can be multidimen-
6.      METADATA & THE RML TOOL CHAIN                                 sional and its consumption diverges across different systems.
                                                                      Different applications require different levels of metadata in-
   We implemented the aforementioned approach at the rml              formation to fulfil their tasks, whereas diverse metadata in-
tool chain, namely the rmleditor15 and the rmlprocessor16 .           formation might be desired. The presented workflow was
The rml tool chain was configured to support implicit graphs          focused on the essential parts of the Linked Data publish-
and rdf reification. The supported metadata can be further            ing workflow. However, any metadata information might be
extended to take into consideration other metadata vocabu-            considered as they accrue from other activities involved in
laries too and generate corresponding metadata information.           the Linked Data publishing workflow. In the future, we con-
In more details:                                                      sider including metadata information regarding the results
                                                                      of the rdf validation [7, 6], applied both on the mapping
RML Editor                                                            document, as well as on the generated rdf data.
The rmleditor [12] was extended to generate metadata re-                 Different aspects of the rdf data generation and publish-
garding the editing and generation of the mapping rules, as           ing might influence its quality and trustworthiness assess-
they are declaratively represented using the rml language.            ment. In most of the cases so far, the provenance and meta-
The rmleditor keeps track of the mapping document edition             data information are manually delivered on dataset level.
and generation activities, when they occurred and by whom.            Automating the provenance and metadata generation rely-
The rmleditor was extended to support implicit graphs for             ing on machine interpretable descriptions of the different
defining the metadata information which is related to the             workflow steps, allows to generate metadata in a system-
rml mapping document or its subsets.                                  atic way. The generated provenance and metadata informa-
                                                                      tion becomes more accurate, consistent and complete. The
15
     http://rml.io/RMLeditor.html                                     metadata generation for certain rdf data is an incremental
16
     http://github.com/RMLio/RML-Mapper                               procedure that relies on the contribution of different activ-
ities in the Linked Data publishing workflow to enrich the          for Editing Mapping Definitions. In Workshop on
information we have for the generated rdf dataset.                  Intelligent Exploration of Semantic Data, 2015.
                                                               [13] P. Hitzler, M. Krötzsch, B. Parsia, P. F.
8.   ACKNOWLEDGMENTS                                                Patel-Schneider, and S. Rudolph. OWL 2 Web
                                                                    Ontology Language. W3C Recom., Dec. 2012.
  The described research activities were funded by Ghent
                                                                    http://www.w3.org/TR/owl2-primer/ .
University, iMinds, the Institute for the Promotion of Inno-
                                                               [14] E. Jiménez-Ruiz, E. Kharlamov, D. Zheleznyakov,
vation by Science and Technology in Flanders (IWT), the
                                                                    I. Horrocks, C. Pinkel, M. Skjæveland, E. Thorstensen,
Fund for Scientific Research Flanders (FWO Flanders), and
                                                                    and J. Mora. BootOX: Practical Mapping of RDBs to
the European Union.
                                                                    OWL 2. In The Semantic Web - ISWC 2015. 2015.
                                                               [15] M. Lanthaler. Hydra Core Vocabulary. Unofficial
9.   REFERENCES                                                     Draft, June 2014.
 [1] K. Alexander, R. Cyganiak, M. Hausenblas, and                  http://www.hydra-cg.com/spec/latest/core/.
     J. Zhao. Describing Linked Datasets with the VoID         [16] T. Lebo, S. Sahoo, and D. McGuinness. PROV-O: The
     Vocabulary. W3C Interest Group Note, Mar. 2011.                PROV Ontology. Working Group Recommendation,
     http://www.w3.org/TR/void/ .                                   W3C, Apr. 2013. http://www.w3.org/TR/prov-o/ .
 [2] G. Carothers. RDF 1.1 N-Quads. Working Group              [17] F. Maali and J. Erickson. Data Catalog Vocabulary
     Recommendation, W3C, Feb. 2014.                                (DCAT). W3C Recommendation, Jan. 2014.
     https://www.w3.org/TR/n-quads/ .                               http://www.w3.org/TR/vocab-dcat/ .
 [3] G. Carothers and A. Seaborne. RDF 1.1 TriG.               [18] L. Moreau and P. Missier. PROV-DM: The PROV
     Working Group Recommendation, W3C, Feb. 2014.                  Data Model. Working Group Recommendation, W3C,
     https://www.w3.org/TR/trig/ .                                  Apr. 2013. http://www.w3.org/TR/prov-dm/ .
 [4] S. Das, S. Sundara, and R. Cyganiak. R2RML: RDB           [19] A.-C. Ngonga Ngomo and S. Auer. Limes: A
     to RDF Mapping Language. Working Group                         Time-efficient Approach for Large-scale Link
     Recommendation, W3C, Sept. 2012.                               Discovery on the Web of Data. 2011.
     http://www.w3.org/TR/r2rml/ .                             [20] V. Nguyen, O. Bodenreider, and A. Sheth. Don’t Like
 [5] L. de Medeiros, F. Priyatna, and O. Corcho.                    RDF Reification?: Making Statements About
     MIRROR: Automatic R2RML Mapping Generation                     Statements Using Singleton Property. In Proceedings
     from Relational Databases. In Engineering the Web in           of the 23rd International Conference on World Wide
     the Big Data Era. 2015.                                        Web, 2014.
 [6] T. De Nies, A. Dimou, R. Verborgh, E. Mannens, and        [21] C. Pinkel, C. Binnig, E. Kharlamov, and P. Haase.
     R. Van de Walle. Enabling dataset trustworthiness by           IncMap: Pay As You Go Matching of Relational
     exposing the provenance of mapping quality                     Schemata to OWL Ontologies. In Proceedings of the
     assessment and refinement. In Proceedings of the 4th           8th International Conference on Ontology Matching,
     International Workshop on Methods for Establishing             pages 37–48, 2013.
     Trust of (Open) Data, 2015.                               [22] M. Schmachtenberg, C. Bizer, and H. Paulheim.
 [7] A. Dimou, D. Kontokostas, M. Freudenberg,                      Adoption of the Linked Data Best Practices in
     R. Verborgh, J. Lehmann, E. Mannens, S. Hellmann,              Different Topical Domains. In ISWC 2014. 2014.
     and R. Van de Walle. Assessing and Refining               [23] K. Sengupta, P. Haase, M. Schmidt, and P. Hitzler.
     Mappings to RDF to Improve Dataset Quality. In                 Editing R2RML mappings made easy. 2013.
     Proceedings of the 14th ISWC, 2015.                       [24] M. Sporny, D. Longley, G. Kellogg, M. Lanthaler, and
 [8] A. Dimou, M. Vander Sande, P. Colpaert,                        N. Lindström. JSON-LD. Working Group
     R. Verborgh, E. Mannens, and R. Van de Walle.                  Recommendation, W3C, Jan. 2014.
     RML: A Generic Language for Integrated RDF                     https://www.w3.org/TR/json-ld/ .
     Mappings of Heterogeneous Data. In Workshop on            [25] J. Tennison, G. Kellogg, and I. Herman. Model for
     Linked Data on the Web, 2014.                                  Tabular Data and Metadata on the Web. W3C
 [9] A. Dimou, R. Verborgh, M. Vander Sande,                        Working Draft, Apr. 2015. http:
     E. Mannens, and R. Van de Walle.                               //www.w3.org/TR/2015/WD-tabular-data-model-20150416/.
     Machine-Interpretable Dataset and Service                 [26] R. Verborgh, O. Hartig, B. De Meester,
     Descriptions for Heterogeneous Data Access and                 G. Haesendonck, L. De Vocht, M. Vander Sande,
     Retrieval. In SEMANTiCS 2015, 2015.                            R. Cyganiak, P. Colpaert, E. Mannens, and R. Van de
[10] O. Hartig and J. Zhao. Provenance and Annotation of            Walle. Querying datasets on the Web with high
     Data and Processes: Third International Provenance             availability. In Proceedings of the 13th ISWC, 2014.
     and Annotation Workshop, IPAW 2010, chapter               [27] R. Verborgh, M. Vander Sande, P. Colpaert,
     Publishing and Consuming Provenance Metadata on                S. Coppens, E. Mannens, and R. Van de Walle.
     the Web of Linked Data. 2010.                                  Web-scale querying through Linked Data Fragments.
[11] P. Heyvaert, A. Dimou, R. Verborgh, E. Mannens, and            In Proceedings of the 7th Workshop on Linked Data on
     R. Van de Walle. Approaches for Generating                     the Web, 2014.
     Mappings to RDF. In Proceedings of the 14th ISWC:         [28] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov. Silk –
     Posters and Demos, 2015.                                       A Link Discovery Framework for the Web of Data. In
[12] P. Heyvaert, A. Dimou, R. Verborgh, E. Mannens, and            Workshop on Linked Data on the Web, 2009.
     R. Van de Walle. Towards a Uniform User Interface