<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detailed Provenance Capture of Data Processing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ben De Meester</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasia Dimou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruben Verborgh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erik Mannens</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ghent University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IDLab, Department of Electronics and Information Systems</institution>
          ,
          <addr-line>Ghent</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A large part of Linked Data generation entails processing the raw data. However, this process is only documented in human-readable form or as a software repository. This inhibits reproducibility and comparability, as current documentation solutions do not provide detailed metadata and rely on the availability of speci c software environments. This paper proposes an automatic capturing mechanism for interchangeable and implementation independent metadata and provenance that includes data processing. Using declarative mapping documents to describe the computational experiment allows automatic capturing of termlevel provenance for both schema and data transformations, and for both the used software tools as the input-output pairs of the data processing executions. This approach is applied to mapping documents described using rml and fno, and implemented in the rmlmapper. The captured metadata can be used to more easily share, reproduce, and compare the dataset generation process, across software environments.</p>
      </abstract>
      <kwd-group>
        <kwd>Computational Experiment</kwd>
        <kwd>Data Processing</kwd>
        <kwd>FnO</kwd>
        <kwd>Provenance</kwd>
        <kwd>RML</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Reproducibility is improved by explicit description of data processing and
analysis [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. A large part of Linked Data generation tasks entail processing data to
generate new data. Thus, detailed metadata of these generation tasks is of great
importance. The ten newly introduced datasets of iswc 2016's Resource Track1
were generated using some sort of data processing, e.g., parsing raw data,
interlinking existing datasets, or performing Natural Language Processing (nlp).
One of the most widely-known examples is dbpedia, where Wikitext is processed
to generate the dbpedia dataset [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        However, the description of how datasets are generated is mostly available
as a scienti c paper, e.g., [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], or software repository, e.g., https://github.com/
dbpedia/extraction-framework. This inhibits reproducibility and
comparability, as these current documentation solutions do not provide detailed
machineinterpretable metadata describing the dataset generation process. This demands
1 http://iswc2016.semanticweb.org/pages/program/accepted-papers.html
manual intervention and speci c software and hardware environments to
reproduce or compare a generated dataset, if possible at all2. Explicit description and
provenance of the generation task provides important insights [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], even when
these software or hardware dependencies are no longer available.
      </p>
      <p>
        In this paper, we propose an automatic capturing mechanism for detailed
metadata and provenance information. This enables reproducibility and
comparability of data processing, without relying on the availability of speci c software
and hardware environments, or implying restrictions on the complexity of the
data processing. After providing a background on provenance in Section 2, we
show relevant provenance types and the underlying model using the prov
Ontology (prov-o) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] in Section 3. In Section 4, we show how using declarative
statements to describe the computational experiment allows us to automatically
capture term-level provenance. We apply our model to the Function Ontology
(fno) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] which, on its own turn, is aligned with the rdf Mapping Language
(rml) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Our proposed approach is implemented in the rml and fno tool
chain, namely, the rmlmapper and FunctionProcessor [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and used to generate
metadata based on the generation of a sample dbpedia dataset. We conclude
in Section 5.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Automatic Metadata for Linked Data Generation and</title>
    </sec>
    <sec id="sec-3">
      <title>Publishing</title>
      <p>In this paper, we propose automatically capturing machine-interpretable and
detailed metadata concerning the data processing of a dataset generation to
improve reproducibility. In Section 2.1, we introduce metadata formats to enable
reproducibility, and in Section 2.2, we provide existing work that automatically
captures machine-interpretable metadata, without considering data processing.
2.1</p>
      <sec id="sec-3-1">
        <title>Provenance</title>
        <p>
          Provenance can be considered information describing materials and
transformations applied to derive the data and the processes that enabled their creation [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
It has several applications [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], namely to assess data quality, trace the audit trail,
aid in describing replication recipes, establish attribution, and be informational,
i.e., provide context. As such, providing provenance of a data processing
alongside the generated dataset can improve general reproducibility.
        </p>
        <p>
          Providing this provenance as Linked Data has advantages, as its distributed
nature allows us (i) to publish provenance separate from the actually published
dataset, and (ii) to easily interlink di erent provenance dimensions without tight
coupling. To improve interoperability, we apply commonly used Linked Data
vocabularies to describe the provenance. Provenance vocabularies already exist,
2 Virtualization tools such as Docker (https://www.docker.com/) do abstract certain
software environment requirements, however, they still rely on the (public)
availability of all needed software tools.
namely, the prov Ontology (prov-o) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], a w3c recommendation to represent
and interchange provenance generated in di erent systems and under di erent
contexts. Describing the generation process using provenance modeled in prov-o
thus allows us to generate machine-interpretable and interoperable metadata.
2.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Automatic Capture of Provenance</title>
        <p>
          A provenance capture mechanism falls into three main classes: work ow-,
process-, and operating system-based (os) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Work ow-based mechanisms are
attached to a work ow system, process-based mechanisms require each involved
service or process to document itself, and os-based mechanisms rely on the
availability of speci c functionalities at the os level, without modi cations to existing
scripts or programs [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Considering a data generation process as a single step
within a work ow, and aiming to provide an implementation independent
solution { thus an os independent solution { the provenance of a dataset generation
process is best captured using a process-based mechanism. As such, it is
complementary to work ow capturing mechanisms such as implemented in the Pegasus
Work ow Management System [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] or ontologies that describe work ows such as
p-plan [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], and os capturing mechanisms such as implemented in panda [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          Related work [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] automatically captures metadata and provenance
information decoupled from the implementation by relying on declarative descriptions,
both for the mapping rules that specify how to generate the Linked Data in rdf,
and the raw data access interfaces. When generating Linked Data, a separate
provenance dataset is generated that includes the contributing schema
transformations and data sources. Di erent detail levels have been identi ed to capture
metadata and provenance [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]: on the dataset; named graph; partitioned dataset;
triple level; and term level. Furthermore, multiple ways of adding provenance
information to the declarative descriptions using prov-o have been identi ed [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]:
using explicit graphs; implicit graphs; singleton properties [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]; or rei cation.
Implicit graphs and rei cation have the advantage that they do not in uence
the generated rdf data, whilst explicit graphs are not supported by all rdf
serializations, and singleton properties require changing the schema level
transformations [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. However, aforementioned work does not include data processing,
i.e., it only takes raw extracted data values into account. Meanwhile, most
generated datasets are related to speci c data processing, and its provenance is an
essential part of the generated dataset.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Metadata and Provenance Capture for Data</title>
    </sec>
    <sec id="sec-5">
      <title>Transformations</title>
      <p>We rst state the di erent dimensions to take into account when capturing the
metadata and provenance for data processing in Section 3.1, after which we
propose our model in Section 3.2.</p>
      <sec id="sec-5-1">
        <title>Metadata and Provenance Dimensions</title>
        <p>
          Schema vs Data Transformations Dataset generation depends on both schema
and data transformations [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Schema transformations involve (re-)modeling the
data, describing how objects are related, and deciding which vocabularies and
ontologies to use [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Data transformations are needed to support any changes in
the structure, representation or content of data [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. However, instead of coupling
these transformations, aligning them allows them to be executed separately as
well as combined [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Namely, aligning instead of coupling the provenance of these
transformations allows us to reproduce data transformations without needing to
reproduce the schema transformations and vice versa. Existing work has mostly
focused on capturing schema transformations' metadata.
        </p>
        <p>
          Interaction vs Actor The execution of a generation process involves di erent
actors, namely, the processor executing the generation process, and the di
erent processes that perform the data transformations. The relation between these
actors is a client-service relation, i.e., the generation process (client) calls the
di erent data transformations (service). Two kinds of provenance are generated
by these actors [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]: (i) interaction provenance, which describes the input and
output parameters of each execution, generated and con rmed by both client
and service actor, and (ii) actor provenance, which is metadata about the
actor's own state during an execution (e.g. implementation details or hardware
con guration) and is not veri able by the other actors. Both kinds of
provenance are complementary, i.e., interaction provenance can be used to compare
results without relying on the implementation, whilst actor provenance can be
used to reproduce performance measurements.
3.2
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Metadata and Provenance Model</title>
        <p>
          Provenance can be captured on di erent levels in the generation process [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. It
is most relevant on term level, as only then it unambiguously de nes which data
transformation contributed to which value. For instance, dbpedia data involves
parsing, e.g., date values from infoboxes in Wikipedia. In particular the
provenance of how this date value was parsed is important for the dbpedia generation
provenance, and currently not measured nor published [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. When capturing term
level provenance, data transformations are decoupled from the schema
transformation, i.e., the captured metadata and provenance is de ned on value level, and
does not rely on the relationships between resources and used vocabularies or
ontologies. In addition to the existing domains to capture metadata and
provenance (i.e., mapping rules de nition and data sources retrieval), we introduce
another domain: the processing domain. This covers the data transformations,
complementary to the schema transformations covered by the mapping rules.
        </p>
        <p>To capture both interaction and actor provenance, it is necessary to capture
and align implementation speci c data (e.g., the software release as actor
provenance) with implementation independent data (e.g., the input and output values
wasInformedBy</p>
        <p>Input
Entity</p>
        <p>Data
Transformation</p>
        <p>Activity
used
Function</p>
        <p>Entity
wasGeneratedBy
wasAssociatedWith</p>
        <p>Output
Entity
Tool
Agent
wasAttributedTo
as interaction provenance). However, on top of input and output values, we
argue additional implementation independent data is needed. Namely, what type
of data processing executed. This allows comparability between data generation
processes across tools. When two di erent tools implement the same type of
data processing function, the input and output values of the rst data generation
process with the rst tool can be used to compare with the second tool. For the
remainder of the paper, we will make the distinction between function (i.e., the
implementation independent description, cfr. interaction provenance) and tool
(i.e., the implementation speci c description, cfr. actor provenance).</p>
        <p>
          Our model (Figure 1) is mapped to prov-o [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. We distinguish schema and
data transformation, provide actor and interaction provenance, and include both
function and tool. Schema and data transformations are a prov:Activity, where
the latter is informed by (prov:wasInformedBy) the schema transformation.
Input, function, and output are a prov:Entity, and the tool is a prov:Agent. The
data transformation uses (prov:used) the input3 and function, and the output is
generated by (prov:wasGeneratedBy) the data transformation. The data
transformation is associated with (prov:wasAssociatedWith) the tool, thus we can derive
that the output is attributed to (prov:wasAttributedTo) the tool. The relation
between the function and the tool is an association (prov:qualifiedAssociation).
4
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Application</title>
      <p>Capturing metadata and provenance within the dataset generation process {
speci cally when including data transformations { requires term-level capturing
mechanisms. Instead of providing a tool-speci c solution, i.e., changing a
speci c system, our approach considers capturing metadata and provenance based
on machine interpretable descriptions of the dataset generation process. This
way, the approach is independent of the actual implementation. Moreover, these
mapping descriptions can be automatically analyzed or even generated.</p>
      <p>
        As exemplary case, we consider the rdf Mapping Language (rml) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to
provide the machine interpretable mapping rules for the schema transformations,
and the Function Ontology (fno) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to describe the data transformations. An
alignment between rml and fno is presented in previous work [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. rml is
considered because it is the only language that allows uniformly de ning the mapping
3 The model puts no restriction on amount of inputs, but it is not visualized for clarity.
1 grel:toTitleCase a fno:Function, prov:Entity, prov:Plan ;
2 fno:name "title case" ;
3 fno:expects ( [ fno:predicate grel:stringInput ] ) ;
4 fno:output ( [ fno:predicate grel:stringOutput ] ) .
5
6 :exe a fno:Execution, prov:Activity ; # Data Transformation
7 prov:wasInformedBy :RDFdataset_Generation . # Schema Transformation
8 :implementation :grelJavaImpl ; # Tool (Agent)
9 fno:executes grel:toTitleCase ; # Function (Entity)
10 grel:stringInput :input ; # ’prov:used’
11 grel:stringOutput :output . # ’prov:wasGeneratedBy’
12
13 :input a prov:Entity ; rdf:value "ben de meester" . # Input string
14 :output a prov:Entity ; rdf:value "Ben De Meester" . # Transformed to title case
Listing 1: Function descriptions and Executions using fno, mapped to prov-o.
rules over heterogeneous data sources, thus covers the most dataset generation
use cases. fno is considered as it allows implementation independent description
of functions, their input parameters, and returning results.
      </p>
      <p>
        Existing work that allows capturing metadata and provenance for dataset
generation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is extended to include data transformations. On top of relying on
the declarative descriptions of (i) the mapping rules and (ii) the raw data access
interfaces, we include the declarative descriptions of (iii) the data
transformations. For all detail levels except the term level, it su ces to include which data
transformation functions have been used, using their fno description.
      </p>
      <p>
        For the term detail level, we extend existing work to include the model as
presented at Section 3.2, relying on the mapping rules described in rml to trigger
the execution of a function described in fno. As example, we present a
simple mapping process that maps person names, and requires a title case data
transformation4. In detail: a data source (:Source a prov:Entity) is mapped
using a mapping process (:RDFdataset Generation a prov:Activity). This mapping
process executes the schema transformations. When generating the triple :Ben
foaf:name "Ben De Meester", a data transformation needs to be executed on the
object, thus, a mapping rule triggers the execution of a function. The description
of the grel:toTitleCase function is given in Listing 1, lines 1{4. When executing
this function, the actual implementation (:grelJavaImpl a prov:Actor) needs to
be retrieved. The execution of that function with speci c input data, together
with the prov types for clari cation, is given in Listing 1, lines 6{14 using the
fno statements [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For capturing the provenance of literal values, we need an
intermediate resource and rdf:value relation to attach additional metadata. This
is not needed in the generated dataset.
      </p>
      <p>Based on this fno execution description, additional prov-o statements such
as :exe prov:wasInformedBy :RDFdataset Generation as well as :output prov:
wasGeneratedBy :exe can be derived. On https://fno.io/prov/, an extended
de4 More advanced mapping processes, with more complicated data processing, e.g.,
nlp, or complex algorithms to generate data values or resources are similar, as the
provenance model makes no assumptions on the execution complexity.
scription and example is given which, due to page constraints, could not be
incorporated in the paper. As such, we can capture all needed metadata and
provenance using rml and fno, with an additional cost of about ten triples for
every literal or resource generated. This increases the total amount of generated
triples with an order of magnitude. However, the captured metadata and
provenance can be published separately and does not a ect the generated dataset.</p>
      <p>
        Our approach was implemented in the rmlmapper, the reference
implementation for both rml and its alignment with fno [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We then captured all metadata
and provenance for a sample of dbpedia, together with additional description
available at https://fno.io/prov/dbpedia/. Both fno as prov-o statements are
available, and exemplary queries show how this approach can ease reproducibility
and comparability of a dataset generation process. As dbpedia's new generation
process includes rml and fno [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], we can provide detailed provenance alongside
the dbpedia data, e.g., all releases of the used tools to generate the sample
DBpedia dataset can be requested (i.e., all data transformation tools, including the
release of the rmlprocessor). Moreover, all executions' input and output
concerning the DBpedia parsing functions used to parse the Wikipedia input data into
normalized rdf literals can be requested. This allows the functional evaluation of
a new parsing function by comparing its output values with those of the existing
parsing functions. The inclusion of prov:startedAtTime and prov:endedAtTime
statements allows performance evaluation, given that hardware context is also
included and can be compared. Moreover, decoupling schema and data
transformations allows us to reproduce and compare data transformations without
needing to execute the schema transformations and vice versa.
5
      </p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions</title>
      <p>Linked Data generation consumes a large part of scienti c output, including
complex and speci c data processing. The publication of this generation process
however is not reproducible, as current documentation options do not provide
machine-intepretable detailed metadata that do not rely on speci c software
environments. In this work, we present automatic capturing of metadata and
provenance of data processing on term level, whilst separating schema and
data transformations and capturing both actor and interaction provenance (i.e.,
applying the distinction between function and tool).</p>
      <p>As we tested our approach on the dbpedia generation, details about the used
tools for every input-output pair are available, and data transformations can
be analyzed decoupled from the schema transformations. Comparability is also
improved, as the description of the used function, together with the input-output
pair, can be used to compare di erent tools, even when the original tools are no
longer available or accessible.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>De Meester</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>An Ontology to Semantically Declare and Describe Functions</article-title>
          .
          <source>In: Proceedings of the 13th ESWC Satellite Events. LNCS</source>
          , vol.
          <volume>9989</volume>
          , pp.
          <volume>46</volume>
          {
          <fpage>49</fpage>
          .
          <string-name>
            <surname>Heraklion</surname>
          </string-name>
          ,
          <string-name>
            <surname>Greece</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>De Meester</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maroy</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mannens</surname>
          </string-name>
          , E.:
          <article-title>Declarative data transformations for Linked Data generation: the case of DBpedia</article-title>
          .
          <source>In: Proceedings of the 14th International Conference, ESWC. LNCS</source>
          , vol.
          <volume>10250</volume>
          , pp.
          <volume>33</volume>
          {
          <fpage>48</fpage>
          .
          <string-name>
            <surname>Portoros</surname>
          </string-name>
          ,
          <string-name>
            <surname>Slovenia</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Deelman</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vahi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Juve</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rynge</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Callaghan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maechling</surname>
            ,
            <given-names>P.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mayani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , Ferreira da Silva,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Livny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Wenger</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          :
          <article-title>Pegasus, a work ow management system for science automation</article-title>
          .
          <source>Future Generation Computer Systems 46(C)</source>
          ,
          <volume>17</volume>
          {
          <fpage>35</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Nies</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>Automated metadata generation for Linked Data generation and publishing work ows</article-title>
          .
          <source>In: Proceedings of the 9th Workshop on Linked Data on the Web</source>
          . vol.
          <volume>1593</volume>
          . CEUR, Montreal, Canada (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vander</surname>
            <given-names>Sande</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data</article-title>
          .
          <source>In: Proceedings of the 7th Workshop on Linked Data on the Web</source>
          . vol.
          <volume>1184</volume>
          . CEUR, Seoul, Korea (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Dolan-Gavitt</surname>
            ,
            <given-names>B.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hodosh</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hulin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leek</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Whelan</surname>
          </string-name>
          , R.:
          <article-title>Repeatable reverse engineering for the greater good with PANDA</article-title>
          .
          <source>Tech. rep.</source>
          , Columbia University (
          <year>2014</year>
          ), https://doi.org/10.7916/D8WM1C1P
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Freire</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koop</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silva</surname>
          </string-name>
          , C.T.:
          <article-title>Provenance for computational tasks: A survey</article-title>
          .
          <source>Computing in Science &amp; Engineering</source>
          <volume>10</volume>
          (
          <issue>3</issue>
          ),
          <volume>20</volume>
          {
          <fpage>30</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Garijo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gil</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>The p-plan ontology</article-title>
          .
          <source>Tech. rep.</source>
          , Ontology Engineering Group (
          <year>2014</year>
          ), http://purl.org/net/p-plan#
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Goble</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Position statement: Musings on provenance, work ow and (semantic web) annotations for bioinformatics</article-title>
          .
          <source>In: Workshop on Data Derivation and Provenance</source>
          . vol.
          <volume>3</volume>
          . Chicago, USA (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Hyland</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Atemezing</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villazon-Terrazas</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Best Practices for Publishing Linked Data</article-title>
          . Working Group Note,
          <source>World Wide Web Consortium (W3C)</source>
          (
          <year>2014</year>
          ), https://www.w3.org/TR/ld-bp/
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Ioannidis</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Allison</surname>
            ,
            <given-names>D.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ball</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Coulibaly</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cui</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Culhane</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Falchi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Furlanello</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Game</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jurman</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , et al.:
          <article-title>Repeatability of published microarray gene expression analyses</article-title>
          .
          <source>Nature genetics 41(2)</source>
          ,
          <volume>149</volume>
          {
          <fpage>155</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lebo</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sahoo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belhajjame</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cheney</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corsar</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garijo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soiland-Reyes</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zednik</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <string-name>
            <surname>PROV-O: The</surname>
            <given-names>PROV</given-names>
          </string-name>
          <string-name>
            <surname>Ontology. Recommendation</surname>
          </string-name>
          ,
          <source>World Wide Web Consortium (W3C)</source>
          (
          <year>2013</year>
          ), https://www.w3. org/TR/prov-o/
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isele</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jakob</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jentzsch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kontokostas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morsey</surname>
            , M., van Kleef,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>DBpedia { A largescale, multilingual knowledge base extracted from Wikipedia</article-title>
          . Sem
          <string-name>
            <surname>Web</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Maroy</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kontokostas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Meester</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mannens</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Sustainable linked data generation: The case of DBpedia</article-title>
          .
          <source>In: Proceedings of the 16th International Semantic Web Conference</source>
          . Vienna, Austria (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bodenreider</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Don't like RDF rei cation?: Making statements about statements using singleton property</article-title>
          .
          <source>In: Proceedings of the 23rd International Conference on World Wide Web</source>
          . pp.
          <volume>759</volume>
          {
          <fpage>770</fpage>
          . New York, USA (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Rahm</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Do</surname>
            ,
            <given-names>H.H.</given-names>
          </string-name>
          :
          <article-title>Data cleaning: Problems and current approaches</article-title>
          .
          <source>IEEE Data Engineering Bulletin</source>
          <volume>23</volume>
          (
          <issue>4</issue>
          ),
          <volume>3</volume>
          {
          <fpage>13</fpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Simmhan</surname>
            ,
            <given-names>Y.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plale</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gannon</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>A survey of data provenance in e-science</article-title>
          .
          <source>SIGMOD Rec</source>
          .
          <volume>34</volume>
          (
          <issue>3</issue>
          ),
          <volume>31</volume>
          {
          <fpage>36</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>