=Paper=
{{Paper
|id=Vol-1931/paper-05
|storemode=property
|title=Detailed Provenance Capture of Data Processing
|pdfUrl=https://ceur-ws.org/Vol-1931/paper-05.pdf
|volume=Vol-1931
|authors=Ben De Meester,Anastasia Dimou,Ruben Verborgh,Erik Mannens
|dblpUrl=https://dblp.org/rec/conf/semweb/MeesterDVM17
}}
==Detailed Provenance Capture of Data Processing==
<pdf width="1500px">https://ceur-ws.org/Vol-1931/paper-05.pdf</pdf>
<pre>
                   Detailed Provenance Capture
                        of Data Processing

     Ben De Meester, Anastasia Dimou, Ruben Verborgh, and Erik Mannens

                          Ghent University – imec – IDLab,
          Department of Electronics and Information Systems, Ghent, Belgium
                          {firstname.lastname}@ugent.be


        Abstract. A large part of Linked Data generation entails processing the
        raw data. However, this process is only documented in human-readable
        form or as a software repository. This inhibits reproducibility and com-
        parability, as current documentation solutions do not provide detailed
        metadata and rely on the availability of specific software environments.
        This paper proposes an automatic capturing mechanism for interchange-
        able and implementation independent metadata and provenance that
        includes data processing. Using declarative mapping documents to de-
        scribe the computational experiment allows automatic capturing of term-
        level provenance for both schema and data transformations, and for both
        the used software tools as the input-output pairs of the data processing
        executions. This approach is applied to mapping documents described
        using rml and fno, and implemented in the rmlmapper. The captured
        metadata can be used to more easily share, reproduce, and compare the
        dataset generation process, across software environments.

        Keywords: Computational Experiment, Data Processing, FnO, Prove-
        nance, RML


1     Introduction
Reproducibility is improved by explicit description of data processing and anal-
ysis [11]. A large part of Linked Data generation tasks entail processing data to
generate new data. Thus, detailed metadata of these generation tasks is of great
importance. The ten newly introduced datasets of iswc 2016’s Resource Track1
were generated using some sort of data processing, e.g., parsing raw data, in-
terlinking existing datasets, or performing Natural Language Processing (nlp).
One of the most widely-known examples is dbpedia, where Wikitext is processed
to generate the dbpedia dataset [13].
    However, the description of how datasets are generated is mostly available
as a scientific paper, e.g., [13], or software repository, e.g., https://github.com/
dbpedia/extraction-framework. This inhibits reproducibility and comparabili-
ty, as these current documentation solutions do not provide detailed machine-
interpretable metadata describing the dataset generation process. This demands
1
    http://iswc2016.semanticweb.org/pages/program/accepted-papers.html
2        Ben De Meester et al.

manual intervention and specific software and hardware environments to repro-
duce or compare a generated dataset, if possible at all2 . Explicit description and
provenance of the generation task provides important insights [11], even when
these software or hardware dependencies are no longer available.
    In this paper, we propose an automatic capturing mechanism for detailed
metadata and provenance information. This enables reproducibility and compa-
rability of data processing, without relying on the availability of specific software
and hardware environments, or implying restrictions on the complexity of the
data processing. After providing a background on provenance in Section 2, we
show relevant provenance types and the underlying model using the prov On-
tology (prov-o) [12] in Section 3. In Section 4, we show how using declarative
statements to describe the computational experiment allows us to automatically
capture term-level provenance. We apply our model to the Function Ontology
(fno) [1] which, on its own turn, is aligned with the rdf Mapping Language
(rml) [5]. Our proposed approach is implemented in the rml and fno tool
chain, namely, the rmlmapper and FunctionProcessor [2], and used to generate
metadata based on the generation of a sample dbpedia dataset. We conclude
in Section 5.


2     Automatic Metadata for Linked Data Generation and
      Publishing

In this paper, we propose automatically capturing machine-interpretable and
detailed metadata concerning the data processing of a dataset generation to
improve reproducibility. In Section 2.1, we introduce metadata formats to enable
reproducibility, and in Section 2.2, we provide existing work that automatically
captures machine-interpretable metadata, without considering data processing.


2.1     Provenance

Provenance can be considered information describing materials and transforma-
tions applied to derive the data and the processes that enabled their creation [17].
It has several applications [9], namely to assess data quality, trace the audit trail,
aid in describing replication recipes, establish attribution, and be informational,
i.e., provide context. As such, providing provenance of a data processing along-
side the generated dataset can improve general reproducibility.
     Providing this provenance as Linked Data has advantages, as its distributed
nature allows us (i) to publish provenance separate from the actually published
dataset, and (ii) to easily interlink different provenance dimensions without tight
coupling. To improve interoperability, we apply commonly used Linked Data
vocabularies to describe the provenance. Provenance vocabularies already exist,
2
    Virtualization tools such as Docker (https://www.docker.com/) do abstract certain
    software environment requirements, however, they still rely on the (public) availabil-
    ity of all needed software tools.
                            Detailed Provenance Capture of Data Processing          3

namely, the prov Ontology (prov-o) [12], a w3c recommendation to represent
and interchange provenance generated in different systems and under different
contexts. Describing the generation process using provenance modeled in prov-o
thus allows us to generate machine-interpretable and interoperable metadata.


2.2   Automatic Capture of Provenance

A provenance capture mechanism falls into three main classes: workflow-, pro-
cess-, and operating system-based (os) [7]. Workflow-based mechanisms are at-
tached to a workflow system, process-based mechanisms require each involved
service or process to document itself, and os-based mechanisms rely on the avail-
ability of specific functionalities at the os level, without modifications to existing
scripts or programs [7]. Considering a data generation process as a single step
within a workflow, and aiming to provide an implementation independent solu-
tion – thus an os independent solution – the provenance of a dataset generation
process is best captured using a process-based mechanism. As such, it is comple-
mentary to workflow capturing mechanisms such as implemented in the Pegasus
Workflow Management System [3] or ontologies that describe workflows such as
p-plan [8], and os capturing mechanisms such as implemented in panda [6].
     Related work [4] automatically captures metadata and provenance informa-
tion decoupled from the implementation by relying on declarative descriptions,
both for the mapping rules that specify how to generate the Linked Data in rdf,
and the raw data access interfaces. When generating Linked Data, a separate
provenance dataset is generated that includes the contributing schema transfor-
mations and data sources. Different detail levels have been identified to capture
metadata and provenance [4]: on the dataset; named graph; partitioned dataset;
triple level; and term level. Furthermore, multiple ways of adding provenance in-
formation to the declarative descriptions using prov-o have been identified [4]:
using explicit graphs; implicit graphs; singleton properties [15]; or reification.
Implicit graphs and reification have the advantage that they do not influence
the generated rdf data, whilst explicit graphs are not supported by all rdf
serializations, and singleton properties require changing the schema level trans-
formations [4]. However, aforementioned work does not include data processing,
i.e., it only takes raw extracted data values into account. Meanwhile, most gen-
erated datasets are related to specific data processing, and its provenance is an
essential part of the generated dataset.


3     Metadata and Provenance Capture for Data
      Transformations

We first state the different dimensions to take into account when capturing the
metadata and provenance for data processing in Section 3.1, after which we
propose our model in Section 3.2.
4       Ben De Meester et al.

3.1   Metadata and Provenance Dimensions

Schema vs Data Transformations Dataset generation depends on both schema
and data transformations [16]. Schema transformations involve (re-)modeling the
data, describing how objects are related, and deciding which vocabularies and
ontologies to use [10]. Data transformations are needed to support any changes in
the structure, representation or content of data [16]. However, instead of coupling
these transformations, aligning them allows them to be executed separately as
well as combined [1]. Namely, aligning instead of coupling the provenance of these
transformations allows us to reproduce data transformations without needing to
reproduce the schema transformations and vice versa. Existing work has mostly
focused on capturing schema transformations’ metadata.

Interaction vs Actor The execution of a generation process involves different
actors, namely, the processor executing the generation process, and the differen-
t processes that perform the data transformations. The relation between these
actors is a client-service relation, i.e., the generation process (client) calls the
different data transformations (service). Two kinds of provenance are generated
by these actors [17]: (i) interaction provenance, which describes the input and
output parameters of each execution, generated and confirmed by both client
and service actor, and (ii) actor provenance, which is metadata about the ac-
tor’s own state during an execution (e.g. implementation details or hardware
configuration) and is not verifiable by the other actors. Both kinds of prove-
nance are complementary, i.e., interaction provenance can be used to compare
results without relying on the implementation, whilst actor provenance can be
used to reproduce performance measurements.


3.2   Metadata and Provenance Model

Provenance can be captured on different levels in the generation process [4]. It
is most relevant on term level, as only then it unambiguously defines which data
transformation contributed to which value. For instance, dbpedia data involves
parsing, e.g., date values from infoboxes in Wikipedia. In particular the prove-
nance of how this date value was parsed is important for the dbpedia generation
provenance, and currently not measured nor published [2]. When capturing term
level provenance, data transformations are decoupled from the schema transfor-
mation, i.e., the captured metadata and provenance is defined on value level, and
does not rely on the relationships between resources and used vocabularies or
ontologies. In addition to the existing domains to capture metadata and prove-
nance (i.e., mapping rules definition and data sources retrieval), we introduce
another domain: the processing domain. This covers the data transformations,
complementary to the schema transformations covered by the mapping rules.
    To capture both interaction and actor provenance, it is necessary to capture
and align implementation specific data (e.g., the software release as actor prove-
nance) with implementation independent data (e.g., the input and output values
                                Detailed Provenance Capture of Data Processing                   5

        Schema       wasInformedBy         Data         wasGeneratedBy     Output
    Transformation                    Transformation                       Entity
        Activity                          Activity
                                                       wasAssociatedWith
                                        used                                   wasAttributedTo


                       Input            Function                            Tool
                       Entity             Entity                            Agent


Fig. 1: Using prov-o for data transformations. The solid gray relation can be
derived. The dotted relation denotes an association, details omitted for clarity.


as interaction provenance). However, on top of input and output values, we ar-
gue additional implementation independent data is needed. Namely, what type
of data processing executed. This allows comparability between data generation
processes across tools. When two different tools implement the same type of da-
ta processing function, the input and output values of the first data generation
process with the first tool can be used to compare with the second tool. For the
remainder of the paper, we will make the distinction between function (i.e., the
implementation independent description, cfr. interaction provenance) and tool
(i.e., the implementation specific description, cfr. actor provenance).
     Our model (Figure 1) is mapped to prov-o [12]. We distinguish schema and
data transformation, provide actor and interaction provenance, and include both
function and tool. Schema and data transformations are a prov:Activity, where
the latter is informed by (prov:wasInformedBy) the schema transformation. In-
put, function, and output are a prov:Entity, and the tool is a prov:Agent. The
data transformation uses (prov:used) the input3 and function, and the output is
generated by (prov:wasGeneratedBy) the data transformation. The data transfor-
mation is associated with (prov:wasAssociatedWith) the tool, thus we can derive
that the output is attributed to (prov:wasAttributedTo) the tool. The relation
between the function and the tool is an association (prov:qualifiedAssociation).

4     Application
Capturing metadata and provenance within the dataset generation process –
specifically when including data transformations – requires term-level capturing
mechanisms. Instead of providing a tool-specific solution, i.e., changing a spe-
cific system, our approach considers capturing metadata and provenance based
on machine interpretable descriptions of the dataset generation process. This
way, the approach is independent of the actual implementation. Moreover, these
mapping descriptions can be automatically analyzed or even generated.
    As exemplary case, we consider the rdf Mapping Language (rml) [5] to pro-
vide the machine interpretable mapping rules for the schema transformations,
and the Function Ontology (fno) [1] to describe the data transformations. An
alignment between rml and fno is presented in previous work [2]. rml is consid-
ered because it is the only language that allows uniformly defining the mapping
3
    The model puts no restriction on amount of inputs, but it is not visualized for clarity.
6         Ben De Meester et al.


 1    grel:toTitleCase a fno:Function, prov:Entity, prov:Plan ;
 2      fno:name    "title case" ;
 3      fno:expects ( [ fno:predicate grel:stringInput ] )    ;
 4      fno:output ( [ fno:predicate grel:stringOutput ] )    .
 5
 6    :exe a fno:Execution, prov:Activity ;               # Data Transformation
 7      prov:wasInformedBy :RDFdataset_Generation .       # Schema Transformation
 8      :implementation :grelJavaImpl ;                   # Tool (Agent)
 9      fno:executes grel:toTitleCase ;                   # Function (Entity)
10      grel:stringInput :input ;                         # 'prov:used'
11      grel:stringOutput :output .                       # 'prov:wasGeneratedBy'
12
13    :input a prov:Entity ; rdf:value "ben de meester" . # Input string
14    :output a prov:Entity ; rdf:value "Ben De Meester" . # Transformed to title case

 Listing 1: Function descriptions and Executions using fno, mapped to prov-o.


rules over heterogeneous data sources, thus covers the most dataset generation
use cases. fno is considered as it allows implementation independent description
of functions, their input parameters, and returning results.
    Existing work that allows capturing metadata and provenance for dataset
generation [4] is extended to include data transformations. On top of relying on
the declarative descriptions of (i) the mapping rules and (ii) the raw data access
interfaces, we include the declarative descriptions of (iii) the data transforma-
tions. For all detail levels except the term level, it suffices to include which data
transformation functions have been used, using their fno description.
    For the term detail level, we extend existing work to include the model as p-
resented at Section 3.2, relying on the mapping rules described in rml to trigger
the execution of a function described in fno. As example, we present a sim-
ple mapping process that maps person names, and requires a title case data
transformation4 . In detail: a data source (:Source a prov:Entity) is mapped us-
ing a mapping process (:RDFdataset Generation a prov:Activity). This mapping
process executes the schema transformations. When generating the triple :Ben
foaf:name "Ben De Meester", a data transformation needs to be executed on the
object, thus, a mapping rule triggers the execution of a function. The description
of the grel:toTitleCase function is given in Listing 1, lines 1–4. When executing
this function, the actual implementation (:grelJavaImpl a prov:Actor) needs to
be retrieved. The execution of that function with specific input data, together
with the prov types for clarification, is given in Listing 1, lines 6–14 using the
fno statements [1]. For capturing the provenance of literal values, we need an in-
termediate resource and rdf:value relation to attach additional metadata. This
is not needed in the generated dataset.
    Based on this fno execution description, additional prov-o statements such
as :exe prov:wasInformedBy :RDFdataset Generation as well as :output prov:
wasGeneratedBy :exe can be derived. On https://fno.io/prov/, an extended de-
 4
      More advanced mapping processes, with more complicated data processing, e.g.,
     nlp, or complex algorithms to generate data values or resources are similar, as the
     provenance model makes no assumptions on the execution complexity.
                           Detailed Provenance Capture of Data Processing        7

scription and example is given which, due to page constraints, could not be
incorporated in the paper. As such, we can capture all needed metadata and
provenance using rml and fno, with an additional cost of about ten triples for
every literal or resource generated. This increases the total amount of generated
triples with an order of magnitude. However, the captured metadata and prove-
nance can be published separately and does not affect the generated dataset.
    Our approach was implemented in the rmlmapper, the reference implementa-
tion for both rml and its alignment with fno [2]. We then captured all metadata
and provenance for a sample of dbpedia, together with additional description
available at https://fno.io/prov/dbpedia/. Both fno as prov-o statements are
available, and exemplary queries show how this approach can ease reproducibility
and comparability of a dataset generation process. As dbpedia’s new generation
process includes rml and fno [14], we can provide detailed provenance alongside
the dbpedia data, e.g., all releases of the used tools to generate the sample DB-
pedia dataset can be requested (i.e., all data transformation tools, including the
release of the rmlprocessor). Moreover, all executions’ input and output concern-
ing the DBpedia parsing functions used to parse the Wikipedia input data into
normalized rdf literals can be requested. This allows the functional evaluation of
a new parsing function by comparing its output values with those of the existing
parsing functions. The inclusion of prov:startedAtTime and prov:endedAtTime
statements allows performance evaluation, given that hardware context is also
included and can be compared. Moreover, decoupling schema and data trans-
formations allows us to reproduce and compare data transformations without
needing to execute the schema transformations and vice versa.

5   Conclusions
Linked Data generation consumes a large part of scientific output, including
complex and specific data processing. The publication of this generation process
however is not reproducible, as current documentation options do not provide
machine-intepretable detailed metadata that do not rely on specific software
environments. In this work, we present automatic capturing of metadata and
provenance of data processing on term level, whilst separating schema and da-
ta transformations and capturing both actor and interaction provenance (i.e.,
applying the distinction between function and tool).
    As we tested our approach on the dbpedia generation, details about the used
tools for every input-output pair are available, and data transformations can
be analyzed decoupled from the schema transformations. Comparability is also
improved, as the description of the used function, together with the input-output
pair, can be used to compare different tools, even when the original tools are no
longer available or accessible.

References
 1. De Meester, B., Dimou, A., Verborgh, R., Mannens, E., Van de Walle, R.: An
    Ontology to Semantically Declare and Describe Functions. In: Proceedings of the
8       Ben De Meester et al.

    13th ESWC Satellite Events. LNCS, vol. 9989, pp. 46–49. Heraklion, Greece (2016)
 2. De Meester, B., Maroy, W., Dimou, A., Verborgh, R., Mannens, E.: Declarative
    data transformations for Linked Data generation: the case of DBpedia. In: Pro-
    ceedings of the 14th International Conference, ESWC. LNCS, vol. 10250, pp. 33–48.
    Portoroš, Slovenia (2017)
 3. Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P.J., Mayani,
    R., Chen, W., Ferreira da Silva, R., Livny, M., Wenger, K.: Pegasus, a workflow
    management system for science automation. Future Generation Computer Systems
    46(C), 17–35 (2015)
 4. Dimou, A., De Nies, T., Verborgh, R., Mannens, E., Van de Walle, R.: Automated
    metadata generation for Linked Data generation and publishing workflows. In:
    Proceedings of the 9th Workshop on Linked Data on the Web. vol. 1593. CEUR,
    Montreal, Canada (2016)
 5. Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R., Mannens, E., Van de
    Walle, R.: RML: A Generic Language for Integrated RDF Mappings of Heteroge-
    neous Data. In: Proceedings of the 7th Workshop on Linked Data on the Web. vol.
    1184. CEUR, Seoul, Korea (2014)
 6. Dolan-Gavitt, B.F., Hodosh, J., Hulin, P., Leek, T., Whelan, R.: Repeatable reverse
    engineering for the greater good with PANDA. Tech. rep., Columbia University
    (2014), https://doi.org/10.7916/D8WM1C1P
 7. Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks:
    A survey. Computing in Science & Engineering 10(3), 20–30 (2008)
 8. Garijo, D., Gil, Y.: The p-plan ontology. Tech. rep., Ontology Engineering Group
    (2014), http://purl.org/net/p-plan#
 9. Goble, C.: Position statement: Musings on provenance, workflow and (semantic
    web) annotations for bioinformatics. In: Workshop on Data Derivation and Prove-
    nance. vol. 3. Chicago, USA (2002)
10. Hyland, B., Atemezing, G., Villazón-Terrazas, B.: Best Practices for Publishing
    Linked Data. Working Group Note, World Wide Web Consortium (W3C) (2014),
    https://www.w3.org/TR/ld-bp/
11. Ioannidis, J.P., Allison, D.B., Ball, C.A., Coulibaly, I., Cui, X., Culhane, A.C.,
    Falchi, M., Furlanello, C., Game, L., Jurman, G., et al.: Repeatability of published
    microarray gene expression analyses. Nature genetics 41(2), 149–155 (2009)
12. Lebo, T., Sahoo, S., McGuinness, D., Belhajjame, K., Cheney, J., Corsar, D.,
    Garijo, D., Soiland-Reyes, S., Zednik, S., Zhao, J.: PROV-O: The PROV Ontology.
    Recommendation, World Wide Web Consortium (W3C) (2013), https://www.w3.
    org/TR/prov-o/
13. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N.,
    Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia – A large-
    scale, multilingual knowledge base extracted from Wikipedia. Sem Web (2015)
14. Maroy, W., Dimou, A., Kontokostas, D., De Meester, B., Verborgh, R., Lehman-
    n, J., Mannens, E., Hellmann, S.: Sustainable linked data generation: The case
    of DBpedia. In: Proceedings of the 16th International Semantic Web Conference.
    Vienna, Austria (2017)
15. Nguyen, V., Bodenreider, O., Sheth, A.: Don’t like RDF reification?: Making s-
    tatements about statements using singleton property. In: Proceedings of the 23rd
    International Conference on World Wide Web. pp. 759–770. New York, USA (2014)
16. Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data
    Engineering Bulletin 23(4), 3–13 (2000)
17. Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science.
    SIGMOD Rec. 34(3), 31–36 (2005)

</pre>