-

Detailed Provenance Capture of Data Processing

Ben De Meester

Anastasia Dimou

Ruben Verborgh

Erik Mannens

0 0 Ghent University 1 IDLab, Department of Electronics and Information Systems , Ghent , Belgium

A large part of Linked Data generation entails processing the raw data. However, this process is only documented in human-readable form or as a software repository. This inhibits reproducibility and comparability, as current documentation solutions do not provide detailed metadata and rely on the availability of speci c software environments. This paper proposes an automatic capturing mechanism for interchangeable and implementation independent metadata and provenance that includes data processing. Using declarative mapping documents to describe the computational experiment allows automatic capturing of termlevel provenance for both schema and data transformations, and for both the used software tools as the input-output pairs of the data processing executions. This approach is applied to mapping documents described using rml and fno, and implemented in the rmlmapper. The captured metadata can be used to more easily share, reproduce, and compare the dataset generation process, across software environments.

Computational Experiment Data Processing FnO Provenance RML

Reproducibility is improved by explicit description of data processing and analysis [ 11 ]. A large part of Linked Data generation tasks entail processing data to generate new data. Thus, detailed metadata of these generation tasks is of great importance. The ten newly introduced datasets of iswc 2016's Resource Track1 were generated using some sort of data processing, e.g., parsing raw data, interlinking existing datasets, or performing Natural Language Processing (nlp). One of the most widely-known examples is dbpedia, where Wikitext is processed to generate the dbpedia dataset [ 13 ].

However, the description of how datasets are generated is mostly available as a scienti c paper, e.g., [ 13 ], or software repository, e.g., https://github.com/ dbpedia/extraction-framework. This inhibits reproducibility and comparability, as these current documentation solutions do not provide detailed machineinterpretable metadata describing the dataset generation process. This demands 1 http://iswc2016.semanticweb.org/pages/program/accepted-papers.html manual intervention and speci c software and hardware environments to reproduce or compare a generated dataset, if possible at all2. Explicit description and provenance of the generation task provides important insights [ 11 ], even when these software or hardware dependencies are no longer available.

In this paper, we propose an automatic capturing mechanism for detailed metadata and provenance information. This enables reproducibility and comparability of data processing, without relying on the availability of speci c software and hardware environments, or implying restrictions on the complexity of the data processing. After providing a background on provenance in Section 2, we show relevant provenance types and the underlying model using the prov Ontology (prov-o) [ 12 ] in Section 3. In Section 4, we show how using declarative statements to describe the computational experiment allows us to automatically capture term-level provenance. We apply our model to the Function Ontology (fno) [ 1 ] which, on its own turn, is aligned with the rdf Mapping Language (rml) [ 5 ]. Our proposed approach is implemented in the rml and fno tool chain, namely, the rmlmapper and FunctionProcessor [ 2 ], and used to generate metadata based on the generation of a sample dbpedia dataset. We conclude in Section 5. 2

Automatic Metadata for Linked Data Generation and Publishing

In this paper, we propose automatically capturing machine-interpretable and detailed metadata concerning the data processing of a dataset generation to improve reproducibility. In Section 2.1, we introduce metadata formats to enable reproducibility, and in Section 2.2, we provide existing work that automatically captures machine-interpretable metadata, without considering data processing. 2.1

Provenance

Provenance can be considered information describing materials and transformations applied to derive the data and the processes that enabled their creation [ 17 ]. It has several applications [ 9 ], namely to assess data quality, trace the audit trail, aid in describing replication recipes, establish attribution, and be informational, i.e., provide context. As such, providing provenance of a data processing alongside the generated dataset can improve general reproducibility.

Providing this provenance as Linked Data has advantages, as its distributed nature allows us (i) to publish provenance separate from the actually published dataset, and (ii) to easily interlink di erent provenance dimensions without tight coupling. To improve interoperability, we apply commonly used Linked Data vocabularies to describe the provenance. Provenance vocabularies already exist, 2 Virtualization tools such as Docker (https://www.docker.com/) do abstract certain software environment requirements, however, they still rely on the (public) availability of all needed software tools. namely, the prov Ontology (prov-o) [ 12 ], a w3c recommendation to represent and interchange provenance generated in di erent systems and under di erent contexts. Describing the generation process using provenance modeled in prov-o thus allows us to generate machine-interpretable and interoperable metadata. 2.2

Automatic Capture of Provenance

A provenance capture mechanism falls into three main classes: work ow-, process-, and operating system-based (os) [ 7 ]. Work ow-based mechanisms are attached to a work ow system, process-based mechanisms require each involved service or process to document itself, and os-based mechanisms rely on the availability of speci c functionalities at the os level, without modi cations to existing scripts or programs [ 7 ]. Considering a data generation process as a single step within a work ow, and aiming to provide an implementation independent solution { thus an os independent solution { the provenance of a dataset generation process is best captured using a process-based mechanism. As such, it is complementary to work ow capturing mechanisms such as implemented in the Pegasus Work ow Management System [ 3 ] or ontologies that describe work ows such as p-plan [ 8 ], and os capturing mechanisms such as implemented in panda [ 6 ].

Related work [ 4 ] automatically captures metadata and provenance information decoupled from the implementation by relying on declarative descriptions, both for the mapping rules that specify how to generate the Linked Data in rdf, and the raw data access interfaces. When generating Linked Data, a separate provenance dataset is generated that includes the contributing schema transformations and data sources. Di erent detail levels have been identi ed to capture metadata and provenance [ 4 ]: on the dataset; named graph; partitioned dataset; triple level; and term level. Furthermore, multiple ways of adding provenance information to the declarative descriptions using prov-o have been identi ed [ 4 ]: using explicit graphs; implicit graphs; singleton properties [ 15 ]; or rei cation. Implicit graphs and rei cation have the advantage that they do not in uence the generated rdf data, whilst explicit graphs are not supported by all rdf serializations, and singleton properties require changing the schema level transformations [ 4 ]. However, aforementioned work does not include data processing, i.e., it only takes raw extracted data values into account. Meanwhile, most generated datasets are related to speci c data processing, and its provenance is an essential part of the generated dataset. 3

Metadata and Provenance Capture for Data Transformations

We rst state the di erent dimensions to take into account when capturing the metadata and provenance for data processing in Section 3.1, after which we propose our model in Section 3.2.

Metadata and Provenance Dimensions

Schema vs Data Transformations Dataset generation depends on both schema and data transformations [ 16 ]. Schema transformations involve (re-)modeling the data, describing how objects are related, and deciding which vocabularies and ontologies to use [ 10 ]. Data transformations are needed to support any changes in the structure, representation or content of data [ 16 ]. However, instead of coupling these transformations, aligning them allows them to be executed separately as well as combined [ 1 ]. Namely, aligning instead of coupling the provenance of these transformations allows us to reproduce data transformations without needing to reproduce the schema transformations and vice versa. Existing work has mostly focused on capturing schema transformations' metadata.

Interaction vs Actor The execution of a generation process involves di erent actors, namely, the processor executing the generation process, and the di erent processes that perform the data transformations. The relation between these actors is a client-service relation, i.e., the generation process (client) calls the di erent data transformations (service). Two kinds of provenance are generated by these actors [ 17 ]: (i) interaction provenance, which describes the input and output parameters of each execution, generated and con rmed by both client and service actor, and (ii) actor provenance, which is metadata about the actor's own state during an execution (e.g. implementation details or hardware con guration) and is not veri able by the other actors. Both kinds of provenance are complementary, i.e., interaction provenance can be used to compare results without relying on the implementation, whilst actor provenance can be used to reproduce performance measurements. 3.2

Metadata and Provenance Model

Provenance can be captured on di erent levels in the generation process [ 4 ]. It is most relevant on term level, as only then it unambiguously de nes which data transformation contributed to which value. For instance, dbpedia data involves parsing, e.g., date values from infoboxes in Wikipedia. In particular the provenance of how this date value was parsed is important for the dbpedia generation provenance, and currently not measured nor published [ 2 ]. When capturing term level provenance, data transformations are decoupled from the schema transformation, i.e., the captured metadata and provenance is de ned on value level, and does not rely on the relationships between resources and used vocabularies or ontologies. In addition to the existing domains to capture metadata and provenance (i.e., mapping rules de nition and data sources retrieval), we introduce another domain: the processing domain. This covers the data transformations, complementary to the schema transformations covered by the mapping rules.

To capture both interaction and actor provenance, it is necessary to capture and align implementation speci c data (e.g., the software release as actor provenance) with implementation independent data (e.g., the input and output values wasInformedBy

Input Entity

Data Transformation

Activity used Function

Entity wasGeneratedBy wasAssociatedWith

Output Entity Tool Agent wasAttributedTo as interaction provenance). However, on top of input and output values, we argue additional implementation independent data is needed. Namely, what type of data processing executed. This allows comparability between data generation processes across tools. When two di erent tools implement the same type of data processing function, the input and output values of the rst data generation process with the rst tool can be used to compare with the second tool. For the remainder of the paper, we will make the distinction between function (i.e., the implementation independent description, cfr. interaction provenance) and tool (i.e., the implementation speci c description, cfr. actor provenance).

Our model (Figure 1) is mapped to prov-o [ 12 ]. We distinguish schema and data transformation, provide actor and interaction provenance, and include both function and tool. Schema and data transformations are a prov:Activity, where the latter is informed by (prov:wasInformedBy) the schema transformation. Input, function, and output are a prov:Entity, and the tool is a prov:Agent. The data transformation uses (prov:used) the input3 and function, and the output is generated by (prov:wasGeneratedBy) the data transformation. The data transformation is associated with (prov:wasAssociatedWith) the tool, thus we can derive that the output is attributed to (prov:wasAttributedTo) the tool. The relation between the function and the tool is an association (prov:qualifiedAssociation). 4

Application

Capturing metadata and provenance within the dataset generation process { speci cally when including data transformations { requires term-level capturing mechanisms. Instead of providing a tool-speci c solution, i.e., changing a speci c system, our approach considers capturing metadata and provenance based on machine interpretable descriptions of the dataset generation process. This way, the approach is independent of the actual implementation. Moreover, these mapping descriptions can be automatically analyzed or even generated.

As exemplary case, we consider the rdf Mapping Language (rml) [ 5 ] to provide the machine interpretable mapping rules for the schema transformations, and the Function Ontology (fno) [ 1 ] to describe the data transformations. An alignment between rml and fno is presented in previous work [ 2 ]. rml is considered because it is the only language that allows uniformly de ning the mapping 3 The model puts no restriction on amount of inputs, but it is not visualized for clarity. 1 grel:toTitleCase a fno:Function, prov:Entity, prov:Plan ; 2 fno:name "title case" ; 3 fno:expects ( [ fno:predicate grel:stringInput ] ) ; 4 fno:output ( [ fno:predicate grel:stringOutput ] ) . 5 6 :exe a fno:Execution, prov:Activity ; # Data Transformation 7 prov:wasInformedBy :RDFdataset_Generation . # Schema Transformation 8 :implementation :grelJavaImpl ; # Tool (Agent) 9 fno:executes grel:toTitleCase ; # Function (Entity) 10 grel:stringInput :input ; # ’prov:used’ 11 grel:stringOutput :output . # ’prov:wasGeneratedBy’ 12 13 :input a prov:Entity ; rdf:value "ben de meester" . # Input string 14 :output a prov:Entity ; rdf:value "Ben De Meester" . # Transformed to title case Listing 1: Function descriptions and Executions using fno, mapped to prov-o. rules over heterogeneous data sources, thus covers the most dataset generation use cases. fno is considered as it allows implementation independent description of functions, their input parameters, and returning results.

Existing work that allows capturing metadata and provenance for dataset generation [ 4 ] is extended to include data transformations. On top of relying on the declarative descriptions of (i) the mapping rules and (ii) the raw data access interfaces, we include the declarative descriptions of (iii) the data transformations. For all detail levels except the term level, it su ces to include which data transformation functions have been used, using their fno description.

For the term detail level, we extend existing work to include the model as presented at Section 3.2, relying on the mapping rules described in rml to trigger the execution of a function described in fno. As example, we present a simple mapping process that maps person names, and requires a title case data transformation4. In detail: a data source (:Source a prov:Entity) is mapped using a mapping process (:RDFdataset Generation a prov:Activity). This mapping process executes the schema transformations. When generating the triple :Ben foaf:name "Ben De Meester", a data transformation needs to be executed on the object, thus, a mapping rule triggers the execution of a function. The description of the grel:toTitleCase function is given in Listing 1, lines 1{4. When executing this function, the actual implementation (:grelJavaImpl a prov:Actor) needs to be retrieved. The execution of that function with speci c input data, together with the prov types for clari cation, is given in Listing 1, lines 6{14 using the fno statements [ 1 ]. For capturing the provenance of literal values, we need an intermediate resource and rdf:value relation to attach additional metadata. This is not needed in the generated dataset.

Based on this fno execution description, additional prov-o statements such as :exe prov:wasInformedBy :RDFdataset Generation as well as :output prov: wasGeneratedBy :exe can be derived. On https://fno.io/prov/, an extended de4 More advanced mapping processes, with more complicated data processing, e.g., nlp, or complex algorithms to generate data values or resources are similar, as the provenance model makes no assumptions on the execution complexity. scription and example is given which, due to page constraints, could not be incorporated in the paper. As such, we can capture all needed metadata and provenance using rml and fno, with an additional cost of about ten triples for every literal or resource generated. This increases the total amount of generated triples with an order of magnitude. However, the captured metadata and provenance can be published separately and does not a ect the generated dataset.

Our approach was implemented in the rmlmapper, the reference implementation for both rml and its alignment with fno [ 2 ]. We then captured all metadata and provenance for a sample of dbpedia, together with additional description available at https://fno.io/prov/dbpedia/. Both fno as prov-o statements are available, and exemplary queries show how this approach can ease reproducibility and comparability of a dataset generation process. As dbpedia's new generation process includes rml and fno [ 14 ], we can provide detailed provenance alongside the dbpedia data, e.g., all releases of the used tools to generate the sample DBpedia dataset can be requested (i.e., all data transformation tools, including the release of the rmlprocessor). Moreover, all executions' input and output concerning the DBpedia parsing functions used to parse the Wikipedia input data into normalized rdf literals can be requested. This allows the functional evaluation of a new parsing function by comparing its output values with those of the existing parsing functions. The inclusion of prov:startedAtTime and prov:endedAtTime statements allows performance evaluation, given that hardware context is also included and can be compared. Moreover, decoupling schema and data transformations allows us to reproduce and compare data transformations without needing to execute the schema transformations and vice versa. 5

Conclusions

Linked Data generation consumes a large part of scienti c output, including complex and speci c data processing. The publication of this generation process however is not reproducible, as current documentation options do not provide machine-intepretable detailed metadata that do not rely on speci c software environments. In this work, we present automatic capturing of metadata and provenance of data processing on term level, whilst separating schema and data transformations and capturing both actor and interaction provenance (i.e., applying the distinction between function and tool).

As we tested our approach on the dbpedia generation, details about the used tools for every input-output pair are available, and data transformations can be analyzed decoupled from the schema transformations. Comparability is also improved, as the description of the used function, together with the input-output pair, can be used to compare di erent tools, even when the original tools are no longer available or accessible.

1. De Meester , B. , Dimou , A. , Verborgh , R. , Mannens , E., Van de Walle, R.: An Ontology to Semantically Declare and Describe Functions . In: Proceedings of the 13th ESWC Satellite Events. LNCS , vol. 9989 , pp. 46 { 49 . Heraklion , Greece ( 2016 )

2. De Meester , B. , Maroy , W. , Dimou , A. , Verborgh , R. , Mannens , E.: Declarative data transformations for Linked Data generation: the case of DBpedia . In: Proceedings of the 14th International Conference, ESWC. LNCS , vol. 10250 , pp. 33 { 48 . Portoros , Slovenia ( 2017 )

3. Deelman , E. , Vahi , K. , Juve , G. , Rynge , M. , Callaghan , S. , Maechling , P.J. , Mayani , R. , Chen , W. , Ferreira da Silva, R. , Livny , M. , Wenger , K. : Pegasus, a work ow management system for science automation . Future Generation Computer Systems 46(C) , 17 { 35 ( 2015 )

4. Dimou , A. , De Nies , T. , Verborgh , R. , Mannens , E., Van de Walle, R.: Automated metadata generation for Linked Data generation and publishing work ows . In: Proceedings of the 9th Workshop on Linked Data on the Web . vol. 1593 . CEUR, Montreal, Canada ( 2016 )

5. Dimou , A. , Vander

Sande

, M. , Colpaert , P. , Verborgh , R. , Mannens , E., Van de Walle, R.: RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data . In: Proceedings of the 7th Workshop on Linked Data on the Web . vol. 1184 . CEUR, Seoul, Korea ( 2014 )

6. Dolan-Gavitt , B.F. , Hodosh , J. , Hulin , P. , Leek , T. , Whelan , R.: Repeatable reverse engineering for the greater good with PANDA . Tech. rep. , Columbia University ( 2014 ), https://doi.org/10.7916/D8WM1C1P

7. Freire , J. , Koop , D. , Santos , E. , Silva , C.T.: Provenance for computational tasks: A survey . Computing in Science & Engineering 10 ( 3 ), 20 { 30 ( 2008 )

8. Garijo , D. , Gil , Y. : The p-plan ontology . Tech. rep. , Ontology Engineering Group ( 2014 ), http://purl.org/net/p-plan#

9. Goble , C. : Position statement: Musings on provenance, work ow and (semantic web) annotations for bioinformatics . In: Workshop on Data Derivation and Provenance . vol. 3 . Chicago, USA ( 2002 )

10. Hyland , B. , Atemezing , G. , Villazon-Terrazas , B. : Best Practices for Publishing Linked Data . Working Group Note, World Wide Web Consortium (W3C) ( 2014 ), https://www.w3.org/TR/ld-bp/

11. Ioannidis , J.P. , Allison , D.B. , Ball , C.A. , Coulibaly , I. , Cui , X. , Culhane , A.C. , Falchi , M. , Furlanello , C. , Game , L. , Jurman , G. , et al.: Repeatability of published microarray gene expression analyses . Nature genetics 41(2) , 149 { 155 ( 2009 )

12. Lebo , T. , Sahoo , S. , McGuinness , D. , Belhajjame , K. , Cheney , J. , Corsar , D. , Garijo , D. , Soiland-Reyes , S. , Zednik , S. , Zhao , J. : PROV-O: The

PROV

Ontology. Recommendation , World Wide Web Consortium (W3C) ( 2013 ), https://www.w3. org/TR/prov-o/

13. Lehmann , J. , Isele , R. , Jakob , M. , Jentzsch , A. , Kontokostas , D. , Mendes , P.N. , Hellmann , S. , Morsey , M., van Kleef, P. , Auer , S. , Bizer , C. : DBpedia { A largescale, multilingual knowledge base extracted from Wikipedia . Sem Web ( 2015 )

14. Maroy , W. , Dimou , A. , Kontokostas , D. , De Meester , B. , Verborgh , R. , Lehmann , J. , Mannens , E. , Hellmann , S. : Sustainable linked data generation: The case of DBpedia . In: Proceedings of the 16th International Semantic Web Conference . Vienna, Austria ( 2017 )

15. Nguyen , V. , Bodenreider , O. , Sheth , A. : Don't like RDF rei cation?: Making statements about statements using singleton property . In: Proceedings of the 23rd International Conference on World Wide Web . pp. 759 { 770 . New York, USA ( 2014 )

16. Rahm , E. , Do , H.H. : Data cleaning: Problems and current approaches . IEEE Data Engineering Bulletin 23 ( 4 ), 3 { 13 ( 2000 )

17. Simmhan , Y.L. , Plale , B. , Gannon , D.: A survey of data provenance in e-science . SIGMOD Rec . 34 ( 3 ), 31 { 36 ( 2005 )