Towards Interoperable Provenance Publication on the Linked Data Web Jun Zhao Olaf Hartig Department of Zoology Institut für Informatik University of Oxford Humboldt-Universität zu Berlin South Parks Road, Oxford Unter den Linden 6 OX1 3PS, United Kingdom 10099 Berlin, Germany jun.zhao@zoo.ox.ac.uk hartig@informatik.hu-berlin.de ABSTRACT to leverage this problem. However, before these standards Provenance provides vital information for evaluating quality are eventually published and universally adopted, we must and trustworthiness of information on the Web. To achieve understand them in the context of existing provenance vo- this we must have access to semantically interchangeable cabularies and publication approaches in order to achieve provenance information and an agreement on where and how the optimal interoperability now and in the near future. this information is to be located. The ongoing W3C Prove- There has been a sea of interest in providing provenance- nance Working Group provides a promise towards leverag- related vocabularies, a summary of which can be found by ing these problems. In this position paper, we provide an the group report of the late W3C Provenance Incubator overview of how the upcoming standards and the existing Group [10]. This position paper chose two of these vocab- vocabularies and publication approaches could fit together ularies to compare their semantic interoperability with the so that we achieve an optimal interoperability now and in PROV-O ontology [2], being standardized by the working the near future. Because the standardization is an ongo- group. The two chosen vocabularies are the OPMV (Open ing effort, any analysis results presented in this paper are Provenance Model Vocabulary) [12], a lightweight imple- positional and are aimed at communicating the latest devel- mentation of the community Open Provenance Model [8], opment of the working group to the community. and the Provenance Vocabulary [6], another lightweight vo- cabulary targetted at Linked Data use cases. These two vocabularies were chosen because: 1) both of them were Categories and Subject Descriptors created with the needs of Semantic Web users in minds, 2) H.4 [Information Systems Applications]: General they were designed to cover a similar scope of motivation use cases as PROV-O; and 3) they share a largely similar General Terms modeling pattern as PROV-O. Interoperability of provenance data requires not only an Linked Data, Interoperability agreement on how provenance is represented but also a shared understanding about “what” is described. Researchers from Keywords the provenance community emphasize that provenance should Provenance, Linked Data, Semantic Web, RDF provide a precise history of what happened that have led to the particular state of an object [8]. The state of an ob- ject can be characterised by a set of its attribute values. 1. INTRODUCTION Resources on the Web are dynamic in nature and their at- Provenance information about a resource provides infor- tribute values can be changed at a volatile rate. The defi- mation about its origin, such as who created it, when it nition of the state of an entity should be driven by actual was modified, or how it was created. It has been widely context, and it is hard to reach a universal agreement. For accepted that this kind of information is vital for evaluating example, an Ajax web page reporting weather forecast of quality and trustworthiness of information on the Web [5, London can be updated regularly with its latest forecast 6]. Interoperability of provenance information is essential data. Over this time the state of this web page can be re- for creating a trustworthy Web of Data. Given the nature garded as fixed because its key features are not changed: of distributed data publication and access on the Linked at the same URL and always about London weather. It is Data Web, provenance information about data can be pub- sufficient to track who created this document without refer- lished by any parties, according to any provenance vocab- ring to the document at any specific time instant. However, ularies or publication approaches. To evaluate quality of in another context, changes to the forecast value could be data on the Web, applications must be able to access infor- regarded as a change to the state of the web page. Its prove- mation through different channels and make sense out of the nance must include information about when the Ajax page diverse information described using languages of varied se- was updated, how and etc. mantics. The ongoing standardization effort from the W3C If the definition of the state of an entity does not match Provenance Working Group provides a family of standards the needs in hand, then we will not access sufficient prove- nance to recreate its historical record. For example, if a Copyright is held by the author/owner(s) new state is not defined when the forecast data was up- LDOW2012, April 16, 2012, Lyon, France. dated then we cannot know how the document was updated them [10]. To align the chosen vocabularies, this survey used with this data. Provenance is less “precise” in this context, a list of terms from the Open Provenance Model (OPM) [8], even though its precision is sufficient for other context, e.g. a community provenance model. The analysis showed that knowing the creator of the document. Without an aware- there is a considerable correspondence among the vocabular- ness of the co-existence of this “precise” v.s. “imprecise” ies along the core concepts of agents, entities, and activities. provenance information on the Web, provenance data con- It also identified some gaps in OPM for representing things sumers could misinterpret the semantics of this information like versions, containment between entities, etc. and make incorrect judgement. Hence, our analysis also For this position paper we picked two of these vocabular- highlights how the three vocabularies allow users to express ies, OPMV (Open Provenance Model Vocabulary) [12] and provenance in a “state-ful” and “state-less” manner. the Provenance Vocabulary [6], to compare their similar- Another question that must be addressed towards achiev- ity with the PROV-O ontology that is being proposed and ing interoperable provenance on the Web is how to make this standardized by the W3C Provenance Working Group. Our information accessible on the Web. Hartig and Zhao [6] have analysis shows that the three vocabularies employ a com- analyzed different possible ways of publishing provenance in- mon pattern for describing provenance, but have different formation onto the Web. But how can this information be perceptions with respect to entities whose provenance being discovered in the first place? The Provenance Access and described. Query (PAQ) working draft [9] from the W3C Provenance Working Group proposes a set of best practices for mak- 3.1 Describing Provenance ing provenance information discoverable. The second part The W3C PROV Model Primer [3] points out that prove- of the position paper presents some recommended ways of nance could be viewed from three different perspectives: publishing provenance information according to this speci- fication in order to achieve interoperable provenance access • Agent-oriented provenance focuses on information on the Web. describing the entities “involved in generating or ma- Because these working drafts from the provenance work- nipulating the information in question”. ing group are still work in progress, this position paper only provides an analysis as per the state-of-the-art. This is not • Object-oriented provenance focuses on tracing the an advocate of the working group deliverables, but rather entities contributing to the existence of another entity. a communication of the latest developments of the working group by positioning them in the context of existing work. • Process-oriented provenance focuses on tracking the “actions and steps taken to generate” an entity whose 2. TERMINOLOGIES provenance information is being described. Provenance-related terminologies are very diverse; for ex- Together, through these three perspectives, we capture ample, each of the three selected provenance vocabularies the ‘who’, ‘what’, ‘when’ and ‘how’ information, as shown uses different terminology for modeling and describing prove- in Figure 1. And this pattern of using the three core con- nance. To remove ambiguities this paper uses the set of cepts of agent, entity and activity is repeatedly applied in terms introduced in the latest PROV Model Primer [3] and the three selected provenance vocabularies, i.e. PROV-O, the PAQ working draft [9] released by the W3C provenance OPMV, and the Provenance Vocabulary. This forms a so- working group. The definitions and semantics of these ter- called process-centric modeling pattern, i.e. an activity class minologies are still subject to changes, and we are using is always introduced to describe the creation or modifica- them in a way as they were available by the time of writing. tion of an entity. A relationship between an entity and an • Entities, are the things “that one may ask the prove- agent must be stated by explicitly describing the activity in nance of” [3]. which the agent is involved that leads to a modification of the entity. There is an exception for stating the relationship • Activities, are “how entities come into existence and between entities, which can be directly stated without hav- how their attributes change” [3] in a way that lead to ing to introduce an activity. This is sometimes regarded as existence of a new entity. a shortcut or as a data-centric view on top of the process- centric logs. In other provenance-related vocabularies, such • Agents, are entities that take “an active role in an as Dublin Core, such a process-centric pattern is not em- activity” by taking “some degree of responsibility” in ployed. Any statements can be directly associated with an that activity [3]. object (be an entity or an agent) without having to make • Resources, refer to “whatever might be identified by explicit the activities involved in their creation. a URI” as described by the Architecture of the World Wide Web [11]. 3. THE PROVENANCE VOCABULARIES Provenance vocabularies/ontologies provide the building blocks for describing provenance information on the Seman- tic Web. To achieve interoperable provenance descriptions we must understand the semantic interoperability of these building blocks. Previously the W3C Provenance Incuba- tor group has conducted a thorough survey of the state- Figure 1: Describing provenance information from of-the-art provenance vocabularies and a mapping between three perspectives. Table 1: Definitions of agents and activities/processes in PROV-O and OPMV. PROV-O OPMV Agent a type of entity that “takes an active role in an a contextual entity acting as a catalyst of a pro- activity” by taking “some degree of responsibil- cess, enabling, facilitating, controlling, or affect- ity” in that activity ing its execution Activity “how entities come into existence and how their an action or series of actions performed on or attributes change” in a way that lead to existence caused by artifacts, and resulting in new arti- of a new entity facts. Table 2: Properties for describing the provenance of an entity. Descriptions of key properties PROV-O OPMV The Provenance Vocabulary represents the active involvement of wasAssociatedWith wasControlledBy performedBy/accessedService agent in modifying the characteristics of the instance of an activity express that an entity was used or con- used used usedData/usedGuideline sumed during an activity express that an entity was generated or wasGeneratedBy wasGeneratedBy retrievedBy/createdBy created by an activity express that the existence of one entity wasDerivedFrom wasDerivedFrom is (at least partly) due to another entity A further analysis shows that the three vocabularies also Defining clear-cut states for resources on the Web is a share a very similar semantics for their definitions of agents, challenging task, due to varied interpretation and context activities, and related properties. With the latest revision under which the data were published. As a standard for the the Provenance Vocabulary even positions itself as a special- Semantic Web community, PROV-O therefore allows the ex- ization of PROV-O1 . Tables 1 and 2 summarize correspon- pression of provenance in both a state-ful and state-less man- dences of related concepts and properties from the three ner, in order to provide a practical solution for a wider range vocabularies. Apart from these commonalities, the vocabu- of users in the community. OPMV and the Provenance Vo- laries show a key difference in their notion about provenance cabulary, however, emphasize more explicitly the immutable entities, which directly impact on the expression of “precise” nature of entities or artifacts. In OPMV, an Artifact is a and “imprecise” provenance using these vocabularies. general concept that represents an immutable piece of state; and it is impossible to express provenance metadata in List- 3.2 State-ful v.s. State-less Provenance ing 1 using this concept. The Provenance Vocabulary ex- Provenance metadata is expected to provide a faithful historical record of what happened. The metadata itself should be immutable and the entities whose provenance be- ing described should be persistent to a particular state. The 1 2 @prefix prov : < http :// www . w3 . org / ns / prov - o / > state of an object can be characterised by a set of its at- 3 @prefix ex2 : < http :// example . org /2 > tribute values. If attributes charaterising the “state-ful” en- 4 tity changed, it should be regarded as a new entity. 5 # provenance of London forecast on two different days However, attributes that characterise a resource are sub- 6 ject to the context under which provenance is generated, and 7 < http :// example . org / f o r e c a s t / london > the application for which provenance is collected. For exam- 8 ex2 : degree " -6 " ^ ^ xsd : Integer ; ple, Listing 1 uses URI ; 10 prov : wa sGenerat edBy [ london> to identify the daily weather forecast for London. 11 rdf : type prov : Activity ; For applications that are interested in understanding who 12 prov : used < http :// s a t e l l i t e _ a > ; provides this forecast, even though the forecast data is up- 13 prov : startedAtTime " 2012 -02 -06 T00 :00:00 " ^ ^ xsd : dateTime ] . dated day by day, this URI is regarded as identifying the 14 same entity. It is a “state-less” entity whose state, i.e. be- 15 < http :// example . org / f o r e c a s t / london > ing accessible via a specific URI, remains unchanged over 16 ex2 : degree " 0 " ^ ^ xsd : Integer ; 17 prov : wa sA t tr ib ut e dT o < http :// bbc . co . uk > ; time. However, for applications that need to understand 18 prov : wa sGenerat edBy [ how the forecast data was generated everyday, the forecast 19 rdf : type prov : Activity ; data of each day needs to be treated as a different entity. 20 prov : used < http :// s a t e l l i t e _ b > ; 21 prov : startedAtTime From the example in Listing 1, applications are unable to " 2012 -02 -07 T00 :00:00 " ^ ^ xsd : dateTime ] . access historical information that records exactly what hap- pened everyday. To fix this, we need to refer to a “state-ful” entity that represents forecast of each particular day. Listing 1: Express provenance of the state-less 1 London forecast entity using PROV-O. http://purl.org/net/provenance/ns-20120314 1 @prefix prov : < http :// www . w3 . org / ns / prov - o / > Hartig and Zhao [6] propose several choices on where to 2 @prefix prv : < http :// purl . org / net / p r o v e n a n c e / ns # > make provenance available for Linked Data, such as includ- 3 @prefix ex2 : < http :// example . org /2 > 4 ing provenance information in the voiD (Vocabulary of In- 5 # provenance of London forecast on Feb . 6 , 2012 terlinked Datasets) [1] description about a linked dataset, 6 or in the RDF graph that is served in response to an HTTP 7 < http :// example . org / f o r e c a s t _ 0 6 0 2 > GET operation. All these proposed ways are embedding 8 ex2 : degree " -6 " ^ ^ xsd : Integer ; 9 prov : wasAttrib u te dT o < http :// bbc . co . uk > ; approaches. Although locating provenance information in 10 rdf : type prv : Immutable , prv : DataItem ; these cases is made easy, it can however introduce a perfor- 11 prv : retrievedBy [ mance problem if the number of provenance triples is large 12 rdf : type prv : DataAccess ; 13 prv : access e d R e s o u r c e or even outnumbers the actual triples that describe the re- < http :// example . org / id / f o r e c a s t _ 0 6 0 2 > ; source itself. We should have an alternative choice that al- 14 prv : completedAt lows us to link resources to provenance descriptions through " 2012 -02 -06 T00 :00:00 " ^ ^ xsd : dateTime ] ; 15 prv : createdBy [ a URI identifying these descriptions. Such a URI is called a 16 rdf : type prv : DataCreation ; provenance URI in the PAQ document [9]. 17 prv : usedData < http :// s a t e l l i t e _ a > ; The PAQ working draft [9] from the provenance working 18 prv : completedAt " 2012 -02 -06 T00 :00:00 " ^ ^ xsd : dateTime ] . group aims to specify best practices for enabling provenance 19 information to be located in an agreed way. It recommends 20 # provenance of London forecast on Feb . 7 , 2012 at least two ways to link provenance descriptions with enti- 21 22 < http :// example . org / f o r e c a s t _ 0 7 0 2 > ties: one is to use HTTP header to indicate the provenance 23 ex2 : degree " 0 " ^ ^ xsd : Integer ; URI, and the other is to use pre-defined properties to express 24 prov : wasAttrib u te dT o < http :// bbc . co . uk > ; links to provenance URIs in RDF. 25 rdf : type prv : Immutable , prv : DataItem ; The following snippet shows how to indicate provenance 26 prv : retrievedBy [ 27 rdf : type prv : DataAccess ; information of a specific entity using the HTTP Link header 28 prv : access e d R e s o u r c e field. The Link header field can be included in the HTTP re- < http :// example . org / id / f o r e c a s t _ 0 6 0 2 > ; sponse to a GET or HEAD operation [9]. This approach is very 29 prv : completedAt " 2012 -02 -07 T00 :00:00 " ^ ^ xsd : dateTime ] ; convenient in the Linked Data context where the “following- 30 prv : createdBy [ your-nose” approach is widely appreciated and adopted. In 31 rdf : type prv : DataCreation ; an HTTP response, several provenance link header fields 32 prv : usedData < http :// s a t e l l i t e _ b > ; 33 prv : completedAt could be included, so that a data publisher may indicate " 2012 -02 -07 T00 :00:00 " ^ ^ xsd : dateTime ] . provenance information for each separate entity URI. Link: provenance-URI; rel="provenance"; Listing 2: Express provenance of state-ful London anchor="entity-URI" forecast using the Provenance Vocabulary. Some existing work like Memento [4] and duri [7] have proposed solutions to navigating between a dynamic web re- source and different versions of this resource. The PAQ doc- tends PROV-O by introducing a concept prv:Immutable, ument proposes the use of a property like ex1:hasAnchor2 , that allows users to explicitly mark the immutable nature of to link a web resource URI with the entity URIs that repre- an entity at a particular state. Using this concept, Listing 2 sent a particular state of that dynamic web resource. As il- rewrites provenance of London forecast data by regarding lustrated in Listing 3, we use ex1:hasAnchor to refer the dy- daily forecast as a state-ful entity. Two separate URIs are namic resource () created to identify London forecast from two separate days to two URIs, each of which represents London forecast taken in order to provide a static record for each entity. on a specific day. These entity URIs can then be used to This subtlety must be considered when publishing prove- provide provenance information for a particular version of nance information for resources on the Web. These prove- a state-less resource, as previously shown in our example in nance for “state-ful” v.s. “state-less” entities are not two Listing 2. distinctive types of provenance. They are simply histori- All the approaches presented so far are targetted at data cal statements collected in different context, under different owners who will publish provenance along with their data. conditions. When a resource is state-ful instead of state-less Provenance information about data can also be published by is all relative speaking. What is indeed needed is an interop- third-parties. The PAQ document also includes some more erable way to refer to these static, state-ful entities, such as complex mechanisms to achieve this, which are not covered the forecast of each individual day, and their dynamic coun- here but can be referred to in the PAQ document. terpart (i.e. the daily forecast data as a general concept), to retrieve their provenance information. 5. CONCLUSIONS AND DISCUSSION Making interoperable provenance information accessible 4. PROVENANCE PUBLICATION FOR LIN- on the Web is crucial towards achieving a trustworthy web KED DATA RESOURCES of data/documents. To achieve this we require a language To make provenance information accessible on the Linked that allows us to interchange provenance information rep- Data Web in an interoperable way we must have an agree- resented using different languages and a mechanism to dis- ment on how provenance is made available (e.g. embedded 2 Note that the namespace of these properties were not yet in an RDF graph or retrievable via links), and where to look defined by the time of writing. This is scheduled to be fi- for this provenance information. nalized in according to the PROV-O ontology. 1 @prefix ex1 : < http :// example . org / t . b . d . > . OPMV only allows more state-ful provenance statements 2 and the Provenance Vocabulary explicitly defines immutable 3 < http :// example . org / f o r e c a s t / london > 4 ex1 : hasAnchor entities, to encourage the publication of more precise prove- < http :// example . org / f o r e c a s t _ 0 6 0 2 > , nance. PROV-O provides a relaxed definition of an entity, < http :// example . org / f o r e c a s t _ 0 7 0 2 > ; permitting expression of provenance in both a state-ful and 5 ex1 : hasProvenance state-less manner, which can hopefully address these subtle < http :// example . org / f o r e c a s t _ 0 6 0 2 / prvnc > , < http :// example . org / f o r e c a s t _ 0 7 0 2 / prvnc > . differences as a bridging vocabulary. 6 7 ## Retrieve provenance of each state - ful entity 8 6. REFERENCES 9 C : GET / forecast_0602 / prvnc HTTP /1.1 [1] K. Alexander, R. Cyganiak, M. Hausenblas, and 10 C : Host : example . org 11 C : Accept : text / turtle J. Zhao. Describing linked datasets. In Proceedings of 12 the Linked Data on the Web Workshop (LDOW) at 13 S : HTTP /1.1 200 OK WWW, 2009. 14 S: 15 S : < http :// example . org / f o r e c a s t _ 0 6 0 2 > [2] K. Belhajjame, J. Cheney, D. Garijo, S. Soiland-Reyes, 16 S: prv : createdBy [ S. Zednik, and J. Zhao. The PROV Ontology: Model 17 S: rdf : type prv : DataCreation ; and Formal Semantics. Technical report, 2011. 18 S: prv : completedAt " 2012 -02 -06 T00 :00:00 " ^ ^ xsd : dateTime ] . http://www.w3.org/TR/2011/WD-prov-o-20111213/, 19 Accessed on February 14, 2012. 20 C : GET / forecast_0702 / prvnc HTTP /1.1 [3] K. Belhajjame, H. Deus, D. Garijo, G. Klyne, 21 C : Host : example . org 22 C : Accept : text / turtle P. Missier, S. Soiland-Reyes, and S. Zednik. PROV 23 Model Primer. Technical report, 2012. http: 24 S : HTTP /1.1 200 OK //www.w3.org/TR/2012/WD-prov-primer-20120110/, 25 S: 26 S : < http :// example . org / f o r e c a s t _ 0 7 0 2 > Accessed on February 14, 2012. 27 S: prv : createdBy [ [4] H. V. de Sompel, R. Sanderson, M. L. Nelson, 28 S: rdf : type prv : DataCreation ; L. Balakireva, H. Shankar, and S. Ainsworth. An 29 S: prv : completedAt " 2012 -02 -07 T00 :00:00 " ^ ^ xsd : dateTime ] . http-based versioning mechanism for linked data. In Proceedings of LDOW2010, 2010. [5] J. Golbeck. Weaving a web of trust. Science, Listing 3: Linking a state-less resource to state-ful 321(5896):1640–1641, 2008. entities and their provenance. [6] O. Hartig and J. Zhao. Publishing and consuming provenance metadata on the web of linked data. In Proceedings of IPAW 2010, 2010. cover and access this metadata unambiguously. The fam- [7] L. Masinter. The ’tdb’ and ’duri’ URI schemes, based ily of standards from the W3C Provenance Working Group on dated URIs draft-masinter-dated-uri-10. Technical are currently geared towards these goals. And our analysis report, 2012. http://tools.ietf.org/html/ of the interoperability between two widely accepted prove- draft-masinter-dated-uri-10, Accessed on nance vocabularies and PROV-O has concluded a promising February 16, 2012. result. [8] L. Moreau, B. Clifford, J. Freire, Y. Gil, P. Groth, What is not described here is that PROV-O also provides J. Futrelle, N. Kwasnikowska, S. Miles, P. Missier, constructs for expressing some more complicated provenance J. Myers, Y. Simmhan, E. Stephan, and J. Van den patterns, such as describing additional attributes of relation- Bussche. The Open Provenance Model – Core ships between entities and activities. For example, it can Specification (v1.1), Dec. 2009. explicitly express recipes used by an activity to generate [9] L. Moreau, O. Hartig, Y. Simmhan, J. Myers, T. Lebo, an entity in a reification kind of pattern. K. Belhajjame, and S. Miles. PROV-AQ: Provenance Deciding “what”, be state-ful or state-less, is described Access and Query. Technical report, 2012. http: in provenance information is another longstanding issue to //www.w3.org/TR/2012/WD-prov-aq-20120110/, achieve interoperable understanding about this information. Accessed on February 14, 2012. Provenance vocabularies largely enforce a strong state-ful [10] W3C Provenance Incubator Group. Provenance mindset; if attributes of an entity changed, it becomes a Vocabulary Mappings. Technical report, 2010. new, different entity. However, on a open world such as the http://www.w3.org/2005/Incubator/prov/wiki/ Web, provenance information is generated and published for Provenance_Vocabulary_Mappings, Released on applications of varied purposes, from varied perspectives. August 06, 2010. The representation of a web resource may change over time, [11] N. Walsh and I. Jacobs. Architecture of the World for example, the daily forecast of London weather, and it Wide Web, Volume One. Technical report, 2004. http: might continually be regarded as the same entity, regardless //www.w3.org/TR/2004/REC-webarch-20041215/, of its change of “state”. If the states of an entity are de- W3C Recommendation. fined in a very fine-grained manner, e.g. an hourly state for [12] J. Zhao. The Open Provenance Model Vocabulary. the forecast page, we will have more detailed, or “precise”, Technical report, 2010. provenance information. However, too fine-grained distinc- http://purl.org/net/opmv/ns, Accessed on March tion between the states of an entity might be impractical 16, 2012. and lead to overwhelming provenance data. The trade-off should be considered based on actual context and needs.