Provenance Information in the Web of Data Olaf Hartig Humboldt-Universität zu Berlin Department of Computer Science Berlin, Germany hartig@informatik.hu-berlin.de ABSTRACT derivation of the final output of” [30] a workflow. Davidson The openness of the Web and the ease to combine linked et al. [14] provide an overview of provenance in workflow data from different sources creates new challenges. Systems systems. Data provenance, in contrast, provides a more de- that consume linked data must evaluate quality and trust- tailed view on the derivation of single pieces of data. A par- worthiness of the data. A common approach for data quality ticular area of research on data provenance is provenance in assessment is the analysis of provenance information. For databases which considers the provenance of query results. this reason, this paper discusses provenance of data on the In this context, Buneman et al. [8] distinguish why- and Web and proposes a suitable provenance model. While tra- where-provenance: why-provenance represents the origins ditional provenance research usually addresses the creation that were involved in calculating a single entry of a query of data, our provenance model also represents data access, result; where-provenance refers to the exact locations an ele- a dimension of provenance that is particularly relevant in ment of a query result entry has been extracted from. Green the context of Web data. Based on our model we identify et al. [20] additionally introduce how-provenance that, in options to obtain provenance information and we raise open contrast to why-provenance, describes how the origins were questions concerning the publication of provenance-related involved in the calculation. metadata for linked data on the Web. While a great many approaches exist that represent prove- nance [4, 29, 30], none of these explicitly addresses the char- acteristics of provenance of data from the Web. Web data Categories and Subject Descriptors provenance includes the access of data items on the Web, I.2.4 [Computing Methodologies]: Knowledge Represen- an information not required in the context of self-contained tation Formalisms and Methods; H.3.3 [Information Sys- systems such as a DBMS or a workflow management system. tems]: Information Search and Retrieval In this paper we discuss provenance of Web data. We aim to provide a base for research on the application of prove- Keywords nance information to assess qualities of linked data from the Web. Our main contributions are the following: Provenance, Lineage, Web Data, Web of Data, Linked Data • We propose a provenance model that captures both, information about Web-based data access as well as 1. INTRODUCTION information about the creation of data. Today, a large amount of RDF data is published on the Web; large datasets are interlinked; new applications emerge • We describe options to obtain provenance information that utilize this data in novel and innovative ways. An up- by accessing metadata on the Web. coming challenge that has to be addressed in these applica- • We analyze vocabularies for RDF data that allow to tions is the evaluation of qualities of the data retrieved from describe provenance information. the Web, qualities such as accuracy, timeliness, reliability, and trustworthiness. • We identify open questions concerning the publication A recent study shows that one of the main factors that of provenance-related metadata for linked data on the influence the trust of users in Web content is provenance [18]. Web. Thus, a common approach for data quality assessment is This paper is structured as follows. First, Section 2 reviews the analysis of provenance information. “Information about related work. In Section 3 we introduce our provenance provenance constitutes the proof of correctness [...] and [...] model for Web data. A discussion of options to obtain prove- determines the quality and amount of trust [...]” [30] nance information is given in Section 4. Finally, Section 5 Provenance information about a data item is information raises open questions and Section 6 concludes this paper. about the history of the item, starting from its creation, including information about its origins. Tan [30] distin- guishes two granularities of provenance: workflow (or coarse- 2. RELATED WORK grained) provenance and data (or fine-grained) provenance. Representing and analyzing provenance is a topic of re- Workflow provenance represents “the entire history of the search since many years [4]. Simmhan et al. [29] provide a taxonomy of provenance characteristics. The authors dif- Copyright is held by the author/owner(s). ferentiate between data-oriented approaches and process-or- LDOW2009, April 20, 2009, Madrid, Spain. iented approaches. While data-oriented approaches focus on data items, process-oriented approaches emphasize infor- web applications and services.” The term knowledge prove- mation about the processes that consume and generate the nance refers to information about the origin of knowledge data. Due to its level of abstraction our provenance model and about the reasoning processes used to produce answers. can be used as a basis for both types of approaches as well In [11] the authors present the Proof Markup Language to as for hybrid approaches. describe justifications for results of an answering engine or An approach to model provenance on a more detailed level a reasoner. A formal definition of justifications for entail- is the Open Provenance Model introduced by Moreau et ments in OWL ontologies is provided by Horridge et al. [23]. al. [25]. Similar to our model, the authors distinguish three These justifications may describe the execution of a specific types of pieces of provenance information: artifacts, pro- kind of data creation processes represented by our prove- cesses, and agents. The Open Provenance Model represents nance model. provenance by graphs. The nodes in these graph represent the artifacts, processes, and agents. The edges are directed 3. A MODEL OF WEB DATA PROVENANCE and they have a predefined semantic depending on the type Provenance research in the context of databases [30] or in of the adjacent nodes. For instance, an edge that connects a the context of workflows [14] usually focuses on the creation process with an agent means the process was controlled by of data items such as query results and data products. In the agent. Some edges can be annotated with a use case- the majority of cases, these approaches apply a notion of the specific role. Due to its more detailed representation the sources of a data item that is directly related to the creation Open Provenance Model can be used to realize the descrip- process. To represent the provenance of data from the Web tion of parts of a provenance graph that complies with our, we need an additional dimension. Provenance information more abstract model. of Web data must comprise the aspect of publishing and Bunemann et al. [7] raise several open questions for data accessing data on the Web. Questions such as who operates provenance in the age of the Web. The authors identify three the service that provides a dataset are equally important as main issues: i) obtaining provenance information, ii) citing asking for the entity that created the data. For this reason, components of a digital library such as (components of) a we suggest a provenance model for data from the Web that document in another context, and iii) ensuring integrity of includes both dimensions, the creation and the access of citations under the assumption that cited databases evolve. data. In this section we describe our model: we introduce We address the first issue by discussing options to obtain the basic elements, we present the data creation dimension, provenance information in Section 4.1. and we describe the representation of data access. Harth et al. [21] argue for a provenance model for the Web that includes a “social dimension to associate provenance 3.1 Basics of the Provenance Model with the originator (typically a person) of a given piece of Provenance information can be used for various purposes. information.” Given such a model it is possible to embed Possible uses are the estimation of data quality, the tracing provenance-based quality assessments in the social context of audit trails of data, the repetition of data derivations, the of users. We agree to the authors’ request. With our prove- determination of liabilities, and the discovery of data [29]. nance model we encourage to represent human actors and The main purpose of our provenance model is to support their relation to data items. the assessment of data qualities such as accuracy, reliability, A more technical notion of provenance is represented by and timeliness. Ding et al. [15] who understand the provenance of RDF We propose to describe the provenance of a specific data data as the RDF graphs of which parts of an analyzed RDF item from the Web (e.g. a specific RDF graph or RDF graph has been derived from. The authors argue that track- statement) by a provenance graph. The nodes of prove- ing complete RDF graphs is too coarse-grained and that a nance graphs are provenance elements that represent pieces representation on the level of single RDF statements is un- of the provenance information about the data, pieces such suitable, too. For this reason, Ding et al. introduce RDF as the actual creator of a specific dataset. Our provenance molecules as the finest sub-graphs that can be generated by model identifies different types of provenance elements and a lossless decomposition of an RDF graph. Our provenance it describes the relationships between these types and, thus, model represents data items on an abstract level. Thus, between the possible provenance elements in a provenance actual applications may use any level of granularity: RDF graph. Since provenance information for a data item may graphs, statements, or RDF molecules. comprise information about source data a provenance graph Hausenblas et al. [22] touch another aspect of Web data for the data item may contain subgraphs that describe the provenance. The authors distinguish sources of Web data provenance of the source data. Thus, our understanding of a based on the way these sources represent RDF data. Sources data item does not only include RDF graphs and RDF state- may contain RDF data in a non-serialized form (e.g. in- ments but it also covers typical source data such as workflow memory, in-store) or arbitrary data in a serialized form. results and database entries (Table 3 in Appendix B lists Sources with serialized data may be i) RDF model-compliant further examples for data items). and standalone, ii) RDF model-compliant and embedded, or We broadly distinguish three types of provenance elements: iii) non-compliant to RDF model. Our provenance model actors, executions, and artifacts. An actor usually performs also differentiates between the data itself and its represen- the execution of an action or a process which – in most tation in a document. cases – yields an artifact such as a specific dataset. An exe- Another approach that considers provenance in the con- cution may include the use of artifacts which, in turn, might text of the Semantic Web has been developed in the Infer- be the result of another execution. Furthermore, direct re- ence Web project. Da Silva et al. [12] describe a provenance lationships between artifacts as well as between actors may infrastructure that supports “the extraction, maintenance exist. For instance, a specific company is responsible for its and usage of knowledge provenance related to answers of Web server. Figure 1: Provenance information concerning the creation of data. Our model defines different element types as specializa- Our provenance model abstracts from the different notions tions of actors, executions, and artifacts. For example, data of data creation. Figure 1 depicts the relationships of all creators are a specific kind of actors and RDF graphs are element types that cover the data creation dimension. The a specific kind of artifacts. Each provenance element in a central element type is the data creation. Data creations rep- provenance graph corresponds to at least one of these types. resent the execution of actions or processes that create new The edges in a provenance graph correspond to the rela- data items. Thus, in the provenance graph of a specific data tionships between element types of the adjacent provenance item actual data creations are represented by provenance el- elements. ements of the data creation type. All data creations have a All provenance elements have attributes that represent creation time and use a method (cf. the universal attributes provenance-specific information about the elements. For in- in Figure 1). Examples for data creation methods are the stance, a specific data creator may have a name and may aforementioned completion of a Web form as well as the ex- work for a well-known organization. The actual attributes ecution of a workflow, of a query, of a transformation, or of and their extent depend on the element type and the needs a reasoning process. The existence of further, provenance- in the application scenario. For some element types, how- related attributes of a data creation and the granularity of ever, all conceivable provenance elements will likely have these attributes depend on the specific creation method and attributes of the same kind. We call these attributes uni- on the application of the provenance model. For instance, versal attributes and associate them with the corresponding the execution of a query could be associated with a repre- element types in our model. sentation of how-provenance [20]; the inference of data could Please notice, to cover a broad variety of applications we be represented with a justification [11][23]. designed our provenance model with generality in mind. Dif- Provenance elements that are associated with a data cre- ferent applications of our model may have different needs ation are the created data item, data creators, source data, with respect to the amount and granularity of the repre- and creation guidelines. Data creator s are actors that per- sented provenance information. For this reason, the ele- form the data creation. Our model distinguishes non-human ments of our model abstract from actual use cases; we do and human data creators. Non-human creators are data cre- not suggest a specific implementation of provenance graphs ating devices such as sensors and data creating services such nor do we prescribe attributes that must be used for specific as software agents, reasoners, query engines, or workflow en- provenance elements. However, we give examples for various gines. Human data creators, called data creating entities, are provenance element types in Appendix B. persons, groups, organizations, etc. Data creating entities may create the data directly – as in the Web form example – 3.2 Data Creation or they are responsible for a non-human data creator that The creation of a data item can be a complex process such creates the data. A provenance-relevant attribute of all data as executing a sophisticated workflow. Thus, provenance in- creating entities is their relation to the created data. Further formation about a data item may include a comprehensive attributes depend on the actual provenance element and on description of the execution of data creation processes. On the implementation of the provenance model. For example, the other hand, the creation of data can be as straightfor- a data creating service may implement a specific algorithm ward as performing a simple action such as filling out a Web and, usually, has a version number and a developer. form. To represent this simple action as provenance of the A data creator often makes use of source data to create derived data a few details may suffice. Furthermore, what new data. Examples for source data are the content of a is a simple action in one situation may be considered as a document used for machine learning, the statements in a complex process in another case. knowledge base used to entail a new statement, and the entries in a database used to answer a query. The granu- as linked data in RDF documents, in HTML documents, or larity by which a provenance graph represents source data as the result of SPARQL queries. depends on the use case. Notice, not every data creation Data publisher s are persons, groups, or organizations that uses source data as indicated by the dashed connection in use a data providing service to publish data on the Web. Figure 1. However, all source data has provenance which Similar to the services, a data publisher may also be the should be represented as a subgraph in the provenance graph data creating entity. The service provider element type rep- of the created data. Thus, a provenance graph may recur- resents entities such as a person, a groups, or an organization sively contain subgraphs for source data, for the sources of that controls a data providing service. Our model introduces source data, and so on. this type because data publishers may publish their data on Further input artifacts that may be associated with a data platforms that are provided by a third party, the service creation are the creation guidelines which guided the execu- provider. Information about this third party may be rele- tion of the data creation. Examples for creation guidelines vant as provenance information. However, a data publisher are transformation rules, mapping definitions, entailment may administer its own service and, thus, could be the ser- rules, and database queries. vice provider itself. Our provenance model represents the actual execution of 3.3 Data Access a data access by an element type called data access. Prove- A system that uses Web data must access this data from nance information that is common to all provenance ele- a provider on the Web. Information about this process ments of this type is the access time and the access method. and about the providers is important for a representation A major access method an HTTP-based resource request of provenance that aims to support the assessment of data where the URI of the requested resource would be another qualities. Hence, our provenance model introduces element provenance-related attribute of the corresponding provenan- types that give attention to the data access dimension. Fig- ce elements. Additional attributes of data access provenance ure 2 depicts the main element types and their relationships. elements may represent the content negotiation [17, Sec- tion 12] and possible redirections [17, Section 10.3] that hap- pened during the data access. Further examples for access methods are API-based data access and its specialization, query-based access. Provenance information in these cases are the API call with its parameters and the issued query, respectively. Notice, query-based data access usually is a data creation too. Further provenance information not considered so far is the availability and validity of digital signatures. Since this information is important to assess the quality of data our provenance model contains additional element types, namely intergrity assurances, digital signatures, public keys and sign- ers (cf. Figure 3). Figure 2: Provenance information about data access on the Web. Data published on the Web is embedded in a host arti- fact, usually a document. Following the terminology of the W3C Technical Architecture Group we call this artifact an information resource [24]. Each information resource has a type, e.g., it is an RDF document or an HTML document. A system, the data accessor, retrieves information resour- Figure 3: Signature verifiability is a part of Web ces from a provider. Our provenance model allows a detailed data provenance. representation of providers by distinguishing data providing services, data publishers, and service providers. A data pro- An intergrity assurance basically represents the verifica- viding service is a non-human actor – usually a Web service tion of a digital signature for the signed artifact. The ver- or a server – that processes data access requests and actually ification requires the public key of the signer. Intergrity sends the information resource over the Web. Provenance- assurances are associated with information about the result related attributes of data providing services may be a de- of the verification. Digital signatures have several properties scription of the software that realizes the service. Notice, a that are relevant as provenance information, e.g., the date data providing service in a provenance graph may also be a of issue and the signature scheme [19]. The provenance el- data creating service. For instance, a D2R server [1] creates ement for a public key could describe the creation and the RDF data from a relational database and provides this data expiration date. 4. OBTAINING PROVENANCE INFORMA- not prescribe a standard form to structure returned data. TION A source of data about the content provided by a Web server are sitemaps. A sitemap is an XML document that A system that applies our provenance model generates informs search engine crawlers about URLs on a website. provenance graphs for data items. To create the provenance The semantic sitemap approach [10] extends these docu- elements of such a graph the system has to collect differ- ments with information about the location of RDF data ent pieces of provenance information automatically. In this and about alternative means to access this data (e.g. data section we discuss options to obtain provenance information dumps and SPARQL endpoints). Even if the information in for Web data. a semantic sitemap is marginally provenance-related an im- Some pieces of provenance information can be recorded portant element is the specification of a URI that represents by a system; for other pieces the system relies on meta- referenced datasets. Given the provider follows the linked data provided by third parties. Thus, we distinguish record- data principles a look-up of these URIs will yield RDF-based able provenance information and metadata-reliant prove- metadata that may describe provenance information about nance information. Basically, recordable provenance infor- the datasets. mation is information on executions that are performed by Using linked data, in any case, is an important approach the system itself or that can sufficiently be monitored by to discover provenance-related metadata. As with the URI the system. Usually, these executions are data accesses of an RDF dataset it is possible to look up any HTTP URI initiated by the system, signature verifications, and local that represents a piece of provenance (e.g. the URI of a data creations. Metadata-reliant provenance information, data item such as a named RDF graph [9] or the URI of an in contrast, can not be recorded automatically but requires actor). Moreover, collecting provenance information may the evaluation of metadata that is published on the Web. involve following RDF links in order to get more complete Metadata-reliant provenance information comprises infor- information during the generation of provenance graphs. mation about executions inaccessible to the system as well Another method to discover metadata about Web resour- as information about actors and artifacts involved in these ces is POWDER [28], the Protocol for Web Description executions. Furthermore, obtaining more exhaustive prove- Resources. POWDER introduces so called description re- nance information about certain actors involved in accessible sources to describe resources on the Web. These descrip- executions may also require metadata. For instance, even if tions are either based on RDF data or on simple keywords a system can record information about a self-initiated data (i.e. tags) and they may contain provenance information. access a proper representation of the involved providers re- A specific kind of actors represented by our provenance quires metadata. model are Web services that create or provide data. Dif- Recording provenance information is a fundamental topic ferent standards exist to describe Web services [32]. These of provenance research. For instance, Bose and Frew [4] descriptions may also contain information that are relevant study scientific workflow management systems that aim to for provenance graphs. track provenance for data products; Horridge et al. [23] present concepts to compute justifications for entailments in 4.2 Provenance-Related Vocabularies ontologies; Tan [30] discusses different approaches to propa- gate and to compute provenance of query results in database Various vocabularies exist that allow to describe prove- systems. The concepts developed in these contexts can be nance information with RDF data. In the following, we adapted to generate recordable provenance information in describe these vocabularies and we relate the classes and the context of Web data. On that account, we focus on properties defined by these vocabularies to the elements of metadata-reliant provenance information in the remainder our provenance model. Afterwards, we study the presence of this section. We identify methods to access relevant meta- of the vocabularies in the Web. data on the Web, we analyze vocabularies that allow a rep- A popular standard to represent general-purpose meta- resentation of provenance-related metadata, and we study data are the Dublin Core Metadata Terms [16] which are the existence of such metadata on the Web. available as an RDFS Schema. The following properties de- fined by this schema can be associated with a resource to 4.1 Accessing Provenance-Related Metadata describe provenance information: Provenance-relevant metadata is either directly attached • dcterms:contributor1 , dcterms:creator – The con- to a data item or its host document or it is available as ad- tributor of a resource is defined as “an entity responsi- ditional data on the Web. Examples for attached metadata ble for making contributions to the resource” [16] and are RDF statements about an RDF graph that contains the the creator is “an entity primarily responsible for mak- statements, author and creation date of blog entries added ing the resource.” [16] Thus, these properties may be to a syndication feed, or information about an image embed- used to obtain information about data creators of a ded in the image file. Both, attached metadata and detached data item. However, the actual type of the referenced metadata, may be represented in RDF using vocabularies as described in Section 4.2 or it may be data of another form. 1 We use the following namespace prefixes in this paper: In the following, we present options to discover detached dcterms: http://purl.org/dc/terms/ metadata on the Web. dc11: http://purl.org/dc/elements/1.1/ Accessing data on the Web is often based on HTTP URIs. foaf: http://xmlns.com/foaf/0.1/ Since these URIs are grounded in the Domain Name System sioc: http://rdfs.org/sioc/ns# (DNS) it is possible to query a WHOIS [13] service in order swp: http://www.w3.org/2004/03/trix/swp-2/ wot: http://xmlns.com/wot/0.1/ to get provenance information about the accessed data item. iwProv: http://inferenceweb.stanford.edu/2006/06/pml-provenance.owl# However, the responses of WHOIS services are hardly usable ouzo: http://www.mygrid.org.uk/provenance# for automatic evaluation because the WHOIS protocol does cs: http://purl.org/vocab/changeset/schema# creators as well as their role in the data creation pro- referenced users according to our provenance model is cess remain unclear because the Dublin Core schema less ambiguous: creators and modifiers referenced in a does not distinguish different types of data creators as SIOC-based description are data creating entities. our provenance model does. Analyzing data about the creator (or about the contributor) may yield further • sioc:has owner, sioc:owner of – These properties ex- information (e.g. the type) that can help to derive a press ownership of SIOC items. This information may more precise provenance graph. provide an indication of the relation of a data publisher to provided data and, thus, might be used to set the • dcterms:source – The source of a resource is “a re- corresponding attribute of the entity in a provenance lated resource from which the described resource is de- graph. rived.” [16] With this property it is possible to create provenance elements associated as source data with a • sioc:earlier version, sioc:later version, data creation element. sioc:next version, sioc:previous version – These properties relate different versions of a SIOC item with • dcterms:created – This property specifies the cre- each other and could be used to create relationships ation date of a resource and can be used to set the between artifacts in a provenance graph. creation time attribute associated with the execution of a data creation. The Semantic Web Publishing Vocabulary (SWP) [9] en- ables the description of information about provision of data. • dcterms:modified – This property specifies the date With SWP it is possible to represent the attitude of a le- on which a resource has been changed. We propose gal person to an RDF graph. SWP supports two attitudes: to represent the modification of a data item as a data claiming the graph is true and quoting the graph without creation which creates a new, modified version of the a comment on its truth. These commitments towards the original data item. The creation time attribute as- truth can be used to derive a data publisher’s or a data sociated with this data creation can be set using the creating entity’s relation to provided or created artifacts. dcterms:modified property. Furthermore, the SWP allows to describe digests and digital signatures of RDF graphs and to represent public keys. Sim- • dcterms:publisher – The publisher of a resource is ilarly, the Web Of Trust schema (WOT) [5] enables descrip- “an entity responsible for making the resource avail- tions that document the use of public key cryptography tools able” [16] This property may be used to obtain in- to sign documents. However, two differences between WOT formation about a provider of an information resource and the signature-related part of SWP exist. First, digital whereas the actual type of the provider (data providing signatures in SWP are represented as RDF data; WOT, in service, data publisher, or service provider) remains contrast, refers to signatures that are individually encoded unclear. in dedicated documents. Second, while the digital signa- • dcterms:provenance – This property links a resource tures described with WOT sign information resources, the to “a statement of any changes in ownership and cus- signatures in SWP-based descriptions sign a specific kind tody of the resource since its creation that are sig- of data item, namely RDF graphs. However, both, SWP- nificant for its authenticity, integrity, and interpreta- based descriptions as well as WOT-based descriptions, can tion.” [16] Due to the very general definition, it is dif- be used to obtain information about public key and digital ficult to use such a provenance statement during the signatures in order to represent them in a provenance graph. creation of a provenance graph. Further vocabularies that can be used to describe prove- nance information of specific types of data items are the The Friend of a Friend (FOAF) vocabulary [6] provides following: classes and properties to describe entities such as persons, organizations, groups and software agents. FOAF-based de- • The Ontology Metadata Vocabulary (OMV) [27] de- scriptions can be used to obtain basic information about ac- scribes ontologies. OMV includes properties for cre- tors (e.g. names, group membership, email addresses, iden- ators, contributers, reviewers, and creation and modi- tifying online accounts). Furthermore, FOAF contains the fication dates. property foaf:maker and its inverse foaf:made to relate the • The Proof Markup Language [11] describes justifica- described entities to resources made by the entities. These tions for results of an answering engine or an inference properties can be used to identify the data creator of a data process. item. However, using these properties raises the same ques- tions as for the dcterms:creator property. • The Changeset Vocabulary [31] describes changes to The Semantically-Interlinked Online Communities (SIOC) RDF-based resource descriptions. ontology [2] describes information from online communities. The ontology associates SIOC items such as blog posts, • The Ouzo Provenance Ontology [33] describes the run comments, and e-mail messages to users that are identified of a (scientific) workflow, the processed data, and the by their online accounts. The following properties describe entities responsible for the workflow run. provenance-relevant information: 4.3 Existence of Provenance Metadata • sioc:has creator, sioc:creator of, To study the existence of metadata on the Web that uses sioc:has modifier, sioc:modifier of – These prop- the aforementioned vocabularies and their provenance-rele- erties relate a SIOC item to a user who created it vant properties we utilized two Web data indexes available or who modified it. In contrast to the corresponding on the Web, namely the Web service Ping the Semantic Dublin Core and FOAF properties, representing the Web (PTSW) [3] and the Sindice search engine [26]. PTSW is a service that receives notifications from different appli- To overcome these issues we propose to develop a vocabu- cations that create, update, or discover RDF documents on lary that enables data publishers to describe the provenance the Web; PTSW aggregates these notifications and provides of the provided data more precisely. This new vocabulary an up-to-date list of existing RDF documents. Furthermore, may refine existing vocabularies. Nonetheless, the develop- the PTSW website provides different statistics. Currently, ment of this new vocabulary should be based on the pre- PTSW indexes 1,073,2182 RDF documents. For each of the sented provenance model. vocabularies discussed before Table 1 presents the number The second problem is the general lack of provenance- of documents registered at PTSW that use this vocabulary. related metadata in the Web of linked data. Reasons might As the numbers indicate, FOAF and SIOC are widely used. be the lack of suitable vocabularies, a lack of usable tools The other vocabularies are not at all or they are used in an to generate provenance-related metadata, and ignorance or insignificant number of documents. at least a lack of sensitization. The first two reasons can be ascribed to technical problems that should be solvable Table 1: Number of RDF documents known to by the development of the proposed vocabulary and by the PTSW that use a vocabulary (as of Feb. 7, 2009). implementation of corresponding tools. The lack of sensi- vocabulary occurence tization is a more general problem that must be addressed Dublin Core Metadata Terms 121 by the linked data community. A possible approach may Dublin Core Metadata Element (legacy) 9 evolve based on the recently released Vocabulary of Inter- FOAF 989,263 linked Datasets (voiD) [34] which is a vocabulary to de- SIOC 127,974 scribe the content of RDF-based datasets and the links be- SWP 1 tween different datasets. Since voiD enables the discovery WOT 101 and usage of linked datasets it may raise the awareness of Proof Markup Language 0 publishers to provide metadata for their datasets. This un- Ouzo Provenance Ontology 0 derstanding may be used to motivate publishers to provide Changeset Vocabulary 0 provenance information along with their voiD descriptions. The statistics of PTSW may not be representative be- cause they heavily depend on the applications that notify 6. CONCLUSIONS PTSW. For this reason, we utilized the Sindice search en- In this paper we propose a provenance model for Web data gine for another inquiry. Sindice indexes structured data and we discuss options to obtain provenance information. In from the Web. We queried Sindice for documents that con- contrast to provenance research in areas such as workflows tain RDF statements with the provenance-relevant proper- and databases the analysis of the Web data provenance must ties of the vocabularies. Table 2 in Appendix A lists the include information about the access of data in the Web. number of documents in the Sindice index for each prop- Thus, our provenance model includes two dimensions: data erty. According to Sindice, Dublin Core is roughly as often creation and data access. By specifying the relationships used as FOAF,a conclusion that cannot be drawn from the of rather general types of pieces of provenance information PTSW statistics. Furthermore, in contrast to the PTSW our model describes provenance on an abstract level. This numbers, the Sindice queries reveal that the legacy Dublin generality gives applications a choice to refine the model Core metadata elements are used more widely than the rec- according to their use case. ommended new metadata terms. Consistent with the find- Based on our provenance model we describe options to ings discovered in the PTSW statistics, the SWP, the Proof obtain provenance information and we analyze vocabularies Markup Language, the Ouzo Provenance Ontology, and the to express such information. Our analysis identifies several Changeset Vocabulary are not used at all3 . Moreover, con- open questions. We aim to address these questions in the sidering that Sindice currently indexes about 48.99 million future. documents the numbers for the other vocabularies are not As further future work we will develop concepts to esti- satisfying either. Thus, we conclude that there is only very mate the trustworthiness of Web data based on our prove- little provenance-related, RDF-based metadata available on nance model. These estimations have to consider subjective the Web. assessments of the elements in a provenance graph (e.g. the reliability of a data creator) as well as the trustworthiness of 5. OPEN QUESTIONS the data that has been used to create the provenance graph. Our analysis of vocabularies that allow to express prove- nance information reveals two problems. First, the vocabu- 7. ACKNOWLEDGMENT laries are partly unsuitable and lack certain features. Our re- We thank Jun Zhao for her valuable feedback on our view of the vocabularies shows a lack of unambiguousness for provenance model. certain properties. In particular, it is difficult to distinguish between the different types of providers and it is impossi- ble to express the actual relationships between providers of 8. REFERENCES different types. The same holds for data creators. Further- [1] C. Bizer and R. Cyganiak. D2R Server – Publishing more, it is impossible to describe the execution of a data Relational Databases on the Semantic Web. Poster at access. These descriptions may be required to document the 5th International Semantic Web Conference the access of source data executed by a third party. (ISWC), Nov. 2006. 2 All numbers are from February 7, 2009. [2] U. Bojars and J. G. Breslin. SIOC Core Ontology 3 Occurences of 1 to 3 in Table 2 refer to documents that Specification, Revision 1.30. Online, Jan. 2009. specify the corresponding properties. Retrieved Feb. 7. [3] U. Bojars, A. Passant, F. Giasson, and J. Breslin. An Symposium on Principles of Database Systems Architecture to Discover and Query Decentralized (PODS). ACM, June 2007. RDF Data. In Proceedings of the 3rd Workshop on [21] A. Harth, A. Polleres, and S. Decker. Towards a Social Scripting for the Semantic Web (SFSW) at ESWC, Provenance Model for the Web. In Proceedings of the June 2007. Workshop on Principles of Provenance, Nov. 2007. [4] R. Bose and J. Frew. Lineage retrieval for scientific [22] M. Hausenblas, W. Slany, and D. Ayers. A data processing: A survey. ACM Computing Surveys, Performance and Scalability Metric for Virtual RDF 37(1):1–28, Mar. 2005. Graphs. In Proceedings of the 3rd Workshop on [5] D. Brickley. Web Of Trust RDF Ontology. Online, Scripting for the Semantic Web (SFSW) at ESWC, Feb. 2004. Retrieved Feb. 7. June 2007. [6] D. Brickley and L. Miller. FOAF Vocabulary [23] M. Horridge, B. Parsia, and U. Sattler. Laconic and Specification. Online, Nov. 2007. Retrieved Feb. 7. Precise Justifications in OWL. In Proceedings of the [7] P. Buneman, S. Khanna, and W. C. Tan. Data 7th International Semantic Web Conference (ISWC). Provenance: Some Basic Issues. In Proceedings of the Springer, Oct. 2008. 20th Conference on Foundations of Software [24] I. Jacobs and N. Walsh. Architecture of the World Technology and Theoretical Computer Science Wide Web, Volume One. W3C Recommendation, (FST TCS). Springer, Dec. 2000. Online, Dec. 2004. Retrieved Feb. 7. [8] P. Buneman, S. Khanna, and W. C. Tan. Why and [25] L. Moreau, B. Plale, S. Miles, C. Goble, P. Missier, Where: A Characterization of Data Provenance. In R. Barga, Y. Simmhan, J. Futrelle, R. McGrath, Proceedings of the 8th International Conference on J. Myers, P. Paulson, S. Bowers, B. Ludaescher, Database Theory (ICDT). Springer, Jan. 2001. N. Kwasnikowska, J. Van den Bussche, T. Ellkvist, [9] J. J. Carroll, C. Bizer, P. Hayes, and P. Stickler. J. Freire, and P. Groth. The Open Provenance Model. Named Graphs, Provenance and Trust. In Proceedings Technical report, Electronics and Computer Science, of the 14th International World Wide Web Conference University of Southampton, 2008. (WWW ). ACM Press, May 2005. [26] E. Oren, R. Delbru, M. Catasta, R. Cyganiak, [10] R. Cyganiak, H. Stenzhorn, R. Delbru, S. Decker, and H. Stenzhorn, and G. Tummarello. Sindice.com: A G. Tummarello. Semantic Sitemaps: Efficient and Document-oriented Lookup Index for Open Linked Flexible Access to Datasets on the Semantic Web. In Data. International Journal of Metadata, Semantics Proceedings of the 5th European Semantic Web and Ontologies, 3(1):37–52, 2008. Conference (ESWC). Springer, June 2008. [27] R. Palma, J. Hartmann, and P. Haase. OMV [11] P. P. da Silva, D. L. McGuinness, and R. Fikes. A Ontology Metadata Vocabulary for the Semantic Web, Proof Markup Language for Semantic Web Services. v2.4. Online, Jan. 2008. Retrieved Feb. 7. Information Systems, 31(4-5):381–395, June 2006. [28] K. Scheppe. Protocol for Web Description Resources [12] P. P. da Silva, D. L. McGuinness, and R. McCool. (POWDER): Primer. W3C Working Draft, Online, Knowledge Provenance Infrastructure. Data Nov. 2008. Retrieved Feb. 7. Engineering Bulletin, 26(4):26–32, Dec. 2003. [29] Y. Simmhan, B. Plale, and D. Gannon. A Survey of [13] L. Daigle. WHOIS Protocol Specification. Data Provenance in e-Science. SIGMOD Record, IETF RFC 3912, Sept. 2004. 34(3):31–36, Sept. 2005. [14] S. B. Davidson, S. C. Boulakia, A. Eyal, [30] W. C. Tan. Provenance in Databases: Past, Current, B. Ludäscher, T. M. McPhillips, S. Bowers, M. K. and Future. IEEE Data Engineering Bulletin, Anand, and J. Freire. Provenance in Scientific 30(4):3–12, Dec. 2007. Workflow Systems. IEEE Data Engineering Bulletin, [31] S. Tunnicliffe and I. Davis. Changeset Vocabulary. 30(4):44–50, Dec. 2007. Online, Mar. 2006. Retrieved Feb. 7. [15] L. Ding, T. Finin, Y. Peng, P. P. da Silva, and D. L. [32] K. Verma and A. Sheth. Semantically Annotating a McGuinness. Tracking RDF Graph Provenance using Web Service. IEEE Internet Computing, 11(2):83–85, RDF Molecules. Technical Report TR-CS-05-06, Mar. 2007. UMBC, Apr. 2005. [33] J. Zhao. A conceptual model for e-science provenance. [16] Dublin Core Metadata Initiative Usage Board. DCMI PhD thesis, University of Manchester, June 2007. Metadata Terms. Online, Jan. 2008. Retrieved Feb. 7. [34] J. Zhao, K. Alexander, M. Hausenblas, and [17] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, and R. Cyganiak. Vocabulary of Interlinked Datasets. T. Berners-Lee. Hypertext Transfer Protocol – Online, Jan. 2009. Retrieved Feb. 7. HTTP/1.1. IETF RFC 2068, Jan. 1997. [18] Y. Gil and D. Artz. Towards Content Trust of Web Resources. Journal of Web Semantics, 5(4):227–239, APPENDIX Dec. 2007. [19] S. Goldwasser, S. Micali, and R. Rivest. A Digital A. USAGE OF VOCABULARIES Signature Scheme Secure Against Adaptive This appendix details the result of a Sindice-based search Chosen-Message Attacks. SIAM Journal on for documents that contain certain provenance-relevant prop- Computing, 17(2):281–308, Apr. 1988. erties. Table 2 lists each property together with the number [20] T. J. Green, G. Karvounarakis, and V. Tannen. of documents that contain at least one RDF statement with Provenance Semirings. In Proceedings of the 26th the property. The numbers have been recorded on February 7, 2009, by querying the Sindice search engine. B. PROVENANCE ITEM TYPES This appendix lists examples for various provenance ele- Table 2: Provenance-relevant properties and the ment types in the data creation dimension (cf. Table 3) and number of documents in which they occur at least in the data access dimension (cf. Table 4). once (according to the Sindice search engine). Table 3: Examples of provenance element types in property occurences the data creation dimension. dcterms:creator 134 element type examples dc11:creator about 24,150 data item RDF statement dcterms:contributor 11 RDF graph dc11:contributor 465 subgraph of an RDF graph dcterms:source 1 axiom in a knowledge base dc11:source about 3,630 data product created by workflow dcterms:created about 73,010 set of data products dc11:created about 9,710 result of a query dcterms:modified about 2,320 table in a database dc11:modified about 9,700 tuple in a database table dcterms:publisher 87 data creation completion of a Web form dc11:publisher 808 execution of a workflow dcterms:provenance 7 execution of a transformation foaf:made about 5,420 automatic reasoning foaf:maker about 29,370 execution of a database query sioc:creator of about 1,370 execution of a search query sioc:has creator about 4,520 mapping from other data models sioc:modifier of 3 machine learning sioc:has modifier 4 data creating entity person, group, organization sioc:owner of about 3,020 data creating device sensor sioc:has owner about 553 data creating service software agent sioc:earlier version 0 workflow engine sioc:later version 0 reasoner sioc:next version 3 query engine sioc:previous version 3 search index swp:assertedBy 0 data wrapper (e.g. D2R Server) swp:authority 0 (batch) script interpreter swp:quotedBy 0 source data content of a document swp:validUntil 0 (statements in) a dataset wot:assurance 135 (data in) a database wot:fingerprint 52 creation guidelines workflow model wot:hasKey 23 transformation rules wot:hex id 48 entailment/inference rules wot:identity 36 database query wot:length 43 search query wot:pubkeyAddress 54 mapping definitions wot:sigdate 8 wot:signed 3 Table 4: Examples of provenance element types in wot:signer 2 the data access dimension. iwProv:hasMember 1 element type examples iwProv:isMemberOf 1 data access resource-based data access iwProv:hasPublisher 1 API-based data access iwProv:hasPublicationDateTime 1 query-based data access iwProv:hasUsageDateTime 1 data providing service Web server iwProv:hasSource 1 Web service iwProv:hasInferenceEngineRule 1 Web-based query interface iwProv:usesInferenceEngine 1 data publisher person, group, organization ouzo:belongsTo 2 service provider person, group, organization ouzo:dataDerivedFrom 2 ouzo:launchedBy 2 ouzo:processInput 2 ouzo:runsWorkflow 2 cs:createdDate 3 cs:creatorName 3