Provenance Information in the Web of Data

                                                          Olaf Hartig
                                               Humboldt-Universität zu Berlin
                                              Department of Computer Science
                                                     Berlin, Germany
                                            hartig@informatik.hu-berlin.de


ABSTRACT                                                          derivation of the final output of” [30] a workflow. Davidson
The openness of the Web and the ease to combine linked            et al. [14] provide an overview of provenance in workflow
data from different sources creates new challenges. Systems       systems. Data provenance, in contrast, provides a more de-
that consume linked data must evaluate quality and trust-         tailed view on the derivation of single pieces of data. A par-
worthiness of the data. A common approach for data quality        ticular area of research on data provenance is provenance in
assessment is the analysis of provenance information. For         databases which considers the provenance of query results.
this reason, this paper discusses provenance of data on the       In this context, Buneman et al. [8] distinguish why- and
Web and proposes a suitable provenance model. While tra-          where-provenance: why-provenance represents the origins
ditional provenance research usually addresses the creation       that were involved in calculating a single entry of a query
of data, our provenance model also represents data access,        result; where-provenance refers to the exact locations an ele-
a dimension of provenance that is particularly relevant in        ment of a query result entry has been extracted from. Green
the context of Web data. Based on our model we identify           et al. [20] additionally introduce how-provenance that, in
options to obtain provenance information and we raise open        contrast to why-provenance, describes how the origins were
questions concerning the publication of provenance-related        involved in the calculation.
metadata for linked data on the Web.                                 While a great many approaches exist that represent prove-
                                                                  nance [4, 29, 30], none of these explicitly addresses the char-
                                                                  acteristics of provenance of data from the Web. Web data
Categories and Subject Descriptors                                provenance includes the access of data items on the Web,
I.2.4 [Computing Methodologies]: Knowledge Represen-              an information not required in the context of self-contained
tation Formalisms and Methods; H.3.3 [Information Sys-            systems such as a DBMS or a workflow management system.
tems]: Information Search and Retrieval                              In this paper we discuss provenance of Web data. We aim
                                                                  to provide a base for research on the application of prove-
Keywords                                                          nance information to assess qualities of linked data from the
                                                                  Web. Our main contributions are the following:
Provenance, Lineage, Web Data, Web of Data, Linked Data
                                                                     • We propose a provenance model that captures both,
                                                                       information about Web-based data access as well as
1.   INTRODUCTION                                                      information about the creation of data.
   Today, a large amount of RDF data is published on the
Web; large datasets are interlinked; new applications emerge         • We describe options to obtain provenance information
that utilize this data in novel and innovative ways. An up-            by accessing metadata on the Web.
coming challenge that has to be addressed in these applica-          • We analyze vocabularies for RDF data that allow to
tions is the evaluation of qualities of the data retrieved from        describe provenance information.
the Web, qualities such as accuracy, timeliness, reliability,
and trustworthiness.                                                 • We identify open questions concerning the publication
   A recent study shows that one of the main factors that              of provenance-related metadata for linked data on the
influence the trust of users in Web content is provenance [18].        Web.
Thus, a common approach for data quality assessment is            This paper is structured as follows. First, Section 2 reviews
the analysis of provenance information. “Information about        related work. In Section 3 we introduce our provenance
provenance constitutes the proof of correctness [...] and [...]   model for Web data. A discussion of options to obtain prove-
determines the quality and amount of trust [...]” [30]            nance information is given in Section 4. Finally, Section 5
   Provenance information about a data item is information        raises open questions and Section 6 concludes this paper.
about the history of the item, starting from its creation,
including information about its origins. Tan [30] distin-
guishes two granularities of provenance: workflow (or coarse-
                                                                  2. RELATED WORK
grained) provenance and data (or fine-grained) provenance.           Representing and analyzing provenance is a topic of re-
Workflow provenance represents “the entire history of the         search since many years [4]. Simmhan et al. [29] provide
                                                                  a taxonomy of provenance characteristics. The authors dif-
Copyright is held by the author/owner(s).                         ferentiate between data-oriented approaches and process-or-
LDOW2009, April 20, 2009, Madrid, Spain.                          iented approaches. While data-oriented approaches focus
on data items, process-oriented approaches emphasize infor-     web applications and services.” The term knowledge prove-
mation about the processes that consume and generate the        nance refers to information about the origin of knowledge
data. Due to its level of abstraction our provenance model      and about the reasoning processes used to produce answers.
can be used as a basis for both types of approaches as well     In [11] the authors present the Proof Markup Language to
as for hybrid approaches.                                       describe justifications for results of an answering engine or
   An approach to model provenance on a more detailed level     a reasoner. A formal definition of justifications for entail-
is the Open Provenance Model introduced by Moreau et            ments in OWL ontologies is provided by Horridge et al. [23].
al. [25]. Similar to our model, the authors distinguish three   These justifications may describe the execution of a specific
types of pieces of provenance information: artifacts, pro-      kind of data creation processes represented by our prove-
cesses, and agents. The Open Provenance Model represents        nance model.
provenance by graphs. The nodes in these graph represent
the artifacts, processes, and agents. The edges are directed    3.    A MODEL OF WEB DATA PROVENANCE
and they have a predefined semantic depending on the type
                                                                  Provenance research in the context of databases [30] or in
of the adjacent nodes. For instance, an edge that connects a
                                                                the context of workflows [14] usually focuses on the creation
process with an agent means the process was controlled by
                                                                of data items such as query results and data products. In
the agent. Some edges can be annotated with a use case-
                                                                the majority of cases, these approaches apply a notion of the
specific role. Due to its more detailed representation the
                                                                sources of a data item that is directly related to the creation
Open Provenance Model can be used to realize the descrip-
                                                                process. To represent the provenance of data from the Web
tion of parts of a provenance graph that complies with our,
                                                                we need an additional dimension. Provenance information
more abstract model.
                                                                of Web data must comprise the aspect of publishing and
   Bunemann et al. [7] raise several open questions for data
                                                                accessing data on the Web. Questions such as who operates
provenance in the age of the Web. The authors identify three
                                                                the service that provides a dataset are equally important as
main issues: i) obtaining provenance information, ii) citing
                                                                asking for the entity that created the data. For this reason,
components of a digital library such as (components of) a
                                                                we suggest a provenance model for data from the Web that
document in another context, and iii) ensuring integrity of
                                                                includes both dimensions, the creation and the access of
citations under the assumption that cited databases evolve.
                                                                data. In this section we describe our model: we introduce
We address the first issue by discussing options to obtain
                                                                the basic elements, we present the data creation dimension,
provenance information in Section 4.1.
                                                                and we describe the representation of data access.
   Harth et al. [21] argue for a provenance model for the Web
that includes a “social dimension to associate provenance       3.1    Basics of the Provenance Model
with the originator (typically a person) of a given piece of
                                                                   Provenance information can be used for various purposes.
information.” Given such a model it is possible to embed
                                                                Possible uses are the estimation of data quality, the tracing
provenance-based quality assessments in the social context
                                                                of audit trails of data, the repetition of data derivations, the
of users. We agree to the authors’ request. With our prove-
                                                                determination of liabilities, and the discovery of data [29].
nance model we encourage to represent human actors and
                                                                The main purpose of our provenance model is to support
their relation to data items.
                                                                the assessment of data qualities such as accuracy, reliability,
   A more technical notion of provenance is represented by
                                                                and timeliness.
Ding et al. [15] who understand the provenance of RDF
                                                                   We propose to describe the provenance of a specific data
data as the RDF graphs of which parts of an analyzed RDF
                                                                item from the Web (e.g. a specific RDF graph or RDF
graph has been derived from. The authors argue that track-
                                                                statement) by a provenance graph. The nodes of prove-
ing complete RDF graphs is too coarse-grained and that a
                                                                nance graphs are provenance elements that represent pieces
representation on the level of single RDF statements is un-
                                                                of the provenance information about the data, pieces such
suitable, too. For this reason, Ding et al. introduce RDF
                                                                as the actual creator of a specific dataset. Our provenance
molecules as the finest sub-graphs that can be generated by
                                                                model identifies different types of provenance elements and
a lossless decomposition of an RDF graph. Our provenance
                                                                it describes the relationships between these types and, thus,
model represents data items on an abstract level. Thus,
                                                                between the possible provenance elements in a provenance
actual applications may use any level of granularity: RDF
                                                                graph. Since provenance information for a data item may
graphs, statements, or RDF molecules.
                                                                comprise information about source data a provenance graph
   Hausenblas et al. [22] touch another aspect of Web data
                                                                for the data item may contain subgraphs that describe the
provenance. The authors distinguish sources of Web data
                                                                provenance of the source data. Thus, our understanding of a
based on the way these sources represent RDF data. Sources
                                                                data item does not only include RDF graphs and RDF state-
may contain RDF data in a non-serialized form (e.g. in-
                                                                ments but it also covers typical source data such as workflow
memory, in-store) or arbitrary data in a serialized form.
                                                                results and database entries (Table 3 in Appendix B lists
Sources with serialized data may be i) RDF model-compliant
                                                                further examples for data items).
and standalone, ii) RDF model-compliant and embedded, or
                                                                   We broadly distinguish three types of provenance elements:
iii) non-compliant to RDF model. Our provenance model
                                                                actors, executions, and artifacts. An actor usually performs
also differentiates between the data itself and its represen-
                                                                the execution of an action or a process which – in most
tation in a document.
                                                                cases – yields an artifact such as a specific dataset. An exe-
   Another approach that considers provenance in the con-
                                                                cution may include the use of artifacts which, in turn, might
text of the Semantic Web has been developed in the Infer-
                                                                be the result of another execution. Furthermore, direct re-
ence Web project. Da Silva et al. [12] describe a provenance
                                                                lationships between artifacts as well as between actors may
infrastructure that supports “the extraction, maintenance
                                                                exist. For instance, a specific company is responsible for its
and usage of knowledge provenance related to answers of
                                                                Web server.
                        Figure 1: Provenance information concerning the creation of data.


   Our model defines different element types as specializa-        Our provenance model abstracts from the different notions
tions of actors, executions, and artifacts. For example, data   of data creation. Figure 1 depicts the relationships of all
creators are a specific kind of actors and RDF graphs are       element types that cover the data creation dimension. The
a specific kind of artifacts. Each provenance element in a      central element type is the data creation. Data creations rep-
provenance graph corresponds to at least one of these types.    resent the execution of actions or processes that create new
The edges in a provenance graph correspond to the rela-         data items. Thus, in the provenance graph of a specific data
tionships between element types of the adjacent provenance      item actual data creations are represented by provenance el-
elements.                                                       ements of the data creation type. All data creations have a
   All provenance elements have attributes that represent       creation time and use a method (cf. the universal attributes
provenance-specific information about the elements. For in-     in Figure 1). Examples for data creation methods are the
stance, a specific data creator may have a name and may         aforementioned completion of a Web form as well as the ex-
work for a well-known organization. The actual attributes       ecution of a workflow, of a query, of a transformation, or of
and their extent depend on the element type and the needs       a reasoning process. The existence of further, provenance-
in the application scenario. For some element types, how-       related attributes of a data creation and the granularity of
ever, all conceivable provenance elements will likely have      these attributes depend on the specific creation method and
attributes of the same kind. We call these attributes uni-      on the application of the provenance model. For instance,
versal attributes and associate them with the corresponding     the execution of a query could be associated with a repre-
element types in our model.                                     sentation of how-provenance [20]; the inference of data could
   Please notice, to cover a broad variety of applications we   be represented with a justification [11][23].
designed our provenance model with generality in mind. Dif-        Provenance elements that are associated with a data cre-
ferent applications of our model may have different needs       ation are the created data item, data creators, source data,
with respect to the amount and granularity of the repre-        and creation guidelines. Data creator s are actors that per-
sented provenance information. For this reason, the ele-        form the data creation. Our model distinguishes non-human
ments of our model abstract from actual use cases; we do        and human data creators. Non-human creators are data cre-
not suggest a specific implementation of provenance graphs      ating devices such as sensors and data creating services such
nor do we prescribe attributes that must be used for specific   as software agents, reasoners, query engines, or workflow en-
provenance elements. However, we give examples for various      gines. Human data creators, called data creating entities, are
provenance element types in Appendix B.                         persons, groups, organizations, etc. Data creating entities
                                                                may create the data directly – as in the Web form example –
3.2   Data Creation                                             or they are responsible for a non-human data creator that
   The creation of a data item can be a complex process such    creates the data. A provenance-relevant attribute of all data
as executing a sophisticated workflow. Thus, provenance in-     creating entities is their relation to the created data. Further
formation about a data item may include a comprehensive         attributes depend on the actual provenance element and on
description of the execution of data creation processes. On     the implementation of the provenance model. For example,
the other hand, the creation of data can be as straightfor-     a data creating service may implement a specific algorithm
ward as performing a simple action such as filling out a Web    and, usually, has a version number and a developer.
form. To represent this simple action as provenance of the         A data creator often makes use of source data to create
derived data a few details may suffice. Furthermore, what       new data. Examples for source data are the content of a
is a simple action in one situation may be considered as a      document used for machine learning, the statements in a
complex process in another case.                                knowledge base used to entail a new statement, and the
entries in a database used to answer a query. The granu-         as linked data in RDF documents, in HTML documents, or
larity by which a provenance graph represents source data        as the result of SPARQL queries.
depends on the use case. Notice, not every data creation            Data publisher s are persons, groups, or organizations that
uses source data as indicated by the dashed connection in        use a data providing service to publish data on the Web.
Figure 1. However, all source data has provenance which          Similar to the services, a data publisher may also be the
should be represented as a subgraph in the provenance graph      data creating entity. The service provider element type rep-
of the created data. Thus, a provenance graph may recur-         resents entities such as a person, a groups, or an organization
sively contain subgraphs for source data, for the sources of     that controls a data providing service. Our model introduces
source data, and so on.                                          this type because data publishers may publish their data on
   Further input artifacts that may be associated with a data    platforms that are provided by a third party, the service
creation are the creation guidelines which guided the execu-     provider. Information about this third party may be rele-
tion of the data creation. Examples for creation guidelines      vant as provenance information. However, a data publisher
are transformation rules, mapping definitions, entailment        may administer its own service and, thus, could be the ser-
rules, and database queries.                                     vice provider itself.
                                                                    Our provenance model represents the actual execution of
3.3   Data Access                                                a data access by an element type called data access. Prove-
  A system that uses Web data must access this data from         nance information that is common to all provenance ele-
a provider on the Web. Information about this process            ments of this type is the access time and the access method.
and about the providers is important for a representation        A major access method an HTTP-based resource request
of provenance that aims to support the assessment of data        where the URI of the requested resource would be another
qualities. Hence, our provenance model introduces element        provenance-related attribute of the corresponding provenan-
types that give attention to the data access dimension. Fig-     ce elements. Additional attributes of data access provenance
ure 2 depicts the main element types and their relationships.    elements may represent the content negotiation [17, Sec-
                                                                 tion 12] and possible redirections [17, Section 10.3] that hap-
                                                                 pened during the data access. Further examples for access
                                                                 methods are API-based data access and its specialization,
                                                                 query-based access. Provenance information in these cases
                                                                 are the API call with its parameters and the issued query,
                                                                 respectively. Notice, query-based data access usually is a
                                                                 data creation too.
                                                                    Further provenance information not considered so far is
                                                                 the availability and validity of digital signatures. Since this
                                                                 information is important to assess the quality of data our
                                                                 provenance model contains additional element types, namely
                                                                 intergrity assurances, digital signatures, public keys and sign-
                                                                 ers (cf. Figure 3).


Figure 2: Provenance information about data access
on the Web.

  Data published on the Web is embedded in a host arti-
fact, usually a document. Following the terminology of the
W3C Technical Architecture Group we call this artifact an
information resource [24]. Each information resource has a
type, e.g., it is an RDF document or an HTML document.
  A system, the data accessor, retrieves information resour-     Figure 3: Signature verifiability is a part of Web
ces from a provider. Our provenance model allows a detailed      data provenance.
representation of providers by distinguishing data providing
services, data publishers, and service providers. A data pro-       An intergrity assurance basically represents the verifica-
viding service is a non-human actor – usually a Web service      tion of a digital signature for the signed artifact. The ver-
or a server – that processes data access requests and actually   ification requires the public key of the signer. Intergrity
sends the information resource over the Web. Provenance-         assurances are associated with information about the result
related attributes of data providing services may be a de-       of the verification. Digital signatures have several properties
scription of the software that realizes the service. Notice, a   that are relevant as provenance information, e.g., the date
data providing service in a provenance graph may also be a       of issue and the signature scheme [19]. The provenance el-
data creating service. For instance, a D2R server [1] creates    ement for a public key could describe the creation and the
RDF data from a relational database and provides this data       expiration date.
4.    OBTAINING PROVENANCE INFORMA-                             not prescribe a standard form to structure returned data.
      TION                                                         A source of data about the content provided by a Web
                                                                server are sitemaps. A sitemap is an XML document that
   A system that applies our provenance model generates
                                                                informs search engine crawlers about URLs on a website.
provenance graphs for data items. To create the provenance      The semantic sitemap approach [10] extends these docu-
elements of such a graph the system has to collect differ-      ments with information about the location of RDF data
ent pieces of provenance information automatically. In this     and about alternative means to access this data (e.g. data
section we discuss options to obtain provenance information     dumps and SPARQL endpoints). Even if the information in
for Web data.
                                                                a semantic sitemap is marginally provenance-related an im-
   Some pieces of provenance information can be recorded
                                                                portant element is the specification of a URI that represents
by a system; for other pieces the system relies on meta-
                                                                referenced datasets. Given the provider follows the linked
data provided by third parties. Thus, we distinguish record-    data principles a look-up of these URIs will yield RDF-based
able provenance information and metadata-reliant prove-
                                                                metadata that may describe provenance information about
nance information. Basically, recordable provenance infor-
                                                                the datasets.
mation is information on executions that are performed by          Using linked data, in any case, is an important approach
the system itself or that can sufficiently be monitored by      to discover provenance-related metadata. As with the URI
the system. Usually, these executions are data accesses         of an RDF dataset it is possible to look up any HTTP URI
initiated by the system, signature verifications, and local
                                                                that represents a piece of provenance (e.g. the URI of a
data creations. Metadata-reliant provenance information,
                                                                data item such as a named RDF graph [9] or the URI of an
in contrast, can not be recorded automatically but requires
                                                                actor). Moreover, collecting provenance information may
the evaluation of metadata that is published on the Web.
                                                                involve following RDF links in order to get more complete
Metadata-reliant provenance information comprises infor-        information during the generation of provenance graphs.
mation about executions inaccessible to the system as well
                                                                   Another method to discover metadata about Web resour-
as information about actors and artifacts involved in these
                                                                ces is POWDER [28], the Protocol for Web Description
executions. Furthermore, obtaining more exhaustive prove-
                                                                Resources. POWDER introduces so called description re-
nance information about certain actors involved in accessible   sources to describe resources on the Web. These descrip-
executions may also require metadata. For instance, even if
                                                                tions are either based on RDF data or on simple keywords
a system can record information about a self-initiated data
                                                                (i.e. tags) and they may contain provenance information.
access a proper representation of the involved providers re-
                                                                   A specific kind of actors represented by our provenance
quires metadata.
                                                                model are Web services that create or provide data. Dif-
   Recording provenance information is a fundamental topic
                                                                ferent standards exist to describe Web services [32]. These
of provenance research. For instance, Bose and Frew [4]
                                                                descriptions may also contain information that are relevant
study scientific workflow management systems that aim to
                                                                for provenance graphs.
track provenance for data products; Horridge et al. [23]
present concepts to compute justifications for entailments in   4.2     Provenance-Related Vocabularies
ontologies; Tan [30] discusses different approaches to propa-
gate and to compute provenance of query results in database        Various vocabularies exist that allow to describe prove-
systems. The concepts developed in these contexts can be        nance information with RDF data. In the following, we
adapted to generate recordable provenance information in        describe these vocabularies and we relate the classes and
the context of Web data. On that account, we focus on           properties defined by these vocabularies to the elements of
metadata-reliant provenance information in the remainder        our provenance model. Afterwards, we study the presence
of this section. We identify methods to access relevant meta-   of the vocabularies in the Web.
data on the Web, we analyze vocabularies that allow a rep-         A popular standard to represent general-purpose meta-
resentation of provenance-related metadata, and we study        data are the Dublin Core Metadata Terms [16] which are
the existence of such metadata on the Web.                      available as an RDFS Schema. The following properties de-
                                                                fined by this schema can be associated with a resource to
4.1   Accessing Provenance-Related Metadata                     describe provenance information:
  Provenance-relevant metadata is either directly attached
                                                                   • dcterms:contributor1 , dcterms:creator – The con-
to a data item or its host document or it is available as ad-
                                                                     tributor of a resource is defined as “an entity responsi-
ditional data on the Web. Examples for attached metadata
                                                                     ble for making contributions to the resource” [16] and
are RDF statements about an RDF graph that contains the
                                                                     the creator is “an entity primarily responsible for mak-
statements, author and creation date of blog entries added
                                                                     ing the resource.” [16] Thus, these properties may be
to a syndication feed, or information about an image embed-
                                                                     used to obtain information about data creators of a
ded in the image file. Both, attached metadata and detached
                                                                     data item. However, the actual type of the referenced
metadata, may be represented in RDF using vocabularies as
described in Section 4.2 or it may be data of another form.     1
                                                                  We use the following namespace prefixes in this paper:
In the following, we present options to discover detached       dcterms: http://purl.org/dc/terms/
metadata on the Web.                                            dc11: http://purl.org/dc/elements/1.1/
  Accessing data on the Web is often based on HTTP URIs.        foaf: http://xmlns.com/foaf/0.1/
Since these URIs are grounded in the Domain Name System         sioc: http://rdfs.org/sioc/ns#
(DNS) it is possible to query a WHOIS [13] service in order     swp: http://www.w3.org/2004/03/trix/swp-2/
                                                                wot: http://xmlns.com/wot/0.1/
to get provenance information about the accessed data item.     iwProv: http://inferenceweb.stanford.edu/2006/06/pml-provenance.owl#
However, the responses of WHOIS services are hardly usable      ouzo: http://www.mygrid.org.uk/provenance#
for automatic evaluation because the WHOIS protocol does        cs: http://purl.org/vocab/changeset/schema#
     creators as well as their role in the data creation pro-           referenced users according to our provenance model is
     cess remain unclear because the Dublin Core schema                 less ambiguous: creators and modifiers referenced in a
     does not distinguish different types of data creators as           SIOC-based description are data creating entities.
     our provenance model does. Analyzing data about the
     creator (or about the contributor) may yield further            • sioc:has owner, sioc:owner of – These properties ex-
     information (e.g. the type) that can help to derive a             press ownership of SIOC items. This information may
     more precise provenance graph.                                    provide an indication of the relation of a data publisher
                                                                       to provided data and, thus, might be used to set the
   • dcterms:source – The source of a resource is “a re-               corresponding attribute of the entity in a provenance
     lated resource from which the described resource is de-           graph.
     rived.” [16] With this property it is possible to create
     provenance elements associated as source data with a            • sioc:earlier version, sioc:later version,
     data creation element.                                            sioc:next version, sioc:previous version – These
                                                                       properties relate different versions of a SIOC item with
   • dcterms:created – This property specifies the cre-                each other and could be used to create relationships
     ation date of a resource and can be used to set the               between artifacts in a provenance graph.
     creation time attribute associated with the execution
     of a data creation.                                             The Semantic Web Publishing Vocabulary (SWP) [9] en-
                                                                  ables the description of information about provision of data.
   • dcterms:modified – This property specifies the date          With SWP it is possible to represent the attitude of a le-
     on which a resource has been changed. We propose             gal person to an RDF graph. SWP supports two attitudes:
     to represent the modification of a data item as a data       claiming the graph is true and quoting the graph without
     creation which creates a new, modified version of the        a comment on its truth. These commitments towards the
     original data item. The creation time attribute as-          truth can be used to derive a data publisher’s or a data
     sociated with this data creation can be set using the        creating entity’s relation to provided or created artifacts.
     dcterms:modified property.                                   Furthermore, the SWP allows to describe digests and digital
                                                                  signatures of RDF graphs and to represent public keys. Sim-
   • dcterms:publisher – The publisher of a resource is           ilarly, the Web Of Trust schema (WOT) [5] enables descrip-
     “an entity responsible for making the resource avail-        tions that document the use of public key cryptography tools
     able” [16] This property may be used to obtain in-           to sign documents. However, two differences between WOT
     formation about a provider of an information resource        and the signature-related part of SWP exist. First, digital
     whereas the actual type of the provider (data providing      signatures in SWP are represented as RDF data; WOT, in
     service, data publisher, or service provider) remains        contrast, refers to signatures that are individually encoded
     unclear.                                                     in dedicated documents. Second, while the digital signa-
   • dcterms:provenance – This property links a resource          tures described with WOT sign information resources, the
     to “a statement of any changes in ownership and cus-         signatures in SWP-based descriptions sign a specific kind
     tody of the resource since its creation that are sig-        of data item, namely RDF graphs. However, both, SWP-
     nificant for its authenticity, integrity, and interpreta-    based descriptions as well as WOT-based descriptions, can
     tion.” [16] Due to the very general definition, it is dif-   be used to obtain information about public key and digital
     ficult to use such a provenance statement during the         signatures in order to represent them in a provenance graph.
     creation of a provenance graph.                                 Further vocabularies that can be used to describe prove-
                                                                  nance information of specific types of data items are the
   The Friend of a Friend (FOAF) vocabulary [6] provides          following:
classes and properties to describe entities such as persons,
organizations, groups and software agents. FOAF-based de-            • The Ontology Metadata Vocabulary (OMV) [27] de-
scriptions can be used to obtain basic information about ac-           scribes ontologies. OMV includes properties for cre-
tors (e.g. names, group membership, email addresses, iden-             ators, contributers, reviewers, and creation and modi-
tifying online accounts). Furthermore, FOAF contains the               fication dates.
property foaf:maker and its inverse foaf:made to relate the          • The Proof Markup Language [11] describes justifica-
described entities to resources made by the entities. These            tions for results of an answering engine or an inference
properties can be used to identify the data creator of a data          process.
item. However, using these properties raises the same ques-
tions as for the dcterms:creator property.                           • The Changeset Vocabulary [31] describes changes to
   The Semantically-Interlinked Online Communities (SIOC)              RDF-based resource descriptions.
ontology [2] describes information from online communities.
The ontology associates SIOC items such as blog posts,               • The Ouzo Provenance Ontology [33] describes the run
comments, and e-mail messages to users that are identified             of a (scientific) workflow, the processed data, and the
by their online accounts. The following properties describe            entities responsible for the workflow run.
provenance-relevant information:                                  4.3    Existence of Provenance Metadata
   • sioc:has creator, sioc:creator of,                             To study the existence of metadata on the Web that uses
     sioc:has modifier, sioc:modifier of – These prop-            the aforementioned vocabularies and their provenance-rele-
     erties relate a SIOC item to a user who created it           vant properties we utilized two Web data indexes available
     or who modified it. In contrast to the corresponding         on the Web, namely the Web service Ping the Semantic
     Dublin Core and FOAF properties, representing the            Web (PTSW) [3] and the Sindice search engine [26]. PTSW
is a service that receives notifications from different appli-         To overcome these issues we propose to develop a vocabu-
cations that create, update, or discover RDF documents on           lary that enables data publishers to describe the provenance
the Web; PTSW aggregates these notifications and provides           of the provided data more precisely. This new vocabulary
an up-to-date list of existing RDF documents. Furthermore,          may refine existing vocabularies. Nonetheless, the develop-
the PTSW website provides different statistics. Currently,          ment of this new vocabulary should be based on the pre-
PTSW indexes 1,073,2182 RDF documents. For each of the              sented provenance model.
vocabularies discussed before Table 1 presents the number              The second problem is the general lack of provenance-
of documents registered at PTSW that use this vocabulary.           related metadata in the Web of linked data. Reasons might
As the numbers indicate, FOAF and SIOC are widely used.             be the lack of suitable vocabularies, a lack of usable tools
The other vocabularies are not at all or they are used in an        to generate provenance-related metadata, and ignorance or
insignificant number of documents.                                  at least a lack of sensitization. The first two reasons can
                                                                    be ascribed to technical problems that should be solvable
Table 1: Number of RDF documents known to
                                                                    by the development of the proposed vocabulary and by the
PTSW that use a vocabulary (as of Feb. 7, 2009).
                                                                    implementation of corresponding tools. The lack of sensi-
    vocabulary                                   occurence          tization is a more general problem that must be addressed
    Dublin Core Metadata Terms                          121         by the linked data community. A possible approach may
    Dublin Core Metadata Element (legacy)                 9         evolve based on the recently released Vocabulary of Inter-
    FOAF                                            989,263         linked Datasets (voiD) [34] which is a vocabulary to de-
    SIOC                                            127,974         scribe the content of RDF-based datasets and the links be-
    SWP                                                   1         tween different datasets. Since voiD enables the discovery
    WOT                                                 101         and usage of linked datasets it may raise the awareness of
    Proof Markup Language                                 0         publishers to provide metadata for their datasets. This un-
    Ouzo Provenance Ontology                              0         derstanding may be used to motivate publishers to provide
    Changeset Vocabulary                                  0         provenance information along with their voiD descriptions.

   The statistics of PTSW may not be representative be-
cause they heavily depend on the applications that notify           6.   CONCLUSIONS
PTSW. For this reason, we utilized the Sindice search en-             In this paper we propose a provenance model for Web data
gine for another inquiry. Sindice indexes structured data           and we discuss options to obtain provenance information. In
from the Web. We queried Sindice for documents that con-            contrast to provenance research in areas such as workflows
tain RDF statements with the provenance-relevant proper-            and databases the analysis of the Web data provenance must
ties of the vocabularies. Table 2 in Appendix A lists the           include information about the access of data in the Web.
number of documents in the Sindice index for each prop-             Thus, our provenance model includes two dimensions: data
erty. According to Sindice, Dublin Core is roughly as often         creation and data access. By specifying the relationships
used as FOAF,a conclusion that cannot be drawn from the             of rather general types of pieces of provenance information
PTSW statistics. Furthermore, in contrast to the PTSW               our model describes provenance on an abstract level. This
numbers, the Sindice queries reveal that the legacy Dublin          generality gives applications a choice to refine the model
Core metadata elements are used more widely than the rec-           according to their use case.
ommended new metadata terms. Consistent with the find-                Based on our provenance model we describe options to
ings discovered in the PTSW statistics, the SWP, the Proof          obtain provenance information and we analyze vocabularies
Markup Language, the Ouzo Provenance Ontology, and the              to express such information. Our analysis identifies several
Changeset Vocabulary are not used at all3 . Moreover, con-          open questions. We aim to address these questions in the
sidering that Sindice currently indexes about 48.99 million         future.
documents the numbers for the other vocabularies are not              As further future work we will develop concepts to esti-
satisfying either. Thus, we conclude that there is only very        mate the trustworthiness of Web data based on our prove-
little provenance-related, RDF-based metadata available on          nance model. These estimations have to consider subjective
the Web.                                                            assessments of the elements in a provenance graph (e.g. the
                                                                    reliability of a data creator) as well as the trustworthiness of
5.    OPEN QUESTIONS                                                the data that has been used to create the provenance graph.
   Our analysis of vocabularies that allow to express prove-
nance information reveals two problems. First, the vocabu-          7.   ACKNOWLEDGMENT
laries are partly unsuitable and lack certain features. Our re-       We thank Jun Zhao for her valuable feedback on our
view of the vocabularies shows a lack of unambiguousness for        provenance model.
certain properties. In particular, it is difficult to distinguish
between the different types of providers and it is impossi-
ble to express the actual relationships between providers of        8.   REFERENCES
different types. The same holds for data creators. Further-          [1] C. Bizer and R. Cyganiak. D2R Server – Publishing
more, it is impossible to describe the execution of a data               Relational Databases on the Semantic Web. Poster at
access. These descriptions may be required to document                   the 5th International Semantic Web Conference
the access of source data executed by a third party.                     (ISWC), Nov. 2006.
2
 All numbers are from February 7, 2009.                              [2] U. Bojars and J. G. Breslin. SIOC Core Ontology
3
 Occurences of 1 to 3 in Table 2 refer to documents that                 Specification, Revision 1.30. Online, Jan. 2009.
specify the corresponding properties.                                    Retrieved Feb. 7.
 [3] U. Bojars, A. Passant, F. Giasson, and J. Breslin. An        Symposium on Principles of Database Systems
     Architecture to Discover and Query Decentralized             (PODS). ACM, June 2007.
     RDF Data. In Proceedings of the 3rd Workshop on         [21] A. Harth, A. Polleres, and S. Decker. Towards a Social
     Scripting for the Semantic Web (SFSW) at ESWC,               Provenance Model for the Web. In Proceedings of the
     June 2007.                                                   Workshop on Principles of Provenance, Nov. 2007.
 [4] R. Bose and J. Frew. Lineage retrieval for scientific   [22] M. Hausenblas, W. Slany, and D. Ayers. A
     data processing: A survey. ACM Computing Surveys,            Performance and Scalability Metric for Virtual RDF
     37(1):1–28, Mar. 2005.                                       Graphs. In Proceedings of the 3rd Workshop on
 [5] D. Brickley. Web Of Trust RDF Ontology. Online,              Scripting for the Semantic Web (SFSW) at ESWC,
     Feb. 2004. Retrieved Feb. 7.                                 June 2007.
 [6] D. Brickley and L. Miller. FOAF Vocabulary              [23] M. Horridge, B. Parsia, and U. Sattler. Laconic and
     Specification. Online, Nov. 2007. Retrieved Feb. 7.          Precise Justifications in OWL. In Proceedings of the
 [7] P. Buneman, S. Khanna, and W. C. Tan. Data                   7th International Semantic Web Conference (ISWC).
     Provenance: Some Basic Issues. In Proceedings of the         Springer, Oct. 2008.
     20th Conference on Foundations of Software              [24] I. Jacobs and N. Walsh. Architecture of the World
     Technology and Theoretical Computer Science                  Wide Web, Volume One. W3C Recommendation,
     (FST TCS). Springer, Dec. 2000.                              Online, Dec. 2004. Retrieved Feb. 7.
 [8] P. Buneman, S. Khanna, and W. C. Tan. Why and           [25] L. Moreau, B. Plale, S. Miles, C. Goble, P. Missier,
     Where: A Characterization of Data Provenance. In             R. Barga, Y. Simmhan, J. Futrelle, R. McGrath,
     Proceedings of the 8th International Conference on           J. Myers, P. Paulson, S. Bowers, B. Ludaescher,
     Database Theory (ICDT). Springer, Jan. 2001.                 N. Kwasnikowska, J. Van den Bussche, T. Ellkvist,
 [9] J. J. Carroll, C. Bizer, P. Hayes, and P. Stickler.          J. Freire, and P. Groth. The Open Provenance Model.
     Named Graphs, Provenance and Trust. In Proceedings           Technical report, Electronics and Computer Science,
     of the 14th International World Wide Web Conference          University of Southampton, 2008.
     (WWW ). ACM Press, May 2005.                            [26] E. Oren, R. Delbru, M. Catasta, R. Cyganiak,
[10] R. Cyganiak, H. Stenzhorn, R. Delbru, S. Decker, and         H. Stenzhorn, and G. Tummarello. Sindice.com: A
     G. Tummarello. Semantic Sitemaps: Efficient and              Document-oriented Lookup Index for Open Linked
     Flexible Access to Datasets on the Semantic Web. In          Data. International Journal of Metadata, Semantics
     Proceedings of the 5th European Semantic Web                 and Ontologies, 3(1):37–52, 2008.
     Conference (ESWC). Springer, June 2008.                 [27] R. Palma, J. Hartmann, and P. Haase. OMV
[11] P. P. da Silva, D. L. McGuinness, and R. Fikes. A            Ontology Metadata Vocabulary for the Semantic Web,
     Proof Markup Language for Semantic Web Services.             v2.4. Online, Jan. 2008. Retrieved Feb. 7.
     Information Systems, 31(4-5):381–395, June 2006.        [28] K. Scheppe. Protocol for Web Description Resources
[12] P. P. da Silva, D. L. McGuinness, and R. McCool.             (POWDER): Primer. W3C Working Draft, Online,
     Knowledge Provenance Infrastructure. Data                    Nov. 2008. Retrieved Feb. 7.
     Engineering Bulletin, 26(4):26–32, Dec. 2003.           [29] Y. Simmhan, B. Plale, and D. Gannon. A Survey of
[13] L. Daigle. WHOIS Protocol Specification.                     Data Provenance in e-Science. SIGMOD Record,
     IETF RFC 3912, Sept. 2004.                                   34(3):31–36, Sept. 2005.
[14] S. B. Davidson, S. C. Boulakia, A. Eyal,                [30] W. C. Tan. Provenance in Databases: Past, Current,
     B. Ludäscher, T. M. McPhillips, S. Bowers, M. K.            and Future. IEEE Data Engineering Bulletin,
     Anand, and J. Freire. Provenance in Scientific               30(4):3–12, Dec. 2007.
     Workflow Systems. IEEE Data Engineering Bulletin,       [31] S. Tunnicliffe and I. Davis. Changeset Vocabulary.
     30(4):44–50, Dec. 2007.                                      Online, Mar. 2006. Retrieved Feb. 7.
[15] L. Ding, T. Finin, Y. Peng, P. P. da Silva, and D. L.   [32] K. Verma and A. Sheth. Semantically Annotating a
     McGuinness. Tracking RDF Graph Provenance using              Web Service. IEEE Internet Computing, 11(2):83–85,
     RDF Molecules. Technical Report TR-CS-05-06,                 Mar. 2007.
     UMBC, Apr. 2005.                                        [33] J. Zhao. A conceptual model for e-science provenance.
[16] Dublin Core Metadata Initiative Usage Board. DCMI            PhD thesis, University of Manchester, June 2007.
     Metadata Terms. Online, Jan. 2008. Retrieved Feb. 7.    [34] J. Zhao, K. Alexander, M. Hausenblas, and
[17] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, and            R. Cyganiak. Vocabulary of Interlinked Datasets.
     T. Berners-Lee. Hypertext Transfer Protocol –                Online, Jan. 2009. Retrieved Feb. 7.
     HTTP/1.1. IETF RFC 2068, Jan. 1997.
[18] Y. Gil and D. Artz. Towards Content Trust of Web
     Resources. Journal of Web Semantics, 5(4):227–239,      APPENDIX
     Dec. 2007.
[19] S. Goldwasser, S. Micali, and R. Rivest. A Digital      A. USAGE OF VOCABULARIES
     Signature Scheme Secure Against Adaptive                   This appendix details the result of a Sindice-based search
     Chosen-Message Attacks. SIAM Journal on                 for documents that contain certain provenance-relevant prop-
     Computing, 17(2):281–308, Apr. 1988.                    erties. Table 2 lists each property together with the number
[20] T. J. Green, G. Karvounarakis, and V. Tannen.           of documents that contain at least one RDF statement with
     Provenance Semirings. In Proceedings of the 26th        the property. The numbers have been recorded on February
                                                             7, 2009, by querying the Sindice search engine.
                                                   B.   PROVENANCE ITEM TYPES
                                                     This appendix lists examples for various provenance ele-
Table 2: Provenance-relevant properties and the    ment types in the data creation dimension (cf. Table 3) and
number of documents in which they occur at least   in the data access dimension (cf. Table 4).
once (according to the Sindice search engine).     Table 3: Examples of provenance element types in
    property                        occurences     the data creation dimension.
    dcterms:creator                          134       element type       examples
    dc11:creator                    about 24,150          data item       RDF statement
    dcterms:contributor                       11                          RDF graph
    dc11:contributor                         465                          subgraph of an RDF graph
    dcterms:source                             1                          axiom in a knowledge base
    dc11:source                      about 3,630                          data product created by workflow
    dcterms:created                 about 73,010                          set of data products
    dc11:created                     about 9,710                          result of a query
    dcterms:modified                 about 2,320                          table in a database
    dc11:modified                    about 9,700                          tuple in a database table
    dcterms:publisher                         87        data creation     completion of a Web form
    dc11:publisher                           808                          execution of a workflow
    dcterms:provenance                         7                          execution of a transformation
    foaf:made                        about 5,420                          automatic reasoning
    foaf:maker                      about 29,370                          execution of a database query
    sioc:creator of                  about 1,370                          execution of a search query
    sioc:has creator                 about 4,520                          mapping from other data models
    sioc:modifier of                           3                          machine learning
    sioc:has modifier                          4     data creating entity person, group, organization
    sioc:owner of                    about 3,020    data creating device sensor
    sioc:has owner                    about 553     data creating service software agent
    sioc:earlier version                       0                          workflow engine
    sioc:later version                         0                          reasoner
    sioc:next version                          3                          query engine
    sioc:previous version                      3                          search index
    swp:assertedBy                             0                          data wrapper (e.g. D2R Server)
    swp:authority                              0                          (batch) script interpreter
    swp:quotedBy                               0         source data      content of a document
    swp:validUntil                             0                          (statements in) a dataset
    wot:assurance                            135                          (data in) a database
    wot:fingerprint                           52     creation guidelines  workflow model
    wot:hasKey                                23                          transformation rules
    wot:hex id                                48                          entailment/inference rules
    wot:identity                              36                          database query
    wot:length                                43                          search query
    wot:pubkeyAddress                         54                          mapping definitions
    wot:sigdate                                8
    wot:signed                                 3   Table 4: Examples of provenance element types in
    wot:signer                                 2   the data access dimension.
    iwProv:hasMember                           1         element type        examples
    iwProv:isMemberOf                          1           data access       resource-based data access
    iwProv:hasPublisher                        1                             API-based data access
    iwProv:hasPublicationDateTime              1                             query-based data access
    iwProv:hasUsageDateTime                    1
                                                      data providing service Web server
    iwProv:hasSource                           1
                                                                             Web service
    iwProv:hasInferenceEngineRule              1
                                                                             Web-based query interface
    iwProv:usesInferenceEngine                 1
                                                          data publisher     person, group, organization
    ouzo:belongsTo                             2
                                                         service provider    person, group, organization
    ouzo:dataDerivedFrom                       2
    ouzo:launchedBy                            2
    ouzo:processInput                          2
    ouzo:runsWorkflow                          2
    cs:createdDate                             3
    cs:creatorName                             3