<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Provenance in Linked Data Integration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tope Omitola</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicholas Gibbins</string-name>
          <email>g@ecs.soton.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nigel Shadbolt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Intelligence, Agents, Multimedia (IAM) Group School of Electronics and Computer Science University of Southampton</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The open world of the (Semantic) Web is a global information space o ering diverse materials of disparate qualities, and the opportunity to re-use, aggregate, and integrate these materials in novel ways. The advent of Linked Data brings the potential to expose data on the Web, creating new challenges for data consumers who want to integrate these data. One challenge is the ability, for users, to elicit the reliability and/or the accuracy of the data they come across. In this paper, we describe a light-weight provenance extension for the voiD vocabulary that allows data publishers to add provenance metadata to their datasets. These provenance metadata can be queried by consumers and used as contextual information for integration and inter-operation of information resources on the Semantic Web.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Data</kwd>
        <kwd>Public Open Data</kwd>
        <kwd>Data Publication</kwd>
        <kwd>Data Consumption</kwd>
        <kwd>Semantic Web</kwd>
        <kwd>Provenance</kwd>
        <kwd>Data Integration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Whom do you trust? In the human realm, you must earn the trust of others,
not assume it. You cannot, for example, expect them to simply hand you their
money based on your assurances that you are an honourable person (although
they may do that through referral). Conversely, you will give someone else your
money or your information only after you have established that they will handle
it appropriately (responsibly). And, how do you generate that trust? You
generate that trust not by declaring \I'm trustworthy", but by revealing as much
information of you as possible.</p>
      <p>The above are examples of integration and inter-operation of transactions
enabled by notions of quality and trust. Similar examples can be found on the
Web. A user, on the Web, may be confronted with a potentially large number
of diverse data sources of variable maturity or quality, and selecting the high
quality and trustworthy data that are pertinent for their uses and integrating
these together may be di cult. The advent of Linked Data brings the potential
to expose data on the Web, pointing towards a clear trend where users will be
able to easily aggregate, consume, and republish data. It is therefore necessary
for end-users to be able to decide the quality and trustworthiness of information
at hand.</p>
      <p>While in some cases the answer speaks for itself, in others the user will not
be con dent of the answer unless they know how and why the answer has been
produced and where the data has come from. Users want to know about the
reliability or the accuracy of the data they see. Thus, a data integration system,
to gain the trust of a user must be able, if required, to provide an explanation or
justi cation for an answer. Since the answer is usually the result of a reasoning
process, the justi cation can be given as well as a derivation of the conclusion
with the sources of information for the various steps.</p>
      <p>Provenance, also known as lineage, describes how an object came to be in
its present state, and thus, it describes the evolution of the object over time.
Provision of the provenance of information resources on the Web can be used as
a basis for the assessment of information quality, improving the contextual
information behind the generation, transformation, and integration of information
on the Web.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Provenance</title>
      <p>There are two major research strands of provenance in the literature: data
and work ow provenance. In the scienti c enterprise, a work ow is typically
used to perform complex data processing tasks. A work ow can be thought of
as a set of procedure steps, computer and human, that one enacts to get from
the starting state to the goal state. Work ow provenance refers to the record
of the entire history of the derivation of the nal output of the work ow. The
details of the recording vary from one experiment to another. It may depend on
the goals of the experiment, or the regulatory and compliance procedures, and a
number of other things. It may involve the recording of the software programs,
the hardware, and the instruments used in the experiment.</p>
      <p>Data provenance, on the other hand, is more concerned about the derivation
of a piece of data that is in the result of a transformation step. It refers to a
description of the origins of a piece of data and the process by which it arrives
in a database.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Past and Current Work on Provenance</title>
      <p>
        There are many surveys of existing work on provenance from work ows [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and
database[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] research communities. There have been work on the quality
assessment of data that have addressed the issues of provenance[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. There is the Open
Provenance Model[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] which allows the characterisation of the dependencies
between \things", and it consists of a directed graph expressing such dependencies.
It is not light-weight but can be used to describe part of the provenance
relationships that is a concern of a dataset publisher.
      </p>
      <p>
        Berners-Lee's \Oh yeah?" button [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] was meant to challenge the origins,
i.e. provenance, of what is being asserted and request proofs, by directly or
indirectly consulting the meta-information of what is being asserted. Named
graphs [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] are models that allow entire groups of graphs be given a URI and
provenance information can be attached to those graphs. The Semantic Web
Publishing Vocabulary (SWP)[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is an RDF-Schema vocabulary for expressing
information provision related meta-information and for assuring the origin of
information with digital signatures. It can be used within the named graph
framework to integrate information about provenance, assertional status, and
digital signatures of graphs. An RDF graph is a set of RDF triples, therefore
an RDF graph may contain a few triples or very many. The Named Graph
framework does not give a good control on the granularity of the collection of
data items to attach provenance to. In this work, we do use some elements of
the SWP.
      </p>
      <p>
        The Provenance Vocabulary[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] provides classes and properties enabling providers
of Web data to publish provenance-related metadata about their data. The
vocabulary provides classes, called Artifacts, Executions, and Actors, that can be
used to specify provenance for data access and data creation, at the triple level.
An Actor performs an Execution on an Artifact. In the Provenance Vocabulary,
there are di erent types of actors that perform di erent types of executions over
diverse types of artifacts. Although encoding at the triple level is ne-grained
and lets provenance data be attached to a triple, a big dataset may contain a
large number of triples, and encoding at triple level may lead to the provenance
information be much more than the actual data.
      </p>
      <p>In the linked data world, data are usually collected together and provided as
datasets. The provision of the provenance information of datasets' elements is
an interesting problem.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Provenance of Linked Data and Datasets: Challenges</title>
      <p>There are two major challenges in the provision of provenance information of
linked data, viz Provenance Representation and Provenance Storage.
4.1</p>
      <sec id="sec-4-1">
        <title>Provenance Storage</title>
        <p>Provenance information can sometimes be larger than the data it describes if the
data items under provenance control is ne-grained and the information provided
very rich. However, one can reduce storage needs by recording data collection
that are important for the operational aspects of the dataset publisher's business.</p>
        <p>
          Provenance can be tightly coupled to the data it describes and located in
the same data storage system or even be embedded within the data le, as
advocated in tSPARQL [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Such approaches can ease maintaining the integrity
of provenance, but make it harder to publish and search just the provenance. It
can also lead to a large amount of provenance information needing to be stored.
Provenance can also be stored by itself [26] or with other metadata. Once you
decide how to store the provenance data, provenance representation itself is
another major challenge.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Provenance Representation</title>
        <p>
          There are two major approaches to representing provenance information, and
these alternate representations have implications on their cost of recording and
the richness of their usages. These two approaches are:
{ The Inversion method: This uses the relationships between the input data,
working backwards, to derive the output data, giving the records of this trace.
Examples include queries and user-de ned functions in databases that can be
inverted automatically or by explicit functions [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Here, information about
the queries and the output data may be su cient to identify the source data,
{ The Annotation method: Metadata of the derivation history of a data are
collected as annotation, as well as descriptions about source data and
processes. Here, provenance is pre-computed and readily usable as metadata.
        </p>
        <p>While the inversion method is more compact than the annotation approach,
the information it provides is sparse and limited to the derivation history of the
data. The annotation method, however, provides more information that includes
more than the derivation history of the data and may include the parameters
passed to the derivation processes, the post-conditions, etc.</p>
        <p>
          We advocate the use of the annotation method as it gives richer information
of the data and the data set we may be interested in. The voiD (Vocabulary of
Interlinked Datasets) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] vocabulary can be employed to describe the provenance
information of the data we are interested in. voiD is an RDF based schema
to describe datasets. With voiD, the discovery and usage of datasets can be
performed both e ectively and e ciently. Using voiD also have the added bene t
of storing the provenance information with the other metadata of our datasets.
There are two core classes at the heart of voiD:
1. A dataset (void:Dataset), i.e. a collection of data, which is:
{ published and maintained by a single provider,
{ available as RDF,
{ accessible, for example, through dereferenceable HTTP URIs or a SPARQL1
endpoint
2. The interlinking modelled by a linkset (void:Linkset). A linkset in voiD is
a subclass of a dataset, used for describing the interlinking relationship
between datasets. In each interlinking triple, the subject is a resource hosted
in one dataset and the object is a resource hosted in another dataset. This
modelling enables a exible and powerful way to state the interlinking
between two datasets, such as how many links there exist, the kind of links,
and who made these statements.
5
        </p>
        <p>
          voidp: Provenance Extension to voiD
In [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], a linked data(set) publisher was advised to reuse terms from well-known
vocabularies wherever possible, and one should only de ne new terms one cannot
        </p>
        <sec id="sec-4-2-1">
          <title>1 http://www.w3.org/TR/rdf-sparql-query/</title>
          <p>
            nd in existing vocabularies. Reusing existing vocabularies takes advantage of
the ease of bringing together diverse domains within RDF, and it makes data
more reusable. By reusing vocabularies, the data is no longer isolated nor locked
within a single context designed for a single use. We adhered to this advice and
have made use of the following ontologies:
{ Provenance Vocabulary[
            <xref ref-type="bibr" rid="ref11">11</xref>
            ],
{ The Time Ontology in OWL[
            <xref ref-type="bibr" rid="ref12">12</xref>
            ],
{ The Semantic Web Publishing Vocabulary[
            <xref ref-type="bibr" rid="ref5">5</xref>
            ],
          </p>
          <p>In addition, the namespace for voidp is:
@prefix voidp: &lt;http://purl.org/void/provenance/ns/&gt;.
The classes are:
{ Actor: Here, we reuse the Actor class in the Provenance vocabulary to specify
an entity or an object that performs an action on a particular data item (or
a data source or data set),
{ Provenance: this class is a container class for the list of DataItem(s) we are
putting under provenance control,
{ DataItem: this class models the item of data we put under provenance
control.</p>
          <p>The properties are:
1. activity: this property speci es that a particular dataset has some items
under provenance control,
2. item: speci es the item under provenance control,
3. originatingSource: the item's original source,
4. originatingSourceURI: the URI of the item's original source,
5. originatingSourceLabel: the label text used to describe the item's original
source,
6. certification: if the dataset is signed, this property is used to contain the
signature elements. This is an important element to prove the origin of a
dataset as it is being sliced and diced during its evolution,
7. swp:signature: represents the signature of the dataset,
8. swp:signatureMethod: speci es the signature method,
9. swp:authority: de nes the authority of the relationship between the item
under provenance control and the dataset publisher,
10. swp:valid-from and swp:valid-until: these are the valid start and end
dates of that (authority) relationship,
11. processType: speci es the type of transformation or conversion procedure
carried out on the item's source, e.g. the transformation may be due to some
scripts being run on the source data,
12. prv:createdBy: speci es the actor that executes an action on the item that
is being recorded.
13. prv:performedAt: date when the transformation is done,
14. prv:performedBy: the URI of the actor that performs the recording of the
provenance activity on the item.</p>
          <p>These classes and properties are su cient to be useful for specifying
information for both work ow and data provenance.
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments and Results</title>
      <p>Our group, the EnAKTing group2, is dedicated to solving fundamental problems
in achieving an e ective web of linked data, and as part of our work, we make use
of some of the United Kingdom's government data. As part of our group's work,
we recently converted a set of government data les from
comma-separatedvalues (csv) to RDF datasets.
6.1</p>
      <sec id="sec-5-1">
        <title>Source Datasets</title>
        <p>Some of these data les were:
{ Mortality data:</p>
        <p>http://www.statistics.gov.uk/downloads/theme population/ Table 3 Deaths Area Local Authority.xls,
{ Population data: http://www.statistics.gov.uk/downloads/ theme
population/Mid2003 Parl Con quinary est.xls,
{ Energy:</p>
        <p>http://www.decc.gov.uk/assets/decc/statistics/regional/ road transport/ le45728.xls,
{ CO2 emission:
http://www.decc.gov.uk/assets/ decc/ statistics/climate change/ 1 20100122174542
e @@ localregionalco2 emissionsest20057.xls,
{ Crime:
http://www.homeo ce.gov.uk /rds/pdfs09/ hosb1109chap7.xls.</p>
        <p>We used voiD to describe these datasets. These datasets and their voiD
descriptions were inserted into our RDF database, 4store3. The voiD descriptions
used can be found at http://152.78.189.49/voidp/. The provenance elements can
be seen in the voiD descriptions.</p>
        <p>Example Scenario Query. We may be interested in an example query such as
the following:</p>
        <p>\Give the originating urls of the datasets for Robbery and female population
for the County of Durham in the United Kingdom for 2004. Also give the CO2
emission values and total energy consumption values for that same area. Only
give datasets that are from the United Kingdom Home O ce and from the
United Kingdom's Department of Energy and Climate Change".</p>
        <sec id="sec-5-1-1">
          <title>2 http://enakting.org 3 http://4store.org/</title>
          <p>Running such a query, we are given the source urls that were stated in subsection
6.1 (Source Datasets).
7</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>The provenance of a data element can be used to elicit that datas quality and/or
trustworthiness. Data quality can be used as contextual information to aid in
data integration. This paper described voidp, a light-weight provenance
extension for the voiD vocabulary that allows data publishers to add provenance
metadata to the elements of their datasets, enumerating its classes and
properties. These provenance metadata can be used by a data integration system or
consumer for data aggregation and inter-operation.
8</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This work was supported by the EnAKTing project, funded by EPSRC project
number EP/G008493/1.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>K.</given-names>
            <surname>Alexander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hausenblas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <article-title>Describing linked datasets</article-title>
          .
          <source>LDOW2009</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>D.</given-names>
            <surname>Artz</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gil</surname>
          </string-name>
          .
          <article-title>A survey of trust in computer science and the semantic web</article-title>
          .
          <source>Journal of Web Semantics</source>
          ,
          <volume>5</volume>
          (
          <issue>2</issue>
          ),
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>T.</surname>
          </string-name>
          Berners-Lee.
          <article-title>Cleaning up the user interface</article-title>
          . http://www.w3.org/DesignIssues/UI.html (retrieved Nov.
          <year>2010</year>
          ),
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Semantic web publishing vocabulary (swp) user manual</article-title>
          . www4.wiwiss.fuberlin.de/bizer/WIQA/swp/SWP-UserManual.
          <source>pdf (retrieved Nov</source>
          .
          <year>2010</year>
          ),
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <string-name>
            <surname>Quality-Driven Information</surname>
          </string-name>
          Filtering- In
          <source>the Context of Web-Based Information Systems</source>
          . VDM Verlag,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Carroll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hayes</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Stickler</surname>
          </string-name>
          .
          <article-title>Cleaning up the user interface</article-title>
          .
          <source>WWW '05 Proceedings of the 14th international conference on World Wide Web</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          .
          <article-title>How to publish linked data on the wel</article-title>
          . http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/ (retrieved Nov.
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>J.</given-names>
            <surname>Cheney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chiticariu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W. C.</given-names>
            <surname>Tan</surname>
          </string-name>
          .
          <article-title>Provenance in databases: Why, where and how</article-title>
          .
          <source>Fundations and Trends in Databases</source>
          ,
          <volume>4</volume>
          (
          <issue>1</issue>
          ),
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>O.</given-names>
            <surname>Hartig</surname>
          </string-name>
          .
          <article-title>Provenance information in the web of data</article-title>
          .
          <source>LDOW2009</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>O.</given-names>
            <surname>Hartig</surname>
          </string-name>
          .
          <article-title>Querying trust in rdf data with tsparql</article-title>
          .
          <source>Lecture Notes in Computer Science</source>
          ,
          <volume>5554</volume>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>O.</given-names>
            <surname>Hartig</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <article-title>Using web data provenance for quality assessment</article-title>
          .
          <source>Proceedings of the 1st Int. Workshop on the Role of Semantic Web in Provenance Management, ISWC</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Hobbs</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Pan</surname>
          </string-name>
          .
          <article-title>An ontology of time for the semantic web</article-title>
          .
          <source>ACM Transactions on Asian Language Processing (TALIP): Special issue on Temporal Information Processing</source>
          ,
          <volume>3</volume>
          (
          <issue>1</issue>
          ),
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. L.
          <string-name>
            <surname>Moreau</surname>
            ,
            <given-names>B. Cli ord</given-names>
          </string-name>
          , J. Freire,
          <string-name>
            <given-names>J.</given-names>
            <surname>Futrelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gill</surname>
          </string-name>
          , and
          <string-name>
            <given-names>e. a. P.</given-names>
            <surname>Groth</surname>
          </string-name>
          .
          <article-title>The open provenance model core speci cation (v1.1)</article-title>
          .
          <source>Future Generation Computer Systems</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>J.</given-names>
            <surname>Widom</surname>
          </string-name>
          .
          <article-title>Trio: A system for integrated management of data, accuracy, and lineage</article-title>
          .
          <source>CIDR</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>