<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>th International Semantic Web Conference ISWC-</string-name>
        </contrib>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Chairs</title>
      <sec id="sec-1-1">
        <title>Amit Sheth</title>
      </sec>
      <sec id="sec-1-2">
        <title>Vassilis Christophides</title>
        <p>Organizing Committee/PC Co-Chairs</p>
      </sec>
      <sec id="sec-1-3">
        <title>Juliana Freire</title>
      </sec>
      <sec id="sec-1-4">
        <title>Paolo Missier</title>
      </sec>
      <sec id="sec-1-5">
        <title>Satya S. Sahoo</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Program Committee</title>
      <sec id="sec-2-1">
        <title>Aleksander Slominski, IBM Research</title>
      </sec>
      <sec id="sec-2-2">
        <title>Bertram Ludäscher, University of California Davis</title>
      </sec>
      <sec id="sec-2-3">
        <title>Beth Plale, Indiana University</title>
      </sec>
      <sec id="sec-2-4">
        <title>Claudio Silva, University of Utah</title>
      </sec>
      <sec id="sec-2-5">
        <title>Francisco Curbera, IBM Research</title>
      </sec>
      <sec id="sec-2-6">
        <title>Giorgos Flouris, FORTH-ICS, Greece</title>
      </sec>
      <sec id="sec-2-7">
        <title>Ilkay Altintas, San Diego Supercomputer Center, UCSD</title>
      </sec>
      <sec id="sec-2-8">
        <title>James Cheney, University of Edinburgh</title>
      </sec>
      <sec id="sec-2-9">
        <title>Jun Zhao, Oxford University</title>
      </sec>
      <sec id="sec-2-10">
        <title>Kei Cheung, Yale University</title>
      </sec>
      <sec id="sec-2-11">
        <title>Krishnaprasad Thirunarayan, Wright State University</title>
      </sec>
      <sec id="sec-2-12">
        <title>Luc Moreau, University of Southampton</title>
      </sec>
      <sec id="sec-2-13">
        <title>Nirmal Mukhi, IBM Research</title>
      </sec>
      <sec id="sec-2-14">
        <title>Olivier Bodenreider, National Library of Medicine, NIH</title>
      </sec>
      <sec id="sec-2-15">
        <title>Paulo Pinheiro da Silva, University of Texas at El Paso</title>
      </sec>
      <sec id="sec-2-16">
        <title>Peter Fox, Tetherless World Research Constellation, RPI</title>
      </sec>
      <sec id="sec-2-17">
        <title>Roger Barga, Microsoft Research</title>
      </sec>
      <sec id="sec-2-18">
        <title>Sarah Cohen-Boulakia, Universite Paris-Sud</title>
      </sec>
      <sec id="sec-2-19">
        <title>Sudha Ram, Arizona State University</title>
      </sec>
      <sec id="sec-2-20">
        <title>Val Tannen, University of Pennsylvania</title>
      </sec>
      <sec id="sec-2-21">
        <title>Yogesh Simmhan, Microsoft Research</title>
        <p>© Copyright remains with the authors, and permission to reproduce material printed as SWPM09 workshop proceedings should be sought from them.
Similarly, pursuing copyright infringements, plagiarism, etc. remains the responsibility of authors.
The growing eScience infrastructure is enabling scientists to generate scientific data on
an industrial scale. Similarly, the Web 2.0 paradigm is enabling Web users to create
applications that combine data from multiple sources, popularly referred to as
“mashups”, on a large scale. The importance of managing various forms of apparently
ancillary metadata, in addition to the primary data products of eScience, Web, and
business applications is increasingly being recognized as critical for the correct
interpretation of the data. In this workshop we focus specifically on metadata that
describes the origins of the data. The term provenance from the French word
“provenir”, meaning “to come from", describes the lineage or origins of a data entity.
Provenance metadata is essential to correctly interpret the results of a process
execution, to validate data processing tools, to verify the quality of data, and to
associate measures of trust to the data. The proof layer in the Semantic Web layer cake,
corresponding to provenance information, has been identified as an important
component for the implementation of “trust mechanisms” and effective information
extraction from the Web.</p>
        <p>The primary objective of this workshop is to explore the role of Semantic Web in
addressing some of the critical challenges facing provenance management, namely:
1. Efficiently capturing and propagating provenance information as data is processed,
fragmented and recombined across multiple applications on a Web scale.
2. A common representation model for provenance for processing and analysis by
both agents and humans.
3. Interoperability of provenance information generated in distributed environments
such as myGrid.
4. Tools leveraging the Semantic Web for visualization of provenance information.
We thank the keynote speakers, all members of the program committee, authors,
invited speakers, participants and local organizers for their efforts.</p>
      </sec>
      <sec id="sec-2-22">
        <title>We look forward to a successful workshop!</title>
        <p>Juliana Freire, Paolo Missier, Satya S. Sahoo</p>
        <p>Semantic Provenance for Science Data
Products: Application to Image Data</p>
        <p>Processing
Stephan Zednik, Peter Fox, Deborah L. McGuinness, Paulo Pinheiro da Silva, Cynthia Chang
Abstract—A challenge in providing scientific data services to a broad user base is to also provide the metadata services
and tools the user base needs to correctly interpret and trust the provided data. Provenance metadata is especially vital to
establishing trust, giving the user information on the conditions under which the data originated and any processing that
was applied to generate the data product provided.</p>
        <p>In this paper, we describe our work on a federated set of data services in the area of solar coronal physics. These data
services provide a particular challenge because there is decades of existing data whose provenance we will have to
reconstruct, and because the quality of the final data product is highly sensitive to data capture conditions, information
which is not currently propagated with the data.</p>
        <p>We describe our use of semantic technologies for encoding provenance and domain knowledge and show how provenance
and domain ontologies can be used together to satisfy complex use cases. We show our progress on provenance search
and visualization tools and highlight the need for semantics in the user tools. Finally, we describe how our methods are
applicable to generic data processing systems.</p>
        <p>✦</p>
        <sec id="sec-2-22-1">
          <title>INTRODUCTION</title>
          <p>W servatory1 with extensive provenance
sup</p>
          <p>
            E aim to create a next-generation virtual
obport. Provenance is a first-class concept in our
system; with full support in our search,
explanation, and visualization tools. We require a general
provenance model that is applicable in a wide array
of domains, and integrable with domain models,
so that domain concepts can be modeled along
with provenance concepts. For these reasons we
have chosen to use the Proof Markup Language
(PML) [
            <xref ref-type="bibr" rid="ref18 ref3 ref31">3</xref>
            ], [
            <xref ref-type="bibr" rid="ref22 ref35 ref7">7</xref>
            ] family of OWL ontologies as our
provenance model. We show how PML can be
used to model provenance causality chains,
introduce our domain model, and show how the PML
provenance model and our science domain models
can be integrated in a manner that provides a rich
• Stephan Zednik, Peter Fox, Deborah L. McGuinness, and Cynthia
Chang are with the Rensselaer Polytechnic Institute, Tetherless
World Constellation.
• Paulo Pinheiro da Silva is with the University of Texas at El Paso,
Department of Computer Science.
          </p>
          <p>1. A virtual observatory is a collection of interoperating data
archives and software tools which utilize the internet to form a
scientific research environment in which research programs can
be conducted.
provenance infrastructure, able to model complex
scientific provenance relations.</p>
          <p>We have chosen to test our system in the
domain of solar coronal physics, using the Advanced
Coronal Observing System (ACOS) as a testbed.
The ACOS data products are the result of several
data ingest pipelines, processing observations from
three imaging instruments located at the Mauna
Loa Solar Observatory (MLSO). The ACOS data
pipelines are distributed data pipelines, operated in
part at MLSO in Hawaii and the National Center
for Atmospheric Research High Altitude
Observatory (NCAR/HAO) in Boulder, CO. ACOS has been
operational for over a decade, and has produced
terabytes of data.</p>
          <p>ACOS was chosen because its data pipelines
are typical of data ingest systems and the vast
quantity of existing data ACOS has generated in
its decades of operation provides the opportunity
to design a system geared to reconstruct, as well
as capture, provenance. One of the ACOS data
pipelines, the Chromospheric Helium-I Imaging
Photometer (CHIP) Intensity Image pipeline, is
illustrated in Figure 1. This high-level diagram is
designed to show not just the process/artifact flow
of the pipeline, but domain concepts that could
and should be captured and represented in the
provenance. In the pipeline, data (square boxes)
passes through a number of stages (ovals) each
of which can contain a number of complex
processing, analysis, human interaction, and decision
steps. Each of these stages contains domain-specific
information (dotted-lined boxes) related to the data
product provenance.</p>
          <p>Of particular interest is information in the
pipeline that is not a direct or inferred result of
the data capture event. The Observer Log is a
human-generated account of weather conditions
and system status during the instrument observing
schedule for the day. Bad weather conditions or
instrument instability, noted in the log, can have
significant negative effect on the quality of the
data observations. This information is currently
not propagated in the data pipeline nor do the
data products reference it in any way, but is an
invaluable reference in determining why an image
has been given a low-grade quality assessment.
This information is an important component of the
origin of the data image and should be represented
in its provenance.</p>
          <p>
            The motivation for this project arose from our
experiences designing and deploying a solar
terrestrial physics virtual observatory system [
            <xref ref-type="bibr" rid="ref1 ref16 ref29">1</xref>
            ], [
            <xref ref-type="bibr" rid="ref17 ref2 ref30">2</xref>
            ], and
from numerous discussions with the data providers
(i.e. ’roles’ in Figure 1). Among their remarks were
the following:
• Data is being used in new ways and we
frequently do not have sufficient information on
what happened to the data along the
processing stages to determine if it is suitable for a use
we did not envision.
• We often fail to capture, represent, and
propagate manually generated information that
needs to go with the data flows.
          </p>
          <p>Further, when science data and visual
representations of the data (such as the CHIP Quick Look
images) are made available to the end-user, the
product has often gone through a number of data
filtration and processing steps. If thorough
provenance metadata and processing documentation is
not captured, propagated, and made available to
the end-user; the data system is in effect a ’black
box’, and the end-user must blindly trust the
science quality of the data product and long-term
consistency of the pipeline processing.</p>
          <p>Virtual Observatories are particularly prone to
this information gap. This project traces the entire
csiatelibcraaltiibornatiimona,ge set
instrument operating
mode
general comments,
weather comments,
instrument
comments</p>
          <p>Data</p>
          <p>Capture
Quick Look
Processing
Ship to</p>
          <p>NCAR
Quality field
specifying 'Good',
'Bad', or 'Ugly' data;
added by QA</p>
          <p>Calibration,
Accounting,
Quality</p>
          <p>Rating
sub-processes.
processing algorithms,
input parameters</p>
          <p>Flatten,
Re-scale,
Rotate
Image</p>
          <p>Site/Instrument
Configuration
Instrument
Calibration
Raw Image
Files with</p>
          <p>Headers
Observer Logs
Quick Look GIF
(Visual Product)
Images with
Headers, and
Observer Logs
Calibrated Data
with Quality
value
Final Data</p>
          <p>Images
(Data Product)</p>
          <p>PI
Site
Observer
Data
Analyst
Quality
Assesment,
Data Product
Manager
End User</p>
          <p>Fig. 1.</p>
          <p>Chromospheric Helium-I Imaging
(CHIP) Intensity Image pipeline
Photometer
pipeline and accounts for all roles, processes, and
metadata as they relate to use cases which require
provenance.
2</p>
        </sec>
        <sec id="sec-2-22-2">
          <title>USE CASES</title>
          <p>During discussions with science project
participants, we developed an initial set of use cases
which reflect real user questions that cannot
presently be answered in any routine or automated
manner:
• What were the cloud cover and seeing
conditions during the observation period of this data
product artifact?
• What calibrations have been applied to this</p>
          <p>data product artifact?
• Who (person or program) added the
comments to the science data file for the best
vignetted rectangular polarization brightness
image from January 26, 2005 18:49:09UT taken
by the ACOS Mark IV polarimeter?
• Find all good CHIP He 1083 nm intensity im- pipeline. The provenance capture should be
configages on March 21, 2008. urable such that as much provenance as possible
• Why does this data look bad? can be generated based on our understanding of
These use case scenarios mix domain and prove- an earlier version of the pipeline, without forcing
nance terms in a manner that would make the us to re-run data processing.
question difficult to answer if the provenance and Finally, since ACOS is a distributed system the
domain models are independent. The cloud cover provenance infrastructure must also work as a
disand seeing conditions during observation periods is tributed system. Provenance should be gathered
recorded, but it is never directly associated with nor where processing occurs and made available as part
propagated with processed data. We could adjust of a distributed provenance store.
the data processing pipeline to pull this
information from the observation records and propagate 4 DATA MODEL
it with the data as metadata; but this would be a 4.1 Provenance Representation
very heavy-handed approach to take for any and
all information associated with a data product’s
generation and processing that a user may want
to see. Instead, we intend to build the link in the
provenance representation of the data product, so
that by following the data product’s causality graph
the system can find the weather condition records,
calibrations applied, quality control information,
etc. that are associated with the data product’s
generation or processing.</p>
          <p>These use case scenarios are representative of
many different question types that are routinely
asked for all data products produced by the ACOS
data ingest pipeline, and we believe these are
representative of use cases that are common in any
science data pipeline application.</p>
          <p>To support our provenance requirements we have
elected to use OWL ontologies to model both
domain and provenance concepts. The provenance
and domain base ontologies are independent, but
the system’s individuals reference both models (via
multiple-inheritance), so queries, filtering and
reasoning by either domain or provenance concepts
are supported. This design supports our desire
to build a maintainable system that refrains from
hardcoding solar terrestrial concepts into the base
provenance model or provenance logic.</p>
          <p>
            We have chosen as our provenance model the
Inference Web [
            <xref ref-type="bibr" rid="ref21 ref34 ref6">6</xref>
            ] Framework’s Proof Markup
Language [
            <xref ref-type="bibr" rid="ref18 ref3 ref31">3</xref>
            ], [
            <xref ref-type="bibr" rid="ref22 ref35 ref7">7</xref>
            ] (PML) because of its capabilities
in representing conclusions, justifications (inference
and source usage), and explanations. Another
particularly useful aspect of the PML model is its
separation of the process engine and process rule
concepts. By defining these concepts separately,
PML can represent both the process that was
executed (PML InferenceEngine) and the rule (PML
InferenceRule) by which the executed process
operated. Another way to view this concept separation
is that PML can capture both execution history
and execution purpose. Inference rules are a
pivotal concept of the justification aspect of PML and
provide a clear mechanism for relating domain
concepts to a provenance causality graph. While
other provenance models such as Open Provenance
Model (OPM) could have provided some of the
foundation provided in PML, we found some of
the core representational constructs such as those
mentioned above to be well suited for our
applications. For further analysis of the relative benefits
of PML and OPM, see ’Towards Usable and
Interoperable Workflow Provenance: Emperical Case
Studies using PML’ [
            <xref ref-type="bibr" rid="ref19 ref32 ref4">4</xref>
            ] and ’Domain Knowledge
and Provenance in Science Data Systems’ [
            <xref ref-type="bibr" rid="ref20 ref33 ref5">5</xref>
            ].
3
          </p>
        </sec>
        <sec id="sec-2-22-3">
          <title>PROVENANCE REQUIREMENTS</title>
          <p>We require a provenance infrastructure that
supports queries, filtering, and reasoning by domain
concepts. A design requiring the hardwiring of
domain concepts into the provenance model, or
into the system logic that accesses the provenance
store, is undesirable because it will be difficult to
maintain and extend, and furthermore, such a
hardwired application will make interoperation more
challenging.</p>
          <p>The provenance infrastructure must also
support existing data systems, requiring little to no
modification of the processing pipeline; ACOS is
a production system, and we do not have the
opportunity to re-engineer it. The system should also
support generating some amount of provenance for
existing processed data. It is not feasible to
reprocess all existing data, and doing so with the current
pipeline may introduce discrepancies between the
newly processed products and archived products
processed on a earlier and different version of the
Fig. 2.</p>
          <p>Basic PML NodeSet</p>
          <p>In PML, a piece of information (the conclusion)
and its justification(s) are modeled as a nodeset. A
nodeset justification, known as an inference step,
is used to describe the engine or source and the
rule used to generate the nodeset conclusion. Each
inference step may specify a list of nodesets, known
as antecedents, whose conclusions it is dependent
upon. The antecedent relations between nodesets
can be used to build a causality graph, or
explanation, for the conclusion of the nodeset. The
fundamental classes and properties of a PML nodeset are
shown in Figure 2.
4.2</p>
          <p>
            Domain Representation
The VSTO2 solar-terrestrial ontology, developed
during our previous experience deploying a
semantic virtual observatory [
            <xref ref-type="bibr" rid="ref1 ref16 ref29">1</xref>
            ], [
            <xref ref-type="bibr" rid="ref17 ref2 ref30">2</xref>
            ] will be used
as one of our core science domain models. The
VSTO ontology provides a model for data
products, instruments, and parameters related to
solarterrestrial data systems. The VSTO ontology does
not currently describe the processing that occurs
in a typical science data ingest pipeline
(calibrations, transformations, data filtering, quality control
processes, etc.) so we are developing our own
science data processing ontology based on experience
gained during this3 project and similar work with
the MDSA4 project.
          </p>
          <p>Figure 3 illustrates some domain model concepts
from the VSTO and (in-development) science data
2. Virtual Solar Terrestrial Observatory,
http://vsto.org/
3. Semantic Provenance Capture in Data Ingest Systems
4. Multi-Sensor Data Synergy Advisor,
http://tw.rpi.edu/portal/MDSA</p>
          <p>Fig. 3.</p>
          <p>VSTO (vsto) and Science Data Processing (spcdis)
domain concepts
processing ontologies that we will be integrating
with the ACOS provenance. Of particular interest in
this example is the vsto:InstrumentOperatingMode
concept, which is defined as the configuration and
process that allows an instrument to produce the
required signal. In the solar terrestrial domain
terminology, an operating mode is not treated as
an input configuration to a process, not as an
artifact, but as a description of the state of the
instrument and the entailing process for data
capture. It is a description of how an
instrument performs data capture of a specific
parameter type. In fact, vsto:InstrumentOperatingMode
was originally modeled as a subclass of the class
vsto:AbstractProcess. The VSTO
InstrumentOperatingMode and science data processing ontology
DataCapture concepts relate to each other in much
the same way the PML InferenceRule and
InferenceEngine relate, and in the next section we will
show how they can be integrated.
4.3</p>
          <p>Provenance and Domain Model Integration
The provenance and domain ontology concepts
are integrated not in the model definitions, but
in the individuals’ declarations by taking
advantage of OWLs natural support for
multipleinheritance. Where it is deemed beneficial to
express both domain and provenance concepts,
individuals (ontology class instances) are
defined with multiple types, one type from the
provenance model and at least one type from
the domain ontologies. As an example, the
science data processing ontology may define
an individual spcdis:FlatFieldCalibration of type
spcdis:Calibration and type pmlp:InferenceRule.
The use case ’What calibrations have been applied to
this data product artifact?’ may now be answered
by querying the data product’s causality graph for
inference rule’s of type spcdis:Calibration used by
any nodeset’s justification.</p>
          <p>Figure 4 shows how individuals of the domain
concepts illustrated in Figure 3 may be mapped to
PML provenance concepts. The instrument
operating mode instance is modeled not as the conclusion
of a some nodeset’s justification that acts as an
antecedent to the data capture nodeset’s
justification, but as the rule by which data capture occurs.
The individual spcdis:CHIP is defined as both a
vsto:Photometer and a pmlp:Sensor, allowing it to
be both the source of the conclusion of the data
capture justification as well as assert any properties
for which vsto:Photometer is in the domain.</p>
          <p>We can now model a PML NodeSet using
individuals that have type and properties from
domain ontologies, as shown in Figure 5. Integrating
domain types and properties with the
provenancebased causality graph allows us to answer complex
use case scenarios such as ’What calibrations have
been applied to this data product artifact?’ by
performing reasoning on domain concepts integrated with
the individuals in the provenance.
5</p>
        </sec>
        <sec id="sec-2-22-4">
          <title>PROVENANCE CAPTURE</title>
          <p>To assist in PML generation, we describe a type
of program referred to as a PML data annotator.
A PML data annotator is a simple program whose
sole purpose is to capture the provenance of a
single decision/process in a decision system and
encode that provenance as a PML nodeset. PML
data annotator programs are run as components
of a workflow; either as part or separate to the
actual decision processing. When run as part of the
data processing, the PML data annotator invokes
the inference engine directly; extracting required
processing inputs from antecedent nodesets and
passing this information during inference engine
invocation.</p>
          <p>For the ACOS provenance capture we will utilize
a parallel workflow, where PML data annotators
do not directly invoke inference engines but
reconstruct the processing of the existing data ingest
pipelines. This architecture also supports our need
for provenance generations for archived or
preexisting data products without preprocessing of the
data. The PML data annotators are in in a workflow
that simulates the processing of the data pipeline
using analysis of existing artifacts and information
about the data processing encoded in the PML data
annotator configuration to reconstruct provenance.
The PML data annotator workflow can be
reconfigured to simulate different variations of the data
processing pipeline to generate provenance from
data processing pipelines that are no longer active.
6</p>
        </sec>
        <sec id="sec-2-22-5">
          <title>PROVENANCE SEARCH</title>
          <p>We will utilize the search and explanation
capabilities of the Inference Web toolset to provide
both a free text and guided search on provenance
and domain concepts. Guided searches generate a
SPARQL query on the provenance + domain RDF
and free text searches currently performs a standard
full text index search on the same. Search results
Fig. 6.</p>
          <p>Probe-It visualizing the provenance of a CHIP
QuickLook Visual Product
can be viewed using a number of tools including
the Inference Web browser, where the user can
explore the provenance encoding in detail, or in
Probe-It!, a second generation PML provenance
visualization tool using applet technology for greater
graph functionality, introduced in the next section.
7</p>
        </sec>
        <sec id="sec-2-22-6">
          <title>PROVENANCE VISUALIZATION</title>
          <p>
            Probe-It! [
            <xref ref-type="bibr" rid="ref23 ref36 ref8">8</xref>
            ] is graphical browser of PML-based
provenance, developed by Cyber-ShARE at the
University of Texas at El Paso. Probe-It! generates
a causality graph for all antecedents to a
specified nodeset and generates a visual
representation (where applicable) for the conclusion of all
nodesets in the graph. A Probe-It! visualization
of the CHIP QuickLook5 Visual Product is shown
in Figure 6. Our primary interest in Probe-It for
the ACOS provenance is in enabling scientists to
better understand imperfections in and processing
consequences upon science data images.
8
          </p>
        </sec>
        <sec id="sec-2-22-7">
          <title>DISCUSSION AND CONCLUSION</title>
          <p>To date we have reconstructed provenance for the
Quick Look visual product from the CHIP
Intensity data ingest and shown how weather condition
information, previously not propagated, can
provide value added to the data product. We have
introduced our work constructing a domain-aware
5. QuickLook images are lightly-calibrated visual
approximations of the data image, generated in real-time and used
primarily as quality checks on instrument operation
provenance store built using a generic, extensible
provenance model and a solar-terrestrial domain
model.</p>
          <p>We described how this provenance store can be
used to represent the relationships expressed in our
use cases, and why these use cases are important
in increasing user trust in the data products. While
we did this in one workflow setting, since the use
cases are representative of those in many scientific
workflow settings, we believe this work provides
a foundation for scientific workflow provenance
applications. We have described in brief how we
intend to use a parallel workflow to reconstruct and
generate provenance, and the semantically-enabled
provenance search, explanation, and visualization
tools we will provide for the end users.</p>
          <p>The next stage of our work will involve further
modeling of data pipeline concepts in the ACOS
provenance ontology, further documentation of the
ACOS data pipelines, and construction of PML
wrappers for newly documented sections of the
data pipelines. We also intend to prototype
semantic provenance faceted-search interfaces, move our
free text search to the Apache Lucene text search
engine, and develop new visual representations
of nodeset conclusions in our visual provenance
browser. Following the completion and testing of
the ACOS application, we will separate our
extensions that are ACOS-specific from those that are
general to science applications and release scientific
provenance module extensions for VSTO and PML
ontologies and related wrapper support tools.</p>
        </sec>
        <sec id="sec-2-22-8">
          <title>ACKNOWLEDGMENTS</title>
          <p>The SPCDIS6 project is funded by NSF, Office of
Cyber Infrastructure under the SDCI program, grant
number OCI-0721943, and in collaboration with
UTEP CyberShare Center, funded by NSF under the
CREST program, grant number HRD-0734825.
Reasoning With Provenance, Trust and all that other
Meta Knowlege in OWL
Simon Schenk , Renata Dividino and Steffen Staab
ISWeb Research Group University of Koblenz-Landau</p>
          <p>Email: sschenk, dividino, staab@uni-koblenz.de</p>
          <p>Abstract—For many tasks, such as the integration of
knowledge bases in the semantic web, one must not only handle the
knowledge itself, but also characterizations of this knowledge,
e.g.: (i) where did a knowledge item come from (i.e. provenance),
(ii) what level of trust can be assigned to a knowledge item, or
(iii) what degree of certainty is associated with it. We refer to all
such kinds of characterizations as meta knowledge. Approaches
for providing meta knowledge for query answers in relational
databases and RDF repositories, based on algebraic operations,
exist. As query answering in description logics in general does not
boil down to algebraic evaluation of tree shaped query models,
these formalizations do not easily carry over. In this paper
we propose a formalization of meta knowledge, which is still
algebraic, but allows for the computation of meta knowledge of
inferred knowledge in description logics, including reasoning with
conflicting and incomplete meta knowledge. We use pinpointing
to come up with meta knowledge formulas for description logics,
which then can be evaluated algebraically. We describe and
evaluate our prototypical implementation.</p>
          <p>I. INTRODUCTION</p>
          <p>When exploiting explicit/inferred knowledge in the semantic
web, one must not only handle the knowledge itself, but
also characterizations of this knowledge, e.g.: (i) where did
a knowledge item come from (i.e. provenance), (ii) what level
of trust can be assigned to a knowledge item, or (iii) what
degree of certainty is associated with it. We refer to all such
kinds of characterizations as meta knowledge. On the semantic
web, meta knowledge needs to be computed along with each
reasoning task.</p>
          <p>Meta knowledge can come in various, complex dimensions.</p>
          <p>
            Many simplifications done today, such as assuming trust to be
measured on a scale from 1 to 10, are not justified. In contrast,
actual information sources, modification dates, etc. should be
tracked to establish trust [
            <xref ref-type="bibr" rid="ref1 ref16 ref29">1</xref>
            ]. We propose a flexible mechanism
for tracking meta knowledge, which meets these requirements.
          </p>
          <p>Various approaches to this problem have been proposed.</p>
          <p>
            They can be grouped in to three clusters: First, we have
extensions of logical formalisms, e.g. description logics, to
deal with a particular kind of meta knowledge. Most prominent
are extensions for reasoning with uncertainty, such as fuzzy
and probabilistic [
            <xref ref-type="bibr" rid="ref17 ref2 ref30">2</xref>
            ] or possibilistic [
            <xref ref-type="bibr" rid="ref18 ref3 ref31">3</xref>
            ] description logics.
          </p>
          <p>
            Other proposals exist, which are tailored to specific meta
knowledge such as trust [
            <xref ref-type="bibr" rid="ref19 ref32 ref4">4</xref>
            ]. Second, for systems allowing
for algebraic query evaluation (such as relational databases
and SPARQL engines), more flexible mechanisms such as [
            <xref ref-type="bibr" rid="ref20 ref33 ref5">5</xref>
            ]
and [
            <xref ref-type="bibr" rid="ref21 ref34 ref6">6</xref>
            ] have been proposed, which allow for many kinds
of meta knowledge, but are limited to lower expressiveness
of the underlying logical formalism. Third, the expressive
system proposed by [
            <xref ref-type="bibr" rid="ref22 ref35 ref7">7</xref>
            ] has a rather ad-hoc semantics, which
is partially defined in constructors in queries and hence can
differ in each query evaluation.
          </p>
          <p>To come up with a flexible mechanism, which at the same
time supports expressive logics and multiple kinds of meta
knowledge, a suitable formalization of meta knowledge in
a semantically precise manner is needed. Moreover, such a
mechanism must be supported with a suitable
operationalization. From the existing approaches it is clear, that integrating
an expressive meta knowledge language with an expressive
base knowledge representation language is a non-trivial task,
mainly because of the different foundations, i.e. algebra vs.
logics, of the meta knowledge and base languages.</p>
          <p>Expressive descriptions of meta knowledge in less
expressive languages (such as SPARQL based on RDF) have been
founded on a tree-based algebraic formalization. Reasoning
frameworks, however, frequently have non-tree-based
derivations used for consistency checking and querying. In order to
be able to reason with meta knowledge, which we formalize
as algebraic structure, on top of expressive base languages, we
propose a reasoning framework for meta knowledge based on
pinpointing. Pinpointing summarizes explanations for axioms
in a single boolean formula, which then can be evaluated
using a meta knowledge algebra. We provide a blackbox
algorithm for reasoning with meta knowledge, and describe our
prototypical implementation. The algorithm uses an existing
description logic reasoner for entailment checks. Hence, the
supported expressivity is that of the underlying description
logic.</p>
          <p>As a motivation, we first explain a short use case, before
laying foundations and defining the semantics of meta
knowledge. Afterwards we briefly discuss the complexity and our
prototypical implementation. We review the related work and
conclude the paper.</p>
          <p>II. USE CASE</p>
          <p>
            In a common scenario for collaborative ontology editing
we have public, living ontologies, for which users can propose
changes [
            <xref ref-type="bibr" rid="ref23 ref36 ref8">8</xref>
            ] and which are possibly interlinked through imports
or views. Applications include large medical and biological
ontologies such as SNOMED or the Gene Ontology. The
example in [
            <xref ref-type="bibr" rid="ref23 ref36 ref8">8</xref>
            ] is based on a use case at the UN’s Food and
Agricultural Organization FAO. A change can be the addition,
change, or removal of an axiom. Users have different levels
of expertise and hence their knowledge items are assigned
different degrees of trustworthiness. Moreover, there may be
conflicting changes or modifications, which make the ontology
inconsistent. When answering queries and inferring knowledge
in such systems, users need to know for example
who contributed to axioms used to infer new knowledge,
when they were last modified, and
how trustworthy they are.
          </p>
          <p>The derivation of meta knowledge can happen dynamically,
in a completely open system comparable to today’s wikis,
where the change history is available for every user.</p>
          <p>III. FOUNDATIONS</p>
          <p>As we use pinpointing as a vehicle for computing meta
knowledge, we introduce pinpointing as a foundation for
the rest of the paper and give some information of existing
algorithms for finding pinpoints.</p>
          <p>A. Pinpointing</p>
          <p>The term pinpointing has been coined for the process of
finding explanations for concluded axioms or for a discovered
inconsistency. An explanation is a minimal set of axioms,
which makes the concluded axiom true (or the theory
inconsistent, respectively). Such an explanation is called a pinpoint.</p>
          <p>While there may be multiple ways to establish the truth or
falsity of an axiom, a pinpoint describes exactly one such way.</p>
          <p>Definition 1: Pinpoint.</p>
          <p>A pinpoint for a entailed axiom A wrt. an ontology O is a
set of axioms fA1; :::; Ang from O, such that fA1; :::; Ang j=
A and 8Ai 2 fA1; :::; Ang : fA1; :::Ai 1; Ai+1; :::; Ang 6j=
A. Analogously, a pinpoint for a refuted axiom A wrt. an
ontology O is a set of axioms fA1; :::; Ang from O, such
that fA; A1; :::; Ang is inconsistent and 8Ai 2 fA1; :::; Ang :
fA; A1; :::Ai 1; Ai+1; :::; Ang is not.</p>
          <p>
            Hence, finding pinpoints for a refuted axiom corresponds
to finding the Minimum Unsatisfiable Subontologies (MUPS)
for this axiom [
            <xref ref-type="bibr" rid="ref24 ref37 ref9">9</xref>
            ].
          </p>
          <p>
            Pinpointing is the computation of all pinpoints for a given
axiom and ontology. The truth of the axiom can then be
computed using the pinpointing formula [
            <xref ref-type="bibr" rid="ref10 ref25 ref38">10</xref>
            ].
          </p>
          <p>Definition 2: Pinpointing Formula.</p>
          <p>Let A be an axiom, O an ontology and P1; :::; Pn with
Pi = fAi;1; :::; Ai;mi g the pinpoints of A wrt. O. Let lab
be a function assigning a unique label to an axiom. Then
Win=1 Vjm=ii lab(Ai;j ) is a pinpointing formula of A wrt. O.</p>
          <p>A pinpointing formula of an axiom A describes, which
(combination of) axioms need to be true in order to make
A true or inconsistent respectively.</p>
          <p>B. Finding all Pinpoints</p>
          <p>
            Algorithms for finding Pinpoints can be grouped into three
groups:
a) Finding one pinpoint: Algorithms to find one
pinpoint can either derive a pinpoint by tracking the reasoning
process of a tableaux reasoner, or use an existing reasoner
as a black box. In the latter case, a pinpoint is searched by
subsequently growing (shrinking) a subontology until it starts
(stops) entailing the axiom under question. Based on the so
derived smaller ontology the process is refined, until a pinpoint
has been found. The advantage of blackbox algorithms is that
they can support any description logic, for which a reasoner is
available [
            <xref ref-type="bibr" rid="ref24 ref37 ref9">9</xref>
            ]. Extending a tableaux reasoner on the other hand
is complicated, but yields better performance, as a pinpoint
can be generated in parallel to a usual subsumption check
with low overhead [
            <xref ref-type="bibr" rid="ref10 ref25 ref38">10</xref>
            ].
          </p>
          <p>
            b) Finding all Pinpoints using a Tableaux Reasoner:
Baader and Pen˜aloza have shown that forest tableaux with
equality blocking (and hence, reasoners for the web ontology
language OWL) can be extended to find pinpointing formulas
[
            <xref ref-type="bibr" rid="ref10 ref25 ref38">10</xref>
            ]. In this approach a tableaux reasoner is extended to find
not only one, but all pinpoints. Special care needs to be taken
in order to ensure termination of the tableaux algorithm. As
an advantage, the overhead for pinpointing is lower compared
to a blackbox algorithm. Moreover, this approach can derive
a compact representation of the pinpointing formula, which
might have worst-case exponential size in conjunctive normal
form. To the best of our knowledge none of the standard
reasoners for complex description logics has been extended
in this direction yet.
          </p>
          <p>
            c) Finding all Pinpoints using Blackbox Algorithms: The
most performant black-box algorithms for finding all
justifications first extract a relevant module from the overall ontology,
ensuring that this module yields the same inferences with
respect to the axiom on interest. Then, starting from a single
pinpoint, which is computed using an algorithm discussed in
paragraph III-B0a, Reiter’s Hitting Set Tree algorithm [
            <xref ref-type="bibr" rid="ref11 ref26 ref39">11</xref>
            ]
is used to compute all pinpoints by iteratively removing one
axiom from the pinpoint at hand and growing it to a full
pinpoint again [
            <xref ref-type="bibr" rid="ref12 ref27 ref40">12</xref>
            ], [
            <xref ref-type="bibr" rid="ref13 ref28 ref41">13</xref>
            ]. Using this kind of algorithm, a lot
of subsumption checks in the underlying description logic are
needed.
          </p>
          <p>
            For both, tableaux based and black box algorithms, the
worst case complexity of finding all pinpoints is rather high,
as there can be exponentially many pinpoints for any given
ontology. However, recent work has shown that in the average
case, the number is significantly lower [
            <xref ref-type="bibr" rid="ref10 ref25 ref38">10</xref>
            ].
          </p>
          <p>IV. SYNTAX OF META KNOWLEDGE</p>
          <p>Meta knowledge can be expressed as annotations on axioms.</p>
          <p>Annotations are of main importance for the management of
ontologies as annotations may be used to support analysis
during collaborative engineering.</p>
          <p>We associate ontology axioms with meta knowledge through
axiom annotations. Basically, an axiom annotation assigns
an annotation object to an axiom e.g. ”(brokenLimb
subClass Limb) was created by Crow on 15.01.2008”. A meta
knowledge annotation consists of an annotation URI and a
meta knowledge object specifying the value of the annotation.</p>
          <p>In our case, the meta knowledge object is a constant-value
representing who asserted/modified the axiom, when the axiom
was last modified, or the uncertainty degree of the axiom,
or a combination thereof. The grammar for meta knowledge
annotations as an extension of OWL 2 annotations1 is as
follows:
OWLAxiomAnnotation := ’OWLAxiomAnnotation’</p>
          <p>’(’OWLAxiom OWLAnnotation+’)’
OWLAnnotation := OWLConstantAnnotation
OWLConstantAnnotation := MetaKnowledgeAnnotation
MetaKnowledgeAnnotation := ’MetaKnowledgeAnnotation’</p>
          <p>’(’AnnotationURI MetaKnowledge+’)’
MetaKnowledge := CertaintyAnnotation j DateAnnotation j
SourceAnnotation j AgentAnnotation
CertaintyAnnotation := ’CertaintyAnnotation’</p>
          <p>’(’AnnotationValue’)’
SourceAnnotation := ’SourceAnnotation’ ’(’AnnotationValue’)’
DateAnnotation := ’DateAnnotation’ ’(’AnnotationValue’)’
AgentAnnotation := ’AgentAnnotation’ ’(’AnnotationValue’)’</p>
          <p>In our scenario we assume that we are looking for meta
knowledge information about all limbs which are either broken
or wrenched. Our ontology contains the axioms and meta
knowledge annotations summarized in Table I.</p>
          <p>An example of how meta knowledge is represented and
associated with OWL axioms is presented below.</p>
          <p>OWLAxiomAnnotation(ClassAssertion(limb1 Limb)</p>
          <p>MetaKnowledgeAnnotation(</p>
          <p>annot1 AgentAnnotation(Crow)))
OWLAxiomAnnotation(</p>
          <p>PropertyAssertion(limb1 isBroken true)
MetaKnowledgeAnnotation(</p>
          <p>annot2 AgentAnnotation(House)))</p>
          <p>Annotations, however, have no semantic meaning in OWL 2.</p>
          <p>All annotations are ignored by the reasoner, and they may not
themselves be structured by further axioms. For this reason,
as next step, we first define the semantics of meta knowledge,
later we describe how meta knowledge can be combined with
reasoning.</p>
          <p>V. SEMANTICS OF META KNOWLEDGE</p>
          <p>Meta knowledge can have multiple dimensions, e.g.
uncertainty, a least recently modified date or a trust metric. For this
paper, we assume that these (and possible further) dimensions
are independent of each other.</p>
          <p>1OWL 2 Web Ontology Language: Spec. and Func.-Style Syntax:
http://www.w3.org/TR/2008/WD-owl2-syntax-20081202</p>
          <p>Definition 3: Knowledge dimension. A knowledge
dimension D is an algebraic structure (BD; _D; ^D), such that
(BD; _D) and (BD; ^D) are complete semilattices.</p>
          <p>BD represents the values the meta knowledge can take,
e.g. all valid dates for the least recently modified date or a
set of knowledge sources for provenance. As (BD; _D) and
(BD; ^D) are complete semilattices, they are, in fact, also
lattices. Hence, there are minimal elements in the corresponding
orders.</p>
          <p>As an example, let I be the meta knowledge interpretation2
that is a partial function mapping axioms into the allowed
value range of a meta knowledge dimension, and A and B
be axioms of an ontology such that A 6= B. Provenance, i.e.
the set of knowledge sources a piece of knowledge is derived
from, can be modeled as:</p>
          <p>I(A _ B) = I(A) [ I(B)</p>
          <p>I(A ^ B) = I(A) [ I(B)
The least recently modified date could be modeled as:</p>
          <p>I(A _ B) = min(I(A); I(B))</p>
          <p>I(A ^ B) = max(I(A); I(B))</p>
          <p>Axioms can be assigned meta knowledge from any of the
meta knowledge dimensions. Within a single assignment, the
meta knowledge must be uniquely defined.</p>
          <p>Definition 4: Meta Knowledge Assignment.</p>
          <p>A meta knowledge assignment M is a set f(D1; d1 2
D1); :::; (Dn; dn 2 Dn)g of pairs of meta knowledge
dimensions and corresponding truth values, such that Di = Dj )
di = dj .</p>
          <p>In our running example, the meta knowledge assignment
for PropertyAssertion(limb1 isBroken true) is f(agent, Crow),
(date, 15.01.2008)g</p>
          <p>Without loss of generality we assume a fixed number of
meta knowledge dimensions. As a default value for Dn in a
meta knowledge assignment we choose ?D.</p>
          <p>To allow for reasoning with meta knowledge, we need to
formalize, how meta knowledge assignments are combined.</p>
          <p>
            How provenance [
            <xref ref-type="bibr" rid="ref14 ref42">14</xref>
            ] is a strategy, which describes how an
axiom A can be inferred from a set of axioms fA1; :::; Ang,
i.e. it is a boolean formula connecting the Ai. We call a
logical formula expressing how provenance a meta knowledge
formula. For example the following query finds all limbs, that
are either broken or wrenched:
x : Limb ^ (hx; truei : isBroken _ hx; truei : isWrenched).
          </p>
          <p>The results of this query and the corresponding meta
knowledge formulas are:</p>
          <p>limb1 #1 ^ #3 and limb2 #2 ^ #4</p>
          <p>The operators for meta knowledge dimensions extend to
meta knowledge assignments, allowing us to compute meta
knowledge for entailed knowledge by evaluating the
corresponding meta knowledge formula.</p>
          <p>Definition 5: Operations on Meta Knowledge Assignments.</p>
          <p>Let A; B be axioms and meta(A) = f(D1; x1); :::; (Dn; xn)g
and meta(B) = f(E1; y1); :::; (Em; ym)g be meta knowledge</p>
          <p>2The administrator defines the intended semantics of these properties in
order to facilitate query processing with complex expressions and pattern
combinations.
assignments. Let dim(A) be the set of meta knowledge
dimensions of A. Then meta(A)_meta(B) = f(D; x_D y)j(D; x) 2
meta(A) and (D; y) 2 meta(B)g. ^ is defined analogously.</p>
          <p>Having defined the operations on meta knowledge
assignments, we can define formulas using these operations.</p>
          <p>Definition 6: Meta Knowledge Formula.</p>
          <p>Let A be an axiom of an ontology O, lab a function assigning
a unique label to each Ai from O and lab(O) the set of all
labels of axioms in O. A meta knowledge formula for a
axiom A wrt. an ontology O is boolean formula over the set
of labels flab(A1); :::; lab(An)g of axioms fA1; :::; Ang from
O, such that for each valuation V lab(O), which makes
true, the following holds: lab (V ) j= A.</p>
          <p>The meta knowledge of an axiom A within a meta
knowledge dimension is obtained by evaluating the corresponding
meta knowledge formula after replacing axiom labels with
the corresponding meta knowledge in the dimension under
consideration.</p>
          <p>Definition 7: Meta Knowledge of an Axiom.</p>
          <p>Let meta be a function mapping from an axiom to a meta
knowledge assignment in dimension D. The meta knowledge
of an axiom A wrt. O in D is obtained by evaluating the
formula obtained from A’s meta knowledge formula wrt. O
by replacing each lab(Ai) with the corresponding meta(Ai).</p>
          <p>In our running example, if we model the agent dimension
as where provenance, the meta knowledge of the query result
for limb1 is: (agent, fCrowg) ^ (agent, fHouseg) = (agent,
fCrowg [ fHouseg) = (agent, fCrow, Houseg).</p>
          <p>
            In contrast to [
            <xref ref-type="bibr" rid="ref20 ref33 ref5">5</xref>
            ] we omit the : operator in our
formalization, as description logics are monotonic and : in [
            <xref ref-type="bibr" rid="ref20 ref33 ref5">5</xref>
            ]
allows for default negation. While axioms in the underlying
description logic may contain negation, this negation is not
visible on the level of meta knowledge.
          </p>
          <p>VI. EXTENDED SEMANTICS FOR CONFLICTING META</p>
          <p>KNOWLEDGE</p>
          <p>In the following we extend our model to support conflicting
meta knowledge, which can arise from conflicting changes or
meta knowledge assignments by multiple users in an axiom.</p>
          <p>Definition 8: Extended knowledge dimension. A
extended knowledge dimension D is an algebraic
structure (BD; _D; ^D; D), such that (BD; _D), (BD; ^D)
and (BD; D) are complete semilattices. The minimum of
(BD; D) is called ?D.</p>
          <p>As an example, let I be the meta knowledge interpretation
that is a partial function mapping axioms into the allowed
value range of a meta knowledge dimension A be an axiom
of an ontology, and I1 and I2 interpretations of multiple
meta knowledge assertions to A. Provenance, i.e. the set of
knowledge sources a piece of knowledge is derived from, can
be modeled as:</p>
          <p>I(A A) = I1(A) [ I2(A)
The least recently modified date could be modeled as</p>
          <p>I(A A) = max(I1(A); I2(A))</p>
          <p>Consider the following example presented in Table II and
assume that two users assert the same axiom at different times
into the example ontology:
Meta Knowledge
statedBy Crow;
modified 14-01-2008
statedBy Crow;
modified 14-01-2008
statedBy House;
modified 15-01-2008;
statedBy House;
modified 15-01-2008
ID
#1
#2
#3
#4</p>
          <p>In our running example, the meta knowledge assignment
for axiom #10 is f(agent, Crow), (date, 14.01.2008), (agent,
House), (date, 15.01.2008)g</p>
          <p>In our running example, if we model the least recently
modified date dimension, the meta knowledge of the axiom
#10 is: (date, f14.01.2008g) (date, f15.01.2008g) = (date,
max(f14.01.2008g, f15.01.2008g)) = (date, f15.01.2008g).</p>
          <p>Consider the extended semantics of meta knowledge, we
need to describe a different way of finding a meta knowledge
formula. We redefine the meta function of Definition 7,
such that it computes L of all meta knowledge assignments
available for a statement.</p>
          <p>Definition 9: Meta Knowledge of an Axiom. Extended
Definition.</p>
          <p>Let allmeta: axioms ! 2MKAssignments be a function mapping
from an axiom to all meta knowledge assignments to that
axiom in a meta knowledge dimension D. Then meta(A) is
defined as L allmeta(A).</p>
          <p>This definition of meta not only allows to aggregate meta
knowledge from multiple sources, but also to gracefully handle
unknown meta knowledge, i.e. situations where a knowledges
source does not provide a truth value for some meta knowledge
dimension.</p>
          <p>For example, we want to model the agent dimension as
where provenance, the meta knowledge of the query result for:
ClassAssertion(BrokenLimb limb1). The axiom is satisfiable,
so the corresponding pinpointing formula is #1 ^ #3 ^ #10
= (agent, fCrowg) ^ (agent, Houseg) ^ ((agent, fCrowg)
(agent, fHouseg)) = (agent, fCrow, Houseg).</p>
          <p>VII. COMPUTING META KNOWLEDGE USING PINPOINTS</p>
          <p>
            In order to allow for an algebraic evaluation of meta
knowledge dimensions, we need a single boolean formula. In meta
knowledge mechanisms like [
            <xref ref-type="bibr" rid="ref20 ref33 ref5">5</xref>
            ], it is derived from queries
in relational algebra. When reasoning with description logics,
however, such a rather simple algebraic foundation of the
basic language does not exist. Instead, multiple axioms may be
needed to establish the truth or falsity of inferred knowledge.
          </p>
          <p>For this purpose, we have defined the meta knowledge formula
in definitions 6 and 9.</p>
          <p>As we can see above, definitions 2, 6 and 9 are quite similar.</p>
          <p>In fact, a pinpointing formula provides exactly what we need
for a meta knowledge formula: All combinations of axioms,
which can be used to establish the truth or falsity of inferred
knowledge.</p>
          <p>For this reason, when reasoning in a logic, where a
pinpointing algorithm is known, we can compute a pinpointing
formula and then derive meta knowledge as usual.</p>
          <p>VIII. COMPLEXITY</p>
          <p>The complexity of this rather naive approach for computing
meta knowledge is equivalent to the computation of
pinpointings. Due to the algebraic specification of meta knowledge
the complexity of the meta knowledge formula is polynomial.</p>
          <p>If the meta knowledge formula is in conjunctive normal
form, however, we might encounter an exponential blowup.</p>
          <p>
            Approaches for computing pinpointings like [
            <xref ref-type="bibr" rid="ref10 ref25 ref38">10</xref>
            ] which, rather
than representing pinpoints formula in a conjunctive normal
form, derive a compact representation of the pinpoints formula
benefit the computation of meta knowledge since they avoid
exponential blowup.
          </p>
          <p>IX. EXPERIMENTS</p>
          <p>In this section, we present the evaluation results of our
algorithm. The experiments were performed on a Windows
XP SP3 System and 512MB maximal heap space was set.</p>
          <p>Sun’s Java 1.5.0 Update 6 was used for Java-based tools.</p>
          <p>Reasoning with Meta knowledge The framework for
reasoning with meta knowledge is is available as a Java
prototype and is available as an open source
implementation at &lt;http://isweb.uni-koblenz.de/Research/MetaKnowledge&gt;
together with example of ontologies extended with meta
knowledge. The aggregation of meta knowledge is computed
based on the model presented in Section VI and Section VII.</p>
          <p>Reasoning with Pinpointing The framework for reasoning
with pinpointing is implemented with the OWL API and
the OWL-DL reasoner, Pellet3. Pellet provides the axiom
pinpointing service for debugging ontologies that, for any
arbitrary entailment derived by a reasoner from an
OWLDL knowledge base, returns the minimal set (explanations)
of source axioms that cause an inconsistency and the relation
between unsatisfiable concepts. The algorithm is black box
based. In the following experiments we compare the
processing time of our approach with reasoning with pinpointing
approach.</p>
          <p>
            Data Our sample data consists of 7 typical existing OWL
ontologies used for debugging. This dataset has already been
used for tests the computing time of laconic justifications
in [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]. Table III shows the number of entailments that hold in
them and provide the range of expresivity. Each ontology was
classified in order to determine the unsatisfiable classes. This
classes were selected as input (query) to compute the meta
knowledge degree and pinpoints. For each query the time to
compute all pinpoints and the meta knowledge degree was
recorded.
          </p>
          <p>3Pellet Reasoner: http://clarkparsia.com/pellet/</p>
          <p>TABLE III</p>
          <p>
            ONTOLOGIES USES IN EXPERIMENT. TABLE TAKEN FROM [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]
          </p>
          <p>Evaluation Results Table IV displays the times for
reasoning with meta knowledge and reasoning with pinpointing.</p>
          <p>For each ontology, we have computed all pinpoints for all
unsatisfiable classes and reported the overall computing time.</p>
          <p>The experiments was done 10 times and the average time was
considered. We can observe that the time for computing the
meta knowledge degree takes longer than the computation of
pinpointing (in average 4,9 ms longer). This is to be expected
since the computation of meta knowledge degree is done once
all justifications are already computed as we have shown in
Section VIII. In all in all, the processing times presented in
Table IV are still acceptable for interactive applications, and
thus this approach can be used for solutions in real time.</p>
          <p>We expect optimizations to reduce the processing time to
less than a second in the average case also for the more
complex ontologies. As we are only interested in computing
the meta knowledge, we can direct the pinpointing algorithm
to only compute those pinpoints resulting in the highest meta
knowledge values. The optimization will be reported in future
work.</p>
          <p>X. RELATED WORK</p>
          <p>Related work can be grouped into the following categories:
(i) Extensions of description logics with a particular meta
knowledge dimension, especially uncertainty. (ii) General meta
knowledge for query answering with algebraic query
languages. (iii) Extensions of description logics with general
meta knowledge and (iv) meta knowledge for other logical
formalisms.</p>
          <p>
            ad (i) Several multi-valued extensions of description logic
have been proposed: [
            <xref ref-type="bibr" rid="ref17 ref2 ref30">2</xref>
            ] propose fuzzy and probabilistic
extensions of the DLs underlying the web ontology language
OWL. [
            <xref ref-type="bibr" rid="ref18 ref3 ref31">3</xref>
            ] describe an extension towards a possibilistic logic.
          </p>
          <p>
            Another extension towards multi valued logic is presented
by [
            <xref ref-type="bibr" rid="ref19 ref32 ref4">4</xref>
            ]. They target at trust and paraconsistency instead of
uncertainty. OWL 2 is extended to reasoning over logical
bilattices. Bilattices which reflect the desired trust orders are
then used for reasoning. [16] provide an extension to reasoning
in OWL with paraconsistency.
          </p>
          <p>All of these approaches have in common, that they modify
the character of models in the underlying description logic, e.g.
to fuzzy or possibilistic models. In our approach in contrast,
we reason on a meta level: While the underlying model
remains unchanged, we compute consequences of annotations
on axioms. This meta level reasoning is not possible in the
approaches proposed above. Unlike general meta knowledge,
these approaches are more tailored to a specific need and hence
reasoning is cheaper for some. Particularly for fuzzy,
possibilistic and paraconsistent description logics, the complexity of
the underlying logic carries over, while in our case additional
complexity is introduced through pinpointing.</p>
          <p>
            ad (ii) Meta knowledge to algebraic languages has been
proposed by various authors, for example for the
Semantic Web Query Language SPARQL [
            <xref ref-type="bibr" rid="ref20 ref33 ref5">5</xref>
            ] and for relational
databases [
            <xref ref-type="bibr" rid="ref21 ref34 ref6">6</xref>
            ]. In [17] the authors have propose a framework
for meta knowledge management with support for querying
and updating RDF/S graphs that takes into account both RDF
named graphs and RDFS inference. While the actual meta
knowledge formalisms are comparable to ours, the underlying
languages are of lower expressivity, typically Datalog. Meta
knowledge formulas in these language can directly be derived
from the tree shaped representation of a query, which is not
possible in description logics.
          </p>
          <p>
            ad (iii) [
            <xref ref-type="bibr" rid="ref22 ref35 ref7">7</xref>
            ] propose a meta knowledge extension of OWL,
which is also based on annotation properties. Even though
meta knowlege can be expressed in ways comparable to ours,
it has a rather ad-hoc semantics, which may differ from
query to query. In our approach, meta knowledge and classical
reasoning take place in parallel. Hence, we can answer queries
such as ”Give me all results with a confidence degree of x”.
          </p>
          <p>
            In contrast, reasoning on the ontology and meta level in [
            <xref ref-type="bibr" rid="ref22 ref35 ref7">7</xref>
            ]
is separated. As a result, queries such as the following can be
answered: ”Give me all results, which are based on axioms
with a confidence degree of x”. Although this difference
might seem quite subtle, depending on the meta knowledge
dimension, e.g. probabilistic confidence, these queries may
have very different results.
          </p>
          <p>ad (iv) [18] propose an extension of Datalog with weights,
which are based on c-semirings and can be redefined to reflect
various notions of trust and uncertainty. Our meta knowledge
dimensions are similar to c-semirings, but additionally allow
to handle conflicting meta knowledge using a third operator.</p>
          <p>As c-semirings have been investigated in great detail and
have some desirable properties4, a modification of our work
towards similar algebraic structures might introduce additional
interesting properties of meta knowledge.</p>
          <p>XI. CONCLUSION</p>
          <p>We have introduced a formalization of meta knowledge that
allows to handle conflicting and incomplete meta knowledge
on the Semantic Web. Meta knowledge per se cannot easily</p>
          <p>4Such as the fact that the cartesian product of two c-semirings again is a
c-semiring.
be built into a logical formalism such as description logics.</p>
          <p>Hence, we have provided an operationalization based on
pinpointing, in order to derive a meta knowledge formula,
which can easily be evaluated. Extensions of the approach
beyond description logics are possible, based on pinpointing.</p>
          <p>Currently, we are working on the optimization of the
algorithms for computing meta knowledge. The optimization are
possible based on the observation, that we no longer need
to compute all pinpointing formulas in oder to determine the
meta knowledge but only computing a relevant subset of all
pinpoints.</p>
          <p>XII. ACKNOWLEDGEMENTS</p>
          <p>This research was supported by the European Commission
under contract IST-FP6-026978, X-Media and contract
IST2006-027595, Lifecycle Support for Networked Ontologies
NeOn. The expressed content is the view of the authors but
not necessarily the view of the X-Media and NeOn projects.
Semantically Annotated Provenance in the Life</p>
          <p>Science Grid
Bin Cao, Beth Plale, and Girish</p>
          <p>Subramanian</p>
          <p>School of Informatics and Computing
Indiana University, Bloomington, IN, USA
{plale, bincao, subramag}@cs.indiana.edu</p>
          <p>Paolo Missier and Carole Goble</p>
          <p>School of Computer Science</p>
          <p>University of Manchester</p>
          <p>Manchester, UK
{cgoble, pmissier}@cs.man.ac.uk
Abstract— Selected semantic annotation on raw provenance
data can help bridge the gap between low level provenance
events (e.g., service invocations, data creation, message
passing) and the high-level view that the user has of his/her
investigation (e.g., data retrieval and analysis). In this initial
investigation we added semantically annotated provenance to
the Life Science Grid, a cyber-infrastructure framework
supporting interactive data exploration and automated data
analysis tools, through (i) automated data provenance
collection and (ii) automated semantic enrichment of the
collected provenance metadata. We use a paradigmatic life
sciences use case of interactive data exploration to show that
semantically annotated provenance can help users recognize
the occurrence of specific patterns of investigation from an
otherwise low-level sequence of elementary interaction events.</p>
          <p>Keywords- life sciences, provenance, semantic annotation</p>
          <p>I.</p>
          <p>INTRODUCTION</p>
          <p>
            Cyber-infrastructure frameworks for experimental
science are becoming an increasingly popular way of
interacting with a variety of analysis tools and other
computational and data resources on the Internet. Automated
provenance [
            <xref ref-type="bibr" rid="ref21 ref34 ref6">6</xref>
            ] metadata, collected during the course of a
scientist’s interaction with the framework during a data
exploration session, can add value to the exploration process
in a number of ways: it can be used to reproduce analyses
and processes, identify the causality of a series of events,
broaden sharing and reuse of data products, support the
longterm preservation of scientific data, attribute ownership, and
determine the quality of a particular data set. Raw
provenance data, however, consists mainly of observations of
a user’s interaction with some visual interface, as well as of
system-level observations of system events (service
invocations, data creation, message passing). Unlocking the
potential of such provenance metadata requires bridging the
gap between these low level events, and the view that the
user has of his/her investigation, which is likely to be
described in terms of high-level information processing,
typically consisting of data retrieval and analysis steps that
lead to some scientific finding. The work described in this
paper stems from the hypothesis that augmenting raw
provenance metadata with selected semantic annotations
helps bridge this gap, and furthermore, that for the most part
such annotations can be obtained automatically, i.e., with
minimal user effort.
          </p>
          <p>This work is sponsored by funding from the Eli Lilly Corporation and
by the National Science Foundation award OCI-0721674 .</p>
          <p>
            We explore this hypothesis in the specific context of the
Eli Lilly open source Life Science Grid (LSG) [
            <xref ref-type="bibr" rid="ref18 ref3 ref31">3</xref>
            ], a
cyberinfrastructure framework built from Microsoft .NET 2.0
Component Application Block (CAB) and Web Services that
couples automated data visualization and display (through
the CAB) with invocation of data sources and analysis tools
(through Web Services). The LSG is in production use inside
Eli Lilly with a more fully functioning open source version
anticipated.
          </p>
          <p>
            We approach the study by defining a paradigmatic use
case for interactive exploration of life sciences data, and used
it to drive the design of an architectural model that integrates
LSG with (i) automated data provenance collection, using the
Karma provenance framework [
            <xref ref-type="bibr" rid="ref22 ref35 ref7">7</xref>
            ] developed at Indiana
University, and (ii) automated semantic enrichment of the
collected provenance metadata, using the Semantic-Open
Grid Service Architecture (S-OGSA) semantic annotation
framework [
            <xref ref-type="bibr" rid="ref1 ref16 ref29">1</xref>
            ] developed at University of Manchester. The
use case is based on the data playground idea, first proposed
by Gibson et al. [
            <xref ref-type="bibr" rid="ref17 ref2 ref30">2</xref>
            ], which builds on the hypothesis that
recognizable patterns of a complex data exploration process
may emerge from the continuous observation of direct user
interaction with data exploration and analysis tools.
          </p>
          <p>The remainder of the paper describes an initial
investigation into the potential for the use of provenance in
this scenario, specifically to help users recognize the
occurrence of specific patterns of investigation from an
otherwise low-level sequence of elementary interaction
events. Thus, in addition to describing the use case (Section
II) and presenting the technical architecture that made this
investigation possible (in Section III), we reflect upon the
type of provenance metadata that can be usefully and
inexpensively collected, semantically annotated, and
exploited to add value to scientific findings.</p>
          <p>USE CASE</p>
          <p>The use case driving our work, shown in Figure 1,
describes a realistic scenario of exploratory analysis on genes
and gene products. The example is representative of a typical
investigation method in bioinformatics, where a small set of
genes that are known to be involved in a particular disease,
in this case human diabetes, is used as a seed to grow a larger
collection of related genes, which will provide the scope for
further and possibly more expensive lab analyses. The
collection grows incrementally, in a series of iterations where
a gene pool, indicated as the working set in Figure 1, is
updated by either adding or removing some of its elements.</p>
          <p>The iteration involves a combination of access and user
interaction with public databases accessible through web
services on the web (we use the NCBI Entrez service for
searching gene details and the AmiGO browser for Gene
Ontology associations), and the use of Basic Local
Alignment Search Tool (BLAST) in order to reveal
homologous genes in model organisms, typically the mouse.</p>
          <p>Genes obtained from BLAST are again inspected by the user,
and can be selected for addition to a “working set”
maintained on behalf of a user, or discarded based on the
user’s judgment. The process can repeat possibly multiple
times, by again BLAST-ing some of the mouse genes,
leading to a larger pool of human genes that are more or less
directly related to each other through homology properties.</p>
          <p>Throughout this process, users interact with a variety of
interfaces, which LSG integrates into one single visual
environment, as described in the next section. Although the
iterations indicate a logical sequence of events, users are not
constrained by any prescribed course of action; indeed, most
of the steps can be performed in any sequence, making for a
variety of different analysis paths. At the end of the process,
it is important for users to understand how a certain final
working set of genes was accumulated: certain genes were
discovered but discarded, others were deemed worthy of
further investigation, others were first added and then
replaced by other, more promising elements. A combination
of raw provenance metadata, user-provided and
automatically added semantic annotations is used to support
the explanation process. Raw provenance includes a trace of
all the invocations to services through the LSG interface, as
well as all UI interactions. User-provided annotations include
optional descriptions that explain each update decision that
affected the working set (addition, removal), and semantic
annotations are obtained from various sources, for example a
registry of semantically annotated Web Services, as
described in the next Section.</p>
          <p>SYSTEM ARCHITECTURE</p>
          <p>We view the use case as an instance of a general user
interaction model, where events and data products are
recorded and associated to a user session, and various
annotations are associated to both the events, for example a
service invocation, and the data products, e.g., the result
message from the service (we define a session as being
delimited by user login and logout actions). Figure 2 gives an
overview of the architecture used to support this interaction
model. Individual users can configure their own personal
LSG desktop environment by selectively enabling some of
the available plugins, which control the interaction with
specific services. In addition, LSG plugins interact amongst
each other using a publish/subscribe model through an LSG
event bus, providing users with an integrated, multi-panel
interface. Thus, suppose for example that an NCBI Entrez1
plugin accepts user gene lookup requests, and sends the
corresponding gene descriptions onto the bus, while an
AmiGO2 plugin that is able to resolve Gene Ontology (GO)
terms subscribes to those descriptions. When the user
submits a request, the response triggers the AmiGO plugin,
which responds by updating its own interface with the GO
descriptions of the gene, while details of the latter are being
displayed on the NCBI Entrez plugin interface.</p>
          <p>user</p>
          <p>AmiGO
Gene</p>
          <p>Browser
NCBI
Entrez
Proxy</p>
          <p>BLAST</p>
          <p>Resources
(public + private)</p>
          <p>LSG Space
s
u
B
B
A
C
y
lli
L
a tro
ram lfec
K e
tyo tsnR
l
ilL evE
Karma Events Bus (WS+LSG)
ProveDnBance
s
e
c
ir
v
e
S
a
m
r
a
K
e
c
a
Iftr
e
n
F
D
R
*
M
P</p>
          <p>O
Karma
structure
ontology
Service Annotations
*Open Provenance Model v1.01</p>
          <p>
            We have exploited this event model to generate
elementary provenance events through the Karma
component, which is configured to snoop on the LSG event
bus, in addition to having its own instrumentation in the Web
Service proxies that mediate LSG plugin interactions with
the services. Karma structures these provenance events
according to the Open Provenance Model (OPM) [
            <xref ref-type="bibr" rid="ref20 ref33 ref5">5</xref>
            ], a
1 NCBI Entrez is a search engine for biomedical databases
and available at
http://www.ncbi.nlm.nih.gov/sites/entrez?db=gen.
2 AmiGO is a web service to access Gene Ontology
associations for genes and available at
http://www.geneontology.org/.
community standard for describing causal graphs through a
set of pre-defined types of nodes and their relationship.
          </p>
          <p>Throughout a user session, fragments of OPM graphs
representing single interactions are forwarded to the
SOGSA component, which performs two functions: firstly, it
analyses the OPM graph and adds semantic annotations to
some of its nodes, whenever possible and by using a variety
of annotations sources. For instance, if a node represents a
Web service invocation, and a semantically annotated
description of the service is available, then S-OGSA
augments the OPM graph by associating the annotations to
that node (a more detailed description of the annotation
architecture is described in Section C). Secondly, S-OGSA
stores the pair &lt;user session, OPM graph&gt; in its own
database, which the Provenance Visualizer can query to
present semantic provenance to the user. By having the
Provenance Visualizer implemented as a new LSG plugin
itself, the combination of these components provides users
with a seamlessly integrated feedback loop, by incrementally
displaying the effect of their actions as a rendering of
provenance metadata.</p>
          <p>Next, we elaborate on the three main components of the
integrated architecture.</p>
          <p>A. LSG</p>
          <p>Two main features make LSG an appealing platform for
our experimentation: its openness, which made it possible to
create provenance events simply by adding a subscriber to
the LSG event bus, as described earlier; and its extensibility,
which we have used to implement new plug-ins especially
for our use case. Specifically, we have used two of the
available plugins for the open source version of LSG, namely
for searching the NCBI Entrez database and for resolving
GO terms; and have implemented three new plugins:


</p>
          <p>A BLAST plugin that interacts with one of the
several publicly available BLAST services3;
A Working Set Manager, to manage the dynamic
collection of data products, in this case genes, that
represent the main outcome of the users’
investigation;
A Provenance Visualizer, which can display parts of
the provenance graph to the users (see Figure 3).</p>
          <p>The “LSG Space” in Figure 1 shows the relationship
amongst these plugins. While we exploit the event model to
automate much of the data flow across the plugins, we also
identify points in the process where we felt that explicit,
knowledge-intensive user input was desirable. Thus, for
example, while it is possible to extract a DNA sequence in
FASTA format from an NCBI gene description record, to be
used as input to BLAST, expert users prefer to have control
over the portion of the sequence, for example to include or
exclude the gene promoter regions on either side of the
sequence. This mix of automated data flow and explicit user
input offers the additional opportunity for users to add their
3 http://www.ebi.ac.uk/Tools/blast/.
own notes as explanations of their actions, for example to
comment on the choice of a wider region around a gene. This
is particularly clear in the design of the Working Set
Manager, which automatically accepts new elements, i.e.,
genes from BLAST, through the event bus, but also offer
users the opportunity to examine (accept, reject, annotate)
each of them individually.</p>
          <p>The main functions of the Karma component in this
setting are to capture raw provenance events, and to format
them according to the Open Provenance Model specification.</p>
          <p>As Karma is a general provenance collection and
management tool, it implements a generic provenance model
and set of instrumentation tools that are independent of the
application system. Instrumentation of the LSG required the
use of several forms of instrumentation. For the web hosted
data services and sources, we implemented proxy web
services that utilize instrumentation handlers in Axis2 to
collect provenance. Provenance of the CAB activity is
captured by a listener on the CAB events bus. The listener
forwards provenance relevant events to Karma. The high
level view of capture is shown in Figure 4.</p>
          <p>Remote 
Web 
Service
Proxy</p>
          <p>Life Science Grid
Gene </p>
          <p>Browser
Provenance 
Instrumentation</p>
          <p>Working Set </p>
          <p>Manager
Provenance 
Instrumentation</p>
          <p>Gene </p>
          <p>Ontology
Provenance </p>
          <p>Instrumentation</p>
          <p>BLAST
Provenance 
Instrumentation</p>
          <p>NCBI Entrez
Provenance </p>
          <p>Instrumentation</p>
          <p>Enterprise Service Bus</p>
          <p>Karma
causeID.SetServiceID("htp:/www.my
grid.org.uk/ontology#Blast_Plugin");
efectID.SetServiceID("htp:/www.my
grid.org.uk/ontology#Blast_Ebi_Web_
Service");
OTimeTypetime=newOTimeType();
time.SetClockID("default");
time.SetNoEarlierThan(System.DateT
ime.Now);
time.SetNoLaterThan(System.DateTi
me.Now);
LSGUtilitiesutil=
LSGUtilities.Instance();
stringuserID=
util.GetUser().LogonID;
WasTriggeredBynotification=new
WasTriggeredBy(efectID,causeID,
userID,time);
WSEClientApiwseClient=new
WSEClientApi();
wseClient.publish("htp:/tyr11.cs.india
na.edu:12346","LSG-EVENT",
notification.ToXml();</p>
          <p>We distinguish between black box plugins, for which it
may be possible to observe data exchange events that occur
through the LSG bus, and white box plugins, where in
addition, user interaction events that occur through a service
interface can also be detected. In practice, black boxes are
those where native web pages are displayed, so that access to
the user click-throughs on the page is limited and can only be
achieved by intercepting the HTTP requests using a proxy,
for instance, but some of the context in which the request is
made is missing. In white box components, on the other
hand, the UI is part of the plugin design, and as a
consequence we can capture user actions with full detail.</p>
          <p>According to the black-box, white-box distinction, the
Working Set manager is a white box, because all user events
can be observed along with optional user annotations, while
the native LSG plugins, the NCBI Entrez and AmiGO
plugins, are black boxes. As for the BLAST plugin, making
it a white box required the extra effort of encoding a bespoke
web-based interface to interact with the service, in order to
capture all of the important user interactions. Thus, Karma
captures the user selection, de-selection, and annotations of
genes in Working Set Manager, and the set of genes that
transit on the LSG bus, including BLAST reports.</p>
          <p>Karma maps provenance events to fragments of OPM
graphs. In its simplest form, an OPM graph consists of two
types of nodes, which represent Artifacts and Processes.</p>
          <p>
            These are shown as ovals and rectangles, respectively, in
Figure 3. Nodes are connected using directed labeled arcs,
which express properties that hold between two nodes. The
set of all legal properties is fully described in [
            <xref ref-type="bibr" rid="ref20 ref33 ref5">5</xref>
            ], however
the following three types of properties were found to be
sufficient to express our provenance events:



process P used artifact A, for example,
NCBI_Entrez_Plugin used Entrez_Gene_ID,
artifact A wasGeneratedBy process P, for example,
Entrez_Gene_ID wasGeneratedBy
Gene_Browser_Plugin, and
process P1 wasTriggeredBy process P2, for example,
NCBI_Entrez_Plugin wasTriggeredBy
          </p>
          <p>Gene_Browser_Plugin.</p>
          <p>The first two properties express ordinary
producer/consumer relationships, while the latter is useful in
expressing the indirect interaction between two plugins that
publish and subscribe to a data element, respectively. We
also use the same property to express the fact that a plugin
controls an underlying service, i.e., in the typical situation
where a service invocation is triggered by a plugin.</p>
          <p>
            Provenance events are published as notifications to the
Web services-based message broker, WS-Messenger [
            <xref ref-type="bibr" rid="ref23 ref36 ref8">8</xref>
            ],
where Karma is a subscriber. When a provenance
notification arrives, the corresponding provenance handler
picks it up, retrieves the raw provenance data, and stores
these data into its own provenance database, a MySQL
relational database. These raw provenance data can be used
to answer general provenance questions as well as determine
the artifact dependency and the process dependency during a
user session. Meanwhile, these data is sent to S-OGSA for
semantic annotation. Since OPM is an abstract process
model with multiple concrete serialization formats for
portability across applications, as indicated in Figure 2, we
have used the RDF 4 serialization to transfer OPM graph
fragments from Karma to the S-OGSA component.
          </p>
          <p>C. Modular Semantic Annotations using S-OGSA</p>
          <p>
            S-OGSA [
            <xref ref-type="bibr" rid="ref1 ref16 ref19 ref29 ref32 ref4">1, 4</xref>
            ] manages the persistent and stateful
associations between Grid resources, i.e., data or services,
and their annotations (or any form of related metadata),
expressed primarily as RDF graphs. Such associations,
known as semantic bindings, can be queried with SPARQL.
          </p>
          <p>S-OGSA mapping to this project has user sessions playing
the role of resources, with OPM provenance graphs produced
by Karma as their associated metadata. S-OGSA additionally
augments the input graphs with semantic annotations. Here
we focus on the latter part of the S-OGSA architecture5.</p>
          <p>The annotation architecture is based on the principle that
annotations to nodes in the RDF OPM graph will depend on
(i) the specific types of Artifact and Process nodes, and (ii)
the availability of metadata sources that can be used to derive
interesting metadata for those node types. To account for this
flexibility, we designed a modular architecture based on the
interceptor pattern, consisting of an extensible chain of
annotators, each specialized to annotate specific types of
nodes. Each annotator receives an input RDF graph,
produces an augmented version of the same graph with
annotations added to it, and forwards it to the next annotator
down the chain. As no parts of the input graph are ever
removed, annotators can be added incrementally to S-OGSA,
in a monotonic fashion. The pattern is illustrated in Figure 5.</p>
          <p>As a proof of concept, we have implemented a chain
consisting of two annotators, one for Process node of type
Web Services, and one for Artifact nodes of type Blast
report. We now describe how each of these two annotators
uses a different metadata source to produce its annotations.
4 http://www.w3.org/RDF/
5 Technically, S-OGSA relies on the Anzo RDF API for
storing its annotation graphs, and its functionality is
exposed as a RESTful Web Service.
deconfliction S was generated.</p>
          <p>This challenge was approached in a two-step process [16].</p>
          <p>Starting with a set of domain ontologies and a GILA log
instance, an analysis phase would first be conducted. This
would return the following: (i) from the domain
ontologies, a list of OWL classes and properties corresponding
to PML classes (e.g., pmlp:Agent, pmlp:InferenceRule), and
relationships (e.g., pmlj:isConsequentOf), and (ii) from the
log instance, a set of RDF based structural relations, not
captured by the domain ontologies, which correspond to PML
relationships. Following the analysis phase, a mapping phase
would be conducted, in which PML-based information would
be inserted into the log instance.</p>
          <p>Figure 4 illustrates how provenance normalization could be
applied for the example above. First, in the analysis phase,
two domain ontologies - gilcore and gilaco - are inspected to
identify the hidden provenance information from the original
GILA log. The following domain knowledge is uncovered: (i)
ACM deconflictions are represented as instances of the class
gilcore:Solution, (ii) each solution Si has a corresponding
problem state Pi, defined as an instance of gilcore:Problem,
(iii) the property gilcore:hasProblem is used to link Si to Pi (in
the figure, property names are omitted due to limited space),
(iii) the property gilaco:hasSolutionListResolveConflict
links Pi to a recursively declared list of instances of
gilcore:SolutionListResolveConflict where each list item
listi;i 1 represents an earlier solution Si 1. This list helps
define the context of a current problem, but doesnt explicitly
define the solution Si 1 used to transform Pi 1 to Pi
knowledge required for uncovering the solution generation
data flow.</p>
          <p>In the mapping phase, PML data is built in the
following steps: (i) an OWL ontology is defined for linking the
domain ontologies to PML, which asserts gilcore:Solution
and gilcore:Problem as subclasses of the OWL class
pmlp:Information, (ii) through OWL inference, instances
of these two classes will be inferred to be of type
pmlp:Information, (iii) through JENA 2 (a Java-based RDF
data processing API) and SPARQL, PML data is generated
from a GILA log instance which normalizes links between
problems and solutions (e.g. from Pi to Si 1).</p>
          <p>B. Lessons learned</p>
          <p>In this case study, we demonstrated how to use provenance
normalization to address usability challenges by generating
PML data based on both GILA’s domain ontologies and
logs. Although non-trivial domain expertise was needed to (i)
identify the provenance components in the domain ontologies
and log data, and (ii) establish mappings from the domain
ontologies to PML, such work usually ended up being a
onetime job. Subsequent generation of PML data in the mapping
phase could then be automated using off-the-shelf tools and
easily be performed.</p>
          <p>One of our future goals will involve determining ways
in which the analysis phase could be (at least partially)
2http://jena.sourceforge.net/
GILA Log</p>
          <p>Fig. 4. PML encoding based on GILA log
automated. However, as part of this, a set of constraints on
how log and domain ontology information can be structured
will be required.</p>
          <p>
            In general, many complex systems like GILA can record
provenance in their workflow logs, but usually use
domainspecific terminology and structure. A fair amount of work must
be done to individually tailor explanation interfaces for these
systems. By normalizing workflow provenance into PML, we
can more easily apply general-purpose explainers [
            <xref ref-type="bibr" rid="ref13 ref28 ref41">13</xref>
            ], [
            <xref ref-type="bibr" rid="ref20 ref33 ref5">5</xref>
            ] to
various workflows from different domains.
          </p>
          <p>V. CASE STUDY: PC3</p>
          <p>In the Third Provenance Challenge (PC3) 3, 15 research
groups were asked to use their own approaches to: (i) generate
provenance metadata for exposing the execution of a given
workflow, (ii) use this metadata to answer a set of
provenancebased queries, (iii) export this metadata, and (iv) import
metadata from other teams and answer the queries from (ii)
using it. A common interchange format, the Open Provenance
Model (OPM) [17], was chosen for teams to import and
export their provenance metadata. The workflow investigated
in this effort was derived from the Pan-STARRS project 4,
which processes data on 99% of visible stars in the northern
hemisphere, and manages a pipeline for loading domain data
in CSV files into a relational database and validating it. Here,
control flow was viewed as the sequence of processes executed
within the workflow, subject to conditional branching (e.g.,
if a process fails to complete correctly, halt the workflow
otherwise, continue normally). Likewise, dataflow was defined
as a sequence of steps by which data would be generated and
used by processes.</p>
          <p>A. Use of PML</p>
          <p>During PC3, two important requirements emerged for us
(and many other groups) in encoding provenance capable of
answering the queries.</p>
          <p>3http://twiki.ipaw.info/bin/view/Challenge/ThirdProvenanceChallenge
4http://pan-starrs.ifa.hawaii.edu/</p>
          <p>First, domain-specific provenance (as with GILA and
SPCDIS) was needed to answer many of the queries. One
such query, known in PC3 as Core Query 1, reads: ”For a
given detection, which CSV file(s) contributed to it?” Here,
two domain-specific concepts are referenced: (i) a detection,
which is a type of data handled in the workflow, and (ii) a
contribution, which references a data loading sequence carried
out by the workflow.</p>
          <p>Second, many queries required the control flow of the
workflow to be tracked in parallel with the dataflow. We
viewed this challenge as consisting of two parts: (i) explicitly
distinguishing execution of operations from the operations
themselves, and (ii) representing the dependencies among
executions of operations. An example of a control-flow based
query, Optional Query 2, reads: ”Which pairs of procedures
in the workflow could be swapped and the same result still be
obtained (given the particular data input)?”</p>
          <p>The generation of provenance meeting both these
requirements that both we and other groups could answer queries
over constituted an interoperability challenge in PC3. To
address this, we used Semantic Web technologies to manage and
query provenance - both recorded from the system workflow
and imported from other teams.</p>
          <p>Specifically, we explored storing provenance as RDF data
structured around a prototype ontology containing both OPM
and PML based concepts 5. In turn, we were able to export
provenance in both the OPM and PML formats for use by other
groups (although during PC3, only OPM was used by other
groups). In both our exported OPM and PML, we were able to
handle the provenance specialization requirement mentioned
above. However, some interesting issues emerged with both
OPM and PML in control flow tracking.</p>
          <p>For OPM, these seemed to emerge from ambiguities in its
Process concept definition - which could either be viewed as
an operation, or the execution of an operation. Such ambiguity
was avoided in PML through the concepts pmlj:InferenceRule
(the operation) and pmlj:InferenceStep (its execution). While
the names InferenceRule and InferenceStep may be used most
often in logical theorem provers, they are applied in any setting
where some inference (possibly statistical or process) is used
to manipulate information thus they are easily applied in a
process setting.</p>
          <p>Likewise, PML lacked a mechanism for directly tracking
dependencies between operations and their executions.
However, OPM did provide a way to track dependencies
between instances of its Process concept - through the provided
wasTriggeredBy relation. Both PML and OPM are evolving
to meet community needs and we might expect co-evolution
and potentially inclusion or importing of some representational
features from one into the other. A comparison of the OPM
and PML models can be found at [18].</p>
          <p>B. Lessons Learned</p>
          <p>Based on our experiences with PC3, and the other case
studies, we feel that Semantic Web technologies are well suited
5http://www.cs.rpi.edu/˜michaj6/provenance/PC3OPM.owl
for representing workflow provenance (in particular, for
facilitating integration of provenance from heterogeneous sources).</p>
          <p>In addition, while we found some expressivity limitations in
PML, these could easily be fixed by adding/referencing other
ontology modules (e.g. for control flow concepts).</p>
          <p>VI. RELATED WORK</p>
          <p>
            Workflow Provenance Models. There is a diverse
literature on workflow provenance models [19]. Although these
models differ in certain aspects, they all model some general
provenance concepts, including processes, data, and
processdata dependencies [
            <xref ref-type="bibr" rid="ref1 ref16 ref29">1</xref>
            ]. Many of them include domain
specific concepts required by applications. For instance, Taverna
[20] has included bioinformatics ontologies and the VisTrails
[21] system adds workflow description as a kind of data in
tracking user behavior in assembling workflows. PML, as a
provenance interlingua, covers these general concepts. It is
notable that PML, as an OWL ontology, can be connected to
domain concepts (without hard-coding) via ontology mapping
(declaring the rdf:type of certain domain data as a subclass of
pmlp:Information).
          </p>
          <p>Beyond the basic provenance concepts, some useful
concepts like control flow may also be captured by
workflow provenance. Furthermore, [22] identified prospective
provenance (abstract workflow descriptions) and retrospective
provenance (workflow execution logs) in a layered model,
and both types are supported by the REDUX [23], Taverna,
Pegasus [24], and Karma [25] provenance models. PML core
vocabularies only cover the basic provenance concepts in
the retrospective provenance because they were designed to
only capture generic data derivation processes. However, PML
can be extended with workflow specific modules, such as
WDO (http://trust.utep.edu/wdo/) and SAWs [26] for capturing
prospective provenance.</p>
          <p>The Open Provenance Model (OPM) is another
generalpurposed provenance model. While OPM remains technology
agnostic, PML presently provides a family of OWL ontologies
with RDF syntax. This brings about a current implementation
advantage of PML: it can be seamlessly integrated with
domain ontologies and thus support queries involving both
domain constraints and generic provenance relations.</p>
          <p>Semantic Web Vocabulary for Provenance. There are
some existing works on provenance representation in
Semantic Web communities. The Dublin Core (DC) ontology
(http://dublincore.org/documents/dc-rdf/) offers generic
provenance related properties. The Friend of a Friend (FOAF)
ontology (http://xmlns.com/foaf/spec/) offers classes and
properties for annotating entities involved in provenance, such as
people (foaf:person). It is also notable that there are some
emerging provenance ontologies [27]. These ontologies have
a good overlap with PML, especially its provenance module.</p>
          <p>However, PML differs from these works based on its
justification module which offers support for tracking complex
relationships between provenance-based entities.</p>
          <p>VII. CONCLUSION</p>
          <p>In this paper, we have shown the usage of PML in
representing workflow provenance through three case studies. In
addressing these case studies, both usability and
interoperability challenges emerged in various forms, which required
differing strategies to handle. With this, we give some final
words on both the challenges of provenance usability and
interoperability.</p>
          <p>For the usability challenge, many workflow systems (such
as SPCDIS and PC3) will rely upon domain-specific concepts
that cannot be expressed using a domain independent
representation alone. Likewise, others (GILA) will encode provenance
data using a domain-specific log that is not intuitive for a
general audience. In our case studies, PML proved effective by
(i) helping users answer queries involving both domain specific
and independent provenance knowledge, and (ii) helping with
normalization of domain-specific provenance relationships.</p>
          <p>Likewise, to address the interoperability challenge, PML
can be (and has been) easily connected to domain ontologies
and other provenance models, including OPM, via ontology
mappings (as was done with SPCDIS and GILA) and ontology
extensions (like with PC3). Here, It should be emphasized
that the interoperability challenge requires the establishment
of best practices for information exchange (as well as an
effectively designed provenance representation like PML).</p>
          <p>To establish best practices for provenance interoperability,
we stress the adoption of Semantic Web languages (such as
OWL) as a common data exchange medium. Through this,
explicit mappings can be established between concepts (for
instance, by adding ”owl:sameAs” assertions). In addition, this
would allow for a wider degree of terminology to be used for
concept descriptions (such as the PML terms ”pmlp:hasName”
and ”pmlp:hasFormat”).</p>
          <p>ACKNOWLEDGMENT</p>
          <p>This work is partially supported by NSF award
#0524481,HRD-0734825, IARPA award #FA8750-07-2-0031,
DARPA award
#FA8750-07-D-0185,#55-002001,#FA865006-C-7605.</p>
          <p>Using Web Data Provenance for Quality Assessment</p>
          <p>Olaf Hartig
Databases and Information Systems Research Group</p>
          <p>Department of Computer Science</p>
          <p>Humboldt-Universita¨t zu Berlin
Email: hartig@informatik.hu-berlin.de</p>
          <p>Jun Zhao
Image Bioinformatics Research Group</p>
          <p>Department of Zoology</p>
          <p>University of Oxford
Email: jun.zhao@zoo.ox.ac.uk</p>
          <p>Abstract—The Web of Data cannot be a trustworthy data
source unless an approach for evaluating the quality of data on
the Web is established and integrated as part of the data
publication and access process. In this paper, we propose an approach of
using provenance information about the data on the Web to assess
their quality and trustworthiness. Our contributions include a
model for Web data provenance and an assessment method that
can be adapted for specific quality criteria. We demonstrate how
this method can be used to evaluate the timeliness of data on
the Web, to reflect how up-to-date the data is. We also propose
a possible solution to deal with missing provenance information
by associating certainty values with calculated quality values.</p>
          <p>
            With the growth of the open-accessible Web of Data [
            <xref ref-type="bibr" rid="ref1 ref16 ref29">1</xref>
            ]
the needs for evaluating the quality of the data in applications
are becoming more and more pressing. Information quality
research has been successfully applied to evaluate the quality
of organizational information and to monitor the improvement
of work practice [
            <xref ref-type="bibr" rid="ref17 ref2 ref30">2</xref>
            ]. Quality assessment of data on Web
should be a paramount task in order to ensure that the
most appropriate and trustworthy data are made available and
delivered to the users. Scientific applications built upon the
Web of Data will be of little value if scientists are skeptical of
the quality of data; financial systems will be untrustworthy and
fragile without any policies for quality control and evaluation.
          </p>
          <p>To assess the quality of data, we need to identify the types
of information that can be used for evaluation and a method
for calculating quality values. In this paper, we present an
approach that uses provenance information to assess quality
of data on the Web; and we propose a generic assessment
procedure that can be adapted for evaluating specific
qualitycriteria, such as accuracy and timeliness.</p>
          <p>
            As the base of our approach we introduce a provenance
model tailored to the needs for tracking and tracing provenance
information about data on the Web. This model not only
represents the creation of a data item, but also describes
provenance information about the entities who made the data
accessible on the Web [
            <xref ref-type="bibr" rid="ref18 ref3 ref31">3</xref>
            ]. We call this Web data provenance.
          </p>
          <p>Many existing information quality assessment approaches
are based on information contributed by users. In this paper,
we focus on using a quantitative approach for calculating
quality of data. This assessment approach takes three steps:
collecting the elements of provenance information needed for
quality assessment, then deciding on the influence of these
elements on the assessment, and, finally, applying a function
to calculate the quality.</p>
          <p>To demonstrate how design decisions can be made when
developing this method into assessing specific quality criteria
we walk through the development using the timeliness criterion
as an exemplar. Since the provenance information required
for quality assessment might be incomplete or fragmentary,
the assessment method must be capable to deal with missing
information. We introduce a possible approach of associating
certainty values with the calculated timeliness value.</p>
          <p>This paper is structured as follows. Section II reviews
related work and Section III introduces the model for Web data
provenance. Section IV describes our assessment method that
can be adapted for specific quality criterion, like timeliness,
as demonstrated in Section V. We conclude in Section VI.</p>
          <p>II. RELATED WORK</p>
          <p>
            In this paper we consider information quality (IQ) as “an
aggregated value of multiple IQ-criteria” [
            <xref ref-type="bibr" rid="ref19 ref32 ref4">4</xref>
            ], such as accuracy,
completeness, believability, and timeliness. The assessment
of information quality can be regarded as “the process of
assigning numerical values (IQ-scores) to IQ-criteria” [
            <xref ref-type="bibr" rid="ref19 ref32 ref4">4</xref>
            ]. IQ
assessment is known to be hard [
            <xref ref-type="bibr" rid="ref19 ref32 ref4">4</xref>
            ]. Although there are many
related work on conceptualizing the problem of IQ and its
assessment, there are fewer work proposing concrete methods
for quantifying the quality assessment. In the following we first
introduce different approaches for IQ assessment in general,
and then we focus on provenance-based.
          </p>
          <p>
            Lee et al. [
            <xref ref-type="bibr" rid="ref17 ref2 ref30">2</xref>
            ] propose a quality assessment methodology that
measures IQ from four quadrants: soundness, dependability,
usefulness, and usable information. Each dimension includes
several IQ-criteria. For example the dependability of
information is measured by its timeliness and security. A questionnaire
is designed to measure users’ feedback to each IQ-criterion
in a scale of 0-10. The assessment value for each quadrant is
computed as a mean of the measurements of its constituent
IQcriteria. Similarly, Bobrowski et al. [
            <xref ref-type="bibr" rid="ref20 ref33 ref5">5</xref>
            ] also use questionnaires
to assess information quality. Both methods, although being
quantitative, are based on subjective, users’ inputs.
          </p>
          <p>In certain circumstances, an automatic assessment of
information quality could be feasible with available metadata
information and reliable auto-assessment techniques. Depending
on the application and users’ needs, this automatic approach
could be more desirable than a subjective, manual approach.</p>
          <p>
            Motro and Rakov propose automated assessment methods for
evaluating the soundness and completeness of data sources [
            <xref ref-type="bibr" rid="ref21 ref34 ref6">6</xref>
            ].
          </p>
          <p>
            Gruser et al. present a prediction algorithm to learn and predict
response times of Web data sources [
            <xref ref-type="bibr" rid="ref22 ref35 ref7">7</xref>
            ]. Ballou et al. [
            <xref ref-type="bibr" rid="ref23 ref36 ref8">8</xref>
            ]
introduce a quantitative assessment method for measuring and
calculating the timeliness of data. Their formulas laid the
foundation for our work and will be detailed in Section V.
          </p>
          <p>
            Ballou et al.’s method for assessing the timeliness is partially
based on the provenance information of a data item, e.g., the
time when the data was obtained. Provenance metadata has
been used to evaluate other IQ-criteria, such as the
believability and trustworthiness. Wong et al. [
            <xref ref-type="bibr" rid="ref24 ref37 ref9">9</xref>
            ] use information about
the types of services or data involved in a data creation process
to validate the believability of derived data items. Golbeck and
Mannes [
            <xref ref-type="bibr" rid="ref10 ref25 ref38">10</xref>
            ] use provenance of user-contributed annotations
to compute trust values and to recommend how much a user
should trust others. This method does not compute the
trustworthiness of the annotations themselves using provenance.
          </p>
          <p>III. A MODEL FOR WEB DATA PROVENANCE</p>
          <p>
            Our provenance-based IQ assessment method is based on
our model for Web data provenance. We give a brief
introduction to the model in this section. A detailed discussion of the
model can be found in [
            <xref ref-type="bibr" rid="ref18 ref3 ref31">3</xref>
            ].
          </p>
          <p>
            Traditional provenance research usually addresses the
creation of data. While many approaches exist that represent
provenance [
            <xref ref-type="bibr" rid="ref11 ref26 ref39">11</xref>
            ], [
            <xref ref-type="bibr" rid="ref12 ref27 ref40">12</xref>
            ], none of these explicitly addresses the
characteristics of data that was not only created but also
retrieved over the Web. Provenance of data from the Web
comprises information about the entities that published the
data and that made it accessible on the Web, information
not required in the context of self-contained systems such
as a DBMS or a workflow management system. Hence, our
model for Web data provenance comprises two dimensions:
data creation and data access.
          </p>
          <p>Our model identifies types of so called provenance elements
and the relationships between these types. The provenance
elements represent pieces of provenance information; such an
element might be the actual creator of a specific data item what
makes this element an instance of the ’data creator’ type. The
types are classified in three categories: actors, executions, and
artifacts. An actor usually performed the execution of an action
or a process which – in most cases – yielded an artifact such
as a specific data item. An execution might have included the
use of artifacts which, in turn, might be the result of another
execution. Furthermore, direct relationships between artifacts
as well as between actors may exist. For instance, a specific
company was responsible for its Web server. All other element
types are specializations of actors, executions, and artifacts.</p>
          <p>The central type in the data creation dimension is the data
creation execution by which a data item was created. A data
creation was performed by a data creator. For the creation
source data and creation guidelines could have been used by
the data creator.</p>
          <p>The data access dimension centers around data access
executions. Data accessors perform data access executions to
retrieve data items contained in documents from a provider
on the Web. To enable a detailed representation of providers
the model distinguishes data providing services that process
data access requests and send the documents over the Web,
data publishers who use data providing services to publish
their data, and service providers who operate data providing
services. Furthermore, the model represents the execution of
integrity verifications of artifacts and the results thereof.</p>
          <p>Based on the element types and their relationships identified
by our provenance model it is possible to represent provenance
of data items from the Web by, so called, provenance graphs.</p>
          <p>The nodes in these graphs are the provenance elements; the
edges correspond to the relationships between the element
types of adjacent elements; edges are labeled with the
relationship name. Notice, to allow for a wide range of applications of
our model we do not prescribe a specific granularity by which
provenance information has to be described in provenance
graphs. For instance, a data item could be a whole linked
dataset as well as a single RDF statement, depending on the
granularity required for the use case at hand. A data item
could have been created by the use of creation guidelines
and source data which also have provenance. This provenance
should be represented by subgraphs in the provenance graph
of the created data item.</p>
          <p>Formally, we represent a provenance graph as a tuple
(PE; R; type; attr) where
² PE denotes the set of provenance elements in the graph,
² R µ PE £ PE £ RN denotes the labeled edges in
the graph where RN is the set of relationship names
as introduced by our provenance model,
² type : PE ! P(T ) is a mapping that associates each
provenance element with its types where T is the set of
element types as introduced by our provenance model,
² attr : PE ! P(A £ V ) is a mapping that associates
each provenance element with additional properties
represented by attribute-value pairs where A is a set of
available attributes and V is a set of values.</p>
          <p>We do not specify the sets A and V any further because the
available attributes, possible values, and the meaning of these
depend on the use case. However, we introduce an abbreviated
notation to refer to the target of an edge in a provenance graph:</p>
          <p>rn
if (pe1; pe2; rn) 2 R we write pe1 ¡! ± = pe2.</p>
          <p>IV. PROVENANCE-BASED QUALITY ASSESSMENT</p>
          <p>The method is based on provenance graphs represented
using our provenance model. This approach should be
regarded as a blueprint for the development of actual assessment
methods that address specific scenarios and focus on specific
quality criteria. This section introduces the general method and
discusses questions that must be addressed when applying this
method for a specific quality criterion.
in a provenance graph on the particular quality of the assessed
data item. We divide the assessment procedure into three steps:
1) Generate a provenance graph for the data item;
2) Annotate the provenance graph with impact values;
3) Calculate an IQ-score for the data item from the
anno</p>
          <p>tated provenance graph.</p>
          <p>In order to use the provenance of a data item for automated
quality assessment this provenance has to be represented in
the assessment system. We propose to use provenance graphs
as introduced in Section III for this purpose. Hence, the first
step of an assessment procedure must be the generation of
such a graph for the data item that is to be assessed, i.e.,
the considered data item. This step comprises collecting the
necessary provenance information about the data item.</p>
          <p>Some, if not all, of the provenance elements might have
had an influence on certain qualities of the assessed data item.</p>
          <p>Some of these influences are known to us; others are possible
or cannot be ruled out. Both types of influences, known as
well as possible influences, have an impact on our assessment
of the qualities. We propose to represent this impact by impact
values associated with the corresponding provenance elements.</p>
          <p>For instance, the possibility of manipulating published data by
a service provider may affect the believability and the assumed
accuracy of the data; an impact value for a service provider
could represent the provider’s manipulation probability. An
example for a known influence is the execution time of a data
creation which has an impact on the timeliness assumed for
the created data item. Notice, there can be different kinds of
impact values for different types of provenance elements.</p>
          <p>The second step of our assessment procedure comprises
determining these impact values; the system adds annotations
to the provenance graph generated from step 1, associating
elements in the provenance graph with estimated impact
values. Formally, an annotated provenance graph is a pair
(pg; ann) where pg = (PE; R; type; attr) is a provenance
graph and ann : PE ! P(I) is a mapping that associates a
provenance element with a set of impact values; each impact
value (n; v) 2 I has a name n and the actual value v. For
(n; v) 2 ann(pe) we write n£pe¤ = v.</p>
          <p>In the final step the system executes a function to calculate a
value that represents the information quality of the considered
data item using the annotated provenance graph from step 2.</p>
          <p>B. Designing Actual Assessment Methods</p>
          <p>To apply our assessment approach one must first develop
the presented method into an actual assessment method that is
tailored for the quality criterion of interest. In the following
we discuss design decisions that must be considered at each
step and we specify the questions that must be addressed.</p>
          <p>The most fundamental question that must be answered in
the beginning is: For which quality criterion do we want to
apply the method? This decision influences every aspect of an
application of our approach. In the remainder we consecutively
focus on the three steps of our assessment method. However,
the design decisions for the three steps partly depend on each
other. For this reason, designing an actual assessment method
should be an iterative process.</p>
          <p>
            Considering step 1 of the assessment method it is necessary
to ensure the generation of a provenance graph that is suitable
for the assessment. To specify suitability in the given context
one has to ask: What types of provenance elements are
necessary to determine the considered information quality
and what level of detail (i.e. granularity) is necessary to
describe the provenance elements in the application scenario?
To answer these questions we propose to study the literature
that deals with the considered quality criterion. A good starting
point is Pipino et al. [
            <xref ref-type="bibr" rid="ref13 ref28 ref41">13</xref>
            ]. Based on the answers to the
above two questions, the procedure for generating provenance
graphs can be developed. Defining this procedure requires
to address the question: Where and how do we get the
provenance information to generate the provenance graph for
a data item? Basically, there are two complementary options
to obtain provenance information: some pieces of provenance
information can be recorded by the system; for other pieces
the system relies on meta-data provided by third parties.
          </p>
          <p>
            In [
            <xref ref-type="bibr" rid="ref18 ref3 ref31">3</xref>
            ] we discuss these options. Furthermore, we are working
on the Provenance Vocabulary1 to enable the publication of
provenance-related metadata in the Web of data.
          </p>
          <p>The fundamental questions that have to be answered for
the implementation of step 2 are: How might each type
of provenance element influence the quality of interest and
what kind of impact values are necessary for the
application scenario? The answers to these questions substantially
depend on the considered quality criterion as well as on the
assessment function used in step 3. Notice, impact values
need not necessarily be numerical; they could also be of
a more abstract nature such as the simple weighting “high
impact”. After specifying the impact values it is necessary
to address the question: How do we determine the impact
values or where do we get them from? Some of the impact
values might already be part of the provenance information
such as the creation time in the aforementioned timeliness
example; others might be calculated based on the provenance
graph. Certain kinds of impact values could also be determined
based on user input. Another possibility is to estimate impact
values by taking background knowledge about information
consumers or providers into account. For instance, a data
creator’s credibility which influences believability assessments
could be determined based on former experiences as well as
on recommendations from other users.</p>
          <p>The main questions regarding step 3 of the assessment
method are: How can we represent the considered
information quality by a value and what function do we use to
calculate such a value from the annotated provenance graph?
Again, answering these questions fundamentally depends on
the quality criterion. The calculated value could be a single
number in a specific interval; but it could also be a vector
of numbers or an element of a set of discrete values. In any
case, it is important to specify what such a value means. The
1http://purl.org/net/provenance/
definition of the applied function depends on the impact values
introduced at step 2. For this reason, we recommend to develop
the function together with specification of the impact values.</p>
          <p>For the development of this function it is important to bear in
mind that the results of steps 1 and 2 cannot be guaranteed
to be complete in many cases; the provenance graph could
be fragmentary or some annotations could be missing due to
the lack of certain information required during steps 1 and 2.</p>
          <p>Hence, the function for step 3 must not assume to operate on
an ideal annotated provenance graph but it must be able to
deal with incompleteness.</p>
          <p>V. PROVENANCE-BASED ASSESSMENT OF TIMELINESS</p>
          <p>In this section we exemplarily apply our general assessment
approach to assess the timeliness of data from the Web. We
first give a brief introduction to timeliness and how it can be
calculated; we then illustrate the design and the execution of
the three steps of assessment; finally, we propose a way to
deal with incomplete provenance information.</p>
          <p>A. Representing and Calculating Timeliness</p>
          <p>
            Timeliness is an intrinsic IQ criterion [
            <xref ref-type="bibr" rid="ref14 ref42">14</xref>
            ] that is often
referred to as a task-dependend up-to-dateness of a data
item [
            <xref ref-type="bibr" rid="ref13 ref28 ref41">13</xref>
            ], [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]. Ballou et al. represent timeliness by an
absolute measure on a continuous scale from 0 to 1 where
data with 1 “meet the most strict timeliness standard” [
            <xref ref-type="bibr" rid="ref23 ref36 ref8">8</xref>
            ] and
0 is unacceptable. This timeliness measure can be calculated
using the following formula [
            <xref ref-type="bibr" rid="ref23 ref36 ref8">8</xref>
            ]:
          </p>
          <p>
            T imeliness = (max(1 ¡ Currency=V olatility; 0))s (1)
In this formula, Volatility is “the length of time the data
remains valid” which is analogous to the shelf life of perishable
products [
            <xref ref-type="bibr" rid="ref23 ref36 ref8">8</xref>
            ]; Currency is “the age of the data when it is
delivered to the user” [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] which can be calculated according
to [
            <xref ref-type="bibr" rid="ref23 ref36 ref8">8</xref>
            ] by the following formula:
          </p>
          <p>Currency = Delivery T ime ¡ Input T ime + Age (2)
where Delivery Time is the time when the data was delivered
to the user; Input Time is the time when the data was entered
in the system; and Age is how old the data was at Input Time.</p>
          <p>
            The exponent s in (1) is a parameter that controls the
sensitivity of Timeliness to the Currency-Volatility ratio. The
ratio should be large (e.g., s = 2) for highly volatile data and
be small (e.g., s = 0:5) for long shelf life data [
            <xref ref-type="bibr" rid="ref23 ref36 ref8">8</xref>
            ].
          </p>
          <p>
            Note that in Ballou et al.’s paper [
            <xref ref-type="bibr" rid="ref23 ref36 ref8">8</xref>
            ] the timeliness formula
is defined in a closed “information manufacturing system”,
which processes primitive data units from outside. Hence, the
semantics of Age, Input Time, and Delivery Time might be
different w.r.t. to an open-world system, like the Web.
          </p>
          <p>On the Web, we do not have primitive data from the
outside. Instead, we have unprocessed data and derived data.</p>
          <p>Unprocessed data are data items for which the creation did
not depend on other data items; i.e., no source data was used
for their creation. Derived data, in contrast, was derived from
other data items.</p>
          <p>
            1) Timeliness of Unprocessed Data: For an unprocessed
data item, its Age is 0 because it did not exist before; its
Input Time is the time when its creation was finished; and
its Delivery Time should be “now”, i.e., the time when the
timeliness of the considered data item is assessed. This means
that the Currency values for unprocessed data items differ
only by their creation time. To calculate the Timeliness of
unprocessed data items using formula (1) we also need the
Volatility. We could speak of volatility exclusively as shelf
life, as Ballou et al. [
            <xref ref-type="bibr" rid="ref23 ref36 ref8">8</xref>
            ] do. Alternatively, we could speak of
expiry time and adapt the formula from the Sampaio et al. [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]:
          </p>
          <p>V olatility = Expiry T ime ¡ Input T ime + Age</p>
          <p>
            (3)
2) Timeliness of Derived Data: Ballou et al. compute
the timeliness of data outputs from a processing block as a
weighted average value [
            <xref ref-type="bibr" rid="ref23 ref36 ref8">8</xref>
            ]. In our method, for a derived data
item, if it is caused by only one source data item, then it
has the same timeliness value as the source data item; if it is
caused by multiple source data items, then its timeliness value
should be a weighted average of the timeliness values of the
source data items.
          </p>
          <p>B. Constructing the Provenance Graph</p>
          <p>We adopt the calculation approach outlined in the previous
section to apply our provenance-based assessment method for
the determination of timeliness. The first step is to generate a
provenance graph for the considered data item. For this work
we assume the availability of all provenance information.</p>
          <p>Example 1: We demonstrate the method applied to
assessing timeliness of temperature measurements taken by a sensor.</p>
          <p>These measurements are unprocessed data items. They are
taken every 1 hour, and they are stored in a Web-accessible
storage device immediately. A system accesses these
measurement from the storage device for further processing; in order
to process the measures the system evaluates their timeliness.</p>
          <p>We represent the provenance of a specific measure by
a provenance graph pe = (PE; R; type; attr) which is
illustrated in Figure 1. PE contains the measure msr, the
sensor sens, the data creation cExc that produced msr, the
storage device stor, the system sys, the data access aExc,
and the document doc with which msr was retrieved during
aExc. Given msr was taken at 10:00 and doc was retrieved
at 10:13 it holds attr(cExc) = f(execT ime; 10:00)g and
attr(aExc) = f(execT ime; 10:13)g. ¤
Fig. 1. Provenance graph representing our running example (cf. Example 1).</p>
          <p>C. Adding Impact Values</p>
          <p>The second step of the assessment method includes the
annotation of the provenance graph with impact values. In order
to design this step we study the relevance of different pieces
of provenance information for the timeliness assessment. In
particular, we discuss the relation of the provenance element
types introduced by our provenance model to the calculation
approach outlined before (cf. Section V-A).</p>
          <p>Data creation executions have a direct influence on the
timeliness assessment. As discussed before, the creation time
of unprocessed data items corresponds to the input time in
formula (2). Hence, we annotate each data creation element
that is not associated with source data with a creation time
impact value. It is not necessary to explicitly determine these
kind of impact values because they are already represented
in the provenance graph as an attribute of the data creation
elements.</p>
          <p>
            Data creations that yield a derived data item have an
influence on the timeliness of this item if multiple source data
items were used (cf. Section V-A2). We reflect this influence
by another impact value: each data creation element that takes
multiple source data items as inputs is annotated with a weights
impact value. This impact value represents the weights that can
be used to calculate the weighted average of the timeliness
values of the source data items. Ballou et al. write: “The
weights could reflect the size of the data units that are merged,
their importance or some combination of attributes.” [
            <xref ref-type="bibr" rid="ref23 ref36 ref8">8</xref>
            ]. In
this paper, we leave the choice of the weights to applications
adopting our assessment method, because this choice should
be based on actual needs from their information consumers.
          </p>
          <p>Creation guidelines may have an impact on IQ criteria such
as accuracy and reliability. However, creation guidelines have
no influence on the timeliness of the created data.</p>
          <p>Source data may have an impact on the timeliness
assessment. According to the calculation approach the timeliness
of each derived data item can be ascribed to a combination
of the timeliness values of unprocessed data items. Hence,
in the ideal case of a complete provenance graph only the
unprocessed data items have a direct influence on the
timeliness of the considered data item. While their currency can
be determined with the aforementioned creation time impact
values we also need the volatility to calculate their timeliness
using formula (1). To enable the calculation of their volatility
using formula (3) we annotate each unprocessed data item
with an expiry time impact value. We assume these impact
values can be determined based on the input from users who
configure a default expiry time for data from specific data
creators or for data with a specific content.</p>
          <p>Data creators have an influence on the volatility of
unprocessed data items as discussed before. The previously
mentioned strategy for determining the expiry time impact
values reflects this influence.</p>
          <p>Example 2: We annotate the provenance graph pg from
Example 1 with impact values as follows. The data creation
cExc is not associated with source data; hence, it has to
be annotated with a creation time impact value that refers
to its execT ime attribute: ann(cExc) = f(crtT; 10:00)g.</p>
          <p>Furthermore, pg contains the unprocessed data item msr
which has to be annotated with an expiry time impact value.</p>
          <p>It is possible to determine this value based on the information
that sens takes the measures every hour. Hence, it holds:
ann(msr) = f(expT; 11:00)g. The other elements in pg do
not have an influence on our timeliness assessment. ¤</p>
          <p>In the ideal case of a complete provenance graph the
elements that belong to the data access dimension can be
ignored for the timeliness assessment. However, in the likely
case that information about the creation of a (source) data
item is missing or that it is impossible to determine one of
the impact values introduced so far. Hence, we propose to
consider the data access related elements as fall-back. For
these data items, the Input Time in formulas (2) and (3) is
the access time associated with the corresponding data access
execution. Furthermore, the Age for these items is probably
larger than 0, assuming that they, or the data from which they
were derived, were not created at the time of the access. We
annotate each of these data items with a timeliness impact
value that represents a timeliness value estimated for them.</p>
          <p>This value could be estimated based on different data access
related provenance elements. For instance, knowing when a
data publisher updates her data may, in combination with the
access time, be an indicator for the Age. The Expiry Time
might be estimated based on information about the update
frequency of the data publisher. After all, it must be realized
that the timeliness impact values can only be estimates at best.</p>
          <p>Based on the impact values in the annotated provenance
graph it is possible to calculate timeliness by adopting
formulas (1) to (3). The recursive function t in Figure 2 implements
this idea: t incorporates (1) to (3) to calculate timeliness
at step 3 of our assessment method. For a data item with
incomplete provenance information t returns the timeliness
impact value that is annotated to this item (cf. first case in the
equation in Figure 2). For unprocessed data items t applies the
formulas (1) to (3) using the corresponding creation time and
expiry time impact values (cf. second case in the equation). For
derived data items that were created with a single source data
item t returns the timeliness value that is recursively calculated
for the source data item (cf. third case). Finally, for other
derived data items t uses the weights impact value of the
corresponding data creation element to calculate a weighted
average of the recursively calculated timeliness values of the
source items (cf. fourth case).</p>
          <p>Example 3: Based on the annotated provenance graph
(pg; ann) from Example 2 it is possible to calculate the
timeliness of msr. Since msr cre¡at!edBy ± ¡us!ed ± = ? it holds:</p>
          <p>µ now ¡ crtT£cExc¤ ´¶s
t(msr) = max³1 ¡ expT£msr¤ ¡ crtT£cExc¤ ; 0
µ
max³1 ¡ 1n1o:w00¡¡ 1100::0000 ; 0
´¶s
8&gt;timeliness£d¤
t(d) = &gt;&gt;&gt;&lt;&gt;&gt;&gt;&gt;&gt;µmax³1 ¡ exnpTow£d¡¤ ¡crtcTrt£Td£cdrec¡arte!e¡adt!eBdyB±y¤±¤ ; 0
&gt;t(ds)
&gt;&gt;:&gt;&gt;&gt;&gt;&gt;&gt; PPds2ddsc2reda¡ctr!eeda¡Bt!eyd±Bu¡ys!±edu¡±s!ewd±eiwghetig£hdtc£rde¡actr!eed¡aBt!eyd±B¤yd±s¤¢dt(sds)
if d cre¡at!edBy
if d cre¡at!edBy</p>
          <p>used
± ¡! ± = ?,
if d cre¡at!edBy
E. Dealing with Incomplete Provenance Information</p>
          <p>Our timeliness assessment method deals with incomplete
information by using alternative impact values; furthermore,
certain impact values can only be determined by estimation.</p>
          <p>
            Thus, the calculated timeliness value becomes an
approximation rather than an exact assessment. To make the degree of
inexactness explicit we propose to associate the calculated
timeliness value with a certainty value. This additional value
represents the certainty of whether the calculated timeliness
is correct. We suggest to represent certainty with a value in
the interval [
            <xref ref-type="bibr" rid="ref1 ref16 ref29">0,1</xref>
            ] where 1 represents absolute certainty, i.e. no
doubt, and 0 represents absolute unvertainty, i.e. the calculated
timeliness value is useless. In the following we outline an
approach to calculate the certainty value during the execution
of our assessment method.
          </p>
          <p>We assume a value, initialized to 1, that is accessible
throughout the whole assessment procedure. During the
execution of steps 1 and 2 this value is incrementally decreased
whenever i) a part of the provenance graph cannot be generated
appropriately due to missing provenance information and
whenever ii) an impact value is estimated. With each decrease
the value should be reduced by a certain percentage where the
extent of this percentage may differ for different decreases.</p>
          <p>Identifying appropriate extents is subject to further research.</p>
          <p>For instance, in the case of missing provenance information
the importance of this information to the assessment may
affect the amount of reduction. Decreases due to impact value
estimation may differ depending on the reliability of the
applied estimation strategy. However, after the completion of
step 2 the decreased value represents the reliability of the
annotated provenance graph. Since the calculation in step 3 is
solely based on this graph the value also represents a certainty
regarding the correctness of the calculated timeliness value.</p>
          <p>In this paper we propose a provenance model for Web
data provenance and an assessment method for evaluating the
quality of data on the Web using provenance graphs based on
this model. Our provenance model introduces a new dimension
of provenance information, i.e. the provenance of data access,
to the existing provenance research. We are gathering feedback
to our model from different communities and we foresee
continuing development of our provenance vocabulary driven by
well-defined use cases. In this paper, we demonstrate assessing
the timeliness of data on the Web using our method. We plan
to implement this method as part of a Web data publication
framework in the near future and to apply this method to the
assessment of other quality criteria, such as accuracy. Our
method should be generic enough to incorporate subjective
quality indicators derived from Web data provenance. Existing
work on evaluating and filtering subjective quality indicators
will be considered and appropriately applied.</p>
          <p>VII. ACKNOWLEDGEMENT</p>
          <p>Part of this work is funded by EPSRC Postdoctoral
Fellowship to Dr Jun Zhao (Grant EP/G049327/1).</p>
          <p>Sudha Ram, Jun Liu
430J McClelland Hall, Department of MIS, Eller School of Management,</p>
          <p>University of Arizona, Tucson, AZ 85721
Abstract: Data Provenance refers to the “origin”, “lineage”, and “source” of data. In this work, we
examine provenance from a semantics perspective and present the W7 model, an ontological
model of data provenance. In the W7 model, provenance is conceptualized as a combination of
seven interconnected elements including “what”, “when”, “where”, “how”, “who”, “which” and
“why”. Each of these components may be used to track events that affect data during its lifetime.
The W7 model is general and extensible enough to capture provenance semantics for data in
different domains. Using the example of the Wikipedia, we illustrate how the W7 model can
capture domain or application specific provenance.
1. Introduction</p>
          <p>
            Data provenance is an overloaded term that has been defined differently by different people.
A recent survey [
            <xref ref-type="bibr" rid="ref1 ref16 ref29">1</xref>
            ] reviews the various definitions of provenance in literature. Some researchers
define provenance as the origin or source of data [
            <xref ref-type="bibr" rid="ref17 ref2 ref30">2</xref>
            ]. As an example, Buneman puts forth two
forms of data provenance, i.e., “why” provenance and “where” provenance [
            <xref ref-type="bibr" rid="ref18 ref3 ref31">3</xref>
            ]. Both “why” and
“where” provenance deal with tracing the source from which the data came. Others view
provenance as metadata recording the process of experimental workflows, annotations and notes
about scientific experiments [
            <xref ref-type="bibr" rid="ref19 ref32 ref4">4</xref>
            ]. In research such as [
            <xref ref-type="bibr" rid="ref20 ref21 ref33 ref34 ref5 ref6">5, 6</xref>
            ], the data generating processes in the
form of workflows are the primary entities for which provenance are collected. Due to the lack of
consensus on the semantics or meaning of provenance, current efforts on capturing data
provenance have focused on only one or two aspects of provenance while ignoring others. As a
result, the provenance is often incomplete and cannot be shared across applications. In response to
this challenge, we attempt to formally define the semantics of provenance that can be agreed upon
by people from different domains. To our knowledge, our research is the first of its kind to explore
the “semantics” of provenance.
          </p>
          <p>In this research, we define the W7 model, an ontology that clarifies the semantics of data
provenance. The W7 model represents data provenance as a combination of seven interconnected
elements including, “what”, “when”, “where”, “how”, “who”, “which”, and “why”. The W7
model is general and extensible enough to capture provenance semantics for data in different
domains. Using examples in Wikipedia, we illustrate how the W7 model can help define, capture,
and use data provenance.
2. Use cases and competency questions</p>
          <p>
            Following the formal methodology for ontology development proposed in [
            <xref ref-type="bibr" rid="ref22 ref35 ref7">7</xref>
            ], we started by
collecting use cases from different domains. Given the set of use cases, a set of competency
questions were identified. The competency questions are those that our ontology must be
“competent” to answer. Our use cases and their corresponding competency questions describe a
set of requirements the ontology must satisfy. They helped us understand the intended informal
semantics of the concepts and relations to be included in the ontology. We gathered 188 use cases
from users in various domains including biology, businesses (such as the manufacturing, defense,
and pharmaceutical organizations), and physical sciences. We present some of the use cases as
well as the competency questions.
          </p>
          <p>Use Case 1: In a missile manufacturing company, an engineer performs a material test to
measure the transverse tension fatigue life of a particular material “S2/8552 glass-epoxy”. She
then publishes the results and test procedure online. Another engineer discovers the published
results years later. Before reusing the results, he verifies whether the results are valid by repeating
the test procedure, in the test environment that was described.</p>
          <p>Competency Questions: A replication of a material test requires recording provenance that is
competent in answering the following questions: 1) how was the material data created, and 2)
how was the material test conducted in terms of the test procedure, test environment, sample
condition, temperature, etc.</p>
          <p>Use Case 2: To organize the huge amounts of bio images being generated, the bio-computing
lab stores bio images on different storage devices based on their “value”. For instance, images
created by a graduate student doing an internship or images that have not been accessed for 5 five
years are deemed less valuable.</p>
          <p>Competency Questions: Use case 2 demonstrates the desire for recording “who created the bio
images” and “when the bio images have been accessed”.</p>
          <p>Use Case 3: An engineer obtains the composite material “Cycom 381/S2 Uni-glass” and
performs a test to measure the tensile strength of the composite. Another engineer in a different lab
later performs a test on the same material, again provided by the same vendor. She compares the
two results and notices significant differences. She needs to assess whether the differences are
because of different test methods or different instruments used in the test.</p>
          <p>Competency Questions: To determine the quality or reliability of material test results, it is
necessary to provide answers to the following two questions: 1) how were the results generated,
and 2) which instrument was used in created the data and what were its parameter settings?</p>
          <p>Use Case 4: A genetics researcher records in his lab notebook the reason for using specific data
records in an in silico experiment, e.g., “I chose this restriction enzyme as it cut only three times
within 200 base pairs of the SNP”.</p>
          <p>Competency Questions: The relevant question is why certain records data were used.</p>
          <p>Use Case 5: A scientist, S, is interested in rainfall and water levels in neighboring rivers and
lakes for a part of the Sierra Nevada mountain range in California. He is trying to acquire sensor
signals captured in Southern California.</p>
          <p>Competency Questions: Use Case 5 indicates the use of data provenance for data discovery. In
this use case, the question the scientist needs to answer is “where was the data measured”, so
that he can locate the appropriate data.
of interest is the event that affects each piece of data during its life cycle from birth (creation) to
death (deletion or archiving). While many of the use cases point out the need to understand the data
creation related provenance, in many cases, other life cycle events are even more useful. For
example, Use case 4 requires us to record the why associated the use of data. Also, for some
domains, the most critical provenance events are changes in the ownership of the data and
archiving of data. As a result, our provenance ontology should be competent to answer the
question of “what”, i.e., events that affect the data. Thus our ontology is anchored around the
“what” or the life cycle events affecting the data.
3. Conceptualization of provenance based on Bunge’s theory</p>
          <p>
            The use case analysis helped us identify the basic components of data provenance including
the 7 Ws (what, how, when, where, who, which, and why). We then adopt Bunge’s ontology [
            <xref ref-type="bibr" rid="ref23 ref36 ref8">8</xref>
            ] to
define these components and identify the relationships between them.
          </p>
          <p>State, event and history: The elementary notion of Bunge’s ontology is a thing. The state of a
thing is the set of property values of the thing at a given time. Bunge’s ontology postulates that
everything changes, and every change is a change of state of things, that is the change of properties
of things. A change of state is termed an event. It follows that an event occurs when a thing
acquires or loses a property or changes the value of a property. Based on the constructs of event
and state, Bunge defines the concept of history: History of a thing is a sequence of events that
happens to the thing.</p>
          <p>Action, agent, time and space: These are constructs related to events. An event on a thing
occurs when it is acted upon by another thing, which is often a human or a software agent. An
event happens in time and space.</p>
          <p>
            Data are also “things”. Bunge’s theory regarding history and events is a perfect match for
defining data provenance and its semantics since data provenance is often referred to as the
pedigree or history of data. More importantly, our use case analysis indicates that data provenance
is really all about various events that affect data during its life cycle. Thus, the constructs in
Bunge’s ontology including history, event, action, etc. lay a theoretical foundation for defining
provenance and its components. We define provenance and the 7 Ws and develop connections
between them using the constructs in Bunge’s ontology.
by Sowa [
            <xref ref-type="bibr" rid="ref24 ref37 ref9">9</xref>
            ], which has been widely as a language for ontology. The boxes in CGs represent
concepts and the bubbles are the relationships. As shown in Figure 1, what, i.e., events, is the
anchor of our model. In essence, data provenance includes events and various information
(including who, how, when, where, which and why) associated with and describing the events.
          </p>
          <p>
            Tables 2 summarizes the definition of each of the 7 Ws and shows the correspondence
between the Ws and Bunge’s ontology concepts. For interested readers, please refer to our
previous research [
            <xref ref-type="bibr" rid="ref10 ref25 ref38">10</xref>
            ] for a more detailed discussion of each of the 7 Ws.
An event (i.e. change of state) that happens to data during its
life time
An action leading to the events. An event may occur, when it is
acted upon by another thing, which is often a human or a
software agent
Time or more accurately the duration of an event
Locations associated with an event
Agents including persons or organizations involved in an event
Instruments or software programs used in the event
          </p>
          <p>Reasons that explain why an event occurred
MATERIAL-TEST(X) is</p>
          <p>
            In [
            <xref ref-type="bibr" rid="ref11 ref26 ref39">11</xref>
            ], Simmhan et al argue that due to the diverse needs across disciplines, it is challenging
to develop a standard model for capturing provenance. To address this concern, we developed
the W7 model as a generic ontology of provenance that captures the semantics of data
provenance and can thus be applied to various domains. However, for our model to be of any
practical use, it must be easily adaptable to address domain specific provenance needs. We use
the “type definition” mechanism developed by Sowa [
            <xref ref-type="bibr" rid="ref24 ref37 ref9">9</xref>
            ] to provide the domain specific
extension of the W7 model. The CG formalism enables to explicitly define the semantics of a
concept via a type definition. As an example, in the domain of design and manufacturing, how
often refers to a material test, using which material data is created. The specification of the test
and the material sample used in the test are
critical provenance that needs to be
HOW: X ATTRIBUTE TEST_SPECIFICATION captured. We thus formally define
“material test”, as shown in Figure 2.
          </p>
          <p>OBJECT MATERIAL_SAMPLE The CG in Figure 2 defines</p>
          <p>MATERIAL-TEST as a subtype of HOW.</p>
          <p>Figure 2. Type definition of the concept “material-test” A material test is carried out upon material
samples and it has an attribute “test
specification”. Type definitions represent the semantics and necessary attributes of a concept that
have been agreed upon by people in a domain and therefore can be used to provide domain
specific extensions of the W7 model.
5. Application of the W7 model – the Wikipedia example</p>
          <p>We use Wikipedia as an example to illustrate the application of the W7 model to harvest and
structure data provenance. Table 3 summarizes the application of the W7 model in Wikipedia.
What or events that affect a Wikipedia page are primarily creation, modification and destruction of
the page. Other events may include “quality assessment” (e.g., a page may be designated as a
featured page) or “change in access rights” (e.g., a page may be locked to prevent editing by
anonymous editors). The “How” construct for a page modification event may be sentence
insertion/update/deletion, link insertion/update/deletion, reference insertion/update/deletion, and
reverts (see Table 3). These are actions made by editors that may lead to the modification of a
page. Who represents the editors of a Wikipedia page. The Wikipedia distinguishes between three
types of users: 1) administrators, 2) registered editors, and 3) anonymous editors. When refers to
the time an event occurs. In the Wikipedia, a timestamp is automatically recorded in the database
whenever an event occurs. Where in the Wikipedia represents the IP address from which an editor
makes a change. Which in Wikipedia refers to bots, i.e., software that automatically edits
Wikipedia pages. The Wikipedia allows an editor to input why, i.e., justifications for a change, in
the “comment” field.</p>
          <p>Creation, modification, destruction, quality assessment, access rights change
Sentence insertion/update/deletion, link insertion /update/deletion,
insertion/update/ deletion, revert (reverting the article to a previous version)
Administrators, registered editors, and anonymous editors
Timestamps of the events
IP address of the editor
Software used in editing the page
User comments
reference</p>
          <p>
            Harvesting data provenance in the Wikipedia requires little human effort. The Mediawiki
software used by the Wikipedia is set to automatically capture the what, who, when, where, and
which. The how provenance can be derived by comparing two versions of a page using the diff
function. Only the why provenance demands manual input. Applying the W7 model to the
Wikipedia enables us to harvest provenance of the Wikipedia pages in a structured and
comprehensive way. Data provenance in the Wikipedia has widely been used to automatically
assess the quality of Wikipedia pages. As an example, [
            <xref ref-type="bibr" rid="ref12 ref27 ref40">12</xref>
            ] suggests metrics such as “rigor” (total
number of changes made for the article) and “diversity” (total number of unique editors for the
article) as measures of quality. In our recent study [
            <xref ref-type="bibr" rid="ref13 ref28 ref41">13</xref>
            ], we track every action by an editor that
affects the life of a Wikipedia article from its creation to the present time. We classify roles by
mining the provenance, i.e., various actions carried out by a contributor on an article. We then
further identify collaboration patterns based on provenance in terms of who does what. The
collaboration patterns derived from data provenance have been proved to be correlated with data
quality of Wikipedia pages.
6. Conclusion and Future Research
          </p>
          <p>In conclusion, the focus of our research is on investigating the semantics of provenance. We
have developed a generic provenance model, i.e., the W7 model, to represent these semantics. We
identify various elements of provenance such as “what”, “where”, “when”, “who”, “how”,
“which” and “why” and present the semantics of each of these elements. Our W7 model is inspired
by theoretical work such as Bunge’s ontology as well as our empirical analysis of provenance use
in many application domains. It is a generic model of data provenance and is intended to be easily
adaptable to represent domain specific provenance requirements. Using the Wikipedia as an
example application, we illustrate the use of the W7 model to harvest and track data provenance.
We are continuing to use this model to harvest and track provenance in a variety of other
application domains.
References
Lister Hill National Center for Biomedical Communications</p>
          <p>US National Library of Medicine</p>
          <p>Bethesda, Maryland, USA</p>
          <p>olivier@nlm.nih.gov
Abstract—We present a use case for provenance information in
biomedical knowledge repositories designed to support
applications including information retrieval and knowledge
discovery. We show that information about the knowledge
sources from which statements are extracted must be recorded in
addition to the statement themselves in order to support these
applications. While the storage and processing of statements has
been greatly facilitated by the emergence of powerful triple stores
and the standardization of query languages (e.g., SPARQL),
recording and exploiting provenance information (i.e., statements
about statements) remains challenging.</p>
          <p>Keywords-provenance information; use case; biomedical
knowledge repository</p>
          <p>INTRODUCTION</p>
          <p>Biomedical knowledge is produced and consumed by
biomedical researchers and health care practitioners. The
biomedical literature (textbooks and journal articles) represents
the main source of unstructured biomedical knowledge.</p>
          <p>
            Knowledge bases (e.g., model organism databases annotated to
the Gene Ontology) result from the curation of the primary
literature in an attempt to make knowledge more accessible and
actionable. Finally, ontologies represent the ultimate form of
computationable knowledge, but are often limited in scope and
tend to focus on definitional, as opposed to assertional
knowledge. Rich sets of metadata have been defined and are
collected along with the primary data, using standards such as
the Dublin Core for the literature [
            <xref ref-type="bibr" rid="ref1 ref16 ref29">1</xref>
            ] and MIAME for gene
expression data [
            <xref ref-type="bibr" rid="ref17 ref2 ref30">2</xref>
            ].
          </p>
          <p>
            Attempts to make knowledge accessible to agents in
addition to humans have focused on the extraction of
knowledge from unstructured sources, as well as the
interoperability of structured knowledge sources. Text mining
techniques are used to extract “predications” (i.e., statements)
from text, for example in the Semantic Medline project [
            <xref ref-type="bibr" rid="ref18 ref3 ref31">3</xref>
            ].
          </p>
          <p>
            Metadata are often stored in an ad hoc format in order to help
associate predications with the articles from which they have
been extracted. The Linked Data initiative [
            <xref ref-type="bibr" rid="ref19 ref32 ref4">4</xref>
            ] promotes the use
of RDF (the Resource Description Framework) [
            <xref ref-type="bibr" rid="ref20 ref33 ref5">5</xref>
            ] to link
biomedical datasets, with a strong emphasis on shared URIs
(Uniform Resource Identifiers) in order to relate concepts
sharing the same identifiers across datasets. In most cases,
however, such repositories of linked data have little metadata,
in part because simple RDF representations make it difficult to
represent statements about statements. These examples
illustrate the difficulty of representing – let along comput ing
with – provenance information in biomedical knowledge
repositories.
          </p>
          <p>
            One such repository is being created as part of a research
project at the National Library of Medicine [
            <xref ref-type="bibr" rid="ref21 ref34 ref6">6</xref>
            ]. It includes
knowledge extracted from Medline abstracts by text mining
tools, structured knowledge derived from existing knowledge
bases (e.g., NCBI’s Entrez system [
            <xref ref-type="bibr" rid="ref22 ref35 ref7">7</xref>
            ]) and terminological
knowledge from the Unified Medical Language System [
            <xref ref-type="bibr" rid="ref23 ref36 ref8">8</xref>
            ]. In
this project, we are also interested in recording and processing
information about the statements (e.g., location in the
information space and time annotations), in order to support
applications including enhanced information retrieval,
multidocument summarization, question answering and knowledge
discovery.
          </p>
          <p>In this paper, we briefly examine the types of metadata
required in the context of our biomedical knowledge
repository. In other words, we look at provenance information
through the use case of this knowledge repository and discuss
some of the issues encountered along the way and challenges
ahead.</p>
          <p>
            The four applications our repository has been designed to
support require various types of provenance information [
            <xref ref-type="bibr" rid="ref21 ref34 ref6">6</xref>
            ].
          </p>
          <p>Common to all applications is the requirement that the origin of
any statement be identifiable (e.g., from which knowledge
sources was it extracted?, using which extraction techniques, if
any?) Because biomedical knowledge evolves over time, it is
also indispensable that some time annotation be associated with
each statement (e.g., date of publication of the article from
which the statement was extracted, date when the statement
was curated in a given knowledge base, or date when a given
ontology was last revised.) When available, the degree of
confidence associated with a given statement should also be
recorded. Confidence can be indicated by the tools used for the
production of the statements (e.g., text mining tools) or
approximated through frequency information. In the following
discussion, the association between types of applications and
types of provenance information is somewhat arbitrary and
presented essentially for illustrative purposes.</p>
          <p>The enhanced information retrieval envisioned goes beyond
keyword or concept searches and supports searches based on
relations. For example, finding all the documents in which the
statement “IL-13 inhibits COX-2” is found. Like with
traditional search engines, there is a need for associating a
document identifier with a given statement. The list of all
document identifiers associated with a given statement forms
the basic index in such a system. Conversely, indexing a
document consists in associating this document with all the
statements extracted from it by the text mining tool.</p>
          <p>B. Multi-document summarization</p>
          <p>In addition to the basic index required for information
retrieval, information is needed for the prioritization of
statements (among all relevant statements) in multi-document
summarization. Statements below a certain threshold of
confidence may be hidden as a way of restricting the amount of
information provided in the summary. Low confidence can be
indicated by a text mining tool, for example, when ambiguity in
natural language cannot be resolved by the system.</p>
          <p>C. Question answering</p>
          <p>In question answering applications, answers must be
collected from reputable sources. Here, statements from the
biomedical knowledge repository are used as potential answers
to input questions (e.g., what genes does IL-13 inhibit?) Not
only must the origin of the statement be present as for
information retrieval and summarization purposes, but
additional metatada associated with the document must also be
available (e.g., does this document come from a reputable
source, such as an article about randomized clinical in the case
of clinical effectiveness statements? Does this statement come
from a document published/a knowledge base revised
recently?) The distinction here is between metadata directly
associated with the statement (e.g., document identifiers), and
metadata about the documents themselves, indirectly associated
with the statement (reputability of the source, publication date).</p>
          <p>D. Knowledge discovery</p>
          <p>Information retrieval, summarization and question
answering can be thought of as exploiting a static repository,
mostly through look-ups in the repository, with no (or limited)
need for inference. In contrast, knowledge discovery processes
aim at inferring new knowledge from patterns of statements in
the repository. Inference is one major technique for deriving
new knowledge from existing knowledge. Production rules
provide a simple mechanism for formalizing inference and rule
engines are implemented in many systems that store
statements. Knowledge discovery systems require not only
production rules and rule engines for the production of entailed
statements from rules, but also the production of the metadata
associated with the entailed statements (i.e., inferred
provenance information). Provenance for both asserted and
inferred statements is required so that the universe of
statements can be restricted to degree of confidence, specific
time periods or sources. For example, can a path be found in a
graph, directly (asserted links) or indirectly (inferred links),
between two nodes (e.g., between a disease and a drug), when
links are restricted to a specific source? The issue here is not
only to associate provenance information to asserted
statements, but also to compute such information for inferred
statements as well.</p>
          <p>III. ISSUES AND CHALLENGES</p>
          <p>
            Limitation of naïve implementations. RDF provides a
simple mechanism for recording statements about statements
through “blank nodes” [
            <xref ref-type="bibr" rid="ref20 ref33 ref5">5</xref>
            ]. A blank node can be used as an
identifier for the statement, each component of which – subject,
predicate and object – is then linked to it through predicates
such as hasSubject, hasPredicate and hasObject. Similarly,
provenance information can be linked to the statement
identifier (e.g., link to the article from which it is extracted
through a hasSource predicate). This mechanism, called
reification, is inefficient as it increases the number of triples
required for implementing a statement (at least one for the
relation of the blank node to each of the three components of
the original statement). Scalability issues are thus likely with
large biomedical repositories (typically several hundred million
asserted statements). Moreover, by significantly increasing the
complexity of queries, reification also puts an unnecessary
cognitive burden on the user.
          </p>
          <p>Lack of support for provenance information in
mainstream triple stores, There is currently no support for
exploiting provenance information in off-the-shelf triple stores.</p>
          <p>
            Support is generally provided for named graphs in so-called
“quad stores”, but named graphs hardly provide the level of
granularity needed for provenance information required in
biomedical applications. Beside reification, SPARQL does not
offer support for seamless processing of provenance
information. There is a need for a standardization of emerging
models of provenance (e.g., OPM [
            <xref ref-type="bibr" rid="ref24 ref37 ref9">9</xref>
            ]) and their efficient
implementation in triple stores.
          </p>
          <p>
            Limitations for the applications. The emerging paradigm
of linked data and mashups had met tremendous enthusiasm in
the biomedical community [
            <xref ref-type="bibr" rid="ref10 ref11 ref25 ref26 ref38 ref39">10, 11</xref>
            ]. At this early stage, the
possibility of easily integrating disparate datasets still
outweighs the lack of fine control over constraints on these data
sources. However, when applications mature beyond answering
questions such as “is there a path between this and that?” to
restricting graph traversal with constraints specific to properties
of the links (statements, not simply predicates), the lack of
standard models and implementations for provenance
information will appear as a serious limitation.
          </p>
          <p>ACKNOWLEDGMENT</p>
          <p>This research was supported in part by the Intramural
Research Program of the National Institutes of Health (NIH),
National Library of Medicine (NLM).</p>
          <p>
            REFERENCES
[
            <xref ref-type="bibr" rid="ref1 ref16 ref29">1</xref>
            ] “The Dublin Core® Metadata Initiative,” http://www.dublincore.org/.
[
            <xref ref-type="bibr" rid="ref17 ref2 ref30">2</xref>
            ] A. Brazma, “Minimum Information About a Microarray Experiment
(MIAME)--successes, failures, challenges,” ScientificWorldJournal, vol.
9, 2009, pp. 420-423.
[
            <xref ref-type="bibr" rid="ref20 ref33 ref5">5</xref>
            ]
[
            <xref ref-type="bibr" rid="ref21 ref34 ref6">6</xref>
            ]
[
            <xref ref-type="bibr" rid="ref22 ref35 ref7">7</xref>
            ]
          </p>
          <p>M. Fiszman, et al., “Automatic summarization of MEDLINE citations
for evidence-based medical treatment: a topic-oriented evaluation,” J
Biomed Inform, vol. 42, no. 5, 2009, pp. 801-813.
“Linked data,” http://linkeddata.org/.</p>
          <p>W3C, “Resource
http://www.w3.org/RDF/.</p>
          <p>Description</p>
          <p>Framework</p>
          <p>(RDF),”
O. Bodenreider and T.C. Rindflesch, Advanced library services:
Developing a biomedical knowledge repository to support advanced
information management applications, Technical report, Lister Hill
National Center for Biomedical Communications, National Library of
Medicine, 2006.</p>
          <p>
            E.W. Sayers, et al., “Database resources of the National Center for
Biotechnology Information,” Nucleic Acids Res, vol. 37, no. Database
issue, 2009, pp. D5-15.
[
            <xref ref-type="bibr" rid="ref24 ref37 ref9">9</xref>
            ]
          </p>
          <p>O. Bodenreider, “The Unified Medical Language System (UMLS):
Integrating biomedical terminology,” Nucleic Acids Res, vol. 32
Database issue, 2004, pp. D267-270.</p>
          <p>L. Moreau, et al., “The Open Provenance Model: An Overview,”
Provenance and Annotation of Data and Processes, Lecture Notes in</p>
          <p>
            Computer Science 5272, Springer, 2008, pp. 323-326.
[
            <xref ref-type="bibr" rid="ref10 ref25 ref38">10</xref>
            ] F. Belleau, et al., “Bio2RDF: towards a mashup to build bioinformatics
knowledge systems,” J Biomed Inform, vol. 41, no. 5, 2008, pp.
706716.
[
            <xref ref-type="bibr" rid="ref11 ref26 ref39">11</xref>
            ] K.H. Cheung, et al., “Semantic mashup of biomedical data,” J Biomed
          </p>
          <p>Inform, vol. 41, no. 5, 2008, pp. 683-686.</p>
          <p>On User Views in Scientific Workflow Systems</p>
          <p>Yi Chen, Peng Sun</p>
          <p>Arizona State University
Email: fyi, psun5g@asu.edu</p>
          <p>
            Abstract—An increasing number of scientific workflow systems
are providing support for the automated tracking and storage
of provenance information. However, the amount of provenance
information recorded can become very large, even for a single
execution of a workflow – [
            <xref ref-type="bibr" rid="ref21 ref34 ref6">6</xref>
            ] estimates a ten-fold blowup of
the size of the original input data. There is therefore a need
to provide ways of allowing users to focus their attention on
meaningful provenance information in provenance queries. We
highlight recent work in this area on user views, showing how
they can be efficiently computed given user input on relevance,
or and how pre-existing views can be corrected to provide
accurate provenance information. We also discuss how to search
a repository of workflow specifications and their views, returning
workflows at an appropriate level of complexity with respect to
a hierarchy of views.
          </p>
          <p>
            Scientific workflow management systems (e.g.,
myGrid/Taverna [
            <xref ref-type="bibr" rid="ref11 ref26 ref39">11</xref>
            ], Kepler [
            <xref ref-type="bibr" rid="ref20 ref33 ref5">5</xref>
            ], VisTrails [
            <xref ref-type="bibr" rid="ref24 ref37 ref9">9</xref>
            ], and Chimera [
            <xref ref-type="bibr" rid="ref23 ref36 ref8">8</xref>
            ])
have become increasingly popular as a way of specifying and
executing data-intensive analyses. To ensure reproducibility of
results and track the large amount of final and intermediate
data products that are produced in a workflow execution,
many of these systems are beginning to provide support for
managing and querying provenance information.
          </p>
          <p>
            However, the amount of provenance information recorded
even for a single execution of a workflow can be extremely
large; [
            <xref ref-type="bibr" rid="ref21 ref34 ref6">6</xref>
            ] estimates a ten-fold blowup of the size of the original
input data. While databases are adept at storing and efficiently
answering queries over large amounts of information, users
are not adept at assimilating large amounts of information.
          </p>
          <p>It is therefore important to develop techniques to minimize
the cognitive overload resulting from provenance queries,
providing provenance information that is relevant to users.</p>
          <p>As an example, consider the workflow specification (a.k.a
workflow definition or schema) in Fig. 1 (a)1, which describes
a common analysis in molecular biology: Phylogenomic
inference of protein biological function. This workflow first
takes in a set of entries selected by the user from a database
(such as GenBank), and formats these entries to extract a
set of sequences, and, possibly, a set of annotations (M1).</p>
          <p>An alignment is then created (M3), and the result formatted
(M4). The user may also be interested in rectifying the
alignment (M5). M3 to M5 are repeated until the biologist
is satisfied with the result obtained. The user may also inspect
the annotations provided by GenBank (M2) and generate a set
of curated annotations; new user input is needed for this. The
1The reader should ignore the dotted boxes for now.</p>
          <p>M8</p>
          <p>Format
annotations
M7</p>
          <p>Build
Phylo tree
M6
Functional
data</p>
          <p>M9</p>
          <p>O
annotations are then formatted (M8) to be taken as input to the
phylogenetic tree reconstruction task (M7). Other annotations
are also considered: M6 takes in annotations from the user’s
lab and formats them to be taken as input to M7. From the
annotations produced by M8 (and possibly M6) together with
the alignment produced by M4, M7 provides a phylogenetic
tree labeled with functional annotations. Note that a number of
these tasks or modules (e.g. M1, M4, M8) involve formatting
and are not central to the scientific goal of the experiment,
and that edges represent the precedence and potential dataflow
between modules during an execution.</p>
          <p>The result of executing a scientific workflow is called a
run. As a workflow executes, data flows between module
invocations (or steps). For example, a run of the phylogenomics
workflow is shown in Fig. 1(b). Nodes represent steps that
are labeled by a unique step identifier and a corresponding
module name (e.g., S1:M1). Edges denote the flow of data
between steps, and are labeled accordingly (e.g., data objects
ent1,...,ent10 flow from input I to the first step S1). Note
that loops in the workflow specification are always unrolled
in the run graph, e.g., two steps S3 and S6 of M4 are shown
in the run of Fig. 1(b).</p>
          <p>M2
Annotation
Checking</p>
          <p>Modify
alignment M5
I</p>
          <p>M1
Split
entries</p>
          <p>MR3un
alignment</p>
          <p>M4Format
alignment</p>
          <p>M11</p>
          <p>M10
(a)
(b)
seq1,…,
seq100
funct1, …, funct30
exp1, …, exp6
I
annotGB1, …,
annotGB100</p>
          <p>S1:M1
ent1,…,
ent100</p>
          <p>S8:M2 annotChecked1, …, annotChecked100</p>
          <p>(S13:M10)
(S11:M11) (S12:M11)
S2:M3 S3:M4 S4:M5 S5:M3 S6:M4 S7:M7</p>
          <p>S9:M8 1
annotFile1
tree1</p>
          <p>O
align1 alignF1 fileGaps align2 alignF2</p>
          <p>S10:M6 functFile1</p>
          <p>(S14:M9)</p>
          <p>Fig. 1. Phylogenetic workflow specification (a) and run (b)
I</p>
          <p>M1</p>
          <p>M9</p>
          <p>O
I</p>
          <p>M1</p>
          <p>M9</p>
          <p>O</p>
          <p>M2
M5
M11</p>
          <p>Fig. 2. Joe’s (a) and Mary’s (b) user views.</p>
          <p>
            Data provenance in workflows is typically captured as a
set of dependencies between data objects [
            <xref ref-type="bibr" rid="ref22 ref35 ref7">7</xref>
            ]. Essentially,
the graph of Fig. 1(b) becomes one in which the nodes are
data; each edge is labeled with the module execution which
produced the data at its start and taking as one of its inputs
the data at its end. Note that module names are repeated for
every input-output pair. Thus, a query of the provenance of
the final data product tree1 would return a graph similar to
that in Fig. 1(b), which (even for this simple example) is quite
large.
          </p>
          <p>In this paper, we discuss a technique called user views which
uses composite modules, i.e. modules which may themselves
contain subworkflows, to hide portions of a workflow run and
thus simplify the workflow specification as well as provenance
information. Section II shows how users can indicate which
modules are relevant within a specification, and have a user
view automatically created around those relevant modules.</p>
          <p>Section III discusses how to refine pre-defined views to ones
which correctly portray the provenance relationships between
the input and output of composite modules. Section IV shows
how a database of specifications and their views can be
searched using keyword queries, returning workflows at an
appropriate level of complexity with respect to a hierarchy of
views.</p>
          <p>II. USER VIEWS</p>
          <p>
            As illustrated in Fig. 1 (b), a workflow run may comprise
many steps and intermediate data objects, and therefore the
amount of information provided in response to a provenance
query can be overwhelming. A user may therefore wish to
indicate which modules in the workflow specification are
relevant, and have provenance information presented with
respect to that user view. To do this, composite modules are
used as an abstraction mechanism [
            <xref ref-type="bibr" rid="ref18 ref3 ref31">3</xref>
            ].
          </p>
          <p>As an example, for the workflow in Fig. 1 (a), user Joe might
indicate that M2: Annotation Checking, M3: Run Alignment,
and M7: Build Phylo Tree are relevant to him. In this case,
composite modules M9 and M10 would automatically be
constructed (indicated by dotted boxes labeled M9 and M10
in Fig. Fig. 1(a)), and Joe’s user view would be fM1, M2,
M9, M10g as shown in Fig. 2(a). When answering provenance
queries with respect to a user view, only data passed between
modules in the user view would be visible. Data and module
executions internal to a composite module in the view would
be hidden; this corresponds to hiding the module executions
and data shown within the dotted boxes M9 and M10 in
Fig. 1(b). Thus, the provenance for tree1 presented
according to Joe’s user view would no longer include annotFile1,
functFile1 (both pieces of data are hidden inside M9),
or align1, alignF1, fileGaps, align2 (hidden inside
M10).</p>
          <p>Views are individualized according the user’s interests. For
example, another user, Mary, may be interested in modules
M2, M3 and M7 (like Joe) but additionally interested in
M5: Modify alignment. Mary’s user view would therefore be
constructed as fM1, M2, M5, M9, M11g (shown in Fig. 2(b)),
and her view for the provenance of tree1 would expose
alignF1 and fileGaps (unlike Joe’s view) while hiding
annotFile1, functFile1, align1, and align2.</p>
          <p>More formally, a user view is a partition of the workflow
modules. It induces a “higher level” workflow in which nodes
represent composite modules in the partition (e.g., M9 and
M10) and edges are induced by dataflow between modules
in different composite modules (e.g., an edge between M10
and M9 is induced by the edge from M4 to M7 in the original
workflow). Provenance information is then seen by a user with
respect to the flow of data between modules in his view.</p>
          <p>
            In the ZOOM system [
            <xref ref-type="bibr" rid="ref18 ref3 ref31">3</xref>
            ], [
            <xref ref-type="bibr" rid="ref17 ref2 ref30">2</xref>
            ], user views are constructed
automatically given input on what modules the user finds
relevant such that (1) a composite module contains at most
one relevant (atomic) module, thus assuming the “meaning”
of that module; and (2) no data dependencies (either direct or
indirect) are introduced or removed between relevant modules.
          </p>
          <p>In this way, the meaning of the original workflow specification
is preserved, and only relevant provenance information is
provided to the user.</p>
          <p>An interesting theoretical question is whether there are
efficient algorithms for constructing a user view which obeys
conditions (1) and (2) above and which are as small as possible,
i.e. in which the number of composite modules is minimized.</p>
          <p>
            We call such a view optimum. It turns out that whether or
not the optimum user view can be constructed depends on
the graphical structure of the workflow specification [
            <xref ref-type="bibr" rid="ref19 ref32 ref4">4</xref>
            ]: For
specifications that are general graphs, regardless of the number
of distinct modules in the input workflow and the structure of
interaction between them, the number of composite modules
can be exponentially large in the number of relevant modules
in an optimum user view for the specification. However,
for series-parallel workflow graphs [
            <xref ref-type="bibr" rid="ref14 ref42">14</xref>
            ] there is a simple,
linear time algorithm for constructing an optimum user view
for a given specification [
            <xref ref-type="bibr" rid="ref19 ref32 ref4">4</xref>
            ]. A study of scientific workflow
specifications collected from our collaborators as well as those
found at myexperiment.org has shown that over 80% of
scientific workflow specifications are series-parallel graphs,
I2
          </p>
          <p>I1
I I6</p>
          <p>M20</p>
          <p>Split
GenBank
entries</p>
          <p>M1</p>
          <p>M2
Annotation
Checking</p>
          <p>Modify
alignment</p>
          <p>M6
Functional
data</p>
          <p>O2
O6</p>
          <p>O4</p>
          <p>I2
I1</p>
          <p>I6
I</p>
          <p>M20</p>
          <p>M22</p>
          <p>O</p>
          <p>Fig. 3. Unsound User View (a) and Projected View (b)</p>
          <p>III. CORRECTING COMPOSITE MODULES FOR</p>
          <p>PROVENANCE</p>
          <p>The previous section focused on how to create views give
user input on what modules were relevant. However, composite
modules are frequently used for purposes of modularization,
abstraction and reuse when specifying workflows, and
therefore workflow views may already exist.</p>
          <p>However, unless a view is carefully designed, it may not
preserve the dataflow between modules in a workflow, and thus
can be misleading and lead to incorrect provenance analysis.</p>
          <p>For example, consider the view defined in Fig. 3(a), and
suppose a user would like to determine the provenance of
output O2 of module M20 in the projected view in Fig. 3(b).</p>
          <p>Based on the abstracted provenance graph, she would believe
that inputs I2, I1 and I6 are all involved. However, there is no
path between I6 and O2 in the original workflow; only I1 and
I2 are in the provenance of O2.</p>
          <p>
            Ideally, a view should preserve all the data dependencies
between composite tasks in the workflow, without adding
or removing dependencies. We call such a view sound with
respect to provenance. In our example, the view in Figure
3(b) indicates a data dependency path between I6 and O2,
which does not exist in the original workflow in Figure 3(a),
and thus unsound. Although it would seem natural to design
views which are sound, our survey of workflow designs in a
well-curated workflow repository [
            <xref ref-type="bibr" rid="ref1 ref16 ref29">1</xref>
            ] revealed several unsound
views. The goal of the WOLVES system [
            <xref ref-type="bibr" rid="ref13 ref28 ref41">13</xref>
            ] is therefore to
diagnose and correct unsound views.
          </p>
          <p>We prove that a view is sound if every composite task in the
view is sound. Two alternatives can be pursued for correcting
an unsound task: Splitting it into multiple smaller tasks, or
merging it with other tasks. Note that splitting composite</p>
          <p>M5
M4
Format
alignment</p>
          <p>O2
O4
O6</p>
          <p>M8</p>
          <p>Format
annotations
M7</p>
          <p>Build
Phylo tree
O
tasks refines the initial view to a lower level and provides
more provenance information. In contrast, merging tasks loses
information, as tasks that are important to the user may be
invisible after the merge. Therefore, in WOLVES we focus on
techniques that resolve an unsound view by splitting unsound
composite tasks rather than merging them. For example, we
could split M20 into three composite tasks: fM1g, fM2g,
fM6g, fM3, M4, M5g.</p>
          <p>Our goal is to correct an unsound view by splitting its
unsound composite tasks to a minimal number of tasks, each
of which is sound. However, this problem is NP-hard by
reduction from the independent set problem. To efficiently
tackle this problem, we propose two optimality criteria: weak
local optimality and strong local optimality. A weak local
optimal solution is one in which no two tasks in the resulting
view can be merged into a sound task, and strong local optimal
solution is one in which no set of two or more tasks in the
view can be merged. Weak local optimality can be achieved
with an O(n2) algorithm, and strong local optimality with
an O(n3) algorithm, where n is the number of tasks in the
workflow. The proposed algorithms are much more efficient
than the algorithm which produces an optimal solution. The
strongly local optimal algorithm often has comparable
processing efficiency to the weakly local optimal algorithm, and
produces views that are comparable to the optimal one.</p>
          <p>IV. SEARCHING WORKFLOWS THROUGH VIEWS</p>
          <p>
            An increasing number of workflow specifications and their
views are being stored, either as part of a local workflow
system or collected to form a community repository (e.g.
myexperiment.org [
            <xref ref-type="bibr" rid="ref10 ref25 ref38">10</xref>
            ]). It is therefore important for workflow
designers to be able to search these repositories and then
reuse, include or revise the retrieved workflows to simplify the
design of a new workflow.
          </p>
          <p>Techniques for finding workflows of interest are currently
limited to keyword searches based on the name of the
workflow or tags explicitly associated with the workflow, and
the result is a set of workflows shown at an arbitrary level
of detail. However, by using a notion of hierarchical user
views, this rudimentary way of searching for workflows of
interest can be significantly improved. In a hierarchical user
view, composite modules may themselves contain composite
modules, and names can be associated with each atomic or
composite module.</p>
          <p>For example, suppose that a user would like to make a
sauteed dish which uses chicken breast and coconut, and
needs a recipe. She would then issue a keyword query Q,
“chicken breast, coconut milk, saute” on a repository of recipes
(workflows) to search for relevant recipes.</p>
          <p>Now suppose that Fig. IV is one of the relevant recipes
in the repository, where all query keywords have matches.</p>
          <p>Obviously, returning the entire workflow hierarchy as a query
result, i.e. the one in which all the composite modules are
exposed, is difficult to understand since too much irrelevant
information is exposed to the user. We therefore need to find
ways of exposing relevant information in our query results.</p>
          <p>Fig. 5. Query Result for Q:fchicken breast, coconut milk, sauteg</p>
          <p>The immediate question is how to define search results
when users issue keyword queries on a repository of
workflow hierarchies. Recall that much research has been done
on keyword search on graph-structured data (e.g., relational
databases) and tree-structured data (e.g., XML), where a result
is defined as a smallest data tree that contains the query
keywords. Unfortunately, this definition of a query result is not
appropriate for workflow search as the result is not guaranteed
to capture the dependencies and dataflow among tasks that
contain keyword matches.</p>
          <p>For our example, a ”good” query result is shown in
Fig. IV, which is a query-driven view that visualizes
keyword matches shown in dashed rectangles together with
their dataflow. For example, after saute (the chicken)
until tender, we stir (it) in flour and then
add coconut milk. Note that the dataflow paths among
these tasks are not explicitly shown in the workflow
hierarchy in Fig. IV, but are derived. Also, expansion edges that
represent irrelevant views are avoided.</p>
          <p>Supporting keyword search on workflow hierarchies poses
new challenges beyond keyword search on relational and XML
data. First we need to define meaningful query results. We
propose that a query result should be a minimal query-driven view
of the workflow hierarchy that contains all keyword matches,
which is a graph containing all matches and dataflow edges</p>
          <p>V. CONCLUSION</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          , “
          <article-title>Linked Data - The Story So Far,” Int</article-title>
          .
          <source>Journal on Semantic Web and Information Systems</source>
          , Special Issue on Linked Data,
          <year>2009</year>
          , in press.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W. L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Diane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. K.</given-names>
            <surname>Kahn</surname>
          </string-name>
          , and Y. W. Richard, “AIMQ:
          <article-title>a methodology for information quality assessment</article-title>
          ,
          <source>” Information &amp; Management</source>
          , vol.
          <volume>40</volume>
          , no.
          <issue>2</issue>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O.</given-names>
            <surname>Hartig</surname>
          </string-name>
          , “
          <article-title>Provenance Information in the Web of Data,”</article-title>
          <source>in Proc. of the Linked Data on the Web Workshop at WWW</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <article-title>Quality-driven query answering for integrated information systems</article-title>
          . Springer Verlag,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bobrowski</surname>
          </string-name>
          , M. Marre´, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Yankelevich</surname>
          </string-name>
          ., “
          <article-title>A homogeneous framework to measure data quality,”</article-title>
          <source>in Proc. of IQ</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Motro</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Rakov</surname>
          </string-name>
          , “
          <article-title>Estimating the quality of databases,”</article-title>
          <source>in Proc. of FQAS</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Gruser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Raschid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zadorozhny</surname>
          </string-name>
          , and T. Zhan, “
          <article-title>Learning response time for websources using query feedback and application in query optimization,”</article-title>
          <source>VLDB Journal</source>
          , vol.
          <volume>9</volume>
          , no.
          <issue>1</issue>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ballou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pazer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Tayi</surname>
          </string-name>
          , “Modeling Information Manufacturing Systems to Determine Information Product Quality,” Management Science, vol.
          <volume>44</volume>
          , no.
          <issue>4</issue>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Miles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Groth</surname>
          </string-name>
          , and L. Moreau, “
          <article-title>Provenancebased validation of e-science experiments,”</article-title>
          <source>in Proc. of ISWC</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Golbeck</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Mannes</surname>
          </string-name>
          , “
          <article-title>Using Trust and Provenance for Content Filtering on the Semantic Web,”</article-title>
          <source>in Proc. of the Models of Trust for the Web Workshop at WWW</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Simmhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Plale</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Gannon</surname>
          </string-name>
          , “
          <article-title>A Survey of Data Provenance in e-Science,”</article-title>
          <source>SIGMOD Record</source>
          , vol.
          <volume>34</volume>
          , no.
          <issue>3</issue>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>W. C.</given-names>
            <surname>Tan</surname>
          </string-name>
          , “Provenance in Databases: Past, Current, and Future,
          <source>” IEEE Data Engineering Bulletin</source>
          , vol.
          <volume>30</volume>
          , no.
          <issue>4</issue>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Pipino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. W.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , “
          <article-title>Data Quality Assessment,” Communications of the ACM</article-title>
          , vol.
          <volume>45</volume>
          , no.
          <issue>4</issue>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <article-title>Quality-Driven Information Filtering in the Context of WebBased Information Systems</article-title>
          . VDM Verlag,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>S. de F. Mendes Sampaio</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Dong</surname>
            , and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Sampaio</surname>
          </string-name>
          , “
          <article-title>Incorporating the Timeliness Quality Dimension in Internet Query Systems,”</article-title>
          <source>in Proc. of WISE</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ram</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>"A Semiotics Framework for Analyzing Data Provenance Research,"</article-title>
          <source>Journal of computing Science and Engineering</source>
          , vol.
          <volume>2</volume>
          , pp.
          <fpage>221</fpage>
          -
          <lpage>248</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Buneman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khanna</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W. C.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <article-title>"Data Provenance: Some Basic Issues,"</article-title>
          <source>Proceedings of FSTTCS</source>
          , New Delhi, India,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Buneman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khanna</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>"Why and Where: A Characterization of Data Provenance,"</article-title>
          <source>in Lecture Notes in Computer Science</source>
          , vol.
          <year>1973</year>
          , pp
          <fpage>316</fpage>
          -
          <lpage>330</lpage>
          , Springer,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Frew</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Bose</surname>
          </string-name>
          ,
          <article-title>"Earth System Science Workbench: A Data Management Infrastructure for Earth Science Products,"</article-title>
          <source>Proceedings of 13th International Conference on Scientific and Statistical Database Management</source>
          , Fairfax,
          <string-name>
            <surname>VA</surname>
          </string-name>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bose</surname>
          </string-name>
          ,
          <article-title>"A conceptual framework for composing and managing scientific data lineage,"</article-title>
          <source>Proceedings of 14th International Conference on Scientific and Statistical Database Management</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Greenwood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Goble</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Stevens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Addis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marvin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moreau</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Oinn</surname>
          </string-name>
          ,
          <article-title>"Provenance of e-Science Experiments - experience from Bioinformatics,"</article-title>
          <source>Proceedings of UK e-Science All Hands Meeting</source>
          , Nottingham, UK,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grueninger</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Fox</surname>
          </string-name>
          ,
          <article-title>"Methodology for the Design and Evaluation of Ontologies,"</article-title>
          <source>Proceedings of Workshop on Basic Ontological Issues in Knowledge Sharing</source>
          , Montreal, Quebec, Canada,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bunge</surname>
          </string-name>
          , Treatise on Basic Philosophy: Vol.
          <volume>3</volume>
          :
          <string-name>
            <surname>Ontology</surname>
            <given-names>I</given-names>
          </string-name>
          :
          <article-title>The Furniture of the World</article-title>
          . Boston, MA: Reidel,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sowa</surname>
          </string-name>
          , Conceptual structures:
          <source>Information processing in Mind and Machine</source>
          . Reading, MA: Addison-Wesley,
          <year>1984</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ram</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>"Understanding the Semantics of Data Provenance to Support Active Conceptual Modeling,"</article-title>
          <source>in Lecture Notes in Computer Science</source>
          , vol.
          <volume>4521</volume>
          , pp
          <fpage>17</fpage>
          -
          <lpage>29</lpage>
          , Springer-Verlag,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Simmhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Plale</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Gannon</surname>
          </string-name>
          ,
          <article-title>"A Survey of Data Provenance Techniques,"</article-title>
          <source>Indiana University</source>
          ,
          <source>Technical Report IUB-CS-TR618</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lih</surname>
          </string-name>
          ,
          <article-title>"Wikipedia as Participatory Journalism: Reliable Sources? Metrics for Evaluating Collaborative Media as a News Resource,"</article-title>
          <source>Proceedings of 5th International Symposium on Online Journalism</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ram</surname>
          </string-name>
          ,
          <article-title>"Who Does What: Collaboration Patterns in the Wikipedia and Their Impact on Data Quality,"</article-title>
          <source>Proceedings of nineteenth Annual Workshop on Information Technologies and Systems(WITS</source>
          <year>2009</year>
          ), Phoenix, Arizona, USA, December,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Altintas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Berkley</surname>
          </string-name>
          , E. Jaeger,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          <article-title>Luda¨scher, and</article-title>
          <string-name>
            <given-names>S.</given-names>
            <surname>Mock</surname>
          </string-name>
          . Kepler:
          <article-title>An extensible system for design and execution of scientific workflows</article-title>
          .
          <source>In SSDBM</source>
          , pages
          <fpage>423</fpage>
          -
          <lpage>424</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Biton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Boulakia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Davidson</surname>
          </string-name>
          .
          <article-title>Zoom*userviews: Querying relevant provenance in workflow systems</article-title>
          .
          <source>In VLDB</source>
          , pages
          <fpage>1366</fpage>
          -
          <lpage>1369</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O.</given-names>
            <surname>Biton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Boulakia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Davidson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Hara</surname>
          </string-name>
          .
          <article-title>Querying and managing provenance through user views in scientific workflows</article-title>
          .
          <source>In ICDE</source>
          , pages
          <fpage>1072</fpage>
          -
          <lpage>1081</lpage>
          . IEEE,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O.</given-names>
            <surname>Biton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khanna</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Roy</surname>
          </string-name>
          .
          <article-title>Optimizing user views for workflows</article-title>
          .
          <source>In ICDT '09: Proceedings of the 12th International Conference on Database Theory</source>
          , pages
          <fpage>310</fpage>
          -
          <lpage>323</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bowers</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Luda</surname>
          </string-name>
          <article-title>¨scher. Actor-oriented design of scientific workflows</article-title>
          .
          <source>In Int. Conf. on Concept. Modeling</source>
          , pages
          <fpage>369</fpage>
          -
          <lpage>384</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Ramanan</surname>
          </string-name>
          .
          <article-title>Efficient provenance storage</article-title>
          .
          <source>In SIGMOD Conference</source>
          , pages
          <fpage>993</fpage>
          -
          <lpage>1006</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Boulakia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Eyal</surname>
          </string-name>
          , B. Luda¨scher,
          <string-name>
            <surname>T. M. McPhillips</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bowers</surname>
            ,
            <given-names>M. K.</given-names>
          </string-name>
          <string-name>
            <surname>Anand</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Freire</surname>
          </string-name>
          .
          <article-title>Provenance in scientific workflow systems</article-title>
          .
          <source>IEEE Data Eng. Bull.</source>
          ,
          <volume>30</volume>
          (
          <issue>4</issue>
          ):
          <fpage>44</fpage>
          -
          <lpage>50</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I. T.</given-names>
            <surname>Foster</surname>
          </string-name>
          , J.-S. Vo¨ckler, M. Wilde, and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          . Chimera:
          <article-title>Avirtual data system for representing, querying, and automating data derivation</article-title>
          .
          <source>In SSDBM</source>
          , pages
          <fpage>37</fpage>
          -
          <lpage>46</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Callahan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Scheidegger</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Vo</surname>
          </string-name>
          .
          <article-title>Managing rapidly-evolving scientific workflows</article-title>
          .
          <source>In IPAW</source>
          , volume
          <volume>4145</volume>
          <source>of LNCS</source>
          , pages
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
          . Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>[10] myExperiment. http://www.myexperiment.org/workflows.</mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Oinn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Addis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ferris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marvin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Senger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Greenwood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Carver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Pocock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wipat</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Taverna: a tool for the composition and enactment of bioinformatics workflows</article-title>
          . Bioinformatics,
          <volume>20</volume>
          (
          <issue>1</issue>
          ):
          <fpage>3045</fpage>
          -
          <lpage>3054</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sun</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Wise: a workflow information search engine</article-title>
          .
          <source>In ICDE</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. N.</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          . WOLVES:
          <article-title>Achieving Correct Provenance Analysis by Detecting and Resolving Unsound Workflow Views</article-title>
          .
          <source>In PVLDB</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Valdes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Tarjan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Lawler</surname>
          </string-name>
          .
          <article-title>The recognition of series parallel digraphs</article-title>
          .
          <source>SIAM J. Comput.</source>
          ,
          <volume>11</volume>
          (
          <issue>2</issue>
          ):
          <fpage>298</fpage>
          -
          <lpage>313</lpage>
          ,
          <year>1982</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>