<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Metadata Annotation through Reconstructing Provenance</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paul Groth</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yolanda Gil</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sara Magliacane</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Annotating datasets with metadata is an important part of organizing and curating data. However, it is a time consuming process and often not done in a rigorous fashion. In this paper, we propose a new approach to annotating datasets through the use of reconstructed provenance. A detailed survey of the related work in this area is given. Additionally, we provide an overview of our approach for both reconstructing provenance and using that provenance to automatically annotate datasets with metadata. This approach leverages existing work in AI planning and change detection algorithms.</p>
      </abstract>
      <kwd-group>
        <kwd>provenance</kwd>
        <kwd>reconstruction</kwd>
        <kwd>metadata annotation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1 VU University Amsterdam
2 Information Sciences Institute, University of Southern California</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>A major impediment to data aggregation and exploitation is the need to describe
datasets and their contents with appropriate metadata, so they can be
appropriately organized and prepared for analysis. Typically, simple metadata about
location and time of collection are available, but important metadata about data
properties and provenance require e ort and are not typically captured.
Moreover, scientists tend to rely on spreadsheets and other data preparation software
that does not create metadata for the resulting data. Despite major investments
in infrastructure for metadata annotation, the collection of metadata remains a
challenging area for science because of the e ort it requires.</p>
      <p>We are investigating a new approach that automatically derives metadata
rather than requiring scientists to provide it. The key idea is that rather than
manually annotating the metadata of many datasets, we manually annotate the
(much fewer) models that use the data. Scientists will be able to upload datasets
they have collected together with informal descriptions, but with no structured
metadata associated with them. Other scientists will download these datasets
and prepare them to be analyzed by models implemented in software. Our system
will have access to the original datasets and the prepared datasets that are input
to models.</p>
      <p>Our approach is to reconstruct the provenance of the prepared dataset, that
is, to infer what sequence of transformations could have been done to the original
dataset to obtain the nal dataset. The nal datasets can be assigned metadata
because of the way they are used in a model, and once the provenance is
reconstructed then the metadata can be propagated to the initial dataset. We assume
a messy environment where data is provided as is (e.g., a normal desktop le
system).</p>
      <p>Being able to reconstruct provenance is of interest because it places less of
a burden on scientists to either adapt to an underlying provenance system or
document provenance themselves. In this paper, we provide a review of related
work in possible approaches to addressing the problem of reconstructing
provenance. From this review, we outline a new approach to solving the problem of
reconstructing provenance tailored towards automatic metadata annotation.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Approaches to reconstructing provenance</title>
      <p>
        Provenance has been studied from a variety of perspectives. There have been
several good surveys of the provenance literature [
        <xref ref-type="bibr" rid="ref13 ref20 ref9">20, 9, 13</xref>
        ]. Here, we focus on
the speci c literature related to reconstructing provenance. We begin by looking
at work directly from the provenance community. We then recast the problem
of reconstructing provenance as one of either change detection or planning and
review the related work in those two areas.
2.1
      </p>
      <sec id="sec-3-1">
        <title>Approaches from the provenance literature</title>
        <p>We classify the related work in provenance into three broad areas: mining
provenance from data, using network structures to infer provenance, and leveraging
the execution environment to reconstruct provenance.</p>
        <p>
          Mining provenance The problem of reconstructing chains of historical
evolution for a corpus of text documents is discussed in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The approach consists in
clustering the documents based on their cosine similarity as vectors of terms and
ordering the documents in each cluster based on their creation time. Due to the
similarity metrics involved, this method can only reconstruct the dependencies
between the documents, while ignoring the transformations that lead from one
document to another.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], provenance is interpreted simply as the type of process that created
the data. In the considered application domain, i.e. reservoir engineering, the
same process often generates instances of semantically related concepts.
Assuming access to historical data with complete provenance information, it becomes
possible to compute con dence values for semantic associations between
concepts that have the same generating process. Given an instance of a concept,
its missing provenance can be predicted using the semantic association with the
highest con dence value and assigning the provenance of associated items. This
work overlaps with work ow mining [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], where work ows are mined from log
les.
        </p>
        <p>
          In the computational work ow environment, [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] describes an approach for
inferring service substitutions using examples found in provenance traces,
essentially, mining a high-level provenance description.
        </p>
        <p>
          Leveraging network structures Other work (e.g., [
          <xref ref-type="bibr" rid="ref17 ref3">17, 3</xref>
          ]) has proposed to
reconstruct the provenance of information based on the topology of the
underlying network. In this case provenance is intended as a provenance path, i.e.
the set of nodes and edges through which the information is communicated.
Speci cally, in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], some simple techniques for the reconstruction of incomplete
provenance in an information sharing network are given. Provenance is
represented as a list of signed metadata from the nodes that have received a speci c
information item. These metadata include the node identi er, the location and
time at which the node processed the item. In case of partial metadata for one
node, the missing parts can be approximated based on the metadata from the
neighboring nodes. On the other hand, if the provenance chain is incomplete, the
path of the information can be reconstructed by rst listing all possible paths
(either constructing a reachability set or by previously pro ling the system) and
then matching a path that is most compatible to the known provenance, both
by total length and order of common subsequences.
        </p>
        <p>
          The problem of tracking the information provenance path in a social media
setting is de ned in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The paper describes how to leverage the structure of a
social network present to estimate the most likely provenance paths for a given
piece of information. The notion of provenance in this setting is limited to a list
of transmitting nodes, without di erentiating between the operations that could
be performed on these nodes.
        </p>
        <p>
          Leveraging the execution environment The following approaches rely on
knowledge about the execution environment to infer or rebuild provenance
information. Work in the database community, has de ned the notion of a registry of
weak inverse functions that allow transformation allow the inverse of functions
applied within a database to be approximated in order to track back
provenance [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. These approximation functions must be registered by users of the
system.
        </p>
        <p>
          In the context of stream data processing, complete provenance information
can be very large. In order to reduce the required storage, [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] proposes to store
only coarse-grained provenance. Through coarse-grained information about the
transformations performed on data and a temporal data model they introduce an
algorithm to reconstruct the processing window data and compute ne-grained
provenance (tuple-level).
        </p>
        <p>
          [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] discusses the need for provenance systems that are able to detect and
correct errors in provenance records. The authors consider several examples in
which the provenance information is incomplete, missing or erroneous, either
because of rogue users or failing processes, and conclude that provenance systems
should include redundancy (e.g. having several copies of the same record in
di erent nodes) and tamperproof mechanisms to minimize these issues.
        </p>
        <p>
          Several systems have gathered provenance information about provenance
transparently by monitoring application at the operating system level [
          <xref ref-type="bibr" rid="ref14 ref18">18, 14</xref>
          ].
Based on knowledge about how processes run and reads and writes to the le
system occur, these system can reconstruct provenance information.
2.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Reconstructing provenance as change detection</title>
        <p>There exists extensive research on reconstructing sequences of operations based
on input and output data, in particular in change detection and edit distance
algorithms. These approaches can be seen as analogous to the problem of
reconstructing provenance. We give a brief overview of this work here.</p>
        <p>Edit distance is a common similarity measure between two entities that
consists of the number of transformations required to transform one entity into
another. Algorithms for computing the edit distance can also output the
related sequence of operations, called edit script. This edit script can be seen as
corresponding to some approximate form of provenance.</p>
        <p>
          In the literature, there are several well-known algorithms for computing the
edit distance for di erent types of entities, for example strings, ordered and
unordered trees [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and graphs [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. In these cases, usually the set of considered
transformation consists in elementary operations, e.g. insert/delete node,
substitute node, etc. In general, computing the minimal edit distance for unordered
trees and graphs has been proved to be an NP-hard problem, although
polynomial algorithms have been devised for some special cases of restricted graph
structures. Other possible approaches consider using heuristics or approximating
the minimal edit distance.
        </p>
        <p>
          Leveraging domain knowledge, it becomes possible to de ne more e cient
heuristic solutions, e.g. for hierarchically structured data with node insert, node
delete, node update as well as subtree move and copy operations [
          <xref ref-type="bibr" rid="ref7 ref8">8, 7</xref>
          ]. Other edit
distance algorithms have been tailored for speci c types of data with the
corresponding operations. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] introduce an algorithm for edit distance in ordered
XML documents. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] proposes three di erent similarity metrics for Business
Process Models: text similarity, structural similarity and behavioral similarity.
Bao et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] compare provenance traces of several executions of the same
workow. Provenance traces are series-parallel graphs with well-nested forking and
looping and the set of considered edit operations (path insertion, deletion,
expansion, contraction) is di erent from the standard tree edit distance problem,
thus it is possible to de ne e cient polynomial time algorithms. PROMPTDIFF
is a tool for di erentiating ontologies that allows for the detection of high-level
changes, which provide richer semantics than change primitives just discussed
[
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. In particular, the tool rst reconstructs the basic change operations using a
set of heuristic matchers and then applies a set of rules to infer complex change
operations.
        </p>
        <p>We note that all of the above-mentioned approaches for change detection
refer to entities of the same type and optimizations are possible because of
deeper knowledge about the domain.
2.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Reconstructing provenance as planning</title>
        <p>Another related eld is the planning of composite operations from several atomic
operations based on user requirements about the output operation (i.e., AI
planning).</p>
        <p>A particular instance of this problem is automated web service composition,
i.e. the problem of creating plans composing several web services automatically
based on user requirements, possibly taking into account also the availability of
the web services and the quality of service at run-time. This work is similar to
reconstructing provenance as the data involved (i.e. service descriptions) tend to
involve complex representations that need to be connected by a set of complex
operations. However, unlike provenance, service descriptions are generally of one
format.</p>
        <p>
          The general assumptions in this work is that there exists a repository of web
services and that a formal description of each web service is available, as well
as the formalization of user requirements. In most approaches the composition
is divided in two phases: synthesis, which aims at creating a plan of abstract
services, and orchestration, which substitutes the abstract services with one of
the possibly many functionally equivalent concrete services. Several surveys (e.g.
[
          <xref ref-type="bibr" rid="ref22 ref4">22, 4</xref>
          ]) describe a number of methods that have been proposed for this problem,
they can be categorized into work ow composition and AI planning.
        </p>
        <p>In work ow composition approaches a composite service can be seen as a
work ow of atomic services, thus dynamic work ow methods for binding the
abstract work ow plan to concrete resources can be reused. However, often these
methods require a prede ned abstract plan with the set of tasks and in most
cases they are limited to serial and parallel composition of tasks. In AI planning
approaches formal descriptions of the preconditions and e ects of each service
are provided. From these descriptions, a plan can be generated automatically by
a logical theorem prover or an AI planner.</p>
        <p>One of the challenges in these approaches is that they need to handle
nondeterminism and partially observable states, as well as considering fault-tolerance,
quality of service and interactivity with the user during the planning phase.
These issues are not present when reconstructing provenance. Furthermore,
unlike our domain these approaches require formal descriptions of data.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>A new approach to reconstructing provenance</title>
      <p>Our approach builds upon the above work to develop a new approach to
reconstructing provenance that is less dependent on formal descriptions or extensive
domain knowledge. Figure 1 illustrates our approach with an example.</p>
      <p>The user prepares the data, typically with Excel or a programming tool such
as R, but those steps are not recorded. The prepared data is used as input to
a model (for example, the Owens-Gibbs model for estimating reaeration rates),
which the system knows takes as input date, salinity, average temperature, and
CO2 levels in that order. From that, the system infers the metadata for the
prepared dataset that was used as input, so FC1 is date, FC2 is salinity, FC3 is
temperature, and so on. Now the system searches for transformations that could
have been used to transform the initial data into the prepared data, and
hypothesizes that column FC2 was derived from truncating the values in OC6, column
FC3 from averaging each entry in OC4 and OC5, and column FC4 from OC7.
The system may not be able to gure out that FC1 was derived from OC1.
These hypothesized transformations constitute the (possibly partially)
reconstructed provenance, which the system then uses to infer semantic metadata for
the initial dataset by propagating the metadata of the prepared dataset through
the reconstructed provenance.</p>
      <p>A major technical challenge for this to work is that there is a very large
search space of possible data transformations. Another challenge is that the
transformations that users can make to a dataset are not enumerable in principle
so the search space is unbound. To address these challenges, there are three key
features of our approach.</p>
      <p>First, we use anytime algorithms for our search. This means that at any
time after they start running, the system will be able to output a partial
understanding of how the data was transformed. For example, it may have gured
out in a few minutes what six of the nine columns are, it may gure out two
more columns in an hour, but it may never be able to gure out what two other
columns are because the transformations were not de ned in the systems library.</p>
      <p>Second, we use systematic search to explore the space of transformations in
a principled manner. This means that the system detects when the same partial
set of transformations were reached in two di erent areas of the search space,
and only spend time once to further explore them. We use heuristics to guide the
search to explore the most promising partial transformation at any given time.</p>
      <p>Third, we are developing a library of basic transformations that are
common across scienti c domains. These include basic mathematical functions,
spatial and temporal data transformations, and string transformations (truncation,
pre x additions, etc.)</p>
      <p>We have developed a prototype of this system to demonstrate the approach.
It uses the A* search algorithm combined with a heuristic function based on edit
distance to infer the provenance as a sequence of transformations on the original
dataset. This search algorithm is heuristic and expands the most promising
partial sequence of transformations at each search iteration. Essentially, it combines
the approach of AI planning with similarity measures to try to come up with
a reasonable approximation of the provenance of a given dataset. The current
prototype supports only a small number of structural transformations on tabular
data (e.g. CSVs) but we are currently incorporating more.</p>
      <p>A key element of the prototype is that we are able to identify particular cells
within the output data that can be traced back to cells in the input data. We
are currently implementing an approach to back propagate metadata about the
output data to the input data using this reconstructed provenance trace.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we have provided an overview of related work in reconstructing
provenance. Based on this overview, we have outlined a new approach to
reconstructing provenance that combines AI planning and change detection
techniques. While our current work is research in progress, it o ers the community
a novel frame to think about the problem of reconstructing provenance.</p>
      <p>Importantly, reconstructing provenance provides a new solution to the
problem of metadata annotation in science. The approach requires no e ort from
the scientist and would provide a number of bene ts: 1) provenance would be
automatically reconstructed for tools that do not track it and are ubiquitous in
sciences such as Excel; 2) metadata would be automatically annotated including
the original data used in an analysis; 3) the reconstructed provenance could used
to automatically prepare new data from the same initial sources (e.g. sensors).
Acknowledgements This publication was supported by the Dutch national
program COMMIT.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>van der Aalst</surname>
            , W., van Dongen,
            <given-names>B.F.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Herbst</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maruster</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schimm</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weijters</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Work ow mining: A survey of issues and approaches</article-title>
          .
          <source>Data &amp; Knowledge</source>
          Engineering Vol.
          <volume>47</volume>
          , No. 2. pp.
          <volume>237</volume>
          {
          <issue>267</issue>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davidson</surname>
            ,
            <given-names>S.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen-Boulakia</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eyal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khanna</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Di erencing Provenance in Scienti c Work ows</article-title>
          .
          <source>In: Proceedings of ICDE 2009</source>
          . pp.
          <volume>808</volume>
          {
          <issue>819</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Barbier</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , Liu, H.:
          <article-title>Information provenance in social media</article-title>
          .
          <source>In: SBP 2011</source>
          . pp.
          <volume>276</volume>
          {
          <fpage>283</fpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Baryannis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plexousakis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <source>Automated Web Service Composition: State of the Art and Research Challenges. Tech. rep., 409</source>
          ,
          <string-name>
            <surname>ICS-FORTH</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Belhajjame</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goble</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soiland-Reyes</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Roure</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Fostering Scienti c Work ow Preservation Through Discovery of Substitute Services</article-title>
          . In: eScience 2011
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bille</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A survey on tree edit distance and related problems</article-title>
          .
          <source>Theoretical Computer Science</source>
          <volume>337</volume>
          (
          <issue>1-3</issue>
          ),
          <volume>217</volume>
          {
          <fpage>239</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Chawathe</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia-Molina</surname>
          </string-name>
          , H.:
          <article-title>Meaningful change detection in structured data</article-title>
          .
          <source>In: ACM SIGMOD Record</source>
          . pp.
          <volume>26</volume>
          {
          <issue>37</issue>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Chawathe</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rajaraman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia-Molina</surname>
          </string-name>
          , H.:
          <article-title>Change detection in hierarchically structured information</article-title>
          .
          <source>ACM SIGMOD</source>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Cheney</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chiticariu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>W.C.</given-names>
          </string-name>
          :
          <article-title>Provenance in databases: Why, how, and where</article-title>
          .
          <source>Found. Trends databases 1, 379{474 (April</source>
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Cobena</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abiteboul</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marian</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Detecting changes in XML documents</article-title>
          .
          <source>In: Proceedings of ICDE</source>
          . pp.
          <volume>41</volume>
          {
          <issue>52</issue>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Deolalikar</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>La</surname>
            <given-names>tte</given-names>
          </string-name>
          , H.:
          <article-title>Provenance as data mining: combining le system metadata with content analysis</article-title>
          .
          <source>In: First workshop on Theory and practice of provenance</source>
          . p.
          <fpage>10</fpage>
          . USENIX Association (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Dijkman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumas</surname>
          </string-name>
          , M.,
          <string-name>
            <surname>van Dongen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Kaarik, R.,
          <string-name>
            <surname>Mendling</surname>
          </string-name>
          , J.:
          <article-title>Similarity of business process models: Metrics and evaluation</article-title>
          .
          <source>Information Systems</source>
          <volume>36</volume>
          (
          <issue>2</issue>
          ),
          <volume>498</volume>
          {516 (Apr
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Freire</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koop</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silva</surname>
          </string-name>
          , C.T.:
          <article-title>Provenance for computational tasks: A survey</article-title>
          .
          <source>Computing in Science and Engg</source>
          .
          <volume>10</volume>
          ,
          <issue>11</issue>
          {21 (May
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Frew</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Metzger</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Slaughter</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Automatic capture and reconstruction of computational provenance</article-title>
          .
          <source>Concurrency and Computation: Practice and Experience</source>
          <volume>20</volume>
          (
          <issue>5</issue>
          ),
          <volume>485</volume>
          {
          <fpage>496</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.:</given-names>
          </string-name>
          <article-title>A survey of graph edit distance</article-title>
          .
          <source>Pattern Analysis and Applications</source>
          <volume>13</volume>
          (
          <issue>1</issue>
          ),
          <volume>113</volume>
          {129 (Jan
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Gates</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bishop</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>One of These Records Is Not Like the Others</article-title>
          .
          <source>Proceedings of workshop on Theory and practice of provenance</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Govindan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dogan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zeng</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>PRONET : Network Trust Assessment Based on Incomplete Provenance</article-title>
          .
          <source>IEEE The Premier International Conference for Military Communications</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. Holland,
          <string-name>
            <given-names>D.A.</given-names>
            ,
            <surname>Seltzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.I.</given-names>
            ,
            <surname>Braun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            ,
            <surname>Muniswamy-Reddy</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.K.</surname>
          </string-name>
          :
          <article-title>Passing the provenance challenge</article-title>
          .
          <source>Concurrency and Computation: Practice and Experience</source>
          <volume>20</volume>
          (
          <issue>5</issue>
          ),
          <volume>531</volume>
          {
          <fpage>540</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Huq</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wombacher</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Inferring ne-grained data provenance in stream data processing: reduced storage cost, high accuracy</article-title>
          .
          <source>Database and Expert Systems</source>
          pp.
          <volume>118</volume>
          {
          <issue>127</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Moreau</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>The foundations for provenance on the web</article-title>
          .
          <source>Found. Trends Web Sci. 2</source>
          ,
          <issue>99</issue>
          {241 (
          <year>February 2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Noy</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kunnatur</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Musen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Tracking changes during ontology evolution</article-title>
          . In: ISWC. pp.
          <volume>259</volume>
          {
          <fpage>273</fpage>
          . Springer (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Rao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>X.:</given-names>
          </string-name>
          <article-title>A survey of automated web service composition methods</article-title>
          .
          <source>Semantic Web Services and Web Process</source>
          Composition pp.
          <volume>43</volume>
          {
          <issue>54</issue>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Woodru</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stonebraker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Supporting ne-grained data lineage in a database visualization environment</article-title>
          .
          <source>In: Proceedings of ICDE</source>
          . pp.
          <volume>91</volume>
          {
          <issue>102</issue>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomadam</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prasanna</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Predicting Missing Provenance using Semantic Associations in Reservoir Engineering</article-title>
          .
          <source>In: Fifth IEEE International Conference on Semantic Computing (ICSC)</source>
          . pp.
          <volume>141</volume>
          {
          <fpage>148</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>