<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Model for Data Curation and Transformation, with Provenance</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David W. Archer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>advised by Lois M. L. Delcambre</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Portland State University Portland</institution>
          ,
          <addr-line>OR 97207</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Intended Contributions</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Motivation</title>
      <p>One non-traditional style of data integration that has seen significant recent interest
involves joining fine grain data (e.g., individual data values) from multiple sources to
entities they describe, and then querying and manipulating resulting datasets. A typical
example from the intelligence community: “Persons of interest”, perhaps adversary
bomb-makers, are profiled and tracked by integrating small amounts of data from military
patrol logs, intelligence databases, cell phone logs, and surveillance records. Such data
may be gathered by multiple analysts over extended periods, and may be queried and
manipulated in various ways during and after integration to yield new materialized views.
Data values may be inserted more than once, for example to document additional data
sources. Data may be deleted, and then later re-inserted. Multiple data values for an
attribute may be retained either temporarily or permanently. Analysts also commonly
want to record expressions of confidence or doubt in data.</p>
      <p>We call this a data curation setting, and observe that the usefulness of information in
this setting depends on both the trustworthiness of the original data, and the decisions
made in its curation. Thus detailed provenance of data: where it came from, and how,
when, in what order, and by whom it was manipulated and queried, is at least as
important in our setting as the data. We focus on four needs in data curation settings
unaddressed in the current literature:</p>
      <p>Representation of sets of entities with multi-valued (that is, bag-valued)
attributes sharing a common schema. We call this simple non-first normal form
relational (SNF2) data
Representation of provenance that may include a combination of query, DML,
and DDL operations (e.g., a data value derived by a query and then updated by
DML). We call this multi-provenance
Ability to easily access, manipulate, and query the combination of SNF2 data
and its multi-provenance using analogs of familiar relational operators
Operators to express confirmation of or doubt about data
In this work, we will contribute a new conceptual model that supports:
• DDL, DML, and query of SNF2 data, including multiple creation of values
• Multi-provenance at the dataset, entity, schema, and attribute value levels
•
•</p>
      <p>Operators for and provenance of user expressions of confirmation and doubt
about data
Access to provenance via queries without requiring the user to understand its
underlying implementation, nor write complex, recursive queries; and access to
provenance via relational views of data and graphical views of provenance
We will also contribute a logical model that
• Faithfully supports our conceptual model
• Economically represents data, its provenance, and its relational structure
• Provides p-time bounded query execution</p>
    </sec>
    <sec id="sec-2">
      <title>Conceptual Model</title>
      <p>In our model, a database slice consists of a finite set of datasets of SNF2 data. The
top of Figure 1 shows an example dataset (dataset D) with one entity and two attributes,
along with other datasets, in a database slice. A dataloaf is a totally ordered set of
database slices, along with a set of external sources. We refer to datasets, entities,
attributes, attribute values, and external sources collectively as components. The initial
database slice in a dataloaf is created when a DDL operation is performed to create its
first dataset. An operator in our conceptual model operates on the most recent, or current,
database slice in a dataloaf, and induces a new result database slice. The result database
slice is a copy of the current database slice, with components modified or augmented as
prescribed by the operation performed. An external source represents a data source
outside a dataloaf. An example dataloaf, with four database slices and one external
source, is shown at the center of Figure 1.</p>
      <p>Operators induce provenance links from components in the current database slice to
components of similar type derived from them in the result database slice (e.g., from a
dataset to another dataset), or from external sources to components in the result database
slice. Provenance links have attributes, including a type (name of operator or query, or
default, or renew). When a result database slice is created from a source database slice, a
default provenance link is induced from each component in the source database slice to
each corresponding (i.e., copied, but unmodified) component in the result database slice.
An operator provenance link is induced from each component in a source database slice
(or an external source) to its modified copy (in the case of DDL or DML operations), or
to components newly derived from it (in the case of a query or external data insertion) in
the result database slice. We discuss renew provenance links below.</p>
      <p>The data definition language in our model includes operators for creation of datasets,
their attributes, and external sources, as well as deletion of datasets and attributes. The
data manipulation language includes operators for insertion, deletion, and copying of
whole datasets, individual entities, and individual attribute values, along with expressions
of confirmation and doubt in data values. The query language provides Select, Project,
Join, and Union operators, and includes predicates for selecting data based on its
provenance. The dataloaf in Figure 1 shows an example where the most recent two
operators applied were the insertion of a complete dataset, Dataset B, from an external
source X, followed by a Join of datasets A and B, resulting in dataset D.</p>
      <p>When components are deleted, they are retained and tagged as expired. Expired
components are copied into result database slices, and connected by default provenance
Database after t</p>
      <p>D</p>
      <p>Dataset A</p>
      <p>Dataset B</p>
      <p>Dataset C</p>
      <p>Database after t-3
Database after t-2</p>
      <p>default
Database after t-1
Join
Join</p>
      <p>default
default</p>
      <p>A
C</p>
      <p>B
default</p>
      <p>A</p>
      <p>C
Insert Dataset</p>
      <sec id="sec-2-1">
        <title>Source “X”</title>
      </sec>
      <sec id="sec-2-2">
        <title>Dataloaf</title>
        <p>Dataset D
Color Size
Blue
links to their corresponding current database slice components, though they are not
available for use by operators. Re-insertion via DML of a deleted attribute value results in
a renew provenance link connecting the expired attribute value in a source database slice
to the re-inserted value in a result database slice.</p>
        <p>We define a provenance graph view that includes both data and provenance of a
selected component. The graph consists of a set of directed edges representing
provenance links, and a set of vertices representing components. The set of edges is the
subset of the provenance links in the dataloaf that connect to the selected component
either directly or via a path of provenance links. The set of vertices is the set of
components in such paths, including external sources. Figure 1 shows the provenance
graph for dataset D at the bottom.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Logical Model</title>
      <p>Our conceptual view of databases allows a user to visualize each database slice as a
step in time, with all components either newly derived, or “copied forward” from
identical predecessor components. In our logical model, we eliminate the redundancy
inherent in “copying forward” unchanged components during each application of an
operator. We represent the structure and provenance of all components in all datasets with
a single directed graph called a DPGraph. Components are typed vertices in a DPGraph.
Edges in a DPGraph are either structure edges or action edges. Structure edges represent
the relational structure of data. Action edges denote the derivation of one component
from another by an operator, or an update to a component by an operator. Our logical
model does not include the notion of default provenance links. In our logical model, the
current database is comprised of a set of current vertices, each of which represents the
most recent update to a component. Thus the current database may consist of vertices
derived at different times, yet it includes the most up-to-date derivation of each
component. Operators in our logical model are the same as those in our conceptual
model.</p>
      <p>Sub-graphs of a DPGraph can be defined to represent useful abstractions. One kind
of sub-graph represents the current contents of a dataset, with its schema (attributes) and
instance (entities and data values). We call these current dataset graphs, and note that
they directly model datasets in our conceptual model. A provenance graph is a sub-graph
of the DPGraph that represents the provenance of a component. A provenance graph is
rooted at the component whose provenance it displays, and includes all action edges and
component vertices that comprise the history of the component. Provenance graphs
directly model the provenance graph view from our conceptual model.</p>
    </sec>
    <sec id="sec-4">
      <title>Related Work</title>
      <p>The non-first normal form relational model [1] includes DML, relational operators
and selection predicates [2] for attributes with non-atomic values, along with operators
(nest, unnest) to translate to and from first normal form relations. Nest and unnest are not
needed in our setting, but we adopt this model's Join and Select operators.</p>
      <p>The temporal relational model [3] addresses multi-valued data where only a single
value is applicable during a given time period providing both a kind of non-first normal
form data, and a kind of “when” provenance. Our model is different from the temporal
model, because of the need to represent multiple values valid simultaneously.</p>
      <p>Traditional data integration is realized either by constructing common data storage,
e.g. using an ETL approach, or by providing uniform data access, e.g. a federated
database [4]. These approaches support no provenance, although they may record
workflow logs. Our work does not adopt traditional data integration workflow logs.
Instead, we track provenance of data items individually.</p>
      <p>A lazy provenance system by Cui and Widom [5] computes provenance of a query
result after the fact by computing a reverse query that finds the set of tuples that produced
the result. By constructing provenance links during operator execution, we avoid the
reverse query approach in our work.</p>
      <p>Eager provenance models support provenance at the attribute value [6], tuple [7,8],
or sub-tree [9,10] level. None support provenance at all levels (relation, attribute, tuple,
and data value), making it difficult to query provenance comprehensively. None support
SNF2 data. Eager models store provenance as relational data in additional attributes
[6,7,8], or in adjunct relations [9,10]. Querying such data requires that a user understand
the details of the provenance schema, and write queries to explicitly trace provenance of
components one “generation” at a time. Often this requires the use of recursive query
techniques and complex joins. Some eager models do not support query operators [10],
while others do not support DDL or DML [6,8,9]. Only one model from the literature
tracks both DML and query operators [7], but none support multi-provenance, and none
support expressions of confidence or doubt in data. Only one model supports provenance
for deleted components [7]. Our work adopts and augments both the per-value and
pertuple provenance approaches. Our model adopts and augments the “DML + query”
provenance approach used in Trio, as well as leveraging the Trio approach of retaining
deleted components.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>This paper presents motivations and expected contributions of our research. At the
workshop, we plan to discuss the current state of our research and highlight areas of the
work where we seek insight. Beyond the scope of the current work, we envision a query
language for limiting aspects of, or the range of, data provenance visible in derived
datasets, as a way of providing provenance information on a “need-to-know” basis. We
also envision forensic mechanisms that would allow a user to revisit former states of a
database and track provenance forward in time to more recent versions. No current
literature on provenance models discusses these concepts.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Abbreviated</given-names>
            <surname>References</surname>
          </string-name>
          [1]
          <string-name>
            <surname>Haskin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lorie</surname>
          </string-name>
          , R. “
          <article-title>On Extending the Functions of a Relational Database System</article-title>
          ,”
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>In Proc. of the 1982 ACM SIGMOD International Conference</source>
          ,
          <year>1982</year>
          . [2]
          <string-name>
            <surname>Jaeschke</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schek</surname>
          </string-name>
          , H. “
          <article-title>Remarks on the Algebra of Non First Normal Form Relations”</article-title>
          , In
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>Proc. of the 1st ACM SIGMOD Symposium on Principles of Database Systems</source>
          ,
          <year>1982</year>
          . [3]
          <string-name>
            <surname>Ben-Zvi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>The Time Relational Model</article-title>
          .
          <source>PhD thesis</source>
          . University of Calif., Los Angeles.
          <year>1982</year>
          . [4]
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Larson</surname>
          </string-name>
          , J. “
          <article-title>Federated database systems for managing distributed, heterogeneous</article-title>
          , and
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          autonomous databases,
          <source>” ACM Comput. Surv</source>
          .
          <volume>22</volume>
          ,
          <issue>3</issue>
          ,
          <year>1990</year>
          . [5]
          <string-name>
            <surname>Cui</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Widom</surname>
          </string-name>
          , J. “
          <article-title>Lineage Tracing for General Data Warehouse Transformations,” The</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>VLDB Journal</source>
          <volume>12</volume>
          (
          <issue>1</issue>
          ),
          <year>2003</year>
          . [6]
          <string-name>
            <surname>Bhagwat</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chiticariu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vijayvargiya</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          “
          <article-title>An Annotation Management System for</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Relational</given-names>
            <surname>Databases</surname>
          </string-name>
          ,”
          <source>In Proc. of the 30th Int'l. Conference on Very Large Data Bases</source>
          ,
          <year>2004</year>
          . [7]
          <string-name>
            <surname>Benjelloun</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            <given-names>Sarma</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Halevy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Theobald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Widom</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2008</year>
          . “Databases with
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          uncertainty and lineage,”
          <source>The VLDB Journal</source>
          <volume>17</volume>
          (
          <issue>2</issue>
          ),
          <year>2008</year>
          . [8]
          <string-name>
            <surname>Green</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karvounarakis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tannen</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          “Provenance Semirings,”
          <source>In Proc. of the</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>26th ACM SIGMOD-SIGACT Symposium on Principles of Database Systems</source>
          ,
          <year>2007</year>
          . [9]
          <string-name>
            <surname>Buneman</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khanna</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          “
          <article-title>Why and Where: A Characterization of Data Provenance,”</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>In Proc. of the 8th Int'l. Conference on Database Theory</source>
          ,
          <year>2001</year>
          . [10]
          <string-name>
            <surname>Buneman</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chapman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cheney</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vansummeren</surname>
            ,
            <given-names>S. “</given-names>
          </string-name>
          <article-title>A Provenance Model for Manually</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Curated</given-names>
            <surname>Data</surname>
          </string-name>
          ,”
          <source>In Proc. of the Int'l. Provenance and Annotation Workshop</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>