<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Improving Data Cleaning Quality using a Data Lineage Facility</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Daniela Florescu Dennis Shasha Propel Courant Institute</institution>
          ,
          <addr-line>NYU</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>[12] D. Quass</institution>
          ,
          <addr-line>A. Gupta, I. Mumick, J. Widom, Making views self-maintanable for data warehousing, PDIS'95</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>[QGMW96] Dallan Quass, Ashish Gupta, Inderphal Singh Mumick, and Jennifer Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach</institution>
          ,
          <addr-line>Florida, USA, 1996. Available via WWW at www-db.stanford.edu as pub/papers/self-maint.ps</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2001</year>
      </pub-date>
      <fpage>3</fpage>
      <lpage>7</lpage>
      <abstract>
        <p>The problem of data cleaning, which consists of removing inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. However, for some applications, existing ETL (Extraction Transformation Loading) and data cleaning tools for writing data cleaning programs are insufficient. One important challenge with them is the design of a data flow graph that effectively generates clean data. A generalized difficulty is the lack of explanation of cleaning results and user interaction facilities to tune a data cleaning program. This paper presents a solution to handle this problem by enabling users to express user interactions declaratively and tune data cleaning programs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The development of Internet services often requires the
integration of heterogeneous sources of data. Often the
sources are unstructured whereas the intended service
requires structured data. The main challenge is to provide
consistent and error-free data (aka clean data).</p>
      <p>Founded by “Instituto Superior Te´cnico” - Technical University of
Lisbon and by a JNICT fellowship of Program PRAXIS XXI (Portugal)</p>
      <sec id="sec-1-1">
        <title>The copyright of this paper belongs to the paper’s authors. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage.</title>
        <sec id="sec-1-1-1">
          <title>Proceedings of the International Workshop on Design and</title>
        </sec>
        <sec id="sec-1-1-2">
          <title>Management of Data Warehouses (DMDW’2001)</title>
          <p>Interlaken, Switzerland, June 4, 2001
(D. Theodoratos, J. Hammer, M. Jeusfeld, M. Staudt, eds.)
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-39/
To illustrate the difficulty of data cleaning applied to
sources of unstructured data, we first introduce a concrete
running example. The Citeseer Web site (see [Ins]) collects
all the bibliographic references in Computer Science that
appear in documents (reports, publications, etc) available
on the Web in the form of postscript, or pdf files. Using
these data, Citeseer enables Web clients to browse through
citations in order to find out for instance, how many times
a given paper is referenced. The data used to construct the
Citeseer site is a large set of string records. The next two
records belong to this data set:</p>
          <p>Establishing that these are the same paper is a challenge.
First, there is no universal record key that could
establish their identity. Second, there are several syntactic and
formatting differences between the records. Authors are
written in different formats (e.g. “Dallan Quass” and “D.
Quass”), and the name of the conference appears
abbreviated (“PDIS”) or in full text (“Conference on Parallel ...”).
Third, data can be inconsistent, such as years of publication
(“1996” and “1995”). Fourth, data can be erroneous due to
misspelling or errors introducing during the automatic
processing of postscript or pdf files, as in the title of the second
record (“maintanable” instead of “maintainable”). Finally,
records may hold different information, e.g., city and
country are missing in the second record.</p>
          <p>The problem of data cleaning is well known for decision
support systems and data warehouses (see [CD97]).
Extraction, Transformation and Loading (ETL) tools and data
reengineering tools provide powerful software platforms to
implement a large data transformation chain, which can
extract data flows from arbitrary data sources and
progressively combine these flows through a variety of data
transformation operations until clean and appropriately
formatted data flows are obtained [RD00]. The resulting data can
then be loaded into some database. Some tools are
comprehensive but offer only limited cleaning functionality (e.g.,
Sagent [Sag], ETI [Int], Informatica [Inf]), while others are
dedicated to data cleaning (e.g., Integrity [Val]).</p>
          <p>It is important to realize that the more “dirty” the data
are, the more difficult it is to automate their cleaning with a
fixed set of transformations. In the Citeseer example, when
the years of publication are different in two records that
apparently refer to the same publication, there is no obvious
criteria to decide which date to use; hence the user must be
explicitly consulted. In existing tools, there is no specific
support for user consultation except to write the data to a
specific file to be later analyzed by the user. In this case,
the integration of that data, after correction, into the data
cleaning program is not properly handled. Finally, in
existing tools, the process of data cleaning is unidirectional in
the sense that once the operators are executed, the only way
to analyze what was done is to inspect log files. This is an
impediment to the stepwise refinement of a data cleaning
program.</p>
          <p>This paper presents a mechanism based on exceptions
offered by a data cleaning framework to support the
refinement of data cleaning criteria. The contributions we will
emphasize are:</p>
          <p>The explicit specification of user interaction based on
exceptions that are automatically generated by the
execution of data transformation operations.</p>
          <p>A data lineage facility that enables the user to
interactively inspect intermediate data and exceptions;
backtrack the cleaning process in the graph of data
transformations; investigate the result of each data
transformation operation; and modify the input/output of a
data transformation interactively.</p>
          <p>We illustrate these contributions with a data cleaning
application for the Citeseer set of bibliographic textual
references. The paper is organized as follows. Section 2 gives
an overview of the proposed framework. Section 3 presents
the way exceptions are specified and generated. The fourth
section explains the data lineage mechanism. Section 5
shows some results and section 6 concludes.
2</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Data Cleaning Framework</title>
      <p>The proposed framework supports the specification of a
data cleaning application as a graph of high-level
transformations. Suppose we wish to migrate the Citeseer data set
(which is a set of strings corresponding to textual
bibliographic references) into four sets of structured and clean
data, modeled as database relations: Authors, identified by
a key and a name; Events, identified by a key and a name;
Publications, identified by a key, a title, a year, an event
key, a volume, etc; and the correspondence between
publications and authors, Publications-Authors, identified by a
publication key and an author key.</p>
      <p>A partial and high-level view of the data cleaning
strategy that we used is the following:
1. Add a key to every input record.
2. Extract from each input record, and output into four
different flows the information relative to: names of authors, titles
of publications, names of events and the association between
titles and authors.
3. Extract from each input record, and output into a
publication data flow the information relative to the volume,
number, country, city, pages, year and url of each publication.</p>
      <p>Use auxiliary dictionaries for extracting city and country
from each bibliographic reference. These dictionaries store
the correspondences between standard city/country names
and their synonyms that can be recognized.
4. Eliminate duplicates from the flows of author names, titles</p>
      <p>and events.
5. Aggregate the duplicate-free flow of titles with the flow of</p>
      <p>publications.</p>
      <p>The definition of each transformation consists in
determining “quality” heuristics that automatically lead to the
best accuracy (level of cleaning quality) of the results. For
considerable amounts of data with a reasonable degree of
dirtiness, it is usually impossible to apply the automatic
criteria to every record of data. To illustrate this problem,
consider the first Extract operation (step 2) in the strategy
above. The separation between the author list and the
title is assumed to be done by one of the two punctuation
marks: ;.“. However, some citations use a comma between
these two informations, so it is not clear to detect where
does the author list finish and the title start. Another
example concerns step 4. The two titles presented in the
motivating example (starting by “Making Views...”) are
considered duplicates and need to be merged into a single title
(the correctly written instance in this case). Suppose the
consolidation phase used an automatic criteria that chooses
the longest title among duplicates. Then, it could not be
decided which is the correct one among these two. In such
situations, it is important to define interaction points in the
cleaning process that indicate operations which are not
executed automatically and their corresponding input records.</p>
      <p>The user can then analyze these records and execute some
interactive procedure to handle them.</p>
      <p>Our framework [GFS+01b] permits to model a data
cleaning program satisfying the strategy above as a data
flow graph where nodes are logical operators of the
following types: mapping, view, matching, clustering, and
merging, and the input and output data flows of operators are
3-2
g DirtyData.paper AS paper INTO KeyDirtyData
f SELECT Key.generateKey AS paperKey,
CREATE MAPPING AddKeytoDirtyData
FROM DirtyData</p>
      <p>LET Key = generateKey(DirtyData.paper)</p>
    </sec>
    <sec id="sec-3">
      <title>Management of Exceptions</title>
      <p>This section describes how the generation of exceptions can
be explicitly specified within the data cleaning operators.
The semantics and syntax of each logical operator are
presented by example. A formal description of our declarative
data cleaning language and the BNF grammar for its syntax
can be found in [GFS+01a].
re-integrated into the data flow graph and the logical
operations that take them as input can then be re-executed. This
functionality proved to be essential in our experiments with
Citeseer reported in Section 5.
logically modeled as database relations. More specifically,
the mapping operator takes a single relation and produces
several relations as output; it can express any one-to-many
data mapping due to its use of external functions. The view
operator, which essentially corresponds to an SQL select
statement, takes several relations as input and returns a
single relation as output; it can express limited many-to-one
mappings. The matching operator takes two input relations
and produces a single output relation; it computes the
similarity between any two input records using an arbitrary
distance function. The clustering operator transforms the
result of a matching operation into a nested relation where
each nested set is a cluster of records from the input
relation; the clustering of records is arbitrary. Finally, the
merging operator takes a relation representing a set of
clusters and apply an arbitrary data mapping to the elements of
each cluster.</p>
      <p>Example 2.1: The above data cleaning strategy is mapped into
the data flow graph of Figure 1. The numbering beside each data
cleaning operation corresponds to a step in the strategy. For each
output data flow of Step 2, the duplicate elimination is mapped
into a sequence of three operations of matching, clustering, and
merging. Every other step is mapped into a single operator.</p>
      <p>Each logical operator can make use of externally defined
functions that implement domain-specific treatments such
as the normalization of strings, the extraction of substrings
from a string, the computation of the distance between two
values, etc. For each input, its execution creates one or
more regular output data flows and possibly one
exceptional data flow. Regular data flows contain records
automatically cleaned using the criteria specified in the
operator. However, the processing of the input records may
throw an exception specified by the operator, thereby
entailing the insertion of the corresponding input record into
an exceptional output data flow. The names of all the
exceptions that were raised for each input record are attached
to each exceptional record.</p>
      <p>At any stage of execution of a data cleaning program,
a data lineage mechanism enables users to browse into
exceptions, analyze their provenance in the data flow graph
and interactively apply some corrections. As a first user
action, the cleaning criteria can be corrected. For instance,
the aforementioned criteria used to separate author names
and titles could be modified in order to include the
following heuristic: “if a word in lower case is recognized, it is
assumed the title has started just after the punctuation mark
that precedes it” (in the example above, the lower case word
is “views” and the punctuation mark that precedes it is the
“,” before “Making” so the beginning of the title would be
correctly recognized). The corrected logical operation is
then re-executed. As a second user action, the data that led
to the generation of exceptions may be interactively
corrected. For example, the user may interactively separate a
list of authors from a title that do not have any
punctuation mark between them. The corrected data can then be
3-3
The following mapping operator transforms the
relation DirtyDatafpaperg into a “target” relation
KeyDirtyDatafpaperkey, paperg; this corresponds to Step 1 of
Example 2.1.</p>
      <p>The create clause indicates the name of the operation.
The from clause is a standard SQL from-clause that
specifies the name of the input relation of the mapping operator.
Then, the let keyword introduces a let-clause as a sequence
of one or more assignment statements.</p>
      <p>In each assignment statement, a relation can be
assigned a functional-expression which is an expression that
involves the invocation of one or more external functions
(that have been registered to the library of functions of the
system). If the functional-expression returns a value, it
is named atomic assignment statement. The let-clause in
the example contains an atomic assignment statement that
constructs a relation Key using an external (atomic)
function generateKey that takes as argument a variable
DirtyData.paper ranging over attribute paper of DirtyData.
Relation Key is constructed as follows. For every tuple
DirtyData(a) in DirtyData1, if generateKey(a) does not
return an exception value exc, then a tuple Key(a,
generateKey(a)) is added to relation Key. Otherwise, a tuple
DirtyDataexc(a) is added to relation DirtyDataexc. We shall
say that this statement “defines” a relation Keyfpaper,
generateKeyg2.</p>
      <p>Finally, the schema of the target relation is specified by
the “f SELECT key.generateKey AS ...g” clause. It
indicates that the schema of KeyDirtyData is built using the
attributes of Key and DirtyData.</p>
      <p>The mapping command below transforms
KeyDirtyData defined above into four target relations, whose
1Where a is a string representing a paper.</p>
      <p>2For convenience, we shall assume that the name of the attribute
holding the result of the function is the same as the name of the function.
Titles</p>
      <sec id="sec-3-1">
        <title>Merging</title>
      </sec>
      <sec id="sec-3-2">
        <title>Clustering</title>
      </sec>
      <sec id="sec-3-3">
        <title>Matching</title>
        <p>DirtyTitles
...</p>
        <p>Authors</p>
      </sec>
      <sec id="sec-3-4">
        <title>Merging</title>
      </sec>
      <sec id="sec-3-5">
        <title>Clustering</title>
      </sec>
      <sec id="sec-3-6">
        <title>Matching</title>
        <p>...</p>
        <p>DirtyAuthors</p>
        <p>...
DirtyPubs
schemas are specified elsewhere in the sections of the data
cleaning program that declare the externally defined
functions. The schemas of the relations returned by table
functions extractAuthorTitleEvent and extractAuthors are
fauthorlist, title, eventg and fid, nameg respectively.
This operation corresponds to Step 2 in figure 1.
Countries</p>
        <p>Cities</p>
      </sec>
      <sec id="sec-3-7">
        <title>Mapping</title>
        <p>The assignment statements in the let-clause are named
table assignment statements since the corresponding
functional-expressions return tables. A relation can also be
assigned the result of an SQL select from where
expression that can make use of previously defined relations.</p>
        <p>The where keyword introduces a filter expressed as a
conjunctive normal form in a syntax similar to an SQL
where-clause.</p>
        <p>The syntax of the output-clause consists of one or more
select into expressions that specify the schema of each
target relation, and the constraints associated with each target
relation. Constraints can be of the following kinds: not
null, unique, foreign key and check. Their syntax is the
same as SQL assertions, but their meaning is different due
to the management of exceptions when constraints are
violated. If one of the output constraints is violated, the
execution of the mapping does not stop and the corresponding
input tuple is added to the relation Extractionexc.</p>
        <p>We now explain the semantics of the mapping
operator. The assignment statements that compose a let-clause
are evaluated in their order of appearance in the let-clause
for each tuple of the input relation. The syntax of an
assignment statement also allows to assign a relation using
an if then else control structure, in order to expose within
the operator the logic of the assignment. If the evaluation
of the let-clause does not throw any exception, the filter
specified by the where-clause is checked. If it returns true
and the output constraints are satisfied, the corresponding
tuples are added to the regular output relations of the
mapping.</p>
        <p>Exceptions may arise during the evaluation of the
letclause or when output constraints are violated. The
exceptions can be classified as non-anticipated or anticipated,
depending on the place where they occur. Non-anticipated
exceptions are not explicitly declared. They are thrown by
the external functions called within the let-clause. The
implementation of the let-clause is able to handle any
exception thrown within it. If the evaluation of the let-clause
for an input relation S returns an exception value, then the
corresponding input tuple is added to a special output
table named Sexc. The external functions
extractAuthorTi3-4
3.3</p>
        <sec id="sec-3-7-1">
          <title>Clustering Operator</title>
          <p>match (called DirtyAuthorsno match), must be returned by
the operator.</p>
          <p>The matching operator cannot generate implicit
exceptions since there is no explicit output/constraint-clauses.
Non-anticipated or explicit exceptions may be thrown in
the let-clause. Nevertheless, the general application of a
matching operator is required to be completely automatic
and no user interaction is usually required.
&lt; WHERE distance maxDist(a1.name, a2.name, 15)
The (self-)matching operator that follows takes as input
the relation DirtyAuthorsfauthorKey, nameg twice. Its
intention is to find possible duplicates within
DirtyAuthors. The let-clause has the same meaning as before
with the additional constraint that it must define a
relation, named distance, within an atomic function
assignment. Here, distance is defined using an atomic
function editDistanceAuthors computing an integer distance
value between two author names. The let-clause produces
a relation distancefauthorKey1, name1, authorKey2,
name2, editDistanceAuthorsg that contains one tuple for
every possible pair of tuples taken from DirtyAuthors. The
WHERE clause filters out the tuples of distance for which
editDistanceAuthors returned a value greater than a value
computed (by maxDist) as 15% of the maximal length of
the names compared. Finally, the INTO clause specifies the
name of the target relation (here, MatchAuthors) whose
schema is the same as distance.</p>
          <p>CREATE MATCHING MatchDirtyAuthors
FROM DirtyAuthors a1, DirtyAuthors a2
LET distance = editDistanceAuthors(a1.name, a2.name)
INTO MatchAuthors</p>
        </sec>
        <sec id="sec-3-7-2">
          <title>Matching Operator</title>
          <p>tleEvent() or extractAuthors() called in the Extraction
mapping presented may generate non-anticipated exceptions.</p>
          <p>Consider the second citation given as example in the
introduction. As suggested before, extractAuthorTitleEvent()
may not be able to recognize the end of the list of authors
since there are only commas to separate the different
elements of the citation. Therefore, an exception is thrown by
this function notifying there is no title extracted.</p>
          <p>Anticipated exceptions are declared and may be either
explicit or implicit. Explicit exceptions arise when a throw
clause is specified in the let-clause. There is the possibility
to explicitly throw an exception, introduced by the throw
keyword, in an if-then-else expression. Implicit exceptions
occur when some output constraint (constraint keyword) is
violated. An implicit exception is thrown in the Extraction
operation whenever one of the output constraints is violated
(e.g. a null event is extracted from one citation).</p>
          <p>Next, we briefly present the matching, clustering and
merging operations that model the duplicate elimination of
author names represented by step 4 and the view operation
that model step 5 in figure 1. The semantics, syntax and
exceptions for each operator are presented by example and
differences with respect to the mapping operator are
highlighted.
3.2</p>
          <p>The following is a specification of the clustering
operation by transitive closure; and the schema of the target
relation clusterAuthors is:
fcluster id,DirtyAuthors key1,DirtyAuthors key2g.</p>
          <p>CREATE CLUSTERING clusterAuthorsByTranstiveClosure
FROM MatchAuthors
BY METHOD transitive closure</p>
          <p>INTO clusterAuthors</p>
          <p>The clustering operator does not generate any
exceptions. Its semantics consists on applying an automatic
clustering method to the bi-dimensional space generated by a
matching operation.</p>
          <p>The syntax for the components of the operator have
already been presented before. The only subtlety in the
SQLlike syntax of the matching operator is the use of the
symbol “+” after an input predicate in the from clause (it would
be “a1 +” in the above example). It indicates that a
relation containing all the tuples of DirtyAuthors that did not
3-5
S of the matching). Let A be an attribute of 1. Then, if p is
j the expression S1(p):A refers to the set of tuples: fx.A x</p>
          <p>Consider the clusterAuthors relation obtained in the
previous example. Each cluster contains a set of names. For
each cluster, a possible merging strategy is to generate a
tuple composed of a key value, e.g., generated using a
generateKey function, and a name obtained by taking the longest
author name among all the author names belonging to the
same cluster. Thus, the format of the output relation of
the merging operation would be a relation, say Authors, of
schema fauthorKey, nameg.</p>
          <p>The using clause is similar to a from clause with the
following differences. A let clause is always defined wrt the
relation(s) indicated in the from. In the case of merging,
the let clause is defined wrt to the relation and attributes of
that relation indicated in the using clause. In this case we
are interested to merge each cluster into a single tuple so
the clusterAuthors relation and the cluster id attribute are
specified in the using clause. Essentially, each assignment
statement is evaluated by iterating over the clusters of the
input relation.</p>
          <p>The let clause is used to construct the attribute values
that will compose each tuple over the target relation. A
specific notation is introduced to ease the access to the
attribute values of the elements of a cluster, which are
identifiers. Suppose that the identifier’s attributes of the input
relation, say P, are associated with relations S1; S2 (inputs
a variable ranging over the attribute domain clust id of P,
is a tuple over S1 and the identifier of x belongs to cluster
pg.
g CONSTRAINT NOT NULL title
CREATE VIEW viewPublications
FROM DirtyPubs p, Titles t
WHERE p.pubKey = t.pubKey
fSELECT p.pubkey AS pubKey, t.title AS title,
t.eventKey AS eventKey, p.volume AS volume,
p.number AS number, p.country AS country, p.city AS city,
p.pages AS pages, p.year AS year,
p.url AS url INTO Publications
interpretation of this clause is first to compute the result
of the SQL select statement formed from the select-into
clause, the from clause, and the where clause. Then, the
set of constraints is evaluated against this result. If a
constraint is violated, exceptions are generated.</p>
          <p>The following example specifies an SQL join that
aggregates together the informations that result from extraction
of volume, number, year, etc of each citation with the
corresponding titles and event information free of duplicates.
g name AS name INTO Authors
f SELECT key AS authorKey,</p>
          <p>The specification that follows describes the merging
operation introduced intuitively above.</p>
          <p>CREATE MERGING MergeAuthors
USING clusterAuthors(cluster id) ca
LET name = getLongestAuthorName(DirtyAuthors(ca).name)</p>
          <p>key = generateKey()</p>
          <p>In this example, ca is a variable ranging over the
clust id attribute of clusterAuthors. Therefore, expression
(DirtyAuthors(ca).name) refers to the set of author names
associated with all the DirtyAuthors identifiers of cluster
ca. This set is passed to the function
getLongestAuthorName that throws an exception if there is more than one
author with maximum length belonging to the same
cluster. This is a non-anticipated exception and the input tuple
corresponding to the cluster that caused it is added to the
relation MergeAuthorsexc. In general, the merging operator
can also throw explicit and implicit exceptions if a
throwclause is specified in the let-clause and a constraint clause
is indicated, respectively.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Data Lineage Facility</title>
      <p>In this section, we present the functionality supplied to the
user for correcting exceptions and the methodology to
effectively interact during the execution of the cleaning
program. Furthermore, we present algorithms for the
incremental execution of a cleaning program. They ensure the
consistent integration of interactively modified data into the
data flow of transformations.
4.1</p>
      <sec id="sec-4-1">
        <title>Data lineage definition</title>
        <p>We first introduce a useful definition of tuple data lineage
taken from [CW00].</p>
        <p>This definition tells that the lineage tuple sets, given by
Ii ’s, derive exactly tuple t, and each tuple t in the
lineage sets does in fact contribute to t. In our framework,
the input relations of an operator are those specified in the
from-clause. Relations that are used in the let-clause,
consumed by external functions called within the let-clause,
or by clustering algorithms are named external input
relations and are not considered for data lineage. Given the
above definition, the following two propositions apply to
our cleaning operators.</p>
        <p>Proposition 4.1: If Op is a matching, mapping, clustering,
or a SPJ(Select-Project-Join) view:
3-6
Proposition 4.2: If Op is a merging, its input relation I
π(I: )[σI:Id1=O:Id1;:::;I:Idk=O:Idk (I
:::; A1; Ap. In the merging operator MergeAuthors specified
:::; attributes A1; Ap correspond to the clust id attribute.</p>
        <p>Thus, it is sufficient to keep, for every tuple in the output
of a merging, the identifier of the partition from which it is
generated. This identifier is given by a tuple of values over
in section 3, which is applied after a clustering operation,</p>
        <p>This technique does not work in the case of an
aggregate-select-project-join (ASPJ) view, because the set
of tuples of an input relation that contribute to a given
output tuple t cannot be summarized by a compound identifier.</p>
        <p>However, additional techniques such as keeping auxiliary
views could be used to support the tuple lineage of ASPJ
views, as suggested in [CW00].</p>
        <p>We shall say that the mapping, matching, clustering,
merging, and SPJ views are traceable, which means that
a tuple lineage for these operators can be supported by
propagating the record identifiers through operators. For
each traceable operator Op, we can define a lineage
tracing query that returns the lineage of each output tuple t,
according to the operator Op:
( ) =&lt;
2 = For every tuple t Oi Op(I1; :::Im), we keep the
identi; fiers of the tuples t1 :::;tn that are such that Op 1 t
; t1 :::tm &gt;. These identifiers permit to obtain the lineage of</p>
        <p>Our data lineage mechanism builds on this proposition.
any output tuple. For instance, in the Extraction mapping
specified in Section 3, the schema of the output relations
DirtyAuthors, DirtyEvents, etc contains the identifier of the
input relation (paperKey).</p>
        <p>In the case of a merging operator, we have:
2 that t Ri iff t contributes to t according to p, for
= :::; i 1; n. In that case, Ri is the lineage of t in Ri
acR1; ::::Rn &gt;, where R1; ::::Rn are subsets of R1; :::Rn, such
cording to p and is denoted by pRi1(t).</p>
      </sec>
      <sec id="sec-4-2">
        <title>Tuning data cleaning programs</title>
        <p>During the execution of a data cleaning program, the data
lineage facility offers the following functionality: (i) the
user may inspect the set of exceptional tuples, (ii) backtrack
in the data flow graph and discover how the exceptional
tuples were generated, and (iii) modify attribute values of
tuples, insert or delete a tuple of any relation of the data
flow graph in order to remedy the exceptions.</p>
        <p>The use of this functionality permits to tune a data
cleaning program. We describe the method for tuning a program
through a sequence of steps modeled by the state-transition
diagram of Figure 3. This diagram represents the data
cleaning process which corresponds to the execution of the
whole cleaning program specified or the execution of any
(sub-)program that compose it. A node represents a state of
the data cleaning process. Arrows correspond to transitions
with a label of the form: event[condition].action, which
indicates that the action is performed whenever event occurs
and condition is satisfied.</p>
        <p>The run action corresponds to the execution of any data
cleaning program. When the execution is finished, the
process is in the state transformations executed. Three
situations may then arise. First, if no exceptions were
π(I: )[σI:A1=O:A1;:::;I:Ap=O:Ap (I</p>
        <p>We have defined the notion of lineage of a tuple
according to a single traceable operator. Now, we introduce the
definition of lineage of a tuple according to a sequence of
traceable operators, i.e. a traceable data cleaning program.
Again, our definition is adapted from [CW00].</p>
        <p>3that is the result of a group-by operation on attributes A1;:::;Ap
We are now able to describe the algorithm, in Figure
2, to compute the lineage of any relation O according to a
traceable cleaning program p.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3 Incremental execution</title>
        <p>ple in order to prevent the user from doing redundant data
analysis and correction: the order by which the user
analyzes and corrects exceptions should always be the order
determined by the graph of transformations that model the
data cleaning program. This means that exceptions of an
operator whose regular output relation(s) contribute to the
output tuples of other operators should always be analyzed
and corrected before the exceptional tuples occurring for
those operators. Considering the Citeseer example and the
sequence of logical operations represented in figure 1, the
correction of exceptions thrown during the mappings that
implement extractions (steps 2 and 3 of the strategy) should
precede the correction of exceptions occurring during the
merging operation that composes duplicate eliminations in
step 4.</p>
        <p>All cleaning programs considered in the rest of the paper
are considered to be traceable.
thrown during the execution and no more data
transformations need to be executed, the cleaning process halts and
reaches its final state. Second, if there were no exceptions
thrown and there are other transformations to be executed,
their execution is triggered by the run action. The third
situation corresponds to the occurrence of exceptions
during the execution of the cleaning program and is handled as
follows.</p>
        <p>When exceptional tuples exist, the user is allowed to
inspect the exceptional and regular output relations that were
generated. In order to discover the tuples that contributed to
any exceptional tuple, according to Definition 4.1, the user
may obtain the lineage of an exceptional tuple according to
the lineage algorithm presented in Figure 2. In this case, we
say the user backtracks an exceptional tuple. Once the
inspection of exceptions is finished, represented by the data
inspected state in the diagram, the user is able to decide
among two procedures to correct the exceptional situations.</p>
        <p>The first procedure consists in refining the logic of some
operators that were wrongly specified; rewrite incorrect
code of external functions (for example, refine the
extractAuthorTitleEvent() to properly separate authors from title
in the citation: “D. Quass, A. Gupta, I. Mumick, J. Widom,
Making views ...”) or clustering algorithms; or add entries
to auxiliary dictionaries that are incomplete. When the
refinement needed is concluded (code/dictionaries refined
state), the operators modified must be re-executed, as well
as those operators in the data cleaning program whose input
relations are affected by the output of the refined operators.</p>
        <p>The run action is thus re-triggered for the data cleaning
program that encloses the sub-graph constituted by those
data transformation operators.</p>
        <p>The second procedure for correcting exceptional tuples
takes place once exceptions were thrown during the
execution of a cleaning program, and no refining actions can
help to correct them. This situation occurs when the
process is in the data inspected state and refinement is not
useful. The user interactive data modification is then
triggered (modify action). In this phase, the user may update
any relation generated by the cleaning program in order to
disambiguate an exceptional situation that cannot be
automated (for example, decide the correct title among the
two equally sized and similar titles in our motivating
example). The user modifications can be insertion, deletion
or updating of a tuple. Each tuple interactively modified
usually contributes, according to Definition 4.1, to one or
more tuples of arbitrary output relations in the graph of data
transformations that compose the cleaning program. The
operators that produce those output relations must then be
re-executed and the run action is thus re-triggered. As we
will detail next, the re-execution of data cleaning operators
after interactive data modifications should and can,
sometimes, be performed in incremental mode.</p>
        <p>In addition to the methodology above described, the
inspection of exceptions should obey to the following
princi3-8
Initial
State</p>
        <p>Run
= t0:Ai a0i, the incremental execution of the merging
con2 = :::; = as being those tuples to O : to:Id1 ti0:Id1; to:Idp
= ^ = where ft j:Ai ai t j :Ai a0i)g and 1 j cardinality(I),
2 ti0:Idp; 8ti0 I.
= ^ = ft j:Ai ai t j:Ai a0i)g and 1 j cardinality(I), is
! a single tuple modification (t t 0) in the input relation,
= a given attribute, let us say Ai. Suppose t:Ai ai. If
ator satisfies Propositions 4.1 and 4.2 and is thus traceable.</p>
        <p>Based on this principle, each incremental operator is able
to determine the output tuples that need to be re-computed</p>
        <p>The mapping operator iterates over each input tuple and
produces one or more tuples per output relation. For a
its incremental execution deletes mapping(t) and then
inserts mapping(t 0). The remaining output tuples will be the
same. The incremental execution of the matching operator
over input relations I1 and I2 where one of the input
relations, let us say I1, contains the modified tuple t 0, consists
in deleting matching(t; t j), where 1 j cardinality(I2)
and inserting matching(t 0; t j). An analogous reasoning is
valid for SPJ views. The merging operator collapses the
tuples of the input relation that have the same value of
sists in the following two steps. First, merging(t j), where
deleted from the output relation O. Second, merging(t j),
is recomputed and inserted in the merging output relation.</p>
        <p>Figure 4 intuitively illustrates the incremental execution for
a merging operator assuming the user replaces tuple t that
belongs to its input relation I by tuple t 0.</p>
        <p>Code/Dictionaries</p>
        <p>Refined
[Refinement Useful].Refine
Merging (non-incremental)
tj.Ai= ai
tj.Ai= a’i
= [ g I1no match I1no match ft:Id2g previously applied to its output O. These user-modified
tu= f g _ 2 ^ = = I1no match I1no match t:Id2g O t0 O+) Op(t) t0, where means the two tuples
= [ O+ Out put+ ftg ple as the user modifications applied to output O. More
2 = if (6 9t0 O such that t0:Id2 t:Id2) then be able to consistently integrate user tuple modifications
) f insertion(Op;t;O;I1no match user modifications applied to input I concern the same
tuf insertion(Op;t;O) O for tuple deletions, where O+ and O have the same
= f 2 2 I1no match I1no match t:Id1g concretely, a conflict exist for any two tuples t I +; (t0
) f deleteO(Op=;t;OO;I[1noftmgatch input relation I, by definition of incremental mode. These
= [ g O+ O+ ftg schema as O. A conflict situation may arise during the
in= [ g Out put Out put ftg troduce some auxiliary relations and define the notion of
= [ I1no match I1no match ft:Id1g have the same schema as relation I. In addition, Op must
2 = + if (6 9t0 O such that t0:Id1 t:Id1) then user mofications are inserted in relations I and I that</p>
        <p>Figure 4: Execution in incremental mode of merging
follows removes ti from their output relations, according operator must then memorize that tuple t has been deleted
to Definition 4.3. Now, imagine that, when inspecting the from its output, by an user action, and recall this
informaexceptional tuples generated by operator Op 4, the user de- tion during further re-executions. If a later user action
recides to correct one of its exceptional tuples by inserting generates the previously deleted tuple t in the Op input
retuple ti in the output of operator Op1. The incremental ex- lation, Op should automatically recognize the possible
execution of Op2 that follows will then re-generate tuple ti istence of a conflict. Analogously, when an output tuple t
thus invalidating the previous user action that had deleted is inserted as result of a user action, Op should log the user
it. Some special handling must then be provided to avoid insertion. When further user actions direct or indirectly
rethese possible conflicting situations. generate tuple t, the possibility of conflict should also be
notified.
dMealpetpei(Ongp;,t;OMeutrpguit)ngf, SPJ Views: Before we detail the new execution algorithm, we
inconflict. Any operator Op, executing in incremental mode,
Matching: must take into account the user modifications applied to its
Mapping, Merging, SPJ Views: ples are stored in relations O+, for tuple insertions, and
Matching: tegration of the different types of user modifications if the
have the same values for all attributes except the attributes
Figure 6: Algorithms for logging interactive tuple deletion that compose their identifiers. This means the
incremenor insertion tal execution of Op re-inserts a tuple in output O that was</p>
        <p>The execution of an operator in incremental mode that already deleted or inserted by the user.
takes into account previous user actions applied to its out- We now describe the components of the execution
alput is now sketched out. Three possible user actions are gorithm in incremental mode for an operator Op such that
distinguished: tuple deletion, insertion or updating. As al- it consistently integrates the user modifications applied to
ready mentioned, a tuple update corresponds to the deletion its output relation O. This algorithm is an extension of the
followed by the insertion of a tuple. We will thus explain incremental execution presented in Definition 4.3.
just the deletion and insertion user actions. If an output First, two algorithms that permit to record user
insertuple t is deleted by the user, the operator Op should, in tions and deletions in the output relation of an operator are
principle, never again generate t in its output relation. The presented in figure 6. We consider tuple t is deleted from
3-10
t1...ti...tn
Op 1
Op 3</p>
        <p>O1: t1...ti...t n
Op 2 (EXCEPTIONS)</p>
        <p>(1)
O2: t1...ti-1, ti , ti+1...tn</p>
        <p>O3: t1...ti-1, ti , ti+1...tn
Op 4 (EXCEPTIONS)</p>
        <p>(2)
+ input relation, represented by I and I , and data
modifi+ cations applied to the output relation recorded in O and
:::; ble O+= (Id; B1; Bl; IId), where Id is the identifier of
+= :::; table. For input table I (Id; A1; Ak) and output
ta:::; (O:Id; O:B1; O:Bl; O:IId; I:Id; con f lictDescription).</p>
        <p>Second, the execution algorithm of an operator Op in
incremental mode that copes with data modifications in the
O , is presented in Figure 7. The input modifications
result from the incremental execution of the operators that
produce the input relation I of Op. The output
modifications correspond to user actions in order to correct
exceptions occurred during the execution of Op. The algorithm
manages conflicts that may exist when integrating the
input and output tuple modifications. We consider the
simplest case of an operator accepting a single input relation
I and returning a simple output relation O. Besides
updating output relation O, according to data modifications,
the execution of operator Op in incremental mode
generates an additional table of exceptions named U serOp exc.</p>
        <p>Whenever there exists a conflict between two user
modifications reflected in the same operator output, the tuples
that generate this conflict are added to this exceptional
each table, the exceptional table U serOpexc has the schema
This additional exceptional relation stores the user
modithe output relation O of operator Op or is inserted in same
relation O. In the case of matching, we consider two
input relations, I1 and I2 and the additional output relation
I1no match that stores the tuples of the input relation I1 that
do not match with any tuple of the input relation I 2, as
introduced in section 3. For each operator whose output relation
tuples may be modified by the user, two additional output
relations are updated: O that records tuples the user has
deleted; and O+ that records tuples the user has inserted.</p>
        <p>The matching operator needs a supplementary automatic
procedure that updates the relation Iino match according to
the user modifications.
= any cleaning program p Op(I), the algorithm for
incre; ) IncOpWithU ser(Op; I; I+; I O; O+; O
Proposition 4.5: Given any operator Op that can be
executed in incremental mode, with input relations I, I +, and
I , and output relations O, O+ and O , and considering
mental execution:
guarantees that all conflicts are detected whatever is the
order by which the user applies data modifications
interactively.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5 Experiments</title>
      <p>In this section, we show how exceptions are used in the
Citeseer data cleaning application and report results of
applying the methodology for cleaning data, proposed in
Section 4, using exceptions and data lineage.</p>
      <p>The following proposition states the validity of this
algorithm:
fied tuples and the identifier of the user-modified input
table that generate the conflict as well as a textual description
of the conflict.
5.1</p>
      <sec id="sec-5-1">
        <title>Use of exceptions and data lineage</title>
        <p>We present two examples to illustrate the use of the
mechanism of exceptions and data lineage facility that lead to the
Extractionexc :</p>
        <p>12jCiteSeerException - TitleIsEmpty
KeyDirtyData:
12jD. Quass, A. Gupta, I. Mumick, J. Widom, Making views
self-maintanable for data, PDIS’95
refinement of the cleaning criteria and interactive merging,
respectively.</p>
        <p>The extraction of authors, title and event name (as
specified in Section 3) throws the following exception when
trying to separate its elements for the second citation in the
motivating example:</p>
        <p>This happens because “Making views self-maintanable
for data” is considered to be yet another author according
to the first extraction criteria that expects . ; “ to
separate the author list from the title. After modifying (as
explained in Section 2) the logic of the function
extractAuthorTitleEvent(), called within the extraction mapping, the
correct tuples are returned as output of the Extraction
operation:
j MergeAuthorsexc: 2 EqualSizeException is inserted in</p>
        <p>Let us consider the merging operator MergeAuthors
presented in Section 3. The function getLongestAuthorName()
throws an exception if there are distinct author names in
the same cluster and more than one with maximum length.</p>
        <p>In that case, a tuple is written in the exceptional
output data flow named MergeAuthorsexc. Suppose the tuple
the exception relation generated by the MergeAuthors
operator. The user can then trace this tuple back to the input
tuples of ClusterAuthors and DirtyAuthors that have
generated it, by soliciting the lineage of the corresponding
merging exceptional tuples and clustering output tuples.
DirtyTitles:</p>
        <p>12jMaking views self-maintanable for data
DirtyTitlesDirtyAuthors:
12j1
12j2 ...</p>
        <p>DirtyEvents:</p>
        <p>12jPDIS’95</p>
        <p>The tuples that constitute the lineage of this exceptional
tuple permit the user to discover that her interaction is
needed because the system failed to choose an author name.</p>
        <p>The user may thus insert directly the correct author name
into the Authors relation, if “A Gupta” and “H Gupta” are
indeed the same person. Otherwise, the user may update
the corresponding DirtyAuthors tuples so that they are no
longer considered as candidate matches by the matching
operator (e.g., expand to “Ashish Gupta” and “Himanshu
Gupta”).</p>
        <p>A similar interaction is required for merging titles
(recall the two titles of the motivating example: “Making
views...”) since the same criteria (longest one) is used to
merge automatically and these titles have the same length.</p>
        <p>We consider the whole set of exceptions classified
according to three types of data transformations. Extraction
criteria enclose the extraction of author names, title and
event name (Step 2 in the strategy presented in the
introduction); and the extraction of volume, year, number, city
name, month, etc from each citation (Step 3 in the strategy).</p>
        <p>A normalization transformation is applied to the extracted
event names (and before duplicate elimination is applied to
events) and aims at finding the standard event name from
a dictionary that maps significative keywords to standard
event names (e.g. the keywords “parallel, distributed,
information, conference” identify the standard event name:
“International Conference on Parallel and Distributed
Information Systems (PDIS)”). If more than one standard event
name is matched, an exception is thrown. The user can
3-12
correct these either by refining the dictionary or writing
directly the correct standard event when no more refinement
is possible. The merging operations refer to the duplicate
elimination of author names, titles and event names (Step
4 in the strategy). Merging exceptions are thrown when no
unique, author name, title and event name, respectively can
be automatically determined for each cluster.</p>
        <p>From the analysis of the table, we remark that some
exceptions, as normalization (column Normalz.) are mostly
solved by refining criteria/dictionaries and others, as
merging (Merg. column) require interactive cleaning of data.</p>
        <p>The extraction (Extract. column) operations benefit partly
from the refinement of extraction functions (the number of
extraction exceptions was reduced in 30%) and partly from
the user procedure.</p>
        <p>The last column of the table (R/P) represents two
metrics, usually used in the Information Retrieval domain, that
assess the quality of the data transformed. We used the
running set of 1,000 citations manually cleaned as reference
and call it the set of correct citations. Recall gives the
fraction of correct citations returned by the automatic cleaning
program. Precision gives the fraction of citations returned
by the automatic cleaning program that are indeed correct.</p>
        <p>The results reported confirm our expectations in what
concerns the quality of data obtained. In fact, after each
phase of debugging and user interaction, the accuracy
obtained is better. It slightly improves after refinement of
code and dictionary entries (e.g. recall increases from 0.32
to 0.36) and a superior gain is obtained after the user data
corrections (recall increases to 0.74).
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>In this paper, we presented a mechanism of exceptions and
a data lineage facility that assist the user in tuning a data
cleaning program.</p>
      <p>From our experience with the whole set of Citeseer dirty
data, we concluded that for large amounts of data whose
level of dirtiness is considerably high, the data cleaning
system must be tuned using different samples of data (in
this paper we report two sets of 1,000 dirty citations). This
corresponds to the methodology used in commercial data
cleaning projects. Data domains (e.g. bibliographic
references in Computer Science) handled for the first time
need always incremental code refinement and enrichment
of auxiliary dictionaries. The re-use of cumulated
refinements is only profitable after at least one project for a given
data domain (unless of course all source of standard
knowledge, e.g. dictionary of all universities that issue reports in
the Citeseer cleaning application, is supplied as an entry of
the cleaning process). The advantage that our data lineage
facility and mechanism of exceptions brings is to ease and
assist the user in the tuning activity.</p>
      <p>Two issues suggest future work. First, machine
learning techniques could be incorporated to correct exceptions.
When duplicate records are frequent, it would be useful to
train the system for automatically modify exceptional
tuples according to a user-established correction pattern.
Second, a limiting factor for the full incremental execution of
a data cleaning program as described in section 4 is the
clustering operator. It would be interesting to analyze the
classes of clustering methods that permits incremental
execution.
[AGS93]
[CD97]
[CW91]
[CW00]</p>
      <p>Inderpal Singh Mumick Ashish Gupta and
V.S. Subrahmanian. Maintaining Views
Incrementally. In Proc. of ACM SIGMOD Conf. on
Data Management, 1993.</p>
      <p>S. Chaudhuri and U. Dayal. An Overview
of Data Warehousing and OLAP Technology.</p>
      <p>SIGMOD Record, March 1997.</p>
      <p>Stefano Ceri and Jennifer Widom. Deriving
Production Rules for Incremental View
Maintenance. In Proc. of the Int. Conf. on Very
Large Databases, 1991.</p>
      <p>Yingwei Cui and Jennifer Widom.
Practical Lineage Tracing in Data Warehouses. In</p>
      <p>ICDE, 2000.
[RD00]
[Sag]
[Val]</p>
      <p>Informatica. Informatica home page.
http://www.informatica.com/.</p>
      <p>NEC Research Institute. Research Index
(CiteSeer). http://citeseer.nj.nec.com/.</p>
      <p>Evolutionary Technologies International.</p>
      <p>Home Page of ETI. http://www.evtech.com.</p>
      <p>Erhard Rahm and Hong Hai Do. Data
Cleaning: Problems and Current Approaches. IEEE
Data Engineering Bulletin, 23(3), September
2000.</p>
      <p>Sagent. Sagent
http://www.sagenttech.com/.
home</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>