<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On Axiomatization and Inference Complexity over a Hierarchy of Functional Dependencies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jaroslaw Szlichta</string-name>
          <email>jaroslaw.szlichta@uoit.ca</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lukasz Golab</string-name>
          <email>lgolab@uwaterloo.ca</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Divesh Srivastava</string-name>
          <email>divesh@research.att.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AT&amp;T Labs-Research</institution>
          ,
          <addr-line>New Jersey</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Ontario Institute of Technology</institution>
          ,
          <addr-line>Oshawa</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Waterloo</institution>
          ,
          <addr-line>Waterloo</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Functional dependencies (FDs) have recently been extended for data quality purposes with various notions of similarity instead of strict equality. We study these extensions in this paper. We begin by constructing a hierarchy of dependencies, showing which dependencies generalize others. We then focus on an extension of FDs that we call Antecedent Metric Functional Dependencies (AMFDs). An AMFD asserts that if two tuples have similar but not necessarily equal values of the antecedent attributes, then their consequent values must be equal. We present a sound and complete axiomatization as well as an inference algorithm for AMFDs. We compare the axiomatization of AMFDs to those of the other dependencies, and we show that while the complexity of inference for some FD extensions is quadratic or even co-NP complete, the inference problem for AMFDs remains linear, as in traditional FDs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Poor data quality is a bottleneck to e ective business decision-making. Big data
initiatives are likely to take longer, cost more, and deliver fewer bene ts without
clean data. The ability to store data is no longer a problem: according to a survey
of 586 senior executives conducted in June 2011 by the Economist Intelligence
Unit (EIU) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], less than 20% indicated data storage as a problem; however more
than 50% rated data management tasks such as cleaning as problematic.
      </p>
      <p>
        With the interest in data analytics at an all-time high, data quality has
become a critical issue in research and practice. Integrity constraints, which specify
the intended semantics and attribute relationships, are commonly used to
characterize and ensure data quality [
        <xref ref-type="bibr" rid="ref20 ref6">6, 20</xref>
        ]. In particular, Functional Dependencies
(FDs), which have traditionally been used in schema design, have recently been
extended for data consistency purposes. An FD asserts that if two tuples agree
on the left-hand-side attributes, then they must also agree on the
right-handside attributes. The idea behind various extensions of FDs is to replace strict
equality with some notion of similarity, either on the left-hand-side (see, e.g,
source
      </p>
      <p>
        A
B
C
Matching Dependencies [
        <xref ref-type="bibr" rid="ref5 ref7 ref9">7, 5, 9</xref>
        ]), on the right-hand-side (see, e.g, Metric
Functional Dependencies [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and Sequential Dependencies [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]), or on both sides of
the dependency (see, e.g., Di erential Dependencies [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]).
      </p>
      <p>In this paper, we study these generalizations of FDs. Our rst objective is to
construct a hierarchy of dependencies, revealing which ones (strictly) generalize
others, and comparing their axiomatization and complexity of inference.</p>
      <p>We then introduce a particular generalization of FDs that we call Antecedent
Metric FDs (AMFDs). An AMFD asserts that if two tuples have similar but not
necessarily equal values of the antecedent attributes, then their consequent values
must be equal; we will compare AMFDs with related dependencies in Section 2.</p>
      <p>To illustrate the utility of AMFDs, consider the movie data set shown in
Table 1, which was put together from multiple data sources. In the process
of merging data from various sources, it is often the case that small variations
occur. For example, one source might report the movie A Beautiful Mind to have
a running time of 135 minutes, as shown in Table 1, while another source may
refer to the same movie as A Beaut. Mind and the third one as Beautiful Mind.
An AMFD ftitle, year, directorg 7! flengthg indicates that movies with similar
titles, years, and directors (up to some distance threshold, as we will discuss in
Section 2) must have equal lengths. Of course, we assume that the semantics
are such that two similar movie titles, made in similar years, by similar director
names do in fact refer to the same movie.</p>
      <p>An FD ftitle, year, directorg ! flengthg would not require the three length
values in Table 1 to be equal, even though they refer to the same movie. Thus,
AMFDs generalize FDs and can express the additional semantics of similarity.</p>
      <p>
        The inference problem is to determine whether a dependency is logically
entailed by a set of dependencies. For FDs, the inference problem has been well
studied in previous work [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. We prove that while AMFDs are more expressive
than FDs and have a more complex axiomatization, their complexity of inference
remains linear.
      </p>
      <p>The contributions of this paper are as follows.
1. Hierarchy: we construct a hierarchy of dependencies, showing which ones
generalize others and comparing their complexity of reasoning. Our hierarchy
shows which dependencies are practical and which are hard to reason about,
and suggests further research on identifying tractable extensions of FDs.
2. FD extension: we introduce AMFDs, which describe integrity constraints on
tuples with similar attribute values and are useful in data cleaning.
3. Axiomatization: we present a sound and complete axiomatization for AMFDs.</p>
      <p>Axiomatization is a rst necessary step to designing an e cient inference
procedure. Our axiomatization reveals interesting insights about inference
Relations
{ A bold capital letter represents a relation schema: R. Italic capital letters
near the beginning of the alphabet represent single attributes: A and B.
{ A small bold capital letter in italic represents a relation (a table): t.
{ Small italic letters near the end of the alphabet denote tuples: s and t.
{ Small italic letters near the beginning of the alphabet denote attribute
values: a, b and c. A small italic letter m denotes a similarity metric.
Sets
{ Italic capital letters near the end of alphabet stand for sets of attributes: X
{ XY is shorthand for X [ Y . Likewise, AX or XA stand for X [ fAg.
rules over AMFDs. For instance, the Re exivity and Augmentation axioms,
which hold for traditional FDs, are not necessary true for AMFDs.
4. Inference Procedure: we develop an inference procedure for AMFDs that
runs in linear time in the complexity of the schema4. We implemented the
inference algorithm and experimentally veri ed its e ciency.</p>
      <p>The remainder of this paper is organized as follows. In Section 2, we review
previous work, formally de ne AMFDs, and present a hierarchy of dependencies.
In Sections 3 and 4, we present a sound and complete axiomatization and an
inference procedure for AMFDs, respectively, and we compare the axiomatization
to those of other related dependencies. We conclude the paper in Section 5.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Fundamentals</title>
      <sec id="sec-2-1">
        <title>AMFDs</title>
        <p>We provide notational conventions in Table 2. To accommodate small variations
in the attribute values on the left-hand-side of the dependency, we de ne AMFDs
(De nition 2). This is a departure from traditional FDs which enforce equality
on both sides. Before we de ne AMFDs, we rst de ne a similarity operator
with a distance threshold.</p>
        <p>De nition 1. (similarity) For every attribute A in a relational schema R, we
assume a binary similarity relation ( m; ) w.r.t. some similarity metric m and
a threshold parameter 0. Speci cally, for two tuples s and t, s[A] m; t[A]
i m(s[A], t[A]) . Metric m satis es standard properties; it is symmetric,
satis es the triangle inequality and identity of indiscernibles, i.e., m(a, b) = 0
i a = b. For two tuples s, t in relation t over R, we write s[X] m, t[X] to
mean s[A1] m1; 1 t[A1], ..., s[An] mn; n t[An], where X = fA1, ..., Ang, m
= [m1, ..., mn] and = [ 1, ..., n].
4 Our inference procedure is e cient because it is done at the schema level, which is
much smaller than the size of the data.</p>
        <p>Next, we de ne AMFDs. By de nition, AMFDs generalize FDs.</p>
        <p>De nition 2. (AMFD) Let X and Y be two sets of attributes, and let mX
and X be metrics and thresholds for attributes X. Then, X 7! Y denotes an
antecedent metric FD (AMFD), read as X metrically functionally determines
Y . Let R be a relation schema that contains the attributes that appear in X and
Y , and let t be a relation instance of R. Relation t satis es X 7! Y (t j= X 7!
Y ), i for all tuples s, t 2 t, s[X] mX ; X t[X] implies s[Y ] = t[Y ]. An AMFD
X 7! Y is said to hold for R, written as R j= X 7! Y , i for each admissible
relational instance t of R, relation t satis es X 7! Y . An AMFD X 7! Y is
trivial i for all t, t j= X 7! Y .</p>
        <p>Example 1. (AMFD) Assume metrics mtitle and mdirector are edit distances with
thresholds title = 6 and director = 0, respectively, in Table 1 (movie relation).
Also assume that metric myear is an integer distance with a threshold year =
0. Therefore, Table 1 satis es the AMFD ftitle, year, directorg 7! flengthg.
2.2</p>
        <p>
          Related Work
AMFDs and traditional FDs are speci ed over a single relation. However, AMFDs
replace strict equality on the left-hand-side of the dependency with similarity.
Dependencies de ned over a single relation with similarity on the
right-handside, called Metric FDs, were proposed by Koudas et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. We call them
Consequent Metric FDs (CMFDs) to distinguish them from AMFDs. The
veri cation problem over CMFDs was studied in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], which is to decide whether
the instance satis es a prescribed set of dependencies. However, axiomatization
and inference were not considered.5
        </p>
        <p>
          Bertossi et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], Fan [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and Fan et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] studied Matching
Dependencies (MDs), which are object-identi cation constraints across multiple relations.
MDs also enforce similarity rather than equality on the left-hand-side, but allow
arbitrary Boolean similarity functions. (These similarity functions only need to
satisfy re exivity, symmetry and subsumption of equality.) On the other hand,
AMFDs are de ned over a single relation and only allow a restricted notion of
similarity, namely thresholds over similarity metrics (recall De nition 1). Fan
et al. presented a sound and complete axiomatization6 and a quadratic-time
inference procedure for MDs.
        </p>
        <p>
          Pointwise Order Dependencies (PODs) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] consider order relationship rather
than equality of attribute values. A relation satis es a POD X ,! Y if, for all
tuples s and t, for every attribute A in X, sA op t A implies that for every
attribute B in Y sB op t B, where op f&lt;; &gt;; ; ; =g. For example, in relation
5 Some of the authors of this paper solved the axiomatization and inference problems
for CMFDs in a paper currently under submission.
6 It is stated in Fan et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] (without a proof of completeness) that a complete
axiomatization for MDs consists of 11 axioms, but only 9 sound axioms are presented.
sequential id
TimePolls (Table 3), the POD fdate&gt;g ,! fyear=; month=; day&gt;g holds;
however, the POD fdate&gt;g ,! fyear=; month=; day g does not hold. Ginsburg and
Hull [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] present a sound and complete axiomatization for PODs and show that
the inference problem for them is co-NP-complete.
        </p>
        <p>
          PODs are de ned over sets of attributes. On the other hand, Lexicographical
Order Dependencies (LODs) are de ned over lists of attributes [
          <xref ref-type="bibr" rid="ref17 ref19">17, 19</xref>
          ]. LODs
describe the relationship among lexicographical orderings of sets of tuples. This
is the notion of order used in SQL and in query optimization, as per the order by
operator (nested sort). A relation satis es a LOD X ,! Y if any list of its tuples
that satisfy order by X also satis es order by Y; however, not necessarily vice
versa. (X and Y denote lists of attributes.) For instance, in relation TimePolls,
the LODs timestamp ,! date and [date] ,! [year, month, day ] are true. The
default direction of the SQL order by is ascending. This can be generalized to
order-by's that mix asc and desc directions, e.g., order by name asc, age desc.
For example, in relation TimePolls, the LOD [ sequential id desc] ,!
[timestamp asc] holds. Szlichta et al. present a sound and complete axiomatization
for lexicographical order dependencies and show that the inference problem for
LODs is co-NP-complete [
          <xref ref-type="bibr" rid="ref17 ref19">17, 19</xref>
          ].
        </p>
        <p>
          Another constraint for ordered data, sequential dependencies (SDs), was
introduced in Golab et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. For example, the SD sequential id ,![4;5] timestamp
means that after sorting the data by the attribute sequential id, the gaps
between consecutive timestamps are between 4 and 5. This particular SD holds in
the TimePolls relation; however, the SD sequential id ,![6;7] timestamp does
not hold. Golab et al. present a framework for discovering which subsets of the
data obey a given SD, but axiomatization and inference were not considered.
        </p>
        <p>
          SDs were generalized in Song and Chen [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] by introducing gaps (di erential
functions) on both sides of the dependency and named Di erential
Dependencies (DDs). For instance, in the relation TimePolls, the DDs sequential id[1;1]
,! timestamp[4;5] and fdate[0;1]g ,! fyear[0;0]; month[0;0]; day[0;1]g hold.
However, the DD sequential id[1;1] ,! timestamp(5;6] does not hold. Song and Chen
present an axiomatization and show that inference problem for DDs is
co-NPcomplete.
dependencies, their complexity of inference is bookended by their immediate
ancestors and descendants in the hierarchy.
        </p>
        <p>We say that a dependency class A generalizes dependency class B i there
is a semantically preserving mapping of any dependency of class B into a set of
dependencies of class A. Class A strictly generalizes class B i A generalizes B,
however, B does not generalize A. The hierarchy in Figure 1 shows which
dependencies strictly generalize others; due to space constraints, proofs will appear in
extended version of this paper.</p>
        <sec id="sec-2-1-1">
          <title>MDDs</title>
          <p>(not studied)
MDs
(quadratic)</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Limited AMDs</title>
          <p>(linear)</p>
          <p>DDs
(co-NP-complete)
FDs
(linear)</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>AMFDs</title>
          <p>(linear)</p>
        </sec>
        <sec id="sec-2-1-4">
          <title>CMFDs</title>
          <p>(linear - Footnote 5)</p>
        </sec>
        <sec id="sec-2-1-5">
          <title>PODs</title>
          <p>(co-NP-complete)</p>
        </sec>
        <sec id="sec-2-1-6">
          <title>LODs</title>
          <p>(co-NP-complete)</p>
          <p>SDs
(not studied)</p>
          <p>For example, DDs strictly generalize SDs. In our example involving Table 3,
with the SD sequential id ,![4;5] time, consecutive sequence numbers can be
simulated by using on the left-hand-side of the DD a similarity metric which
returns distance one if two numbers are consecutive and zero otherwise.
Axiomatization and complexity of inference for SDs are open problems. However, since
our hierarchy indicates that DDs strictly generalize SDs, the upper bound on
the complexity of inference for SDs is co-NP complete.</p>
          <p>
            Similarly, DDs strictly generalize PODs (which strictly generalize LODs [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ]).
For instance, a POD A ,! B is equivalent to a DD A[0;+1] ,! B[ 1;0]. This
suggests that the complexity results for PODs can be adapted to DDs and did
not have to be re-developed from scratch in [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]. AMFDs and CMFDs are also
subsumed by DDs, since DDs allow similarity both on the left-hand-side and
right-hand side. (Limited AMDs are introduced in Section 3.) Both AMFDs and
CMFDs strictly generalize FDs by replacing equality with similarity.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Axiomatization</title>
      <sec id="sec-3-1">
        <title>Soundness and Completeness</title>
        <p>
          We now present an axiomatization for AMFDs, analogous to Armstrong's
axiomatization for FDs [
          <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
          ]. This provides a formal framework for reasoning about
1. Void
        </p>
        <p>X 7! fg
2. Transitivity</p>
        <p>If X 7! Y
and Y 7! Z
then X 7! Z
3 Composition</p>
        <p>If X 7! Y and Z 7! W</p>
        <p>then XZ 7! Y W
4 Decomposition</p>
        <p>If X 7! Y and Z Y</p>
        <p>then X 7! Z
Fig. 2: Axiomatization for AMFDs.</p>
        <p>5 Reduce</p>
        <p>If XZ 7! Y and X 7! Z</p>
        <p>then X 7! Y
6 Limited Re exivity</p>
        <p>If Y X and Y = 0
then X 7! Y
AMFDs. The axioms give insights into how AMFDs behave and reveal how
dependencies logically follow from others, which is not easily evident when
reasoning from rst principles. Also, a sound and complete axiomatization is necessary
for an e cient inference procedure (see Section 4).</p>
        <p>The axioms for AMFDs are presented in Figure 2. Recall that fg denotes an
empty set. Two of the axioms generate trivial dependencies that are always true:
Void and Limited Re exivity. Below we introduce additional inference rules that
follow from the axioms in Figure 2. These will be used throughout the paper,
particularly to prove that our AMFD axioms are complete.</p>
        <p>Lemma 1. (Left Augmentation) If X 7! Y , then XZ 7! Y .</p>
        <p>Proof. By Void and Composition it follows that XZ 7! Y . tu
Lemma 2. (Union) If X 7! Y and X 7! Z, then X 7! Y Z.</p>
        <p>Proof. By Composition it follows that X 7! Y Z. tu</p>
        <p>Next, we de ne closure over AMFDs. The closure of a set of attributes X is
the set of attributes that X logically determines given a set of AMFDs F .
De nition 3. (Closure X+) The AMFD-closure of set of attributes X, denoted
X+, w.r.t. a set of AMFDs F using axioms I = f1{6g in Figure 2, is de ned
as, X+ = fA j F ` X 7! Ag.</p>
        <p>Lemma 3 tells us whether a dependency follows from F using our axioms.
Lemma 3. (Closure for AMFDs) F ` X 7! Y , if and only if Y
X+.</p>
        <p>Proof. Let Y = fA1, :::, Ang. Assume Y X+. By de nition of X+, X 7! Ai
for all i 2 f1; :::; ng. Therefore, by the Union axiom, X 7! Y . To prove the other
direction, suppose X 7! Y follows from the axioms. For each i 2 f1; :::; ng, X
7! Ai by Decomposition, so Y X+. tu
Theorem 1. (Completeness) AMFD axioms are sound and complete.
Proof. The soundness proof (if F ` X 7! Y , then F j= X 7! Y ) is trivial. We
just have to show that each axiom is true. We present the completeness proof (if
F j= X 7! Y , then F ` X 7! Y ). We consider a table t with two rows, whose
template is shown in Table 4. We divide the attributes of t into three subsets:
X+, the set N , consisting of attributes in X that are not in the closure of X7,
7 For a traditional FD X 7! Y , by Re exivity all the attributes in X are also in X+.</p>
        <p>However, this is not true for AMFDs, as we will show in Example 2.
and all the remaining attributes. All the attributes of the rst row have the value
a, while for the second row, the attributes in X+ are a's, the attributes in N are
b's and the other attributes are c's. Without loss of generality, assume that for
all the attributes A in N and other attributes over t we use the same metric m,
and that a and b are similar (a m; A b) but not equal. Also, assume that the
values a and c are not similar (a 6 m; A c), and hence not equal.</p>
        <p>We rst show that all dependencies in the set of AMFDs F are satis ed by
table t (t j= F ). Since the AMFD axioms are sound, AMFDs inferred from F are
true. Note that by Void and Limited Re exivity, all trivial AMFDs are satis ed
in table t. Assume V 7! Z is in F but is not satis ed by table t. Therefore, V
fX+ [ N g because otherwise two rows of t are not similar on some attribute of
V since a 6 m; A c, and consequently an AMFD V 7! Z would not be violated.
Moreover, Z cannot be a subset of X+ (Z 6 X+), or else V 7! Z would be
satis ed by t. Let A be an attribute of Z not in X+. Since the dependency V
7! Z is in F , by Decomposition, V 7! A. Let V1 be a maximal set of attributes
such that V1 V and V1 X+. Let V2 be a maximal set of attributes such that
V2 V and V2 N . By Union and De nition 3 of closure, X 7! X+. Therefore,
by Left Augmentation and Reduce, XV2 7! A. Since N = fA j A 2 X and
A 62 X+g, V2 X. Hence, X 7! A, which is a contradiction.</p>
        <p>Our remaining proof obligation is to show that any AMFD not inferable from
the set of AMFDs F with our axioms (F 6` X 7! Y ) is not true (F 6j= X 7!
Y ). Suppose it is satis ed (F j= X 7! Y ). It follows by the construction of table
t that Y X+; otherwise, two rows of table t agree or are similar on X but
disagree on some attribute A from Y . Since Y X+, by Lemma 3 it can be
inferred that X 7! Y , which is a contradiction. Thus, whenever X 7! Y does
not follow from F by the AMFD axioms, F does not logically imply X 7! Y .
That is, the axiom system is complete over AMFDs, which ends the proof . tu
The axiomatization for AMFDs is more involved than its FDs counterpart. A
sound and complete axiomatization for traditional FDs consists of only three
axioms: Re exivity, Augmentation and Transitivity. Interestingly, Re exivity (if
Y X, then X 7! Y ) is not necessary true for AMFDs.</p>
        <p>Example 2. (lack of Re exivity) Consider table t (Table 4). Assume again that
the values a and b are not equal (a 6= b) but they are similar (a m; A b) for each
attribute A in N . Let attributes fBCDg N . Therefore, the AMFDs BCD 7!
BCD and BCD 7! BC are not satis ed in t because the values are similar on
the left hand side of the dependencies, but not equal on their right hand side.</p>
        <p>Similarly, Augmentation, which is another axiom for FDs, does not necessary
hold for AMFDs. (Augmentation states that if X 7! Y then XZ 7! Y Z.)</p>
        <p>We replaced Re exivity with Void and Limited Re exivity in the
axiomatization for AMFDs. Lack of Augmentation forced us to add Composition and
Decomposition to the axiomatization. Note that Left Augmentation (Theorem
1) holds for AMFDs. Since Re exivity does not hold for AMFDs, we had to add
Reduce. Only Transitivity (base axiom for FDs) was preserved in axiomatization.</p>
        <p>
          We also studied an axiomatization for a simpli ed version of MDs over a
single table, rather than multiple tables as originally de ned in [
          <xref ref-type="bibr" rid="ref5 ref7 ref9">5, 7, 9</xref>
          ]. We call
these limited AMDs. The main di erence between limited AMDs and AMFDs
is that the former allow arbitrary similarity functions while the latter employ
thresholds on similarity metrics. A sound and complete axiomatization for
limited AMDs consists of the following ve axioms: Void, Transitivity, Composition,
Decomposition and Reduce. The proof will appear in the extended version of
this paper; it is a simpli ed version of the proof of Theorem 1. In comparison
to AMFDs, Limited Re exivity does not hold for limited AMDs. A sound and
complete axiomatization for a full class of MDs is more complex (Footnote 6),
as it incorporates axioms that allow us to reason over multiple relations.
        </p>
        <p>Table 5 compares the axiomatization of AMFDs and AMDs with other
dependency classes. We point out several interesting observations below.</p>
        <p>In contrast to MDs and AMFDs, the distance (gap) functions for DDs are
de ned at the dependency level for each attribute instead of the schema level.
Therefore, Transitivity for DDs additionally requires an order relation over
differential functions. For instance, if we have the dependency \if the date di erence
for two tuples is 30 days, then price $50", then the dependency \if the date
di erence for two tuples is 30 days, then price $40" must also hold.</p>
        <p>There is an extra axiom for DDs (Impropriety) that accommodates
inconsistencies between dependencies (this problem does not arise in AMFDs and MDs).
For example, the following two dependencies are inconsistent since it is not
posAlgorithm 1 Inference procedure for AMFDs
Input: A set of AMFDs F, and a set of attributes X.</p>
        <p>Output: The closure of X with respect to F.
1: F unused F ; n 0
2: X n W where W = fA j A 2 X and A = 0g
if 9 V 7! Z 2 F unused and V</p>
        <p>X n+1 X n [ Z</p>
        <p>F unused
fV 7! Zg</p>
        <p>fX n [ X g then
sible to instantiate a relation that satis es both of them: a) if the date di erence
for two tuples is 30 days, then price = $50; and b) if the date di erence for two
tuples is 30 days, then price &gt; $50. Similarly, Augmentation and Re exivity
have to be modi ed for DDs to accommodate di erent distance functions used
by di erent dependencies on the same attribute. For instance, di erent distance
functions for the same attribute may result in the same actual distance.</p>
        <p>Interestingly, as we traverse the hierarchy of dependencies, the number of
axioms does not necessary decrease. There are 6 axioms for AMFDs, 7 for PODs
and 6 for LODs versus 4 for DDs; however, there are 3 axioms for FDs at
the bottom of hierarchy. There are two reasons for this. First, the axioms for
DDs are quite complex. Second, as we go down the hierarchy, the dependencies
become more specialized and therefore we may need more axioms to express
their restricted semantics, e.g., lack of Re exivity. As the dependencies become
more generalized, some axioms must be weakened, e.g., Limited Re exivity.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Inference Procedure</title>
      <p>
        A goal of a dependency theory is to develop algorithms for the inference problem.
Inference for DDs is co-NP-complete [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and for MDs it is quadratic [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Since
DDs and MDs generalize AMFDs, this sets an upper bound for the complexity of
inference for AMFDs. However, computing closure, X+, for AMFDs can be done
more e ciently. It takes time proportional to the length of the dependencies in
F , written out (linear time), which is as e cient as for FDs. (The complexity of
inference for limited AMDs is also linear; the proof will appear in the extended
version of this paper.) Algorithm 1 presents an inference procedure for AMFDs.
Our experiments have shown that it is e cient. For 10 AMFDs prescribed over
a dataset generated by the UIS Database [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the algorithm runs in time 1ms.
Example 3. (inference) Let F = fAB 7! C, ABC 7! EG, EG 7! Hg denote the
set of AMFDs. Also, let C = 0 and D &gt; 0 for all attributes D in ABEGH Let
us calculate the closure of set of attributes AB with Algorithm 1:
1) X 0 = fg; 2) X 1 = C; 3)X 2 = CEG; 4) X 3 = CEGH.
      </p>
      <p>The closure of AB is CEGH. For traditional FDs, the closure of AB is ABCEGH.
Theorem 2. (inference) Alg. 1 correctly computes the closure X+ over AMFDs.
Proof. First we show by induction on k that if Z is placed in X k in Algorithm
1, then Z is in X+.</p>
      <p>Base case: k = 0. By Limited Re exivity, X 7! W , where W = fA j A 2 X and
A = 0g.</p>
      <p>Induction step: k &gt; 0. Assume that X k 1 only consists of the attributes in X+.
Suppose Z is placed in X k because V W 7! Z 2 F unused, such that V X k 1
and W X. Since V X k 1, we know by the induction hypothesis that V
X+. Hence, by Lemma 3, X 7! V . Therefore, since XV 7! V Z by Composition,
then by Reduce and Decomposition X 7! Z. Thus, Z is in X+.</p>
      <p>Now we prove the opposite: if Z is in X+, then Z is in the set returned by
Algorithm 1. Suppose Z is in X+ but Z is not in the set returned by Algorithm
1. Consider table t similar to that in Table 4. Table t has two tuples that agree
on attributes in X n, are similar but not equal on attributes X that are not
subset of X n, and disagree on all other attributes. We claim that t satis es F .
If not, let P 7! Q be a dependency in F that is violated by t. Then P X n
[ X and Q cannot be a subset of X n [ X, if the violation happens. We used a
similar argument in the proof of Theorem 1. Thus, by Algorithm 1, Lines 4{7,
there exists X n+1, which is a contradiction. tu
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>In this paper, we developed a hierarchy of dependency classes and laid out the
theoretical foundations for AMFDs, which generalize traditional FDs. In future
work, we plan to investigate the following problems.</p>
      <p>
        { Determining whether a given AMFD holds on a given relation, and using
AMFDs for data cleaning, similarly to how FDs were employed in previous
data cleaning work [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ].
{ Algorithms for automatic discovery of dependencies have been proposed for
some dependencies, such as FDs and CFDs [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Similarly, we plan to study
algorithms for discovering AMFDs.
{ We plan to explore an inference framework for multiple dependencies. For
example, the following inference rules hold: a) if an AMFD X 7! Y , then a
CMFD X 7! Y ; b) if an AMFD X 7! Y , then an FD X ! Y ; c) if an FD X
! Y , then a CMFD X 7! Y . These rules along with the axioms for AMFDs
(Figure 2) and CMFDs (Footnote 5), are sound for the integrated inference
problem with FDs, AMFDs and CMFDs. However, an open question is if this
rule set is complete and what is the complexity of the inference problem.
{ Integrity constraints have been widely used in query optimization. For
instance, FDs and LODs have been shown to be useful in simplifying queries
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Economist</given-names>
            <surname>Intelligence</surname>
          </string-name>
          <article-title>Unit analysis</article-title>
          , http://www.economistinsights.com/technologyinnovation/analysis/big-data.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>UIS</given-names>
            <surname>Data Generator</surname>
          </string-name>
          , http://www.cs.utexas.edu/users/ml/riddle/data.html.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>W.</given-names>
            <surname>Armstrong</surname>
          </string-name>
          .
          <article-title>Dependency Structures of Database relationships</article-title>
          .
          <source>In Proceedings of the IFIP Congress</source>
          , pages
          <volume>580</volume>
          {
          <fpage>583</fpage>
          ,
          <year>1974</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>C.</given-names>
            <surname>Beeri</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          .
          <article-title>Computional Problems Related to the Design of Normal Form Relational Schemas</article-title>
          .
          <source>TODS</source>
          <volume>4</volume>
          (
          <issue>1</issue>
          ):,
          <volume>4</volume>
          (
          <issue>1</issue>
          ):
          <volume>30</volume>
          {
          <fpage>59</fpage>
          ,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>L.</given-names>
            <surname>Bertossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kolahi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Lakshmanan</surname>
          </string-name>
          .
          <article-title>Data cleaning and query answering with matching dependencies and matching functions</article-title>
          .
          <source>In ICDT</source>
          , pages
          <volume>268</volume>
          {
          <fpage>279</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>G.</given-names>
            <surname>Beskales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. F.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Golab</surname>
          </string-name>
          .
          <article-title>Sampling the repairs of functional dependency violations under hard constraints</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>3</volume>
          (
          <issue>1</issue>
          ):
          <volume>197</volume>
          {
          <fpage>207</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          .
          <article-title>Dependencies Revisited for Improving Data Quality</article-title>
          .
          <source>In PODS</source>
          , pages
          <volume>159</volume>
          {
          <fpage>170</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Geerts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Xiong</surname>
          </string-name>
          .
          <source>Discovering Conditional Functional Dependencies. TKDE</source>
          ,
          <volume>23</volume>
          (
          <issue>5</issue>
          ):
          <volume>683</volume>
          {
          <fpage>698</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          .
          <source>Reasoning about Record Matching Rules. PVLDB</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <volume>407</volume>
          {
          <fpage>418</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>S.</given-names>
            <surname>Ginsburg</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Hull</surname>
          </string-name>
          .
          <article-title>Order dependency in the relational model</article-title>
          .
          <source>Theoretical Computer Science</source>
          ,
          <volume>26</volume>
          (
          <issue>1</issue>
          {2):
          <volume>149</volume>
          {
          <fpage>195</fpage>
          ,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. L.
          <string-name>
            <surname>Golab</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Karlo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Korn</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Saha</surname>
            , and
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Sequential dependencies</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <volume>574</volume>
          {
          <fpage>585</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huhtala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karkk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Porkka</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Toivonen</surname>
          </string-name>
          . TANE:
          <article-title>An E cient Algorithm for Discovering Functional and Approximate Dependencies</article-title>
          .
          <source>Computer Journal</source>
          ,
          <volume>42</volume>
          (
          <issue>2</issue>
          ):
          <volume>100</volume>
          {
          <fpage>111</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>N.</given-names>
            <surname>Koudas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Venkatasubramanian</surname>
          </string-name>
          .
          <article-title>Metric Functional Dependencies</article-title>
          . In ICDE,
          <fpage>1291</fpage>
          -
          <lpage>1294</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>M. Malkemus</surname>
            ,
            <given-names>P. S.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bhattacharjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cranston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lai</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Koo</surname>
          </string-name>
          .
          <article-title>Predicate Derivation and Monotonicity Detection in DB2 UDB</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>ICDE</given-names>
          </string-name>
          ,
          <fpage>939</fpage>
          -
          <lpage>947</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>Y. N.</given-names>
            <surname>Silva</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Pearson</surname>
          </string-name>
          .
          <article-title>Exploiting database similarity joins for metric spaces</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>5</volume>
          (
          <issue>12</issue>
          ):
          <year>1922</year>
          {
          <year>1925</year>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Di erential dependencies: Reasoning and discovery</article-title>
          .
          <source>TODS</source>
          ,
          <volume>36</volume>
          (
          <issue>3</issue>
          ):
          <fpage>16</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>J. Szlichta</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Godfrey</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gryz</surname>
          </string-name>
          .
          <source>Fundamentals of Order Dependencies. PVLDB</source>
          ,
          <volume>5</volume>
          (
          <issue>11</issue>
          ):
          <volume>1220</volume>
          {
          <fpage>1231</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>J. Szlichta</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Godfrey</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gryz</surname>
            , W. Ma, P. Pawluk, and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zuzarte</surname>
          </string-name>
          .
          <article-title>Queries on dates: fast yet not blind</article-title>
          .
          <source>In EDBT 497-502</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>J. Szlichta</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Godfrey</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gryz</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zuzarte</surname>
          </string-name>
          .
          <source>Expressiveness and Complexity of Order Dependencies. PVLDB</source>
          <volume>6</volume>
          (
          <issue>14</issue>
          ):
          <fpage>1858</fpage>
          -
          <lpage>1869</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>M. Volkovs</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Chiang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Szlichta</surname>
            , and
            <given-names>R. J.</given-names>
          </string-name>
          <string-name>
            <surname>Miller</surname>
          </string-name>
          .
          <article-title>Continuous data cleaning</article-title>
          .
          <source>In ICDE</source>
          , pages
          <volume>244</volume>
          {
          <fpage>255</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>