=Paper=
{{Paper
|id=Vol-1378/paper15
|storemode=property
|title=On Axiomatization and Inference Complexity over a Hierarchy of Functional Dependencies
|pdfUrl=https://ceur-ws.org/Vol-1378/AMW_2015_paper_15.pdf
|volume=Vol-1378
|dblpUrl=https://dblp.org/rec/conf/amw/SzlichtaGS15
}}
==On Axiomatization and Inference Complexity over a Hierarchy of Functional Dependencies==
<pdf width="1500px">https://ceur-ws.org/Vol-1378/AMW_2015_paper_15.pdf</pdf>
<pre>
    On Axiomatization and Inference Complexity
    over a Hierarchy of Functional Dependencies

             Jaroslaw Szlichta1 , Lukasz Golab2 , and Divesh Srivastava3
         1
              University of Ontario Institute of Technology, Oshawa, Canada
                              jaroslaw.szlichta@uoit.ca
                      2
                         University of Waterloo, Waterloo, Canada
                                  lgolab@uwaterloo.ca
                       3
                          AT&T Labs-Research, New Jersey, USA
                               divesh@research.att.com


      Abstract. Functional dependencies (FDs) have recently been extended
      for data quality purposes with various notions of similarity instead of
      strict equality. We study these extensions in this paper. We begin by
      constructing a hierarchy of dependencies, showing which dependencies
      generalize others. We then focus on an extension of FDs that we call
      Antecedent Metric Functional Dependencies (AMFDs). An AMFD as-
      serts that if two tuples have similar but not necessarily equal values of
      the antecedent attributes, then their consequent values must be equal.
      We present a sound and complete axiomatization as well as an inference
      algorithm for AMFDs. We compare the axiomatization of AMFDs to
      those of the other dependencies, and we show that while the complexity
      of inference for some FD extensions is quadratic or even co-NP complete,
      the inference problem for AMFDs remains linear, as in traditional FDs.


1    Introduction

Poor data quality is a bottleneck to effective business decision-making. Big data
initiatives are likely to take longer, cost more, and deliver fewer benefits without
clean data. The ability to store data is no longer a problem: according to a survey
of 586 senior executives conducted in June 2011 by the Economist Intelligence
Unit (EIU) [1], less than 20% indicated data storage as a problem; however more
than 50% rated data management tasks such as cleaning as problematic.
    With the interest in data analytics at an all-time high, data quality has be-
come a critical issue in research and practice. Integrity constraints, which specify
the intended semantics and attribute relationships, are commonly used to char-
acterize and ensure data quality [6, 20]. In particular, Functional Dependencies
(FDs), which have traditionally been used in schema design, have recently been
extended for data consistency purposes. An FD asserts that if two tuples agree
on the left-hand-side attributes, then they must also agree on the right-hand-
side attributes. The idea behind various extensions of FDs is to replace strict
equality with some notion of similarity, either on the left-hand-side (see, e.g,
                            Table 1: Movie Relation
                 source       title     length year director
                   A    A Beautiful Mind 135 2001 Ridley Scott
                   B     A Beaut. Mind    135 2001 Ridley Scott
                   C     Beautiful Mind 135 2001 Ridley Scott

Matching Dependencies [7, 5, 9]), on the right-hand-side (see, e.g, Metric Func-
tional Dependencies [13] and Sequential Dependencies [11]), or on both sides of
the dependency (see, e.g., Differential Dependencies [16]).
    In this paper, we study these generalizations of FDs. Our first objective is to
construct a hierarchy of dependencies, revealing which ones (strictly) generalize
others, and comparing their axiomatization and complexity of inference.
    We then introduce a particular generalization of FDs that we call Antecedent
Metric FDs (AMFDs). An AMFD asserts that if two tuples have similar but not
necessarily equal values of the antecedent attributes, then their consequent values
must be equal; we will compare AMFDs with related dependencies in Section 2.
    To illustrate the utility of AMFDs, consider the movie data set shown in
Table 1, which was put together from multiple data sources. In the process
of merging data from various sources, it is often the case that small variations
occur. For example, one source might report the movie A Beautiful Mind to have
a running time of 135 minutes, as shown in Table 1, while another source may
refer to the same movie as A Beaut. Mind and the third one as Beautiful Mind.
An AMFD {title, year, director} 7→ {length} indicates that movies with similar
titles, years, and directors (up to some distance threshold, as we will discuss in
Section 2) must have equal lengths. Of course, we assume that the semantics
are such that two similar movie titles, made in similar years, by similar director
names do in fact refer to the same movie.
    An FD {title, year, director} → {length} would not require the three length
values in Table 1 to be equal, even though they refer to the same movie. Thus,
AMFDs generalize FDs and can express the additional semantics of similarity.
    The inference problem is to determine whether a dependency is logically
entailed by a set of dependencies. For FDs, the inference problem has been well
studied in previous work [3, 4]. We prove that while AMFDs are more expressive
than FDs and have a more complex axiomatization, their complexity of inference
remains linear.
    The contributions of this paper are as follows.
 1. Hierarchy: we construct a hierarchy of dependencies, showing which ones
    generalize others and comparing their complexity of reasoning. Our hierarchy
    shows which dependencies are practical and which are hard to reason about,
    and suggests further research on identifying tractable extensions of FDs.
 2. FD extension: we introduce AMFDs, which describe integrity constraints on
    tuples with similar attribute values and are useful in data cleaning.
 3. Axiomatization: we present a sound and complete axiomatization for AMFDs.
    Axiomatization is a first necessary step to designing an efficient inference
    procedure. Our axiomatization reveals interesting insights about inference
                           Table 2: Notational Conventions
Relations
 – A bold capital letter represents a relation schema: R. Italic capital letters
   near the beginning of the alphabet represent single attributes: A and B.
 – A small bold capital letter in italic represents a relation (a table): t.
 – Small italic letters near the end of the alphabet denote tuples: s and t.
 – Small italic letters near the beginning of the alphabet denote attribute val-
   ues: a, b and c. A small italic letter m denotes a similarity metric.
Sets
 – Italic capital letters near the end of alphabet stand for sets of attributes: X
 – XY is shorthand for X ∪ Y . Likewise, AX or XA stand for X ∪ {A}.


    rules over AMFDs. For instance, the Reflexivity and Augmentation axioms,
    which hold for traditional FDs, are not necessary true for AMFDs.
 4. Inference Procedure: we develop an inference procedure for AMFDs that
    runs in linear time in the complexity of the schema4 . We implemented the
    inference algorithm and experimentally verified its efficiency.
    The remainder of this paper is organized as follows. In Section 2, we review
previous work, formally define AMFDs, and present a hierarchy of dependencies.
In Sections 3 and 4, we present a sound and complete axiomatization and an in-
ference procedure for AMFDs, respectively, and we compare the axiomatization
to those of other related dependencies. We conclude the paper in Section 5.


2     Fundamentals
2.1     AMFDs
We provide notational conventions in Table 2. To accommodate small variations
in the attribute values on the left-hand-side of the dependency, we define AMFDs
(Definition 2). This is a departure from traditional FDs which enforce equality
on both sides. Before we define AMFDs, we first define a similarity operator
with a distance threshold.

Definition 1. (similarity) For every attribute A in a relational schema R, we
assume a binary similarity relation (≈m,θ ) w.r.t. some similarity metric m and
a threshold parameter θ ≥ 0. Specifically, for two tuples s and t, s[A] ≈m,θ t[A]
iff m(s[A], t[A]) ≤ θ. Metric m satisfies standard properties; it is symmetric,
satisfies the triangle inequality and identity of indiscernibles, i.e., m(a, b) = 0
iff a = b. For two tuples s, t in relation t over R, we write s[X] ≈m, Θ t[X] to
mean s[A1 ] ≈m1 ,θ1 t[A1 ], ..., s[An ] ≈mn ,θn t[An ], where X = {A1 , ..., An }, m
= [m1 , ..., mn ] and Θ = [θ1 , ..., θn ].
4
    Our inference procedure is efficient because it is done at the schema level, which is
    much smaller than the size of the data.
      Next, we define AMFDs. By definition, AMFDs generalize FDs.

Definition 2. (AMFD) Let X and Y be two sets of attributes, and let mX
and ΘX be metrics and thresholds for attributes X. Then, X 7→ Y denotes an
antecedent metric FD (AMFD), read as X metrically functionally determines
Y . Let R be a relation schema that contains the attributes that appear in X and
Y , and let t be a relation instance of R. Relation t satisfies X 7→ Y (t |= X 7→
Y ), iff for all tuples s, t ∈ t, s[X] ≈mX ,ΘX t[X] implies s[Y ] = t[Y ]. An AMFD
X 7→ Y is said to hold for R, written as R |= X 7→ Y , iff for each admissible
relational instance t of R, relation t satisfies X 7→ Y . An AMFD X 7→ Y is
trivial iff for all t, t |= X 7→ Y .

Example 1. (AMFD) Assume metrics mtitle and mdirector are edit distances with
thresholds θtitle = 6 and θdirector = 0, respectively, in Table 1 (movie relation).
Also assume that metric myear is an integer distance with a threshold θyear =
0. Therefore, Table 1 satisfies the AMFD {title, year, director} 7→ {length}.


2.2     Related Work

AMFDs and traditional FDs are specified over a single relation. However, AMFDs
replace strict equality on the left-hand-side of the dependency with similarity.
Dependencies defined over a single relation with similarity on the right-hand-
side, called Metric FDs, were proposed by Koudas et al. [13]. We call them
Consequent Metric FDs (CMFDs) to distinguish them from AMFDs. The ver-
ification problem over CMFDs was studied in [13], which is to decide whether
the instance satisfies a prescribed set of dependencies. However, axiomatization
and inference were not considered.5
    Bertossi et al. [5], Fan [7] and Fan et al. [9] studied Matching Dependen-
cies (MDs), which are object-identification constraints across multiple relations.
MDs also enforce similarity rather than equality on the left-hand-side, but allow
arbitrary Boolean similarity functions. (These similarity functions only need to
satisfy reflexivity, symmetry and subsumption of equality.) On the other hand,
AMFDs are defined over a single relation and only allow a restricted notion of
similarity, namely thresholds over similarity metrics (recall Definition 1). Fan
et al. presented a sound and complete axiomatization6 and a quadratic-time
inference procedure for MDs.
    Pointwise Order Dependencies (PODs) [10] consider order relationship rather
than equality of attribute values. A relation satisfies a POD X ,→ Y if, for all
tuples s and t, for every attribute A in X, s A op t A implies that for every at-
tribute B in Y s B op t B , where op {<, >, ≤, ≥, =}. For example, in relation
5
  Some of the authors of this paper solved the axiomatization and inference problems
  for CMFDs in a paper currently under submission.
6
  It is stated in Fan et al. [9] (without a proof of completeness) that a complete
  axiomatization for MDs consists of 11 axioms, but only 9 sound axioms are presented.
                           Table 3: TimePolls Relation
              sequential id   timestamp      date year month day
                   1        20140201142320 20140201 2014 02  01
                   2        20140201142325 20140202 2014 02  02

TimePolls (Table 3), the POD {date> } ,→ {year= , month= , day > } holds; how-
ever, the POD {date> } ,→ {year= , month= , day ≤ } does not hold. Ginsburg and
Hull [10] present a sound and complete axiomatization for PODs and show that
the inference problem for them is co-NP-complete.
    PODs are defined over sets of attributes. On the other hand, Lexicographical
Order Dependencies (LODs) are defined over lists of attributes [17, 19]. LODs
describe the relationship among lexicographical orderings of sets of tuples. This
is the notion of order used in SQL and in query optimization, as per the order by
operator (nested sort). A relation satisfies a LOD X ,→ Y if any list of its tuples
that satisfy order by X also satisfies order by Y; however, not necessarily vice
versa. (X and Y denote lists of attributes.) For instance, in relation TimePolls,
the LODs timestamp ,→ date and [date] ,→ [year, month, day] are true. The
default direction of the SQL order by is ascending. This can be generalized to
order-by’s that mix asc and desc directions, e.g., order by name asc, age desc.
For example, in relation TimePolls, the LOD [−sequential id desc] ,→ [times-
tamp asc] holds. Szlichta et al. present a sound and complete axiomatization
for lexicographical order dependencies and show that the inference problem for
LODs is co-NP-complete [17, 19].
    Another constraint for ordered data, sequential dependencies (SDs), was in-
troduced in Golab et al. [11]. For example, the SD sequential id ,→[4,5] timestamp
means that after sorting the data by the attribute sequential id, the gaps be-
tween consecutive timestamps are between 4 and 5. This particular SD holds in
the TimePolls relation; however, the SD sequential id ,→[6,7] timestamp does
not hold. Golab et al. present a framework for discovering which subsets of the
data obey a given SD, but axiomatization and inference were not considered.
    SDs were generalized in Song and Chen [16] by introducing gaps (differential
functions) on both sides of the dependency and named Differential Dependen-
cies (DDs). For instance, in the relation TimePolls, the DDs sequential id[1,1]
,→ timestamp[4,5] and {date[0,1] } ,→ {year[0,0] , month[0,0] , day [0,1] } hold. How-
ever, the DD sequential id[1,1] ,→ timestamp(5,6] does not hold. Song and Chen
present an axiomatization and show that inference problem for DDs is co-NP-
complete.

2.3   Hierarchy of Dependencies
Figure 1 illustrates a hierarchy of the dependencies we discussed above as well
as a new class: MDDs. MDDs strictly generalize MDs and DDs by allowing dif-
ferential functions (on the left hand side and the right hand side) with arbitrary
similarity functions and allowing multiple tables. Below each dependency name,
we point out the complexity of inference. Observe that for the “not studied”
dependencies, their complexity of inference is bookended by their immediate
ancestors and descendants in the hierarchy.
    We say that a dependency class A generalizes dependency class B iff there
is a semantically preserving mapping of any dependency of class B into a set of
dependencies of class A. Class A strictly generalizes class B iff A generalizes B,
however, B does not generalize A. The hierarchy in Figure 1 shows which depen-
dencies strictly generalize others; due to space constraints, proofs will appear in
extended version of this paper.


                                MDDs
                               (not studied)


                MDs                                  DDs
               (quadratic)                      (co-NP-complete)


            Limited AMDs                                                  PODs              SDs
                  (linear)                                             (co-NP-complete)   (not studied)


                             AMFDs                CMFDs                   LODs
                              (linear)         (linear - Footnote 5)   (co-NP-complete)


                                                      FDs
                                                      (linear)


      Fig. 1: Hierarchy of dependencies and their complexity of inference.
    For example, DDs strictly generalize SDs. In our example involving Table 3,
with the SD sequential id ,→[4,5] time, consecutive sequence numbers can be
simulated by using on the left-hand-side of the DD a similarity metric which
returns distance one if two numbers are consecutive and zero otherwise. Axiom-
atization and complexity of inference for SDs are open problems. However, since
our hierarchy indicates that DDs strictly generalize SDs, the upper bound on
the complexity of inference for SDs is co-NP complete.
    Similarly, DDs strictly generalize PODs (which strictly generalize LODs [19]).
For instance, a POD A≥ ,→ B ≤ is equivalent to a DD A[0;+∞] ,→ B [−∞;0] . This
suggests that the complexity results for PODs can be adapted to DDs and did
not have to be re-developed from scratch in [16]. AMFDs and CMFDs are also
subsumed by DDs, since DDs allow similarity both on the left-hand-side and
right-hand side. (Limited AMDs are introduced in Section 3.) Both AMFDs and
CMFDs strictly generalize FDs by replacing equality with similarity.


3     Axiomatization

3.1   Soundness and Completeness

We now present an axiomatization for AMFDs, analogous to Armstrong’s ax-
iomatization for FDs [3, 4]. This provides a formal framework for reasoning about
 1. Void                     3 Composition                   5 Reduce
    X 7→ {}                    If X 7→ Y and Z 7→ W            If XZ 7→ Y and X 7→ Z
 2. Transitivity                 then XZ 7→ Y W                  then X 7→ Y
    If X 7→ Y                4 Decomposition                 6 Limited Reflexivity
       and Y 7→ Z              If X 7→ Y and Z ⊆ Y             If Y ⊆ X and ΘY = 0
      then X 7→ Z                then X 7→ Z                     then X 7→ Y
                          Fig. 2: Axiomatization for AMFDs.
AMFDs. The axioms give insights into how AMFDs behave and reveal how de-
pendencies logically follow from others, which is not easily evident when reason-
ing from first principles. Also, a sound and complete axiomatization is necessary
for an efficient inference procedure (see Section 4).
    The axioms for AMFDs are presented in Figure 2. Recall that {} denotes an
empty set. Two of the axioms generate trivial dependencies that are always true:
Void and Limited Reflexivity. Below we introduce additional inference rules that
follow from the axioms in Figure 2. These will be used throughout the paper,
particularly to prove that our AMFD axioms are complete.
Lemma 1. (Left Augmentation) If X 7→ Y , then XZ 7→ Y .
Proof. By Void and Composition it follows that XZ 7→ Y . t
                                                         u
Lemma 2. (Union) If X 7→ Y and X 7→ Z, then X 7→ Y Z.
Proof. By Composition it follows that X 7→ Y Z. t
                                                u
   Next, we define closure over AMFDs. The closure of a set of attributes X is
the set of attributes that X logically determines given a set of AMFDs F .
Definition 3. (Closure X + ) The AMFD-closure of set of attributes X, denoted
X + , w.r.t. a set of AMFDs F using axioms I = {1–6} in Figure 2, is defined
as, X + = {A | F ` X 7→ A}.
     Lemma 3 tells us whether a dependency follows from F using our axioms.
Lemma 3. (Closure for AMFDs) F ` X 7→ Y , if and only if Y ⊆ X + .
Proof. Let Y = {A1 , ..., An }. Assume Y ⊆ X + . By definition of X + , X 7→ Ai
for all i ∈ {1, ..., n}. Therefore, by the Union axiom, X 7→ Y . To prove the other
direction, suppose X 7→ Y follows from the axioms. For each i ∈ {1, ..., n}, X
7→ Ai by Decomposition, so Y ⊆ X + . t     u
Theorem 1. (Completeness) AMFD axioms are sound and complete.
Proof. The soundness proof (if F ` X 7→ Y , then F |= X 7→ Y ) is trivial. We
just have to show that each axiom is true. We present the completeness proof (if
F |= X 7→ Y , then F ` X 7→ Y ). We consider a table t with two rows, whose
template is shown in Table 4. We divide the attributes of t into three subsets:
X+ , the set N , consisting of attributes in X that are not in the closure of X 7 ,
7
    For a traditional FD X 7→ Y , by Reflexivity all the attributes in X are also in X + .
    However, this is not true for AMFDs, as we will show in Example 2.
                      Table 4: Table template for AMFDs.
               X + N = {A | A ∈ X and A 6∈ X + } other attributes
               a...a           a...a                   a...a
               a...a           b...b                   c...c
and all the remaining attributes. All the attributes of the first row have the value
a, while for the second row, the attributes in X + are a’s, the attributes in N are
b’s and the other attributes are c’s. Without loss of generality, assume that for
all the attributes A in N and other attributes over t we use the same metric m,
and that a and b are similar (a ≈m,θA b) but not equal. Also, assume that the
values a and c are not similar (a 6≈m,θA c), and hence not equal.
    We first show that all dependencies in the set of AMFDs F are satisfied by
table t (t |= F ). Since the AMFD axioms are sound, AMFDs inferred from F are
true. Note that by Void and Limited Reflexivity, all trivial AMFDs are satisfied
in table t. Assume V 7→ Z is in F but is not satisfied by table t. Therefore, V ⊆
{X + ∪ N } because otherwise two rows of t are not similar on some attribute of
V since a 6≈m,θA c, and consequently an AMFD V 7→ Z would not be violated.
Moreover, Z cannot be a subset of X + (Z 6⊆ X + ), or else V 7→ Z would be
satisfied by t. Let A be an attribute of Z not in X + . Since the dependency V
7→ Z is in F , by Decomposition, V 7→ A. Let V1 be a maximal set of attributes
 such that V1 ⊆ V and V1 ⊆ X + . Let V2 be a maximal set of attributes such that
 V2 ⊆ V and V2 ⊆ N . By Union and Definition 3 of closure, X 7→ X + . Therefore,
by Left Augmentation and Reduce, XV2 7→ A. Since N = {A | A ∈ X and
A 6∈ X + }, V2 ⊆ X. Hence, X 7→ A, which is a contradiction.
    Our remaining proof obligation is to show that any AMFD not inferable from
 the set of AMFDs F with our axioms (F 6` X 7→ Y ) is not true (F 6|= X 7→
Y ). Suppose it is satisfied (F |= X 7→ Y ). It follows by the construction of table
t that Y ⊆ X + ; otherwise, two rows of table t agree or are similar on X but
disagree on some attribute A from Y . Since Y ⊆ X + , by Lemma 3 it can be
inferred that X 7→ Y , which is a contradiction. Thus, whenever X 7→ Y does
not follow from F by the AMFD axioms, F does not logically imply X 7→ Y .
That is, the axiom system is complete over AMFDs, which ends the proof . t       u


3.2   Discussion
The axiomatization for AMFDs is more involved than its FDs counterpart. A
sound and complete axiomatization for traditional FDs consists of only three
axioms: Reflexivity, Augmentation and Transitivity. Interestingly, Reflexivity (if
Y ⊆ X, then X 7→ Y ) is not necessary true for AMFDs.

Example 2. (lack of Reflexivity) Consider table t (Table 4). Assume again that
the values a and b are not equal (a 6= b) but they are similar (a ≈m,θA b) for each
attribute A in N . Let attributes {BCD} ⊆ N . Therefore, the AMFDs BCD 7→
BCD and BCD 7→ BC are not satisfied in t because the values are similar on
the left hand side of the dependencies, but not equal on their right hand side.
                    Table 5: Comparison of Axiomatizations
     Dependency Class                               Axioms
         DDs [16]              Extended Reflexivity, Extended Augmentation,
                                      Extended Transitivity, Impropriety
          SDs [11]                                    N\A
         PODs [10]            Reflexivity, Augmentation, Transitivity, Reversal,
                                    Disjunction, Total Order, Impropriety
        LODs [17, 19]              Reflexivity, Transitivity, Augmentation,
                                         Suffix, Normalization, Chain
        MDs [7, 5, 9]              9 sound axioms (out of 11) appear in [9]
 limited AMDs [this paper] Void, Transitivity, Composition, Decomposition, Reduce
    AMFDs [this paper]         Void, Transitivity, Composition, Decomposition,
                                         Reduce, Limited Reflexivity
        CMFDs [13]                                 Footnote 5
        FDs [3, 4, 12]             Reflexivity, Augmentation, Transitivity


    Similarly, Augmentation, which is another axiom for FDs, does not necessary
hold for AMFDs. (Augmentation states that if X 7→ Y then XZ 7→ Y Z.)
    We replaced Reflexivity with Void and Limited Reflexivity in the axioma-
tization for AMFDs. Lack of Augmentation forced us to add Composition and
Decomposition to the axiomatization. Note that Left Augmentation (Theorem
1) holds for AMFDs. Since Reflexivity does not hold for AMFDs, we had to add
Reduce. Only Transitivity (base axiom for FDs) was preserved in axiomatization.
    We also studied an axiomatization for a simplified version of MDs over a
single table, rather than multiple tables as originally defined in [5, 7, 9]. We call
these limited AMDs. The main difference between limited AMDs and AMFDs
is that the former allow arbitrary similarity functions while the latter employ
thresholds on similarity metrics. A sound and complete axiomatization for lim-
ited AMDs consists of the following five axioms: Void, Transitivity, Composition,
Decomposition and Reduce. The proof will appear in the extended version of
this paper; it is a simplified version of the proof of Theorem 1. In comparison
to AMFDs, Limited Reflexivity does not hold for limited AMDs. A sound and
complete axiomatization for a full class of MDs is more complex (Footnote 6),
as it incorporates axioms that allow us to reason over multiple relations.
    Table 5 compares the axiomatization of AMFDs and AMDs with other de-
pendency classes. We point out several interesting observations below.
    In contrast to MDs and AMFDs, the distance (gap) functions for DDs are
defined at the dependency level for each attribute instead of the schema level.
Therefore, Transitivity for DDs additionally requires an order relation over dif-
ferential functions. For instance, if we have the dependency “if the date difference
for two tuples is ≤ 30 days, then price ≥ $50”, then the dependency “if the date
difference for two tuples is ≤ 30 days, then price ≥ $40” must also hold.
    There is an extra axiom for DDs (Impropriety) that accommodates inconsis-
tencies between dependencies (this problem does not arise in AMFDs and MDs).
For example, the following two dependencies are inconsistent since it is not pos-
Algorithm 1 Inference procedure for AMFDs
Input: A set of AMFDs F, and a set of attributes X.
Output: The closure of X with respect to F.
 1: F unused ← F ; n ← 0
 2: X n ← W where W = {A | A ∈ X and θA = 0}
 3: loop
 4:   if ∃ V 7→ Z ∈ F unused and V ⊆ {X n ∪ X } then
 5:      X n+1 ← X n ∪ Z
 6:      F unused ← F unused − {V 7→ Z}
 7:      n ← n+1
 8:   else
 9:      return X n
10:   end if
11: end loop


sible to instantiate a relation that satisfies both of them: a) if the date difference
for two tuples is ≤ 30 days, then price = $50; and b) if the date difference for two
tuples is ≤ 30 days, then price > $50. Similarly, Augmentation and Reflexivity
have to be modified for DDs to accommodate different distance functions used
by different dependencies on the same attribute. For instance, different distance
functions for the same attribute may result in the same actual distance.
    Interestingly, as we traverse the hierarchy of dependencies, the number of
axioms does not necessary decrease. There are 6 axioms for AMFDs, 7 for PODs
and 6 for LODs versus 4 for DDs; however, there are 3 axioms for FDs at
the bottom of hierarchy. There are two reasons for this. First, the axioms for
DDs are quite complex. Second, as we go down the hierarchy, the dependencies
become more specialized and therefore we may need more axioms to express
their restricted semantics, e.g., lack of Reflexivity. As the dependencies become
more generalized, some axioms must be weakened, e.g., Limited Reflexivity.

4    Inference Procedure
A goal of a dependency theory is to develop algorithms for the inference problem.
Inference for DDs is co-NP-complete [16] and for MDs it is quadratic [7]. Since
DDs and MDs generalize AMFDs, this sets an upper bound for the complexity of
inference for AMFDs. However, computing closure, X + , for AMFDs can be done
more efficiently. It takes time proportional to the length of the dependencies in
F , written out (linear time), which is as efficient as for FDs. (The complexity of
inference for limited AMDs is also linear; the proof will appear in the extended
version of this paper.) Algorithm 1 presents an inference procedure for AMFDs.
Our experiments have shown that it is efficient. For 10 AMFDs prescribed over
a dataset generated by the UIS Database [2], the algorithm runs in time ≤ 1ms.
Example 3. (inference) Let F = {AB 7→ C, ABC 7→ EG, EG 7→ H} denote the
set of AMFDs. Also, let θC = 0 and θD > 0 for all attributes D in ABEGH Let
us calculate the closure of set of attributes AB with Algorithm 1:
       1) X 0 = {}; 2) X 1 = C; 3)X 2 = CEG; 4) X 3 = CEGH.
The closure of AB is CEGH. For traditional FDs, the closure of AB is ABCEGH.

Theorem 2. (inference) Alg. 1 correctly computes the closure X + over AMFDs.

Proof. First we show by induction on k that if Z is placed in X k in Algorithm
1, then Z is in X + .
Base case: k = 0. By Limited Reflexivity, X 7→ W , where W = {A | A ∈ X and
θA = 0}.
Induction step: k > 0. Assume that X k−1 only consists of the attributes in X + .
Suppose Z is placed in X k because V W 7→ Z ∈ F unused , such that V ⊆ X k−1
and W ⊆ X. Since V ⊆ X k−1 , we know by the induction hypothesis that V ⊆
X + . Hence, by Lemma 3, X 7→ V . Therefore, since XV 7→ V Z by Composition,
then by Reduce and Decomposition X 7→ Z. Thus, Z is in X + .
    Now we prove the opposite: if Z is in X + , then Z is in the set returned by
Algorithm 1. Suppose Z is in X + but Z is not in the set returned by Algorithm
1. Consider table t similar to that in Table 4. Table t has two tuples that agree
on attributes in X n , are similar but not equal on attributes X that are not
subset of X n , and disagree on all other attributes. We claim that t satisfies F .
If not, let P 7→ Q be a dependency in F that is violated by t. Then P ⊆ X n
∪ X and Q cannot be a subset of X n ∪ X, if the violation happens. We used a
similar argument in the proof of Theorem 1. Thus, by Algorithm 1, Lines 4–7,
there exists X n+1 , which is a contradiction. t
                                               u


5   Conclusions and Future Work
In this paper, we developed a hierarchy of dependency classes and laid out the
theoretical foundations for AMFDs, which generalize traditional FDs. In future
work, we plan to investigate the following problems.

 – Determining whether a given AMFD holds on a given relation, and using
   AMFDs for data cleaning, similarly to how FDs were employed in previous
   data cleaning work [6, 7].
 – Algorithms for automatic discovery of dependencies have been proposed for
   some dependencies, such as FDs and CFDs [8]. Similarly, we plan to study
   algorithms for discovering AMFDs.
 – We plan to explore an inference framework for multiple dependencies. For
   example, the following inference rules hold: a) if an AMFD X 7→ Y , then a
   CMFD X 7→ Y ; b) if an AMFD X 7→ Y , then an FD X → Y ; c) if an FD X
   → Y , then a CMFD X 7→ Y . These rules along with the axioms for AMFDs
   (Figure 2) and CMFDs (Footnote 5), are sound for the integrated inference
   problem with FDs, AMFDs and CMFDs. However, an open question is if this
   rule set is complete and what is the complexity of the inference problem.
 – Integrity constraints have been widely used in query optimization. For in-
   stance, FDs and LODs have been shown to be useful in simplifying queries
    with group by and order by [14, 18, 17, 19] We believe that AMFDs can be
    used in similar ways to simplify SQL queries with similarity operators [15].


References
 1. Economist Intelligence Unit analysis, http://www.economistinsights.com/technology-
    innovation/analysis/big-data.
 2. UIS Data Generator, http://www.cs.utexas.edu/users/ml/riddle/data.html.
 3. W. Armstrong. Dependency Structures of Database relationships. In Proceedings
    of the IFIP Congress, pages 580–583, 1974.
 4. C. Beeri and P. Bernstein. Computional Problems Related to the Design of Normal
    Form Relational Schemas. TODS 4(1):, 4(1):30–59, 1979.
 5. L. Bertossi, S. Kolahi, and V. Lakshmanan. Data cleaning and query answering
    with matching dependencies and matching functions. In ICDT, pages 268–279,
    2011.
 6. G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional depen-
    dency violations under hard constraints. PVLDB, 3(1):197–207, 2010.
 7. W. Fan. Dependencies Revisited for Improving Data Quality. In PODS, pages
    159–170, 2008.
 8. W. Fan, F. Geerts, J. Li, and M. Xiong. Discovering Conditional Functional De-
    pendencies. TKDE, 23(5):683–698, 2011.
 9. W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about Record Matching Rules.
    PVLDB, 2(1):407–418, 2009.
10. S. Ginsburg and R. Hull. Order dependency in the relational model. Theoretical
    Computer Science, 26(1–2):149–195, 1983.
11. L. Golab, H. Karloff, F.Korn, A. Saha, and D. Srivastava. Sequential dependencies.
    PVLDB, 2(1):574–585, 2009.
12. Y. Huhtala, J. Karkk, P. Porkka, and H. Toivonen. TANE: An Efficient Algorithm
    for Discovering Functional and Approximate Dependencies. Computer Journal,
    42(2):100–111, 1999.
13. N. Koudas, A. Saha, A. Srivastava, and S. Venkatasubramanian. Metric Functional
    Dependencies. In ICDE, 1291-1294, 2009.
14. M. Malkemus, P. S., B. Bhattacharjee, L. Cranston, T. Lai, and F. Koo. Predicate
    Derivation and Monotonicity Detection in DB2 UDB. In ICDE, 939-947, 2005.
15. Y. N. Silva and S. Pearson. Exploiting database similarity joins for metric spaces.
    PVLDB, 5(12):1922–1925, 2012.
16. S. Song and L. Chen. Differential dependencies: Reasoning and discovery. TODS,
    36(3):16, 2011.
17. J. Szlichta, P. Godfrey, and J. Gryz. Fundamentals of Order Dependencies.
    PVLDB, 5(11):1220–1231, 2012.
18. J. Szlichta, P. Godfrey, J. Gryz, W. Ma, P. Pawluk, and C. Zuzarte. Queries on
    dates: fast yet not blind. In EDBT 497-502, 2011.
19. J. Szlichta, P. Godfrey, J. Gryz, and C. Zuzarte. Expressiveness and Complexity
    of Order Dependencies. PVLDB 6(14): 1858-1869, 2013.
20. M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In
    ICDE, pages 244–255, 2014.

</pre>