-

On Axiomatization and Inference Complexity over a Hierarchy of Functional Dependencies

Jaroslaw Szlichta

jaroslaw.szlichta@uoit.ca 1

Lukasz Golab

lgolab@uwaterloo.ca 2

Divesh Srivastava

divesh@research.att.com 0 0 AT&T Labs-Research , New Jersey , USA 1 University of Ontario Institute of Technology , Oshawa , Canada 2 University of Waterloo , Waterloo , Canada

Functional dependencies (FDs) have recently been extended for data quality purposes with various notions of similarity instead of strict equality. We study these extensions in this paper. We begin by constructing a hierarchy of dependencies, showing which dependencies generalize others. We then focus on an extension of FDs that we call Antecedent Metric Functional Dependencies (AMFDs). An AMFD asserts that if two tuples have similar but not necessarily equal values of the antecedent attributes, then their consequent values must be equal. We present a sound and complete axiomatization as well as an inference algorithm for AMFDs. We compare the axiomatization of AMFDs to those of the other dependencies, and we show that while the complexity of inference for some FD extensions is quadratic or even co-NP complete, the inference problem for AMFDs remains linear, as in traditional FDs.

Poor data quality is a bottleneck to e ective business decision-making. Big data initiatives are likely to take longer, cost more, and deliver fewer bene ts without clean data. The ability to store data is no longer a problem: according to a survey of 586 senior executives conducted in June 2011 by the Economist Intelligence Unit (EIU) [ 1 ], less than 20% indicated data storage as a problem; however more than 50% rated data management tasks such as cleaning as problematic.

With the interest in data analytics at an all-time high, data quality has become a critical issue in research and practice. Integrity constraints, which specify the intended semantics and attribute relationships, are commonly used to characterize and ensure data quality [ 6, 20 ]. In particular, Functional Dependencies (FDs), which have traditionally been used in schema design, have recently been extended for data consistency purposes. An FD asserts that if two tuples agree on the left-hand-side attributes, then they must also agree on the right-handside attributes. The idea behind various extensions of FDs is to replace strict equality with some notion of similarity, either on the left-hand-side (see, e.g, source

A B C Matching Dependencies [ 7, 5, 9 ]), on the right-hand-side (see, e.g, Metric Functional Dependencies [ 13 ] and Sequential Dependencies [ 11 ]), or on both sides of the dependency (see, e.g., Di erential Dependencies [ 16 ]).

In this paper, we study these generalizations of FDs. Our rst objective is to construct a hierarchy of dependencies, revealing which ones (strictly) generalize others, and comparing their axiomatization and complexity of inference.

We then introduce a particular generalization of FDs that we call Antecedent Metric FDs (AMFDs). An AMFD asserts that if two tuples have similar but not necessarily equal values of the antecedent attributes, then their consequent values must be equal; we will compare AMFDs with related dependencies in Section 2.

To illustrate the utility of AMFDs, consider the movie data set shown in Table 1, which was put together from multiple data sources. In the process of merging data from various sources, it is often the case that small variations occur. For example, one source might report the movie A Beautiful Mind to have a running time of 135 minutes, as shown in Table 1, while another source may refer to the same movie as A Beaut. Mind and the third one as Beautiful Mind. An AMFD ftitle, year, directorg 7! flengthg indicates that movies with similar titles, years, and directors (up to some distance threshold, as we will discuss in Section 2) must have equal lengths. Of course, we assume that the semantics are such that two similar movie titles, made in similar years, by similar director names do in fact refer to the same movie.

An FD ftitle, year, directorg ! flengthg would not require the three length values in Table 1 to be equal, even though they refer to the same movie. Thus, AMFDs generalize FDs and can express the additional semantics of similarity.

The inference problem is to determine whether a dependency is logically entailed by a set of dependencies. For FDs, the inference problem has been well studied in previous work [ 3, 4 ]. We prove that while AMFDs are more expressive than FDs and have a more complex axiomatization, their complexity of inference remains linear.

The contributions of this paper are as follows. 1. Hierarchy: we construct a hierarchy of dependencies, showing which ones generalize others and comparing their complexity of reasoning. Our hierarchy shows which dependencies are practical and which are hard to reason about, and suggests further research on identifying tractable extensions of FDs. 2. FD extension: we introduce AMFDs, which describe integrity constraints on tuples with similar attribute values and are useful in data cleaning. 3. Axiomatization: we present a sound and complete axiomatization for AMFDs.

Axiomatization is a rst necessary step to designing an e cient inference procedure. Our axiomatization reveals interesting insights about inference Relations { A bold capital letter represents a relation schema: R. Italic capital letters near the beginning of the alphabet represent single attributes: A and B. { A small bold capital letter in italic represents a relation (a table): t. { Small italic letters near the end of the alphabet denote tuples: s and t. { Small italic letters near the beginning of the alphabet denote attribute values: a, b and c. A small italic letter m denotes a similarity metric. Sets { Italic capital letters near the end of alphabet stand for sets of attributes: X { XY is shorthand for X [ Y . Likewise, AX or XA stand for X [ fAg. rules over AMFDs. For instance, the Re exivity and Augmentation axioms, which hold for traditional FDs, are not necessary true for AMFDs. 4. Inference Procedure: we develop an inference procedure for AMFDs that runs in linear time in the complexity of the schema4. We implemented the inference algorithm and experimentally veri ed its e ciency.

The remainder of this paper is organized as follows. In Section 2, we review previous work, formally de ne AMFDs, and present a hierarchy of dependencies. In Sections 3 and 4, we present a sound and complete axiomatization and an inference procedure for AMFDs, respectively, and we compare the axiomatization to those of other related dependencies. We conclude the paper in Section 5. 2 2.1

Fundamentals AMFDs

We provide notational conventions in Table 2. To accommodate small variations in the attribute values on the left-hand-side of the dependency, we de ne AMFDs (De nition 2). This is a departure from traditional FDs which enforce equality on both sides. Before we de ne AMFDs, we rst de ne a similarity operator with a distance threshold.

De nition 1. (similarity) For every attribute A in a relational schema R, we assume a binary similarity relation ( m; ) w.r.t. some similarity metric m and a threshold parameter 0. Speci cally, for two tuples s and t, s[A] m; t[A] i m(s[A], t[A]) . Metric m satis es standard properties; it is symmetric, satis es the triangle inequality and identity of indiscernibles, i.e., m(a, b) = 0 i a = b. For two tuples s, t in relation t over R, we write s[X] m, t[X] to mean s[A1] m1; 1 t[A1], ..., s[An] mn; n t[An], where X = fA1, ..., Ang, m = [m1, ..., mn] and = [ 1, ..., n]. 4 Our inference procedure is e cient because it is done at the schema level, which is much smaller than the size of the data.

Next, we de ne AMFDs. By de nition, AMFDs generalize FDs.

De nition 2. (AMFD) Let X and Y be two sets of attributes, and let mX and X be metrics and thresholds for attributes X. Then, X 7! Y denotes an antecedent metric FD (AMFD), read as X metrically functionally determines Y . Let R be a relation schema that contains the attributes that appear in X and Y , and let t be a relation instance of R. Relation t satis es X 7! Y (t j= X 7! Y ), i for all tuples s, t 2 t, s[X] mX ; X t[X] implies s[Y ] = t[Y ]. An AMFD X 7! Y is said to hold for R, written as R j= X 7! Y , i for each admissible relational instance t of R, relation t satis es X 7! Y . An AMFD X 7! Y is trivial i for all t, t j= X 7! Y .

Example 1. (AMFD) Assume metrics mtitle and mdirector are edit distances with thresholds title = 6 and director = 0, respectively, in Table 1 (movie relation). Also assume that metric myear is an integer distance with a threshold year = 0. Therefore, Table 1 satis es the AMFD ftitle, year, directorg 7! flengthg. 2.2

Related Work AMFDs and traditional FDs are speci ed over a single relation. However, AMFDs replace strict equality on the left-hand-side of the dependency with similarity. Dependencies de ned over a single relation with similarity on the right-handside, called Metric FDs, were proposed by Koudas et al. [ 13 ]. We call them Consequent Metric FDs (CMFDs) to distinguish them from AMFDs. The veri cation problem over CMFDs was studied in [ 13 ], which is to decide whether the instance satis es a prescribed set of dependencies. However, axiomatization and inference were not considered.5

Bertossi et al. [ 5 ], Fan [ 7 ] and Fan et al. [ 9 ] studied Matching Dependencies (MDs), which are object-identi cation constraints across multiple relations. MDs also enforce similarity rather than equality on the left-hand-side, but allow arbitrary Boolean similarity functions. (These similarity functions only need to satisfy re exivity, symmetry and subsumption of equality.) On the other hand, AMFDs are de ned over a single relation and only allow a restricted notion of similarity, namely thresholds over similarity metrics (recall De nition 1). Fan et al. presented a sound and complete axiomatization6 and a quadratic-time inference procedure for MDs.

Pointwise Order Dependencies (PODs) [ 10 ] consider order relationship rather than equality of attribute values. A relation satis es a POD X ,! Y if, for all tuples s and t, for every attribute A in X, sA op t A implies that for every attribute B in Y sB op t B, where op f<; >; ; ; =g. For example, in relation 5 Some of the authors of this paper solved the axiomatization and inference problems for CMFDs in a paper currently under submission. 6 It is stated in Fan et al. [ 9 ] (without a proof of completeness) that a complete axiomatization for MDs consists of 11 axioms, but only 9 sound axioms are presented. sequential id TimePolls (Table 3), the POD fdate>g ,! fyear=; month=; day>g holds; however, the POD fdate>g ,! fyear=; month=; day g does not hold. Ginsburg and Hull [ 10 ] present a sound and complete axiomatization for PODs and show that the inference problem for them is co-NP-complete.

PODs are de ned over sets of attributes. On the other hand, Lexicographical Order Dependencies (LODs) are de ned over lists of attributes [ 17, 19 ]. LODs describe the relationship among lexicographical orderings of sets of tuples. This is the notion of order used in SQL and in query optimization, as per the order by operator (nested sort). A relation satis es a LOD X ,! Y if any list of its tuples that satisfy order by X also satis es order by Y; however, not necessarily vice versa. (X and Y denote lists of attributes.) For instance, in relation TimePolls, the LODs timestamp ,! date and [date] ,! [year, month, day ] are true. The default direction of the SQL order by is ascending. This can be generalized to order-by's that mix asc and desc directions, e.g., order by name asc, age desc. For example, in relation TimePolls, the LOD [ sequential id desc] ,! [timestamp asc] holds. Szlichta et al. present a sound and complete axiomatization for lexicographical order dependencies and show that the inference problem for LODs is co-NP-complete [ 17, 19 ].

Another constraint for ordered data, sequential dependencies (SDs), was introduced in Golab et al. [ 11 ]. For example, the SD sequential id ,![4;5] timestamp means that after sorting the data by the attribute sequential id, the gaps between consecutive timestamps are between 4 and 5. This particular SD holds in the TimePolls relation; however, the SD sequential id ,![6;7] timestamp does not hold. Golab et al. present a framework for discovering which subsets of the data obey a given SD, but axiomatization and inference were not considered.

SDs were generalized in Song and Chen [ 16 ] by introducing gaps (di erential functions) on both sides of the dependency and named Di erential Dependencies (DDs). For instance, in the relation TimePolls, the DDs sequential id[1;1] ,! timestamp[4;5] and fdate[0;1]g ,! fyear[0;0]; month[0;0]; day[0;1]g hold. However, the DD sequential id[1;1] ,! timestamp(5;6] does not hold. Song and Chen present an axiomatization and show that inference problem for DDs is co-NPcomplete. dependencies, their complexity of inference is bookended by their immediate ancestors and descendants in the hierarchy.

We say that a dependency class A generalizes dependency class B i there is a semantically preserving mapping of any dependency of class B into a set of dependencies of class A. Class A strictly generalizes class B i A generalizes B, however, B does not generalize A. The hierarchy in Figure 1 shows which dependencies strictly generalize others; due to space constraints, proofs will appear in extended version of this paper.

MDDs

(not studied) MDs (quadratic)

Limited AMDs

(linear)

DDs (co-NP-complete) FDs (linear)

AMFDs

(linear)

CMFDs

(linear - Footnote 5)

PODs

(co-NP-complete)

LODs

(co-NP-complete)

SDs (not studied)

For example, DDs strictly generalize SDs. In our example involving Table 3, with the SD sequential id ,![4;5] time, consecutive sequence numbers can be simulated by using on the left-hand-side of the DD a similarity metric which returns distance one if two numbers are consecutive and zero otherwise. Axiomatization and complexity of inference for SDs are open problems. However, since our hierarchy indicates that DDs strictly generalize SDs, the upper bound on the complexity of inference for SDs is co-NP complete.

Similarly, DDs strictly generalize PODs (which strictly generalize LODs [ 19 ]). For instance, a POD A ,! B is equivalent to a DD A[0;+1] ,! B[ 1;0]. This suggests that the complexity results for PODs can be adapted to DDs and did not have to be re-developed from scratch in [ 16 ]. AMFDs and CMFDs are also subsumed by DDs, since DDs allow similarity both on the left-hand-side and right-hand side. (Limited AMDs are introduced in Section 3.) Both AMFDs and CMFDs strictly generalize FDs by replacing equality with similarity.

Axiomatization Soundness and Completeness

We now present an axiomatization for AMFDs, analogous to Armstrong's axiomatization for FDs [ 3, 4 ]. This provides a formal framework for reasoning about 1. Void

X 7! fg 2. Transitivity

If X 7! Y and Y 7! Z then X 7! Z 3 Composition

If X 7! Y and Z 7! W

then XZ 7! Y W 4 Decomposition

If X 7! Y and Z Y

then X 7! Z Fig. 2: Axiomatization for AMFDs.

5 Reduce

If XZ 7! Y and X 7! Z

then X 7! Y 6 Limited Re exivity

If Y X and Y = 0 then X 7! Y AMFDs. The axioms give insights into how AMFDs behave and reveal how dependencies logically follow from others, which is not easily evident when reasoning from rst principles. Also, a sound and complete axiomatization is necessary for an e cient inference procedure (see Section 4).

The axioms for AMFDs are presented in Figure 2. Recall that fg denotes an empty set. Two of the axioms generate trivial dependencies that are always true: Void and Limited Re exivity. Below we introduce additional inference rules that follow from the axioms in Figure 2. These will be used throughout the paper, particularly to prove that our AMFD axioms are complete.

Lemma 1. (Left Augmentation) If X 7! Y , then XZ 7! Y .

Proof. By Void and Composition it follows that XZ 7! Y . tu Lemma 2. (Union) If X 7! Y and X 7! Z, then X 7! Y Z.

Proof. By Composition it follows that X 7! Y Z. tu

Next, we de ne closure over AMFDs. The closure of a set of attributes X is the set of attributes that X logically determines given a set of AMFDs F . De nition 3. (Closure X+) The AMFD-closure of set of attributes X, denoted X+, w.r.t. a set of AMFDs F using axioms I = f1{6g in Figure 2, is de ned as, X+ = fA j F ` X 7! Ag.

Lemma 3 tells us whether a dependency follows from F using our axioms. Lemma 3. (Closure for AMFDs) F ` X 7! Y , if and only if Y X+.

Proof. Let Y = fA1, :::, Ang. Assume Y X+. By de nition of X+, X 7! Ai for all i 2 f1; :::; ng. Therefore, by the Union axiom, X 7! Y . To prove the other direction, suppose X 7! Y follows from the axioms. For each i 2 f1; :::; ng, X 7! Ai by Decomposition, so Y X+. tu Theorem 1. (Completeness) AMFD axioms are sound and complete. Proof. The soundness proof (if F ` X 7! Y , then F j= X 7! Y ) is trivial. We just have to show that each axiom is true. We present the completeness proof (if F j= X 7! Y , then F ` X 7! Y ). We consider a table t with two rows, whose template is shown in Table 4. We divide the attributes of t into three subsets: X+, the set N , consisting of attributes in X that are not in the closure of X7, 7 For a traditional FD X 7! Y , by Re exivity all the attributes in X are also in X+.

However, this is not true for AMFDs, as we will show in Example 2. and all the remaining attributes. All the attributes of the rst row have the value a, while for the second row, the attributes in X+ are a's, the attributes in N are b's and the other attributes are c's. Without loss of generality, assume that for all the attributes A in N and other attributes over t we use the same metric m, and that a and b are similar (a m; A b) but not equal. Also, assume that the values a and c are not similar (a 6 m; A c), and hence not equal.

We rst show that all dependencies in the set of AMFDs F are satis ed by table t (t j= F ). Since the AMFD axioms are sound, AMFDs inferred from F are true. Note that by Void and Limited Re exivity, all trivial AMFDs are satis ed in table t. Assume V 7! Z is in F but is not satis ed by table t. Therefore, V fX+ [ N g because otherwise two rows of t are not similar on some attribute of V since a 6 m; A c, and consequently an AMFD V 7! Z would not be violated. Moreover, Z cannot be a subset of X+ (Z 6 X+), or else V 7! Z would be satis ed by t. Let A be an attribute of Z not in X+. Since the dependency V 7! Z is in F , by Decomposition, V 7! A. Let V1 be a maximal set of attributes such that V1 V and V1 X+. Let V2 be a maximal set of attributes such that V2 V and V2 N . By Union and De nition 3 of closure, X 7! X+. Therefore, by Left Augmentation and Reduce, XV2 7! A. Since N = fA j A 2 X and A 62 X+g, V2 X. Hence, X 7! A, which is a contradiction.

Our remaining proof obligation is to show that any AMFD not inferable from the set of AMFDs F with our axioms (F 6` X 7! Y ) is not true (F 6j= X 7! Y ). Suppose it is satis ed (F j= X 7! Y ). It follows by the construction of table t that Y X+; otherwise, two rows of table t agree or are similar on X but disagree on some attribute A from Y . Since Y X+, by Lemma 3 it can be inferred that X 7! Y , which is a contradiction. Thus, whenever X 7! Y does not follow from F by the AMFD axioms, F does not logically imply X 7! Y . That is, the axiom system is complete over AMFDs, which ends the proof . tu The axiomatization for AMFDs is more involved than its FDs counterpart. A sound and complete axiomatization for traditional FDs consists of only three axioms: Re exivity, Augmentation and Transitivity. Interestingly, Re exivity (if Y X, then X 7! Y ) is not necessary true for AMFDs.

Example 2. (lack of Re exivity) Consider table t (Table 4). Assume again that the values a and b are not equal (a 6= b) but they are similar (a m; A b) for each attribute A in N . Let attributes fBCDg N . Therefore, the AMFDs BCD 7! BCD and BCD 7! BC are not satis ed in t because the values are similar on the left hand side of the dependencies, but not equal on their right hand side.

Similarly, Augmentation, which is another axiom for FDs, does not necessary hold for AMFDs. (Augmentation states that if X 7! Y then XZ 7! Y Z.)

We replaced Re exivity with Void and Limited Re exivity in the axiomatization for AMFDs. Lack of Augmentation forced us to add Composition and Decomposition to the axiomatization. Note that Left Augmentation (Theorem 1) holds for AMFDs. Since Re exivity does not hold for AMFDs, we had to add Reduce. Only Transitivity (base axiom for FDs) was preserved in axiomatization.

We also studied an axiomatization for a simpli ed version of MDs over a single table, rather than multiple tables as originally de ned in [ 5, 7, 9 ]. We call these limited AMDs. The main di erence between limited AMDs and AMFDs is that the former allow arbitrary similarity functions while the latter employ thresholds on similarity metrics. A sound and complete axiomatization for limited AMDs consists of the following ve axioms: Void, Transitivity, Composition, Decomposition and Reduce. The proof will appear in the extended version of this paper; it is a simpli ed version of the proof of Theorem 1. In comparison to AMFDs, Limited Re exivity does not hold for limited AMDs. A sound and complete axiomatization for a full class of MDs is more complex (Footnote 6), as it incorporates axioms that allow us to reason over multiple relations.

Table 5 compares the axiomatization of AMFDs and AMDs with other dependency classes. We point out several interesting observations below.

In contrast to MDs and AMFDs, the distance (gap) functions for DDs are de ned at the dependency level for each attribute instead of the schema level. Therefore, Transitivity for DDs additionally requires an order relation over differential functions. For instance, if we have the dependency \if the date di erence for two tuples is 30 days, then price $50", then the dependency \if the date di erence for two tuples is 30 days, then price $40" must also hold.

There is an extra axiom for DDs (Impropriety) that accommodates inconsistencies between dependencies (this problem does not arise in AMFDs and MDs). For example, the following two dependencies are inconsistent since it is not posAlgorithm 1 Inference procedure for AMFDs Input: A set of AMFDs F, and a set of attributes X.

Output: The closure of X with respect to F. 1: F unused F ; n 0 2: X n W where W = fA j A 2 X and A = 0g if 9 V 7! Z 2 F unused and V

X n+1 X n [ Z

F unused fV 7! Zg

fX n [ X g then sible to instantiate a relation that satis es both of them: a) if the date di erence for two tuples is 30 days, then price = $50; and b) if the date di erence for two tuples is 30 days, then price > $50. Similarly, Augmentation and Re exivity have to be modi ed for DDs to accommodate di erent distance functions used by di erent dependencies on the same attribute. For instance, di erent distance functions for the same attribute may result in the same actual distance.

Interestingly, as we traverse the hierarchy of dependencies, the number of axioms does not necessary decrease. There are 6 axioms for AMFDs, 7 for PODs and 6 for LODs versus 4 for DDs; however, there are 3 axioms for FDs at the bottom of hierarchy. There are two reasons for this. First, the axioms for DDs are quite complex. Second, as we go down the hierarchy, the dependencies become more specialized and therefore we may need more axioms to express their restricted semantics, e.g., lack of Re exivity. As the dependencies become more generalized, some axioms must be weakened, e.g., Limited Re exivity. 4

Inference Procedure

A goal of a dependency theory is to develop algorithms for the inference problem. Inference for DDs is co-NP-complete [ 16 ] and for MDs it is quadratic [ 7 ]. Since DDs and MDs generalize AMFDs, this sets an upper bound for the complexity of inference for AMFDs. However, computing closure, X+, for AMFDs can be done more e ciently. It takes time proportional to the length of the dependencies in F , written out (linear time), which is as e cient as for FDs. (The complexity of inference for limited AMDs is also linear; the proof will appear in the extended version of this paper.) Algorithm 1 presents an inference procedure for AMFDs. Our experiments have shown that it is e cient. For 10 AMFDs prescribed over a dataset generated by the UIS Database [ 2 ], the algorithm runs in time 1ms. Example 3. (inference) Let F = fAB 7! C, ABC 7! EG, EG 7! Hg denote the set of AMFDs. Also, let C = 0 and D > 0 for all attributes D in ABEGH Let us calculate the closure of set of attributes AB with Algorithm 1: 1) X 0 = fg; 2) X 1 = C; 3)X 2 = CEG; 4) X 3 = CEGH.

The closure of AB is CEGH. For traditional FDs, the closure of AB is ABCEGH. Theorem 2. (inference) Alg. 1 correctly computes the closure X+ over AMFDs. Proof. First we show by induction on k that if Z is placed in X k in Algorithm 1, then Z is in X+.

Base case: k = 0. By Limited Re exivity, X 7! W , where W = fA j A 2 X and A = 0g.

Induction step: k > 0. Assume that X k 1 only consists of the attributes in X+. Suppose Z is placed in X k because V W 7! Z 2 F unused, such that V X k 1 and W X. Since V X k 1, we know by the induction hypothesis that V X+. Hence, by Lemma 3, X 7! V . Therefore, since XV 7! V Z by Composition, then by Reduce and Decomposition X 7! Z. Thus, Z is in X+.

Now we prove the opposite: if Z is in X+, then Z is in the set returned by Algorithm 1. Suppose Z is in X+ but Z is not in the set returned by Algorithm 1. Consider table t similar to that in Table 4. Table t has two tuples that agree on attributes in X n, are similar but not equal on attributes X that are not subset of X n, and disagree on all other attributes. We claim that t satis es F . If not, let P 7! Q be a dependency in F that is violated by t. Then P X n [ X and Q cannot be a subset of X n [ X, if the violation happens. We used a similar argument in the proof of Theorem 1. Thus, by Algorithm 1, Lines 4{7, there exists X n+1, which is a contradiction. tu 5

Conclusions and Future Work

In this paper, we developed a hierarchy of dependency classes and laid out the theoretical foundations for AMFDs, which generalize traditional FDs. In future work, we plan to investigate the following problems.

{ Determining whether a given AMFD holds on a given relation, and using AMFDs for data cleaning, similarly to how FDs were employed in previous data cleaning work [ 6, 7 ]. { Algorithms for automatic discovery of dependencies have been proposed for some dependencies, such as FDs and CFDs [ 8 ]. Similarly, we plan to study algorithms for discovering AMFDs. { We plan to explore an inference framework for multiple dependencies. For example, the following inference rules hold: a) if an AMFD X 7! Y , then a CMFD X 7! Y ; b) if an AMFD X 7! Y , then an FD X ! Y ; c) if an FD X ! Y , then a CMFD X 7! Y . These rules along with the axioms for AMFDs (Figure 2) and CMFDs (Footnote 5), are sound for the integrated inference problem with FDs, AMFDs and CMFDs. However, an open question is if this rule set is complete and what is the complexity of the inference problem. { Integrity constraints have been widely used in query optimization. For instance, FDs and LODs have been shown to be useful in simplifying queries

Economist

Intelligence Unit analysis , http://www.economistinsights.com/technologyinnovation/analysis/big-data.

UIS

Data Generator , http://www.cs.utexas.edu/users/ml/riddle/data.html.

Armstrong . Dependency Structures of Database relationships . In Proceedings of the IFIP Congress , pages 580 { 583 , 1974 .

Beeri and

Bernstein . Computional Problems Related to the Design of Normal Form Relational Schemas . TODS 4 ( 1 ):, 4 ( 1 ): 30 { 59 , 1979 .

Bertossi ,

Kolahi , and

Lakshmanan . Data cleaning and query answering with matching dependencies and matching functions . In ICDT , pages 268 { 279 , 2011 .

Beskales ,

I. F.

Ilyas , and

Golab . Sampling the repairs of functional dependency violations under hard constraints . PVLDB , 3 ( 1 ): 197 { 207 , 2010 .

Fan . Dependencies Revisited for Improving Data Quality . In PODS , pages 159 { 170 , 2008 .

Fan ,

Geerts ,

Li , and

Xiong . Discovering Conditional Functional Dependencies. TKDE , 23 ( 5 ): 683 { 698 , 2011 .

Fan ,

Jia ,

Li , and

Ma . Reasoning about Record Matching Rules. PVLDB , 2 ( 1 ): 407 { 418 , 2009 .

10.

Ginsburg and

Hull . Order dependency in the relational model . Theoretical Computer Science , 26 ( 1 {2): 149 { 195 , 1983 .

11. L. Golab , H.

Karlo , F.

Korn , A.

Saha , and D.

Srivastava . Sequential dependencies . PVLDB , 2 ( 1 ): 574 { 585 , 2009 .

12.

Huhtala ,

Karkk ,

Porkka , and

Toivonen . TANE: An E cient Algorithm for Discovering Functional and Approximate Dependencies . Computer Journal , 42 ( 2 ): 100 { 111 , 1999 .

13.

Koudas ,

Saha ,

Srivastava , and

Venkatasubramanian . Metric Functional Dependencies . In ICDE, 1291 - 1294 , 2009 .

14. M. Malkemus , P. S. ,

Bhattacharjee ,

Cranston ,

Lai , and

Koo . Predicate Derivation and Monotonicity Detection in DB2 UDB . In

ICDE

, 939 - 947 , 2005 .

15.

Y. N.

Silva and

Pearson . Exploiting database similarity joins for metric spaces . PVLDB , 5 ( 12 ): 1922 { 1925 , 2012 .

16.

Song and

Chen . Di erential dependencies: Reasoning and discovery . TODS , 36 ( 3 ): 16 , 2011 .

17. J. Szlichta , P.

Godfrey , and J.

Gryz . Fundamentals of Order Dependencies. PVLDB , 5 ( 11 ): 1220 { 1231 , 2012 .

18. J. Szlichta , P.

Godfrey , J.

Gryz , W. Ma, P. Pawluk, and C.

Zuzarte . Queries on dates: fast yet not blind . In EDBT 497-502 , 2011 .

19. J. Szlichta , P.

Godfrey , J.

Gryz , and C.

Zuzarte . Expressiveness and Complexity of Order Dependencies. PVLDB 6 ( 14 ): 1858 - 1869 , 2013 .

20. M. Volkovs , F.

Chiang , J.

Szlichta , and R. J.

Miller . Continuous data cleaning . In ICDE , pages 244 { 255 , 2014 .