Schema Design for Uncertain Databases? Anish Das Sarma, Jeffrey Ullman, Jennifer Widom {anish,ullman,widom}@cs.stanford.edu Stanford University Abstract. We address schema design in uncertain databases. Since uncertain data is rela- tional in nature, decomposition becomes a key issue in design. Decomposition relies on dependency theory, and primarily on functional dependencies. We study the theory of functional dependencies (FDs) for uncertain relations. We define several kinds of horizontal FDs and vertical FDs, each of which is consistent with conventional FDs when an uncertain relation doesn’t contain any uncertainty. In addition to standard forms of decompositions allowed by ordi- nary relations, our FDs allow more complex decompositions specific to uncertain data. We show how our theory of FDs can be used for lossless decomposition of uncertain relations. We then present algorithms and complexity results for three fundamental problems with respect to FDs over ordinary and uncertain relations: (1) Testing whether a relation instance satisfies an FD; (2) Finding all FDs satis- fied by a relation instance; and (3) Inferring all FDs that hold in the result of a query over uncertain relations with FDs. We also give a sound and complete ax- iomatization of horizontal and vertical FDs. We look at keys as a special case of FDs. Finally, we briefly consider uncertain data that contains confidence values. 1 Introduction With the recent increase in applications such as scientific and sensor databases, data cleaning and integration, information extraction, and approximate query processing, the field of uncertain databases is attracting considerable interest [5, 16, 19, 44]. While a large body of previous and recent work addresses issues like modeling (e.g., [3, 5, 13, 15, 25, 31, 28]), querying (e.g., [15, 17, 20, 24, 31, 41]), and designing systems (e.g., [6, 16, 19, 32, 44]) for uncertain data, there is little past work relevant to dependency theory for uncertain databases. Obviously there has been a significant amount of previous work in dependency theory for ordinary relations (refer to [36, 42]), and some work for XML data [8, 11], but none of this past work can be applied directly to uncertain databases. Dependency theory for uncertain databases introduces a number of new and inter- esting problems. Moreover, these dependencies give rise to better database designs and useful decompositions of uncertain relations, which are different in nature from decom- positions of ordinary relations. ? This work was supported by a Microsoft Graduate Fellowship, by the National Science Foun- dation under grants IIS-0324431 and IIS-0414762, and by grants from the Boeing and Hewlett- Packard Corporations. In this paper, we revisit the theory of functional dependencies (FDs) in the context of uncertain data. As in ordinary databases, FDs in uncertain databases can be useful in several ways: (1) FDs can be used to decompose uncertain relations, resulting in com- pression, faster querying, and a better overall design of the database; (2) The detection of keys (and their violations) can be useful in understanding and verifying properties of the data; (3) Knowledge of FDs can aid in efficient storage and indexing. At first, we might think that when data is uncertain, there is nothing nontrivial we can say about FDs. But we shall see that the mechanisms we know and love, both from FDs and from multivalued dependencies, appear in uncertain data! They even lead to decompositions with lossless joins or with other means to reconstruct the originals. While some decompositions are like the usual BCNF decompositions, some others are more complex in nature. We study FD theory for data models whose basic construct for uncertainty is alter- natives. Several previously proposed data models for uncertainty (for example, [4, 13, 15, 21, 25, 33, 38]) use similar constructs. The semantics of uncertain relations repre- sented in any data model (including those not based on alternatives) are defined through a set possible worlds, and we study functional dependencies directly over sets of pos- sible worlds as well. Alternatives in a tuple specify a nonempty finite set of possible values for the tuple. For example: (Thomas, Main St.) || (Tom, Maine St.) contains a tuple with two alternatives giving the two possible values for the tuple. Our data model is defined formally in Section 2. Our contributions are introduced briefly in the next few subsections, with details and results in Sections 4–9. We review past work and relate it to our contributions in Section 3, and we conclude with future work in Section 10. 1.1 FD Definitions and Axiomatization In Section 4 we define two kinds of FDs: horizontal and vertical. While horizontal FDs primarily capture dependencies across the alternatives within tuples, vertical FDs apply to a relation as a whole and capture dependencies across tuples. Over ordinary relations, our definition of horizontal and vertical FDs become conventional FDs. We then study definitions of functional dependency over sets of possible worlds and relate these definitions to horizontal and vertical FDs. In Section 5 we prove that for both horizontal and vertical FDs, the axioms of reflex- ivity, transitivity, and augmentation are sound and complete. Interestingly, our complete axiomatization for uncertain data is based on the same three axioms used for ordinary relations, even though the FD definitions appear quite different. 1.2 Decomposition In Section 6 we show that both horizontal and vertical FDs can be used for the lossless decomposition of an uncertain relation R. Given the FDs that hold in R, we give algo- rithms for the decomposition of R into a set of constituents, C(R). We then show how to obtain a SQL query Q such that executing Q on the constituents of R gives back R; i.e., Q(C(R)) ≡ R.1 1.3 Test, Find, and Infer In Section 7 we look at the following three problems, analyzing their complexity and giving algorithms for the tractable cases: – FD Testing: Given a relation R and a horizontal or vertical FD f , test whether R satisfies f . – FD Finding: Given a relation R, find all horizontal and vertical FDs that R satisfies. – FD Inference: Given a set of input relations R, the set of FDs FRi that hold in each Ri ∈ R respectively, and a query Q over R, determine all FDs that are guaranteed to hold in the result Q(R). Since our FDs degenerate to conventional FDs when there is no uncertainty, we also study the three problems above for the special case of ordinary relations, and point out interesting differences. For example, on performing a join query over ordinary relations, the set of FDs satisfied by the result can only increase. However, we can never infer any new FDs for a join query over uncertain relations! 1.4 Keys In Section 8 we study keys as a special case of FDs in greater detail. In ordinary rela- tions, keys indicate uniqueness of values. In uncertain relations, keys are more compli- cated. They can indicate the nonduplication of values horizontally, vertically, or in the entire relation. (Informally, a set of attributes X in R is nonduplicated if no X value appears twice.) First we revisit the issues of Section 1.3 for keys. We then study the relationship between horizontal and vertical keys in a relation R and the presence of nonduplicated attribute values in R. 1.5 Confidence Values Sometimes models for uncertain data include confidence values [13, 15, 16, 18, 25]. In Section 9 we briefly revisit the definitions and results from the paper when confidence values are present on alternatives. (Models with confidence values but no alternatives are probabilistic databases, not relevant to this paper.) We provide a stronger definition of FDs that allows for decomposition even in the presence of confidence values. 1 We use ≡ and not = because Q(C(R)) may return a different representation of the uncertainty in R, depending on the query-execution strategy. 2 Preliminaries Recall the definition of alternatives from the previous section. We define an uncertain relation R to be a bag of tuples, where each tuple is a nonempty, finite set of alter- natives. Semantically, an uncertain relation R represents a set of possible worlds (or possible instances), each of which is an ordinary relation. The possible worlds for an uncertain relation are a bag obtained by choosing one alternative value for each tuple, in all possible ways. Note that an uncertain relation has at least one possible world. An ordinary relation by definition has exactly one possible world. An uncertain rela- tion is equivalent to an ordinary relation if and only if every tuple contains exactly one alternative. Given an uncertain relation R, define the H-relation (for “horizontal relation”) of a tuple t ∈ R, denoted H-relation(t), as the ordinary relation consisting of all the alterna- tives in t. We say that two relations are equivalent if they have the same set of possible worlds. An uncertain relation with each tuple having a single alternative is equivalent to an ordinary relation. A model for uncertain data is unique if two different relations in the model can never be equivalent. The following theorem was proved in [22]. Theorem 1 (Uniqueness). The uncertain-relations model is unique. 2 We can compose a family of models using constructs similar to alternatives, e.g., with alternative values for one or a set of attributes (known as or-sets), or a combination of or-sets and tuple alternatives. None of the models composed of these constructs is more expressive than the uncertain relations we consider, and hence are not considered in the rest of the paper. 3 Related Work While the study of uncertainty in databases and the study of functional dependencies for ordinary relations have separately received considerable attention for several decades now, there is little history combining the two bodies of work. Dependency theory has been widely studied from the 70s and we refer the reader to [36, 42]. Uncertainty in databases also have been studied extensively in the last twenty years [2, 3, 24, 25, 31, 32], as well as more recently [5, 15–17, 19, 28, 41, 44]. However, this past work focuses on data modeling, query processing, and system design. The problem of dependency satisfaction for an uncertain relation with incomplete information was studied in [26, 30, 35]. Conceptually, given a dependency, the possible worlds of an uncertain relation are restricted to only those that satisfy the dependency. Similarly, [43] considers completions of relations with null values such that the com- pleted relation satisfies the constraint. More recently, in the information-source-tracking method [39] of modeling uncertainty, [40] addressed similar issues of adjusting the model and query answering in the presence of key constraints. Therefore, the focus in all these papers is to transform the uncertain relation into one that satisfies certain dependencies, and on query answering thereof. In comparison, we are interested in pre- serving the uncertain relation (and possible worlds) and defining when a dependency is satisfied by it. Note that an individual possible world need not satisfy the dependency in our scenario. Moreover, we address the problems of decomposition and inference, not considered in any of these previous papers. Reference [34] studies two kinds of functional dependencies for incomplete rela- tions: strong FDs, which hold in every possible world, and weak FDs, which hold in at least one possible world of the incomplete relation. Our work defines functional de- pendencies directly on the representation of uncertain relations. We exhibit benefits in decomposition when defining FDs directly over uncertain relations, instead of through possible worlds. Reference [7] briefly considers factorizing uncertain relations represented in the WSD data model. These factorizations can be thought of as decomposing a relation such that the cross product of the various components results in the original relation. While the work in [7] is restricted to cross-product factorizations, we are interested in a much wider class of decompositions. Specifically, we look at the traditional BCNF- like decompositions, as well as newer kinds of decompositions specific to uncertain data. Moreover, as in traditional database literature, our decompositions are based on functional dependencies, while functional dependencies or the associated problems of testing, finding, and inference are not addressed in [7]. Finally, our work is not to be confused with the area of inconsistent databases (e.g., [9, 10, 27, 45]), in which a database that does not satisfy a set of constraints is “repaired.” Possible repairs to the database result in a set of possible worlds, i.e., an uncertain database. Hence, an uncertain database is created because of violation of con- straints, but constraints are not defined on an uncertain database. 4 Functional Dependencies In this section we define functional dependencies over uncertain relations and prove some basic results about them. Throughout the paper we use R, S, T, . . . to refer to relations, A, B, . . . to denote single attributes, X, Y, . . . to denote sets of attributes, and when clear from the context, the relation name (e.g., R) to denote the set of all attributes in the relation. Section 4.1 presents horizontal FDs and Section 4.2 presents vertical FDs. As will be clear shortly, horizontal FDs are a straightforward adaptation of the conventional definition of FDs. Hence several results on horizontal FDs throughout the paper borrow from past literature on regular FDs. Vertical FDs are significantly more interesting and challenging. In Section 4.3 we relate both horizontal and vertical FDs to what they imply on possible worlds. 4.1 Horizontal FDs We define three kinds of horizontal FDs for an uncertain relation R. Definition 1 (Horizontal FD). 1. R satisfies FD X → Y of type H1 (denoted X →H1 Y ) if the conventional FD X → Y holds in the union of the H-relations of all tuples in R. 2. R satisfies FD X → Y of type H2 (denoted X →H2 Y ) if the conventional FD X → Y holds separately in the H-relations of all tuples in R. 3. A tuple t in R satisfies FD X → Y of type H3 (denoted X →H3 (t) Y ) if the conventional FD X → Y holds in the H-relation of tuple t. 2 A horizontal FD of type H1 is equivalent to the corresponding conventional FD if the relation R is an ordinary relation. All three FD types permit some lossless decomposi- tion of R, to be discussed in Section 6. When X →H2 Y holds in R, all H-relations in R satisfy X → Y , but the mapping between X and Y values could be different in each H-relation. In contrast when X →H1 Y holds in R, all H-relations in R satisfy X → Y with the same mapping between X and Y . X →H3 (t) Y allows some tuples (specifically, t) to satisfy X → Y while others don’t. The following straightforward theorem relates the three types of horizontal FDs. Theorem 2. For any uncertain relation R, (X →H1 Y ) ⇒ (X →H2 Y ) ⇒ (X →H3 (t) Y ), for any t in R. 2 Example 1. Consider the uncertain relations R, S, and T shown below. ID R(SSN, Name, Address) r1 (1,Thomas,Main St.) || (1,Thomas,Maine St.) r2 (2,Alice,Poplar Ave) ID S(SSN, Name, Address) s1 (1,Thomas,Main St.) || (1,Thomas,Maine St.) s2 (1,Tom,Main St.) || (2,Alice,Poplar Ave) ID T(SSN, Name, Address) t1 (1,Thomas,Main St.) || (1,Tom,Maine St.) t2 (2,Alice,Poplar Ave) R satisfies SSN→Name for all horizontal FD types, S satisfies SSN→Name for hor- izontal FD types H2 and H3 (si ) for all tuples si ∈ S but not H1 , and T satisfies SSN→Name only for FD type H3 (t2 ) and not for FD types H1 , H2 , or H3 (t1 ). 2 4.2 Vertical FDs Definition 2 (Vertical FD). An uncertain relation R satisfies a vertical FD X → Y (denoted X →V Y ) if and only if the following conditions hold: 1. ∀A ∈ (Y − X), the H-relation of each tuple in R satisfies X →→ A. (→→ denotes the conventional multivalued dependency.) 2. For any two tuples t1 , t2 in R, let T1x (T2x respectively) be the set of all tuples in H-relation(t1 ) (H-relation(t2 ) respectively) that have the value x for attribute X. Let S1x,A (S2x,A respectively) be the set of all distinct A values that appear in tuples of T1x (T2x respectively). Then ∀A ∈ (Y − X), ∀x, at least one of the following holds: (1) S1x,A = ∅, (2) S2x,A = ∅, (3) S1x,A = S2x,A . 2 Intuitively, for X →V Y to hold, a given X value should functionally determine the set of possible values for each attribute in (Y − X). The set of values for each attribute of (Y − X) only depends on the value of X and all these sets are independent of one another. In our definition we impose the condition X →→ A, ∀A ∈ (Y − X) and not the weaker X → → (Y − X) because under the weaker condition our definition would not allow certain kinds of decompositions allowed by our current definition (discussed in Section 6). More importantly, it would also not satisfy some desirable axioms such as the splitting rule2 , satisfied by our current definition (Section 5), as illustrated by the following example. Example 2. Consider the following relation R(SSN, Name, Address). ID R(SSN, Name, Address) r1 (1,Thomas,Main St.) || (1,Thomas,Maine St.) r2 (2,Bill,Poplar Ave) || (2,William,Poplar Ave) r3 (1,Thomas,Main St.) || (1,Thomas,Maine St.) || (2,Bill,Poplar Ave) || (2,William,Poplar Ave) R satisfies SSN→V NameAddress: It can be seen that SSN→→Name and SSN→ →Address in the H-relations of each tuple above (Condition 1 of Definition 2), and the set of Name and Address values for every distinct SSN value is the same for every H-relation (Condition 2 of Definition 2). R also satisfies SSN→V Name and SSN→V Address. Note that R does not satisfy any horizontal FD with only SSN on the left side. Now consider the following simple relation S(SSN, Name, Address). ID S(SSN, Name, Address) s1 (1,Tom,Main St.) || (1,Thomas,Maine St.) S does not satisfy SSN→V NameAddress. Since SSN→→NameAddress is sat- isfied in the only tuple’s H-relation in S, if we changed Condition 1 of the vertical FD definition to X → → (Y − X), then SSN→V NameAddress would be satisfied above. However, SSN→V Name or SSN→V Address would still not be satisfied, violating the splitting rule. 2 The following theorem shows that vertical FDs degenerate to regular FDs in the special case of an uncertain relation being equivalent to an ordinary relation. Theorem 3. If uncertain relation R is equivalent to an ordinary relation S, then the conventional FD X → Y holds for S if and only if X →V Y holds for R. 2 Proof. Let X →V Y hold in R. Then ∀A ∈ (Y −X), X →→ A (Condition 1 of vertical FD definition) is satisfied in the H-relation of each tuple in R. Since R is equivalent to 2 The splitting rule allows us to infer X → Z for any Z ⊆ Y when X → Y holds. an ordinary relation, every tuple has exactly one alternative. Therefore, the set of all A values corresponding to a given X value in Condition 2 is a singleton set. Hence by Condition 2, if two tuples in R have the same X value, they also have the same A value for every A ∈ (Y − X). Hence X → Y holds in S. Conversely, if X → Y holds in S, then both conditions of the vertical FD definition are satisfied for X →V Y in R, since each tuple in R corresponding to a tuple in S has one alternative. 4.3 Possible Worlds In this section we consider defining FDs in terms of possible worlds. Since the seman- tics of uncertain relations are based on a set of possible worlds, one may wonder why we can’t define FDs in terms of possible worlds, as in [34]. Next we consider the most natural definition of FDs in terms of possible worlds. We then relate this definition in terms of possible worlds to our definitions of horizontal and vertical FDs. Finally, we describe why horizontal and vertical FDs are more appropriate than the definition in terms of possible worlds, for schema design in uncertain databases. Definition 3 (Possible-World FD). An uncertain relation R, with possible worlds {P1 , . . . , Pm }, satisfies a possible-world FD X → Y (denoted X →P W Y ) if and only if each possible world Pi satisfies the conventional FD X → Y . Example 3. Consider the following single-tuple uncertain relation R(SSN, Name, Ad- dress). ID R(SSN, Name, Address) r1 (1,Tom,Main St.) || (1,Thomas,Maine St.) R has two possible worlds: P1 (SSN, N ame, Address) containing the single tuple (1,Tom,Main St.) and P2 (SSN, N ame, Address) contain- ing the single tuple (1,Thomas,Maine St.). Since both P1 and P2 satisfy SSN→NameAddress, R satisfies SSN→P W NameAddress. However, R does not satisfy the vertical FD SSN→V NameAddress, nor does it satisfy any of the horizontal FDs SSN→H1 NameAddress, SSN→H2 NameAddress, or SSN→H3 (r1 ) NameAddress. 2 Next let us try to understand how possible-world FDs relate to horizontal and ver- tical FDs. The example above showed that even if a possible-world FD X →P W Y holds, none of the corresponding horizontal or vertical FDs is guaranteed to hold. Con- versely. our next example shows that the possible-world FD cannot be inferred from horizontal FDs of type H2 or H3 (t), or from vertical FDs. Example 4. Consider relation S from Example 1: ID S(SSN, Name, Address) s1 (1,Thomas,Main St.) || (1,Thomas,Maine St.) s2 (1,Tom,Main St.) || (2,Alice,Poplar Ave) S satisfies SSN→Name for horizontal FD types H2 and H3 (si ) for all tuples si ∈ S. However, S does not satisfy SSN→P W Name: The possible world obtained by select- ing the first alternative from each of s1 and s2 does not satisfy the conventional FD SSN→Name. To prove that even vertical FDs don’t imply possible-world FDs, consider the re- lation R from Example 2. R satisfies SSN→V Name. However R does not satisfy SSN→P W Name since R has the following possible world, which does not satisfy SSN→Name. R(SSN, Name, Address) (1,Thomas,Main St.) (2,Bill,Poplar Ave) (2,William,Poplar Ave) 2 Finally, the following theorem shows that the only remaining implication between a horizontal or a vertical FD, and the corresponding possible-world FD holds. Theorem 4. For any uncertain relation R, (X →H1 Y ) ⇒ (X →P W Y ). 2 Proof. Let the union of the H-relations of all tuples in R be R. Since R satisfies (X →H1 Y ), R satisfies the conventional FD (X → Y ). Therefore, any subset of R also satisfies (X → Y ). Since each possible world of R is a subset of R, each possible world of R also satisfies (X → Y ); hence, R satisfies (X →P W Y ). While the definition of possible-world FDs described above is simple and intuitive, it isn’t suitable for schema design. In particular, unlike the definition of horizontal and vertical FDs, possible-world FDs in uncertain relations don’t give rise to useful decom- positions. Consider relation R from Example 3. While R satisfies SSN→P W Name, there is no lossless decomposition of R based on this FD. The standard BCNF decom- position of R based on SSN→Name is lossy. However, as we shall see in Section 6, both horizontal and vertical FDs always give rise to useful (BCNF-like or other) de- compositions. Hence, we focus on horizontal and vertical FDs for the rest of the paper. 5 Sound and Complete Axioms In this section we prove that the traditional Armstrong’s axioms for conventional FDs are also sound and complete for horizontal and vertical FDs over uncertain data. First we review the definitions of soundness and completeness, then Section 5.1 presents results for horizontal FDs and Section 5.2 for vertical FDs. Definition 4 (Soundness). A set of axioms A is sound if for any relation R and a set of FDs F satisfied by R, any FD f derived from F using A holds in R. 2 Definition 5 (Completeness). A set of axioms A is complete if for any set of FDs F, if FD f is true in every relation R satisfying F, then f can be derived from F using A. 2 Definition 6 (Strong Completeness). A set of axioms A is strongly complete if for any set of FDs F on attributes U , there exists a relation R(U ) satisfying all and only FDs that can be derived from F and A. 2 Intuitively strong completeness proves the existence of a relation which shows that any FD that cannot be derived from F and A is not implied by F. 5.1 Horizontal FDs Theorem 5. The following Armstrong’s axioms (where → stands for one of →H1 , →H2 , and →H3 (t) ) are sound and complete with respect to each type of horizontal FD: 1. Reflexivity: If Y ⊆ X, then X → Y . 2. Transitivity: If X → Y and Y → Z, then X → Z. 3. Augmentation: If X → Y and Z ⊆ W , then XW → Y Z. 2 Proof. For an uncertain relation R, consider the union of the H-relations S of all tuples in R. An FD X →H1 Y holds in S if and only if X → Y holds in S. Hence if any FD can be derived using a set of axioms, the corresponding H1 FD is sound and can be derived using the H1 axioms. Hence the Armstrong’s axioms are sound and complete for H1 FDs. Similarly considering H-relations for either a specific tuple t or independently for all tuples in R we argue that the Armstrong’s axioms are sound and complete for H3 (t) and H2 FDs respectively. Note that the above theorem applies to only one of H1 , H2 , and H3 (t) FDs at a time. However, if we have a combination of the different kinds of horizontal FDs, we can use Theorem 2 and Armstrong’s axioms to derive further dependencies that hold in the weakest form of horizontal FDs used to derive the result. Theorem 6. Armstrong’s axioms from Theorem 5 are strongly complete for H1 , H2 , and H3 (t) FDs with respect to uncertain relations. 2 Proof. Given a set F of horizontal FDs over a set of attributes U , we use the strong completeness of regular FDs [14] to construct an ordinary relation R(U ) which satisfies all and only regular functional dependencies derivable from F using the Armstrong’s axioms. Now we construct an uncertain relation S(U ) with a single tuple whose alter- natives are exactly the set of tuples in R. S(U ) satisfies all and only horizontal FDs derivable from F and the Armstrong’s axioms for horizontal FDs. 5.2 Vertical FDs The next three lemmas prove the soundness of Armstrong’s axioms for vertical FDs. We then prove the completeness and strong completeness of Armstrong’s axioms. Lemma 1 (Reflexivity). Given an uncertain relation R with sets of attributes X, Y ⊆ R, if Y ⊆ X, then X →V Y holds in R. 2 Proof. Since (Y − X) = ∅, it can be seen easily that both the conditions of Definition 2 are satisfied for X →V Y . Lemma 2 (Transitivity). Given an uncertain relation R with sets of attributes X, Y, Z ⊆ R, if X →V Y and Y →V Z, then X →V Z. 2 Proof. Consider a tuple t in R. By Definition 2, we have: (1) ∀Ti ∈ (Y − X), we have X →→ Ti in H-relation(t). (2) ∀Vj ∈ (Z − Y ), Y →→ Vj in H-relation(t). From (1), applying the Union rule for MVDs [14] on H-relation(t), we have X →→ (Y − X). Using Augmentation [14], we get: (3) X →→ Y in H-relation(t). Combining (2) and (3) above using the transitivity rule for MVDs [14], we have: (4) ∀Vj ∈ (Z − Y ), X →→ (Vj − Y ) in H-relation(t). Hence, (5) ∀Vj ∈ (Z − Y ), X →→ Vj in H-relation(t). Finally, since all attributes in (Z − X) are in at least one of (Z − Y ) and (Y − X), combining Equations (1) and (5), we get Condition 1 of Definition 2 for X →V Z: (6) ∀Ak ∈ (Z − X), X → → Ak in H-relation(t). Next we prove Condition 2 of Definition 2 for X →V Z. Now consider two tuples t1 and t2 in R, and a particular X value x. If X 6= x in all tuples of H-relation(t1 ) or all tuples of H-relation(t2 ), we have Condition 2 satisfied for X →V Z for X = x. If not, by the vertical FD definition for X →V Y , the set of all Y − X that appear in tuples having value x for attribute X is the same in H-relation(t1 ) and H-relation(t2 ). Call this set S1 . Consider a particular value s ∈ S1 . Since (Y − X) = s in at least one tuple each of H-relation(t1 ) and H-relation(t2 ), by the vertical FD definition for Y →V Z, the set of all A values for A ∈ (Z − Y ) values associated with s is the same in H-relation(t1 ) and H-relation(t2 ). Let this set be S2 (s). For any B ∈ (Z − X), the set of all B values that appears in tuples with value x for attribute X is the same in both H-relation(t1 ) and H-relation(t2 ): It is equal to the B attributes in ∪s∈S1 S2 (s) if B ∈ (Z − Y ) and the B attributes of S1 if B ∈ (Y − X). Hence, Condition 2 of Definition 2 for X →V Z holds. Lemma 3 (Augmentation). Given an uncertain relation R with sets of attributes X, Y, Z, W ⊆ R, if X →V Y and Z ⊆ W , then XW →V Y Z. 2 Proof. Since X →V Y , for every tuple t in R we have X →→ A in H-relation(t) for every A ∈ (Y − X). By the augmentation rule for MVDs over ordinary relations, we have XW → → A for all A ∈ (Y − X) and hence for all A ∈ (Y Z − XW ). Hence Condition 1 of Definition 2 for XW →V Y Z is satisfied. Now consider Condition 2 for XW = xw. Let t1 and t2 be two tuples in R. If either of H-relation(t1 ) and H-relation(t2 ) does not have any tuple with XW = xw, Condition 2 is satisfied. Consider an attribute A ∈ (Y Z − XW ) = (Y − XW ). Using Condition 2 for X →V Y , the set of all A values that appears with X = x is the same in H-relation(t1 ) and H-relation(t2 ). Let this set of Y − X values be S. For both H-relation(t1 ) and H-relation(t2 ) since X →→ A, A 6∈ XW , and since there exists a tuple with W = w, the set of all A values in tuples having XW = xw is also S. Hence Condition 2 is satisfied for all A ∈ (Y Z − XW ). Example 5. Recall Example 2 giving an uncertain relation in which the splitting rule is not satisfied by a weaker definition of vertical FDs. Our current definition satisfies the splitting rule as it follows from Armstrong’s axioms. 2 The following theorem proves the strong completeness, and hence also complete- ness, of Armstrong’s axioms for vertical FDs. Theorem 7 (Completeness). The Armstrong’s axioms are strongly complete for verti- cal FDs with respect to uncertain relations. 2 Proof. We use the completeness of Armstrong’s axioms for conventional FDs over or- dinary relations [14]. Consider a set of vertical FDs F over a set of attributes U . The strong completeness of regular FDs allows us to construct an ordinary relation R(U ) which satisfies all and only regular functional dependencies derivable from regular FD counterpart of F using the Armstrong’s axioms. Since R(U ) is also an uncertain rela- tion, using Theorem 3, we know that R(U ) satisfies all only those vertical FDs that can be derived from F and Armstrong’s axioms for vertical FDs. Theorem 8. The Armstrong’s axioms are a sound and complete axiomatization for ver- tical FDs with respect to uncertain relations. 2 Proof. The soundness of Armstrong’s axioms follows from Lemmas 1, 2, and 3 and the completeness follows from Theorem 7 6 Decomposition We address the problem of decomposition of an uncertain relation when it satisfies horizontal or vertical functional dependencies. We shall see that we can use techniques similar to those for conventional functional and multivalued dependencies. Some of our decompositions are like the usual BCNF decompositions for ordinary relations, but there are also new decomposition forms that can be used in some cases. We first briefly describe the standard possible-worlds semantics [3] of relational queries over uncertain relations, necessary for our discussion of decomposition, and then show how uncertain relations can be decomposed based on horizontal (Section 6.1) and vertical (Section 6.2) FDs. Consider uncertain database D (consisting of one or more uncertain relations) with possible worlds D1 , . . . , Dn . Each Di is a set of ordinary relations, one corresponding to each relation in the database. (When D contains multiple relations, the set of possible worlds of D is obtained by taking the cross-product of the set of possible worlds for each relation.) Let Q(Di ) be the ordinary relation obtained by evaluating Q on Di . The result of performing a relational query Q over D is an uncertain relation whose possible worlds are {Q(D1 ), . . . , Q(Dn )}, assuming such an uncertain relation exists. 6.1 Horizontal FDs Consider an uncertain relation R(XY Z) that satisfies X →H1 Y . We can decompose R into two relations R1 and R2 as follows. R1 (XZ) = ΠXZ (R), where Π stands for the project relational operator. R2 (XY ) is an ordinary relation consisting of all distinct XY values appearing in R.3 The decomposition of R into R1 and R2 is lossless. Informally, R1 retains all the alternatives of R but projects them onto XZ, and R2 stores the mappings between X and Y . Since X →H1 Y , each X value appears exactly once in R2 , resulting in a compressed representation of R. The following theorem proves that R can be obtained by joining R1 and R2 . Theorem 9. For any uncertain relation R(XY Z) satisfying X →H1 Y , if R1 (XZ) = ΠXZ (R) and R2 (XY ) is an ordinary relation containing all distinct XY values that appear in some alternative of R, then R ≡ (R1 1X R2 ).4 2 Proof. Let R have n possible worlds P1 (XY Z), . . . , Pn (XY Z). By the definition of R1 , the possible worlds of R1 are ΠXZ (P1 ), . . . , ΠXZ (Pn ). R2 has exactly one possi- ble world P (XY ) containing all distinct XY values in R. Hence, the possible worlds of (R1 1X R2 ) are (P 1X (ΠXZ (P1 ))), . . . , (P 1X (ΠXZ (Pn ))), which are equal to P1 , . . . , Pn since every XY value appearing in any Pi also appears in P . Example 6. Consider relation R from Example 1 satisfying SSN→H1 Name. After de- composing R as described above, we have the following two relations: R1 (SSN, Address) R2 (SSN, Name) (1,Main St.) || (1,Maine St.) (1,Thomas) (2,Poplar Ave) (2,Alice) 2 Next let us consider horizontal FD type H2 . Suppose relation R(XY Z) satisfies X →H2 Y . We now know that every H-relation in R satisfies X → Y but the same X value could be mapped to different Y values in the H-relations of different tuples. We give two schemes for decomposing R in this case. The first decomposition scheme introduces a new attribute I in R, denoting a unique tuple identifier for each tuple. All alternatives of a given tuple have the same value in R. Once we have unique identifiers in each tuple, we have the H1 horizontal FD IX →H1 Y satisfied in R. We can now decompose R into R1 (IXZ) and R2 (IXY ) as in the H1 FD case discussed above. By Theorem 9, the decomposition is lossless and R ≡ ΠXY Z (R1 1IX R2 ). The second decomposition scheme first horizontally partitions R such that X →H1 Y holds in every horizontal partition and then applies the decomposition described above. We partition the set of tuples in R, say {t1 , . . . , tn }, into m groups to form m 3 R2 can be obtained from R using a query containing the flatten operator [1] designed to create an ordinary relation from an uncertain relation. First project R onto XY , then flatten the result into an ordinary relation. 4 Recall “≡” represents equivalence of uncertain relations, i.e., same set of possible worlds. uncertain relations R1 (XY Z), . . . , Rm (XY Z). We describe the process of partition shortly. The key point is our partition ensures that X →H1 Y holds in every Ri . Each Ri is now decomposed into R1i (XZ) and R2i (XY ) based on the H1 horizontal FD. The following theorem, which follows from Theorem 9 and the horizontal partition of R into R1 , . . . , Rm , shows that our decomposition is lossless and that R can be obtained by a relational query over the decomposed components. Theorem 10. For any uncertain relation R(XY Z) satisfying X →H2 Y , the decom- position of R into R1i (XZ) and R2i (XY ), 1 ≤ i ≤ m, described above is lossless: R ≡ ∪i (R1i 1X R2i ). 2 Let us now turn to the horizontal partitioning of R into R1 , . . . , Rm . Intuitively, we would like to have the fewest partitions (i.e., smallest m) allowing us to decompose each Ri based on X →H1 Y . Two tuples t1 and t2 can appear together in any Ri if only if X →H1 Y holds in an uncertain relation composed of just t1 and t2 . In other words, t1 and t2 should not constitute a violation of the H1 FD. For every pair of tuples in R we check whether all X values are mapped to the same Y value in all alternatives of both the tuples. Let us suppose S stores all pairs of tuples that do constitute a vi- olation. Finding the fewest number of partitions is equivalent to solving the following NP-hard graph coloring problem, which can be solved approximately using well-known techniques [37]. Given a set U = {1, . . . , n} and a set S of pairs (i, j), 1 ≤ i < j ≤ n, let U1 , . . . , Um be a partition of U . That is, U = ∪m k=1 Uk and if k 6= l, Uk ∩ Ul = ∅. Find the smallest partition of U , i.e. minimize m, such that if (i, j) ∈ S, then ∀k i 6∈ Uk or j 6∈ Uk . Example 7. Consider the relation S from Example 1 satisfying SSN→H2 Name. Intro- ducing tuple identifiers s1 and s2 for the two tuples and then decomposing S according to the first scheme mentioned above, we have the following decomposed components: S(I, SSN, Name) S(I, SSN, Address) (s1 ,1,Thomas) (s1 ,1,Main St.) || (s1 ,1,Maine St.) (s2 ,1,Tom) (s2 ,1,Main St.) || (s2 ,2,Poplar Ave) (s2 ,2,Alice) Under the second scheme of decomposition s1 and s2 constitute different horizontal partitions, and each of them is decomposed similarly. 2 Finally, suppose R(XY Z) satisfies X →H3 (t) Y for a tuple t in R, we horizontally partition R into two components: R1 consisting of tuple t only and R2 consisting of the rest of the tuples. R1 is then partitioned based on X →H1 Y and R2 remains as is. 6.2 Vertical FDs Consider an uncertain relation R(XY Z) that satisfies X →V Y where Y = {A1 , . . . , Am }. Intuitively, the X value in any alternative uniquely determines the set of possible Ai values for every Ai ∈ Y , and these sets of values are independent of one an- other. We can therefore decompose R by retaining all the alternative values of R in one relation, and for each Ai , creating a relation that gives the mapping between every X value and the set of possible Ai values it determines. We have the following decomposi- tion of R into R0 (XZ), R1 (XA1 ), R2 (XA2 ), . . . , Rm (XAm ). R0 (XZ) = ΠXZ (R). Ri (XAi ) contains n tuples, one for each distinct X value that appears in R. For a given X-value x0 , suppose the set of all Ai values that appear in tuples having value x0 for attributes X is {a1i , . . . , aki i }. Then the tuple corresponding to X = x0 has ki alterna- tives (x0 , a1i ), . . . , (x0 , aki ).5 The following theorem shows that R can be reconstructed from its components using a relational query, thus proving that the decomposition is lossless. Theorem 11. For any uncertain relation R(XY Z) satisfying X →V Y , where Y = {A1 , . . . , Am }, let R0 (XZ) = ΠXZ (R). For all Ai ∈ Y , let Ri (XAi ) be an uncertain relation containing one tuple for every distinct X-value in R, with alternatives corre- sponding to all possible Ai values it appears with. Then R ≡ (R0 1X R1 1X R2 1X . . . 1X Rm ). 2 Proof. If R contains K distinct X values in its alternatives, by the definition of Ri (XAi ), each possible world of Ri contains K tuples, one for each distinct X value. The possible worlds of Ri list all possible combinations of mappings between X and Ai . Now consider S(XY ) = (R1 1X R2 1X . . . 1X Rm ). The possible worlds of S are obtained by joining all combinations of possible worlds from R1 to Rm . Since every possible world of every Ri contains exactly K tuples containing all the distinct X values, every possible world of S also has K tuples containing all distinct K values. The possible worlds of S list all combinations of mappings between X and Y . Let these possible worlds of S be Q1 (XY ), . . . , Ql (XY ). Let the set of possible worlds of R be P W (R) = {P1 (XY Z), . . . , Pn (XY Z)}. Since R0 (XZ) = ΠXZ (R), the set of possible worlds of R0 is {ΠXZ (P1 ), . . . , ΠXZ (Pn )}. Hence, the set of possible worlds of T = R 1X S, P W (T ), is: {(Q1 1X (ΠXZ (P1 ))), . . . , (Q1 1X (ΠXZ (Pn ))), . . . , (Qn 1X (ΠXZ (Pn )))}. We claim P W (R) = P W (T ): Every Pi (XY Z) ∈ P W (T ) because ΠXZ (Pi ) is joined with every combination of mappings between X and Y , and in particular it is joined with some Qj containing exactly the mappings between X and Y present in Pi . Conversely, every (Qj 1X (ΠXZ (Pi ))) is equal to Pk for some k: Consider the specific mappings between X and Y in Qj and the tuples in ΠXZ (Pi ). Pi was obtained from R by choosing one alternative from every tuple in R. Applying the definition of vertical FDs on each H-relation of R, if we replace the Y values in the alternatives chosen in Pi with Y values determined by the mappings in Qj , the resulting alternatives must also be present in the H-relations of R. Hence if we pick the resulting alternatives from each tuple, we get some possible world Pk , which is exactly equal to (Qj 1X (ΠXZ (Pi ))). 5 Ri can be obtained from R using a query containing the group-alts operator [1] designed to create an ordinary relation from an uncertain relation. First project R onto XAi , then group alternatives by X. Example 8. Consider the following slightly modified relation from Example 2 satisfy- ing SSN→V NameAddress. R(SSN, Name, Address,Phone) (1,Tom,Main St.,P1) || (1,Tom,Maine St.,P1) (2,Bill,Poplar Ave,P2) || (2,William,Poplar Ave,P2) (1,Tom,Main St.,P1) || (1,Tom,Maine St.,P1) || (2,Bill,Poplar Ave,P3) || (2,William,Poplar Ave,P3) Decomposing R as described above, we have the following three components: R0 (SSN, Phone) R1 (SSN, Name) (1,P1) (1,Tom) (2,P2) (2,Bill) || (2,William) (1,P1) || (2,P3) R2 (SSN, Address) (1,Main St.) || (1,Maine St.) (2,Poplar Ave) 2 Let us now revisit the weaker definition of vertical FDs discussed in Section 4.2, which imposes X → → Y , instead of X →→ Ai , in the H-relation of every tuple. Under this weaker definition the decomposition of R described above would be not lossless. We can however decompose R into R0 (XZ) and R1 (XY ), where like Ri (XAi ), R1 gives mappings between X and Y . The following abstract example shows that the de- composition into Ri ’s can result in an exponentially more compact representation of R. Example 9. Consider the following single tuple relation R(X, A1 , . . . , An , Z) contain- ing alternatives with X = 1, Z = 1, and all 2n combinations of 0s and 1s for A1 to An . R(X,A1 ,. . .,An ,Z) (1,0,. . .,0,1) || . . . || (1,1,. . .,1,1) R satisfies X →V Y , where Y = {A1 , . . . , An }. Decomposing R into two relations R0 (XZ) and R1 (XY ) gives: R0 (X,Z) R1 (X,A1 ,. . .,An ) (1,1) (1,0,. . .,0) || . . . || (1,1,. . .,1) However, decomposing into n + 1 relations gives an exponentially (in n) more compact representation: R0 (X,Z) R1 (X,A1 ) Rn (X,An ) ... (1,1) (1,0) || (1,1) (1,0) || (1,1) 2 7 Test, Find, and Infer In this section we consider the problems of testing (Section 7.1), finding (Section 7.2), and inference (Section 7.3) of FDs. We study the problems for conventional FDs over ordinary relations as well as for horizontal and vertical FDs over uncertain relations. Not surprisingly the solution techniques for all of the problems are similar for conven- tional and horizontal FDs, so we discuss them together. While the complexity of some problems remains the same for conventional or horizontal FDs over ordinary relations and vertical FDs over uncertain relations, some problems become more challenging for the case of vertical FDs. Even more interestingly, inference of join queries over uncer- tain relations is actually easier than that for ordinary relations! 7.1 Testing Recall the testing problem: Given a relation instance R and an FD f , determine whether R satisfies f . The testing problem is simple for conventional FDs over ordinary rela- tions. Even horizontal FD testing can be reduced easily to conventional FD testing: For horizontal FD type H1 we test whether the union of the H-relation of all tuples in R satisfies f , for FD type H2 we test whether the H-relation of each tuple in R satisfies f , and for FD type H3 (t) we test whether H-relation(t) satisfies f . Next consider testing whether uncertain relation R having tuples t1 , . . . , tn satisfies vertical FD X →V Y , where (Y − X) = {A1 , A2 , . . . , Am }, Z = (R − Y − X). This problem also can be solved easily by checking for both the conditions of Defini- tion 2, as shown in Algorithm 1. 7.2 Finding Recall the finding problem: Given a relation instance R, find all FDs satisfied by R. Finding horizontal FDs in an uncertain relation is similar to finding conventional FDs over ordinary relations. For horizontal FD type H1 we find all FDs satisfied by the union of the H-relations of all tuples in R, for FD type H2 we find all FDs satisfied by the H-relation of every tuple in R, and for FD type H3 (t) we find all FDs satisfied by H-relation(t). We address the problem of finding conventional FDs over ordinary relations in Sec- tion 7.2 and turn to finding vertical FDs over uncertain relations in Section 7.2. 1: Condition 1, the H-relation(tj ) of each tuple tj in R satisfies X → → Ai , 1 ≤ i ≤ m: For every X-value x appearing in H-relation(tj ): 1. Find the set SZ of all Z-values that appear in tuples with value x for attribute X j 2. ∀i, find the set SA i (x) of all Ai -values that appear in tuples with value x for attribute X j j j 3. Compute the cross-product of sets SA 1 (x), . . ., SA m (x), SZ (x) and for every element (a1 . . . am aZ ) in the result check whether the tuple (xa1 . . . am aZ ) is present in H- relation(tj ). If not, return “X →V Y not satisfied.” 2: Condition 2: For every X-value x appearing in any alternative of R, for each Ai , check 1 n whether SA i (x) = . . . = SA i (x). If not, return “X →V Y not satisfied.” 3: Return “X →V Y satisfied.” Algorithm 1: Testing whether R satisfies X →V Y . Conventional FDs Before solving the problem of finding conventional FDs in ordinary relations, we define the notion of closed sets for a relation and give some properties. Pre- vious work such as [29] has studied algorithms for finding all FDs in ordinary relations. Here we take an alternative approach: We give algorithms for generating all closed sets of a relation, and show that closed sets completely characterize conventional FDs for ordinary relations. Consider an ordinary relation R. We define an instance-level notion of closed sets for R. A similar schema-level notion of closed sets was first introduced in [12] as a characterization of FDs. A set of attributes X in R is a closed set for R if for any attribute A 6∈ X, R does not satisfy X → A. Intuitively, no attribute outside of X is functionally determined by X. In this section we also use traditional FD rules (such as the union and transitivity rules) at the instance-level. All rules for FD’s hold on each relation instance. The following theorem shows that closed sets completely characterize the set of FDs satisfied by R. Theorem 12. Consider an ordinary relation R, and let S be the set of all closed sets for R. R satisfies X → Y if and only if ∀s ∈ S, (X ⊆ s) ⇒ (Y ⊆ s). 2 Proof. If: Suppose ∀s ∈ S, (X ⊆ s) ⇒ (Y ⊆ s) but R does not satisfy X → Y . Let X + be the set of all attributes Ai such that X → Ai . By the union rule of FDs [14] we have X → X + . Moreover, if A 6∈ X + , then R does not satisfy X + → A (because otherwise by transitivity X → A). Therefore X + is a closed set for R and since X → X, X ⊆ X + but Y 6⊆ X + , contradicting our assumption. Only if: Suppose R satisfies X → Y . Consider a closed set s ∈ S. Suppose X ⊆ s, then s → X, and X → Y . Hence by transitivity s → Y . Therefore Y ⊆ s. Algorithm 2 shows how to generate all the closed sets for R. We first compare all pairs of tuples in R to generate a base set of closed sets as follows. Given a pair of tuples t1 and t2 in R, if t1 and t2 have the same set of values for attributes in X and different values for all other attributes, then X is a closed set: for any attribute A 6∈ X, tuples t1 and t2 express a violation of X → A. Once we have the base set of closed 1: Base Set: Initialize base set of closed sets B = ∅. For every pair of tuples t1 and t2 in R: 1. Find the set of all attributes X in which t1 and t2 have the same value. 2. Set B = B ∪ {X}. 2: All Closed Sets: Let B = {b1 , . . . , bn }. Initialize S = {b1 }. For i = 2..n, do: 1. ∀sj ∈ S, add sj ∩ bi to S. 2. Set S = S ∪ {bi }. Algorithm 2: Finding all closed sets in R. sets obtained by pairwise tuple comparisons, we successively find intersections of these closed sets to generate all closed sets. The following lemma shows that Algorithm 2 generates only correct closed sets, and the next theorem shows that it generates all the closed sets in R. Lemma 4. If X and Y are closed sets, then X ∩ Y is also a closed set. 2 Proof. Let Z = X ∩ Y . For any attribute A 6∈ Z, either A 6∈ X or A 6∈ Y . Suppose without loss of generality A 6∈ X. Then X 6→ A. Therefore, Z 6→ A. To see why, if Z → A, then using X → Z and transitivity we get X → A. Theorem 13. Algorithm 2 generates all closed sets of R. 2 Proof. We show that any closed set of R can be obtained by successive intersections of the basic set of closed sets obtained by pairwise comparisons of tuples in R in Step 1. Suppose Z is a closed set. Let A1 , . . . , Am be all attributes of R not in Z. ∀i, Z 6→ Ai . Therefore, there exists a pair of tuples ti1 and ti2 in R such that t1 and t2 have the same value for all attributes of Z but have different values for attribute Ai . In Algorithm 2 suppose we constructed base closed set Si when we compared ti1 and ti2 . We have Z ⊆ Si and Ai 6∈ Si . Therefore, Z = (S1 ∩ . . . ∩ Sm ). Vertical FDs We give a PTIME (in the combined size of the input and output) algo- rithm for finding all vertical FDs that an uncertain relation R satisfies. We first find all vertical FDs of the form X →V A, where A ∈ R and X ⊆ R, that R satisfies; we can then use the union rule of FDs to find all vertical FDs satisfied by R. (Since the union rule follows from Armstrong’s axioms, vertical FDs satisfy the union rule.) Let us consider finding all FDs of the form X →V A for one attribute A. (We repeat the procedure for each attribute A in R.) Note that if R satisfies X →V A, then ∀X 0 ⊃ X, R also satisfies X 0 →V A. We can use this idea to start by testing (R − A) →V A. If (R − A) →V A is satisfied, for every possible subset Y of (R − A) obtained by removing a single attribute from (R − A), we test whether Y →V A is satisfied by R. So on, we recurse over every set of attributes Y for which Y →V A is satisfied by R, testing subsets of attributes if they haven’t been tested before. Our algorithm finds all nontrivial FDs with singleton right sides (i.e., FDs of the form X →V A where A 6∈ X) in polynomial time in the number of such FDs. 7.3 Inference Recall the inference problem: given a set of input relations R, for each Ri ∈ R the set of FDs FRi that hold in each Ri , and a query Q over R, determine all FDs that are guaranteed to hold in the result Q(R). We consider two variants of the problem: when we only have the schema of R, and when have the relation instances for R. We consider arbitrary SPJ queries whose results can always be represented using an uncertain relation. Note from Theorem 1, whenever the result is representable using an uncertain relation, there is a unique representation. Hence we are interested in inferring FDs for the unique representation of the result. Once again, the inference problem for horizontal FDs over uncertain relations is solved as in the case of conventional FDs over ordinary relations, studied in Section 7.3. Inference of vertical FDs over uncertain relations is considered in Section 7.3. Conventional FDs The intractability of the version of the inference problem where we only have the input relation’s schema has been established in previous work [2, 23]. When the input includes the schema and data, then the inference problem can easily be seen to be polynomially-solvable. Theorem 14. Given a set of input relation instances R and an SPJ query Q over R, we can infer all nontrivial FDs with singleton right sides that are satisfied by the result in PTIME (in the combined size of the input and output). 2 Proof. Since the query can be answered in PTIME, we first obtain the query result. Then, in PTIME, we can find all the FDs satisfied by the result as discussed in Sec- tion 7.2. Vertical FDs By Theorem 3, the hardness result from Section 7.3 carries over to the case of vertical FDs over uncertain relations as well. The subclass of join queries how- ever illustrates an interesting difference between inference in ordinary relations and uncertain relations. The following theorem shows that for join queries over uncertain relations, without looking at the input data, we cannot infer any new vertical FDs involv- ing the join attribute on the right side. Therefore, this version of the inference problem is trivial, and in fact easier than inferring conventional FDs over ordinary relations! Theorem 15. Given uncertain relations R(X, B), S(B, Y ), where X and Y are a set of one or more attributes, set of FDs FR , FS that hold in R and S respectively, and join query Q = (R 1B S), no FD with B on the right side is guaranteed to hold in the result of Q. 2 Proof. We construct uncertain relation instances R(A, B) and S(B, C), where R satis- fies A →V B, S satisfies B →V C, but the join of R and S does not satisfy A →V B. (Our construction can be extended easily for schemas with X and Y constituting more than one attribute.) R(A, B) has one tuple with two alternatives: [(a, b1 ) || (a, b2 )] and S(B, C) has two tuples: (b1 , c1 ) and (b2 , c2 ). Clearly R satisfies A →V B and S satisfies B →V C. The result T (A, B, C) = R 1 S has one tuple with two alternatives: [(a, b1 , c1 ) || (a, b2 , c2 )]. Although T still satisfies B →V C, it does not satisfy A →V B. Clearly when R does not satisfy A →V B, we can similarly construct S such that the join result still does not satisfy A →V B. Note for conventional FDs over ordinary relations, when a join query is performed over a set of input relations to obtain result R, all FDs involving attributes of R satisfied by any input relation are still satisfied by R. In addition, R may satisfy new FDs not satisfied by any input relation. Interestingly, vertical FDs over uncertain relations do not display this behavior: As shown by the example in the proof above, although A →V B was satisfied by R and both the attributes of R were projected onto the result, still the result did not satisfy A →V B. Finally, as in the case of conventional FDs over ordinary relations, for inference of SPJ queries when the input includes data, we first answer the query and then find all vertical FDs in the result. 8 Keys We study the special case of keys. Given an uncertain relation R, a set of attributes X, X ⊆ R, is an H1 , H2 , H3 (t), or vertical key if X →H1 R, X →H2 R, X →H3 (t) R, or X →V R respectively. It can be seen easily that all the algorithms and results of our paper naturally carry over for the special case of keys. Theorem 16. All definitions, algorithms, and results from this paper, and in particular the complexity result of Section 7.3 and the result of Theorem 15, carry over for keys as a special case of FDs. 2 Next we study the relationship of keys to the following notions of nonduplication of attribute values. Definition 7. Given an uncertain relation R and a set of attributes X in R, we say that X is: – Vertically nonduplicated (VND) if no two alternatives from different tuples in R have the same value for attributes in X. – Horizontally nonduplicated (HND) if no two alternatives of the same tuple in R have the same value for attributes in X. – Totally nonduplicated (TND) if no two alternatives in R have the same value for attributes in X. (Equivalent to VND and HND.) The following theorem summarizes all the implication relationships we can draw between the above notions and keys. The theorem lists the strongest forms of the relationships. We can of course infer others using relationships within the defini- tions above, such as (X is NDT) ⇒ (X is NDH), or relationships of FDs, such as (X →H1 R) ⇒ (X →H2 R). Theorem 17. 1. (X is TND) ⇒ (X →H1 R) 2. (X is TND) ⇒ (X →V R) 3. (X is HND) ⇔ (X →H2 R) 2 Proof. We prove each of the three statements of the theorem below: 1. Since X is TND, X is also nonduplicated in the union of the H-relation of all tuples in R. Hence X →H1 R. 2. Since X is TND, X is HND. Hence in the H-relation of each tuple in R, ∀A ∈ R, X → → A. Moreover, no two alternatives from different tuples in R have the same value for attribute X. Hence, X →V R. 3. If X is HND, then in the H-relation of each tuple in R, X is unique. Hence, X →H2 R. Conversely, if X →H2 R, X is HND as each tuple contains a set of alternatives. 9 Confidence Values Finally, we briefly look at uncertain relations with confidence values. Confidence values are attached to each alternative. For example: (Thomas, Main St.):0.6 || (Tom, Maine St.):0.4 The sum of the confidence values of all alternatives in each tuple is equal to 1. (Refer to [15] for more details, including query semantics.) Even in the presence of confidence values, horizontal FDs are defined as before. Decompositions based on horizontal FDs described in Section 6.1 are slightly modified as follows: Recall the decomposition of uncertain relation R into R1 containing the various alternatives and R2 containing the mappings between X and Y . The confidence values of each alternative are now carried over to alternatives in R1 and the confidence value of every alternative in R2 is 1. However, decomposition based on vertical FDs becomes more complicated in the presence of confidence values. Specifically, if we disregard the confidence values and decompose based on the techniques in Section 6.2, in many cases the decomposition is necessarily lossy. Moreover, even if the decomposition can be lossless, it is not clear where to store the confidence values of alternatives in R. To solve this problem, we strengthen the definition of vertical FDs by adding a third condition as follows. Definition 8 (Vertical FD). An uncertain relation R with confidence values satisfies a vertical FD X → Y (denoted X →VC Y ) if and only if the following conditions hold: 1. ∀A ∈ (Y − X), the H-relation of each tuple in R satisfies X →→ A. (→→ denotes the conventional multivalued dependency.) 2. For any two tuples t1 , t2 in R, let T1x (T2x respectively) be the set of all tuples in H-relation(t1 ) (H-relation(t2 ) respectively) that have the value x for attribute X. Let S1x,A (S2x,A respectively) be the set of all distinct A values that appear in tuples of T1x (T2x respectively). Then ∀A ∈ (Y − X), ∀x, at least one of the following holds: (1) S1x,A = ∅, (2) S2x,A = ∅, (3) S1x,A = S2x,A . 3. For any A ∈ (Y − X), for any two tuples t1 , t2 in R, if there exist alternatives a11 , a12 ∈ t1 and a21 , a22 ∈ t2 , such that a11 .(R − A) = a12 .(R − A), a21 .(R − A) = a22 .(R − A), a11 .A = a21 .A, and a12 .A = a22 .A, then c(a11 ) c(a12 ) c(a21 ) = c(a22 ) . (aij .Z is the set of values for attributes in Z in aij and c(aij ) is the confidence value for alternative aij . 2 Intuitively, for every attribute A in (Y − X) functionally determined by X, the value of X also determines the confidence values for distinct A values. Now, when we de- compose an uncertain relation based on vertical FD X →VC Y , the distribution of confidence values for each attribute A in (Y − X) is stored with the partition contain- ing A, and the distribution of confidence values for alternative in R are stored in the partition containing all the alternatives. We illustrate using the following example. Example 10. Consider relation R from Example 8 but now with confidence values. R(SSN, Name, Address,Phone) (1,Tom,Main St.,P1):0.4 || (1,Tom,Maine St.,P1):0.6 (2,Bill,Poplar Ave,P2):0.8 || (2,William,Poplar Ave,P2):0.2 (1,Tom,Main St.,P1):0.2 || (1,Tom,Maine St.,P1):0.3 || (2,Bill,Poplar Ave,P3):0.4 || (2,William,Poplar Ave,P3):0.1 R satisfies SSN→VC NameAddress, and decomposing R as described above, we have the following three components: R0 (SSN, Phone) R1 (SSN, Name) (1,P1):1.0 (1,Tom):1.0 (2,P2):1.0 (2,Bill):0.8 || (2,William):0.2 (1,P1):0.5 || (2,P3):0.5 R2 (SSN, Address) (1,Main St.):0.4 || (1,Maine St.):0.6 (2,Poplar Ave):1.0 2 10 Conclusions and Future Work As a step toward schema design in uncertain databases, we proposed a theory of func- tional dependencies for uncertain relations. We defined horizontal and vertical FDs, which give rise to useful lossless decompositions of uncertain relations. We provided a sound and complete axiomatization of both kinds of FDs. We gave algorithms for de- composition (and reconstruction thereafter) of uncertain relations. Then we looked at the problems of testing, finding and inference of FDs for ordinary and uncertain rela- tions. Finally, we studied keys as a special case of FDs and briefly considered uncertain relations with confidence values. Our paper, obviously, does not solve all problems related to schema design or de- pendency theory for uncertain relations. Instead, it suggests several directions for future work. Of course, developing a parallel theory for multivalued and other kinds of depen- dencies for uncertain relations would be theoretically interesting. A more detailed study of uncertain relations with confidence values is yet another avenue for future work. Fi- nally, a notion of “uncertain dependency” for uncertain relations, capturing the fact that a large (but possibly not entire) fraction of a relation satisfies a dependency, might also be practically useful. References 1. TriQL: The Trio query language. Available at http://i.stanford.edu/˜widom/triql.html. 2. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. 3. S. Abiteboul, P. Kanellakis, and G. Grahne. On the Representation and Querying of Sets of Possible Worlds. Theoretical Computer Science, 78(1), 1991. 4. P. Andritsos, A. Fuxman, and R. Miller. Clean answers over dirty databases: A probabilistic approach. In Proc. of ICDE, 2006. 5. L. Antova, C. Koch, and D. Olteanu. MayBMS: Managing Incomplete Information with Probabilistic World-Set Decompositions. In Proc. of ICDE, 2007. 6. L. Antova, C. Koch, and D. Olteanu. MayBMS: Managing Incomplete Information with Probabilistic World-Set Decompositions (Demonstration). In Proc. of ICDE, 2007. 7. L. Antova, C. Koch, and D. Olteanu. World-set decompositions: Expressiveness and efficient algorithms. In Proc. of ICDT, 2007. 8. M. Arenas. Normalization theory for XML. SIGMOD Record, 35(4):57–64, 2006. 9. M. Arenas, L. Bertossi, and J. Chomicki. Answer sets for consistent query answering in inconsistent databases. TPLP, 3(4), 2003. 10. M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent Query Answers in Inconsistent Databases. In Proc. of ACM PODS, 1999. 11. M. Arenas and L. Libkin. A normal form for XML documents. TODS, 29(1):195–232, 2004. 12. W. W. Armstrong. Dependency structures of database relationships. In Proc. of IFIP, pages 580–583, 1974. 13. D. Barbará, H. Garcia-Molina, and D. Porter. The Management of Probabilistic Data. TKDE, 4(5), 1992. 14. C. Beeri, R. Fagin, and J. H. Howard. A complete axiomatization for functional and multi- valued dependencies in database relations. In Proc. of ACM SIGMOD, 1977. 15. O. Benjelloun, A. Das Sarma, A. Halevy, and J. Widom. ULDBs: Databases with uncertainty and lineage. In Proc. of VLDB, 2006. 16. J. Boulos, N. Dalvi, B. Mandhani, S. Mathur, C. Re, and D. Suciu. MYSTIQ: a system for finding more answers by using probabilities. In Proc. of ACM SIGMOD, 2005. 17. D. Burdick, P. M. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. OLAP over uncertain and imprecise data. J. VLDB, 16(1):123–144, 2007. 18. R. Cavallo and M. Pittarelli. The theory of probabilistic databases. In Proc. of VLDB, 1987. 19. R. Cheng, S. Singh, and S. Prabhakar. U-DBMS: A database system for managing constantly-evolving data. In Proc. of VLDB, 2005. 20. N. Dalvi and D. Suciu. Efficient Query Evaluation on Probabilistic Databases. In Proc. of VLDB, 2004. 21. A. Das Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working Models for Uncertain Data. In Proc. of ICDE, 2006. 22. A. Das Sarma, S. Nabar, and J. Widom. Representing uncertain data: Uniqueness, equiva- lence, minimization, and approximation. Technical report, Stanford InfoLab, 2005. Available at http://dbpubs.stanford.edu/pub/2005-38. 23. P. C. Fischer, J. H. Jou, and D. Tsou. Succinctness in dependency systems. TCS, 24, 1983. 24. N. Fuhr. A Probabilistic Framework for Vague Queries and Imprecise Information in Databases. In Proc. of VLDB, 1990. 25. N. Fuhr and T. Rölleke. A Probabilistic NF2 Relational Algebra for Imprecision in Databases. Unpublished Manuscript, 1997. 26. G. Grahne. Dependency Satisfaction in Databases with Incomplete Information. In Proc. of VLDB, 1984. 27. G. Greco, S. Greco, and E. Zumpano. A logical framework for querying and repairing inconsistent databases. IEEE Transactions on Knowledge and Data Engineering, 15(6). 28. T. J. Green and V. Tannen. Models for incomplete and probabilistic information. In Proc. of IIDB Workshop, 2006. 29. Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: an efficient algorithm for discovering functional and approximate dependencies. Comp. Journal, 42(2), 1999. 30. T. Imielinski and W. Lipski. Incomplete information and dependencies in relational databases. In Proc. of ACM SIGMOD, 1983. 31. T. Imielinski and W. Lipski. Incomplete Information in Relational Databases. Journal of the ACM, 31(4), 1984. 32. L. V. S. Lakshmanan, N. Leone, R. Ross, and V.S. Subrahmanian. ProbView: A Flexible Probabilistic Database System. ACM TODS, 22(3), 1997. 33. S. K. Lee. An extended Relational Database Model for Uncertain and Imprecise Information. In Proc. of VLDB, 1992. 34. M. Levene and G. Loizou. Axiomatisation of functional dependencies in incomplete rela- tions. Theoretical Computer Science, 206, 1998. 35. E. Lien. Multivalued dependencies with null values in relational databases. In Proc. of VLDB, 1979. 36. D. Maier. Theory of Relational Databases. Computer Science Pr, 1983. 37. V. Th. Paschos. Polynomial approximation and graph-coloring. Computing, 70(1), 2003. 38. C. Re and D. Suciu. Materialized views in probabilistic databases for information exchange and query optimization. In Proc. of VLDB, 2007. 39. F. Sadri. Reliability of answers to queries in relational databases. TKDE, 3(2):245–251, 1991. 40. F. Sadri. Integrity constraints in the information source tracking method. TKDE, 7(1):106– 119, 1995. 41. P. Sen and A. Deshpande. Representing and Querying Correlated Tuples in Probabilistic Databases. In Proc. of ICDE, 2007. 42. J. D. Ullman. Principles of Database and Knowledge-Base Systems, Volume I. Computer Science Press, 1988. 43. Y. Vassiliou. Functional dependencies and incomplete information. In Proc. of VLDB, 1981. 44. J. Widom. Trio: A System for Integrated Management of Data, Accuracy, and Lineage. In Proc. of CIDR, 2005. 45. J. Wijsen. Condensed representation of database repairs for consistent query answering. In Proc. of ICDT, 2003.