On Estimating the Cardinality of Aggregate Views  Paolo Ciaccia Matteo Golfarelli Stefano Rizzi DEIS - University of Bologna 40136 Bologna, Italy fpciaccia, mgolfarelli, srizzig@deis.unibo.it and data warehouses [AGS97]. It represents facts of inter- est for the decision process into cubes in which each cell Abstract contains numerical measures which quantify the fact from different points of view, while each axis represents an in- Accurately estimating the cardinality of aggregate teresting dimension for analysis. For instance, within a 4- views is crucial for logical and physical design dimensional cube modeling the phone calls supported by a of data warehouses. While the warehouse is un- telecommunication company, the dimensions might be the der development and data are not available yet, calling number, the number called, the date, and the time the approaches based on accessing data cannot be segment in which the call is placed; each cube cell could adopted. This paper proposes an approach to es- be associated to a measure of the total duration of the calls timate the cardinality of views based on a-priori made from a given number to another number on a given information derived from the application domain. time segment and date. We face the problem by first computing satisfac- The basic mechanism to extract significant information tory bounds for the cardinality, then by capital- from the huge quantity of fine-grained data stored in base izing on these bounds to determine a good prob- cubes is aggregation according to hierarchies of attributes abilistic estimate for it. Bounds are determined rooted in dimensions [GL97]. In most application cases, by using, besides the functional dependencies ex- cubes are significantly sparse (for instance, most couples pressed by the multidimensional scheme, addi- of telephone numbers are never connected by a call in a tional domain-derived information in the form of given date), and so are the aggregate views. cardinality constraints which may bound either Accurately estimating the actual cardinality of each the cardinality of a given view or the ratio between view is crucial for logical and physical design as well as for the cardinalities of two given views. In particular, query processing and optimization [Vas00]. As a relevant we propose a bounding strategy which achieves case, consider the view materialization problem, where the an effective trade-off between the tightness of the aggregate views which are the most useful in answering bounds produced and the computational complex- the workload queries have to be selected for materializa- ity. tion (see [TB00] for a survey). Since the number of pos- sible views which can be derived by aggregating a cube is 1 Introduction and Motivation exponential in the number of attributes, most approaches assume that a constraint on the total disk space occupied The multidimensional model is the foundation for data by materialization is posed, and attempt to find the subset representation and querying in multidimensional databases of views which contemporarily satisfies this constraint and  This work has been partially supported by the D2I MURST project. minimizes the workload cost [GR00, Gup97, HRU96]. An- The copyright of this paper belongs to the paper’s authors. Permission to other case where estimation of view cardinalities is relevant copy without fee all or part of this material is granted provided that the is index selection [GHRU97]. copies are not made or distributed for direct commercial advantage. If the data warehouse has already been loaded, view car- Proceedings of the International Workshop on Design and dinalities can be quite accurately estimated by using sta- Management of Data Warehouses (DMDW’2001) tistical techniques based, say, on histograms [MD88] or Interlaken, Switzerland, June 4, 2001 sampling [HO91]. However, such techniques cannot be (D. Theodoratos, J. Hammer, M. Jeusfeld, M. Staudt, eds.) applied at all if the data warehouse is still under develop- http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-39/ ment, and the estimation of view cardinalities is needed for P. Ciaccia, M. Golfarelli, S. Rizzi 12-1 design purposes. To obviate this, current approaches are It should be noted that we are using the term view to de- based on estimation models that only exploit the cardinal- note the set of grouping attributes used for aggregation, ity of the base cube and that of the single attribute domains while the “actual” views will typically include also one or [RS97, SDNR96], which however leads to significant over- more measures. This slight abuse in terminology is possi- estimation. ble since we are interested in determining the cardinality of In this paper we propose a novel approach to estimate views, which only depends on the grouping attributes. Definition 3 (Roll-up) Given the set V D of all possible the cardinality of views based on a-priori information de- views on D, we define on VD the roll-up partial order  rived from the application domain. Similarly to what is as follows: V W iff 8A i 2 V 9Aj 2 W j (Aj ! Ai ) 2 done when estimating the cardinality of projections in re- lational databases [CM95], we face the problem by first F + , i.e., iff W ! V . We call multidimensional lattice for D the corresponding lattice, whose top and bottom el- computing satisfactory bounds for the cardinality, then by ements are dim(D) and the empty view fg, respectively. capitalizing on these bounds to determine a good proba- We will denote with V W the view that is the least upper bilistic estimate for it. Besides the functional dependen- cies expressed by the multidimensional scheme, the bounds bound of V and W in the lattice; given a set of views S , we will briefly denote with (S ) the view that is their least we determine also take into account additional domain- derived information expressed in the form of cardinality upper bound. constraints, namely, bounds of the cardinality of some views and bounds (called k-dependencies) on the ratio be- Example 1 Consider an enterprise with branches in differ- tween the cardinalities of two views. The computation of ent cities. A simple dimensional scheme Transfers mod- bounds is based on a bounding strategy, which is aimed at eling the transfers of employees between offices might in- achieving an effective trade-off between the tightness of the clude: U = fdate; month; year; fromOÆce; fromDept; fromCity; bounds produced and the computational complexity. The paper is organized as follows. After providing some basic definitions in Section 2, in Section 3 we introduce k- toOÆce; toDept; toCity; employeeg dependencies. In Section 4 we outline our overall approach F = fdate ! month; month ! year; to estimation and show its benefits with an example. Sec- fromOÆce ! fromDept; fromOÆce ! fromCity; toOÆce ! toDept; toOÆce ! toCityg tion 5 introduces the basic properties of bounds, proposes an efficient bounding strategy, and sketches a branch-and- bound approach to determine the upper bound of the car- dinality of a given view when the cardinality constraints in thus dim(D) = f date fromOÆce toOÆce employeeg. ; ; ; Examples of views on the Transfers scheme are input do not contain k-dependencies; besides, it discusses how the strategy introduced can be improved. Section 6 V = fmonth; fromOÆce; toCity; employeeg shows how the bounds derived may be used to improve the W = fmonth; fromCity; fromDeptg cardinality estimates. Finally, Section 7 discusses the most Z = fyear; fromOÆce; toCityg interesting open issues. It is W Z = fmonth; fromOÆce; toCityg, with 2 Background and Working Example (W Z )V . 2 In this section we formalize the concept of view, define a The following notation is used throughout the rest of the partial ordering on the set of views, and present the appli- paper. Uppercase letters from the beginning of the alpha- cation domain we will use as an example. bet (A, B , : : : ) denote dimensions. Attributes which are functionally determined by another attribute, i.e. attributes Definition 1 (Dimensional Scheme) We call dimensional other than dimensions, are denoted by the corresponding scheme D a couple (U; F ) where U is a set of attributes primed letters (e.g., A ! A0 , A ! A00). Sets of attributes and F = fAi ! Aj j Ai ; Aj 2 U g is a set of functional are represented by omitting braces, thus writing ABC for dependencies (FD’s) which relate the attributes of U into a fA; B; C g. V is the view whose cardinality is to be es- set of pairwise disjoint directed trees. We call dimensions timated, while W , X , Y , and Z , possibly with subscripts the attributes A k 2 U in which the trees are rooted, i.e., (W1 ; W2; : : : ), denote generic views in VD . Finally, low- such that 8A i 2 U (Ai ! Ak ) 62 F ; let dim(D)  U ercase letters are used for the cardinalities of views and at- denote the set of dimensions of D. tributes (e.g., w is the cardinality of view W , abc is the cardinality of the view with attributes ABC , and so on). Definition 2 (View) Let D = (U; F ) be a dimensional scheme. We call view on D any subset of attributes V  U such that 8A i ; Aj 2 V (Ai ! Aj ) 62 F + , where F + 3 The k-dependencies denotes the set of all functional dependencies logically im- A k-dependency is a relevant case of cardinality constraint plied by F . which naturally generalizes a functional dependency. In the P. Ciaccia, M. Golfarelli, S. Rizzi 12-2 candidate view authors’ experience, k-dependencies are particularly use- dimensional scheme ful to characterize the knowledge of the business domain held by the experts in the field. For instance, in the trans- bounder fer domain, we might have some information concerning logical the number of destination cities for an employee, or on the dimensional scheme scheme, bounds view number of distinct areas moved to from each area. If such cardinality materializer information is in the form of bounds, it can be effectively constraints used to improve the bounds of view cardinality. cardinality estimate estimator Definition 4 (k-dependency) Let X and Y be two views on D. We say that a k-dependency (kD) holds between X workload and Y , and denote it with X ! Y , when k (k  1) is an k upper bound of the number of distinct tuples of Y which correspond to each distinct tuple of X within view X Y . Figure 1: Overall architecture for logical design Example 2 In the Transfers scheme, assume the domain our approach works in two steps. First, the bounder uses expert provides the following information: The maximum the set I of cardinality constraints supplied by the user to number of inter-department transfers of an employee dur- determine effective bounds for the cardinalities of a proper ing one year is 2. This constraint can be formalized by set of views; then, the estimator uses these bounds to derive 2 a probabilistic estimate for the cardinality of V . Note that the following kD: X ! Y , where X = f year employee ; g, Y =f toDept g. Intuitively, from this we can derive that this two-steps approach generalizes well-known paramet- the cardinality of the view f year employee toDept ; ; g can- ric models for the estimation of the cardinality of relational not exceed twice the cardinality of X . 2 queries [MCS88], and in particular those for projection size estimation [CM95], for which bounds are typically given as The kD’s have been studied in the context of relational input parameters. database theory, where they are also known as numerical The different forms of cardinality constraints we will dependencies. Grant and Minker [GM83] proved that kD’s consider are: are not finitely axiomatizable, thus no fixed set of inference 1. a lower (w ) and/or an upper (w + ) bound of the car- rules can be used to determine whether or not a given kD dinality w of a view W ; is logically implied by a set of kD’s. Nonetheless, a ba- 2. a k-dependency (X ! Y ) expressing an upper bound k sic set of rules, which naturally extend those for FD’s, was proposed in [GM83]. The rules we use, generalized to the of the ratio between the cardinalities of two views X multidimensional lattice, are: and Y . R1 : X ! Y ` X Z ! Y Z We will assume that at least the upper bounds of the car- k k dinalities of all the single attributes in the dimensional R2 : X ! Y ^ Y ! Z ` X ! Y Z k l  k l scheme are known. This assumption, which is perfectly X! Y Z ` X ! reasonable in all application domains, is necessary in order R3 : Y k k to guarantee that at least one upper bound can be deter- R4 : X ! Y ^ X ! Z ` X ! Y Z k l  k l mined for each view. The set I , together with the dimensional scheme D, uni- Note that the “union” rule R4 is not strictly needed, since it vocally determines two bounds for the cardinality of V , can be derived from rules R1 (“extension”), R2 (“transitiv- which are called the greatest lower bound and the least ity”), and R3 (“decomposition”). upper bound, denoted as v and v+ , respectively.1 The interpretation of such bounds is as follows: 4 A Framework for Estimation 1. in each instance of D that does not violate any con- The framework for this work is the logical design of mul- straint in I , the cardinality v of V is such that v 2 tidimensional databases carried out off-line, i.e. assuming [v ; v+ ]; and 2. there exist two instances, both compatible with I , that the source data cannot be directly queried to estimate the cardinality of multidimensional views. Without loss of generality, in the following we consider that estimates are where v equals v and v+ , respectively. needed for the purpose of view materialization, thus reli- We say a constraint c 2 I is redundant iff all the greatest able information on the size of the candidate views has to lower bounds and the least upper bounds determined by I be supplied to the materialization algorithm. are equal to those determined by I fcg. As sketched in Figure 1, whenever the materialization 1 For simplicity of notation, in denoting bounds we omit the depen- algorithm requires information about a candidate view V , D dence on and . I P. Ciaccia, M. Golfarelli, S. Rizzi 12-3 Definition 5 (Sound and Minimal Input) Let I be a set it is v  103. Finally, by using the model in Section 6, the of cardinality constraints on dimensional scheme D. We cardinality of V is estimated as v = 3:8  10 4. 2 say I is sound iff there exists at least one non-empty in- stance of D which satisfies all the constraints in I . We say 5 The Bounder I is minimal iff no constraint in I is redundant. The basic observation to determine bounds for view car- In this paper we will assume that the input I is sound and dinalities using bounds of the cardinalities of other views minimal. It is straightforward to derive that, in this case, is that the multidimensional lattice induces an isomorphic all the bounds in I are either greatest lower bounds or least structure over such cardinalities. In fact, from Definition 3 it follows that W Z implies w  z in each instance of upper bounds (whereas the opposite is not necessarily true). Computing the bounds implied by I turns out to be D, since Z ! W holds. This inequality also applies to a challenging combinatorial problem, even for “simple” bounds. Lemma 1 If W Z , then w  z and w+  z + . forms of cardinality constraints. For instance, it is known that the problem is NP-hard for arbitrary patterns of func- tional dependencies [CM92]. Furthermore, the actual com- Proof: (w  z ) Assume w > z . Then, there is putational effort needed to compute these bounds might an instance of D in which w  w > z  z , thus w > z , limit applicability in real-world cases. For this reason, the which is a contradiction. Similarly for w +  z + . 2 bounder is built around the concept of bounding strategy. A s bounding strategy is characterized by a couple of bound- ing functions that, given I ; D, and V , compute bounds v s As to k-dependencies, their influence on the determina- and vs+ such that vs  v and v+  vs+ both hold. In tion of bounds is summarized by the following lemma. Lemma 2 Let Z = X Y . If X ! Y , then x  z =k other terms, a bounding strategy never computes bounds k and z +  k  x+ . which are more restrictive than the ones logically implied by the input constraints, trading-off accuracy for speed of s evaluation. We say that a strategy is decoupled iff com- puting v s+ for an arbitrary view V only requires the knowl- Proof: From Definition 4 it follows immediately that, if edge of upper bounds w s+ of other views W , but no knowl- X! k Y , the cardinality z of Z is related to the cardinality edge of lower bounds w s , and vice versa. Thus, for a de- x of X by inequality z  k  x. The inequalities on bounds coupled bounding strategy, the two bounding functions can follow immediately. 2 be defined independently of each other. Turning to the estimator, our framework supports differ- In the rest of this section we first propose a decoupled ent probabilistic models. A probabilistic model is a func- strategy to compute upper bounds (Section 5.1), then we tion that, given I ; D, V , as well as bounds computed by discuss some issues related to coupled strategies (Section the bounder, provides an estimate, v, for the cardinality of 5.2). V . In general, this step can use further information from the application domain that is not suitable to derive bounds. 5.1 A Decoupled Upper Bounding Strategy Typically this is the case with information concerning aver- The bounding strategy we propose in this section, called age values (e.g., the number of transfers of each employee cover-based, relies on the concept of cover of a view to on each year is 1.5, on the average). compute upper bounds. The following are two preliminary definitions whose aim is to precisely characterize how sets Example 3 Let 104 be the number of employees who have of views and kD’s can be sinergically combined together. been transfered at least once, and let the enterprise con- sist of 103 offices distributed over 10 cities and belong- Definition 6 (Graph of a set of kD’s) Let K = fX 1 !1 k ing to one of 10 departments; let 10 3 days be the observa- Y1 ; : : : ; Xp !p Yp g be a set of kD’s. The (labelled ori- k tion period. Let V = f date fromOÆce toOÆce ; ; g. Since ented)S graph of K is G (K ) = (N; E ), with set of nodes N = i fXi ; Yig, set of edges E = fei = (Xi ; Yi ); i = each office is involved in transfers at most with every other office on each date, the first trivial upper bound of v is 103  103  103 = 109. If the maximum number of transfers 1; : : : ; pg, and labeling function  such that (e i ) = ki .2 for an employee during one year is 2, and since we con- sider 3 years, it is derived that the cardinality of the base Definition 7 (K-set of views) Let S = fW1; : : : ; W gm cube is at most 2  3 = 6 times the number of transferred be a non-empty set of views, and let K = fX 1 !1 k employees, i.e. 6  104 . Thus, the upper bound of v can be Y1 ; : : : ; X ! Y g be a set of kD’s. The couple C = kp improved to 6  10 4 as well (the cardinality of a view can- p p 2 Technically, G (K ) is a multi-graph, since two edges may share the not exceed that of its base cube). On the other hand, if we same couple of nodes. This, however, does not influence the following assume that each office is involved in at least one transfer, arguments. P. Ciaccia, M. Golfarelli, S. Rizzi 12-4 (S; K ) is called a k-set of views iff 8i = 1; : : : ; p it is ABCD Y 2 S and there exists a set of kD’s, K 0 = fW 1 !1 i j k Y1 ; : : : ; W ! Y g, such that: 1) 8i = 1; : : : ; p it is ABC kp X W with W 2 S , and 2) G (K 0 ) = (N 0; E 0 ) is a for- jp p i ji ji est, i.e., a set of disjoint directed trees. We call S -compliant AB BC CD a set K 0 with such properties. Each kD in an S -compliant set K 0 is derived from a cor- A'B responding kD in K by applying rules R1 and R3 (since, by hypothesis, it is X i Wji = Wji ). Note that N 0  S A B C D always holds and that, in general, multiple S -compliant K 0 sets can be derived from the same C , depending on how A' each Wji 2 S is chosen. Example 4 C 1 = (fA0 B; C; Dg; K ), with K = fA0B !1 k Figure 2: Roll-up relationships of views in Example 5 C; C ! k2 Dg, is a k-set of views, since K is S -compliant.  C3 = (fAB; C g; fAB ! C g). From Lemma 2 it k The same is true for C2 = (fAB; C; Dg; K ), since K 0 = immediately follows abc  ab +  k. fAB ! k1 C; C !k2 Dg is S -compliant (in fact, A 0 B AB ). On the other hand, C3 = (fB; C; Dg; K ) is not a k-set since no S -compliant set of kD’s can be found.  C4 = (fA; B; C g; fA !1 B; B !2 C g). By applying k k It is important to remark that Definition 7 requires K 0 , rule R2, we derive A ! 1 2 BC , thus abc  a+  k  k . k k 1 2 and not necessarily K , to be a forest. For instance, the couple (fAg; fA0 ! Ag) is not a k-set, though G (fA 0 !  C5 = (fA; B; C g; fA !1 B; A !2 C g). Rule R4 is k k k k Ag) is a forest, since G (fA ! Ag) is cyclic. On the other k now used to derive A ! 1 2 BC , thus abc  a+  k  k . k k 1 2 hand, C = (fA; A0; B g; fA0 !1 B; B !2 A0 g) is a k-set k k 4 (after deriving A !1 B from A0 !1 B ) even if G (fA0 !1 k k k  C6 = (fA; A0B; C g; fA0 ! Ag). According to rule k B; B !2 A0 g) is cyclic. R1 it is A 0B ! AB , and from Lemma 2 ab+  k  k k Finally, for the k-set C 5 = (fA; A0 B; A0C g; fA0 ! k a0 b+ . On the other hand, abc  ab+  c+ , thus abc  k  a0 b+  c+ . 2 Ag), two S -compliant sets, K10 = fA0 B ! Ag and k K20 = fA0 C ! Ag, can be derived. 2 k The following theorem precisely characterizes how Definition 8 (Cover) Let V 2 VD be a view on D and bounds are related to the graph of K 0 . C = (S; K ) be a k-set of views. C is called a V -cover iff V (S ). Theorem 1 (Cover-based bounding) Let V be a view and As the following example suggests, a V -cover can be C = (S; K ) be a V -cover, with S = fW1; : : : ; W g m used to bound from above the cardinality of V by generaliz- and K = fX1 !1 Y1 ; : : : ; X ! Y g. Let K 0 be an k kp ing Lemma 1 to the case of multiple views (since V (S ) S -compliant set, R(G (K 0 )) be the set of root nodes of p p holds). When also kD’s are present, Lemma 2 can be ex- the forest G (K 0 ) = (N 0 ; E 0) associated to K 0 , and let ploited to improve the bound. Since a cover must be a k-set, S0 = S N 0 stand for the set of views which are not nodes we are guaranteed that the cardinalities of some views in S in G (K 0 ). Then: can be safely “replaced” by the ki’s of the kD’s in K 0 . Example 5 Let V = ABC . Below we consider some no- Y Y p table examples of V -covers and show how each of them v  u(C ; K 0) def = ki  wj+ (1) i=1 Wj 2R(G (K ))[S0 can be used to derive an upper bound for v. In order to 0 help the reader, Figure 2 depicts the roll-up relationships between the views involved. Proof: The intuition behind the proof is that each tree  C1 = (fABCDg; ;) is a V -cover since V (S1 ) = G = (N 0; E 0 ) of G (K 0 ) contributes to u(C ; K 0) with the t t t upper bound of the cardinality of its root W t times all the ABCD. From Lemma 1 it is derived abc  abcd+. ki’s which label the edges in Et0 .  C2 = (fAB; BC g; ;) is a V -cover since V (S2 ) = Since by Definition 8 it is V (S ), it is sufficient to ABC . Since the natural join between two views is a prove that u(C ; K 0 ) is an upper bound of (S ). Since the subset of their Cartesian product, it is abc  ab +  bc+ . size of the natural join of a set of views can never exceed P. Ciaccia, M. Golfarelli, S. Rizzi 12-5 that of their Cartesian product, it is by (1) is actually independent of the one chosen. For in- card((S ))  card(((S0 ))((N 0 ))) stance, the reader may immediately verify that, in Example 4, it is u(C5; K10 ) = u(C5 ; K20 ) = k  a0 b+  a0 c+ .  card((S0 ))  card((N 0 )) Y + Y Lemma 3 Let C = (S; K ) be a V -cover, and let K10 and  w  card((N 0)) 20 j t K20 be two arbitrary S -compliant sets. It is u(C ; K 10 ) = u(C ; K20 ) def = u(C ). Wj S t Since the ki’s are Q partitioned over the trees, it is enough to prove that wt+  i20 ki , where Wt is the root of G t and 0t Coherently with Theorem 1 and Lemma 3, the cover- is the set of labels in G t, is an upper bound of card((N t0)). t cb + as: based bounding strategy computes v cb This is proved by induction on the number L of levels in G t . ( Base step (L=2). 3 In this case Gt corresponds to the += v+ if v + 2 I ; k2;q vcb minfucb(C ) j C is a V -coverg if v + 62 I : k2;1 (2) set of kD’s fWt ! W2;1 ; : : : ; Wt !2 W2;q2 g. From the unionQ rule R4 it immediately follows that card((N t0))  + wt  qi=1 2 k . where ucb(C ) is obtained by replacing w j+ with wj; + in 2;i cb Inductive step (L 1 ) L). Let Nt0 (L 1) u(C ). In general, evaluating the cover-based bound leads be the set of nodes in the first L 1 levels. By in- to a recursive computational flow; note that the “case-0” of ductive hypothesis it is card((N t0 (L 1)))  wt+  + = v+ , is correctly defined since we assumed recursion, vcb QL 1 Qql l=2 i=1 kl;i . Adding the L-th level introduces new the input I to be minimal. edges with labels kL;1; : : : ; kL;qL and corresponding ter- The space of the V -covers to be analyzed in order to minal nodes WL;1 ; : : : ; WL;qL . From the i-th of the corre- determine vcb+ has exponential size. On the other hand, the sponding kD’s we can derive (using rules R1 and R3) the following theorem shows that, under some circumstances, a kD (Nt0 (L 1)) ! W kL;i V -cover C2 can be discarded from the search space without even computing ucb (C2). L;i . From the union rule R4 it is derived: Q Theorem 2 Let C1 = (S1 ; K1) and C2 = (S2 ; K2 ) be two (N 0 (L 1)) =1! qL t i kL;i (fW 1; : : : ; W L; L;qL g) V -covers. If S1  S2 and K1 = K2 or S1 = S2 and which, due to Lemma 2, leads to: K2  K1 , then ucb(C1 )  ucb (C2). card(((Nt0 (L 1)))((fWL;1; : : : ; WL;qL g))) = 5.1.1 Reasoning without k-dependencies = card((Nt0(L))) When no k-dependencies are included among the input L 1 ql YqL + YY + YL Yql constraints I , covers degenerate into sets of views, which  kL;i  wt  kl;i = wt  kl;i 2 allows us to precisely characterize the set of V -covers i=1 l=2 i=1 l=2 i=1 that can provide useful (non redundant) bounds. To see how such covers are determined, two orthogonal aspects It is possible to prove that (1) is valid even if G (K 0 ) is are considered: a domination relationship between sets of views and the input information, I . While the former in- not a forest, provided that R(G (K 0 )) contains (at least) a duces a partial order on the bounds obtainable from V - set of nodes from which every other node in G (K 0 ) can be covers, regardless of the specific input I , the latter can be reached through a directed path. On the other hand, the used to restrict the set of useful V -covers to those including bounds determined by such “non-forest” V -covers are al- only views in I . ways redundant, meaning that a proper V -cover yielding a In this section, since we assume K = ;, we will work better bound for v can always be found. only with the S part of V -covers. Consequently, in (2), Example 6 Let V = ABC , and consider the couple ucb (C ) can be replaced by (fA; B; C g; K ) with K = fA ! k1 C; B ! k2 C g), which ucb (S ) = Y wi+ : 0 is not a k-set since the graph of K = K has two roots (A (3) and B ). The bound returned by (1) is v  k 1  k2  a+  b+ Wi 2S which is redundant, since a better bound is obviously ob- Definition 9 (Domination between sets of views) tained through the V -cover (fA; B; C g; fA !1 C g). 2 Let S1 = fW1;1; : : : ; W1;i; : : : ; W1;m g and S2 = k fW2;1; : : : ; W2;j ; : : : ; W2;ng be two sets of views. We The following lemma shows that, when multiple S - say that S1 dominates S2 , written S1 vS2 , iff S2 can compliant sets exist for a given cover, the bound returned be partitioned into m subsets S 2;1; : : : ; S2;m such that 3 The case L = 1 cannot arise, since each G has at least one edge. t W1;i(S2;i ) 8i = 1; : : : ; m. P. Ciaccia, M. Golfarelli, S. Rizzi 12-6 For instance, fA0 B; C gvfAB; CD; E g. Note that if Y3 Si vSj then (Si )(Sj ) necessarily holds, whereas the opposite is not always true (e.g., fAB; BC g6vfABCDg Y2 though ABC ABCD). V Lemma 4 Let S1 and S2 be two sets of views. If S 1 vS2 Y1= W1 W2 W3 then ucb(S1 )  ucb (S2 ). Z1 Z2 Definition 10 (Ground Views and Covers) We say that a view W is ground iff w + is in I . A V -cover is said to be Figure 3: Roll-up relationships between views in Lemma ground when all the views it includes are ground. 6, in the case n = 3 Lemma 5 Let S be a non-ground V -cover. Then there ex- 4. If S is a ground V -cover, no set S 0 such that S  S 0 ists a ground V -cover S1 such that u cb (S1 )  ucb (S ). is a minimal V -cover (from Definitions 9 and 11). Proof (sketch): Since S is not ground, at least one view 5. If a minimal V -cover S contains a ground view W , in S is not ground. By recursively applying (3), u cb (S ) will it cannot contain any other ground view W 0 such that be eventually expressed as a product of bounds in I . The W W 0 (from Definitions 9 and 11). case of strict inequality (u cb(S1 ) < ucb (S )) can arise since in this recursive process there is no guarantee that a given 5.2 Towards a Coupled Bounding Strategy ground view will be generated just once, thus its least upper The bounds we derive through the strategy described in bound might appear more than once in u cb (S ). 2 Section 5.1 are not necessarily the tightest possible ones. In fact, more complex and effective bounding strategies Definition 11 (Minimal Cover) A ground V -cover S is can be defined to the detriment of computational speed. minimal iff there is no other ground V -cover S 1 such that S1 vS holds. Basically, in these strategies the concept of cover may be extended by considering more complex patterns of views, where upper and lower bounds are used jointly. In this sec- The following theorem immediately derives from Lem- tion we present some preliminary considerations on cou- mas 4 and 5. pled strategies; for simplicity, we will assume that the input Theorem 3 (Sufficiency of Minimal Covers) It is: does not contain k-dependencies. As to upper bounding, the cover-based strategy can be minfucb(S ) j S is a V -coverg = improved by exploiting results from majorization theory, which state that the size of the natural join between two = minfucb(S ) j S is a minimal V -coverg: (4) relations is majorized when the distributions of the join at- + + For instance, let I = fab0 ; cd+; a0de+ ; a+ ; a0 ; b+ ; tribute(s) in the two relations are maximally skewed [IC91]. b0+ ; c+ ; d+ ; e+ g and V = A0B 0 CD. The min- The extension of this argument to the multidimensional lattice is as follows. Given two views W 1 and W2 such imal V -covers are fAB ; CDg, fA0 ; B 0 ; CDg, and 0 that W1 ÆW2 and W2 ÆW1 , let Y = W1W2 and let fA0 DE; B 0 ; C g. Z = W1 W2 , where is the greatest lower bound op- From the above results, several facts can be easily de- erator on the lattice; it can be proved that rived, which can be exploited to efficiently generate mini- mal V -covers by means, say, of a branch-and-bound algo- y  w1+  w2+ (z 1)(w1+ + w2+ z ) (5) It should be noted that, when W 1 W2 = fg, since the rithm: 1. A ground view W such that V W is a ground V - empty view fg has cardinality 1, (5) correctly reduces to cover (from Definition 8). (3). This result can be extended to a V -cover whose views 2. A ground view W such that arity(W ) = 1 and are connected by a linear join graph. W \ V = ; does not belong to any minimal V -cover 4 Lemma 6 Let S = fW1 ; : : : ; Wng be a V -cover; let Zi = (from Definitions 9 and 11). Wi Wi+1 , i = 1; : : : ; n 1, Y1 = W1 , and Yi+1 = 3. A ground view W such that arity(W ) > 1 and 8W 0 Yi Wi+1 , i = 1; : : : ; n 1. Then: for which W 0W it is arity(W 0 \ V ) < 2 does not yi++1  yi+  wi++1 (zi 1)(yi+ + wi++1 zi ) belong to any minimal V -cover (since C includes the cardinalities of all the attributes). for i = 1; : : : ; n 1 ; (6) 4 arity(W ) denotes the number of attributes in W . v  yn : + P. Ciaccia, M. Golfarelli, S. Rizzi 12-7 The pattern consisting of views Y i , Wi+1 , Zi , and se In our approach, denoted (“safe-estimate”), the above Yi+1 , depicted in Figure 3, can in principle be extended estimate is improved in two ways: by replacing v max with to take into account also size information on Y i Zi and + , as a mea- the upper bound computed for v, for instance v cb Wi+1 Zi , that is, on non-join attributes; this will further sure of the maximum cardinality of V , and by replacing the strengthen the upper bound. At present, we guess that the cardinality of the base cube d with an estimate, w se , of the exact computation of v + might involve taking into account cardinality of a view W such that V W . This leads to: patterns that can extend over the whole lattice. However, + ; w )  minfv+ ; w g v se = (vcb (10) besides the theoretical interest, it is important to trade-off se cb se the increased complexity with the actual gain that could be Since both vcb + and w can be considerably lower than obtained by having more accurate bounds, considering also se how bounds can be used by the estimator. vmax and d, respectively, it is usually the case that v se  A coupled strategy requires also lower bounds to be v sdnr. The rationale for (10) is that we can view the prob- computed, which is radically different from computing up- lem of estimating v as the one of distributing the tuples of per bounds. In fact, while computing an upper bound cor- view W , which are estimated to be wse , over a number of + “buckets”. vcb responds to bounding the size of a join, computing a lower bound corresponds to bounding the size of a projection, Due to the need to know w se, it is obvious that our es- where the relevant difference is that projection is a unary timation process must move downward from the top of the operator. This leads to a much simpler situation to deal lattice (whose cardinality d is typically known) following a with, in which Lemma 1 is exploited and the lower bound path leading to V . Clearly, this represents a simplification of v is computed as maxfw j w 2 I ; W V g. Differ- of the correct estimation procedure, which would require to ently from upper bounds, no combinatorial issues arise in determine v by following all the paths from dim(D) to V . computing lower bounds through this strategy; thus, com- On the other hand, this would lead to combinatorial explo- plexity is linear in the cardinality of I . sion and necessitate of highly complex probabilistic models A better bound can be obtained by using information that are well beyond the current state-of-the-art knowledge. associated to “sibling” views. Let W be a view such that From a more practical (numerical) point of view, it V \ W = ;, and Z = V W ; then: should be noted that moving from upper bounds to esti- mates leads to significant differences under specific condi- v  wz + (7) tions only. Two relevant cases should be considered, which arise from the limit behavior of Cardenas’ formula: In fact, if v < z =w+ , then the size of the Cartesian prod- + it is v  w uct of V and W would be less than z , which is impossible. 1. When wse  0:1  vcb se se + it is v  v+ 2. When wse  3  vcb se f+ 6 The Estimator The values 0:1 and 3 can thus be used to predict whether Assuming that effective bounds have been derived, cardi- the estimator will deliver results which substantially differ nality estimation must be based on a probabilistic model from those directly obtainable from the bounder. to derive an estimate, v, of the cardinality of view V . The model we adopt here is based on the Cardenas’ formula Example 7 In the Transfers scheme, we consider three in- [Car75], which states that, when throwing N distinct ob- put situations: jects into B buckets, the expected number of buckets in  which at least one object will fall can be estimated as: I1 = fdateg+ = 103 ; fyearg+ = 3; femployeeg+ = 104 ;  N ! def (B; N ) = B  1 1 B1  minfB; N g ffromOÆceg+ = ftoOÆceg+ = 103 ; ffromCityg+ = ftoCityg+ = 10; (8) ffromDeptg+ = ftoDeptg+ = 10  Within the approach proposed in [SDNR96], (8) is used to estimate v by relying on the maximum cardinality of V , I2 =I1 [ femployee; yearg !4 ffromOÆce; toOÆce; dateg  defined as the Cartesian product Q of the cardinalities of the I3 =I2 [ ffromCity; fromDeptg+ = 40; attributes in V , v max = Ai 2V ai , and on the cardinality ftoCity; toDeptg+ = 40; of the base cube, d = card(dim(D)), that is: v sdnr = (vmax ; d)  minfvmax ; dg (9) ffromCity; fromDeptg !2 ftoCity; toDeptg; 30 This formula turns out to significantly overestimate the ffromCity; fromDeptg ! ffromOÆceg; 30 cardinalities and can easily lead to violate the constraint ftoCity; toDeptg ! ftoOÆceg v sdnr  v+ . P. Ciaccia, M. Golfarelli, S. Rizzi 12-8 each year is 2, would allow the cardinality of the base Table 1: Improving upper bounds and estimates for increas- cube to be estimated as twice the cardinality of view ing domain-derived information femployee year ; g. + + input wcb vcb wse v se  Probabilistic estimates. Estimates based on Cardenas’ I1 1013 106 1013 106 formula can be improved in several ways. In particu- I2 1:2  105 1:2  105 1:2  105 7:6  104 lar, information on lower bounds could be considered I3 1:2  105 7:2  104 1:2  105 5:8  104 by exploiting the results in [CM95], as well as infor- mation concerning the distribution of attribute values date employee fromOÆce toOÆce over their domains. For this, the challenge is to derive Let W = dim(D) = f ; ; ; g fromOÆce toOÆce new models that can be applied when the data ware- be the base cube and V = f ; g be the house has not been loaded yet. view whose cardinality is to be estimated. Table 1 shows how the upper bound w cb + of W , the upper bound v + of V , cb References and the estimate vse improve as new cardinality constraints are progressively supplied. The estimate v se is based on [AGS97] R. Agrawal, A. Gupta, and S. Sarawagi. Mod- the estimate of w, wse , which is assumed to be equal to its eling Multidimensional Databases. In Proc. +. upper bound w cb 2 ICDE’97, pages 232–243, Birmingham, UK, 1997. 7 Conclusions and Open Issues [Car75] A.F. Cardenas. Analysis and Performance of Inverted Database Structures. Communica- In this paper we have shown how cardinality constraints tions of the ACM, 18(5):253–263, 1975. derived from the application domain may be employed to determine effective bounds on the cardinality of aggregate [CM92] P. Ciaccia and D. Maio. On the Complexity views and how, in turn, such bounds can be used to esti- of Finding Bounds for Projection Cardinalities mate the cardinality of the views. In order to improve the in Relational Databases. Information Systems, approach effectiveness, some issues still need to be investi- 17(6):511–515, 1992. gated. In the following we briefly discuss those we believe to be crucial: [CM95] P. Ciaccia and D. Maio. Domains and Ac- tive Domains: What This Distinction Implies  Domination. A characterization of domination be- for the Estimation of Projection Sizes in Re- tween k-sets of views, similar to that reported in Def- lational Databases. IEEE Transactions on inition 9 for sets of views, needs to be developed in Knowledge and Data Engineering, 7(4):641– order to reduce the complexity of computing upper 655, 1995. bounds in presence of k-dependencies. [GHRU97] H. Gupta, V. Harinarayan, A. Rajaraman, and  Minimality. Throughout this paper we assumed that J. Ullman. Index Selection for OLAP. In Proc. the cardinality constraints supplied by the domain ex- ICDE’97, pages 208–219, Birmingham, UK, pert are sound and non redundant. Of course, this 1997. gives rise to the problem of determining, given an in- put I , if I is sound and minimal, which we argue can [GL97] M. Gyssens and L.V .S. Lakshmanan. A Foun- dation for Multi-Dimensional Databases. In be dealt with as done for, say, functional dependencies Proc. 23rd VLDB, pages 106–115, Athens, (whose inference rules can be used both for schema Greece, 1997. normalization as well as for input minimization).  Cardinality constraints. The input knowledge may be [GM83] J. Grant and J. Minker. Numerical Depen- dencies. In H. Gallaire, J. Minker, and J.-M. further extended by considering other forms of car- Nicolas, editors, Advances in Database The- dinality constraints which are typically known to the ory, volume II. Plenum Publ. Co., 1983. experts of the application domain. For instance, while in this paper we have defined k-dependencies to ex- [GR00] M. Golfarelli and S. Rizzi. View Material- press bounds on the ratio between the cardinalities of ization for Nested GPSJ Queries. In Proc. two views, they may also be used to denote the aver- DMDW’2000, Stockholm, Sweden, 2000. age of such ratio; while this kind of knowledge cannot be used by the bounder, it allows the cardinality esti- [Gup97] H. Gupta. Selection of Views to Materialize in mations to be improved. For instance, knowing that a Data Warehouse. In Proc. ICDT’97, pages the average number of transfers for each employee on 98–112, Delphi, Greece, 1997. P. Ciaccia, M. Golfarelli, S. Rizzi 12-9 [HO91] W. Hou and G. Özsoyoglu. Statistical Es- timators for Aggregate Relational Algebra Queries. ACM Transactions on Database Sys- tems, 16(4):600–654, 1991. [HRU96] V. Harinarayan, A. Rajaraman, and J. Ullman. Implementing Data Cubes Efficiently. In Proc. ACM Sigmod Conf., pages 205–216, Montreal, Canada, 1996. [IC91] Y.E. Ioannidis and S. Christodoulakis. On the Propagation of Errors in the Size of Join Re- sults. In Proc. ACM Sigmod Conf., pages 268– 277, Denver, CO, 1991. [MCS88] M. V. Mannino, P. Chu, and T. Sager. Sta- tistical Profile Estimation in Database Sys- tems. ACM Computing Surveys, 20(3):191– 221, 1988. [MD88] M. Muralikrishna and D.J. DeWitt. Equi-depth Histograms for Estimating Selectivity Fac- tors for Multi-Dimensional Queries. In Proc. ACM Sigmod Conf., pages 28–36, Chicago, IL, 1988. [RS97] K. Ross and D. Srivastava. Fast Computation of Sparse Datacubes. In Proc. 23rd VLDB, pages 116–125, Athens, Greece, 1997. [SDNR96] A. Shukla, P. Deshpande, J. Naughton, and K. Ramasamy. Storage Estimation for Multidi- mensional Aggregates in the Presence of Hier- archies. In Proc. 22nd VLDB, pages 522–531, Mumbai, India, 1996. [TB00] D. Theodoratos and M. Bouzeghoub. A Gen- eral Framework for the View Selection Prob- lem for Data Warehouse Design and Evolu- tion. In Proc. DOLAP 2000, pages 1–8, Wash- ington, DC, 2000. [Vas00] P. Vassiliadis. Gulliver in the land of data warehousing: practical experiences and obser- vations of a researcher. In Proc. DMDW’2000, pages 12/1–12/16, Stockholm, Sweden, 2000. P. Ciaccia, M. Golfarelli, S. Rizzi 12-10