On Estimating the Cardinality of Aggregate Views 

                                  Paolo Ciaccia                     Matteo Golfarelli          Stefano Rizzi

                                                     DEIS - University of Bologna
                                                         40136 Bologna, Italy
                                              fpciaccia, mgolfarelli, srizzig@deis.unibo.it


                                                                              and data warehouses [AGS97]. It represents facts of inter-
                                                                              est for the decision process into cubes in which each cell
                                Abstract                                      contains numerical measures which quantify the fact from
                                                                              different points of view, while each axis represents an in-
      Accurately estimating the cardinality of aggregate                      teresting dimension for analysis. For instance, within a 4-
      views is crucial for logical and physical design                        dimensional cube modeling the phone calls supported by a
      of data warehouses. While the warehouse is un-                          telecommunication company, the dimensions might be the
      der development and data are not available yet,                         calling number, the number called, the date, and the time
      the approaches based on accessing data cannot be                        segment in which the call is placed; each cube cell could
      adopted. This paper proposes an approach to es-                         be associated to a measure of the total duration of the calls
      timate the cardinality of views based on a-priori                       made from a given number to another number on a given
      information derived from the application domain.                        time segment and date.
      We face the problem by first computing satisfac-                            The basic mechanism to extract significant information
      tory bounds for the cardinality, then by capital-                       from the huge quantity of fine-grained data stored in base
      izing on these bounds to determine a good prob-                         cubes is aggregation according to hierarchies of attributes
      abilistic estimate for it. Bounds are determined                        rooted in dimensions [GL97]. In most application cases,
      by using, besides the functional dependencies ex-                       cubes are significantly sparse (for instance, most couples
      pressed by the multidimensional scheme, addi-                           of telephone numbers are never connected by a call in a
      tional domain-derived information in the form of                        given date), and so are the aggregate views.
      cardinality constraints which may bound either                              Accurately estimating the actual cardinality of each
      the cardinality of a given view or the ratio between                    view is crucial for logical and physical design as well as for
      the cardinalities of two given views. In particular,                    query processing and optimization [Vas00]. As a relevant
      we propose a bounding strategy which achieves                           case, consider the view materialization problem, where the
      an effective trade-off between the tightness of the                     aggregate views which are the most useful in answering
      bounds produced and the computational complex-                          the workload queries have to be selected for materializa-
      ity.                                                                    tion (see [TB00] for a survey). Since the number of pos-
                                                                              sible views which can be derived by aggregating a cube is
1     Introduction and Motivation                                             exponential in the number of attributes, most approaches
                                                                              assume that a constraint on the total disk space occupied
The multidimensional model is the foundation for data                         by materialization is posed, and attempt to find the subset
representation and querying in multidimensional databases                     of views which contemporarily satisfies this constraint and
     This work has been partially supported by the D2I MURST project.        minimizes the workload cost [GR00, Gup97, HRU96]. An-
The copyright of this paper belongs to the paper’s authors. Permission to
                                                                              other case where estimation of view cardinalities is relevant
copy without fee all or part of this material is granted provided that the    is index selection [GHRU97].
copies are not made or distributed for direct commercial advantage.               If the data warehouse has already been loaded, view car-
Proceedings of the International Workshop on Design and                       dinalities can be quite accurately estimated by using sta-
Management of Data Warehouses (DMDW’2001)                                     tistical techniques based, say, on histograms [MD88] or
Interlaken, Switzerland, June 4, 2001                                         sampling [HO91]. However, such techniques cannot be
(D. Theodoratos, J. Hammer, M. Jeusfeld, M. Staudt, eds.)                     applied at all if the data warehouse is still under develop-
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-39/         ment, and the estimation of view cardinalities is needed for


P. Ciaccia, M. Golfarelli, S. Rizzi                                                                                                    12-1
design purposes. To obviate this, current approaches are        It should be noted that we are using the term view to de-
based on estimation models that only exploit the cardinal-      note the set of grouping attributes used for aggregation,
ity of the base cube and that of the single attribute domains   while the “actual” views will typically include also one or
[RS97, SDNR96], which however leads to significant over-        more measures. This slight abuse in terminology is possi-
estimation.                                                     ble since we are interested in determining the cardinality of
    In this paper we propose a novel approach to estimate       views, which only depends on the grouping attributes.
                                                                Definition 3 (Roll-up) Given the set V D of all possible
the cardinality of views based on a-priori information de-
                                                                views on D, we define on VD the roll-up partial order 
rived from the application domain. Similarly to what is
                                                                as follows: V W iff 8A i 2 V 9Aj 2 W j (Aj ! Ai ) 2
done when estimating the cardinality of projections in re-
lational databases [CM95], we face the problem by first
                                                                F + , i.e., iff W ! V . We call multidimensional lattice
                                                                for D the corresponding lattice, whose top and bottom el-
computing satisfactory bounds for the cardinality, then by
                                                                ements are dim(D) and the empty view fg, respectively.
capitalizing on these bounds to determine a good proba-
                                                                We will denote with V W the view that is the least upper
bilistic estimate for it. Besides the functional dependen-
cies expressed by the multidimensional scheme, the bounds
                                                                bound of V and W in the lattice; given a set of views S ,
                                                                we will briefly denote with (S ) the view that is their least
we determine also take into account additional domain-
derived information expressed in the form of cardinality
                                                                upper bound.
constraints, namely, bounds of the cardinality of some
views and bounds (called k-dependencies) on the ratio be-       Example 1 Consider an enterprise with branches in differ-
tween the cardinalities of two views. The computation of        ent cities. A simple dimensional scheme Transfers mod-
bounds is based on a bounding strategy, which is aimed at       eling the transfers of employees between offices might in-
achieving an effective trade-off between the tightness of the   clude:
                                                                U = fdate; month; year; fromOÆce; fromDept; fromCity;
bounds produced and the computational complexity.
    The paper is organized as follows. After providing some
basic definitions in Section 2, in Section 3 we introduce k-         toOÆce; toDept; toCity; employeeg
dependencies. In Section 4 we outline our overall approach      F = fdate ! month; month ! year;
to estimation and show its benefits with an example. Sec-
                                                                     fromOÆce ! fromDept; fromOÆce ! fromCity;
                                                                     toOÆce ! toDept; toOÆce ! toCityg
tion 5 introduces the basic properties of bounds, proposes
an efficient bounding strategy, and sketches a branch-and-
bound approach to determine the upper bound of the car-
dinality of a given view when the cardinality constraints in
                                                                thus dim(D) = f      date fromOÆce toOÆce employeeg.
                                                                                       ;            ;       ;
                                                                Examples of views on the Transfers scheme are
input do not contain k-dependencies; besides, it discusses
how the strategy introduced can be improved. Section 6                 V = fmonth; fromOÆce; toCity; employeeg
shows how the bounds derived may be used to improve the                W = fmonth; fromCity; fromDeptg
cardinality estimates. Finally, Section 7 discusses the most           Z = fyear; fromOÆce; toCityg
interesting open issues.
                                                                It is W Z        = fmonth; fromOÆce; toCityg, with
2   Background and Working Example                              (W Z )V .                                                 2
In this section we formalize the concept of view, define a          The following notation is used throughout the rest of the
partial ordering on the set of views, and present the appli-    paper. Uppercase letters from the beginning of the alpha-
cation domain we will use as an example.                        bet (A, B , : : : ) denote dimensions. Attributes which are
                                                                functionally determined by another attribute, i.e. attributes
Definition 1 (Dimensional Scheme) We call dimensional           other than dimensions, are denoted by the corresponding
scheme D a couple (U; F ) where U is a set of attributes        primed letters (e.g., A ! A0 , A ! A00). Sets of attributes
and F = fAi ! Aj j Ai ; Aj 2 U g is a set of functional         are represented by omitting braces, thus writing ABC for
dependencies (FD’s) which relate the attributes of U into a     fA; B; C g. V is the view whose cardinality is to be es-
set of pairwise disjoint directed trees. We call dimensions     timated, while W , X , Y , and Z , possibly with subscripts
the attributes A k 2 U in which the trees are rooted, i.e.,     (W1 ; W2; : : : ), denote generic views in VD . Finally, low-
such that 8A i 2 U (Ai ! Ak ) 62 F ; let dim(D)  U             ercase letters are used for the cardinalities of views and at-
denote the set of dimensions of D.                              tributes (e.g., w is the cardinality of view W , abc is the
                                                                cardinality of the view with attributes ABC , and so on).
Definition 2 (View) Let D = (U; F ) be a dimensional
scheme. We call view on D any subset of attributes V  U
such that 8A i ; Aj 2 V (Ai ! Aj ) 62 F + , where F +
                                                                3   The k-dependencies
denotes the set of all functional dependencies logically im-    A k-dependency is a relevant case of cardinality constraint
plied by F .                                                    which naturally generalizes a functional dependency. In the


P. Ciaccia, M. Golfarelli, S. Rizzi                                                                                      12-2
                                                                                                            candidate view
authors’ experience, k-dependencies are particularly use-                                                                    dimensional
                                                                                                                               scheme
ful to characterize the knowledge of the business domain
held by the experts in the field. For instance, in the trans-
                                                                                           bounder
fer domain, we might have some information concerning                                                                                       logical
the number of destination cities for an employee, or on the          dimensional                                                            scheme
                                                                         scheme,        bounds                                  view
number of distinct areas moved to from each area. If such             cardinality                                            materializer
information is in the form of bounds, it can be effectively           constraints
used to improve the bounds of view cardinality.


                                                                                                         cardinality
                                                                                                           estimate
                                                                                           estimator
Definition 4 (k-dependency) Let X and Y be two views
on D. We say that a k-dependency (kD) holds between X
                                                                                                                              workload

and Y , and denote it with X ! Y , when k (k  1) is an
                               k


upper bound of the number of distinct tuples of Y which
correspond to each distinct tuple of X within view X Y .
                                                                       Figure 1: Overall architecture for logical design

Example 2 In the Transfers scheme, assume the domain             our approach works in two steps. First, the bounder uses
expert provides the following information: The maximum           the set I of cardinality constraints supplied by the user to
number of inter-department transfers of an employee dur-         determine effective bounds for the cardinalities of a proper
ing one year is 2. This constraint can be formalized by          set of views; then, the estimator uses these bounds to derive
                       2                                         a probabilistic estimate for the cardinality of V . Note that
the following kD: X ! Y , where X = f      year employee
                                              ;          g,
Y =f   toDept  g. Intuitively, from this we can derive that      this two-steps approach generalizes well-known paramet-
the cardinality of the view f   year employee toDept
                                  ;         ;       g can-       ric models for the estimation of the cardinality of relational
not exceed twice the cardinality of X .                  2       queries [MCS88], and in particular those for projection size
                                                                 estimation [CM95], for which bounds are typically given as
    The kD’s have been studied in the context of relational      input parameters.
database theory, where they are also known as numerical             The different forms of cardinality constraints we will
dependencies. Grant and Minker [GM83] proved that kD’s           consider are:
are not finitely axiomatizable, thus no fixed set of inference     1. a lower (w ) and/or an upper (w + ) bound of the car-
rules can be used to determine whether or not a given kD              dinality w of a view W ;
is logically implied by a set of kD’s. Nonetheless, a ba-
                                                                   2. a k-dependency (X ! Y ) expressing an upper bound
                                                                                                 k
sic set of rules, which naturally extend those for FD’s, was
proposed in [GM83]. The rules we use, generalized to the              of the ratio between the cardinalities of two views X
multidimensional lattice, are:                                        and Y .

       R1 :             X ! Y ` X Z ! Y Z                      We will assume that at least the upper bounds of the car-
                            k                 k

                                                                 dinalities of all the single attributes in the dimensional
       R2 : X ! Y ^ Y ! Z ` X ! Y Z
                 k          l             
                                        k l
                                                                 scheme are known. This assumption, which is perfectly
                     X! Y Z ` X !
                                                                 reasonable in all application domains, is necessary in order
       R3 :                        Y
                      k          k

                                                                 to guarantee that at least one upper bound can be deter-
       R4 : X ! Y ^ X ! Z ` X ! Y Z
                k           l             
                                        k l
                                                                 mined for each view.
                                                                    The set I , together with the dimensional scheme D, uni-
Note that the “union” rule R4 is not strictly needed, since it   vocally determines two bounds for the cardinality of V ,
can be derived from rules R1 (“extension”), R2 (“transitiv-      which are called the greatest lower bound and the least
ity”), and R3 (“decomposition”).                                 upper bound, denoted as v and v+ , respectively.1 The
                                                                 interpretation of such bounds is as follows:
4   A Framework for Estimation
                                                                   1. in each instance of D that does not violate any con-
The framework for this work is the logical design of mul-             straint in I , the cardinality v of V is such that v 2
tidimensional databases carried out off-line, i.e. assuming           [v ; v+ ]; and
                                                                   2. there exist two instances, both compatible with I ,
that the source data cannot be directly queried to estimate
the cardinality of multidimensional views. Without loss of
generality, in the following we consider that estimates are           where v equals v and v+ , respectively.
needed for the purpose of view materialization, thus reli-       We say a constraint c 2 I is redundant iff all the greatest
able information on the size of the candidate views has to       lower bounds and the least upper bounds determined by I
be supplied to the materialization algorithm.                    are equal to those determined by I fcg.
    As sketched in Figure 1, whenever the materialization           1 For simplicity of notation, in denoting bounds we omit the depen-
algorithm requires information about a candidate view V ,                 D
                                                                 dence on and .     I


P. Ciaccia, M. Golfarelli, S. Rizzi                                                                                                              12-3
Definition 5 (Sound and Minimal Input) Let I be a set           it is v  103. Finally, by using the model in Section 6, the
of cardinality constraints on dimensional scheme D. We          cardinality of V is estimated as v = 3:8  10 4.           2
say I is sound iff there exists at least one non-empty in-
stance of D which satisfies all the constraints in I . We say   5    The Bounder
I is minimal iff no constraint in I is redundant.
                                                                The basic observation to determine bounds for view car-
In this paper we will assume that the input I is sound and      dinalities using bounds of the cardinalities of other views
minimal. It is straightforward to derive that, in this case,    is that the multidimensional lattice induces an isomorphic
all the bounds in I are either greatest lower bounds or least   structure over such cardinalities. In fact, from Definition
                                                                3 it follows that W Z implies w  z in each instance of
upper bounds (whereas the opposite is not necessarily true).
    Computing the bounds implied by I turns out to be           D, since Z ! W holds. This inequality also applies to
a challenging combinatorial problem, even for “simple”          bounds.

                                                                Lemma 1 If W Z , then w              z and w+  z + .
forms of cardinality constraints. For instance, it is known
that the problem is NP-hard for arbitrary patterns of func-
tional dependencies [CM92]. Furthermore, the actual com-           Proof: (w  z ) Assume w > z . Then, there is
putational effort needed to compute these bounds might          an instance of D in which w  w > z  z , thus w > z ,
limit applicability in real-world cases. For this reason, the   which is a contradiction. Similarly for w +  z + . 2
bounder is built around the concept of bounding strategy. A
                   s
bounding strategy is characterized by a couple of bound-
ing functions that, given I ; D, and V , compute bounds v s
                                                                   As to k-dependencies, their influence on the determina-
and vs+ such that vs  v and v+  vs+ both hold. In
                                                                tion of bounds is summarized by the following lemma.

                                                                Lemma 2 Let Z = X Y . If X ! Y , then x                      z =k
other terms, a bounding strategy never computes bounds                                                    k


                                                                and z +  k  x+ .
which are more restrictive than the ones logically implied
by the input constraints, trading-off accuracy for speed of
                                      s
evaluation. We say that a strategy is decoupled iff com-
puting v s+ for an arbitrary view V only requires the knowl-
                                                                   Proof: From Definition 4 it follows immediately that, if
edge of upper bounds w s+ of other views W , but no knowl-      X!  k
                                                                      Y , the cardinality z of Z is related to the cardinality
edge of lower bounds w s , and vice versa. Thus, for a de-      x of X by inequality z  k  x. The inequalities on bounds
coupled bounding strategy, the two bounding functions can       follow immediately.                                         2
be defined independently of each other.
    Turning to the estimator, our framework supports differ-       In the rest of this section we first propose a decoupled
ent probabilistic models. A probabilistic model is a func-      strategy to compute upper bounds (Section 5.1), then we
tion that, given I ; D, V , as well as bounds computed by       discuss some issues related to coupled strategies (Section
the bounder, provides an estimate, v, for the cardinality of    5.2).
V . In general, this step can use further information from
the application domain that is not suitable to derive bounds.   5.1 A Decoupled Upper Bounding Strategy
Typically this is the case with information concerning aver-    The bounding strategy we propose in this section, called
age values (e.g., the number of transfers of each employee      cover-based, relies on the concept of cover of a view to
on each year is 1.5, on the average).                           compute upper bounds. The following are two preliminary
                                                                definitions whose aim is to precisely characterize how sets
Example 3 Let 104 be the number of employees who have           of views and kD’s can be sinergically combined together.
been transfered at least once, and let the enterprise con-
sist of 103 offices distributed over 10 cities and belong-      Definition 6 (Graph of a set of kD’s) Let K = fX 1 !1
                                                                                                                   k


ing to one of 10 departments; let 10 3 days be the observa-
                                                                Y1 ; : : : ; Xp !p Yp g be a set of kD’s. The (labelled ori-
                                                                                k

tion period. Let V = f   date fromOÆce toOÆce
                              ;            ;         g. Since   ented)S    graph of K is G (K ) = (N; E ), with set of nodes
                                                                N = i fXi ; Yig, set of edges E = fei = (Xi ; Yi ); i =
each office is involved in transfers at most with every other
office on each date, the first trivial upper bound of v is
103  103  103 = 109. If the maximum number of transfers       1; : : : ; pg, and labeling function  such that (e i ) = ki .2
for an employee during one year is 2, and since we con-
sider 3 years, it is derived that the cardinality of the base
                                                                Definition 7 (K-set of views) Let S               = fW1; : : : ; W gm


cube is at most 2  3 = 6 times the number of transferred       be a non-empty set of views, and let K = fX 1 !1
                                                                                                                                     k


employees, i.e. 6  104 . Thus, the upper bound of v can be     Y1 ; : : : ; X ! Y g be a set of kD’s. The couple C =
                                                                                kp


improved to 6  10 4 as well (the cardinality of a view can-
                                                                            p         p

                                                                   2 Technically, G (K ) is a multi-graph, since two edges may share the
not exceed that of its base cube). On the other hand, if we     same couple of nodes. This, however, does not influence the following
assume that each office is involved in at least one transfer,   arguments.


P. Ciaccia, M. Golfarelli, S. Rizzi                                                                                               12-4
(S; K ) is called a k-set of views iff 8i = 1; : : : ; p it is                                            ABCD
Y 2 S and there exists a set of kD’s, K 0 = fW 1 !1
 i                                                         j
                                                               k


Y1 ; : : : ; W ! Y g, such that: 1) 8i = 1; : : : ; p it is                                                ABC
                   kp


X W with W 2 S , and 2) G (K 0 ) = (N 0; E 0 ) is a for-
              jp             p

  i      ji             ji

est, i.e., a set of disjoint directed trees. We call S -compliant                            AB             BC           CD
a set K 0 with such properties.
   Each kD in an S -compliant set K 0 is derived from a cor-
                                                                                              A'B
responding kD in K by applying rules R1 and R3 (since,
by hypothesis, it is X i Wji = Wji ). Note that N 0  S                            A                 B              C                    D
always holds and that, in general, multiple S -compliant K 0
sets can be derived from the same C , depending on how                                  A'
each Wji 2 S is chosen.

Example 4 C 1 = (fA0 B; C; Dg; K ), with K = fA0B !1
                                                      k                   Figure 2: Roll-up relationships of views in Example 5

C; C !
     k2
        Dg, is a k-set of views, since K is S -compliant.                  C3 = (fAB; C g; fAB ! C g). From Lemma 2 it
                                                                                                                 k

The same is true for C2 = (fAB; C; Dg; K ), since K 0 =
                                                                            immediately follows abc  ab +  k.
fAB ! k1
         C; C !k2
                  Dg is S -compliant (in fact, A 0 B AB ).
On the other hand, C3 = (fB; C; Dg; K ) is not a k-set
since no S -compliant set of kD’s can be found.                            C4 = (fA; B; C g; fA !1 B; B !2 C g). By applying
                                                                                                            k                 k


   It is important to remark that Definition 7 requires K 0 ,               rule R2, we derive A !
                                                                                                 1 2 BC , thus abc  a+  k  k .
                                                                                                           k k
                                                                                                                           1 2
and not necessarily K , to be a forest. For instance, the
couple (fAg; fA0 ! Ag) is not a k-set, though G (fA 0 !                    C5 = (fA; B; C g; fA !1 B; A !2 C g). Rule R4 is
                    k                                    k                                                   k                k


Ag) is a forest, since G (fA ! Ag) is cyclic. On the other
                              k
                                                                            now used to derive A !
                                                                                                 1 2 BC , thus abc  a+  k  k .
                                                                                                           k k
                                                                                                                           1 2
hand, C = (fA; A0; B g; fA0 !1 B; B !2 A0 g) is a k-set
                                k         k
          4
(after deriving A !1 B from A0 !1 B ) even if G (fA0 !1
                   k           k                     k
                                                                           C6 = (fA; A0B; C g; fA0 ! Ag). According to rule
                                                                                                                     k


B; B !2 A0 g) is cyclic.                                                    R1 it is A 0B ! AB , and from Lemma 2 ab+  k 
       k                                                                                          k


      Finally, for the k-set C 5 = (fA; A0 B; A0C g; fA0 !
                                                               k            a0 b+ . On the other hand, abc  ab+  c+ , thus abc 
                                                                            k  a0 b+  c+ .                                     2
Ag), two S -compliant sets, K10 = fA0 B ! Ag and
                                                     k


K20 = fA0 C ! Ag, can be derived.              2
            k

                                                                       The following theorem precisely characterizes how
Definition 8 (Cover) Let V 2 VD be a view on D and                   bounds are related to the graph of K 0 .
C = (S; K ) be a k-set of views. C is called a V -cover iff
V (S ).
                                                                     Theorem 1 (Cover-based bounding) Let V be a view and
   As the following example suggests, a V -cover can be              C = (S; K ) be a V -cover, with S = fW1; : : : ; W g                      m

used to bound from above the cardinality of V by generaliz-          and K = fX1 !1 Y1 ; : : : ; X ! Y g. Let K 0 be an
                                                                                              k                          kp

ing Lemma 1 to the case of multiple views (since V (S )            S -compliant set, R(G (K 0 )) be the set of root nodes of
                                                                                                                     p            p


holds). When also kD’s are present, Lemma 2 can be ex-               the forest G (K 0 ) = (N 0 ; E 0) associated to K 0 , and let
ploited to improve the bound. Since a cover must be a k-set,         S0 = S N 0 stand for the set of views which are not nodes
we are guaranteed that the cardinalities of some views in S          in G (K 0 ). Then:
can be safely “replaced” by the ki’s of the kD’s in K 0 .
Example 5 Let V = ABC . Below we consider some no-                                               Y         Y
                                                                                                  p


table examples of V -covers and show how each of them                         v  u(C ; K 0) def
                                                                                             = ki                    wj+                      (1)
                                                                                                 i=1 Wj 2R(G (K ))[S0
can be used to derive an upper bound for v. In order to
                                                                                                                                      0


help the reader, Figure 2 depicts the roll-up relationships
between the views involved.                                               Proof: The intuition behind the proof is that each tree
      C1 = (fABCDg; ;) is a V -cover since V (S1 ) =              G = (N 0; E 0 ) of G (K 0 ) contributes to u(C ; K 0) with the
                                                                      t         t        t
                                                                     upper bound of the cardinality of its root W t times all the
       ABCD. From Lemma 1 it is derived abc  abcd+.                 ki’s which label the edges in Et0 .
      C2 = (fAB; BC g; ;) is a V -cover since V (S2 ) =              Since by Definition 8 it is V (S ), it is sufficient to
       ABC . Since the natural join between two views is a           prove that u(C ; K 0 ) is an upper bound of (S ). Since the
       subset of their Cartesian product, it is abc  ab +  bc+ .   size of the natural join of a set of views can never exceed


P. Ciaccia, M. Golfarelli, S. Rizzi                                                                                                           12-5
that of their Cartesian product, it is                                   by (1) is actually independent of the one chosen. For in-

       card((S ))  card(((S0 ))((N 0 )))
                                                                         stance, the reader may immediately verify that, in Example
                                                                         4, it is u(C5; K10 ) = u(C5 ; K20 ) = k  a0 b+  a0 c+ .
                    card((S0 ))  card((N 0 ))
                      Y + Y                                              Lemma 3 Let C = (S; K ) be a V -cover, and let K10 and
                          w  card((N 0))
                                  20
                                          j                  t
                                                                         K20 be two arbitrary S -compliant sets. It is u(C ; K 10 ) =
                                                                         u(C ; K20 ) def
                                                                                     = u(C ).
                             Wj   S           t


Since the ki’s are  Q partitioned over the trees, it is enough to
prove that wt+  i20 ki , where Wt is the root of G t and 0t              Coherently with Theorem 1 and Lemma 3, the cover-
is the set of labels in G t, is an upper bound of card((N t0)).
                           t
                                                                                                   cb       + as:
                                                                         based bounding strategy computes v cb
This is proved by induction on the number L of levels in G t .                   (
    Base step (L=2). 3 In this case Gt corresponds to the                   +=      v+                                      if v + 2 I ;
                                          k2;q                             vcb
                                                                                    minfucb(C ) j C is a V -coverg          if v + 62 I :
                      k2;1                                                                                                                   (2)
set of kD’s fWt ! W2;1 ; : : : ; Wt !2 W2;q2 g. From the
unionQ  rule R4 it immediately follows that card((N t0)) 
  +
wt  qi=1 2 k .                                                          where ucb(C ) is obtained by replacing w j+ with wj;  + in
                2;i                                                                                                              cb
    Inductive step (L            1 ) L). Let Nt0 (L 1)                   u(C ). In general, evaluating the cover-based bound leads
be the set of nodes in the first L             1 levels. By in-          to a recursive computational flow; note that the “case-0” of
ductive hypothesis it is card((N t0 (L            1)))  wt+                        + = v+ , is correctly defined since we assumed
                                                                         recursion, vcb
QL 1 Qql
   l=2     i=1
                kl;i . Adding the L-th level introduces new              the input I to be minimal.
edges with labels kL;1; : : : ; kL;qL and corresponding ter-                The space of the V -covers to be analyzed in order to
minal nodes WL;1 ; : : : ; WL;qL . From the i-th of the corre-           determine vcb+ has exponential size. On the other hand, the
sponding kD’s we can derive (using rules R1 and R3) the                  following theorem shows that, under some circumstances, a
kD (Nt0 (L         1)) ! W
                         kL;i
                                                                         V -cover C2 can be discarded from the search space without
                                                                         even computing ucb (C2).
                                  L;i   . From the union rule R4 it is
derived:
                 Q                                                       Theorem 2 Let C1 = (S1 ; K1) and C2 = (S2 ; K2 ) be two
     (N 0 (L 1)) =1!
                            qL

          t
                            i  kL;i
                                        (fW 1; : : : ; W
                                               L;          L;qL   g)     V -covers. If S1  S2 and K1 = K2 or S1 = S2 and
which, due to Lemma 2, leads to:                                         K2  K1 , then ucb(C1 )  ucb (C2).
 card(((Nt0 (L 1)))((fWL;1; : : : ; WL;qL g))) =                      5.1.1   Reasoning without k-dependencies
    = card((Nt0(L)))                                                    When no k-dependencies are included among the input
                   L 1 ql
      YqL
                +  YY              +   YL
                                           Yql
                                                                         constraints I , covers degenerate into sets of views, which
     kL;i  wt           kl;i = wt          kl;i 2                    allows us to precisely characterize the set of V -covers
      i=1          l=2 i=1             l=2 i=1
                                                                         that can provide useful (non redundant) bounds. To see
                                                                         how such covers are determined, two orthogonal aspects
   It is possible to prove that (1) is valid even if G (K 0 ) is
                                                                         are considered: a domination relationship between sets of
                                                                         views and the input information, I . While the former in-
not a forest, provided that R(G (K 0 )) contains (at least) a            duces a partial order on the bounds obtainable from V -
set of nodes from which every other node in G (K 0 ) can be              covers, regardless of the specific input I , the latter can be
reached through a directed path. On the other hand, the                  used to restrict the set of useful V -covers to those including
bounds determined by such “non-forest” V -covers are al-                 only views in I .
ways redundant, meaning that a proper V -cover yielding a
                                                                            In this section, since we assume K = ;, we will work
better bound for v can always be found.
                                                                         only with the S part of V -covers. Consequently, in (2),
Example 6 Let V = ABC , and consider the couple                          ucb (C ) can be replaced by
(fA; B; C g; K ) with K = fA !     k1
                                      C; B !
                                           k2
                                                C g), which
                                                                                             ucb (S ) =
                                                                                                          Y
                                                                                                                    wi+ :
                                    0
is not a k-set since the graph of K = K has two roots (A
                                                                                                                                             (3)
and B ). The bound returned by (1) is v  k 1  k2  a+  b+
                                                                                                          Wi   2S


which is redundant, since a better bound is obviously ob-                Definition 9 (Domination between sets of views)
tained through the V -cover (fA; B; C g; fA !1 C g).       2             Let S1 = fW1;1; : : : ; W1;i; : : : ; W1;m g and S2 =
                                             k


                                                                         fW2;1; : : : ; W2;j ; : : : ; W2;ng be two sets of views. We
  The following lemma shows that, when multiple S -                      say that S1 dominates S2 , written S1 vS2 , iff S2 can
compliant sets exist for a given cover, the bound returned               be partitioned into m subsets S 2;1; : : : ; S2;m such that
  3 The case   L = 1 cannot arise, since each G has at least one edge.
                                                  t                      W1;i(S2;i ) 8i = 1; : : : ; m.


P. Ciaccia, M. Golfarelli, S. Rizzi                                                                                                         12-6
   For instance, fA0 B; C gvfAB; CD; E g. Note that if                                            Y3
Si vSj then (Si )(Sj ) necessarily holds, whereas the
opposite is not always true (e.g., fAB; BC g6vfABCDg                                       Y2
though ABC ABCD).                                                                                       V

Lemma 4 Let S1 and S2 be two sets of views. If S 1 vS2
                                                                               Y1= W1             W2            W3
then ucb(S1 )  ucb (S2 ).
                                                                                           Z1            Z2
Definition 10 (Ground Views and Covers) We say that a
view W is ground iff w + is in I . A V -cover is said to be         Figure 3: Roll-up relationships between views in Lemma
ground when all the views it includes are ground.                   6, in the case n = 3
Lemma 5 Let S be a non-ground V -cover. Then there ex-               4. If S is a ground V -cover, no set S 0 such that S  S 0
ists a ground V -cover S1 such that u cb (S1 )  ucb (S ).              is a minimal V -cover (from Definitions 9 and 11).
   Proof (sketch): Since S is not ground, at least one view          5. If a minimal V -cover S contains a ground view W ,
in S is not ground. By recursively applying (3), u cb (S ) will         it cannot contain any other ground view W 0 such that
be eventually expressed as a product of bounds in I . The               W W 0 (from Definitions 9 and 11).
case of strict inequality (u cb(S1 ) < ucb (S )) can arise since
in this recursive process there is no guarantee that a given        5.2 Towards a Coupled Bounding Strategy
ground view will be generated just once, thus its least upper       The bounds we derive through the strategy described in
bound might appear more than once in u cb (S ).               2     Section 5.1 are not necessarily the tightest possible ones.
                                                                    In fact, more complex and effective bounding strategies
Definition 11 (Minimal Cover) A ground V -cover S is
                                                                    can be defined to the detriment of computational speed.
minimal iff there is no other ground V -cover S 1 such that
S1 vS holds.
                                                                    Basically, in these strategies the concept of cover may be
                                                                    extended by considering more complex patterns of views,
                                                                    where upper and lower bounds are used jointly. In this sec-
  The following theorem immediately derives from Lem-
                                                                    tion we present some preliminary considerations on cou-
mas 4 and 5.
                                                                    pled strategies; for simplicity, we will assume that the input
Theorem 3 (Sufficiency of Minimal Covers) It is:                    does not contain k-dependencies.
                                                                        As to upper bounding, the cover-based strategy can be
   minfucb(S ) j S is a V -coverg =                                 improved by exploiting results from majorization theory,
                                                                    which state that the size of the natural join between two
         = minfucb(S ) j S is a minimal V -coverg:            (4)   relations is majorized when the distributions of the join at-
                                    +                     +
    For instance, let I = fab0 ; cd+; a0de+ ; a+ ; a0 ; b+ ;
                                                                    tribute(s) in the two relations are maximally skewed [IC91].

b0+ ; c+ ; d+ ; e+ g and V     = A0B 0 CD. The min-
                                                                    The extension of this argument to the multidimensional
                                                                    lattice is as follows. Given two views W 1 and W2 such
imal V -covers are fAB ; CDg, fA0 ; B 0 ; CDg, and
                             0
                                                                    that W1 ÆW2 and W2 ÆW1 , let Y = W1W2 and let
fA0 DE; B 0 ; C g.                                                  Z = W1 W2 , where is the greatest lower bound op-
    From the above results, several facts can be easily de-         erator on the lattice; it can be proved that
rived, which can be exploited to efficiently generate mini-
mal V -covers by means, say, of a branch-and-bound algo-                   y  w1+  w2+     (z  1)(w1+ + w2+ z ) (5)
                                                                    It should be noted that, when W 1 W2 = fg, since the
rithm:

  1. A ground view W such that V W is a ground V -                 empty view fg has cardinality 1, (5) correctly reduces to
     cover (from Definition 8).                                     (3).
                                                                       This result can be extended to a V -cover whose views
  2. A ground view W such that arity(W ) = 1 and                    are connected by a linear join graph.
     W \ V = ; does not belong to any minimal V -cover 4            Lemma 6 Let S = fW1 ; : : : ; Wng be a V -cover; let Zi =
     (from Definitions 9 and 11).
                                                                    Wi Wi+1 , i = 1; : : : ; n 1, Y1 = W1 , and Yi+1 =
  3. A ground view W such that arity(W ) > 1 and 8W 0               Yi Wi+1 , i = 1; : : : ; n 1. Then:
     for which W 0W it is arity(W 0 \ V ) < 2 does not                 yi++1  yi+  wi++1 (zi 1)(yi+ + wi++1 zi )
     belong to any minimal V -cover (since C includes the
     cardinalities of all the attributes).                                      for i = 1; : : : ; n 1 ;                   (6)
  4   arity(W ) denotes the number of attributes in W .                     v  yn :
                                                                                  +


P. Ciaccia, M. Golfarelli, S. Rizzi                                                                                          12-7
   The pattern consisting of views Y i , Wi+1 , Zi , and                                       se
                                                                     In our approach, denoted (“safe-estimate”), the above
Yi+1 , depicted in Figure 3, can in principle be extended         estimate is improved in two ways: by replacing v max with
to take into account also size information on Y i Zi and                                                         + , as a mea-
                                                                  the upper bound computed for v, for instance v cb
Wi+1 Zi , that is, on non-join attributes; this will further      sure of the maximum cardinality of V , and by replacing the
strengthen the upper bound. At present, we guess that the         cardinality of the base cube d with an estimate, w se , of the
exact computation of v + might involve taking into account        cardinality of a view W such that V W . This leads to:
patterns that can extend over the whole lattice. However,
                                                                                        + ; w )  minfv+ ; w g
                                                                              v se = (vcb                                 (10)
besides the theoretical interest, it is important to trade-off                               se        cb   se
the increased complexity with the actual gain that could be
                                                                  Since both vcb + and w can be considerably lower than
obtained by having more accurate bounds, considering also                                 se
how bounds can be used by the estimator.                          vmax and d, respectively, it is usually the case that v se 
   A coupled strategy requires also lower bounds to be            v sdnr. The rationale for (10) is that we can view the prob-
computed, which is radically different from computing up-         lem of estimating v as the one of distributing the tuples of
per bounds. In fact, while computing an upper bound cor-          view W , which are estimated to be wse , over a number of
                                                                    + “buckets”.
                                                                  vcb
responds to bounding the size of a join, computing a lower
bound corresponds to bounding the size of a projection,               Due to the need to know w se, it is obvious that our es-
where the relevant difference is that projection is a unary       timation process must move downward from the top of the
operator. This leads to a much simpler situation to deal          lattice (whose cardinality d is typically known) following a
with, in which Lemma 1 is exploited and the lower bound           path leading to V . Clearly, this represents a simplification
of v is computed as maxfw j w 2 I ; W V g. Differ-               of the correct estimation procedure, which would require to
ently from upper bounds, no combinatorial issues arise in         determine v by following all the paths from dim(D) to V .
computing lower bounds through this strategy; thus, com-          On the other hand, this would lead to combinatorial explo-
plexity is linear in the cardinality of I .                       sion and necessitate of highly complex probabilistic models
   A better bound can be obtained by using information            that are well beyond the current state-of-the-art knowledge.
associated to “sibling” views. Let W be a view such that              From a more practical (numerical) point of view, it
V \ W = ;, and Z = V W ; then:                                   should be noted that moving from upper bounds to esti-
                                                                  mates leads to significant differences under specific condi-
                         v        wz +                     (7)   tions only. Two relevant cases should be considered, which
                                                                  arise from the limit behavior of Cardenas’ formula:
In fact, if v < z =w+ , then the size of the Cartesian prod-                            + it is v  w
uct of V and W would be less than z , which is impossible.         1. When wse  0:1  vcb       se   se

                                                                                      + it is v  v+
                                                                   2. When wse  3  vcb       se  f+
6    The Estimator
                                                                  The values 0:1 and 3 can thus be used to predict whether
Assuming that effective bounds have been derived, cardi-
                                                                  the estimator will deliver results which substantially differ
nality estimation must be based on a probabilistic model
                                                                  from those directly obtainable from the bounder.
to derive an estimate, v, of the cardinality of view V . The
model we adopt here is based on the Cardenas’ formula
                                                                  Example 7 In the Transfers scheme, we consider three in-
[Car75], which states that, when throwing N distinct ob-
                                                                  put situations:
jects into B buckets, the expected number of buckets in
                                                                       
which at least one object will fall can be estimated as:          I1 = fdateg+ = 103 ; fyearg+ = 3; femployeeg+ = 104 ;
                                         N !
            def
    (B; N ) = B  1         1 B1                 minfB; N g         ffromOÆceg+ = ftoOÆceg+ = 103 ;
                                                                      ffromCityg+ = ftoCityg+ = 10;
                                                            (8)
                                                                      ffromDeptg+ = ftoDeptg+ = 10
                                                                           
Within the approach proposed in [SDNR96], (8) is used to
estimate v by relying on the maximum cardinality of V ,
                                                                  I2 =I1 [ femployee; yearg !4 ffromOÆce; toOÆce; dateg
                                                                           
defined as the Cartesian product
                         Q       of the cardinalities of the      I3 =I2 [ ffromCity; fromDeptg+ = 40;
attributes in V , v max = Ai 2V ai , and on the cardinality           ftoCity; toDeptg+ = 40;
of the base cube, d = card(dim(D)), that is:
           v sdnr = (vmax ; d)  minfvmax ; dg             (9)
                                                                      ffromCity; fromDeptg !2 ftoCity; toDeptg;
                                                                                            30
This formula turns out to significantly overestimate the
                                                                      ffromCity; fromDeptg !   ffromOÆceg;
                                                                                       30
cardinalities and can easily lead to violate the constraint           ftoCity; toDeptg ! ftoOÆceg
v sdnr  v+ .


P. Ciaccia, M. Golfarelli, S. Rizzi                                                                                       12-8
                                                                     each year is 2, would allow the cardinality of the base
Table 1: Improving upper bounds and estimates for increas-
                                                                     cube to be estimated as twice the cardinality of view
ing domain-derived information
                                                                     femployee year
                                                                                ;     g.
                 +          +
    input      wcb         vcb          wse         v se            Probabilistic estimates. Estimates based on Cardenas’
      I1       1013        106          1013      106                formula can be improved in several ways. In particu-
      I2     1:2  105 1:2  105      1:2  105 7:6  104            lar, information on lower bounds could be considered
      I3     1:2  105 7:2  104      1:2  105 5:8  104            by exploiting the results in [CM95], as well as infor-
                                                                     mation concerning the distribution of attribute values

                        date employee fromOÆce toOÆce
                                                                     over their domains. For this, the challenge is to derive
Let W = dim(D) = f          ;          ;           ;         g
                              fromOÆce toOÆce
                                                                     new models that can be applied when the data ware-
be the base cube and V = f                ;        g be the          house has not been loaded yet.
view whose cardinality is to be estimated. Table 1 shows
how the upper bound w cb + of W , the upper bound v + of V ,
                                                    cb           References
and the estimate vse improve as new cardinality constraints
are progressively supplied. The estimate v se is based on        [AGS97]    R. Agrawal, A. Gupta, and S. Sarawagi. Mod-
the estimate of w, wse , which is assumed to be equal to its                eling Multidimensional Databases. In Proc.
               +.
upper bound w cb                                          2                 ICDE’97, pages 232–243, Birmingham, UK,
                                                                            1997.

7    Conclusions and Open Issues                                 [Car75]    A.F. Cardenas. Analysis and Performance of
                                                                            Inverted Database Structures. Communica-
In this paper we have shown how cardinality constraints                     tions of the ACM, 18(5):253–263, 1975.
derived from the application domain may be employed to
determine effective bounds on the cardinality of aggregate       [CM92]     P. Ciaccia and D. Maio. On the Complexity
views and how, in turn, such bounds can be used to esti-                    of Finding Bounds for Projection Cardinalities
mate the cardinality of the views. In order to improve the                  in Relational Databases. Information Systems,
approach effectiveness, some issues still need to be investi-               17(6):511–515, 1992.
gated. In the following we briefly discuss those we believe
to be crucial:                                                   [CM95]     P. Ciaccia and D. Maio. Domains and Ac-
                                                                            tive Domains: What This Distinction Implies
     Domination. A characterization of domination be-                      for the Estimation of Projection Sizes in Re-
      tween k-sets of views, similar to that reported in Def-               lational Databases. IEEE Transactions on
      inition 9 for sets of views, needs to be developed in                 Knowledge and Data Engineering, 7(4):641–
      order to reduce the complexity of computing upper                     655, 1995.
      bounds in presence of k-dependencies.
                                                                 [GHRU97] H. Gupta, V. Harinarayan, A. Rajaraman, and
     Minimality. Throughout this paper we assumed that                   J. Ullman. Index Selection for OLAP. In Proc.
      the cardinality constraints supplied by the domain ex-              ICDE’97, pages 208–219, Birmingham, UK,
      pert are sound and non redundant. Of course, this                   1997.
      gives rise to the problem of determining, given an in-
      put I , if I is sound and minimal, which we argue can
                                                                 [GL97]     M. Gyssens and L.V .S. Lakshmanan. A Foun-
                                                                            dation for Multi-Dimensional Databases. In
      be dealt with as done for, say, functional dependencies
                                                                            Proc. 23rd VLDB, pages 106–115, Athens,
      (whose inference rules can be used both for schema
                                                                            Greece, 1997.
      normalization as well as for input minimization).

     Cardinality constraints. The input knowledge may be        [GM83]     J. Grant and J. Minker. Numerical Depen-
                                                                            dencies. In H. Gallaire, J. Minker, and J.-M.
      further extended by considering other forms of car-
                                                                            Nicolas, editors, Advances in Database The-
      dinality constraints which are typically known to the
                                                                            ory, volume II. Plenum Publ. Co., 1983.
      experts of the application domain. For instance, while
      in this paper we have defined k-dependencies to ex-        [GR00]     M. Golfarelli and S. Rizzi. View Material-
      press bounds on the ratio between the cardinalities of                ization for Nested GPSJ Queries. In Proc.
      two views, they may also be used to denote the aver-                  DMDW’2000, Stockholm, Sweden, 2000.
      age of such ratio; while this kind of knowledge cannot
      be used by the bounder, it allows the cardinality esti-    [Gup97]    H. Gupta. Selection of Views to Materialize in
      mations to be improved. For instance, knowing that                    a Data Warehouse. In Proc. ICDT’97, pages
      the average number of transfers for each employee on                  98–112, Delphi, Greece, 1997.


P. Ciaccia, M. Golfarelli, S. Rizzi                                                                                     12-9
[HO91]       W. Hou and G. Özsoyoglu. Statistical Es-
             timators for Aggregate Relational Algebra
             Queries. ACM Transactions on Database Sys-
             tems, 16(4):600–654, 1991.
[HRU96]      V. Harinarayan, A. Rajaraman, and J. Ullman.
             Implementing Data Cubes Efficiently. In Proc.
             ACM Sigmod Conf., pages 205–216, Montreal,
             Canada, 1996.
[IC91]       Y.E. Ioannidis and S. Christodoulakis. On the
             Propagation of Errors in the Size of Join Re-
             sults. In Proc. ACM Sigmod Conf., pages 268–
             277, Denver, CO, 1991.
[MCS88]      M. V. Mannino, P. Chu, and T. Sager. Sta-
             tistical Profile Estimation in Database Sys-
             tems. ACM Computing Surveys, 20(3):191–
             221, 1988.
[MD88]       M. Muralikrishna and D.J. DeWitt. Equi-depth
             Histograms for Estimating Selectivity Fac-
             tors for Multi-Dimensional Queries. In Proc.
             ACM Sigmod Conf., pages 28–36, Chicago, IL,
             1988.
[RS97]       K. Ross and D. Srivastava. Fast Computation
             of Sparse Datacubes. In Proc. 23rd VLDB,
             pages 116–125, Athens, Greece, 1997.
[SDNR96] A. Shukla, P. Deshpande, J. Naughton, and
         K. Ramasamy. Storage Estimation for Multidi-
         mensional Aggregates in the Presence of Hier-
         archies. In Proc. 22nd VLDB, pages 522–531,
         Mumbai, India, 1996.
[TB00]       D. Theodoratos and M. Bouzeghoub. A Gen-
             eral Framework for the View Selection Prob-
             lem for Data Warehouse Design and Evolu-
             tion. In Proc. DOLAP 2000, pages 1–8, Wash-
             ington, DC, 2000.
[Vas00]      P. Vassiliadis. Gulliver in the land of data
             warehousing: practical experiences and obser-
             vations of a researcher. In Proc. DMDW’2000,
             pages 12/1–12/16, Stockholm, Sweden, 2000.


P. Ciaccia, M. Golfarelli, S. Rizzi                          12-10