View Materialization for Nested GPSJ Queries
                            Matteo Golfarelli                                                 Stefano Rizzi
                      DEIS Ð University of Bologna                              DEIS, CSITE CNR Ð University of Bologna
                   Viale Risorgimento 2, Bologna, Italy                            Viale Risorgimento 2, Bologna, Italy
                           mgolfarelli@deis.unibo.it                                         srizzi@deis.unibo.it

                                                                              are typical dimensions while quantity sold and retail price
                                                                              are measures.
                                            Abstract                             A DW implemented on a relational DBMS is usually
                                                                              organized according to the so-called star scheme [Kim96],
     View materialization is a central issue in logical                       composed by one fact table containing the measures and
     design of data warehouses since it is one of the                         one denormalized dimension table for each dimension of
     most powerful techniques to improve the                                  analysis. The primary key of each dimension table is
     response to the workload. Most approaches in the                         imported into the fact table; the primary key of the fact
     literature only focus on the aggregation patterns                        table is defined by the set of these foreign keys. Each
     required by the queries in the workload; in this                         dimension table contains a set of dimension attributes
     paper we propose an original approach to                                 defining a hierarchy of aggregation levels for the
     materialization in which the workload is                                 corresponding dimension.
     characterized by the presence of complex queries                            The basic mechanism to extract useful information
     which cannot be effectively described only by                            from elemental data in DWs is aggregation. In principle,
     their aggregation pattern. In particular, we                             all queries could be solved by directly aggregating the
     consider queries represented by nested GPSJ                              elemental data in the base fact table; for efficiency reasons,
     (Generalized Projection / Selection / Join)                              a DW typically stores, besides the elemental values of
     expressions, in which sequences of aggregate                             measures, also values summarized according to some
     operators may be applied to measures and                                 aggregation pattern, i.e., to a set of dimension attributes
     selection predicates may be formulated, at                               defining the coarseness of aggregation. Even summarized
     different granularities, on both dimensions and                          data are organized into a star scheme, whose fact table is
     measures. Other specific issues taken into account                       usually called a view; measure values are obtained by
     are related to the need for materializing derived                        applying an aggregation operator to the data in the base
     measures as well as support measures to make                             fact table or in another view with a finer pattern. Of
     algebraic operators distributive. Based on this                          course, the larger the number of queries a view can be
     query model, an efficient algorithm to determine                         used to answer, the more effective materializing it [Coh99,
     a restricted set of candidate views for                                  Yan95].
     materialization, to be fed into an optimization                             Since pre-computing all the possible views is
     algorithm, is proposed. Finally, the effectiveness                       unfeasible, several techniques to select an appropriate
     of our approach is discussed with reference to a                         subset of views to materialize have been proposed. While
     sample workload.                                                         most approaches only focus on the aggregation patterns
                                                                              required by queries, very few works analyze how the
1 Introduction                                                                presence of different operators affects aggregation. On the
                                                                              other hand, using only one operator (typically, the sum)
A data warehouse (DW) is a repository which provides                          to aggregate measures when materializing views, makes
an integrated environment for decision support. From a                        them useless for several queries.
logical point of view, a DW can be seen as a                                     In this paper we propose an original approach to
multidimensional cube where each cell contains                                materialization in which the workload is characterized by
information useful for the decision process, i.e., a set of                   the presence of complex queries which cannot be
numerical attributes called measures, while each axis                         effectively described only by their aggregation pattern. In
represents a possible dimension for analysis. For example,                    particular, we consider queries represented by Nested
in a DW modeling a chain store, time, product and store                       Generalized Projection / Selection / Join (NGPSJ)
                                                                              expressions, in which sequences of aggregate operators
 The copyright of this paper belongs to the paperÕs authors. Permission to
                                                                              may be applied to measures and selection predicates may
 copy without fee all or part of this material is granted provided that the   be defined, at different granularities, on both dimensions
 copies are not made or distributed for direct commercial advantage.          and measures. As a result, the correspondence between
 Proceedings of the International Workshop on Design                          measures in the views and in the base fact table is not
 and Management of Data Warehouses (DMDW'2000)                                necessarily one-to-one: some measures in the base fact
 Stockholm, Sweden, June 5-6, 2000                                            table may not be present at all in a view, while others
 (M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.)                            may correspond to several measures in the view;
                                                                              furthermore, additional (either support or derived)
 http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-28/


M. Golfarelli, S. Rizzi                                                                                                               10-1
measures may be included in the view in order to              aggregate on {PName} calculating the maximum quantity
correctly compute aggregations. Particular emphasis is        and the average price. Query q2 aggregates foodstuff sales
given to the effects on materialization of the contemporary   on {TMonth, PType}, calculating for each month the
presence of multiple aggregation operators in a query.        average price and the sum of the derived measure R =
   Under these assumptions, an efficient algorithm to         Qty ´ Prc . Query q3 requires to aggregate on {TMonth,
determine the restricted set of views which could be          PName} calculating the monthly sums of the quantities
usefully materialized (candidate views) is proposed. The      and that of the revenues, then to aggregate on {TYear,
algorithm builds a query view graph whose vertices            PName} calculating the sums of the monthly quantities
represent candidate views and whose arcs denote the           and the averages of the monthly revenues.
possibility of computing a view from another. The query          These examples give rise to some considerations:
view graph may then be fed into an optimization
algorithm like the ones proposed in [Gup97, Gup99]            1.   Different aggregate operators may be applied to the
which select, from the set of the candidate views, the             same measure (e.g., SUM(Qty), MAX(Qty)).
subset which minimizes the response to the workload           2.   Sequences of aggregate operators may be applied to a
under a given space constraint.                                    measure (e.g., the maximum of the sums).
   After discussing NGPSJ expressions in Section 2 and        3.   Some aggregations may require support measures in
classifying aggregate operators in Section 3, in Section 4         order to be correctly calculated from partial aggregates
we define how views will be structured. The algorithm for          (e.g., the average operator requires the cardinality of
determining candidate views is outlined in Section 5,              each partial aggregate).
while Section 6 discusses the experimental results.           4.   Calculating a derived measure from its component
                                                                   measures may determine a wrong result if the query is
1.1 Motivating Example                                             not solved directly on the base fact table (e.g.,
                                                                   SUM(Qty)*AVG(Prc) ¹ SUM(Qty*Prc)).
The simple Sales star scheme used as a working example
is defined below:
                                                              In all these cases, in order to take full advantage of
                                                              materialization, it may be necessary to include additional
STORE (StoreId, SName, SCity)                                 measures in views. Besides:
TIME (TimeId, TDay, TMonth, TYear)
PRODUCT (ProductId, PName, PType, PCategory)                  5.   The results of aggregation are affected by the
SALE (TimeId, StoreId, ProductId, Qty, Prc)                        selections which operate on its source data.

where the denormalized dimension tables leave out the         1.2 Related Literature
following hierarchies of functional dependencies:
                                                              Several works adopt graph-like structures to represent
                                                              views, but a few focus on how to build it. In [The97] an
        SName®SCity, TDay®TMonth®TYear,
                                                              algorithm to build a multiquery graph is outlined; the
            PName®PType®PCategory                             approach deals with the problem of building base fact
                                                              tables from operational databases. Besides, since only
Tuples in dimension tables are typically identified by        join, selection and classical projection operators are
surrogate keys; in the rest of the paper, for the sake of     involved, and no aggregation semantics is introduced, the
simplicity, surrogate keys will never be considered,          approach cannot be used to determine aggregate views.
assuming that tuples are identified by an attribute of the       The approach which is most closely related to ours is
hierarchy.                                                    the one described in [Aki99]. That work is improved by
   As to the workload, consider the following queries on      ours with reference to the class of queries considered
the Sales star scheme:                                        (GPSJ queries with distributive operators vs. NGPSJ
                                                              queries with distributive, algebraic and holistic operators)
   q1: ÒFor each product, find the average selling price      and to the computational cost of the graph-building
   and the maximum total quantity sold for the stores         algorithm. In [Aki99] the complexity is always
   which sold more then 1000 units of that productÓ           exponential since, for each query, all the possible
   q2: ÒFor each month and for each type of food              evaluation strategies are built and organized into a graph;
   product, find the average selling price and the total      the query view graph is then obtained by applying a set of
   revenueÓ                                                   rules that prune non-candidate views. The complexity of
                                                              our algorithm is exponential only in the worst case, to
   q3: ÒFor product of type beverage and for each year,       become polynomial in the best case.
   find the total quantity sold and the average of the           Query nesting is considered in [Ros98], where
   total monthly revenuesÓ                                    subqueries involve multiple dependent aggregates at
                                                              multiple granularities (multi-feature cubes). Though
   Query q1 requires to first aggregate data on pattern       interesting results concerning distributive and algebraic
{SName, PName} , calculating the average price and the        cubes are provided, the queries considered are quite
sum of the quantities sold, then to select the stores for     different from NGPSJ queries, since no nesting of distinct
which this sum is greater than 1000, finally to further       aggregation operators at different granularities is possible.


M. Golfarelli, S. Rizzi                                                                                               10-2
2 Query Modeling                                                  ·      the necessary joins are pushed below all projections
                                                                         and selections;
In principle, the workload for a DW is dynamic and                ·      projection and selections are coalesced as much as
unpredictable. A possible approach to cope with this fact                possible;
consists in monitoring the actual workload while the DW           ·      selections on dimensional attributes are pushed below
                                                                                                                   1
is operating. Otherwise, the designer may try to determine               all projections and selections on measures .
a core workload a priori: in fact, on the one hand, the user
typically knows in advance which kind of data analysis               In the rest of the paper we will consider NGPSJ
(s)he will carry out more often for decisional or statistical     expressions in normal form applied to a star scheme S
purposes; on the other, a substantial amount of queries are       with base fact table FT and dimension tables DT 1,...DT n,
aimed at extracting summary data to fill standard reports.        structured as follows:
    Basically, the approaches to view materialization differ
in the level of detail they adopt to model the queries in         e:      sPredMh(pPh,Mh(... sPredM1(pP1,M1(sPredA Ù PredM0
the workload. Some approaches only consider the                         ( FT >< DT 1 ... >< DTn)))...))
aggregation patterns [Bar97], while others analyze the
query features in more detail [Gup97].                            where:
    Our approach falls in the second group; in particular,
we consider queries represented by nested GPSJ                    ·      M j (j=1,...h) is a set of aggregate measures, each
expressions. A GPSJ (Generalized Projection / Selection /                defined as OP(m1,ÉmmaxOP) where OP is an aggregate
Join) expression [Gup95] is a selection s1 over a                        operator, maxOP is the number of arguments of OP
generalized projection p over a selection s2 over a set of               and mi is an algebraic expression involving measures
joins c: s 1ps 2c. The generalized projection operator,                  in Mj-1; M0 denotes the set of measures in FT.
p P,M(R), is an extension of duplicate eliminating                ·      PredMj (j=0,...h) is a conjunction/disjunction of
projection, where P denotes an aggregation pattern, i.e.,                Boolean predicates expressed on measures in Mj;
the set of group-by attributes, and M denotes a set of            ·      Pj (j=1,...h) is an aggregation pattern.
aggregate operators applied to the attributes in R. Thus,         ·      PredA is a conjunction/disjunction of Boolean
GPSJ expressions extend select-joins expressions with                    predicates expressed on dimension attributes of
aggregation grouping and group selection.                                DT 1,...DT n.
    Nesting GPSJ expressions means using the result from
an expression as the input for another; it adds expressive            We wish to emphasize that, in this paper, we are not
power to GPSJ expressions since it allows sequences of            interested in determining optimal execution plans.
aggregate operators to be used on a measure. For instance,        Expressions are written in normal form for the sake of
the queries in Section 1 can be formulated as follows:            clarity and in order to simplify their management in the
                                                                  algorithm for building the query view graph; the order in
q 1: pPName,WAVG(P,C),MAX(Q)(sQ>1000                              which operators appear within the expressions does not
   ( pPName,SName,Q=SUM(Qty),P=AVG(Prc),C=COUNT(*)                reflect the execution plans that will be used to calculate
   ( SALE >< PRODUCT >< STORE)))                                  them. For instance, pushing all the joins below the other
q2: pPType,TMonth,AVG(Prc),SUM(Qty×Prc)(sPCategory=ÕFoodstuffsÕ   operators would obviously be inefficient in several cases,
   ( SALE >< PRODUCT >< TIME))                                    and pushing some projections/selections down might be
q3: pTYear,PName,SUM(Q),AVG(R)                                    highly preferable.
   ( pTYear,TMonth,PName,Q=SUM(Qty),R=SUM(Qty×Prc)                    For the same reason, we will assume for simplicity
   ( sPType=ÕBevÕ(SALE >< PRODUCT >< TIME)))                      that the fact table is always joined to all the dimension
                                                                  tables (though in some queries, depending on the
where WAVG(m,w) computes the weighted average of                  aggregation pattern required, it may be possible to omit
measure m based on the weights w. The SQL formulation             one or more of them; see q1, q2 and q3 for instance). As to
of q1 is the following:                                           aggregation patterns, we will write them in a concise non-
                                                                  redundant form by dropping all the attributes functionally
                                                                  determined by other attributes in the pattern. For instance,
SELECT PN,SUM(P*C)/SUM(C),MAX(Q)
FROM ( SELECT PRODUCT.PName AS PN,
                                                                  q3 can be rewritten in concise form as
         STORE.SName AS SN,SUM(SALE.Qty) AS Q,
         AVG(SALE.Prc) AS P,COUNT(*) AS C
                                                                         pTYear,PName,SUM(Q),AVG(R)(pTMonth,PName,Q=SUM(Qty),R=SUM(Qty×Prc)
       FROM SALE,PRODUCT,STORE                                         ( sPType=ÕBevÕ(SALE >< PRODUCT >< TIME)))
       WHERE <...join conditions...>
       GROUP BY PRODUCT.PName,STORE.SName )
                                                                  by dropping attribute TYear from the intermediate pattern
WHERE Q>1000
                                                                  since TMonth®TYear.
GROUP BY PN                                                            1
                                                                         While in general moving a selection on a measure in different positions
                                                                  of an expression affects the expression semantics, selections on dimensional
                                                                  attributes can be placed indifferently in any position outside the joins. Placing
    The expressions above are in normal form [Gup95],             them in the innermost possible position makes the materialization algorithm
i.e.:                                                             simpler since no push-downs or pull-ups are required to compare couples of
                                                                  expressions.


M. Golfarelli, S. Rizzi                                                                                                                     10-3
    Definition 1. Given two NGPSJ expressions e i, ej                      determining if a conjunction of aggregation constraints is
    defined on star scheme S, we say that ej can be derived                feasible, and [Cab99], which presents a theoretical
    from ei (ei³ej) if, by applying a sequence of generalized              framework for studying aggregations from a declarative
    projections and selections to e i, it is possible to                   and operational point of view.
    obtain a NGPSJ expression which, put in normal
    form, is equivalent to ej.                                             3.1 Support Measures
                                                                           Query q can be solved on a view v if the NGPSJ
Derivability holds when every measure returned in ej can
                                                                           expression representing q can be derived from the one
be calculated from those returned in e i. In evaluating
                                                                           which defines v (v³q). If a distributive operator is used,
derivability, three factors must be taken into account:
                                                                           aggregation can be equivalently carried out by steps; thus,
aggregation patterns, selections on both dimension
attributes and measures, and the aggregation operators
                                                                             pTYear,SUM(Qty)(SALE >< STORE >< PRODUCT >< TIME)
applied to measures. For instance, expression
                                                                           can be derived from
      pPType,TYear,SUM(Q)(sQ>2000(pPName,TMonth,Q=SUM(Qty)
    ( sPCategory=ÕFoodstuffsÕ Ù TYear>Õ1997Õ                               R = pTMonth,Q=SUM(Qty)
    ( SALE >< PRODUCT >< TIME))))                                             ( SALE >< STORE >< PRODUCT >< TIME)

can be derived from                                                        by computing pTYear,SUM(Q)(R).
                                                                              On the other hand, if an algebraic operator is used for
R = sQ>1000(pPName,TMonth,Q=SUM(Qty)                                       aggregation, derivability holds only if the necessary
   ( SALE >< PRODUCT >< TIME))                                             support measures are added to the view. For instance, in
                                                                           order to calculate the average of a set of elemental values
by computing                                                               from a set of partial aggregates of those values, the
                                                                           cardinality of each single partial aggregate must be
pPType,TYear,SUM(Q)(sPCategory=ÕFoodstuffsÕ Ù TYear>Õ1997Õ Ù Q>2000(R)).   known. Thus,
On the other hand, expression                                                pTYear,AVG(Qty)(SALE >< STORE >< PRODUCT >< TIME)
      sQ>1000(pPName,TMonth,Q=SUM(Qty)                                     can be derived from
    ( sTDay>ÕFeb15,97Õ Ù Prc>10(SALE >< PRODUCT >< TIME)))
                                                                           R = pTMonth,Q=AVG(Qty),C=COUNT(*)
cannot be derived from R since it specifies additional                        ( SALE >< STORE >< PRODUCT >< TIME)
selections on a dimension attribute which did not appear
in R due to aggregation (TDay) and on a measure which                      by computing pTYear,WAVG(Q,C)(R). Obviously, computing
in R is not available at the granularity required (Prc).                   pTYear,AVG(Q)(R) leads to a different result.
                                                                              Finally, for a holistic operator, aggregation can never
3 Aggregate Operators                                                      be carried out by steps since the number of necessary
                                                                           support measures is not bounded.
An aggregate operator is a function mapping a multiset of
values into a single value. An attempt to define rules for                 3.2 Aggregation Sequences
aggregation of measures is presented in [Len97] and
[Sho96]: measures are classified into flow, stock and                      Aggregation sequences increase the query language
value-per-unit, and a table defines when each class of                     expressivity in two ways: by allowing two or more
measures can be aggregated depending on the operators                      different aggregate operators to be applied in sequence on a
used and on the type of hierarchy (temporal, non-                          measure, and by allowing selections to be applied on
temporal). A classification of aggregation functions which                 partially aggregated measures.
is relevant to our approach is presented in [Gra95]:                          Consider the following query:

·    Distributive, that allows aggregates to be computed                              pP2,m2=OP2(m1)(pP1,m1=OP1(m)(pP0,m(...)))
     directly from partial aggregates.
·    Algebraic, that require additional information                        where OP1¹OP2. In general, there is no way of computing
     (support measures) to calculate aggregates from                       m2 directly from m by coalescing OP1 and OP2. Besides,
     partial aggregates.                                                   the aggregation sequence determines the query result: in
·    Holistic, that do not allow aggregates to be computed                 fact, swapping OP1 and OP2 in q would change the
     from partial aggregates through a finite number of                    semantics of m2; the semantics also depends on the
     support measures.                                                     intermediate pattern P 1, which determines the partial
                                                                           aggregates on which OP2 operates.
Other works demonstrating the importance of aggregation                       As to the use of selections, consider the query
operators are [Ros98b], which studies the problem of
                                                                                          pP1,m1=OP(m)(spredm(pP0,m(...)))


M. Golfarelli, S. Rizzi                                                                                                           10-4
                                                                          pPName,MAX(Q1),(sQ1>1000(pPName,SName,Q1=SUM(Q),(v12)))
                                                                           WAVG(P1,C1)                      P1=WAVG(P,C),C1=SUM(C)


                                                               q1                           q2
      pPName,MAX(Q)(sQ>1000(pPName,SName,Q=SUM(Qty),(v0)))                                       pPType,TMonth,AVG(Prc),(sPCategory='Foodstuffs'(v0))
       WAVG(P,C)               P=AVG(Prc),C=COUNT(*)                                              SUM(Qty.Prc)

                                                                             v12
                         pPName,SName,TMonth,Q=SUM(Qty),P=AVG(Prc),(v0)                     pPType,TMonth,SUM(R),(sPCategory='Foodstuffs'(v12))
                           C=COUNT(*),R =SUM(Qty.Prc)                                         WAVG(P,C)


                                                                                v0


                    Figure 1: QVG for queries q1 and q2 (v0 = SALE >< STORE >< PRODUCT >< TIME).

in which the selection predicate on m reduces the set of                    ·        contain a subset of the tuples at a given aggregation
partial aggregates on which OP operates. In general, both                            pattern as a consequence of selections on dimension
pushing down and pulling up the selection operator alter                             attributes and measures;
the query semantics.                                                        ·        contain only a subset of the measures in the base fact
   The different semantics which may be induced by                                   table as a result of projections;
aggregation sequences play a central role in determining                    ·        include measures obtained by applying different
derivability between NGPSJ expressions; as shown in                                  aggregation sequences to the same measure;
Section 5, this may lead to a dramatic increase in the                      ·        include derived measures and support measures
number of candidate views.                                                           necessary to support queries based on algebraic
                                                                                     operators.
4 Views
                                                                            The widened definition of views given in our approach
View materialization consists in selecting, among all the                   may lead to proliferation of measures; very large tables
possible views, those that optimize the response to the                     may entail poor performance. In order to avoid this, it
workload under both a disk space and a maintenance cost                     may be convenient to adopt vertical partitioning
constraint; this can be formulated as a knapsack problem,                   techniques capable of splitting views into smaller
and is obviously NP-hard. Most methods in the literature                    fragments aimed at optimizing performance for a given
aim at determining the subset of views that could be                        workload [Gol00].
useful with reference to the workload (candidate views),                       The candidate views and the relationships between
then adopt heuristic algorithms to find sub-optimal                         them can be represented in a graph:
solutions [Gup97, Gup99]. In [Aki99], this is done with
reference to distributive aggregate functions; neither                               Definition 3. The query view graph (QVG) for
derived/support measures nor nesting of GPSJ queries are                             workload Q is the directed acyclic graph G=(V,A) such
considered. Within this paper we focus on determining                                that:
candidate views; the exponential dimension of the space of                            · V={v0,...vp}, where v0 is the base fact table joined
the possible views makes this step more complex and                                        with the dimension tables and v1,...vp are the
crucial than that of selecting the candidate views to be                                   candidate views.
materialized.                                                                         · Each arc aijÎA denotes that vi³vj. Only the
                                                                                           elemental      derivability  relationships     are
   Definition 2. Given a star scheme S and a workload                                      represented in A; those which can be obtained by
   Q={q1,...qm} on S, a candidate view is a fact table                                     transitivity are omitted.
   defined by a NGPSJ expression e on S such that:
   ($ qiÎQ | e = qi) Ú ($ qi, qj ÎQ | (e ³ qi) Ù (e ³ qj) Ù                 The QVG for a given workload is necessarily unique,
                       / | (e ³ eÕ) Ù (eÕ ³ qi) Ù (eÕ ³ qj)))
                     ( $eÕ                                                  since the set of candidate views is univocally determined
                                                                            by the set of queries2. Figure 1 represents the QVG for
   In other words, a candidate view is a fact table which                   queries q1 and q2 in Section 1; to help the reader, also the
either includes exactly the tuples returned by one query or                 expressions which allow one view to be computed from
is the ÔcoarsestÕ table including all the tuples allowing                   another are shown (in dashed call-outs).
two or more queries to be solved. In [Bar97] the authors
prove that candidate views defined as above are the only
relevant views for materialization, i.e., that materializing
                                                                                2
non-candidate views may never decrease the workload                               Actually, since some aggregation operators can be expressed in terms of
                                                                            others (e.g., AVG can be computed from SUM and COUNT), some candidate
cost. In our approach, candidate views may:                                 views can be rewritten using different operators; nevertheless, the QVG is still
                                                                            unique since the set of queries which can be computed on each candidate view
                                                                            does not change.


M. Golfarelli, S. Rizzi                                                                                                                                 10-5
                                          v                            vcurr               v                    vcurr
             v      vcurr

                                          vcurr                        v                                vt

             Case A                 Case B                       Case C                           Case D
                               Figure 2: Different relationships between couples of views.

5 Building the Query View Graph                                 Merge(S,v,vcurr):
The QVG for star scheme S and workload Q={q1,...qm} is          { Drel=Æ; // vertices in relationship D with v
built incrementally by merging each query with the QVG            for each v currÎS do
according to the expression which defines it:                     { inserted=FALSE;
                                                                    if ExploreSubGraph(v,vcurr,NULL,inserted)
                                                                    // case A holds
QVGBuilder(Q,v0):
                                                                      then return;
{ V={v0,q1};
                                                                    if not inserted then DrelÈ={vcurr};
  A={(v0,q1)};
                                                                  }
  QVG=(V,A);
                                                                  if (Drel¹Æ) then             // case D holds
  for each qiÎQ, i>1 do
                                                                    for each vjÎDrel do
  { S={vjÎV| Child(vj)=Æ};
                                                                    { vt=Ancestor(vj,v);
    // S set of the leaves in QVG;
                                                                      VÈ={vt};
    // function Child(vj)
                                                                      AÈ={(vt,v)};
    // returns {vkÎV|(vj,vk)ÎA}
                                                                      SÕ={vk|vjÎChild(vk)};
    VÈ={qi};
                                                                      AÈ={(vt,vj)};
    Merge(S,qi,NULL);
                                                                      Merge(SÕ,vt,vj);         // merge vt
  }
                                                                    }
}
                                                                }

   Merging rules are aimed at determining if and how a
query can be answered on a view belonging to the QVG.                          Figure 3: The Merge algorithm.
In general, the four possible relationships between two
candidate views v and vcurr as entailed by Definition 1 are     Boolean ExploreSubGraph(v,vcurr,vsucc,inserted):
presented in Figure 2. In case A, v and vcurr are equivalent;   // returns TRUE if v is already present
in cases B and C, one of the views can be derived from          // among the predecessors of vcurr
the other; in case D no derivation relationship can be          { Case Compare(v,vcurr) of:
established. In case D, vt is defined as follows:                 { ÔAÕ: { V-={v};
                                                                           for each vjÎChild(v) do
   Definition 4. Given two candidate views vi and vj,                      // v already exists as vcurr
   the ancestor of vi and vj is the candidate view vt such                 { A-={(v,vj)}; AÈ={(vcurr,vj)}; }
   that (1) both vi and vj can be derived from vt and (2)                  RemoveRedundantArcs();
   for each other candidate view vtÕ which satisfies (1), it               return TRUE;
   is vtÕ³vt (in the worst case, it is vt º v0).                         }
                                                                    ÔBÕ: { AÈ={(vcurr,v)};       // vcurr³v
    The pseudo-code for the Merge algorithm is shown in                    if vsucc¹NULL then A-={(vcurr,vsucc)};
Figure 3. The algorithm finds the correct position for                     inserted=TRUE;
inserting v in the QVG, starting from its leaves and                     }
descending towards its root v0. If v is already present (case       ÔCÕ: { AÈ={(v,vcurr)};       // v³vcurr
A), no additional views must be inserted and Merge                         if vsucc¹NULL then A-={(v,vsucc)};
terminates. Otherwise, if while descending a path a                        for each vj|vcurrÎChild(vj) do
derivability relationship (case B) with v is found, the                      if ExploreSubGraph(v,vj,vcurr)
correct insertion position on that path has been reached; if                  then return TRUE;
a D relationship (no derivability) among v and some view                 }
vcurr is found, the path is abandoned and vcurr is inserted         ÔDÕ:           // no derivability relationship
into a set Drel to be further examined. At the end, for each      }
view in Drel, a new candidate view vt is determined by            return FALSE;
function Ancestor and recursively merged into the QVG.          }
    The ExploreSubGraph function carries out a depth-
first exploration of the sub-graph including all the
predecessors of vcurr; its pseudo-code is shown in Figure 4.          Figure 4: The ExploreSubGraph function.


M. Golfarelli, S. Rizzi                                                                                                 10-6
           pPType,TMonth,SUM(R),(sPCategory='Foodstuffs'(v12))                         p TYear,PName, (pTMonth,PName,Q1=SUM(Q),(sPType='Bev'(v13)))
             WAVG(P,C)                                                                   AVG(R1),SUM(Q1)      R1=SUM(R)


      q1                      q2                                                  q1                           q2              q3


               v12                                                                            v12                   v13
                     pPName,SName,TMonth,Q=SUM(Qty),P=AVG(Prc),(v0)
                      C=COUNT(*),R =SUM(Qty.Prc)                                                          pPName,SName,TMonth,Q=SUM(Qty),P=AVG(Prc),(v0)
                                                                                                           C=COUNT(*),R =SUM(Qty.Prc)

      v0                                                                                       v0
                            pPName,MAX(Q1),(sQ1>1000(pPName,SName,Q1=SUM(Q),(v12)))
                              WAVG(P1,C1)                P1=WAVG(P,C),C1=SUM(C)


            p TYear,PName,(sPType='Bev'(v23))                                                 pPType,TMonth,SUM(R),(sPCategory='Foodstuffs'(v23))
                                                                                                WAVG(P,C)
              AVG(R),SUM(Q)


                                                                       q1                           q2                        q3
       q1             q2                     q3

                                                                                                                 v23
                                   v23
                      v12             pPName,TMonth,Q=SUM(Qty),P=AVG(Prc),(v0)
                                         C=COUNT(*),R =SUM(Qty.Prc)                                 v12
                                                                                                            pPName,TMonth,SUM(Q),WAVG(P,C),(v12)
                                                                                                              SUM(C),SUM(R)

                      v0                                                                            v0


                                         Figure 5: Construction of the QVG for queries q1, q2, q3.

   In Figure 5, the steps of the QVGBuilder algorithm for                        search space is pruned by neglecting all the vertices for
the queries proposed in Section 1 are shown. View v13                            which a derivability relationship with the new vertex to
does not appear in the final QVG since it is equivalent to                       be positioned can be inferred from the graph structure.
v12 (case A). As to v23, the search is restricted to arc
(q2,v12) since a derivability relationship (case B) is found                     6 Experimental Tests
and the other can be inferred.
   If we restrict to considering only the aggregation                            We tested our approach on the TPC-D benchmark; the
patterns, the QVG resulting from the algorithm                                   workload includes four standard TPC-D queries and three
degenerates to the MDred Lattice presented in [Bar97].                           additional NGPSJ queries. In order to demonstrate the
With reference to that approach, ours has a lower                                effectiveness of our approach, we compared the results
computation complexity (in terms of comparisons                                  obtained by applying the same materialization heuristics
between expressions) since the derivability relationships                        [Bar97] to our QVG and to the lattice defined in [Bar97]
expressed by the QVG are actively used to restrict the                           (considering the same global space constraint). The cost
search to the sub-graph where the new vertex can be                              function adopted expresses the total number of disk pages
inserted. In fact, the complexity of Merge for adding the                        which must be accessed in order to solve each query,
(i+1)-th query to the QVG reaches O(2i) only in the worst                        taking both the view size and the query selectivity into
case, in which (1) the QVG already contains 2 i views and                        account. The results are summarized in Figure 6.
(2) all the candidate views solving both qi+1 and each                               The low execution cost for the QVG-based approach is
element in the power set of the i queries in the QVG must                        mainly due to the fact that, since NPSGJ views can
be created, i.e., relationship D holds for all the views in                      express the nesting of aggregation operators, queries can
the QVG. In the best case the QVG degenerates into a                             be executed on views aggregated at a ÒhigherÓ pattern.
path, and the complexity drops to O(1). In the other                             Besides, since NPSGJ views are more closely tailored on
cases, the complexity depends on the specific                                    queries than classical views, more NPSGJ views typically
characteristics of the queries and on the order in which                         fit the same space constraint. These considerations are
they are merged. The algorithm is efficient since the                            further supported by observing how, in Figure 6,


M. Golfarelli, S. Rizzi                                                                                                                                    10-7
materializing the maximum number of useful NPSGJ                           [Bar97]   E. Baralis, S. Paraboschi and E. Teniente.
views (7, one for each query in the workload) requires                               Materialized      view     selection    in
only 1.1 GB, and entails a lower cost than materializing                             multidimensional database. Proc. 23rd Int.
the same number of classical views.                                                  Conf. on Very Large Data Bases, Athens,
                                                                                     Greece, pp. 156-165, 1997.
7 Conclusions                                                              [Cab99]   L. Cabibbo and R. Torlone. A framework for
In this paper we have presented an original approach to                              the investigation of aggregate functions in
view materialization in which NGPSJ expressions are                                  database queries. Proc. Int. Conf. on Database
used to model both queries and candidate views. An                                   Theory, Jerusalem, Israel, 1999.
algorithm for selecting the minimal set of candidate views                 [Coh99]   S. Cohen, W. Nutt and A. Serebrenik.
by incrementally inserting queries into a directed acyclic                           Algorithms for rewriting aggregate queries
graph has been proposed; the algorithm is efficient since,                           using views. Proc. Int. Workshop on Design
every time a query is inserted, only a portion of the graph                          and Management of Data Warehouses,
is visited.                                                                          Heidelberg, Germany, 1999.
    Our approach relies on the existence of a method for
comparing two relational algebra expressions and                           [Gol00]   M. Golfarelli, D. Maio and S. Rizzi. Applying
determining their ancestor [Nut98]. The procedure for                                Vertical Fragmentation Techniques in Logical
comparing two NGPSJ expressions can be seen as an                                    Design of Multidimensional Databases. To
                                                                                                         nd
iteration of the comparison between two GPSJ                                         appear in Proc. 2 Int. Conf. on Data
expressions, which was proposed in [Gup95].                                          Warehousing and Knowledge Discovery,
Comparison proceeds from inside the two nested                                       Greenwich, 2000.
expressions, considering at each step two units having the                 [Gra95]   J. Gray, A. Bosworth, A. Lyman and H.
form p(s(R)); while in [Gup95] R denotes a set of joins,                             Pirahesh. Data-Cube: a relational aggregation
here it denotes the result of the NGPSJ expressions                                  operator generalizing group-by, cross-tab and
analyzed up to the previous step. Since the problem of                               sub-totals. Technical Report MSR-TR-95-22,
comparing two GPSJ expression is undecidable [Ros98b],                               Microsoft Research, 1995.
the ancestor determined will sometimes be sub-optimal.
Our future work will focus on devising a comparison                        [Gup95]   A. Gupta, V. Harinarayan and D. Quass.
algorithm specifically oriented to NGPSJ expressions, by                             Aggregate-query    processing  in   data-
taking their peculiarities into account.                                             warehousing environments. Proc. 21st Int.
                                                                                     Conf. on Very Large Data Bases, Zurich,
                                                                                     Switzerland, 1995.
                                              QVG       LAT
                                                                           [Gup97]   H. Gupta. Selection of views to materialize in a
                      300                                                            data warehouse. Proc. Int. Conf. on Database
                                                                                     Theory, Delphi, Greece, pp. 98-112, 1997.
                      250 4
 thousands of pages


                                  5
                                                                           [Gup99]   H. Gupta and I.S. Mumick. Selection of views
                      200                                                            to materialize under a maintenance cost
                                        5                                            constraint. Proc. Int. Conf. on Database
                      150                      6
                             5                                                       Theory, Jerusalem, Israel, 1999.
                                                    6    6
                      100         6                            6     7
                                                                           [Kim96] R. Kimball. The data warehouse toolkit. John
                                                                                   Wiley & Sons, 1996.
                       50               6      7    7     7    7     7
                                                                           [Len97]   H.J. Lenz and A. Shoshani. Summarizability
                       0
                                                                                     in OLAP and statistical databases. Proc.
                            0.8   0.9   1.0   1.1 1.2    1.3   1.4   1.5
                                                                                     Statistical  and    Scientific  Database
                                                 GB
                                                                                     Management, Washington, 1997.

Figure 6: Execution costs for the QVG-based and the                        [Nut98]   W. Nutt, Y. Sagiv and S. Shurin. Deciding
lattice-based approaches, on a sample workload, in                                   equivalences among aggregate queries. Proc.
function of the global space constraint. Besides each value                          17th Symposium on Principles of Database
point, the number of views materialized.                                             Systems, 1998.
                                                                           [Ros98]   K. A. Ross, D. Srivastava, and D.
                                                                                     Chatziantoniou. Complex aggregation at
References                                                                           multiple granularities. Proc. Int. Conf. on
                                                                                     Extending Database Technology, 1998.
[Aki99]                       M. Akinde and M. Bšhlen. Constructing
                              GPSJ view graphs. Proc. Int. Workshop on     [Ros98b] K. A. Ross, D. Srivastava, P. J. Stuckey and
                              Design and Management of Data Warehouses,             S. Sudarshan. Foundations of aggregation
                              Heidelberg, Germany, 1999.


M. Golfarelli, S. Rizzi                                                                                                         10-8
          constraints. Theoretical Computer Science,
          193(1-2):149-179, 1998.
[Sho96]   A. Shoshani. OLAP and statistical databases:
          similarities and differences. Proc. Int. Conf. on
          Information and Knowledge Management,
          Rockville, Maryland, 1996.
[The97]   D. Theodoratos and T. Sellis. Data warehouse
          configuration. Proc. 23rd Int. Conf. on Very
          Large Data Bases, Athens, Greece, pp. 126-
          135, 1997.
[Yan95]   W.P. Yan and P. Larson. Eager and lazy
          aggregation. Proc. 21st Int. Conf. on Very
          Large Data Bases, Zurich, Switzerland, pp.
          345-357, 1995.


M. Golfarelli, S. Rizzi                                       10-9