View Materialization for Nested GPSJ Queries Matteo Golfarelli Stefano Rizzi DEIS Ð University of Bologna DEIS, CSITE CNR Ð University of Bologna Viale Risorgimento 2, Bologna, Italy Viale Risorgimento 2, Bologna, Italy mgolfarelli@deis.unibo.it srizzi@deis.unibo.it are typical dimensions while quantity sold and retail price are measures. Abstract A DW implemented on a relational DBMS is usually organized according to the so-called star scheme [Kim96], View materialization is a central issue in logical composed by one fact table containing the measures and design of data warehouses since it is one of the one denormalized dimension table for each dimension of most powerful techniques to improve the analysis. The primary key of each dimension table is response to the workload. Most approaches in the imported into the fact table; the primary key of the fact literature only focus on the aggregation patterns table is defined by the set of these foreign keys. Each required by the queries in the workload; in this dimension table contains a set of dimension attributes paper we propose an original approach to defining a hierarchy of aggregation levels for the materialization in which the workload is corresponding dimension. characterized by the presence of complex queries The basic mechanism to extract useful information which cannot be effectively described only by from elemental data in DWs is aggregation. In principle, their aggregation pattern. In particular, we all queries could be solved by directly aggregating the consider queries represented by nested GPSJ elemental data in the base fact table; for efficiency reasons, (Generalized Projection / Selection / Join) a DW typically stores, besides the elemental values of expressions, in which sequences of aggregate measures, also values summarized according to some operators may be applied to measures and aggregation pattern, i.e., to a set of dimension attributes selection predicates may be formulated, at defining the coarseness of aggregation. Even summarized different granularities, on both dimensions and data are organized into a star scheme, whose fact table is measures. Other specific issues taken into account usually called a view; measure values are obtained by are related to the need for materializing derived applying an aggregation operator to the data in the base measures as well as support measures to make fact table or in another view with a finer pattern. Of algebraic operators distributive. Based on this course, the larger the number of queries a view can be query model, an efficient algorithm to determine used to answer, the more effective materializing it [Coh99, a restricted set of candidate views for Yan95]. materialization, to be fed into an optimization Since pre-computing all the possible views is algorithm, is proposed. Finally, the effectiveness unfeasible, several techniques to select an appropriate of our approach is discussed with reference to a subset of views to materialize have been proposed. While sample workload. most approaches only focus on the aggregation patterns required by queries, very few works analyze how the 1 Introduction presence of different operators affects aggregation. On the other hand, using only one operator (typically, the sum) A data warehouse (DW) is a repository which provides to aggregate measures when materializing views, makes an integrated environment for decision support. From a them useless for several queries. logical point of view, a DW can be seen as a In this paper we propose an original approach to multidimensional cube where each cell contains materialization in which the workload is characterized by information useful for the decision process, i.e., a set of the presence of complex queries which cannot be numerical attributes called measures, while each axis effectively described only by their aggregation pattern. In represents a possible dimension for analysis. For example, particular, we consider queries represented by Nested in a DW modeling a chain store, time, product and store Generalized Projection / Selection / Join (NGPSJ) expressions, in which sequences of aggregate operators The copyright of this paper belongs to the paperÕs authors. Permission to may be applied to measures and selection predicates may copy without fee all or part of this material is granted provided that the be defined, at different granularities, on both dimensions copies are not made or distributed for direct commercial advantage. and measures. As a result, the correspondence between Proceedings of the International Workshop on Design measures in the views and in the base fact table is not and Management of Data Warehouses (DMDW'2000) necessarily one-to-one: some measures in the base fact Stockholm, Sweden, June 5-6, 2000 table may not be present at all in a view, while others (M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.) may correspond to several measures in the view; furthermore, additional (either support or derived) http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-28/ M. Golfarelli, S. Rizzi 10-1 measures may be included in the view in order to aggregate on {PName} calculating the maximum quantity correctly compute aggregations. Particular emphasis is and the average price. Query q2 aggregates foodstuff sales given to the effects on materialization of the contemporary on {TMonth, PType}, calculating for each month the presence of multiple aggregation operators in a query. average price and the sum of the derived measure R = Under these assumptions, an efficient algorithm to Qty ´ Prc . Query q3 requires to aggregate on {TMonth, determine the restricted set of views which could be PName} calculating the monthly sums of the quantities usefully materialized (candidate views) is proposed. The and that of the revenues, then to aggregate on {TYear, algorithm builds a query view graph whose vertices PName} calculating the sums of the monthly quantities represent candidate views and whose arcs denote the and the averages of the monthly revenues. possibility of computing a view from another. The query These examples give rise to some considerations: view graph may then be fed into an optimization algorithm like the ones proposed in [Gup97, Gup99] 1. Different aggregate operators may be applied to the which select, from the set of the candidate views, the same measure (e.g., SUM(Qty), MAX(Qty)). subset which minimizes the response to the workload 2. Sequences of aggregate operators may be applied to a under a given space constraint. measure (e.g., the maximum of the sums). After discussing NGPSJ expressions in Section 2 and 3. Some aggregations may require support measures in classifying aggregate operators in Section 3, in Section 4 order to be correctly calculated from partial aggregates we define how views will be structured. The algorithm for (e.g., the average operator requires the cardinality of determining candidate views is outlined in Section 5, each partial aggregate). while Section 6 discusses the experimental results. 4. Calculating a derived measure from its component measures may determine a wrong result if the query is 1.1 Motivating Example not solved directly on the base fact table (e.g., SUM(Qty)*AVG(Prc) ¹ SUM(Qty*Prc)). The simple Sales star scheme used as a working example is defined below: In all these cases, in order to take full advantage of materialization, it may be necessary to include additional STORE (StoreId, SName, SCity) measures in views. Besides: TIME (TimeId, TDay, TMonth, TYear) PRODUCT (ProductId, PName, PType, PCategory) 5. The results of aggregation are affected by the SALE (TimeId, StoreId, ProductId, Qty, Prc) selections which operate on its source data. where the denormalized dimension tables leave out the 1.2 Related Literature following hierarchies of functional dependencies: Several works adopt graph-like structures to represent views, but a few focus on how to build it. In [The97] an SName®SCity, TDay®TMonth®TYear, algorithm to build a multiquery graph is outlined; the PName®PType®PCategory approach deals with the problem of building base fact tables from operational databases. Besides, since only Tuples in dimension tables are typically identified by join, selection and classical projection operators are surrogate keys; in the rest of the paper, for the sake of involved, and no aggregation semantics is introduced, the simplicity, surrogate keys will never be considered, approach cannot be used to determine aggregate views. assuming that tuples are identified by an attribute of the The approach which is most closely related to ours is hierarchy. the one described in [Aki99]. That work is improved by As to the workload, consider the following queries on ours with reference to the class of queries considered the Sales star scheme: (GPSJ queries with distributive operators vs. NGPSJ queries with distributive, algebraic and holistic operators) q1: ÒFor each product, find the average selling price and to the computational cost of the graph-building and the maximum total quantity sold for the stores algorithm. In [Aki99] the complexity is always which sold more then 1000 units of that productÓ exponential since, for each query, all the possible q2: ÒFor each month and for each type of food evaluation strategies are built and organized into a graph; product, find the average selling price and the total the query view graph is then obtained by applying a set of revenueÓ rules that prune non-candidate views. The complexity of our algorithm is exponential only in the worst case, to q3: ÒFor product of type beverage and for each year, become polynomial in the best case. find the total quantity sold and the average of the Query nesting is considered in [Ros98], where total monthly revenuesÓ subqueries involve multiple dependent aggregates at multiple granularities (multi-feature cubes). Though Query q1 requires to first aggregate data on pattern interesting results concerning distributive and algebraic {SName, PName} , calculating the average price and the cubes are provided, the queries considered are quite sum of the quantities sold, then to select the stores for different from NGPSJ queries, since no nesting of distinct which this sum is greater than 1000, finally to further aggregation operators at different granularities is possible. M. Golfarelli, S. Rizzi 10-2 2 Query Modeling · the necessary joins are pushed below all projections and selections; In principle, the workload for a DW is dynamic and · projection and selections are coalesced as much as unpredictable. A possible approach to cope with this fact possible; consists in monitoring the actual workload while the DW · selections on dimensional attributes are pushed below 1 is operating. Otherwise, the designer may try to determine all projections and selections on measures . a core workload a priori: in fact, on the one hand, the user typically knows in advance which kind of data analysis In the rest of the paper we will consider NGPSJ (s)he will carry out more often for decisional or statistical expressions in normal form applied to a star scheme S purposes; on the other, a substantial amount of queries are with base fact table FT and dimension tables DT 1,...DT n, aimed at extracting summary data to fill standard reports. structured as follows: Basically, the approaches to view materialization differ in the level of detail they adopt to model the queries in e: sPredMh(pPh,Mh(... sPredM1(pP1,M1(sPredA Ù PredM0 the workload. Some approaches only consider the ( FT >< DT 1 ... >< DTn)))...)) aggregation patterns [Bar97], while others analyze the query features in more detail [Gup97]. where: Our approach falls in the second group; in particular, we consider queries represented by nested GPSJ · M j (j=1,...h) is a set of aggregate measures, each expressions. A GPSJ (Generalized Projection / Selection / defined as OP(m1,ÉmmaxOP) where OP is an aggregate Join) expression [Gup95] is a selection s1 over a operator, maxOP is the number of arguments of OP generalized projection p over a selection s2 over a set of and mi is an algebraic expression involving measures joins c: s 1ps 2c. The generalized projection operator, in Mj-1; M0 denotes the set of measures in FT. p P,M(R), is an extension of duplicate eliminating · PredMj (j=0,...h) is a conjunction/disjunction of projection, where P denotes an aggregation pattern, i.e., Boolean predicates expressed on measures in Mj; the set of group-by attributes, and M denotes a set of · Pj (j=1,...h) is an aggregation pattern. aggregate operators applied to the attributes in R. Thus, · PredA is a conjunction/disjunction of Boolean GPSJ expressions extend select-joins expressions with predicates expressed on dimension attributes of aggregation grouping and group selection. DT 1,...DT n. Nesting GPSJ expressions means using the result from an expression as the input for another; it adds expressive We wish to emphasize that, in this paper, we are not power to GPSJ expressions since it allows sequences of interested in determining optimal execution plans. aggregate operators to be used on a measure. For instance, Expressions are written in normal form for the sake of the queries in Section 1 can be formulated as follows: clarity and in order to simplify their management in the algorithm for building the query view graph; the order in q 1: pPName,WAVG(P,C),MAX(Q)(sQ>1000 which operators appear within the expressions does not ( pPName,SName,Q=SUM(Qty),P=AVG(Prc),C=COUNT(*) reflect the execution plans that will be used to calculate ( SALE >< PRODUCT >< STORE))) them. For instance, pushing all the joins below the other q2: pPType,TMonth,AVG(Prc),SUM(Qty×Prc)(sPCategory=ÕFoodstuffsÕ operators would obviously be inefficient in several cases, ( SALE >< PRODUCT >< TIME)) and pushing some projections/selections down might be q3: pTYear,PName,SUM(Q),AVG(R) highly preferable. ( pTYear,TMonth,PName,Q=SUM(Qty),R=SUM(Qty×Prc) For the same reason, we will assume for simplicity ( sPType=ÕBevÕ(SALE >< PRODUCT >< TIME))) that the fact table is always joined to all the dimension tables (though in some queries, depending on the where WAVG(m,w) computes the weighted average of aggregation pattern required, it may be possible to omit measure m based on the weights w. The SQL formulation one or more of them; see q1, q2 and q3 for instance). As to of q1 is the following: aggregation patterns, we will write them in a concise non- redundant form by dropping all the attributes functionally determined by other attributes in the pattern. For instance, SELECT PN,SUM(P*C)/SUM(C),MAX(Q) FROM ( SELECT PRODUCT.PName AS PN, q3 can be rewritten in concise form as STORE.SName AS SN,SUM(SALE.Qty) AS Q, AVG(SALE.Prc) AS P,COUNT(*) AS C pTYear,PName,SUM(Q),AVG(R)(pTMonth,PName,Q=SUM(Qty),R=SUM(Qty×Prc) FROM SALE,PRODUCT,STORE ( sPType=ÕBevÕ(SALE >< PRODUCT >< TIME))) WHERE <...join conditions...> GROUP BY PRODUCT.PName,STORE.SName ) by dropping attribute TYear from the intermediate pattern WHERE Q>1000 since TMonth®TYear. GROUP BY PN 1 While in general moving a selection on a measure in different positions of an expression affects the expression semantics, selections on dimensional attributes can be placed indifferently in any position outside the joins. Placing The expressions above are in normal form [Gup95], them in the innermost possible position makes the materialization algorithm i.e.: simpler since no push-downs or pull-ups are required to compare couples of expressions. M. Golfarelli, S. Rizzi 10-3 Definition 1. Given two NGPSJ expressions e i, ej determining if a conjunction of aggregation constraints is defined on star scheme S, we say that ej can be derived feasible, and [Cab99], which presents a theoretical from ei (ei³ej) if, by applying a sequence of generalized framework for studying aggregations from a declarative projections and selections to e i, it is possible to and operational point of view. obtain a NGPSJ expression which, put in normal form, is equivalent to ej. 3.1 Support Measures Query q can be solved on a view v if the NGPSJ Derivability holds when every measure returned in ej can expression representing q can be derived from the one be calculated from those returned in e i. In evaluating which defines v (v³q). If a distributive operator is used, derivability, three factors must be taken into account: aggregation can be equivalently carried out by steps; thus, aggregation patterns, selections on both dimension attributes and measures, and the aggregation operators pTYear,SUM(Qty)(SALE >< STORE >< PRODUCT >< TIME) applied to measures. For instance, expression can be derived from pPType,TYear,SUM(Q)(sQ>2000(pPName,TMonth,Q=SUM(Qty) ( sPCategory=ÕFoodstuffsÕ Ù TYear>Õ1997Õ R = pTMonth,Q=SUM(Qty) ( SALE >< PRODUCT >< TIME)))) ( SALE >< STORE >< PRODUCT >< TIME) can be derived from by computing pTYear,SUM(Q)(R). On the other hand, if an algebraic operator is used for R = sQ>1000(pPName,TMonth,Q=SUM(Qty) aggregation, derivability holds only if the necessary ( SALE >< PRODUCT >< TIME)) support measures are added to the view. For instance, in order to calculate the average of a set of elemental values by computing from a set of partial aggregates of those values, the cardinality of each single partial aggregate must be pPType,TYear,SUM(Q)(sPCategory=ÕFoodstuffsÕ Ù TYear>Õ1997Õ Ù Q>2000(R)). known. Thus, On the other hand, expression pTYear,AVG(Qty)(SALE >< STORE >< PRODUCT >< TIME) sQ>1000(pPName,TMonth,Q=SUM(Qty) can be derived from ( sTDay>ÕFeb15,97Õ Ù Prc>10(SALE >< PRODUCT >< TIME))) R = pTMonth,Q=AVG(Qty),C=COUNT(*) cannot be derived from R since it specifies additional ( SALE >< STORE >< PRODUCT >< TIME) selections on a dimension attribute which did not appear in R due to aggregation (TDay) and on a measure which by computing pTYear,WAVG(Q,C)(R). Obviously, computing in R is not available at the granularity required (Prc). pTYear,AVG(Q)(R) leads to a different result. Finally, for a holistic operator, aggregation can never 3 Aggregate Operators be carried out by steps since the number of necessary support measures is not bounded. An aggregate operator is a function mapping a multiset of values into a single value. An attempt to define rules for 3.2 Aggregation Sequences aggregation of measures is presented in [Len97] and [Sho96]: measures are classified into flow, stock and Aggregation sequences increase the query language value-per-unit, and a table defines when each class of expressivity in two ways: by allowing two or more measures can be aggregated depending on the operators different aggregate operators to be applied in sequence on a used and on the type of hierarchy (temporal, non- measure, and by allowing selections to be applied on temporal). A classification of aggregation functions which partially aggregated measures. is relevant to our approach is presented in [Gra95]: Consider the following query: · Distributive, that allows aggregates to be computed pP2,m2=OP2(m1)(pP1,m1=OP1(m)(pP0,m(...))) directly from partial aggregates. · Algebraic, that require additional information where OP1¹OP2. In general, there is no way of computing (support measures) to calculate aggregates from m2 directly from m by coalescing OP1 and OP2. Besides, partial aggregates. the aggregation sequence determines the query result: in · Holistic, that do not allow aggregates to be computed fact, swapping OP1 and OP2 in q would change the from partial aggregates through a finite number of semantics of m2; the semantics also depends on the support measures. intermediate pattern P 1, which determines the partial aggregates on which OP2 operates. Other works demonstrating the importance of aggregation As to the use of selections, consider the query operators are [Ros98b], which studies the problem of pP1,m1=OP(m)(spredm(pP0,m(...))) M. Golfarelli, S. Rizzi 10-4 pPName,MAX(Q1),(sQ1>1000(pPName,SName,Q1=SUM(Q),(v12))) WAVG(P1,C1) P1=WAVG(P,C),C1=SUM(C) q1 q2 pPName,MAX(Q)(sQ>1000(pPName,SName,Q=SUM(Qty),(v0))) pPType,TMonth,AVG(Prc),(sPCategory='Foodstuffs'(v0)) WAVG(P,C) P=AVG(Prc),C=COUNT(*) SUM(Qty.Prc) v12 pPName,SName,TMonth,Q=SUM(Qty),P=AVG(Prc),(v0) pPType,TMonth,SUM(R),(sPCategory='Foodstuffs'(v12)) C=COUNT(*),R =SUM(Qty.Prc) WAVG(P,C) v0 Figure 1: QVG for queries q1 and q2 (v0 = SALE >< STORE >< PRODUCT >< TIME). in which the selection predicate on m reduces the set of · contain a subset of the tuples at a given aggregation partial aggregates on which OP operates. In general, both pattern as a consequence of selections on dimension pushing down and pulling up the selection operator alter attributes and measures; the query semantics. · contain only a subset of the measures in the base fact The different semantics which may be induced by table as a result of projections; aggregation sequences play a central role in determining · include measures obtained by applying different derivability between NGPSJ expressions; as shown in aggregation sequences to the same measure; Section 5, this may lead to a dramatic increase in the · include derived measures and support measures number of candidate views. necessary to support queries based on algebraic operators. 4 Views The widened definition of views given in our approach View materialization consists in selecting, among all the may lead to proliferation of measures; very large tables possible views, those that optimize the response to the may entail poor performance. In order to avoid this, it workload under both a disk space and a maintenance cost may be convenient to adopt vertical partitioning constraint; this can be formulated as a knapsack problem, techniques capable of splitting views into smaller and is obviously NP-hard. Most methods in the literature fragments aimed at optimizing performance for a given aim at determining the subset of views that could be workload [Gol00]. useful with reference to the workload (candidate views), The candidate views and the relationships between then adopt heuristic algorithms to find sub-optimal them can be represented in a graph: solutions [Gup97, Gup99]. In [Aki99], this is done with reference to distributive aggregate functions; neither Definition 3. The query view graph (QVG) for derived/support measures nor nesting of GPSJ queries are workload Q is the directed acyclic graph G=(V,A) such considered. Within this paper we focus on determining that: candidate views; the exponential dimension of the space of · V={v0,...vp}, where v0 is the base fact table joined the possible views makes this step more complex and with the dimension tables and v1,...vp are the crucial than that of selecting the candidate views to be candidate views. materialized. · Each arc aijÎA denotes that vi³vj. Only the elemental derivability relationships are Definition 2. Given a star scheme S and a workload represented in A; those which can be obtained by Q={q1,...qm} on S, a candidate view is a fact table transitivity are omitted. defined by a NGPSJ expression e on S such that: ($ qiÎQ | e = qi) Ú ($ qi, qj ÎQ | (e ³ qi) Ù (e ³ qj) Ù The QVG for a given workload is necessarily unique, / | (e ³ eÕ) Ù (eÕ ³ qi) Ù (eÕ ³ qj))) ( $eÕ since the set of candidate views is univocally determined by the set of queries2. Figure 1 represents the QVG for In other words, a candidate view is a fact table which queries q1 and q2 in Section 1; to help the reader, also the either includes exactly the tuples returned by one query or expressions which allow one view to be computed from is the ÔcoarsestÕ table including all the tuples allowing another are shown (in dashed call-outs). two or more queries to be solved. In [Bar97] the authors prove that candidate views defined as above are the only relevant views for materialization, i.e., that materializing 2 non-candidate views may never decrease the workload Actually, since some aggregation operators can be expressed in terms of others (e.g., AVG can be computed from SUM and COUNT), some candidate cost. In our approach, candidate views may: views can be rewritten using different operators; nevertheless, the QVG is still unique since the set of queries which can be computed on each candidate view does not change. M. Golfarelli, S. Rizzi 10-5 v vcurr v vcurr v vcurr vcurr v vt Case A Case B Case C Case D Figure 2: Different relationships between couples of views. 5 Building the Query View Graph Merge(S,v,vcurr): The QVG for star scheme S and workload Q={q1,...qm} is { Drel=Æ; // vertices in relationship D with v built incrementally by merging each query with the QVG for each v currÎS do according to the expression which defines it: { inserted=FALSE; if ExploreSubGraph(v,vcurr,NULL,inserted) // case A holds QVGBuilder(Q,v0): then return; { V={v0,q1}; if not inserted then DrelÈ={vcurr}; A={(v0,q1)}; } QVG=(V,A); if (Drel¹Æ) then // case D holds for each qiÎQ, i>1 do for each vjÎDrel do { S={vjÎV| Child(vj)=Æ}; { vt=Ancestor(vj,v); // S set of the leaves in QVG; VÈ={vt}; // function Child(vj) AÈ={(vt,v)}; // returns {vkÎV|(vj,vk)ÎA} SÕ={vk|vjÎChild(vk)}; VÈ={qi}; AÈ={(vt,vj)}; Merge(S,qi,NULL); Merge(SÕ,vt,vj); // merge vt } } } } Merging rules are aimed at determining if and how a query can be answered on a view belonging to the QVG. Figure 3: The Merge algorithm. In general, the four possible relationships between two candidate views v and vcurr as entailed by Definition 1 are Boolean ExploreSubGraph(v,vcurr,vsucc,inserted): presented in Figure 2. In case A, v and vcurr are equivalent; // returns TRUE if v is already present in cases B and C, one of the views can be derived from // among the predecessors of vcurr the other; in case D no derivation relationship can be { Case Compare(v,vcurr) of: established. In case D, vt is defined as follows: { ÔAÕ: { V-={v}; for each vjÎChild(v) do Definition 4. Given two candidate views vi and vj, // v already exists as vcurr the ancestor of vi and vj is the candidate view vt such { A-={(v,vj)}; AÈ={(vcurr,vj)}; } that (1) both vi and vj can be derived from vt and (2) RemoveRedundantArcs(); for each other candidate view vtÕ which satisfies (1), it return TRUE; is vtÕ³vt (in the worst case, it is vt º v0). } ÔBÕ: { AÈ={(vcurr,v)}; // vcurr³v The pseudo-code for the Merge algorithm is shown in if vsucc¹NULL then A-={(vcurr,vsucc)}; Figure 3. The algorithm finds the correct position for inserted=TRUE; inserting v in the QVG, starting from its leaves and } descending towards its root v0. If v is already present (case ÔCÕ: { AÈ={(v,vcurr)}; // v³vcurr A), no additional views must be inserted and Merge if vsucc¹NULL then A-={(v,vsucc)}; terminates. Otherwise, if while descending a path a for each vj|vcurrÎChild(vj) do derivability relationship (case B) with v is found, the if ExploreSubGraph(v,vj,vcurr) correct insertion position on that path has been reached; if then return TRUE; a D relationship (no derivability) among v and some view } vcurr is found, the path is abandoned and vcurr is inserted ÔDÕ: // no derivability relationship into a set Drel to be further examined. At the end, for each } view in Drel, a new candidate view vt is determined by return FALSE; function Ancestor and recursively merged into the QVG. } The ExploreSubGraph function carries out a depth- first exploration of the sub-graph including all the predecessors of vcurr; its pseudo-code is shown in Figure 4. Figure 4: The ExploreSubGraph function. M. Golfarelli, S. Rizzi 10-6 pPType,TMonth,SUM(R),(sPCategory='Foodstuffs'(v12)) p TYear,PName, (pTMonth,PName,Q1=SUM(Q),(sPType='Bev'(v13))) WAVG(P,C) AVG(R1),SUM(Q1) R1=SUM(R) q1 q2 q1 q2 q3 v12 v12 v13 pPName,SName,TMonth,Q=SUM(Qty),P=AVG(Prc),(v0) C=COUNT(*),R =SUM(Qty.Prc) pPName,SName,TMonth,Q=SUM(Qty),P=AVG(Prc),(v0) C=COUNT(*),R =SUM(Qty.Prc) v0 v0 pPName,MAX(Q1),(sQ1>1000(pPName,SName,Q1=SUM(Q),(v12))) WAVG(P1,C1) P1=WAVG(P,C),C1=SUM(C) p TYear,PName,(sPType='Bev'(v23)) pPType,TMonth,SUM(R),(sPCategory='Foodstuffs'(v23)) WAVG(P,C) AVG(R),SUM(Q) q1 q2 q3 q1 q2 q3 v23 v23 v12 pPName,TMonth,Q=SUM(Qty),P=AVG(Prc),(v0) C=COUNT(*),R =SUM(Qty.Prc) v12 pPName,TMonth,SUM(Q),WAVG(P,C),(v12) SUM(C),SUM(R) v0 v0 Figure 5: Construction of the QVG for queries q1, q2, q3. In Figure 5, the steps of the QVGBuilder algorithm for search space is pruned by neglecting all the vertices for the queries proposed in Section 1 are shown. View v13 which a derivability relationship with the new vertex to does not appear in the final QVG since it is equivalent to be positioned can be inferred from the graph structure. v12 (case A). As to v23, the search is restricted to arc (q2,v12) since a derivability relationship (case B) is found 6 Experimental Tests and the other can be inferred. If we restrict to considering only the aggregation We tested our approach on the TPC-D benchmark; the patterns, the QVG resulting from the algorithm workload includes four standard TPC-D queries and three degenerates to the MDred Lattice presented in [Bar97]. additional NGPSJ queries. In order to demonstrate the With reference to that approach, ours has a lower effectiveness of our approach, we compared the results computation complexity (in terms of comparisons obtained by applying the same materialization heuristics between expressions) since the derivability relationships [Bar97] to our QVG and to the lattice defined in [Bar97] expressed by the QVG are actively used to restrict the (considering the same global space constraint). The cost search to the sub-graph where the new vertex can be function adopted expresses the total number of disk pages inserted. In fact, the complexity of Merge for adding the which must be accessed in order to solve each query, (i+1)-th query to the QVG reaches O(2i) only in the worst taking both the view size and the query selectivity into case, in which (1) the QVG already contains 2 i views and account. The results are summarized in Figure 6. (2) all the candidate views solving both qi+1 and each The low execution cost for the QVG-based approach is element in the power set of the i queries in the QVG must mainly due to the fact that, since NPSGJ views can be created, i.e., relationship D holds for all the views in express the nesting of aggregation operators, queries can the QVG. In the best case the QVG degenerates into a be executed on views aggregated at a ÒhigherÓ pattern. path, and the complexity drops to O(1). In the other Besides, since NPSGJ views are more closely tailored on cases, the complexity depends on the specific queries than classical views, more NPSGJ views typically characteristics of the queries and on the order in which fit the same space constraint. These considerations are they are merged. The algorithm is efficient since the further supported by observing how, in Figure 6, M. Golfarelli, S. Rizzi 10-7 materializing the maximum number of useful NPSGJ [Bar97] E. Baralis, S. Paraboschi and E. Teniente. views (7, one for each query in the workload) requires Materialized view selection in only 1.1 GB, and entails a lower cost than materializing multidimensional database. Proc. 23rd Int. the same number of classical views. Conf. on Very Large Data Bases, Athens, Greece, pp. 156-165, 1997. 7 Conclusions [Cab99] L. Cabibbo and R. Torlone. A framework for In this paper we have presented an original approach to the investigation of aggregate functions in view materialization in which NGPSJ expressions are database queries. Proc. Int. Conf. on Database used to model both queries and candidate views. An Theory, Jerusalem, Israel, 1999. algorithm for selecting the minimal set of candidate views [Coh99] S. Cohen, W. Nutt and A. Serebrenik. by incrementally inserting queries into a directed acyclic Algorithms for rewriting aggregate queries graph has been proposed; the algorithm is efficient since, using views. Proc. Int. Workshop on Design every time a query is inserted, only a portion of the graph and Management of Data Warehouses, is visited. Heidelberg, Germany, 1999. Our approach relies on the existence of a method for comparing two relational algebra expressions and [Gol00] M. Golfarelli, D. Maio and S. Rizzi. Applying determining their ancestor [Nut98]. The procedure for Vertical Fragmentation Techniques in Logical comparing two NGPSJ expressions can be seen as an Design of Multidimensional Databases. To nd iteration of the comparison between two GPSJ appear in Proc. 2 Int. Conf. on Data expressions, which was proposed in [Gup95]. Warehousing and Knowledge Discovery, Comparison proceeds from inside the two nested Greenwich, 2000. expressions, considering at each step two units having the [Gra95] J. Gray, A. Bosworth, A. Lyman and H. form p(s(R)); while in [Gup95] R denotes a set of joins, Pirahesh. Data-Cube: a relational aggregation here it denotes the result of the NGPSJ expressions operator generalizing group-by, cross-tab and analyzed up to the previous step. Since the problem of sub-totals. Technical Report MSR-TR-95-22, comparing two GPSJ expression is undecidable [Ros98b], Microsoft Research, 1995. the ancestor determined will sometimes be sub-optimal. Our future work will focus on devising a comparison [Gup95] A. Gupta, V. Harinarayan and D. Quass. algorithm specifically oriented to NGPSJ expressions, by Aggregate-query processing in data- taking their peculiarities into account. warehousing environments. Proc. 21st Int. Conf. on Very Large Data Bases, Zurich, Switzerland, 1995. QVG LAT [Gup97] H. Gupta. Selection of views to materialize in a 300 data warehouse. Proc. Int. Conf. on Database Theory, Delphi, Greece, pp. 98-112, 1997. 250 4 thousands of pages 5 [Gup99] H. Gupta and I.S. Mumick. Selection of views 200 to materialize under a maintenance cost 5 constraint. Proc. Int. Conf. on Database 150 6 5 Theory, Jerusalem, Israel, 1999. 6 6 100 6 6 7 [Kim96] R. Kimball. The data warehouse toolkit. John Wiley & Sons, 1996. 50 6 7 7 7 7 7 [Len97] H.J. Lenz and A. Shoshani. Summarizability 0 in OLAP and statistical databases. Proc. 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Statistical and Scientific Database GB Management, Washington, 1997. Figure 6: Execution costs for the QVG-based and the [Nut98] W. Nutt, Y. Sagiv and S. Shurin. Deciding lattice-based approaches, on a sample workload, in equivalences among aggregate queries. Proc. function of the global space constraint. Besides each value 17th Symposium on Principles of Database point, the number of views materialized. Systems, 1998. [Ros98] K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at References multiple granularities. Proc. Int. Conf. on Extending Database Technology, 1998. [Aki99] M. Akinde and M. Bšhlen. Constructing GPSJ view graphs. Proc. Int. Workshop on [Ros98b] K. A. Ross, D. Srivastava, P. J. Stuckey and Design and Management of Data Warehouses, S. Sudarshan. Foundations of aggregation Heidelberg, Germany, 1999. M. Golfarelli, S. Rizzi 10-8 constraints. Theoretical Computer Science, 193(1-2):149-179, 1998. [Sho96] A. Shoshani. OLAP and statistical databases: similarities and differences. Proc. Int. Conf. on Information and Knowledge Management, Rockville, Maryland, 1996. [The97] D. Theodoratos and T. Sellis. Data warehouse configuration. Proc. 23rd Int. Conf. on Very Large Data Bases, Athens, Greece, pp. 126- 135, 1997. [Yan95] W.P. Yan and P. Larson. Eager and lazy aggregation. Proc. 21st Int. Conf. on Very Large Data Bases, Zurich, Switzerland, pp. 345-357, 1995. M. Golfarelli, S. Rizzi 10-9