<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>View Materialization for Nested GPSJ Queries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matteo Golfarelli</string-name>
          <email>mgolfarelli@deis.unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Rizzi</string-name>
          <email>srizzi@deis.unibo.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DEIS Ð University of Bologna</institution>
          ,
          <addr-line>Viale Risorgimento 2, Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DEIS, CSITE CNR Ð University of Bologna</institution>
          ,
          <addr-line>Viale Risorgimento 2, Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1998</year>
      </pub-date>
      <fpage>1</fpage>
      <lpage>2</lpage>
      <abstract>
        <p>View materialization is a central issue in logical design of data warehouses since it is one of the most powerful techniques to improve the response to the workload. Most approaches in the literature only focus on the aggregation patterns required by the queries in the workload; in this paper we propose an original approach to materialization in which the workload is characterized by the presence of complex queries which cannot be effectively described only by their aggregation pattern. In particular, we consider queries represented by nested GPSJ (Generalized Projection / Selection / Join) expressions, in which sequences of aggregate operators may be applied to measures and selection predicates may be formulated, at different granularities, on both dimensions and measures. Other specific issues taken into account are related to the need for materializing derived measures as well as support measures to make algebraic operators distributive. Based on this query model, an efficient algorithm to determine a restricted set of candidate views for materialization, to be fed into an optimization algorithm, is proposed. Finally, the effectiveness of our approach is discussed with reference to a sample workload.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>A data warehouse (DW) is a repository which provides
an integrated environment for decision support. From a
logical point of view, a DW can be seen as a
multidimensional cube where each cell contains
information useful for the decision process, i.e., a set of
numerical attributes called measures, while each axis
represents a possible dimension for analysis. For example,
in a DW modeling a chain store, time, product and store
The copyright of this paper belongs to the paperÕs authors. Permission to
copy without fee all or part of this material is granted provided that the
copies are not made or distributed for direct commercial advantage.
Proceedings of the International Workshop on Design
and Management of Data Warehouses (DMDW'2000)
Stockholm, Sweden, June 5-6, 2000
(M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.)
are typical dimensions while quantity sold and retail price
are measures.</p>
      <p>A DW implemented on a relational DBMS is usually
organized according to the so-called star scheme [Kim96],
composed by one fact table containing the measures and
one denormalized dimension table for each dimension of
analysis. The primary key of each dimension table is
imported into the fact table; the primary key of the fact
table is defined by the set of these foreign keys. Each
dimension table contains a set of dimension attributes
defining a hierarchy of aggregation levels for the
corresponding dimension.</p>
      <p>The basic mechanism to extract useful information
from elemental data in DWs is aggregation. In principle,
all queries could be solved by directly aggregating the
elemental data in the base fact table; for efficiency reasons,
a DW typically stores, besides the elemental values of
measures, also values summarized according to some
aggregation pattern, i.e., to a set of dimension attributes
defining the coarseness of aggregation. Even summarized
data are organized into a star scheme, whose fact table is
usually called a view; measure values are obtained by
applying an aggregation operator to the data in the base
fact table or in another view with a finer pattern. Of
course, the larger the number of queries a view can be
used to answer, the more effective materializing it [Coh99,
Yan95].</p>
      <p>Since pre-computing all the possible views is
unfeasible, several techniques to select an appropriate
subset of views to materialize have been proposed. While
most approaches only focus on the aggregation patterns
required by queries, very few works analyze how the
presence of different operators affects aggregation. On the
other hand, using only one operator (typically, the sum)
to aggregate measures when materializing views, makes
them useless for several queries.</p>
      <p>In this paper we propose an original approach to
materialization in which the workload is characterized by
the presence of complex queries which cannot be
effectively described only by their aggregation pattern. In
particular, we consider queries represented by Nested
Generalized Projection / Selection / Join (NGPSJ)
expressions, in which sequences of aggregate operators
may be applied to measures and selection predicates may
be defined, at different granularities, on both dimensions
and measures. As a result, the correspondence between
measures in the views and in the base fact table is not
necessarily one-to-one: some measures in the base fact
table may not be present at all in a view, while others
may correspond to several measures in the view;
furthermore, additional (either support or derived)
measures may be included in the view in order to
correctly compute aggregations. Particular emphasis is
given to the effects on materialization of the contemporary
presence of multiple aggregation operators in a query.</p>
      <p>Under these assumptions, an efficient algorithm to
determine the restricted set of views which could be
usefully materialized (candidate views) is proposed. The
algorithm builds a query view graph whose vertices
represent candidate views and whose arcs denote the
possibility of computing a view from another. The query
view graph may then be fed into an optimization
algorithm like the ones proposed in [Gup97, Gup99]
which select, from the set of the candidate views, the
subset which minimizes the response to the workload
under a given space constraint.</p>
      <p>After discussing NGPSJ expressions in Section 2 and
classifying aggregate operators in Section 3, in Section 4
we define how views will be structured. The algorithm for
determining candidate views is outlined in Section 5,
while Section 6 discusses the experimental results.</p>
      <sec id="sec-1-1">
        <title>1.1 Motivating Example</title>
        <p>The simple Sales star scheme used as a working example
is defined below:</p>
        <sec id="sec-1-1-1">
          <title>STORE (StoreId, SName, SCity) TIME (TimeId, TDay, TMonth, TYear) PRODUCT (ProductId, PName, PType, PCategory) SALE (TimeId, StoreId, ProductId, Qty, Prc)</title>
          <p>where the denormalized dimension tables leave out the
following hierarchies of functional dependencies:
SName®SCity, TDay®TMonth®TYear,</p>
        </sec>
        <sec id="sec-1-1-2">
          <title>PName®PType®PCategory</title>
          <p>Tuples in dimension tables are typically identified by
surrogate keys; in the rest of the paper, for the sake of
simplicity, surrogate keys will never be considered,
assuming that tuples are identified by an attribute of the
hierarchy.</p>
          <p>As to the workload, consider the following queries on
the Sales star scheme:
q1: ÒFor each product, find the average selling price
and the maximum total quantity sold for the stores
which sold more then 1000 units of that productÓ
q2: ÒFor each month and for each type of food
product, find the average selling price and the total
revenueÓ
q3: ÒFor product of type beverage and for each year,
find the total quantity sold and the average of the
total monthly revenuesÓ</p>
          <p>Query q1 requires to first aggregate data on pattern
{SName, PName}, calculating the average price and the
sum of the quantities sold, then to select the stores for
which this sum is greater than 1000, finally to further
aggregate on {PName} calculating the maximum quantity
and the average price. Query q2 aggregates foodstuff sales
on {TMonth, PType}, calculating for each month the
average price and the sum of the derived measure R =
Qty´Prc. Query q3 requires to aggregate on {TMonth,
PName} calculating the monthly sums of the quantities
and that of the revenues, then to aggregate on {TYear,
PName} calculating the sums of the monthly quantities
and the averages of the monthly revenues.</p>
          <p>These examples give rise to some considerations:
1.
2.
3.
4.</p>
          <p>Different aggregate operators may be applied to the
same measure (e.g., SUM(Qty), MAX(Qty)).</p>
          <p>Sequences of aggregate operators may be applied to a
measure (e.g., the maximum of the sums).</p>
          <p>Some aggregations may require support measures in
order to be correctly calculated from partial aggregates
(e.g., the average operator requires the cardinality of
each partial aggregate).</p>
          <p>Calculating a derived measure from its component
measures may determine a wrong result if the query is
not solved directly on the base fact table (e.g.,
SUM(Qty)*AVG(Prc) ¹ SUM(Qty*Prc)).</p>
          <p>In all these cases, in order to take full advantage of
materialization, it may be necessary to include additional
measures in views. Besides:
5.</p>
          <p>The results of aggregation are affected by the
selections which operate on its source data.</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>1.2 Related Literature</title>
        <p>Several works adopt graph-like structures to represent
views, but a few focus on how to build it. In [The97] an
algorithm to build a multiquery graph is outlined; the
approach deals with the problem of building base fact
tables from operational databases. Besides, since only
join, selection and classical projection operators are
involved, and no aggregation semantics is introduced, the
approach cannot be used to determine aggregate views.</p>
        <p>The approach which is most closely related to ours is
the one described in [Aki99]. That work is improved by
ours with reference to the class of queries considered
(GPSJ queries with distributive operators vs. NGPSJ
queries with distributive, algebraic and holistic operators)
and to the computational cost of the graph-building
algorithm. In [Aki99] the complexity is always
exponential since, for each query, all the possible
evaluation strategies are built and organized into a graph;
the query view graph is then obtained by applying a set of
rules that prune non-candidate views. The complexity of
our algorithm is exponential only in the worst case, to
become polynomial in the best case.</p>
        <p>Query nesting is considered in [Ros98], where
subqueries involve multiple dependent aggregates at
multiple granularities (multi-feature cubes). Though
interesting results concerning distributive and algebraic
cubes are provided, the queries considered are quite
different from NGPSJ queries, since no nesting of distinct
aggregation operators at different granularities is possible.
In principle, the workload for a DW is dynamic and
unpredictable. A possible approach to cope with this fact
consists in monitoring the actual workload while the DW
is operating. Otherwise, the designer may try to determine
a core workload a priori: in fact, on the one hand, the user
typically knows in advance which kind of data analysis
(s)he will carry out more often for decisional or statistical
purposes; on the other, a substantial amount of queries are
aimed at extracting summary data to fill standard reports.</p>
        <p>Basically, the approaches to view materialization differ
in the level of detail they adopt to model the queries in
the workload. Some approaches only consider the
aggregation patterns [Bar97], while others analyze the
query features in more detail [Gup97].</p>
        <p>Our approach falls in the second group; in particular,
we consider queries represented by nested GPSJ
expressions. A GPSJ (Generalized Projection / Selection /
Join) expression [Gup95] is a selection s1 over a
generalized projection p over a selection s2 over a set of
joins c: s1ps2c. The generalized projection operator,
pP,M(R), is an extension of duplicate eliminating
projection, where P denotes an aggregation pattern, i.e.,
the set of group-by attributes, and M denotes a set of
aggregate operators applied to the attributes in R. Thus,
GPSJ expressions extend select-joins expressions with
aggregation grouping and group selection.</p>
        <p>Nesting GPSJ expressions means using the result from
an expression as the input for another; it adds expressive
power to GPSJ expressions since it allows sequences of
aggregate operators to be used on a measure. For instance,
the queries in Section 1 can be formulated as follows:
q1:
q2:
q3:</p>
        <p>pPName,WAVG(P,C),MAX(Q)(sQ&gt;1000
( pPName,SName,Q=SUM(Qty),P=AVG(Prc),C=COUNT(*)
( SALE &gt;&lt;PRODUCT &gt;&lt;STORE)))</p>
        <p>pPType,TMonth,AVG(Prc),SUM(Qty×Prc)(sPCategory=ÕFoodstuffsÕ
( SALE &gt;&lt;PRODUCT &gt;&lt;TIME))</p>
        <p>pTYear,PName,SUM(Q),AVG(R)
( pTYear,TMonth,PName,Q=SUM(Qty),R=SUM(Qty×Prc)
( sPType=ÕBevÕ(SALE &gt;&lt;PRODUCT &gt;&lt;TIME)))
where WAVG(m,w) computes the weighted average of
measure m based on the weights w. The SQL formulation
of q1 is the following:
SELECT PN,SUM(P*C)/SUM(C),MAX(Q)
FROM ( SELECT PRODUCT.PName AS PN,</p>
        <p>STORE.SName AS SN,SUM(SALE.Qty) AS Q,</p>
        <p>AVG(SALE.Prc) AS P,COUNT(*) AS C
FROM SALE,PRODUCT,STORE
WHERE &lt;...join conditions...&gt;</p>
        <p>GROUP BY PRODUCT.PName,STORE.SName )
WHERE Q&gt;1000
GROUP BY PN</p>
        <p>The expressions above are in normal form [Gup95],
i.e.:
·
·
·
·
·
·
·
the necessary joins are pushed below all projections
and selections;
projection and selections are coalesced as much as
possible;
selections on dimensional attributes are pushed below
all projections and selections on measures1.</p>
        <p>In the rest of the paper we will consider NGPSJ
expressions in normal form applied to a star scheme S
with base fact table FT and dimension tables DT1,...DTn,
structured as follows:
e:</p>
        <p>sPredMh(pPh,Mh(... sPredM1(pP1,M1(sPredA Ù PredM0
( FT &gt;&lt;DT1 ... &gt;&lt;DTn)))...))
where:</p>
        <p>Mj (j=1,...h) is a set of aggregate measures, each
defined as OP(m1,ÉmmaxOP) where OP is an aggregate
operator, maxOP is the number of arguments of OP
and mi is an algebraic expression involving measures
in Mj-1; M0 denotes the set of measures in FT.</p>
        <p>PredMj (j=0,...h) is a conjunction/disjunction of
Boolean predicates expressed on measures in Mj;
Pj (j=1,...h) is an aggregation pattern.</p>
        <p>PredA is a conjunction/disjunction of Boolean
predicates expressed on dimension attributes of
DT1,...DTn.</p>
        <p>We wish to emphasize that, in this paper, we are not
interested in determining optimal execution plans.
Expressions are written in normal form for the sake of
clarity and in order to simplify their management in the
algorithm for building the query view graph; the order in
which operators appear within the expressions does not
reflect the execution plans that will be used to calculate
them. For instance, pushing all the joins below the other
operators would obviously be inefficient in several cases,
and pushing some projections/selections down might be
highly preferable.</p>
        <p>For the same reason, we will assume for simplicity
that the fact table is always joined to all the dimension
tables (though in some queries, depending on the
aggregation pattern required, it may be possible to omit
one or more of them; see q1, q2 and q3 for instance). As to
aggregation patterns, we will write them in a concise
nonredundant form by dropping all the attributes functionally
determined by other attributes in the pattern. For instance,
q3 can be rewritten in concise form as</p>
        <p>pTYear,PName,SUM(Q),AVG(R)(pTMonth,PName,Q=SUM(Qty),R=SUM(Qty×Prc)
( sPType=ÕBevÕ(SALE &gt;&lt;PRODUCT &gt;&lt;TIME)))
by dropping attribute TYear from the intermediate pattern
since TMonth®TYear.</p>
        <p>1 While in general moving a selection on a measure in different positions
of an expression affects the expression semantics, selections on dimensional
attributes can be placed indifferently in any position outside the joins. Placing
them in the innermost possible position makes the materialization algorithm
simpler since no push-downs or pull-ups are required to compare couples of
expressions.
Definition 1. Given two NGPSJ expressions ei, ej
defined on star scheme S, we say that ej can be derived
from ei (ei³ej) if, by applying a sequence of generalized
projections and selections to ei, it is possible to
obtain a NGPSJ expression which, put in normal
form, is equivalent to ej.</p>
        <p>Derivability holds when every measure returned in ej can
be calculated from those returned in ei. In evaluating
derivability, three factors must be taken into account:
aggregation patterns, selections on both dimension
attributes and measures, and the aggregation operators
applied to measures. For instance, expression</p>
        <p>pPType,TYear,SUM(Q)(sQ&gt;2000(pPName,TMonth,Q=SUM(Qty)
( sPCategory=ÕFoodstuffsÕ Ù TYear&gt;Õ1997Õ
( SALE &gt;&lt;PRODUCT &gt;&lt;TIME))))
can be derived from
R = sQ&gt;1000(pPName,TMonth,Q=SUM(Qty)</p>
        <p>( SALE &gt;&lt;PRODUCT &gt;&lt;TIME))
by computing
pPType,TYear,SUM(Q)(sPCategory=ÕFoodstuffsÕ Ù TYear&gt;Õ1997Õ Ù Q&gt;2000(R)).
On the other hand, expression</p>
        <p>sQ&gt;1000(pPName,TMonth,Q=SUM(Qty)
( sTDay&gt;ÕFeb15,97Õ Ù Prc&gt;10(SALE &gt;&lt;PRODUCT &gt;&lt;TIME)))
cannot be derived from R since it specifies additional
selections on a dimension attribute which did not appear
in R due to aggregation (TDay) and on a measure which
in R is not available at the granularity required (Prc).</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3 Aggregate Operators</title>
      <p>An aggregate operator is a function mapping a multiset of
values into a single value. An attempt to define rules for
aggregation of measures is presented in [Len97] and
[Sho96]: measures are classified into flow, stock and
value-per-unit, and a table defines when each class of
measures can be aggregated depending on the operators
used and on the type of hierarchy (temporal,
nontemporal). A classification of aggregation functions which
is relevant to our approach is presented in [Gra95]:
·
·
·</p>
      <p>Distributive, that allows aggregates to be computed
directly from partial aggregates.</p>
      <p>Algebraic, that require additional information
(support measures) to calculate aggregates from
partial aggregates.</p>
      <p>Holistic, that do not allow aggregates to be computed
from partial aggregates through a finite number of
support measures.</p>
      <p>Other works demonstrating the importance of aggregation
operators are [Ros98b], which studies the problem of
determining if a conjunction of aggregation constraints is
feasible, and [Cab99], which presents a theoretical
framework for studying aggregations from a declarative
and operational point of view.</p>
      <sec id="sec-2-1">
        <title>3.1 Support Measures</title>
        <p>Query q can be solved on a view v if the NGPSJ
expression representing q can be derived from the one
which defines v (v³q). If a distributive operator is used,
aggregation can be equivalently carried out by steps; thus,
pTYear,SUM(Qty)(SALE &gt;&lt;STORE &gt;&lt;PRODUCT &gt;&lt;TIME)
can be derived from
R = pTMonth,Q=SUM(Qty)</p>
        <p>( SALE &gt;&lt;STORE &gt;&lt;PRODUCT &gt;&lt;TIME)
by computing pTYear,SUM(Q)(R).</p>
        <p>On the other hand, if an algebraic operator is used for
aggregation, derivability holds only if the necessary
support measures are added to the view. For instance, in
order to calculate the average of a set of elemental values
from a set of partial aggregates of those values, the
cardinality of each single partial aggregate must be
known. Thus,</p>
        <p>pTYear,AVG(Qty)(SALE &gt;&lt;STORE &gt;&lt;PRODUCT &gt;&lt;TIME)
can be derived from
R = pTMonth,Q=AVG(Qty),C=COUNT(*)</p>
        <p>( SALE &gt;&lt;STORE &gt;&lt;PRODUCT &gt;&lt;TIME)
by computing pTYear,WAVG(Q,C)(R). Obviously, computing
pTYear,AVG(Q)(R) leads to a different result.</p>
        <p>Finally, for a holistic operator, aggregation can never
be carried out by steps since the number of necessary
support measures is not bounded.</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2 Aggregation Sequences</title>
        <p>Aggregation sequences increase the query language
expressivity in two ways: by allowing two or more
different aggregate operators to be applied in sequence on a
measure, and by allowing selections to be applied on
partially aggregated measures.</p>
        <p>Consider the following query:</p>
        <p>pP2,m2=OP2(m1)(pP1,m1=OP1(m)(pP0,m(...)))
where OP1¹OP2. In general, there is no way of computing
m2 directly from m by coalescing OP1 and OP2. Besides,
the aggregation sequence determines the query result: in
fact, swapping OP1 and OP2 in q would change the
semantics of m2; the semantics also depends on the
intermediate pattern P1, which determines the partial
aggregates on which OP2 operates.</p>
        <p>As to the use of selections, consider the query
pP1,m1=OP(m)(spredm(pP0,m(...)))
q1
q2
pPName,MAX(Q)(sQ&gt;1000(pPName,SName,Q=SUM(Qty),(v0)))</p>
        <p>WAVG(P,C) P=AVG(Prc),C=COUNT(*)
pPType,TMonth,AVG(Prc),(sPCategory='Foodstuffs'(v0))</p>
        <p>SUM(Qty.Prc)
pPName,SName,TMonth,Q=SUM(Qty),P=AVG(Prc),(v0)</p>
        <p>C=COUNT(*),R =SUM(Qty.Prc)
in which the selection predicate on m reduces the set of
partial aggregates on which OP operates. In general, both
pushing down and pulling up the selection operator alter
the query semantics.</p>
        <p>The different semantics which may be induced by
aggregation sequences play a central role in determining
derivability between NGPSJ expressions; as shown in
Section 5, this may lead to a dramatic increase in the
number of candidate views.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4 Views</title>
      <p>View materialization consists in selecting, among all the
possible views, those that optimize the response to the
workload under both a disk space and a maintenance cost
constraint; this can be formulated as a knapsack problem,
and is obviously NP-hard. Most methods in the literature
aim at determining the subset of views that could be
useful with reference to the workload (candidate views),
then adopt heuristic algorithms to find sub-optimal
solutions [Gup97, Gup99]. In [Aki99], this is done with
reference to distributive aggregate functions; neither
derived/support measures nor nesting of GPSJ queries are
considered. Within this paper we focus on determining
candidate views; the exponential dimension of the space of
the possible views makes this step more complex and
crucial than that of selecting the candidate views to be
materialized.</p>
      <p>Definition 2. Given a star scheme S and a workload
Q={q1,...qm} on S, a candidate view is a fact table
defined by a NGPSJ expression e on S such that:
($ qiÎQ | e = qi) Ú ($ qi, qj ÎQ | (e ³ qi) Ù (e ³ qj) Ù
( $/ eÕ | (e ³ eÕ) Ù (eÕ ³ qi) Ù (eÕ ³ qj)))</p>
      <p>In other words, a candidate view is a fact table which
either includes exactly the tuples returned by one query or
is the ÔcoarsestÕ table including all the tuples allowing
two or more queries to be solved. In [Bar97] the authors
prove that candidate views defined as above are the only
relevant views for materialization, i.e., that materializing
non-candidate views may never decrease the workload
cost. In our approach, candidate views may:
v12
v0
·
·
·
·
contain a subset of the tuples at a given aggregation
pattern as a consequence of selections on dimension
attributes and measures;
contain only a subset of the measures in the base fact
table as a result of projections;
include measures obtained by applying different
aggregation sequences to the same measure;
include derived measures and support measures
necessary to support queries based on algebraic
operators.</p>
      <p>The widened definition of views given in our approach
may lead to proliferation of measures; very large tables
may entail poor performance. In order to avoid this, it
may be convenient to adopt vertical partitioning
techniques capable of splitting views into smaller
fragments aimed at optimizing performance for a given
workload [Gol00].</p>
      <p>The candidate views and the relationships between
them can be represented in a graph:
·
Definition 3. The query view graph (QVG) for
workload Q is the directed acyclic graph G=(V,A) such
that:
·</p>
      <p>V={v0,...vp}, where v0 is the base fact table joined
with the dimension tables and v1,...vp are the
candidate views.</p>
      <p>Each arc aijÎA denotes that vi³vj. Only the
elemental derivability relationships are
represented in A; those which can be obtained by
transitivity are omitted.</p>
      <p>The QVG for a given workload is necessarily unique,
since the set of candidate views is univocally determined
by the set of queries2. Figure 1 represents the QVG for
queries q1 and q2 in Section 1; to help the reader, also the
expressions which allow one view to be computed from
another are shown (in dashed call-outs).</p>
      <p>2 Actually, since some aggregation operators can be expressed in terms of
others (e.g., AVG can be computed from SUM and COUNT), some candidate
views can be rewritten using different operators; nevertheless, the QVG is still
unique since the set of queries which can be computed on each candidate view
does not change.
10-5
v</p>
      <p>vcurr
Case A
vcurr
vcurr
v
v</p>
      <p>vcurr
vt
Case D
Case B</p>
      <p>Case C
5 Building the Query View Graph
The QVG for star scheme S and workload Q={q1,...qm} is
built incrementally by merging each query with the QVG
according to the expression which defines it:
QVGBuilder(Q,v0):
{ V={v0,q1};</p>
      <p>A={(v0,q1)};
QVG=(V,A);
for each qiÎQ, i&gt;1 do
{ S={vjÎV| Child(vj)=Æ};
// S set of the leaves in QVG;
// function Child(vj)
// returns {vkÎV|(vj,vk)ÎA}
VÈ={qi};</p>
      <p>Merge(S,qi,NULL);
}
}</p>
      <p>Merging rules are aimed at determining if and how a
query can be answered on a view belonging to the QVG.
In general, the four possible relationships between two
candidate views v and vcurr as entailed by Definition 1 are
presented in Figure 2. In case A, v and vcurr are equivalent;
in cases B and C, one of the views can be derived from
the other; in case D no derivation relationship can be
established. In case D, vt is defined as follows:</p>
      <p>Definition 4. Given two candidate views vi and vj,
the ancestor of vi and vj is the candidate view vt such
that (1) both vi and vj can be derived from vt and (2)
for each other candidate view vtÕ which satisfies (1), it
is vtÕ³vt (in the worst case, it is vt º v0).</p>
      <p>The pseudo-code for the Merge algorithm is shown in
Figure 3. The algorithm finds the correct position for
inserting v in the QVG, starting from its leaves and
descending towards its root v0. If v is already present (case
A), no additional views must be inserted and Merge
terminates. Otherwise, if while descending a path a
derivability relationship (case B) with v is found, the
correct insertion position on that path has been reached; if
a D relationship (no derivability) among v and some view
vcurr is found, the path is abandoned and vcurr is inserted
into a set Drel to be further examined. At the end, for each
view in Drel, a new candidate view vt is determined by
function Ancestor and recursively merged into the QVG.</p>
      <p>The ExploreSubGraph function carries out a
depthfirst exploration of the sub-graph including all the
predecessors of vcurr; its pseudo-code is shown in Figure 4.
Merge(S,v,vcurr):
{ Drel=Æ; // vertices in relationship D with v
for each vcurrÎS do
{ inserted=FALSE;
if ExploreSubGraph(v,vcurr,NULL,inserted)
// case A holds</p>
      <p>then return;
if not inserted then DrelÈ={vcurr};
}
if (Drel¹Æ) then
for each vjÎDrel do
{ vt=Ancestor(vj,v);</p>
      <p>VÈ={vt};
AÈ={(vt,v)};
SÕ={vk|vjÎChild(vk)};
AÈ={(vt,vj)};</p>
      <p>Merge(SÕ,vt,vj);
}
}
// case D holds
// merge vt
pPName,MAX(Q1),(sQ1&gt;1000(pPName,SName,Q1=SUM(Q),(v12)))</p>
      <p>WAVG(P1,C1) P1=WAVG(P,C),C1=SUM(C)
q3
q1
v23
pPName,TMonth,Q=SUM(Qty),P=AVG(Prc),(v0)</p>
      <p>C=COUNT(*),R =SUM(Qty.Prc)
p TYear,PName, (pTMonth,PName,Q1=SUM(Q),(sPType='Bev'(v13)))
AVG(R1),SUM(Q1) R1=SUM(R)
p TYear,PName,(sPType='Bev'(v23))</p>
      <p>AVG(R),SUM(Q)
pPType,TMonth,SUM(R),(sPCategory='Foodstuffs'(v23))</p>
      <p>WAVG(P,C)
pPName,TMonth,SUM(Q),WAVG(P,C),(v12)</p>
      <p>SUM(C),SUM(R)</p>
      <p>In Figure 5, the steps of the QVGBuilder algorithm for
the queries proposed in Section 1 are shown. View v13
does not appear in the final QVG since it is equivalent to
v12 (case A). As to v23, the search is restricted to arc
(q2,v12) since a derivability relationship (case B) is found
and the other can be inferred.</p>
      <p>If we restrict to considering only the aggregation
patterns, the QVG resulting from the algorithm
degenerates to the MDred Lattice presented in [Bar97].
With reference to that approach, ours has a lower
computation complexity (in terms of comparisons
between expressions) since the derivability relationships
expressed by the QVG are actively used to restrict the
search to the sub-graph where the new vertex can be
inserted. In fact, the complexity of Merge for adding the
(i+1)-th query to the QVG reaches O(2i) only in the worst
case, in which (1) the QVG already contains 2i views and
(2) all the candidate views solving both qi+1 and each
element in the power set of the i queries in the QVG must
be created, i.e., relationship D holds for all the views in
the QVG. In the best case the QVG degenerates into a
path, and the complexity drops to O(1). In the other
cases, the complexity depends on the specific
characteristics of the queries and on the order in which
they are merged. The algorithm is efficient since the
search space is pruned by neglecting all the vertices for
which a derivability relationship with the new vertex to
be positioned can be inferred from the graph structure.</p>
    </sec>
    <sec id="sec-4">
      <title>6 Experimental Tests</title>
      <p>We tested our approach on the TPC-D benchmark; the
workload includes four standard TPC-D queries and three
additional NGPSJ queries. In order to demonstrate the
effectiveness of our approach, we compared the results
obtained by applying the same materialization heuristics
[Bar97] to our QVG and to the lattice defined in [Bar97]
(considering the same global space constraint). The cost
function adopted expresses the total number of disk pages
which must be accessed in order to solve each query,
taking both the view size and the query selectivity into
account. The results are summarized in Figure 6.</p>
      <p>The low execution cost for the QVG-based approach is
mainly due to the fact that, since NPSGJ views can
express the nesting of aggregation operators, queries can
be executed on views aggregated at a ÒhigherÓ pattern.
Besides, since NPSGJ views are more closely tailored on
queries than classical views, more NPSGJ views typically
fit the same space constraint. These considerations are
further supported by observing how, in Figure 6,
10-7
materializing the maximum number of useful NPSGJ
views (7, one for each query in the workload) requires
only 1.1 GB, and entails a lower cost than materializing
the same number of classical views.</p>
    </sec>
    <sec id="sec-5">
      <title>7 Conclusions</title>
      <p>In this paper we have presented an original approach to
view materialization in which NGPSJ expressions are
used to model both queries and candidate views. An
algorithm for selecting the minimal set of candidate views
by incrementally inserting queries into a directed acyclic
graph has been proposed; the algorithm is efficient since,
every time a query is inserted, only a portion of the graph
is visited.</p>
      <p>Our approach relies on the existence of a method for
comparing two relational algebra expressions and
determining their ancestor [Nut98]. The procedure for
comparing two NGPSJ expressions can be seen as an
iteration of the comparison between two GPSJ
expressions, which was proposed in [Gup95].
Comparison proceeds from inside the two nested
expressions, considering at each step two units having the
form p(s(R)); while in [Gup95] R denotes a set of joins,
here it denotes the result of the NGPSJ expressions
analyzed up to the previous step. Since the problem of
comparing two GPSJ expression is undecidable [Ros98b],
the ancestor determined will sometimes be sub-optimal.
Our future work will focus on devising a comparison
algorithm specifically oriented to NGPSJ expressions, by
taking their peculiarities into account.</p>
      <p>QVG</p>
      <p>LAT
300
s250 4
e
g
a
fp200
o
snd150 5
a
su100
o
h
t 50
0
5
6
5
6
6
7</p>
      <p>GB
6
7
6
7
6
7
7
7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
[Aki99]</p>
      <p>M. Akinde and M. Bhlen. Constructing
GPSJ view graphs. Proc. Int. Workshop on
Design and Management of Data Warehouses,
Heidelberg, Germany, 1999.</p>
      <p>E. Baralis, S. Paraboschi and E. Teniente.
Materialized view selection in
multidimensional database. Proc. 23rd Int.
Conf. on Very Large Data Bases, Athens,
Greece, pp. 156-165, 1997.</p>
      <p>L. Cabibbo and R. Torlone. A framework for
the investigation of aggregate functions in
database queries. Proc. Int. Conf. on Database
Theory, Jerusalem, Israel, 1999.
[Coh99] S. Cohen, W. Nutt and A. Serebrenik.</p>
      <p>Algorithms for rewriting aggregate queries
using views. Proc. Int. Workshop on Design
and Management of Data Warehouses,
Heidelberg, Germany, 1999.
[Bar97]
[Cab99]
[Gol00]
[Gra95]
[Gup95]
[Gup97]
[Gup99]
[Len97]
[Nut98]
[Ros98]</p>
      <p>M. Golfarelli, D. Maio and S. Rizzi. Applying
Vertical Fragmentation Techniques in Logical
Design of Multidimensional Databases. T o
appear in Proc. 2nd Int. Conf. on Data
Warehousing and Knowledge Discovery,
Greenwich, 2000.</p>
      <p>J. Gray, A. Bosworth, A. Lyman and H.
Pirahesh. Data-Cube: a relational aggregation
operator generalizing group-by, cross-tab and
sub-totals. Technical Report MSR-TR-95-22,
Microsoft Research, 1995.</p>
      <p>
        A. Gupta, V. Harinarayan and D. Quass.
Aggregate-query processing in
datawarehousing environments. Proc. 21st Int.
Conf. on Very Large Data Bases, Zurich,
S
        <xref ref-type="bibr" rid="ref1">witzerland, 1995</xref>
        .
      </p>
      <p>H. Gupta. Selection of views to materialize in a
data warehouse. Proc. Int. Conf. on Database
Theory, Delphi, Greece, pp. 98-112, 1997.</p>
      <p>H. Gupta and I.S. Mumick. Selection of views
to materialize under a maintenance cost
constraint. Proc. Int. Conf. on Database
Theory, Jerusalem, Israel, 1999.</p>
      <p>H.J. Lenz and A. Shoshani. Summarizability
in OLAP and statistical databases. Proc.
Statistical and Scientific Database
Management, Washington, 1997.</p>
      <p>W. Nutt, Y. Sagiv and S. Shurin. Deciding
equivalences among aggregate queries. Proc.
17th Symposium on Principles of Database
Systems, 1998.</p>
      <p>K. A. Ross, D. Srivastava, and D.
Chatziantoniou. Complex aggregation at
multiple granularities. Proc. Int. Conf. on
Extending Database Technology, 1998.
[Kim96] R. Kimball. The data warehouse toolkit. John</p>
      <p>Wiley &amp; Sons, 1996.
[Ros98b] K. A. Ross, D. Srivastava, P. J. Stuckey and
S. Sudarshan. Foundations of aggregation
[Sho96]
[The97]
[Yan95]</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>W.P.</given-names>
            <surname>Yan</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Larson</surname>
          </string-name>
          .
          <article-title>Eager and lazy aggregation</article-title>
          .
          <source>Proc. 21st Int. Conf. on Very Large Data Bases</source>
          , Zurich, Switzerland, pp.
          <fpage>345</fpage>
          -
          <lpage>357</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>