<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael O. Akinde</string-name>
          <email>strategy@cs.auc.dk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael H. B o¨hlen</string-name>
          <email>boehlen@cs.auc.dk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Aalborg University</institution>
          ,
          <addr-line>Fredrik Bajers Vej 7 E-1, DK-9220 Aalborg Øst</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Aalborg University</institution>
          ,
          <addr-line>Fredrik Bajers Vej 7 E-1, DK-9220 Aalborg Øst</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
      </contrib-group>
      <fpage>8</fpage>
      <lpage>8</lpage>
      <abstract>
        <p>A data warehouse collects and maintains integrated information from heterogeneous data sources for OLAP and decision support. An important task in data warehouse design is the selection of views to materialize, in order to minimize the response time and maintenance cost of generalized project-select-join (GPSJ) queries. We discuss how to construct GPSJ view graphs. GPSJ view graphs are directed acyclic graphs, used to compactly encode and represent different possible ways of evaluating a set of GPSJ queries. Our view graph construction algorithm, GPSJVIEWGRAPHBUILDER, incrementally constructs GPSJ view graphs based on a set of merge rules. We provide a set of merging rules to construct GPSJ view graphs in the presence of duplicate sensitive and insensitive aggregates. The merging algorithm used in GPSJVIEWGRAPHBUILDER ensures that each node is correctly added to the view graph, and employs the merge rules to ensure that relationships between nodes from different queries are incorporated into the view graph.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The copyright of this paper belongs to the paper’s authors. Permission to
copy without fee all or part of this material is granted provided that the
copies are not made or distributed for direct commercial advantage.</p>
    </sec>
    <sec id="sec-2">
      <title>Proceedings of the International Workshop on Design and</title>
    </sec>
    <sec id="sec-3">
      <title>Management of Data Warehouses (DMDW’99)</title>
      <p>Heidelberg, Germany, 14. - 15.6. 1999
(S. Gatziu, M. Jeusfeld, M. Staudt, Y. Vassiliou, eds.)
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-19/
1</p>
      <sec id="sec-3-1">
        <title>Introduction</title>
        <p>A data warehouse is a repository of integrated
information from multiple, independent data sources available for
querying and analysis. As data warehouses contain
integrated information, often spanning long periods of time,
they tend to be orders of magnitude larger than
conventional operational databases; ranging from hundreds of
gigabytes to terabytes in size. The workload is typically
query-intensive, with many complex queries that may
access millions of records and perform many joins and
aggregates.</p>
        <p>Three costs must be balanced during physical database
design for warehouses: (1) the cost of answering queries,
(2) the cost of maintaining the warehouse, and (3) the cost
of secondary storage. The cost of (1) can be reduced by
materializing (precomputing) frequently asked queries as
materialized views in the data warehouse, but this increases
the maintenance costs of the warehouse. The problem of
selecting an appropriate set of views and indexes to
materialize in a data warehouse is referred to as the
viewselection [Gup97] or data warehouse configuration
problem [TS97].</p>
        <p>For the purpose of our discussion, we use the
following terminology (precise definitions follow later). A
GPSJ query is a generalized project-select-join query,
i.e., a project-select-join query extended with aggregation,
grouping, and group selection. This class of queries is
the single most important one used in data warehousing
[Kim96]. A GPSJ query graph is a directed acyclic graph.</p>
        <p>It represents a specific strategy to evaluate a GPSJ query.</p>
        <p>A GPSJ expression DAG compactly encodes different
possibilities to evaluate a GPSJ query. It combines multiple
GPSJ query graphs into a single graph. Finally, a GPSJ
view graph encodes multiple evaluation strategies for
different queries. Put differently, a GPSJ view graph
integrates several GPSJ expression DAGs.</p>
        <p>In this paper, we concentrate on the construction of</p>
        <p>GPSJ view graphs. The integrator takes the set of GPSJ
GPSJ Queries
Q1, Q2,, . . . , Qn
accounts(a personid, a name, a balance,</p>
        <p>a type, a date)
expression DAGs as input, and merges them into a GPSJ
view graph according to a set of merging rules. The
integrator and its embedding into the data warehouse
configuration tool are illustrated in Figure 1. Given a set of queries
for the data warehouse and metadata (such as query
frequencies, schema information, etc.), the tool returns a data
warehouse configuration.
: : : G(X) Q1; Q2; Qn, the GPSJ view graph is
incremenmerge rules used in the GPSJVIEWGRAPHBUILDER
algorithm.</p>
        <p>Specifically, given a set of data warehouse queries,
tally constructed from the set of associated GPSJ
expresG(A1) Figure 2 gives two simple expression DAGs, and
G(X) them to derive the GPSJ view graph displayed on the</p>
        <p>G(A2), for these two queries. Equivalence nodes,
representing views which can be materialized in the data
warehouse, are denoted in the figures using ovals and a label.</p>
        <p>Operation nodes are denoted using rectangles. The
integrator takes these expression DAGs as input, and merges
left. Note that during the merging process, additional
operation nodes and edges may be derived, such as the node
and edges connecting A0 to A1.
: : : ; sion DAGs, G(Q1); G(Q2); G(Qn) by the algorithm
using a set of merging rules. We give a set of basic rules to
integrate GPSJ expression DAGs into GPSJ view graphs.</p>
        <p>To the best of our knowledge, the rules and algorithms for
merging such graphs in the presence of aggregation and
grouping have not previously been considered in the
literature.</p>
        <p>In order to ensure an optimal solution to the
viewselection problem, it is necessary to generate the entire
GPSJ view graph. However, the complexity (space and
time) of algorithms for solving the view-selection problem
optimally is exponential in the number of nodes in the
expression graph [TS97, Gup97, GM99]. Although we do not
directly represent the conventional (duplicate-preserving)
projection in the GPSJ view graphs of this framework, as
they can be discarded when they appear as interior nodes
in the graph [RSS96, YKL97] and can be expressed using
a generalized projection when appearing as the outermost
nodes [GHQ95], the complete GPSJ view graph in the
presence of aggregation and grouping alone can still be huge.</p>
        <p>Therefore, we provide a number of rules which can be used
to reduce the size of the view graph.</p>
        <p>The problem of constructing view graphs such as those
described in this paper is most often considered in relation
to view-selection algorithms or physical database design.</p>
        <p>Roussopolous [Rou82] presents an algorithm for
generating LAP schemas, which closely resemble our view graphs,
however, his algorithm does not consider aggregate
functions and grouping.</p>
        <p>Other papers on view-selection employing view-graph
like structures [RSS96, TS97, GM99] either do not
consider the construction of these structures, or consider only
select-join views without aggregation.</p>
        <p>Gupta [Gup97] suggests that the AND-OR view graph
be constructed using the expression AND-OR DAGs of the
queries. These expression AND-OR DAGs are to contain
only those views which will be considered (useful) for the
view selection algorithm. However, it is unclear how to
determine these “useful” views in the presence of
aggregation, and therefore how to construct the AND-OR view
graph. This is the problem considered in this paper.</p>
        <p>A major part of the integrator is the
GPSJVIEWGRAPHBUILDER algorithm. The algorithm starts with a (possibly
empty) GPSJ view graph and incrementally builds up the
“final” graph by integrating GPSJ expression DAGs into
it. The behavior of the algorithm is controlled by a set of
merging rules, which are also part of the integrator. This
flexibility allows us to extend the GPSJ view graph
framework to consider more general view graphs and various
graph input types, simply by adding or modifying the set of
1.2</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Paper Outline</title>
      <p>The paper is structured as follows. Section 2 describes the
aggregation framework and notation used in this paper, and
gives rules for recomputing aggregates using their disjoint
sets. Section 3 defines the GPSJ view graphs, and gives the
8-2
A1</p>
      <p>πa_personid, a_name [sum(a_balance*count(*)]
graphical notation used to illustrate the graphs. Section 4
gives rules and the algorithm for merging expression DAGs
for GPSJ queries. In Section 5 we give some rules for
reducing the size of the GPSJ view graph. Section 6
concludes the paper and points to future research directions.</p>
      <p>A2</p>
      <p>A1</p>
      <p>A2
A’
G F where denotes the set of group-by attributes and
deF = : : : ; notes a set of aggregate functions f1; f2; fn over</p>
      <p>We use the generalized projection (GP) operator [GHQ95],</p>
      <p>G[F (A)], to represent aggregation. Generalized
projection is an extension of duplicate eliminating projection,
attributes in the attribute set A.</p>
      <p>In this paper, we consider only distributive aggregate
functions, i.e., aggregate functions that can be computed
by partitioning their inputs into disjoint sets. The SQL
aggregate functions count, sum, min, and max, are all
distributive. The algebraic aggregate function avg can be
expressed in this framework using sum/count.</p>
      <p>Aggregate functions can be divided into the duplicate
sensitive aggregates (referred to as DSAs), such as count
and sum, and the duplicate insensitive aggregates (referred
to as non-DSAs), such as max and min. These
characteristics are of importance when computing an
aggregate function from its disjoint sets. In general, non-DSAs
can always be computed from views which contain the
same aggregate, or the attribute of the aggregate; e.g.,
A;B R) A[max(B)]( A[max(B)](R). In order to
do a similar transformation with DSAs, we need additional
information about the number of duplicates. This
information can be acquired using a count. We refer to the
process of computing aggregates from their group-by
attributes using a count as duplicate compensation.</p>
      <p>Table 1 gives the rules for computing aggregates from
the partial results of previously computed aggregates or the
group-by attributes. The prerequisite attributes are those
π R
πA[F]
σ R
Q A query graph for a query is a graph, where each leaf
lem [Gup97, TS97, GM99], usually model the problem as
some form of graph structure representing multiple queries.</p>
      <p>We use GPSJ view graphs for this purpose.
node corresponds to a base table used to define Q, and each
non-leaf node is an operator with associated children. The
algebraic expression computed at the root node is
equivalent to Q. Query graphs are used in query optimizers to
determine the cost of a particular way of evaluating a query.</p>
      <p>We refer to the leaves and root nodes of the query graph as
“equivalence” nodes and the non-leaf nodes as “operation”
nodes.</p>
      <p>Expression DAGs are used to compactly represent the
space of the equivalent query graphs of a single query as a
directed acyclic graph. An expression DAG is a bipartite
directed acyclic graph with equivalence nodes and
operation nodes. An equivalence node (with the possible
exception of leaf nodes) has edges to one or more operation
nodes. An operation node consists of an operator, edges to
either one or more predecessors that are equivalence nodes,
and an edge to the derived equivalence node. We denote
an equivalence node by the algebraic expression it
computes. Its predecessor operation nodes correspond to
various query graphs that yield a result that is algebraically
equivalent to the label of the equivalence node. The leaves
of an expression DAG are equivalence nodes
corresponding to base tables. Expression DAGs are used in rule-based
optimizers.</p>
      <p>The GPSJ view graph is a multi-query expression DAG,
i.e., it represents expression DAGs of several queries in a
single DAG. Each equivalence node in the view graph
corresponds to a view, which can be materialized in a data
warehouse. Like the expression DAG, the leaf equivalence
nodes of a GPSJ view graph correspond to the base tables
of the data warehouse. We define the equivalence nodes of</p>
      <p>GPSJ view graphs as follows:
v equivalence node in the GPSJ view graph is annotated
: : : ; Q1; Q2; Qn if for each query Qi, there is a subgraph
G(X) graph having the base tables as the leaves is
G(X) Gi(Qi) in that is an expression DAG of Qi. Each</p>
    </sec>
    <sec id="sec-5">
      <title>Definition 3.2 (GPSJ View Graphs) A directed acyclic</title>
      <p>called a GPSJ view graph for the queries (or views)
with the query frequency fv (frequency of queries on v),
update frequency gv (frequency of update on v), and the
size sv of the view if materialized.</p>
      <p>Update frequence and size of the views in the GPSJ
view graph can be re-calculated after the construction of
the GPSJ view graph. Therefore, the only parameter of
interest during the construction of the GPSJ view graph is
the query frequency fv. We will not consider the update
frequency or size in the rest of this paper. Also, to avoid
cluttering up the figures, we will not annotate the graphs
with query frequencies in the examples.</p>
      <p>An important characteristic of the GPSJ view graph is its
similarity to AND-OR view graphs. This means that
standard view selection algorithms [Gup97, GM99] can, with
little or no modification, be used on GPSJ view graphs.
4</p>
      <sec id="sec-5-1">
        <title>Constructing GPSJ View Graphs</title>
        <p>= (C ;).
represented using a join on an empty condition
Example 4.1 Consider the table accounts of
Example 1.1 and the SQL query :</p>
        <sec id="sec-5-1-1">
          <title>SELECT a_name, max(a_balance)</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>FROM accounts</title>
        </sec>
        <sec id="sec-5-1-3">
          <title>GROUP BY a_name</title>
          <p>2 = 8 in Example 4.1 has 25 different possible
groupn number of attributes used in the view, and is the number
G(Q) G(Qi) of into the view graph, is controlled by the
R m jection over a base table ) is O(2n m), where is the</p>
          <p>For the purposes of this paper, we assume that the
complete expression DAG has been generated. This issue has
no effect on the construction algorithm itself, as the
inputs of this algorithm is simply a set of graphs, however
complete expression DAGs are required to construct
complete GPSJ view graphs. The complete expression DAG of
aggregate queries are typically very large, e.g., the space
complexity of the number of equivalence nodes in a simple
aggregate query (i.e., one constructed using a single
proof attributes in the base table.</p>
          <p>For example, the complete expression DAG of the query
ings, namely all combination of attributes that include
a name and a balance. Using these 8 groupings we
end up with over 51 possible distinct query graphs. In
addition to these, we could also construct query graphs with
max(a balance) instead of grouping on a balance,
resulting in more than a hundred possible query graphs for
this query.</p>
          <p>For the purposes of algorithms such as those presented
in [Gup97, GM99], we can not simply choose an “optimal”
query graph, as we might then miss important equivalence
nodes which could be shared by other queries. However,
without knowledge about the other queries being
considered, it is impossible to know which views, and thus, which
nodes of the view graph will be useful for the purposes of
the view-selection algorithm, without sacrificing the
optimality of the solution. In Section 5, we will consider rules
for reducing the size of GPSJ view graphs. Given
information about the set of queries being considered, this can be
done without sacrificing the performance guarantee of the
view selection algorithms being used.</p>
          <p>The incremental merging of each expression DAG
set of merging rules which we discuss below. The merge
G(Q) 2. Annotate each equivalence node in with
G(X) 1. Let the initial GPSJ View Graph contain
: : : ; relations used in the queries Q1; Q2; Qn.
2 3. For each G(Qi) G(Q):
equivalence nodes corresponding to the base
fv.</p>
          <p>MERGEALGORITHM(G(X);G(Qi)) (cf. Figure 10)
G(Q) equivalence nodes in different DAGs of are correctly
rules ensure that potential structural relationships between
incorporated in the GPSJ view graph.
‘ ’ We use standard inference rules on the form to
The rules and merge algorithm are specified so as to ensure
that it is not necessary to iterate in the graph to discover
whether two queries can be computed from each other
using a single operation. Assuming that the full expression
DAG of a query has been materialized, three rules are
sufficient to ensure that all derivations between nodes in the
graphs of two different graphs will be derived. Note that
the set of views handled by the merge algorithm (i.e., GPSJ
views) can easily be extended by the introduction of
additional merge rules for other operators (e.g., outer join,
union, etc.).
state that, given and , we can infer ’.</p>
          <p>V 1
V1
V1
πG1[F1]</p>
          <p>V2
v 2 Each equivalence node VX is a triple hvL; Op; Odi,
G(X) ; is a tuple hVX OX i, where VX is the set of
equivao 2 Each operation node OX is a triple hoL; Ep; Edi,
Definition 4.1 (Graph Definition) A DAG or view graph
lence nodes, and OX is the set of operation nodes.
where vL is the label of the equivalence node. Op and Od
are a set of labels, where Op is the set of predecessor
operations and Od is the set of derivative operations.
where oL is the label of the operation node. Ep and Ed are
a set of labels, where Ep denotes one or more predecessor
equivalence nodes, and Ed is the derived equivalence node.</p>
          <p>We refer to Ep and Ed as the predecessor and derivative
expressions.
F 0 The set of aggregate functions is derived from F1 and</p>
          <p>F2 using the rules for recomputing distributive aggregate
functions from their component parts (see Table 1).
G(X) briefly describe the notation of the algorithm. We let
G(Q) and represent the GPSJ view graph and the
expres</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>4.2 The Merging Algorithm</title>
      <p>Before we present the algorithm for merging, we shall
sion DAG of the query Q, respectively. We formally
describe an expression DAG or view graph, following the
description in Section 3.</p>
    </sec>
    <sec id="sec-7">
      <title>Rule 4.3 (Derivation from a Join Node)</title>
      <p>iff the join condition C1 restricts the same attributes as C2,
and the condition C1 is more restrictive than C2.</p>
      <p>We define the graph as a set of equivalence nodes and
operation nodes (rather than as a set of nodes and edges) as
these two kinds of nodes are treated separately in the
algorithm. Recall our claim in Section 3 regarding the similarity
of GPSJ view graphs to AND-OR view graphs. To
transform an expression DAG or view graph defined as above
into an AND-OR DAG, we simply transform the operation
nodes into edges; each operation node corresponds to an
AND arc in the AND-OR DAG framework.</p>
      <p>We refer to the operation nodes connecting an
equivalence node to its derivative nodes as derivative operation
nodes, and the operation nodes connecting it to its deriving
equivalence nodes as predecessor operation nodes.
= + Let fvj fvj fvi .
G(Q) The Expression DAG
G(X) The GPSJ View Graph
= If vi vj then do
62 If vi VX then
G(X) = ; Let hVX OXi
= ; If VQ then
2 For each vi VQ:
2 For each vj VX :
62 If vi is a child of vj and vi VX
G(Q) = Let hVQ; OQi</p>
    </sec>
    <sec id="sec-8">
      <title>Input</title>
    </sec>
    <sec id="sec-9">
      <title>Output</title>
      <p>The modified GPSJ View Graph G(X)0</p>
    </sec>
    <sec id="sec-10">
      <title>Method</title>
      <p>Return G(X).</p>
      <p>%% Step 1 - Check for Equality</p>
      <p>Merge(vi; vj )
Else
%% Step 2 - Check Ancestry
then</p>
      <p>Add(vi,G(X)).</p>
      <p>Else
%% Step 3 - Attempt to Derive
If vi is derivable from vj using
Rules 4.1-4.3 or (vj is derivable
from vi using Rules 4.1-4.3 and
vj is not a child of vi) then do</p>
      <p>Add(vi,G(X)).</p>
      <p>Add the deriving operation node
to vi and vj .</p>
      <p>EndIf
v i
Opp1 Op p2 . . . . . Oppm
P \ ) 6= of vj , we test reOps(vi) DerOps(vj ;.</p>
      <p>To test whether a view vi is a child (i.e., directly derived)
In order to simplify the presentation of the algorithm,
we assume that the list of the views VQ of the expression</p>
    </sec>
    <sec id="sec-11">
      <title>Definition 4.2 (Predecessor &amp; Derivative Operations)</title>
      <p>Given a DAG as in Figure 9, we define the predecessor
operations PreOps and derivative operations DerOps of the
equivalence node vi as:
A1
GP1
acc1
G(X)</p>
      <p>A2
GP2
acc2
G(Q)</p>
      <p>A1</p>
      <p>A2
GP1</p>
      <p>GP2
acc1</p>
      <p>G(X)</p>
    </sec>
    <sec id="sec-12">
      <title>Merge</title>
      <p>sults in the following configurations:
G(X) = ; v = Let hVX OX i, hvL; Op; Odi
Add(v; G(X))</p>
      <sec id="sec-12-1">
        <title>Let Ov be the operation nodes denoted by the set of labels</title>
        <p>= [ v VX VX
= [ OX OX Ov
Od.</p>
        <p>G(Q) equivalence nodes in during the processing of the
G(X) in can actually have operations connecting them to</p>
        <p>To add the operations (or set of operations) of an
equivalence node v1 to v2, we first update the expressions of
the derivative operation nodes referencing v1 to reference
v2. We then add the derivative operations of v1 to v2, and
add the operation nodes to the graph. The merge function
ensures that any child vk of vi will also be detected as a
child of vj . Note that this implies that equivalence nodes
merge algorithm (as shown in Example 4.3). Due to the
ordering imposed on VQ however, we are always ensured that
any predecessor equivalence nodes will be merged into the
view graph first. Finally, we update the query frequency fv
of the merged view in G(X).</p>
        <p>At the lowest level of the query graph, views
correspond to base relations. In this case, recognizing
identical views is a simple matter of name matching. For the
levels above, the recognition is done by checking whether
the derivation of a pair of views is equivalent. The
problem of testing for equivalence of GPSJ queries is
considered in [GHQ95, SDJL96, NSS98, RSSS98]. The problem
can be simplified by normalizing the expressions used to
identify the equivalence nodes when constructing the
expression DAGs.</p>
        <p>If the view vi is a child of some vj , and vi or an identical
view does not already exist in G(X), then we add vi to the</p>
        <p>GPSJ view graph G(X), using the Add function:</p>
      </sec>
      <sec id="sec-12-2">
        <title>Q1 SELECT a_name,max(a_balance)</title>
      </sec>
      <sec id="sec-12-3">
        <title>FROM accounts</title>
      </sec>
      <sec id="sec-12-4">
        <title>GROUP BY a_name</title>
        <p>Example 4.4 Consider the tables
accounts(a_personid,a_name,a_balance,
a_type,a_date)
transaction(t_personid,t_change,t_date)
with the following three SQL queries:</p>
      </sec>
      <sec id="sec-12-5">
        <title>Q2 SELECT</title>
      </sec>
      <sec id="sec-12-6">
        <title>FROM</title>
      </sec>
      <sec id="sec-12-7">
        <title>WHERE</title>
        <p>a_name,count(t_personid)
accounts,transaction
t_personid = a_personid</p>
      </sec>
      <sec id="sec-12-8">
        <title>AND t_change &gt; 250</title>
      </sec>
      <sec id="sec-12-9">
        <title>GROUP BY a_name</title>
      </sec>
      <sec id="sec-12-10">
        <title>Q3 SELECT</title>
      </sec>
      <sec id="sec-12-11">
        <title>FROM</title>
      </sec>
      <sec id="sec-12-12">
        <title>WHERE</title>
        <p>a_name,a_balance
accounts,transaction
t_personid = a_personid</p>
      </sec>
      <sec id="sec-12-13">
        <title>AND t_change &gt; 500</title>
      </sec>
      <sec id="sec-12-14">
        <title>GROUP BY a_name,a_balance</title>
        <p>Q2
M M [ fQ1; Q2; Q3g to select a set of views , such that are
A2 T ize and 1.</p>
        <p>In the view graph of Figure 14, each of the equivalence
nodes represents a view which could be materialized in the
data warehouse to answer one of the queries Q1, Q2, or
Q3, while the paths in the graphs represent ways of
computing the queries from these views. It is thus possible to
apply a view selection algorithm to select and identify
useful sets of views to materialize. For example, if we wished
self-maintainable (i.e., can be maintained on changes to the
base tables without referencing these), we could
materialThe actual running time of view selection algorithms
depends on the size and structure of the generated view graph.</p>
        <p>The full expression DAG of an aggregate query is typically
huge (cf. the very simple query in Example 4.1); this
results in a blow-up of the size of the view graph structure
and longer running time or larger space requirements of the
view selection algorithm. A solution to the performance
problem of view selection algorithms is to prune the search
space of the algorithm [TS97, YKL97]. This corresponds
to reducing the size of GPSJ view graphs.</p>
        <p>Obviously, a view selection algorithm cannot consider
or select a view, if that view is not represented in the GPSJ
view graph. Therefore, care must be taken to ensure that we
do not remove equivalence nodes which might be
considered by the view selection algorithm, if we wish to
maintain a performance guarantee for the view selection, such
as those presented in [Gup97, GM99].</p>
        <p>Recall our assumption that the complete expression
DAG is generated for each query, thus ensuring that the
8-9</p>
        <p>T1
with respect to the set of queries, it follows that V1 is not
a “useful” view for the view selection algorithm. Recall
the O(2n m) space complexity for the number of
equivalence nodes in the simple aggregate query expression DAG.</p>
        <p>Rule 5.1 reduces the number of group-by attributes
considered in the expression DAG (the factor n), thus minimizing
the exponent.</p>
        <p>Before giving the second rule, we first define
superfluous aggregates.</p>
        <p>GPSJ view graph is also complete. However, given
knowledge about the complete set of queries to be considered, it
is possible to identify views which will never be considered
by the view selection algorithm. Removing such views
allows us to reduce the size and complexity of the GPSJ view
graph structure.</p>
        <p>We present two rules, which reduce the size of the GPSJ
view graph without affecting subsequent view selection
algorithms, i.e., these algorithms will still select the same set
of views to materialize. This reduction, or pruning, of the
graph structure can occur at several different stages of the
configuration tool architecture (cf. Figure 1), either as a
pre- or post-optimization of the Integrator component, or
as part of the expression DAG generation.</p>
        <p>The correctness of Rule 5.2 follows from the principles
of duplicate compensation (cf. Section 2). Since
superfluous aggregates do not add additional value for the view
selection algorithm (i.e., if a query can be computed from a
view containing superfluous aggregates, then it can equally
8-10</p>
        <p>Q2
Q1
πa_name, a_balance []</p>
        <p>AT5
[NSS98]
[RSS96]
well be computed from one containing no superfluous
aggregates), we eliminate such views to reduce the size of the
view graph.
6</p>
        <sec id="sec-12-14-1">
          <title>Conclusion</title>
          <p>The selection of which views to materialize is one of the
most important decisions in data warehouse design. The
view selection (or data warehouse configuration) problem
is to select a set of views to materialize in the data
warehouse, so as to optimize the total query response time
under some space or maintenance cost constraints. We
discuss the view graph structures required by view selection
algorithms such as those proposed in [Gup97, GM99]. We
are not aware of any papers considering the construction
of such graph structures in the presence of aggregation and
grouping.</p>
          <p>We describe an algorithm,
GPSJVIEWGRAPHBUILDER for the construction of GPSJ view graphs from
the expression DAGs of GPSJ queries based on a set of
merge rules. We define a set of rules for merging complete
expression DAGs, and give a merge algorithm for carrying
out the incremental merging of an expression DAG with
the GPSJ view graph. We also give a number of rules for
reducing the size of the view graph constructed, while still
allowing us to keep the performance guarantee of the view
selection algorithms used.</p>
          <p>In our future work, we intend to investigate the
following issues:</p>
          <p>There is a lot of scope for reducing the size of the GPSJ
view graph generated for the view selection algorithm. It
would be interesting to attempt to further examine the
effects of “non-optimal” pruning strategies on the view
selection algorithms considered. Is it possible to tailor view
selection algorithms to particular pruning strategies?</p>
          <p>The GPSJ view graph framework can easily be extended
with operations such as outer join, union, and other
relational operators by extending the merge rules of the
algorithm. This would allow us to consider a broader class of
views.
Werner Nutt, Yehoshua Sagiv, and Sara
Shurin. Deciding equivalences among
aggregate queries. In Proceedings of the Seventeenth
ACM SIGACT-SIGMOD-SIGART Symposium
on Principles of Database Systems, pages 214–
223, Seattle, Washington, USA, June 1998.
ACM Press.</p>
          <p>N. Roussopolous. The Logical Access Path
Schema of a Database. IEEE Transactions
in Software Engineering, SE-8(6):563–573,
November 1982.</p>
          <p>Kenneth A. Ross, Divesh Srivastava, and S.
Sudarshan. Materialized View Maintenance and
Integrity Constraint Checking: Trading Space
for Time. In Proceedings of the 1996 ACM
SIGMOD International Conference on
Management of Data, pages 447–458, Montreal,
Quebec, Canada, June 1996.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [GHQ95]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Harinarayan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Quass</surname>
          </string-name>
          .
          <article-title>Aggregate-Query Processing in Data Warehousing Environments</article-title>
          . In Umeshwar Dayal,
          <string-name>
            <surname>Peter M. D. Gray</surname>
          </string-name>
          , and Shojiro Nishio, editors,
          <source>Proceedings of the Twenty-first International Conference on Very large Databases. Zurich</source>
          , Switzerland,
          <year>September 1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [GM99]
          <article-title>Himanshu Gupta and Inderpal Singh Mumick</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>Selection of Views to Materialize Under a Maintenance Cost Constraint</article-title>
          . To appear
          <source>in the Proceedings of the ICDT'99</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>[RSSS98] K. A. Ross</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>P. J.</given-names>
          </string-name>
          <string-name>
            <surname>Stuckey</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sudarshan</surname>
          </string-name>
          .
          <source>Foundations of Aggregation Constraints. Theoretical Computer Science</source>
          ,
          <volume>193</volume>
          :
          <fpage>149</fpage>
          -
          <lpage>179</lpage>
          ,
          <year>February 1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [SDJL96]
          <string-name>
            <given-names>Divesh</given-names>
            <surname>Srivastava</surname>
          </string-name>
          , Shaul Dar,
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          , and
          <string-name>
            <surname>Alon</surname>
            <given-names>Y. Levy.</given-names>
          </string-name>
          <article-title>Answering Queries with Aggregation Using Views</article-title>
          .
          <source>In Proceedings of the 22nd Annual International Conference on Very Large Data Bases</source>
          , Bombay, India,
          <year>September 1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [TS97]
          <article-title>Dimitri Theodoratos and Timos Sellis. Data Warehouse Configuration</article-title>
          .
          <source>In Proccedings of the Twenty-third International Conference on Very Large Data Bases</source>
          , pages
          <fpage>126</fpage>
          -
          <lpage>135</lpage>
          , Athens, Greece,
          <year>August 1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [YKL97]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Karlapalem</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Algorithms for Materialized View Design in Data Warehousing Environment</article-title>
          . In Umeshwar Dayal,
          <string-name>
            <surname>Peter M. D. Gray</surname>
          </string-name>
          , and Shojiro Nishio, editors,
          <source>Proceedings of the Twenty-third International Conference on Very Large Databases</source>
          ,
          <year>August 1997</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>