Constructing GPSJ View Graphs

                            Michael O. Akinde                                        Michael H. Böhlen
                      Department of Computer Science                           Department of Computer Science
                            Aalborg University                                       Aalborg University
                         Fredrik Bajers Vej 7 E-1                                 Fredrik Bajers Vej 7 E-1
                      DK–9220 Aalborg Øst, Denmark                             DK–9220 Aalborg Øst, Denmark
                            strategy@cs.auc.dk                                       boehlen@cs.auc.dk


                                                                             1 Introduction
                                                                             A data warehouse is a repository of integrated informa-
                                Abstract                                     tion from multiple, independent data sources available for
                                                                             querying and analysis. As data warehouses contain inte-
                                                                             grated information, often spanning long periods of time,
      A data warehouse collects and maintains in-                            they tend to be orders of magnitude larger than conven-
      tegrated information from heterogeneous data                           tional operational databases; ranging from hundreds of gi-
      sources for OLAP and decision support. An im-                          gabytes to terabytes in size. The workload is typically
      portant task in data warehouse design is the selec-                    query-intensive, with many complex queries that may ac-
      tion of views to materialize, in order to minimize                     cess millions of records and perform many joins and ag-
      the response time and maintenance cost of gener-                       gregates.
      alized project-select-join (GPSJ) queries.
                                                                                 Three costs must be balanced during physical database
                                                                             design for warehouses: (1) the cost of answering queries,
      We discuss how to construct GPSJ view graphs.
                                                                             (2) the cost of maintaining the warehouse, and (3) the cost
      GPSJ view graphs are directed acyclic graphs,
                                                                             of secondary storage. The cost of (1) can be reduced by ma-
      used to compactly encode and represent differ-
                                                                             terializing (precomputing) frequently asked queries as ma-
      ent possible ways of evaluating a set of GPSJ
                                                                             terialized views in the data warehouse, but this increases
      queries. Our view graph construction algorithm,
                                                                             the maintenance costs of the warehouse. The problem of
      GPSJV IEW G RAPH BUILDER, incrementally con-
                                                                             selecting an appropriate set of views and indexes to ma-
      structs GPSJ view graphs based on a set of merge
                                                                             terialize in a data warehouse is referred to as the view-
      rules. We provide a set of merging rules to con-
                                                                             selection [Gup97] or data warehouse configuration prob-
      struct GPSJ view graphs in the presence of du-
                                                                             lem [TS97].
      plicate sensitive and insensitive aggregates. The
      merging algorithm used in GPSJV IEW G RAPH -                               For the purpose of our discussion, we use the fol-
      BUILDER ensures that each node is correctly                            lowing terminology (precise definitions follow later). A
      added to the view graph, and employs the merge                         GPSJ query is a generalized project-select-join query,
      rules to ensure that relationships between nodes                       i.e., a project-select-join query extended with aggregation,
      from different queries are incorporated into the                       grouping, and group selection. This class of queries is
      view graph.                                                            the single most important one used in data warehousing
                                                                             [Kim96]. A GPSJ query graph is a directed acyclic graph.
                                                                             It represents a specific strategy to evaluate a GPSJ query.
                                                                             A GPSJ expression DAG compactly encodes different pos-
The copyright of this paper belongs to the paper’s authors. Permission to
copy without fee all or part of this material is granted provided that the   sibilities to evaluate a GPSJ query. It combines multiple
copies are not made or distributed for direct commercial advantage.          GPSJ query graphs into a single graph. Finally, a GPSJ
Proceedings of the International Workshop on Design and                      view graph encodes multiple evaluation strategies for dif-
Management of Data Warehouses (DMDW’99)                                      ferent queries. Put differently, a GPSJ view graph inte-
Heidelberg, Germany, 14. - 15.6. 1999                                        grates several GPSJ expression DAGs.
(S. Gatziu, M. Jeusfeld, M. Staudt, Y. Vassiliou, eds.)                          In this paper, we concentrate on the construction of
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-19/        GPSJ view graphs. The integrator takes the set of GPSJ


M.O. Akinde, M.H. Böhlen                                                                                                            8-1
expression DAGs as input, and merges them into a GPSJ                 merge rules used in the GPSJV IEW G RAPH BUILDER algo-
view graph according to a set of merging rules. The inte-             rithm.
grator and its embedding into the data warehouse configu-                 Specifically, given a set of data warehouse queries,
ration tool are illustrated in Figure 1. Given a set of queries       Q1 ; Q2 ; : : : Qn , the GPSJ view graph G(X ) is incremen-
for the data warehouse and metadata (such as query fre-               tally constructed from the set of associated GPSJ expres-
quencies, schema information, etc.), the tool returns a data          sion DAGs, G(Q1 ); G(Q2 ); : : : ; G(Qn ) by the algorithm
warehouse configuration.                                              using a set of merging rules. We give a set of basic rules to
                                                                      integrate GPSJ expression DAGs into GPSJ view graphs.
 GPSJ Queries
 Q1, Q 2,, . . . , Qn
                        Metadata                    Data Warehouse    To the best of our knowledge, the rules and algorithms for
                                                     Configuration
                                                                      merging such graphs in the presence of aggregation and
                                                                      grouping have not previously been considered in the litera-
       Expression DAG                               View Selection    ture.
          Generator                                    Algorithm          In order to ensure an optimal solution to the view-
                                                                      selection problem, it is necessary to generate the entire
    Expression DAGs                                 GPSJ View Graph   GPSJ view graph. However, the complexity (space and
                                       Integrator
    G(Q 1), G(Q 2 ), . . . , G(Q n )                     G(V)         time) of algorithms for solving the view-selection problem
                                                                      optimally is exponential in the number of nodes in the ex-
                                                                      pression graph [TS97, Gup97, GM99]. Although we do not
          Figure 1: Data Warehouse Configuration Tool                 directly represent the conventional (duplicate-preserving)
                                                                      projection in the GPSJ view graphs of this framework, as
                                                                      they can be discarded when they appear as interior nodes
Example 1.1 Consider the table:                                       in the graph [RSS96, YKL97] and can be expressed using
accounts(a personid, a name, a balance,                               a generalized projection when appearing as the outermost
         a type, a date)                                              nodes [GHQ95], the complete GPSJ view graph in the pres-
                                                                      ence of aggregation and grouping alone can still be huge.
and the SQL queries:                                                  Therefore, we provide a number of rules which can be used
                                                                      to reduce the size of the view graph.
      A1 SELECT   a_name,max(a_balance)
         FROM     accounts
         GROUP BY a_name                                              1.1 Related Work
                                                                      The problem of constructing view graphs such as those de-
      A2 SELECT       a_personid,a_name,
                                                                      scribed in this paper is most often considered in relation
                      sum(a_balance)
                                                                      to view-selection algorithms or physical database design.
             FROM     accounts
             GROUP BY a_name                                          Roussopolous [Rou82] presents an algorithm for generat-
                                                                      ing LAP schemas, which closely resemble our view graphs,
                                                                      however, his algorithm does not consider aggregate func-
  Figure 2 gives two simple expression DAGs, G(A1) and                tions and grouping.
G(A2), for these two queries. Equivalence nodes, repre-                  Other papers on view-selection employing view-graph
senting views which can be materialized in the data ware-             like structures [RSS96, TS97, GM99] either do not con-
house, are denoted in the figures using ovals and a label.            sider the construction of these structures, or consider only
Operation nodes are denoted using rectangles. The inte-               select-join views without aggregation.
grator takes these expression DAGs as input, and merges                  Gupta [Gup97] suggests that the AND-OR view graph
them to derive the GPSJ view graph G(X ) displayed on the             be constructed using the expression AND-OR DAGs of the
left. Note that during the merging process, additional op-            queries. These expression AND-OR DAGs are to contain
eration nodes and edges may be derived, such as the node              only those views which will be considered (useful) for the
and edges connecting A0 to A1 .                                       view selection algorithm. However, it is unclear how to
                                                                      determine these “useful” views in the presence of aggre-
    A major part of the integrator is the GPSJV IEW G RAPH -
                                                                      gation, and therefore how to construct the AND-OR view
BUILDER algorithm. The algorithm starts with a (possibly
                                                                      graph. This is the problem considered in this paper.
empty) GPSJ view graph and incrementally builds up the
“final” graph by integrating GPSJ expression DAGs into
                                                                      1.2 Paper Outline
it. The behavior of the algorithm is controlled by a set of
merging rules, which are also part of the integrator. This            The paper is structured as follows. Section 2 describes the
flexibility allows us to extend the GPSJ view graph frame-            aggregation framework and notation used in this paper, and
work to consider more general view graphs and various                 gives rules for recomputing aggregates using their disjoint
graph input types, simply by adding or modifying the set of           sets. Section 3 defines the GPSJ view graphs, and gives the


M.O. Akinde, M.H. Böhlen                                                                                                      8-2
                                                       A2                                                      A1                                             A2


           A1              πa_personid, a_name [sum(a_balance*count(*)]                                                             πa_personid, a_name [sum(a_balance*count(*)]
                                                                                                                  πa_name [max(a_balance)]
                                                                  πa_personid, a_name [sum(a_balance)]                                                                    πa_personid, a_name [sum(a_balanc
     πa_name [max(a_balance)]              A’
                                                                                                                                                         A’


      accounts
                            πa_personid, a_name, a_balance []                                                                         πa_personid, a_name, a_balance []
                                                                                                         πa_name [max(a_balance)]


                                                  accounts                                                                                               accounts


      G(A1)                                           G(A2)                                                                                      G(X)

                                     Figure 2: Combining two GPSJ Expression DAGs into a GPSJ View Graph

graphical notation used to illustrate the graphs. Section 4                                                 group-by attribute and/or aggregate functions required in a
gives rules and the algorithm for merging expression DAGs                                                   view for the computation of the aggregate in the first col-
for GPSJ queries. In Section 5 we give some rules for re-                                                   umn. For example, we can compute Q =  [sum(A)]R
ducing the size of the GPSJ view graph. Section 6 con-                                                      given the view V = A [count() as cnt](R) as Q =
cludes the paper and points to future research directions.                                                  [sum(A  cnt)](V ).
                                                                                                               Aggregate                   Prerequisite                                 Computed
2 Aggregation Framework                                                                                                                    Attributes                                   Aggregate
We use the generalized projection (GP) operator [GHQ95],                                                       count()                    count() as cnt                              sum(cnt)
G [F (A)], to represent aggregation. Generalized projec-                                                      count(A)                    count(A) as cntA                             sum(cntA)
tion is an extension of duplicate eliminating projection,                                                      count(A)                    A,count() as cnt                            sum(cnt)
where G denotes the set of group-by attributes and F de-                                                       sum(A)                      sum(A) as sumA                               sum(sumA)
notes a set of aggregate functions F = f1 ; f2 ; : : : ; fn over                                               sum(A)                      A, count() as cnt                           sum(A  cnt)
attributes in the attribute set A.                                                                             max(A)                      max(A) as maxA                               max(maxA)
    In this paper, we consider only distributive aggregate                                                     max(A)                      A                                            max(A)
functions, i.e., aggregate functions that can be computed                                                      min(A)                      min(A) as minA                               min(minA)
by partitioning their inputs into disjoint sets. The SQL                                                       min(A)                      A                                            min(A)
aggregate functions count, sum, min, and max, are all
distributive. The algebraic aggregate function avg can be
expressed in this framework using sum/count.                                                                Table 1: Computing Aggregates from partial results and/or
    Aggregate functions can be divided into the duplicate                                                   group-by attributes
                                                                                                               An example of the application of such rules is the deriv-
                                                                                                            ing of A1 or A2 from the node A0 in Figure 2.
sensitive aggregates (referred to as DSAs), such as count
and sum, and the duplicate insensitive aggregates (referred
to as non-DSAs), such as max and min. These char-
acteristics are of importance when computing an aggre-                                                      3 GPSJ View Graphs
gate function from its disjoint sets. In general, non-DSAs                                                  A GPSJ query is PSJ query enhanced with grouping and
can always be computed from views which contain the                                                         aggregation. More precisely, a GPSJ query is any query
same aggregate, or the attribute of the aggregate; e.g.,                                                    which can be written in GP normal form [GHQ95] (i.e., a
A [max(B )](A;B R)  A [max(B )](R). In order to                                                         selection, 1 , over a generalized projection,  , over a se-
do a similar transformation with DSAs, we need additional                                                   lection, 2 , over a set of joins, X ,: 1 2 X ). A large class
information about the number of duplicates. This infor-                                                     of queries can be expressed as GPSJ queries, in particular
mation can be acquired using a count. We refer to the                                                       all SELECT-FROM-WHERE-GROUP BY-HAVING queries
process of computing aggregates from their group-by at-                                                     can be reduced to this form if the attributes/aggregate
tributes using a count as duplicate compensation.                                                           functions in the GROUP BY and HAVING clauses appear
    Table 1 gives the rules for computing aggregates from                                                   in the SELECT clause, no aggregate functions use the
the partial results of previously computed aggregates or the                                                DISTINCT keyword, and the WHERE clauses are con-
group-by attributes. The prerequisite attributes are those                                                  junctive. Algorithms for solving the view-selection prob-


M.O. Akinde, M.H. Böhlen                                                                                                                                                                                     8-3
lem [Gup97, TS97, GM99], usually model the problem as                        represented using a join on an empty condition
some form of graph structure representing multiple queries.                  (C = ;).
We use GPSJ view graphs for this purpose.
    A query graph for a query Q is a graph, where each leaf          In GPSJ expression DAGs and GPSJ view graphs the
node corresponds to a base table used to define Q, and each      three basic operations are denoted using the graphical no-
non-leaf node is an operator with associated children. The       tation of Figure 3.
algebraic expression computed at the root node is equiv-
alent to Q. Query graphs are used in query optimizers to            π R             σR                         RQ
determine the cost of a particular way of evaluating a query.
We refer to the leaves and root nodes of the query graph as
“equivalence” nodes and the non-leaf nodes as “operation”           πA [F]          σ [C]                       [C]


nodes.
    Expression DAGs are used to compactly represent the
space of the equivalent query graphs of a single query as a          R               R                R                 Q
directed acyclic graph. An expression DAG is a bipartite
directed acyclic graph with equivalence nodes and oper-          Figure 3: (a) Generalized Projection, (b) Selection, and (c)
ation nodes. An equivalence node (with the possible ex-          Join in GPSJ View Graphs
ception of leaf nodes) has edges to one or more operation
nodes. An operation node consists of an operator, edges to
either one or more predecessors that are equivalence nodes,      Definition 3.2 (GPSJ View Graphs) A directed acyclic
and an edge to the derived equivalence node. We denote           graph G(X ) having the base tables as the leaves is
an equivalence node by the algebraic expression it com-          called a GPSJ view graph for the queries (or views)
putes. Its predecessor operation nodes correspond to var-        Q1 ; Q2 ; : : : ; Qn if for each query Qi , there is a subgraph
ious query graphs that yield a result that is algebraically      Gi (Qi ) in G(X ) that is an expression DAG of Qi . Each
equivalent to the label of the equivalence node. The leaves      equivalence node v in the GPSJ view graph is annotated
of an expression DAG are equivalence nodes correspond-           with the query frequency fv (frequency of queries on v ),
ing to base tables. Expression DAGs are used in rule-based       update frequency gv (frequency of update on v ), and the
optimizers.                                                      size sv of the view if materialized.
    The GPSJ view graph is a multi-query expression DAG,
i.e., it represents expression DAGs of several queries in a          Update frequence and size of the views in the GPSJ
single DAG. Each equivalence node in the view graph cor-         view graph can be re-calculated after the construction of
responds to a view, which can be materialized in a data          the GPSJ view graph. Therefore, the only parameter of in-
warehouse. Like the expression DAG, the leaf equivalence         terest during the construction of the GPSJ view graph is
nodes of a GPSJ view graph correspond to the base tables         the query frequency fv . We will not consider the update
of the data warehouse. We define the equivalence nodes of        frequency or size in the rest of this paper. Also, to avoid
GPSJ view graphs as follows:                                     cluttering up the figures, we will not annotate the graphs
                                                                 with query frequencies in the examples.
Definition 3.1 (Equivalence Nodes of View Graphs)                    An important characteristic of the GPSJ view graph is its
Let R be the set of base tables in the data warehouse. We        similarity to AND-OR view graphs. This means that stan-
define the set of equivalence nodes V of a GPSJ view graph       dard view selection algorithms [Gup97, GM99] can, with
recursively as:                                                  little or no modification, be used on GPSJ view graphs.

 1. If Ri 2 R, then Ri 2 V .                                     4 Constructing GPSJ View Graphs
 2. If Ri ; Rj 2 V , then:                                       The GPSJ view graph of a set of queries Q1 ; Q2 ; : : : ; Qn
      (a)  [C ](Ri ) 2 V , where  [C ](Ri ) selects the sub-   is constructed by merging the expression DAGs G (Q) =
          set of the tuples of Ri that satisfies the condition   G(Q1 ); G(Q2 ); : : : ; G(Qn ) for each of these queries.
          C.                                                     Example 4.1 Consider the table accounts of Exam-
      (b) G [F ](Ri ) 2 V , where G [A] is a generalized       ple 1.1 and the SQL query :
          projection on Ri . G is a subset of the attributes
          in Ri and F are a set of aggregate functions over          SELECT   a_name, max(a_balance)
          Ri .                                                       FROM     accounts
      (c) 1 [C ](Ri ; Rj ) 2 V , where ./ [C ](Ri ; Rj ) is
                                                                     GROUP BY a_name
          a join between Ri and Rj on the join condi-            Figure 4 gives three equivalent query graphs for this query,
          tion C . The Cartesian product of two views is         and the expression DAG derived from these query graphs.


M.O. Akinde, M.H. Böhlen                                                                                                   8-4
                                                                                                                                                A’


              A’                            A’                                  A’
                                                                                                        GP4 =    πa_name [max(a_balance)]                           GP5 =   πa_name [max(a_balance)]


    πa_name [max(a_balance)]      πa_name [max(a_balance)]           πa_name [max(a_balance)]                   A1                 GP1 = πa_name [max(a_balance)]           A2


         accounts
                               πa_personid, a_name, a_balance []   πa_name, a_type   [max(a_balance)]
                                                                                                        GP2 = πa_personid, a_name, a_balance []              GP3 = πa_name, a_type    [max(a_balance)]


                                       accounts                           accounts                                                          accounts

                                                                                                                      Expression DAG of QG1, QG2, and QG3
            QG1                            QG2                               QG3


                                           Figure 4: Equivalent query graphs and the resulting expression DAG

    For the purposes of this paper, we assume that the com-
                                                                                                            1. Let the initial GPSJ View Graph G(X ) contain
plete expression DAG has been generated. This issue has                                                        equivalence nodes corresponding to the base
no effect on the construction algorithm itself, as the in-                                                     relations used in the queries Q1 ; Q2 ; : : : ; Qn .
puts of this algorithm is simply a set of graphs, however                                                   2. Annotate each equivalence node in G (Q) with
complete expression DAGs are required to construct com-                                                        fv .

plete GPSJ view graphs. The complete expression DAG of                                                      3. For each G(Qi ) 2 G (Q):
aggregate queries are typically very large, e.g., the space                                                    M ERGE A LGORITHM(G(X ); G(Qi )) (cf.                                 Figure 10)
complexity of the number of equivalence nodes in a simple
aggregate query (i.e., one constructed using a single pro-                                                                 Figure 5: GPSJV IEW G RAPH BUILDER
jection over a base table R) is O(2n,m ), where m is the
number of attributes used in the view, and n is the number                                                rules ensure that potential structural relationships between
of attributes in the base table.                                                                          equivalence nodes in different DAGs of G (Q) are correctly
                                                                                                          incorporated in the GPSJ view graph.
    For example, the complete expression DAG of the query
in Example 4.1 has 25,2 = 8 different possible group-
ings, namely all combination of attributes that include                                                   4.1 Merge Rules
a name and a balance. Using these 8 groupings we                                                          The rules and merge algorithm are specified so as to ensure
end up with over 51 possible distinct query graphs. In ad-                                                that it is not necessary to iterate in the graph to discover
dition to these, we could also construct query graphs with                                                whether two queries can be computed from each other us-
max(a balance) instead of grouping on a balance,                                                          ing a single operation. Assuming that the full expression
resulting in more than a hundred possible query graphs for                                                DAG of a query has been materialized, three rules are suf-
this query.                                                                                               ficient to ensure that all derivations between nodes in the
    For the purposes of algorithms such as those presented                                                graphs of two different graphs will be derived. Note that
in [Gup97, GM99], we can not simply choose an “optimal”                                                   the set of views handled by the merge algorithm (i.e., GPSJ
query graph, as we might then miss important equivalence                                                  views) can easily be extended by the introduction of ad-
nodes which could be shared by other queries. However,                                                    ditional merge rules for other operators (e.g., outer join,
without knowledge about the other queries being consid-                                                   union, etc.).
ered, it is impossible to know which views, and thus, which                                                   We use standard inference rules on the form ;  ` ' to
nodes of the view graph will be useful for the purposes of                                                state that, given and , we can infer '.
the view-selection algorithm, without sacrificing the opti-
mality of the solution. In Section 5, we will consider rules                                              Rule 4.1 (Derivation from a Selection Node)
for reducing the size of GPSJ view graphs. Given informa-
tion about the set of queries being considered, this can be                                                  V1 = [C1 ](V ); V2 = [C2 ](V ) ` V1 = [C1 ](V2 )
done without sacrificing the performance guarantee of the
view selection algorithms being used.                                                                     iff the selection condition C1 restricts the same attributes
    The incremental merging of each expression DAG                                                        as C2 , and the condition C1 is more restrictive than C2 .
G(Qi ) of G (Q) into the view graph, is controlled by the
set of merging rules which we discuss below. The merge


M.O. Akinde, M.H. Böhlen                                                                                                                                                                                8-5
                   V1                                                                          V1

                                  σ [C 1]
                                                                                               [C 1]   σ [C 1]
                    σ [C 1]                  V2
                               σ [C 2]
                                                                                               V
                                                                                                   2
                           V
                                                                                               [C 2]

             Figure 6: Illustration of Rule 4.1
                                                                                   V                        V
Rule 4.2 (Derivation from a GP Node)                                                   i                         j


V1 = G1 [F1 ](V ); V2 = G2 [F2 ](V ) ` V1 = G1 [F 0 ](V2 )                  Figure 8: Illustration of Rule 4.3

iff:                                                              4.2 The Merging Algorithm
  1. G1  G2 ,                                                    Before we present the algorithm for merging, we shall
                                                                  briefly describe the notation of the algorithm. We let G(X )
  2. for each DSA fi (ai ) 2 F1 there exists a correspond-        and G(Q) represent the GPSJ view graph and the expres-
     ing DSA fi (ai ) 2 F2 , or a count() 2 F2 and               sion DAG of the query Q, respectively. We formally de-
     ai 2 G2 , and                                                scribe an expression DAG or view graph, following the de-
  3. for each non-DSA fi (ai ) 2 F1 there exists a corre-         scription in Section 3.
     sponding non-DSA fi (ai ) 2 F2 or ai 2 G2 .
                                                                  Definition 4.1 (Graph Definition) A DAG or view graph
The set of aggregate functions F 0 is derived from F1 and         G(X ) is a tuple hVX ; OX i, where VX is the set of equiva-
F2 using the rules for recomputing distributive aggregate         lence nodes, and OX is the set of operation nodes.
functions from their component parts (see Table 1).                  Each equivalence node v 2 VX is a triple hvL ; Op ; Od i,
                                                                  where vL is the label of the equivalence node. Op and Od
                                                                  are a set of labels, where Op is the set of predecessor oper-
                   V1                                             ations and Od is the set of derivative operations.
                                 π G1[F1’]                           Each operation node o 2 OX is a triple hoL ; Ep ; Ed i,
                                                                  where oL is the label of the operation node. Ep and Ed are
                 πG1[F1]
                                             V2                   a set of labels, where Ep denotes one or more predecessor
                                                                  equivalence nodes, and Ed is the derived equivalence node.
                                π G2 [F2]
                                                                  We refer to Ep and Ed as the predecessor and derivative
                                                                  expressions.
                   V
                                                                  Example 4.2 Consider the expression DAG in Figure 4.
             Figure 7: Illustration of Rule 4.2                   Following Definition 4.1, we can define the expression
                                                                  DAG G(X ) as follows:

Rule 4.3 (Derivation from a Join Node)                                  G(X ) =hVX ; OX i
                                                                        VX = fhaccounts; fg; fGP 1; GP 2; GP 3gi;
          V1 = 1[C1 ](Vi ; Vj );V2 = 1[C2 ](Vi ; Vj ) `                        hA1; fGP 2g; fGP 4gi;
                                V1 = [C1 ](V2 )                               hA2; fGP 3g; fGP 5gi;
iff the join condition C1 restricts the same attributes as C2 ,                hA0 ; fGP 1; GP 4; GP 5g; fgig
and the condition C1 is more restrictive than C2 .                      OX = fhGP 1; faccountsg; fA0gi;
     Rule 4.3 can also be used to handle the view V de-                        hGP 2; faccountsg; fA1gi;
rived from a Cartesian product. This is done by treating                       hGP 3; faccountsg; fA2gi;
the Cartesian product as a join on the empty condition, i.e.,
any join condition C will always be more restrictive than
                                                                               hGP 4; fA1g; fA0gi;
the conditions of the Cartesian product.                                       hGP 5; fA2g; fA0gig


M.O. Akinde, M.H. Böhlen                                                                                                  8-6
    We define the graph as a set of equivalence nodes and
                                                                   Input
operation nodes (rather than as a set of nodes and edges) as           The GPSJ View Graph G(X )
these two kinds of nodes are treated separately in the algo-           The Expression DAG G(Q)
rithm. Recall our claim in Section 3 regarding the similarity      Output
                                                                       The modified GPSJ View Graph G(X )0
of GPSJ view graphs to AND-OR view graphs. To trans-               Method
form an expression DAG or view graph defined as above                  Let G(X ) = hVX ; OX i
into an AND-OR DAG, we simply transform the operation                  Let G(Q) = hVQ ; OQ i
nodes into edges; each operation node corresponds to an                If VQ = ; then
                                                                          Return G(X ).
AND arc in the AND-OR DAG framework.                                   For each vi 2 VQ :
    We refer to the operation nodes connecting an equiva-                 For each vj 2 VX :
lence node to its derivative nodes as derivative operation                   %% Step 1 - Check for Equality
nodes, and the operation nodes connecting it to its deriving                 If vi = vj then do
                                                                                Merge(vi ; vj )
equivalence nodes as predecessor operation nodes.                               Let fvj = fvj + fvi .
                                                                             Else
                 Opd1 Op d2 . . . . . Opdn                                      %% Step 2 - Check Ancestry
                                                                                If vi is a child of vj and vi 62 VX
                                                                                then
                                                                                    Add(vi ,G(X )).
                                                                                Else
                                                                                    %% Step 3 - Attempt to Derive
                              vi                                                    If vi is derivable from vj using
                                                                                    Rules 4.1-4.3 or (vj is derivable
                                                                                    from vi using Rules 4.1-4.3 and
                                                                                    vj is not a child of vi ) then do
                                                                                       If vi 62 VX then
                 Opp1 Op p2 . . . . . Oppm                                                 Add(vi ,G(X )).
                                                                                       Add the deriving operation node
                                                                                       to vi and vj .
                                                                                    EndIf
Figure 9: Predecessor and derivative operation nodes of the               End For
equivalence node vi                                                    End For
                                                                       Return G(X ).

Definition 4.2 (Predecessor & Derivative Operations)                    Figure 10: M ERGE A LGORITHM(G(X ); G(Q))
Given a DAG as in Figure 9, we define the predecessor op-
erations PreOps and derivative operations DerOps of the                Merge(v1; v2)
equivalence node vi as:                                                Let G(X ) = hVX ; OX i
                                                                       Let E be the set of expressions
         PreOps(vi ) = fOpp1 ; Opp2 ; : : : ; Oppm g
         DerOps(vi ) = fOpd1 ; Opd2 ; : : : ; Opdn g                      in the operation nodes
                                                                          denoted by the labels of
   To test whether a view vi is a child (i.e., directly derived)          DerOps(v1 ).
of vj , we test PreOps(vi ) \ DerOps(vj ) 6= ;.                        If v1 2 E then:
   In order to simplify the presentation of the algorithm,                  E = (E n fv1 g) [ fv2 g
we assume that the list of the views VQ of the expression              DerOps(v2 ) = DerOps(v2 ) [ DerOps(v1 )
DAG G(Q) is an ordered set of views v1 ; v2 ; : : : ; vn such          Let Ov1 be the set of operation
that no view vi is derived from a view vi+j , where i and j                nodes denoted by the labels
are positive numbers. Such an ordering is easily imposed                   of DerOps(v1 ).
using a topological sort. The M ERGE A LGORITHM is given               OX = OX [ Ov1
in Figure 10. The explanations follow below.
   The intuition behind the merging algorithm is as fol-           Example 4.3 Consider the view graph G(X ) = hVX ; OX i
lows. We consider each view vi of the query graph G(Q)             and the DAG G(Q) = hVQ ; OQ i, where:
in connection with the views of G(X ), considering the
leaves/base tables of G(Q) first. The time complexity of                VX = fhacc1; fg; fGP 1gi; hA1; fGP 1g; fgig
M ERGE A LGORITHM is O(jVX j (jVX j + jVQ j)), assuming                OX = fhGP 1; facc1g; fA1gig
that no views in G(Q) match with those in G(X ), which is               VQ = fhacc2; fg; fGP 2gi; hA2; fGP 2g; fgig
a worst case.                                                           OQ = fhGP 2; facc2g; fA2gig
   If the views vi and vj are identical, we merge the two
views into a single equivalence node in G(X ), using the           We assume acc1 and acc2 are equivalent and/or represent
Merge function, which is defined as follows:                       the same view and perform Merge(acc2; acc1). This re-


M.O. Akinde, M.H. Böhlen                                                                                             8-7
sults in the following configurations:                                   Add(v; G(X))
                                                                         Let G(X ) = hVX ; OX i, v = hvL ; Op ; Od i
                 DerOps(acc2) = fGP 2g                                   Let Ov be the operation nodes
                 E = facc2; A2g                                             denoted by the set of labels
                                                                            Od .
We then replace acc2 with acc1, to get E = facc1; A2g.                   VX = VX [ v
Finally, we add DerOps(acc2) to DerOps(acc1) and add                     OX = OX [ Ov
the operation node GP 2 to OX to obtain the following con-
figuration of G(X ):                                                    Because of the ordering imposed on G(Q), we only
                                                                    need to add the derivative operation nodes Od of v , as the
 VX = fhacc1; fg; fGP 1; GP 2gi; hA1; fGP 1g; fgig                  algorithm ensures that the predecessor nodes will have been
 OX = fhGP 1; facc1g; fA1gi; hGP 2; facc1g; fA2gig                  added previously.
                                                                        Finally, we apply Rules 4.1 to 4.3 to check whether
                                                                    either of the views vi or vj can be derived from the other.
                                                                    The test to check whether vj is a child of vi is required
    A1            A2                   A1             A2
                                                                    to avoid duplicate operation nodes being introduced in this
                                                                    step, since G(X ) is not an ordered list like G(Q). Rules 4.1
                                                                    to 4.3 provide us with information about the deriving oper-
                                                                    ation node o0 , which can then be added to Op and Od of vi
    GP1          GP2                   GP1     GP2

                                                                    and vj as appropriate, to connect the two nodes. It is criti-
   acc1         acc2                  acc1
                                                                    cal for the correctness of the view selection algorithm, that
                                                                    it has full information about the connectivity of the view
   G(X)         G(Q)                  G(X)                          graph.
                          Merge                                     Example 4.4 Consider the tables

           Figure 11: Illustration of Example 4.3                   accounts(a_personid,a_name,a_balance,
                                                                             a_type,a_date)
    To add the operations (or set of operations) of an equiv-       transaction(t_personid,t_change,t_date)
alence node v1 to v2 , we first update the expressions of
the derivative operation nodes referencing v1 to reference          with the following three SQL queries:
v2 . We then add the derivative operations of v1 to v2 , and            Q1 SELECT   a_name,max(a_balance)
add the operation nodes to the graph. The merge function
ensures that any child vk of vi will also be detected as a
                                                                           FROM     accounts
child of vj . Note that this implies that equivalence nodes
                                                                           GROUP BY a_name
in G(X ) can actually have operations connecting them to
equivalence nodes in G(Q) during the processing of the
                                                                        Q2 SELECT     a_name,count(t_personid)
                                                                           FROM       accounts,transaction
merge algorithm (as shown in Example 4.3). Due to the or-
dering imposed on VQ however, we are always ensured that
                                                                           WHERE      t_personid = a_personid
                                                                                      AND t_change > 250
any predecessor equivalence nodes will be merged into the
view graph first. Finally, we update the query frequency fv
                                                                             GROUP BY a_name
of the merged view in G(X ).                                            Q3 SELECT     a_name,a_balance
    At the lowest level of the query graph, views corre-                   FROM       accounts,transaction
spond to base relations. In this case, recognizing identi-                 WHERE      t_personid = a_personid
cal views is a simple matter of name matching. For the                                AND t_change > 500
levels above, the recognition is done by checking whether                    GROUP BY a_name,a_balance
the derivation of a pair of views is equivalent. The prob-
lem of testing for equivalence of GPSJ queries is consid-              The expression DAGs are given in Figure 12. For sim-
ered in [GHQ95, SDJL96, NSS98, RSSS98]. The problem                 plicity of presentation, we give only very simple expres-
can be simplified by normalizing the expressions used to            sion DAGs, each representing two possible query evalua-
identify the equivalence nodes when constructing the ex-            tion paths.
pression DAGs.                                                         We initialize the GPSJ view graph G(X ) with the two
    If the view vi is a child of some vj , and vi or an identical   base table nodes accounts and transaction (A and
view does not already exist in G(X ), then we add vi to the         T in the figures). The initial merging of Q1 results in a
GPSJ view graph G(X ), using the Add function:                      “graph” similar to that of Q1 in Figure 12, except for the
                                                                    presence of the unconnected equivalence node T .


M.O. Akinde, M.H. Böhlen                                                                                                    8-8
              Q1

                          πa_name [max(a_balance)]
        πa_name [max(a_balance)]
                                                                                                                                  Q3
                                        A1

                    πa_personid, a_name[max(a_balance)]
                                                                                                         πa_name, a_balance []             πa_name, a_balance []

               A
                                                                                                               AT4                                AT5

                                                 Q2                                                     σ [t_change > 500]                    [t_personid=a_personid]


                        πa_name [count(t_personid)]       πa_name [count(t_personid)]

                                                                                                          A2                 AT1                            T2
                               AT2                               AT3

                                                                                        πa_personid, a_name, a_balance []                                   σ [t_change > 500]
                       σ [t_change > 250]
                                                                                                                                 [t_personid=a_personid]
                                                             [t_personid=a_personid]


                                                                                                                    A                           T

                         A1                   AT1                          T1

        πa_personid, a_name[max(a_balance)]                                σ [t_change > 250]
                                              [t_personid=a_personid]


                                   A                           T


                                   Figure 12: The Expression DAGs of the three queries Q1, Q2, and Q3

    Figure 13 shows the GPSJ view graph after Q2 has been                                      5 Reducing the GPSJ View Graph
merged into the graph, and Figure 14 illustrates the com-
plete view graph after Q3 has been merged into the graph.                                      The actual running time of view selection algorithms de-
In Figure 14 additional operations have been added to the                                      pends on the size and structure of the generated view graph.
graph using the rules of Section 4.1 to derive A1 and Q1                                       The full expression DAG of an aggregate query is typically
from A2, AT 4 from AT 2, and T 2 from T 1.                                                     huge (cf. the very simple query in Example 4.1); this re-
                                                                                               sults in a blow-up of the size of the view graph structure
    In the view graph of Figure 14, each of the equivalence
                                                                                               and longer running time or larger space requirements of the
nodes represents a view which could be materialized in the
data warehouse to answer one of the queries Q1, Q2, or
                                                                                               view selection algorithm. A solution to the performance
Q3, while the paths in the graphs represent ways of com-                                       problem of view selection algorithms is to prune the search
                                                                                               space of the algorithm [TS97, YKL97]. This corresponds
puting the queries from these views. It is thus possible to
                                                                                               to reducing the size of GPSJ view graphs.
apply a view selection algorithm to select and identify use-
ful sets of views to materialize. For example, if we wished                                        Obviously, a view selection algorithm cannot consider
to select a set of views M , such that M [fQ1; Q2; Q3g are                                     or select a view, if that view is not represented in the GPSJ
self-maintainable (i.e., can be maintained on changes to the                                   view graph. Therefore, care must be taken to ensure that we
base tables without referencing these), we could material-                                     do not remove equivalence nodes which might be consid-
ize A2 and T 1.                                                                                ered by the view selection algorithm, if we wish to main-
                                                                                               tain a performance guarantee for the view selection, such
                                                                                               as those presented in [Gup97, GM99].
                                                                                                   Recall our assumption that the complete expression
                                                                                               DAG is generated for each query, thus ensuring that the


M.O. Akinde, M.H. Böhlen                                                                                                                                                        8-9
                                        Q1                                                    Q2


                                                                      πa_name [count(t_personid)]     πa_name [count(t_personid)]


                                                                             AT2                              AT3


                                      πa_name [max(a_balance)]       σ [t_change > 250]                   [t_personid=a_personid]


                                                                       A1                   AT1                         T1

                        πa_name [max(a_balance)]      πa_personid, a_name[max(a_balance)]                               σ [t_change > 250]
                                                                                        [t_personid=a_personid]


                                                                                 A                          T


                                         Figure 13: The view graph after merging Q1 and Q2
GPSJ view graph is also complete. However, given knowl-                              with respect to the set of queries, it follows that V1 is not
edge about the complete set of queries to be considered, it                          a “useful” view for the view selection algorithm. Recall
is possible to identify views which will never be considered                         the O(2n,m ) space complexity for the number of equiva-
by the view selection algorithm. Removing such views al-                             lence nodes in the simple aggregate query expression DAG.
lows us to reduce the size and complexity of the GPSJ view                           Rule 5.1 reduces the number of group-by attributes consid-
graph structure.                                                                     ered in the expression DAG (the factor n), thus minimizing
    We present two rules, which reduce the size of the GPSJ                          the exponent.
view graph without affecting subsequent view selection al-                              Before giving the second rule, we first define superflu-
gorithms, i.e., these algorithms will still select the same set                      ous aggregates.
of views to materialize. This reduction, or pruning, of the
graph structure can occur at several different stages of the                         Definition 5.1 (Superfluous Aggregates) An
configuration tool architecture (cf. Figure 1), either as a                          aggregate function fi (ai ) is superfluous, if it can be com-
pre- or post-optimization of the Integrator component, or                            puted from other components of the same GP. Let A [F ] be
as part of the expression DAG generation.                                            a GP over a table. A non-DSA fi (ai ) 2 F is superfluous
                                                                                     if ai 2 A. A DSA fi (ai ) 2 F is superfluous if ai 2 A and
Rule 5.1 (Pruning Group-by attributes) Let Ai be the                                 count(ai ) 2 F .
set of group-by attributes, attributes used in join and se-
lection conditions, and attributes used in aggregate func-                           Rule 5.2 (Eliminate Superfluous Aggregates)
tions in Qi . Then A = fA1 ; A2 ; : : : ; An g includes all the                      Let Agb be the set of group-by attributes and F be a set
attributes used in the queries Q1 ; Q2 ; : : : ; Qn . Then the                       of aggregate function over the attributes Aagg , and Av =
only views which are of interest for the view selection algo-                        Agb \ Aagg . If V 0 = Agb [F ](V ) is a view in the expres-
rithm constructed using a GP are those with group-by Agb ,                           sion DAG and Av 6= ;, then we replace V 0 with a view
where Agb  A.                                                                       V 00 = Agb [F 0 ](V ). F 0 is identical to F , minus the su-
                                                                                     perfluous aggregates. If a DSA fi can be made superfluous
    Intuitively, we are only interested in those views                               by introducing a count function to F , we add the count
which contain attributes used in the set of queries                                  and remove fi .
Q1 ; Q2 ; : : : ; Qn . If we have a view V1 = A1 ;A2 [F ](V )
in our expression DAG, where A1  A and A2 6 A, then                                   The correctness of Rule 5.2 follows from the principles
we can create a view V2 = A1 [F ](V ) which can be used                             of duplicate compensation (cf. Section 2). Since super-
to answer the same set of queries as V1 and whose benefit                            fluous aggregates do not add additional value for the view
(i.e., measure of the usefulness of the view) will always be                         selection algorithm (i.e., if a query can be computed from a
greater than or equal to that of V2 . Since V2 can replace V1                        view containing superfluous aggregates, then it can equally


M.O. Akinde, M.H. Böhlen                                                                                                                    8-10
                                                                                                                                                Q3

                                                                 Q2                                                                                             πa_name, a_balance []
                                                                                                                        πa_name, a_balance []

                 Q1
                                                                                                                                                AT4                   AT5
                                                                              πa_name [count(t_personid)]
                                                πa_name [count(t_personid)]                                        σ [t_change > 500]


                                                       AT2                                AT3


               πa_name [max(a_balance)]         σ [t_change > 250]                    [t_personid=a_personid]           σ [t_change > 500]

                                      πa_name [max(a_balance)]
                                                                                                                                                [t_personid=a_personid]

                                                                                                             AT1
                                  A1
 πa_name [max(a_balance)]            πa_personid, a_name[max(a_balance)]                                                                              T2
                                                                                                        [t_personid=a_personid]

                                                                 A2
                                                                                                                                        σ [t_change > 500]
               πa_personid, a_name[max(a_balance)]
                                                 πa_personid, a_name, a_balance []
                                                                                                     T1

                                                                                                            σ [t_change > 250]
                                                                 A                                                                       T


                                               Figure 14: The view graph after merging Q1, Q2, and Q3


M.O. Akinde, M.H. Böhlen                                                                                                                                                       8-11
well be computed from one containing no superfluous ag-         [Gup97]   H. Gupta. Selection of Views to Materialize in
gregates), we eliminate such views to reduce the size of the              a Data Warehouse. In Proceedings of the Sixth
view graph.                                                               ICDT, pages 98–112, 1997.
                                                                [Kim96] R. Kimball. The Data Warehouse Toolkit. John
6 Conclusion                                                            Wiley & Sons, Inc., 1996.
The selection of which views to materialize is one of the       [NSS98] Werner Nutt, Yehoshua Sagiv, and Sara
most important decisions in data warehouse design. The                  Shurin. Deciding equivalences among aggre-
view selection (or data warehouse configuration) problem                gate queries. In Proceedings of the Seventeenth
is to select a set of views to materialize in the data ware-            ACM SIGACT-SIGMOD-SIGART Symposium
house, so as to optimize the total query response time un-              on Principles of Database Systems, pages 214–
der some space or maintenance cost constraints. We dis-                 223, Seattle, Washington, USA, June 1998.
cuss the view graph structures required by view selection               ACM Press.
algorithms such as those proposed in [Gup97, GM99]. We
are not aware of any papers considering the construction        [Rou82]   N. Roussopolous. The Logical Access Path
of such graph structures in the presence of aggregation and               Schema of a Database. IEEE Transactions
grouping.                                                                 in Software Engineering, SE-8(6):563–573,
    We describe an algorithm, GPSJV IEW G RAPH -                          November 1982.
BUILDER for the construction of GPSJ view graphs from
the expression DAGs of GPSJ queries based on a set of           [RSS96] Kenneth A. Ross, Divesh Srivastava, and S. Su-
merge rules. We define a set of rules for merging complete              darshan. Materialized View Maintenance and
expression DAGs, and give a merge algorithm for carrying                Integrity Constraint Checking: Trading Space
out the incremental merging of an expression DAG with                   for Time. In Proceedings of the 1996 ACM
the GPSJ view graph. We also give a number of rules for                 SIGMOD International Conference on Manage-
reducing the size of the view graph constructed, while still            ment of Data, pages 447–458, Montreal, Que-
allowing us to keep the performance guarantee of the view               bec, Canada, June 1996.
selection algorithms used.                                      [RSSS98] K. A. Ross, D. Srivastava, P. J. Stuckey, and
    In our future work, we intend to investigate the follow-             S. Sudarshan. Foundations of Aggregation
ing issues:                                                              Constraints. Theoretical Computer Science,
    There is a lot of scope for reducing the size of the GPSJ            193:149–179, February 1998.
view graph generated for the view selection algorithm. It
would be interesting to attempt to further examine the ef-      [SDJL96] Divesh Srivastava, Shaul Dar, H. V. Jagadish,
fects of “non-optimal” pruning strategies on the view se-                and Alon Y. Levy. Answering Queries with Ag-
lection algorithms considered. Is it possible to tailor view             gregation Using Views. In Proceedings of the
selection algorithms to particular pruning strategies?                   22nd Annual International Conference on Very
    The GPSJ view graph framework can easily be extended                 Large Data Bases, Bombay, India, September
with operations such as outer join, union, and other rela-               1996.
tional operators by extending the merge rules of the algo-
                                                                [TS97]    Dimitri Theodoratos and Timos Sellis. Data
rithm. This would allow us to consider a broader class of
                                                                          Warehouse Configuration. In Proccedings of
views.
                                                                          the Twenty-third International Conference on
                                                                          Very Large Data Bases, pages 126–135, Athens,
References                                                                Greece, August 1997.
[GHQ95] A. Gupta, V. Harinarayan, and D. Quass.                 [YKL97] J. Yang, K. Karlapalem, and Q. Li. Algorithms
        Aggregate-Query Processing in Data Warehous-                    for Materialized View Design in Data Ware-
        ing Environments. In Umeshwar Dayal, Pe-                        housing Environment. In Umeshwar Dayal, Pe-
        ter M. D. Gray, and Shojiro Nishio, editors,                    ter M. D. Gray, and Shojiro Nishio, editors, Pro-
        Proceedings of the Twenty-first International                   ceedings of the Twenty-third International Con-
        Conference on Very large Databases. Zurich,                     ference on Very Large Databases, August 1997.
        Switzerland, September 1995.

[GM99]     Himanshu Gupta and Inderpal Singh Mumick.
           Selection of Views to Materialize Under a
           Maintenance Cost Constraint. To appear in the
           Proceedings of the ICDT’99, 1999.


M.O. Akinde, M.H. Böhlen                                                                                           8-12