=Paper= {{Paper |id=Vol-28/paper-11 |storemode=property |title=Storing auxiliary data for efficient maintenance and lineage tracing of complex views |pdfUrl=https://ceur-ws.org/Vol-28/paper11.pdf |volume=Vol-28 |dblpUrl=https://dblp.org/rec/conf/dmdw/CuiW00 }} ==Storing auxiliary data for efficient maintenance and lineage tracing of complex views== https://ceur-ws.org/Vol-28/paper11.pdf
                  Storing Auxiliary Data for Efficient Maintenance
                       and Lineage Tracing of Complex Views∗

                                          Yingwei Cui and Jennifer Widom
                                   Computer Science Department, Stanford University
                                          {cyw, widom}@db.stanford.edu


                                                                            gorithm that selects the optimal set of auxil-
                                                                            iary views is far too expensive in many cases;
                            Abstract                                        (2) two heuristic algorithms that we present
                                                                            select good (often optimal) sets of auxiliary
     As views in a data warehouse become more                               views in a much shorter time; (3) even aux-
     complex, the view maintenance process can                              iliary views selected by a very simple al-
     become very complicated and potentially                                gorithm can significantly reduce the overall
     very inefficient. Storing auxiliary views in                           view maintenance and lineage tracing cost.
     the warehouse can reduce the complexity and
     improve the efficiency of view maintenance,                        1 Introduction
     and the same auxiliary views can help in ef-
                                                                        Data warehousing systems collect data from multiple,
     ficiently answering lineage tracing queries
                                                                        distributed sources and integrate the information as
     over the warehouse views. In this paper, we
                                                                        materialized views in local databases [CD97, IK93,
     study the problem of selecting auxiliary views
                                                                        LW95, Wid95]. Users can then perform data analy-
     to materialize in order to minimize the total
                                                                        sis and mining on the warehouse views. The mate-
     view maintenance and lineage tracing cost.
                                                                        rialized views in the warehouse need to be kept up-
     We consider relational views with arbitrary
                                                                        to-date when data at the sources changes. As the
     use of aggregation operators, and we define
                                                                        view definitions become more complex in order to
     an initial search space for our optimization
                                                                        support sophisticated data analyses, the view mainte-
     problem based on a normal form for such
                                                                        nance process can become very complicated and po-
     view definitions. We present several auxiliary
                                                                        tentially very inefficient. Most previous work on view
     view selection algorithms, and to study their
                                                                        maintenance, e.g., [CW91, GMS93, LW95, LYC+99,
     performance we conduct experiments using
                                                                        Qua96], considers simple views containing at most
     the TPC-D benchmark in addition to synthetic
                                                                        one level of aggregation. In order to efficiently main-
     view definitions and statistics. The results of
                                                                        tain complex views which may contain multiple levels
     our experiments show: (1) the exhaustive al-
                                                                        of aggregation, it is clearly advantageous to store aux-
    ∗
    This work was supported by the National Science Foundation          iliary data in addition to the original view to reduce
under grant IIS-9811947, by Sagent Technology Inc., and by an           overall view maintenance cost.
equipment grant from IBM Corporation.
                                                                            From a different perspective, for in-depth analysis
The copyright of this paper belongs to the paper’s authors. Per-
                                                                        of warehouse data sometimes it is useful to be able
mission to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct         to “drill through” from selected interesting (or possi-
commercial advantage.                                                   bly erroneous) view data to the original source data
Proceedings of the International Workshop on Design                     that derived the view data. We call this process trac-
and Management of Data Warehouses (DMDW’2000)                           ing the lineage of the view data [CWW97]. To trace
Stockholm, Sweden, June 5-6, 2000                                       the lineage of a view data item efficiently, the ware-
(M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.)                       house also needs to store auxiliary data—to reduce the
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-28/   computation cost at the warehouse, and to reduce or




Y. Cui, J. Widom                                                                                                           11-1
entirely avoid expensive source accesses for lineage          • Two heuristic algorithms that we present select
tracing. It turns out that the same auxiliary data that         good (often optimal) sets of auxiliary views in a
can be used to improve the performance of view main-            much shorter time.
tenance as discussed in the first paragraph also can
improve the performance of lineage tracing queries.           • Even auxiliary views selected by a very sim-
Therefore, the problems of selecting auxiliary data for         ple algorithm can significantly reduce the overall
the two purposes are closely related, and we study the          view maintenance and lineage tracing cost.
problems together.
   The auxiliary data is stored as materialized views in    1.1   Outline of Paper
the warehouse, called auxiliary views (as opposed to
the original warehouse views, which we call primary         The remainder of the paper proceeds as follows. Sec-
views). Given a complex relational primary view,            tion 2 covers related work. Section 3 presents pre-
there are numerous possible sets of auxiliary views to      liminary material on materialized views, view main-
materialize for view maintenance and lineage tracing,       tenance, and lineage tracing, including a running ex-
with significant performance tradeoffs. In general, the     ample. Section 4 introduces the auxiliary views we
more auxiliary views we materialize, the more effi-         consider for efficient view maintenance and lineage
ciently we can maintain and trace the lineage of data in    tracing, and defines the search space for selecting aux-
the primary view. However, the auxiliary views them-        iliary views to materialize. Section 5 describes the
selves also need to be maintained, so materializing too     cost model and statistics we use for estimating view
many auxiliary views can increase overall cost.             maintenance and lineage tracing costs, and for study-
   Previous work has studied the selection of views         ing the performance of our auxiliary view selection al-
to materialize for answering queries, e.g., [HRU96,         gorithms. Section 6 presents several algorithms for se-
Gup97], and the selection of auxiliary views for ef-        lecting auxiliary views within our search space. Sec-
ficient maintenance of given primary views, e.g.,           tion 7 compares the performance of the algorithms us-
[LQA97, RSS96]. (Further discussion of this work            ing experiments on the TPC-D benchmark, as well as
appears in Section 2.) In [CW00], we introduced             using a variety of synthetic view definitions and statis-
the idea of materializing auxiliary views to minimize       tics.
overall view maintenance and lineage tracing cost,
and we studied the problem in the context of select-        2 Related Work
project-join (SPJ) primary views. This paper inves-
tigates the more difficult and general problem of re-       Previous work related to this paper falls into three
lational views with arbitrary use of aggregation and        categories: selecting views to materialize in order to
SPJ operators. As we will see, it is an expensive com-      minimize query costs, e.g., [HRU96, Gup97], select-
                                                            ing auxiliary views to materialize in order to min-
binatorial problem, and our overall approach differs
from [CW00].                                                imize the cost of maintaining given primary views,
                                                            e.g., [LQA97, RSS96], and our own previous work
   In this paper, we first define a normal form for
                                                            in lineage tracing and view maintenance [CWW97,
the primary view definition, which suggests an ini-
tial search space of possible auxiliary views. We then      CW00].
propose a variety of algorithms for selecting auxiliary        [HRU96] proposes a greedy algorithm for selecting
views within this search space. Finally, we compare         auxiliary views to materialize, with the goal of mini-
empirically the running time of our algorithms and the      mizing the cost of queries over aggregate views given
optimality of the auxiliary view sets they select, using    certain constraints such as the maximum number of
the TPC-D benchmark [TPC96] in addition to a suite          views that can be materialized. The work considers
of synthetic view definitions and statistics. The results   data-cube views only, and can make certain simpli-
of our experiments show:                                    fying assumptions based on this restriction. [Gup97]
                                                            extends the work in [HRU96] to general relational
  • The exhaustive algorithm that selects the opti-         views, and proves that the auxiliary view selection
    mal set of auxiliary views is far too expensive         problem under maintenance cost constraints is NP-
    in many cases.                                          hard.




Y. Cui, J. Widom                                                                                                11-2
    [RSS96] proposes an exhaustive algorithm for se-             rithm, and we compare the performance of our
lecting auxiliary views to optimize view maintenance,            algorithms (both running time and quality of so-
and suggests simple search space pruning strategies              lution) through experiments.
when the view is too complex for exhaustive search.
[LQA97] presents an A* algorithm for selecting aux-        3 Preliminaries
iliary views and indexes on different join combina-        We now introduce the relational materialized views
tions for SPJ view maintenance. Both [RSS96] and           we consider, as well as the processes of view main-
[LQA97] consider a single algorithm for selecting          tenance and lineage tracing, using a running example.
auxiliary views (and indexes in the case of [LQA97]),      Along the way, we illustrate why materializing auxil-
designed specifically for optimizing view mainte-          iary views is important for view maintenance and lin-
nance. They consider as potential auxiliary views all      eage tracing, and why it is useful to consider the two
nodes in all possible relevant query plans, making the     problems together.
search space doubly exponential in the view definition
size.                                                      3.1   Materialized Views
    We introduced lineage tracing for relational data
warehouses in [CWW97], presenting a formal frame-          To answer a variety of user queries efficiently, a data
work and basic algorithms. In [CW00], we introduced        warehouse typically computes and stores a number of
the problem of selecting auxiliary views to simulta-       materialized views [LW95]. In this paper, we con-
neously reduce view maintenance and lineage trac-          sider relational views with arbitrary use of aggrega-
ing costs, and we considered the restricted case of        tion, selection, projection, and join operators, which
SPJ views. We suggested several alternative auxil-         we call ASPJ views. We use an algebraic representa-
iary view schemes and compared their performance.          tion for the operators: α for grouping and aggregation,
In this paper, we tackle the problem for complex rela-     σ for selection, π for (duplicate-eliminating) projec-
tional views with arbitrary use of aggregation and SPJ     tion, and ./ for join. A view definition is presented
operators. Arbitrarily complex primary views make          using a rooted operator DAG with source tables at the
the auxiliary view selection problem more compli-          leaves.
cated and expensive than for SPJ views, and we take a         Any ASPJ view definition v can be transformed
different approach to solving it than for the restricted   into an equivalent form v 0 composed of α-π-σ-./
case considered in [CW00]. We introduce a normal           operator sequences, by commuting and combining
form for our view definitions that suggests an initial     some select-project-join operators in the view defini-
(still exponential) search space for useful auxiliary      tion [CWW97]. We call the resulting form v’s ASPJ
view sets, and then we consider heuristic algorithms       normal form, and we call each α-π-σ-./ sequence a
that explore various view sets in this search space.       segment. An example will be given shortly. In ASPJ
    Our work differs from the previous work discussed      normal form, a segment may omit the π, σ, or ./ op-
above in several ways:                                     erator, but each segment except the topmost must in-
                                                           clude a non-trivial aggregation operator (or it would
  • Unlike all previous work besides our own, we           be merged with an adjacent segment). Since our view
    consider lineage tracing as well as view main-         definitions are DAGs, they may contain multiple ref-
    tenance costs when selecting auxiliary views to        erences to a source table or to a segment at any level.
    materialize.                                              We say that a view is an n-level ASPJ view if
  • Instead of considering a doubly exponential            traversing from the root to any leaf in its normalized
    search space of auxiliary views (as in [HRU96,         definition crosses at most n segments. The fan-out of
    Gup97, LQA97, RSS96]), or a very simple                a segment is the number of operands of the segment’s
    fixed set (as in [CW00]), we explore a “mid-           join operator, or 1 if there is no join.
    dle ground” based on our view definition normal
                                                           Example 3.1 (Materialized View and Normal Form)
    form.
                                                           Consider a data warehouse for a department store
  • We propose several different auxiliary view se-        chain based on the following four tables, some or all
    lection algorithms, as opposed to a single algo-       of which may reside at remote source databases.




Y. Cui, J. Widom                                                                                             11-3
  • Store(store-id, city, expenses)                              sequence of queries and updates called the mainte-
    gives the city and monthly operating expenses of             nance procedure. For 1-level ASPJ views we use the
    each store. We assume that each city contains at             maintenance procedures from [GMS93, Qua96]. The
    most one store, and that the operating expenses              following example shows a 1-level ASPJ view and its
    do not include employee salaries.                            maintenance procedure.

  • Product(product-id, price, cost)                             Example 3.2 (View Maintenance Procedure)
    gives the retail price and wholesale cost of each            Consider the source tables from Example 3.1 and
    product item.                                                a 1-level ASPJ view Profit corresponding to the
                                                                 middle α-./ segment in Figure 2:
  • Sales(store-id, product-id, num)
    gives the expected monthly number of sales for                   CREATE VIEW Profit AS
                                                                     SELECT store-id, SUM(num*(price−cost)) AS profit
    each product at each store.
                                                                     FROM Sales, Product
                                                                     WHERE Sales.product-id = Product.product-id
  • Employee(emp-id, store-id,                                       GROUP BY store-id
    salary) gives the monthly salary of each
    employee at each store.                                      Suppose the Sales table changes over time, and
                                                                 a set of insertions and deletions to the table are
Consider a materialized view HighProfit that                     stored in delta tables ∆Sales and ∇Sales, respec-
keeps track of those cities whose stores are very                tively. The resulting changes to the view Profit
profitable, i.e., whose monthly income exceeds ex-               (∆Profit and ∇Profit) can be computed by the
penses by at least $100,000. An SQL definition for               following maintenance procedure, which uses the
HighProfit is shown in Figure 1, and its normal-                 summary-delta approach from [Qua96]:
ized view definition tree is shown in Figure 2. We                   SELECT store-id, SUM(profit) AS profit INTO SummaryDelta
use αG,aggr(A) to represent grouping and aggregation,                FROM (SELECT store-id, (num*(price−cost)) as profit
where G is a list of grouping attributes, and aggr(A)                      FROM ∆Sales, Product
abbreviates a list of aggregate functions over attributes                  WHERE ∆Sales.product-id = Product.product-id)
                                                                         UNION
in set A [CWW97].1 HighProfit is a 2-level ASPJ
                                                                         (SELECT store-id, -1*(num*(price−cost)) as profit
view containing three segments: the topmost π-σ-./                         FROM ∇Sales, Product
segment with fan-out 3, the leftmost α segment with                        WHERE ∇Sales.product-id = Product.product-id)
fan-out 1, and the middle α-./ segment with fan-out                  GROUP BY store-id
2.                                                     2             SELECT * INTO ∇Profit
                                                                     FROM Profit
                                                                     WHERE store-id IN
3.2    View Maintenance Procedures
                                                                         (SELECT store-id FROM SummaryDelta)
Materialized views must be maintained to keep their                  SELECT store-id, SUM(profit) INTO ∆Profit
contents up-to-date as the source tables they are de-                FROM ∇Profit UNION SummaryDelta
fined over change. We assume a standard incremental                  GROUP BY store-id
view maintenance approach, as in [GMS93, Qua96].                 2
Insertions and deletions to each source table are mon-
itored and recorded in delta tables (∆ and ∇ respec-                For an n-level ASPJ view where n > 1, to com-
tively) in the warehouse. Updates are modeled as                 pute the changes to the entire view we can com-
deletions followed by insertions. During view mainte-            pute the changes for one segment at a time using
nance, changes to the view (also expressed as deltas)            the maintenance procedure for 1-level views, prop-
are computed based on the source delta tables, the               agating the deltas upward through the view defini-
view contents, and the source data, using a predefined           tion DAG. Just as we needed source table Prod-
   1
                                                                 uct along with ∆Sales and ∇Sales to compute
     This operator is similar to the generalized projection of
[GHQ95], but we distinguish between projection and aggregation
                                                                 the deltas for Profit in Example 3.2, to compute
operators because of the way our segments and auxiliary views    the deltas for a higher-level segment we may need
are defined.                                                     deltas and/or full contents for each lower segment.




Y. Cui, J. Widom                                                                                                        11-4
 CREATE VIEW HighProfit AS                                                              HighProfit
 SELECT city
 FROM Store,                                                                                  π city
   (SELECT store-id, SUM(num*(price−cost)) AS profit                                          σ profit - expenses - salaries
    FROM Sales, Product                                                                           > 100000
    WHERE Sales.product-id = Product.product-id
    GROUP BY store-id) AS P,                                          α store-id,             α store-id,
   (SELECT store-id, SUM(salary) AS salaries                            sum(salary)             sum(num * (price - cost))
    FROM Employee                                                       as salaries             as profit
    GROUP BY store-id) AS E
 WHERE Store.store-id = E.store-id
   AND E.store-id = P.store-id
   AND P.profit−E.salaries−Store.expenses > 100000                  Employee          Sales       Product                   Store


      Figure 1: SQL definition for HighProfit                      Figure 2: Normal form for HighProfit

For example, suppose we want to compute the deltas          given tuple t ∈ V , t’s lineage in T1 , . . . , Tm accord-
for HighProfit given ∆Profit and ∇Profit,                   ing to v can be computed with the following query:
where again Profit corresponds to the middle α-./
segment as in Example 3.2. We need to join the delta        T Qt,v = SplitT1 ,...,Tm (σC∧G=t.G (T1 ./ · · · ./ Tm ))
tables for Profit with source table Store, as well
as with the leftmost α segment in Figure 2, then per-       where Ti , denotes the schema of table Ti , i =
form the selection and projection in the topmost seg-       1..m, and Split is an operator that breaks a ta-
ment. If we materialize an auxiliary view Salary cor-       ble into multiple projections: SplitA1 ,...,Am (T ) =
responding to the leftmost α segment, we can signif-        hπA1 (T ), . . . , πAm (T )i.2 Given a tuple set T ⊆ V ,
icantly improve the performance of the maintenance          we can simultaneously trace all the tuples in T with:
procedure by avoiding recomputation of the aggre-
                                                             T QT,v = SplitT1 ,...,Tm (σC (T1 ./ · · · ./ Tm ) n T )
gate values. In addition to materializing “intermedi-
ate” views, if source tables are remote and expensive       Example 3.3 (Lineage Tracing Query) Consider
(or impossible) to access, we may want to replicate         the view Profit in Example 3.2. The lineage
some or all of the source tables as auxiliary views at      tracing query for a tuple t ∈ Profit is
the warehouse.
    Note that our approach in this paper applies to all             T Qt,Profit = SplitT1 ,T2 (σC (T1 ./ T2 ))
views, including those with non-incrementally main-
tainable aggregates (e.g., min, max). In the presence       where T1 = Sales, T2 = Product, and C =
of such aggregates, the maintenance procedure must          “store-id = t.store-id”. The SQL presentation of
involve some recomputation, but auxiliary views may         the query is as follows:
still be of benefit.
                                                               SELECT Sales.* INTO LN Sales
                                                               FROM Sales, Product
3.3   Lineage Tracing                                          WHERE Sales.product-id = Product.product-id AND
                                                               Sales.store-id = t.store-id
Given a materialized view in a data warehouse, in ad-
dition to issuing regular queries or performing other          SELECT Product.* INTO LN Product
kinds of analysis over the view, we may want to trace          FROM Sales, Product
the lineage of selected “interesting” tuples in the view.      WHERE Sales.product-id = Product.product-id AND
                                                               Sales.store-id = t.store-id
The lineage of a view tuple is defined as the set of
original source tuples that derived the given view tu-      where LN Sales and LN Product contain the
ple. To trace the lineage of a view tuple, we use a         lineage of t according to view Profit in the source
predefined sequence of queries called tracing queries       tables Sales and Product, respectively.          2
(TQs) [CWW97].                                                 2
                                                                 When we execute the tracing query, the selection condition
   Given a 1-level ASPJ view V whose definition is          σC∧G=t.G is pushed down to individual Ti ’s whenever possible
v = αG,aggr(B) (πA (σC (T1 ./ · · · ./ Tm ))), and          to improve tracing query performance.




Y. Cui, J. Widom                                                                                                                    11-5
   To trace the lineage of an n-level ASPJ view where     lineage tracing queries take advantage of the auxiliary
n > 1, we logically define an intermediate view for       views (Section 4.2). Finally, we formally define the
each segment, and then recursively trace through the      auxiliary view selection problem and estimate the size
hierarchy of intermediate views top-down. At each         of our search space (Section 4.3).
level, we use the tracing query for a 1-level ASPJ view
to compute the lineage for the current traced tuples      4.1   The Auxiliary Views We Consider
with respect to the intermediate views or source ta-      Let us first define two types of potentially useful aux-
bles at the next level below. The necessary interme-      iliary views, based on a single segment. (Similar aux-
diate results can either be computed at tracing time,     iliary views were introduced in the context of SPJ pri-
or we can materialize certain intermediate results as     mary views in [CW00].) Any segment can be thought
auxiliary views for the purpose of lineage tracing.       of as a view definition v = αG,aggr(B) (πA (σC (T1 ./
   For example, to trace the lineage of a tuple t in      · · · ./ Tm ))), where each T1 , . . . , Tm is either a
view HighProfit, we logically define intermedi-           source table or a lower-level segment (view). Let V
ate views Salary and Profit corresponding to              denote the materialization of v over T1 , . . . , Tm .
the leftmost α segment and middle α-./ segment, re-
spectively, in Figure 2. We trace the lineage of tu-       1. Lineage View (LV) for v: We can store the inter-
ple t in Salary, Profit, and Store, producing                 mediate result LV (v) = σC (T1 ./ · · · ./ Tm )
hLN Salary, LN Profit, LN Storei, using the fol-              to help trace the lineage of tuples in V . We can
lowing tracing query:                                         rewrite the lineage tracing queries in Section 3.3
                                                              using LV (v) as:
     T Q = SplitT1 ,T2 ,T3 (σC (T1 ./ T2 ./ T3 ))
                                                                  T Qt,v = SplitT1 ,...,Tm (σG=t.G (LV (v)))
where T1 = Salary, T2 = Profit, T3 = Store, and                  T QT,v = SplitT1 ,...,Tm (LV (v) n T )
C = “profit − expenses − salaries > 100000
∧city = t.city”. Then, we further trace the lineage             The maintenance procedure for V also can be
of the tuples in LN Salary and LN Profit in the                 simplified. If LV (v) is materialized, then we
source tables to produce LN Employee, LN Sales,                 compute ∆LV (v) and ∇LV (v), and the query
and LN Product. As with view maintenance, ma-                   for computing the summary-delta table in the
terializing rather than recomputing intermediate re-            maintenance procedure (Section 3.2) can be
sults can significantly improve tracing performance.            rewritten as:
Since lineage tracing queries always return data from
                                                                SummaryDelta =αG,aggr(B) (αG,aggr(B) (∆LV )
source tables by definition, replicating (portions of)
the source data at the warehouse may be advanta-                                     ∪ αG,−aggr(B) (∇LV ))
geous, for the same reasons outlined in Section 3.2.

4 Auxiliary Views for View Maintenance and                 2. Split Lineage Tables (SLTs) for v: For a view
  Lineage Tracing                                             (segment) v 0 whose joins is many-to-many,
                                                              LV (v 0 ) can be very large and inefficient to main-
As motivated in Section 3, it may be advantageous
                                                              tain. Thus, another possibility is to “split” the
to materialize certain auxiliary views in a data ware-
                                                              Lineage View and store a set of smaller tables:
house to improve the performance of view mainte-
                                                              SLTi (v) = πTi (σC (T1 ./ · · · ./ Tm )), i =
nance and lineage tracing. View maintenance pro-
                                                              1..m. The lineage tracing queries can then be
cedures and lineage tracing queries use the auxil-
                                                              rewritten using the SLTs as:
iary views to avoid recomputations and expensive
source queries, thereby reducing maintenance and                T Qt,v =SplitT1 ,...,Tm (σG=t.G (σC ((SLT1 (v)
query costs. There are many possible sets of auxiliary
                                                                         n T ) ./ · · · ./ (SLTm (v) n T )) n T ))
views to materialize. In this section, we first specify
a number of potentially useful auxiliary views for ar-          T QT,v =SplitT1 ,...,Tm (σC ((SLT1 (v) n T )
bitrary n-level ASPJ primary views (Section 4.1). We                     ./ · · · ./ (SLTm (v) n T )) n T )
then discuss how view maintenance procedures and




Y. Cui, J. Widom                                                                                               11-6
     Although these tracing queries look much more                                           HighProfit

     complex than with LV, performance can some-                                                 π
     times be much better due to the smaller size of                                             σ        LV2 or SLT 2

     the SLTs. Furthermore, as with LV, the mainte-          AG 1
     nance procedure for V can use the deltas for the                  α                         α        AG2
                                                                               LV1 or SLT1
     SLTs to be much more efficient. See [CW00] for
     details.
                                                            BT 1                 BT 2                          BT 3              BT 4
   Given a general ASPJ view definition in normal
form, in addition to considering Lineage Views and                  Employee             Sales       Product             Store
Split Lineage Tables for each segment, we also may
consider storing copies of some or all of the source ta-   Figure 3: Possible auxiliary views for HighProfit
bles, to avoid expensive (remote) source queries dur-
ing view maintenance and lineage tracing. We refer                  the source tables. For each source table R, we
to these source table copies in the warehouse as Base               decide whether to store a Base Table (BT) copy
Tables (BTs). Finally, maintaining the results of in-               of R. If BT is not materialized, we may need to
termediate aggregations in the view (AGs) also can be               issue queries directly to source table R for view
very helpful in view maintenance and lineage tracing,               maintenance and lineage tracing.3
as motivated in Section 3.
   To summarize, we divide the normalized view def-        Example 4.1 (Auxiliary Views) Recall our example
inition into three types of components, and for each       view HighProfit from Figure 2. Figure 3 shows
type of component we have certain choices of possi-        the three ASPJ segments in the view definition and all
ble auxiliary views to materialize:                        of the possible auxiliary views we consider material-
                                                           izing for HighProfit.                               2
 1. Topmost Segment: the segment at the root of
    the view definition DAG. Note that the α, π, σ,            Notice that because of our search space reduction,
    and/or ./ operators (but not all of them) may be       it is possible that there are useful auxiliary views we
    omitted in this segment. Also note that the top-       are not considering, notably different join combina-
    most node corresponds to the primary view itself,      tions in the case of a many-way join. This special
    so its contents are always materialized. We may        case is considered in detail in [LQA97], and we could
    further choose to materialize the Lineage View         extend our search space accordingly.
    (LV) or the Split Lineage Tables (SLTs) for this
    segment, but not both. (If we store one, then stor-    4.2       Using Auxiliary Views for View Maintenance
    ing the other will not further reduce the lineage                and Lineage Tracing
    tracing or overall maintenance cost.)
                                                           In Section 4.1 we gave examples of how to rewrite
 2. Intermediate Segment: a non-root segment that          queries for view maintenance and lineage tracing us-
    is defined over the source tables and/or other seg-    ing auxiliary views. In general, when we have a set
    ments. Note that the π, σ, and/or ./ operators         of auxiliary views available, there may be more than
    may be omitted in this segment, but the α opera-       one way to rewrite a query to take advantage of aux-
    tor is always present. For an intermediate seg-        iliary views. We assume that the “best” rewriting is
    ment, we consider materializing the following          selected, and this assumption is reflected in the cost
    auxiliary views:                                       model we present in Section 5.
     (a) The contents of the α node (AG)                       As an example, Figure 4 shows the rewritings of the
                                                           lineage tracing query TQ for a tuple t according to the
     (b) The Lineage View (LV) or the Split Lineage
                                                           topmost segment in the definition of HighProfit
         Tables (SLTs), but not both
                                                           using the auxiliary views in Figure 3:
 3. Source Table: We assume that all local selec-             3
                                                                Most existing data warehousing systems automatically store
    tion conditions in the view—predicates that in-        a copy of each source table in the warehouse. However, as we will
    volve a single source table—are pushed down to         see in Section 6, sometimes it is not beneficial to store a copy.




Y. Cui, J. Widom                                                                                                                 11-7
            T Q1 =SplitSalary,Profit,Store(σprofit−expenses−salaries>100000∧city=t.city (AG1 ./ AG2 ./ BT4 ))
            T Q2 =SplitSalary,Profit,Store(σcity=t.city (LV2 ))
                                                                     Salary
            T Q3 =SplitSalary,Profit,Store(σcity=t.city (SLT2                 ./ SLT2Profit ./ SLT2Store))


                                               Figure 4: Tracing query rewritings

    Suppose that LV2 , AG1 , and BT4 are materialized.
                                                                                   Table 1: Statistics for cost estimation
Then we could use query T Q2 , or (among other op-
tions) we could use a tracing query similar to T Q1                      Parameter                            Description
that recomputes the contents of AG2 . In this case it                                 usage statistics (for each primary view)
is likely that T Q2 would be chosen as the best query                    query rate      # of tracing queries per unit time period
rewriting based on the available auxiliary views.                        query size      # of tuples traced per query
                                                                                       usage statistics (for each source table)
4.3    The View Selection Problem and the Search                         update rate # of source table updates per unit time period
                                                                         update size # of changed tuples per source table update
       Space Size
                                                                                        data statistics (for each source table)
We have shown that various auxiliary views can be                        tuple num       # of tuples in a source table
used in the view maintenance and lineage tracing pro-                    tuple size      size of tuples in a source table (in bytes)
cesses. Our goal is to select among the choices of aux-                                data statistics (for each view segment)
iliary views described in Section 4.1 a set that mini-                   fan-out         # of joined tables
mizes overall cost: the cost of lineage tracing plus the                 join ratio      # of joining tuples / # of tuples in cross-product
                                                                         select ratio # of selected tuples / # of tuples before selection
cost of maintaining the primary and auxiliary views.                     proj ratio      # of bytes projected / tuple size before projection
Our cost model is described in Section 5. Here let                       aggr ratio      # of aggr tuples / # of tuples before aggregation
us consider the size of our search space. Suppose we                              system statistics (for each source or warehouse)
have an n-level ASPJ view in normal form, and con-                       block size      # of bytes in a block
sider a balanced view definition tree4 with a fan-out                    disk cost       cost to read/write a disk block (in ms/block)
of m in each segment. There is one topmost segment,                      net cost        network transmission cost (in ms/byte)
and for that segment we have 3 auxiliary view options:
LV, SLTs, or nothing (case 1 in Section 4.1). There                    of possible auxiliary view sets for HighProfit is
are m1 + m2 + · · · + mn−1 = O(mn−1 ) intermediate                     26 · 32 = 384.                                 2
segments, each having 2 options for case 2(a) in Sec-
tion 4.1 (AG or nothing) and 3 options for case 2(b)                   The number of choices for HighProfit is quite
(LV, SLTs, or nothing). Finally, there are mn source                   manageable. However, real warehouse views tend to
tables, each having 2 options: BT or nothing. There-                   have much higher fan-outs, as well as possibly more
fore, the size of the entire search space is                           levels. As we will see in Section 7, even for a view
                           n−1 )       n             n
                                                                       with only 2 levels and average fan-out of 5, we can-
         31 · (2 · 3)O(m           · 2(m ) = O(2m )                    not consider all possible auxiliary view sets due to the
                                                                       large search space.
If k is the total number of components in the view def-
inition, where a component is a segment or a source
                                                                       5 Cost Model
table, then k = O(mn ) and the search space size is
O(2k ).                                                                In this section we present the model that we use to es-
                                                                       timate view maintenance and lineage tracing costs for
Example 4.2 (Search Space Size) Consider our ex-                       a given primary view and set of auxiliary views. The
ample view HighProfit (Figure 3). The number                           statistics our cost model relies on are listed in Table 1.
   4
     A tree is balanced if each leaf node in the tree has the same
                                                                       Values for these statistics are set for each experiment,
depth. We consider this view definition shape since it represents      as described in Section 7. We briefly outline our cost
the largest search space size for an n-level view.                     estimation procedure as follows.




Y. Cui, J. Widom                                                                                                                     11-8
    Let cost(Q, s) denote the estimated cost of evaluat-       Finally, the total cost is the combination of lineage
ing a query Q at the warehouse given a set of statistics    tracing cost and view maintenance cost:
s. Q could be a lineage tracing query, or a query or
update in a view maintenance procedure. To compute                total cost(v, A, s) = q(v, A, s) + m(v, A, s)
cost(Q, s) we use a fairly conventional cost model
                                                            In our experiments, we measure the optimality of
for relational queries in a distributed database setting,
                                                            given sets of auxiliary views, by which we mean how
similar to, e.g., [LQA97, Ull89, ZGMHW95]. Details
                                                            close the sets of views come to the set that yields the
are omitted, but the cost formulas rely on all of the
                                                            lowest estimated cost. For a given primary view v and
statistics from Table 1, and assume no indexes.
                                                            statistics s, let Aopt denote the set of auxiliary views
    Suppose we have a primary view v and a set of aux-
                                                            within our search space (Section 4) with the lowest
iliary views A = {v1 , ..., vn }. To trace the lineage of
                                                            total cost total cost(v, Aopt , s). For a set of auxiliary
tuples in the primary view given the materialized aux-
                                                            views A, we define the optimality of A as:
iliary views in A, there are various possible rewrit-
ings of the lineage tracing queries using the auxiliary                                total cost(v, Aopt , s)
views (recall Section 4.2). Our cost model selects the              optimality(A) =
                                                                                        total cost(v, A, s)
sequence of lineage tracing queries with the lowest es-
timated cost. Let q(v, A, s) denote the estimated lin-      6 Algorithms for Selecting Auxiliary Views
eage tracing cost for a given primary view v, set of        Having defined the search space for the optimization
auxiliary views A, and statistics s:                        problem and the cost model that we use, we now in-
                            X                               troduce four different algorithms for selecting a set
              q(v, A, s) =       cost(Qi , s)
                           1..m
                                                            of auxiliary views within the search space. The in-
                                                            put to each algorithm is the primary view definition v
where Q1 , . . . , Qm is the set of tracing queries se-
                                                            in ASPJ normal form, and a set of statistics s as speci-
lected for v given auxiliary view set A. Note that the
                                                            fied in Section 5. The output is a set of auxiliary views
lineage query rate and the average number of tuples
                                                            A.
traced in a lineage query (part of our usage statistics
in Table 1) are included in the input statistics s, and     6.1    Exhaustive Algorithm
thus are incorporated into the lineage cost estimated
by q(v, A, s).                                              The exhaustive algorithm enumerates all choices in
   Maintenance costs are incurred both for the pri-         the search space, estimates the cost of each choice,
mary view v and for the auxiliary views in A =              and picks the cheapest one. For our example view
{v1 , ..., vn }. As with lineage tracing, when there are    HighProfit, the exhaustive algorithm considers
multiple possible rewritings for the view maintenance       all 384 possible combinations of auxiliary views (re-
queries and updates using the auxiliary views in A,         call Figure 3). We set a sample set of statistics s
our cost model selects the ones with the lowest es-         as shown in Table 2, including statistics for view
timated cost. Let m(v, A, s) denote the estimated           HighProfit, source tables Employee, Sales,
maintenance cost for a given primary view v, set of         Product, and Store, as well as each ASPJ seg-
auxiliary views A, and statistics s:                        ment in the view definition (Figure 3). Over this set
                             X                              of statistics, the exhaustive algorithm selects A =
               m(v, A, s) =      cost(Mi , s)               {BT4 , AG1 , AG2 , SLT1 , LV2 }. The exhaustive algo-
                           1..n                             rithm always finds the optimal auxiliary view set ac-
where M1 , . . . , Mn is the set of maintenance queries     cording to our cost model. However, the complexity
and updates selected to maintain primary view v and         of the algorithm is the same as the search space size:
the auxiliary views in A. Note that the source ta-          O(2k ) where k is the number of components in the
ble update rate and average number of source tuples         view definition (recall Section 4.3).
changed in each update (part of our usage statistics in
                                                            6.2    Naive Algorithm
Table 1) are included in the input statistics s, and thus
are incorporated into the maintenance cost estimated        At the other end of the spectrum, we con-
by m(v, A, s).                                              sider a naive algorithm that selects a fixed set




Y. Cui, J. Widom                                                                                                 11-9
                                               Table 2: Statistics for view HighProfit

      Parameter name                                                       Values
                                  HighProfit    Employee     Sales    Product   Store    segment 1   segment 2   segment 3
      query rate (#/unit time)       100
      query size (tuples)             1
      update rate (#/unit time)                     10        10         10       0
      update size (tuples)                           1       100         1        0
      tuple num                                   10000    1000000    100000     100
      tuple size (bytes)                           1000      500        500      400
      fan-out                                                                               1            2           3
      join ratio                                                                                      0.0002      0.0001
      select ratio                                                                                                  0.1
      proj ratio                                                                            0.1         0.1         0.1
      aggr ratio                                                                           0.01        0.001        0.2
      block size (bytes)             8K             8K        8K        8K       8K
      disk cost (ms/block)            1              1         1         1        1
      net cost (ms/byte)              0          0.00001    0.0001    0.0001   0.0001


of auxiliary views: Lineage Views (LVs) for the                      not guarantee an optimal answer, nor even an answer
topmost and all intermediate segments, aggrega-                      within some percentage of optimal. In Section 7.3, we
tion results (AGs) for all intermediate segments,                    will see a scenario where the greedy algorithm per-
and all Base Tables (BTs). For example view                          forms very poorly.
HighProfit, the naive algorithm selects A =
{BT1 , BT2 , BT3 , BT4 , AG1 , AG2 , LV1 , LV2 }. Even               6.4   Three-Step Algorithm
though this naive fixed set of auxiliary views may not
be optimal—in fact it can be arbitrarily bad compared                Our last algorithm divides the auxiliary view selec-
to the optimal set—our experimental results in Sec-                  tion process into three phases. See Figure 5. In the
tion 7 show that the naive algorithm selects reasonably              first phase, we use a greedy approach to add auxiliary
good view sets in many cases, especially considering                 views of the AG and BT types only. In the second
its simplicity. The complexity of the naive algorithm                phase, we decide for the topmost and each intermedi-
is O(1).                                                             ate segment whether to add LV or SLTs. At this point,
                                                                     it may turn out that some of the AG or BT views se-
                                                                     lected in the first phase are no longer beneficial given
6.3     Greedy Algorithm
                                                                     the LV or SLT views selected in the second phase, and
We also consider a conventional greedy algorithm.                    they incur maintenance cost. Thus, in third phase we
This algorithm initializes the auxiliary view set A to               remove AG and BT views that are not beneficial, and
be empty. In each iteration, it adds into A the auxiliary            we do so in a greedy manner.
view (not yet in A) that brings the most benefit, i.e.,                  For example view HighProfit and the same
reduces the total cost the most given the current set of             statistics used in Sections 6.1 and 6.3, the three-step
views in A. Iteration continues until there are no more              algorithm also selects the optimal set of auxiliary
auxiliary views outside A that can further reduce the                views. In phase 1, it selects AG and BT views in
total cost. For our example view HighProfit and                      the following order: AG2 , AG1 , BT2 , BT3 , BT4 . In
the same sample statistics used in the exhaustive algo-              phase 2, it selects SLT1 and LV2 . In phase 3, it re-
rithm (Section 6.1), the greedy algorithm selects the                moves BT2 and BT3 (in that order) because they are
optimal set of auxiliary views in the following order:               no longer beneficial given the views selected in phase
AG2 , SLT1 , AG1 , LV2 , BT4 .                                       2.
   The greedy algorithm has complexity O(k2 ) in-                        The three-step algorithm has complexity O(k2 ),
stead of O(2k ) as in the exhaustive algorithm, and it               which is the same as the greedy algorithm, but its ac-
selects the optimal auxiliary view set in most cases                 tual running time is less than the greedy algorithm by
(see Section 7). However, the greedy algorithm can-                  a linear factor. The first phase of the three-step algo-




Y. Cui, J. Widom                                                                                                        11-10
        input: primary view v, statistics s
        output: auxiliary view set A
        begin
           A ← ∅;

           // phase 1: use greedy algorithm on AG and BT nodes
           Aall ← all possible auxiliary views for v;
           while true do
               for each vi ∈ Aall of type AG or BT such that vi 6∈ A do
                   benefiti ← total cost(v, A, s) − total cost(v, A ∪ {vi }, s);
               pick vi with the highest benefiti ;
               if benefiti ≤ 0 then break else A ← A ∪ {vi };
           endwhile;

           // phase 2: decide LV and SLTs
           for the topmost and each intermediate segment do
               // Let LV and SLT be the Lineage View and Split Lineage Tables for the segment
               cost1 ← cost(v, A ∪ {LV }, s);
               cost2 ← cost(v, A ∪ {SLT }, s);
               if cost1 ≥ cost2 > cost(v, A, s) then A ← A ∪ {LV }
               else if cost2 > cost1 > cost(v, A, s) then A ← A ∪ {SLT };
           endfor;

           // phase 3: remove useless AGs and BTs
           while true do
               for each vi ∈ A of type AG or BT do
                   benefiti ← total cost(v, A − {vi }, s) − total cost(v, A, s);
               pick vi with the lowest benefiti ;
               if benefiti > 0 then break else A ← A − {vi };
           endwhile;

           return A;
        end


                                         Figure 5: The three-step algorithm

rithm is faster than the greedy algorithm since it only       practical option is to combine the two algorithms:
selects from the AG and BT views, instead of from             run both algorithms and select whichever answer has
all auxiliary views. The second phase is linear in the        lower estimated cost. The running time of the com-
number of segments. The third phase only examines             bined algorithm remains O(k2 ).
the AG and BT views selected in the first phase, which
is a small number in most cases.                              7 Performance Study
   Like the greedy algorithm, the three-step algorithm        In this section, we study the performance of the four
usually selects the optimal set of auxiliary views (see       algorithms specified in Section 6, comparing their
Section 7). However, also like the greedy algorithm,          running times and the optimality of the answers they
the three-step algorithm cannot make any guarantees           produce. We also compare the cost of the answers pro-
about the optimality of its answers. In Section 7.3 we        duced by these algorithms against the cost of storing
will see a scenario where the three-step algorithm per-       no auxiliary views. In Section 7.1, we present results
forms poorly. Interestingly, in the case we show where        of experiments using the schema, statistics, and some
the three-step algorithm performs poorly, the greedy          views from the TPC-D benchmark. In Section 7.2,
algorithm performs well, and vice-versa. Thus, one            we present results of experiments using more complex




Y. Cui, J. Widom                                                                                              11-11
            Q5                             Q11                         Q17                                 V1

             α                             π                            α                                  α         levels = 2
                                           σ                            π                                  π         fan-out = 3
             π
                                                                                                           σ
             σ                                                          σ
                                       α         α
                                       π         π                                                 α       α        α
Customer             Region                σ                    Part         α                     π       π        π
    Order         Nation                                                     π                     σ       σ        σ
      Lineitem Supplier
                               PartSupp        Nation              Lineitem
                                      Supplier                                                 R1 R2 R3 R4 R5 R6 R7 R8 R9

Figure 6: Materialized views for TPC-D experiments                                            Figure 9: Structure of view v1

                                                                exhaustive                      levels   fan-out   query/update ratio
 100%
                                                                greedy                   v1       2         3             100
                                                                three-step               v2       2         3              10
  80%                                                           naive                    v3       2         3              1
                                                                none                     v4       2         3             0.1
  60%                                                                                    v5       2         3              0
                                                                                         v6       6         1              10
  40%
                                                                                         v7       2         5              10
  20%
                                                                                          Figure 10: Synthetic configurations
    0%
                    Q5           Q11             Q17
                                                                                 der are the fact tables according to the benchmark
          Figure 7: Optimality for TPC-D views                                   specification. For views, we select queries Q5 , Q11 ,
                                                                                 and Q17 from the benchmark, since they are relatively
                                                                                 complex and differ somewhat from each other. In each
            exhaustive        greedy       three-step   naive      none
                                                                                 case, we treat the benchmark query as the definition
   Q5         11.15            4.79           2.38      0.08       0.09
                                                                                 of our primary materialized view to be stored at the
   Q11        14.07            0.94           0.38      0.03       0.04
   Q17         0.62            0.29           0.10      0.02       0.03          warehouse. The general structure of each of the three
                                                                                 views is shown in Figure 6. The complete list of sta-
      Figure 8: Running time in seconds for TPC-D                                tistical settings (recall Table 1) for our three TPC-D
                                                                                 experiments is shown in Appendix A, Tables 3–5.
synthetic view definitions. Since the greedy and three-                              Recall that we are comparing five algorithms—the
step algorithms perform quite well in all of the experi-                         four algorithms from Section 6, as well as the “algo-
ments in Sections 7.1 and 7.2, in Section 7.3 we show                            rithm” that selects no auxiliary views (which we call
experiments illustrating that greedy and three-step can                          algorithm none). Figure 7 plots the optimality of the
perform poorly.                                                                  five algorithms for each of the TPC-D views we con-
                                                                                 sider. Recall from Section 5 that optimality is defined
7.1      TPC-D Experiments
                                                                                 as the cost of the optimal auxiliary view set divided
Our first set of experiments is based on the TPC-D                               by the cost of the chosen view set. Figure 8 plots the
benchmark [TPC96]. We use the schema of tables                                   running time of the algorithms. Note that algorithms
Customer, Order, Lineitem, Supplier, Na-                                         naive and none do incur a small running time, which
tion, Region, PartSupp, and Part from the                                        is the time required to compute the cost of their one
benchmark for our experiments. The table statistics                              solution.
we use correspond to a scaling factor of 1. The re-                                  We can see from Figure 7 that storing no auxiliary
maining statistics from Table 1 are set according to                             views can be dramatically worse than storing some,
the benchmark and commonly used database and net-                                even those selected by the naive algorithm. We also
work system settings. For example, we set the update                             see from Figure 7 that the greedy and three-step al-
rate for the Lineitem and Order tables to be much                                gorithms select the optimal auxiliary view set for all
higher than other tables, since Lineitem and Or-                                 three views, and from Figure 8 we see that they do so




Y. Cui, J. Widom                                                                                                                        11-12
                                                                                             exhaustive
                                                                                             greedy
  100%
                                                                                             three-step

      80%                                                                                    naive
                                                                                             none

      60%


      40%


      20%


           0%
                          V1       V2           V3       V4     V5     V6        V7


                                           Figure 11: Optimality for synthetic views

             exhaustive   greedy   three-step    naive   none    for view v7 the exhaustive algorithm never finished,
      v1      2411.68      3.18       0.67       0.05    0.07    so optimality is measured against the auxiliary view
      v2      2414.86      3.18       0.99       0.03    0.03
                                                                 set selected by the greedy algorithm.) The greedy and
      v3      2455.74      3.28       0.71       0.08    0.03
      v4      2493.06      3.17       0.63       0.03    0.05    three-step algorithms find their answer in a small frac-
      v5      2376.53      3.93       0.64       0.03    0.03    tion of the running time required by the exhaustive
      v6       757.49      0.99       0.17       0.02    0.02    algorithm, and three-step is much faster than greedy.
      v7                  156.36     35.29       0.33    0.21    Another interesting result is that algorithm none per-
                                                                 forms better when the query/update ratio is lower (ex-
  Figure 12: Running time for synthetic views (sec)
                                                                 periments v3 –v5 ). However, we should not infer that
in a small fraction of the running time required by the          the benefit of auxiliary views is primarily for lineage
exhaustive algorithm. We also note that the three-step           tracing. In fact, in experiment v5 the query/update ra-
algorithm runs considerably faster than the greedy al-           tio is set to 0 (indicating view maintenance only), and
gorithm.                                                         we still see significant benefit to using auxiliary views.
                                                                    Next, we consider in more detail how the running
7.2        Synthetic Experiments                                 times of the exhaustive, greedy, and three-step algo-
                                                                 rithms are affected by view complexity. In Figure 13,
Our next set of experiments is conducted using syn-              we consider views where we fix the number of levels
thetic views and data statistics. The views we consider          at 2 and increase the fan-out from 1 to 9. The exhaus-
all have a regular tree definition, as illustrated by view       tive, greedy, and three-step algorithms become pro-
v1 in Figure 9. We consider seven different views.               hibitive when the fan-out exceeds 3, 7, and 8, respec-
Figure 10 summarizes the “shape” of each view defi-              tively. We see similar behavior in Figure 14, where we
nition tree (number of levels and fan-out of each seg-           fix the fan-out at 2 and increase the number of levels
ment), along with the query/update ratio, which repre-           from 1 to 8.
sents the ratio of the average number of tracing queries
per unit time to the average number of source updates
                                                                 7.3   When Greedy and Three-Step Fail
(recall Table 1). The complete set of statistical set-
tings for the seven experiments is summarized in Ap-             The greedy and three-step algorithms select the op-
pendix A, Tables 6–8.                                            timal (or in one case very near to optimal) auxiliary
   Figure 11 plots the optimality of our five algorithms         view set in all of the experiments reported in Sec-
for each of the seven synthetic views we consider. Fig-          tions 7.1 and 7.2. However, there are cases in which
ure 12 plots the running time of the algorithms. The             these algorithms fail to pick an optimal or even near-
greedy and three-step algorithms always select the op-           optimal answer.
timal auxiliary view set or, in the one case of the three-          Consider a simple view definition v and the auxil-
step algorithm on v1 , very near to optimal. (Actually,          iary views that are considered for v in Figure 15. By




Y. Cui, J. Widom                                                                                                    11-13
                10000                                                                           V
                                             exhaustive
     running time (sec)
                          8000                   greedy                                         α
                                              three-step
                          6000                                                                  π
                                                                                LV or SLTs
                          4000                                                                  σ
                          2000
                                                                                    BTR                      BTS
                             0
                                 1   2   3      5 6 7 8
                                                 4               9                          R            S
                                             fan-out
                           Figure 13: Running time vs. fan-out           Figure 15: View structure used for v8 and v9

                10000                                                                                               exhaustive

                                             exhaustive                                                             greedy
     running time (sec)




                          8000                   greedy
                                                                      100%
                                                                                                                    three-step
                                              three-step               80%                                          naive
                          6000                                                                                      none

                                                                       60%
                          4000
                                                                       40%
                          2000
                                                                       20%

                             0
                                 1   2       3
                                             4    5     6    7   8      0%
                                                                                       V8           V9
                                            # levels
                          Figure 14: Running time vs. # levels                Figure 16: Optimality for v8 and v9

                                                                     8 Conclusion
setting the data statistics (Table 1) to different values,
                                                                     We have examined the problem of selecting auxiliary
we can vary the costs and benefits of the four differ-
                                                                     views to materialize in a data warehouse in order to
ent auxiliary views. We have set two different con-
                                                                     reduce the cost of view maintenance and lineage trac-
figurations, which we call v8 and v9 , both based on
                                                                     ing for complex primary views. We specify a normal
the view in Figure 15. The complete set of statistical
                                                                     form for view definitions and use it to define an initial
settings for these two experiments is summarized in
                                                                     search space of potentially beneficial auxiliary views.
Appendix A, Tables 9 and 10. In particular, we set the
                                                                     We presented four algorithms for exploring the search
source table network costs for v8 to be much higher
                                                                     space and selecting a set of auxiliary views: exhaus-
than for v9 , and we set the join ratio of v8 higher (less
                                                                     tive, greedy, three-step, and naive. We compared the
selective) than v9 . The optimal set of auxiliary views
                                                                     optimality and running time of the algorithms using
for v8 is {BTR , BTS }, and the optimal set for v9 is
                                                                     experiments based on the TPC-D benchmark, as well
{LV }.
                                                                     as on a variety of synthetic views and statistics.
   Figure 16 plots the optimality of all five algorithms                Our experiments indicate that in terms of running
on views v8 and v9 . In particular, the greedy al-                   time and optimality, the three-step algorithm appears
gorithm performs extremely poorly on v8 (selecting                   to be the best, although the running time of the greedy
{LV } instead of {BTR , BTS }), while the three-step                 algorithm probably also is fast enough in practice for
algorithm misses the optimal solution for v9 (select-                most complex warehouse views. (The exhaustive al-
ing {BTR , BTS } instead of {LV }). However, as sug-                 gorithm, on the other hand, becomes intractable quite
gested in Section 6, if we use a combined algorithm                  quickly.) Both the greedy and three-step algorithms
that runs both greedy and three-step and then selects                find the optimal auxiliary view set in most cases, al-
the lower-cost solution, we will select the optimal                  though we have shown (complementary) situations
view set for both v8 and v9 .                                        in which either one algorithm or the other performs




Y. Cui, J. Widom                                                                                                        11-14
poorly. Our experiments also illustrate that even a                     In Proc. of the Twenty-First Inter-
naive selection of auxiliary views reduces overall cost                 national Conference on Very Large
dramatically in most cases, underscoring the impor-                     Data Bases, pages 358–369, Zurich,
tance of materializing auxiliary views for the dual pur-                Switzerland, September 1995.
poses of view maintenance and lineage tracing in a
warehousing environment.                                    [GMS93]     A. Gupta, I. S. Mumick, and V. Sub-
   Although we have presented our work in the con-                      rahmanian. Maintaining views in-
text of selecting an auxiliary view set for a single pri-               crementally. In Proc. of the ACM
mary warehouse view, our approach extends easily to                     SIGMOD International Conference on
considering multiple primary views together. Then                       Management of Data, pages 157–166,
the cost of an auxiliary view may be “shared” if the                    Washington, DC, May 1993.
auxiliary view is beneficial to view maintenance or
                                                            [Gup97]     H. Gupta. Selection of views to mate-
lineage tracing for more than one primary view. Fur-
                                                                        rialize in a data warehouse. In Proc. of
thermore, although we have studied an environment in
                                                                        the Sixth International Conference on
which both view maintenance and lineage tracing are
                                                                        Database Theory, pages 98–112, Del-
important, if only one type of activity is present our
                                                                        phi, Greece, January 1997.
algorithms remain applicable.
                                                            [HRU96]     V. Harinarayan, A. Rajaraman, and
References                                                              J. D. Ullman. Implementing data
[CD97]          S. Chaudhuri and U. Dayal.    An                        cubes efficiently. In Proc. of the ACM
                overview of data warehousing and                        SIGMOD International Conference on
                OLAP technology. SIGMOD Record,                         Management of Data, pages 205–216,
                26(1):65–74, March 1997.                                Montreal, Canada, June 1996.

[CW91]          S. Ceri and J. Widom.         Deriv-        [IK93]      W.H. Inmon and C. Kelley. Rdb/VMS:
                ing production rules for incremen-                      Developing the Data Warehouse.
                tal view maintenance.       In Proc.                    QED Publishing Group, Boston,
                of the International Conference on                      Massachussetts, 1993.
                Very Large Databases, pages 577–589,
                Barcelona, Spain, September 1991.           [LQA97]     W.J. Labio, D. Quass, and B. Adel-
                                                                        berg. Physical database design for
[CW00]          Y. Cui and J. Widom. Practical lineage                  data warehousing. In Proc. of the
                tracing in data warehouses. To appear                   Thirteenth International Conference
                in Proc. of the Sixteenth International                 on Data Engineering, pages 277–288,
                Conference on Data Engineering, San                     Birmingham, UK, April 1997.
                Diego, California, February 2000.
                                                            [LW95]      D. Lomet and J. Widom, editors. Spe-
[CWW97]         Y. Cui, J. Widom, and J.L. Wiener.                      cial Issue on Materialized Views and
                Tracing the lineage of view data                        Data Warehousing, IEEE Data Engi-
                in a warehousing environment.                           neering Bulletin 18(2), June 1995.
                Technical report, Stanford Univer-
                sity Database Group, November               [LYC+ 99]   W. Labio, J. Yang, Y. Cui, H. Garcia-
                1997.      Available at http://www-                     Molina, and J. Widom. Performance
                db.stanford.edu/pub/papers/lineage-                     issues in incremental warehouse main-
                full.ps.                                                tenance. Technical report, Stanford
                                                                        University Database Group, October
[GHQ95]         A. Gupta, V. Harinarayan, and                           1999.       Available at http://www-
                D. Quass. Aggregate-query process-                      db.stanford.edu/pub/papers/whips-
                ing in data warehousing environments.                   wm.ps.




Y. Cui, J. Widom                                                                                         11-15
[Qua96]       D. Quass. Maintenance expressions          A   Statistical Settings for Experiments
              for views with aggregation. In Proc. of
              the Workshop on Materialized Views:
              Techniques and Applications, pages
              110–118, Montreal, Canada, June
              1996.

[RSS96]       K.A. Ross, D. Srivastava, and S. Su-
              darshan. Materialized view mainte-
              nance and integrity constraint check-
              ing: Trading space for time. In Proc.
              of the ACM SIGMOD International
              Conference on Management of Data,
              pages 447–458, Montreal, Canada,
              June 1996.

[TPC96]       Transaction Processing Performance
              Council. TPC-D Benchmark Specifi-
              cation, Version 1.2, 1996. Available at:
              http://www.tpc.org/.

[Ull89]       J. D. Ullman.        Database and
              Knowledge-base Systems (Vol 2).
              Computer Science Press, 1989.

[Wid95]       J. Widom. Research problems in data
              warehousing. In Proc. of the Fourth In-
              ternational Conference on Information
              and Knowledge Management, pages
              25–30, Baltimore, Maryland, Novem-
              ber 1995.

[ZGMHW95] Y. Zhuge, H. Garcia-Molina, J. Ham-
          mer, and J. Widom. View mainte-
          nance in a warehousing environment.
          In Proc. of the ACM SIGMOD Interna-
          tional Conference on Management of
          Data, pages 316–327, San Jose, Cali-
          fornia, May 1995.




Y. Cui, J. Widom                                                                                    11-16
                                                    Table 3: Statistics for Q5

       Parameter name                                                         Values
                                   Q5      Customer         Order      LineItem Supplier      Nation   Region       segment 1
       query rate (#/unit time)    100
       query size (tuples)          1
       update rate (#/unit time)                  1           10          40          1          0          0
       update size (tuples)                       1           10          10          1          0          0
       tuple num                               150000      1500000     6000000      10000       25          5
       tuple size (bytes)                        300         100         300         200       100         100
       fan-out                                                                                                          6
       join ratio                                                                                                    0.00001
       select ratio                                                                                                    0.01
       proj ratio                                                                                                      0.1
       aggr ratio                                                                                                     0.001
       block size (bytes)          8K            8K           8K            8K        8K        8K       8K
       disk cost (ms/block)         1             1            1             1         1         1        1
       net cost (ms/byte)           0          0.0001      0.00001       0.00001    0.0001    0.0001   0.0001




                                                    Table 4: Statistics for Q11

           Parameter name                                                      Values
                                         Q11     PartSupp     Supplier     Nation segment 1     segment 2        segment 3
           query rate (#/unit time)      100
           query size (tuples)            1
           update rate (#/unit time)                 10          1            0
           update size (tuples)                      1           1            0
           tuple num                              800000       10000         25
           tuple size (bytes)                       100         200          100
           fan-out                                                                      1            3               3
           join ratio                                                                             0.0006          0.0006
           select ratio                                                                0.1         0.04             0.04
           proj ratio                                                                  0.6          0.1             0.1
           aggr ratio                                                                             0.005            0.005
           block size (bytes)            8K         8K          8K           8K
           disk cost (ms/block)           1          1           1            1
           net cost (ms/byte)             0       0.0001      0.0001       0.0001




Y. Cui, J. Widom                                                                                                                11-17
                                                  Table 5: Statistics for Q17

                      Parameter name                                         Values
                                                   Q17     Part         Lineitem segment 1        segment 2
                      query rate (#/unit time)     100
                      query size (tuples)           1
                      update rate (#/unit time)              10            10
                      update size (tuples)                   1             10
                      tuple num                           200000        6000000
                      tuple size (bytes)                    200           300
                      fan-out                                                             3            1
                      join ratio                                                       0.0002
                      select ratio                                                      0.002
                      proj ratio                                                         0.1           0.2
                      aggr ratio                                                        0.001         0.03
                      block size (bytes)            8K      8K             8K
                      disk cost (ms/block)           1       1              1
                      net cost (ms/byte)             0    0.0001        0.00001




                                     Table 6: Statistics for synthetic views v1 –v5

               Parameter name                                                  Values
                                            v1       v2    v3      v4     v5      each source table   each segment
               query rate (#/unit time)    1000     100    10      1      0
               query size (tuples)           1       1      1      1      1
               update rate (#/unit time)                                                 10
               update size (tuples)                                                      1
               tuple num                                                               10000
               tuple size (bytes)                                                       100
               levels                        2        2    2       2      2
               fan-out                                                                                         3
               join ratio                                                                                    0.001
               select ratio                                                                                   0.2
               proj ratio                                                                                     0.4
               aggr ratio                                                                                     0.1
               block size (bytes)           8K       8K    8K      8K     8K             8K
               disk cost (ms/block)          1        1     1       1      1              1
               net cost (ms/byte)            0        0     0       0      0           0.0001




Y. Cui, J. Widom                                                                                                     11-18
                 Table 7: Statistics for v6                                              Table 8: Statistics for v7

Parameter name                             Values                       Parameter name                             Values
                             v6   each source table      each segment                                v7   each source table     each segment
query rate (#/unit time)    100                                         query rate (#/unit time)    100
query size (tuples)          1                                          query size (tuples)          1
update rate (#/unit time)                10                             update rate (#/unit time)                 10
update size (tuples)                      1                             update size (tuples)                       1
tuple num                             1000000                           tuple num                               10000
tuple size (bytes)                      1000                            tuple size (bytes)                       100
levels                      6                                           levels                      2
fan-out                                                       1         fan-out                                                      5
join ratio                                                              join ratio                                                 0.001
select ratio                                                 0.5        select ratio                                                0.2
proj ratio                                                   0.8        proj ratio                                                  0.4
aggr ratio                                                   0.2        aggr ratio                                                  0.1
block size (bytes)          8K             8K                           block size (bytes)          8K            8K
disk cost (ms/block)         1              1                           disk cost (ms/block)         1             1
net cost (ms/byte)           0           0.0001                         net cost (ms/byte)           0          0.0001




                 Table 9: Statistics for v8                                             Table 10: Statistics for v9

Parameter name                              Values                      Parameter name                             Values
                             v8     R1         R2        segment 1                                   v9    R1         R2         segment 1
query rate (#/unit time)    100                                         query rate (#/unit time)    100
query size (tuples)          1                                          query size (tuples)          1
update rate (#/unit time)            0          0                       update rate (#/unit time)           0          10
update size (tuples)                 0          0                       update size (tuples)                0          20
tuple num                         100000     100000                     tuple num                         10000     1000000
tuple size (bytes)                 1000       1000                      tuple size (bytes)                1000        1000
fan-out                                                       2         fan-out                                                       2
join ratio                                                  0.001       join ratio                                                0.00003
select ratio                                                 0.2        select ratio                                                 0.2
proj ratio                                                   0.6        proj ratio                                                   0.6
aggr ratio                                                   0.2        aggr ratio                                                   0.2
block size (bytes)          8K      8K             8K                   block size (bytes)          8K      8K           8K
disk cost (ms/block)         1       1              1                   disk cost (ms/block)         1       1            1
net cost (ms/byte)           0     0.01           0.01                  net cost (ms/byte)           0    0.0001        0.001




     Y. Cui, J. Widom                                                                                                            11-19