=Paper=
{{Paper
|id=Vol-28/paper-11
|storemode=property
|title=Storing auxiliary data for efficient maintenance and lineage tracing of complex views
|pdfUrl=https://ceur-ws.org/Vol-28/paper11.pdf
|volume=Vol-28
|dblpUrl=https://dblp.org/rec/conf/dmdw/CuiW00
}}
==Storing auxiliary data for efficient maintenance and lineage tracing of complex views==
Storing Auxiliary Data for Efficient Maintenance
and Lineage Tracing of Complex Views∗
Yingwei Cui and Jennifer Widom
Computer Science Department, Stanford University
{cyw, widom}@db.stanford.edu
gorithm that selects the optimal set of auxil-
iary views is far too expensive in many cases;
Abstract (2) two heuristic algorithms that we present
select good (often optimal) sets of auxiliary
As views in a data warehouse become more views in a much shorter time; (3) even aux-
complex, the view maintenance process can iliary views selected by a very simple al-
become very complicated and potentially gorithm can significantly reduce the overall
very inefficient. Storing auxiliary views in view maintenance and lineage tracing cost.
the warehouse can reduce the complexity and
improve the efficiency of view maintenance, 1 Introduction
and the same auxiliary views can help in ef-
Data warehousing systems collect data from multiple,
ficiently answering lineage tracing queries
distributed sources and integrate the information as
over the warehouse views. In this paper, we
materialized views in local databases [CD97, IK93,
study the problem of selecting auxiliary views
LW95, Wid95]. Users can then perform data analy-
to materialize in order to minimize the total
sis and mining on the warehouse views. The mate-
view maintenance and lineage tracing cost.
rialized views in the warehouse need to be kept up-
We consider relational views with arbitrary
to-date when data at the sources changes. As the
use of aggregation operators, and we define
view definitions become more complex in order to
an initial search space for our optimization
support sophisticated data analyses, the view mainte-
problem based on a normal form for such
nance process can become very complicated and po-
view definitions. We present several auxiliary
tentially very inefficient. Most previous work on view
view selection algorithms, and to study their
maintenance, e.g., [CW91, GMS93, LW95, LYC+99,
performance we conduct experiments using
Qua96], considers simple views containing at most
the TPC-D benchmark in addition to synthetic
one level of aggregation. In order to efficiently main-
view definitions and statistics. The results of
tain complex views which may contain multiple levels
our experiments show: (1) the exhaustive al-
of aggregation, it is clearly advantageous to store aux-
∗
This work was supported by the National Science Foundation iliary data in addition to the original view to reduce
under grant IIS-9811947, by Sagent Technology Inc., and by an overall view maintenance cost.
equipment grant from IBM Corporation.
From a different perspective, for in-depth analysis
The copyright of this paper belongs to the paper’s authors. Per-
of warehouse data sometimes it is useful to be able
mission to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct to “drill through” from selected interesting (or possi-
commercial advantage. bly erroneous) view data to the original source data
Proceedings of the International Workshop on Design that derived the view data. We call this process trac-
and Management of Data Warehouses (DMDW’2000) ing the lineage of the view data [CWW97]. To trace
Stockholm, Sweden, June 5-6, 2000 the lineage of a view data item efficiently, the ware-
(M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.) house also needs to store auxiliary data—to reduce the
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-28/ computation cost at the warehouse, and to reduce or
Y. Cui, J. Widom 11-1
entirely avoid expensive source accesses for lineage • Two heuristic algorithms that we present select
tracing. It turns out that the same auxiliary data that good (often optimal) sets of auxiliary views in a
can be used to improve the performance of view main- much shorter time.
tenance as discussed in the first paragraph also can
improve the performance of lineage tracing queries. • Even auxiliary views selected by a very sim-
Therefore, the problems of selecting auxiliary data for ple algorithm can significantly reduce the overall
the two purposes are closely related, and we study the view maintenance and lineage tracing cost.
problems together.
The auxiliary data is stored as materialized views in 1.1 Outline of Paper
the warehouse, called auxiliary views (as opposed to
the original warehouse views, which we call primary The remainder of the paper proceeds as follows. Sec-
views). Given a complex relational primary view, tion 2 covers related work. Section 3 presents pre-
there are numerous possible sets of auxiliary views to liminary material on materialized views, view main-
materialize for view maintenance and lineage tracing, tenance, and lineage tracing, including a running ex-
with significant performance tradeoffs. In general, the ample. Section 4 introduces the auxiliary views we
more auxiliary views we materialize, the more effi- consider for efficient view maintenance and lineage
ciently we can maintain and trace the lineage of data in tracing, and defines the search space for selecting aux-
the primary view. However, the auxiliary views them- iliary views to materialize. Section 5 describes the
selves also need to be maintained, so materializing too cost model and statistics we use for estimating view
many auxiliary views can increase overall cost. maintenance and lineage tracing costs, and for study-
Previous work has studied the selection of views ing the performance of our auxiliary view selection al-
to materialize for answering queries, e.g., [HRU96, gorithms. Section 6 presents several algorithms for se-
Gup97], and the selection of auxiliary views for ef- lecting auxiliary views within our search space. Sec-
ficient maintenance of given primary views, e.g., tion 7 compares the performance of the algorithms us-
[LQA97, RSS96]. (Further discussion of this work ing experiments on the TPC-D benchmark, as well as
appears in Section 2.) In [CW00], we introduced using a variety of synthetic view definitions and statis-
the idea of materializing auxiliary views to minimize tics.
overall view maintenance and lineage tracing cost,
and we studied the problem in the context of select- 2 Related Work
project-join (SPJ) primary views. This paper inves-
tigates the more difficult and general problem of re- Previous work related to this paper falls into three
lational views with arbitrary use of aggregation and categories: selecting views to materialize in order to
SPJ operators. As we will see, it is an expensive com- minimize query costs, e.g., [HRU96, Gup97], select-
ing auxiliary views to materialize in order to min-
binatorial problem, and our overall approach differs
from [CW00]. imize the cost of maintaining given primary views,
e.g., [LQA97, RSS96], and our own previous work
In this paper, we first define a normal form for
in lineage tracing and view maintenance [CWW97,
the primary view definition, which suggests an ini-
tial search space of possible auxiliary views. We then CW00].
propose a variety of algorithms for selecting auxiliary [HRU96] proposes a greedy algorithm for selecting
views within this search space. Finally, we compare auxiliary views to materialize, with the goal of mini-
empirically the running time of our algorithms and the mizing the cost of queries over aggregate views given
optimality of the auxiliary view sets they select, using certain constraints such as the maximum number of
the TPC-D benchmark [TPC96] in addition to a suite views that can be materialized. The work considers
of synthetic view definitions and statistics. The results data-cube views only, and can make certain simpli-
of our experiments show: fying assumptions based on this restriction. [Gup97]
extends the work in [HRU96] to general relational
• The exhaustive algorithm that selects the opti- views, and proves that the auxiliary view selection
mal set of auxiliary views is far too expensive problem under maintenance cost constraints is NP-
in many cases. hard.
Y. Cui, J. Widom 11-2
[RSS96] proposes an exhaustive algorithm for se- rithm, and we compare the performance of our
lecting auxiliary views to optimize view maintenance, algorithms (both running time and quality of so-
and suggests simple search space pruning strategies lution) through experiments.
when the view is too complex for exhaustive search.
[LQA97] presents an A* algorithm for selecting aux- 3 Preliminaries
iliary views and indexes on different join combina- We now introduce the relational materialized views
tions for SPJ view maintenance. Both [RSS96] and we consider, as well as the processes of view main-
[LQA97] consider a single algorithm for selecting tenance and lineage tracing, using a running example.
auxiliary views (and indexes in the case of [LQA97]), Along the way, we illustrate why materializing auxil-
designed specifically for optimizing view mainte- iary views is important for view maintenance and lin-
nance. They consider as potential auxiliary views all eage tracing, and why it is useful to consider the two
nodes in all possible relevant query plans, making the problems together.
search space doubly exponential in the view definition
size. 3.1 Materialized Views
We introduced lineage tracing for relational data
warehouses in [CWW97], presenting a formal frame- To answer a variety of user queries efficiently, a data
work and basic algorithms. In [CW00], we introduced warehouse typically computes and stores a number of
the problem of selecting auxiliary views to simulta- materialized views [LW95]. In this paper, we con-
neously reduce view maintenance and lineage trac- sider relational views with arbitrary use of aggrega-
ing costs, and we considered the restricted case of tion, selection, projection, and join operators, which
SPJ views. We suggested several alternative auxil- we call ASPJ views. We use an algebraic representa-
iary view schemes and compared their performance. tion for the operators: α for grouping and aggregation,
In this paper, we tackle the problem for complex rela- σ for selection, π for (duplicate-eliminating) projec-
tional views with arbitrary use of aggregation and SPJ tion, and ./ for join. A view definition is presented
operators. Arbitrarily complex primary views make using a rooted operator DAG with source tables at the
the auxiliary view selection problem more compli- leaves.
cated and expensive than for SPJ views, and we take a Any ASPJ view definition v can be transformed
different approach to solving it than for the restricted into an equivalent form v 0 composed of α-π-σ-./
case considered in [CW00]. We introduce a normal operator sequences, by commuting and combining
form for our view definitions that suggests an initial some select-project-join operators in the view defini-
(still exponential) search space for useful auxiliary tion [CWW97]. We call the resulting form v’s ASPJ
view sets, and then we consider heuristic algorithms normal form, and we call each α-π-σ-./ sequence a
that explore various view sets in this search space. segment. An example will be given shortly. In ASPJ
Our work differs from the previous work discussed normal form, a segment may omit the π, σ, or ./ op-
above in several ways: erator, but each segment except the topmost must in-
clude a non-trivial aggregation operator (or it would
• Unlike all previous work besides our own, we be merged with an adjacent segment). Since our view
consider lineage tracing as well as view main- definitions are DAGs, they may contain multiple ref-
tenance costs when selecting auxiliary views to erences to a source table or to a segment at any level.
materialize. We say that a view is an n-level ASPJ view if
• Instead of considering a doubly exponential traversing from the root to any leaf in its normalized
search space of auxiliary views (as in [HRU96, definition crosses at most n segments. The fan-out of
Gup97, LQA97, RSS96]), or a very simple a segment is the number of operands of the segment’s
fixed set (as in [CW00]), we explore a “mid- join operator, or 1 if there is no join.
dle ground” based on our view definition normal
Example 3.1 (Materialized View and Normal Form)
form.
Consider a data warehouse for a department store
• We propose several different auxiliary view se- chain based on the following four tables, some or all
lection algorithms, as opposed to a single algo- of which may reside at remote source databases.
Y. Cui, J. Widom 11-3
• Store(store-id, city, expenses) sequence of queries and updates called the mainte-
gives the city and monthly operating expenses of nance procedure. For 1-level ASPJ views we use the
each store. We assume that each city contains at maintenance procedures from [GMS93, Qua96]. The
most one store, and that the operating expenses following example shows a 1-level ASPJ view and its
do not include employee salaries. maintenance procedure.
• Product(product-id, price, cost) Example 3.2 (View Maintenance Procedure)
gives the retail price and wholesale cost of each Consider the source tables from Example 3.1 and
product item. a 1-level ASPJ view Profit corresponding to the
middle α-./ segment in Figure 2:
• Sales(store-id, product-id, num)
gives the expected monthly number of sales for CREATE VIEW Profit AS
SELECT store-id, SUM(num*(price−cost)) AS profit
each product at each store.
FROM Sales, Product
WHERE Sales.product-id = Product.product-id
• Employee(emp-id, store-id, GROUP BY store-id
salary) gives the monthly salary of each
employee at each store. Suppose the Sales table changes over time, and
a set of insertions and deletions to the table are
Consider a materialized view HighProfit that stored in delta tables ∆Sales and ∇Sales, respec-
keeps track of those cities whose stores are very tively. The resulting changes to the view Profit
profitable, i.e., whose monthly income exceeds ex- (∆Profit and ∇Profit) can be computed by the
penses by at least $100,000. An SQL definition for following maintenance procedure, which uses the
HighProfit is shown in Figure 1, and its normal- summary-delta approach from [Qua96]:
ized view definition tree is shown in Figure 2. We SELECT store-id, SUM(profit) AS profit INTO SummaryDelta
use αG,aggr(A) to represent grouping and aggregation, FROM (SELECT store-id, (num*(price−cost)) as profit
where G is a list of grouping attributes, and aggr(A) FROM ∆Sales, Product
abbreviates a list of aggregate functions over attributes WHERE ∆Sales.product-id = Product.product-id)
UNION
in set A [CWW97].1 HighProfit is a 2-level ASPJ
(SELECT store-id, -1*(num*(price−cost)) as profit
view containing three segments: the topmost π-σ-./ FROM ∇Sales, Product
segment with fan-out 3, the leftmost α segment with WHERE ∇Sales.product-id = Product.product-id)
fan-out 1, and the middle α-./ segment with fan-out GROUP BY store-id
2. 2 SELECT * INTO ∇Profit
FROM Profit
WHERE store-id IN
3.2 View Maintenance Procedures
(SELECT store-id FROM SummaryDelta)
Materialized views must be maintained to keep their SELECT store-id, SUM(profit) INTO ∆Profit
contents up-to-date as the source tables they are de- FROM ∇Profit UNION SummaryDelta
fined over change. We assume a standard incremental GROUP BY store-id
view maintenance approach, as in [GMS93, Qua96]. 2
Insertions and deletions to each source table are mon-
itored and recorded in delta tables (∆ and ∇ respec- For an n-level ASPJ view where n > 1, to com-
tively) in the warehouse. Updates are modeled as pute the changes to the entire view we can com-
deletions followed by insertions. During view mainte- pute the changes for one segment at a time using
nance, changes to the view (also expressed as deltas) the maintenance procedure for 1-level views, prop-
are computed based on the source delta tables, the agating the deltas upward through the view defini-
view contents, and the source data, using a predefined tion DAG. Just as we needed source table Prod-
1
uct along with ∆Sales and ∇Sales to compute
This operator is similar to the generalized projection of
[GHQ95], but we distinguish between projection and aggregation
the deltas for Profit in Example 3.2, to compute
operators because of the way our segments and auxiliary views the deltas for a higher-level segment we may need
are defined. deltas and/or full contents for each lower segment.
Y. Cui, J. Widom 11-4
CREATE VIEW HighProfit AS HighProfit
SELECT city
FROM Store, π city
(SELECT store-id, SUM(num*(price−cost)) AS profit σ profit - expenses - salaries
FROM Sales, Product > 100000
WHERE Sales.product-id = Product.product-id
GROUP BY store-id) AS P, α store-id, α store-id,
(SELECT store-id, SUM(salary) AS salaries sum(salary) sum(num * (price - cost))
FROM Employee as salaries as profit
GROUP BY store-id) AS E
WHERE Store.store-id = E.store-id
AND E.store-id = P.store-id
AND P.profit−E.salaries−Store.expenses > 100000 Employee Sales Product Store
Figure 1: SQL definition for HighProfit Figure 2: Normal form for HighProfit
For example, suppose we want to compute the deltas given tuple t ∈ V , t’s lineage in T1 , . . . , Tm accord-
for HighProfit given ∆Profit and ∇Profit, ing to v can be computed with the following query:
where again Profit corresponds to the middle α-./
segment as in Example 3.2. We need to join the delta T Qt,v = SplitT1 ,...,Tm (σC∧G=t.G (T1 ./ · · · ./ Tm ))
tables for Profit with source table Store, as well
as with the leftmost α segment in Figure 2, then per- where Ti , denotes the schema of table Ti , i =
form the selection and projection in the topmost seg- 1..m, and Split is an operator that breaks a ta-
ment. If we materialize an auxiliary view Salary cor- ble into multiple projections: SplitA1 ,...,Am (T ) =
responding to the leftmost α segment, we can signif- hπA1 (T ), . . . , πAm (T )i.2 Given a tuple set T ⊆ V ,
icantly improve the performance of the maintenance we can simultaneously trace all the tuples in T with:
procedure by avoiding recomputation of the aggre-
T QT,v = SplitT1 ,...,Tm (σC (T1 ./ · · · ./ Tm ) n T )
gate values. In addition to materializing “intermedi-
ate” views, if source tables are remote and expensive Example 3.3 (Lineage Tracing Query) Consider
(or impossible) to access, we may want to replicate the view Profit in Example 3.2. The lineage
some or all of the source tables as auxiliary views at tracing query for a tuple t ∈ Profit is
the warehouse.
Note that our approach in this paper applies to all T Qt,Profit = SplitT1 ,T2 (σC (T1 ./ T2 ))
views, including those with non-incrementally main-
tainable aggregates (e.g., min, max). In the presence where T1 = Sales, T2 = Product, and C =
of such aggregates, the maintenance procedure must “store-id = t.store-id”. The SQL presentation of
involve some recomputation, but auxiliary views may the query is as follows:
still be of benefit.
SELECT Sales.* INTO LN Sales
FROM Sales, Product
3.3 Lineage Tracing WHERE Sales.product-id = Product.product-id AND
Sales.store-id = t.store-id
Given a materialized view in a data warehouse, in ad-
dition to issuing regular queries or performing other SELECT Product.* INTO LN Product
kinds of analysis over the view, we may want to trace FROM Sales, Product
the lineage of selected “interesting” tuples in the view. WHERE Sales.product-id = Product.product-id AND
Sales.store-id = t.store-id
The lineage of a view tuple is defined as the set of
original source tuples that derived the given view tu- where LN Sales and LN Product contain the
ple. To trace the lineage of a view tuple, we use a lineage of t according to view Profit in the source
predefined sequence of queries called tracing queries tables Sales and Product, respectively. 2
(TQs) [CWW97]. 2
When we execute the tracing query, the selection condition
Given a 1-level ASPJ view V whose definition is σC∧G=t.G is pushed down to individual Ti ’s whenever possible
v = αG,aggr(B) (πA (σC (T1 ./ · · · ./ Tm ))), and to improve tracing query performance.
Y. Cui, J. Widom 11-5
To trace the lineage of an n-level ASPJ view where lineage tracing queries take advantage of the auxiliary
n > 1, we logically define an intermediate view for views (Section 4.2). Finally, we formally define the
each segment, and then recursively trace through the auxiliary view selection problem and estimate the size
hierarchy of intermediate views top-down. At each of our search space (Section 4.3).
level, we use the tracing query for a 1-level ASPJ view
to compute the lineage for the current traced tuples 4.1 The Auxiliary Views We Consider
with respect to the intermediate views or source ta- Let us first define two types of potentially useful aux-
bles at the next level below. The necessary interme- iliary views, based on a single segment. (Similar aux-
diate results can either be computed at tracing time, iliary views were introduced in the context of SPJ pri-
or we can materialize certain intermediate results as mary views in [CW00].) Any segment can be thought
auxiliary views for the purpose of lineage tracing. of as a view definition v = αG,aggr(B) (πA (σC (T1 ./
For example, to trace the lineage of a tuple t in · · · ./ Tm ))), where each T1 , . . . , Tm is either a
view HighProfit, we logically define intermedi- source table or a lower-level segment (view). Let V
ate views Salary and Profit corresponding to denote the materialization of v over T1 , . . . , Tm .
the leftmost α segment and middle α-./ segment, re-
spectively, in Figure 2. We trace the lineage of tu- 1. Lineage View (LV) for v: We can store the inter-
ple t in Salary, Profit, and Store, producing mediate result LV (v) = σC (T1 ./ · · · ./ Tm )
hLN Salary, LN Profit, LN Storei, using the fol- to help trace the lineage of tuples in V . We can
lowing tracing query: rewrite the lineage tracing queries in Section 3.3
using LV (v) as:
T Q = SplitT1 ,T2 ,T3 (σC (T1 ./ T2 ./ T3 ))
T Qt,v = SplitT1 ,...,Tm (σG=t.G (LV (v)))
where T1 = Salary, T2 = Profit, T3 = Store, and T QT,v = SplitT1 ,...,Tm (LV (v) n T )
C = “profit − expenses − salaries > 100000
∧city = t.city”. Then, we further trace the lineage The maintenance procedure for V also can be
of the tuples in LN Salary and LN Profit in the simplified. If LV (v) is materialized, then we
source tables to produce LN Employee, LN Sales, compute ∆LV (v) and ∇LV (v), and the query
and LN Product. As with view maintenance, ma- for computing the summary-delta table in the
terializing rather than recomputing intermediate re- maintenance procedure (Section 3.2) can be
sults can significantly improve tracing performance. rewritten as:
Since lineage tracing queries always return data from
SummaryDelta =αG,aggr(B) (αG,aggr(B) (∆LV )
source tables by definition, replicating (portions of)
the source data at the warehouse may be advanta- ∪ αG,−aggr(B) (∇LV ))
geous, for the same reasons outlined in Section 3.2.
4 Auxiliary Views for View Maintenance and 2. Split Lineage Tables (SLTs) for v: For a view
Lineage Tracing (segment) v 0 whose joins is many-to-many,
LV (v 0 ) can be very large and inefficient to main-
As motivated in Section 3, it may be advantageous
tain. Thus, another possibility is to “split” the
to materialize certain auxiliary views in a data ware-
Lineage View and store a set of smaller tables:
house to improve the performance of view mainte-
SLTi (v) = πTi (σC (T1 ./ · · · ./ Tm )), i =
nance and lineage tracing. View maintenance pro-
1..m. The lineage tracing queries can then be
cedures and lineage tracing queries use the auxil-
rewritten using the SLTs as:
iary views to avoid recomputations and expensive
source queries, thereby reducing maintenance and T Qt,v =SplitT1 ,...,Tm (σG=t.G (σC ((SLT1 (v)
query costs. There are many possible sets of auxiliary
n T ) ./ · · · ./ (SLTm (v) n T )) n T ))
views to materialize. In this section, we first specify
a number of potentially useful auxiliary views for ar- T QT,v =SplitT1 ,...,Tm (σC ((SLT1 (v) n T )
bitrary n-level ASPJ primary views (Section 4.1). We ./ · · · ./ (SLTm (v) n T )) n T )
then discuss how view maintenance procedures and
Y. Cui, J. Widom 11-6
Although these tracing queries look much more HighProfit
complex than with LV, performance can some- π
times be much better due to the smaller size of σ LV2 or SLT 2
the SLTs. Furthermore, as with LV, the mainte- AG 1
nance procedure for V can use the deltas for the α α AG2
LV1 or SLT1
SLTs to be much more efficient. See [CW00] for
details.
BT 1 BT 2 BT 3 BT 4
Given a general ASPJ view definition in normal
form, in addition to considering Lineage Views and Employee Sales Product Store
Split Lineage Tables for each segment, we also may
consider storing copies of some or all of the source ta- Figure 3: Possible auxiliary views for HighProfit
bles, to avoid expensive (remote) source queries dur-
ing view maintenance and lineage tracing. We refer the source tables. For each source table R, we
to these source table copies in the warehouse as Base decide whether to store a Base Table (BT) copy
Tables (BTs). Finally, maintaining the results of in- of R. If BT is not materialized, we may need to
termediate aggregations in the view (AGs) also can be issue queries directly to source table R for view
very helpful in view maintenance and lineage tracing, maintenance and lineage tracing.3
as motivated in Section 3.
To summarize, we divide the normalized view def- Example 4.1 (Auxiliary Views) Recall our example
inition into three types of components, and for each view HighProfit from Figure 2. Figure 3 shows
type of component we have certain choices of possi- the three ASPJ segments in the view definition and all
ble auxiliary views to materialize: of the possible auxiliary views we consider material-
izing for HighProfit. 2
1. Topmost Segment: the segment at the root of
the view definition DAG. Note that the α, π, σ, Notice that because of our search space reduction,
and/or ./ operators (but not all of them) may be it is possible that there are useful auxiliary views we
omitted in this segment. Also note that the top- are not considering, notably different join combina-
most node corresponds to the primary view itself, tions in the case of a many-way join. This special
so its contents are always materialized. We may case is considered in detail in [LQA97], and we could
further choose to materialize the Lineage View extend our search space accordingly.
(LV) or the Split Lineage Tables (SLTs) for this
segment, but not both. (If we store one, then stor- 4.2 Using Auxiliary Views for View Maintenance
ing the other will not further reduce the lineage and Lineage Tracing
tracing or overall maintenance cost.)
In Section 4.1 we gave examples of how to rewrite
2. Intermediate Segment: a non-root segment that queries for view maintenance and lineage tracing us-
is defined over the source tables and/or other seg- ing auxiliary views. In general, when we have a set
ments. Note that the π, σ, and/or ./ operators of auxiliary views available, there may be more than
may be omitted in this segment, but the α opera- one way to rewrite a query to take advantage of aux-
tor is always present. For an intermediate seg- iliary views. We assume that the “best” rewriting is
ment, we consider materializing the following selected, and this assumption is reflected in the cost
auxiliary views: model we present in Section 5.
(a) The contents of the α node (AG) As an example, Figure 4 shows the rewritings of the
lineage tracing query TQ for a tuple t according to the
(b) The Lineage View (LV) or the Split Lineage
topmost segment in the definition of HighProfit
Tables (SLTs), but not both
using the auxiliary views in Figure 3:
3. Source Table: We assume that all local selec- 3
Most existing data warehousing systems automatically store
tion conditions in the view—predicates that in- a copy of each source table in the warehouse. However, as we will
volve a single source table—are pushed down to see in Section 6, sometimes it is not beneficial to store a copy.
Y. Cui, J. Widom 11-7
T Q1 =SplitSalary,Profit,Store(σprofit−expenses−salaries>100000∧city=t.city (AG1 ./ AG2 ./ BT4 ))
T Q2 =SplitSalary,Profit,Store(σcity=t.city (LV2 ))
Salary
T Q3 =SplitSalary,Profit,Store(σcity=t.city (SLT2 ./ SLT2Profit ./ SLT2Store))
Figure 4: Tracing query rewritings
Suppose that LV2 , AG1 , and BT4 are materialized.
Table 1: Statistics for cost estimation
Then we could use query T Q2 , or (among other op-
tions) we could use a tracing query similar to T Q1 Parameter Description
that recomputes the contents of AG2 . In this case it usage statistics (for each primary view)
is likely that T Q2 would be chosen as the best query query rate # of tracing queries per unit time period
rewriting based on the available auxiliary views. query size # of tuples traced per query
usage statistics (for each source table)
4.3 The View Selection Problem and the Search update rate # of source table updates per unit time period
update size # of changed tuples per source table update
Space Size
data statistics (for each source table)
We have shown that various auxiliary views can be tuple num # of tuples in a source table
used in the view maintenance and lineage tracing pro- tuple size size of tuples in a source table (in bytes)
cesses. Our goal is to select among the choices of aux- data statistics (for each view segment)
iliary views described in Section 4.1 a set that mini- fan-out # of joined tables
mizes overall cost: the cost of lineage tracing plus the join ratio # of joining tuples / # of tuples in cross-product
select ratio # of selected tuples / # of tuples before selection
cost of maintaining the primary and auxiliary views. proj ratio # of bytes projected / tuple size before projection
Our cost model is described in Section 5. Here let aggr ratio # of aggr tuples / # of tuples before aggregation
us consider the size of our search space. Suppose we system statistics (for each source or warehouse)
have an n-level ASPJ view in normal form, and con- block size # of bytes in a block
sider a balanced view definition tree4 with a fan-out disk cost cost to read/write a disk block (in ms/block)
of m in each segment. There is one topmost segment, net cost network transmission cost (in ms/byte)
and for that segment we have 3 auxiliary view options:
LV, SLTs, or nothing (case 1 in Section 4.1). There of possible auxiliary view sets for HighProfit is
are m1 + m2 + · · · + mn−1 = O(mn−1 ) intermediate 26 · 32 = 384. 2
segments, each having 2 options for case 2(a) in Sec-
tion 4.1 (AG or nothing) and 3 options for case 2(b) The number of choices for HighProfit is quite
(LV, SLTs, or nothing). Finally, there are mn source manageable. However, real warehouse views tend to
tables, each having 2 options: BT or nothing. There- have much higher fan-outs, as well as possibly more
fore, the size of the entire search space is levels. As we will see in Section 7, even for a view
n−1 ) n n
with only 2 levels and average fan-out of 5, we can-
31 · (2 · 3)O(m · 2(m ) = O(2m ) not consider all possible auxiliary view sets due to the
large search space.
If k is the total number of components in the view def-
inition, where a component is a segment or a source
5 Cost Model
table, then k = O(mn ) and the search space size is
O(2k ). In this section we present the model that we use to es-
timate view maintenance and lineage tracing costs for
Example 4.2 (Search Space Size) Consider our ex- a given primary view and set of auxiliary views. The
ample view HighProfit (Figure 3). The number statistics our cost model relies on are listed in Table 1.
4
A tree is balanced if each leaf node in the tree has the same
Values for these statistics are set for each experiment,
depth. We consider this view definition shape since it represents as described in Section 7. We briefly outline our cost
the largest search space size for an n-level view. estimation procedure as follows.
Y. Cui, J. Widom 11-8
Let cost(Q, s) denote the estimated cost of evaluat- Finally, the total cost is the combination of lineage
ing a query Q at the warehouse given a set of statistics tracing cost and view maintenance cost:
s. Q could be a lineage tracing query, or a query or
update in a view maintenance procedure. To compute total cost(v, A, s) = q(v, A, s) + m(v, A, s)
cost(Q, s) we use a fairly conventional cost model
In our experiments, we measure the optimality of
for relational queries in a distributed database setting,
given sets of auxiliary views, by which we mean how
similar to, e.g., [LQA97, Ull89, ZGMHW95]. Details
close the sets of views come to the set that yields the
are omitted, but the cost formulas rely on all of the
lowest estimated cost. For a given primary view v and
statistics from Table 1, and assume no indexes.
statistics s, let Aopt denote the set of auxiliary views
Suppose we have a primary view v and a set of aux-
within our search space (Section 4) with the lowest
iliary views A = {v1 , ..., vn }. To trace the lineage of
total cost total cost(v, Aopt , s). For a set of auxiliary
tuples in the primary view given the materialized aux-
views A, we define the optimality of A as:
iliary views in A, there are various possible rewrit-
ings of the lineage tracing queries using the auxiliary total cost(v, Aopt , s)
views (recall Section 4.2). Our cost model selects the optimality(A) =
total cost(v, A, s)
sequence of lineage tracing queries with the lowest es-
timated cost. Let q(v, A, s) denote the estimated lin- 6 Algorithms for Selecting Auxiliary Views
eage tracing cost for a given primary view v, set of Having defined the search space for the optimization
auxiliary views A, and statistics s: problem and the cost model that we use, we now in-
X troduce four different algorithms for selecting a set
q(v, A, s) = cost(Qi , s)
1..m
of auxiliary views within the search space. The in-
put to each algorithm is the primary view definition v
where Q1 , . . . , Qm is the set of tracing queries se-
in ASPJ normal form, and a set of statistics s as speci-
lected for v given auxiliary view set A. Note that the
fied in Section 5. The output is a set of auxiliary views
lineage query rate and the average number of tuples
A.
traced in a lineage query (part of our usage statistics
in Table 1) are included in the input statistics s, and 6.1 Exhaustive Algorithm
thus are incorporated into the lineage cost estimated
by q(v, A, s). The exhaustive algorithm enumerates all choices in
Maintenance costs are incurred both for the pri- the search space, estimates the cost of each choice,
mary view v and for the auxiliary views in A = and picks the cheapest one. For our example view
{v1 , ..., vn }. As with lineage tracing, when there are HighProfit, the exhaustive algorithm considers
multiple possible rewritings for the view maintenance all 384 possible combinations of auxiliary views (re-
queries and updates using the auxiliary views in A, call Figure 3). We set a sample set of statistics s
our cost model selects the ones with the lowest es- as shown in Table 2, including statistics for view
timated cost. Let m(v, A, s) denote the estimated HighProfit, source tables Employee, Sales,
maintenance cost for a given primary view v, set of Product, and Store, as well as each ASPJ seg-
auxiliary views A, and statistics s: ment in the view definition (Figure 3). Over this set
X of statistics, the exhaustive algorithm selects A =
m(v, A, s) = cost(Mi , s) {BT4 , AG1 , AG2 , SLT1 , LV2 }. The exhaustive algo-
1..n rithm always finds the optimal auxiliary view set ac-
where M1 , . . . , Mn is the set of maintenance queries cording to our cost model. However, the complexity
and updates selected to maintain primary view v and of the algorithm is the same as the search space size:
the auxiliary views in A. Note that the source ta- O(2k ) where k is the number of components in the
ble update rate and average number of source tuples view definition (recall Section 4.3).
changed in each update (part of our usage statistics in
6.2 Naive Algorithm
Table 1) are included in the input statistics s, and thus
are incorporated into the maintenance cost estimated At the other end of the spectrum, we con-
by m(v, A, s). sider a naive algorithm that selects a fixed set
Y. Cui, J. Widom 11-9
Table 2: Statistics for view HighProfit
Parameter name Values
HighProfit Employee Sales Product Store segment 1 segment 2 segment 3
query rate (#/unit time) 100
query size (tuples) 1
update rate (#/unit time) 10 10 10 0
update size (tuples) 1 100 1 0
tuple num 10000 1000000 100000 100
tuple size (bytes) 1000 500 500 400
fan-out 1 2 3
join ratio 0.0002 0.0001
select ratio 0.1
proj ratio 0.1 0.1 0.1
aggr ratio 0.01 0.001 0.2
block size (bytes) 8K 8K 8K 8K 8K
disk cost (ms/block) 1 1 1 1 1
net cost (ms/byte) 0 0.00001 0.0001 0.0001 0.0001
of auxiliary views: Lineage Views (LVs) for the not guarantee an optimal answer, nor even an answer
topmost and all intermediate segments, aggrega- within some percentage of optimal. In Section 7.3, we
tion results (AGs) for all intermediate segments, will see a scenario where the greedy algorithm per-
and all Base Tables (BTs). For example view forms very poorly.
HighProfit, the naive algorithm selects A =
{BT1 , BT2 , BT3 , BT4 , AG1 , AG2 , LV1 , LV2 }. Even 6.4 Three-Step Algorithm
though this naive fixed set of auxiliary views may not
be optimal—in fact it can be arbitrarily bad compared Our last algorithm divides the auxiliary view selec-
to the optimal set—our experimental results in Sec- tion process into three phases. See Figure 5. In the
tion 7 show that the naive algorithm selects reasonably first phase, we use a greedy approach to add auxiliary
good view sets in many cases, especially considering views of the AG and BT types only. In the second
its simplicity. The complexity of the naive algorithm phase, we decide for the topmost and each intermedi-
is O(1). ate segment whether to add LV or SLTs. At this point,
it may turn out that some of the AG or BT views se-
lected in the first phase are no longer beneficial given
6.3 Greedy Algorithm
the LV or SLT views selected in the second phase, and
We also consider a conventional greedy algorithm. they incur maintenance cost. Thus, in third phase we
This algorithm initializes the auxiliary view set A to remove AG and BT views that are not beneficial, and
be empty. In each iteration, it adds into A the auxiliary we do so in a greedy manner.
view (not yet in A) that brings the most benefit, i.e., For example view HighProfit and the same
reduces the total cost the most given the current set of statistics used in Sections 6.1 and 6.3, the three-step
views in A. Iteration continues until there are no more algorithm also selects the optimal set of auxiliary
auxiliary views outside A that can further reduce the views. In phase 1, it selects AG and BT views in
total cost. For our example view HighProfit and the following order: AG2 , AG1 , BT2 , BT3 , BT4 . In
the same sample statistics used in the exhaustive algo- phase 2, it selects SLT1 and LV2 . In phase 3, it re-
rithm (Section 6.1), the greedy algorithm selects the moves BT2 and BT3 (in that order) because they are
optimal set of auxiliary views in the following order: no longer beneficial given the views selected in phase
AG2 , SLT1 , AG1 , LV2 , BT4 . 2.
The greedy algorithm has complexity O(k2 ) in- The three-step algorithm has complexity O(k2 ),
stead of O(2k ) as in the exhaustive algorithm, and it which is the same as the greedy algorithm, but its ac-
selects the optimal auxiliary view set in most cases tual running time is less than the greedy algorithm by
(see Section 7). However, the greedy algorithm can- a linear factor. The first phase of the three-step algo-
Y. Cui, J. Widom 11-10
input: primary view v, statistics s
output: auxiliary view set A
begin
A ← ∅;
// phase 1: use greedy algorithm on AG and BT nodes
Aall ← all possible auxiliary views for v;
while true do
for each vi ∈ Aall of type AG or BT such that vi 6∈ A do
benefiti ← total cost(v, A, s) − total cost(v, A ∪ {vi }, s);
pick vi with the highest benefiti ;
if benefiti ≤ 0 then break else A ← A ∪ {vi };
endwhile;
// phase 2: decide LV and SLTs
for the topmost and each intermediate segment do
// Let LV and SLT be the Lineage View and Split Lineage Tables for the segment
cost1 ← cost(v, A ∪ {LV }, s);
cost2 ← cost(v, A ∪ {SLT }, s);
if cost1 ≥ cost2 > cost(v, A, s) then A ← A ∪ {LV }
else if cost2 > cost1 > cost(v, A, s) then A ← A ∪ {SLT };
endfor;
// phase 3: remove useless AGs and BTs
while true do
for each vi ∈ A of type AG or BT do
benefiti ← total cost(v, A − {vi }, s) − total cost(v, A, s);
pick vi with the lowest benefiti ;
if benefiti > 0 then break else A ← A − {vi };
endwhile;
return A;
end
Figure 5: The three-step algorithm
rithm is faster than the greedy algorithm since it only practical option is to combine the two algorithms:
selects from the AG and BT views, instead of from run both algorithms and select whichever answer has
all auxiliary views. The second phase is linear in the lower estimated cost. The running time of the com-
number of segments. The third phase only examines bined algorithm remains O(k2 ).
the AG and BT views selected in the first phase, which
is a small number in most cases. 7 Performance Study
Like the greedy algorithm, the three-step algorithm In this section, we study the performance of the four
usually selects the optimal set of auxiliary views (see algorithms specified in Section 6, comparing their
Section 7). However, also like the greedy algorithm, running times and the optimality of the answers they
the three-step algorithm cannot make any guarantees produce. We also compare the cost of the answers pro-
about the optimality of its answers. In Section 7.3 we duced by these algorithms against the cost of storing
will see a scenario where the three-step algorithm per- no auxiliary views. In Section 7.1, we present results
forms poorly. Interestingly, in the case we show where of experiments using the schema, statistics, and some
the three-step algorithm performs poorly, the greedy views from the TPC-D benchmark. In Section 7.2,
algorithm performs well, and vice-versa. Thus, one we present results of experiments using more complex
Y. Cui, J. Widom 11-11
Q5 Q11 Q17 V1
α π α α levels = 2
σ π π fan-out = 3
π
σ
σ σ
α α
π π α α α
Customer Region σ Part α π π π
Order Nation π σ σ σ
Lineitem Supplier
PartSupp Nation Lineitem
Supplier R1 R2 R3 R4 R5 R6 R7 R8 R9
Figure 6: Materialized views for TPC-D experiments Figure 9: Structure of view v1
exhaustive levels fan-out query/update ratio
100%
greedy v1 2 3 100
three-step v2 2 3 10
80% naive v3 2 3 1
none v4 2 3 0.1
60% v5 2 3 0
v6 6 1 10
40%
v7 2 5 10
20%
Figure 10: Synthetic configurations
0%
Q5 Q11 Q17
der are the fact tables according to the benchmark
Figure 7: Optimality for TPC-D views specification. For views, we select queries Q5 , Q11 ,
and Q17 from the benchmark, since they are relatively
complex and differ somewhat from each other. In each
exhaustive greedy three-step naive none
case, we treat the benchmark query as the definition
Q5 11.15 4.79 2.38 0.08 0.09
of our primary materialized view to be stored at the
Q11 14.07 0.94 0.38 0.03 0.04
Q17 0.62 0.29 0.10 0.02 0.03 warehouse. The general structure of each of the three
views is shown in Figure 6. The complete list of sta-
Figure 8: Running time in seconds for TPC-D tistical settings (recall Table 1) for our three TPC-D
experiments is shown in Appendix A, Tables 3–5.
synthetic view definitions. Since the greedy and three- Recall that we are comparing five algorithms—the
step algorithms perform quite well in all of the experi- four algorithms from Section 6, as well as the “algo-
ments in Sections 7.1 and 7.2, in Section 7.3 we show rithm” that selects no auxiliary views (which we call
experiments illustrating that greedy and three-step can algorithm none). Figure 7 plots the optimality of the
perform poorly. five algorithms for each of the TPC-D views we con-
sider. Recall from Section 5 that optimality is defined
7.1 TPC-D Experiments
as the cost of the optimal auxiliary view set divided
Our first set of experiments is based on the TPC-D by the cost of the chosen view set. Figure 8 plots the
benchmark [TPC96]. We use the schema of tables running time of the algorithms. Note that algorithms
Customer, Order, Lineitem, Supplier, Na- naive and none do incur a small running time, which
tion, Region, PartSupp, and Part from the is the time required to compute the cost of their one
benchmark for our experiments. The table statistics solution.
we use correspond to a scaling factor of 1. The re- We can see from Figure 7 that storing no auxiliary
maining statistics from Table 1 are set according to views can be dramatically worse than storing some,
the benchmark and commonly used database and net- even those selected by the naive algorithm. We also
work system settings. For example, we set the update see from Figure 7 that the greedy and three-step al-
rate for the Lineitem and Order tables to be much gorithms select the optimal auxiliary view set for all
higher than other tables, since Lineitem and Or- three views, and from Figure 8 we see that they do so
Y. Cui, J. Widom 11-12
exhaustive
greedy
100%
three-step
80% naive
none
60%
40%
20%
0%
V1 V2 V3 V4 V5 V6 V7
Figure 11: Optimality for synthetic views
exhaustive greedy three-step naive none for view v7 the exhaustive algorithm never finished,
v1 2411.68 3.18 0.67 0.05 0.07 so optimality is measured against the auxiliary view
v2 2414.86 3.18 0.99 0.03 0.03
set selected by the greedy algorithm.) The greedy and
v3 2455.74 3.28 0.71 0.08 0.03
v4 2493.06 3.17 0.63 0.03 0.05 three-step algorithms find their answer in a small frac-
v5 2376.53 3.93 0.64 0.03 0.03 tion of the running time required by the exhaustive
v6 757.49 0.99 0.17 0.02 0.02 algorithm, and three-step is much faster than greedy.
v7 156.36 35.29 0.33 0.21 Another interesting result is that algorithm none per-
forms better when the query/update ratio is lower (ex-
Figure 12: Running time for synthetic views (sec)
periments v3 –v5 ). However, we should not infer that
in a small fraction of the running time required by the the benefit of auxiliary views is primarily for lineage
exhaustive algorithm. We also note that the three-step tracing. In fact, in experiment v5 the query/update ra-
algorithm runs considerably faster than the greedy al- tio is set to 0 (indicating view maintenance only), and
gorithm. we still see significant benefit to using auxiliary views.
Next, we consider in more detail how the running
7.2 Synthetic Experiments times of the exhaustive, greedy, and three-step algo-
rithms are affected by view complexity. In Figure 13,
Our next set of experiments is conducted using syn- we consider views where we fix the number of levels
thetic views and data statistics. The views we consider at 2 and increase the fan-out from 1 to 9. The exhaus-
all have a regular tree definition, as illustrated by view tive, greedy, and three-step algorithms become pro-
v1 in Figure 9. We consider seven different views. hibitive when the fan-out exceeds 3, 7, and 8, respec-
Figure 10 summarizes the “shape” of each view defi- tively. We see similar behavior in Figure 14, where we
nition tree (number of levels and fan-out of each seg- fix the fan-out at 2 and increase the number of levels
ment), along with the query/update ratio, which repre- from 1 to 8.
sents the ratio of the average number of tracing queries
per unit time to the average number of source updates
7.3 When Greedy and Three-Step Fail
(recall Table 1). The complete set of statistical set-
tings for the seven experiments is summarized in Ap- The greedy and three-step algorithms select the op-
pendix A, Tables 6–8. timal (or in one case very near to optimal) auxiliary
Figure 11 plots the optimality of our five algorithms view set in all of the experiments reported in Sec-
for each of the seven synthetic views we consider. Fig- tions 7.1 and 7.2. However, there are cases in which
ure 12 plots the running time of the algorithms. The these algorithms fail to pick an optimal or even near-
greedy and three-step algorithms always select the op- optimal answer.
timal auxiliary view set or, in the one case of the three- Consider a simple view definition v and the auxil-
step algorithm on v1 , very near to optimal. (Actually, iary views that are considered for v in Figure 15. By
Y. Cui, J. Widom 11-13
10000 V
exhaustive
running time (sec)
8000 greedy α
three-step
6000 π
LV or SLTs
4000 σ
2000
BTR BTS
0
1 2 3 5 6 7 8
4 9 R S
fan-out
Figure 13: Running time vs. fan-out Figure 15: View structure used for v8 and v9
10000 exhaustive
exhaustive greedy
running time (sec)
8000 greedy
100%
three-step
three-step 80% naive
6000 none
60%
4000
40%
2000
20%
0
1 2 3
4 5 6 7 8 0%
V8 V9
# levels
Figure 14: Running time vs. # levels Figure 16: Optimality for v8 and v9
8 Conclusion
setting the data statistics (Table 1) to different values,
We have examined the problem of selecting auxiliary
we can vary the costs and benefits of the four differ-
views to materialize in a data warehouse in order to
ent auxiliary views. We have set two different con-
reduce the cost of view maintenance and lineage trac-
figurations, which we call v8 and v9 , both based on
ing for complex primary views. We specify a normal
the view in Figure 15. The complete set of statistical
form for view definitions and use it to define an initial
settings for these two experiments is summarized in
search space of potentially beneficial auxiliary views.
Appendix A, Tables 9 and 10. In particular, we set the
We presented four algorithms for exploring the search
source table network costs for v8 to be much higher
space and selecting a set of auxiliary views: exhaus-
than for v9 , and we set the join ratio of v8 higher (less
tive, greedy, three-step, and naive. We compared the
selective) than v9 . The optimal set of auxiliary views
optimality and running time of the algorithms using
for v8 is {BTR , BTS }, and the optimal set for v9 is
experiments based on the TPC-D benchmark, as well
{LV }.
as on a variety of synthetic views and statistics.
Figure 16 plots the optimality of all five algorithms Our experiments indicate that in terms of running
on views v8 and v9 . In particular, the greedy al- time and optimality, the three-step algorithm appears
gorithm performs extremely poorly on v8 (selecting to be the best, although the running time of the greedy
{LV } instead of {BTR , BTS }), while the three-step algorithm probably also is fast enough in practice for
algorithm misses the optimal solution for v9 (select- most complex warehouse views. (The exhaustive al-
ing {BTR , BTS } instead of {LV }). However, as sug- gorithm, on the other hand, becomes intractable quite
gested in Section 6, if we use a combined algorithm quickly.) Both the greedy and three-step algorithms
that runs both greedy and three-step and then selects find the optimal auxiliary view set in most cases, al-
the lower-cost solution, we will select the optimal though we have shown (complementary) situations
view set for both v8 and v9 . in which either one algorithm or the other performs
Y. Cui, J. Widom 11-14
poorly. Our experiments also illustrate that even a In Proc. of the Twenty-First Inter-
naive selection of auxiliary views reduces overall cost national Conference on Very Large
dramatically in most cases, underscoring the impor- Data Bases, pages 358–369, Zurich,
tance of materializing auxiliary views for the dual pur- Switzerland, September 1995.
poses of view maintenance and lineage tracing in a
warehousing environment. [GMS93] A. Gupta, I. S. Mumick, and V. Sub-
Although we have presented our work in the con- rahmanian. Maintaining views in-
text of selecting an auxiliary view set for a single pri- crementally. In Proc. of the ACM
mary warehouse view, our approach extends easily to SIGMOD International Conference on
considering multiple primary views together. Then Management of Data, pages 157–166,
the cost of an auxiliary view may be “shared” if the Washington, DC, May 1993.
auxiliary view is beneficial to view maintenance or
[Gup97] H. Gupta. Selection of views to mate-
lineage tracing for more than one primary view. Fur-
rialize in a data warehouse. In Proc. of
thermore, although we have studied an environment in
the Sixth International Conference on
which both view maintenance and lineage tracing are
Database Theory, pages 98–112, Del-
important, if only one type of activity is present our
phi, Greece, January 1997.
algorithms remain applicable.
[HRU96] V. Harinarayan, A. Rajaraman, and
References J. D. Ullman. Implementing data
[CD97] S. Chaudhuri and U. Dayal. An cubes efficiently. In Proc. of the ACM
overview of data warehousing and SIGMOD International Conference on
OLAP technology. SIGMOD Record, Management of Data, pages 205–216,
26(1):65–74, March 1997. Montreal, Canada, June 1996.
[CW91] S. Ceri and J. Widom. Deriv- [IK93] W.H. Inmon and C. Kelley. Rdb/VMS:
ing production rules for incremen- Developing the Data Warehouse.
tal view maintenance. In Proc. QED Publishing Group, Boston,
of the International Conference on Massachussetts, 1993.
Very Large Databases, pages 577–589,
Barcelona, Spain, September 1991. [LQA97] W.J. Labio, D. Quass, and B. Adel-
berg. Physical database design for
[CW00] Y. Cui and J. Widom. Practical lineage data warehousing. In Proc. of the
tracing in data warehouses. To appear Thirteenth International Conference
in Proc. of the Sixteenth International on Data Engineering, pages 277–288,
Conference on Data Engineering, San Birmingham, UK, April 1997.
Diego, California, February 2000.
[LW95] D. Lomet and J. Widom, editors. Spe-
[CWW97] Y. Cui, J. Widom, and J.L. Wiener. cial Issue on Materialized Views and
Tracing the lineage of view data Data Warehousing, IEEE Data Engi-
in a warehousing environment. neering Bulletin 18(2), June 1995.
Technical report, Stanford Univer-
sity Database Group, November [LYC+ 99] W. Labio, J. Yang, Y. Cui, H. Garcia-
1997. Available at http://www- Molina, and J. Widom. Performance
db.stanford.edu/pub/papers/lineage- issues in incremental warehouse main-
full.ps. tenance. Technical report, Stanford
University Database Group, October
[GHQ95] A. Gupta, V. Harinarayan, and 1999. Available at http://www-
D. Quass. Aggregate-query process- db.stanford.edu/pub/papers/whips-
ing in data warehousing environments. wm.ps.
Y. Cui, J. Widom 11-15
[Qua96] D. Quass. Maintenance expressions A Statistical Settings for Experiments
for views with aggregation. In Proc. of
the Workshop on Materialized Views:
Techniques and Applications, pages
110–118, Montreal, Canada, June
1996.
[RSS96] K.A. Ross, D. Srivastava, and S. Su-
darshan. Materialized view mainte-
nance and integrity constraint check-
ing: Trading space for time. In Proc.
of the ACM SIGMOD International
Conference on Management of Data,
pages 447–458, Montreal, Canada,
June 1996.
[TPC96] Transaction Processing Performance
Council. TPC-D Benchmark Specifi-
cation, Version 1.2, 1996. Available at:
http://www.tpc.org/.
[Ull89] J. D. Ullman. Database and
Knowledge-base Systems (Vol 2).
Computer Science Press, 1989.
[Wid95] J. Widom. Research problems in data
warehousing. In Proc. of the Fourth In-
ternational Conference on Information
and Knowledge Management, pages
25–30, Baltimore, Maryland, Novem-
ber 1995.
[ZGMHW95] Y. Zhuge, H. Garcia-Molina, J. Ham-
mer, and J. Widom. View mainte-
nance in a warehousing environment.
In Proc. of the ACM SIGMOD Interna-
tional Conference on Management of
Data, pages 316–327, San Jose, Cali-
fornia, May 1995.
Y. Cui, J. Widom 11-16
Table 3: Statistics for Q5
Parameter name Values
Q5 Customer Order LineItem Supplier Nation Region segment 1
query rate (#/unit time) 100
query size (tuples) 1
update rate (#/unit time) 1 10 40 1 0 0
update size (tuples) 1 10 10 1 0 0
tuple num 150000 1500000 6000000 10000 25 5
tuple size (bytes) 300 100 300 200 100 100
fan-out 6
join ratio 0.00001
select ratio 0.01
proj ratio 0.1
aggr ratio 0.001
block size (bytes) 8K 8K 8K 8K 8K 8K 8K
disk cost (ms/block) 1 1 1 1 1 1 1
net cost (ms/byte) 0 0.0001 0.00001 0.00001 0.0001 0.0001 0.0001
Table 4: Statistics for Q11
Parameter name Values
Q11 PartSupp Supplier Nation segment 1 segment 2 segment 3
query rate (#/unit time) 100
query size (tuples) 1
update rate (#/unit time) 10 1 0
update size (tuples) 1 1 0
tuple num 800000 10000 25
tuple size (bytes) 100 200 100
fan-out 1 3 3
join ratio 0.0006 0.0006
select ratio 0.1 0.04 0.04
proj ratio 0.6 0.1 0.1
aggr ratio 0.005 0.005
block size (bytes) 8K 8K 8K 8K
disk cost (ms/block) 1 1 1 1
net cost (ms/byte) 0 0.0001 0.0001 0.0001
Y. Cui, J. Widom 11-17
Table 5: Statistics for Q17
Parameter name Values
Q17 Part Lineitem segment 1 segment 2
query rate (#/unit time) 100
query size (tuples) 1
update rate (#/unit time) 10 10
update size (tuples) 1 10
tuple num 200000 6000000
tuple size (bytes) 200 300
fan-out 3 1
join ratio 0.0002
select ratio 0.002
proj ratio 0.1 0.2
aggr ratio 0.001 0.03
block size (bytes) 8K 8K 8K
disk cost (ms/block) 1 1 1
net cost (ms/byte) 0 0.0001 0.00001
Table 6: Statistics for synthetic views v1 –v5
Parameter name Values
v1 v2 v3 v4 v5 each source table each segment
query rate (#/unit time) 1000 100 10 1 0
query size (tuples) 1 1 1 1 1
update rate (#/unit time) 10
update size (tuples) 1
tuple num 10000
tuple size (bytes) 100
levels 2 2 2 2 2
fan-out 3
join ratio 0.001
select ratio 0.2
proj ratio 0.4
aggr ratio 0.1
block size (bytes) 8K 8K 8K 8K 8K 8K
disk cost (ms/block) 1 1 1 1 1 1
net cost (ms/byte) 0 0 0 0 0 0.0001
Y. Cui, J. Widom 11-18
Table 7: Statistics for v6 Table 8: Statistics for v7
Parameter name Values Parameter name Values
v6 each source table each segment v7 each source table each segment
query rate (#/unit time) 100 query rate (#/unit time) 100
query size (tuples) 1 query size (tuples) 1
update rate (#/unit time) 10 update rate (#/unit time) 10
update size (tuples) 1 update size (tuples) 1
tuple num 1000000 tuple num 10000
tuple size (bytes) 1000 tuple size (bytes) 100
levels 6 levels 2
fan-out 1 fan-out 5
join ratio join ratio 0.001
select ratio 0.5 select ratio 0.2
proj ratio 0.8 proj ratio 0.4
aggr ratio 0.2 aggr ratio 0.1
block size (bytes) 8K 8K block size (bytes) 8K 8K
disk cost (ms/block) 1 1 disk cost (ms/block) 1 1
net cost (ms/byte) 0 0.0001 net cost (ms/byte) 0 0.0001
Table 9: Statistics for v8 Table 10: Statistics for v9
Parameter name Values Parameter name Values
v8 R1 R2 segment 1 v9 R1 R2 segment 1
query rate (#/unit time) 100 query rate (#/unit time) 100
query size (tuples) 1 query size (tuples) 1
update rate (#/unit time) 0 0 update rate (#/unit time) 0 10
update size (tuples) 0 0 update size (tuples) 0 20
tuple num 100000 100000 tuple num 10000 1000000
tuple size (bytes) 1000 1000 tuple size (bytes) 1000 1000
fan-out 2 fan-out 2
join ratio 0.001 join ratio 0.00003
select ratio 0.2 select ratio 0.2
proj ratio 0.6 proj ratio 0.6
aggr ratio 0.2 aggr ratio 0.2
block size (bytes) 8K 8K 8K block size (bytes) 8K 8K 8K
disk cost (ms/block) 1 1 1 disk cost (ms/block) 1 1 1
net cost (ms/byte) 0 0.01 0.01 net cost (ms/byte) 0 0.0001 0.001
Y. Cui, J. Widom 11-19