Storing Auxiliary Data for Efficient Maintenance and Lineage Tracing of Complex Views∗ Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University {cyw, widom}@db.stanford.edu gorithm that selects the optimal set of auxil- iary views is far too expensive in many cases; Abstract (2) two heuristic algorithms that we present select good (often optimal) sets of auxiliary As views in a data warehouse become more views in a much shorter time; (3) even aux- complex, the view maintenance process can iliary views selected by a very simple al- become very complicated and potentially gorithm can significantly reduce the overall very inefficient. Storing auxiliary views in view maintenance and lineage tracing cost. the warehouse can reduce the complexity and improve the efficiency of view maintenance, 1 Introduction and the same auxiliary views can help in ef- Data warehousing systems collect data from multiple, ficiently answering lineage tracing queries distributed sources and integrate the information as over the warehouse views. In this paper, we materialized views in local databases [CD97, IK93, study the problem of selecting auxiliary views LW95, Wid95]. Users can then perform data analy- to materialize in order to minimize the total sis and mining on the warehouse views. The mate- view maintenance and lineage tracing cost. rialized views in the warehouse need to be kept up- We consider relational views with arbitrary to-date when data at the sources changes. As the use of aggregation operators, and we define view definitions become more complex in order to an initial search space for our optimization support sophisticated data analyses, the view mainte- problem based on a normal form for such nance process can become very complicated and po- view definitions. We present several auxiliary tentially very inefficient. Most previous work on view view selection algorithms, and to study their maintenance, e.g., [CW91, GMS93, LW95, LYC+99, performance we conduct experiments using Qua96], considers simple views containing at most the TPC-D benchmark in addition to synthetic one level of aggregation. In order to efficiently main- view definitions and statistics. The results of tain complex views which may contain multiple levels our experiments show: (1) the exhaustive al- of aggregation, it is clearly advantageous to store aux- ∗ This work was supported by the National Science Foundation iliary data in addition to the original view to reduce under grant IIS-9811947, by Sagent Technology Inc., and by an overall view maintenance cost. equipment grant from IBM Corporation. From a different perspective, for in-depth analysis The copyright of this paper belongs to the paper’s authors. Per- of warehouse data sometimes it is useful to be able mission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct to “drill through” from selected interesting (or possi- commercial advantage. bly erroneous) view data to the original source data Proceedings of the International Workshop on Design that derived the view data. We call this process trac- and Management of Data Warehouses (DMDW’2000) ing the lineage of the view data [CWW97]. To trace Stockholm, Sweden, June 5-6, 2000 the lineage of a view data item efficiently, the ware- (M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.) house also needs to store auxiliary data—to reduce the http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-28/ computation cost at the warehouse, and to reduce or Y. Cui, J. Widom 11-1 entirely avoid expensive source accesses for lineage • Two heuristic algorithms that we present select tracing. It turns out that the same auxiliary data that good (often optimal) sets of auxiliary views in a can be used to improve the performance of view main- much shorter time. tenance as discussed in the first paragraph also can improve the performance of lineage tracing queries. • Even auxiliary views selected by a very sim- Therefore, the problems of selecting auxiliary data for ple algorithm can significantly reduce the overall the two purposes are closely related, and we study the view maintenance and lineage tracing cost. problems together. The auxiliary data is stored as materialized views in 1.1 Outline of Paper the warehouse, called auxiliary views (as opposed to the original warehouse views, which we call primary The remainder of the paper proceeds as follows. Sec- views). Given a complex relational primary view, tion 2 covers related work. Section 3 presents pre- there are numerous possible sets of auxiliary views to liminary material on materialized views, view main- materialize for view maintenance and lineage tracing, tenance, and lineage tracing, including a running ex- with significant performance tradeoffs. In general, the ample. Section 4 introduces the auxiliary views we more auxiliary views we materialize, the more effi- consider for efficient view maintenance and lineage ciently we can maintain and trace the lineage of data in tracing, and defines the search space for selecting aux- the primary view. However, the auxiliary views them- iliary views to materialize. Section 5 describes the selves also need to be maintained, so materializing too cost model and statistics we use for estimating view many auxiliary views can increase overall cost. maintenance and lineage tracing costs, and for study- Previous work has studied the selection of views ing the performance of our auxiliary view selection al- to materialize for answering queries, e.g., [HRU96, gorithms. Section 6 presents several algorithms for se- Gup97], and the selection of auxiliary views for ef- lecting auxiliary views within our search space. Sec- ficient maintenance of given primary views, e.g., tion 7 compares the performance of the algorithms us- [LQA97, RSS96]. (Further discussion of this work ing experiments on the TPC-D benchmark, as well as appears in Section 2.) In [CW00], we introduced using a variety of synthetic view definitions and statis- the idea of materializing auxiliary views to minimize tics. overall view maintenance and lineage tracing cost, and we studied the problem in the context of select- 2 Related Work project-join (SPJ) primary views. This paper inves- tigates the more difficult and general problem of re- Previous work related to this paper falls into three lational views with arbitrary use of aggregation and categories: selecting views to materialize in order to SPJ operators. As we will see, it is an expensive com- minimize query costs, e.g., [HRU96, Gup97], select- ing auxiliary views to materialize in order to min- binatorial problem, and our overall approach differs from [CW00]. imize the cost of maintaining given primary views, e.g., [LQA97, RSS96], and our own previous work In this paper, we first define a normal form for in lineage tracing and view maintenance [CWW97, the primary view definition, which suggests an ini- tial search space of possible auxiliary views. We then CW00]. propose a variety of algorithms for selecting auxiliary [HRU96] proposes a greedy algorithm for selecting views within this search space. Finally, we compare auxiliary views to materialize, with the goal of mini- empirically the running time of our algorithms and the mizing the cost of queries over aggregate views given optimality of the auxiliary view sets they select, using certain constraints such as the maximum number of the TPC-D benchmark [TPC96] in addition to a suite views that can be materialized. The work considers of synthetic view definitions and statistics. The results data-cube views only, and can make certain simpli- of our experiments show: fying assumptions based on this restriction. [Gup97] extends the work in [HRU96] to general relational • The exhaustive algorithm that selects the opti- views, and proves that the auxiliary view selection mal set of auxiliary views is far too expensive problem under maintenance cost constraints is NP- in many cases. hard. Y. Cui, J. Widom 11-2 [RSS96] proposes an exhaustive algorithm for se- rithm, and we compare the performance of our lecting auxiliary views to optimize view maintenance, algorithms (both running time and quality of so- and suggests simple search space pruning strategies lution) through experiments. when the view is too complex for exhaustive search. [LQA97] presents an A* algorithm for selecting aux- 3 Preliminaries iliary views and indexes on different join combina- We now introduce the relational materialized views tions for SPJ view maintenance. Both [RSS96] and we consider, as well as the processes of view main- [LQA97] consider a single algorithm for selecting tenance and lineage tracing, using a running example. auxiliary views (and indexes in the case of [LQA97]), Along the way, we illustrate why materializing auxil- designed specifically for optimizing view mainte- iary views is important for view maintenance and lin- nance. They consider as potential auxiliary views all eage tracing, and why it is useful to consider the two nodes in all possible relevant query plans, making the problems together. search space doubly exponential in the view definition size. 3.1 Materialized Views We introduced lineage tracing for relational data warehouses in [CWW97], presenting a formal frame- To answer a variety of user queries efficiently, a data work and basic algorithms. In [CW00], we introduced warehouse typically computes and stores a number of the problem of selecting auxiliary views to simulta- materialized views [LW95]. In this paper, we con- neously reduce view maintenance and lineage trac- sider relational views with arbitrary use of aggrega- ing costs, and we considered the restricted case of tion, selection, projection, and join operators, which SPJ views. We suggested several alternative auxil- we call ASPJ views. We use an algebraic representa- iary view schemes and compared their performance. tion for the operators: α for grouping and aggregation, In this paper, we tackle the problem for complex rela- σ for selection, π for (duplicate-eliminating) projec- tional views with arbitrary use of aggregation and SPJ tion, and ./ for join. A view definition is presented operators. Arbitrarily complex primary views make using a rooted operator DAG with source tables at the the auxiliary view selection problem more compli- leaves. cated and expensive than for SPJ views, and we take a Any ASPJ view definition v can be transformed different approach to solving it than for the restricted into an equivalent form v 0 composed of α-π-σ-./ case considered in [CW00]. We introduce a normal operator sequences, by commuting and combining form for our view definitions that suggests an initial some select-project-join operators in the view defini- (still exponential) search space for useful auxiliary tion [CWW97]. We call the resulting form v’s ASPJ view sets, and then we consider heuristic algorithms normal form, and we call each α-π-σ-./ sequence a that explore various view sets in this search space. segment. An example will be given shortly. In ASPJ Our work differs from the previous work discussed normal form, a segment may omit the π, σ, or ./ op- above in several ways: erator, but each segment except the topmost must in- clude a non-trivial aggregation operator (or it would • Unlike all previous work besides our own, we be merged with an adjacent segment). Since our view consider lineage tracing as well as view main- definitions are DAGs, they may contain multiple ref- tenance costs when selecting auxiliary views to erences to a source table or to a segment at any level. materialize. We say that a view is an n-level ASPJ view if • Instead of considering a doubly exponential traversing from the root to any leaf in its normalized search space of auxiliary views (as in [HRU96, definition crosses at most n segments. The fan-out of Gup97, LQA97, RSS96]), or a very simple a segment is the number of operands of the segment’s fixed set (as in [CW00]), we explore a “mid- join operator, or 1 if there is no join. dle ground” based on our view definition normal Example 3.1 (Materialized View and Normal Form) form. Consider a data warehouse for a department store • We propose several different auxiliary view se- chain based on the following four tables, some or all lection algorithms, as opposed to a single algo- of which may reside at remote source databases. Y. Cui, J. Widom 11-3 • Store(store-id, city, expenses) sequence of queries and updates called the mainte- gives the city and monthly operating expenses of nance procedure. For 1-level ASPJ views we use the each store. We assume that each city contains at maintenance procedures from [GMS93, Qua96]. The most one store, and that the operating expenses following example shows a 1-level ASPJ view and its do not include employee salaries. maintenance procedure. • Product(product-id, price, cost) Example 3.2 (View Maintenance Procedure) gives the retail price and wholesale cost of each Consider the source tables from Example 3.1 and product item. a 1-level ASPJ view Profit corresponding to the middle α-./ segment in Figure 2: • Sales(store-id, product-id, num) gives the expected monthly number of sales for CREATE VIEW Profit AS SELECT store-id, SUM(num*(price−cost)) AS profit each product at each store. FROM Sales, Product WHERE Sales.product-id = Product.product-id • Employee(emp-id, store-id, GROUP BY store-id salary) gives the monthly salary of each employee at each store. Suppose the Sales table changes over time, and a set of insertions and deletions to the table are Consider a materialized view HighProfit that stored in delta tables ∆Sales and ∇Sales, respec- keeps track of those cities whose stores are very tively. The resulting changes to the view Profit profitable, i.e., whose monthly income exceeds ex- (∆Profit and ∇Profit) can be computed by the penses by at least $100,000. An SQL definition for following maintenance procedure, which uses the HighProfit is shown in Figure 1, and its normal- summary-delta approach from [Qua96]: ized view definition tree is shown in Figure 2. We SELECT store-id, SUM(profit) AS profit INTO SummaryDelta use αG,aggr(A) to represent grouping and aggregation, FROM (SELECT store-id, (num*(price−cost)) as profit where G is a list of grouping attributes, and aggr(A) FROM ∆Sales, Product abbreviates a list of aggregate functions over attributes WHERE ∆Sales.product-id = Product.product-id) UNION in set A [CWW97].1 HighProfit is a 2-level ASPJ (SELECT store-id, -1*(num*(price−cost)) as profit view containing three segments: the topmost π-σ-./ FROM ∇Sales, Product segment with fan-out 3, the leftmost α segment with WHERE ∇Sales.product-id = Product.product-id) fan-out 1, and the middle α-./ segment with fan-out GROUP BY store-id 2. 2 SELECT * INTO ∇Profit FROM Profit WHERE store-id IN 3.2 View Maintenance Procedures (SELECT store-id FROM SummaryDelta) Materialized views must be maintained to keep their SELECT store-id, SUM(profit) INTO ∆Profit contents up-to-date as the source tables they are de- FROM ∇Profit UNION SummaryDelta fined over change. We assume a standard incremental GROUP BY store-id view maintenance approach, as in [GMS93, Qua96]. 2 Insertions and deletions to each source table are mon- itored and recorded in delta tables (∆ and ∇ respec- For an n-level ASPJ view where n > 1, to com- tively) in the warehouse. Updates are modeled as pute the changes to the entire view we can com- deletions followed by insertions. During view mainte- pute the changes for one segment at a time using nance, changes to the view (also expressed as deltas) the maintenance procedure for 1-level views, prop- are computed based on the source delta tables, the agating the deltas upward through the view defini- view contents, and the source data, using a predefined tion DAG. Just as we needed source table Prod- 1 uct along with ∆Sales and ∇Sales to compute This operator is similar to the generalized projection of [GHQ95], but we distinguish between projection and aggregation the deltas for Profit in Example 3.2, to compute operators because of the way our segments and auxiliary views the deltas for a higher-level segment we may need are defined. deltas and/or full contents for each lower segment. Y. Cui, J. Widom 11-4 CREATE VIEW HighProfit AS HighProfit SELECT city FROM Store, π city (SELECT store-id, SUM(num*(price−cost)) AS profit σ profit - expenses - salaries FROM Sales, Product > 100000 WHERE Sales.product-id = Product.product-id GROUP BY store-id) AS P, α store-id, α store-id, (SELECT store-id, SUM(salary) AS salaries sum(salary) sum(num * (price - cost)) FROM Employee as salaries as profit GROUP BY store-id) AS E WHERE Store.store-id = E.store-id AND E.store-id = P.store-id AND P.profit−E.salaries−Store.expenses > 100000 Employee Sales Product Store Figure 1: SQL definition for HighProfit Figure 2: Normal form for HighProfit For example, suppose we want to compute the deltas given tuple t ∈ V , t’s lineage in T1 , . . . , Tm accord- for HighProfit given ∆Profit and ∇Profit, ing to v can be computed with the following query: where again Profit corresponds to the middle α-./ segment as in Example 3.2. We need to join the delta T Qt,v = SplitT1 ,...,Tm (σC∧G=t.G (T1 ./ · · · ./ Tm )) tables for Profit with source table Store, as well as with the leftmost α segment in Figure 2, then per- where Ti , denotes the schema of table Ti , i = form the selection and projection in the topmost seg- 1..m, and Split is an operator that breaks a ta- ment. If we materialize an auxiliary view Salary cor- ble into multiple projections: SplitA1 ,...,Am (T ) = responding to the leftmost α segment, we can signif- hπA1 (T ), . . . , πAm (T )i.2 Given a tuple set T ⊆ V , icantly improve the performance of the maintenance we can simultaneously trace all the tuples in T with: procedure by avoiding recomputation of the aggre- T QT,v = SplitT1 ,...,Tm (σC (T1 ./ · · · ./ Tm ) n T ) gate values. In addition to materializing “intermedi- ate” views, if source tables are remote and expensive Example 3.3 (Lineage Tracing Query) Consider (or impossible) to access, we may want to replicate the view Profit in Example 3.2. The lineage some or all of the source tables as auxiliary views at tracing query for a tuple t ∈ Profit is the warehouse. Note that our approach in this paper applies to all T Qt,Profit = SplitT1 ,T2 (σC (T1 ./ T2 )) views, including those with non-incrementally main- tainable aggregates (e.g., min, max). In the presence where T1 = Sales, T2 = Product, and C = of such aggregates, the maintenance procedure must “store-id = t.store-id”. The SQL presentation of involve some recomputation, but auxiliary views may the query is as follows: still be of benefit. SELECT Sales.* INTO LN Sales FROM Sales, Product 3.3 Lineage Tracing WHERE Sales.product-id = Product.product-id AND Sales.store-id = t.store-id Given a materialized view in a data warehouse, in ad- dition to issuing regular queries or performing other SELECT Product.* INTO LN Product kinds of analysis over the view, we may want to trace FROM Sales, Product the lineage of selected “interesting” tuples in the view. WHERE Sales.product-id = Product.product-id AND Sales.store-id = t.store-id The lineage of a view tuple is defined as the set of original source tuples that derived the given view tu- where LN Sales and LN Product contain the ple. To trace the lineage of a view tuple, we use a lineage of t according to view Profit in the source predefined sequence of queries called tracing queries tables Sales and Product, respectively. 2 (TQs) [CWW97]. 2 When we execute the tracing query, the selection condition Given a 1-level ASPJ view V whose definition is σC∧G=t.G is pushed down to individual Ti ’s whenever possible v = αG,aggr(B) (πA (σC (T1 ./ · · · ./ Tm ))), and to improve tracing query performance. Y. Cui, J. Widom 11-5 To trace the lineage of an n-level ASPJ view where lineage tracing queries take advantage of the auxiliary n > 1, we logically define an intermediate view for views (Section 4.2). Finally, we formally define the each segment, and then recursively trace through the auxiliary view selection problem and estimate the size hierarchy of intermediate views top-down. At each of our search space (Section 4.3). level, we use the tracing query for a 1-level ASPJ view to compute the lineage for the current traced tuples 4.1 The Auxiliary Views We Consider with respect to the intermediate views or source ta- Let us first define two types of potentially useful aux- bles at the next level below. The necessary interme- iliary views, based on a single segment. (Similar aux- diate results can either be computed at tracing time, iliary views were introduced in the context of SPJ pri- or we can materialize certain intermediate results as mary views in [CW00].) Any segment can be thought auxiliary views for the purpose of lineage tracing. of as a view definition v = αG,aggr(B) (πA (σC (T1 ./ For example, to trace the lineage of a tuple t in · · · ./ Tm ))), where each T1 , . . . , Tm is either a view HighProfit, we logically define intermedi- source table or a lower-level segment (view). Let V ate views Salary and Profit corresponding to denote the materialization of v over T1 , . . . , Tm . the leftmost α segment and middle α-./ segment, re- spectively, in Figure 2. We trace the lineage of tu- 1. Lineage View (LV) for v: We can store the inter- ple t in Salary, Profit, and Store, producing mediate result LV (v) = σC (T1 ./ · · · ./ Tm ) hLN Salary, LN Profit, LN Storei, using the fol- to help trace the lineage of tuples in V . We can lowing tracing query: rewrite the lineage tracing queries in Section 3.3 using LV (v) as: T Q = SplitT1 ,T2 ,T3 (σC (T1 ./ T2 ./ T3 )) T Qt,v = SplitT1 ,...,Tm (σG=t.G (LV (v))) where T1 = Salary, T2 = Profit, T3 = Store, and T QT,v = SplitT1 ,...,Tm (LV (v) n T ) C = “profit − expenses − salaries > 100000 ∧city = t.city”. Then, we further trace the lineage The maintenance procedure for V also can be of the tuples in LN Salary and LN Profit in the simplified. If LV (v) is materialized, then we source tables to produce LN Employee, LN Sales, compute ∆LV (v) and ∇LV (v), and the query and LN Product. As with view maintenance, ma- for computing the summary-delta table in the terializing rather than recomputing intermediate re- maintenance procedure (Section 3.2) can be sults can significantly improve tracing performance. rewritten as: Since lineage tracing queries always return data from SummaryDelta =αG,aggr(B) (αG,aggr(B) (∆LV ) source tables by definition, replicating (portions of) the source data at the warehouse may be advanta- ∪ αG,−aggr(B) (∇LV )) geous, for the same reasons outlined in Section 3.2. 4 Auxiliary Views for View Maintenance and 2. Split Lineage Tables (SLTs) for v: For a view Lineage Tracing (segment) v 0 whose joins is many-to-many, LV (v 0 ) can be very large and inefficient to main- As motivated in Section 3, it may be advantageous tain. Thus, another possibility is to “split” the to materialize certain auxiliary views in a data ware- Lineage View and store a set of smaller tables: house to improve the performance of view mainte- SLTi (v) = πTi (σC (T1 ./ · · · ./ Tm )), i = nance and lineage tracing. View maintenance pro- 1..m. The lineage tracing queries can then be cedures and lineage tracing queries use the auxil- rewritten using the SLTs as: iary views to avoid recomputations and expensive source queries, thereby reducing maintenance and T Qt,v =SplitT1 ,...,Tm (σG=t.G (σC ((SLT1 (v) query costs. There are many possible sets of auxiliary n T ) ./ · · · ./ (SLTm (v) n T )) n T )) views to materialize. In this section, we first specify a number of potentially useful auxiliary views for ar- T QT,v =SplitT1 ,...,Tm (σC ((SLT1 (v) n T ) bitrary n-level ASPJ primary views (Section 4.1). We ./ · · · ./ (SLTm (v) n T )) n T ) then discuss how view maintenance procedures and Y. Cui, J. Widom 11-6 Although these tracing queries look much more HighProfit complex than with LV, performance can some- π times be much better due to the smaller size of σ LV2 or SLT 2 the SLTs. Furthermore, as with LV, the mainte- AG 1 nance procedure for V can use the deltas for the α α AG2 LV1 or SLT1 SLTs to be much more efficient. See [CW00] for details. BT 1 BT 2 BT 3 BT 4 Given a general ASPJ view definition in normal form, in addition to considering Lineage Views and Employee Sales Product Store Split Lineage Tables for each segment, we also may consider storing copies of some or all of the source ta- Figure 3: Possible auxiliary views for HighProfit bles, to avoid expensive (remote) source queries dur- ing view maintenance and lineage tracing. We refer the source tables. For each source table R, we to these source table copies in the warehouse as Base decide whether to store a Base Table (BT) copy Tables (BTs). Finally, maintaining the results of in- of R. If BT is not materialized, we may need to termediate aggregations in the view (AGs) also can be issue queries directly to source table R for view very helpful in view maintenance and lineage tracing, maintenance and lineage tracing.3 as motivated in Section 3. To summarize, we divide the normalized view def- Example 4.1 (Auxiliary Views) Recall our example inition into three types of components, and for each view HighProfit from Figure 2. Figure 3 shows type of component we have certain choices of possi- the three ASPJ segments in the view definition and all ble auxiliary views to materialize: of the possible auxiliary views we consider material- izing for HighProfit. 2 1. Topmost Segment: the segment at the root of the view definition DAG. Note that the α, π, σ, Notice that because of our search space reduction, and/or ./ operators (but not all of them) may be it is possible that there are useful auxiliary views we omitted in this segment. Also note that the top- are not considering, notably different join combina- most node corresponds to the primary view itself, tions in the case of a many-way join. This special so its contents are always materialized. We may case is considered in detail in [LQA97], and we could further choose to materialize the Lineage View extend our search space accordingly. (LV) or the Split Lineage Tables (SLTs) for this segment, but not both. (If we store one, then stor- 4.2 Using Auxiliary Views for View Maintenance ing the other will not further reduce the lineage and Lineage Tracing tracing or overall maintenance cost.) In Section 4.1 we gave examples of how to rewrite 2. Intermediate Segment: a non-root segment that queries for view maintenance and lineage tracing us- is defined over the source tables and/or other seg- ing auxiliary views. In general, when we have a set ments. Note that the π, σ, and/or ./ operators of auxiliary views available, there may be more than may be omitted in this segment, but the α opera- one way to rewrite a query to take advantage of aux- tor is always present. For an intermediate seg- iliary views. We assume that the “best” rewriting is ment, we consider materializing the following selected, and this assumption is reflected in the cost auxiliary views: model we present in Section 5. (a) The contents of the α node (AG) As an example, Figure 4 shows the rewritings of the lineage tracing query TQ for a tuple t according to the (b) The Lineage View (LV) or the Split Lineage topmost segment in the definition of HighProfit Tables (SLTs), but not both using the auxiliary views in Figure 3: 3. Source Table: We assume that all local selec- 3 Most existing data warehousing systems automatically store tion conditions in the view—predicates that in- a copy of each source table in the warehouse. However, as we will volve a single source table—are pushed down to see in Section 6, sometimes it is not beneficial to store a copy. Y. Cui, J. Widom 11-7 T Q1 =SplitSalary,Profit,Store(σprofit−expenses−salaries>100000∧city=t.city (AG1 ./ AG2 ./ BT4 )) T Q2 =SplitSalary,Profit,Store(σcity=t.city (LV2 )) Salary T Q3 =SplitSalary,Profit,Store(σcity=t.city (SLT2 ./ SLT2Profit ./ SLT2Store)) Figure 4: Tracing query rewritings Suppose that LV2 , AG1 , and BT4 are materialized. Table 1: Statistics for cost estimation Then we could use query T Q2 , or (among other op- tions) we could use a tracing query similar to T Q1 Parameter Description that recomputes the contents of AG2 . In this case it usage statistics (for each primary view) is likely that T Q2 would be chosen as the best query query rate # of tracing queries per unit time period rewriting based on the available auxiliary views. query size # of tuples traced per query usage statistics (for each source table) 4.3 The View Selection Problem and the Search update rate # of source table updates per unit time period update size # of changed tuples per source table update Space Size data statistics (for each source table) We have shown that various auxiliary views can be tuple num # of tuples in a source table used in the view maintenance and lineage tracing pro- tuple size size of tuples in a source table (in bytes) cesses. Our goal is to select among the choices of aux- data statistics (for each view segment) iliary views described in Section 4.1 a set that mini- fan-out # of joined tables mizes overall cost: the cost of lineage tracing plus the join ratio # of joining tuples / # of tuples in cross-product select ratio # of selected tuples / # of tuples before selection cost of maintaining the primary and auxiliary views. proj ratio # of bytes projected / tuple size before projection Our cost model is described in Section 5. Here let aggr ratio # of aggr tuples / # of tuples before aggregation us consider the size of our search space. Suppose we system statistics (for each source or warehouse) have an n-level ASPJ view in normal form, and con- block size # of bytes in a block sider a balanced view definition tree4 with a fan-out disk cost cost to read/write a disk block (in ms/block) of m in each segment. There is one topmost segment, net cost network transmission cost (in ms/byte) and for that segment we have 3 auxiliary view options: LV, SLTs, or nothing (case 1 in Section 4.1). There of possible auxiliary view sets for HighProfit is are m1 + m2 + · · · + mn−1 = O(mn−1 ) intermediate 26 · 32 = 384. 2 segments, each having 2 options for case 2(a) in Sec- tion 4.1 (AG or nothing) and 3 options for case 2(b) The number of choices for HighProfit is quite (LV, SLTs, or nothing). Finally, there are mn source manageable. However, real warehouse views tend to tables, each having 2 options: BT or nothing. There- have much higher fan-outs, as well as possibly more fore, the size of the entire search space is levels. As we will see in Section 7, even for a view n−1 ) n n with only 2 levels and average fan-out of 5, we can- 31 · (2 · 3)O(m · 2(m ) = O(2m ) not consider all possible auxiliary view sets due to the large search space. If k is the total number of components in the view def- inition, where a component is a segment or a source 5 Cost Model table, then k = O(mn ) and the search space size is O(2k ). In this section we present the model that we use to es- timate view maintenance and lineage tracing costs for Example 4.2 (Search Space Size) Consider our ex- a given primary view and set of auxiliary views. The ample view HighProfit (Figure 3). The number statistics our cost model relies on are listed in Table 1. 4 A tree is balanced if each leaf node in the tree has the same Values for these statistics are set for each experiment, depth. We consider this view definition shape since it represents as described in Section 7. We briefly outline our cost the largest search space size for an n-level view. estimation procedure as follows. Y. Cui, J. Widom 11-8 Let cost(Q, s) denote the estimated cost of evaluat- Finally, the total cost is the combination of lineage ing a query Q at the warehouse given a set of statistics tracing cost and view maintenance cost: s. Q could be a lineage tracing query, or a query or update in a view maintenance procedure. To compute total cost(v, A, s) = q(v, A, s) + m(v, A, s) cost(Q, s) we use a fairly conventional cost model In our experiments, we measure the optimality of for relational queries in a distributed database setting, given sets of auxiliary views, by which we mean how similar to, e.g., [LQA97, Ull89, ZGMHW95]. Details close the sets of views come to the set that yields the are omitted, but the cost formulas rely on all of the lowest estimated cost. For a given primary view v and statistics from Table 1, and assume no indexes. statistics s, let Aopt denote the set of auxiliary views Suppose we have a primary view v and a set of aux- within our search space (Section 4) with the lowest iliary views A = {v1 , ..., vn }. To trace the lineage of total cost total cost(v, Aopt , s). For a set of auxiliary tuples in the primary view given the materialized aux- views A, we define the optimality of A as: iliary views in A, there are various possible rewrit- ings of the lineage tracing queries using the auxiliary total cost(v, Aopt , s) views (recall Section 4.2). Our cost model selects the optimality(A) = total cost(v, A, s) sequence of lineage tracing queries with the lowest es- timated cost. Let q(v, A, s) denote the estimated lin- 6 Algorithms for Selecting Auxiliary Views eage tracing cost for a given primary view v, set of Having defined the search space for the optimization auxiliary views A, and statistics s: problem and the cost model that we use, we now in- X troduce four different algorithms for selecting a set q(v, A, s) = cost(Qi , s) 1..m of auxiliary views within the search space. The in- put to each algorithm is the primary view definition v where Q1 , . . . , Qm is the set of tracing queries se- in ASPJ normal form, and a set of statistics s as speci- lected for v given auxiliary view set A. Note that the fied in Section 5. The output is a set of auxiliary views lineage query rate and the average number of tuples A. traced in a lineage query (part of our usage statistics in Table 1) are included in the input statistics s, and 6.1 Exhaustive Algorithm thus are incorporated into the lineage cost estimated by q(v, A, s). The exhaustive algorithm enumerates all choices in Maintenance costs are incurred both for the pri- the search space, estimates the cost of each choice, mary view v and for the auxiliary views in A = and picks the cheapest one. For our example view {v1 , ..., vn }. As with lineage tracing, when there are HighProfit, the exhaustive algorithm considers multiple possible rewritings for the view maintenance all 384 possible combinations of auxiliary views (re- queries and updates using the auxiliary views in A, call Figure 3). We set a sample set of statistics s our cost model selects the ones with the lowest es- as shown in Table 2, including statistics for view timated cost. Let m(v, A, s) denote the estimated HighProfit, source tables Employee, Sales, maintenance cost for a given primary view v, set of Product, and Store, as well as each ASPJ seg- auxiliary views A, and statistics s: ment in the view definition (Figure 3). Over this set X of statistics, the exhaustive algorithm selects A = m(v, A, s) = cost(Mi , s) {BT4 , AG1 , AG2 , SLT1 , LV2 }. The exhaustive algo- 1..n rithm always finds the optimal auxiliary view set ac- where M1 , . . . , Mn is the set of maintenance queries cording to our cost model. However, the complexity and updates selected to maintain primary view v and of the algorithm is the same as the search space size: the auxiliary views in A. Note that the source ta- O(2k ) where k is the number of components in the ble update rate and average number of source tuples view definition (recall Section 4.3). changed in each update (part of our usage statistics in 6.2 Naive Algorithm Table 1) are included in the input statistics s, and thus are incorporated into the maintenance cost estimated At the other end of the spectrum, we con- by m(v, A, s). sider a naive algorithm that selects a fixed set Y. Cui, J. Widom 11-9 Table 2: Statistics for view HighProfit Parameter name Values HighProfit Employee Sales Product Store segment 1 segment 2 segment 3 query rate (#/unit time) 100 query size (tuples) 1 update rate (#/unit time) 10 10 10 0 update size (tuples) 1 100 1 0 tuple num 10000 1000000 100000 100 tuple size (bytes) 1000 500 500 400 fan-out 1 2 3 join ratio 0.0002 0.0001 select ratio 0.1 proj ratio 0.1 0.1 0.1 aggr ratio 0.01 0.001 0.2 block size (bytes) 8K 8K 8K 8K 8K disk cost (ms/block) 1 1 1 1 1 net cost (ms/byte) 0 0.00001 0.0001 0.0001 0.0001 of auxiliary views: Lineage Views (LVs) for the not guarantee an optimal answer, nor even an answer topmost and all intermediate segments, aggrega- within some percentage of optimal. In Section 7.3, we tion results (AGs) for all intermediate segments, will see a scenario where the greedy algorithm per- and all Base Tables (BTs). For example view forms very poorly. HighProfit, the naive algorithm selects A = {BT1 , BT2 , BT3 , BT4 , AG1 , AG2 , LV1 , LV2 }. Even 6.4 Three-Step Algorithm though this naive fixed set of auxiliary views may not be optimal—in fact it can be arbitrarily bad compared Our last algorithm divides the auxiliary view selec- to the optimal set—our experimental results in Sec- tion process into three phases. See Figure 5. In the tion 7 show that the naive algorithm selects reasonably first phase, we use a greedy approach to add auxiliary good view sets in many cases, especially considering views of the AG and BT types only. In the second its simplicity. The complexity of the naive algorithm phase, we decide for the topmost and each intermedi- is O(1). ate segment whether to add LV or SLTs. At this point, it may turn out that some of the AG or BT views se- lected in the first phase are no longer beneficial given 6.3 Greedy Algorithm the LV or SLT views selected in the second phase, and We also consider a conventional greedy algorithm. they incur maintenance cost. Thus, in third phase we This algorithm initializes the auxiliary view set A to remove AG and BT views that are not beneficial, and be empty. In each iteration, it adds into A the auxiliary we do so in a greedy manner. view (not yet in A) that brings the most benefit, i.e., For example view HighProfit and the same reduces the total cost the most given the current set of statistics used in Sections 6.1 and 6.3, the three-step views in A. Iteration continues until there are no more algorithm also selects the optimal set of auxiliary auxiliary views outside A that can further reduce the views. In phase 1, it selects AG and BT views in total cost. For our example view HighProfit and the following order: AG2 , AG1 , BT2 , BT3 , BT4 . In the same sample statistics used in the exhaustive algo- phase 2, it selects SLT1 and LV2 . In phase 3, it re- rithm (Section 6.1), the greedy algorithm selects the moves BT2 and BT3 (in that order) because they are optimal set of auxiliary views in the following order: no longer beneficial given the views selected in phase AG2 , SLT1 , AG1 , LV2 , BT4 . 2. The greedy algorithm has complexity O(k2 ) in- The three-step algorithm has complexity O(k2 ), stead of O(2k ) as in the exhaustive algorithm, and it which is the same as the greedy algorithm, but its ac- selects the optimal auxiliary view set in most cases tual running time is less than the greedy algorithm by (see Section 7). However, the greedy algorithm can- a linear factor. The first phase of the three-step algo- Y. Cui, J. Widom 11-10 input: primary view v, statistics s output: auxiliary view set A begin A ← ∅; // phase 1: use greedy algorithm on AG and BT nodes Aall ← all possible auxiliary views for v; while true do for each vi ∈ Aall of type AG or BT such that vi 6∈ A do benefiti ← total cost(v, A, s) − total cost(v, A ∪ {vi }, s); pick vi with the highest benefiti ; if benefiti ≤ 0 then break else A ← A ∪ {vi }; endwhile; // phase 2: decide LV and SLTs for the topmost and each intermediate segment do // Let LV and SLT be the Lineage View and Split Lineage Tables for the segment cost1 ← cost(v, A ∪ {LV }, s); cost2 ← cost(v, A ∪ {SLT }, s); if cost1 ≥ cost2 > cost(v, A, s) then A ← A ∪ {LV } else if cost2 > cost1 > cost(v, A, s) then A ← A ∪ {SLT }; endfor; // phase 3: remove useless AGs and BTs while true do for each vi ∈ A of type AG or BT do benefiti ← total cost(v, A − {vi }, s) − total cost(v, A, s); pick vi with the lowest benefiti ; if benefiti > 0 then break else A ← A − {vi }; endwhile; return A; end Figure 5: The three-step algorithm rithm is faster than the greedy algorithm since it only practical option is to combine the two algorithms: selects from the AG and BT views, instead of from run both algorithms and select whichever answer has all auxiliary views. The second phase is linear in the lower estimated cost. The running time of the com- number of segments. The third phase only examines bined algorithm remains O(k2 ). the AG and BT views selected in the first phase, which is a small number in most cases. 7 Performance Study Like the greedy algorithm, the three-step algorithm In this section, we study the performance of the four usually selects the optimal set of auxiliary views (see algorithms specified in Section 6, comparing their Section 7). However, also like the greedy algorithm, running times and the optimality of the answers they the three-step algorithm cannot make any guarantees produce. We also compare the cost of the answers pro- about the optimality of its answers. In Section 7.3 we duced by these algorithms against the cost of storing will see a scenario where the three-step algorithm per- no auxiliary views. In Section 7.1, we present results forms poorly. Interestingly, in the case we show where of experiments using the schema, statistics, and some the three-step algorithm performs poorly, the greedy views from the TPC-D benchmark. In Section 7.2, algorithm performs well, and vice-versa. Thus, one we present results of experiments using more complex Y. Cui, J. Widom 11-11 Q5 Q11 Q17 V1 α π α α levels = 2 σ π π fan-out = 3 π σ σ σ α α π π α α α Customer Region σ Part α π π π Order Nation π σ σ σ Lineitem Supplier PartSupp Nation Lineitem Supplier R1 R2 R3 R4 R5 R6 R7 R8 R9 Figure 6: Materialized views for TPC-D experiments Figure 9: Structure of view v1 exhaustive levels fan-out query/update ratio 100% greedy v1 2 3 100 three-step v2 2 3 10 80% naive v3 2 3 1 none v4 2 3 0.1 60% v5 2 3 0 v6 6 1 10 40% v7 2 5 10 20% Figure 10: Synthetic configurations 0% Q5 Q11 Q17 der are the fact tables according to the benchmark Figure 7: Optimality for TPC-D views specification. For views, we select queries Q5 , Q11 , and Q17 from the benchmark, since they are relatively complex and differ somewhat from each other. In each exhaustive greedy three-step naive none case, we treat the benchmark query as the definition Q5 11.15 4.79 2.38 0.08 0.09 of our primary materialized view to be stored at the Q11 14.07 0.94 0.38 0.03 0.04 Q17 0.62 0.29 0.10 0.02 0.03 warehouse. The general structure of each of the three views is shown in Figure 6. The complete list of sta- Figure 8: Running time in seconds for TPC-D tistical settings (recall Table 1) for our three TPC-D experiments is shown in Appendix A, Tables 3–5. synthetic view definitions. Since the greedy and three- Recall that we are comparing five algorithms—the step algorithms perform quite well in all of the experi- four algorithms from Section 6, as well as the “algo- ments in Sections 7.1 and 7.2, in Section 7.3 we show rithm” that selects no auxiliary views (which we call experiments illustrating that greedy and three-step can algorithm none). Figure 7 plots the optimality of the perform poorly. five algorithms for each of the TPC-D views we con- sider. Recall from Section 5 that optimality is defined 7.1 TPC-D Experiments as the cost of the optimal auxiliary view set divided Our first set of experiments is based on the TPC-D by the cost of the chosen view set. Figure 8 plots the benchmark [TPC96]. We use the schema of tables running time of the algorithms. Note that algorithms Customer, Order, Lineitem, Supplier, Na- naive and none do incur a small running time, which tion, Region, PartSupp, and Part from the is the time required to compute the cost of their one benchmark for our experiments. The table statistics solution. we use correspond to a scaling factor of 1. The re- We can see from Figure 7 that storing no auxiliary maining statistics from Table 1 are set according to views can be dramatically worse than storing some, the benchmark and commonly used database and net- even those selected by the naive algorithm. We also work system settings. For example, we set the update see from Figure 7 that the greedy and three-step al- rate for the Lineitem and Order tables to be much gorithms select the optimal auxiliary view set for all higher than other tables, since Lineitem and Or- three views, and from Figure 8 we see that they do so Y. Cui, J. Widom 11-12 exhaustive greedy 100% three-step 80% naive none 60% 40% 20% 0% V1 V2 V3 V4 V5 V6 V7 Figure 11: Optimality for synthetic views exhaustive greedy three-step naive none for view v7 the exhaustive algorithm never finished, v1 2411.68 3.18 0.67 0.05 0.07 so optimality is measured against the auxiliary view v2 2414.86 3.18 0.99 0.03 0.03 set selected by the greedy algorithm.) The greedy and v3 2455.74 3.28 0.71 0.08 0.03 v4 2493.06 3.17 0.63 0.03 0.05 three-step algorithms find their answer in a small frac- v5 2376.53 3.93 0.64 0.03 0.03 tion of the running time required by the exhaustive v6 757.49 0.99 0.17 0.02 0.02 algorithm, and three-step is much faster than greedy. v7 156.36 35.29 0.33 0.21 Another interesting result is that algorithm none per- forms better when the query/update ratio is lower (ex- Figure 12: Running time for synthetic views (sec) periments v3 –v5 ). However, we should not infer that in a small fraction of the running time required by the the benefit of auxiliary views is primarily for lineage exhaustive algorithm. We also note that the three-step tracing. In fact, in experiment v5 the query/update ra- algorithm runs considerably faster than the greedy al- tio is set to 0 (indicating view maintenance only), and gorithm. we still see significant benefit to using auxiliary views. Next, we consider in more detail how the running 7.2 Synthetic Experiments times of the exhaustive, greedy, and three-step algo- rithms are affected by view complexity. In Figure 13, Our next set of experiments is conducted using syn- we consider views where we fix the number of levels thetic views and data statistics. The views we consider at 2 and increase the fan-out from 1 to 9. The exhaus- all have a regular tree definition, as illustrated by view tive, greedy, and three-step algorithms become pro- v1 in Figure 9. We consider seven different views. hibitive when the fan-out exceeds 3, 7, and 8, respec- Figure 10 summarizes the “shape” of each view defi- tively. We see similar behavior in Figure 14, where we nition tree (number of levels and fan-out of each seg- fix the fan-out at 2 and increase the number of levels ment), along with the query/update ratio, which repre- from 1 to 8. sents the ratio of the average number of tracing queries per unit time to the average number of source updates 7.3 When Greedy and Three-Step Fail (recall Table 1). The complete set of statistical set- tings for the seven experiments is summarized in Ap- The greedy and three-step algorithms select the op- pendix A, Tables 6–8. timal (or in one case very near to optimal) auxiliary Figure 11 plots the optimality of our five algorithms view set in all of the experiments reported in Sec- for each of the seven synthetic views we consider. Fig- tions 7.1 and 7.2. However, there are cases in which ure 12 plots the running time of the algorithms. The these algorithms fail to pick an optimal or even near- greedy and three-step algorithms always select the op- optimal answer. timal auxiliary view set or, in the one case of the three- Consider a simple view definition v and the auxil- step algorithm on v1 , very near to optimal. (Actually, iary views that are considered for v in Figure 15. By Y. Cui, J. Widom 11-13 10000 V exhaustive running time (sec) 8000 greedy α three-step 6000 π LV or SLTs 4000 σ 2000 BTR BTS 0 1 2 3 5 6 7 8 4 9 R S fan-out Figure 13: Running time vs. fan-out Figure 15: View structure used for v8 and v9 10000 exhaustive exhaustive greedy running time (sec) 8000 greedy 100% three-step three-step 80% naive 6000 none 60% 4000 40% 2000 20% 0 1 2 3 4 5 6 7 8 0% V8 V9 # levels Figure 14: Running time vs. # levels Figure 16: Optimality for v8 and v9 8 Conclusion setting the data statistics (Table 1) to different values, We have examined the problem of selecting auxiliary we can vary the costs and benefits of the four differ- views to materialize in a data warehouse in order to ent auxiliary views. We have set two different con- reduce the cost of view maintenance and lineage trac- figurations, which we call v8 and v9 , both based on ing for complex primary views. We specify a normal the view in Figure 15. The complete set of statistical form for view definitions and use it to define an initial settings for these two experiments is summarized in search space of potentially beneficial auxiliary views. Appendix A, Tables 9 and 10. In particular, we set the We presented four algorithms for exploring the search source table network costs for v8 to be much higher space and selecting a set of auxiliary views: exhaus- than for v9 , and we set the join ratio of v8 higher (less tive, greedy, three-step, and naive. We compared the selective) than v9 . The optimal set of auxiliary views optimality and running time of the algorithms using for v8 is {BTR , BTS }, and the optimal set for v9 is experiments based on the TPC-D benchmark, as well {LV }. as on a variety of synthetic views and statistics. Figure 16 plots the optimality of all five algorithms Our experiments indicate that in terms of running on views v8 and v9 . In particular, the greedy al- time and optimality, the three-step algorithm appears gorithm performs extremely poorly on v8 (selecting to be the best, although the running time of the greedy {LV } instead of {BTR , BTS }), while the three-step algorithm probably also is fast enough in practice for algorithm misses the optimal solution for v9 (select- most complex warehouse views. (The exhaustive al- ing {BTR , BTS } instead of {LV }). However, as sug- gorithm, on the other hand, becomes intractable quite gested in Section 6, if we use a combined algorithm quickly.) Both the greedy and three-step algorithms that runs both greedy and three-step and then selects find the optimal auxiliary view set in most cases, al- the lower-cost solution, we will select the optimal though we have shown (complementary) situations view set for both v8 and v9 . in which either one algorithm or the other performs Y. Cui, J. Widom 11-14 poorly. Our experiments also illustrate that even a In Proc. of the Twenty-First Inter- naive selection of auxiliary views reduces overall cost national Conference on Very Large dramatically in most cases, underscoring the impor- Data Bases, pages 358–369, Zurich, tance of materializing auxiliary views for the dual pur- Switzerland, September 1995. poses of view maintenance and lineage tracing in a warehousing environment. [GMS93] A. Gupta, I. S. Mumick, and V. Sub- Although we have presented our work in the con- rahmanian. Maintaining views in- text of selecting an auxiliary view set for a single pri- crementally. In Proc. of the ACM mary warehouse view, our approach extends easily to SIGMOD International Conference on considering multiple primary views together. Then Management of Data, pages 157–166, the cost of an auxiliary view may be “shared” if the Washington, DC, May 1993. auxiliary view is beneficial to view maintenance or [Gup97] H. Gupta. Selection of views to mate- lineage tracing for more than one primary view. Fur- rialize in a data warehouse. In Proc. of thermore, although we have studied an environment in the Sixth International Conference on which both view maintenance and lineage tracing are Database Theory, pages 98–112, Del- important, if only one type of activity is present our phi, Greece, January 1997. algorithms remain applicable. [HRU96] V. Harinarayan, A. Rajaraman, and References J. D. Ullman. Implementing data [CD97] S. Chaudhuri and U. Dayal. An cubes efficiently. In Proc. of the ACM overview of data warehousing and SIGMOD International Conference on OLAP technology. SIGMOD Record, Management of Data, pages 205–216, 26(1):65–74, March 1997. Montreal, Canada, June 1996. [CW91] S. Ceri and J. Widom. Deriv- [IK93] W.H. Inmon and C. Kelley. Rdb/VMS: ing production rules for incremen- Developing the Data Warehouse. tal view maintenance. In Proc. QED Publishing Group, Boston, of the International Conference on Massachussetts, 1993. Very Large Databases, pages 577–589, Barcelona, Spain, September 1991. [LQA97] W.J. Labio, D. Quass, and B. Adel- berg. Physical database design for [CW00] Y. Cui and J. Widom. Practical lineage data warehousing. In Proc. of the tracing in data warehouses. To appear Thirteenth International Conference in Proc. of the Sixteenth International on Data Engineering, pages 277–288, Conference on Data Engineering, San Birmingham, UK, April 1997. Diego, California, February 2000. [LW95] D. Lomet and J. Widom, editors. Spe- [CWW97] Y. Cui, J. Widom, and J.L. Wiener. cial Issue on Materialized Views and Tracing the lineage of view data Data Warehousing, IEEE Data Engi- in a warehousing environment. neering Bulletin 18(2), June 1995. Technical report, Stanford Univer- sity Database Group, November [LYC+ 99] W. Labio, J. Yang, Y. Cui, H. Garcia- 1997. Available at http://www- Molina, and J. Widom. Performance db.stanford.edu/pub/papers/lineage- issues in incremental warehouse main- full.ps. tenance. Technical report, Stanford University Database Group, October [GHQ95] A. Gupta, V. Harinarayan, and 1999. Available at http://www- D. Quass. Aggregate-query process- db.stanford.edu/pub/papers/whips- ing in data warehousing environments. wm.ps. Y. Cui, J. Widom 11-15 [Qua96] D. Quass. Maintenance expressions A Statistical Settings for Experiments for views with aggregation. In Proc. of the Workshop on Materialized Views: Techniques and Applications, pages 110–118, Montreal, Canada, June 1996. [RSS96] K.A. Ross, D. Srivastava, and S. Su- darshan. Materialized view mainte- nance and integrity constraint check- ing: Trading space for time. In Proc. of the ACM SIGMOD International Conference on Management of Data, pages 447–458, Montreal, Canada, June 1996. [TPC96] Transaction Processing Performance Council. TPC-D Benchmark Specifi- cation, Version 1.2, 1996. Available at: http://www.tpc.org/. [Ull89] J. D. Ullman. Database and Knowledge-base Systems (Vol 2). Computer Science Press, 1989. [Wid95] J. Widom. Research problems in data warehousing. In Proc. of the Fourth In- ternational Conference on Information and Knowledge Management, pages 25–30, Baltimore, Maryland, Novem- ber 1995. [ZGMHW95] Y. Zhuge, H. Garcia-Molina, J. Ham- mer, and J. Widom. View mainte- nance in a warehousing environment. In Proc. of the ACM SIGMOD Interna- tional Conference on Management of Data, pages 316–327, San Jose, Cali- fornia, May 1995. Y. Cui, J. Widom 11-16 Table 3: Statistics for Q5 Parameter name Values Q5 Customer Order LineItem Supplier Nation Region segment 1 query rate (#/unit time) 100 query size (tuples) 1 update rate (#/unit time) 1 10 40 1 0 0 update size (tuples) 1 10 10 1 0 0 tuple num 150000 1500000 6000000 10000 25 5 tuple size (bytes) 300 100 300 200 100 100 fan-out 6 join ratio 0.00001 select ratio 0.01 proj ratio 0.1 aggr ratio 0.001 block size (bytes) 8K 8K 8K 8K 8K 8K 8K disk cost (ms/block) 1 1 1 1 1 1 1 net cost (ms/byte) 0 0.0001 0.00001 0.00001 0.0001 0.0001 0.0001 Table 4: Statistics for Q11 Parameter name Values Q11 PartSupp Supplier Nation segment 1 segment 2 segment 3 query rate (#/unit time) 100 query size (tuples) 1 update rate (#/unit time) 10 1 0 update size (tuples) 1 1 0 tuple num 800000 10000 25 tuple size (bytes) 100 200 100 fan-out 1 3 3 join ratio 0.0006 0.0006 select ratio 0.1 0.04 0.04 proj ratio 0.6 0.1 0.1 aggr ratio 0.005 0.005 block size (bytes) 8K 8K 8K 8K disk cost (ms/block) 1 1 1 1 net cost (ms/byte) 0 0.0001 0.0001 0.0001 Y. Cui, J. Widom 11-17 Table 5: Statistics for Q17 Parameter name Values Q17 Part Lineitem segment 1 segment 2 query rate (#/unit time) 100 query size (tuples) 1 update rate (#/unit time) 10 10 update size (tuples) 1 10 tuple num 200000 6000000 tuple size (bytes) 200 300 fan-out 3 1 join ratio 0.0002 select ratio 0.002 proj ratio 0.1 0.2 aggr ratio 0.001 0.03 block size (bytes) 8K 8K 8K disk cost (ms/block) 1 1 1 net cost (ms/byte) 0 0.0001 0.00001 Table 6: Statistics for synthetic views v1 –v5 Parameter name Values v1 v2 v3 v4 v5 each source table each segment query rate (#/unit time) 1000 100 10 1 0 query size (tuples) 1 1 1 1 1 update rate (#/unit time) 10 update size (tuples) 1 tuple num 10000 tuple size (bytes) 100 levels 2 2 2 2 2 fan-out 3 join ratio 0.001 select ratio 0.2 proj ratio 0.4 aggr ratio 0.1 block size (bytes) 8K 8K 8K 8K 8K 8K disk cost (ms/block) 1 1 1 1 1 1 net cost (ms/byte) 0 0 0 0 0 0.0001 Y. Cui, J. Widom 11-18 Table 7: Statistics for v6 Table 8: Statistics for v7 Parameter name Values Parameter name Values v6 each source table each segment v7 each source table each segment query rate (#/unit time) 100 query rate (#/unit time) 100 query size (tuples) 1 query size (tuples) 1 update rate (#/unit time) 10 update rate (#/unit time) 10 update size (tuples) 1 update size (tuples) 1 tuple num 1000000 tuple num 10000 tuple size (bytes) 1000 tuple size (bytes) 100 levels 6 levels 2 fan-out 1 fan-out 5 join ratio join ratio 0.001 select ratio 0.5 select ratio 0.2 proj ratio 0.8 proj ratio 0.4 aggr ratio 0.2 aggr ratio 0.1 block size (bytes) 8K 8K block size (bytes) 8K 8K disk cost (ms/block) 1 1 disk cost (ms/block) 1 1 net cost (ms/byte) 0 0.0001 net cost (ms/byte) 0 0.0001 Table 9: Statistics for v8 Table 10: Statistics for v9 Parameter name Values Parameter name Values v8 R1 R2 segment 1 v9 R1 R2 segment 1 query rate (#/unit time) 100 query rate (#/unit time) 100 query size (tuples) 1 query size (tuples) 1 update rate (#/unit time) 0 0 update rate (#/unit time) 0 10 update size (tuples) 0 0 update size (tuples) 0 20 tuple num 100000 100000 tuple num 10000 1000000 tuple size (bytes) 1000 1000 tuple size (bytes) 1000 1000 fan-out 2 fan-out 2 join ratio 0.001 join ratio 0.00003 select ratio 0.2 select ratio 0.2 proj ratio 0.6 proj ratio 0.6 aggr ratio 0.2 aggr ratio 0.2 block size (bytes) 8K 8K 8K block size (bytes) 8K 8K 8K disk cost (ms/block) 1 1 1 disk cost (ms/block) 1 1 1 net cost (ms/byte) 0 0.01 0.01 net cost (ms/byte) 0 0.0001 0.001 Y. Cui, J. Widom 11-19