=Paper=
{{Paper
|id=Vol-19/paper-4
|storemode=property
|title=Speeding up Warehouse Physical Design Using a Randomized Algorithm
|pdfUrl=https://ceur-ws.org/Vol-19/paper3.pdf
|volume=Vol-19
|dblpUrl=https://dblp.org/rec/conf/dmdw/LeeH99
}}
==Speeding up Warehouse Physical Design Using a Randomized Algorithm==
Speeding Up Warehouse Physical Design
Using A Randomized Algorithm
Minsoo Lee and Joachim Hammer
Dept. of Computer & Information Science & Engineering
University of Florida
Gainesville, FL 32611-6120
{mslee, jhammer}@cise.ufl.edu
1 Introduction
Abstract A data warehouse stores information that is collected
from multiple, heterogeneous information sources for the
A data warehouse stores information that is purpose of complex querying and analysis [IK93, Wid95].
collected from multiple, heterogeneous The information in the warehouse is typically processed
information sources for the purpose of complex and integrated before it is loaded in order to detect and
querying and analysis. Information in the resolve any inconsistencies and discrepancies among
warehouse is typically stored in the form of related data items from different sources. Since the
materialized views. One of the most important amount of information in a data warehouse tends to be
tasks when designing a warehouse is the selection large and queries may involve hundreds of complex
of materialized views to be maintained in the aggregates at a time, the organization of the data
warehouse. The goal is to select a set of views in warehouse becomes a critical factor in supporting efficient
such a way as to minimize the total query online analytical query processing (OLAP) as well as in
allowing periodic maintenance of the warehouse contents.
response time over all queries, given a limited
Data in the warehouse is often organized in summary
amount of time for maintaining the views
tables, or materialized views [Rou97], which represent
(maintenance-cost view selection problem). The pre-computed portions of the most frequently asked
paper focuses on an efficient solution to the queries. In this way, the warehouse query processor
maintenance-cost view selection problem using a avoids having to scan the large data sets for each query, a
genetic algorithm for computing a near-optimal task that is even more wasteful if the query occurs
set of views. Specifically, we explore the view frequently. However, in order to keep these materialized
selection problem in the context of OR view views consistent with the data at the sources, the views
graphs. We show that our approach represents a have to be maintained. Rather than periodically refreshing
dramatic improvement in time complexity over the entire view, a process that may be time consuming and
existing search-based approaches using wasteful, a view can be maintained in an incremental
heuristics. Our analysis shows that the algorithm fashion, whereby only the portions of the view which are
consistently yields a solution that lies within 10% affected by the changes in the relevant sources are
of the optimal query benefit while at the same updated [GM95, ZGH+95].
time exhibiting only a linear increase in Besides this so-called view maintenance or update cost,
execution time. We have implemented a each materialized view in the warehouse also requires
additional storage space which must be taken into account
prototype version of our algorithm which is used
when deciding which and how many views to materialize.
to simulate the measurements used in the analysis For example, given a set of frequently asked OLAP
of our approach. queries, materializing all possible views will certainly
increase query response time but will also raise the update
costs for the warehouse and may exceed the available
storage capacity. Thus by trading space for time and vice
versa, the warehouse administrator must carefully decide
on a particular warehouse configuration which balances
The copyright of this paper belongs to the paper’s authors. Permission to copy
the three important factors given above: query response
without fee all or part of this material is granted provided that the copies are not
made or distributed for direct commercial advantage. time, maintenance cost, and storage space. The problem of
Proceedings of the International Workshop on Design and selecting a set of materialized views for a particular
Management of Data Warehouses (DMDW'99)
warehouse configuration which represents a desirable
Heidelberg, Germany, 14. - 15. 6. 1999
(S. Gatziu, M. Jeusfeld, M. Staudt, Y. Vassiliou, eds.)
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-19/
M. Lee, J. Hammer 3-1
balance among the three costs is known as the view View indices are similar to views except that instead of
selection problem1. storing the tuples in the views directly, each tuple in the
In this paper we focus on a solution to the maintenance- view index consists of pointers to the tuples in the base
cost view selection problem which minimizes query relations that derive the view tuple. The algorithm is based
response time given varying upper bounds on the on A* to find an optimal set of view indexes but uses a
maintenance cost, assuming unlimited amount of storage very simple cost model for updating the view which does
space because storage space is regarded cheap and not a not take into account which subviews have been selected.
critical resource. Specifically, we explore the view As a result, the maintenance cost for the selected view set
selection problem in the context of OR view graphs, in is not very realistic.
which any view can be computed from any of its related More recently, Ross et al. [RSS96] and Labio et al.
views. Although the view selection problem has been [LQA97] have examined the same problem using
addressed previously (e.g., see [Gup97], [GM99], exhaustive search algorithms that make use of heuristics
[TS97]), existing algorithms do not perform well when for pruning the search space. The work by Labio et al. is
computing warehouse configurations involving more than an extension of the work by Ross et al. considering
20-25 views or so. In those cases, the search space indexes and also improving upon the efficiency of the
becomes too large for any kind of exhaustive search algorithm. In addition, Labio et al. are the first to provide
method and even the best heuristics can only compute a valuable set of rules and guidelines for choosing a set of
acceptable solutions for a small set of special cases of the views and indexes when their algorithm cannot compute
problem. To this end, we have designed a solution the optimal warehouse configuration within a reasonable
involving randomization techniques which have proven time due to the complexity of the solution. Similarly,
successful in other combinatorial problems. We show that Theodoratos et al. [TS97] present an exhaustive search
our solution is superior to existing solutions in terms of algorithm with pruning to find a warehouse configuration
both its expected run-time behavior as well as the quality for answering a set of queries given unlimited space for
of the warehouse configurations found. The analysis storing the views. Their work also focuses on minimizing
proves that our genetic algorithm yields a solution that lies query evaluation and view maintenance.
within 90% of the optimal query benefit while at the same [HRU96] present and analyze greedy algorithms for
time exhibiting only a linear increase in execution time. selection of views in the special case of “data cubes” that
We expect our algorithm to be useful in data warehouse come within 63% of the optimal configuration. However,
design; most importantly in those scenarios where the their calculations do not figure in the update costs for the
queries which are supported by the existing warehouse selected views.
views change frequently, making it necessary to Our work is most closely related to that of Gupta [GM99]
reconfigure the warehouse efficiently and quickly. who has used both the greedy approach as well as the A*
Supporting data warehouse evolution in this way may algorithm for solving the maintenance-cost view selection
increase the usefulness of the data warehousing concept problem in the context of both OR/AND view graphs and
even further. the general case of AND-OR view graphs. His approach
The paper is organized as follows. In the Sec. 2 we also balances query response time and view maintenance
present an overview of the related work. Sec. 3 describes cost while assuming an unlimited amount of storage space.
our technical approach. Specifically, we briefly introduce
the idea behind genetic algorithms (which are a special
class of randomized algorithms) and how we are using the
3 Technical Approach
technique to find an efficient solution to the view selection In [Gup97] the view selection problem is stated to be NP-
problem. In Sec. 4 we describe the implementation of our hard (see, for example, [Coo71, GJ79]), as there is a
prototype which was used to generate the simulation runs straightforward reduction to the minimum set cover
which we present and analyze in Sec. 5. Sec. 6 concludes problem. Roughly speaking, it is very difficult to find an
the paper with a summary of our results and future plans. optimal solution to problems in this class because of the
fact that the solution space grows exponentially as the
2 Related Research problem size increases. Although some good solutions for
NP-hard problems in general and the view selection
All of the related work on view selection uses some form problem in specific exist, such approaches encounter
of greedy strategy or heuristics-based searching technique significant problems with performance when the problem
to avoid having to exhaustively traverse the solution space size grows above a certain limit. More recent approaches
in search of the optimal solution. The problem of selecting use randomized algorithms in solving NP-hard problems.
additional structures for materialization was first studied Randomized algorithms are based on statistical concepts
by Roussopoulos [Rou82] who proposed to materialize where the large search space can be explored randomly
view indices rather than the actual views themselves. using an evaluation function to guide the search process
closer to the desired goal. Randomized algorithms can
1
find a reasonable solution within a relatively short period
Sometimes the problem is also referred to as the view index of time by trading executing time for quality. Although the
selection problem (VIS) when the solution includes a resulting solution is only near-optimal, this reduction is
recommendation on which index structures should be not as drastic as the reduction in execution time. Usually,
maintained in support of the materialized views.
M. Lee, J. Hammer 3-2
the solution is within a few percentage points of the mutation operations, respectively. Step ④ evaluates the
optimal solution which makes randomized algorithm an population that is created. A function called the fitness
attractive alternative to traditional approaches such as the function, which evaluates the superiority of a genome, is
ones outlined in Sec. 2, for example. used in this process. The fitness of each genome can be
Our approach uses a genetic algorithm, which is one form gathered and used as a metric to evaluate the improvement
of a randomized algorithm. The motivation to use genetic made in the new generation. This fitness value is also used
algorithms in solving the view selection problem was during the selection process (in step ②) in the next
based on the observation that data warehouses can have a iteration to select superior genomes for the next
large number of views and the queries that must be population. Also, the genome with the best fitness so far is
supported may change very frequently. Thus, a fast saved. We now explain how this algorithm can be
solution is needed to provide new configurations for the adapted to solve the view selection problem.
data warehouse: an ideal starting point for the genetic
algorithm. However, genetic algorithms do not provide a 3.2 An Improved Solution to the View Selection
magical solution by themselves and their success (or Problem
failure) often depends on the proper problem
specification, the set-up of the algorithm, as well as the In order to apply a genetic algorithm approach to solve the
outcome of the extremely difficult and tedious fine-tuning view selection problem, two requirements must be met:
of the algorithm that must be performed during many test (1) We need to find a string representation of a candidate
runs. After a brief overview of genetic algorithms, we solution and, (2) we need to be able to define the fitness
provide details on how to apply these techniques to design function, the crossover operator, and the mutation
an optimal solution to the view selection problem. operator as outlined above.
Specifically, we elaborate on a suitable representation of In [Mic94], the authors discuss several solutions to
the solution space as well as the necessary evaluation popular problems using genetic algorithms, including the
functions needed by our genetic algorithm. 0/1 knapsack problem (see for example [Aho83]). The
similarity of the view selection problem to the 0/1
knapsack problem gives us a hint on how to apply the
3.1 Genetic Algorithms
genetic algorithm strategies in our context. However, to
The idea behind the Genetic Algorithm (GA) [Gol89] our knowledge nobody has yet to apply genetic algorithm
comes from imitating how living organisms evolve into techniques to solving view selection and the solutions
superior populations from one generation to the next. The presented here represent our own approach.
genetic algorithm works as follows. A pool of genomes is
initially established. Each genome represents a possible 3.2.1 Problem Specification
solution for the problem to be solved. This pool of
The problem to be solved can be stated as follows
genomes is called a population. The population will
[GM99]: Given an OR view graph G and a quantity
undergo changes and create a new population. Each
representing the total maintenance time limit, select a set
population is referred to as a generation. Starting with an
of views that minimizes the total query response time and
initial generation t, the sequence of subsequent
also does not exceed the total maintenance time limit. An
populations is referred to as generation t+1, t+2, ..., and
OR view graph is composed of a set of views where each
so on. After several generations, it is expected that the
view in the graph can be constructed from other source
population t+k should be composed of genomes which are
views in one or more ways, but each derivation involves
superior to the genomes in population t. By superior
only one other view. In other words, only one view among
genomes we mean genomes which represent a solution
the source views is needed to compute a view. An
that is closer to the optimal solution (based on a so-called
example of an OR view graph is the data cube [GCB+97]
fitness evaluation).
where each view can be constructed in many different
The Genetic Algorithm repeatedly executes the following
ways but each derivation only involves one view. A
four steps:
sample OR view graph is shown in Figure 1.
① t = t +1
② select P(t) from P(t-1)
③ recombine P(t)
④ evaluate P(t)
a
In step ①, a new generation t is created by increasing the OR
generation variable t by one. In step ② superior genomes view s
among the previous population P(t-1) are selected and
used as the basis for composing the genomes in the new b c d
OR OR OR
population P(t). A statistical method, for example, the
roulette wheel method [Mic94], is used to select those
genomes which are superior. base tables g h i k
In step ③, several operations are executed on paired or
individual genomes to create new genomes in the
population. These operations are called crossover and
Figure 1: Sample OR view graph for four views.
M. Lee, J. Hammer 3-3
For example, in this figure, the view labeled a can be information between the two, thereby creating two new
computed from any of the views labeled b, c, d (the same genomes. Crossover works as follows:
is true for the views b, c, d).
As mentioned in the introduction, we are limiting our Each genome is selected with a probability of pc.
experiments to OR view graphs at this point since the cost Pair genomes.
model is considerably less complex than that for the other For each pair, do the following :
two view graph models (i.e., AND and AND-OR view // Assuming two genomes
graphs). However, despite their simpler cost model, OR // g1 = (b1 b2 ... bpos bpos+1 ... bm) and
view graphs are useful and appear frequently in // g2 = (c1 c2 ... cpos cpos+1 ... cm)
warehouses used for decision support in the form of data (1) Randomly decide a crossover point pos.
cubes as indicated above. (2) Exchange information among genomes,
Future versions of this paper will address the cases of and replace g1, g2 with g1', g2'
AND view graphs as well as AND-OR view graphs. We // (ex) g1' = (b1 b2 ... bpos cpos+1 ... cm) and
are also deferring the problem of index selection for the // g2' = (c1 c2 ... cpos bpos+1 ... bm)
next version of our algorithm. However, in Sec. 6 we
briefly mention how index selection can be added into the The mutation operator works as follows.
problem domain in a straightforward manner.
For all genomes,
3.2.2 Problem Solution For each bit in genome,
mutate (flip) bit with probability of pm
Step 1 – Representation of the Solution. The selection, crossover, mutation and evaluation
A genome represents a candidate solution of the problem (described in Step 4) processes will be repeated in a loop
to be solved. A genome can be represented appropriately until the termination condition is satisfied. The
as a string as follows: Either as a binary string composed termination condition is reached after 400 generations.
of 0s and 1s, or as a string of alphanumeric characters. The values used for the probabilities and termination
The content of the string is flexible, but the representation condition are based on empirical values used in other
of the solution must be carefully designed so that it is examples, and although reference [Mic94] mentioned that
possible to properly represent all possible solutions. As no improvement was observed after 500 generations, we
alphanumeric strings are good for representing solutions have reduced this value to 400 in our experiments since
to ordering problems and binary strings are used for our algorithm converged more rapidly.
selection problems, we use a binary string to represent a
solution. The views in the OR view graph may be Step 4 – Evaluation Process
enumerated as v1, v2, …, vm where m is the total number The fitness function measures how good a solution (i.e., a
of views. We can represent a selection of these views as a genome) is by providing a fitness value as follows: If the
binary string of m bits. If the bit in position i (starting fitness is high, the solution is closer to an optimal
from the leftmost bit as position 1) is 1, view vi is solution; if the fitness is low, the solution is far away from
selected. Otherwise, view vi is not selected. For example, the optimal solution. There are many possible fitness
the bit string 001101001 encodes the fact that only views functions and finding the best possible one (i.e., one that
v3, v4, v6 and v9 are selected. can truthfully evaluate the quality of a particular
warehouse configuration) requires a lot of fine-tuning.
Step 2 – Initialization of the Population. For our problem, the fitness function has to evaluate a
The initial population consists of a pool of randomly genome (i.e., a set of selected views to materialize) with
generated bit strings of size m. In future implementations, respect to the query benefit as well as with respect to the
however, it is straightforward to start with an initial maintenance constraint. This is similar to the goal of the
population which represents a favorable configuration 0/1 knapsack problem, for example, where the goal is to
based on external knowledge about the problem and its maximize the profit of the packed load while satisfying a
solution. It will be interesting to see if and how this affects specific capacity constraint of the knapsack. The
the quality as well as the run-time of our algorithm. For difference is that in the view selection problem, when a
the experiments described in this paper, we have chosen a view is selected, the benefit will not only depend on the
population size of 30. view itself but also on other views that are selected. A
good way to model such a complex problem is by
Step 3 – Selection, Crossover, Mutation, Termination introducing a penalty value as part of the fitness function.
This penalty value will reduce the fitness if the
The selection process uses the roulette wheel method. The maintenance constraint is not satisfied. When the
crossover and mutation operators are assigned maintenance constraint is satisfied, the penalty value will
probabilities pc and pm, respectively. The specific values have no effect and only the query benefit should be
used in our simulation are 0.001 and 0.9. The crossover evaluated. We have applied the penalty value in three
operation is applied to two genomes by exchanging different ways when calculating the fitness: Subtract mode
(S), Divide mode (D), and Subtract & Divide mode (SD).
M. Lee, J. Hammer 3-4
The subtract mode will calculate the fitness by subtracting [LH99] which is available for download via ftp and http
the penalty value from the query benefit. Because the from our publication server.
fitness value is not allowed to have a negative value, the
fitness is set to 0 when the result of the calculation 4 Prototype Implementation
becomes negative (i.e., the penalty value exceeds the
query benefit). The divide mode will divide the query We used version 2.4.3 of the genetic algorithm toolkit
benefit with the penalty value in an effort to reduce the from MIT called GAlib [MIT] to develop a prototype of
query benefit. When the penalty value is less than 1, the the algorithm described above. The toolkit supports
division is not performed in order to prevent the fitness various types of genetic algorithms including mutation
from increasing. The subtract & divide mode combines and crossover operators, built-in genome types such as 1-
the two methods discussed above. If the query benefit is dimensional or 2-dimensional strings, and a statistics
larger than the penalty value, the subtract mode is used. If gathering tool that can provide summarized information
the penalty value is larger than the query benefit, the about each generation during a single run of the genetic
divide mode is used. The penalty value can be calculated algorithm. The prototype was written entirely in C++
using a penalty function, which will be discussed using Microsoft Visual C++ as our development platform.
afterwards. Thus, we have defined a fitness function, Since the toolkit did not provide any libraries to encode a
called Eval, as follows: fitness function based on the evaluation strategies
discussed above, we had to encode our own. The fitness
①
Subtract mode (S): function we developed can calculate the fitness in nine
Eval(x)=B(G,M)-Pen(x) (if B(G,M)-Pen(x)≥ 0) different ways by pairing each type of penalty mode with
=0 (if B(G,M)-Pen(x)<0) each type of penalty function; in our implementation, we
for all x[i]=1, vi ∈ M can control the way the penalty is calculated and applied
for all x[i]=0, vi ∉ M in the fitness function by setting the value of a variable
which indicates the desired strategy. This allows us to
②
Divide mode (D):
Eval(x)=B(G,M) / Pen(x) (if Pen(x)>1) switch back and forth between the different penalty modes
=B(G,M) (if Pen(x)≤ 1) when conducting our experiments. The fitness function
needs to evaluate each genome using the cost values given
③
Subtract&Divide mode (SD): by the OR-view graph and the maintenance cost limit
Eval(x)=B(G,M)-Pen(x) (if B(G,M)>Pen(x)) (e.g., given by the warehouse administrator). For this
=B(G,M)/Pen(x) (if Pen(x)≥ B(G,M) and purpose, additional cost functions which, when given a
Pen(x) >1) genome can calculate the total query cost and the total
=B(G,M) (if Pen(x)≥ B(G,M) and maintenance cost of the selected views represented by the
Pen(x) ≤ 1) genome, must be encoded. The OR-view graph has the
where B is the query benefit function, Pen is the penalty related costs shown in Table 1. Each node (=view) in the
function, x is a genome, G is the OR view graph, and M is graph, has associated with it a read cost of the view (RC),
the set of selected views given by x. a query frequency (QF) and an update frequency (UF).
The penalty function itself can also take on various forms. Each edge of the graph, which denotes the relationship
For example, we have experimented with logarithmic, among the views, is associated with a query cost (QC) and
linear and exponential penalty functions as shown in ④,⑤, a maintenance cost (MC).
and ⑥.
Table 1: Cost parameters for OR view graphs.
④
Logarithmic penalty (LG):
Pen(x) = log 2 ( 1 + ρ ( U(M) - S ) ) Para- Description
meter
Pen(x) = ( 1 + ρ ( U(M) - S ) ) ⑤
Linear penalty (LN):
RC Read Cost of the view, also used to
Node represent the size of the view.
Pen(x) = ( 1 + ρ ( U(M) - S ) ) ⑥
Exponential penalty (EX): (View) QF Query Frequency, represents the
2
number of queries on the view
where ρ is defined as a constant calculated from the given during a given time interval.
OR-view graph G, U(M) is the total maintenance cost of UF Update Frequency, represents the
the set of materialized views M, and S is the maintenance number of updates on the view
cost constraint. during a given time interval.
We combined the three types of penalty modes (i.e., S, D, QC Query Cost, represents the cost for
SD) and the three types of penalty functions (i.e., LG, LN, Edge calculating a view from one of its
EX) in our prototype to evaluate and determine the best source views.
possible strategy for solving the view selection problem. MC Maintenance Cost represents the
Please note, the details as well as the formulas for the cost for updating a view using one
query benefit function B(G,M), the total maintenance cost of its source views.
U(M), and ρ are provided in a full version of this paper
M. Lee, J. Hammer 3-5
The total query cost for the selected views represented by 5.1 Quality of Solutions
a genome is calculated by summing over all calculated
Initially, we used all nine different fitness functions to
minimum cost paths from each selected view to another
conduct the experiments. The quality of the solutions was
selected view or a base table. Each minimum cost path is
measured as a ratio of the optimal total query cost
composed of all of the QC values of the edges on the path
(obtained using the exhaustive search) over the computed
and the RC values of the final selected view or base table.
total query cost (obtained using the genetic algorithm).
This calculation is implemented by using a depth-first
The ratio was computed and averaged over several runs.
traversal of the OR view graph.
It was initially expected that the ratio would always be
The total maintenance cost is calculated similarly, but the
less than 100%. However, we observed that the solutions
cost of each minimum cost path is composed of only the
produced by the genetic algorithm sometimes resulted in a
UC values of the edges. The detailed formulae and
higher than specified maintenance cost but lower than
examples are given in [LH99]. An OR-view graph
expected overall query cost: in those cases, the total query
generator, which can randomly generate OR-views based
cost obtained was lower than the query cost obtained by
on the density and using the parameter ranges given for
strictly adhering to the maintenance constraint value (i.e.,
each parameter of the graph, was also developed for
the ratio exceeded 100%). This was very interesting in the
experimental purpose. In addition, we implemented an
sense that although a maintenance cost constraint may be
exhaustive search algorithm to find the optimal solution
given, it may be beneficial to use it more as a guideline
(at least for small warehouse configurations) in order to be
(within certain limits) rather than as a strict policy.
able to compare the quality of our GA-based solution to
Actually, in [GM99] the inverted-tree greedy heuristic
the optimal one for each test case.
also does not guarantee a strict maintenance cost
constraint, but satisfies a limit within twice the constraint
5 Evaluation of the Algorithm value.
Our genetic algorithm was developed and evaluated using The nine different strategies used in our initial set of
a Pentium II 450 MHz PC running Windows NT 4.0. We experiments are denoted LG-S, LG-D, LG-SD, LN-S, LN-
performed two kinds of evaluations. First, the nine D, LN-SD, EX-S, EX-D, EX-SD, where LG, LN, EX
strategies for the fitness functions (see Sec. 3.2.2) were denote the different penalty functions and S, D, SD denote
compared in terms of the quality of the generated the different ways of applying those penalty functions (as
solutions with respect to the optimal solutions. Second, we described in Sec. 3.2.2).
compared the run-time behavior of the genetic algorithm After an initial experiment, the logarithmic penalty
to the exhaustive search algorithm in order to gain insight functions (LG-S, LG-D, LG-SD) did not perform well,
into the efficiency of our approach. especially LG-S and LG-SD. The reason was that the
The OR-view graphs that were used in the experiments logarithmic penalty function makes the penalty value too
were as follows. The number of base tables was fixed to small to have a noticeable effect on the fitness value.
10 tables. The number of views varied from 5 to 20 views. Thus, for LG-S and LG-SD, our algorithm always tried to
The edge density of the graph varied from 15% to 30% to maximize the query benefit while ignoring the
50% to 75%. The ranges for the values of all the maintenance cost constraint by yielding a solution that
important parameters of the OR-view graphs are shown in materializes all of the views. LG-D and several others
Table 2. The maintenance cost constraint for the problem such as LN-S, EX-S did not result in such extreme
was set to 50, 100, 300, and 500. Please note that one solutions but tended to fluctuate wildly over the
possible interpretation of these values is to view them as maintenance cost limit, sometimes exceeding it by as
time limits on how long the warehouse is expected to be much as 10,000%! Therefore, we disregard these
unavailable due to maintenance or as the amount of data strategies in our figures and only show the results from the
that must be read etc. remaining strategies, namely LN-D, LN-SD, EX-D, EX-
SD as depicted in Figures 2 and 3. Figure 2 shows the
results of averaging over the ratios of optimal total query
Table 2: Range of parameter values for the simulated cost (based on a strict maintenance constraint) over GA
OR-view graphs total query costs. Figure 3 shows the results of averaging
over the ratios of GA total maintenance cost over the
Para- Description maintenance constraint. The values are arranged in tuples
meter in lexicographical order:
RC 100-10,000 for base tables (RC for (density, number of views, maintenance
Node views are calculated from source constraint).
(View) views) The density changes occur at the points 1, 65, 129 and
QF 0.1 - 0.9 193 on the x-axis, each increasing the densities. The
UF 0.1- 0.9 numbers of views are shown in increasing order within a
QC 10 - 80 % of RC of source view given density. The maintenance cost is also shown in
Edge MC 10 – 150% of QC increasing order within each set of views.
The results in Figure 3 show that the LN-D and LN-SD
still exhibit a considerable amount of fluctuation (about
M. Lee, J. Hammer 3-6
110
108
total query cost ratio
106
(optimal/GA) %
104 LN-D
102
LN-SD
100
98 EX-D
96 EX-SD
94
92
90
1
13
25
37
49
61
73
85
97
109
121
133
145
157
169
181
193
205
217
229
241
253
(density,# of views, maint. constraint)
Figure 2: Average ratios of (optimal total query cost/GA total query cost)
450
total maintenance cost ratio
400
350
(GA/optimal) %
300 LN-D
250
LN-SD
200
EX-D
150
100 EX-SD
50
0
1
13
25
37
49
61
73
85
97
109
121
133
145
157
169
181
193
205
217
229
241
253
(density,# of views, maint.constraint)
Figure 3: Average ratios of (GA total maintenance cost/maintenance constraint)
380%) for the maintenance cost. This was especially by using this algorithm, we have actual proof that it is
noticeable for low density OR-view graphs where the extremely time consuming to obtain an optimal solution
penalty values resulted in values which were too small to when the number of views exceeds 20 views. As a result,
enforce the maintenance constraint. If we discard these we limited our experiments to only 20 views. Although
two strategies from our consideration, Figure 2 shows that better heuristics exist (which still have polynomial time
the remaining EX-D and EX-SD strategies obtain a total complexity), this particular experiment is intended to give
query cost ratio that is guaranteed to always be over 90% the reader a feel for the performance capabilities of our
which is very close to the optimal solution. Furthermore, genetic algorithm. From the figures we can see that the
the maintenance cost is always within two times the value execution time for the exhaustive algorithm increases
of the maintenance cost. Thus, EX-D and EX-SD exponentially within each given density as it goes up to 20
represent good fitness functions for our genetic algorithm. views. Our genetic algorithm on the other hand exhibits
Note that this result is also very close to the one that was linear behavior. As the density grows, the slope of the
verified in theory in the inverted-tree greedy heuristics linear graph increases only slightly. The genetic algorithm
proposed by [GM99]. took approximately 1/80th of the time of an exhaustive
search of an OR-view graph with density of 75% and 20
5.2 Execution Time views. As the number of views goes up to 30 and beyond,
this ratio is expected to be much more impressive.
Figures 4 and 5 show a comparison in execution time Furthermore, the quality of the solution generated by the
between our genetic algorithm and the exhaustive search genetic algorithm remains very close to the optimal.
averaged over the sample OR view graphs. The exhaustive
search algorithm was developed to obtain the optimal
solutions for comparing the qualities of the solutions, and
M. Lee, J. Hammer 3-7
120
100
80
time (sec)
60
Exhaustive Search
40
20
0
1
13
25
37
49
61
73
85
97
109
121
133
145
157
169
181
193
205
217
229
241
253
(density, # of views, maint. constraint)
Figure 4: Execution time for exhaustive search algorithm
1.6
1.4
1.2
LN-D
time (sec)
1
LN-SD
0.8
EX-D
0.6
EX-SD
0.4
0.2
0
1
13
25
37
49
61
73
85
97
109
121
133
145
157
169
181
193
205
217
229
241
253
(density, # of views, maint. cost)
Figure 5: Execution time for genetic algorithm
6 Conclusion a certain threshold instead of always computing all
generations.
In this paper we have shown that our genetic algorithm is • Expanding our approach to include AND-OR view
superior to existing solutions to the view selection graphs as well as indexes. The first is straightforward.
problem in the context of OR view graphs. Specifically, The latter is more complicated as we have to modify
our genetic algorithm consistently yields a solution that the problem representation. One possible approach
comes within 10% of the optimal solution (which we may be as follows: Add the indexes related to a view
verified by running an exhaustive search on OR view immediately after the bit position of the view.
graphs with up to 20 views) while at the same time However, crossover and mutation operations need to
exhibiting a linear run-time behavior. A penalty function be carefully redesigned since only when a view is
has been included in the fitness function, and experimental selected for materialization can the associated indexes
results show that the EX-D and EX-SD penalty functions be selected. See [LH99] for more details on how we
produce the best results for the maintenance cost view plan on incorporating index selection into our
selection problem. We believe that this algorithm can algorithm.
become an invaluable tool for warehouse evolution, • Lastly, genetic algorithms are well suited for
especially for those data warehouses with a large number exploiting parallelism. For further improvement in the
of views and frequent changes to the queries which are performance of the devised algorithm, a parallel
supported by the given warehouse configuration. version may be devised.
In the future, we are considering the following
improvements:
• Generate an initial population based on knowledge of a References
possible solution rather than using random
configurations. [Aho83] A.V. Aho, J.E. Hopcroft, and J.D. Ullman, Data
• Experiment with several other crossover or mutation Structures and Algorithms, Addison-Wesley Publishing
operators to speed up convergence even further. Company, Reading, MA, 1983.
• Implement a more flexible termination condition that
can interrupt the algorithm when the solution lies with
M. Lee, J. Hammer 3-8
[Coo71] S.A. Cook, “The Complexity of Theorem [Mic94] Z. Michalewicz, Genetic Algorithms + Data
Proving Procedure,” Annual ACM SIGACT Symposium on Structures = Evolution Programs, Sringer-Verlag, New
Theory of Computing, pp. 151-158, 1971. York, New York, NY, 1994.
[GJ79] M.R. Garey and D.S. Johnson, Computers and [MIT] MIT Technology Lab, “GAlib: A C++ Library of
Intractability – A Guide to the Theory of NP- Genetic Algorithm Components”, URL,
Completeness, San Francisco, 1979. http://lancet.mit.edu/ga/.
[GCB+97] J. Gray, S. Chaudhuri, A. Bosworth, A. [RSS96] K.A. Ross, D. Srivastava, and S. Sudarshan,
Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. “Materialized view maintenance and integrity constraint
Pirahesh, “Data Cube: A Relational Aggregation Operator checking: Trading space for time,” SIGMOD Record
Generalizing Group-By, Cross-Tab, and Sub-Totals,” (ACM Special Interest Group on Management of Data),
Data Mining and Knowledge Discovery, 1:1, pp. 29-53, 25:2, pp. 447-458, 1996.
1997.
[Rou82] N. Roussopoulos, “View Indexing inrelational
[GM95] A. Gupta and I.S. Mumick, “Maintenance of Databases,” ACM Transactions on Database Systems, 7:2,
Materialized Views: Problems, Techniques, and pp. 258-290, 1982.
Applications,” Data Engineering Bulletin, Special Issue
on Materialized Views and Data Warehousing, 18:2, pp.
[Rou97] N. Roussopoulos, “Materialized Views and Data
3-18, 1995.
Warehouses,” in Proceedings of the KRDB, December
1997.
[Gup97] H. Gupta, “Selection of Views to Materialize in a
Data Warehouse,” in Proceedings of the International
[TS97] D. Theodoratos and T.K. Sellis, “Data
Conference on Database Theory, pp. 98-112, Delphi,
Warehouse Configuration,” in Proceedings of the Twenty-
Greece, January 1997.
third International Conference on Very Large Databases,
Athens, Greece, pp. 126-135, August 1997.
[GM99] H. Gupta and I. Mumick, "Selection of Views to
Materialize Under a Maintenance Cost Constraint," in [Wid95] J. Widom, “Research Problems in Data
Proceedings of the International Conference on Database Warehousing,” in Proceedings of the Fourth International
Theory, Jerusalem, Israel, pp. 453-470, January 1999. Conference on Information and Knowledge Management,
Baltimore, Maryland, pp. 25-30, November 1995.
[Gol89] D.E. Goldberg, Genetic Algorithms in Search,
Optimization, and Machine Learning, p.412, Addison- [ZGH+95] Y. Zhuge, H. Garcia-Molina, J. Hammer, and J.
Wesley, Mass., 1989. Widom, “View Maintenance in a Warehousing
Environment,” SIGMOD Record (ACM Special Interest
[HRU96] V. Harinarayan, A. Rajaraman, and J.D. Group on Management of Data), 24:2, pp. 316-27, 1995.
Ullman, “Implementing data cubes efficiently,” SIGMOD
Record (ACM Special Interest Group on Management of
Data), 25:2, pp. 205-216, 1996.
[IK93] W.H. Inmon and C. Kelley, Rdb/VMS:
Developing the Data Warehouse, QED Publishing Group,
Boston, London, Toronto, 1993.
[LQA97] W. Labio, D. Quass, and B. Adelberg, “Physical
Database Design for Data Warehouses,” in Proceedings of
the International Conference on Data Engineering,
Birmingham, England, pp. 277-288, March 1997.
[LH99] M. Lee and J. Hammer, “Speeding Up
Warehouse Physical Design Using A Randomized
Algorithm,” University of Florida, Gainesville, FL,
Technical Report April 1999.
M. Lee, J. Hammer 3-9