=Paper=
{{Paper
|id=Vol-19/paper-4
|storemode=property
|title=Speeding up Warehouse Physical Design Using a Randomized Algorithm
|pdfUrl=https://ceur-ws.org/Vol-19/paper3.pdf
|volume=Vol-19
|dblpUrl=https://dblp.org/rec/conf/dmdw/LeeH99
}}
==Speeding up Warehouse Physical Design Using a Randomized Algorithm==
<pdf width="1500px">https://ceur-ws.org/Vol-19/paper3.pdf</pdf>
<pre>
                                            Speeding Up Warehouse Physical Design
                                                Using A Randomized Algorithm
                                                       Minsoo Lee and Joachim Hammer
                                            Dept. of Computer & Information Science & Engineering
                                                             University of Florida
                                                         Gainesville, FL 32611-6120
                                                        {mslee, jhammer}@cise.ufl.edu


                                                                         1      Introduction
                                         Abstract                                     A data warehouse stores information that is collected
                                                                                      from multiple, heterogeneous information sources for the
     A data warehouse stores information that is                                      purpose of complex querying and analysis [IK93, Wid95].
     collected           from         multiple,          heterogeneous                The information in the warehouse is typically processed
     information sources for the purpose of complex                                   and integrated before it is loaded in order to detect and
     querying and analysis. Information in the                                        resolve any inconsistencies and discrepancies among
     warehouse is typically stored in the form of                                     related data items from different sources. Since the
     materialized views. One of the most important                                    amount of information in a data warehouse tends to be
     tasks when designing a warehouse is the selection                                large and queries may involve hundreds of complex
     of materialized views to be maintained in the                                    aggregates at a time, the organization of the data
     warehouse. The goal is to select a set of views in                               warehouse becomes a critical factor in supporting efficient
     such a way as to minimize the total query                                        online analytical query processing (OLAP) as well as in
                                                                                      allowing periodic maintenance of the warehouse contents.
     response time over all queries, given a limited
                                                                                      Data in the warehouse is often organized in summary
     amount of time for maintaining the views
                                                                                      tables, or materialized views [Rou97], which represent
     (maintenance-cost view selection problem). The                                   pre-computed portions of the most frequently asked
     paper focuses on an efficient solution to the                                    queries. In this way, the warehouse query processor
     maintenance-cost view selection problem using a                                  avoids having to scan the large data sets for each query, a
     genetic algorithm for computing a near-optimal                                   task that is even more wasteful if the query occurs
     set of views. Specifically, we explore the view                                  frequently. However, in order to keep these materialized
     selection problem in the context of OR view                                      views consistent with the data at the sources, the views
     graphs. We show that our approach represents a                                   have to be maintained. Rather than periodically refreshing
     dramatic improvement in time complexity over                                     the entire view, a process that may be time consuming and
     existing          search-based             approaches            using           wasteful, a view can be maintained in an incremental
     heuristics. Our analysis shows that the algorithm                                fashion, whereby only the portions of the view which are
     consistently yields a solution that lies within 10%                              affected by the changes in the relevant sources are
     of the optimal query benefit while at the same                                   updated [GM95, ZGH+95].
     time exhibiting only a linear increase in                                        Besides this so-called view maintenance or update cost,
     execution time. We have implemented a                                            each materialized view in the warehouse also requires
                                                                                      additional storage space which must be taken into account
     prototype version of our algorithm which is used
                                                                                      when deciding which and how many views to materialize.
     to simulate the measurements used in the analysis                                For example, given a set of frequently asked OLAP
     of our approach.                                                                 queries, materializing all possible views will certainly
                                                                                      increase query response time but will also raise the update
                                                                                      costs for the warehouse and may exceed the available
                                                                                      storage capacity. Thus by trading space for time and vice
                                                                                      versa, the warehouse administrator must carefully decide
                                                                                      on a particular warehouse configuration which balances
 The copyright of this paper belongs to the paper’s authors. Permission to copy
                                                                                      the three important factors given above: query response
 without fee all or part of this material is granted provided that the copies are not
 made or distributed for direct commercial advantage.                                 time, maintenance cost, and storage space. The problem of
 Proceedings of the International Workshop on Design and selecting a set of materialized views for a particular
 Management of Data Warehouses (DMDW'99)
                                                                                      warehouse configuration which represents a desirable
 Heidelberg, Germany, 14. - 15. 6. 1999
 (S. Gatziu, M. Jeusfeld, M. Staudt, Y. Vassiliou, eds.)
 http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-19/


M. Lee, J. Hammer                                                                                                                           3-1
balance among the three costs is known as the view              View indices are similar to views except that instead of
selection problem1.                                             storing the tuples in the views directly, each tuple in the
In this paper we focus on a solution to the maintenance-        view index consists of pointers to the tuples in the base
cost view selection problem which minimizes query               relations that derive the view tuple. The algorithm is based
response time given varying upper bounds on the                 on A* to find an optimal set of view indexes but uses a
maintenance cost, assuming unlimited amount of storage          very simple cost model for updating the view which does
space because storage space is regarded cheap and not a         not take into account which subviews have been selected.
critical resource. Specifically, we explore the view            As a result, the maintenance cost for the selected view set
selection problem in the context of OR view graphs, in          is not very realistic.
which any view can be computed from any of its related          More recently, Ross et al. [RSS96] and Labio et al.
views. Although the view selection problem has been             [LQA97] have examined the same problem using
addressed previously (e.g., see [Gup97], [GM99],                exhaustive search algorithms that make use of heuristics
[TS97]), existing algorithms do not perform well when           for pruning the search space. The work by Labio et al. is
computing warehouse configurations involving more than          an extension of the work by Ross et al. considering
20-25 views or so. In those cases, the search space             indexes and also improving upon the efficiency of the
becomes too large for any kind of exhaustive search             algorithm. In addition, Labio et al. are the first to provide
method and even the best heuristics can only compute            a valuable set of rules and guidelines for choosing a set of
acceptable solutions for a small set of special cases of the    views and indexes when their algorithm cannot compute
problem. To this end, we have designed a solution               the optimal warehouse configuration within a reasonable
involving randomization techniques which have proven            time due to the complexity of the solution. Similarly,
successful in other combinatorial problems. We show that        Theodoratos et al. [TS97] present an exhaustive search
our solution is superior to existing solutions in terms of      algorithm with pruning to find a warehouse configuration
both its expected run-time behavior as well as the quality      for answering a set of queries given unlimited space for
of the warehouse configurations found. The analysis             storing the views. Their work also focuses on minimizing
proves that our genetic algorithm yields a solution that lies   query evaluation and view maintenance.
within 90% of the optimal query benefit while at the same       [HRU96] present and analyze greedy algorithms for
time exhibiting only a linear increase in execution time.       selection of views in the special case of “data cubes” that
We expect our algorithm to be useful in data warehouse          come within 63% of the optimal configuration. However,
design; most importantly in those scenarios where the           their calculations do not figure in the update costs for the
queries which are supported by the existing warehouse           selected views.
views change frequently, making it necessary to                 Our work is most closely related to that of Gupta [GM99]
reconfigure the warehouse efficiently and quickly.              who has used both the greedy approach as well as the A*
Supporting data warehouse evolution in this way may             algorithm for solving the maintenance-cost view selection
increase the usefulness of the data warehousing concept         problem in the context of both OR/AND view graphs and
even further.                                                   the general case of AND-OR view graphs. His approach
The paper is organized as follows. In the Sec. 2 we             also balances query response time and view maintenance
present an overview of the related work. Sec. 3 describes       cost while assuming an unlimited amount of storage space.
our technical approach. Specifically, we briefly introduce
the idea behind genetic algorithms (which are a special
class of randomized algorithms) and how we are using the
                                                                3    Technical Approach
technique to find an efficient solution to the view selection   In [Gup97] the view selection problem is stated to be NP-
problem. In Sec. 4 we describe the implementation of our        hard (see, for example, [Coo71, GJ79]), as there is a
prototype which was used to generate the simulation runs        straightforward reduction to the minimum set cover
which we present and analyze in Sec. 5. Sec. 6 concludes        problem. Roughly speaking, it is very difficult to find an
the paper with a summary of our results and future plans.       optimal solution to problems in this class because of the
                                                                fact that the solution space grows exponentially as the
2    Related Research                                           problem size increases. Although some good solutions for
                                                                NP-hard problems in general and the view selection
All of the related work on view selection uses some form        problem in specific exist, such approaches encounter
of greedy strategy or heuristics-based searching technique      significant problems with performance when the problem
to avoid having to exhaustively traverse the solution space     size grows above a certain limit. More recent approaches
in search of the optimal solution. The problem of selecting     use randomized algorithms in solving NP-hard problems.
additional structures for materialization was first studied     Randomized algorithms are based on statistical concepts
by Roussopoulos [Rou82] who proposed to materialize             where the large search space can be explored randomly
view indices rather than the actual views themselves.           using an evaluation function to guide the search process
                                                                closer to the desired goal. Randomized algorithms can
1
                                                                find a reasonable solution within a relatively short period
  Sometimes the problem is also referred to as the view index   of time by trading executing time for quality. Although the
selection problem (VIS) when the solution includes a            resulting solution is only near-optimal, this reduction is
recommendation on which index structures should be              not as drastic as the reduction in execution time. Usually,
maintained in support of the materialized views.


M. Lee, J. Hammer                                                                                                        3-2
the solution is within a few percentage points of the          mutation operations, respectively. Step ④ evaluates the
optimal solution which makes randomized algorithm an           population that is created. A function called the fitness
attractive alternative to traditional approaches such as the   function, which evaluates the superiority of a genome, is
ones outlined in Sec. 2, for example.                          used in this process. The fitness of each genome can be
Our approach uses a genetic algorithm, which is one form       gathered and used as a metric to evaluate the improvement
of a randomized algorithm. The motivation to use genetic       made in the new generation. This fitness value is also used
algorithms in solving the view selection problem was           during the selection process (in step ②) in the next
based on the observation that data warehouses can have a       iteration to select superior genomes for the next
large number of views and the queries that must be             population. Also, the genome with the best fitness so far is
supported may change very frequently. Thus, a fast             saved. We now explain how this algorithm can be
solution is needed to provide new configurations for the       adapted to solve the view selection problem.
data warehouse: an ideal starting point for the genetic
algorithm. However, genetic algorithms do not provide a        3.2      An Improved Solution to the View Selection
magical solution by themselves and their success (or                    Problem
failure) often depends on the proper problem
specification, the set-up of the algorithm, as well as the     In order to apply a genetic algorithm approach to solve the
outcome of the extremely difficult and tedious fine-tuning     view selection problem, two requirements must be met:
of the algorithm that must be performed during many test       (1) We need to find a string representation of a candidate
runs. After a brief overview of genetic algorithms, we         solution and, (2) we need to be able to define the fitness
provide details on how to apply these techniques to design     function, the crossover operator, and the mutation
an optimal solution to the view selection problem.             operator as outlined above.
Specifically, we elaborate on a suitable representation of     In [Mic94], the authors discuss several solutions to
the solution space as well as the necessary evaluation         popular problems using genetic algorithms, including the
functions needed by our genetic algorithm.                     0/1 knapsack problem (see for example [Aho83]). The
                                                               similarity of the view selection problem to the 0/1
                                                               knapsack problem gives us a hint on how to apply the
3.1    Genetic Algorithms
                                                               genetic algorithm strategies in our context. However, to
The idea behind the Genetic Algorithm (GA) [Gol89]             our knowledge nobody has yet to apply genetic algorithm
comes from imitating how living organisms evolve into          techniques to solving view selection and the solutions
superior populations from one generation to the next. The      presented here represent our own approach.
genetic algorithm works as follows. A pool of genomes is
initially established. Each genome represents a possible       3.2.1      Problem Specification
solution for the problem to be solved. This pool of
                                                               The problem to be solved can be stated as follows
genomes is called a population. The population will
                                                               [GM99]: Given an OR view graph G and a quantity
undergo changes and create a new population. Each
                                                               representing the total maintenance time limit, select a set
population is referred to as a generation. Starting with an
                                                               of views that minimizes the total query response time and
initial generation t, the sequence of subsequent
                                                               also does not exceed the total maintenance time limit. An
populations is referred to as generation t+1, t+2, ..., and
                                                               OR view graph is composed of a set of views where each
so on. After several generations, it is expected that the
                                                               view in the graph can be constructed from other source
population t+k should be composed of genomes which are
                                                               views in one or more ways, but each derivation involves
superior to the genomes in population t. By superior
                                                               only one other view. In other words, only one view among
genomes we mean genomes which represent a solution
                                                               the source views is needed to compute a view. An
that is closer to the optimal solution (based on a so-called
                                                               example of an OR view graph is the data cube [GCB+97]
fitness evaluation).
                                                               where each view can be constructed in many different
The Genetic Algorithm repeatedly executes the following
                                                               ways but each derivation only involves one view. A
four steps:
                                                               sample OR view graph is shown in Figure 1.
       ① t = t +1
       ② select P(t) from P(t-1)
       ③ recombine P(t)
       ④ evaluate P(t)
                                                                                                   a
In step ①, a new generation t is created by increasing the                                                 OR
generation variable t by one. In step ② superior genomes                    view s
among the previous population P(t-1) are selected and
used as the basis for composing the genomes in the new                                    b        c            d
                                                                                              OR       OR               OR
population P(t). A statistical method, for example, the
roulette wheel method [Mic94], is used to select those
genomes which are superior.                                             base tables   g       h        i            k

In step ③, several operations are executed on paired or
individual genomes to create new genomes in the
population. These operations are called crossover and
                                                                     Figure 1: Sample OR view graph for four views.


M. Lee, J. Hammer                                                                                                            3-3
For example, in this figure, the view labeled a can be            information between the two, thereby creating two new
computed from any of the views labeled b, c, d (the same          genomes. Crossover works as follows:
is true for the views b, c, d).
As mentioned in the introduction, we are limiting our              Each genome is selected with a probability of pc.
experiments to OR view graphs at this point since the cost         Pair genomes.
model is considerably less complex than that for the other         For each pair, do the following :
two view graph models (i.e., AND and AND-OR view                     // Assuming two genomes
graphs). However, despite their simpler cost model, OR               // g1 = (b1 b2 ... bpos bpos+1 ... bm) and
view graphs are useful and appear frequently in                      // g2 = (c1 c2 ... cpos cpos+1 ... cm)
warehouses used for decision support in the form of data            (1) Randomly decide a crossover point pos.
cubes as indicated above.                                           (2) Exchange information among genomes,
Future versions of this paper will address the cases of                 and replace g1, g2 with g1', g2'
AND view graphs as well as AND-OR view graphs. We                        // (ex) g1' = (b1 b2 ... bpos cpos+1 ... cm) and
are also deferring the problem of index selection for the                //      g2' = (c1 c2 ... cpos bpos+1 ... bm)
next version of our algorithm. However, in Sec. 6 we
briefly mention how index selection can be added into the         The mutation operator works as follows.
problem domain in a straightforward manner.
                                                                   For all genomes,
3.2.2    Problem Solution                                             For each bit in genome,
                                                                            mutate (flip) bit with probability of pm
Step 1 – Representation of the Solution.                          The selection, crossover, mutation and evaluation
A genome represents a candidate solution of the problem           (described in Step 4) processes will be repeated in a loop
to be solved. A genome can be represented appropriately           until the termination condition is satisfied. The
as a string as follows: Either as a binary string composed        termination condition is reached after 400 generations.
of 0s and 1s, or as a string of alphanumeric characters.          The values used for the probabilities and termination
The content of the string is flexible, but the representation     condition are based on empirical values used in other
of the solution must be carefully designed so that it is          examples, and although reference [Mic94] mentioned that
possible to properly represent all possible solutions. As         no improvement was observed after 500 generations, we
alphanumeric strings are good for representing solutions          have reduced this value to 400 in our experiments since
to ordering problems and binary strings are used for              our algorithm converged more rapidly.
selection problems, we use a binary string to represent a
solution. The views in the OR view graph may be                   Step 4 – Evaluation Process
enumerated as v1, v2, …, vm where m is the total number           The fitness function measures how good a solution (i.e., a
of views. We can represent a selection of these views as a        genome) is by providing a fitness value as follows: If the
binary string of m bits. If the bit in position i (starting       fitness is high, the solution is closer to an optimal
from the leftmost bit as position 1) is 1, view vi is             solution; if the fitness is low, the solution is far away from
selected. Otherwise, view vi is not selected. For example,        the optimal solution. There are many possible fitness
the bit string 001101001 encodes the fact that only views         functions and finding the best possible one (i.e., one that
v3, v4, v6 and v9 are selected.                                   can truthfully evaluate the quality of a particular
                                                                  warehouse configuration) requires a lot of fine-tuning.
Step 2 – Initialization of the Population.                        For our problem, the fitness function has to evaluate a
The initial population consists of a pool of randomly             genome (i.e., a set of selected views to materialize) with
generated bit strings of size m. In future implementations,       respect to the query benefit as well as with respect to the
however, it is straightforward to start with an initial           maintenance constraint. This is similar to the goal of the
population which represents a favorable configuration             0/1 knapsack problem, for example, where the goal is to
based on external knowledge about the problem and its             maximize the profit of the packed load while satisfying a
solution. It will be interesting to see if and how this affects   specific capacity constraint of the knapsack. The
the quality as well as the run-time of our algorithm. For         difference is that in the view selection problem, when a
the experiments described in this paper, we have chosen a         view is selected, the benefit will not only depend on the
population size of 30.                                            view itself but also on other views that are selected. A
                                                                  good way to model such a complex problem is by
Step 3 – Selection, Crossover, Mutation, Termination              introducing a penalty value as part of the fitness function.
                                                                  This penalty value will reduce the fitness if the
The selection process uses the roulette wheel method. The         maintenance constraint is not satisfied. When the
crossover and mutation operators are assigned                     maintenance constraint is satisfied, the penalty value will
probabilities pc and pm, respectively. The specific values        have no effect and only the query benefit should be
used in our simulation are 0.001 and 0.9. The crossover           evaluated. We have applied the penalty value in three
operation is applied to two genomes by exchanging                 different ways when calculating the fitness: Subtract mode
                                                                  (S), Divide mode (D), and Subtract & Divide mode (SD).


M. Lee, J. Hammer                                                                                                           3-4
The subtract mode will calculate the fitness by subtracting    [LH99] which is available for download via ftp and http
the penalty value from the query benefit. Because the          from our publication server.
fitness value is not allowed to have a negative value, the
fitness is set to 0 when the result of the calculation         4      Prototype Implementation
becomes negative (i.e., the penalty value exceeds the
query benefit). The divide mode will divide the query          We used version 2.4.3 of the genetic algorithm toolkit
benefit with the penalty value in an effort to reduce the      from MIT called GAlib [MIT] to develop a prototype of
query benefit. When the penalty value is less than 1, the      the algorithm described above. The toolkit supports
division is not performed in order to prevent the fitness      various types of genetic algorithms including mutation
from increasing. The subtract & divide mode combines           and crossover operators, built-in genome types such as 1-
the two methods discussed above. If the query benefit is       dimensional or 2-dimensional strings, and a statistics
larger than the penalty value, the subtract mode is used. If   gathering tool that can provide summarized information
the penalty value is larger than the query benefit, the        about each generation during a single run of the genetic
divide mode is used. The penalty value can be calculated       algorithm. The prototype was written entirely in C++
using a penalty function, which will be discussed              using Microsoft Visual C++ as our development platform.
afterwards. Thus, we have defined a fitness function,          Since the toolkit did not provide any libraries to encode a
called Eval, as follows:                                       fitness function based on the evaluation strategies
                                                               discussed above, we had to encode our own. The fitness
                                                 ①
Subtract mode (S):                                             function we developed can calculate the fitness in nine
Eval(x)=B(G,M)-Pen(x) (if B(G,M)-Pen(x)≥ 0)                    different ways by pairing each type of penalty mode with
       =0                (if B(G,M)-Pen(x)<0)                  each type of penalty function; in our implementation, we
           for all x[i]=1, vi ∈ M                              can control the way the penalty is calculated and applied
           for all x[i]=0, vi ∉ M                              in the fitness function by setting the value of a variable
                                                               which indicates the desired strategy. This allows us to
                                          ②
Divide mode (D):
Eval(x)=B(G,M) / Pen(x) (if Pen(x)>1)                          switch back and forth between the different penalty modes
       =B(G,M)          (if Pen(x)≤ 1)                         when conducting our experiments. The fitness function
                                                               needs to evaluate each genome using the cost values given

                                              ③
Subtract&Divide mode (SD):                                     by the OR-view graph and the maintenance cost limit
Eval(x)=B(G,M)-Pen(x) (if B(G,M)>Pen(x))                       (e.g., given by the warehouse administrator). For this
       =B(G,M)/Pen(x) (if Pen(x)≥ B(G,M) and                   purpose, additional cost functions which, when given a
                        Pen(x) >1)                             genome can calculate the total query cost and the total
       =B(G,M)         (if Pen(x)≥ B(G,M) and                  maintenance cost of the selected views represented by the
                       Pen(x) ≤ 1)                             genome, must be encoded. The OR-view graph has the
where B is the query benefit function, Pen is the penalty      related costs shown in Table 1. Each node (=view) in the
function, x is a genome, G is the OR view graph, and M is      graph, has associated with it a read cost of the view (RC),
the set of selected views given by x.                          a query frequency (QF) and an update frequency (UF).
The penalty function itself can also take on various forms.    Each edge of the graph, which denotes the relationship
For example, we have experimented with logarithmic,            among the views, is associated with a query cost (QC) and
linear and exponential penalty functions as shown in ④,⑤,      a maintenance cost (MC).
and ⑥.
                                                                     Table 1: Cost parameters for OR view graphs.
                                    ④
Logarithmic penalty (LG):
  Pen(x) = log 2 ( 1 + ρ ( U(M) - S ) )                                     Para-               Description
                                                                            meter
  Pen(x) = ( 1 + ρ ( U(M) - S ) ) ⑤
Linear penalty (LN):
                                                                             RC     Read Cost of the view, also used to
                                                                    Node            represent the size of the view.
  Pen(x) = ( 1 + ρ ( U(M) - S ) ) ⑥
Exponential penalty (EX):                                          (View)    QF     Query Frequency, represents the
                                  2
                                                                                    number of queries on the view
where ρ is defined as a constant calculated from the given                          during a given time interval.
OR-view graph G, U(M) is the total maintenance cost of                       UF     Update Frequency, represents the
the set of materialized views M, and S is the maintenance                           number of updates on the view
cost constraint.                                                                    during a given time interval.
We combined the three types of penalty modes (i.e., S, D,                    QC     Query Cost, represents the cost for
SD) and the three types of penalty functions (i.e., LG, LN,        Edge             calculating a view from one of its
EX) in our prototype to evaluate and determine the best                             source views.
possible strategy for solving the view selection problem.                   MC      Maintenance Cost represents the
Please note, the details as well as the formulas for the                            cost for updating a view using one
query benefit function B(G,M), the total maintenance cost                           of its source views.
U(M), and ρ are provided in a full version of this paper


M. Lee, J. Hammer                                                                                                     3-5
The total query cost for the selected views represented by     5.1    Quality of Solutions
a genome is calculated by summing over all calculated
                                                               Initially, we used all nine different fitness functions to
minimum cost paths from each selected view to another
                                                               conduct the experiments. The quality of the solutions was
selected view or a base table. Each minimum cost path is
                                                               measured as a ratio of the optimal total query cost
composed of all of the QC values of the edges on the path
                                                               (obtained using the exhaustive search) over the computed
and the RC values of the final selected view or base table.
                                                               total query cost (obtained using the genetic algorithm).
This calculation is implemented by using a depth-first
                                                               The ratio was computed and averaged over several runs.
traversal of the OR view graph.
                                                               It was initially expected that the ratio would always be
The total maintenance cost is calculated similarly, but the
                                                               less than 100%. However, we observed that the solutions
cost of each minimum cost path is composed of only the
                                                               produced by the genetic algorithm sometimes resulted in a
UC values of the edges. The detailed formulae and
                                                               higher than specified maintenance cost but lower than
examples are given in [LH99]. An OR-view graph
                                                               expected overall query cost: in those cases, the total query
generator, which can randomly generate OR-views based
                                                               cost obtained was lower than the query cost obtained by
on the density and using the parameter ranges given for
                                                               strictly adhering to the maintenance constraint value (i.e.,
each parameter of the graph, was also developed for
                                                               the ratio exceeded 100%). This was very interesting in the
experimental purpose. In addition, we implemented an
                                                               sense that although a maintenance cost constraint may be
exhaustive search algorithm to find the optimal solution
                                                               given, it may be beneficial to use it more as a guideline
(at least for small warehouse configurations) in order to be
                                                               (within certain limits) rather than as a strict policy.
able to compare the quality of our GA-based solution to
                                                               Actually, in [GM99] the inverted-tree greedy heuristic
the optimal one for each test case.
                                                               also does not guarantee a strict maintenance cost
                                                               constraint, but satisfies a limit within twice the constraint
5      Evaluation of the Algorithm                             value.
Our genetic algorithm was developed and evaluated using        The nine different strategies used in our initial set of
a Pentium II 450 MHz PC running Windows NT 4.0. We             experiments are denoted LG-S, LG-D, LG-SD, LN-S, LN-
performed two kinds of evaluations. First, the nine            D, LN-SD, EX-S, EX-D, EX-SD, where LG, LN, EX
strategies for the fitness functions (see Sec. 3.2.2) were     denote the different penalty functions and S, D, SD denote
compared in terms of the quality of the generated              the different ways of applying those penalty functions (as
solutions with respect to the optimal solutions. Second, we    described in Sec. 3.2.2).
compared the run-time behavior of the genetic algorithm        After an initial experiment, the logarithmic penalty
to the exhaustive search algorithm in order to gain insight    functions (LG-S, LG-D, LG-SD) did not perform well,
into the efficiency of our approach.                           especially LG-S and LG-SD. The reason was that the
The OR-view graphs that were used in the experiments           logarithmic penalty function makes the penalty value too
were as follows. The number of base tables was fixed to        small to have a noticeable effect on the fitness value.
10 tables. The number of views varied from 5 to 20 views.      Thus, for LG-S and LG-SD, our algorithm always tried to
The edge density of the graph varied from 15% to 30% to        maximize the query benefit while ignoring the
50% to 75%. The ranges for the values of all the               maintenance cost constraint by yielding a solution that
important parameters of the OR-view graphs are shown in        materializes all of the views. LG-D and several others
Table 2. The maintenance cost constraint for the problem       such as LN-S, EX-S did not result in such extreme
was set to 50, 100, 300, and 500. Please note that one         solutions but tended to fluctuate wildly over the
possible interpretation of these values is to view them as     maintenance cost limit, sometimes exceeding it by as
time limits on how long the warehouse is expected to be        much as 10,000%! Therefore, we disregard these
unavailable due to maintenance or as the amount of data        strategies in our figures and only show the results from the
that must be read etc.                                         remaining strategies, namely LN-D, LN-SD, EX-D, EX-
                                                               SD as depicted in Figures 2 and 3. Figure 2 shows the
                                                               results of averaging over the ratios of optimal total query
Table 2: Range of parameter values for the simulated           cost (based on a strict maintenance constraint) over GA
                  OR-view graphs                               total query costs. Figure 3 shows the results of averaging
                                                               over the ratios of GA total maintenance cost over the
             Para-               Description                   maintenance constraint. The values are arranged in tuples
             meter                                             in lexicographical order:
              RC     100-10,000 for base tables (RC for        (density, number of views, maintenance
     Node            views are calculated from source          constraint).
    (View)           views)                                    The density changes occur at the points 1, 65, 129 and
             QF      0.1 - 0.9                                 193 on the x-axis, each increasing the densities. The
             UF      0.1- 0.9                                  numbers of views are shown in increasing order within a
             QC      10 - 80 % of RC of source view            given density. The maintenance cost is also shown in
    Edge     MC      10 – 150% of QC                           increasing order within each set of views.
                                                               The results in Figure 3 show that the LN-D and LN-SD
                                                               still exhibit a considerable amount of fluctuation (about


M. Lee, J. Hammer                                                                                                       3-6
                                     110
                                     108
        total query cost ratio

                                     106
           (optimal/GA) %

                                     104                                                                                                                               LN-D
                                     102
                                                                                                                                                                       LN-SD
                                     100
                                      98                                                                                                                               EX-D
                                      96                                                                                                                               EX-SD
                                      94
                                      92
                                      90
                                           1
                                                13
                                                      25
                                                           37
                                                                49
                                                                     61
                                                                          73
                                                                               85
                                                                                    97
                                                                                         109
                                                                                               121
                                                                                                     133
                                                                                                           145
                                                                                                                 157
                                                                                                                       169
                                                                                                                             181
                                                                                                                                   193
                                                                                                                                         205
                                                                                                                                               217
                                                                                                                                                     229
                                                                                                                                                           241
                                                                                                                                                                 253
                                                                          (density,# of views, maint. constraint)

                                                     Figure 2: Average ratios of (optimal total query cost/GA total query cost)


                                     450
      total maintenance cost ratio


                                     400
                                     350
             (GA/optimal) %


                                     300                                                                                                                               LN-D
                                     250
                                                                                                                                                                       LN-SD
                                     200
                                                                                                                                                                       EX-D
                                     150
                                     100                                                                                                                               EX-SD
                                      50
                                       0
                                           1
                                               13
                                                     25
                                                           37
                                                                49
                                                                     61
                                                                          73
                                                                               85
                                                                                    97
                                                                                         109
                                                                                               121
                                                                                                     133
                                                                                                           145
                                                                                                                 157
                                                                                                                       169
                                                                                                                             181
                                                                                                                                   193
                                                                                                                                         205
                                                                                                                                               217
                                                                                                                                                     229
                                                                                                                                                           241
                                                                                                                                                                 253
                                                                          (density,# of views, maint.constraint)

                                               Figure 3: Average ratios of (GA total maintenance cost/maintenance constraint)

380%) for the maintenance cost. This was especially                                                    by using this algorithm, we have actual proof that it is
noticeable for low density OR-view graphs where the                                                    extremely time consuming to obtain an optimal solution
penalty values resulted in values which were too small to                                              when the number of views exceeds 20 views. As a result,
enforce the maintenance constraint. If we discard these                                                we limited our experiments to only 20 views. Although
two strategies from our consideration, Figure 2 shows that                                             better heuristics exist (which still have polynomial time
the remaining EX-D and EX-SD strategies obtain a total                                                 complexity), this particular experiment is intended to give
query cost ratio that is guaranteed to always be over 90%                                              the reader a feel for the performance capabilities of our
which is very close to the optimal solution. Furthermore,                                              genetic algorithm. From the figures we can see that the
the maintenance cost is always within two times the value                                              execution time for the exhaustive algorithm increases
of the maintenance cost. Thus, EX-D and EX-SD                                                          exponentially within each given density as it goes up to 20
represent good fitness functions for our genetic algorithm.                                            views. Our genetic algorithm on the other hand exhibits
Note that this result is also very close to the one that was                                           linear behavior. As the density grows, the slope of the
verified in theory in the inverted-tree greedy heuristics                                              linear graph increases only slightly. The genetic algorithm
proposed by [GM99].                                                                                    took approximately 1/80th of the time of an exhaustive
                                                                                                       search of an OR-view graph with density of 75% and 20
5.2                  Execution Time                                                                    views. As the number of views goes up to 30 and beyond,
                                                                                                       this ratio is expected to be much more impressive.
Figures 4 and 5 show a comparison in execution time                                                    Furthermore, the quality of the solution generated by the
between our genetic algorithm and the exhaustive search                                                genetic algorithm remains very close to the optimal.
averaged over the sample OR view graphs. The exhaustive
search algorithm was developed to obtain the optimal
solutions for comparing the qualities of the solutions, and


M. Lee, J. Hammer                                                                                                                                                             3-7
                   120

                   100

                          80
    time (sec)


                          60
                                                                                                                                                                                                Exhaustive Search
                          40

                          20

                          0
                               1
                                       13
                                             25
                                                   37
                                                         49
                                                               61
                                                                     73
                                                                           85
                                                                                  97
                                                                                        109
                                                                                                121
                                                                                                        133
                                                                                                                145
                                                                                                                        157
                                                                                                                                169
                                                                                                                                        181
                                                                                                                                                193
                                                                                                                                                        205
                                                                                                                                                                217
                                                                                                                                                                        229
                                                                                                                                                                              241
                                                                                                                                                                                    253
                                                                     (density, # of views, maint. constraint)


                                                              Figure 4: Execution time for exhaustive search algorithm

                          1.6
                          1.4
                          1.2
                                                                                                                                                                                                       LN-D
             time (sec)


                               1
                                                                                                                                                                                                       LN-SD
                          0.8
                                                                                                                                                                                                       EX-D
                          0.6
                                                                                                                                                                                                       EX-SD
                          0.4
                          0.2
                               0
                                   1
                                        13
                                              25
                                                    37
                                                          49
                                                                61
                                                                      73
                                                                             85
                                                                                   97
                                                                                          109
                                                                                                  121
                                                                                                          133
                                                                                                                  145
                                                                                                                          157
                                                                                                                                  169
                                                                                                                                          181
                                                                                                                                                  193
                                                                                                                                                          205
                                                                                                                                                                  217
                                                                                                                                                                          229
                                                                                                                                                                                241
                                                                                                                                                                                          253
                                                                            (density, # of views, maint. cost)

                                                                          Figure 5: Execution time for genetic algorithm


6                Conclusion                                                                                                     a certain threshold instead of always computing all
                                                                                                                                generations.
In this paper we have shown that our genetic algorithm is                                                                     • Expanding our approach to include AND-OR view
superior to existing solutions to the view selection                                                                            graphs as well as indexes. The first is straightforward.
problem in the context of OR view graphs. Specifically,                                                                         The latter is more complicated as we have to modify
our genetic algorithm consistently yields a solution that                                                                       the problem representation. One possible approach
comes within 10% of the optimal solution (which we                                                                              may be as follows: Add the indexes related to a view
verified by running an exhaustive search on OR view                                                                             immediately after the bit position of the view.
graphs with up to 20 views) while at the same time                                                                              However, crossover and mutation operations need to
exhibiting a linear run-time behavior. A penalty function                                                                       be carefully redesigned since only when a view is
has been included in the fitness function, and experimental                                                                     selected for materialization can the associated indexes
results show that the EX-D and EX-SD penalty functions                                                                          be selected. See [LH99] for more details on how we
produce the best results for the maintenance cost view                                                                          plan on incorporating index selection into our
selection problem. We believe that this algorithm can                                                                           algorithm.
become an invaluable tool for warehouse evolution,                                                                            • Lastly, genetic algorithms are well suited for
especially for those data warehouses with a large number                                                                        exploiting parallelism. For further improvement in the
of views and frequent changes to the queries which are                                                                          performance of the devised algorithm, a parallel
supported by the given warehouse configuration.                                                                                 version may be devised.
In the future, we are considering the following
improvements:
• Generate an initial population based on knowledge of a                                                                      References
    possible solution rather than using random
    configurations.                                                                                                           [Aho83] A.V. Aho, J.E. Hopcroft, and J.D. Ullman, Data
• Experiment with several other crossover or mutation                                                                         Structures and Algorithms, Addison-Wesley Publishing
    operators to speed up convergence even further.                                                                           Company, Reading, MA, 1983.
• Implement a more flexible termination condition that
    can interrupt the algorithm when the solution lies with


M. Lee, J. Hammer                                                                                                                                                                                                   3-8
[Coo71] S.A. Cook, “The Complexity of Theorem               [Mic94] Z. Michalewicz, Genetic Algorithms + Data
Proving Procedure,” Annual ACM SIGACT Symposium on          Structures = Evolution Programs, Sringer-Verlag, New
Theory of Computing, pp. 151-158, 1971.                     York, New York, NY, 1994.

[GJ79] M.R. Garey and D.S. Johnson, Computers and           [MIT] MIT Technology Lab, “GAlib: A C++ Library of
Intractability – A Guide to the Theory of     NP-           Genetic    Algorithm       Components”,      URL,
Completeness, San Francisco, 1979.                          http://lancet.mit.edu/ga/.

[GCB+97] J. Gray, S. Chaudhuri, A. Bosworth, A.             [RSS96] K.A. Ross, D. Srivastava, and S. Sudarshan,
Layman, D. Reichart, M. Venkatrao, F. Pellow, and H.        “Materialized view maintenance and integrity constraint
Pirahesh, “Data Cube: A Relational Aggregation Operator     checking: Trading space for time,” SIGMOD Record
Generalizing Group-By, Cross-Tab, and Sub-Totals,”          (ACM Special Interest Group on Management of Data),
Data Mining and Knowledge Discovery, 1:1, pp. 29-53,        25:2, pp. 447-458, 1996.
1997.
                                                            [Rou82] N. Roussopoulos, “View Indexing inrelational
[GM95] A. Gupta and I.S. Mumick, “Maintenance of            Databases,” ACM Transactions on Database Systems, 7:2,
Materialized Views: Problems, Techniques, and               pp. 258-290, 1982.
Applications,” Data Engineering Bulletin, Special Issue
on Materialized Views and Data Warehousing, 18:2, pp.
                                                            [Rou97] N. Roussopoulos, “Materialized Views and Data
3-18, 1995.
                                                            Warehouses,” in Proceedings of the KRDB, December
                                                            1997.
[Gup97] H. Gupta, “Selection of Views to Materialize in a
Data Warehouse,” in Proceedings of the International
                                                            [TS97] D. Theodoratos and T.K. Sellis, “Data
Conference on Database Theory, pp. 98-112, Delphi,
                                                            Warehouse Configuration,” in Proceedings of the Twenty-
Greece, January 1997.
                                                            third International Conference on Very Large Databases,
                                                            Athens, Greece, pp. 126-135, August 1997.
[GM99] H. Gupta and I. Mumick, "Selection of Views to
Materialize Under a Maintenance Cost Constraint," in        [Wid95] J. Widom, “Research Problems in Data
Proceedings of the International Conference on Database     Warehousing,” in Proceedings of the Fourth International
Theory, Jerusalem, Israel, pp. 453-470, January 1999.       Conference on Information and Knowledge Management,
                                                            Baltimore, Maryland, pp. 25-30, November 1995.
[Gol89] D.E. Goldberg, Genetic Algorithms in Search,
Optimization, and Machine Learning, p.412, Addison-         [ZGH+95] Y. Zhuge, H. Garcia-Molina, J. Hammer, and J.
Wesley, Mass., 1989.                                        Widom, “View Maintenance in a Warehousing
                                                            Environment,” SIGMOD Record (ACM Special Interest
[HRU96] V. Harinarayan, A. Rajaraman, and J.D.              Group on Management of Data), 24:2, pp. 316-27, 1995.
Ullman, “Implementing data cubes efficiently,” SIGMOD
Record (ACM Special Interest Group on Management of
Data), 25:2, pp. 205-216, 1996.

[IK93] W.H. Inmon and C. Kelley, Rdb/VMS:
Developing the Data Warehouse, QED Publishing Group,
Boston, London, Toronto, 1993.

[LQA97] W. Labio, D. Quass, and B. Adelberg, “Physical
Database Design for Data Warehouses,” in Proceedings of
the International Conference on Data Engineering,
Birmingham, England, pp. 277-288, March 1997.

[LH99] M. Lee and J. Hammer, “Speeding Up
Warehouse Physical Design Using A Randomized
Algorithm,” University of Florida, Gainesville, FL,
Technical Report April 1999.


M. Lee, J. Hammer                                                                                               3-9

</pre>