=Paper=
{{Paper
|id=Vol-28/paper-9
|storemode=property
|title=A quality-based framework for physical data warehouse design
|pdfUrl=https://ceur-ws.org/Vol-28/paper9.pdf
|volume=Vol-28
|dblpUrl=https://dblp.org/rec/conf/dmdw/BouzeghoubK00
}}
==A quality-based framework for physical data warehouse design==
A Quality-Based Framework for Physical
Data Warehouse Design
Mokrane Bouzeghoub, Zoubida Kedad
Laboratoire PRiSM, Université de Versailles
45, avenue des Etats-Unis
78035 Versailles Cedex, France
Mokrane.Bouzeghoub@prism.uvsq.fr, Zoubida.Kedad@prism.uvsq.fr
Abstract organize this data into multidimensional structures
which are suitable for decision making, and (iv)
Data warehousing is a software infrastructure refresh it periodically to maintain the data up to date
which supports OLAP applications by and accurate. Many problems related with this
providing a collection of tools which allow approach have been addressed, such as data
data extraction and cleaning, data integration extraction and cleaning, selection of the best views to
and aggregation, and data organization into materialize, update propagation, and
multidimensional structures which are multidimensional representation and manipulation of
suitable for decision making. At the design data.
level, a data warehouse is defined as a
hierarchy of view expressions. One key View materialization is a physical design technique
problem of the design of a data warehouse is which has been introduced to optimize database
the selection of a set of views to materialize queries by making persistent some intermediate
in order to meet the performance results, possibly shared by different queries [Gupta &
requirements, that is, to minimize both the Mumick 95]. It is seen as a caching technique which
evaluation cost of the users queries and the contributes to optimize queries. Additionally, in the
maintenance cost of the materialized data warehousing context, view materialization
expressions. Existing approaches either avoids overloading sources that supply operational
propose algorithms that explore the search data, and therefore makes data warehouse
space by enumerating and evaluating all the applications partially and momentarily independent
possible solutions, or introduce heuristics that of the data sources. This introduces two
restrict the search space. In this paper, we complementary constraints: (i) the first requires that
propose a framework to support the physical all OLAP queries must be evaluated using only the
design of a data warehouse; it is based on a materialized views, (ii) the second requires that
logical model, called the view synthesis materialized views must be updated periodically with
graph. A level of overlapping is defined to respect to data changes OLAP applications want to
characterize the graph. This level allows to see, and to a level of freshness required for this data.
order the search space. The framework Consequently, one key issue during the physical
integrates some quality factors that can be design of a data warehouse is the selection of the set
used either to drive the exploration of the of views to materialize in order to optimize both the
search space, or to reduce the search space, query evaluation and the maintenance cost.
or both.
The general approach to select the best views to
1. Introduction materialize in a data warehouse consists in exploring
a solution space which is defined by all the plans
Data warehousing is a software infrastructure which derived from a multi-query graph which integrates all
supports OLAP applications by providing a collection the input queries. Two kinds of exploring techniques
of tools which (i) collect data from a set of distributed are reported in the literature: the first is an exhaustive
heterogeneous sources, (ii) clean and integrate this evaluation of all the possible materialization
data into a uniform representation, (iii) aggregate and scenarios [Theodoratos & Sellis 97], the second
proposes heuristics to reduce the size of the search
space [Theodoratos et al 99][Yang et al
The copyright of this paper belongs to the paper’s authors. Permission to
copy without fee all or part of this material is granted provided that the
97][Ligoudistianos et al 99]. The algorithms of the
copies are not made or distributed for direct commercial advantage. second class are obviously less expensive compared to
Proceedings of the International Workshop on Design those of the first class, however, they do not
and Management of Data Warehouses (DMDW'2000) guarantee to derive the best solution as the search
Stockholm, Sweden, June 5-6, 2000 space is arbitrarily reduced using some heuristics.
(M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.)
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-28/
M. Bouzeghoub, Z. Kedad 9-1
Algorithms of both classes are driven by cost framework for the selection of the best materialization
functions that formalize the compromise between the scenario. Section 5 presents the related works and
query evaluation cost and the maintenance cost, section 6 concludes with further research.
called the operational cost. Our general comments on
these approaches are the following: 2. The General Framework for Quality-
Based Physical Design in Data
• Both approaches search for the materialization
Warehousing
scenario having the minimal operational cost, without
considering whether this minimal cost coincides or As recalled before, algorithms for materialized view
not with user needs. In practical applications, it is selection are generally based on a cost function that
often not necessary to search for the best solution but combines query evaluation cost and view
for the solution that best fits the user requirements. maintenance cost. Given a cost function, the selection
Indeed, it is not necessary to spend too much time to problem is posed as an exhaustive search of the best
find the solution with a minimal cost while an materialization scenario, possibly driven by some
intermediate solution whose cost fits user requirement heuristics that restrict the search space under
may be sufficient. On the other side, if the solution consideration. The search space itself is generally
having the minimal cost is far from user needs, it's represented as a multi-query graph, which is the
important to highlight the mismatch between what union of all possible plans of all user views (or
the user requires and what the technology is able to queries).
provide, especially when the costs of all the existing
solutions are monotonic and predictable. The idea behind our framework is to make the
materialized view selection problem as practical as
• Heuristic-based approaches provide faster possible by providing a context within which derived
algorithms to derive reasonable solutions but do not solutions can be evaluated with respect to their
provide any characterization of these solutions with relevance to user requirements in terms of quality
respect to those derived from the exhaustive search. factors. Quality factors are considered as a good way
Besides this point, they do not provide any other to handle non-functional requirements. They can play
quality factor - e.g. flexibility or evolution - which a central role in the algorithms which select a
can convince that the derived solution is not useless. materialization scenario. Indeed, the complexity of
the view selection problem is such that running the
• It is hard in both approaches to introduce corresponding algorithms for real applications
additional requirements in the corresponding remains a challenge. Thus, introducing quality factors
algorithms, such as concurrency control and security may, on the one hand, help in reducing the search
for example, which may also influence the space, and on the other hand, allow to characterize
operational cost. the derived solutions.
A very few research has been done in handling We assume that users are able to specify these quality
quality factors in logical and physical data warehouse factors in a formal way as numerical values, logical
design. A substantial effort has been done within the assertions or any other mathematical formula. As an
European DWQ project where a generic model of example of requirement for the selection of
quality has been defined [Jarke et al 99] [Vassiliadis materialized views, one may give boundaries to the
et al 99] [Jeusfeld et al 98] to capture the set of operational cost, and the exploration procedure will
quality factors associated to the various data therefore consider only solutions which match this
warehouse objects, and to perform their measurement requirement. Two categories of quality factors can be
in a specific quality goal. It was particularly shown distinguished:
that the refreshment process [Bouzeghoub et al 99] - Those who define or constrain the cost function,
and the selection of the materialized views such as conformance to space limitation, upper limit
[Bouzeghoub & Theodoratos 99] are demonstrative to query response time, freshness of data;
examples where quality can drive the design process. - Those who impose some desired features on the
Following the work presented in [Bouzeghoub & derived materialization scenario (the selected plan),
Theodoratos 99], this paper is a step further to such as evolution, maintainability or traceability.
elaborate a general framework which allows to
integrate in a more systematic way several quality Although the second category of quality factors is not
factors into the physical design process. directly involved in the cost function, it may help in
the generation of the heuristics which can drive a
This paper is organized as follows: in section 2, we selective search, i.e., it might be a part of the
present the general framework for physical design. selection algorithm.
Section 3 gives some preliminary definitions of the
fundamental concepts that constitute the baseline of The second category of quality factors constrains the
our framework. Section 4 presents the use of this structure of the views. They are generally related to
M. Bouzeghoub, Z. Kedad 9-2
the level of overlapping between view definitions. 3. Preliminary definitions
This overlapping level is very important in physical
design; the higher it is, the more complex is the In this section, we present the formalisms and the
selection problem. Indeed, considering a set of n user definitions used in the proposed framework for data
views (or queries) whose plans do not overlap, the warehouse design. We assume that the input queries
complexity of the selection problem is reduced to the are represented by relational expressions. In the
complexity of the optimization of a list of n remaining of the paper, these queries are called users
independent queries. Except for this very specific views. Our framework is based on a graph
case, user views in data warehouse systems generally representing the logical perception of a data
overlap. We assume that the complexity of the warehouse, and called the view synthesis graph.
materialized views selection algorithms is not
independent of the level of overlapping since any The design process of the logical schema of a data
materialization scenario has a direct influence on the warehouse is out of the scope of this paper; it starts
evaluation of all user views which share the same from the representation of the different user views.
materialized views. Moreover, evolution of a physical These views represent the users needs, that is, the
design as well as traceability and many other different aggregated data corresponding to the user
quality factors can also be correlated to the level of requirements, computed using a given set of relations
overlapping of views. A part of our design framework located in the data sources. We assume that the user
is consequently based on these structural features. views are represented by relational expressions. In a
data warehouse, the user views are not necessarily
Figure 1 shows a simplified view of the framework, disjoint. A given user view is represented by a set of
which uses a class of quality factors which may be equivalent relational expressions corresponding to
defined on a given search space. This class can be different plans.
divided in two subclasses, the first one concerning the
quality factors which constrain the cost function, and The result of this design process is the logical
the second concerning those which characterize the schema, called the view synthesis graph (VSG). The
structural aspect of a materialization scenario. The design process of the VSG produces successively the
quality factors of the latter subclass may be used to following representations:
generate heuristics which restrict the search space or
• The multi-query graph (MQGs): we assume that
which allow to qualify the selected solutions. They
can also be used to select a representation which each user view is a relational expression. Given such
satisfies some level of overlapping. This an expression, a set of equivalent plans is considered;
representation will then be used as a restricted search they are represented by a MQG, which is presented in
space on which the selection procedure is performed. section 3.1.
• The multi-view graph (MVG): this graph
Section 4 presents the usage of this framework
through an example. represents the integration of all the MQGs
corresponding to the set of user views. Therefore, it
represents for each user view a set of relational
expressions. The MVG is presented in section 3.2.
SEARCH SPACE The properties of this graph are given in section 3.3
• The view synthesis graph (VSG): this graph is
DefinedOn generated from the MVG and it is such that each user
view is represented by a single expression. This graph
Inputs Inputs
QUALITY FACTORS is presented in section 3.4, along with the definition
of the level of overlapping.
SpecializedAs SpecializedAs
One important feature of the VSG is that it shows a
COST FUNCTION STRUCTURAL set of sub-expressions which are shared by several
PARAMETERS PROPERTIES user views. The following sub-sections describe in
ain
s more details the different graphs and their properties.
nstr Inputs
Inputs Co
4 o
sT
rm
onfo 3.1. The multi-query graph
SELECTION C CHARACTERIZATION
PROCEDURE 2 PROCEDURE
Let us consider a user view Vi. This view can be
Outputs 3 Outputs
represented by a set of equivalent algebraic
Inpu
ts
expressions denoted EVi and such that:
SELECTED SELECTED EVi = {e1Vi, e2Vi, ...., ekVi}
SCENARIOS 1 BelongsTo REPRESENTATION
The algebraic expressions corresponding to the user
view Vi are denoted ekVi. They use a set of operations
Figure 1: The general framework for
quality based design
M. Bouzeghoub, Z. Kedad 9-3
V(A,G,H)
V(A,G,H)
R1(A, B, C) R2(C, G, E) R3(E, H) R1(A, B, C) R2(C, G, E) R3(E, H)
(a) (b)
Figure 2 : The multi-query graph associated with a user view
such as join, project and restrict operations, or some distributivity or associativity. Figure 2 gives an
aggregate function. Each expression represents a example of a user view V and the associated multi-
given plan for the user view Vi. In order to limit the query graph.
number of considered plans, we will consider only
those for which the project operation is the last one. 3.2. The multi-view graph
Each user view being represented by a multi-query
Given a user view Vi and the associated set of
graph, the multi-view graph (MVG) consists in
equivalent expressions EVi, the multi-query graph
integrating all the MQGs into a single graph. It
represents each element ejVi of the set EVi. Each
therefore represents a set of equivalent expressions for
expression ejVi is either an expression without any
each user view. More formally, let us consider the
project operation, or an expression in which the
following notations:
project operation is the last one. Note that this
V = {V1, V2, ...., Vn } the set of user views,
restriction will reduce the number of expressions in
EVi = {e1Vi, e2Vi , ...., ekVi} the set of equivalent
the MQG, but not the number of common sub-
expressions corresponding to the view Vi.
expressions. The multi-query graph does not
The multi-view graph denoted MVG is such that:
represent all the equivalent expressions to the initial
MVG = {EV1, EV2, ....., EVn}
view, because the search of these expressions may
lead to an infinite set of expressions. This graph
The multi-view graph is a set whose elements are sets
represents only a sub-set of the possible equivalent
of equivalent expressions associated with the user
expressions which will be used as an input to generate
views.
the multi-view graph presented in the next section.
The plans of the set EVi can be generated using the
properties of the algebraic operations, such as
V1 (A, B, C, D, E, F) V2 (A, G, H)
V3 (A, B, E, F)
R1(B,D) R2 (A,B,C) R3 (E,F,A) R4 (G,H,E)
Source S1 Source S2 Source S3
Join operation Restriction operation Projection operation
Figure 3 : Example of a multi-view graph (MVG)
M. Bouzeghoub, Z. Kedad 9-4
Figure 3 shows an example of the graphical As a consequence, the level of sharing of the sub-
representation of an MVG, built by integrating the expression e is such that:
MQGs corresponding to three user views V1(A, B, C, level_of_sharing(e) > 0
D, E, F), V2(A, G, H) and V3(A, B, E, F). In this In other words, two expressions representing distinct
example, we can notice that some sub-expressions are views are dependent if they have a common sub-
used by more than one user view. An ellipse expression e in the multi-view graph.
represents these expressions.
An expression eiVk representing a view Vk is
The MVG graph represents a set of equivalent independent if it does not share any sub-expression
expressions for all the user views of the data with another plan ejVl representing a distinct view Vl.
warehouse. If we consider the graphical That is, the level of sharing of each sub-expression e
representation of a MVG, each node in the graph of eiVk equals zero. In this case, the considered plan
represents either a source relation, a sub-expression, has no common sub-expression with any of the other
or an expression corresponding to a user view. The plans representing a distinct view in the multi-view
graph shows all the existing sub-expressions used by graph.
each user view. Each sub-expression may be either
shared by several plans, or used in only one plan. 3.3.3. Level of dependency of an expression
representing a view
3.3. Properties of the multi-view graph The level of dependency of an expression ejVi
The properties which characterize the sub-expressions representing a user view Vi in a multi-view graph
and the plans in a MVG are defined below. MVG is the number of distinct user views Vl such
that there exists an expression ekVl which share a sub-
3.3.1. Level of sharing of a sub-expression expression e with ejVi. Let us consider the set of user
views V represented by the MVG; the level of
The level of sharing of a sub-expression is the
dependency of an expression ejVi is a function defined
number of distinct views which uses this sub-
as follows:
expression; let us consider the multi-view graph
MVG corresponding to a set of user views V, and a
level_ of_dependency(eiVj)
sub-expression in MVG denoted e. The level of
sharing of the sub-expression e in the graph MVG is
level = 0
a function defined as follows:
for each Vl not marked in V, l ≠ j
level_ of_sharing(e) for each ekVl in EVk
for each e ∈ eiVj
level = 0 if (e ∈ ekVl) then
for each Vi in V mark Vi
for each ejVi in EVi level = level + 1
if (e ∈ ejVi) and (Vi not marked) then endif
mark Vi endfor
level = level + 1 endfor
endif endfor
endfor return (level)
endfor end;
return (level)
end; 3.3.4. Redundant expressions
An expression eVi representing the user view Vi is
The level of sharing characterizes a sub-expression in redundant if there exists a distinct expression e'Vi
the multi-view graph. Each expression corresponding representing the same user view Vi and such that:
to a user view in the multi-view graph can be level_of_dependency(eVi) < level_of_dependency(e'Vi)
characterized according to the following properties. In other words, if e'Vi is the expression having the
highest level of dependency, then all the expressions
3.3.2. Dependency between two expressions representing the same view Vi and having a lower
corresponding to distinct views level of dependency are said to be redundant.
Let us consider two expressions eiV1 and ejV2
representing respectively the user views V1 and V2 in 3.4. The view synthesis graph
a multi-view graph MVG. The two expressions eiV1 The view synthesis graph is a canonical
and ejV2 are said to be dependent if there exists a sub- representation derived from the MVG. This
expression e in the multi-view graph such that e is representation is such that each user view is
used in both eiV1 and ejV2. represented by exactly one plan. For each user view,
M. Bouzeghoub, Z. Kedad 9-5
among all the plans represented by the MVG, only keeping for each view the expression having the
one will be kept in the VSG. highest level of dependency, that is by eliminating the
redundant expressions. The resulting VSG is given in
One possible way to built the VSG is to use the notion figure 4.
of redundancy: a view synthesis graph can be defined
as a set of expressions such that each user view is V1(A,B,C,D,E,F) V2(A,G,H)
represented by exactly one expression, and in which
each plan eVi representing the user view Vi is not V3(A,B,E,F)
redundant in the corresponding MVG. This means
that the VSG is built in order to maximize the
number of shared sub-expressions.
By definition and construction, the multi-view graph
is unique. The view synthesis graph is not unique. For
a given view Vi, there may exist more than one
expression having the highest level of dependency,
and consequently, the same MVG may lead to more
RB2(A,B,C,E,F)
than one VSG. The level of overlapping which
RB3(A,G,E)
characterizes a VSG is defined below. RB1(B,D)
Level of overlapping of a view synthesis graph
R1(B,D) R2(A,B,C) R3(E,F,A) R4(G,H,E)
The level of overlapping of a view synthesis graph is Source S1 Source S2 Source S3
the maximal level of dependency of the view
expressions represented in this graph. Consider a Figure 4 : Example of a logical model
VSG containing a number k of user views denoted Vi,
each of them described by the expression denoted eVi. Each non-redundant plan corresponding to a user
A level of dependency denoted ni characterizes each view is represented by an algebraic tree. The
expression. The level of overlapping of this VSG, expressions RB1, RB2 and RB3 are the base views.
denoted n(VSG) is such that: Each of them is computed from the data provided by a
single data source. These expressions can be viewed
n(VSG) = Max(ni, i= 1, k) at as the extractors (or wrappers) associated with each
source. The expression RB2 is an intermediate view
The VSG corresponding to a given MVG and having shared by the three user views V1, V2 and V3.
the highest level of overlapping must satisfy the
following conditions: 4. Using the framework for data
warehouse design
• To each user view of the data warehouse is
associated a single expression in the VSG. The key issue during the physical design of a data
• All the expressions in the VSG are not warehouse is to find a set of views to materialize in
redundant. order to meet some user requirements. Generally,
these requirements are expressed in terms of a
The resulting VSG is a hierarchy of views. Besides maintenance cost, and an evaluation cost. The
the user views, this hierarchy contains intermediate combination of these two costs is the operational cost
views and base views defined as follows: of the data warehouse. The goal of the physical
design is to find the set of views to materialize in
• Each sub-expression computed from the relations order to minimize this operational cost.
provided by a single data source is defined as a base
view. Existing approaches for the selection of the views to
• A given sub-expression in the VSG is an materialize either perform an exhaustive exploration
intermediate view if it satisfies the following of all the possible materialization scenarios, and
conditions: evaluate them to find the one that minimizes the
1. this sub-expression is shared by two or more operational cost, or use heuristics in order to reduce
plans in the VSG the number of scenarios to evaluate. In this section,
2. this sub-expression is neither a base view, nor we show how the proposed framework can be used for
a user view. the physical design of a data warehouse. This
framework allows taking into account some cost
Starting from the MQG given in figure 3, the VSG function parameters as well as some structural
which maximizes the level of overlapping is built by properties reflecting some desired quality criteria.
M. Bouzeghoub, Z. Kedad 9-6
The considered search space is partially ordered, and VSGpi denotes a view synthesis graph and m is the
may be explored using different strategies. number of view synthesis graphs associated with the
data warehouse and having the level of overlapping p.
In the following, we first present the ordering of the Each graph VSGpi is such that n(VSGpi) = p.
search space, then the cost function parameters and
the structural properties, and finally, we illustrate the In the set Sp, the different representations can be
use of the framework by an example of instantiation. ordered by considering subsets of representations
having the same number of shared sub-expressions.
4.1. Ordering the search space Therefore, the set Sp can be partitioned in several
We consider that the data warehouse is represented by subsets, considering the number of shared sub-
a view synthesis graph. Given the set of the user expressions in the representations. A subset of Sp
views of the data warehouse, the corresponding view denoted Snp contains all the representations whose
synthesis graph is not unique, as we have seen in level of overlapping is p and whose number of shared
section 3. Besides this point, given a view synthesis expressions is n.
graph, several candidate sets of materialized views
may be derived, each of them representing a Figure 5 illustrates the partitions of the search space
materialization scenario to evaluate. If we denote by according to the two criteria n and p. The search
VSGi a view synthesis graph, and MSi the sets of space is first partitioned in several sets Sp according
materialization scenarios which may be derived from to the level of overlapping. Then each set Sp is
the graph VSGi, the search space for the selection of partitioned in several sets Snp according to the
the materialized views denoted S is such that: number of shared expressions in the VSG.
S = MS1 ∪ MS2 ∪ ....∪ MSn, where n is the number
Once the search space is ordered, it can be explored
of view synthesis graphs associated with the data
in several ways: for example, the exploration could be
warehouse. Therefore, exploring the search space S
done starting from the representations having the
consists in considering every materialization scenario
highest level of overlapping, that is, representations
for every possible view synthesis graph representing
showing the most shared expressions. These
the data warehouse.
representations could then be explored according to a
descending order of their number of shared sub-
This search space can be ordered according to two
expressions. In our framework, the exploration
criteria:
strategy and the selection of the scenarios may be
• the level of overlapping of the view synthesis
done according to two kinds of factors: the cost
graphs representing the data warehouse,
function parameters and the structural properties
• the number of shared expressions in the view
required for the representation of the data warehouse.
synthesis graphs.
Both factors are presented in sections 4.2. and 4.3.
Given a multi-view graph, the set of view synthesis
graphs having the level of overlapping p is denoted
Sp, and is such that: Sp = {VSGp1, VSGp2, ... ,
VSGpm},
Search Space
Partitioning
the search space
according to the
level of overlapping
Sn Sn-1 S0
VSGn1 VSGn2 … VSGnm VSGn-11 VSGn-12 … VSGn-1p … … VSG01 VSG02 … VSG0l
… … … … .. … … … … ..
Partitioning
the search space Sin-1 Si-1n-1 Si-kn-1
according to the
number of shared VSGn-11 VSGn-12 VSGn-13 … … VSGn-1p
expressions
Figure 5 : Partitioning the search space
M. Bouzeghoub, Z. Kedad 9-7
The exploration of the search space is costly, and it is
4.2. The cost function parameters not necessary to search for an optimal solution which
The evaluation of a materialization scenario is done is far beyond the actual performance needs. The
using a cost function which takes into account the exploration may stop when a solution fitting the user
different factors having an impact over the overall requirements is found. Another possible cost function
cost of the scenario. Finding such function is a parameter could be a space limitation. All the cost
domain specific problem, because the factors to function parameters are used to evaluate and select
consider vary from one context to another. the materialization scenarios.
In the existing approaches for the selection of 4.3. The structural properties
materialized views, the different materialization The structural properties are related to the
scenarios are compared by considering their representation of the data warehouse. They express
respective operational cost, which is a combination of some quality requirements specified by the user for
a maintenance cost and an evaluation cost. In the the materialization scenario. Let us consider for
following, we recall the general cost function which is example the traceability criteria. Given a set of user
widely used in the literature related to the data views, one possible requirement is to trace the
warehouses physical design. intermediate results for this set in order to find the
origin of an error or a misunderstanding of some
Each materialized view in the data warehouse has aggregate value. For example, a set of aggregate
maintenance cost, which is the cost of propagating values computed by different user views may be
the source updates in this view. The maintenance cost contradictory. To find the origin of the contradiction,
of the data warehouse, denoted M, is the sum of the the materialized views of the data warehouse will be
maintenance costs of all the materialized views. We checked in order to find the computation which gave
denote Vmi, i ranging from 1 to n, the materialized raise to the error. It is obvious to see that the higher is
views of the data warehouse, f mi the maintenance the number of user views which use the same
frequency for the view Vi and CVi the maintenance materialized expressions, the less is the cost of
cost of the view Vi; the maintenance cost of the data tracing a given error, because in this case, the number
warehouse is such that: of materialized expressions to check is low compared
M = Σ i=1, n f mi . C Vmi to the number of expressions to consider if the users
views share very few materialized expressions.
The user views are evaluated against the set of
materialized views. They are denoted Vui, i ranging Consequently, in order to ease the tracing process, it
from 1 to p. Each of these user views has an would be more suitable to select the materialized
evaluation cost denoted EVui and a frequency denoted f views among the set of expressions which are used by
e
i. The global evaluation cost of a materialization the maximal number of user views, that is, over the
scenario denoted E is such that: view synthesis graph having the highest level of
E = Σ i=1, p f ei . EVui overlapping. The best materialization scenario will
therefore represent a compromise between the cost
The operational cost of a materialization scenario parameters and the level of overlapping.
denoted OC is a weighted sum function of the
maintenance cost M and the evaluation cost E: In our framework, the structural properties may be
handled in the following ways:
OC = E + wM • they may be used to constrain the selection
procedure, that is to define an exploration strategy in
The best set of materialized views is the one the search space; in the example of the traceability
corresponding to the minimal operational cost OC. criteria, the search space may be explored
exhaustively starting from the representations having
The w factor is a parameter which expresses the the highest level of overlapping,
relative importance between the evaluation cost and • they may be used to reduce the search space, by
the maintenance cost. Besides this factor, other selecting a sub-set of representations satisfying them,
parameters may be considered. For example, the users • and finally, they may be used after the
or the administrators of the data warehouse may state determination of the scenario which best fits the cost
a value or an interval representing the acceptable function parameters, in order to check whether this
operational cost of the data warehouse. These scenario conforms to the structural properties or not.
boundaries of the operational cost may be very useful
to confront the cost of a scenario against the
expectations of the users in terms of performance.
M. Bouzeghoub, Z. Kedad 9-8
4.4. An instantiation of the general framework The characterization procedure is not performed, and
the paths used in the instantiation are represented by
We will now present an example of selection of a
the continuous lines. If p is the maximal level of
materialization scenario considering, additionally to
overlapping which can be derived from the multi-
the ordering criteria of the search space, the following
view graph, then the corresponding set Sp contains
factors: first, the users or the administrators of the
the sub-expressions which are shared by the maximal
data warehouse have stated an interval representing
number of user views. These sub-expressions will be
the acceptable operating cost of the data warehouse;
first considered when determining the views to
second, some traceability requirements have been
materialize. Multiple materialization scenarios can be
expressed by the users. These requirements will be
generated from a single graph VSGpi. Each of them
used to constrain the selection procedure.
should be evaluated. After considering the graphs of a
set Sp, the graphs of the set Sp-1 are considered, and
If we consider that the user requirements in terms of
the cost of the materialization scenarios is evaluated
performance are expressed by the interval [l1, l2],
for each graph VSGp-1i. This process iterates until a
where l1 and l2 represents the lowest and the highest
solution is found within the boundaries specified by
acceptable operational cost respectively, the
the users.
exploration of the ordered search space will stop if the
operational cost OC corresponding to a
This exploration of the search space starting from the
materialization scenario msi is such that:
representations having the maximal level of
overlapping is described by the following general
l1 ≤ OC(msi) ≤ l2.
algorithm which uses the following notations:
Considering the traceability criteria, the approach for
• MVG is the multi-view graph describing the data
the selection of the views to materialize will focus on
warehouse,
the shared sub-expressions, and choose the
• Sp the set of view synthesis graph derived from
materialized views in the set of sub-expressions
MVG and having the level of overlapping p,
shared by the maximal number of views. The
• pmax the maximal level of overlapping of the
exploration strategy is then to consider first
expressions in MVG,
representations having the highest level of p
• VSG i a view synthesis graph whose level of
overlapping. The instantiation of our general
overlapping is p,
framework considering these requirements is given in
• [l1, l2] the interval representing the user
figure 6.
requirements, that is the interval in which the
operational cost varies.
• MSp the set of materialization scenarios derived
SEARCH SPACE
from the view synthesis graphs whose level of
overlapping is p,
• msi a materialization scenario,
DefinedOn
• OC(msi) the operational cost of the
Inputs
Inputs materialization scenario msi.
QUALITY FACTORS
SpecializedAs SpecializedAs Exploration procedure
p = pmax ;
STRUCTURAL
COST FUNCTION PROPERTIES Generate Sp from MVG;
PARAMETERS ns (maximal level of While (solution not found)
rai
(user cost interval [l1,l2])
Const overlapping) If Sp ≠ ∅ Then
For each VSGpi in Sp
o
4
sT
Inputs
rm
Inputs
Generate MSp
nfo
Co
SELECTION 2 CHARACTERIZATION For each msi in MSp
PROCEDURE PROCEDURE If l1 ≤ OC(msi) ≤ l2
Then (msi) is an acceptable solution;
Outputs 3 In
puts Outputs Endif
SELECTED
Endfor
SELECTED
SCENARIOS 1 BelongsTo
REPRESENTATION Endfor
Endif
p = p-1;
Figure 6 : Instantiation of the general framework
Generate Sp from MVG;
Endwhile
End
M. Bouzeghoub, Z. Kedad 9-9
We did not present in this paper the details of the exploration is done according to a strategy which may
selection of the candidate sets of views to materialize; be driven by the user requirements. These
each set of candidate materialized views should allow requirements may be either some cost function
to compute all the user views, that is, each user view parameters or some structural properties. Existing
can be rewritten using the set of materialized views. approaches introduce cost function parameters such
as space limitation, but they do not integrate the
5. Related works structural properties which express some quality
factors such as evolutivity, security or traceability.
Several approaches have been proposed for the
physical design of a data warehouse. The problem
addressed by these approaches is to find a set of views
6. Conclusion
to materialize in order to meet the performance In this paper we have presented an extensible
requirements. These performance requirements are framework for data warehouse physical design. This
expressed as an operational cost, which is a framework is based on a logical schema, which is a
combination of an evaluation cost and a maintenance canonical representation of the data warehouse views.
cost. Some of the approaches dealing with this design This logical model is built from subsets of the
problem are described in [Baralis et al 97], possible plans for each user view, which are
[Theodoratos & Sellis 97], [Theodoratos & Sellis 98], integrated in an intermediate representation called the
[Theodoratos et al 99], [Ligoudistianos et al 99] and multi-view graph. The logical model, called the view
[Yang et al 97]. synthesis graph (VSG), is derived from the multi-
view graph. A level of overlapping characterizes this
Two kinds of solutions are generally proposed for the model. It represents the maximal number of views
selection of the views to materialize in a data sharing the same expression in the graph.
warehouse, either solutions consisting in an
exhaustive exploration of the possible materialization The framework for physical design uses the VSG as a
solutions to find to optimal one, or solutions based on representation of the views of a data warehouse. Since
some heuristics which reduce the exploration. In this VSG is not unique, the search space for the
[Yang et al 97], two algorithms for selecting the selection of the views to materialize is composed of
views to materialize are proposed; the first one is an all the scenarios derived from all the possible VSG.
exhaustive exploration of all the possible plans of the This search space can be ordered considering the
users queries to find an optimal multiple view level of overlapping and the number of shared
processing plan, the second one is an heuristic expressions of the representations from which the
algorithm which considers only the optimal plan for scenarios are derived.
each user query.
This ordered search space can be explored in different
In [Baralis et al 97], the proposed algorithm ways; one possible exploration strategy consists in
addresses the materialized view selection problem in giving priority to the most shared sub-expressions
the context of multidimensional databases. during the selection of the views to materialize, and
[Theodoratos & Sellis 97] and [Theodoratos & Sellis therefore exploring first the representations having
98] present the view selection problem as a state the highest level of overlapping. The exploration is
space problem. In [Theodoratos et al 99], a pruning guided by some users requirements, which can either
algorithm and a greedy algorithm are presented to be related to the cost function, such as the upper an
select the set of views to materialize. But the lower bounds representing acceptable operational
heuristics used do not guarantee to find the optimal cost, or related to some structural properties of the
solution. [Ligoudistianos et al 99] proposes an A* logical model, such as evolutivity.
algorithm which delivers the optimal solution, and
presents some heuristics to prune the considered state Further works will consist in identifying sets of
space. quality factors related to the physical design of the
data warehouses, and study the impact of these factors
In this paper, rather than a new approach for the on both the selection and the characterization
selection of the best set of materialized views, we procedures of the proposed framework.
tried to define the main features of a framework for
data warehouse physical design. This physical design References
is done on a search space consisting of a set of
representations of the data warehouse, considering [Baralis et al 97] E. Baralis, S. Paraboschi, E.
the level of overlapping and the number of shared Teniente. Materialized View Selection in a
expressions which characterize each representation. Multidimensional Environment. In Proc. of the
This level allows to order the search space, and the 23rd International Conference on Very Large Data
Bases (VLDB'97), Athens, Greece, August 1997.
M. Bouzeghoub, Z. Kedad 9 - 10
[Bouzeghoub et al 99] Bouzeghoub M., Fabret F., [Hull & Zhou 96] R. Hull, G. Zhou. A Framework for
Matulovic M., Modeling the Data Warehouse Supporting Data Integration Using the Materialized
Refreshment Process as a Workflow Application. and Virtual Approches. In Proceedings of the ACM
Proceedings of the International Workshop on SIGMOD International Conference on
Design and Management of Data Warehouses Management of Data, Montreal, Canada, June
(DMDW'99), Heidelberg, Germany, June 1999. 1996.
[Bouzeghoub & Theodoratos 99] Bouzeghoub M., [Jarke et al 97] M. Jarke, Y. Vassiliou. Foundation of
Theodoratos D. Data Currency Quality Factors in Data Warehouse Quality: an Overview of the DWQ
Data Warehouse Design. Proceedings of the project . In Proceedings of the 2nd Internat. Conf.
International Workshop on Design and on Information Quality, Cambridge, Mass, 1997.
Management of Data Warehouses (DMDW'99), [Jarke et al 99] M. Jarke, M. Lenzerini, Y. Vassiliou,
Heidelberg, Germany, June 1999. P. Vassiliadis. (edits). Fundamentals of Data
Warehouses. Springer, 1999.
[Calvanese et al 98] D. Calvanese, G. De Giacomo,
M. Lenzerini, D. Nardi, R. Rosati. Source [Jarke et al 99a] Jarke M, Jeusfeld M.A., Quix C.,
Integration integration : conceptual modeling and Vassiliadis P., "Architecture and Quality in Data
reasoning support. In Proceedings of the 3rd IFCIS Warehouses", Proceeding of the 10th International
International Conference on Cooperative Conference on Advanced Information Systems
Information Systems (CoopIS’98), New York City, Engineering (CAiSE'98), Pisa, Italy, June 1998.
USA, August 1998.
[Jeusfeld et al 98] Jeusfeld, M. A., Quix C., Jarke M.,
[Chawathe et al 94] S. Chawathe, H. Garcia-Molina, "Design and Analysis of Quality Information for
J. Hammer, K. Ireland, Y. Papakonstantinou, J. Data Warehouses", Proceedings of the 17th
Ullman, J. Widom. The TSIMMIS project : Internat. Conf. on Conceptual Modeling (ER'98),
Integration of Hererogeneous Information Sources. Singapore, november 1998.
In Proceedings of the 10th Meeting of the
Information Processing Society of Japan (IPSJ'94), [Kedad 99] Z. Kedad. Techniques d’intégration dans
Tokyo, Japan, October 1994. les systèmes d’information multi-source. PhD
Thesis, Laboratoire PriSM, University of Versailles,
[Chaudhuri & Dayal 96] S. Chaudhuri, A. Dayal. An France, January 1999.
Overview of Data Warehousing and OLAP
Technology. Tutorial, In Proceedings of the 22th [Kedad & Bouzeghoub 99] Z. Kedad, M.
International Conference on Very Large Data Bouzeghoub. Conception de systèmes d’information
Bases (VLDB'96), Mumbai, India, September multi-source. In Proceedings of INFORSID’99,
1996. Toulon, France, June 1999.
[Garcia-Molina et al 95] H. Garcia-Molina, Y. [Levy et al 96] A.Y. Levy, A. Rajaraman, J.J.
Papakonstantinou, D. Quass, A. Rajamaran, Y. Ordille. Query-Answering Algorithms for
Sagiv, J. Ullman, V. Vassalos, J. Widom. The Information Agents. In Proceedings of the 13th
TSIMMIS approach to mediation: Data model and National Conference on Artificial Intelligence and
Languages. In 2nd International Workshop on Next 8th Innovative Applications of Artificial
Generation Information Technologies and Systems Intelligence Conference (AAAI 96, IAAI 96),
(NGITS '95), Naharia, Israel, June 1995. Portland, Oregon, August 1996.
[Gupta & Mumick 95] A. Gupta, I.S. Mumick, [Levy et al 96a] A.Y. Levy, A. Rajaraman, J.J.
Maintenance of materialized views: Problems, Ordille. Querying Heterogeneous Information
Techniques, and applications. In IEEE Data Sources Using Source Description. In Proceedings
Engineering Bulletin, Special Issue on Materialised of the 22th International Conference on Very Large
Views and Data Warehousing, 18(2), june 1995 Data Bases (VLDB'96), Mumbai, India, September
1996.
[Hammer et al 95] J. Hammer, H. Garcia-Molina, J.
Widom, W. Labio, Y. Zhuge. The Stanford Data [Ligoudistianos et al 99] S. Ligoudistianos, T. Sellis,
Warehousing Project. IEEE Data Engineering D. Theodoratos, Y. Vassiliou. Heuristic Algorithms
Bulletin, Special Issue on Materialized Views and For Designing The Data Warehouse with SPJ
Data Warehousing, 18(2), June 1995 Views. In Proceedings of the International
Conference on Data Warehousing and Knowledge
Discovery (DAWAK'99), 1999.
M. Bouzeghoub, Z. Kedad 9 - 11
[Theodoratos & Sellis 97] D. Theodoratos, T. Sellis.
The Data Warehouse Configuration. In
Proceedings of the 23rd International Conference
on Very Large Data Bases (VLDB'97), Athens,
Greece, August 1997.
[Theodoratos & Sellis 98] D. Theodoratos, T. Sellis.
Data Warehouse Schema and Instance Design. In
Proceedings of the 17th International Conference
on Conceptual Modeling (ER'98), Singapore,
November 1998.
[Theodoratos et al 99] D. Theodoratos, S.
Ligoudistianos, T. Sellis. Designing the Global
Data Warehouse with SPJ Views. In Proceedings of
the 11th Conference on Advanced Information
Systems Engineering (CAISE'99), Heidelberg,
Germany, June 1999.
[Vassiliadis et al 99] P. Vassiliadis, M. Bouzeghoub,
C. Quix. Towards quality-oriented data warehouse
usage and evolution. Proceedings of the 11th
Conference on Advanced Information Systems
Engineering (CAiSE'99), Heidelberg, Germany,
June 1999.
[Widom 95] J. Widom. Research Problems in Data
Warehousing. In Proceedings of the 4th
International Conference on Information and
Knowledge Management (CIKM'95), Baltimore,
Maryland, November 1995.
[Wierner et al 96] J.L. Wiener, H. Gupta, W.J. Labio,
Y. Zhuge, H. Garcia-Molina, J. Widom. A System
Prototype for Warehouse View Maintenance. In
VIEWS 96
[Yang et al 97] J. Yang, K. Karlapalem, S. Li.
Algorithms for materialized view design in data
warehousing Environment. In Proceedings of the
23rd International Conference on Very Large Data
Bases (VLDB'97), Athens, Greece, August 1997.
[Zhou et al 95] G. Zhou, R. Hull, R. King, J.C.
Franchitti. Data Integration and Warehousing using
H2O. IEEE Data Engineering Bulletin, Special
Issue on Materialized Views and Data
Warehousing, 18(2), June 1995
M. Bouzeghoub, Z. Kedad 9 - 12