<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Quality-Based Framework for Physical Data Warehouse Design</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Mokrane Bouzeghoub, Zoubida Kedad Laboratoire PRiSM, Université de Versailles 45</institution>
          ,
          <addr-line>avenue des Etats-Unis 78035 Versailles Cedex</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data warehousing is a software infrastructure which supports OLAP applications by providing a collection of tools which allow data extraction and cleaning, data integration and aggregation, and data organization into multidimensional structures which are suitable for decision making. At the design level, a data warehouse is defined as a hierarchy of view expressions. One key problem of the design of a data warehouse is the selection of a set of views to materialize in order to meet the performance requirements, that is, to minimize both the evaluation cost of the users queries and the maintenance cost of the materialized expressions. Existing approaches either propose algorithms that explore the search space by enumerating and evaluating all the possible solutions, or introduce heuristics that restrict the search space. In this paper, we propose a framework to support the physical design of a data warehouse; it is based on a logical model, called the view synthesis graph. A level of overlapping is defined to characterize the graph. This level allows to order the search space. The framework integrates some quality factors that can be used either to drive the exploration of the search space, or to reduce the search space, or both.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Data warehousing is a software infrastructure which
supports OLAP applications by providing a collection
of tools which (i) collect data from a set of distributed
heterogeneous sources, (ii) clean and integrate this
data into a uniform representation, (iii) aggregate and</p>
      <p>The general approach to select the best views to
materialize in a data warehouse consists in exploring
a solution space which is defined by all the plans
derived from a multi-query graph which integrates all
the input queries. Two kinds of exploring techniques
are reported in the literature: the first is an exhaustive
evaluation of all the possible materialization
scenarios [Theodoratos &amp; Sellis 97], the second
proposes heuristics to reduce the size of the search
space [Theodoratos et al 99][Yang et al
97][Ligoudistianos et al 99]. The algorithms of the
second class are obviously less expensive compared to
those of the first class, however, they do not
guarantee to derive the best solution as the search
space is arbitrarily reduced using some heuristics.
Algorithms of both classes are driven by cost
functions that formalize the compromise between the
query evaluation cost and the maintenance cost,
called the operational cost. Our general comments on
these approaches are the following:
• Both approaches search for the materialization
scenario having the minimal operational cost, without
considering whether this minimal cost coincides or
not with user needs. In practical applications, it is
often not necessary to search for the best solution but
for the solution that best fits the user requirements.
Indeed, it is not necessary to spend too much time to
find the solution with a minimal cost while an
intermediate solution whose cost fits user requirement
may be sufficient. On the other side, if the solution
having the minimal cost is far from user needs, it's
important to highlight the mismatch between what
the user requires and what the technology is able to
provide, especially when the costs of all the existing
solutions are monotonic and predictable.
• Heuristic-based approaches provide faster
algorithms to derive reasonable solutions but do not
provide any characterization of these solutions with
respect to those derived from the exhaustive search.
Besides this point, they do not provide any other
quality factor - e.g. flexibility or evolution - which
can convince that the derived solution is not useless.
• It is hard in both approaches to introduce
additional requirements in the corresponding
algorithms, such as concurrency control and security
for example, which may also influence the
operational cost.</p>
      <p>A very few research has been done in handling
quality factors in logical and physical data warehouse
design. A substantial effort has been done within the
European DWQ project where a generic model of
quality has been defined [Jarke et al 99] [Vassiliadis
et al 99] [Jeusfeld et al 98] to capture the set of
quality factors associated to the various data
warehouse objects, and to perform their measurement
in a specific quality goal. It was particularly shown
that the refreshment process [Bouzeghoub et al 99]
and the selection of the materialized views
[Bouzeghoub &amp; Theodoratos 99] are demonstrative
examples where quality can drive the design process.
Following the work presented in [Bouzeghoub &amp;
Theodoratos 99], this paper is a step further to
elaborate a general framework which allows to
integrate in a more systematic way several quality
factors into the physical design process.</p>
      <p>This paper is organized as follows: in section 2, we
present the general framework for physical design.
Section 3 gives some preliminary definitions of the
fundamental concepts that constitute the baseline of
our framework. Section 4 presents the use of this
framework for the selection of the best materialization
scenario. Section 5 presents the related works and
section 6 concludes with further research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. The General Framework for Quality</title>
    </sec>
    <sec id="sec-3">
      <title>Based Physical Design in Data</title>
    </sec>
    <sec id="sec-4">
      <title>Warehousing</title>
      <p>As recalled before, algorithms for materialized view
selection are generally based on a cost function that
combines query evaluation cost and view
maintenance cost. Given a cost function, the selection
problem is posed as an exhaustive search of the best
materialization scenario, possibly driven by some
heuristics that restrict the search space under
consideration. The search space itself is generally
represented as a multi-query graph, which is the
union of all possible plans of all user views (or
queries).</p>
      <p>The idea behind our framework is to make the
materialized view selection problem as practical as
possible by providing a context within which derived
solutions can be evaluated with respect to their
relevance to user requirements in terms of quality
factors. Quality factors are considered as a good way
to handle non-functional requirements. They can play
a central role in the algorithms which select a
materialization scenario. Indeed, the complexity of
the view selection problem is such that running the
corresponding algorithms for real applications
remains a challenge. Thus, introducing quality factors
may, on the one hand, help in reducing the search
space, and on the other hand, allow to characterize
the derived solutions.</p>
      <p>We assume that users are able to specify these quality
factors in a formal way as numerical values, logical
assertions or any other mathematical formula. As an
example of requirement for the selection of
materialized views, one may give boundaries to the
operational cost, and the exploration procedure will
therefore consider only solutions which match this
requirement. Two categories of quality factors can be
distinguished:
- Those who define or constrain the cost function,
such as conformance to space limitation, upper limit
to query response time, freshness of data;
- Those who impose some desired features on the
derived materialization scenario (the selected plan),
such as evolution, maintainability or traceability.
Although the second category of quality factors is not
directly involved in the cost function, it may help in
the generation of the heuristics which can drive a
selective search, i.e., it might be a part of the
selection algorithm.</p>
      <p>The second category of quality factors constrains the
structure of the views. They are generally related to
the level of overlapping between view definitions.
This overlapping level is very important in physical
design; the higher it is, the more complex is the
selection problem. Indeed, considering a set of n user
views (or queries) whose plans do not overlap, the
complexity of the selection problem is reduced to the
complexity of the optimization of a list of n
independent queries. Except for this very specific
case, user views in data warehouse systems generally
overlap. We assume that the complexity of the
materialized views selection algorithms is not
independent of the level of overlapping since any
materialization scenario has a direct influence on the
evaluation of all user views which share the same
materialized views. Moreover, evolution of a physical
design as well as traceability and many other different
quality factors can also be correlated to the level of
overlapping of views. A part of our design framework
is consequently based on these structural features.</p>
      <p>DefinedOn</p>
      <sec id="sec-4-1">
        <title>4 Constrains</title>
      </sec>
      <sec id="sec-4-2">
        <title>ConformsTo</title>
        <p>2
3
In this section, we present the formalisms and the
definitions used in the proposed framework for data
warehouse design. We assume that the input queries
are represented by relational expressions. In the
remaining of the paper, these queries are called users
views. Our framework is based on a graph
representing the logical perception of a data
warehouse, and called the view synthesis graph.
The design process of the logical schema of a data
warehouse is out of the scope of this paper; it starts
from the representation of the different user views.
These views represent the users needs, that is, the
aggregated data corresponding to the user
requirements, computed using a given set of relations
located in the data sources. We assume that the user
views are represented by relational expressions. In a
data warehouse, the user views are not necessarily
disjoint. A given user view is represented by a set of
equivalent relational expressions corresponding to
different plans.</p>
        <p>The result of this design process is the logical
schema, called the view synthesis graph (VSG). The
design process of the VSG produces successively the
following representations:
• The multi-query graph (MQGs): we assume that
each user view is a relational expression. Given such
an expression, a set of equivalent plans is considered;
they are represented by a MQG, which is presented in
section 3.1.
• The multi-view graph (MVG): this graph
represents the integration of all the MQGs
corresponding to the set of user views. Therefore, it
represents for each user view a set of relational
expressions. The MVG is presented in section 3.2.
The properties of this graph are given in section 3.3
• The view synthesis graph (VSG): this graph is
generated from the MVG and it is such that each user
view is represented by a single expression. This graph
is presented in section 3.4, along with the definition
of the level of overlapping.</p>
        <p>One important feature of the VSG is that it shows a
set of sub-expressions which are shared by several
user views. The following sub-sections describe in
more details the different graphs and their properties.</p>
        <sec id="sec-4-2-1">
          <title>3.1. The multi-query graph</title>
          <p>Let us consider a user view Vi. This view can be
represented by a set of equivalent algebraic
expressions denoted EVi and such that:</p>
          <p>EVi = {e1Vi, e2Vi, ...., ekVi}
The algebraic expressions corresponding to the user
view Vi are denoted ekVi. They use a set of operations
R1(A, B, C) R2(C, G, E)</p>
          <p>R3(E, H)
(a)
R1(A, B, C) R2(C, G, E)
(b)</p>
          <p>R3(E, H)
such as join, project and restrict operations, or some
aggregate function. Each expression represents a
given plan for the user view Vi. In order to limit the
number of considered plans, we will consider only
those for which the project operation is the last one.
Given a user view Vi and the associated set of
equivalent expressions EVi, the multi-query graph
represents each element ejVi of the set EVi. Each
expression ejVi is either an expression without any
project operation, or an expression in which the
project operation is the last one. Note that this
restriction will reduce the number of expressions in
the MQG, but not the number of common
subexpressions. The multi-query graph does not
represent all the equivalent expressions to the initial
view, because the search of these expressions may
lead to an infinite set of expressions. This graph
represents only a sub-set of the possible equivalent
expressions which will be used as an input to generate
the multi-view graph presented in the next section.
The plans of the set EVi can be generated using the
properties of the algebraic operations, such as
distributivity or associativity. Figure 2 gives an
example of a user view V and the associated
multiquery graph.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>3.2. The multi-view graph</title>
          <p>Each user view being represented by a multi-query
graph, the multi-view graph (MVG) consists in
integrating all the MQGs into a single graph. It
therefore represents a set of equivalent expressions for
each user view. More formally, let us consider the
following notations:
V = {V1, V2, ...., Vn } the set of user views,
EVi = {e1Vi, e2Vi , ...., ekVi} the set of equivalent
expressions corresponding to the view Vi.</p>
          <p>The multi-view graph denoted MVG is such that:
MVG = {EV1, EV2, ....., EVn}
The multi-view graph is a set whose elements are sets
of equivalent expressions associated with the user
views.</p>
          <p>V1 (A, B, C, D, E, F)
V2 (A, G, H)</p>
          <p>V3 (A, B, E, F)
R1(B,D)
Source S1
R2 (A,B,C)</p>
          <p>R3 (E,F,A)
Source S2</p>
          <p>R4 (G,H,E)
Source S3
Join operation</p>
          <p>Restriction operation</p>
          <p>Projection operation
Figure 3 shows an example of the graphical
representation of an MVG, built by integrating the
MQGs corresponding to three user views V1(A, B, C,
D, E, F), V2(A, G, H) and V3(A, B, E, F). In this
example, we can notice that some sub-expressions are
used by more than one user view. An ellipse
represents these expressions.</p>
          <p>The MVG graph represents a set of equivalent
expressions for all the user views of the data
warehouse. If we consider the graphical
representation of a MVG, each node in the graph
represents either a source relation, a sub-expression,
or an expression corresponding to a user view. The
graph shows all the existing sub-expressions used by
each user view. Each sub-expression may be either
shared by several plans, or used in only one plan.</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>3.3. Properties of the multi-view graph</title>
          <p>The properties which characterize the sub-expressions
and the plans in a MVG are defined below.</p>
        </sec>
        <sec id="sec-4-2-4">
          <title>3.3.1. Level of sharing of a sub-expression</title>
          <p>The level of sharing of a sub-expression is the
number of distinct views which uses this
subexpression; let us consider the multi-view graph
MVG corresponding to a set of user views V, and a
sub-expression in MVG denoted e. The level of
sharing of the sub-expression e in the graph MVG is
a function defined as follows:
level_ of_sharing(e)
level = 0
for each Vi in V
for each ejVi in EVi
if (e ∈ ejVi) and (Vi not marked) then
mark Vi
level = level + 1
endif
endfor
endfor
return (level)
end;
The level of sharing characterizes a sub-expression in
the multi-view graph. Each expression corresponding
to a user view in the multi-view graph can be
characterized according to the following properties.</p>
        </sec>
        <sec id="sec-4-2-5">
          <title>3.3.2. Dependency between corresponding to distinct views</title>
          <p>Let us consider two expressions eiV1 and ejV2
representing respectively the user views V1 and V2 in
a multi-view graph MVG. The two expressions eiV1
and ejV2 are said to be dependent if there exists a
subexpression e in the multi-view graph such that e is
used in both eiV1 and ejV2.
two
expressions
As a consequence, the level of sharing of the
subexpression e is such that:</p>
          <p>level_of_sharing(e) &gt; 0
In other words, two expressions representing distinct
views are dependent if they have a common
subexpression e in the multi-view graph.</p>
          <p>An expression eiVk representing a view Vk is
independent if it does not share any sub-expression
with another plan ejVl representing a distinct view Vl.
That is, the level of sharing of each sub-expression e
of eiVk equals zero. In this case, the considered plan
has no common sub-expression with any of the other
plans representing a distinct view in the multi-view
graph.</p>
        </sec>
        <sec id="sec-4-2-6">
          <title>3.3.3. Level of dependency of an expression representing a view</title>
          <p>The level of dependency of an expression ejVi
representing a user view Vi in a multi-view graph
MVG is the number of distinct user views Vl such
that there exists an expression ekVl which share a
subexpression e with ejVi. Let us consider the set of user
views V represented by the MVG; the level of
dependency of an expression ejVi is a function defined
as follows:
level_ of_dependency(eiVj)
level = 0
for each Vl not marked in V, l ≠ j
for each ekVl in EVk
for each e ∈ eiVj
if (e ∈ ekVl) then
mark Vi
level = level + 1
endif
endfor
endfor
endfor
return (level)
end;</p>
        </sec>
        <sec id="sec-4-2-7">
          <title>3.3.4. Redundant expressions</title>
          <p>An expression eVi representing the user view Vi is
redundant if there exists a distinct expression e'Vi
representing the same user view Vi and such that:
level_of_dependency(eVi) &lt; level_of_dependency(e'Vi)
In other words, if e'Vi is the expression having the
highest level of dependency, then all the expressions
representing the same view Vi and having a lower
level of dependency are said to be redundant.</p>
        </sec>
        <sec id="sec-4-2-8">
          <title>3.4. The view synthesis graph</title>
          <p>The view synthesis graph is a canonical
representation derived from the MVG. This
representation is such that each user view is
represented by exactly one plan. For each user view,
among all the plans represented by the MVG, only
one will be kept in the VSG.</p>
          <p>One possible way to built the VSG is to use the notion
of redundancy: a view synthesis graph can be defined
as a set of expressions such that each user view is
represented by exactly one expression, and in which
each plan eVi representing the user view Vi is not
redundant in the corresponding MVG. This means
that the VSG is built in order to maximize the
number of shared sub-expressions.</p>
          <p>By definition and construction, the multi-view graph
is unique. The view synthesis graph is not unique. For
a given view Vi, there may exist more than one
expression having the highest level of dependency,
and consequently, the same MVG may lead to more
than one VSG. The level of overlapping which
characterizes a VSG is defined below.</p>
          <p>Level of overlapping of a view synthesis graph
The level of overlapping of a view synthesis graph is
the maximal level of dependency of the view
expressions represented in this graph. Consider a
VSG containing a number k of user views denoted Vi,
each of them described by the expression denoted eVi.
A level of dependency denoted ni characterizes each
expression. The level of overlapping of this VSG,
denoted n(VSG) is such that:</p>
          <p>n(VSG) = Max(ni, i= 1, k)
The VSG corresponding to a given MVG and having
the highest level of overlapping must satisfy the
following conditions:
• To each user view of the data warehouse is
associated a single expression in the VSG.
• All the expressions in the VSG are not
redundant.</p>
          <p>The resulting VSG is a hierarchy of views. Besides
the user views, this hierarchy contains intermediate
views and base views defined as follows:
• Each sub-expression computed from the relations
provided by a single data source is defined as a base
view.
• A given sub-expression in the VSG is an
intermediate view if it satisfies the following
conditions:
1. this sub-expression is shared by two or more
plans in the VSG
2. this sub-expression is neither a base view, nor
a user view.</p>
          <p>Starting from the MQG given in figure 3, the VSG
which maximizes the level of overlapping is built by
keeping for each view the expression having the
highest level of dependency, that is by eliminating the
redundant expressions. The resulting VSG is given in
figure 4.</p>
          <p>V1(A,B,C,D,E,F)
V2(A,G,H)</p>
          <p>V3(A,B,E,F)
RB1(B,D)
R1(B,D)
SourceS1</p>
          <p>RB2(A,B,C,E,F)</p>
          <p>RB3(A,G,E)
R2(A,B,C) R3(E,F,A)</p>
          <p>SourceS2</p>
          <p>R4(G,H,E)
SourceS3</p>
          <p>Figure 4 : Example of a logical model
Each non-redundant plan corresponding to a user
view is represented by an algebraic tree. The
expressions RB1, RB2 and RB3 are the base views.
Each of them is computed from the data provided by a
single data source. These expressions can be viewed
at as the extractors (or wrappers) associated with each
source. The expression RB2 is an intermediate view
shared by the three user views V1, V2 and V3.
framework
for
data</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Using the warehouse design</title>
      <p>The key issue during the physical design of a data
warehouse is to find a set of views to materialize in
order to meet some user requirements. Generally,
these requirements are expressed in terms of a
maintenance cost, and an evaluation cost. The
combination of these two costs is the operational cost
of the data warehouse. The goal of the physical
design is to find the set of views to materialize in
order to minimize this operational cost.</p>
      <p>Existing approaches for the selection of the views to
materialize either perform an exhaustive exploration
of all the possible materialization scenarios, and
evaluate them to find the one that minimizes the
operational cost, or use heuristics in order to reduce
the number of scenarios to evaluate. In this section,
we show how the proposed framework can be used for
the physical design of a data warehouse. This
framework allows taking into account some cost
function parameters as well as some structural
properties reflecting some desired quality criteria.
The considered search space is partially ordered, and
may be explored using different strategies.</p>
      <p>In the following, we first present the ordering of the
search space, then the cost function parameters and
the structural properties, and finally, we illustrate the
use of the framework by an example of instantiation.</p>
      <sec id="sec-5-1">
        <title>4.1. Ordering the search space</title>
        <p>We consider that the data warehouse is represented by
a view synthesis graph. Given the set of the user
views of the data warehouse, the corresponding view
synthesis graph is not unique, as we have seen in
section 3. Besides this point, given a view synthesis
graph, several candidate sets of materialized views
may be derived, each of them representing a
materialization scenario to evaluate. If we denote by
VSGi a view synthesis graph, and MSi the sets of
materialization scenarios which may be derived from
the graph VSGi, the search space for the selection of
the materialized views denoted S is such that:
S = MS1 ∪ MS2 ∪ ....∪ MSn, where n is the number
of view synthesis graphs associated with the data
warehouse. Therefore, exploring the search space S
consists in considering every materialization scenario
for every possible view synthesis graph representing
the data warehouse.</p>
        <p>This search space can be ordered according to two
criteria:
• the level of overlapping of the view synthesis
graphs representing the data warehouse,
• the number of shared expressions in the view
synthesis graphs.</p>
        <p>Given a multi-view graph, the set of view synthesis
graphs having the level of overlapping p is denoted
Sp, and is such that: Sp = {VSGp1, VSGp2, ... ,
VSGpm},
VSGpi denotes a view synthesis graph and m is the
number of view synthesis graphs associated with the
data warehouse and having the level of overlapping p.
Each graph VSGpi is such that n(VSGpi) = p.
In the set Sp, the different representations can be
ordered by considering subsets of representations
having the same number of shared sub-expressions.
Therefore, the set Sp can be partitioned in several
subsets, considering the number of shared
subexpressions in the representations. A subset of Sp
denoted Snp contains all the representations whose
level of overlapping is p and whose number of shared
expressions is n.</p>
        <p>Figure 5 illustrates the partitions of the search space
according to the two criteria n and p. The search
space is first partitioned in several sets Sp according
to the level of overlapping. Then each set Sp is
partitioned in several sets Snp according to the
number of shared expressions in the VSG.</p>
        <p>Once the search space is ordered, it can be explored
in several ways: for example, the exploration could be
done starting from the representations having the
highest level of overlapping, that is, representations
showing the most shared expressions. These
representations could then be explored according to a
descending order of their number of shared
subexpressions. In our framework, the exploration
strategy and the selection of the scenarios may be
done according to two kinds of factors: the cost
function parameters and the structural properties
required for the representation of the data warehouse.
Both factors are presented in sections 4.2. and 4.3.
…</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>VSGnm</title>
      <p>Sn-1
VSGn-11 VSGn-12 …</p>
    </sec>
    <sec id="sec-7">
      <title>VSGn-1p</title>
      <p>… …
S0
VSG01 VSG02 … VSG0l</p>
      <sec id="sec-7-1">
        <title>4.2. The cost function parameters</title>
        <p>The evaluation of a materialization scenario is done
using a cost function which takes into account the
different factors having an impact over the overall
cost of the scenario. Finding such function is a
domain specific problem, because the factors to
consider vary from one context to another.</p>
        <p>In the existing approaches for the selection of
materialized views, the different materialization
scenarios are compared by considering their
respective operational cost, which is a combination of
a maintenance cost and an evaluation cost. In the
following, we recall the general cost function which is
widely used in the literature related to the data
warehouses physical design.</p>
        <p>Each materialized view in the data warehouse has
maintenance cost, which is the cost of propagating
the source updates in this view. The maintenance cost
of the data warehouse, denoted M, is the sum of the
maintenance costs of all the materialized views. We
denote Vmi, i ranging from 1 to n, the materialized
views of the data warehouse, f mi the maintenance
frequency for the view Vi and CVi the maintenance
cost of the view Vi; the maintenance cost of the data
warehouse is such that:</p>
        <p>M = Σ i=1, n f mi . C Vmi
The user views are evaluated against the set of
materialized views. They are denoted Vui, i ranging
from 1 to p. Each of these user views has an
evaluation cost denoted EVui and a frequency denoted f
ei. The global evaluation cost of a materialization
scenario denoted E is such that:</p>
        <p>E = Σ i=1, p f ei . EVui
The operational cost of a materialization scenario
denoted OC is a weighted sum function of the
maintenance cost M and the evaluation cost E:</p>
        <p>OC = E + wM
The best set of materialized views is the one
corresponding to the minimal operational cost OC.
The w factor is a parameter which expresses the
relative importance between the evaluation cost and
the maintenance cost. Besides this factor, other
parameters may be considered. For example, the users
or the administrators of the data warehouse may state
a value or an interval representing the acceptable
operational cost of the data warehouse. These
boundaries of the operational cost may be very useful
to confront the cost of a scenario against the
expectations of the users in terms of performance.
The exploration of the search space is costly, and it is
not necessary to search for an optimal solution which
is far beyond the actual performance needs. The
exploration may stop when a solution fitting the user
requirements is found. Another possible cost function
parameter could be a space limitation. All the cost
function parameters are used to evaluate and select
the materialization scenarios.</p>
      </sec>
      <sec id="sec-7-2">
        <title>4.3. The structural properties</title>
        <p>The structural properties are related to the
representation of the data warehouse. They express
some quality requirements specified by the user for
the materialization scenario. Let us consider for
example the traceability criteria. Given a set of user
views, one possible requirement is to trace the
intermediate results for this set in order to find the
origin of an error or a misunderstanding of some
aggregate value. For example, a set of aggregate
values computed by different user views may be
contradictory. To find the origin of the contradiction,
the materialized views of the data warehouse will be
checked in order to find the computation which gave
raise to the error. It is obvious to see that the higher is
the number of user views which use the same
materialized expressions, the less is the cost of
tracing a given error, because in this case, the number
of materialized expressions to check is low compared
to the number of expressions to consider if the users
views share very few materialized expressions.
Consequently, in order to ease the tracing process, it
would be more suitable to select the materialized
views among the set of expressions which are used by
the maximal number of user views, that is, over the
view synthesis graph having the highest level of
overlapping. The best materialization scenario will
therefore represent a compromise between the cost
parameters and the level of overlapping.</p>
        <p>In our framework, the structural properties may be
handled in the following ways:
• they may be used to constrain the selection
procedure, that is to define an exploration strategy in
the search space; in the example of the traceability
criteria, the search space may be explored
exhaustively starting from the representations having
the highest level of overlapping,
• they may be used to reduce the search space, by
selecting a sub-set of representations satisfying them,
• and finally, they may be used after the
determination of the scenario which best fits the cost
function parameters, in order to check whether this
scenario conforms to the structural properties or not.</p>
      </sec>
      <sec id="sec-7-3">
        <title>4.4. An instantiation of the general framework</title>
        <p>We will now present an example of selection of a
materialization scenario considering, additionally to
the ordering criteria of the search space, the following
factors: first, the users or the administrators of the
data warehouse have stated an interval representing
the acceptable operating cost of the data warehouse;
second, some traceability requirements have been
expressed by the users. These requirements will be
used to constrain the selection procedure.</p>
        <p>If we consider that the user requirements in terms of
performance are expressed by the interval [l1, l2],
where l1 and l2 represents the lowest and the highest
acceptable operational cost respectively, the
exploration of the ordered search space will stop if the
operational cost OC corresponding to a
materialization scenario msi is such that:</p>
        <p>l1 ≤OC(msi) ≤l2.</p>
        <p>Considering the traceability criteria, the approach for
the selection of the views to materialize will focus on
the shared sub-expressions, and choose the
materialized views in the set of sub-expressions
shared by the maximal number of views. The
exploration strategy is then to consider first
representations having the highest level of
overlapping. The instantiation of our general
framework considering these requirements is given in
figure 6.
The characterization procedure is not performed, and
the paths used in the instantiation are represented by
the continuous lines. If p is the maximal level of
overlapping which can be derived from the
multiview graph, then the corresponding set Sp contains
the sub-expressions which are shared by the maximal
number of user views. These sub-expressions will be
first considered when determining the views to
materialize. Multiple materialization scenarios can be
generated from a single graph VSGpi. Each of them
should be evaluated. After considering the graphs of a
set Sp, the graphs of the set Sp-1 are considered, and
the cost of the materialization scenarios is evaluated
for each graph VSGp-1i. This process iterates until a
solution is found within the boundaries specified by
the users.</p>
        <p>This exploration of the search space starting from the
representations having the maximal level of
overlapping is described by the following general
algorithm which uses the following notations:
• MVG is the multi-view graph describing the data
warehouse,
• Sp the set of view synthesis graph derived from
MVG and having the level of overlapping p,
• pmax the maximal level of overlapping of the
expressions in MVG,
• VSGpi a view synthesis graph whose level of
overlapping is p,
• [l1, l2] the interval representing the user
requirements, that is the interval in which the
operational cost varies.
• MSp the set of materialization scenarios derived
from the view synthesis graphs whose level of
overlapping is p,
• msi a materialization scenario,
• OC(msi) the operational cost of the
materialization scenario msi.</p>
        <p>Exploration procedure
p = pmax ;
Generate Sp from MVG;
While (solution not found)
If Sp ≠ ∅ Then</p>
        <p>For each VSGpi in Sp</p>
        <p>Generate MSp
For each msi in MSp</p>
        <p>If l1 ≤OC(msi) ≤l2
Then (msi) is an acceptable solution;</p>
        <p>Endif</p>
        <p>Endfor</p>
        <p>Endfor
Endif
p = p-1;
Generate Sp from MVG;
Endwhile
End
We did not present in this paper the details of the
selection of the candidate sets of views to materialize;
each set of candidate materialized views should allow
to compute all the user views, that is, each user view
can be rewritten using the set of materialized views.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>5. Related works</title>
      <p>Several approaches have been proposed for the
physical design of a data warehouse. The problem
addressed by these approaches is to find a set of views
to materialize in order to meet the performance
requirements. These performance requirements are
expressed as an operational cost, which is a
combination of an evaluation cost and a maintenance
cost. Some of the approaches dealing with this design
problem are described in [Baralis et al 97],
[Theodoratos &amp; Sellis 97], [Theodoratos &amp; Sellis 98],
[Theodoratos et al 99], [Ligoudistianos et al 99] and
[Yang et al 97].</p>
      <p>Two kinds of solutions are generally proposed for the
selection of the views to materialize in a data
warehouse, either solutions consisting in an
exhaustive exploration of the possible materialization
solutions to find to optimal one, or solutions based on
some heuristics which reduce the exploration. In
[Yang et al 97], two algorithms for selecting the
views to materialize are proposed; the first one is an
exhaustive exploration of all the possible plans of the
users queries to find an optimal multiple view
processing plan, the second one is an heuristic
algorithm which considers only the optimal plan for
each user query.</p>
      <p>In [Baralis et al 97], the proposed algorithm
addresses the materialized view selection problem in
the context of multidimensional databases.
[Theodoratos &amp; Sellis 97] and [Theodoratos &amp; Sellis
98] present the view selection problem as a state
space problem. In [Theodoratos et al 99], a pruning
algorithm and a greedy algorithm are presented to
select the set of views to materialize. But the
heuristics used do not guarantee to find the optimal
solution. [Ligoudistianos et al 99] proposes an A*
algorithm which delivers the optimal solution, and
presents some heuristics to prune the considered state
space.</p>
      <p>In this paper, rather than a new approach for the
selection of the best set of materialized views, we
tried to define the main features of a framework for
data warehouse physical design. This physical design
is done on a search space consisting of a set of
representations of the data warehouse, considering
the level of overlapping and the number of shared
expressions which characterize each representation.
This level allows to order the search space, and the
exploration is done according to a strategy which may
be driven by the user requirements. These
requirements may be either some cost function
parameters or some structural properties. Existing
approaches introduce cost function parameters such
as space limitation, but they do not integrate the
structural properties which express some quality
factors such as evolutivity, security or traceability.</p>
    </sec>
    <sec id="sec-9">
      <title>6. Conclusion</title>
      <p>In this paper we have presented an extensible
framework for data warehouse physical design. This
framework is based on a logical schema, which is a
canonical representation of the data warehouse views.
This logical model is built from subsets of the
possible plans for each user view, which are
integrated in an intermediate representation called the
multi-view graph. The logical model, called the view
synthesis graph (VSG), is derived from the
multiview graph. A level of overlapping characterizes this
model. It represents the maximal number of views
sharing the same expression in the graph.</p>
      <p>The framework for physical design uses the VSG as a
representation of the views of a data warehouse. Since
this VSG is not unique, the search space for the
selection of the views to materialize is composed of
all the scenarios derived from all the possible VSG.
This search space can be ordered considering the
level of overlapping and the number of shared
expressions of the representations from which the
scenarios are derived.</p>
      <p>This ordered search space can be explored in different
ways; one possible exploration strategy consists in
giving priority to the most shared sub-expressions
during the selection of the views to materialize, and
therefore exploring first the representations having
the highest level of overlapping. The exploration is
guided by some users requirements, which can either
be related to the cost function, such as the upper an
lower bounds representing acceptable operational
cost, or related to some structural properties of the
logical model, such as evolutivity.</p>
      <p>Further works will consist in identifying sets of
quality factors related to the physical design of the
data warehouses, and study the impact of these factors
on both the selection and the characterization
procedures of the proposed framework.
9 - 10
[Bouzeghoub et al 99] Bouzeghoub M., Fabret F.,
Matulovic M., Modeling the Data Warehouse
Refreshment Process as a Workflow Application.
Proceedings of the International Workshop on
Design and Management of Data Warehouses
(DMDW'99), Heidelberg, Germany, June 1999.
[Hull &amp; Zhou 96] R. Hull, G. Zhou. A Framework for
Supporting Data Integration Using the Materialized
and Virtual Approches. In Proceedings of the ACM
SIGMOD International Conference on
Management of Data, Montreal, Canada, June
1996.
[Bouzeghoub &amp; Theodoratos 99] Bouzeghoub M.,
Theodoratos D. Data Currency Quality Factors in
Data Warehouse Design. Proceedings of the
International Workshop on Design and
Management of Data Warehouses (DMDW'99),
Heidelberg, Germany, June 1999.
[Calvanese et al 98] D. Calvanese, G. De Giacomo,
M. Lenzerini, D. Nardi, R. Rosati. Source
Integration integration : conceptual modeling and
reasoning support. In Proceedings of the 3rd IFCIS
International Conference on Cooperative
Information Systems (CoopIS’98), New York City,
USA, August 1998.
[Chawathe et al 94] S. Chawathe, H. Garcia-Molina,
J. Hammer, K. Ireland, Y. Papakonstantinou, J.
Ullman, J. Widom. The TSIMMIS project :
Integration of Hererogeneous Information Sources.
In Proceedings of the 10th Meeting of the
Information Processing Society of Japan (IPSJ'94),
Tokyo, Japan, October 1994.
[Chaudhuri &amp; Dayal 96] S. Chaudhuri, A. Dayal. An
Overview of Data Warehousing and OLAP
Technology. Tutorial, In Proceedings of the 22th
International Conference on Very Large Data
Bases (VLDB'96), Mumbai, India, September
1996.
[Garcia-Molina et al 95] H. Garcia-Molina, Y.</p>
      <p>Papakonstantinou, D. Quass, A. Rajamaran, Y.
Sagiv, J. Ullman, V. Vassalos, J. Widom. The
TSIMMIS approach to mediation: Data model and
Languages. In 2nd International Workshop on Next
Generation Information Technologies and Systems
(NGITS '95), Naharia, Israel, June 1995.
[Gupta &amp; Mumick 95] A. Gupta, I.S. Mumick,
Maintenance of materialized views: Problems,
Techniques, and applications. In IEEE Data
Engineering Bulletin, Special Issue on Materialised
Views and Data Warehousing, 18(2), june 1995
[Hammer et al 95] J. Hammer, H. Garcia-Molina, J.</p>
      <p>Widom, W. Labio, Y. Zhuge. The Stanford Data
Warehousing Project. IEEE Data Engineering
Bulletin, Special Issue on Materialized Views and
Data Warehousing, 18(2), June 1995</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Baralis et al 97]
          <string-name>
            <given-names>E.</given-names>
            <surname>Baralis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Paraboschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Teniente</surname>
          </string-name>
          .
          <article-title>Materialized View Selection in a Multidimensional Environment</article-title>
          .
          <source>In Proc. of the 23rd International Conference on Very Large Data Bases (VLDB'97)</source>
          , Athens, Greece,
          <year>August 1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Jarke et al 97]
          <string-name>
            <surname>M. Jarke</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Vassiliou</surname>
          </string-name>
          .
          <article-title>Foundation of Data Warehouse Quality: an Overview of the DWQ project</article-title>
          .
          <source>In Proceedings of the 2nd Internat. Conf. on Information Quality</source>
          , Cambridge, Mass,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Jarke et al 99]
          <string-name>
            <surname>M. Jarke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lenzerini</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Vassiliou</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Vassiliadis</surname>
          </string-name>
          . (edits).
          <source>Fundamentals of Data Warehouses</source>
          . Springer,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Jarke et al 99a]
          <string-name>
            <surname>Jarke</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jeusfeld</surname>
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quix</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vassiliadis</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <article-title>"Architecture and Quality in Data Warehouses"</article-title>
          ,
          <source>Proceeding of the 10th International Conference on Advanced Information Systems Engineering (CAiSE'98)</source>
          , Pisa, Italy,
          <year>June 1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Jeusfeld et al 98]
          <string-name>
            <surname>Jeusfeld</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quix</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jarke</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <article-title>"Design and Analysis of Quality Information for Data Warehouses"</article-title>
          ,
          <source>Proceedings of the 17th Internat. Conf. on Conceptual Modeling (ER'98)</source>
          , Singapore, november
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Kedad 99]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kedad</surname>
          </string-name>
          .
          <article-title>Techniques d'intégration dans les systèmes d'information multi-source</article-title>
          .
          <source>PhD Thesis</source>
          , Laboratoire PriSM, University of Versailles, France,
          <year>January 1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Kedad &amp; Bouzeghoub 99]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kedad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bouzeghoub</surname>
          </string-name>
          .
          <article-title>Conception de systèmes d'information multi-source</article-title>
          .
          <source>In Proceedings of INFORSID'99</source>
          ,
          <string-name>
            <surname>Toulon</surname>
          </string-name>
          , France,
          <year>June 1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Levy et al 96]
          <string-name>
            <given-names>A.Y.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rajaraman</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.J.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Ordille</surname>
          </string-name>
          .
          <article-title>Query-Answering Algorithms for Information Agents</article-title>
          .
          <source>In Proceedings of the 13th National Conference on Artificial Intelligence and 8th Innovative Applications of Artificial Intelligence Conference (AAAI 96, IAAI 96)</source>
          , Portland, Oregon,
          <year>August 1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Levy et al 96a]
          <string-name>
            <given-names>A.Y.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rajaraman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.J.</given-names>
            <surname>Ordille</surname>
          </string-name>
          .
          <article-title>Querying Heterogeneous Information Sources Using Source Description</article-title>
          .
          <source>In Proceedings of the 22th International Conference on Very Large Data Bases (VLDB'96)</source>
          , Mumbai, India,
          <year>September 1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [Ligoudistianos et al 99]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ligoudistianos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sellis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Theodoratos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Vassiliou</surname>
          </string-name>
          .
          <article-title>Heuristic Algorithms For Designing The Data Warehouse with SPJ Views</article-title>
          .
          <source>In Proceedings of the International Conference on Data Warehousing and Knowledge Discovery (DAWAK'99)</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [Theodoratos &amp; Sellis 97]
          <string-name>
            <given-names>D.</given-names>
            <surname>Theodoratos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sellis</surname>
          </string-name>
          .
          <article-title>The Data Warehouse Configuration</article-title>
          .
          <source>In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB'97)</source>
          , Athens, Greece,
          <year>August 1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Theodoratos &amp; Sellis 98]
          <string-name>
            <given-names>D.</given-names>
            <surname>Theodoratos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sellis</surname>
          </string-name>
          .
          <article-title>Data Warehouse Schema and Instance Design</article-title>
          .
          <source>In Proceedings of the 17th International Conference on Conceptual Modeling (ER'98)</source>
          , Singapore,
          <year>November 1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Theodoratos et al 99]
          <string-name>
            <given-names>D.</given-names>
            <surname>Theodoratos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ligoudistianos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sellis</surname>
          </string-name>
          .
          <article-title>Designing the Global Data Warehouse with SPJ Views</article-title>
          .
          <source>In Proceedings of the 11th Conference on Advanced Information Systems Engineering (CAISE'99)</source>
          , Heidelberg, Germany,
          <year>June 1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [Vassiliadis et al 99]
          <string-name>
            <given-names>P.</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bouzeghoub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Quix</surname>
          </string-name>
          .
          <article-title>Towards quality-oriented data warehouse usage and evolution</article-title>
          .
          <source>Proceedings of the 11th Conference on Advanced Information Systems Engineering (CAiSE'99)</source>
          , Heidelberg, Germany,
          <year>June 1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [Widom 95]
          <string-name>
            <given-names>J.</given-names>
            <surname>Widom</surname>
          </string-name>
          . Research Problems in Data Warehousing.
          <source>In Proceedings of the 4th International Conference on Information and Knowledge Management (CIKM'95)</source>
          , Baltimore, Maryland,
          <year>November 1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [Wierner et al 96]
          <string-name>
            <surname>J.L. Wiener</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>W.J.</given-names>
          </string-name>
          <string-name>
            <surname>Labio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhuge</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Garcia-Molina</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Widom</surname>
          </string-name>
          .
          <article-title>A System Prototype for Warehouse View Maintenance</article-title>
          .
          <source>In VIEWS 96</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [Yang et al 97]
          <string-name>
            <surname>J. Yang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Karlapalem</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Algorithms for materialized view design in data warehousing Environment</article-title>
          .
          <source>In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB'97)</source>
          , Athens, Greece,
          <year>August 1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>[Zhou</surname>
            et al 95]
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hull</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>King</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          <string-name>
            <surname>Franchitti</surname>
          </string-name>
          .
          <article-title>Data Integration and Warehousing using H2O</article-title>
          . IEEE Data Engineering Bulletin,
          <source>Special Issue on Materialized Views and Data Warehousing</source>
          ,
          <volume>18</volume>
          (
          <issue>2</issue>
          ), June 1995
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>