=Paper= {{Paper |id=Vol-28/paper-9 |storemode=property |title=A quality-based framework for physical data warehouse design |pdfUrl=https://ceur-ws.org/Vol-28/paper9.pdf |volume=Vol-28 |dblpUrl=https://dblp.org/rec/conf/dmdw/BouzeghoubK00 }} ==A quality-based framework for physical data warehouse design== https://ceur-ws.org/Vol-28/paper9.pdf
                                      A Quality-Based Framework for Physical
                                              Data Warehouse Design


                                                      Mokrane Bouzeghoub, Zoubida Kedad
                                                   Laboratoire PRiSM, Université de Versailles
                                                            45, avenue des Etats-Unis
                                                         78035 Versailles Cedex, France
                                         Mokrane.Bouzeghoub@prism.uvsq.fr, Zoubida.Kedad@prism.uvsq.fr

                                        Abstract                             organize this data into multidimensional structures
                                                                             which are suitable for decision making, and (iv)
       Data warehousing is a software infrastructure                         refresh it periodically to maintain the data up to date
       which supports OLAP applications by                                   and accurate. Many problems related with this
       providing a collection of tools which allow                           approach have been addressed, such as data
       data extraction and cleaning, data integration                        extraction and cleaning, selection of the best views to
       and aggregation, and data organization into                           materialize,       update       propagation,       and
       multidimensional structures which are                                 multidimensional representation and manipulation of
       suitable for decision making. At the design                           data.
       level, a data warehouse is defined as a
       hierarchy of view expressions. One key                                View materialization is a physical design technique
       problem of the design of a data warehouse is                          which has been introduced to optimize database
       the selection of a set of views to materialize                        queries by making persistent some intermediate
       in order to meet the performance                                      results, possibly shared by different queries [Gupta &
       requirements, that is, to minimize both the                           Mumick 95]. It is seen as a caching technique which
       evaluation cost of the users queries and the                          contributes to optimize queries. Additionally, in the
       maintenance cost of the materialized                                  data warehousing context, view materialization
       expressions. Existing approaches either                               avoids overloading sources that supply operational
       propose algorithms that explore the search                            data, and therefore makes data warehouse
       space by enumerating and evaluating all the                           applications partially and momentarily independent
       possible solutions, or introduce heuristics that                      of the data sources. This introduces two
       restrict the search space. In this paper, we                          complementary constraints: (i) the first requires that
       propose a framework to support the physical                           all OLAP queries must be evaluated using only the
       design of a data warehouse; it is based on a                          materialized views, (ii) the second requires that
       logical model, called the view synthesis                              materialized views must be updated periodically with
       graph. A level of overlapping is defined to                           respect to data changes OLAP applications want to
       characterize the graph. This level allows to                          see, and to a level of freshness required for this data.
       order the search space. The framework                                 Consequently, one key issue during the physical
       integrates some quality factors that can be                           design of a data warehouse is the selection of the set
       used either to drive the exploration of the                           of views to materialize in order to optimize both the
       search space, or to reduce the search space,                          query evaluation and the maintenance cost.
       or both.
                                                                             The general approach to select the best views to
1. Introduction                                                              materialize in a data warehouse consists in exploring
                                                                             a solution space which is defined by all the plans
Data warehousing is a software infrastructure which                          derived from a multi-query graph which integrates all
supports OLAP applications by providing a collection                         the input queries. Two kinds of exploring techniques
of tools which (i) collect data from a set of distributed                    are reported in the literature: the first is an exhaustive
heterogeneous sources, (ii) clean and integrate this                         evaluation of all the possible materialization
data into a uniform representation, (iii) aggregate and                      scenarios [Theodoratos & Sellis 97], the second
                                                                             proposes heuristics to reduce the size of the search
                                                                             space [Theodoratos et al 99][Yang et al
The copyright of this paper belongs to the paper’s authors. Permission to
copy without fee all or part of this material is granted provided that the
                                                                             97][Ligoudistianos et al 99]. The algorithms of the
copies are not made or distributed for direct commercial advantage.          second class are obviously less expensive compared to
Proceedings of the International Workshop on Design                          those of the first class, however, they do not
and Management of Data Warehouses (DMDW'2000)                                guarantee to derive the best solution as the search
Stockholm, Sweden, June 5-6, 2000                                            space is arbitrarily reduced using some heuristics.
(M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.)
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-28/



M. Bouzeghoub, Z. Kedad                                                                                                           9-1
Algorithms of both classes are driven by cost             framework for the selection of the best materialization
functions that formalize the compromise between the       scenario. Section 5 presents the related works and
query evaluation cost and the maintenance cost,           section 6 concludes with further research.
called the operational cost. Our general comments on
these approaches are the following:                       2. The General Framework for Quality-
                                                          Based   Physical  Design   in  Data
• Both approaches search for the materialization
                                                          Warehousing
scenario having the minimal operational cost, without
considering whether this minimal cost coincides or        As recalled before, algorithms for materialized view
not with user needs. In practical applications, it is     selection are generally based on a cost function that
often not necessary to search for the best solution but   combines query evaluation cost and view
for the solution that best fits the user requirements.    maintenance cost. Given a cost function, the selection
Indeed, it is not necessary to spend too much time to     problem is posed as an exhaustive search of the best
find the solution with a minimal cost while an            materialization scenario, possibly driven by some
intermediate solution whose cost fits user requirement    heuristics that restrict the search space under
may be sufficient. On the other side, if the solution     consideration. The search space itself is generally
having the minimal cost is far from user needs, it's      represented as a multi-query graph, which is the
important to highlight the mismatch between what          union of all possible plans of all user views (or
the user requires and what the technology is able to      queries).
provide, especially when the costs of all the existing
solutions are monotonic and predictable.                  The idea behind our framework is to make the
                                                          materialized view selection problem as practical as
• Heuristic-based approaches provide faster               possible by providing a context within which derived
algorithms to derive reasonable solutions but do not      solutions can be evaluated with respect to their
provide any characterization of these solutions with      relevance to user requirements in terms of quality
respect to those derived from the exhaustive search.      factors. Quality factors are considered as a good way
Besides this point, they do not provide any other         to handle non-functional requirements. They can play
quality factor - e.g. flexibility or evolution - which    a central role in the algorithms which select a
can convince that the derived solution is not useless.    materialization scenario. Indeed, the complexity of
                                                          the view selection problem is such that running the
• It is hard in both approaches to introduce              corresponding algorithms for real applications
additional requirements in the corresponding              remains a challenge. Thus, introducing quality factors
algorithms, such as concurrency control and security      may, on the one hand, help in reducing the search
for example, which may also influence the                 space, and on the other hand, allow to characterize
operational cost.                                         the derived solutions.

A very few research has been done in handling             We assume that users are able to specify these quality
quality factors in logical and physical data warehouse    factors in a formal way as numerical values, logical
design. A substantial effort has been done within the     assertions or any other mathematical formula. As an
European DWQ project where a generic model of             example of requirement for the selection of
quality has been defined [Jarke et al 99] [Vassiliadis    materialized views, one may give boundaries to the
et al 99] [Jeusfeld et al 98] to capture the set of       operational cost, and the exploration procedure will
quality factors associated to the various data            therefore consider only solutions which match this
warehouse objects, and to perform their measurement       requirement. Two categories of quality factors can be
in a specific quality goal. It was particularly shown     distinguished:
that the refreshment process [Bouzeghoub et al 99]        - Those who define or constrain the cost function,
and the selection of the materialized views               such as conformance to space limitation, upper limit
[Bouzeghoub & Theodoratos 99] are demonstrative           to query response time, freshness of data;
examples where quality can drive the design process.      - Those who impose some desired features on the
Following the work presented in [Bouzeghoub &             derived materialization scenario (the selected plan),
Theodoratos 99], this paper is a step further to          such as evolution, maintainability or traceability.
elaborate a general framework which allows to
integrate in a more systematic way several quality        Although the second category of quality factors is not
factors into the physical design process.                 directly involved in the cost function, it may help in
                                                          the generation of the heuristics which can drive a
This paper is organized as follows: in section 2, we      selective search, i.e., it might be a part of the
present the general framework for physical design.        selection algorithm.
Section 3 gives some preliminary definitions of the
fundamental concepts that constitute the baseline of      The second category of quality factors constrains the
our framework. Section 4 presents the use of this         structure of the views. They are generally related to


M. Bouzeghoub, Z. Kedad                                                                                     9-2
the level of overlapping between view definitions.                                            3. Preliminary definitions
This overlapping level is very important in physical
design; the higher it is, the more complex is the                                             In this section, we present the formalisms and the
selection problem. Indeed, considering a set of n user                                        definitions used in the proposed framework for data
views (or queries) whose plans do not overlap, the                                            warehouse design. We assume that the input queries
complexity of the selection problem is reduced to the                                         are represented by relational expressions. In the
complexity of the optimization of a list of n                                                 remaining of the paper, these queries are called users
independent queries. Except for this very specific                                            views. Our framework is based on a graph
case, user views in data warehouse systems generally                                          representing the logical perception of a data
overlap. We assume that the complexity of the                                                 warehouse, and called the view synthesis graph.
materialized views selection algorithms is not
independent of the level of overlapping since any                                             The design process of the logical schema of a data
materialization scenario has a direct influence on the                                        warehouse is out of the scope of this paper; it starts
evaluation of all user views which share the same                                             from the representation of the different user views.
materialized views. Moreover, evolution of a physical                                         These views represent the users needs, that is, the
design as well as traceability and many other different                                       aggregated data corresponding to the user
quality factors can also be correlated to the level of                                        requirements, computed using a given set of relations
overlapping of views. A part of our design framework                                          located in the data sources. We assume that the user
is consequently based on these structural features.                                           views are represented by relational expressions. In a
                                                                                              data warehouse, the user views are not necessarily
Figure 1 shows a simplified view of the framework,                                            disjoint. A given user view is represented by a set of
which uses a class of quality factors which may be                                            equivalent relational expressions corresponding to
defined on a given search space. This class can be                                            different plans.
divided in two subclasses, the first one concerning the
quality factors which constrain the cost function, and                                        The result of this design process is the logical
the second concerning those which characterize the                                            schema, called the view synthesis graph (VSG). The
structural aspect of a materialization scenario. The                                          design process of the VSG produces successively the
quality factors of the latter subclass may be used to                                         following representations:
generate heuristics which restrict the search space or
                                                                                              •    The multi-query graph (MQGs): we assume that
which allow to qualify the selected solutions. They
can also be used to select a representation which                                             each user view is a relational expression. Given such
satisfies some level of overlapping. This                                                     an expression, a set of equivalent plans is considered;
representation will then be used as a restricted search                                       they are represented by a MQG, which is presented in
space on which the selection procedure is performed.                                          section 3.1.
                                                                                              • The multi-view graph (MVG): this graph
Section 4 presents the usage of this framework
through an example.                                                                           represents the integration of all the MQGs
                                                                                              corresponding to the set of user views. Therefore, it
                                                                                              represents for each user view a set of relational
                                                                                              expressions. The MVG is presented in section 3.2.
                              SEARCH SPACE                                                    The properties of this graph are given in section 3.3
                                                                                              • The view synthesis graph (VSG): this graph is
                                      DefinedOn                                               generated from the MVG and it is such that each user
                                                                                              view is represented by a single expression. This graph
Inputs                                                                               Inputs
                              QUALITY FACTORS                                                 is presented in section 3.4, along with the definition
                                                                                              of the level of overlapping.
              SpecializedAs                                SpecializedAs


                                                                                              One important feature of the VSG is that it shows a
         COST FUNCTION                                       STRUCTURAL                       set of sub-expressions which are shared by several
          PARAMETERS                                          PROPERTIES                      user views. The following sub-sections describe in
                                               ain
                                                  s                                           more details the different graphs and their properties.
                                           nstr                        Inputs
           Inputs                        Co
                                     4                o
                                                    sT
                                                  rm
                                              onfo                                            3.1. The multi-query graph
          SELECTION                          C             CHARACTERIZATION
          PROCEDURE                      2                    PROCEDURE
                                                                                              Let us consider a user view Vi. This view can be
          Outputs                             3                            Outputs
                                                                                              represented by a set of equivalent algebraic
                                                  Inpu
                                                      ts
                                                                                              expressions denoted EVi and such that:
            SELECTED                                          SELECTED                                      EVi = {e1Vi, e2Vi, ...., ekVi}
            SCENARIOS             1 BelongsTo              REPRESENTATION
                                                                                              The algebraic expressions corresponding to the user
                                                                                              view Vi are denoted ekVi. They use a set of operations
            Figure 1: The general framework for
                    quality based design


M. Bouzeghoub, Z. Kedad                                                                                                                         9-3
                                                                                                                    V(A,G,H)
                            V(A,G,H)




               R1(A, B, C) R2(C, G, E)         R3(E, H)                                           R1(A, B, C) R2(C, G, E)         R3(E, H)
                                               (a)                                                  (b)

                           Figure 2 : The multi-query graph associated with a user view

such as join, project and restrict operations, or some                      distributivity or associativity. Figure 2 gives an
aggregate function. Each expression represents a                            example of a user view V and the associated multi-
given plan for the user view Vi. In order to limit the                      query graph.
number of considered plans, we will consider only
those for which the project operation is the last one.                      3.2. The multi-view graph
                                                                            Each user view being represented by a multi-query
Given a user view Vi and the associated set of
                                                                            graph, the multi-view graph (MVG) consists in
equivalent expressions EVi, the multi-query graph
                                                                            integrating all the MQGs into a single graph. It
represents each element ejVi of the set EVi. Each
                                                                            therefore represents a set of equivalent expressions for
expression ejVi is either an expression without any
                                                                            each user view. More formally, let us consider the
project operation, or an expression in which the
                                                                            following notations:
project operation is the last one. Note that this
                                                                            V = {V1, V2, ...., Vn } the set of user views,
restriction will reduce the number of expressions in
                                                                            EVi = {e1Vi, e2Vi , ...., ekVi} the set of equivalent
the MQG, but not the number of common sub-
                                                                            expressions corresponding to the view Vi.
expressions. The multi-query graph does not
                                                                            The multi-view graph denoted MVG is such that:
represent all the equivalent expressions to the initial
                                                                            MVG = {EV1, EV2, ....., EVn}
view, because the search of these expressions may
lead to an infinite set of expressions. This graph
                                                                            The multi-view graph is a set whose elements are sets
represents only a sub-set of the possible equivalent
                                                                            of equivalent expressions associated with the user
expressions which will be used as an input to generate
                                                                            views.
the multi-view graph presented in the next section.
The plans of the set EVi can be generated using the
properties of the algebraic operations, such as

                                     V1 (A, B, C, D, E, F)                            V2 (A, G, H)

                                                                                                                V3 (A, B, E, F)




                         R1(B,D)                                      R2 (A,B,C)      R3 (E,F,A)           R4 (G,H,E)

                         Source S1                                               Source S2                 Source S3

                                       Join operation        Restriction operation           Projection operation



                                     Figure 3 : Example of a multi-view graph (MVG)


M. Bouzeghoub, Z. Kedad                                                                                                                      9-4
Figure 3 shows an example of the graphical                 As a consequence, the level of sharing of the sub-
representation of an MVG, built by integrating the         expression e is such that:
MQGs corresponding to three user views V1(A, B, C,                         level_of_sharing(e) > 0
D, E, F), V2(A, G, H) and V3(A, B, E, F). In this          In other words, two expressions representing distinct
example, we can notice that some sub-expressions are       views are dependent if they have a common sub-
used by more than one user view. An ellipse                expression e in the multi-view graph.
represents these expressions.
                                                           An expression eiVk representing a view Vk is
The MVG graph represents a set of equivalent               independent if it does not share any sub-expression
expressions for all the user views of the data             with another plan ejVl representing a distinct view Vl.
warehouse. If we consider the graphical                    That is, the level of sharing of each sub-expression e
representation of a MVG, each node in the graph            of eiVk equals zero. In this case, the considered plan
represents either a source relation, a sub-expression,     has no common sub-expression with any of the other
or an expression corresponding to a user view. The         plans representing a distinct view in the multi-view
graph shows all the existing sub-expressions used by       graph.
each user view. Each sub-expression may be either
shared by several plans, or used in only one plan.         3.3.3. Level of dependency of an expression
                                                           representing a view
3.3. Properties of the multi-view graph                    The level of dependency of an expression ejVi
The properties which characterize the sub-expressions      representing a user view Vi in a multi-view graph
and the plans in a MVG are defined below.                  MVG is the number of distinct user views Vl such
                                                           that there exists an expression ekVl which share a sub-
3.3.1. Level of sharing of a sub-expression                expression e with ejVi. Let us consider the set of user
                                                           views V represented by the MVG; the level of
The level of sharing of a sub-expression is the
                                                           dependency of an expression ejVi is a function defined
number of distinct views which uses this sub-
                                                           as follows:
expression; let us consider the multi-view graph
MVG corresponding to a set of user views V, and a
                                                              level_ of_dependency(eiVj)
sub-expression in MVG denoted e. The level of
sharing of the sub-expression e in the graph MVG is
                                                              level = 0
a function defined as follows:
                                                              for each Vl not marked in V, l ≠ j
   level_ of_sharing(e)                                           for each ekVl in EVk
                                                                      for each e ∈ eiVj
   level = 0                                                              if (e ∈ ekVl) then
   for each Vi in V                                                           mark Vi
       for each ejVi in EVi                                                   level = level + 1
           if (e ∈ ejVi) and (Vi not marked) then                         endif
               mark Vi                                                endfor
               level = level + 1                                  endfor
           endif                                              endfor
       endfor                                                 return (level)
   endfor                                                     end;
   return (level)
   end;                                                    3.3.4. Redundant expressions
                                                           An expression eVi representing the user view Vi is
The level of sharing characterizes a sub-expression in     redundant if there exists a distinct expression e'Vi
the multi-view graph. Each expression corresponding        representing the same user view Vi and such that:
to a user view in the multi-view graph can be              level_of_dependency(eVi) < level_of_dependency(e'Vi)
characterized according to the following properties.       In other words, if e'Vi is the expression having the
                                                           highest level of dependency, then all the expressions
3.3.2. Dependency between           two    expressions     representing the same view Vi and having a lower
corresponding to distinct views                            level of dependency are said to be redundant.
Let us consider two expressions eiV1 and ejV2
representing respectively the user views V1 and V2 in      3.4. The view synthesis graph
a multi-view graph MVG. The two expressions eiV1           The view synthesis graph is a canonical
and ejV2 are said to be dependent if there exists a sub-   representation derived from the MVG. This
expression e in the multi-view graph such that e is        representation is such that each user view is
used in both eiV1 and ejV2.                                represented by exactly one plan. For each user view,


M. Bouzeghoub, Z. Kedad                                                                                      9-5
among all the plans represented by the MVG, only         keeping for each view the expression having the
one will be kept in the VSG.                             highest level of dependency, that is by eliminating the
                                                         redundant expressions. The resulting VSG is given in
One possible way to built the VSG is to use the notion   figure 4.
of redundancy: a view synthesis graph can be defined
as a set of expressions such that each user view is                    V1(A,B,C,D,E,F)                    V2(A,G,H)
represented by exactly one expression, and in which
each plan eVi representing the user view Vi is not                                                                        V3(A,B,E,F)
redundant in the corresponding MVG. This means
that the VSG is built in order to maximize the
number of shared sub-expressions.

By definition and construction, the multi-view graph
is unique. The view synthesis graph is not unique. For
a given view Vi, there may exist more than one
expression having the highest level of dependency,
and consequently, the same MVG may lead to more
                                                                                         RB2(A,B,C,E,F)
than one VSG. The level of overlapping which
                                                                                                                       RB3(A,G,E)
characterizes a VSG is defined below.                     RB1(B,D)

Level of overlapping of a view synthesis graph
                                                          R1(B,D)                          R2(A,B,C)       R3(E,F,A)    R4(G,H,E)
The level of overlapping of a view synthesis graph is     Source S1                                Source S2             Source S3
the maximal level of dependency of the view
expressions represented in this graph. Consider a                     Figure 4 : Example of a logical model
VSG containing a number k of user views denoted Vi,
each of them described by the expression denoted eVi.    Each non-redundant plan corresponding to a user
A level of dependency denoted ni characterizes each      view is represented by an algebraic tree. The
expression. The level of overlapping of this VSG,        expressions RB1, RB2 and RB3 are the base views.
denoted n(VSG) is such that:                             Each of them is computed from the data provided by a
                                                         single data source. These expressions can be viewed
             n(VSG) = Max(ni, i= 1, k)                   at as the extractors (or wrappers) associated with each
                                                         source. The expression RB2 is an intermediate view
The VSG corresponding to a given MVG and having          shared by the three user views V1, V2 and V3.
the highest level of overlapping must satisfy the
following conditions:                                    4. Using the framework                                 for        data
                                                         warehouse design
•   To each user view of the data warehouse is
associated a single expression in the VSG.               The key issue during the physical design of a data
• All the expressions in the VSG are not                 warehouse is to find a set of views to materialize in
redundant.                                               order to meet some user requirements. Generally,
                                                         these requirements are expressed in terms of a
The resulting VSG is a hierarchy of views. Besides       maintenance cost, and an evaluation cost. The
the user views, this hierarchy contains intermediate     combination of these two costs is the operational cost
views and base views defined as follows:                 of the data warehouse. The goal of the physical
                                                         design is to find the set of views to materialize in
•    Each sub-expression computed from the relations     order to minimize this operational cost.
provided by a single data source is defined as a base
view.                                                    Existing approaches for the selection of the views to
• A given sub-expression in the VSG is an                materialize either perform an exhaustive exploration
intermediate view if it satisfies the following          of all the possible materialization scenarios, and
conditions:                                              evaluate them to find the one that minimizes the
    1. this sub-expression is shared by two or more      operational cost, or use heuristics in order to reduce
        plans in the VSG                                 the number of scenarios to evaluate. In this section,
    2. this sub-expression is neither a base view, nor   we show how the proposed framework can be used for
        a user view.                                     the physical design of a data warehouse. This
                                                         framework allows taking into account some cost
Starting from the MQG given in figure 3, the VSG         function parameters as well as some structural
which maximizes the level of overlapping is built by     properties reflecting some desired quality criteria.



M. Bouzeghoub, Z. Kedad                                                                                                     9-6
The considered search space is partially ordered, and             VSGpi denotes a view synthesis graph and m is the
may be explored using different strategies.                       number of view synthesis graphs associated with the
                                                                  data warehouse and having the level of overlapping p.
In the following, we first present the ordering of the            Each graph VSGpi is such that n(VSGpi) = p.
search space, then the cost function parameters and
the structural properties, and finally, we illustrate the         In the set Sp, the different representations can be
use of the framework by an example of instantiation.              ordered by considering subsets of representations
                                                                  having the same number of shared sub-expressions.
4.1. Ordering the search space                                    Therefore, the set Sp can be partitioned in several
We consider that the data warehouse is represented by             subsets, considering the number of shared sub-
a view synthesis graph. Given the set of the user                 expressions in the representations. A subset of Sp
views of the data warehouse, the corresponding view               denoted Snp contains all the representations whose
synthesis graph is not unique, as we have seen in                 level of overlapping is p and whose number of shared
section 3. Besides this point, given a view synthesis             expressions is n.
graph, several candidate sets of materialized views
may be derived, each of them representing a                       Figure 5 illustrates the partitions of the search space
materialization scenario to evaluate. If we denote by             according to the two criteria n and p. The search
VSGi a view synthesis graph, and MSi the sets of                  space is first partitioned in several sets Sp according
materialization scenarios which may be derived from               to the level of overlapping. Then each set Sp is
the graph VSGi, the search space for the selection of             partitioned in several sets Snp according to the
the materialized views denoted S is such that:                    number of shared expressions in the VSG.
S = MS1 ∪ MS2 ∪ ....∪ MSn, where n is the number
                                                                  Once the search space is ordered, it can be explored
of view synthesis graphs associated with the data
                                                                  in several ways: for example, the exploration could be
warehouse. Therefore, exploring the search space S
                                                                  done starting from the representations having the
consists in considering every materialization scenario
                                                                  highest level of overlapping, that is, representations
for every possible view synthesis graph representing
                                                                  showing the most shared expressions. These
the data warehouse.
                                                                  representations could then be explored according to a
                                                                  descending order of their number of shared sub-
This search space can be ordered according to two
                                                                  expressions. In our framework, the exploration
criteria:
                                                                  strategy and the selection of the scenarios may be
• the level of overlapping of the view synthesis
                                                                  done according to two kinds of factors: the cost
graphs representing the data warehouse,
                                                                  function parameters and the structural properties
• the number of shared expressions in the view
                                                                  required for the representation of the data warehouse.
synthesis graphs.
                                                                  Both factors are presented in sections 4.2. and 4.3.
Given a multi-view graph, the set of view synthesis
graphs having the level of overlapping p is denoted
Sp, and is such that: Sp = {VSGp1, VSGp2, ... ,
VSGpm},

                                                           Search Space

          Partitioning
        the search space
        according to the
      level of overlapping

       Sn                                        Sn-1                                        S0
       VSGn1    VSGn2        …   VSGnm      VSGn-11 VSGn-12   …     VSGn-1p       … …        VSG01 VSG02 … VSG0l

                   … … … … ..                                                                       … … … … ..



               Partitioning
            the search space             Sin-1                    Si-1n-1          Si-kn-1
             according to the
            number of shared             VSGn-11 VSGn-12      VSGn-13       … …    VSGn-1p
               expressions

                                         Figure 5 : Partitioning the search space



M. Bouzeghoub, Z. Kedad                                                                                             9-7
                                                          The exploration of the search space is costly, and it is
4.2. The cost function parameters                         not necessary to search for an optimal solution which
The evaluation of a materialization scenario is done      is far beyond the actual performance needs. The
using a cost function which takes into account the        exploration may stop when a solution fitting the user
different factors having an impact over the overall       requirements is found. Another possible cost function
cost of the scenario. Finding such function is a          parameter could be a space limitation. All the cost
domain specific problem, because the factors to           function parameters are used to evaluate and select
consider vary from one context to another.                the materialization scenarios.

In the existing approaches for the selection of           4.3. The structural properties
materialized views, the different materialization         The structural properties are related to the
scenarios are compared by considering their               representation of the data warehouse. They express
respective operational cost, which is a combination of    some quality requirements specified by the user for
a maintenance cost and an evaluation cost. In the         the materialization scenario. Let us consider for
following, we recall the general cost function which is   example the traceability criteria. Given a set of user
widely used in the literature related to the data         views, one possible requirement is to trace the
warehouses physical design.                               intermediate results for this set in order to find the
                                                          origin of an error or a misunderstanding of some
Each materialized view in the data warehouse has          aggregate value. For example, a set of aggregate
maintenance cost, which is the cost of propagating        values computed by different user views may be
the source updates in this view. The maintenance cost     contradictory. To find the origin of the contradiction,
of the data warehouse, denoted M, is the sum of the       the materialized views of the data warehouse will be
maintenance costs of all the materialized views. We       checked in order to find the computation which gave
denote Vmi, i ranging from 1 to n, the materialized       raise to the error. It is obvious to see that the higher is
views of the data warehouse, f mi the maintenance         the number of user views which use the same
frequency for the view Vi and CVi the maintenance         materialized expressions, the less is the cost of
cost of the view Vi; the maintenance cost of the data     tracing a given error, because in this case, the number
warehouse is such that:                                   of materialized expressions to check is low compared
                M = Σ i=1, n f mi . C Vmi                 to the number of expressions to consider if the users
                                                          views share very few materialized expressions.
The user views are evaluated against the set of
materialized views. They are denoted Vui, i ranging       Consequently, in order to ease the tracing process, it
from 1 to p. Each of these user views has an              would be more suitable to select the materialized
evaluation cost denoted EVui and a frequency denoted f    views among the set of expressions which are used by
e
  i. The global evaluation cost of a materialization      the maximal number of user views, that is, over the
scenario denoted E is such that:                          view synthesis graph having the highest level of
                  E = Σ i=1, p f ei . EVui                overlapping. The best materialization scenario will
                                                          therefore represent a compromise between the cost
The operational cost of a materialization scenario        parameters and the level of overlapping.
denoted OC is a weighted sum function of the
maintenance cost M and the evaluation cost E:             In our framework, the structural properties may be
                                                          handled in the following ways:
                    OC = E + wM                           • they may be used to constrain the selection
                                                          procedure, that is to define an exploration strategy in
The best set of materialized views is the one             the search space; in the example of the traceability
corresponding to the minimal operational cost OC.         criteria, the search space may be explored
                                                          exhaustively starting from the representations having
The w factor is a parameter which expresses the           the highest level of overlapping,
relative importance between the evaluation cost and       • they may be used to reduce the search space, by
the maintenance cost. Besides this factor, other          selecting a sub-set of representations satisfying them,
parameters may be considered. For example, the users      • and finally, they may be used after the
or the administrators of the data warehouse may state     determination of the scenario which best fits the cost
a value or an interval representing the acceptable        function parameters, in order to check whether this
operational cost of the data warehouse. These             scenario conforms to the structural properties or not.
boundaries of the operational cost may be very useful
to confront the cost of a scenario against the
expectations of the users in terms of performance.



M. Bouzeghoub, Z. Kedad                                                                                        9-8
4.4. An instantiation of the general framework                                           The characterization procedure is not performed, and
                                                                                         the paths used in the instantiation are represented by
We will now present an example of selection of a
                                                                                         the continuous lines. If p is the maximal level of
materialization scenario considering, additionally to
                                                                                         overlapping which can be derived from the multi-
the ordering criteria of the search space, the following
                                                                                         view graph, then the corresponding set Sp contains
factors: first, the users or the administrators of the
                                                                                         the sub-expressions which are shared by the maximal
data warehouse have stated an interval representing
                                                                                         number of user views. These sub-expressions will be
the acceptable operating cost of the data warehouse;
                                                                                         first considered when determining the views to
second, some traceability requirements have been
                                                                                         materialize. Multiple materialization scenarios can be
expressed by the users. These requirements will be
                                                                                         generated from a single graph VSGpi. Each of them
used to constrain the selection procedure.
                                                                                         should be evaluated. After considering the graphs of a
                                                                                         set Sp, the graphs of the set Sp-1 are considered, and
If we consider that the user requirements in terms of
                                                                                         the cost of the materialization scenarios is evaluated
performance are expressed by the interval [l1, l2],
                                                                                         for each graph VSGp-1i. This process iterates until a
where l1 and l2 represents the lowest and the highest
                                                                                         solution is found within the boundaries specified by
acceptable operational cost respectively, the
                                                                                         the users.
exploration of the ordered search space will stop if the
operational cost OC corresponding to a
                                                                                         This exploration of the search space starting from the
materialization scenario msi is such that:
                                                                                         representations having the maximal level of
                                                                                         overlapping is described by the following general
                            l1 ≤ OC(msi) ≤ l2.
                                                                                         algorithm which uses the following notations:
Considering the traceability criteria, the approach for
                                                                                         •   MVG is the multi-view graph describing the data
the selection of the views to materialize will focus on
                                                                                         warehouse,
the shared sub-expressions, and choose the
                                                                                         • Sp the set of view synthesis graph derived from
materialized views in the set of sub-expressions
                                                                                         MVG and having the level of overlapping p,
shared by the maximal number of views. The
                                                                                         • pmax the maximal level of overlapping of the
exploration strategy is then to consider first
                                                                                         expressions in MVG,
representations having the highest level of                                                       p
                                                                                         • VSG i a view synthesis graph whose level of
overlapping. The instantiation of our general
                                                                                         overlapping is p,
framework considering these requirements is given in
                                                                                         • [l1, l2] the interval representing the user
figure 6.
                                                                                         requirements, that is the interval in which the
                                                                                         operational cost varies.
                                                                                         • MSp the set of materialization scenarios derived

                        SEARCH SPACE
                                                                                         from the view synthesis graphs whose level of
                                                                                         overlapping is p,
                                                                                         • msi a materialization scenario,
                                       DefinedOn
                                                                                         • OC(msi)        the operational cost of the
                                                                                Inputs
 Inputs                                                                                  materialization scenario msi.
                        QUALITY FACTORS
     SpecializedAs                                          SpecializedAs                    Exploration procedure
                                                                                             p = pmax ;
                                                              STRUCTURAL
     COST FUNCTION                                             PROPERTIES                    Generate Sp from MVG;
      PARAMETERS                               ns            (maximal level of               While (solution not found)
                                            rai
  (user cost interval [l1,l2])
                                       Const                    overlapping)                 If Sp ≠ ∅ Then
                                                                                                For each VSGpi in Sp
                                                    o




                                   4
                                                  sT




                                                                       Inputs
                                                rm




          Inputs
                                                                                                    Generate MSp
                                             nfo
                                           Co




      SELECTION                        2                   CHARACTERIZATION                         For each msi in MSp
      PROCEDURE                                              PROCEDURE                                  If l1 ≤ OC(msi) ≤ l2
                                                                                                        Then (msi) is an acceptable solution;
      Outputs                                  3 In
                                                    puts                Outputs                         Endif
                                                                SELECTED
                                                                                                    Endfor
          SELECTED
          SCENARIOS              1 BelongsTo
                                                             REPRESENTATION                     Endfor
                                                                                             Endif
                                                                                             p = p-1;
  Figure 6 : Instantiation of the general framework
                                                                                             Generate Sp from MVG;
                                                                                             Endwhile
                                                                                             End


M. Bouzeghoub, Z. Kedad                                                                                                                     9-9
We did not present in this paper the details of the        exploration is done according to a strategy which may
selection of the candidate sets of views to materialize;   be driven by the user requirements. These
each set of candidate materialized views should allow      requirements may be either some cost function
to compute all the user views, that is, each user view     parameters or some structural properties. Existing
can be rewritten using the set of materialized views.      approaches introduce cost function parameters such
                                                           as space limitation, but they do not integrate the
5. Related works                                           structural properties which express some quality
                                                           factors such as evolutivity, security or traceability.
Several approaches have been proposed for the
physical design of a data warehouse. The problem
addressed by these approaches is to find a set of views
                                                           6. Conclusion
to materialize in order to meet the performance            In this paper we have presented an extensible
requirements. These performance requirements are           framework for data warehouse physical design. This
expressed as an operational cost, which is a               framework is based on a logical schema, which is a
combination of an evaluation cost and a maintenance        canonical representation of the data warehouse views.
cost. Some of the approaches dealing with this design      This logical model is built from subsets of the
problem are described in [Baralis et al 97],               possible plans for each user view, which are
[Theodoratos & Sellis 97], [Theodoratos & Sellis 98],      integrated in an intermediate representation called the
[Theodoratos et al 99], [Ligoudistianos et al 99] and      multi-view graph. The logical model, called the view
[Yang et al 97].                                           synthesis graph (VSG), is derived from the multi-
                                                           view graph. A level of overlapping characterizes this
Two kinds of solutions are generally proposed for the      model. It represents the maximal number of views
selection of the views to materialize in a data            sharing the same expression in the graph.
warehouse, either solutions consisting in an
exhaustive exploration of the possible materialization     The framework for physical design uses the VSG as a
solutions to find to optimal one, or solutions based on    representation of the views of a data warehouse. Since
some heuristics which reduce the exploration. In           this VSG is not unique, the search space for the
[Yang et al 97], two algorithms for selecting the          selection of the views to materialize is composed of
views to materialize are proposed; the first one is an     all the scenarios derived from all the possible VSG.
exhaustive exploration of all the possible plans of the    This search space can be ordered considering the
users queries to find an optimal multiple view             level of overlapping and the number of shared
processing plan, the second one is an heuristic            expressions of the representations from which the
algorithm which considers only the optimal plan for        scenarios are derived.
each user query.
                                                           This ordered search space can be explored in different
In [Baralis et al 97], the proposed algorithm              ways; one possible exploration strategy consists in
addresses the materialized view selection problem in       giving priority to the most shared sub-expressions
the context of multidimensional databases.                 during the selection of the views to materialize, and
[Theodoratos & Sellis 97] and [Theodoratos & Sellis        therefore exploring first the representations having
98] present the view selection problem as a state          the highest level of overlapping. The exploration is
space problem. In [Theodoratos et al 99], a pruning        guided by some users requirements, which can either
algorithm and a greedy algorithm are presented to          be related to the cost function, such as the upper an
select the set of views to materialize. But the            lower bounds representing acceptable operational
heuristics used do not guarantee to find the optimal       cost, or related to some structural properties of the
solution. [Ligoudistianos et al 99] proposes an A*         logical model, such as evolutivity.
algorithm which delivers the optimal solution, and
presents some heuristics to prune the considered state     Further works will consist in identifying sets of
space.                                                     quality factors related to the physical design of the
                                                           data warehouses, and study the impact of these factors
In this paper, rather than a new approach for the          on both the selection and the characterization
selection of the best set of materialized views, we        procedures of the proposed framework.
tried to define the main features of a framework for
data warehouse physical design. This physical design       References
is done on a search space consisting of a set of
representations of the data warehouse, considering         [Baralis et al 97] E. Baralis, S. Paraboschi, E.
the level of overlapping and the number of shared            Teniente. Materialized View Selection in a
expressions which characterize each representation.          Multidimensional Environment. In Proc. of the
This level allows to order the search space, and the         23rd International Conference on Very Large Data
                                                             Bases (VLDB'97), Athens, Greece, August 1997.


M. Bouzeghoub, Z. Kedad                                                                                     9 - 10
[Bouzeghoub et al 99] Bouzeghoub M., Fabret F.,         [Hull & Zhou 96] R. Hull, G. Zhou. A Framework for
  Matulovic M., Modeling the Data Warehouse               Supporting Data Integration Using the Materialized
  Refreshment Process as a Workflow Application.          and Virtual Approches. In Proceedings of the ACM
  Proceedings of the International Workshop on            SIGMOD       International     Conference       on
  Design and Management of Data Warehouses                Management of Data, Montreal, Canada, June
  (DMDW'99), Heidelberg, Germany, June 1999.              1996.

[Bouzeghoub & Theodoratos 99] Bouzeghoub M.,            [Jarke et al 97] M. Jarke, Y. Vassiliou. Foundation of
  Theodoratos D. Data Currency Quality Factors in         Data Warehouse Quality: an Overview of the DWQ
  Data Warehouse Design. Proceedings of the               project . In Proceedings of the 2nd Internat. Conf.
  International  Workshop      on Design     and          on Information Quality, Cambridge, Mass, 1997.
  Management of Data Warehouses (DMDW'99),              [Jarke et al 99] M. Jarke, M. Lenzerini, Y. Vassiliou,
  Heidelberg, Germany, June 1999.                         P. Vassiliadis. (edits). Fundamentals of Data
                                                          Warehouses. Springer, 1999.
[Calvanese et al 98] D. Calvanese, G. De Giacomo,
 M. Lenzerini, D. Nardi, R. Rosati. Source              [Jarke et al 99a] Jarke M, Jeusfeld M.A., Quix C.,
 Integration integration : conceptual modeling and        Vassiliadis P., "Architecture and Quality in Data
 reasoning support. In Proceedings of the 3rd IFCIS       Warehouses", Proceeding of the 10th International
 International    Conference     on     Cooperative       Conference on Advanced Information Systems
 Information Systems (CoopIS’98), New York City,          Engineering (CAiSE'98), Pisa, Italy, June 1998.
 USA, August 1998.
                                                        [Jeusfeld et al 98] Jeusfeld, M. A., Quix C., Jarke M.,
[Chawathe et al 94] S. Chawathe, H. Garcia-Molina,        "Design and Analysis of Quality Information for
  J. Hammer, K. Ireland, Y. Papakonstantinou, J.          Data Warehouses", Proceedings of the 17th
  Ullman, J. Widom. The TSIMMIS project :                 Internat. Conf. on Conceptual Modeling (ER'98),
  Integration of Hererogeneous Information Sources.       Singapore, november 1998.
  In Proceedings of the 10th Meeting of the
  Information Processing Society of Japan (IPSJ'94),    [Kedad 99] Z. Kedad. Techniques d’intégration dans
  Tokyo, Japan, October 1994.                            les systèmes d’information multi-source. PhD
                                                         Thesis, Laboratoire PriSM, University of Versailles,
[Chaudhuri & Dayal 96] S. Chaudhuri, A. Dayal. An        France, January 1999.
  Overview of Data Warehousing and OLAP
  Technology. Tutorial, In Proceedings of the 22th      [Kedad      & Bouzeghoub 99] Z. Kedad, M.
  International Conference on Very Large Data             Bouzeghoub. Conception de systèmes d’information
  Bases (VLDB'96),     Mumbai, India, September           multi-source. In Proceedings of INFORSID’99,
  1996.                                                   Toulon, France, June 1999.

[Garcia-Molina et al 95] H. Garcia-Molina, Y.           [Levy et al 96] A.Y. Levy, A. Rajaraman, J.J.
  Papakonstantinou, D. Quass, A. Rajamaran, Y.           Ordille.    Query-Answering     Algorithms      for
  Sagiv, J. Ullman, V. Vassalos, J. Widom. The           Information Agents. In Proceedings of the 13th
  TSIMMIS approach to mediation: Data model and          National Conference on Artificial Intelligence and
  Languages. In 2nd International Workshop on Next       8th Innovative Applications of Artificial
  Generation Information Technologies and Systems        Intelligence Conference (AAAI 96, IAAI 96),
  (NGITS '95), Naharia, Israel, June 1995.               Portland, Oregon, August 1996.

[Gupta & Mumick 95] A. Gupta, I.S. Mumick,              [Levy et al 96a] A.Y. Levy, A. Rajaraman, J.J.
  Maintenance of materialized views: Problems,            Ordille. Querying Heterogeneous Information
  Techniques, and applications. In IEEE Data              Sources Using Source Description. In Proceedings
  Engineering Bulletin, Special Issue on Materialised     of the 22th International Conference on Very Large
  Views and Data Warehousing, 18(2), june 1995            Data Bases (VLDB'96), Mumbai, India, September
                                                          1996.
[Hammer et al 95] J. Hammer, H. Garcia-Molina, J.
 Widom, W. Labio, Y. Zhuge. The Stanford Data           [Ligoudistianos et al 99] S. Ligoudistianos, T. Sellis,
 Warehousing Project. IEEE Data Engineering               D. Theodoratos, Y. Vassiliou. Heuristic Algorithms
 Bulletin, Special Issue on Materialized Views and        For Designing The Data Warehouse with SPJ
 Data Warehousing, 18(2), June 1995                       Views. In Proceedings of the International
                                                          Conference on Data Warehousing and Knowledge
                                                          Discovery (DAWAK'99), 1999.


M. Bouzeghoub, Z. Kedad                                                                                  9 - 11
[Theodoratos & Sellis 97] D. Theodoratos, T. Sellis.
  The    Data    Warehouse     Configuration.    In
  Proceedings of the 23rd International Conference
  on Very Large Data Bases (VLDB'97), Athens,
  Greece, August 1997.

[Theodoratos & Sellis 98] D. Theodoratos, T. Sellis.
  Data Warehouse Schema and Instance Design. In
  Proceedings of the 17th International Conference
  on Conceptual Modeling (ER'98), Singapore,
  November 1998.

[Theodoratos et al 99] D. Theodoratos, S.
  Ligoudistianos, T. Sellis. Designing the Global
  Data Warehouse with SPJ Views. In Proceedings of
  the 11th Conference on Advanced Information
  Systems Engineering (CAISE'99), Heidelberg,
  Germany, June 1999.

[Vassiliadis et al 99] P. Vassiliadis, M. Bouzeghoub,
  C. Quix. Towards quality-oriented data warehouse
  usage and evolution. Proceedings of the 11th
  Conference on Advanced Information Systems
  Engineering (CAiSE'99), Heidelberg, Germany,
  June 1999.

[Widom 95] J. Widom. Research Problems in Data
  Warehousing.     In Proceedings of the 4th
  International Conference on Information and
  Knowledge Management (CIKM'95), Baltimore,
  Maryland, November 1995.

[Wierner et al 96] J.L. Wiener, H. Gupta, W.J. Labio,
  Y. Zhuge, H. Garcia-Molina, J. Widom. A System
  Prototype for Warehouse View Maintenance. In
  VIEWS 96

[Yang et al 97] J. Yang, K. Karlapalem, S. Li.
  Algorithms for materialized view design in data
  warehousing Environment. In Proceedings of the
  23rd International Conference on Very Large Data
  Bases (VLDB'97), Athens, Greece, August 1997.

[Zhou et al 95] G. Zhou, R. Hull, R. King, J.C.
  Franchitti. Data Integration and Warehousing using
  H2O. IEEE Data Engineering Bulletin, Special
  Issue on Materialized Views and Data
  Warehousing, 18(2), June 1995




M. Bouzeghoub, Z. Kedad                                 9 - 12