=Paper=
{{Paper
|id=Vol-19/paper-7
|storemode=property
|title=Modeling the Data Warehouse Refreshment Process as a Workflow Application
|pdfUrl=https://ceur-ws.org/Vol-19/paper6.pdf
|volume=Vol-19
|dblpUrl=https://dblp.org/rec/conf/dmdw/BouzeghoubFM99
}}
==Modeling the Data Warehouse Refreshment Process as a Workflow Application==
<pdf width="1500px">https://ceur-ws.org/Vol-19/paper6.pdf</pdf>
<pre>
                              Modeling Data Warehouse Refreshment Process
                                       as a Workflow Application
                           Mokrane Bouzeghoub(*)(**), Françoise Fabret(*), Maja Matulovic-Broqué(*)
                                                                        (*)
                                                                 INRIA Rocquencourt, France
                                               (**)
                                                       Laboratoire PRiSM, Université de Versailles, France
                                                                        Mokrane.Bouzeghoub@prism.uvsq.fr


                                                                                        two extreme data stores, we can find different other
                                         Abstract                                       stores depending on the requirements of OLAP
                                                                                        applications. One of these stores is the operational data
                                                                                        store which reflects source data in a uniform and clean
This article is a position paper on the nature of the data                              representation. The corporate data warehouse (CDW)
warehouse refreshment which is often defined as a view                                  contains highly aggregated data and can be organized
maintenance problem or as a loading process. We will                                    into a multidimensional structure. Data extracted from
show that the refreshment process is more complex than                                  each source can also be stored in intermediate data
the view maintenance problem, and different from the                                    recipients. Obviously, this hierarchy of data stores is a
loading process. We conceptually define the                                             logical way to represent the data flows which go from
refreshment process as a workflow whose activities                                      the sources to the data marts. All these stores are not
depend on the available products for data extraction,                                   necessarily materialized, and if they are, they can just
cleaning and integration, and whose coordination                                        constitute different layers of the same database.
events depend on the application domain and on the
required quality in terms of data freshness.
                                                                                        Figure 1 shows a typical data warehouse architecture.
Implementation of this process is clearly distinguished
                                                                                        This is a logical view whose operational
from its conceptual modelling.
                                                                                        implementation receives many different answers in the
                                                                                        data warehousing products. Depending on each data
                                                                                        source, extraction and cleaning can be done by the
1. Introduction                                                                         same wrapper or by distinct tools. Similarly data
                                                                                        reconciliation (also called multi-source cleaning) can
                                                                                        be separated from or merged with data integration
    Data warehousing is a new technology which                                          (multi-sources operations). High level aggregation can
provides software infrastructure for decision support                                   be seen as a set of computation techniques ranging
systems and OLAP applications. Data warehouses                                          from simple statistical functions to advanced data
collect data from heterogeneous and distributed sources.                                mining algorithms. Customisation techniques may vary
This data is aggregated and then customized with                                        from one data mart to another, depending on the way
respect to organizational criteria defined by OLAP                                      decision makers want to see the elaborated data.
applications. The data warehouse can be defined as a
hierarchy of data stores which goes from source data to
the highly aggregated data (data marts). Between these
____________                                                                            __________
The copyright of this paper belongs to the paper’s authors. Permission to copy          The research presented in this paper is supported by the European
without fee all or part of this material is granted provided that the copies are        Commission under the Esprit Prgram LTR project 'DWQ:
not made or distributed for direct commercial advantage.                                Foundations of Data Warehouse Quality'
Proceedings of the International Workshop on Design and
Management of Data Warehouses (DMDW'99)
Heidelberg, Germany, 14. - 15. 6. 1999
(S. Gatziu, M. Jeusfeld, M. Staudt, Y. Vassiliou, eds.)
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-19/


M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué                                                                                                     6-1
                                                           META
                                                           DATA


                           EXTRACTION
                            CLEANING

                                                            CDW
                           EXTRACTION    RECONCILIATION
                            CLEANING      INTEGRATION                     CUSTOMISATION

                                                           HIGH-LEVEL
                                                          AGGREGATION
                           EXTRACTION
                            CLEANING


                    DATA                                                                       DATA
                   SOURCES
                                                             ODS
                                                                                               MARTS


                                        Figure 1: Data warehouse architecture

    The refreshment of a data warehouse is an                 between the refreshment process in one side and the
important process which determines the effective              data loading and view maintenance in the other side.
usability of the data collected and aggregated from           Section 3 defines the generic workflow which
the sources. Indeed, the quality of data provided to          logically represents the refreshment process, with
the decision makers depends on the capability of the          examples of workflow scenarios. Section 4 defines
data warehouse system to convey in a reasonable               the semantics of the refreshment process in terms of
time, from the sources to the data marts, the changes         workflow design decisions. Section 5 concludes with
made at the data sources. Most of the design                  the summary of the main ideas in this position paper
decisions are then concerned by the choice of data            and on some implementation issues.
structures and update techniques that optimise the
refreshment of the data warehouse.                            2. View maintenance, data loading and
                                                              data refreshment
    There is a quiet great confusion in the literature
concerning data warehouse refreshment. Indeed, this               Data refreshment in data warehouses is generally
process is often either reduced to view maintenance           confused with data loading as done during the initial
problem or confused with the data loading phase.              phase or with update propagation through a set of
Our purpose in this paper is to show that the data            materialized views. Both analogies are wrong. The
warehouse refreshment is a more complex than the              following paragraphs argument on the differences
view maintenance problem, and different from the              between data loading and data refreshment, and
loading process. We define the refreshment process            between view maintenance and data refreshment.
as a workflow whose activities depend on the
available products for data extraction, cleaning and
                                                              Data loading vs. data refreshment
integration, and whose triggering events of these
activities depend on the application domain and on
the required quality in terms of data freshness.              The data warehouse loading phase consists in the
                                                              initial data warehouse instantiation, that is the initial
   The objective of the following sections is to              computation of the data warehouse content. This
describe the refreshment process tasks and to                 initial loading is globally a sequential process of four
demonstrate how they can be organised as a                    steps (Figure 2): (i) preparation, (ii) integration, (iii)
workflow. Section 2 arguments on differences                  high level aggregation and (iv) customisation. The


M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué                                                                    6-2
first step is done for each source and consists in data   Consequently, there is no constraint on the response
extraction, data cleaning and possibly data archiving     time. But, in contrast, with respect to the data
before or after cleaning. Archiving data in a history     sources, the loading phase requires more availability.
can be used both for synchronisation purpose
between sources having different access frequencies       The data flow which describes the loading phase can
and for some specific temporal queries. The second        serve as a basis to define the refreshment process,
step consists in data reconciliation and integration,     but the corresponding workflows are different. The
that is cleaning multi-source cleaning of data            workflow of the refreshment process is dynamic and
originated from heterogeneous sources, and                can evolve with users’ needs and with source
derivation of the base relations (or base views) of the   evolution, while the workflow of the initial loading
operational data store (ODS). The third step consists     process is static and defined with respect to current
in the computation of aggregated views from base          user requirements and current sources.
views. While the data extracted from the sources and
integrated in the ODS is considered as ground data            The difference between the refreshment process
with very low level aggregation, the data in the          and the loading process is mainly in the following.
corporate data warehouse (CDW) is generally highly        First, the refreshment process may have a complete
summarised using aggregation functions. The fourth        asynchronism between its different activities
step consists in the derivation and customisation of      (preparation,     integration,    aggregation        and
the user views which define the data marts.               customisation). Second, there may be a high level
Customisation refers to various presentations needed      parallelism within the preparation activity itself, each
by the users for multidimensional data.                   data source having its own availability window and
                                                          its own strategy of extraction. The synchronization is
                                                          done by the integration activity. Another difference
                           Customization                  lies in the source availability. While the loading
                                                          phase requires a long period of availability, the
                                                          refreshment phase should not overload the
                                                          operational applications which use the data sources.
                        Update Propagation                Then, each source provides a specific access
                                                          frequency and a restricted availability duration.
                                                          Finally, there are more constraints on response time
                        History management                for the refreshment process than for the loading
                                                          process. Indeed, with respect to the users, the data
   Integration                                            warehouse does not exist before the initial loading,
      Phase                    Data                       so the computation time is included within the design
                           Reconciliation                 project duration. After the initial loading, the data
                                                          becomes visible and should satisfy user requirements
                                                          in terms of data availability, accessibility and
                                                          freshness.

                         History management               View maintenance vs. data refreshment

   Preparation                                                The propagation of changes during the
      Phase                 DataCleaning                  refreshment process is done through a set of
                                                          independent activities among which we find the
                                                          maintenance of the views stored in the ODS and
                           Data Extraction
                                                          CDW levels. The view maintenance phase consists
                                                          in propagating a certain change raised in a given
                                                          source over a set of views stored at the ODS or
            Figure 2: Data loading activities
                                                          CDW level. Such a phase is a classical materialized
                                                          view maintenance problem except that, in data
                                                          warehouses, the changes to propagate into the
The main feature of the loading phase is that it
                                                          aggregated views are not exactly those occurred in
constitutes the latest stage of the data warehouse
                                                          the sources, but the result of pre-treatments
design project. Before the end of the data loading,
the data warehouse does not yet exist for the users.


M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué                                                               6-3
performed by other refreshment activities such as          refreshment process cannot be limited to a view
data cleaning and multi-source data reconciliation.        maintenance process.

    The view maintenance problem has been                      To summarize the previous discussion, we can
intensively studied in the database research               say that a refreshment process is a complex system
community. Major work done in this area is                 which may be composed of asynchronous and
synthesized in [BFG+99] and [ThBo 99]. Most of             parallel activities that need a certain monitoring. The
the references focus on the problems raised by the         refreshment process is an event-driven system which
maintenance of a set of materialized (also called          evolves frequently, following the evolution of data
concrete) views derived from a set of base relations       sources and user requirements. Users, data
when the current state of the base relations is            warehouse administrators and data source
modified. The main results concern :                       administrators may impose specific constraints as,
• The self-maintainability : Results concerning the        respectively, freshness of data, space limitation of
     self-maintainability are generalized for a set of     the ODS or CDW, and access frequency to sources.
     views : a set of view V is self-maintainable with     There is no simple and unique refreshment strategy
     respect to the changes to the underlying base         which is suitable for all data warehouse applications,
     relations if the changes may be propagated in         for all data warehouse user, or for the whole data
     every views in V without querying the base            warehouse lifetime.
     relations (i.e. the information stored in the
     concrete views plus the instance of the changes       3. The Refreshment process is a
     are sufficient to maintain the views).                workflow
• The coherent and efficient update propagation:
     Various algorithms are provided to schedule
     updates propagation through each individual               A workflow is a set of coordinated activities
     view, taking care of interdependencies between        which might be manual or automated activities
     views, which may lead to possible                     performed by actors [Scha 98]. Workflow concepts
     inconsistencies. For this purpose, auxiliary          have been used in various application domains such
     views are often introduced to facilitate update       as business process modeling [HaCh 93],
     propagation and to enforce self-maintainability.      cooperative applications modeling [CSCW 96]
                                                           [Lawr 97], and database transaction modeling
Results over the self-maintainability of a set of views    [AAE+ 95] [Bern 98]. Depending on the application
are of a great interest in the data warehouse context,     domain, activities and coordination are defined using
and it is commonly admitted that the set of views          appropriate specification languages such as state-
stored in a data warehouse have to be globally self-       chart diagrams and Petri nets [WoKr 93], or active
maintainable.      The      rationale   behind      this   rules [CCPP 95]. In spite of this diversity of
recommendation is that the self-maintainability is a       applications and representation, most of the
strong requirement imposed by the operational              workflow users refer more or less to the concepts
sources in order to not overload their regular             and terminology defined by the Workflow Coalition
activity.                                                  [WFMC 95]. Workflow systems are supposed to
                                                           provide high level flexibility to recursively
    As stated in the previous section, research on         decompose and merge activities, and allow dynamic
data warehouse refreshment has mainly focused on           reorganization of the workflow process. These
update propagation through materialized views.             features are typically useful in the context of data
Many papers have been published on this topic, but a       warehouse refreshment as the activities are
very few is devoted to the whole refreshment process       performed by market products whose functionalities
as defined before. We consider view maintenance            and scope differ from one product to another.
just as one step of the complete refreshment process.
Other steps concern data cleaning, data                        In the following subsections, we show how the
reconciliation, data customisation, and if needed data     refreshment process can be defined as a workflow
archiving. In another hand, extraction and cleaning        application. We illustrate the interest of this
strategies may vary from one source to another, as         approach buy the ability to define different scenarios
well as update propagation which may vary from one         depending on user requirements, source constraints
user view to another, depending for example on the         and data warehouse constraints. We show that these
desired freshness for data. So the data warehouse          scenarios may evolve through the time to fulfill


M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué                                                               6-4
evolution of any of the previous requirements and            Refreshment activities
constraints.
                                                                 The refreshment process is similar to the loading
3.1. The workflow of the refreshment process                 process in its data flow but, while the loading
                                                             process is a massive feeding of the data warehouse,
    The refreshment process aims to propagate                the refreshment process captures the differential
changes raised in the data sources to the data               changes hold in the sources and propagates them
warehouse stores. This propagation is done through a         through the hierarchy of data stores in the data
set of independent activities (extraction, cleaning,         warehouse. The preparation step extracts from each
integration, ...) that can be organized in different         source the data that characterises the changes that
ways, depending on the semantics one wants to                have occurred in this source since the last extraction.
assign to the refreshment process and on the quality         As for loading, this data is cleaned and possibly
he wants to achieve. The ordering of these activities        archived before its integration. The integration step
and the context in which they are executed define            reconciliates the source changes coming from
this semantics and influence this quality. Ordering          multiple sources and adds them to the ODS. The
and context result from the analysis of view                 aggregation step recomputes incrementally the
definitions, data source constraints and user                hierarchy of aggregated views using these changes.
requirement in terms of quality factors. In the              The customisation step propagates the summarized
following subsections, we will describe the                  data to the data marts. As well as for the loading
refreshment activities and their organization as a           phase, this is a logical decomposition whose
workflow. Then we give examples of different                 operational implementation receives many different
workflow scenarios to show how refreshment may be            answers in the data warehouse products. This logical
a dynamic and evolving process. Finally, we                  view allows a certain traceability of the refreshment
summarize the different perspectives through which           process. Figure 3 shows the activities of the
a given refreshment scenario should be considered.           refreshment process as well as a sample of the
                                                             coordinating events.


                                                  Customization                  Temporal/external event

                                                                                 After-Propagation event
          Before-Customization event
                                               Update Propagation                Temporal/external event


           Before-Propagation event            History management               After-Integration event


                                                 Data Integration               Temporal/external event

                                                                                After-Cleaning event
                                               History management
             Before-Integration event

                                                  DataCleaning


                                                  Data Extraction                  Temporal/external event

                             Figure 3: The generic workflow for the refreshment process


M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué                                                                 6-5
Coordination of activities                                 mechanism which allows to trigger the activities at
                                                           the right moment. This can be done by introducing
    In workflow systems, activities are coordinated        composite events which combine, for example, data
by control flows which may be notification of              change events and temporal events. Another
process commitment, emails issued by agents,               alternative is to put locks on data stores and remove
temporal events, or any other trigger events. In the       them after an activity or a set of activities decide to
refreshment process, coordination is done through a        commit. In the case of a long term synchronization
wide range of event types.                                 policy, as it may sometimes happen in some data
                                                           warehouses, this latter approach is not sufficient.
    We can distinguish several event types which
may trigger and synchronize the refreshment                The workflow agents
activities. They might be temporal events,
termination events (dashed lines in figure 3) or any           Two main agent types are involved in the
other user-defined event. Depending on the                 refreshment workflow: human agents which define
refreshment scenario, one can choose an appropriate        requirements, constraints and strategies, and
set of event types which allows to achieve the correct     computer agents which process activities. Among
level of synchronization.                                  human agents we can distinguish users, the data
                                                           warehouse administrator, source administrators.
    Activities of the refreshment workflow are not         Among computer agents, we can mention source
executed as soon as they are triggered, they may           management systems, database systems used for the
depend on the current state of the input data stores.      data warehouse and data marts, wrappers and
For example, if the extraction is triggered                mediators. For simplicity, agents are not represented
periodically, it is actually executed only when there      in the refreshment workflow which concentrates on
are effective changes in the source log file. If the       the activities and their coordination.
cleaning process is triggered immediately after the
extraction process, it is actually executed only if the    3.2. Defining refreshment scenarios
extraction process has gathered some source
changes. Consequently, we can consider that the                To illustrate different workflow scenarios, we
state of the input data store of each activity may be      consider the following example which concern three
considered as a condition to effectively execute this      national Telecom billing sources represented by
activity.                                                  three relations S1, S2, and S3. Each relation has the
                                                           same (simplified) schema: (#PC, date, duration,
    Within the workflow which represents the               cost). An aggregated view V with schema (avg-
refreshment process, activities may be of different        duration, avg-cost, country) is defined in a data
origins and different semantics, the refreshment           warehouse from these sources as the average
strategy is logically considered as independent of         duration and cost of a phone call in each of the three
what the activities actually do. However, at the           country associated with the sources, during the last 6
operational level, some activities can be merged           months. We assume that the construction of the view
(e.g., extraction and cleaning), and some others           follows the steps as explained before. During the
decomposed (e.g. integration). The flexibility             preparation step, the data of the last six months
claimed for workflow systems should allow to               contained in each source is cleaned (e.g., all cost
dynamically tailor the refreshment activities and the      units are translated in Euros). Then, during the
coordinating events.                                       integration phase, a base relation R with schema
                                                           (date, duration, cost country) is constructed by
    There may be another way to represent the              unioning the data coming from each source and
workflow and its triggering strategies. Indeed,            generating an extra attribute (country). Finally, the
instead of considering external events such as             view is computed using aggregates (figure 4).
temporal events or termination events of the different
activities, we can consider data changes as events.            We can define another refreshment scenario with
Hence, each input data store of the refreshment            the same sources and similar views. This scenario
workflow is considered as an event queue that              mirrors the average duration and cost for each day
triggers the corresponding activity. However, to be        instead of for the last six months. This leads to
able to represent different refreshment strategies, this   change the frequency of extraction, cleaning,
approach needs a parametric synchronization                integration and propagation. Figure 5 gives such a


M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué                                                               6-6
possible scenario. Frequencies of source extractions                administrators. Source 3 is permanently available.
are those which are allowed by source

                                                                                    BeforeQueryEvaluation
                                                          Customization

                          BeforeCustomization

                                                       Update Propagation


                                                     History management

                                                                 AfterIntegration

                         EveryBeginingTrimester          Data Integration
                                                                                                  BeforeIntegration


                         S1 DataCleaning                 S2 DataCleaning
                                                                 AfterArchiving

                                 AfterExtraction
                                                      S2 History management

                                                                 AfterExtraction

                         S1 Data Extraction               S2 Data Extraction             S3 Data Extraction


                    EveryEndTrimester                              EveryEndMonth


                                        Figure 4: First example of refreshment scenario


M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué                                                                         6-7
                                                                                BeforeQueryEvaluation
                                                       Customization

                       BeforeCustomization

                                                    Update Propagation


                         BeforePropagation


                                                      Data Integration
                                                                                              BeforeIntegratio


                       S1 DataCleaning                S2 DataCleaning


                                AfterExtraction               AfterExtraction


                       S1 Data Extraction                S2 Data Extraction           S3 Data Extraction


                  Every3Hours                                    EveryHour


                                     Figure 5: Second example of refreshment scenario

    When the refreshment activities are long term                        propagation from the ODS to the aggregated
activities or when the DWA wants to apply                                views. The on-demand strategy can be defined
validation procedures between activities, temporal                       for all aggregated views or only for those for
events or activity terminations can be used to                           which the freshness of data is related to the date
synchronize all the refreshment process. In general,                     of querying.
the quality requirements may impose a certain
synchronization strategy. For example, if users desire            •      Source-driven refreshment which defines part of
high freshness for data, this means that each update                     the process which is triggered by changes made
in a source should be mirrored as soon as possible to                    in the sources. This part concerns the
the views. Consequently, this determines the strategy                    preparation phase. The independence between
of synchronization: trigger the extraction after each                    sources can be used as a way to define different
change in a source, trigger the integration, when                        preparation strategies, depending on the sources.
semantically relevant, after the commit of each data                     Some sources may be associated with cleaning
source, propagate changes through views                                  procedures, others not. Some sources need a
immediately after integration, and customize the user                    history of the extracted data, others not. For
views in data marts.                                                     some sources, the cleaning is done on the fly
                                                                         during the extraction, for some others after the
Refreshment scheduling                                                   extraction or on the history of these changes.
                                                                         The triggering of the extraction may be also
The refreshment process can be viewed through                            different from one source to another. Different
different perspectives :                                                 events can be defined, such as temporal events
                                                                         (periodic or fixed absolute time), after each
•   Client-driven refreshment which describes part                       change detected on the source, on demand from
    of the process which is triggered on demand by                       the integration process.
    the users. This part mainly concern update


M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué                                                                        6-8
•   ODS-driven refreshment which defines part of          definition does not include specific filters defined in
    the process which is automatically monitored by       the cleaning process, such as choosing the same
    the data warehouse system. This part concerns         measure for certain attributes, rounding the values of
    the integration phase. It may be triggered at a       some attributes, or eliminating some confidential
    synchronization point defined with respect to the     data. Consequently, based on the same view
    ending of the preparation phase. Integration can      definitions, the refreshment process may produce
    be considered as a whole and concerns all the         different results depending on all these extra-
    source changes at the same time. In this case, it     parameters which have to be fixed independently,
    can be triggered by an external event which           outside the queries which define the views.
    might be a temporal event or the ending of the
    preparation phase of the last source. The                 The result of a query against view V occurring at
    integration can also be sequenced with respect        time t depends on two main parameters associated
    to the termination of the preparation phase of        with the refreshment strategy implemented by the
    each source, that is extraction is integrated as      data warehouse. First, it depends on the change
    soon as its cleaning is finished. The ODS can         extraction capabilities of each source. For instance,
    also monitor the preparation phase and the            changes in source S1 can be extracted as soon as
    aggregation phase by generation the relevant          they occurred, while changes in source S2 can be
    events that triggers activities of these phases.      captured only during the last night of the month. This
                                                          determines the availability of the changes from a
    In the very simple case, one of the two first         source, and hence impacts the data freshness. It also
approaches is used as a single strategy. In a more        impacts the data coherence because time
complex case, there may be as much strategies as the      discrepancies may occur in the view: the average
number of sources or high level aggregated views. In      may incorporate fresh data from S1 and old data
between, there may be, for example, four different        from S2. Second, it depends on the time needed to
strategies corresponding to the previous four phases.     compute the change to the view from the changes to
For some given user views, one can apply the client       the sources.
driven strategy (pull strategy), while for other views
one can apply the ODS-driven strategy (push                   In fact, the two previous parameters may be
strategy). Similarly, some sources are solicited          repeated as many times as there are intermediate
through a pull strategy while other apply a push          storages between the sources and the view. For
strategy.                                                 instance, suppose that the result of the preparation
                                                          step is stored. The availability parameter
    The strategy to choose depends on the semantic        characterizes the moment at which the integration
parameters but also on the tools available to perform     process is capable of accessing the result of a
the refreshment activities (extraction, cleaning,         preparation step. Thus, if each result is only
integration). Some extraction tools do also the           available at the end of the month then the integration
cleaning in the fly while some integrators propagate      can only be performed at that time and the view will
immediately changes until the high level views.           consequently only reflect changes that occurred in
Then, the generic workflow in Figure 3 is a logical       the sources once per month.
view of the refreshment process. It shows the main
identified activities and the potential event types           Another parameter influences the result of a
which can trigger them.                                   query against V. It characterizes the actualization of
                                                          the data contained in each source. For instance,
4. Semantics of the refreshment process                   source S1 can be updated at the end of every week
                                                          while source S2 is updated two days before the end
    As we have seen in the previous examples of           of every month. If a query is posed against V at the
scenarios, the view definition is not sufficient to fix   end of the second week of a month, the effect of the
the semantics of the refreshment process. Indeed, the     phone calls that occurred since the beginning of the
query which defines a view does not specify whether       month in the country associated with source S2, will
this view operates on a history or not, how this          not be possibly reflected by V, and hence by the
history is sampled, whether the changes of a given        result of the query. Thus, the value of this parameter
source should be integrated each hour or each week,       determines the difference that may exist between the
and which data timestamp should be taken when             state of the view reflected by the data warehouse and
integrating changes of different sources. The view        the state of the view in the real world. Because this


M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué                                                              6-9
parameter is fixed and out of the control of the data       •   Ordering of these tasks.
warehouse application (it is actually part of the           •   The events initiating the tasks. The events put
source operational applications), we do not consider            the rhythm into the refreshment process, and,
it.                                                             depending on this rhythm, the freshness and the
                                                                accuracy of the data may be quiet different.
    The previous discussion has shown how the
refreshment process can depend on some parameters,          5. Implementation issues
independently of the choice of materialized views,
and how these parameters impact on the semantics of
the process. It also shows that building an efficient           With respect to the implementation issues,
refreshment strategy with respect to application            different solutions can be considered. The
requirements (e.g. data freshness, computation time         conceptual definition of the refreshment process by
of queries and views, data accuracy) depends on             means of a workflow, leads naturally to envision an
various parameters related to:                              implementation under the control of a common
• source constraints (e.g. availability windows,            workflow system in the market, provided that this
     frequency of change),                                  latter one supplies event types and all features
                                                            needed by the refreshment scenario. Another solution
• and data warehouse system limits (e.g. storage
                                                            we have preferred in [BFM+ 98] consists in using
     space limit, functional limits).
                                                            active rules which should be executed under a certain
                                                            operational semantics. The rationale behind our
Finally the main lesson drawn from the previous
                                                            choice is the flexibility and the evolutivity provided
examples and discussion is :
                                                            by active rules. Indeed the refreshment strategy is not
                                                            defined once for all; it may evolve with the user
    The operational semantics of the refreshment
                                                            needs, which may result in the change of the
process can be defined as the set of all design
                                                            definition of materialized views or the change of
decisions that contribute to provide to the users
                                                            desired quality factors. It may also evolve when the
relevant data, fulfilling the quality requirements.
                                                            actual values of the quality factors slow down with
                                                            the evolution of the data warehouse feeding or with
    Some of these design decisions are inherited from
                                                            the technology used to implement it. Consequently,
the design of the initial loading, others are specific to
                                                            in order to master the complexity and the evolutivity
the refreshment itself. The first design decisions
                                                            of the data warehouse, it is important to provide a
inherited from the design of the initial loading may
                                                            flexible technology which allows to accommodate
concern the view definition, the structure of the data
                                                            this complexity and evolutivity. This is what active
flow which is between the sources and the data
                                                            rules meant to provide. A prototype has been
marts. The second design decisions inherited from
                                                            developed and demonstrated in the context of the
the design of the initial loading are the semantics of
                                                            DWQ european research project on Data
the loading activities, that is cleaning rules,
                                                            warehouses. However, active rules cannot be
integration rules, etc.
                                                            considered as an alternative to workflow
                                                            representation. Workflow is a conceptual view of the
    The design decisions which are specific to the
                                                            refreshment process, while actives rules are
refreshment semantics are those that determine :
                                                            operational implementation of the workflow.
• the moment when each refreshment task takes
     place in the global process
• the way the different refreshment tasks are               6. Concluding remarks
     synchronized
• the way the shared data is made visible for the               This paper has presented an analysis of the
     corresponding tasks                                    refreshment process in data warehouse applications.
                                                            We have demonstrated, that the refreshment process
These design decisions are specified by defining :          cannot be limited neither to a view maintenance
• The decomposition of the refreshment process in           process nor to a loading process. We have shown
    elementary tasks (e.g. cleaning of some specific        through a simple example, that the refreshment of a
    source, partial integration of two given changes        data warehouse can be conceptually viewed as a
    originated from two different sources, detection        workflow process. We have identified the different
    and cleaning in a unique task for another               tasks of the workflow and shown how they can be
    source).                                                organized in different refreshment scenarios, leading


M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué                                                                6-
                                                                                                             10
to different refreshment semantics. We have                  Entity-Relationship Approach, Austr. Springer,
highlighted design decisions impacting over the              1995.
refreshment semantics and we have shown how the
decisions may be related to some quality factors such     [CSCW 96] ACM 1996 Conference on Computer-
as data freshness and to some constraints such as            Supported Cooperative Work – Cooperating
source availability and accessibility.                       Communities (Proceedings). Ackerman, M.S.
                                                             (ed.), ACM, Boston, MA, November 1996.
References                                                [HaC 93] Hammer, M., Champy, J., Reengineering
                                                             the Corporation, a Manifesto for Business
                                                             Revolution, Harper, New-York, 1993.
[AAE+ 95] Alonso, G., Agrawal, D., El Abadi, A.,
   Kamath, M., Gunther, R., Mohan, C., Advanced
                                                          [JaV 97] M. Jarke, M. Vassiliou: Foundations of
   Transaction Models in Workflow Context, IBM
                                                             data warehouse quality: an overview of the DWQ
   Research Report, RJ 9970, IBM Research
                                                             project.   Proceedings      2nd    International
   Division, 1995
                                                             Conference on Information Quality, Cambridge,
                                                             Mass, 1997
[AGW 97] B.Adelberg, H. Garcia-Mollina, and J.
   Widom. The STRIP Rule System For Efficiently
                                                          [Lawr 97] Lawrence, P. (ed.), Workflow Handbook,
   Maintaing Derived Data. In Proc. of ACM
                                                             Wiley and WfMC, 1997.
   SIGMOD       International  Conference       on
   Management of Data. Tucson, Arizona, USA,
                                                          [Sch 98] Schael, T., Workflow Management Systems
   1997.
                                                             for Process Organisations, Second edition,
[BeNe 98] Berstein, P., Newcomer, E., Principles of
                                                             Springer, 1998.
   Transaction Processing, Morgan Kaufmann
   Publ., 1998.
                                                          [ThBo 99] Theodoratos, D., Bouzeghoub, M., Data
                                                             Currency Quality Factors in Data Warehouse
[BFM+ 98] Bouzeghoub, M., Fabret, F., Matulovic,             Design, Proceed. of the International Workshop
   M., Simon, E.,        "A toolkit Approach for             on Design and Management of Data Warehouses
   developing efficient and customizable active rule         (DMDW'99), Heidelberg, Germany, June 1999.
   systems", DWQ Technical report, October 1998.
                                                          [WAN 98] Wang, R. Y., "A product perspective on
[BFM+ 99] Bouzeghoub, M., Fabret, F., Galhardas,            total data quality management", Com. [ZGJ+ 95]
   H., Matulovic, M., Pereira, J., Simon, E., "Data         Yue Zhuge, Hector Garcia-Molina, Joachim
   Warehouse Refreshment", in Fundamentals in               Hammer,      and     Jennifer  Widom.    View
   Data Warehouses, Chapter 4, M. Jarker et al              maintenance in a warehousing environment. In
   (edts), Springer, 1999.                                  Proc. of the ACM SIGMOD Int. Conf. on
                                                            Management of Data, pages 316-327, 1995.
[BMF+ 97] Bouzeghoub, M., Fabret, F., Llirbat, L.,
  Matulovic, M., Simon, E.,       "Designing data         [WoKr 93] Woetzel, G., Kreifelts, T., The use of
  warehouse refreshment system",            DWQ             Petri nets for modeling workflow with the
  Technical report, October 1997.                           Domino system. Proceed. of the Workshop on
                                                            Computer Supported Cooperative Work, Petri
[CGL+ 96] L.Colby, T. Griffin, L. Libkin, I. S.             nets and Related Formalisms. Chicago, 1993.
  Mumick, and H. Trickey. Algorithms for
  Deferred View Maintenance. In Proceedings of            [WFMC 95] WFMC-TC-1003 (The Workflow
  SIGMOD, Montreal, Canada, 1996.                           Management    Coalition),      The    Workflow
                                                            Reference Model, version 1.1, January 1995.
[ChD 97] S. Chaudhuri, U. Dayal. An Overview of
   Data Warehousing and OLAP Technology.                  [ZGJ+ 95] Zhuge, Y., Garcia-Molina, H., Hammer,
   SIGMOD Record, Vol. 26, No. 1, March 1997                 J., Widom, J., View Maintenance in a
                                                             warehousing environment, Proceedings of the
[CCPP 95] Casati, F., Ceri, S., Pernici, B., Possi, G.,      ACM SIGMOD Int. Conf. On Management of
   Conceptual Modeling of Workflows, Proceed. Of             Data, P 316-327, 1995.
   the Internat. Conf. On Object Oriented and


M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué                                                          6-
                                                                                                       11
M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué   6-
                                                12

</pre>