=Paper=
{{Paper
|id=Vol-19/paper-7
|storemode=property
|title=Modeling the Data Warehouse Refreshment Process as a Workflow Application
|pdfUrl=https://ceur-ws.org/Vol-19/paper6.pdf
|volume=Vol-19
|dblpUrl=https://dblp.org/rec/conf/dmdw/BouzeghoubFM99
}}
==Modeling the Data Warehouse Refreshment Process as a Workflow Application==
Modeling Data Warehouse Refreshment Process
as a Workflow Application
Mokrane Bouzeghoub(*)(**), Françoise Fabret(*), Maja Matulovic-Broqué(*)
(*)
INRIA Rocquencourt, France
(**)
Laboratoire PRiSM, Université de Versailles, France
Mokrane.Bouzeghoub@prism.uvsq.fr
two extreme data stores, we can find different other
Abstract stores depending on the requirements of OLAP
applications. One of these stores is the operational data
store which reflects source data in a uniform and clean
This article is a position paper on the nature of the data representation. The corporate data warehouse (CDW)
warehouse refreshment which is often defined as a view contains highly aggregated data and can be organized
maintenance problem or as a loading process. We will into a multidimensional structure. Data extracted from
show that the refreshment process is more complex than each source can also be stored in intermediate data
the view maintenance problem, and different from the recipients. Obviously, this hierarchy of data stores is a
loading process. We conceptually define the logical way to represent the data flows which go from
refreshment process as a workflow whose activities the sources to the data marts. All these stores are not
depend on the available products for data extraction, necessarily materialized, and if they are, they can just
cleaning and integration, and whose coordination constitute different layers of the same database.
events depend on the application domain and on the
required quality in terms of data freshness.
Figure 1 shows a typical data warehouse architecture.
Implementation of this process is clearly distinguished
This is a logical view whose operational
from its conceptual modelling.
implementation receives many different answers in the
data warehousing products. Depending on each data
source, extraction and cleaning can be done by the
1. Introduction same wrapper or by distinct tools. Similarly data
reconciliation (also called multi-source cleaning) can
be separated from or merged with data integration
Data warehousing is a new technology which (multi-sources operations). High level aggregation can
provides software infrastructure for decision support be seen as a set of computation techniques ranging
systems and OLAP applications. Data warehouses from simple statistical functions to advanced data
collect data from heterogeneous and distributed sources. mining algorithms. Customisation techniques may vary
This data is aggregated and then customized with from one data mart to another, depending on the way
respect to organizational criteria defined by OLAP decision makers want to see the elaborated data.
applications. The data warehouse can be defined as a
hierarchy of data stores which goes from source data to
the highly aggregated data (data marts). Between these
____________ __________
The copyright of this paper belongs to the paper’s authors. Permission to copy The research presented in this paper is supported by the European
without fee all or part of this material is granted provided that the copies are Commission under the Esprit Prgram LTR project 'DWQ:
not made or distributed for direct commercial advantage. Foundations of Data Warehouse Quality'
Proceedings of the International Workshop on Design and
Management of Data Warehouses (DMDW'99)
Heidelberg, Germany, 14. - 15. 6. 1999
(S. Gatziu, M. Jeusfeld, M. Staudt, Y. Vassiliou, eds.)
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-19/
M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué 6-1
META
DATA
EXTRACTION
CLEANING
CDW
EXTRACTION RECONCILIATION
CLEANING INTEGRATION CUSTOMISATION
HIGH-LEVEL
AGGREGATION
EXTRACTION
CLEANING
DATA DATA
SOURCES
ODS
MARTS
Figure 1: Data warehouse architecture
The refreshment of a data warehouse is an between the refreshment process in one side and the
important process which determines the effective data loading and view maintenance in the other side.
usability of the data collected and aggregated from Section 3 defines the generic workflow which
the sources. Indeed, the quality of data provided to logically represents the refreshment process, with
the decision makers depends on the capability of the examples of workflow scenarios. Section 4 defines
data warehouse system to convey in a reasonable the semantics of the refreshment process in terms of
time, from the sources to the data marts, the changes workflow design decisions. Section 5 concludes with
made at the data sources. Most of the design the summary of the main ideas in this position paper
decisions are then concerned by the choice of data and on some implementation issues.
structures and update techniques that optimise the
refreshment of the data warehouse. 2. View maintenance, data loading and
data refreshment
There is a quiet great confusion in the literature
concerning data warehouse refreshment. Indeed, this Data refreshment in data warehouses is generally
process is often either reduced to view maintenance confused with data loading as done during the initial
problem or confused with the data loading phase. phase or with update propagation through a set of
Our purpose in this paper is to show that the data materialized views. Both analogies are wrong. The
warehouse refreshment is a more complex than the following paragraphs argument on the differences
view maintenance problem, and different from the between data loading and data refreshment, and
loading process. We define the refreshment process between view maintenance and data refreshment.
as a workflow whose activities depend on the
available products for data extraction, cleaning and
Data loading vs. data refreshment
integration, and whose triggering events of these
activities depend on the application domain and on
the required quality in terms of data freshness. The data warehouse loading phase consists in the
initial data warehouse instantiation, that is the initial
The objective of the following sections is to computation of the data warehouse content. This
describe the refreshment process tasks and to initial loading is globally a sequential process of four
demonstrate how they can be organised as a steps (Figure 2): (i) preparation, (ii) integration, (iii)
workflow. Section 2 arguments on differences high level aggregation and (iv) customisation. The
M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué 6-2
first step is done for each source and consists in data Consequently, there is no constraint on the response
extraction, data cleaning and possibly data archiving time. But, in contrast, with respect to the data
before or after cleaning. Archiving data in a history sources, the loading phase requires more availability.
can be used both for synchronisation purpose
between sources having different access frequencies The data flow which describes the loading phase can
and for some specific temporal queries. The second serve as a basis to define the refreshment process,
step consists in data reconciliation and integration, but the corresponding workflows are different. The
that is cleaning multi-source cleaning of data workflow of the refreshment process is dynamic and
originated from heterogeneous sources, and can evolve with users’ needs and with source
derivation of the base relations (or base views) of the evolution, while the workflow of the initial loading
operational data store (ODS). The third step consists process is static and defined with respect to current
in the computation of aggregated views from base user requirements and current sources.
views. While the data extracted from the sources and
integrated in the ODS is considered as ground data The difference between the refreshment process
with very low level aggregation, the data in the and the loading process is mainly in the following.
corporate data warehouse (CDW) is generally highly First, the refreshment process may have a complete
summarised using aggregation functions. The fourth asynchronism between its different activities
step consists in the derivation and customisation of (preparation, integration, aggregation and
the user views which define the data marts. customisation). Second, there may be a high level
Customisation refers to various presentations needed parallelism within the preparation activity itself, each
by the users for multidimensional data. data source having its own availability window and
its own strategy of extraction. The synchronization is
done by the integration activity. Another difference
Customization lies in the source availability. While the loading
phase requires a long period of availability, the
refreshment phase should not overload the
operational applications which use the data sources.
Update Propagation Then, each source provides a specific access
frequency and a restricted availability duration.
Finally, there are more constraints on response time
History management for the refreshment process than for the loading
process. Indeed, with respect to the users, the data
Integration warehouse does not exist before the initial loading,
Phase Data so the computation time is included within the design
Reconciliation project duration. After the initial loading, the data
becomes visible and should satisfy user requirements
in terms of data availability, accessibility and
freshness.
History management View maintenance vs. data refreshment
Preparation The propagation of changes during the
Phase DataCleaning refreshment process is done through a set of
independent activities among which we find the
maintenance of the views stored in the ODS and
Data Extraction
CDW levels. The view maintenance phase consists
in propagating a certain change raised in a given
source over a set of views stored at the ODS or
Figure 2: Data loading activities
CDW level. Such a phase is a classical materialized
view maintenance problem except that, in data
warehouses, the changes to propagate into the
The main feature of the loading phase is that it
aggregated views are not exactly those occurred in
constitutes the latest stage of the data warehouse
the sources, but the result of pre-treatments
design project. Before the end of the data loading,
the data warehouse does not yet exist for the users.
M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué 6-3
performed by other refreshment activities such as refreshment process cannot be limited to a view
data cleaning and multi-source data reconciliation. maintenance process.
The view maintenance problem has been To summarize the previous discussion, we can
intensively studied in the database research say that a refreshment process is a complex system
community. Major work done in this area is which may be composed of asynchronous and
synthesized in [BFG+99] and [ThBo 99]. Most of parallel activities that need a certain monitoring. The
the references focus on the problems raised by the refreshment process is an event-driven system which
maintenance of a set of materialized (also called evolves frequently, following the evolution of data
concrete) views derived from a set of base relations sources and user requirements. Users, data
when the current state of the base relations is warehouse administrators and data source
modified. The main results concern : administrators may impose specific constraints as,
• The self-maintainability : Results concerning the respectively, freshness of data, space limitation of
self-maintainability are generalized for a set of the ODS or CDW, and access frequency to sources.
views : a set of view V is self-maintainable with There is no simple and unique refreshment strategy
respect to the changes to the underlying base which is suitable for all data warehouse applications,
relations if the changes may be propagated in for all data warehouse user, or for the whole data
every views in V without querying the base warehouse lifetime.
relations (i.e. the information stored in the
concrete views plus the instance of the changes 3. The Refreshment process is a
are sufficient to maintain the views). workflow
• The coherent and efficient update propagation:
Various algorithms are provided to schedule
updates propagation through each individual A workflow is a set of coordinated activities
view, taking care of interdependencies between which might be manual or automated activities
views, which may lead to possible performed by actors [Scha 98]. Workflow concepts
inconsistencies. For this purpose, auxiliary have been used in various application domains such
views are often introduced to facilitate update as business process modeling [HaCh 93],
propagation and to enforce self-maintainability. cooperative applications modeling [CSCW 96]
[Lawr 97], and database transaction modeling
Results over the self-maintainability of a set of views [AAE+ 95] [Bern 98]. Depending on the application
are of a great interest in the data warehouse context, domain, activities and coordination are defined using
and it is commonly admitted that the set of views appropriate specification languages such as state-
stored in a data warehouse have to be globally self- chart diagrams and Petri nets [WoKr 93], or active
maintainable. The rationale behind this rules [CCPP 95]. In spite of this diversity of
recommendation is that the self-maintainability is a applications and representation, most of the
strong requirement imposed by the operational workflow users refer more or less to the concepts
sources in order to not overload their regular and terminology defined by the Workflow Coalition
activity. [WFMC 95]. Workflow systems are supposed to
provide high level flexibility to recursively
As stated in the previous section, research on decompose and merge activities, and allow dynamic
data warehouse refreshment has mainly focused on reorganization of the workflow process. These
update propagation through materialized views. features are typically useful in the context of data
Many papers have been published on this topic, but a warehouse refreshment as the activities are
very few is devoted to the whole refreshment process performed by market products whose functionalities
as defined before. We consider view maintenance and scope differ from one product to another.
just as one step of the complete refreshment process.
Other steps concern data cleaning, data In the following subsections, we show how the
reconciliation, data customisation, and if needed data refreshment process can be defined as a workflow
archiving. In another hand, extraction and cleaning application. We illustrate the interest of this
strategies may vary from one source to another, as approach buy the ability to define different scenarios
well as update propagation which may vary from one depending on user requirements, source constraints
user view to another, depending for example on the and data warehouse constraints. We show that these
desired freshness for data. So the data warehouse scenarios may evolve through the time to fulfill
M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué 6-4
evolution of any of the previous requirements and Refreshment activities
constraints.
The refreshment process is similar to the loading
3.1. The workflow of the refreshment process process in its data flow but, while the loading
process is a massive feeding of the data warehouse,
The refreshment process aims to propagate the refreshment process captures the differential
changes raised in the data sources to the data changes hold in the sources and propagates them
warehouse stores. This propagation is done through a through the hierarchy of data stores in the data
set of independent activities (extraction, cleaning, warehouse. The preparation step extracts from each
integration, ...) that can be organized in different source the data that characterises the changes that
ways, depending on the semantics one wants to have occurred in this source since the last extraction.
assign to the refreshment process and on the quality As for loading, this data is cleaned and possibly
he wants to achieve. The ordering of these activities archived before its integration. The integration step
and the context in which they are executed define reconciliates the source changes coming from
this semantics and influence this quality. Ordering multiple sources and adds them to the ODS. The
and context result from the analysis of view aggregation step recomputes incrementally the
definitions, data source constraints and user hierarchy of aggregated views using these changes.
requirement in terms of quality factors. In the The customisation step propagates the summarized
following subsections, we will describe the data to the data marts. As well as for the loading
refreshment activities and their organization as a phase, this is a logical decomposition whose
workflow. Then we give examples of different operational implementation receives many different
workflow scenarios to show how refreshment may be answers in the data warehouse products. This logical
a dynamic and evolving process. Finally, we view allows a certain traceability of the refreshment
summarize the different perspectives through which process. Figure 3 shows the activities of the
a given refreshment scenario should be considered. refreshment process as well as a sample of the
coordinating events.
Customization Temporal/external event
After-Propagation event
Before-Customization event
Update Propagation Temporal/external event
Before-Propagation event History management After-Integration event
Data Integration Temporal/external event
After-Cleaning event
History management
Before-Integration event
DataCleaning
Data Extraction Temporal/external event
Figure 3: The generic workflow for the refreshment process
M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué 6-5
Coordination of activities mechanism which allows to trigger the activities at
the right moment. This can be done by introducing
In workflow systems, activities are coordinated composite events which combine, for example, data
by control flows which may be notification of change events and temporal events. Another
process commitment, emails issued by agents, alternative is to put locks on data stores and remove
temporal events, or any other trigger events. In the them after an activity or a set of activities decide to
refreshment process, coordination is done through a commit. In the case of a long term synchronization
wide range of event types. policy, as it may sometimes happen in some data
warehouses, this latter approach is not sufficient.
We can distinguish several event types which
may trigger and synchronize the refreshment The workflow agents
activities. They might be temporal events,
termination events (dashed lines in figure 3) or any Two main agent types are involved in the
other user-defined event. Depending on the refreshment workflow: human agents which define
refreshment scenario, one can choose an appropriate requirements, constraints and strategies, and
set of event types which allows to achieve the correct computer agents which process activities. Among
level of synchronization. human agents we can distinguish users, the data
warehouse administrator, source administrators.
Activities of the refreshment workflow are not Among computer agents, we can mention source
executed as soon as they are triggered, they may management systems, database systems used for the
depend on the current state of the input data stores. data warehouse and data marts, wrappers and
For example, if the extraction is triggered mediators. For simplicity, agents are not represented
periodically, it is actually executed only when there in the refreshment workflow which concentrates on
are effective changes in the source log file. If the the activities and their coordination.
cleaning process is triggered immediately after the
extraction process, it is actually executed only if the 3.2. Defining refreshment scenarios
extraction process has gathered some source
changes. Consequently, we can consider that the To illustrate different workflow scenarios, we
state of the input data store of each activity may be consider the following example which concern three
considered as a condition to effectively execute this national Telecom billing sources represented by
activity. three relations S1, S2, and S3. Each relation has the
same (simplified) schema: (#PC, date, duration,
Within the workflow which represents the cost). An aggregated view V with schema (avg-
refreshment process, activities may be of different duration, avg-cost, country) is defined in a data
origins and different semantics, the refreshment warehouse from these sources as the average
strategy is logically considered as independent of duration and cost of a phone call in each of the three
what the activities actually do. However, at the country associated with the sources, during the last 6
operational level, some activities can be merged months. We assume that the construction of the view
(e.g., extraction and cleaning), and some others follows the steps as explained before. During the
decomposed (e.g. integration). The flexibility preparation step, the data of the last six months
claimed for workflow systems should allow to contained in each source is cleaned (e.g., all cost
dynamically tailor the refreshment activities and the units are translated in Euros). Then, during the
coordinating events. integration phase, a base relation R with schema
(date, duration, cost country) is constructed by
There may be another way to represent the unioning the data coming from each source and
workflow and its triggering strategies. Indeed, generating an extra attribute (country). Finally, the
instead of considering external events such as view is computed using aggregates (figure 4).
temporal events or termination events of the different
activities, we can consider data changes as events. We can define another refreshment scenario with
Hence, each input data store of the refreshment the same sources and similar views. This scenario
workflow is considered as an event queue that mirrors the average duration and cost for each day
triggers the corresponding activity. However, to be instead of for the last six months. This leads to
able to represent different refreshment strategies, this change the frequency of extraction, cleaning,
approach needs a parametric synchronization integration and propagation. Figure 5 gives such a
M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué 6-6
possible scenario. Frequencies of source extractions administrators. Source 3 is permanently available.
are those which are allowed by source
BeforeQueryEvaluation
Customization
BeforeCustomization
Update Propagation
History management
AfterIntegration
EveryBeginingTrimester Data Integration
BeforeIntegration
S1 DataCleaning S2 DataCleaning
AfterArchiving
AfterExtraction
S2 History management
AfterExtraction
S1 Data Extraction S2 Data Extraction S3 Data Extraction
EveryEndTrimester EveryEndMonth
Figure 4: First example of refreshment scenario
M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué 6-7
BeforeQueryEvaluation
Customization
BeforeCustomization
Update Propagation
BeforePropagation
Data Integration
BeforeIntegratio
S1 DataCleaning S2 DataCleaning
AfterExtraction AfterExtraction
S1 Data Extraction S2 Data Extraction S3 Data Extraction
Every3Hours EveryHour
Figure 5: Second example of refreshment scenario
When the refreshment activities are long term propagation from the ODS to the aggregated
activities or when the DWA wants to apply views. The on-demand strategy can be defined
validation procedures between activities, temporal for all aggregated views or only for those for
events or activity terminations can be used to which the freshness of data is related to the date
synchronize all the refreshment process. In general, of querying.
the quality requirements may impose a certain
synchronization strategy. For example, if users desire • Source-driven refreshment which defines part of
high freshness for data, this means that each update the process which is triggered by changes made
in a source should be mirrored as soon as possible to in the sources. This part concerns the
the views. Consequently, this determines the strategy preparation phase. The independence between
of synchronization: trigger the extraction after each sources can be used as a way to define different
change in a source, trigger the integration, when preparation strategies, depending on the sources.
semantically relevant, after the commit of each data Some sources may be associated with cleaning
source, propagate changes through views procedures, others not. Some sources need a
immediately after integration, and customize the user history of the extracted data, others not. For
views in data marts. some sources, the cleaning is done on the fly
during the extraction, for some others after the
Refreshment scheduling extraction or on the history of these changes.
The triggering of the extraction may be also
The refreshment process can be viewed through different from one source to another. Different
different perspectives : events can be defined, such as temporal events
(periodic or fixed absolute time), after each
• Client-driven refreshment which describes part change detected on the source, on demand from
of the process which is triggered on demand by the integration process.
the users. This part mainly concern update
M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué 6-8
• ODS-driven refreshment which defines part of definition does not include specific filters defined in
the process which is automatically monitored by the cleaning process, such as choosing the same
the data warehouse system. This part concerns measure for certain attributes, rounding the values of
the integration phase. It may be triggered at a some attributes, or eliminating some confidential
synchronization point defined with respect to the data. Consequently, based on the same view
ending of the preparation phase. Integration can definitions, the refreshment process may produce
be considered as a whole and concerns all the different results depending on all these extra-
source changes at the same time. In this case, it parameters which have to be fixed independently,
can be triggered by an external event which outside the queries which define the views.
might be a temporal event or the ending of the
preparation phase of the last source. The The result of a query against view V occurring at
integration can also be sequenced with respect time t depends on two main parameters associated
to the termination of the preparation phase of with the refreshment strategy implemented by the
each source, that is extraction is integrated as data warehouse. First, it depends on the change
soon as its cleaning is finished. The ODS can extraction capabilities of each source. For instance,
also monitor the preparation phase and the changes in source S1 can be extracted as soon as
aggregation phase by generation the relevant they occurred, while changes in source S2 can be
events that triggers activities of these phases. captured only during the last night of the month. This
determines the availability of the changes from a
In the very simple case, one of the two first source, and hence impacts the data freshness. It also
approaches is used as a single strategy. In a more impacts the data coherence because time
complex case, there may be as much strategies as the discrepancies may occur in the view: the average
number of sources or high level aggregated views. In may incorporate fresh data from S1 and old data
between, there may be, for example, four different from S2. Second, it depends on the time needed to
strategies corresponding to the previous four phases. compute the change to the view from the changes to
For some given user views, one can apply the client the sources.
driven strategy (pull strategy), while for other views
one can apply the ODS-driven strategy (push In fact, the two previous parameters may be
strategy). Similarly, some sources are solicited repeated as many times as there are intermediate
through a pull strategy while other apply a push storages between the sources and the view. For
strategy. instance, suppose that the result of the preparation
step is stored. The availability parameter
The strategy to choose depends on the semantic characterizes the moment at which the integration
parameters but also on the tools available to perform process is capable of accessing the result of a
the refreshment activities (extraction, cleaning, preparation step. Thus, if each result is only
integration). Some extraction tools do also the available at the end of the month then the integration
cleaning in the fly while some integrators propagate can only be performed at that time and the view will
immediately changes until the high level views. consequently only reflect changes that occurred in
Then, the generic workflow in Figure 3 is a logical the sources once per month.
view of the refreshment process. It shows the main
identified activities and the potential event types Another parameter influences the result of a
which can trigger them. query against V. It characterizes the actualization of
the data contained in each source. For instance,
4. Semantics of the refreshment process source S1 can be updated at the end of every week
while source S2 is updated two days before the end
As we have seen in the previous examples of of every month. If a query is posed against V at the
scenarios, the view definition is not sufficient to fix end of the second week of a month, the effect of the
the semantics of the refreshment process. Indeed, the phone calls that occurred since the beginning of the
query which defines a view does not specify whether month in the country associated with source S2, will
this view operates on a history or not, how this not be possibly reflected by V, and hence by the
history is sampled, whether the changes of a given result of the query. Thus, the value of this parameter
source should be integrated each hour or each week, determines the difference that may exist between the
and which data timestamp should be taken when state of the view reflected by the data warehouse and
integrating changes of different sources. The view the state of the view in the real world. Because this
M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué 6-9
parameter is fixed and out of the control of the data • Ordering of these tasks.
warehouse application (it is actually part of the • The events initiating the tasks. The events put
source operational applications), we do not consider the rhythm into the refreshment process, and,
it. depending on this rhythm, the freshness and the
accuracy of the data may be quiet different.
The previous discussion has shown how the
refreshment process can depend on some parameters, 5. Implementation issues
independently of the choice of materialized views,
and how these parameters impact on the semantics of
the process. It also shows that building an efficient With respect to the implementation issues,
refreshment strategy with respect to application different solutions can be considered. The
requirements (e.g. data freshness, computation time conceptual definition of the refreshment process by
of queries and views, data accuracy) depends on means of a workflow, leads naturally to envision an
various parameters related to: implementation under the control of a common
• source constraints (e.g. availability windows, workflow system in the market, provided that this
frequency of change), latter one supplies event types and all features
needed by the refreshment scenario. Another solution
• and data warehouse system limits (e.g. storage
we have preferred in [BFM+ 98] consists in using
space limit, functional limits).
active rules which should be executed under a certain
operational semantics. The rationale behind our
Finally the main lesson drawn from the previous
choice is the flexibility and the evolutivity provided
examples and discussion is :
by active rules. Indeed the refreshment strategy is not
defined once for all; it may evolve with the user
The operational semantics of the refreshment
needs, which may result in the change of the
process can be defined as the set of all design
definition of materialized views or the change of
decisions that contribute to provide to the users
desired quality factors. It may also evolve when the
relevant data, fulfilling the quality requirements.
actual values of the quality factors slow down with
the evolution of the data warehouse feeding or with
Some of these design decisions are inherited from
the technology used to implement it. Consequently,
the design of the initial loading, others are specific to
in order to master the complexity and the evolutivity
the refreshment itself. The first design decisions
of the data warehouse, it is important to provide a
inherited from the design of the initial loading may
flexible technology which allows to accommodate
concern the view definition, the structure of the data
this complexity and evolutivity. This is what active
flow which is between the sources and the data
rules meant to provide. A prototype has been
marts. The second design decisions inherited from
developed and demonstrated in the context of the
the design of the initial loading are the semantics of
DWQ european research project on Data
the loading activities, that is cleaning rules,
warehouses. However, active rules cannot be
integration rules, etc.
considered as an alternative to workflow
representation. Workflow is a conceptual view of the
The design decisions which are specific to the
refreshment process, while actives rules are
refreshment semantics are those that determine :
operational implementation of the workflow.
• the moment when each refreshment task takes
place in the global process
• the way the different refreshment tasks are 6. Concluding remarks
synchronized
• the way the shared data is made visible for the This paper has presented an analysis of the
corresponding tasks refreshment process in data warehouse applications.
We have demonstrated, that the refreshment process
These design decisions are specified by defining : cannot be limited neither to a view maintenance
• The decomposition of the refreshment process in process nor to a loading process. We have shown
elementary tasks (e.g. cleaning of some specific through a simple example, that the refreshment of a
source, partial integration of two given changes data warehouse can be conceptually viewed as a
originated from two different sources, detection workflow process. We have identified the different
and cleaning in a unique task for another tasks of the workflow and shown how they can be
source). organized in different refreshment scenarios, leading
M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué 6-
10
to different refreshment semantics. We have Entity-Relationship Approach, Austr. Springer,
highlighted design decisions impacting over the 1995.
refreshment semantics and we have shown how the
decisions may be related to some quality factors such [CSCW 96] ACM 1996 Conference on Computer-
as data freshness and to some constraints such as Supported Cooperative Work – Cooperating
source availability and accessibility. Communities (Proceedings). Ackerman, M.S.
(ed.), ACM, Boston, MA, November 1996.
References [HaC 93] Hammer, M., Champy, J., Reengineering
the Corporation, a Manifesto for Business
Revolution, Harper, New-York, 1993.
[AAE+ 95] Alonso, G., Agrawal, D., El Abadi, A.,
Kamath, M., Gunther, R., Mohan, C., Advanced
[JaV 97] M. Jarke, M. Vassiliou: Foundations of
Transaction Models in Workflow Context, IBM
data warehouse quality: an overview of the DWQ
Research Report, RJ 9970, IBM Research
project. Proceedings 2nd International
Division, 1995
Conference on Information Quality, Cambridge,
Mass, 1997
[AGW 97] B.Adelberg, H. Garcia-Mollina, and J.
Widom. The STRIP Rule System For Efficiently
[Lawr 97] Lawrence, P. (ed.), Workflow Handbook,
Maintaing Derived Data. In Proc. of ACM
Wiley and WfMC, 1997.
SIGMOD International Conference on
Management of Data. Tucson, Arizona, USA,
[Sch 98] Schael, T., Workflow Management Systems
1997.
for Process Organisations, Second edition,
[BeNe 98] Berstein, P., Newcomer, E., Principles of
Springer, 1998.
Transaction Processing, Morgan Kaufmann
Publ., 1998.
[ThBo 99] Theodoratos, D., Bouzeghoub, M., Data
Currency Quality Factors in Data Warehouse
[BFM+ 98] Bouzeghoub, M., Fabret, F., Matulovic, Design, Proceed. of the International Workshop
M., Simon, E., "A toolkit Approach for on Design and Management of Data Warehouses
developing efficient and customizable active rule (DMDW'99), Heidelberg, Germany, June 1999.
systems", DWQ Technical report, October 1998.
[WAN 98] Wang, R. Y., "A product perspective on
[BFM+ 99] Bouzeghoub, M., Fabret, F., Galhardas, total data quality management", Com. [ZGJ+ 95]
H., Matulovic, M., Pereira, J., Simon, E., "Data Yue Zhuge, Hector Garcia-Molina, Joachim
Warehouse Refreshment", in Fundamentals in Hammer, and Jennifer Widom. View
Data Warehouses, Chapter 4, M. Jarker et al maintenance in a warehousing environment. In
(edts), Springer, 1999. Proc. of the ACM SIGMOD Int. Conf. on
Management of Data, pages 316-327, 1995.
[BMF+ 97] Bouzeghoub, M., Fabret, F., Llirbat, L.,
Matulovic, M., Simon, E., "Designing data [WoKr 93] Woetzel, G., Kreifelts, T., The use of
warehouse refreshment system", DWQ Petri nets for modeling workflow with the
Technical report, October 1997. Domino system. Proceed. of the Workshop on
Computer Supported Cooperative Work, Petri
[CGL+ 96] L.Colby, T. Griffin, L. Libkin, I. S. nets and Related Formalisms. Chicago, 1993.
Mumick, and H. Trickey. Algorithms for
Deferred View Maintenance. In Proceedings of [WFMC 95] WFMC-TC-1003 (The Workflow
SIGMOD, Montreal, Canada, 1996. Management Coalition), The Workflow
Reference Model, version 1.1, January 1995.
[ChD 97] S. Chaudhuri, U. Dayal. An Overview of
Data Warehousing and OLAP Technology. [ZGJ+ 95] Zhuge, Y., Garcia-Molina, H., Hammer,
SIGMOD Record, Vol. 26, No. 1, March 1997 J., Widom, J., View Maintenance in a
warehousing environment, Proceedings of the
[CCPP 95] Casati, F., Ceri, S., Pernici, B., Possi, G., ACM SIGMOD Int. Conf. On Management of
Conceptual Modeling of Workflows, Proceed. Of Data, P 316-327, 1995.
the Internat. Conf. On Object Oriented and
M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué 6-
11
M. Bouzeghoub, F. Fabret, M. Matulovic-Broqué 6-
12