=Paper= {{Paper |id=Vol-256/paper-13 |storemode=property |title=Data Warehouse Evolution Framework |pdfUrl=https://ceur-ws.org/Vol-256/submission_4.pdf |volume=Vol-256 |dblpUrl=https://dblp.org/rec/conf/syrcodis/Solodovnikova07 }} ==Data Warehouse Evolution Framework== https://ceur-ws.org/Vol-256/submission_4.pdf
                     Data Warehouse Evolution Framework

                                             © Darja Solodovnikova
                          Department of Computer Science, University of Latvia
                                      darja.solodovnikova@lu.lv


                       Abstract                              necessity to adapt data warehouse is given. In Section 3
                                                             the related work is presented. In Section 4 the proposed
    In this paper a data warehouse framework that            data warehouse evolution framework that supports
    supports data warehouse evolution is                     changes in data sources and propagates them to the data
    presented. The framework is able to handle not           warehouse is discussed. We conclude with directions
    only changes in data sources, but also direct            for future work in Section 5.
    changes in a data warehouse schema. In the
    framework the data warehouse versions are                2 Motivating Example
    supported in the development environment as
    well as in reports in the user environment.                  Data warehouse schema evolves frequently when
                                                             business requirements are changed or extended or a
1 Introduction                                               schema is adapted after changes in data sources.
                                                                 As an example, let us consider a data warehouse that
    Data warehouses integrate information from various       stores information about students’ activities in a
distributed and autonomous data sources that can             learning management system (LMS). This data
change in the course of time. Therefore a data               warehouse contained one fact table with measures: hits
warehouse has to be adaptable to any changes that can        and time, which records the duration of students’
happen in underlying data sources. Besides business          activity. These measures could be analyzed by the used
requirements often change at the client level. That can      course, a tool in this course and time, when the activity
cause changes to the data warehouse model. All these         occurred. The activity of all students was summarized.
changes in data sources or business requirements can             During the operation of the aforementioned data
invalidate existing schemata and data extraction,            warehouse the users complained that the information
transformation and loading (ETL) processes of the data       available in it is insufficient because the existing
warehouse. This is why these changes need to be              scheme did not satisfy the desirable granularity.
handled properly. In many cases the existing data            Besides, it was decided to store also data about the
warehouse can be adapted to changes.                         activities of lecturers of courses. Therefore, the new
    Simple adaptation of the data warehouse schema can       dimensions that describe the particular user and his or
cause a loss of history when some previously available       her role in a course were created.
data structures are deleted. To solve problems of history        To solve the evolution problems, the administrator
losses, it is necessary to keep data warehouse versions.     had to create a new data warehouse schema and ETL
Schema versioning means that a change in the data            processes. It required much time and resources, but
warehouse schema creates a new schema version that is        finally the second data warehouse version was created.
assigned a timestamp or other user-defined identifier.       But still there was an open question how to automate
    In this paper a data warehouse framework is              the data warehouse adaptation and how users can work
discussed. The framework supports data warehouse             with two data warehouse versions, because traditional
schema evolution that can happen for different reasons,      reporting tools and query languages do not support the
including cases when schemas of data sources are             concurrent work with many schema versions.
changed. The supported changes are insertion, deletion
and renaming of a source relation, insertion, deletion,      3 Related work
renaming and change of a type of a source relation
attribute. The proposed framework not only automates             In the literature there are various solutions for the
the evolution of a data warehouse schema or creation of      data warehouse evolution problems, which are the data
a new version, but also allows to adapt ETL processes        warehouse adaptation after the changes in source data
and existing reports on a data warehouse schema.             and schemata as well as business requirements. In [1]
    The rest of this paper is organized as follows. In       the primitive evolution operations that occur over the
Section 2 the motivating example that demonstrates the       data warehouse schema are defined. The necessary
                                                             adaptation activities of the data warehouse schema and
Proceedings of the Spring Young Researcher's Colloquium On   instances are formally specified for each operation. This
Database and Information Systems SYRCoDIS, Moscow, Russia,   paper only considers changes raised by alterations of
2007                                                         business requirements.
    In [15] the existing techniques for schema evolution    warehouse version. Issues related to queries to a
are integrated in the new quality-oriented framework.       multiversion data warehouse are considered in [14].
The author proposes schema evolution operations (e.g.            In [17] the definition of a multidimensional schema
attribute insertion or deletion) and describes quality      that supports schema versioning is given. This
factors that they affect.                                   definition is very similar to the one given in [2], the
    In [10] the evolution operations that change the data   difference is that the first one supports versioning. The
warehouse schema are considered. For each operation,        version evolution operations that result in versioning of
the formal semantics of the changes for star and            the data warehouse schema are formalized.
snowflake schemata are given.                                    Structural and content changes in dimensions of a
    The above mentioned papers do not address the           data warehouse are discussed in [9]. A
problems of the data warehouse adaptation after             multidimensional model and its’ instances are defined.
changes in data sources. One of the approaches for          Dimension structural and instance update operators are
solving these problems is adaptation of the data            formally specified and their effect is studied over
warehouse schema and ETL processes. In [11] the             materialized views over dimension levels.
author proposes the solution, which is based on the              In [13] a temporal multidimensional data model is
transformation of schemata of data sources by               proposed, which allows to track history of dimension
transformation primitives. When the data source schema      updates. In the model elements of dimension schemas
changes the information in the transformation               and/or instances are assigned the timestamp when they
specification is used to adapt the data warehouse           were the part of a dimension. The query language
schema and ETL processes.                                   TOLAP is also presented that supports queries over the
    In [19] the authors consider mapping adaptation         proposed data model both over data and metadata.
after the changes in data source schemata. Here                  In [4,5] a method to support data and structure
mapping specifies how data instances of one schema          versions of dimensions is proposed. The method allows
correspond to data instances of another. The authors        tracking history and comparing data, using temporal
propose the algorithm that detects mappings affected by     modes of presentation that is data mapping into the
changes in data sources and generates rewritings that       particular structure version. The authors define the
are consistent with the semantics of the mapped             conceptual model based on the multiversion fact table.
schemata.                                                        The above mentioned papers consider only one kind
    In the paper [12] both as view approach for data        of evolution problems, for example, changes in a
integration is proposed. The basis of this approach is      schema of a data warehouse raised by evolving business
schema transformation primitives, which specify how         requirements, adaptation of a data warehouse after
the global schema is obtained from local schemata.          changes in data sources or data warehouse versioning
From these transformations it is also possible to infer,    and querying multiversion data warehouse. In our
how local schemata can be obtained from the global          approach we propose the framework that is able to solve
schema. The evolution of local and global schemata is       all these kinds of evolution problems.
also discussed. For schema adaptation, similarly as in
the previously mentioned papers, the information about      4 Data Warehouse Evolution Framework
transformations is used.
    In many papers [3,6,16] a data warehouse is defined         To support the data warehouse adaptation after
as a set of materialized views over data sources. These     changes in source schemata and versioning, we propose
papers study the problems of how to rewrite a view          the data warehouse framework depicted in Figure 1.
definition and adapt view extent after changes in source
                                                            4.1 Components of the Framework
data and schemata. In [16] the authors study reasons for
schema changes and possible data warehouse adaptation           The framework is composed of the development
issues for dynamic sources. The framework of the            environment and user environment. In the development
evolvable view environment is presented, which adapts       environment the data warehouse metadata repository
view definition and extent after changes in data sources.   and other components, which will be described later,
    Several authors [7,8,20] propose the data warehouse     are located and ETL processes and change processing is
schema versioning approach to solve the problems of         conducted. In the user environment reports on one or
schema evolution. In [8] the authors propose to store       several data warehouse versions are defined and
augmented schemata together with schema versions.           executed.
When schema changes occur, firstly the new schema               The basic element of the development environment
version is produced and then, for the previous versions,    is the adaptation component that processes changes in
augmented schemata are created and populated with           relations and attributes of source schemata, identifies
data.                                                       the potential changes in a data warehouse and possible
    In [7] the metadata model that supports schema          new versions, adapts a data warehouse schema or
versioning for data warehouses is introduced. Metadata      creates a new version according to the choice of the
management solutions in a multiversion data warehouse       data warehouse administrator, creates the necessary
are also proposed in [20], where one of the discussed       version metadata and adapts ETL processes. To realize
issues is metadata support for detection of changes in      this functionality, the adaptation component uses data
sources and propagation of them to the designated data      from the metadata repository.
                                                          transportation procedure transfers data warehouse data
                                                          from the development environment into the user
                                                          environment and version metadata into the reporting
                                                          metadata repository.
                                                              In addition to the data warehouse, in the user
                                                          environment there is also the reporting metadata
                                                          repository that contains the version metadata, which are
                                                          transferred from the mapping repository of the
                                                          development environment, and the reporting metadata,
                                                          which are created by a data warehouse developer by the
                                                          reports definition tool and are used by the reporting tool
                                                          for generation of reports. Data warehouse users work
                                                          with the reporting tool that allows to define ad-hoc
                                                          queries, display reports as tables and graphs and analyze
                                                          data using hierarchies.
                                                              Using links between data warehouse versions in the
                                                          metadata repository, the reporting tool can run queries
                                                          on multiple data warehouse schema versions or one
                                                          version. In case of many versions, a user can choose
                                                          which version will be used to display results of a query.
                                                              In the user environment an access mechanism is also
                                                          implemented. It is the metadata that defines which
                                                          reports can be used by the particular user. These
                                                          metadata are used by the reporting tool. The access
        Fig. 1. Data Warehouse Evolution Framework        mechanism is set by the developer by the reports
                                                          definition tool.
    The metadata management tool that incorporates the
graphical user interface client tool is used by           4.2 Framework Operation
administrator or developer to design a data warehouse         The proposed framework is able to handle source
schema and specify ETL processes. The metadata            changes that can influence a data warehouse as well as
management tool maintains the static part of the          other changes of a data warehouse schema. If the data
mapping repository of the metadata repository, where      warehouse schema is changed by the administrator then
the metadata of the last data warehouse version and       all changes are conducted by the metadata management
mappings, which define the logics of ETL processes,       tool, which allows to create a new data warehouse
are stored. In addition the mapping repository includes   version or alter an old version. The metadata of ETL
also three another parts. In the adaptation part the      processes in the mapping repository are adapted
adaptation component stores information about             according to a new data warehouse version.
dependencies of data warehouse elements and source            The source changes are processed before the
elements used for a data warehouse adaptation. The        execution of ETL processes in the development
version control mechanism defines rules for necessity     environments. Initially the adaptation component
and logics of creation of a new data warehouse version.   analyzes the changes in the source change repository
The version metadata stores information about data        and detects changes that affect a data warehouse
warehouse versions, which is necessary for definition     schema and ETL processes. The adaptation component
and execution of reports, including links between         processes these changes using data from the adaptation
different versions. The detailed description of the       part of the mapping repository and the version control
metadata repository as well as the operation of the       mechanism and, for each change, generates solutions
adaptation component with adaptation options for each     that create a new data warehouse version or adapt the
change and presentation of the implemented prototype      data warehouse schema and ETL processes. The
of the adaptation framework is found in the paper [18].   administrator is informed about all changes and their
    The metadata repository includes also the data        adaptation and version options. The administrator
warehouse change repository, which accumulates the        chooses the most suitable solutions that must be
potential changes of a data warehouse schema and          implemented according to the business requirements.
version creation options. The administrator chooses the       If the administrator decides to create a new data
most suitable options that are applied. Special agents    warehouse version, the adaptation component changes
are incorporated into data sources. These agents track    the version metadata in the mapping repository to
changes in source schemata and accumulate them in the     reflect the new data warehouse version. If the
source change repository.                                 administrator chooses to conduct adaptation without
    ETL processes are generated by the metadata           creation of a new version, the adaptation component
deployment tool that uses the metadata from the static    does not need to change the version metadata.
part of the mapping repository. The data warehouse            The adaptation component adjusts the metadata of
loader executes generated ETL scripts. The data           the data source, data warehouse and specification of
ETL processes in the adaptation part of the mapping        [5] M. Body, M. Miquel, Y. Bedard, A. Tchounikine.
repository according to the chosen solutions. The              Handling Evolutions in Multidimensional
adaptation component also creates a new data                   Structures. In Proc. of the 19th Intl. Conference on
warehouse version or adapts the existing data                  Data Engineering, Bangalore, India (2003)
warehouse directly in the database. The adaptation         [6] S. Chen, X. Zhang, E.A. Rundensteiner. A
component generates the metadata deployment script,            Compensation-based Approach for Materialized
which is executed by the metadata deployment tool that         View Maintenance in Distributed Environments. In
generates the executable ETL process script. The ETL           Computer Science Technical Report, Worcester
process is executed by the data warehouse loader.              Polytechnic Institute, Worcester, MA, USA (2004)
    When a data warehouse schema is changed, the           [7] J. Eder, C. Koncilia, T. Morzy. The COMET
reporting and version metadata, which describe a data          Metamodel for Temporal Data Warehouses. In
warehouse, in the reporting metadata repository in the         LNCS, Vol. 2348, Springer-Verlag, (2002) 83–99
user environment become inadequate to a changed            [8] M. Golfarelli, J. Lechtenbörger, S. Rizzi, G.
schema or new schema version and reports on a data             Vossen. Schema Versioning in Data Warehouses.
warehouse can not run any more. Therefore, during the          In LNCS, Vol. 3289, (2004) 415–428
data transfer from the development environment into        [9] C.A. Hurtado, A. O. Mendelzon, A.A. Vaisman.
the user environment the data transportation procedure         Maintaining Data Cubes under Dimension Updates.
updates also the reporting and version metadata to             In Proc. of the 15th Intl. Conference on Data
reflect the new data warehouse schema.                         Engineering, Washington, DC (1999) 346-357
                                                           [10] C.E. Kaas, T.B. Pedersen, B.D. Rasmussen.
5 Conclusions and Future Work                                  Schema Evolution for Stars and Snowflakes. In
                                                               Proc. of the 6th Intl. Conference ICEIS, Porto,
    We proposed the data warehouse evolution                   Portugal (2004) 425-433
framework. This framework was developed by                 [11] A. Marotta. Data Warehouse Design and
expanding the data warehouse adaptation framework              Maintenance through Schema Transformations,
[18], which was previously designed and implemented.           Master thesis, Universidad de la República
The adaptation framework could automatically detect            Uruguay, (2000).
changes in schemata of data sources and adapt a data       [12] P. McBrien, A. Poulovassilis. Data Integration by
warehouse schema and ETL processes, according to the           Bi-Directional Schema Transformation Rules. In
administrator’s decision.                                      Proc. of the 19th Intl. Conference ICDE,
    Unlike the adaptation framework, the evolution             Bangalore, India (2003) 227-238
framework is able to handle not only changes in data       [13] A.O. Mendelzon, A.A. Vaisman. Temporal Queries
sources, but also direct changes in a data warehouse           in OLAP. In Proc. of the 26th Intl. Conference
schema. The second important difference is the fact that       VLDB, Cairo (2000) 242-253
in the evolution framework the data warehouse versions     [14] T. Morzy, R. Wrembel. On Querying Versions of
are supported in the development environment as well           Multiversion Data Warehouse. In Proc. of the 7th
as in reports in the user environment.                         ACM Intl. Workshop DOLAP, Washington, DC,
    The proposed framework differs from other                  USA (2004) 92–101
solutions of data warehouse evolution problems             [15] C. Quix. Repository Support for Data Warehouse
presented in the literature by the fact that it supports       Evolution. In Proc. of the Intl. Workshop DMDW,
many evolution problems at once, not just one problem.         Heidelberg, Germany (1999)
    Further it is planned to transform the developed       [16] E.A. Rundensteiner, A. Koeller, X. Zhang.
prototype of the adaptation framework to conform to            Maintaining Data Warehouses over Changing
the aforementioned evolution framework.                        Information Sources. In Communications of the
                                                               ACM, Vol. 43, New York, NY, USA (2000) 57-62
References                                                 [17] M.K. Shahzad, J.A. Nasir, M.A. Pasha. CEV-DW:
                                                               Creation and Evolution of Versions in Data
[1] M. Blaschka. FIESTA: A Framework for Schema
                                                               Warehouse. In Asian Journal of Information
    Evolution in Multidimensional Databases, PhD
                                                               Technology, 4(10) (2005) 910-917
    thesis, Technische Universitat Munchen, Germany
                                                           [18] D. Solodovnikova, L. Niedrite. Data Warehouse
    (2000)
                                                               Adaptation after the Changes in Source Schemata.
[2] M. Blaschka, C. Sapia, G. Hofling. On Schema
                                                               In Proc. of the 7th Intl. Baltic Conference on
    Evolution in Multidimensional Databases. In LNCS,
                                                               Databases and Information Systems, Vilnius,
    Vol. 1676, Springer-Verlag (1999) 153–164
                                                               Lithuania (2006) 52-63
[3] Z. Bellahsene. Schema Evolution in Data
                                                           [19] A.A. Vaisman, A.O. Mendelzon, W. Ruaro, S.G.
    Warehouses. In Knowledge and Information
                                                               Cymerman. Supporting Dimension Updates in an
    Systems, 4, Springer-Verlag, (2002) 283-304
                                                               OLAP Server. In LNCS, Vol. 2348, Springer-
[4] M. Body, M. Miquel, Y. Bedard, A. Tchounikine.
                                                               Verlag (2002) 67-82
    A Multidimensional and Multiversion Structure for
                                                           [20] R. Wrembel, B. Bebel. Metadata Management in a
    OLAP Applications. In Proc. of the 5th ACM Intl.
                                                               Multiversion Data Warehouse. In LNCS, Vol. 3761,
    Workshop DOLAP, McLean, Virginia (2002) 1-6
                                                               Springer-Verlag (2005) 1347-1364