Gulliver in the land of data warehousing: practical experiences and
                      observations of a researcher

                                                                            Panos Vassiliadis
                                                               National Technical University of Athens,
 Department of Electrical and Computer Engineering, Computer Science Division, Knowledge and
                                      Database Systems Laboratory, Zografou 15773, Athens, Greece
                                                                         pvassil@dbnet.ece.ntua.gr


                                                                                        warehousing strategy is still left to the practitioners...»,
                                                                                        «... the influence of the research results on the commercial
                                                                                        stream of data warehouse products is very limited...»,
                                         Abstract                                       «The gap between data warehouse practice and research
                                                                                        became obvious ...». The purpose of this paper is towards
     The gap between researchers and practitioners is
                                                                                        showing the issues which occupy research and practice,
     widely discussed in the IT community. The
                                                                                        and the extent to which these issues have any overlap.
     purpose of this paper is towards showing the
     issues which occupy both research and practice,                                    The ultimate goal is to show possible new areas of
     and the extent to which these issues have any                                      research, based on practical problems and at the same
     overlap, in the field of data warehousing. To                                      time to give an idea of how practice could benefit from
     achieve this goal we first present the current                                     research results which seem to be rather ignored.
     status and tendencies in data warehouse research.                                  To this end we will divide the paper in three parts. The
     Then we list several practical problems as they                                    first part appears in Section 2, where we present the
     appear in the relevant literature, based also on                                   «good news» for data warehousing and more specifically,
     our personal experience. Finally, we try to give                                   the current status of the data warehouse industry in terms
     the relationship of research and practice into a                                   of profit and sales, as well as the status of research. To
     unified big picture.                                                               present the status of the research we have listed and
                                                                                        classified the papers relevant to data warehousing in three
                                                                                        major database conferences during the last five years and
                                                                                        tried to show the tendencies of the research based on this
1. Introduction                                                                         study. The second part of the paper deals with problems
                                                                                        and failures during data warehouse projects and appears
The gap between researchers and practitioners is widely                                 in Section 3. The discussion is based both on the relevant
discussed in the IT community. The situation regarding                                  literature (which is surprisingly small) and on the author’s
data warehousing seems to follow the general pattern                                    personal experiences. Based on the problems which we
where practitioners complain that their practical problems                              detect in the previous paragraphs, we then proceed to
are overlooked by research and researchers are generally                                relate the data warehouse lifecycle with potential
unsatisfied by the acceptance of their ideas in industry.                               problems and solutions proposed by the research
Let us quote some abstracts from the results of the                                     community. Finally, we give some concluding remarks on
previous DMDW workshop [GJSV99]: «Although many                                         the reasons for the gap between the research and practice
solutions were developed for interesting subproblems...                                 communities.
combining these partial and often very abstract and formal
solutions to an overall design methodology and
                                                                                        2. The Good News: Money and Research
 The copyright of this paper belongs to the paper’s authors. Permission to copy
                                                         There are good news for the data warehouse field: sales
 without fee all or part of this material is granted provided that the copies are not
                                                         are increasing with high rates and research is achieving a
 made or distributed for direct commercial advantage.
                                                         standard focus on the field. We will briefly summarize the
 Proceedings of the International Workshop on Design and
                                                         importance of the field by mentioning the financial
 Management of Data Warehouses (DMDW'2000)
                                                         figures in subsection 2.1 and quickly proceed to
 Stockholm, Sweden, June 5-6, 2000
                                                         subsection 2.2 where we discuss the main subject of this
 (M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.)

 http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-28/


P.Vassiliadis                                                                                                                                  12-1
section, which is the status and the tendencies of the        papers could fit in more than one categories; still we
research in data warehousing.                                 followed a naïve approach and attributed each paper to
                                                              only one category. Naturally, we do not claim to be
2.1       The Money                                           perfect: it is possible that some papers can be left out of
                                                              our study, or classified under a category which was not
Selling products related to data warehousing is a business
                                                              the most suitable. We apologize in advance for any such
making money. As mentioned in a report by Merril Lynch
                                                              occurrences, although we scrutinized the proceedings to
at the end of 1998 [ShTy98], the estimation was that the
                                                              avoid this kind of problems. Also, it is possible that the
data warehousing market was going to expand in the next
                                                              contribution of a paper in one category, could be
few years. The numbers are surprisingly large: the data
                                                              accompanied by results in another “correlated” category.
mart market was expected to have a 40% compounded
                                                              We believe that the results which we present are not far
annual growth rate (CAGR) and the RDBMS sales for
                                                              from the ones which could be produced from a more
data warehouse purposes a CAGR of 25%, reaching total
                                                              elaborate categorization of the paper, which would take
sales of $2.2 billion dollars. The OLAP report [Pend00]
                                                              this issue into consideration. Still, there is no proof for
mentions that the sales have reached $2.5 billion dollars
                                                              this statement and the issue remains open (although we
for OLAP tools (including implementation services) and
                                                              believe it is outside the scope of this paper).
they are expected to grow with 20% rate in 2000 and a
                                                              As one can see in Fig. 2, the number of papers seems to
CAGR of 19% for a five-year period. Fig. 1 shows the
                                                              reach stability. Although the research interest is rather
estimated sales, along with the CAGR for six categories
                                                              young (only 5 years old) we anticipate that the tendency is
of tools.


                                              1998  1999   2000   2001   2002 CAGR (%)
        RDBMS sales for DW                   900.0 1110.0 1390.0 1750.0 2200.0      25.0
        Data Marts                             92.4 125.0 172.0 243.0 355.0         40.0
        ETL tools                            101.0 125.0 150.0 180.0 210.0          20.1
        Data Quality                           48.0   55.0   64.5   76.0   90.0     17.0
        Metadata Management                    35.0   40.0   46.0   53.0   60.0     14.4
        OLAP       (including implementation 2000    2500   3000   3600   4000      18.9
        services)*

      Fig. 1 Estimated sales in millions of dollars [ShTy98] (*estimates are from [Pend00]).


                                                              to keep a standard number of papers in the major
2.2       The Research                                        conferences. The drop in the number of papers in 1998
                                                              could be easily justified due to the strange explosion in
Research in the field of data warehousing is flourishing.
                                                              the number of papers relevant to data mining during that
Sessions dedicated to data warehousing have appeared in
                                                              particular year. It is very interesting to see that during the
most of the major conferences of the data management
                                                              last five years there have been 99 relevant papers relevant
discipline. Several workshops have appeared [GJSV99,
                                                              to data warehousing, which makes 20 papers per year on
DOLAP] and there is even a dedicated conference for
                                                              average.
data warehouse issues [DaWaK].
                                                              We have identified 22 categories of research fields where
To obtain an overview of the tendencies of research in the
                                                              the interest of the researchers has been drawn. In the
past five years we have selected three prestigious database
                                                              sequel, we list the most popular out of them (Fig. 4).
conferences, namely PODS, SIGMOD and VLDB and
                                                              - Data warehouse design: the problem lies in detecting
classified their papers which are relevant to the data
                                                                 the set of views to materialize in the data warehouse, in
warehouse area. We included any papers we found
                                                                 order to achieve the optimal operational cost (i.e., the
relevant to data warehousing, except for the ones relevant
                                                                 combined cost of querying and refreshing the contents
to data mining (to retain a clear-cut separation between
                                                                 of the warehouse).
the two fields). We restricted ourselves to just three
                                                              - Query rewriting: the problem lies in reusing existing
conferences, since our goal is to give a general feeling of
                                                                 views, to rewrite a query posed over the sources. An
the situation in the research field, rather than conduct a
                                                                 alternative name for the problem could be ‘Answering
thorough survey of the topic. Based on the content of the
                                                                 queries using views’.
papers, we classified them to several categories, shown in
                                                              - Integration: this is a wide area covering several issues.
Fig. 3. For reasons of better presentation and
                                                                 The general context is that several sources containing
understanding, we group these categories to larger groups,
                                                                 operational data exist in the environment of the data
referred to as “super-categories”. Of course, several


P.Vassiliadis                                                                                                          12-2
  warehouse and a unique interface must be provided in         time. One can see a dropping interest in the view
  order to query / update them. The problem of                 technology issues, which is rather normal since people
  integration is definitely larger than the area of data       originally thought of data warehouses as collections of
  warehousing, especially with the current advances in         materialized views. Although we believe that this attitude
  the Web technology. Note that in our survey we               is still present in the research community, there seems to
  excluded all papers on integration that seemed clearly       be a level of saturation in the problems regarding view
  oriented towards semi-structured or Web data.                technology.


                                                Number of Papers by Year

                                 30

                                 25

                                 20

                                 15

                                 10

                                  5

                                  0
                                        1995          1996          1997            1998               1999
                       No. of Papers      9            20            26              19                 25

                             Fig. 2 Number of papers in PODS/SIGMOD/VLDB by year.


- Processing for relational aggregates: the area includes      Category                                       Super-Category
   structures and algorithms for the efficient processing of   Incomplete information                         Incomplete information
   aggregate queries. We discriminate this area from           Data integration                               Integration
   query rewriting, in the sense that these papers deal with   Integration in general
   results that could directly be implemented in a DBMS.       Query processing over integrated data
   We also discriminate the area from the papers               Schema integration
   involving processing for cubes, which we found more         OLAP modeling                                  OLAP modeling
   focused in MOLAP databases.                                 Caching                                        Query Processing
- View maintenance: the problem lies in keeping the data       Iceberg queries
   warehouse views in accordance with the changes              Processing for aggregate queries
   happening in the source data.                               Processing for cubes
The big picture of the area is made clear in Fig. 5,           Query processing in general
classifying the papers in higher-level super-categories.       Top N queries
The classification is based on the grouping of Fig. 3.         Query containment                              Redundancy
The most popular super-categories so far have been Query                                                      Exploitation
Processing,      View     technology,     Integration   and    Query rewriting
Redundancy Exploitation. Query processing involves all         Clustering                                     Storage Management
techniques to efficiently process requests and answer          Indexing
queries. It involves six categories and 29% percent of the     Storage for cubes
research performed in the past years. View technology is       Storage in general
also a large category, focused on view maintenance             Detecting changes in the sources               View Technology
techniques as well as the physical data warehouse design       Data warehouse design
process. Integration, which has been previously described,     Size estimation for views
involves producing a single interface for the processing of    View maintenance
distributed heterogeneous data, along with query
                                                               Fig. 3 Grouping of paper categories to super
processing techniques for that cause and resolution of
conflicts at the schema level. Redundancy exploitation is      categories
a field where theoreticians are mostly interested,
involving query containment and rewriting.                     At the same time, the interest in query processing rises
Probably the most interesting graph is depicted in Fig. 6,     continuously from year to year, probably due to the
grouping the papers by year and super-category. In this        standard tendency of database researchers towards this
figure we see the evolution with respect to the passing of     field.


P.Vassiliadis                                                                                                                      12-3
                                                                                                                       Papers by Category

                      18
                      16
                      14
                      12
                      10
                       8
                       6
                       4
                       2
                       0


                                                                                                                                                                                  l
                                         s


                                                   l


                                                                                                                                                                                                                                                           l
                                                             g


                                                                                                                                                   t


                                                                                                                                                                                                                                     n
                           ce


                                                                                                                                                                                                                                              s
                                                                                                                                                                                           es
                                                                                                                  n
                                                                                es


                                                                                                                             n


                                                                                                                                                             s
                                                                                                        s


                                                                                                                                         g


                                                                                                                                                                                                               g
                                                                                                                                                                      ng


                                                                                                                                                                                                    ng
                                                                      gn


                                                                                                                                                                                 ra
                                                 ra


                                                                                                                                                                                                                                                       ra
                                                                                                                                                  en
                                      te


                                                                                                                                                                                                                           ..
                                                                                          ...


                                                                                                                                                                                                                                              w
                                                                                                                                                            ie
                                                                                                     ie
                                                           tin


                                                                                                                                                                                                                                  io
                                                                                                                 io

                                                                                                                            io

                                                                                                                                      in


                                                                                                                                                                                                              in
                                                                                                                                                                            ne


                                                                                                                                                                                                                       u.
                          an


                                                                                                                                                                                       ub
                                                                               ub
                                              ne


                                                                                                                                                                                                                                                      ne
                                                                                                                                                                  xi


                                                                                                                                                                                                 hi
                                                                  si


                                                                                                                                                                                                                                            ie
                                    ga


                                                                                         gr


                                                                                                                                                        er
                                                                                                 er


                                                                                                                                                                                                                                 at
                                                                                                                 at

                                                                                                                           at


                                                                                                                                             nm
                                                                                                                                    el


                                                                                                                                                                                                              er
                                                       rit


                                                                                                                                                                                                                    so
                                                                                                                                                                 de
                                                                 de


                                                                                                                                                                                                ac


                                                                                                                                                                                                                                         rv
                                                                                                                                                                           ge
                                                                                                                                 od


                                                                                                                                                                                                                                gr
                      en


                                                                                                             gr


                                                                                                                                                                                      rc
                                                                           rc


                                                                                     te


                                                                                                                       rm


                                                                                                                                                       qu
                                                                                                qu
                                           ge


                                                                                                                                                                                                                                                  ge
                                                                                                                                                                                                          st
                                re


                                                       w


                                                                                                                                             ai
                                                                                    in


                                                                                                                                                             In


                                                                                                                                                                                            C


                                                                                                                                                                                                                            te
                                                                                                                                                                                                         lu
                                                                                                         te


                                                                                                                                                                                                                                       fo
                                                   re


                                                                                                                                                                                                                   e
                                                                                                                                m
                     nt


                                                                                                                                                                                  fo
                               gg


                                                                       fo
                                                             W


                                                                                                                      fo


                                                                                                                                                                       in
                                                                                                                                         nt
                                                                                           N


                                                                                                                                                   g
                                         in


                                                                                                                                                                                                                                                 in
                                                                                                                                                                                                               th
                                                                                                                                                                                                     C


                                                                                                                                                                                                                           in
                                                                                                        in
                                                                                er
                 ai


                                                                                                                                                  er
                                                                                                                  in


                                                                                                                                                                                                                                     n
                                                           D
                                                 ry


                                                                                                                            P

                                                                                                                                      co
                           ra


                                                                  ng


                                                                                                                                                                                 e
                                                                                                                                                                  ng
                                                                                          p
                                      n


                                                                                                                                                                                                                                              e
                                                                                                                                                                                                                                  io
                                                                                                                                                                                                              in
                 m


                                                                                                                                                                                                                        a
                                                                                                    a


                                                                                                                                                                            ag
                                                                               ov


                                                                                                                                              eb
                                                                                                                           LA
                                                                                     To
                                              ue
                                    io


                                                                                                                                                                                                                                         ag
                                                                                                                 e


                                                                                                                                                                                                                       m
                                                                                                 at
                                                                  si


                                                                                                                                                                                                                                 at
                          fo


                                                                                                                                                                  si
                                                                                                                                    ry
                                                                                                             et


                                                                                                                                                                                                          s
                                    at
            ew


                                                                                                                                                                           or
                                                                                                                                             Ic
                                                                 es
                                             Q


                                                                                                                       O
                                                                        ng


                                                                                                                                                                                                                                         or
                                                                                                                                                                                                                   he
                                                                                                D


                                                                                                                                                                                                                            tim
                                                                                                                                                                 es


                                                                                                                                                                                                         ge
                                                                                                                                 ue
                                                                                                          pl
                     ng

                                gr


                                                                                                                                                                       St


                                                                                                                                                                                                                                      St
            Vi


                                                             oc

                                                                       si


                                                                                                        m


                                                                                                                                                                                                               Sc
                                                                                                                                                             oc


                                                                                                                                                                                                     an
                                                                                                                             Q
                               te


                                                                                                                                                                                                                           es
                     si


                                                                      es


                                                                                                     co
                                                           Pr
                 es


                                                                                                                                                            pr
                           In


                                                                                                                                                                                                  ch


                                                                                                                                                                                                                       ze
                                                                  oc


                                                                                                 In
                oc


                                                                                                                                                       ry


                                                                                                                                                                                                g


                                                                                                                                                                                                                   Si
                                                                 pr


                                                                                                                                                   ue


                                                                                                                                                                                            tin
            Pr


                                                             ry


                                                                                                                                                                                           ec
                                                                                                                                                  Q
                                                           ue


                                                                                                                                                                                       et
                                                        Q


                                                                                                                                                                                       D
                                           Fig. 4 Number of papers in PODS/SIGMOD/VLDB by category.


                                                                                                         Papers by Super Category

                                                           35

                                                           30

                                                           25

                                                           20

                                                           15

                                                           10

                                                            5

                                                            0
                                                                  Incomplete                                            Storage               Redundancy                                       View                      Query
                                                                                         OLAP modeling                                                                Integration
                                                                  information                                         Management              Exploitation                                  Technology                 Processing
                          Papers by Super Category                         3                        3                        6                     12                       20                      26                      29


                                    Fig. 5 Number of papers in PODS/SIGMOD/VLDB by super category.


                                                                                          Papers by year and type

                           12
                           10
                               8
                               6
                               4
                               2
                               0             Incomplete                                                                            Query                Redundancy                     Storage                    View
                                                                      Integration             OLAP modeling
                                             inf ormation                                                                        Processing             Exploitation                 Management                Technology

                               1995                                                                                                      2                        2                                                     5

                               1996                1                           6                                                         5                        2                                                     6

                               1997                2                           3                             1                           6                        2                         3                           9

                               1998                                            4                                                         6                        4                         2                           3

                               1999                                            7                             2                        10                          2                         1                           3


                      Fig. 6 Number of papers in PODS/SIGMOD/VLDB by year and super category.


P.Vassiliadis                                                                                                                                                                                                                                                  12-4
There are areas like incomplete information and storage               four categories, namely design, technical, procedural and
management which seem to lose interest as time passes.                sociotechnical factors (Fig. 7).
Redundancy exploitation keeps a standard interest due to              According to [ShTy98], the average time for the
its dedicated audience of theoreticians. Integration and              construction of a data warehouse is 12 to 36 months and
OLAP modeling seem to gain interest at the same time.                 the average cost for its implementation is between $1
The probable reasons for the former are due to the                    million to $1.5 million. Data marts are a less risky
criticism against the materialized nature of data                     expenditure, since they cost hundreds of thousands of
warehousing. As for the latter, it is possible that the lack          dollars and take less than a year to implement. Still, if a
of a standard OLAP model plays its role to the increasing             project of such nature is dependent on so many factors in
interest in this category.                                            order to succeed, then the self-contemplating statements
                                                                      on the state-of-the-art on data warehouse management are
3. Data Warehouse Problems and Failures                               rather unrealistic. In the sequel, we will take a short look
                                                                      to the particular factors of failure for data warehouse
An objective observer facing the facts of the previous                projects. As far as the design factors are concerned, there
section would directly conclude that the area of data                 is an obvious deficit in the part of a “textbook”
warehousing thrives and the potential for further growth is           methodology for the design of a data warehouse. There
more than probable. Although this seems to be a quite                 are no standard, or even widely accepted, metadata
accurate description of the situation, we argue that a data           management techniques1 or languages, data engineering
warehouse project is a great risk and is definitely                   techniques or design methodologies for data warehouses.
endangered by several factors. We intend to back up this              Rather, proprietary solutions from vendors, or do-it-
statement by concrete arguments based both on our                     yourself advice from experts seem to define the
personal practical experience in the field and relevant               landscape. If we look to the relevant research papers, the
literature.                                                           picture is disappointing: the three major conferences on
                                                                      data management are not really concerned with issues like
 Category       of   Factors                                          metadata management or design methodologies for data
 Factors
                                                                      warehouses. There exist, though, relevant areas such as
 Design Factors      Lack of metadata management
                                                                      the research on the physical data warehouse design and
                     Problematic data engineering
                                                                      the integration issues. Still, a closer look will reveal that
                     Unrealistic schema design
                                                                      the research seems to target problems not really close to
                     Client tools are neglected or dominate the
                                                                      the practical ones. For example, the assumptions made for
                     design
                                                                      the design problem are rather unrealistic (knowledge of
                     No design method is used
                                                                      user queries, their sizes and frequencies) with respect to
 Technical Factors   Choice of wrong components
                                                                      practical cases. Also, the integration problem is definitely
                     Vendor claims are not tested
                                                                      oriented toward a uniform API to distributed sources, i.e.,
                     No examination of volume of queries, data sets
                                                                      to languages and mechanisms that enable the querying of
                     and network traffic
                                                                      data. Still, problems like extraction, transformation and
 Procedural          Improper project scope
                                                                      cleaning which can take up to 80% of the time spent in
 Factors
                                                                      the development of a data warehouse [Dema97], seem to
                     Bad use of pilot projects
                                                                      be ignored by the research community.
                     User communities are not involved in the
                     design
                                                                      The technical factors also reveal the absence of research
                     No test of new management requirements
                                                                      in the confrontation of practical problems. There exist, of
                     Lack of training for stakeholders
                                                                      course, standards for the evaluation of software
 Sociotechnical      Data warehouses cross organizational treaty
                                                                      components, but there is a gap in the evaluation and
 Factors             lines
                                                                      choice of hardware components. As one can see in Fig. 8,
                     Data ownership and access are reconsidered
                                                                      hardware costs up to 60% of a data warehouse budget
                     due to the presence of a data warehouse          (disk, processor and network costs). Critical software
                     The work practices of user communities are       (DBMS and client tools) which is purchased (and not
                     affected                                         developed in-site) take up to 16% of the budget. There are
                                                                      no papers to our knowledge that deal with issue of
       Fig. 7 Factors affecting the failure of data                   hardware/software selection for data warehouse
            warehousing projects [Dema97].                            environments. As for the estimation of the sizes of
                                                                      queries, data sets and network traffic, a closer look to the
A very good discussion on the problems of data
warehousing projects is found in [Dema97]. The paper                  1 [ShTy98] reports that the lack of a common metadata
mentions the logical fact that nobody really speaks about             standard (despite the existence of the MDIS standard at
data warehousing failures and goes on to group the                    the end of 1998) is the basic source for concern for
reasons for the failure of a data warehousing project into            metadata management tools.


P.Vassiliadis                                                                                                                 12-5
appendix will reveal only one (!) paper on the estimation                                 warehouse. We refer the interested reader to [Gree00,
of view sizes [SDNR96]. The fact that the average size of                                 Dema97] for further probing on this very interesting
data warehouses increases year by year makes the                                          issue.
problem even tougher. Back in 1996 the average data                                       As for the sociotechnical issues, it is also very interesting
warehouse size was estimated to be around 250 GB. In                                      to briefly discuss the relevant factors, since there is very
today’s data explosion there is even talk about scientific                                little reference to this kind of problems in the literature.
data warehouses of 40 TB [SGKT00]. This means that                                        According to [Dema97], breaking the organizational
despite Moore’s law and the drop in the cost of storage                                   treaties is a consequence of the fact that the data
units, size is still a problem for data warehousing. The                                  warehouse may reorganize the way the organization
increasing number of users increases the complexity of                                    works and intrude the functional or subjective domain of
the problem. [ShTy98] mentions the case of a data                                         the stakeholders. For example, imposing a particular
warehouse involving 20.000 users with an annual increase                                  client tool to the users invades the users’ desktop, which
of 2.000 users per year. Obviously, estimating the size of                                is considered to be their personal “territory”. The
materialized views or user queries is of great importance,                                problems due to the data ownership and access are
in this context.                                                                          grouped in two categories. First, data ownership is power
                                                                                          within an organization. Any attempt to share or take
                                                                                          control over somebody else’s data is equivalent with loss
                      metadata                                                            of power of this particular stakeholder. Secondly, no
                                            activity
                       design
                        5%
                                            monitor                                       division or department can claim to possess 100% clean,
                                              2%              data monitor
        access/analysi                                             2%                     error-free data. The possibility of revealing the data
           s tools
             6%                                             disk storage
                                                                                          quality problems within the information system of the
              DBMS                                              30%                       department is definitely frustrating for the affected
               10%                                                                        stakeholders. Finally, the invasion in the work practice
        network costs                                                                     reduces to the psychological reason that no user
            10%                                          processor
                             integration                   costs
                                                                                          community seems to be really willing to shift from gut
                                 and                        20%                           feeling or experience to objective, data driven
                           transformation
                                 15%                                   DW Design Costs    management (see [Dema97] for a broader discussion). To
                                                                                          top the entire skepticism about the non-technical
 Fig. 8 Data warehouse design costs according to Bill                                     problems and reasons of failure, ethical considerations
                  Inmon [Inmo97]                                                          can be added to the big picture of data warehousing. In
                                                                                          [Smit97] several such thoughts are presented: Is it fair to
                                                                                          use customers’ data to harm their relationships with their
                           periodic       security
                        verification of administration        occasional                  suppliers/customers? Is it fair to use such data to intrude
                                                           reorganization of
                      the conformance
                      to the enterprise
                                             1%                  data                     your customers’ know-how? Is it fair to use customers’
    summary table
    usage analysis       data model                               1%
                                                             data archiving               data to change the structure of your organization in a way
                              2%
         2%                                                        1%                     that is detrimental to your customers? Is it fair to use
      metadata                                                 capacity
     management                                               planning                    personal data of individual customers without any prior
                                                                 1%
         3%                                                                               notice?
       end-user
        training
                                                            DW refreshment                Most of the aforementioned reasons for failure are backed
                                                                 55%
           6%                                                                             up from other testimonial literature (e.g., [Paul97],
          monitoring of
       activity and data        servicing data                                            [ShTy98]).
               7%              mart requests for
                                      data
                                      21%                            Recurring DW costs   3.1      Personal Experience
  Fig. 9 Data warehouse recurring costs according to                                      The author has been involved in both research and
                 Bill Inmon [Inmo97]                                                      practical data warehouse projects, during the last six
                                                                                          years. Our research experience was mainly the European
The procedural and sociotechnical reasons are not really                                  basic research project “DWQ: Foundations for Data
technical reasons with which we should expect the                                         Warehouse Quality” [JaVa97]. Obviously, some of the
research society to deal with. We mention them for                                        criticism and comments in this paper are influenced by
reasons of completeness and in order to show how                                          the research conducted in this project. We apologize for
sensitive a project like the construction of a data                                       this clear bias; still, since this paper presents the author’s
warehouse is. The procedural factors involve reasons for                                  personal judgments we believe that we should make clear
deficiencies concerning the deployment of the data                                        what has possibly influenced our opinion.
warehouse. Apart from classical problems in IS                                            The author has also been involved in three rather small
management, it is important to notice that the role of user                               practical data warehouse projects. The first involved
communities is crucial: the end-users must be trained to                                  loading data from all the health centers (i.e., hospitals,
the new technologies and included in the design of the                                    provincial medical centers and other special kinds of


P.Vassiliadis                                                                                                                                      12-6
centers) in Greece into an enterprise data warehouse. The            as well as all the refreshment processes of the fact
loading of data was performed annually and the querying              table and the materialized views that used it. It was
was supposed to be performed mostly by pre-canned                    only the consistent naming of all the software
reports. Still, quite a lot of flexibility was provided to the       components that helped us perform this task.
user to filter, roll-up drill-down and drill-through the data.   Note that the project experienced no political problems.
The data warehouse was rather small and its construction         The data warehouse was requested by the same
took around 12 months. The major problems encountered            department that previously owned the data. The new
were not technical, since (a) the size of data was not so        system would still be under the control of this particular
big, (b) the refreshment window was not a problem and            department and would thus synchronize and clean the
(c) there was no real problem in reconciling the source          information they provided to higher management. Note
data. Still, there were major problems with the                  also that we never came to direct contact with the end-
administration team of the legacy system due to the              users: this was supposed to be a task undertaken by the
following reasons:                                               administration team of this particular department. Thus,
− Lack of training of the target administration team.            we have no knowledge for the real success of this project.
    The people administering the legacy COBOL-based              In a second occasion, we had to build a data warehouse
    system were the ones who would administer the new            with pension data. The data were to be updated monthly
    system, too. Still, this was their first experience with     and used by pre-canned reports. The size of data involved
    the relational technology and this was definitely a          a few million rows per month. The source data relied
    cultural shock for them.                                     again on a COBOL-based legacy system. The project
− Involvement of the administration team of the legacy           lasted nine months and could be characterized more as the
    system in the design of the new system. Although it is       construction of a data mart rather than the construction of
    clear that no data warehouse can be built without the        a full data warehouse. In this case, the major problem was
    involvement of the source administrators, our personal       of political nature: different departments were involved in
    experience suggests that this should be limited to the       the ownership of the information. The people
    construction of the data warehouse enterprise model          administering the legacy system were definitely affected
    (or even only to the reverse engineering of legacy           by the construction of the warehouse. These people
    data). Any attempt to include people without the             − would lose the full ownership of the information
    proper background in a process they do not really                (which translates to sheer power in the IT
    understand, seems to jeopardize the while effort,                department);
    rather than train / accustom them to the new system.         − would have to take care of the transportation and
− Poor quality of legacy data. The toughest problem in               conversion of the data in their own system (which
    this particular problem was the cleaning of data. Each           means extra workload for both people and systems)
    circuit in the schema seemed to be a sui generis                 and
    situation. Most important, we faced big difficulties         − any deficiencies of the information they produced
    trying to convince the administrators of the legacy              would be revealed (a fact of enormous importance and
    system for the poor quality of their data. Another big           effect in the public sector).
    problem was the detection of which sources were              Bearing all this in mind, it quite straightforward to
    reliable. In a COBOL system there is too much                understand the difficulties raised. Moreover, it was
    redundancy, since each application uses its own data         interesting to see that the higher management, although
    store. Every now and then, the different COBOL files         committed to the idea of constructing the data warehouse,
    are synchronized, although this is not always 100%           was unable to force things to happen and had to take an
    successful. When building the data warehouse, it is a        approach that peacefully resolved any problems that
    hard task to determine the quality of each candidate         occurred, in order to salvage the project from total failure.
    data source.                                                 Another problem we had to face in this project was the
− Data warehouse evolution. The business rules for the           difficulty in constructing the extraction and cleaning
    data warehouse are likely to change even during the          software. The extraction of data from the legacy systems
    construction of the warehouse itself. The problem is         is a highly complex, error-prone and tiring procedure. To
    hard, since it (a) brings the whole project back in          give an idea of the problem, let us mention the case where
    schedule and cost, (b) it psychologically frustrates the     the problem involved detecting relevant data from a
    development team and (c) the lack of a metadata              COBOL file, converting EBCDIC to ASCII format,
    management          repository      makes     it   almost    unpacking the packed numbers, reducing all address fields
    insurmountable to detect which part of the database or       to a standard format and loading the result into a table in
    the applications has to be synchronized with the new         the data warehouse. Apart from the standard tool offered
    situation. Imagine, for example, the case where the          by Oracle for these purposes (SQL*Loader) we did not
    primary key of a fact table has to change a couple of        use any commercial tool for these tasks. This seems to be
    weeks before completing the project. In our case, we         the tactics followed by the majority of data warehousing
    had to detect and evolve around 50 pre-canned reports        projects. According to [ShTy98] most of the companies


P.Vassiliadis                                                                                                            12-7
contacted for their survey, estimate that more than 1/3 of      Apart from these successes, there are two issues that
the cost and time are spent to ETL tasks during the             clearly depict the gap between research and practice. On
development process. Still, in spite the obvious                the one hand, there is an unclear picture with respect to
importance of this process, the vast majority of them           the extent that practice has exploited the results of
developed their own application instead of using a tool to      research. Query processing and storage management are
facilitate the process. [ShTy98] also reports that data         two research fields aiming to empower the technology
quality products are expensive and hard to use. Based on        providers (i.e., the software and hardware vendors) with
the problem of time and budget constraints for a data           better techniques for the storage and acquisition of
warehouse project, [ShTy98] estimates that such products        information. To our knowledge, it is not clear to which
are going to modestly foster in the next few years (with        extent have this results been incorporated in commercial
the almost the lowest CAGR of all the product                   products. The extent to which results in the field of
categories).                                                    incomplete information and redundancy exploitation can
Political problems were apparent in a third case where the      be exploited is another pending issue. The former seemed
project failed. The organization possessed four legacy          to be a rather promising research field but the lack of
systems, all of different kind (COBOL, Excel and dBase          research interest in the later years seems to be
files as well as a relational system). A pilot data mart        discouraging for its further exploitation. The latter is a
involving a subset of one of the legacy systems had             clear field but we believe that its practical exploitation
already been successful and the management was                  will take time to be implemented. As far as the data
enthusiastic about the whole idea. Still, the project failed,   warehouse designer is concerned, the cases where the
before it even started. As we had also observed in the          determination of the intentional subsumption of two data
previous case, it seems to be a common phenomenon that          stores is useful is rather limited. Instead, it is the
the people administrating the legacy system take a little       extensional properties of the data source that count (an
time until they understand what is politically happening to     issue not really apparent in database research). Finally,
them once a data warehouse is built. In this particular case    OLAP modeling could be very useful in the logical
the reaction was quick and absolute: no data were to be         definition of the data warehouse, but the lack of a
given from the largest legacy system, since its                 standard multidimensional hierarchical model seems to
administrators simply refused to provide them. The              drive designers to ad-hoc, proprietary solutions. Still, the
project was thus canceled. The lesson we learnt in this         relational counterpart, in the form of the ER diagram and
case is that it takes more than an enthusiastic management      the relational model, seems to be a promising precedent.
and a successful pilot for a data warehouse to succeed.         On the other hand of course, there seem to be rather big
Later, we learned that the warehouse project started again,     gaps in the table of Fig. 10, with respect to steps in the
still we have no knowledge for the fate of this new effort.     data warehouse lifecycle which are not supported by the
                                                                conducted research. The data model analysis could be
3.2    Relationship between Practical Problems and              clearly helped by improved techniques of metadata
Research Issues                                                 management (and standards) as well as by data
                                                                engineering methods that enable the designer to
In this section we would like to relate the data warehouse
                                                                understand and model data and processes better.
lifecycle with potential problems and solutions offered by
                                                                Breadbox analysis and technical assessment are clearly
technology to tackle this particular problems. The first
                                                                under-estimated by the research community. Techniques
problem in this task is the lack of a concrete “textbook-
                                                                to analyze data volume, network traffic, relevance and
style” methodology. Reading the two classical books on
                                                                quality of software components would greatly be
data warehousing [Inmo96, Kimb96] one gets the feeling
                                                                appreciated by data warehouse designers. The extraction
that they provide tips and solutions for fragments of the
                                                                process is also suffering from lack of help from the
whole process, rather than a concrete methodology for the
                                                                research community: as already mentioned, most research
data warehouse practitioner. We use as a template
                                                                performed has been dedicated to what should be extracted
methodology the one proposed in an Appendix of
                                                                (instead of how this extraction is performed). The
[Inmo96] and try to relate it to potential problems and
                                                                practical aspects of extraction are clearly neglected (e.g.
technological solutions offered by research. We list only
                                                                declarative languages and visual interfaces for the
the aforementioned problems and research categories.
                                                                management of the extraction process, automation of the
Again, we do not claim that either list is exhaustive, but
                                                                extraction programs, etc.). The problem is vast due to the
rather indicative.
                                                                sui generis nature of each kind of source (ASCII data are
As we can see in Fig. 10 there are areas where research
                                                                different from ISAM or database data) and of each
has contributed a lot to the practical problems. For
                                                                particular source itself. The peculiarities of the conversion
example, several issues of the view technology super-
                                                                process are also –more or less- neglected.
category are (or at least, can be) somehow used by
practitioners    in    data    warehouse     design    and
implementation. Also, several topics of the integration
super-category can be exploited in practical cases.


P.Vassiliadis                                                                                                           12-8
P. Vassiliadis


                 Phase            Lifecycle step                         Description                                                 Potential Problems                                        Solutions offered by the
                                                                                                                                                                                               research
                                  Decision to built the warehouse                                                                    Improper project scope
                                                                                                                                     Bad use of pilot projects
                                                                                                                                     Data warehouses cross organizational treaty lines
                                                                                                                                     Data ownership and access are reconsidered
                                                                                                                                     The work practices of user communities are affected
                 Design           Data Model Analysis                    Conceptual and logical model                                No design method is used                                  OLAP modeling
                                                                                                                                     User communities are not involved in the design
                                                                                                                                     Lack of metadata management
                                                                                                                                     Problematic data engineering
                                                                                                                                     Lack of training of the target administration team
                                                                                                                                     Excessive involvement of the administration team of the
                                                                                                                                     legacy system in the design of the new system
                                  Breadbox Analysis                      Size estimation for the data                                No examination of volume of queries, data sets and        Size estimation for views
                                                                                                                                     network traffic
                                  Technical Assessment                   Definition of technical requirements                        No test of new management requirements
                                  Technical Environment Preparation      Definition of network, storage, OS, software components,    Client tools are neglected or dominate the design
                                                                         etc.
                                                                                                                                     Choice of wrong components
                                                                                                                                     Vendor claims are not tested
                                  Subject Area (per subject)             Decision which subject area to populate
                                  Source System Analysis (per subject)   Identification of proper source for the data and reverse    Difficulty in determining which source is appropriate,
                                                                         engineering of the selected source                          due to quality problems
                                  Data Warehouse Database Design         Physical database design for the data warehouse             Unrealistic schema design                                 Physical DW design, Indexing
                 DW               Program Specifications (per subject)   Formalize the interface between source data and             Data warehouse evolution                                  View Maintenance, Data &
                 implementation                                          warehouse                                                                                                             Schema Integration
                                  Programming (per subject)              Construction of the appropriate software for ETL purposes   Poor quality of legacy data                               Detecting changes in the
                                                                                                                                                                                               sources
                                                                                                                                     Difficulty in constructing the S/W correctly
                                  Population (per subject)               Load the warehouse with data                                Difficulty in using the data quality tools
                 Report           Determine data needed                  Decide which part of the data warehouse covers the data
                 Implementation                                          for the report
                 (per report)
                                  Program to extract data                Write a program to get the data from the DW
                                  Customize the data                     Customize the data for the user's intuition
                                  Refine the analysis                    Is the report suitable for what it was intended?
                                  Usage                                  Use the reports                                             Lack of training for stakeholders
                                  Institutionalize                       Should the report be institutionalized?
                                                   Fig. 10 Data warehouse lifecycle steps, potential problems and solutions offered by the research community
12-9
We believe that a turn in the interest of the research             publications out of such an effort. It is not strange,
community from the virtual querying of distributed                 thus, that so much theoretical work has been devoted
heterogeneous data sources and the intentional                     to view maintenance issues, with respect to what
reconciliation to practical aspects of extraction of               should be propagated to the warehouse, while few
materialized data could benefit the practitioners a lot.           research efforts have been made as to how this
Finally, it seems to be unclear, to which extent procedural        extraction and propagation is to be made. We believe
and sociotechnical factors (involved mostly at the                 that it would be really hard for a paper concerning
beginning and the end of a data warehouse project) could           practical automation techniques for the data
benefit from the use of new technology, suggested by               extraction task to convince an academic audience.
research results. This fuzziness alone, is a very good             The last Asilomar report [BBC+98] states the need
reason for research from the part of academia. As reported         for “groundbreaking” instead of “delta” research;
in [SJSV99] significant contribution could also be made            still, it is not clear which practical issues concerning
from business administration sciences, e.g., in the way the        data warehousing are qualified under this definition.
data warehouse in introduced in the corporation.              - The rules that govern the behavior of science are
                                                                   applied also in the case of data warehousing. It is
4. Conclusions                                                     commonly agreed that it is the Paradigm that
                                                                   determines the interesting problems and not vice-
Normally, this is the place for an optimistic message, or          versa. In our case, the paradigm set by the papers of
the ringing of the bell. For a change, we will do neither.         Codd and Selinger et al., has –more or less- set the
There are two issues, though, we would like to touch, as           landscape for the research in the data warehouse
concluding remarks. First, is it really the case, that             field, too. For example, although too much work has
research and practice are so much apart? In our humble             been devoted to query processing for aggregate
opinion, the answer is negative. Although research has             queries, these queries are still treated in isolation.
targeted only a fraction of the possible areas where               Still, an OLAP session is a sequence of steps, which
practitioners could need assistance, the technological             have some logical interrelationship. How many
contribution of the research society is significant. For           papers do you know dealing with this particular
example, let us mention the case of data warehouse                 property of OLAP? As another example, we simply
refreshment. Despite the problems in the extraction step,          remind the technical and design problems mentioned
which we have already mentioned, the refreshment                   in Section 3, which although being of great
process is of significant importance for the proper                importance are not addressed by the research. We
operation of the data warehouse. The recurring costs for           believe that one of the reasons for this situation is the
data warehouse refreshment come up to 55% of the                   non-standard nature of these problems, which puts
overall cost for running a data warehouse (Fig. 9). Still,         them outside the scope of the relational paradigm.
the contribution is only in areas where the existing          As for the future, it is hard to make any predictions. Is
technology could be enhanced, without any                     data warehousing going to be virtual (making all our
methodological results or groundbreaking research in new      comments on the integration problem void, and the
fields.                                                       research conducted in this field highly useful)? Is there
Secondly, why is it that researchers are found away from      going to be a shift towards methodological issues in data
the practical problems of data warehousing? This is a         warehouses? Are the gaps in Fig. 10 going to be filled?
widely discussed issue (e.g., there is a standard debate in   Although the answer is ‘I don’t know’ –at least from our
the Communications of the ACM magazine). We point             part- it is a challenging issue to work on these issues,
only a few reasons that have come to our attention:           contributing thus, to the closing of the gap between
- It is possible that several researchers are not aware of    research and practice and making data warehousing an
     the real-world problems. The major motivation for        easier and less risky endeavor for practitioners and
     writing this paper was a discussion with a visiting      organizations.
     researcher to our department. This person has
     devoted too much time, programming and energy to
     the data warehouse design problem. Still, he believed
                                                              5. References
     that the data warehouse is simply a set of
     “DECLARE VIEW” statements. Clearly, this was a            [BBC+98]       P.A. Bernstein, M.L. Brodie, S. Ceri,
     problem of lack of direct contact with practical                         D.J. DeWitt, M.J. Franklin, H. Garcia-
     problems.                                                                Molina, J. Gray, G. Held, J.M.
- It is not always rewarding, in terms of research, to                        Hellerstein, H.V. Jagadish, M. Lesk, D.
     deal with practical problems. The extraction process                     Maier, J.F. Naughton, H. Pirahesh, M.
     of our case study, which we mentioned in Section 3                       Stonebraker, J.D. Ullman. The Asilomar
     might give an example for this statement. Which                          Report on Database Research. SIGMOD
     researcher would feel happy to work on such a ‘dirty’                    Record 27(4): 74-80 (1998)
     problem, knowing that it will be too hard to make         [Comp96]       ComputerWire Inc. Data Warehouse


P.Vassiliadis                                                                                                        12-10
                Economics:        ROI     doubts?    Data    [Pend00]   N. Pendse, February 24, 2000. The
                Warehouse Tools Bulletin, November                      OLAP        Report.      Available     at
                1996.               Available           at              http://www.olapreport.com/Market.htm.
                http://www.computerwire.com/dwtb/free        [SDNR96]   A. Shukla, P. Deshpande, J.F. Naughton,
                /2112_182.htm                                           K. Ramasamy. Storage Estimation for
 [DaWaK]        International Conference on Data                        Multidimensional Aggregates in the
                Warehousing and Knowledge Discovery                     Presence of Hierarchies. In Proceedings
                (DaWaK).        http://www.informatik.uni-              of 22nd International Conference on Very
                trier.de/~ley/db/conf/dawak/index.html                  Large Databases (VLDB), Mumbai India
 [Dema97]        M. Demarest. The politics of data                      1996.
                warehousing.            Available       at   [SGKT00]   A. Szalay, J. Gray, P. Kunszt, A. Thakar.
                http://www.hevanet.com/demarest/marc/                   Designing and Mining Multi-Terabyte
                dwpol.html                                              Astronomy        Archives.      SIGMOD
 [DOLAP]        International     Workshop      on Data                 Conference 2000. Also available at
                Warehousing and OLAP (DOLAP).                           http://www.research.microsoft.com/~gra
                http://www.pages.drexel.edu/faculty/son                 y/
                giy/dolap.html,                              [ShTy98]   C. Shilakes, J. Tylman. Enterprise
                http://www.informatik.uni-                              Information Portals. Enterprise Software
                trier.de/~ley/db/conf/dolap/index.html                  Team. November 1998. Available at
 [GJSV99]       S. Gatziu, M.A. Jeusfeld, M. Staudt, Y.                 www.sagemaker.com/company/downloa
                Vassiliou. Design and Management of                     ds/eip/indepth.pdf.
                Data Warehouses - Report on the              [Smit97]    J. Smith. Do Data Warehouses
                DMDW’99 Workshop. SIGMOD Record                         Challenge      Fair     Play?     Beyond
                28(4), December 1999. Refers to the                     Computing, 6(4), May 1997. Available at
                International Workshop DMDW’99 at                       www.beyondcomputingmag.com/archive
                CAiSE’99, Heidelberg, Germany, June                     /1997/5-97/ethics.html
                1999. Online version available at
                http://sunsite.informatik.rwth-
                aachen.de/Publications/CEUR-WS/Vol-
                19
 [Gree00]        L. Greenfield. Data Warehousing
                Political    Issues.     February   2000.
                Available                               at
                http://www.dwinfocenter.ord/politics.ht
                ml
 [Inmo96]       W.H. Inmon. Building the Data
                Warehouse. John Wiley & Sons, March
                1996.
 [Inmo97]        B. Inmon. The Data Warehouse Budget.
                DM Review Magazine, January 1997.
                Available                               at
                http://www.dmreview.com/master.cfm?
                NavID=55&EdID=1315
 [JaVa97]       M. Jarke, Y. Vassiliou. Foundations of
                data warehouse quality – a review of the
                DWQ project. In Proc. 2nd Intl.
                Conference Information Quality (IQ-97),
                Cambridge, Mass., 1997. Available in
                http://www.dblab.ece.ntua.gr/~dwq
 [Kimb96]       R. Kimbal. The Data Warehouse Toolkit:
                Practical Techniques for Building
                Dimensional Data Warehouses. John
                Wiley & Sons, February 1996.
 [Paul97]        L.G. Paul. Anatomy of a failure. CIO
                Magazine.       November      15,   1997.
                Available                               at
                http://www.cio.com/archive/enterprise/1
                11597_data_content.html


P.Vassiliadis                                                                                               12-11
Appendix
Paper                                                                                    Category
1995 – PODS
Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv, Divesh Srivastava. Answering Query rewritting
Queries Using Views. 95-104.
Anand Rajaraman, Yehoshua Sagiv, Jeffrey D. Ullman. Answering Queries Using Templates Query rewritting
with Binding Patterns. 105-112.
H. V. Jagadish, Inderpal Singh Mumick, Abraham Silberschatz. View Maintenance Issues View maintenance
for the Chronicle Data Model. 113-124.
1995 - SIGMOD
 Ashid Gupta, Inderpal Singh Mumick, Kenneth A. Ross. Adapting Materialized Views after View maintenance
Redefinitions. 211-222.
Yue Zhuge, Hector Garcia-Molina, Joachim Hammer, Jennifer Widom. View Maintenance View maintenance
in a Warehousing Environment. 316-327.
Timothy Griffin, Leonid Libkin. Incremental Maintenance of Views with Duplicates. 328- View maintenance
339.
James J. Lu, Guido Moerkotte, Joachim Schü, V. S. Subrahmanian. Efficient Maintenance of View maintenance
Materialized Mediated Views. 340-351.
1995 - VLDB
 Weipeng P. Yan, Per-Åke Larson. Eager Aggregation and Lazy Aggregation. 345-357.        Processing          for
                                                                                         aggregates
Ashish Gupta, Venky Harinarayan, Dallan Quass. Aggregate-Query Processing in Data Processing                 for
Warehousing Environments. 358-369.                                                       aggregates
 1996 - PODS
Alon Y. Levy, Anand Rajaraman, Jeffrey D. Ullman. Answering Queries Using Limited Query rewritting
External Processors. 227-237.
1996 - SIGMOD
Richard Hull, Gang Zhou. A Framework for Supporting Data Integration Using the Data integration
Materialized and Virtual Approaches. 481-492.
Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman. Implementing Data Cubes DW design
Efficiently. 205-216.
Leonid Libkin, Rona Machlin, Limsoon Wong. A Query Language for Multidimensional Processing for cubes
Arrays: Design, Implementation, and Optimization Techniques. 228-239.
Sudhir Rao, Antonio Badia, Dirk Van Gucht. Providing Better Support for a Class of Query processing in
Decision Support Queries. 217-227.                                                       general
Kenneth A. Ross, Divesh Srivastava, S. Sudarshan. Materialized View Maintenance and View maintenance
Integrity Constraint Checking: Trading Space for Time. 447-458.
Latha S. Colby, Timothy Griffin, Leonid Libkin, Inderpal Singh Mumick, Howard Trickey. View maintenance
Algorithms for Deferred View Maintenance. 469-480.
1996 - VLDB
Peter Scheuermann, Junho Shim, Radek Vingralek. WATCHMAN : A Data Warehouse Caching
Intelligent Cache Manager. 51-62.
Alon Y. Levy, Anand Rajaraman, Joann J. Ordille. Querying Heterogeneous Information Data integration
Sources Using Source Descriptions. 251-262.
Alon Y. Levy. Obtaining Complete Answers from Incomplete Databases. 402-412.             Data integration
Wilburt Labio, Hector Garcia-Molina. Efficient Snapshot Differential Algorithms for Data Detecting changes in
Warehousing. 63-74.                                                                      the sources
Curtis E. Dyreson. Information Retrieval from an Incomplete Data Cube. 532-543.          Incomplete information


P.Vassiliadis                                                                                               12-12
Laks V. S. Lakshmanan, Fereidoon Sadri, Iyer N. Subramanian. SchemaSQL - A Language Integration in general
for Interoperability in Relational Multi-Database Systems. 239-250.
Yannis Papakonstantinou, Serge Abiteboul, Hector Garcia-Molina. Object Fusion in Integration in general
Mediator Systems. 413-424.
Mark W. W. Vermeer, Peter M. G. Apers. The Role of Integrity Constraints in Database Integration in general
Interoperation. 425-435.
Damianos Chatziantoniou, Kenneth A. Ross. Querying Multiple Features of Groups in Processing                for
Relational Databases. 295-306.                                                           aggregates
Sameet Agarwal, Rakesh Agrawal, Prasad Deshpande, Ashish Gupta, Jeffrey F. Naughton, Processing             for
Raghu Ramakrishnan, Sunita Sarawagi. On the Computation of Multidimensional aggregates
Aggregates. 506-521.
Divesh Srivastava, Shaul Dar, H. V. Jagadish, Alon Y. Levy. Answering Queries with Query rewritting
Aggregation Using Views. 318-329.
Amit Shukla, Prasad Deshpande, Jeffrey F. Naughton, Karthikeyan Ramasamy. Storage Size estimation for
Estimation for Multidimensional Aggregates in the Presence of Hierarchies. 522-531.      views
Martin Staudt, Matthias Jarke. Incremental Maintenance of Externally Materialized Views. View maintenance
75-86.
1997 - PODS
Ching-Tien Ho, Jehoshua Bruck, Rakesh Agrawal. Partial-Sum Queries in Data Cubes Using Processing for cubes
Covering Codes. 228-237.
Catriel Beeri, Alon Y. Levy, Marie-Christine Rousset. Rewriting Queries Using Views in Query rewritting
Description Logics. 99-108.
Oliver M. Duschka, Michael R. Genesereth. Answering Recursive Queries Using Views. Query rewritting
109-116.
1997 - SIGMOD
Joseph M. Hellerstein, Peter J. Haas, Helen Wang. Online Aggregation. 171-182.           Incomplete information

Patrick E. O’Neil, Dallan Quass. Improved Query Performance with Variant Indexes. 38-49. Indexing
Ching-Tien Ho, Rakesh Agrawal, Nimrod Megiddo, Ramakrishnan Srikant. Range Queries Processing for cubes
in OLAP Data Cubes. 73-88.
Yihong Zhao, Prasad Deshpande, Jeffrey F. Naughton. An Array-Based Algorithm for Processing for cubes
Simultaneous Multidimensional Aggregates. 159-170.
Nick Roussopoulos, Yannis Kotidis, Mema Roussopoulos. Cubetree: Organization of and Storage for cubes
Bulk Updates on the Data Cube. 89-99.
Michael J. Carey, Donald Kossmann. On Saying “Enough Already!” in SQL. 219-230.          Top N queries
Inderpal Singh Mumick, Dallan Quass, Barinderpal Singh Mumick. Maintenance of Data View maintenance
Cubes and Summary Tables in a Warehouse. 100-111.
Brad Adelberg, Hector Garcia-Molina, Jennifer Widom. The STRIP Rule System For View maintenance
Efficiently Maintaining Derived Data. 147-158.
Dallan Quass, Jennifer Widom. On-Line Warehouse View Maintenance. 393-404.               View maintenance
Latha S. Colby, Akira Kawaguchi, Daniel F. Lieuwen, Inderpal Singh Mumick, Kenneth A. View maintenance
Ross. Supporting Multiple View Maintenance Policies. 405-416.
Divyakant Agrawal, Amr El Abbadi, Ambuj K. Singh, Tolga Yurek. Efficient View View maintenance
Maintenance at Data Warehouses. 417-427.
1997 - VLDB
Dimitri Theodoratos, Timos K. Sellis. Data Warehouse Configuration. 126-135.             DW design
Jian Yang, Kamalakar Karlapalem, Qing Li. Algorithms for Materialized View Design in DW design
Data Warehousing Environment. 136-145.
Elena Baralis, Stefano Paraboschi, Ernest Teniente. Materialized Views Selection in a DW design
Multidimensional Database. 156-165.
Christos Faloutsos, H. V. Jagadish, Nikolaos Sidiropoulos. Recovering Information from Incomplete information
Summary Data. 36-45.


P.Vassiliadis                                                                                              12-13
Vasilis Vassalos, Yannis Papakonstantinou. Describing and Using Query Capabilities of Integration in general
Heterogeneous Sources. 256-265.
Mary Tork Roth, Peter M. Schwarz. Don’t Scrap It, Wrap It! A Wrapper Architecture for Integration in general
Legacy Data Sources. 266-275.
Marc Gyssens, Laks V. S. Lakshmanan. A Foundation for Multi-dimensional Databases. OLAP modeling
106-115.
Kenneth A. Ross, Divesh Srivastava. Fast Computation of Sparse Datacubes. 116-125.        Processing         for
                                                                                          aggregates
Damianos Chatziantoniou, Kenneth A. Ross. Groupwise Processing of Relational Queries. Processing             for
476-485.                                                                                  aggregates
Laura M. Haas, Donald Kossmann, Edward L. Wimmers, Jun Yang. Optimizing Queries Query processing over
Across Diverse Data Sources. 276-285.                                                     integrated data
H. V. Jagadish, P. P. S. Narayan, S. Seshadri, S. Sudarshan, Rama Kanneganti. Incremental Storage in general
Organization for Data Recording and Warehousing. 16-25.
Nam Huyn. Multiple-View Self-Maintenance in Data Warehousing Environments. 26-35.         View maintenance
1998 - PODS
John R. Smith, Chung-Sheng Li, Vittorio Castelli, Anant Jhingran. Dynamic Assembly of DW design
Views in Data Cubes. 274-283.
Phokion G. Kolaitis, David L. Martin, Madhukar N. Thakur. On the Complexity of the Query containment
Containment Problem for Conjunctive Queries with Built-in Predicates. 197-204.
Phokion G. Kolaitis, Moshe Y. Vardi. Conjunctive-Query Containment and Constraint Query containment
Satisfaction. 205-213.
Werner Nutt, Yehoshua Sagiv, Sara Shurin. Deciding Equivalences Among Aggregate Query containment
Queries. 214-223.
Serge Abiteboul, Oliver M. Duschka. Complexity of Answering Queries Using Materialized Query rewritting
Views. 254-263.
1998 - SIGMOD
Chee Yong Chan, Yannis E. Ioannidis. Bitmap Index Design and Evaluation. 355-366.         Indexing
Prasad Deshpande, Karthikeyan Ramasamy, Amit Shukla, Jeffrey F. Naughton. Caching Processing                 for
Multidimensional Queries Using Chunks. 259-270.                                           aggregates
Yihong Zhao, Prasad Deshpande, Jeffrey F. Naughton, Amit Shukla. Simultaneous Processing                     for
Optimization and Evaluation of Multiple Dimensional Queries. 271-282.                     aggregates
Jun Rao, Kenneth A. Ross. Reusing Invariants: A New Strategy for Correlated Queries. 37- Query processing in
48.                                                                                       general
Subbu N. Subramanian, Shivakumar Venkataraman. Cost-Based Optimization of Decision Query processing over
Support Queries Using Transient Views. 319-330.                                           integrated data
Renée J. Miller. Using Schematically Heterogeneous Structures. 189-200.                   Schema integration
Yannis Kotidis, Nick Roussopoulos. An Alternative Storage Organization for ROLAP Storage for cubes
Aggregate Views Based on Cubetrees. 249-258.
1998 - VLDB
Amit Shukla, Prasad Deshpande, Jeffrey F. Naughton. Materialized View Selection for DW design
Multidimensional Datasets. 488-499.
Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, Jeffrey D. Iceberg queries
Ullman. Computing Iceberg Queries Efficiently. 299-310.
Frédéric Gingras, Laks V. S. Lakshmanan. nD-SQL: A Multi-Dimensional Language for Integration in general
Interoperability and OLAP. 134-145.
Fernando de Ferreira Rezende, Klaudia Hergula. The Heterogeneity Problem and Integration in general
Middleware Technology: Experiences with and Performance of Database Gateways. 146-
157.
Guido Moerkotte. Small Materialized Aggregates: A Light Weight Index Structure for Data Processing           for
Warehousing. 476-487.                                                                     aggregates


P.Vassiliadis                                                                                               12-14
Michael J. Carey, Donald Kossmann. Reducing the Braking Distance of an SQL Query Top N queries
Engine. 158-169.
Hector Garcia-Molina, Wilburt Labio, Jun Yang. Expiring Data in a Warehouse. 500-511.   View maintenance
1999 - PODS
Howard J. Karloff, Milena Mihail. On the Complexity of the View-Selection Problem. 167- DW design
173.
Sara Cohen, Werner Nutt, A. Serebrenik. Rewriting Aggregate Queries Using Views. 155- Query rewritting
166.
Stéphane Grumbach, Maurizio Rafanelli, Leonardo Tininini. Querying Aggregate Data. 174- Query rewritting
184.
1999 - SIGMOD
H. V. Jagadish, Laks V. S. Lakshmanan, Divesh Srivastava. Snakes and Sandwiches: Clustering
Optimal Clustering Strategies for a Data Warehouse. 37-48.
Yannis Kotidis, Nick Roussopoulos. DynaMat: A Dynamic View Management System for DW design
Data Warehouses. 371-382.
Kevin S. Beyer, Raghu Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg Iceberg queries
CUBEs. 359-370.
Ramana Yerneni, Chen Li, Hector Garcia-Molina, Jeffrey D. Ullman. Computing Integration in general
Capabilities of Mediators. 443-454.
Peter J. Haas, Joseph M. Hellerstein. Ripple Joins for Online Aggregation. 287-298.     Processing           for
                                                                                        aggregates
Arunprasad P. Marathe, Kenneth Salem. Query Processing Techniques for Arrays. 323-334. Query processing for
                                                                                        arrays
Zachary G. Ives, Daniela Florescu, Marc Friedman, Alon Y. Levy, Daniel S. Weld. An Query processing over
Adaptive Query Execution System for Data Integration. 299-310.                          integrated data
Chen-Chuan K. Chang, Hector Garcia-Molina. Mind Your Vocabulary: Query Mapping Query processing over
Across Heterogeneous Information Sources. 335-346.                                      integrated data
Wilburt Labio, Ramana Yerneni, Hector Garcia-Molina. Shrinking the Warehouse Update View maintenance
Window. 383-394.
1999 - VLDB
Vanja Josifovski, Tore Risch. Integrating Heterogenous Overlapping Databases through Integration in general
Object-Oriented Transformations. 435-446.
Felix Naumann, Ulf Leser, Johann Christoph Freytag. Quality-driven Integration of Integration in general
Heterogenous Information Systems. 447-458.
Alin Deutsch, Lucian Popa, Val Tannen. Physical Data Independence, Constraints, and Integration in general
Optimization with Universal Plans. 459-470.
Laks V. S. Lakshmanan, Fereidoon Sadri, Subbu N. Subramanian. On Efficiently Integration in general
Implementing SchemaSQL on an SQL Database System. 471-482.
H. V. Jagadish, Laks V. S. Lakshmanan, Divesh Srivastava. What can Hierarchies do for OLAP modeling
Data Warehouses? 530-541.
Torben Bach Pedersen, Christian S. Jensen, Curtis E. Dyreson. Extending Practical Pre- OLAP modeling
Aggregation in On-Line Analytical Processing. 663-674.
Kian-Lee Tan, Cheng Hian Goh, Beng Chin Ooi. Online Feedback for Nested Aggregate Processing                 for
Queries with Multi-Threading. 18-29.                                                    aggregates
Alfons Kemper, Donald Kossmann, Christian Wiesner. Generalised Hash Teams for Join and Processing            for
Group-by. 30-41.                                                                        aggregates
Chee Yong Chan, Yannis E. Ioannidis. Hierarchical Prefix Cubes for Range-Sum Queries. Processing             for
675-686.                                                                                aggregates
Sunita Sarawagi. Explaining Differences in Multidimensional Aggregates. 42-53.          Processing for cubes
Jianzhong Li, Doron Rotem, Jaideep Srivastava. Aggregation Algorithms for Very Large Processing for cubes
Compressed Data Warehouses. 651-662.
Surajit Chaudhuri, Luis Gravano. Evaluating Top-k Selection Queries. 399-410.           Top N queries


P.Vassiliadis                                                                                               12-15
Donko Donjerkovic, Raghu Ramakrishnan. Probabilistic Optimization of Top N Queries. Top N queries
411-422.


P.Vassiliadis                                                                                       12-16