Gulliver in the land of data warehousing: practical experiences and observations of a researcher Panos Vassiliadis National Technical University of Athens, Department of Electrical and Computer Engineering, Computer Science Division, Knowledge and Database Systems Laboratory, Zografou 15773, Athens, Greece pvassil@dbnet.ece.ntua.gr warehousing strategy is still left to the practitioners...», «... the influence of the research results on the commercial stream of data warehouse products is very limited...», Abstract «The gap between data warehouse practice and research became obvious ...». The purpose of this paper is towards The gap between researchers and practitioners is showing the issues which occupy research and practice, widely discussed in the IT community. The and the extent to which these issues have any overlap. purpose of this paper is towards showing the issues which occupy both research and practice, The ultimate goal is to show possible new areas of and the extent to which these issues have any research, based on practical problems and at the same overlap, in the field of data warehousing. To time to give an idea of how practice could benefit from achieve this goal we first present the current research results which seem to be rather ignored. status and tendencies in data warehouse research. To this end we will divide the paper in three parts. The Then we list several practical problems as they first part appears in Section 2, where we present the appear in the relevant literature, based also on «good news» for data warehousing and more specifically, our personal experience. Finally, we try to give the current status of the data warehouse industry in terms the relationship of research and practice into a of profit and sales, as well as the status of research. To unified big picture. present the status of the research we have listed and classified the papers relevant to data warehousing in three major database conferences during the last five years and tried to show the tendencies of the research based on this 1. Introduction study. The second part of the paper deals with problems and failures during data warehouse projects and appears The gap between researchers and practitioners is widely in Section 3. The discussion is based both on the relevant discussed in the IT community. The situation regarding literature (which is surprisingly small) and on the author’s data warehousing seems to follow the general pattern personal experiences. Based on the problems which we where practitioners complain that their practical problems detect in the previous paragraphs, we then proceed to are overlooked by research and researchers are generally relate the data warehouse lifecycle with potential unsatisfied by the acceptance of their ideas in industry. problems and solutions proposed by the research Let us quote some abstracts from the results of the community. Finally, we give some concluding remarks on previous DMDW workshop [GJSV99]: «Although many the reasons for the gap between the research and practice solutions were developed for interesting subproblems... communities. combining these partial and often very abstract and formal solutions to an overall design methodology and 2. The Good News: Money and Research The copyright of this paper belongs to the paper’s authors. Permission to copy There are good news for the data warehouse field: sales without fee all or part of this material is granted provided that the copies are not are increasing with high rates and research is achieving a made or distributed for direct commercial advantage. standard focus on the field. We will briefly summarize the Proceedings of the International Workshop on Design and importance of the field by mentioning the financial Management of Data Warehouses (DMDW'2000) figures in subsection 2.1 and quickly proceed to Stockholm, Sweden, June 5-6, 2000 subsection 2.2 where we discuss the main subject of this (M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.) http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-28/ P.Vassiliadis 12-1 section, which is the status and the tendencies of the papers could fit in more than one categories; still we research in data warehousing. followed a naïve approach and attributed each paper to only one category. Naturally, we do not claim to be 2.1 The Money perfect: it is possible that some papers can be left out of our study, or classified under a category which was not Selling products related to data warehousing is a business the most suitable. We apologize in advance for any such making money. As mentioned in a report by Merril Lynch occurrences, although we scrutinized the proceedings to at the end of 1998 [ShTy98], the estimation was that the avoid this kind of problems. Also, it is possible that the data warehousing market was going to expand in the next contribution of a paper in one category, could be few years. The numbers are surprisingly large: the data accompanied by results in another “correlated” category. mart market was expected to have a 40% compounded We believe that the results which we present are not far annual growth rate (CAGR) and the RDBMS sales for from the ones which could be produced from a more data warehouse purposes a CAGR of 25%, reaching total elaborate categorization of the paper, which would take sales of $2.2 billion dollars. The OLAP report [Pend00] this issue into consideration. Still, there is no proof for mentions that the sales have reached $2.5 billion dollars this statement and the issue remains open (although we for OLAP tools (including implementation services) and believe it is outside the scope of this paper). they are expected to grow with 20% rate in 2000 and a As one can see in Fig. 2, the number of papers seems to CAGR of 19% for a five-year period. Fig. 1 shows the reach stability. Although the research interest is rather estimated sales, along with the CAGR for six categories young (only 5 years old) we anticipate that the tendency is of tools. 1998 1999 2000 2001 2002 CAGR (%) RDBMS sales for DW 900.0 1110.0 1390.0 1750.0 2200.0 25.0 Data Marts 92.4 125.0 172.0 243.0 355.0 40.0 ETL tools 101.0 125.0 150.0 180.0 210.0 20.1 Data Quality 48.0 55.0 64.5 76.0 90.0 17.0 Metadata Management 35.0 40.0 46.0 53.0 60.0 14.4 OLAP (including implementation 2000 2500 3000 3600 4000 18.9 services)* Fig. 1 Estimated sales in millions of dollars [ShTy98] (*estimates are from [Pend00]). to keep a standard number of papers in the major 2.2 The Research conferences. The drop in the number of papers in 1998 could be easily justified due to the strange explosion in Research in the field of data warehousing is flourishing. the number of papers relevant to data mining during that Sessions dedicated to data warehousing have appeared in particular year. It is very interesting to see that during the most of the major conferences of the data management last five years there have been 99 relevant papers relevant discipline. Several workshops have appeared [GJSV99, to data warehousing, which makes 20 papers per year on DOLAP] and there is even a dedicated conference for average. data warehouse issues [DaWaK]. We have identified 22 categories of research fields where To obtain an overview of the tendencies of research in the the interest of the researchers has been drawn. In the past five years we have selected three prestigious database sequel, we list the most popular out of them (Fig. 4). conferences, namely PODS, SIGMOD and VLDB and - Data warehouse design: the problem lies in detecting classified their papers which are relevant to the data the set of views to materialize in the data warehouse, in warehouse area. We included any papers we found order to achieve the optimal operational cost (i.e., the relevant to data warehousing, except for the ones relevant combined cost of querying and refreshing the contents to data mining (to retain a clear-cut separation between of the warehouse). the two fields). We restricted ourselves to just three - Query rewriting: the problem lies in reusing existing conferences, since our goal is to give a general feeling of views, to rewrite a query posed over the sources. An the situation in the research field, rather than conduct a alternative name for the problem could be ‘Answering thorough survey of the topic. Based on the content of the queries using views’. papers, we classified them to several categories, shown in - Integration: this is a wide area covering several issues. Fig. 3. For reasons of better presentation and The general context is that several sources containing understanding, we group these categories to larger groups, operational data exist in the environment of the data referred to as “super-categories”. Of course, several P.Vassiliadis 12-2 warehouse and a unique interface must be provided in time. One can see a dropping interest in the view order to query / update them. The problem of technology issues, which is rather normal since people integration is definitely larger than the area of data originally thought of data warehouses as collections of warehousing, especially with the current advances in materialized views. Although we believe that this attitude the Web technology. Note that in our survey we is still present in the research community, there seems to excluded all papers on integration that seemed clearly be a level of saturation in the problems regarding view oriented towards semi-structured or Web data. technology. Number of Papers by Year 30 25 20 15 10 5 0 1995 1996 1997 1998 1999 No. of Papers 9 20 26 19 25 Fig. 2 Number of papers in PODS/SIGMOD/VLDB by year. - Processing for relational aggregates: the area includes Category Super-Category structures and algorithms for the efficient processing of Incomplete information Incomplete information aggregate queries. We discriminate this area from Data integration Integration query rewriting, in the sense that these papers deal with Integration in general results that could directly be implemented in a DBMS. Query processing over integrated data We also discriminate the area from the papers Schema integration involving processing for cubes, which we found more OLAP modeling OLAP modeling focused in MOLAP databases. Caching Query Processing - View maintenance: the problem lies in keeping the data Iceberg queries warehouse views in accordance with the changes Processing for aggregate queries happening in the source data. Processing for cubes The big picture of the area is made clear in Fig. 5, Query processing in general classifying the papers in higher-level super-categories. Top N queries The classification is based on the grouping of Fig. 3. Query containment Redundancy The most popular super-categories so far have been Query Exploitation Processing, View technology, Integration and Query rewriting Redundancy Exploitation. Query processing involves all Clustering Storage Management techniques to efficiently process requests and answer Indexing queries. It involves six categories and 29% percent of the Storage for cubes research performed in the past years. View technology is Storage in general also a large category, focused on view maintenance Detecting changes in the sources View Technology techniques as well as the physical data warehouse design Data warehouse design process. Integration, which has been previously described, Size estimation for views involves producing a single interface for the processing of View maintenance distributed heterogeneous data, along with query Fig. 3 Grouping of paper categories to super processing techniques for that cause and resolution of conflicts at the schema level. Redundancy exploitation is categories a field where theoreticians are mostly interested, involving query containment and rewriting. At the same time, the interest in query processing rises Probably the most interesting graph is depicted in Fig. 6, continuously from year to year, probably due to the grouping the papers by year and super-category. In this standard tendency of database researchers towards this figure we see the evolution with respect to the passing of field. P.Vassiliadis 12-3 Papers by Category 18 16 14 12 10 8 6 4 2 0 l s l l g t n ce s es n es n s s g g ng ng gn ra ra ra en te .. ... w ie ie tin io io io in in ne u. an ub ub ne ne xi hi si ie ga gr er er at at at nm el er rit so de de ac rv ge od gr en gr rc rc te rm qu qu ge ge st re w ai in In C te lu te fo re e m nt fo gg fo W fo in nt N g in in th C in in er ai er in n D ry P co ra ng e ng p n e io in m a a ag ov eb LA To ue io ag e m at si at fo si ry et s at ew or Ic es Q O ng or he D tim es ge ue pl ng gr St St Vi oc si m Sc oc an Q te es si es co Pr es pr In ch ze oc In oc ry g Si pr ue tin Pr ry ec Q ue et Q D Fig. 4 Number of papers in PODS/SIGMOD/VLDB by category. Papers by Super Category 35 30 25 20 15 10 5 0 Incomplete Storage Redundancy View Query OLAP modeling Integration information Management Exploitation Technology Processing Papers by Super Category 3 3 6 12 20 26 29 Fig. 5 Number of papers in PODS/SIGMOD/VLDB by super category. Papers by year and type 12 10 8 6 4 2 0 Incomplete Query Redundancy Storage View Integration OLAP modeling inf ormation Processing Exploitation Management Technology 1995 2 2 5 1996 1 6 5 2 6 1997 2 3 1 6 2 3 9 1998 4 6 4 2 3 1999 7 2 10 2 1 3 Fig. 6 Number of papers in PODS/SIGMOD/VLDB by year and super category. P.Vassiliadis 12-4 There are areas like incomplete information and storage four categories, namely design, technical, procedural and management which seem to lose interest as time passes. sociotechnical factors (Fig. 7). Redundancy exploitation keeps a standard interest due to According to [ShTy98], the average time for the its dedicated audience of theoreticians. Integration and construction of a data warehouse is 12 to 36 months and OLAP modeling seem to gain interest at the same time. the average cost for its implementation is between $1 The probable reasons for the former are due to the million to $1.5 million. Data marts are a less risky criticism against the materialized nature of data expenditure, since they cost hundreds of thousands of warehousing. As for the latter, it is possible that the lack dollars and take less than a year to implement. Still, if a of a standard OLAP model plays its role to the increasing project of such nature is dependent on so many factors in interest in this category. order to succeed, then the self-contemplating statements on the state-of-the-art on data warehouse management are 3. Data Warehouse Problems and Failures rather unrealistic. In the sequel, we will take a short look to the particular factors of failure for data warehouse An objective observer facing the facts of the previous projects. As far as the design factors are concerned, there section would directly conclude that the area of data is an obvious deficit in the part of a “textbook” warehousing thrives and the potential for further growth is methodology for the design of a data warehouse. There more than probable. Although this seems to be a quite are no standard, or even widely accepted, metadata accurate description of the situation, we argue that a data management techniques1 or languages, data engineering warehouse project is a great risk and is definitely techniques or design methodologies for data warehouses. endangered by several factors. We intend to back up this Rather, proprietary solutions from vendors, or do-it- statement by concrete arguments based both on our yourself advice from experts seem to define the personal practical experience in the field and relevant landscape. If we look to the relevant research papers, the literature. picture is disappointing: the three major conferences on data management are not really concerned with issues like Category of Factors metadata management or design methodologies for data Factors warehouses. There exist, though, relevant areas such as Design Factors Lack of metadata management the research on the physical data warehouse design and Problematic data engineering the integration issues. Still, a closer look will reveal that Unrealistic schema design the research seems to target problems not really close to Client tools are neglected or dominate the the practical ones. For example, the assumptions made for design the design problem are rather unrealistic (knowledge of No design method is used user queries, their sizes and frequencies) with respect to Technical Factors Choice of wrong components practical cases. Also, the integration problem is definitely Vendor claims are not tested oriented toward a uniform API to distributed sources, i.e., No examination of volume of queries, data sets to languages and mechanisms that enable the querying of and network traffic data. Still, problems like extraction, transformation and Procedural Improper project scope cleaning which can take up to 80% of the time spent in Factors the development of a data warehouse [Dema97], seem to Bad use of pilot projects be ignored by the research community. User communities are not involved in the design The technical factors also reveal the absence of research No test of new management requirements in the confrontation of practical problems. There exist, of Lack of training for stakeholders course, standards for the evaluation of software Sociotechnical Data warehouses cross organizational treaty components, but there is a gap in the evaluation and Factors lines choice of hardware components. As one can see in Fig. 8, Data ownership and access are reconsidered hardware costs up to 60% of a data warehouse budget due to the presence of a data warehouse (disk, processor and network costs). Critical software The work practices of user communities are (DBMS and client tools) which is purchased (and not affected developed in-site) take up to 16% of the budget. There are no papers to our knowledge that deal with issue of Fig. 7 Factors affecting the failure of data hardware/software selection for data warehouse warehousing projects [Dema97]. environments. As for the estimation of the sizes of queries, data sets and network traffic, a closer look to the A very good discussion on the problems of data warehousing projects is found in [Dema97]. The paper 1 [ShTy98] reports that the lack of a common metadata mentions the logical fact that nobody really speaks about standard (despite the existence of the MDIS standard at data warehousing failures and goes on to group the the end of 1998) is the basic source for concern for reasons for the failure of a data warehousing project into metadata management tools. P.Vassiliadis 12-5 appendix will reveal only one (!) paper on the estimation warehouse. We refer the interested reader to [Gree00, of view sizes [SDNR96]. The fact that the average size of Dema97] for further probing on this very interesting data warehouses increases year by year makes the issue. problem even tougher. Back in 1996 the average data As for the sociotechnical issues, it is also very interesting warehouse size was estimated to be around 250 GB. In to briefly discuss the relevant factors, since there is very today’s data explosion there is even talk about scientific little reference to this kind of problems in the literature. data warehouses of 40 TB [SGKT00]. This means that According to [Dema97], breaking the organizational despite Moore’s law and the drop in the cost of storage treaties is a consequence of the fact that the data units, size is still a problem for data warehousing. The warehouse may reorganize the way the organization increasing number of users increases the complexity of works and intrude the functional or subjective domain of the problem. [ShTy98] mentions the case of a data the stakeholders. For example, imposing a particular warehouse involving 20.000 users with an annual increase client tool to the users invades the users’ desktop, which of 2.000 users per year. Obviously, estimating the size of is considered to be their personal “territory”. The materialized views or user queries is of great importance, problems due to the data ownership and access are in this context. grouped in two categories. First, data ownership is power within an organization. Any attempt to share or take control over somebody else’s data is equivalent with loss metadata of power of this particular stakeholder. Secondly, no activity design 5% monitor division or department can claim to possess 100% clean, 2% data monitor access/analysi 2% error-free data. The possibility of revealing the data s tools 6% disk storage quality problems within the information system of the DBMS 30% department is definitely frustrating for the affected 10% stakeholders. Finally, the invasion in the work practice network costs reduces to the psychological reason that no user 10% processor integration costs community seems to be really willing to shift from gut and 20% feeling or experience to objective, data driven transformation 15% DW Design Costs management (see [Dema97] for a broader discussion). To top the entire skepticism about the non-technical Fig. 8 Data warehouse design costs according to Bill problems and reasons of failure, ethical considerations Inmon [Inmo97] can be added to the big picture of data warehousing. In [Smit97] several such thoughts are presented: Is it fair to use customers’ data to harm their relationships with their periodic security verification of administration occasional suppliers/customers? Is it fair to use such data to intrude reorganization of the conformance to the enterprise 1% data your customers’ know-how? Is it fair to use customers’ summary table usage analysis data model 1% data archiving data to change the structure of your organization in a way 2% 2% 1% that is detrimental to your customers? Is it fair to use metadata capacity management planning personal data of individual customers without any prior 1% 3% notice? end-user training DW refreshment Most of the aforementioned reasons for failure are backed 55% 6% up from other testimonial literature (e.g., [Paul97], monitoring of activity and data servicing data [ShTy98]). 7% mart requests for data 21% Recurring DW costs 3.1 Personal Experience Fig. 9 Data warehouse recurring costs according to The author has been involved in both research and Bill Inmon [Inmo97] practical data warehouse projects, during the last six years. Our research experience was mainly the European The procedural and sociotechnical reasons are not really basic research project “DWQ: Foundations for Data technical reasons with which we should expect the Warehouse Quality” [JaVa97]. Obviously, some of the research society to deal with. We mention them for criticism and comments in this paper are influenced by reasons of completeness and in order to show how the research conducted in this project. We apologize for sensitive a project like the construction of a data this clear bias; still, since this paper presents the author’s warehouse is. The procedural factors involve reasons for personal judgments we believe that we should make clear deficiencies concerning the deployment of the data what has possibly influenced our opinion. warehouse. Apart from classical problems in IS The author has also been involved in three rather small management, it is important to notice that the role of user practical data warehouse projects. The first involved communities is crucial: the end-users must be trained to loading data from all the health centers (i.e., hospitals, the new technologies and included in the design of the provincial medical centers and other special kinds of P.Vassiliadis 12-6 centers) in Greece into an enterprise data warehouse. The as well as all the refreshment processes of the fact loading of data was performed annually and the querying table and the materialized views that used it. It was was supposed to be performed mostly by pre-canned only the consistent naming of all the software reports. Still, quite a lot of flexibility was provided to the components that helped us perform this task. user to filter, roll-up drill-down and drill-through the data. Note that the project experienced no political problems. The data warehouse was rather small and its construction The data warehouse was requested by the same took around 12 months. The major problems encountered department that previously owned the data. The new were not technical, since (a) the size of data was not so system would still be under the control of this particular big, (b) the refreshment window was not a problem and department and would thus synchronize and clean the (c) there was no real problem in reconciling the source information they provided to higher management. Note data. Still, there were major problems with the also that we never came to direct contact with the end- administration team of the legacy system due to the users: this was supposed to be a task undertaken by the following reasons: administration team of this particular department. Thus, − Lack of training of the target administration team. we have no knowledge for the real success of this project. The people administering the legacy COBOL-based In a second occasion, we had to build a data warehouse system were the ones who would administer the new with pension data. The data were to be updated monthly system, too. Still, this was their first experience with and used by pre-canned reports. The size of data involved the relational technology and this was definitely a a few million rows per month. The source data relied cultural shock for them. again on a COBOL-based legacy system. The project − Involvement of the administration team of the legacy lasted nine months and could be characterized more as the system in the design of the new system. Although it is construction of a data mart rather than the construction of clear that no data warehouse can be built without the a full data warehouse. In this case, the major problem was involvement of the source administrators, our personal of political nature: different departments were involved in experience suggests that this should be limited to the the ownership of the information. The people construction of the data warehouse enterprise model administering the legacy system were definitely affected (or even only to the reverse engineering of legacy by the construction of the warehouse. These people data). Any attempt to include people without the − would lose the full ownership of the information proper background in a process they do not really (which translates to sheer power in the IT understand, seems to jeopardize the while effort, department); rather than train / accustom them to the new system. − would have to take care of the transportation and − Poor quality of legacy data. The toughest problem in conversion of the data in their own system (which this particular problem was the cleaning of data. Each means extra workload for both people and systems) circuit in the schema seemed to be a sui generis and situation. Most important, we faced big difficulties − any deficiencies of the information they produced trying to convince the administrators of the legacy would be revealed (a fact of enormous importance and system for the poor quality of their data. Another big effect in the public sector). problem was the detection of which sources were Bearing all this in mind, it quite straightforward to reliable. In a COBOL system there is too much understand the difficulties raised. Moreover, it was redundancy, since each application uses its own data interesting to see that the higher management, although store. Every now and then, the different COBOL files committed to the idea of constructing the data warehouse, are synchronized, although this is not always 100% was unable to force things to happen and had to take an successful. When building the data warehouse, it is a approach that peacefully resolved any problems that hard task to determine the quality of each candidate occurred, in order to salvage the project from total failure. data source. Another problem we had to face in this project was the − Data warehouse evolution. The business rules for the difficulty in constructing the extraction and cleaning data warehouse are likely to change even during the software. The extraction of data from the legacy systems construction of the warehouse itself. The problem is is a highly complex, error-prone and tiring procedure. To hard, since it (a) brings the whole project back in give an idea of the problem, let us mention the case where schedule and cost, (b) it psychologically frustrates the the problem involved detecting relevant data from a development team and (c) the lack of a metadata COBOL file, converting EBCDIC to ASCII format, management repository makes it almost unpacking the packed numbers, reducing all address fields insurmountable to detect which part of the database or to a standard format and loading the result into a table in the applications has to be synchronized with the new the data warehouse. Apart from the standard tool offered situation. Imagine, for example, the case where the by Oracle for these purposes (SQL*Loader) we did not primary key of a fact table has to change a couple of use any commercial tool for these tasks. This seems to be weeks before completing the project. In our case, we the tactics followed by the majority of data warehousing had to detect and evolve around 50 pre-canned reports projects. According to [ShTy98] most of the companies P.Vassiliadis 12-7 contacted for their survey, estimate that more than 1/3 of Apart from these successes, there are two issues that the cost and time are spent to ETL tasks during the clearly depict the gap between research and practice. On development process. Still, in spite the obvious the one hand, there is an unclear picture with respect to importance of this process, the vast majority of them the extent that practice has exploited the results of developed their own application instead of using a tool to research. Query processing and storage management are facilitate the process. [ShTy98] also reports that data two research fields aiming to empower the technology quality products are expensive and hard to use. Based on providers (i.e., the software and hardware vendors) with the problem of time and budget constraints for a data better techniques for the storage and acquisition of warehouse project, [ShTy98] estimates that such products information. To our knowledge, it is not clear to which are going to modestly foster in the next few years (with extent have this results been incorporated in commercial the almost the lowest CAGR of all the product products. The extent to which results in the field of categories). incomplete information and redundancy exploitation can Political problems were apparent in a third case where the be exploited is another pending issue. The former seemed project failed. The organization possessed four legacy to be a rather promising research field but the lack of systems, all of different kind (COBOL, Excel and dBase research interest in the later years seems to be files as well as a relational system). A pilot data mart discouraging for its further exploitation. The latter is a involving a subset of one of the legacy systems had clear field but we believe that its practical exploitation already been successful and the management was will take time to be implemented. As far as the data enthusiastic about the whole idea. Still, the project failed, warehouse designer is concerned, the cases where the before it even started. As we had also observed in the determination of the intentional subsumption of two data previous case, it seems to be a common phenomenon that stores is useful is rather limited. Instead, it is the the people administrating the legacy system take a little extensional properties of the data source that count (an time until they understand what is politically happening to issue not really apparent in database research). Finally, them once a data warehouse is built. In this particular case OLAP modeling could be very useful in the logical the reaction was quick and absolute: no data were to be definition of the data warehouse, but the lack of a given from the largest legacy system, since its standard multidimensional hierarchical model seems to administrators simply refused to provide them. The drive designers to ad-hoc, proprietary solutions. Still, the project was thus canceled. The lesson we learnt in this relational counterpart, in the form of the ER diagram and case is that it takes more than an enthusiastic management the relational model, seems to be a promising precedent. and a successful pilot for a data warehouse to succeed. On the other hand of course, there seem to be rather big Later, we learned that the warehouse project started again, gaps in the table of Fig. 10, with respect to steps in the still we have no knowledge for the fate of this new effort. data warehouse lifecycle which are not supported by the conducted research. The data model analysis could be 3.2 Relationship between Practical Problems and clearly helped by improved techniques of metadata Research Issues management (and standards) as well as by data engineering methods that enable the designer to In this section we would like to relate the data warehouse understand and model data and processes better. lifecycle with potential problems and solutions offered by Breadbox analysis and technical assessment are clearly technology to tackle this particular problems. The first under-estimated by the research community. Techniques problem in this task is the lack of a concrete “textbook- to analyze data volume, network traffic, relevance and style” methodology. Reading the two classical books on quality of software components would greatly be data warehousing [Inmo96, Kimb96] one gets the feeling appreciated by data warehouse designers. The extraction that they provide tips and solutions for fragments of the process is also suffering from lack of help from the whole process, rather than a concrete methodology for the research community: as already mentioned, most research data warehouse practitioner. We use as a template performed has been dedicated to what should be extracted methodology the one proposed in an Appendix of (instead of how this extraction is performed). The [Inmo96] and try to relate it to potential problems and practical aspects of extraction are clearly neglected (e.g. technological solutions offered by research. We list only declarative languages and visual interfaces for the the aforementioned problems and research categories. management of the extraction process, automation of the Again, we do not claim that either list is exhaustive, but extraction programs, etc.). The problem is vast due to the rather indicative. sui generis nature of each kind of source (ASCII data are As we can see in Fig. 10 there are areas where research different from ISAM or database data) and of each has contributed a lot to the practical problems. For particular source itself. The peculiarities of the conversion example, several issues of the view technology super- process are also –more or less- neglected. category are (or at least, can be) somehow used by practitioners in data warehouse design and implementation. Also, several topics of the integration super-category can be exploited in practical cases. P.Vassiliadis 12-8 P. Vassiliadis Phase Lifecycle step Description Potential Problems Solutions offered by the research Decision to built the warehouse Improper project scope Bad use of pilot projects Data warehouses cross organizational treaty lines Data ownership and access are reconsidered The work practices of user communities are affected Design Data Model Analysis Conceptual and logical model No design method is used OLAP modeling User communities are not involved in the design Lack of metadata management Problematic data engineering Lack of training of the target administration team Excessive involvement of the administration team of the legacy system in the design of the new system Breadbox Analysis Size estimation for the data No examination of volume of queries, data sets and Size estimation for views network traffic Technical Assessment Definition of technical requirements No test of new management requirements Technical Environment Preparation Definition of network, storage, OS, software components, Client tools are neglected or dominate the design etc. Choice of wrong components Vendor claims are not tested Subject Area (per subject) Decision which subject area to populate Source System Analysis (per subject) Identification of proper source for the data and reverse Difficulty in determining which source is appropriate, engineering of the selected source due to quality problems Data Warehouse Database Design Physical database design for the data warehouse Unrealistic schema design Physical DW design, Indexing DW Program Specifications (per subject) Formalize the interface between source data and Data warehouse evolution View Maintenance, Data & implementation warehouse Schema Integration Programming (per subject) Construction of the appropriate software for ETL purposes Poor quality of legacy data Detecting changes in the sources Difficulty in constructing the S/W correctly Population (per subject) Load the warehouse with data Difficulty in using the data quality tools Report Determine data needed Decide which part of the data warehouse covers the data Implementation for the report (per report) Program to extract data Write a program to get the data from the DW Customize the data Customize the data for the user's intuition Refine the analysis Is the report suitable for what it was intended? Usage Use the reports Lack of training for stakeholders Institutionalize Should the report be institutionalized? Fig. 10 Data warehouse lifecycle steps, potential problems and solutions offered by the research community 12-9 We believe that a turn in the interest of the research publications out of such an effort. It is not strange, community from the virtual querying of distributed thus, that so much theoretical work has been devoted heterogeneous data sources and the intentional to view maintenance issues, with respect to what reconciliation to practical aspects of extraction of should be propagated to the warehouse, while few materialized data could benefit the practitioners a lot. research efforts have been made as to how this Finally, it seems to be unclear, to which extent procedural extraction and propagation is to be made. We believe and sociotechnical factors (involved mostly at the that it would be really hard for a paper concerning beginning and the end of a data warehouse project) could practical automation techniques for the data benefit from the use of new technology, suggested by extraction task to convince an academic audience. research results. This fuzziness alone, is a very good The last Asilomar report [BBC+98] states the need reason for research from the part of academia. As reported for “groundbreaking” instead of “delta” research; in [SJSV99] significant contribution could also be made still, it is not clear which practical issues concerning from business administration sciences, e.g., in the way the data warehousing are qualified under this definition. data warehouse in introduced in the corporation. - The rules that govern the behavior of science are applied also in the case of data warehousing. It is 4. Conclusions commonly agreed that it is the Paradigm that determines the interesting problems and not vice- Normally, this is the place for an optimistic message, or versa. In our case, the paradigm set by the papers of the ringing of the bell. For a change, we will do neither. Codd and Selinger et al., has –more or less- set the There are two issues, though, we would like to touch, as landscape for the research in the data warehouse concluding remarks. First, is it really the case, that field, too. For example, although too much work has research and practice are so much apart? In our humble been devoted to query processing for aggregate opinion, the answer is negative. Although research has queries, these queries are still treated in isolation. targeted only a fraction of the possible areas where Still, an OLAP session is a sequence of steps, which practitioners could need assistance, the technological have some logical interrelationship. How many contribution of the research society is significant. For papers do you know dealing with this particular example, let us mention the case of data warehouse property of OLAP? As another example, we simply refreshment. Despite the problems in the extraction step, remind the technical and design problems mentioned which we have already mentioned, the refreshment in Section 3, which although being of great process is of significant importance for the proper importance are not addressed by the research. We operation of the data warehouse. The recurring costs for believe that one of the reasons for this situation is the data warehouse refreshment come up to 55% of the non-standard nature of these problems, which puts overall cost for running a data warehouse (Fig. 9). Still, them outside the scope of the relational paradigm. the contribution is only in areas where the existing As for the future, it is hard to make any predictions. Is technology could be enhanced, without any data warehousing going to be virtual (making all our methodological results or groundbreaking research in new comments on the integration problem void, and the fields. research conducted in this field highly useful)? Is there Secondly, why is it that researchers are found away from going to be a shift towards methodological issues in data the practical problems of data warehousing? This is a warehouses? Are the gaps in Fig. 10 going to be filled? widely discussed issue (e.g., there is a standard debate in Although the answer is ‘I don’t know’ –at least from our the Communications of the ACM magazine). We point part- it is a challenging issue to work on these issues, only a few reasons that have come to our attention: contributing thus, to the closing of the gap between - It is possible that several researchers are not aware of research and practice and making data warehousing an the real-world problems. The major motivation for easier and less risky endeavor for practitioners and writing this paper was a discussion with a visiting organizations. researcher to our department. This person has devoted too much time, programming and energy to the data warehouse design problem. Still, he believed 5. References that the data warehouse is simply a set of “DECLARE VIEW” statements. Clearly, this was a [BBC+98] P.A. Bernstein, M.L. Brodie, S. Ceri, problem of lack of direct contact with practical D.J. DeWitt, M.J. Franklin, H. Garcia- problems. Molina, J. Gray, G. Held, J.M. - It is not always rewarding, in terms of research, to Hellerstein, H.V. Jagadish, M. Lesk, D. deal with practical problems. The extraction process Maier, J.F. Naughton, H. Pirahesh, M. of our case study, which we mentioned in Section 3 Stonebraker, J.D. Ullman. The Asilomar might give an example for this statement. Which Report on Database Research. SIGMOD researcher would feel happy to work on such a ‘dirty’ Record 27(4): 74-80 (1998) problem, knowing that it will be too hard to make [Comp96] ComputerWire Inc. Data Warehouse P.Vassiliadis 12-10 Economics: ROI doubts? Data [Pend00] N. Pendse, February 24, 2000. The Warehouse Tools Bulletin, November OLAP Report. Available at 1996. Available at http://www.olapreport.com/Market.htm. http://www.computerwire.com/dwtb/free [SDNR96] A. Shukla, P. Deshpande, J.F. Naughton, /2112_182.htm K. Ramasamy. Storage Estimation for [DaWaK] International Conference on Data Multidimensional Aggregates in the Warehousing and Knowledge Discovery Presence of Hierarchies. In Proceedings (DaWaK). http://www.informatik.uni- of 22nd International Conference on Very trier.de/~ley/db/conf/dawak/index.html Large Databases (VLDB), Mumbai India [Dema97] M. Demarest. The politics of data 1996. warehousing. Available at [SGKT00] A. Szalay, J. Gray, P. Kunszt, A. Thakar. http://www.hevanet.com/demarest/marc/ Designing and Mining Multi-Terabyte dwpol.html Astronomy Archives. SIGMOD [DOLAP] International Workshop on Data Conference 2000. Also available at Warehousing and OLAP (DOLAP). http://www.research.microsoft.com/~gra http://www.pages.drexel.edu/faculty/son y/ giy/dolap.html, [ShTy98] C. Shilakes, J. Tylman. Enterprise http://www.informatik.uni- Information Portals. Enterprise Software trier.de/~ley/db/conf/dolap/index.html Team. November 1998. Available at [GJSV99] S. Gatziu, M.A. Jeusfeld, M. Staudt, Y. www.sagemaker.com/company/downloa Vassiliou. Design and Management of ds/eip/indepth.pdf. Data Warehouses - Report on the [Smit97] J. Smith. Do Data Warehouses DMDW’99 Workshop. SIGMOD Record Challenge Fair Play? Beyond 28(4), December 1999. Refers to the Computing, 6(4), May 1997. Available at International Workshop DMDW’99 at www.beyondcomputingmag.com/archive CAiSE’99, Heidelberg, Germany, June /1997/5-97/ethics.html 1999. Online version available at http://sunsite.informatik.rwth- aachen.de/Publications/CEUR-WS/Vol- 19 [Gree00] L. Greenfield. Data Warehousing Political Issues. February 2000. Available at http://www.dwinfocenter.ord/politics.ht ml [Inmo96] W.H. Inmon. Building the Data Warehouse. John Wiley & Sons, March 1996. [Inmo97] B. Inmon. The Data Warehouse Budget. DM Review Magazine, January 1997. Available at http://www.dmreview.com/master.cfm? NavID=55&EdID=1315 [JaVa97] M. Jarke, Y. Vassiliou. Foundations of data warehouse quality – a review of the DWQ project. In Proc. 2nd Intl. Conference Information Quality (IQ-97), Cambridge, Mass., 1997. Available in http://www.dblab.ece.ntua.gr/~dwq [Kimb96] R. Kimbal. The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses. John Wiley & Sons, February 1996. [Paul97] L.G. Paul. Anatomy of a failure. CIO Magazine. November 15, 1997. Available at http://www.cio.com/archive/enterprise/1 11597_data_content.html P.Vassiliadis 12-11 Appendix Paper Category 1995 – PODS Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv, Divesh Srivastava. Answering Query rewritting Queries Using Views. 95-104. Anand Rajaraman, Yehoshua Sagiv, Jeffrey D. Ullman. Answering Queries Using Templates Query rewritting with Binding Patterns. 105-112. H. V. Jagadish, Inderpal Singh Mumick, Abraham Silberschatz. View Maintenance Issues View maintenance for the Chronicle Data Model. 113-124. 1995 - SIGMOD Ashid Gupta, Inderpal Singh Mumick, Kenneth A. Ross. Adapting Materialized Views after View maintenance Redefinitions. 211-222. Yue Zhuge, Hector Garcia-Molina, Joachim Hammer, Jennifer Widom. View Maintenance View maintenance in a Warehousing Environment. 316-327. Timothy Griffin, Leonid Libkin. Incremental Maintenance of Views with Duplicates. 328- View maintenance 339. James J. Lu, Guido Moerkotte, Joachim Schü, V. S. Subrahmanian. Efficient Maintenance of View maintenance Materialized Mediated Views. 340-351. 1995 - VLDB Weipeng P. Yan, Per-Åke Larson. Eager Aggregation and Lazy Aggregation. 345-357. Processing for aggregates Ashish Gupta, Venky Harinarayan, Dallan Quass. Aggregate-Query Processing in Data Processing for Warehousing Environments. 358-369. aggregates 1996 - PODS Alon Y. Levy, Anand Rajaraman, Jeffrey D. Ullman. Answering Queries Using Limited Query rewritting External Processors. 227-237. 1996 - SIGMOD Richard Hull, Gang Zhou. A Framework for Supporting Data Integration Using the Data integration Materialized and Virtual Approaches. 481-492. Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman. Implementing Data Cubes DW design Efficiently. 205-216. Leonid Libkin, Rona Machlin, Limsoon Wong. A Query Language for Multidimensional Processing for cubes Arrays: Design, Implementation, and Optimization Techniques. 228-239. Sudhir Rao, Antonio Badia, Dirk Van Gucht. Providing Better Support for a Class of Query processing in Decision Support Queries. 217-227. general Kenneth A. Ross, Divesh Srivastava, S. Sudarshan. Materialized View Maintenance and View maintenance Integrity Constraint Checking: Trading Space for Time. 447-458. Latha S. Colby, Timothy Griffin, Leonid Libkin, Inderpal Singh Mumick, Howard Trickey. View maintenance Algorithms for Deferred View Maintenance. 469-480. 1996 - VLDB Peter Scheuermann, Junho Shim, Radek Vingralek. WATCHMAN : A Data Warehouse Caching Intelligent Cache Manager. 51-62. Alon Y. Levy, Anand Rajaraman, Joann J. Ordille. Querying Heterogeneous Information Data integration Sources Using Source Descriptions. 251-262. Alon Y. Levy. Obtaining Complete Answers from Incomplete Databases. 402-412. Data integration Wilburt Labio, Hector Garcia-Molina. Efficient Snapshot Differential Algorithms for Data Detecting changes in Warehousing. 63-74. the sources Curtis E. Dyreson. Information Retrieval from an Incomplete Data Cube. 532-543. Incomplete information P.Vassiliadis 12-12 Laks V. S. Lakshmanan, Fereidoon Sadri, Iyer N. Subramanian. SchemaSQL - A Language Integration in general for Interoperability in Relational Multi-Database Systems. 239-250. Yannis Papakonstantinou, Serge Abiteboul, Hector Garcia-Molina. Object Fusion in Integration in general Mediator Systems. 413-424. Mark W. W. Vermeer, Peter M. G. Apers. The Role of Integrity Constraints in Database Integration in general Interoperation. 425-435. Damianos Chatziantoniou, Kenneth A. Ross. Querying Multiple Features of Groups in Processing for Relational Databases. 295-306. aggregates Sameet Agarwal, Rakesh Agrawal, Prasad Deshpande, Ashish Gupta, Jeffrey F. Naughton, Processing for Raghu Ramakrishnan, Sunita Sarawagi. On the Computation of Multidimensional aggregates Aggregates. 506-521. Divesh Srivastava, Shaul Dar, H. V. Jagadish, Alon Y. Levy. Answering Queries with Query rewritting Aggregation Using Views. 318-329. Amit Shukla, Prasad Deshpande, Jeffrey F. Naughton, Karthikeyan Ramasamy. Storage Size estimation for Estimation for Multidimensional Aggregates in the Presence of Hierarchies. 522-531. views Martin Staudt, Matthias Jarke. Incremental Maintenance of Externally Materialized Views. View maintenance 75-86. 1997 - PODS Ching-Tien Ho, Jehoshua Bruck, Rakesh Agrawal. Partial-Sum Queries in Data Cubes Using Processing for cubes Covering Codes. 228-237. Catriel Beeri, Alon Y. Levy, Marie-Christine Rousset. Rewriting Queries Using Views in Query rewritting Description Logics. 99-108. Oliver M. Duschka, Michael R. Genesereth. Answering Recursive Queries Using Views. Query rewritting 109-116. 1997 - SIGMOD Joseph M. Hellerstein, Peter J. Haas, Helen Wang. Online Aggregation. 171-182. Incomplete information Patrick E. O’Neil, Dallan Quass. Improved Query Performance with Variant Indexes. 38-49. Indexing Ching-Tien Ho, Rakesh Agrawal, Nimrod Megiddo, Ramakrishnan Srikant. Range Queries Processing for cubes in OLAP Data Cubes. 73-88. Yihong Zhao, Prasad Deshpande, Jeffrey F. Naughton. An Array-Based Algorithm for Processing for cubes Simultaneous Multidimensional Aggregates. 159-170. Nick Roussopoulos, Yannis Kotidis, Mema Roussopoulos. Cubetree: Organization of and Storage for cubes Bulk Updates on the Data Cube. 89-99. Michael J. Carey, Donald Kossmann. On Saying “Enough Already!” in SQL. 219-230. Top N queries Inderpal Singh Mumick, Dallan Quass, Barinderpal Singh Mumick. Maintenance of Data View maintenance Cubes and Summary Tables in a Warehouse. 100-111. Brad Adelberg, Hector Garcia-Molina, Jennifer Widom. The STRIP Rule System For View maintenance Efficiently Maintaining Derived Data. 147-158. Dallan Quass, Jennifer Widom. On-Line Warehouse View Maintenance. 393-404. View maintenance Latha S. Colby, Akira Kawaguchi, Daniel F. Lieuwen, Inderpal Singh Mumick, Kenneth A. View maintenance Ross. Supporting Multiple View Maintenance Policies. 405-416. Divyakant Agrawal, Amr El Abbadi, Ambuj K. Singh, Tolga Yurek. Efficient View View maintenance Maintenance at Data Warehouses. 417-427. 1997 - VLDB Dimitri Theodoratos, Timos K. Sellis. Data Warehouse Configuration. 126-135. DW design Jian Yang, Kamalakar Karlapalem, Qing Li. Algorithms for Materialized View Design in DW design Data Warehousing Environment. 136-145. Elena Baralis, Stefano Paraboschi, Ernest Teniente. Materialized Views Selection in a DW design Multidimensional Database. 156-165. Christos Faloutsos, H. V. Jagadish, Nikolaos Sidiropoulos. Recovering Information from Incomplete information Summary Data. 36-45. P.Vassiliadis 12-13 Vasilis Vassalos, Yannis Papakonstantinou. Describing and Using Query Capabilities of Integration in general Heterogeneous Sources. 256-265. Mary Tork Roth, Peter M. Schwarz. Don’t Scrap It, Wrap It! A Wrapper Architecture for Integration in general Legacy Data Sources. 266-275. Marc Gyssens, Laks V. S. Lakshmanan. A Foundation for Multi-dimensional Databases. OLAP modeling 106-115. Kenneth A. Ross, Divesh Srivastava. Fast Computation of Sparse Datacubes. 116-125. Processing for aggregates Damianos Chatziantoniou, Kenneth A. Ross. Groupwise Processing of Relational Queries. Processing for 476-485. aggregates Laura M. Haas, Donald Kossmann, Edward L. Wimmers, Jun Yang. Optimizing Queries Query processing over Across Diverse Data Sources. 276-285. integrated data H. V. Jagadish, P. P. S. Narayan, S. Seshadri, S. Sudarshan, Rama Kanneganti. Incremental Storage in general Organization for Data Recording and Warehousing. 16-25. Nam Huyn. Multiple-View Self-Maintenance in Data Warehousing Environments. 26-35. View maintenance 1998 - PODS John R. Smith, Chung-Sheng Li, Vittorio Castelli, Anant Jhingran. Dynamic Assembly of DW design Views in Data Cubes. 274-283. Phokion G. Kolaitis, David L. Martin, Madhukar N. Thakur. On the Complexity of the Query containment Containment Problem for Conjunctive Queries with Built-in Predicates. 197-204. Phokion G. Kolaitis, Moshe Y. Vardi. Conjunctive-Query Containment and Constraint Query containment Satisfaction. 205-213. Werner Nutt, Yehoshua Sagiv, Sara Shurin. Deciding Equivalences Among Aggregate Query containment Queries. 214-223. Serge Abiteboul, Oliver M. Duschka. Complexity of Answering Queries Using Materialized Query rewritting Views. 254-263. 1998 - SIGMOD Chee Yong Chan, Yannis E. Ioannidis. Bitmap Index Design and Evaluation. 355-366. Indexing Prasad Deshpande, Karthikeyan Ramasamy, Amit Shukla, Jeffrey F. Naughton. Caching Processing for Multidimensional Queries Using Chunks. 259-270. aggregates Yihong Zhao, Prasad Deshpande, Jeffrey F. Naughton, Amit Shukla. Simultaneous Processing for Optimization and Evaluation of Multiple Dimensional Queries. 271-282. aggregates Jun Rao, Kenneth A. Ross. Reusing Invariants: A New Strategy for Correlated Queries. 37- Query processing in 48. general Subbu N. Subramanian, Shivakumar Venkataraman. Cost-Based Optimization of Decision Query processing over Support Queries Using Transient Views. 319-330. integrated data Renée J. Miller. Using Schematically Heterogeneous Structures. 189-200. Schema integration Yannis Kotidis, Nick Roussopoulos. An Alternative Storage Organization for ROLAP Storage for cubes Aggregate Views Based on Cubetrees. 249-258. 1998 - VLDB Amit Shukla, Prasad Deshpande, Jeffrey F. Naughton. Materialized View Selection for DW design Multidimensional Datasets. 488-499. Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, Jeffrey D. Iceberg queries Ullman. Computing Iceberg Queries Efficiently. 299-310. Frédéric Gingras, Laks V. S. Lakshmanan. nD-SQL: A Multi-Dimensional Language for Integration in general Interoperability and OLAP. 134-145. Fernando de Ferreira Rezende, Klaudia Hergula. The Heterogeneity Problem and Integration in general Middleware Technology: Experiences with and Performance of Database Gateways. 146- 157. Guido Moerkotte. Small Materialized Aggregates: A Light Weight Index Structure for Data Processing for Warehousing. 476-487. aggregates P.Vassiliadis 12-14 Michael J. Carey, Donald Kossmann. Reducing the Braking Distance of an SQL Query Top N queries Engine. 158-169. Hector Garcia-Molina, Wilburt Labio, Jun Yang. Expiring Data in a Warehouse. 500-511. View maintenance 1999 - PODS Howard J. Karloff, Milena Mihail. On the Complexity of the View-Selection Problem. 167- DW design 173. Sara Cohen, Werner Nutt, A. Serebrenik. Rewriting Aggregate Queries Using Views. 155- Query rewritting 166. Stéphane Grumbach, Maurizio Rafanelli, Leonardo Tininini. Querying Aggregate Data. 174- Query rewritting 184. 1999 - SIGMOD H. V. Jagadish, Laks V. S. Lakshmanan, Divesh Srivastava. Snakes and Sandwiches: Clustering Optimal Clustering Strategies for a Data Warehouse. 37-48. Yannis Kotidis, Nick Roussopoulos. DynaMat: A Dynamic View Management System for DW design Data Warehouses. 371-382. Kevin S. Beyer, Raghu Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg Iceberg queries CUBEs. 359-370. Ramana Yerneni, Chen Li, Hector Garcia-Molina, Jeffrey D. Ullman. Computing Integration in general Capabilities of Mediators. 443-454. Peter J. Haas, Joseph M. Hellerstein. Ripple Joins for Online Aggregation. 287-298. Processing for aggregates Arunprasad P. Marathe, Kenneth Salem. Query Processing Techniques for Arrays. 323-334. Query processing for arrays Zachary G. Ives, Daniela Florescu, Marc Friedman, Alon Y. Levy, Daniel S. Weld. An Query processing over Adaptive Query Execution System for Data Integration. 299-310. integrated data Chen-Chuan K. Chang, Hector Garcia-Molina. Mind Your Vocabulary: Query Mapping Query processing over Across Heterogeneous Information Sources. 335-346. integrated data Wilburt Labio, Ramana Yerneni, Hector Garcia-Molina. Shrinking the Warehouse Update View maintenance Window. 383-394. 1999 - VLDB Vanja Josifovski, Tore Risch. Integrating Heterogenous Overlapping Databases through Integration in general Object-Oriented Transformations. 435-446. Felix Naumann, Ulf Leser, Johann Christoph Freytag. Quality-driven Integration of Integration in general Heterogenous Information Systems. 447-458. Alin Deutsch, Lucian Popa, Val Tannen. Physical Data Independence, Constraints, and Integration in general Optimization with Universal Plans. 459-470. Laks V. S. Lakshmanan, Fereidoon Sadri, Subbu N. Subramanian. On Efficiently Integration in general Implementing SchemaSQL on an SQL Database System. 471-482. H. V. Jagadish, Laks V. S. Lakshmanan, Divesh Srivastava. What can Hierarchies do for OLAP modeling Data Warehouses? 530-541. Torben Bach Pedersen, Christian S. Jensen, Curtis E. Dyreson. Extending Practical Pre- OLAP modeling Aggregation in On-Line Analytical Processing. 663-674. Kian-Lee Tan, Cheng Hian Goh, Beng Chin Ooi. Online Feedback for Nested Aggregate Processing for Queries with Multi-Threading. 18-29. aggregates Alfons Kemper, Donald Kossmann, Christian Wiesner. Generalised Hash Teams for Join and Processing for Group-by. 30-41. aggregates Chee Yong Chan, Yannis E. Ioannidis. Hierarchical Prefix Cubes for Range-Sum Queries. Processing for 675-686. aggregates Sunita Sarawagi. Explaining Differences in Multidimensional Aggregates. 42-53. Processing for cubes Jianzhong Li, Doron Rotem, Jaideep Srivastava. Aggregation Algorithms for Very Large Processing for cubes Compressed Data Warehouses. 651-662. Surajit Chaudhuri, Luis Gravano. Evaluating Top-k Selection Queries. 399-410. Top N queries P.Vassiliadis 12-15 Donko Donjerkovic, Raghu Ramakrishnan. Probabilistic Optimization of Top N Queries. Top N queries 411-422. P.Vassiliadis 12-16