MAC: Conceptual Data Modeling for OLAP Aris Tsois Nikos Karayannidis Knowledge and Database Knowledge and Database Systems Laboratory, NTUA Systems Laboratory, NTUA atsois@dblab.ece.ntua.gr nikos@dblab.ece.ntua.gr Timos Sellis Knowledge and Database Systems Laboratory Department of Electrical and Computer Engineering National Technical University of Athens (NTUA) Zografou 15773, Athens, Greece timos@dblab.ece.ntua.gr The typical data flow path involves the gathering of data from various sources into data warehouse systems and Abstract then the usage of those data in the multidimensional analysis process through the use of OLAP applications. In this paper we address the issue of conceptual modeling Multidimensional analysis mainly involves the of data used in multidimensional analysis. We view the computation of aggregated information using a large problem from the end-user point of view and we describe volume of detailed data. The information is analysed a set of requirements for the conceptual modeling of real- based on its detailed or derived properties (dimensions) world OLAP scenarios. Based on those requirements we using an almost static business model (hierarchies). The then define a new conceptual model that intends to reader is referred to [ChDa97] [Inmo96] [Kimb97] capture the static properties of the involved information. [Olap97] for an overview of Data Warehousing and In its definition we use a minimal set of well-understood OLAP. In the following we will assume the reader to be OLAP concepts like dimensions, levels, hierarchies, familiar with the terminology used in those areas. measures and cubes. The central concept of the model is the Multidimensional Aggregation Cube (MAC), which A fundamental issue faced by vendors of OLAP gives a broad and flexible definition to the notion of a applications as well as by researchers in the OLAP multidimensional cube. We evaluate our model against domain is the modeling of data. The well-studied other existing multidimensional models and show that conceptual and logical models used in other database MAC offers a unique combination of modeling skills. Our areas, like the E/R model or the relational model, do not main contribution is the definition of the basic concepts of seam to be sufficient for the OLAP case our model; although the set of requirements and the ([Kimb96][TBC99][S++98][Kimb97]). Vendors have evaluation of all related models against those adopted various models, while standardization bodies and requirements represent an additional result. researchers have developed and studied additional models. All those models share some common concepts like measures or hierarchies but there is still no formally 1 Introduction defined and widely accepted (logical or conceptual) data model. As proved by the history of the relational model a In the last years On-Line Analytical Processing (OLAP) common data model is the key for the collaboration and [Codd93] has become a major research area in the the rapid progress in an area. database community [ChDa97]. The OLAP research is tightly coupled with the research in data warehouses, In this paper we address the problem of modeling real- which are considered to be the information sources based world OLAP scenarios at the conceptual level. The on which On-Line Analytical Processing is performed. current common practice is to use the well-known E/R model [BCN92] and then to annotate the schema with The copyright of this paper belongs to the paper’s authors. Permission to copy any additional OLAP specific information. Still, various without fee all or part of this material is granted provided that the copies are not authors argue that the E/R model is not appropriate for made or distributed for direct commercial advantage. OLAP scenarios since concepts like dimensions, Proceedings of the International Workshop on Design and hierarchies and cubes can only be partially represented. Management of Data Warehouses (DMDW'2001) As a result, two publications ([TBC99] and [S++98]) Interlaken, Switzerland, June 4, 2001 already proposed extensions to the E/R model for the (D. Theodoratos, J. Hammer, M. Jeusfeld, M. Staudt, eds.) multidimensional paradigm. Their approach is mainly http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-39/ suitable for the ODS (operational data store) part of data A. Tsois, N. Karayannidis, T. Sellis 5-1 warehouses as they concentrate on the representation of into the data warehouse from the OLTP system. For each the source-detailed data. sales transaction the OLTP system records the following information: In this paper we consider a slightly different approach, • The date of the transaction. where the information used in multidimensional analysis • The cashier ID where the transaction took place. is the primary target of our modeling concepts. The • The ID of the products being sold. information used in such an analysis process is mainly • The customer ID. aggregated data at various aggregation levels, or • The sales price for each product being sold. combination of such levels. Furthermore, the dimensions, the particular aggregation levels as well as the various We assume that all the above information is somehow hierarchies defined on dimensions represent information stored in the data warehouse. We are not going to talk used during the analysis. about the design process of the data warehouse, neither about how data is loaded from the OLTP system since our In order to define a useful conceptual model we first model does not address those issues. The MAC model, investigate a set of example queries and derive a list of which we propose, is mainly suitable for the users of the modeling requirements. Based on those requirements we data warehouse, the persons that analyze the information then define the concepts of our model and their semantics. through the use of an OLAP application. The central concept of our model is the Multidimensional Aggregation Cube (MAC), which is equivalent to an n- As described by a plethora of OLAP papers [Mendel], the way relationship relating measure values to a set of multidimensional analysis is mainly based on drill-down, dimension values. A careful definition of dimension roll-up, slice and dice operations that are performed on a values allows a single MAC to represent measure values multidimensional view of data. Measures values are of arbitrary aggregation levels. This is an essential selected and aggregated using various predefined difference with respect to the other conceptual models and dimensions, dimension levels and hierarchies. The can be used to simplify the schema of the various OLAP dimension levels, the aggregation paths defined by scenarios. An additional novelty of our model is the hierarchies, the dimensions and the measures are the main explicit modeling of analysis paths, a feature quite concepts used in the analysis. For our example scenario important for OLAP applications. assume that the analysis is performed on the sales price (price of sold items) using the hierarchies defined in Generally speaking, the concepts used in Figure 1. multidimensional analysis are mapped directly to corresponding concepts of the MAC model. As a result, $QDO\VLV 3DWKV 'LPHQVLRQV the proposed MAC model allows OLAP scenarios to be 5HVLGHQFH 5HVLGHQFH 5HVLGHQFH 3 &XVWRPHU modeled in a natural and straightforward way. $UHD &LW\ 5HJLRQ Furthermore, its abilities to model complex dimensions &OLHQW and hierarchies and the broad definition of cubes makes it 3 &XVWRPHU 3URIHVVLRQ suitable for highly complicated OLAP applications. 3 'DWH 0RQWK _s*1.1 there is a functional dependency among them. In our Year.1999 Store_Area._x ALL ALL _s example, cell_A refers to sales measured per Product and Cashier while cell_C refers to sales measured per Brand Table 2: The query Q4 and Cashier. Those cells, although defined at different levels of granularity, can be part of the same cube. This is An additional advantage of our approach is that the most due to the definition of the dimension domain, which typical OLAP operations: drill-down, roll-up, slice and states that all members of all participating levels are valid dice are translated to simple selection queries on the cube. dimension values. For example, a drill-down on the results of Q4 to the Product level can be expressed by simply replacing Brand We believe that the above property of our cubes is crucial with Product at the Item coordinate of Table 2. for the compact and intuitive representation of multidimensional data. As explained in the section 2, the Nevertheless, there is an important argument against this OLAP users usually handle data defined over various modeling approach. The cube can represent cells with levels of granularity. Queries may impose selection measures that are functionally dependent but it cannot conditions on various dimension levels and may require guaranty their consistency. The measure of cell_C could A. Tsois, N. Karayannidis, T. Sellis 5-8 have been 50 in Table 1 without violating in any way the During the last few years a plethora of multidimensional definition of the cube C1. Still, from a semantically point data models for data warehouses and OLAP have been of view this value would be inconsistent with the values proposed. A comparison of some of them can be found in of cell_A and cell_B. This inability to guaranty [VaSe99] and [SBH99]. We are currently aware of 12 consistency comes from total absence of the involved models that have been published in research papers. Most aggregation functions. Each measure of a cube is of them are logical data models and only few ([TBC99] semantically related to some aggregation. In the above [S++98]) can be considered as purely conceptual. Each of examples the measure Sum of sales is obviously related those models has taken a somehow different modeling to the SUM aggregation function. Still, the cube definition approach ranging from a simple global table to does not include this relationship making impossible any sophisticated object classes. consistency checking. In order to demonstrate that our model is not ‘yet another’ We have intentionally chosen not to include the multidimensional model we evaluated all 12 published aggregation functions in the MAC model for a number of models against the requirements described in section 2. reasons. We decided not to define operations, not to This may not be fair for the purely logical models since include aggregation functions and not to cover any other the requirements represent conceptual modeling needs. functional aspects of data in the current model because Still, the evaluation is done only to demonstrate that none those aspects are orthogonal to the defined concepts. The of the models published so far has the expressive power model can be extended to include aggregation functions, of MAC. In fact the evaluation shows that one consistency-checking algorithms and operations without requirement is not satisfied by any of the models and even changing any of the existing MAC concepts. Furthermore, for the remaining requirements there is no model a separate functional model can cover those needs. It satisfying all of them. Since our model can satisfy all the seams that a separate functional model is more suitable requirements of the evaluation we argue that our proposal because each application domain requires its own specific is an improvement to the existing status. aggregation functions, operations and consistency semantics. The requirements of our evaluation are presented in the following list. Each requirement states what the model For example in some applications it might be acceptable should be capable of representing within a schema. to view and analyze data that includes inconsistent parts 1. Levels within dimension (even in the form of simple (maybe due to missing, incomplete or wrong attributes). information). A separate functional model can be tailored 2. Grouping/classification relationships among levels. to cope with operations on such data. By forcing 3. Many-to-many type of grouping/classification consistency at the data modeling level we would make relationships. our model inappropriate for those applications. 4. N-way grouping/classification relationships that relate n dimension levels. 4 Related Work 5. Grouping/classification relationships that do not require total participation of the involved levels. Modeling multidimensional data is not an OLAP specific 6. Analysis paths. issue. In the database community, several research areas 7. Multiple measures as part of one concept. like statistical databases, scientific databases, 8. Measures defined at any granularity level – for each geographical databases and temporal databases deal with involved dimension. multidimensional data. Still, each of these areas has 9. Measure values defined over various granularity particular modeling needs and has developed specialized levels as part of one concept. multidimensional data models. The area closest to data 10. Measure values characterized, for some of its warehouses and OLAP is the statistical database area dimensions, by more than one dimension level [Shos97] where several multidimensional models have members. been proposed [OOM85], [RR91]. In fact those models where proposed long before the appearance of the term Note that the aggregation level of a measure value is the “OLAP” [Codd93]. lowest dimension level that can be used to characterize this value. Also, an analysis path is a lattice of In the data warehouse and OLAP area the first grouping/classification relationships defined on a set of multidimensional data models where developed by levels. This lattice prevents the user from performing a product vendors as the research in the OLAP domain has meaningless (according to the schema designer) drill- followed the evolution of industrial products. Vendors as down or roll-up operation to an arbitrary -outside the still using and developing their own data models. Also, lattice- level of the dimension. various standardization bodies have defined their own models [Meta97] [Olap97] [TPC99]. Due to space limitations we are not going to discuss any of the previously referenced models but refer the reader to [VaSe99] for an overview and comparison of those models. A. Tsois, N. Karayannidis, T. Sellis 5-9 1 2 3 4 5 6 7 8 9 10 [AGS97]) do so by providing the appropriate [AGS97] 9 9 transformation operations. So, this requirement does not [CaTo98] 9 9 9 9 mean that dimensions and measures are represented in the [DaTh97] 9 9 9 same manner by the model. We believe that our model [GoRi98] 9 9 9 9 9 9 can easily support this requirement through the definition [GyLa97] 9 9 9 9 9 of the proper transformation operations (initially called [Lehn98] 9 9 9 9 Push and Pull by [AGS97]). [LiWa96] 9 9 9 9 [PeJe99] 9 9 9 9 9 9 9 9 5 Conclusions [S++98] 9 9 9 9 9 In this paper we addressed the problem of conceptual [TBC99] 9 9 9 9 9 9 modeling of data used in multidimensional analysis. We [Truj99] 9 9 9 presented a set of modeling requirements through the use [Vass98] 9 9 9 of examples and with those requirements in mind we MAC 9 9 9 9 9 9 9 9 9 9 defined a new conceptual data model, named MAC. The proposed model uses concepts familiar to OLAP users, Table 3: Evaluation of multidimensional models like dimensions, levels, paths, measures and cubes. Those concepts are properly defined in order to allow modeling The result of our evaluation is shown in Table 3. Note that of complicated real-world scenarios. Our evaluation and some of the models ([GyLa97], [Truj99], [AGS97]) comparison to previously published models showed that represent relationships among levels using user-defined MAC offers a unique combination of modeling skills. Our functions, which are then used in operations. Also, other model is the first user-centric conceptual model to define models ([LiWa96], [Lehn98], [DaTh97]) leave the cubes as multi-granularity relationships making both relationships to be defined by the particular data instances schemas and queries much more simple and intuitive. The and provide no schema definition for them. In both cases model defines dimension levels, drilling relationships, we considered that the requirements 2,3,4,5 involving dimension paths and dimensions as first-class and grouping/classification relationships as part of the schema standalone concepts, making it possible to share those are not met. concepts among multiple cubes. Furthermore, the complexity of drilling relationships and the usage of The requirement not met by any of the models is the analysis paths in the definition of dimensions are concept of an analysis path. We believe that this additional novelties of our model taking a step beyond the information is an important structural part of the classical multiple hierarchies. Finally, note that the dimension design and it should be represented at the definition of dimension domains implicitly represents a conceptual level. straightforward method for semantic query optimization at both the schema and the instance level. Although our model seams to be able to model a broader range of OLAP scenario than other proposed models, Future work includes the definition of MAC as an there are a few requirements mentioned in several papers extension to the E/R model and the research of a suitable ([Codd93], [PeJe99], [TBC99] [GyLa97]), which are not logical model on which concepts of our model can be satisfied by MAC. In our opinion, the most important of mapped. We also plan to define a functional model that such requirements is the support for correct aggregation will include aggregation functions, derived measures, and of data. As described in [LeSh97] the measures cannot operations and will define the summarizability [LeSh97] always be consistently aggregated by an arbitrary of measures as well as other consistency rules. aggregation function. In order to provide support for correct aggregations the model must include additional Acknowledgements information regarding measures and This work has been partially funded by the European’s grouping/classification relationships. Our model does not Union Information Society Technologies Programme include such additional information since we believe that (IST) under project EDITH (IST-1999-20722). this kind of information, as well as information about aggregation functions and derived measures, can be References described by an independent functional model which will supplement MAC. [AGS97] R. Agrawal, A. Gupta, S. Sarawagi: Modeling Multidimensional Databases. Proc. of the ICDE 1997. A second important requirement stated by various papers ([GyLa97], [PeJe99], [AGS97]) is the need for symmetric [BCN92] C. Batini, S. Ceri, S. Navathe: Conceptual treatment of dimensions and measures. It is important to Database Design. Benjamin/Cummings, 1992. note that what the authors finally mean by symmetric treatment is the ability to transform a measure into a [CaTo98] L. Cabibbo, R. Torlone: A Logical Approach to dimension and the other way around. All models claiming Multidimensional Databases. Proc. of the EDBT to support this requirement ([GyLa97], [PeJe99], 1998. A. Tsois, N. Karayannidis, T. Sellis 5-10 [ChDa97] S. Chaudhuri, U. Dayal: An overview of Data [PeJe99] T. B. Pedersen, C. S. Jensen: Multidimensional Warehousing and OLAP technology. ACM SIGMOD Data Modeling of Complex Data. Proc. of the ICDE Record, 26(1), March 1997. 1999. [Codd93] E. F. Codd: Providing OLAP to user-analysts: [RR91] M. Rafanelli, F.L. Ricci: A functional model An IT mandate. E.F. Codd and Associates, 1993. for macro-databases. SIGMOD Record, 20(1), March 1991. [DaTh97] A. Datta, H. Thomas: A conceptual Model and an algebra for On-Line Analytical Processing in Data [SBH99] C. Sapia, M. Blaschka, G. Höfling: An Warehouse. Proc. of the WITS 1997. Overview of Multidimensional Data Models for OLAP. Technical Report 1999. [G++96] J. Gray et al.: Data Cube: A Relational http://www.forwiss.tu-muenchen.de/ Aggregation Operator Generalizing Group-By, Cross- Tab and Sub-Totals. Proc. of the ICDE 1996. [S++98] C. Sapia, M. Blaschka, G. Höfling, B. Dinter: Extending the E/R model for the Multidimensional [GoRi98] M. Golfarelli, S. Rizzi: A Methodological Paradigm. Proc. of the DWDM 1998. Framework for Data Warehouse Design. Proc. of the DOLAP 1998 [Shos97] A. Shoshani: OLAP and statistical databases: Similarities and differences. Proc. of the PODS 1997. [GyLa97] M. Gyssens, L.V.S. Lakshmanan: A Foundation for Multi-Dimensional Databases. Proc. [TBC99] N. Tryfona, F. Busborg, J. G. B. Chistiansen: of the VLDB 1997. starER: A Conceptual Model for Data Warehouse Design. Proc. of the DOLAP 1999. [Inmo96] W.H. Inmon: Building the Data Warehouse. John Wiley & Sons, 1996. [TPC99] TPC: TPC Benchmark H and TPC Benchmark R. Transaction Processing Council. June 1999. [Kimb96] R. Kimball: The Data Warehouse Toolkit. John http://www.tpc.org/ Wiley & Sons, 1996. [Truj99] J. Trujillo: The GOLD model: An Object [Kimb97] R. Kimball: A Dimensional Modeling Oriented multidimensional data model for Manifesto. DBMS, August 1997. multidimensional databases. Proc. of the ECOOP 1999. [Lehn98] W. Lehner: Modeling Large Scale OLAP Scenarios. Proc. of the EDBT 1998 [VaSe99] P. Vassiliadis, T. Sellis: A Survey of Logical Models for OLAP Databases. SIGMOD Record 28(4), [LeSh97] H. Lenz, A. Shoshani: Summarizability in Dec. 1999. OLAP and Statistical Databases. In Proc. of the SSDBM 1997. [Vass98] P. Vassiliadis: Modeling Multidimensional Databases, cube and cube operations. Proc. of the [LiWa96] C. Li, X. S. Wang: A Data Model for SSDBM 1998. Supporting On-Line Analytical Processing. In Proc. of the CIKM 1996. [Mendel] A. O. Mendelzon: Data warehousing and OLAP: a research-oriented bibliography . http://www.cs.toronto.edu/~mendel/dwbib.html. [Meta91] Metadata Coalition: Meta Data Interchange Specification. (MDIS Version 1.1), August 1997. [Olap97] OLAP Council: OLAP and OLAP Server Definitions. 1997 http://www.olapcouncil.org/reasearch/glossary.htm [OOM85] G. Ozsoyoglu, M. Ozsoyoglu, F. Mata: A Language and a Physical Organization Technique for Summary Tables. Proc. of the SIGMOD 1985. A. Tsois, N. Karayannidis, T. Sellis 5-11