Divide and Aggregate: Caching Multidimensional Objects Wolfgang Lehner, Jens Albrecht, Wolfgang Hümmer Database Systems (IMMD6) University of Erlangen-Nuremberg Martensstr. 3, 91058 Erlangen, Germany {lehner, jalbrecht, huemmer}@informatik.uni-erlangen.de schemes and summary tables. Our strategy is easy to imple- ment in relational OLAP servers and provides excellent Abstract support of analysis hot spots. Experiments show that aver- A common optimization technique in the OLAP age cost reductions of over 50% are easily reached while application domain is the use of summary spending only 10% of additional storage for summary data. aggregate data. For an appropriate support of analysis hot spots, we propose a query based Derivability aggregate cache of ‘multidimensional objects’. To speed up queries, multidimensional objects are cached The major drawback of previous query caching and reused during runtime. The framework of Multidimen- methods is that new queries must be completely sional Objects provide a consistent and formalized view to subsumed by a single cached object in order to database queries as well as to materialized summary data. utilize it. Our solution is the introduction of ‘set- They may be visualized as a partition (restricted to certain derivability’ which allows the combination of selection predicates) of an aggregated data cube (using several aggregates to derive a single query. The set grouping attributes) specified by a regular star-query. of cached multidimensional objects is A necessary prerequisite for reuse is derivability dynamically determined using both users access ([Klug82], [Fink82]). Derivability in the context of parti- patterns and semantical knowledge of tioned aggregates concerns the following two aspects: dimensional structures. Experimental results • compatibility of grouping attributes: show that average cost reductions of over 50% are The precomputed data from which a query may be easily reached while spending only 10% of derived has to have an equal or finer granularity with additional storage for summary data. regard to the corresponding dimensional structures.* • compatibility of selection predicates: 1 Introduction This condition for derivability states that the precom- puted partition of summary data must not be more Online Analytical Processing (OLAP) offers analysts an restrictive than the query which is the subject of deri- efficient interactive access to consistent business data, typ- vation. ically stored in a data warehouse. From the physical data- Since predicate comparison in general is NP-complete base design point of view, the two main obstacles to pro- ([SuKN89]), recent work either does not consider par- vide efficient multidimensional access to those data are to titioning at all ([HaRU96], [BaPT97]), is limited to test choose the right partitioning scheme and the right selection for equality ([ShSV99]) or reduced to statically pre- of summary tables. Whereas partitioning is a common con- defined partitions (chunks; [DRSN98]). cept in physical database design, the generation of sum- mary tables is specific to the OLAP application domain Contribution with mostly read-only data access. Due to a limited storage capacity not all possible aggregations may be precom- To overcome these limits and to provide a simple and pow- puted. In this paper we present a dynamic caching strategy erful strategy to select and release precomputed summary for multidimensional data incorporating both partitioning data with limited scope, this paper contributes the follow- ing: • set-derivability: Our approach is not limited to derive a single query from a single summary data table but computes a nearly optimal construction plan for a *. To illustrate these dependencies [HaRU96] intro- duced to notion of an aggregation lattice. W. Lehner, J. Albrecht, W. Hümmer 2-1 Related Work query based on a set of precomputed summary data algorithm of [YaKL97] computes the global query execu- partitions (‘patch-working on multidimensional tion plan and selects the node of the resulting plan with the objects’; section 4). highest number of references for precomputation. [Gupt97] • use of dimensional structures: Since we allow parti- extends this procedure in the context of AND/OR-graphs tioning only with respect to classification hierarchies by considering alternative local query execution plans. (and not for dimensional attributes) algorithms to com- Based on the concept of an aggregation lattice, the algo- pose a multidimensional aggregate of a set of cached rithm specified in [HaRU96] ([GHRU97] additionally con- objects become feasible. siders existing index structures) statically selects those • dynamic caching policy: Since static precomputation nodes of the lattice with the best cost saving ratio in relation of complete data cubes does not adequately support to the additional storage required by that node. Since this data analysis hot spots, our approach decides ‘on-the- approach shows a complexity of O(n3) with n as the num- fly’ which summary data partitions (multidimensional ber of possible nodes of an aggregation lattice, the work of objects) are worth to keep materialized. Moreover, [BaPT97] introduces heuristics to reduce this number and materialization of the same data cube partition at dif- make the approach more applicable. ferent aggregation levels is explicitly allowed and the In opposite to these static algorithms, dynamic or adap- size of a partition is not a system parameter but implic- tive strategies try to incrementally optimize the set of pre- itly determined by the user. computed aggregates. On the relational level, [ShSV99] • easy to integrate: Since we perform algebraic optimi- materializes the results of arbitrary complex queries. zation, our algorithm may be easily integrated into Unfortunately, by using hash codes they are able to reuse relational OLAP-servers. materialized queries only in the case of an exact match. • high average cost reduction: As our experiments show, This is not acceptable in a real OLAP scenario. In the case we achieve high cost reductions with only very decent of an exact match with regard to the group-by attributes, the storage overhead at almost no cost. recent work of [DRSN98] proposes a chunk-based caching strategy. The materialization of chunks (see below) is Structure of the Paper resolved according to a classical CLOCK scheme, thus directly reflecting the user access behavior. In [KoRo99] a In the next section, we give a comprehensive overview of similar approach is presented. Again, a single query can recent work which is related to selection and use of materi- only be answered from a single materialized fragment. alized aggregates in the context of OLAP and data ware- housing. The third section formally introduces the one- and Table-, Query-, Chunk-based Partitioning Schemes multidimensional structures of the considered multidimen- sional model. Based on the notion of multidimensional The second classification criteria of existing strategies is objects, section 4 defines the concept of set-derivability based on the unit of precomputation. Basically there are using ‘patch-working’ and outlines the four different influ- three levels, namely table-, query-, and chunk-based strat- ence factors to compute the overall benefit of a single egies. cached multidimensional object. Finally, results of various Algorithms like [HaRU96], [GHRU97], and [BaPT97] experiments are presented in section 5. The paper con- do not support partitioning and are therefore classified as cludes with a summary. table-based strategies. Compatibility of selection predi- cates has not to be checked since all summary tables are 2 Related Work defined on an unrestricted scope (the second condition of derivability is trivial). However, this implies that analysis The general idea of precomputing summary data is rooted hot spots like “the current period” are not reflected appro- in the application area of ‘Statistical and Scientific Data- priately, because the algorithms materialize only summary bases’. Several proposals like [ChMM88] were made but tables containing all periods. the idea became popular with the emergence of ‘Data On the opposite, chunk-based partitioning schemes like Warehousing’ together with ‘Online Analytical Process- [DRSN98] define relatively small aggregate partitions of a ing’ and the resulting need for an efficient and mostly read- fixed size (chunks). Queried chunks are cached and later only access to aggregates in a multidimensional context reused to answer new queries. On the one hand, [DRSN98] ([ChSh95]). reports very high cost saving ratios in their experiments. On the other hand however, the implementation of special bit- Static versus Adaptive Precomputation and Caching wise index structures to identify single chunks is extremely Schemes expensive. Moreover, they use a special chunk-based file system preventing a seamless integration into relational A first classification of existing work may be specified OLAP servers. At last, the choice of the chunk size must be according to the flexibility of the scheme with regard to statically determined and is crucial for the efficiency of the user access behavior. Given a set of relational queries, the whole system. W. Lehner, J. Albrecht, W. Hümmer 2-2 The Multidimensional Model The third class of different partitioning schemes is made Example 1: The set of dimensions of a typical OLAP sam- up by query-based partitioning schemes ([SDJL96], ple scenario consists of a product, time, and shop di- [ShSV99], [YaKL97], [GuHQ95]). Single user queries mension. Within the shop dimension for example, the reflect the unit of caching and make the selection of the primary attribute corresponds to a unique shop identi- right partition size obsolete. Recent work reuses results of fier. A regional classification may be defined by the se- former queries either by query containment or by exact quence city → state → region → country. The set of di- match ([SDJL96], [ShSV99]) where the incoming query mensional attributes may be defined as { shop-type, must be completely derivable from the materialized query. purchase-class, selling-area, shop-window-size }. In Moreover, it is further argued against query-based parti- real OLAP scenarios, a single (usually a product or a tioning ([DRSN98]) that several materialized queries may customer) dimension may easily have 20 different at- address a common part of a data cube. This would result in tributes. a redundant storage of that part and decrease the utilization of the additional storage. 3.2 Multidimensional Structures The basic data structure in the multidimensional context, 3 The Multidimensional Model the notion of a multidimensional object, is of fundamental importance. With regard to the caching algorithms The following section introduces the dimensional and mul- described in the following section, all incoming queries as tidimensional structures which build the formal representa- well as materialized results of former queries are internally tion of the multidimensional model. The description is represented as multidimensional objects. divided into the presentation of dimensions and multidi- mensional objects both of which are necessary to define a Definition 2: A multidimensional object (MO) is a triple M multidimensional database ([LeAW98]). = (M, S, G) consisting of a measure M, an n-ary vector of classes S = (s1, ..., sn) denoting the scope, and the 3.1 Dimensional Structures granularity specification G. A measure is a tuple M = [N, A, O] where N is a re- Dimensional structures are defined in the conceptual served name for the corresponding fact, A ∈ { FLOW, schema design phase of a database. The semantic informa- STOCK, VALUEperUNIT } is the aggregation type, tion coded into these structures is heavily used for cache and O ∈ { NONE, SUM, AVG, COUNT } is the ap- replacement. plied aggregation function on that specific fact. Definition 1: A dimension D of a multidimensional data- A granularity specification is a tuple G = [L, P] where base is a set of attributes D = { PA } ∪ { CA1, ..., CAp L (‘aggregation Level’) is an n-ary vector of classifica- } ∪ { DA1, ..., DAq } and the following set of function- tion attributes of each dimension 1 1 n n al dependencies between these attributes: (L ∈ { CA 0 , ..., CA p } ⊗ ... ⊗ { CA 0 , ..., CA p } ), 1 n • ∀i (1≤i≤p): PA→CAi and ∀i (1≤i≤q): PA→DAi and P (‘Properties’) is a set of dimensional attributes The primary attribute PA of a dimension func- 1 1 n n tionally determines all other attributes of a single (P ∈ { DA 1 , ..., DA q } ∪ ... ∪ { DA 1 , ..., DA q } ). 1 n dimension. The instances of the primary attribute Example 2: The formal representation of a sales data cube are called dimensional elements. for video products in Germany would be as follows: • ∀i (1≤i≤p-1): CAi→ CAi+1 M = ( [ SALES , FLOW, SUM ], The classification attributes CAi of a dimension ( p.group = ‘Video’, s.country = ‘Germany’ ), are used to define a multi-level grouping of single [ (p.family, s.region), { p.brand, s.shoptype } ] ) dimensional elements and are ordered according The aggregation type determines the set of applicable to their functional dependencies. The aggregation operations on the numeric values to ensure correct summa- level or granularity i of a classification attribute rizability ([LeSh97]). For example, all aggregation opera- CAi is denoted as CAi. By definition, the primary attribute has the aggregation level 0 (PA = CA0). • ∀i,j i≠j (1≤i,j≤q): DAi → / DAj †. Moreover it is worth to mention that this definition The dimensional attributes DAi of a dimension does not explicitly consider multiple hierarchies with represent all attributes describing properties of the more than one level. Multiple hierarchies would dimensional elements. introduce a partial order on the classification attributes (instead of a total order for the single clas- Although the concept of dimensional attributes is more sification hierarchy). This extension would not affect complex ([LeAW98], [ABD+99]), dimensional attributes any of the presented derivability or caching algo- may be seen as additional single-level hierarchies on a rithms but would complicate the presentation. There- dimension, which implies that they are functionally inde- fore, we only consider a single classification pendent of each other†. hierarchy. W. Lehner, J. Albrecht, W. Hümmer 2-3 The Multidimensional Model tions are applicable to sales figures (A = FLOW), whereas ence of two n-dimensional cubes would result in the worst the sum operation would not be allowed on figures denot- case in a set of 3n-1 convex residual cubes - assuming ing the price of single articles (A = VALUEperUNIT). The exactly 3 classification nodes partitioning each dimension. aggregation type STOCK prohibits summation in the tem- It can be shown 32-1=8 poral dimension. Finally the size of a multidimensional that intelligent object M at the instance level is denoted by |M |. slicing can reduce this number to 2n Relations and Operations on Multidimensional Objects residual cubes - = ([ChMM88], To formally define ‘derivability’ on multidimensional proof of M M’ objects, we need to introduce the ‘≤‘-relation for the gran- proposition 4). ularity specification and the intersection of scopes of mul- Informally, the 2*2=4 tidimensional objects. Moreover, we need intersection and cube is cut itera- Figure 1: Difference of difference operators of multidimensional objects them- tively for each multidimensional objects selves. dimension into Definition 3: A granularity specification G = [L, P] is finer maximal three slices; two slices are residual cubes and are than or equal to a granularity specification not considered any longer; the third slice contains the oper- G’ = [L’, P’], i.e. G ≤ G’, if and only if and and is sliced further in the remaining dimensions. For • all aggregation levels of L are lower or equal than n dimensions we obtain 2n residual cubes. the aggregation levels of L’ In the general case with dimensions consisting of an • P is a superset of dimensional attributes with arbitrary number of classification nodes for each dimension regard to P’, i.e. P’ ⊆ P i (1≤i≤n) the minimum number pi of partitioning classifica- or tion nodes including the operand to be subtracted has to be determined. Then the number of residual cubes will be • G represents raw data, i.e. L = (PA1, ..., PAn). ∑ ( pi – 1 ) , 1≤i≤n, because of the same argumentation as Definition 4: The intersection of multidimensional scopes above.For more detailed information refer to [AlGL99]. is defined as the component-wise intersection of the Definition 6: The difference of two multidimensional ob- classes in each dimension, i.e. by the homomorphism jects M = (M, S, G) and M’ = (M’,S’,G’), i.e. M − M’ S ∩ S’ = (s1, ..., sn) ∩ (s’1, ..., s’n) is defined as a set {(M, S1, G)1, ..., (M, Sk, G)k} where := ( (s1 ∩ s’1), ..., (sn ∩ s’n) ) k is the number of residual cubes and Si are the scopes The intersection is empty if two classes si and s’i do of residual objects obeying the following conditions: not show a parent-child relationship within the corre- ∪ Si and i∩ k k sponding classification hierarchy of dimension i. S = S' ∪ Si = ∅ . i=1 =1 It is important to note that the problem of intersecting Derivability of Multidimensional Objects classes in a classification hierarchy may be solved by checking the simple parent-child relationships. This con- To reuse the results of former queries, derivability of mul- cept is fundamental for the applicability of the set-deriv- tidimensional objects as the units of caching is a necessary ability mechanism. prerequisite. The following definition of the total derivabil- Definition 5: The intersection of multidimensional objects ity will be relaxed in the next section to the partial deriv- M = (M, S, G) and M’ = (M’,S’,G’), i.e. M ∩ M’ is de- ability which is sufficient to define the set-derivability. fined as M ∩ M’ = (M, S ∩ S’, G) if and only if Definition 7: A measure M = (N, A, O) is derivable from a • both objects refer to the same measure, i.e. measure M’= (N’, A’, O’) if and only if M = M’ • the measures refer to the same real world entity • the intersection of the scopes is not empty, i.e. S ∩ (N = N’ and A = A’) S’ ≠ ∅ • the operations are compatible, i.e., O = O’ or • the granularity specification of M is finer than or O’ = NONE equal to the granularity specification of M’, i.e. Note that semi-additive functions like AVG are split into G ≤ G’. the additive and non-additive counterparts ({SUM, As seen in this definition the coarser multidimensional COUNT} and division ). The additive operations build a object M’ is implicitly refined to the granularity level G. regular subject of the caching mechanism. Thus by materi- To be able to introduce the concept of set-derivability by alizing multidimensional objects with the applied opera- substitution (‘patch-working’; section 4), we need to define tions SUM and COUNT, an AVG-query will never be the difference of two multidimensional objects. As illus- directly but only indirectly cached. trated in figure 1 for the two-dimensional case, the differ- W. Lehner, J. Albrecht, W. Hümmer 2-4 Set-Derivability using ‘Patch-Working’ Definition 8: A multidimensional object M = (M, S, G) is • SQ = ∪i Si, totally derivable from a multidimensional object • ∀i, j (1≤i,j≤k) with i≠j: Si ∩ Sj = ∅. M’ = (M’,S’,G’), if and only if In this case M is said to be set-derivable from { M1’, • the measure M in M is derivable from the measure ..., Mk’ } M’ in M’ Each patch in S has two components 〈Mi, Mi’ 〉 denoting • M’ contains M, i.e. S ∩ S’ = S that partition Mi of MQ is to be computed from Mi’. All par- • the granularity specification of M’ is finer than M, titions must be disjoint and their union must result in MQ. i.e. G’ ≤ G The Query Optimization Problem 4 Set-Derivability using ‘Patch-Working’ In general, the cache contains several possibly overlapping aggregates with different granularities which support a Although derivabil- query. Therefore, the problem of the multidimensional ity is the fundamen- query optimizer is to choose a substitution that allows the tal concept for the computation of the query at minimal cost. This problem is use of precomputed M1 MQ M3 very similar to the well-known knapsack problem aggregates, in many ([CoLR90]), where a thieve tries to select the most valuable cases definition 8 it set of things with different values and weights. A common is too restrictive. M2 method is to use a greedy strategy to approximate the opti- Consider the exam- mal solution for the NP-complete 0/1 knapsack problem. In ple in figure 2. The Figure 2: Patch-Working the case of the fractional knapsack problem where the query object MQ is thieve may take arbitrary fractions of objects the greedy not derivable from approach actually yields the optimum. The problem of any of M1, M2, or M3 alone. However, it can be derived finding an optimal substitution can be reduced to the frac- from a combination of M1, M2 and M3. The example illus- tional knapsack, if the cost to compute a fraction of one trates the problem of previous query caching methods multidimensional object from another one is monotonous ([ShSV99], [SDJL96]) where a materialized aggregate can in the size of the support. only be accessed if the complete query is derivable from this single cached object. This is because query contain- Definition 11: A cost function Cost(M, M’ ) determining ment can be checked with a relatively simple algorithm the cost to compute the support of a multidimensional where the determination of the overlapping regions for object M from M’ is monotonous, if for any two arbi- arbitrary predicates is NP-complete ([SuKN89]). However, trary multidimensional objects M’ and M’’ holds: in the context of multidimensional objects where the pred- | M ∩ M’ | < | M ∩ M’’ | icates consist of scope definitions corresponding to classes ⇒ Cost(M, M’ ) < Cost(M, M’’ ). in a classification hierarchy this problem can be reduced to In the case of a monotonous cost function, an algorithm the determination of ancestor and sibling relationships. selecting the cheapest substitution for each data cell in the Definition 9: A multidimensional object M = (M, S, G) is multidimensional object M automatically finds the optimal partially derivable from a multidimensional object substitution (‘Greedy-Choice property’; [CoLR90]). How- M’ = ( M’, S’, G’), i.e. M’ „ M, if and only if ever, especially in the context of relational databases monotony of the cost function may not be given. In a rela- • the conditions (1) and (3) of the total derivability tional backend the computation of a multidimensional hold for M and M’ object results in range queries on the affected tables repre- • M and M’ are overlapping, i.e. S ∩ S’ ≠ ∅ senting the supporting multidimensional objects. In the In this case, M’ is said to support M, and M ∩ M’ is absence of indexes, range queries on relational tables usu- called the support of M’ for M. ally result in full table scans destroying the property of From the set of supporters, i.e. the potential candidates, monotony. an appropriate set of partitions, or patches, has to be In order to find a substitution for a query we use the selected in order to answer a query. Such a set is called a greedy strategy outlined in algorithm 1. The algorithm gets substitution for the query. . the query MQ and the set C of all materialized multidimen- Definition 10: A set S = { 〈M1, M1’ 〉,...,〈Mk, Mk’ 〉 } with sional objects as input parameters. It returns a substitution Mi = (Mi, Si, Gi) and Mi’ = (Mi’, Si’, Gi’) is called a S and the estimated cost to compute MQ. As long as there substitution for a multidimensional object MQ, if and are still uncomputed fragments remaining (line 5), the can- only if didate with the least access cost per cell for each fragment is selected (lines 9-21). After the selected patch is added to • ∀i (1≤i≤k): Mi = MQ, Gi = GQ, and the solution S (line 24) it must be removed from the remain- • ∀i (1≤i≤k): Mi’ „ Mi, der R using the difference operator (line 28). Since M may W. Lehner, J. Albrecht, W. Hümmer 2-5 Experimental Results Algorithm: Substitution (Patch Working) Input: set of existing multidimensional objects C = {M1, ..., Mn} multidimensional object representing the query MQ, Output: substitution S for MQ 16 // if relative cost of new candidate is less 17 If ( costrel > Cost(M, Mcand) / |M ∩ Mcand| ) 1 // initialize remainder, patch-work and cost 18 Mbest := Mcand; 2 R := { MQ }; S := ∅; costtotal := 0 19 costrel := Cost(M, Mcand) / |M ∩ Mcand|; 3 20 End If 4 // iterate over all 21 End Foreach 5 While ( R ≠ ∅) 22 6 MBest := ∅; costrel := ∞; 23 // add the best patch to the result 7 24 S := S ∪ { 〈M∩Mbest, Mbest〉 }; 8 // find cheapest substitution for each remaining patch 25 costtotal := costtotal+Cost(M, Mbest); 9 Foreach M ∈ R 26 10 // loop through all candidates 27 // remove the new patch from the remainder 11 Foreach Mcand ∈ C 28 R := R \ { M } ∪ (M-Mbest); 12 // next candidate if this one is not supporting M 29 End Foreach 13 If ( Not ( Mcand „ M ) ) 30 End While 14 Next; 31 Return (S, costtotal) 15 End If 32 End Algorithm 1: Substitution of a multidimensional object (‘patch-working’) only be partially derivable from Mbest the uncovered differ- • Reconstruction Cost CM(MQ, C): ence M-Mbest must remain in R. It is worth to mention here Since cached multidimensional objects may be com- that a solution is always possible if C contains also the raw puted from the set of multidimensional objects being data objects currently an element of the cache, it may prove benefi- cial if those multidimensional objects are replaced Cache Replacement which can be reconstructed most easily. Since the factors have very different orders of magni- The knowledge of the OLAP application domain provides tude, they are normalized by their average values before the not only a lot of semantic information about the data model overall benefit BM(MQ, C) of a multidimensional object M but also about the users behavior. In order to incorporate for a cache configuration C after query MQ is computed by reference density as well as semantic knowledge the benefit a linear combination. The overall benefit is used as the indi- in our approach is defined by the four factors of the cator, whether there is a change of the set of cached multi- weighted relative reference density, the absolute benefit, dimensional objects or not. the degree of relationship, and the reconstruction cost ([ADB+99]): • Weighted Relative Reference Density DM 5 Experimental Results The weighted relative reference density is used to The following section illustrates the optimization potential approximate the traditional least reference density of the presented algorithms for caching multidimensional (LRD) in the multidimensional context. A reference objects. Our proposed dynamic caching strategy is ana- counter is increased only by the fraction of M that was lyzed according to different cache sizes and influence fac- actually referenced. tors for the computation of the overall benefit. Further- • Absolute Benefit AM: more, we compare our strategy with the static precomputa- The absolute benefit is based on the size, the granular- tion approach of [HaRU96]. ity, and the scope of the multidimensional object and therefore generalizes the benefit definitions from Experimental Setup [HaRU96] as well as [DRSN98]. • Degree of Relationship RM(MQ): The proposed data structures and algorithms are imple- In the OLAP context, it is very likely that an object in mented in the multidimensional OLAP server CUBESTAR. the cache that can be reached within a few navigational All query processing and cache management is based on operations will be used with a high probability. There- algebraic operations on multidimensional objects. How- fore, the number of navigation operations which are ever, multidimensional objects are physically stored in a necessary to transform the last query MQ into a mate- relational backend (Oracle 8.0.4; DEC Alpha 3000/800, rialized, i.e. cached multidimensional object M defines 1CPU, 192MByte RAM) in a star-schema-like manner the degree of relationship for M. (ROLAP system). For each patch in the optimal substitu- tion for the query (algorithm 1), the query processor gener- W. Lehner, J. Albrecht, W. Hümmer 2-6 Experimental Results 70% 70% 60% 60% 50% 50% Parameters: 40% 40% D and R CSR CSR D, A, R and C 30% 30% A and C C alone 20% 20% 10% 10% 0% 0% 0 200 400 600 800 1000 0 200 400 600 800 1000 1200 1500 a) single user, 1000 steps b) multiple users, 1500 steps Figure 3: Cost saving ratio for different combinations of cache parameters ates an SQL statement, which are put together by UNION In the experimental scenario, two user traces, one simulat- ALL operations to yield the final result. We could not com- ing 1000 operations of single user and another one 1500 pletely eliminate side-effects of the system, especially of steps of five independent users, were used to investigate the the database buffer, but multiple runs confirmed the follow- behavior of the cache for varying cache sizes and configu- ing results. rations. Multidimensional Database Schema and Query Influence of the Cache Parameters Generation The first set of graphs in figure 3 illustrates the impact of The schema of the multidimensional database consists of the four cache parameters using a cache size of 20% of the three dimensions with five, four and six levels, and a set of raw data volume. The figures show the development of the up to 14 dimensional attributes, respectively. The corre- cost saving ratios for the single and the multi user trace. If sponding raw data cube was randomly filled with a sparsity all parameters are enabled, cost savings of about 56% in the of 3% resulting in about 250.000 tuples in the relational single user case and 50% in the multi user case were representation. To simulate the cache replacement behav- obtained. By trying all possible combinations of parame- ior, a new query is constructed from the previous query by ters it turned out that for both the single and the multi user applying a single operation in a certain dimension with a traces the usage of the relative reference density D and the predefined probability. Since a typical OLAP user does not relationship degree R yields the best results of 60% and switch very often between dimensions, in 80% of the cases 52% respectively. On the contrary, the worst results of 35% the operation was executed on the same dimension and in and 28%, respectively, were achieved using the absolute 20% of the cases the operation involved a different dimen- benefit A and the reconstruction cost C. sion as the previous one. The set of possible operations It is interesting to note that the number of potential users included slice (scope changes to a child node), unslice of a materialized aggregate, reflected by the absolute bene- (scope changes to a parent node), drill-down (granularity fit A, is not suitable for the determination the overall bene- changes to the next lower level), roll-up (granularity fit. Algorithms like [HaRU96] are based solely on this fac- changes to the next higher level), split (add a dimensional tor. Furthermore, it turns out that the general critics accord- attribute) and merge (remove a dimensional attribute). ing to query-based caching schemes namely the possibly redundant storage of the same part of data cube in different Performance Metric objects is of minor interest, because the influence factor of the reconstruction cost also does very badly. Since in general the cache management aims at maximiz- ing the cost saving ratio ([ShSV99], [DRSN98]), we Influence of the Cache Size defined this metric in the context of multidimensional objects. The cost saving ratio CSR on cached multidimen- The diagram in figure 4 shows the simulation results for sional objects is defined by the division of the sum of the different cache sizes varying over 10.000, 25.000, 50.000, costs of the queries using the cache content and the total 100.000, and 500.000 tuples. Compared to the raw data cost to compute the queries from the non-aggregated raw volume these values result in an additional storage over- data. head of 4%, 10%, 20%, 40% and 200%, respectively. For these simulations the best combination of cache parame- W. Lehner, J. Albrecht, W. Hümmer 2-7 Summary 70% 60% single user, 1000 steps 50% 5 users, CSR 40% 1500 steps 30% Operations: 20% 15% slice / 6% unslice 14% drill-down / 30% roll-up 10% 20% split / 15% merge 0 4% 10% 20% 40% 200% Figure 4: Query trace configurations and cost savings with varying cache sizes ters, i.e. D and R, was used. As can be seen, already with a only supply an insufficient number of aggregates. More- relatively small cache of 4% of the raw data size, cost over, because general static approaches do not take into reductions of over 50% in the single user mode and over account specific partitions, i.e. regions of interest or hot 40% in the multi user mode are possible. However, the cost spots, much of the additional storage space is wasted for savings in the single user mode do not increase much over regions of aggregates which are never or seldom requested. 60% for any cache size beyond 50.000 tuples. The figure illustrates that even for multiple users a cache of a fairly 6 Summary small size yields cost reductions not much worse than for a single user, i.e. above 50%. In this paper, we present a novel method for the dynamic selection of materialized aggregates in the context of mul- Comparison to Static Precomputation of Aggregates tidimensional database systems. Our proposed solution is based on the notion of multidimensional objects which pro- In order to compare our method to other materialization vide a consistent view to queries as well as to materialized methods of aggregations, we implemented a static precom- aggregates. The use of multidimensional objects drastically putation algorithm based on [HaRU96]‡. This algorithm simplifies the determination of intersections and differ- computes a set of aggregates with an unrestricted scope ences of queries. Therefore, we are able to present a query- based on a benefit definition similar to the absolute benefit. based caching strategy overcoming the limitations of query Figure 4 shows the cost saving ratio for different cache containment by the concept of set-derivability. Set-deriv- sizes. If only 4% of additional storage are provided, there ability is achieved by a patch-working mechanism, which is almost no cost reduction possible for the static algorithm (in the case of a monotonous cost function) computes the (therefore not in the figure), because the size of almost all cheapest substitution of a query by a set of materialized useful aggregates with an unrestricted scope is larger than aggregates. The cache replacement strategy is based on the the cache size. Although with increasing cache size the cost overall benefit computation for each element in the aggre- saving ratio of the static approach rises, the values for the gate cache which consists of a linear combination of four dynamic strategies remain unrivaled. Even for a cache size different influence parameters. At last we give the results of twice as large as the raw data volume the cost saving ratios various experiments. Average cost reductions of 50% to for the single and multi user cases are only about 40% and 60% may be obtained by only 10% additional storage 30%, respectively. Note that the dynamic strategy yields capacity. over 50% with 1/50 of that cache size. The proposed optimization technique must not be seen The reason for the bad results of the static approach is as an alternative to traditional physical optimization meth- that the size of the aggregation lattice grows exponentially ods like indexes or fragmentation, but as an addition. Our in the number of possible group-by combinations. In most method not only explicitly supports fragmentation of the application scenarios this number is high, because several fact table but makes it even more beneficial. If aggregates evaluation criteria on a single dimension are used, like are only materialized for hot spots, a single query may uti- dimensional attributes or multiple hierarchies. Using a rea- lize both fragments of materialized aggregates and frag- sonable amount of additional storage, static methods can ments of raw data in one operation. Our proposed strategy may easily be integrated into most relational OLAP serv- ‡. Unfortunately, we could not compare our approach ers, because all optimizations involve only algebraic oper- with the work of [DRSN98], since they require a very ations on common multidimensional structures. special ‘chunk-based’ file system in the context of their Paradise database system. W. Lehner, J. Albrecht, W. Hümmer 2-8 Summary Currently, we are exploring the integration of indexes on GHRU97 Gupta, H.; Harinarayan, V.; Rajaraman, A.; the multidimensional objects and the benefit of using spe- Ullman, J.D.: Index Selection for OLAP. In: cial linearizations. Another research is the use of the frag- Proceedings of the 13th International ment-based patch-working algorithm in a parallel server Conference on Data Engineering (ICDE’97, environment. Birmingham, Great Britain, April 7-11), 1997, pp. 208-219 References GuHQ95 Gupta, A.; Harinarayan, V.; Quass, D.: ABD+99 Albrecht, J.; Bauer, A.; Deyerling, O.; Günzel, Aggregate-Query Processing in Data H.; Hümmer, W.; Lehner, W.; Schlesinger, L.: Warehousing Environments. In: Proceedings of Management of multidimensional Aggregates the 21th International Conference on Very for efficient Online Analytical Processing. In: Large Data Bases (VLDB’95, Zürich, International Database Engineering and Schwitzerland, Sept. 11-15), 1995, pp. 358-369 Applications Symposium (IDEAS'99, Montreal, Gupt97 Gupta, H.: Selection of Views to Materialize in Canada, August 1-3), 1999 a Data Warehouse. In: Proceedings of the 6th AlGL99 Albrecht, J.; Günzel, H.; Lehner, W.: Set- International Conference on Database Theory Derivability of Multidimensional Aggregates. (ICDT‘97, Delphi, Greece, Jan. 8-10), 1997, In: First International Conference on Data pp. 98-112 Warehousing and Knowledge Discovery HaRU96 Harinarayan, V.; Rajaraman, A.; Ullman, J.D.: (DaWaK’99, Florence, Italy, August 30 - Implementing Data Cubes Efficiently. In: September 1), 1999 Proceedings of the 25th International BaPT97 Baralis, E.; Paraboschi, S.; Teniente, E.: Conference on Management of Data, Materialized Views Selection in a (SIGMOD’96, Montreal, Quebec, Canada, Multidimensional Database. In: 23rd June 4-6), 1996, pp. 205-216 International Conference on Very Large Data Klug82 Klug, A.: Equivalence of Relational Algebra Bases (VLDB’97, Athens, Greece, Aug 25-29), and Relational Calculus Query Languages 1997, pp. 156-165 Having Aggregate Functions. In: Journal of the ChMM88 Chen, M. C.; McNamee, L.; Melkanoff, M.: A ACM 29(1982)3, pp. 699-717 Model of Summary Data and its Applications in KoRo99 Kotidis, Y.; Roussopoulos, N.: DynaMat: A Statistical Databases. In: Proceedings of the 4th Dynamic View Management System for Data International Working Conference on Warehouses. In: International Conference on Statistical and Scientific Database Management of Data (SIGMOD’99, Management (4SSDBM, Rome, Italy, June 21- Philadephia (PA), U.S.A., June 1-3), 1999, pp. 23), 1988, pp. 356-372 371-382 ChSh95 Chaudhuri, S.; Shim, K.: An Overview of Cost- LeAW98 Lehner, W.; Albrecht, J.; Wedekind, H.: Normal based Optimization of Queries with Forms for Multidimensional Databases. In: Aggregates. In: IEEE Data Engineering Proceedings of the 10th International Bulletin 18(1995)3, pp. 3-9 Conference on Scientific and Statistical Data CoLR90 Cormen, T. H.; Leiserson, C. E.; Rivest, R. L.: Management (SSDBM’98, Capri, Italy, July 1- Introduction to Algorithms, Cambridge (MA), 3), 1998, pp. 63-72 London: The MIT Press; New York u.a.: LeSh97 Lenz, H.-J.; Shoshani, A.: Summarizability in McGraw-Hill Book Company, 1990 OLAP and Statistical Data Bases. In: DRSN98 Deshpande, P.M.; Ramasamy, K.; Shukla, A.; Proceedings of the 9th International Naughton, J.F.: Caching Multidimensional Conference on Statistical and Scientific Queries Using Chunks. In: Proceedings of the Database Management (9SSDBM, Olympia 27th International Conference on Management (WA), USA, Aug. 11-13), 1997, pp. 132-143 of Data (SIGMOD’98, Seattle (WA), USA, SDJL96 Srivastava, D.; Dar, S.; Jagadish, H.V.; Levy, June 2-4), 1998, pp. 259-270 A.Y.: Answering Queries with Aggregation Fink82 Finkelstein, S.: Common Expression Analysis Using Views. In: Proceedings of 22th in Database Applications. In: Proceedings of International Conference on Very Large Data the International Conference on the Bases (VLDB’96, Bombay, India, Sept. 3-6), Management of Data (SIGMOD’82, Orlando 1996, pp. 318-329 (FL), USA, June 2-4), 1982, pp. 235-245 W. Lehner, J. Albrecht, W. Hümmer 2-9 Summary ShSV99 Shim, J.; Scheuermann, P.; Vingralek, R.: Dynamic Caching of Query Results for Decision Support Systems. In: Proceedings of the 11th International Conference on Scientific and Statistical Database Management (SSDB’99, Cleveland (OH), U.S.A., July. 28- 30), 1999, pp. 254-263 SuKN89 Sun, X.-H.; Kamel, N.; Ni, L.M.: Solving Implication Problems in Database Applications. In: Proceedings of the 18th International Conference on Management of Data (SIGMOD’89, Portland (OR), USA, May 31-June 2), 1989, pp. 185-192 YaKL97 Yang, J.; Karlapalem, K.; Li, Q.: Algorithms for Materialized View Design in Data Warehousing Environment. In: Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB’97, Athens, Greece, Aug. 25- 29), 1997, pp. 136-145 W. Lehner, J. Albrecht, W. Hümmer 2-10