=Paper=
{{Paper
|id=Vol-28/paper-2
|storemode=property
|title=Divide and aggregate: caching multidimensional objects
|pdfUrl=https://ceur-ws.org/Vol-28/paper2.pdf
|volume=Vol-28
|dblpUrl=https://dblp.org/rec/conf/dmdw/LehnerAH00
}}
==Divide and aggregate: caching multidimensional objects==
Divide and Aggregate: Caching Multidimensional Objects
Wolfgang Lehner, Jens Albrecht, Wolfgang Hümmer
Database Systems (IMMD6)
University of Erlangen-Nuremberg
Martensstr. 3, 91058 Erlangen, Germany
{lehner, jalbrecht, huemmer}@informatik.uni-erlangen.de
schemes and summary tables. Our strategy is easy to imple-
ment in relational OLAP servers and provides excellent
Abstract support of analysis hot spots. Experiments show that aver-
A common optimization technique in the OLAP age cost reductions of over 50% are easily reached while
application domain is the use of summary spending only 10% of additional storage for summary data.
aggregate data. For an appropriate support of
analysis hot spots, we propose a query based Derivability
aggregate cache of ‘multidimensional objects’. To speed up queries, multidimensional objects are cached
The major drawback of previous query caching and reused during runtime. The framework of Multidimen-
methods is that new queries must be completely sional Objects provide a consistent and formalized view to
subsumed by a single cached object in order to database queries as well as to materialized summary data.
utilize it. Our solution is the introduction of ‘set- They may be visualized as a partition (restricted to certain
derivability’ which allows the combination of selection predicates) of an aggregated data cube (using
several aggregates to derive a single query. The set grouping attributes) specified by a regular star-query.
of cached multidimensional objects is A necessary prerequisite for reuse is derivability
dynamically determined using both users access ([Klug82], [Fink82]). Derivability in the context of parti-
patterns and semantical knowledge of tioned aggregates concerns the following two aspects:
dimensional structures. Experimental results • compatibility of grouping attributes:
show that average cost reductions of over 50% are The precomputed data from which a query may be
easily reached while spending only 10% of derived has to have an equal or finer granularity with
additional storage for summary data. regard to the corresponding dimensional structures.*
• compatibility of selection predicates:
1 Introduction This condition for derivability states that the precom-
puted partition of summary data must not be more
Online Analytical Processing (OLAP) offers analysts an restrictive than the query which is the subject of deri-
efficient interactive access to consistent business data, typ- vation.
ically stored in a data warehouse. From the physical data- Since predicate comparison in general is NP-complete
base design point of view, the two main obstacles to pro- ([SuKN89]), recent work either does not consider par-
vide efficient multidimensional access to those data are to titioning at all ([HaRU96], [BaPT97]), is limited to test
choose the right partitioning scheme and the right selection for equality ([ShSV99]) or reduced to statically pre-
of summary tables. Whereas partitioning is a common con- defined partitions (chunks; [DRSN98]).
cept in physical database design, the generation of sum-
mary tables is specific to the OLAP application domain Contribution
with mostly read-only data access. Due to a limited storage
capacity not all possible aggregations may be precom- To overcome these limits and to provide a simple and pow-
puted. In this paper we present a dynamic caching strategy erful strategy to select and release precomputed summary
for multidimensional data incorporating both partitioning data with limited scope, this paper contributes the follow-
ing:
• set-derivability: Our approach is not limited to derive
a single query from a single summary data table but
computes a nearly optimal construction plan for a
*. To illustrate these dependencies [HaRU96] intro-
duced to notion of an aggregation lattice.
W. Lehner, J. Albrecht, W. Hümmer 2-1
Related Work
query based on a set of precomputed summary data algorithm of [YaKL97] computes the global query execu-
partitions (‘patch-working on multidimensional tion plan and selects the node of the resulting plan with the
objects’; section 4). highest number of references for precomputation. [Gupt97]
• use of dimensional structures: Since we allow parti- extends this procedure in the context of AND/OR-graphs
tioning only with respect to classification hierarchies by considering alternative local query execution plans.
(and not for dimensional attributes) algorithms to com- Based on the concept of an aggregation lattice, the algo-
pose a multidimensional aggregate of a set of cached rithm specified in [HaRU96] ([GHRU97] additionally con-
objects become feasible. siders existing index structures) statically selects those
• dynamic caching policy: Since static precomputation nodes of the lattice with the best cost saving ratio in relation
of complete data cubes does not adequately support to the additional storage required by that node. Since this
data analysis hot spots, our approach decides ‘on-the- approach shows a complexity of O(n3) with n as the num-
fly’ which summary data partitions (multidimensional ber of possible nodes of an aggregation lattice, the work of
objects) are worth to keep materialized. Moreover, [BaPT97] introduces heuristics to reduce this number and
materialization of the same data cube partition at dif- make the approach more applicable.
ferent aggregation levels is explicitly allowed and the In opposite to these static algorithms, dynamic or adap-
size of a partition is not a system parameter but implic- tive strategies try to incrementally optimize the set of pre-
itly determined by the user. computed aggregates. On the relational level, [ShSV99]
• easy to integrate: Since we perform algebraic optimi- materializes the results of arbitrary complex queries.
zation, our algorithm may be easily integrated into Unfortunately, by using hash codes they are able to reuse
relational OLAP-servers. materialized queries only in the case of an exact match.
• high average cost reduction: As our experiments show, This is not acceptable in a real OLAP scenario. In the case
we achieve high cost reductions with only very decent of an exact match with regard to the group-by attributes, the
storage overhead at almost no cost. recent work of [DRSN98] proposes a chunk-based caching
strategy. The materialization of chunks (see below) is
Structure of the Paper resolved according to a classical CLOCK scheme, thus
directly reflecting the user access behavior. In [KoRo99] a
In the next section, we give a comprehensive overview of
similar approach is presented. Again, a single query can
recent work which is related to selection and use of materi-
only be answered from a single materialized fragment.
alized aggregates in the context of OLAP and data ware-
housing. The third section formally introduces the one- and
Table-, Query-, Chunk-based Partitioning Schemes
multidimensional structures of the considered multidimen-
sional model. Based on the notion of multidimensional The second classification criteria of existing strategies is
objects, section 4 defines the concept of set-derivability based on the unit of precomputation. Basically there are
using ‘patch-working’ and outlines the four different influ- three levels, namely table-, query-, and chunk-based strat-
ence factors to compute the overall benefit of a single egies.
cached multidimensional object. Finally, results of various Algorithms like [HaRU96], [GHRU97], and [BaPT97]
experiments are presented in section 5. The paper con- do not support partitioning and are therefore classified as
cludes with a summary. table-based strategies. Compatibility of selection predi-
cates has not to be checked since all summary tables are
2 Related Work defined on an unrestricted scope (the second condition of
derivability is trivial). However, this implies that analysis
The general idea of precomputing summary data is rooted hot spots like “the current period” are not reflected appro-
in the application area of ‘Statistical and Scientific Data- priately, because the algorithms materialize only summary
bases’. Several proposals like [ChMM88] were made but tables containing all periods.
the idea became popular with the emergence of ‘Data On the opposite, chunk-based partitioning schemes like
Warehousing’ together with ‘Online Analytical Process- [DRSN98] define relatively small aggregate partitions of a
ing’ and the resulting need for an efficient and mostly read- fixed size (chunks). Queried chunks are cached and later
only access to aggregates in a multidimensional context reused to answer new queries. On the one hand, [DRSN98]
([ChSh95]). reports very high cost saving ratios in their experiments. On
the other hand however, the implementation of special bit-
Static versus Adaptive Precomputation and Caching wise index structures to identify single chunks is extremely
Schemes expensive. Moreover, they use a special chunk-based file
system preventing a seamless integration into relational
A first classification of existing work may be specified
OLAP servers. At last, the choice of the chunk size must be
according to the flexibility of the scheme with regard to
statically determined and is crucial for the efficiency of the
user access behavior. Given a set of relational queries, the
whole system.
W. Lehner, J. Albrecht, W. Hümmer 2-2
The Multidimensional Model
The third class of different partitioning schemes is made Example 1: The set of dimensions of a typical OLAP sam-
up by query-based partitioning schemes ([SDJL96], ple scenario consists of a product, time, and shop di-
[ShSV99], [YaKL97], [GuHQ95]). Single user queries mension. Within the shop dimension for example, the
reflect the unit of caching and make the selection of the primary attribute corresponds to a unique shop identi-
right partition size obsolete. Recent work reuses results of fier. A regional classification may be defined by the se-
former queries either by query containment or by exact quence city → state → region → country. The set of di-
match ([SDJL96], [ShSV99]) where the incoming query mensional attributes may be defined as { shop-type,
must be completely derivable from the materialized query. purchase-class, selling-area, shop-window-size }. In
Moreover, it is further argued against query-based parti- real OLAP scenarios, a single (usually a product or a
tioning ([DRSN98]) that several materialized queries may customer) dimension may easily have 20 different at-
address a common part of a data cube. This would result in tributes.
a redundant storage of that part and decrease the utilization
of the additional storage. 3.2 Multidimensional Structures
The basic data structure in the multidimensional context,
3 The Multidimensional Model the notion of a multidimensional object, is of fundamental
importance. With regard to the caching algorithms
The following section introduces the dimensional and mul-
described in the following section, all incoming queries as
tidimensional structures which build the formal representa-
well as materialized results of former queries are internally
tion of the multidimensional model. The description is
represented as multidimensional objects.
divided into the presentation of dimensions and multidi-
mensional objects both of which are necessary to define a Definition 2: A multidimensional object (MO) is a triple M
multidimensional database ([LeAW98]). = (M, S, G) consisting of a measure M, an n-ary vector
of classes S = (s1, ..., sn) denoting the scope, and the
3.1 Dimensional Structures granularity specification G.
A measure is a tuple M = [N, A, O] where N is a re-
Dimensional structures are defined in the conceptual served name for the corresponding fact, A ∈ { FLOW,
schema design phase of a database. The semantic informa- STOCK, VALUEperUNIT } is the aggregation type,
tion coded into these structures is heavily used for cache and O ∈ { NONE, SUM, AVG, COUNT } is the ap-
replacement. plied aggregation function on that specific fact.
Definition 1: A dimension D of a multidimensional data- A granularity specification is a tuple G = [L, P] where
base is a set of attributes D = { PA } ∪ { CA1, ..., CAp L (‘aggregation Level’) is an n-ary vector of classifica-
} ∪ { DA1, ..., DAq } and the following set of function- tion attributes of each dimension
1 1 n n
al dependencies between these attributes: (L ∈ { CA 0 , ..., CA p } ⊗ ... ⊗ { CA 0 , ..., CA p } ),
1 n
• ∀i (1≤i≤p): PA→CAi and ∀i (1≤i≤q): PA→DAi and P (‘Properties’) is a set of dimensional attributes
The primary attribute PA of a dimension func- 1 1 n n
tionally determines all other attributes of a single (P ∈ { DA 1 , ..., DA q } ∪ ... ∪ { DA 1 , ..., DA q } ).
1 n
dimension. The instances of the primary attribute Example 2: The formal representation of a sales data cube
are called dimensional elements. for video products in Germany would be as follows:
• ∀i (1≤i≤p-1): CAi→ CAi+1 M = ( [ SALES , FLOW, SUM ],
The classification attributes CAi of a dimension ( p.group = ‘Video’, s.country = ‘Germany’ ),
are used to define a multi-level grouping of single [ (p.family, s.region), { p.brand, s.shoptype } ] )
dimensional elements and are ordered according The aggregation type determines the set of applicable
to their functional dependencies. The aggregation operations on the numeric values to ensure correct summa-
level or granularity i of a classification attribute rizability ([LeSh97]). For example, all aggregation opera-
CAi is denoted as CAi. By definition, the primary
attribute has the aggregation level 0 (PA = CA0).
• ∀i,j i≠j (1≤i,j≤q): DAi → / DAj †. Moreover it is worth to mention that this definition
The dimensional attributes DAi of a dimension does not explicitly consider multiple hierarchies with
represent all attributes describing properties of the more than one level. Multiple hierarchies would
dimensional elements. introduce a partial order on the classification
attributes (instead of a total order for the single clas-
Although the concept of dimensional attributes is more
sification hierarchy). This extension would not affect
complex ([LeAW98], [ABD+99]), dimensional attributes
any of the presented derivability or caching algo-
may be seen as additional single-level hierarchies on a rithms but would complicate the presentation. There-
dimension, which implies that they are functionally inde- fore, we only consider a single classification
pendent of each other†. hierarchy.
W. Lehner, J. Albrecht, W. Hümmer 2-3
The Multidimensional Model
tions are applicable to sales figures (A = FLOW), whereas ence of two n-dimensional cubes would result in the worst
the sum operation would not be allowed on figures denot- case in a set of 3n-1 convex residual cubes - assuming
ing the price of single articles (A = VALUEperUNIT). The exactly 3 classification nodes partitioning each dimension.
aggregation type STOCK prohibits summation in the tem- It can be shown 32-1=8
poral dimension. Finally the size of a multidimensional that intelligent
object M at the instance level is denoted by |M |. slicing can reduce
this number to 2n
Relations and Operations on Multidimensional Objects residual cubes - =
([ChMM88],
To formally define ‘derivability’ on multidimensional
proof of M M’
objects, we need to introduce the ‘≤‘-relation for the gran-
proposition 4).
ularity specification and the intersection of scopes of mul-
Informally, the 2*2=4
tidimensional objects. Moreover, we need intersection and
cube is cut itera- Figure 1: Difference of
difference operators of multidimensional objects them-
tively for each multidimensional objects
selves.
dimension into
Definition 3: A granularity specification G = [L, P] is finer maximal three slices; two slices are residual cubes and are
than or equal to a granularity specification not considered any longer; the third slice contains the oper-
G’ = [L’, P’], i.e. G ≤ G’, if and only if and and is sliced further in the remaining dimensions. For
• all aggregation levels of L are lower or equal than n dimensions we obtain 2n residual cubes.
the aggregation levels of L’ In the general case with dimensions consisting of an
• P is a superset of dimensional attributes with arbitrary number of classification nodes for each dimension
regard to P’, i.e. P’ ⊆ P i (1≤i≤n) the minimum number pi of partitioning classifica-
or tion nodes including the operand to be subtracted has to be
determined. Then the number of residual cubes will be
• G represents raw data, i.e. L = (PA1, ..., PAn).
∑ ( pi – 1 ) , 1≤i≤n, because of the same argumentation as
Definition 4: The intersection of multidimensional scopes above.For more detailed information refer to [AlGL99].
is defined as the component-wise intersection of the Definition 6: The difference of two multidimensional ob-
classes in each dimension, i.e. by the homomorphism jects M = (M, S, G) and M’ = (M’,S’,G’), i.e. M − M’
S ∩ S’ = (s1, ..., sn) ∩ (s’1, ..., s’n) is defined as a set {(M, S1, G)1, ..., (M, Sk, G)k} where
:= ( (s1 ∩ s’1), ..., (sn ∩ s’n) ) k is the number of residual cubes and Si are the scopes
The intersection is empty if two classes si and s’i do of residual objects obeying the following conditions:
not show a parent-child relationship within the corre-
∪ Si and i∩
k k
sponding classification hierarchy of dimension i. S = S' ∪ Si = ∅ .
i=1 =1
It is important to note that the problem of intersecting
Derivability of Multidimensional Objects
classes in a classification hierarchy may be solved by
checking the simple parent-child relationships. This con- To reuse the results of former queries, derivability of mul-
cept is fundamental for the applicability of the set-deriv- tidimensional objects as the units of caching is a necessary
ability mechanism. prerequisite. The following definition of the total derivabil-
Definition 5: The intersection of multidimensional objects ity will be relaxed in the next section to the partial deriv-
M = (M, S, G) and M’ = (M’,S’,G’), i.e. M ∩ M’ is de- ability which is sufficient to define the set-derivability.
fined as M ∩ M’ = (M, S ∩ S’, G) if and only if Definition 7: A measure M = (N, A, O) is derivable from a
• both objects refer to the same measure, i.e. measure M’= (N’, A’, O’) if and only if
M = M’ • the measures refer to the same real world entity
• the intersection of the scopes is not empty, i.e. S ∩ (N = N’ and A = A’)
S’ ≠ ∅ • the operations are compatible, i.e., O = O’ or
• the granularity specification of M is finer than or O’ = NONE
equal to the granularity specification of M’, i.e. Note that semi-additive functions like AVG are split into
G ≤ G’. the additive and non-additive counterparts ({SUM,
As seen in this definition the coarser multidimensional COUNT} and division ). The additive operations build a
object M’ is implicitly refined to the granularity level G. regular subject of the caching mechanism. Thus by materi-
To be able to introduce the concept of set-derivability by alizing multidimensional objects with the applied opera-
substitution (‘patch-working’; section 4), we need to define tions SUM and COUNT, an AVG-query will never be
the difference of two multidimensional objects. As illus- directly but only indirectly cached.
trated in figure 1 for the two-dimensional case, the differ-
W. Lehner, J. Albrecht, W. Hümmer 2-4
Set-Derivability using ‘Patch-Working’
Definition 8: A multidimensional object M = (M, S, G) is • SQ = ∪i Si,
totally derivable from a multidimensional object • ∀i, j (1≤i,j≤k) with i≠j: Si ∩ Sj = ∅.
M’ = (M’,S’,G’), if and only if In this case M is said to be set-derivable from { M1’,
• the measure M in M is derivable from the measure ..., Mk’ }
M’ in M’ Each patch in S has two components 〈Mi, Mi’ 〉 denoting
• M’ contains M, i.e. S ∩ S’ = S that partition Mi of MQ is to be computed from Mi’. All par-
• the granularity specification of M’ is finer than M, titions must be disjoint and their union must result in MQ.
i.e. G’ ≤ G
The Query Optimization Problem
4 Set-Derivability using ‘Patch-Working’ In general, the cache contains several possibly overlapping
aggregates with different granularities which support a
Although derivabil-
query. Therefore, the problem of the multidimensional
ity is the fundamen-
query optimizer is to choose a substitution that allows the
tal concept for the
computation of the query at minimal cost. This problem is
use of precomputed M1 MQ M3
very similar to the well-known knapsack problem
aggregates, in many
([CoLR90]), where a thieve tries to select the most valuable
cases definition 8 it
set of things with different values and weights. A common
is too restrictive.
M2 method is to use a greedy strategy to approximate the opti-
Consider the exam-
mal solution for the NP-complete 0/1 knapsack problem. In
ple in figure 2. The
Figure 2: Patch-Working the case of the fractional knapsack problem where the
query object MQ is
thieve may take arbitrary fractions of objects the greedy
not derivable from
approach actually yields the optimum. The problem of
any of M1, M2, or M3 alone. However, it can be derived
finding an optimal substitution can be reduced to the frac-
from a combination of M1, M2 and M3. The example illus-
tional knapsack, if the cost to compute a fraction of one
trates the problem of previous query caching methods
multidimensional object from another one is monotonous
([ShSV99], [SDJL96]) where a materialized aggregate can
in the size of the support.
only be accessed if the complete query is derivable from
this single cached object. This is because query contain- Definition 11: A cost function Cost(M, M’ ) determining
ment can be checked with a relatively simple algorithm the cost to compute the support of a multidimensional
where the determination of the overlapping regions for object M from M’ is monotonous, if for any two arbi-
arbitrary predicates is NP-complete ([SuKN89]). However, trary multidimensional objects M’ and M’’ holds:
in the context of multidimensional objects where the pred- | M ∩ M’ | < | M ∩ M’’ |
icates consist of scope definitions corresponding to classes ⇒ Cost(M, M’ ) < Cost(M, M’’ ).
in a classification hierarchy this problem can be reduced to In the case of a monotonous cost function, an algorithm
the determination of ancestor and sibling relationships. selecting the cheapest substitution for each data cell in the
Definition 9: A multidimensional object M = (M, S, G) is multidimensional object M automatically finds the optimal
partially derivable from a multidimensional object substitution (‘Greedy-Choice property’; [CoLR90]). How-
M’ = ( M’, S’, G’), i.e. M’ „ M, if and only if ever, especially in the context of relational databases
monotony of the cost function may not be given. In a rela-
• the conditions (1) and (3) of the total derivability
tional backend the computation of a multidimensional
hold for M and M’
object results in range queries on the affected tables repre-
• M and M’ are overlapping, i.e. S ∩ S’ ≠ ∅
senting the supporting multidimensional objects. In the
In this case, M’ is said to support M, and M ∩ M’ is
absence of indexes, range queries on relational tables usu-
called the support of M’ for M.
ally result in full table scans destroying the property of
From the set of supporters, i.e. the potential candidates, monotony.
an appropriate set of partitions, or patches, has to be In order to find a substitution for a query we use the
selected in order to answer a query. Such a set is called a greedy strategy outlined in algorithm 1. The algorithm gets
substitution for the query. . the query MQ and the set C of all materialized multidimen-
Definition 10: A set S = { 〈M1, M1’ 〉,...,〈Mk, Mk’ 〉 } with sional objects as input parameters. It returns a substitution
Mi = (Mi, Si, Gi) and Mi’ = (Mi’, Si’, Gi’) is called a S and the estimated cost to compute MQ. As long as there
substitution for a multidimensional object MQ, if and are still uncomputed fragments remaining (line 5), the can-
only if didate with the least access cost per cell for each fragment
is selected (lines 9-21). After the selected patch is added to
• ∀i (1≤i≤k): Mi = MQ, Gi = GQ, and the solution S (line 24) it must be removed from the remain-
• ∀i (1≤i≤k): Mi’ „ Mi, der R using the difference operator (line 28). Since M may
W. Lehner, J. Albrecht, W. Hümmer 2-5
Experimental Results
Algorithm: Substitution (Patch Working)
Input: set of existing multidimensional objects C = {M1, ..., Mn}
multidimensional object representing the query MQ,
Output: substitution S for MQ
16 // if relative cost of new candidate is less
17 If ( costrel > Cost(M, Mcand) / |M ∩ Mcand| )
1 // initialize remainder, patch-work and cost
18 Mbest := Mcand;
2 R := { MQ }; S := ∅; costtotal := 0
19 costrel := Cost(M, Mcand) / |M ∩ Mcand|;
3
20 End If
4 // iterate over all
21 End Foreach
5 While ( R ≠ ∅)
22
6 MBest := ∅; costrel := ∞;
23 // add the best patch to the result
7
24 S := S ∪ { 〈M∩Mbest, Mbest〉 };
8 // find cheapest substitution for each remaining patch
25 costtotal := costtotal+Cost(M, Mbest);
9 Foreach M ∈ R
26
10 // loop through all candidates
27 // remove the new patch from the remainder
11 Foreach Mcand ∈ C
28 R := R \ { M } ∪ (M-Mbest);
12 // next candidate if this one is not supporting M
29 End Foreach
13 If ( Not ( Mcand „ M ) )
30 End While
14 Next;
31 Return (S, costtotal)
15 End If
32 End
Algorithm 1: Substitution of a multidimensional object (‘patch-working’)
only be partially derivable from Mbest the uncovered differ- • Reconstruction Cost CM(MQ, C):
ence M-Mbest must remain in R. It is worth to mention here Since cached multidimensional objects may be com-
that a solution is always possible if C contains also the raw puted from the set of multidimensional objects being
data objects currently an element of the cache, it may prove benefi-
cial if those multidimensional objects are replaced
Cache Replacement which can be reconstructed most easily.
Since the factors have very different orders of magni-
The knowledge of the OLAP application domain provides
tude, they are normalized by their average values before the
not only a lot of semantic information about the data model
overall benefit BM(MQ, C) of a multidimensional object M
but also about the users behavior. In order to incorporate
for a cache configuration C after query MQ is computed by
reference density as well as semantic knowledge the benefit
a linear combination. The overall benefit is used as the indi-
in our approach is defined by the four factors of the
cator, whether there is a change of the set of cached multi-
weighted relative reference density, the absolute benefit,
dimensional objects or not.
the degree of relationship, and the reconstruction cost
([ADB+99]):
• Weighted Relative Reference Density DM 5 Experimental Results
The weighted relative reference density is used to
The following section illustrates the optimization potential
approximate the traditional least reference density
of the presented algorithms for caching multidimensional
(LRD) in the multidimensional context. A reference
objects. Our proposed dynamic caching strategy is ana-
counter is increased only by the fraction of M that was
lyzed according to different cache sizes and influence fac-
actually referenced.
tors for the computation of the overall benefit. Further-
• Absolute Benefit AM:
more, we compare our strategy with the static precomputa-
The absolute benefit is based on the size, the granular-
tion approach of [HaRU96].
ity, and the scope of the multidimensional object and
therefore generalizes the benefit definitions from
Experimental Setup
[HaRU96] as well as [DRSN98].
• Degree of Relationship RM(MQ): The proposed data structures and algorithms are imple-
In the OLAP context, it is very likely that an object in mented in the multidimensional OLAP server CUBESTAR.
the cache that can be reached within a few navigational All query processing and cache management is based on
operations will be used with a high probability. There- algebraic operations on multidimensional objects. How-
fore, the number of navigation operations which are ever, multidimensional objects are physically stored in a
necessary to transform the last query MQ into a mate- relational backend (Oracle 8.0.4; DEC Alpha 3000/800,
rialized, i.e. cached multidimensional object M defines 1CPU, 192MByte RAM) in a star-schema-like manner
the degree of relationship for M. (ROLAP system). For each patch in the optimal substitu-
tion for the query (algorithm 1), the query processor gener-
W. Lehner, J. Albrecht, W. Hümmer 2-6
Experimental Results
70% 70%
60% 60%
50% 50%
Parameters:
40% 40% D and R
CSR
CSR
D, A, R and C
30% 30% A and C
C alone
20% 20%
10% 10%
0% 0%
0 200 400 600 800 1000 0 200 400 600 800 1000 1200 1500
a) single user, 1000 steps b) multiple users, 1500 steps
Figure 3: Cost saving ratio for different combinations of cache parameters
ates an SQL statement, which are put together by UNION In the experimental scenario, two user traces, one simulat-
ALL operations to yield the final result. We could not com- ing 1000 operations of single user and another one 1500
pletely eliminate side-effects of the system, especially of steps of five independent users, were used to investigate the
the database buffer, but multiple runs confirmed the follow- behavior of the cache for varying cache sizes and configu-
ing results. rations.
Multidimensional Database Schema and Query Influence of the Cache Parameters
Generation
The first set of graphs in figure 3 illustrates the impact of
The schema of the multidimensional database consists of the four cache parameters using a cache size of 20% of the
three dimensions with five, four and six levels, and a set of raw data volume. The figures show the development of the
up to 14 dimensional attributes, respectively. The corre- cost saving ratios for the single and the multi user trace. If
sponding raw data cube was randomly filled with a sparsity all parameters are enabled, cost savings of about 56% in the
of 3% resulting in about 250.000 tuples in the relational single user case and 50% in the multi user case were
representation. To simulate the cache replacement behav- obtained. By trying all possible combinations of parame-
ior, a new query is constructed from the previous query by ters it turned out that for both the single and the multi user
applying a single operation in a certain dimension with a traces the usage of the relative reference density D and the
predefined probability. Since a typical OLAP user does not relationship degree R yields the best results of 60% and
switch very often between dimensions, in 80% of the cases 52% respectively. On the contrary, the worst results of 35%
the operation was executed on the same dimension and in and 28%, respectively, were achieved using the absolute
20% of the cases the operation involved a different dimen- benefit A and the reconstruction cost C.
sion as the previous one. The set of possible operations It is interesting to note that the number of potential users
included slice (scope changes to a child node), unslice of a materialized aggregate, reflected by the absolute bene-
(scope changes to a parent node), drill-down (granularity fit A, is not suitable for the determination the overall bene-
changes to the next lower level), roll-up (granularity fit. Algorithms like [HaRU96] are based solely on this fac-
changes to the next higher level), split (add a dimensional tor. Furthermore, it turns out that the general critics accord-
attribute) and merge (remove a dimensional attribute). ing to query-based caching schemes namely the possibly
redundant storage of the same part of data cube in different
Performance Metric objects is of minor interest, because the influence factor of
the reconstruction cost also does very badly.
Since in general the cache management aims at maximiz-
ing the cost saving ratio ([ShSV99], [DRSN98]), we
Influence of the Cache Size
defined this metric in the context of multidimensional
objects. The cost saving ratio CSR on cached multidimen- The diagram in figure 4 shows the simulation results for
sional objects is defined by the division of the sum of the different cache sizes varying over 10.000, 25.000, 50.000,
costs of the queries using the cache content and the total 100.000, and 500.000 tuples. Compared to the raw data
cost to compute the queries from the non-aggregated raw volume these values result in an additional storage over-
data. head of 4%, 10%, 20%, 40% and 200%, respectively. For
these simulations the best combination of cache parame-
W. Lehner, J. Albrecht, W. Hümmer 2-7
Summary
70%
60% single user,
1000 steps
50%
5 users,
CSR 40% 1500 steps
30% Operations:
20% 15% slice / 6% unslice
14% drill-down / 30% roll-up
10% 20% split / 15% merge
0
4% 10% 20% 40% 200%
Figure 4: Query trace configurations and cost savings with varying cache sizes
ters, i.e. D and R, was used. As can be seen, already with a only supply an insufficient number of aggregates. More-
relatively small cache of 4% of the raw data size, cost over, because general static approaches do not take into
reductions of over 50% in the single user mode and over account specific partitions, i.e. regions of interest or hot
40% in the multi user mode are possible. However, the cost spots, much of the additional storage space is wasted for
savings in the single user mode do not increase much over regions of aggregates which are never or seldom requested.
60% for any cache size beyond 50.000 tuples. The figure
illustrates that even for multiple users a cache of a fairly 6 Summary
small size yields cost reductions not much worse than for a
single user, i.e. above 50%. In this paper, we present a novel method for the dynamic
selection of materialized aggregates in the context of mul-
Comparison to Static Precomputation of Aggregates tidimensional database systems. Our proposed solution is
based on the notion of multidimensional objects which pro-
In order to compare our method to other materialization
vide a consistent view to queries as well as to materialized
methods of aggregations, we implemented a static precom-
aggregates. The use of multidimensional objects drastically
putation algorithm based on [HaRU96]‡. This algorithm
simplifies the determination of intersections and differ-
computes a set of aggregates with an unrestricted scope
ences of queries. Therefore, we are able to present a query-
based on a benefit definition similar to the absolute benefit.
based caching strategy overcoming the limitations of query
Figure 4 shows the cost saving ratio for different cache
containment by the concept of set-derivability. Set-deriv-
sizes. If only 4% of additional storage are provided, there
ability is achieved by a patch-working mechanism, which
is almost no cost reduction possible for the static algorithm
(in the case of a monotonous cost function) computes the
(therefore not in the figure), because the size of almost all
cheapest substitution of a query by a set of materialized
useful aggregates with an unrestricted scope is larger than
aggregates. The cache replacement strategy is based on the
the cache size. Although with increasing cache size the cost
overall benefit computation for each element in the aggre-
saving ratio of the static approach rises, the values for the
gate cache which consists of a linear combination of four
dynamic strategies remain unrivaled. Even for a cache size
different influence parameters. At last we give the results of
twice as large as the raw data volume the cost saving ratios
various experiments. Average cost reductions of 50% to
for the single and multi user cases are only about 40% and
60% may be obtained by only 10% additional storage
30%, respectively. Note that the dynamic strategy yields
capacity.
over 50% with 1/50 of that cache size.
The proposed optimization technique must not be seen
The reason for the bad results of the static approach is
as an alternative to traditional physical optimization meth-
that the size of the aggregation lattice grows exponentially
ods like indexes or fragmentation, but as an addition. Our
in the number of possible group-by combinations. In most
method not only explicitly supports fragmentation of the
application scenarios this number is high, because several
fact table but makes it even more beneficial. If aggregates
evaluation criteria on a single dimension are used, like
are only materialized for hot spots, a single query may uti-
dimensional attributes or multiple hierarchies. Using a rea-
lize both fragments of materialized aggregates and frag-
sonable amount of additional storage, static methods can
ments of raw data in one operation. Our proposed strategy
may easily be integrated into most relational OLAP serv-
‡. Unfortunately, we could not compare our approach ers, because all optimizations involve only algebraic oper-
with the work of [DRSN98], since they require a very ations on common multidimensional structures.
special ‘chunk-based’ file system in the context of
their Paradise database system.
W. Lehner, J. Albrecht, W. Hümmer 2-8
Summary
Currently, we are exploring the integration of indexes on GHRU97 Gupta, H.; Harinarayan, V.; Rajaraman, A.;
the multidimensional objects and the benefit of using spe- Ullman, J.D.: Index Selection for OLAP. In:
cial linearizations. Another research is the use of the frag- Proceedings of the 13th International
ment-based patch-working algorithm in a parallel server Conference on Data Engineering (ICDE’97,
environment. Birmingham, Great Britain, April 7-11), 1997,
pp. 208-219
References GuHQ95 Gupta, A.; Harinarayan, V.; Quass, D.:
ABD+99 Albrecht, J.; Bauer, A.; Deyerling, O.; Günzel, Aggregate-Query Processing in Data
H.; Hümmer, W.; Lehner, W.; Schlesinger, L.: Warehousing Environments. In: Proceedings of
Management of multidimensional Aggregates the 21th International Conference on Very
for efficient Online Analytical Processing. In: Large Data Bases (VLDB’95, Zürich,
International Database Engineering and Schwitzerland, Sept. 11-15), 1995, pp. 358-369
Applications Symposium (IDEAS'99, Montreal, Gupt97 Gupta, H.: Selection of Views to Materialize in
Canada, August 1-3), 1999 a Data Warehouse. In: Proceedings of the 6th
AlGL99 Albrecht, J.; Günzel, H.; Lehner, W.: Set- International Conference on Database Theory
Derivability of Multidimensional Aggregates. (ICDT‘97, Delphi, Greece, Jan. 8-10), 1997,
In: First International Conference on Data pp. 98-112
Warehousing and Knowledge Discovery HaRU96 Harinarayan, V.; Rajaraman, A.; Ullman, J.D.:
(DaWaK’99, Florence, Italy, August 30 - Implementing Data Cubes Efficiently. In:
September 1), 1999 Proceedings of the 25th International
BaPT97 Baralis, E.; Paraboschi, S.; Teniente, E.: Conference on Management of Data,
Materialized Views Selection in a (SIGMOD’96, Montreal, Quebec, Canada,
Multidimensional Database. In: 23rd June 4-6), 1996, pp. 205-216
International Conference on Very Large Data Klug82 Klug, A.: Equivalence of Relational Algebra
Bases (VLDB’97, Athens, Greece, Aug 25-29), and Relational Calculus Query Languages
1997, pp. 156-165 Having Aggregate Functions. In: Journal of the
ChMM88 Chen, M. C.; McNamee, L.; Melkanoff, M.: A ACM 29(1982)3, pp. 699-717
Model of Summary Data and its Applications in KoRo99 Kotidis, Y.; Roussopoulos, N.: DynaMat: A
Statistical Databases. In: Proceedings of the 4th Dynamic View Management System for Data
International Working Conference on Warehouses. In: International Conference on
Statistical and Scientific Database Management of Data (SIGMOD’99,
Management (4SSDBM, Rome, Italy, June 21- Philadephia (PA), U.S.A., June 1-3), 1999, pp.
23), 1988, pp. 356-372 371-382
ChSh95 Chaudhuri, S.; Shim, K.: An Overview of Cost- LeAW98 Lehner, W.; Albrecht, J.; Wedekind, H.: Normal
based Optimization of Queries with Forms for Multidimensional Databases. In:
Aggregates. In: IEEE Data Engineering Proceedings of the 10th International
Bulletin 18(1995)3, pp. 3-9 Conference on Scientific and Statistical Data
CoLR90 Cormen, T. H.; Leiserson, C. E.; Rivest, R. L.: Management (SSDBM’98, Capri, Italy, July 1-
Introduction to Algorithms, Cambridge (MA), 3), 1998, pp. 63-72
London: The MIT Press; New York u.a.: LeSh97 Lenz, H.-J.; Shoshani, A.: Summarizability in
McGraw-Hill Book Company, 1990 OLAP and Statistical Data Bases. In:
DRSN98 Deshpande, P.M.; Ramasamy, K.; Shukla, A.; Proceedings of the 9th International
Naughton, J.F.: Caching Multidimensional Conference on Statistical and Scientific
Queries Using Chunks. In: Proceedings of the Database Management (9SSDBM, Olympia
27th International Conference on Management (WA), USA, Aug. 11-13), 1997, pp. 132-143
of Data (SIGMOD’98, Seattle (WA), USA, SDJL96 Srivastava, D.; Dar, S.; Jagadish, H.V.; Levy,
June 2-4), 1998, pp. 259-270 A.Y.: Answering Queries with Aggregation
Fink82 Finkelstein, S.: Common Expression Analysis Using Views. In: Proceedings of 22th
in Database Applications. In: Proceedings of International Conference on Very Large Data
the International Conference on the Bases (VLDB’96, Bombay, India, Sept. 3-6),
Management of Data (SIGMOD’82, Orlando 1996, pp. 318-329
(FL), USA, June 2-4), 1982, pp. 235-245
W. Lehner, J. Albrecht, W. Hümmer 2-9
Summary
ShSV99 Shim, J.; Scheuermann, P.; Vingralek, R.:
Dynamic Caching of Query Results for
Decision Support Systems. In: Proceedings of
the 11th International Conference on Scientific
and Statistical Database Management
(SSDB’99, Cleveland (OH), U.S.A., July. 28-
30), 1999, pp. 254-263
SuKN89 Sun, X.-H.; Kamel, N.; Ni, L.M.: Solving
Implication Problems in Database
Applications. In: Proceedings of the 18th
International Conference on Management of
Data (SIGMOD’89, Portland (OR), USA, May
31-June 2), 1989, pp. 185-192
YaKL97 Yang, J.; Karlapalem, K.; Li, Q.: Algorithms for
Materialized View Design in Data Warehousing
Environment. In: Proceedings of the 23rd
International Conference on Very Large Data
Bases (VLDB’97, Athens, Greece, Aug. 25-
29), 1997, pp. 136-145
W. Lehner, J. Albrecht, W. Hümmer 2-10