Divide and Aggregate: Caching Multidimensional Objects
                                Wolfgang Lehner, Jens Albrecht, Wolfgang Hümmer
                                                 Database Systems (IMMD6)
                                              University of Erlangen-Nuremberg
                                           Martensstr. 3, 91058 Erlangen, Germany
                                  {lehner, jalbrecht, huemmer}@informatik.uni-erlangen.de


                                                                 schemes and summary tables. Our strategy is easy to imple-
                                                                 ment in relational OLAP servers and provides excellent
                         Abstract                                support of analysis hot spots. Experiments show that aver-
    A common optimization technique in the OLAP                  age cost reductions of over 50% are easily reached while
    application domain is the use of summary                     spending only 10% of additional storage for summary data.
    aggregate data. For an appropriate support of
    analysis hot spots, we propose a query based                 Derivability
    aggregate cache of ‘multidimensional objects’.               To speed up queries, multidimensional objects are cached
    The major drawback of previous query caching                 and reused during runtime. The framework of Multidimen-
    methods is that new queries must be completely               sional Objects provide a consistent and formalized view to
    subsumed by a single cached object in order to               database queries as well as to materialized summary data.
    utilize it. Our solution is the introduction of ‘set-        They may be visualized as a partition (restricted to certain
    derivability’ which allows the combination of                selection predicates) of an aggregated data cube (using
    several aggregates to derive a single query. The set         grouping attributes) specified by a regular star-query.
    of cached multidimensional objects is                           A necessary prerequisite for reuse is derivability
    dynamically determined using both users access               ([Klug82], [Fink82]). Derivability in the context of parti-
    patterns and semantical knowledge of                         tioned aggregates concerns the following two aspects:
    dimensional structures. Experimental results                   • compatibility of grouping attributes:
    show that average cost reductions of over 50% are                 The precomputed data from which a query may be
    easily reached while spending only 10% of                         derived has to have an equal or finer granularity with
    additional storage for summary data.                              regard to the corresponding dimensional structures.*
                                                                   • compatibility of selection predicates:
1 Introduction                                                        This condition for derivability states that the precom-
                                                                      puted partition of summary data must not be more
Online Analytical Processing (OLAP) offers analysts an                restrictive than the query which is the subject of deri-
efficient interactive access to consistent business data, typ-        vation.
ically stored in a data warehouse. From the physical data-            Since predicate comparison in general is NP-complete
base design point of view, the two main obstacles to pro-             ([SuKN89]), recent work either does not consider par-
vide efficient multidimensional access to those data are to           titioning at all ([HaRU96], [BaPT97]), is limited to test
choose the right partitioning scheme and the right selection          for equality ([ShSV99]) or reduced to statically pre-
of summary tables. Whereas partitioning is a common con-              defined partitions (chunks; [DRSN98]).
cept in physical database design, the generation of sum-
mary tables is specific to the OLAP application domain           Contribution
with mostly read-only data access. Due to a limited storage
capacity not all possible aggregations may be precom-            To overcome these limits and to provide a simple and pow-
puted. In this paper we present a dynamic caching strategy       erful strategy to select and release precomputed summary
for multidimensional data incorporating both partitioning        data with limited scope, this paper contributes the follow-
                                                                 ing:
                                                                   • set-derivability: Our approach is not limited to derive
                                                                      a single query from a single summary data table but
                                                                      computes a nearly optimal construction plan for a


                                                                    *. To illustrate these dependencies [HaRU96] intro-
                                                                       duced to notion of an aggregation lattice.


W. Lehner, J. Albrecht, W. Hümmer                                                                                          2-1
                                                                                                                 Related Work


    query based on a set of precomputed summary data            algorithm of [YaKL97] computes the global query execu-
    partitions (‘patch-working on multidimensional              tion plan and selects the node of the resulting plan with the
    objects’; section 4).                                       highest number of references for precomputation. [Gupt97]
  • use of dimensional structures: Since we allow parti-        extends this procedure in the context of AND/OR-graphs
    tioning only with respect to classification hierarchies     by considering alternative local query execution plans.
    (and not for dimensional attributes) algorithms to com-     Based on the concept of an aggregation lattice, the algo-
    pose a multidimensional aggregate of a set of cached        rithm specified in [HaRU96] ([GHRU97] additionally con-
    objects become feasible.                                    siders existing index structures) statically selects those
  • dynamic caching policy: Since static precomputation         nodes of the lattice with the best cost saving ratio in relation
    of complete data cubes does not adequately support          to the additional storage required by that node. Since this
    data analysis hot spots, our approach decides ‘on-the-      approach shows a complexity of O(n3) with n as the num-
    fly’ which summary data partitions (multidimensional        ber of possible nodes of an aggregation lattice, the work of
    objects) are worth to keep materialized. Moreover,          [BaPT97] introduces heuristics to reduce this number and
    materialization of the same data cube partition at dif-     make the approach more applicable.
    ferent aggregation levels is explicitly allowed and the         In opposite to these static algorithms, dynamic or adap-
    size of a partition is not a system parameter but implic-   tive strategies try to incrementally optimize the set of pre-
    itly determined by the user.                                computed aggregates. On the relational level, [ShSV99]
  • easy to integrate: Since we perform algebraic optimi-       materializes the results of arbitrary complex queries.
    zation, our algorithm may be easily integrated into         Unfortunately, by using hash codes they are able to reuse
    relational OLAP-servers.                                    materialized queries only in the case of an exact match.
  • high average cost reduction: As our experiments show,       This is not acceptable in a real OLAP scenario. In the case
    we achieve high cost reductions with only very decent       of an exact match with regard to the group-by attributes, the
    storage overhead at almost no cost.                         recent work of [DRSN98] proposes a chunk-based caching
                                                                strategy. The materialization of chunks (see below) is
Structure of the Paper                                          resolved according to a classical CLOCK scheme, thus
                                                                directly reflecting the user access behavior. In [KoRo99] a
In the next section, we give a comprehensive overview of
                                                                similar approach is presented. Again, a single query can
recent work which is related to selection and use of materi-
                                                                only be answered from a single materialized fragment.
alized aggregates in the context of OLAP and data ware-
housing. The third section formally introduces the one- and
                                                                Table-, Query-, Chunk-based Partitioning Schemes
multidimensional structures of the considered multidimen-
sional model. Based on the notion of multidimensional           The second classification criteria of existing strategies is
objects, section 4 defines the concept of set-derivability      based on the unit of precomputation. Basically there are
using ‘patch-working’ and outlines the four different influ-    three levels, namely table-, query-, and chunk-based strat-
ence factors to compute the overall benefit of a single         egies.
cached multidimensional object. Finally, results of various        Algorithms like [HaRU96], [GHRU97], and [BaPT97]
experiments are presented in section 5. The paper con-          do not support partitioning and are therefore classified as
cludes with a summary.                                          table-based strategies. Compatibility of selection predi-
                                                                cates has not to be checked since all summary tables are
2 Related Work                                                  defined on an unrestricted scope (the second condition of
                                                                derivability is trivial). However, this implies that analysis
The general idea of precomputing summary data is rooted         hot spots like “the current period” are not reflected appro-
in the application area of ‘Statistical and Scientific Data-    priately, because the algorithms materialize only summary
bases’. Several proposals like [ChMM88] were made but           tables containing all periods.
the idea became popular with the emergence of ‘Data                On the opposite, chunk-based partitioning schemes like
Warehousing’ together with ‘Online Analytical Process-          [DRSN98] define relatively small aggregate partitions of a
ing’ and the resulting need for an efficient and mostly read-   fixed size (chunks). Queried chunks are cached and later
only access to aggregates in a multidimensional context         reused to answer new queries. On the one hand, [DRSN98]
([ChSh95]).                                                     reports very high cost saving ratios in their experiments. On
                                                                the other hand however, the implementation of special bit-
Static versus Adaptive Precomputation and Caching               wise index structures to identify single chunks is extremely
Schemes                                                         expensive. Moreover, they use a special chunk-based file
                                                                system preventing a seamless integration into relational
A first classification of existing work may be specified
                                                                OLAP servers. At last, the choice of the chunk size must be
according to the flexibility of the scheme with regard to
                                                                statically determined and is crucial for the efficiency of the
user access behavior. Given a set of relational queries, the
                                                                whole system.


W. Lehner, J. Albrecht, W. Hümmer                                                                                           2-2
                                                                                                       The Multidimensional Model


    The third class of different partitioning schemes is made   Example 1: The set of dimensions of a typical OLAP sam-
up by query-based partitioning schemes ([SDJL96],                  ple scenario consists of a product, time, and shop di-
[ShSV99], [YaKL97], [GuHQ95]). Single user queries                 mension. Within the shop dimension for example, the
reflect the unit of caching and make the selection of the          primary attribute corresponds to a unique shop identi-
right partition size obsolete. Recent work reuses results of       fier. A regional classification may be defined by the se-
former queries either by query containment or by exact             quence city → state → region → country. The set of di-
match ([SDJL96], [ShSV99]) where the incoming query                mensional attributes may be defined as { shop-type,
must be completely derivable from the materialized query.          purchase-class, selling-area, shop-window-size }. In
Moreover, it is further argued against query-based parti-          real OLAP scenarios, a single (usually a product or a
tioning ([DRSN98]) that several materialized queries may           customer) dimension may easily have 20 different at-
address a common part of a data cube. This would result in         tributes.
a redundant storage of that part and decrease the utilization
of the additional storage.                                      3.2       Multidimensional Structures
                                                                The basic data structure in the multidimensional context,
3 The Multidimensional Model                                    the notion of a multidimensional object, is of fundamental
                                                                importance. With regard to the caching algorithms
The following section introduces the dimensional and mul-
                                                                described in the following section, all incoming queries as
tidimensional structures which build the formal representa-
                                                                well as materialized results of former queries are internally
tion of the multidimensional model. The description is
                                                                represented as multidimensional objects.
divided into the presentation of dimensions and multidi-
mensional objects both of which are necessary to define a       Definition 2: A multidimensional object (MO) is a triple M
multidimensional database ([LeAW98]).                               = (M, S, G) consisting of a measure M, an n-ary vector
                                                                    of classes S = (s1, ..., sn) denoting the scope, and the
3.1    Dimensional Structures                                       granularity specification G.
                                                                    A measure is a tuple M = [N, A, O] where N is a re-
Dimensional structures are defined in the conceptual                served name for the corresponding fact, A ∈ { FLOW,
schema design phase of a database. The semantic informa-            STOCK, VALUEperUNIT } is the aggregation type,
tion coded into these structures is heavily used for cache          and O ∈ { NONE, SUM, AVG, COUNT } is the ap-
replacement.                                                        plied aggregation function on that specific fact.
Definition 1: A dimension D of a multidimensional data-             A granularity specification is a tuple G = [L, P] where
    base is a set of attributes D = { PA } ∪ { CA1, ..., CAp        L (‘aggregation Level’) is an n-ary vector of classifica-
    } ∪ { DA1, ..., DAq } and the following set of function-        tion attributes of each dimension
                                                                                   1           1                   n           n
    al dependencies between these attributes:                          (L ∈ { CA 0 , ..., CA p } ⊗ ... ⊗ { CA 0 , ..., CA p } ),
                                                                                                   1                               n
      • ∀i (1≤i≤p): PA→CAi and ∀i (1≤i≤q): PA→DAi                      and P (‘Properties’) is a set of dimensional attributes
        The primary attribute PA of a dimension func-                              1           1                   n           n
        tionally determines all other attributes of a single           (P ∈ { DA 1 , ..., DA q } ∪ ... ∪ { DA 1 , ..., DA q } ).
                                                                                                   1                               n
        dimension. The instances of the primary attribute       Example 2: The formal representation of a sales data cube
        are called dimensional elements.                             for video products in Germany would be as follows:
      • ∀i (1≤i≤p-1): CAi→ CAi+1                                     M = ( [ SALES , FLOW, SUM ],
        The classification attributes CAi of a dimension                     ( p.group = ‘Video’, s.country = ‘Germany’ ),
        are used to define a multi-level grouping of single                  [ (p.family, s.region), { p.brand, s.shoptype } ] )
        dimensional elements and are ordered according             The aggregation type determines the set of applicable
        to their functional dependencies. The aggregation       operations on the numeric values to ensure correct summa-
        level or granularity i of a classification attribute    rizability ([LeSh97]). For example, all aggregation opera-
        CAi is denoted as CAi. By definition, the primary
        attribute has the aggregation level 0 (PA = CA0).
      • ∀i,j i≠j (1≤i,j≤q): DAi → / DAj                               †. Moreover it is worth to mention that this definition
        The dimensional attributes DAi of a dimension                    does not explicitly consider multiple hierarchies with
        represent all attributes describing properties of the            more than one level. Multiple hierarchies would
        dimensional elements.                                            introduce a partial order on the classification
                                                                         attributes (instead of a total order for the single clas-
   Although the concept of dimensional attributes is more
                                                                         sification hierarchy). This extension would not affect
complex ([LeAW98], [ABD+99]), dimensional attributes
                                                                         any of the presented derivability or caching algo-
may be seen as additional single-level hierarchies on a                  rithms but would complicate the presentation. There-
dimension, which implies that they are functionally inde-                fore, we only consider a single classification
pendent of each other†.                                                  hierarchy.


W. Lehner, J. Albrecht, W. Hümmer                                                                                                      2-3
                                                                                                  The Multidimensional Model


tions are applicable to sales figures (A = FLOW), whereas        ence of two n-dimensional cubes would result in the worst
the sum operation would not be allowed on figures denot-         case in a set of 3n-1 convex residual cubes - assuming
ing the price of single articles (A = VALUEperUNIT). The         exactly 3 classification nodes partitioning each dimension.
aggregation type STOCK prohibits summation in the tem-               It can be shown                       32-1=8
poral dimension. Finally the size of a multidimensional          that      intelligent
object M at the instance level is denoted by |M |.               slicing can reduce
                                                                 this number to 2n
Relations and Operations on Multidimensional Objects             residual        cubes             -           =
                                                                 ([ChMM88],
To formally define ‘derivability’ on multidimensional
                                                                 proof               of     M           M’
objects, we need to introduce the ‘≤‘-relation for the gran-
                                                                 proposition 4).
ularity specification and the intersection of scopes of mul-
                                                                 Informally,        the                    2*2=4
tidimensional objects. Moreover, we need intersection and
                                                                 cube is cut itera-            Figure 1: Difference of
difference operators of multidimensional objects them-
                                                                 tively for each              multidimensional objects
selves.
                                                                 dimension         into
Definition 3: A granularity specification G = [L, P] is finer    maximal three slices; two slices are residual cubes and are
    than or equal to a granularity specification                 not considered any longer; the third slice contains the oper-
    G’ = [L’, P’], i.e. G ≤ G’, if and only if                   and and is sliced further in the remaining dimensions. For
      • all aggregation levels of L are lower or equal than      n dimensions we obtain 2n residual cubes.
        the aggregation levels of L’                                 In the general case with dimensions consisting of an
      • P is a superset of dimensional attributes with           arbitrary number of classification nodes for each dimension
        regard to P’, i.e. P’ ⊆ P                                i (1≤i≤n) the minimum number pi of partitioning classifica-
    or                                                           tion nodes including the operand to be subtracted has to be
                                                                 determined. Then the number of residual cubes will be
      • G represents raw data, i.e. L = (PA1, ..., PAn).
                                                                 ∑ ( pi – 1 ) , 1≤i≤n, because of the same argumentation as
Definition 4: The intersection of multidimensional scopes        above.For more detailed information refer to [AlGL99].
    is defined as the component-wise intersection of the         Definition 6: The difference of two multidimensional ob-
    classes in each dimension, i.e. by the homomorphism              jects M = (M, S, G) and M’ = (M’,S’,G’), i.e. M − M’
    S ∩ S’ = (s1, ..., sn) ∩ (s’1, ..., s’n)                         is defined as a set {(M, S1, G)1, ..., (M, Sk, G)k} where
                       := ( (s1 ∩ s’1), ..., (sn ∩ s’n) )            k is the number of residual cubes and Si are the scopes
    The intersection is empty if two classes si and s’i do           of residual objects obeying the following conditions:
    not show a parent-child relationship within the corre-
                                                                                            ∪ Si and i∩
                                                                                              k            k
    sponding classification hierarchy of dimension i.                           S = S' ∪                 Si = ∅ .
                                                                                            i=1       =1
   It is important to note that the problem of intersecting
                                                                 Derivability of Multidimensional Objects
classes in a classification hierarchy may be solved by
checking the simple parent-child relationships. This con-        To reuse the results of former queries, derivability of mul-
cept is fundamental for the applicability of the set-deriv-      tidimensional objects as the units of caching is a necessary
ability mechanism.                                               prerequisite. The following definition of the total derivabil-
Definition 5: The intersection of multidimensional objects       ity will be relaxed in the next section to the partial deriv-
    M = (M, S, G) and M’ = (M’,S’,G’), i.e. M ∩ M’ is de-        ability which is sufficient to define the set-derivability.
    fined as M ∩ M’ = (M, S ∩ S’, G) if and only if              Definition 7: A measure M = (N, A, O) is derivable from a
       • both objects refer to the same measure, i.e.                measure M’= (N’, A’, O’) if and only if
          M = M’                                                        • the measures refer to the same real world entity
       • the intersection of the scopes is not empty, i.e. S ∩             (N = N’ and A = A’)
          S’ ≠ ∅                                                        • the operations are compatible, i.e., O = O’ or
       • the granularity specification of M is finer than or               O’ = NONE
          equal to the granularity specification of M’, i.e.        Note that semi-additive functions like AVG are split into
          G ≤ G’.                                                the additive and non-additive counterparts ({SUM,
    As seen in this definition the coarser multidimensional      COUNT} and division ). The additive operations build a
object M’ is implicitly refined to the granularity level G.      regular subject of the caching mechanism. Thus by materi-
    To be able to introduce the concept of set-derivability by   alizing multidimensional objects with the applied opera-
substitution (‘patch-working’; section 4), we need to define     tions SUM and COUNT, an AVG-query will never be
the difference of two multidimensional objects. As illus-        directly but only indirectly cached.
trated in figure 1 for the two-dimensional case, the differ-


W. Lehner, J. Albrecht, W. Hümmer                                                                                          2-4
                                                                                    Set-Derivability using ‘Patch-Working’


Definition 8: A multidimensional object M = (M, S, G) is               • SQ = ∪i Si,
    totally derivable from a multidimensional object                   • ∀i, j (1≤i,j≤k) with i≠j: Si ∩ Sj = ∅.
    M’ = (M’,S’,G’), if and only if                                 In this case M is said to be set-derivable from { M1’,
      • the measure M in M is derivable from the measure            ..., Mk’ }
        M’ in M’                                                    Each patch in S has two components 〈Mi, Mi’ 〉 denoting
      • M’ contains M, i.e. S ∩ S’ = S                          that partition Mi of MQ is to be computed from Mi’. All par-
      • the granularity specification of M’ is finer than M,    titions must be disjoint and their union must result in MQ.
        i.e. G’ ≤ G
                                                                The Query Optimization Problem
4 Set-Derivability using ‘Patch-Working’                        In general, the cache contains several possibly overlapping
                                                                aggregates with different granularities which support a
Although derivabil-
                                                                query. Therefore, the problem of the multidimensional
ity is the fundamen-
                                                                query optimizer is to choose a substitution that allows the
tal concept for the
                                                                computation of the query at minimal cost. This problem is
use of precomputed M1               MQ           M3
                                                                very similar to the well-known knapsack problem
aggregates, in many
                                                                ([CoLR90]), where a thieve tries to select the most valuable
cases definition 8 it
                                                                set of things with different values and weights. A common
is too restrictive.
                                   M2                           method is to use a greedy strategy to approximate the opti-
Consider the exam-
                                                                mal solution for the NP-complete 0/1 knapsack problem. In
ple in figure 2. The
                             Figure 2: Patch-Working            the case of the fractional knapsack problem where the
query object MQ is
                                                                thieve may take arbitrary fractions of objects the greedy
not derivable from
                                                                approach actually yields the optimum. The problem of
any of M1, M2, or M3 alone. However, it can be derived
                                                                finding an optimal substitution can be reduced to the frac-
from a combination of M1, M2 and M3. The example illus-
                                                                tional knapsack, if the cost to compute a fraction of one
trates the problem of previous query caching methods
                                                                multidimensional object from another one is monotonous
([ShSV99], [SDJL96]) where a materialized aggregate can
                                                                in the size of the support.
only be accessed if the complete query is derivable from
this single cached object. This is because query contain-       Definition 11: A cost function Cost(M, M’ ) determining
ment can be checked with a relatively simple algorithm              the cost to compute the support of a multidimensional
where the determination of the overlapping regions for              object M from M’ is monotonous, if for any two arbi-
arbitrary predicates is NP-complete ([SuKN89]). However,            trary multidimensional objects M’ and M’’ holds:
in the context of multidimensional objects where the pred-               | M ∩ M’ | < | M ∩ M’’ |
icates consist of scope definitions corresponding to classes                         ⇒ Cost(M, M’ ) < Cost(M, M’’ ).
in a classification hierarchy this problem can be reduced to        In the case of a monotonous cost function, an algorithm
the determination of ancestor and sibling relationships.        selecting the cheapest substitution for each data cell in the
Definition 9: A multidimensional object M = (M, S, G) is        multidimensional object M automatically finds the optimal
    partially derivable from a multidimensional object          substitution (‘Greedy-Choice property’; [CoLR90]). How-
    M’ = ( M’, S’, G’), i.e. M’ „ M, if and only if             ever, especially in the context of relational databases
                                                                monotony of the cost function may not be given. In a rela-
      • the conditions (1) and (3) of the total derivability
                                                                tional backend the computation of a multidimensional
         hold for M and M’
                                                                object results in range queries on the affected tables repre-
      • M and M’ are overlapping, i.e. S ∩ S’ ≠ ∅
                                                                senting the supporting multidimensional objects. In the
    In this case, M’ is said to support M, and M ∩ M’ is
                                                                absence of indexes, range queries on relational tables usu-
    called the support of M’ for M.
                                                                ally result in full table scans destroying the property of
   From the set of supporters, i.e. the potential candidates,   monotony.
an appropriate set of partitions, or patches, has to be             In order to find a substitution for a query we use the
selected in order to answer a query. Such a set is called a     greedy strategy outlined in algorithm 1. The algorithm gets
substitution for the query. .                                   the query MQ and the set C of all materialized multidimen-
Definition 10: A set S = { 〈M1, M1’ 〉,...,〈Mk, Mk’ 〉 } with     sional objects as input parameters. It returns a substitution
    Mi = (Mi, Si, Gi) and Mi’ = (Mi’, Si’, Gi’) is called a     S and the estimated cost to compute MQ. As long as there
    substitution for a multidimensional object MQ, if and       are still uncomputed fragments remaining (line 5), the can-
    only if                                                     didate with the least access cost per cell for each fragment
                                                                is selected (lines 9-21). After the selected patch is added to
      • ∀i (1≤i≤k): Mi = MQ, Gi = GQ, and                       the solution S (line 24) it must be removed from the remain-
      • ∀i (1≤i≤k): Mi’ „ Mi,                                   der R using the difference operator (line 28). Since M may


W. Lehner, J. Albrecht, W. Hümmer                                                                                         2-5
                                                                                                                         Experimental Results


  Algorithm: Substitution (Patch Working)
  Input: set of existing multidimensional objects C = {M1, ..., Mn}
          multidimensional object representing the query MQ,
  Output: substitution S for MQ
                                                                          16              // if relative cost of new candidate is less
                                                                          17              If ( costrel > Cost(M, Mcand) / |M ∩ Mcand| )
  1    // initialize remainder, patch-work and cost
                                                                          18                    Mbest := Mcand;
  2    R := { MQ }; S := ∅; costtotal := 0
                                                                          19                    costrel := Cost(M, Mcand) / |M ∩ Mcand|;
  3
                                                                          20              End If
  4    // iterate over all
                                                                          21         End Foreach
  5    While ( R ≠ ∅)
                                                                          22
  6          MBest := ∅; costrel := ∞;
                                                                          23         // add the best patch to the result
  7
                                                                          24         S := S ∪ { 〈M∩Mbest, Mbest〉 };
  8        // find cheapest substitution for each remaining patch
                                                                          25         costtotal := costtotal+Cost(M, Mbest);
  9        Foreach M ∈ R
                                                                          26
  10             // loop through all candidates
                                                                          27         // remove the new patch from the remainder
  11             Foreach Mcand ∈ C
                                                                          28         R := R \ { M } ∪ (M-Mbest);
  12                  // next candidate if this one is not supporting M
                                                                          29    End Foreach
  13                  If ( Not ( Mcand „ M ) )
                                                                          30 End While
  14                       Next;
                                                                          31 Return (S, costtotal)
  15                  End If
                                                                          32 End

                                 Algorithm 1: Substitution of a multidimensional object (‘patch-working’)
only be partially derivable from Mbest the uncovered differ-                 • Reconstruction Cost CM(MQ, C):
ence M-Mbest must remain in R. It is worth to mention here                     Since cached multidimensional objects may be com-
that a solution is always possible if C contains also the raw                  puted from the set of multidimensional objects being
data objects                                                                   currently an element of the cache, it may prove benefi-
                                                                               cial if those multidimensional objects are replaced
Cache Replacement                                                              which can be reconstructed most easily.
                                                                              Since the factors have very different orders of magni-
The knowledge of the OLAP application domain provides
                                                                          tude, they are normalized by their average values before the
not only a lot of semantic information about the data model
                                                                          overall benefit BM(MQ, C) of a multidimensional object M
but also about the users behavior. In order to incorporate
                                                                          for a cache configuration C after query MQ is computed by
reference density as well as semantic knowledge the benefit
                                                                          a linear combination. The overall benefit is used as the indi-
in our approach is defined by the four factors of the
                                                                          cator, whether there is a change of the set of cached multi-
weighted relative reference density, the absolute benefit,
                                                                          dimensional objects or not.
the degree of relationship, and the reconstruction cost
([ADB+99]):
  • Weighted Relative Reference Density DM                                5 Experimental Results
     The weighted relative reference density is used to
                                                                          The following section illustrates the optimization potential
     approximate the traditional least reference density
                                                                          of the presented algorithms for caching multidimensional
     (LRD) in the multidimensional context. A reference
                                                                          objects. Our proposed dynamic caching strategy is ana-
     counter is increased only by the fraction of M that was
                                                                          lyzed according to different cache sizes and influence fac-
     actually referenced.
                                                                          tors for the computation of the overall benefit. Further-
  • Absolute Benefit AM:
                                                                          more, we compare our strategy with the static precomputa-
     The absolute benefit is based on the size, the granular-
                                                                          tion approach of [HaRU96].
     ity, and the scope of the multidimensional object and
     therefore generalizes the benefit definitions from
                                                                          Experimental Setup
     [HaRU96] as well as [DRSN98].
  • Degree of Relationship RM(MQ):                                        The proposed data structures and algorithms are imple-
     In the OLAP context, it is very likely that an object in             mented in the multidimensional OLAP server CUBESTAR.
     the cache that can be reached within a few navigational              All query processing and cache management is based on
     operations will be used with a high probability. There-              algebraic operations on multidimensional objects. How-
     fore, the number of navigation operations which are                  ever, multidimensional objects are physically stored in a
     necessary to transform the last query MQ into a mate-                relational backend (Oracle 8.0.4; DEC Alpha 3000/800,
     rialized, i.e. cached multidimensional object M defines              1CPU, 192MByte RAM) in a star-schema-like manner
     the degree of relationship for M.                                    (ROLAP system). For each patch in the optimal substitu-
                                                                          tion for the query (algorithm 1), the query processor gener-


W. Lehner, J. Albrecht, W. Hümmer                                                                                                          2-6
                                                                                                                          Experimental Results


       70%                                                        70%

       60%                                                        60%

       50%                                                        50%
                                                                                                                           Parameters:
       40%                                                        40%                                                          D and R
 CSR


                                                            CSR
                                                                                                                               D, A, R and C
       30%                                                        30%                                                          A and C
                                                                                                                               C alone
       20%                                                        20%

       10%                                                        10%

       0%                                                         0%
             0   200     400       600         800   1000               0   200    400   600   800 1000 1200    1500

                  a) single user, 1000 steps                                 b) multiple users, 1500 steps

                               Figure 3: Cost saving ratio for different combinations of cache parameters

ates an SQL statement, which are put together by UNION                            In the experimental scenario, two user traces, one simulat-
ALL operations to yield the final result. We could not com-                       ing 1000 operations of single user and another one 1500
pletely eliminate side-effects of the system, especially of                       steps of five independent users, were used to investigate the
the database buffer, but multiple runs confirmed the follow-                      behavior of the cache for varying cache sizes and configu-
ing results.                                                                      rations.

Multidimensional Database Schema and Query                                        Influence of the Cache Parameters
Generation
                                                                                  The first set of graphs in figure 3 illustrates the impact of
The schema of the multidimensional database consists of                           the four cache parameters using a cache size of 20% of the
three dimensions with five, four and six levels, and a set of                     raw data volume. The figures show the development of the
up to 14 dimensional attributes, respectively. The corre-                         cost saving ratios for the single and the multi user trace. If
sponding raw data cube was randomly filled with a sparsity                        all parameters are enabled, cost savings of about 56% in the
of 3% resulting in about 250.000 tuples in the relational                         single user case and 50% in the multi user case were
representation. To simulate the cache replacement behav-                          obtained. By trying all possible combinations of parame-
ior, a new query is constructed from the previous query by                        ters it turned out that for both the single and the multi user
applying a single operation in a certain dimension with a                         traces the usage of the relative reference density D and the
predefined probability. Since a typical OLAP user does not                        relationship degree R yields the best results of 60% and
switch very often between dimensions, in 80% of the cases                         52% respectively. On the contrary, the worst results of 35%
the operation was executed on the same dimension and in                           and 28%, respectively, were achieved using the absolute
20% of the cases the operation involved a different dimen-                        benefit A and the reconstruction cost C.
sion as the previous one. The set of possible operations                              It is interesting to note that the number of potential users
included slice (scope changes to a child node), unslice                           of a materialized aggregate, reflected by the absolute bene-
(scope changes to a parent node), drill-down (granularity                         fit A, is not suitable for the determination the overall bene-
changes to the next lower level), roll-up (granularity                            fit. Algorithms like [HaRU96] are based solely on this fac-
changes to the next higher level), split (add a dimensional                       tor. Furthermore, it turns out that the general critics accord-
attribute) and merge (remove a dimensional attribute).                            ing to query-based caching schemes namely the possibly
                                                                                  redundant storage of the same part of data cube in different
Performance Metric                                                                objects is of minor interest, because the influence factor of
                                                                                  the reconstruction cost also does very badly.
Since in general the cache management aims at maximiz-
ing the cost saving ratio ([ShSV99], [DRSN98]), we
                                                                                  Influence of the Cache Size
defined this metric in the context of multidimensional
objects. The cost saving ratio CSR on cached multidimen-                          The diagram in figure 4 shows the simulation results for
sional objects is defined by the division of the sum of the                       different cache sizes varying over 10.000, 25.000, 50.000,
costs of the queries using the cache content and the total                        100.000, and 500.000 tuples. Compared to the raw data
cost to compute the queries from the non-aggregated raw                           volume these values result in an additional storage over-
data.                                                                             head of 4%, 10%, 20%, 40% and 200%, respectively. For
                                                                                  these simulations the best combination of cache parame-


W. Lehner, J. Albrecht, W. Hümmer                                                                                                              2-7
                                                                                                                       Summary


                          70%

                          60%                                                            single user,
                                                                                         1000 steps
                          50%
                                                                                         5 users,
                    CSR   40%                                                            1500 steps

                          30%                                                       Operations:

                          20%                                                       15% slice / 6% unslice
                                                                                    14% drill-down / 30% roll-up
                          10%                                                       20% split / 15% merge

                            0
                                4%       10%      20%        40%   200%
                          Figure 4: Query trace configurations and cost savings with varying cache sizes
ters, i.e. D and R, was used. As can be seen, already with a        only supply an insufficient number of aggregates. More-
relatively small cache of 4% of the raw data size, cost             over, because general static approaches do not take into
reductions of over 50% in the single user mode and over             account specific partitions, i.e. regions of interest or hot
40% in the multi user mode are possible. However, the cost          spots, much of the additional storage space is wasted for
savings in the single user mode do not increase much over           regions of aggregates which are never or seldom requested.
60% for any cache size beyond 50.000 tuples. The figure
illustrates that even for multiple users a cache of a fairly        6 Summary
small size yields cost reductions not much worse than for a
single user, i.e. above 50%.                                        In this paper, we present a novel method for the dynamic
                                                                    selection of materialized aggregates in the context of mul-
Comparison to Static Precomputation of Aggregates                   tidimensional database systems. Our proposed solution is
                                                                    based on the notion of multidimensional objects which pro-
In order to compare our method to other materialization
                                                                    vide a consistent view to queries as well as to materialized
methods of aggregations, we implemented a static precom-
                                                                    aggregates. The use of multidimensional objects drastically
putation algorithm based on [HaRU96]‡. This algorithm
                                                                    simplifies the determination of intersections and differ-
computes a set of aggregates with an unrestricted scope
                                                                    ences of queries. Therefore, we are able to present a query-
based on a benefit definition similar to the absolute benefit.
                                                                    based caching strategy overcoming the limitations of query
Figure 4 shows the cost saving ratio for different cache
                                                                    containment by the concept of set-derivability. Set-deriv-
sizes. If only 4% of additional storage are provided, there
                                                                    ability is achieved by a patch-working mechanism, which
is almost no cost reduction possible for the static algorithm
                                                                    (in the case of a monotonous cost function) computes the
(therefore not in the figure), because the size of almost all
                                                                    cheapest substitution of a query by a set of materialized
useful aggregates with an unrestricted scope is larger than
                                                                    aggregates. The cache replacement strategy is based on the
the cache size. Although with increasing cache size the cost
                                                                    overall benefit computation for each element in the aggre-
saving ratio of the static approach rises, the values for the
                                                                    gate cache which consists of a linear combination of four
dynamic strategies remain unrivaled. Even for a cache size
                                                                    different influence parameters. At last we give the results of
twice as large as the raw data volume the cost saving ratios
                                                                    various experiments. Average cost reductions of 50% to
for the single and multi user cases are only about 40% and
                                                                    60% may be obtained by only 10% additional storage
30%, respectively. Note that the dynamic strategy yields
                                                                    capacity.
over 50% with 1/50 of that cache size.
                                                                        The proposed optimization technique must not be seen
    The reason for the bad results of the static approach is
                                                                    as an alternative to traditional physical optimization meth-
that the size of the aggregation lattice grows exponentially
                                                                    ods like indexes or fragmentation, but as an addition. Our
in the number of possible group-by combinations. In most
                                                                    method not only explicitly supports fragmentation of the
application scenarios this number is high, because several
                                                                    fact table but makes it even more beneficial. If aggregates
evaluation criteria on a single dimension are used, like
                                                                    are only materialized for hot spots, a single query may uti-
dimensional attributes or multiple hierarchies. Using a rea-
                                                                    lize both fragments of materialized aggregates and frag-
sonable amount of additional storage, static methods can
                                                                    ments of raw data in one operation. Our proposed strategy
                                                                    may easily be integrated into most relational OLAP serv-
   ‡. Unfortunately, we could not compare our approach              ers, because all optimizations involve only algebraic oper-
      with the work of [DRSN98], since they require a very          ations on common multidimensional structures.
      special ‘chunk-based’ file system in the context of
      their Paradise database system.


W. Lehner, J. Albrecht, W. Hümmer                                                                                             2-8
                                                                                                               Summary


   Currently, we are exploring the integration of indexes on    GHRU97 Gupta, H.; Harinarayan, V.; Rajaraman, A.;
the multidimensional objects and the benefit of using spe-             Ullman, J.D.: Index Selection for OLAP. In:
cial linearizations. Another research is the use of the frag-          Proceedings of the 13th International
ment-based patch-working algorithm in a parallel server                Conference on Data Engineering (ICDE’97,
environment.                                                           Birmingham, Great Britain, April 7-11), 1997,
                                                                       pp. 208-219
References                                                      GuHQ95 Gupta, A.; Harinarayan, V.; Quass, D.:
ABD+99 Albrecht, J.; Bauer, A.; Deyerling, O.; Günzel,                 Aggregate-Query      Processing     in    Data
       H.; Hümmer, W.; Lehner, W.; Schlesinger, L.:                    Warehousing Environments. In: Proceedings of
       Management of multidimensional Aggregates                       the 21th International Conference on Very
       for efficient Online Analytical Processing. In:                 Large Data Bases (VLDB’95, Zürich,
       International Database Engineering and                          Schwitzerland, Sept. 11-15), 1995, pp. 358-369
       Applications Symposium (IDEAS'99, Montreal,              Gupt97    Gupta, H.: Selection of Views to Materialize in
       Canada, August 1-3), 1999                                          a Data Warehouse. In: Proceedings of the 6th
AlGL99     Albrecht, J.; Günzel, H.; Lehner, W.: Set-                     International Conference on Database Theory
           Derivability of Multidimensional Aggregates.                   (ICDT‘97, Delphi, Greece, Jan. 8-10), 1997,
           In: First International Conference on Data                     pp. 98-112
           Warehousing and Knowledge Discovery                  HaRU96    Harinarayan, V.; Rajaraman, A.; Ullman, J.D.:
           (DaWaK’99, Florence, Italy, August 30 -                        Implementing Data Cubes Efficiently. In:
           September 1), 1999                                             Proceedings of the 25th International
BaPT97     Baralis, E.; Paraboschi, S.; Teniente, E.:                     Conference on Management of Data,
           Materialized    Views    Selection    in    a                  (SIGMOD’96, Montreal, Quebec, Canada,
           Multidimensional     Database.    In:    23rd                  June 4-6), 1996, pp. 205-216
           International Conference on Very Large Data          Klug82    Klug, A.: Equivalence of Relational Algebra
           Bases (VLDB’97, Athens, Greece, Aug 25-29),                    and Relational Calculus Query Languages
           1997, pp. 156-165                                              Having Aggregate Functions. In: Journal of the
ChMM88 Chen, M. C.; McNamee, L.; Melkanoff, M.: A                         ACM 29(1982)3, pp. 699-717
       Model of Summary Data and its Applications in            KoRo99    Kotidis, Y.; Roussopoulos, N.: DynaMat: A
       Statistical Databases. In: Proceedings of the 4th                  Dynamic View Management System for Data
       International     Working      Conference      on                  Warehouses. In: International Conference on
       Statistical     and      Scientific   Database                     Management      of    Data     (SIGMOD’99,
       Management (4SSDBM, Rome, Italy, June 21-                          Philadephia (PA), U.S.A., June 1-3), 1999, pp.
       23), 1988, pp. 356-372                                             371-382
ChSh95     Chaudhuri, S.; Shim, K.: An Overview of Cost-        LeAW98 Lehner, W.; Albrecht, J.; Wedekind, H.: Normal
           based Optimization of Queries with                          Forms for Multidimensional Databases. In:
           Aggregates. In: IEEE Data Engineering                       Proceedings of the 10th International
           Bulletin 18(1995)3, pp. 3-9                                 Conference on Scientific and Statistical Data
CoLR90     Cormen, T. H.; Leiserson, C. E.; Rivest, R. L.:             Management (SSDBM’98, Capri, Italy, July 1-
           Introduction to Algorithms, Cambridge (MA),                 3), 1998, pp. 63-72
           London: The MIT Press; New York u.a.:                LeSh97    Lenz, H.-J.; Shoshani, A.: Summarizability in
           McGraw-Hill Book Company, 1990                                 OLAP and Statistical Data Bases. In:
DRSN98 Deshpande, P.M.; Ramasamy, K.; Shukla, A.;                         Proceedings of the 9th International
       Naughton, J.F.: Caching Multidimensional                           Conference on Statistical and Scientific
       Queries Using Chunks. In: Proceedings of the                       Database Management (9SSDBM, Olympia
       27th International Conference on Management                        (WA), USA, Aug. 11-13), 1997, pp. 132-143
       of Data (SIGMOD’98, Seattle (WA), USA,                   SDJL96    Srivastava, D.; Dar, S.; Jagadish, H.V.; Levy,
       June 2-4), 1998, pp. 259-270                                       A.Y.: Answering Queries with Aggregation
Fink82     Finkelstein, S.: Common Expression Analysis                    Using Views. In: Proceedings of 22th
           in Database Applications. In: Proceedings of                   International Conference on Very Large Data
           the International Conference on the                            Bases (VLDB’96, Bombay, India, Sept. 3-6),
           Management of Data (SIGMOD’82, Orlando                         1996, pp. 318-329
           (FL), USA, June 2-4), 1982, pp. 235-245


W. Lehner, J. Albrecht, W. Hümmer                                                                                    2-9
                                                             Summary


ShSV99    Shim, J.; Scheuermann, P.; Vingralek, R.:
          Dynamic Caching of Query Results for
          Decision Support Systems. In: Proceedings of
          the 11th International Conference on Scientific
          and Statistical Database Management
          (SSDB’99, Cleveland (OH), U.S.A., July. 28-
          30), 1999, pp. 254-263
SuKN89    Sun, X.-H.; Kamel, N.; Ni, L.M.: Solving
          Implication     Problems      in Database
          Applications. In: Proceedings of the 18th
          International Conference on Management of
          Data (SIGMOD’89, Portland (OR), USA, May
          31-June 2), 1989, pp. 185-192
YaKL97    Yang, J.; Karlapalem, K.; Li, Q.: Algorithms for
          Materialized View Design in Data Warehousing
          Environment. In: Proceedings of the 23rd
          International Conference on Very Large Data
          Bases (VLDB’97, Athens, Greece, Aug. 25-
          29), 1997, pp. 136-145


W. Lehner, J. Albrecht, W. Hümmer                               2-10