=Paper= {{Paper |id=Vol-28/paper-1 |storemode=property |title=Processing relational OLAP queries with UB-Trees and multidimensional hierarchical clustering |pdfUrl=https://ceur-ws.org/Vol-28/paper1.pdf |volume=Vol-28 |dblpUrl=https://dblp.org/rec/conf/dmdw/MarklB00 }} ==Processing relational OLAP queries with UB-Trees and multidimensional hierarchical clustering== https://ceur-ws.org/Vol-28/paper1.pdf
           Processing Relational OLAP Queries with UB-Trees and
                  Multidimensional Hierarchical Clustering
                                   Volker Markl                                                        Rudolf Bayer
                          Bayerisches Forschungszentrum                                           Institut für Informatik
                           für Wissensbasierte Systeme                                        Technische Universität München
                                   (FORWISS)                                                     Orleanstraße 34, D-81667
                           Orleansstraße 34, D-81667,                                               München, Germany
                               München, Germany                                                      bayer@in.tum.de
                             volker.markl@forwiss.de


                                                                          1 Introduction
                                          Abstract                                    The most established relational data models for data
                                                                                      warehousing applications are the star schema and the
    Multidimensional access methods like the UB-                                      snowflake schema. In both approaches there is a central
    Tree can be used to accelerate almost any query                                   fact table that contains the measures and the dimension
    processing operation, if proper query processing                                  tables are situated around it. The connection between a
    algorithms are used: Relational queries or SQL                                    fact tuple and the corresponding dimension members is
    queries consist of restrictions, projections,                                     realized via foreign key relationships. In the star schema
    ordering, grouping and aggregation, and join                                      the dimension tables are completely denormalized while
    operations. In the presence of multidimensional                                   in the snowflake schema they may be normalized. Que-
    restrictions or sorting, multidimensional range                                   ries usually contain restrictions on multiple dimension
    query or Tetris algorithms efficiently process                                    tables (e.g., only sales for specific customer group and for
    these operations. In addition, these algorithms                                   a specific time period are asked) that are then used as
    also efficiently support queries that generate                                    restrictions on the usually very large fact table.
    some hierarchical restrictions (for instance by                                   In this paper we investigate, how UB-Trees and the Tet-
    following 1:n foreign key relationships). In this                                 ris algorithm in combination with our technique of mul-
    paper we investigate the impacts on query                                         tidimensional hierarchical clustering by hierarchy inter-
    processing in RDBMS when using UB-Trees and                                       leaving (MHC/HI) may be used to accelerate relational
    multidimensional hierarchical clustering for                                      query processing with a special focus on star-joins, the
    physical data organization. We illustrate the                                     most frequent operation of query processing for relational
    benefits by performance measurements of                                           data warehouses. With our technique of MHC/HI we spe-
    queries for a star schema from a real world                                       cifically cluster the data with respect to the foreign key
    application of a SAP business information                                         relationships defined by the star- or snowflake schema of
    warehouse. The performance results reported in                                    a data warehouse. We also illustrate the physical data
    this paper were measured with our prototype                                       modeling with MHC/HI for a SAP business information
    implementation of UB-Trees on top of Oracle 8.                                    warehouse and give a performance analysis for this real-
    We compare the performance of UB-Trees to                                         world application. Our performance analysis gives in-
    native query processing techniques of Oracle,                                     sight into the real world data distribution of 6 GB of sales
    namely access via an index organized table,                                       data from a fruit juice company and shows, why MHC/HI
    which essentially stores a relation in a clustered                                in combination with UB-Trees is superior to classical
    B*-Tree, and access via a full table scan of an                                   bitmap indexes, clustering B*-Trees and parallel full
    entire relation. In addition we measure the                                       table scans.
    performance of the intersection of multiple                                       Our paper is organized as follows: Section 2 surveys re-
    bitmap indexes to answer multidimensional                                         lated work, in Section 3 we present the basic concepts of
    range queries.                                                                    UB-Trees, their algorithms and MHC/HI. Section 4 pre-
                                                                                      sents the SAP business warehouse schema that we use for
                                                                                      our analysis. In Section 5 we analyze the data distribu-
                                                                                      tion of the real world data. Section 6 gives a performance
 The copyright of this paper belongs to the paper’s authors. Permission to copy
 without fee all or part of this material is granted provided that the copies are not
                                                                                      analysis and performance measurements with our proto-
 made or distributed for direct commercial advantage.                                 type implementation. Section 7 concludes our paper and
 Proceedings of the International Workshop on Design and gives an outlook on future work.
 Management of Data Warehouses (DMDW’2000)
 Stockholm, Sweden, June 5-6, 2000
 (M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.)
 http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-28/




V. Markl, R. Bayer                                                                                                                           1-1
2 Related Work                                                Z-Region [α : β ] is the space covered by an interval on
                                                              the Z-Curve and is defined by two Z-Addresses α and β.
Performance optimization has been well studied in the         We call β the region address of [α : β ].Each Z-Region
field of OLTP systems [Gra93]. Due to the completely          maps exactly onto one page on secondary storage, i.e., to
different query characteristics of OLAP applications in       one leaf page of the B-Tree.
comparison to OLTP new questions have to be addressed
here. The performance problem is heavily linked to the
physical data model.
The index selection problem for ROLAP application is
widely discussed in the research community [GHR+97,
Sar97]. Especially bitmap indexes have been proposed to
speed up ROLAP applications because of their compact-
                                                                         (a)               (b)                (c)
ness and support of star joins [CI98]. A common way of
performance improvement is the usage of materialized                               Figure 2: Z-regions
views - often in combination with indexing methods
                                                              For an 8×8 universe, i.e., s = 3 and d = 2, Figure1b
[TS97, Moe98, WB98]. Due to the large number of pos-
                                                              shows the corresponding Z-addresses. Figure2a shows
sible views a selection problem exists besides the mainte-
                                                              the Z-region [4: 20] and Figure2b shows a partitioning
nance issue [Gup97, SDN+96, SDN+98]. Clustering of
                                                              with five Z-regions [0 : 3], [4 : 20], [21: 35], [36 : 47]
OLAP data plays a key role in providing good perform-
                                                              and [48 : 63]. Assuming a page capacity of 2 points,
ance. Clustering has been well researched in the field of
                                                              Figure2c shows ten points, which create the partitioning
access methods. B-Trees, for instance, provide one-di-
                                                              of Figure2b. The details of the UB-Tree algorithms are
mensional clustering. Multidimensional clustering has
                                                              described in [Bay97a, Mar99].
been discussed in the field of multidimensional access
methods. See [GG97] and [Sam90] for excellent surveys
of almost all of these methods. [ZSL98] addresses the         3.1 Bit Interleaving
issue of hierarchical clustering for the one-dimensional      The performance of the UB-Tree crucially relies on an
case.                                                         efficient implementation of the Z-address calculation. For
Most work on applying multidimensional indexes to             tuples of positive integer numbers, bit interleaving im-
RDBMS discusses restrictions by range queries [SRF97,         mediately yields an algorithm to calculate a Z-address
NHS84, Gut84, OM84, LS90]. [JL98] accelerates range           from the binary representation of a tuple.
queries with aggregations by storing aggregated data in       It is easy to incorporate varying attribute lengths, i.e.,
R-Trees. [MRB99] and [Bay97b] are the basis of our ap-        attributes with different resolutions, into this algorithm:
proach, where joins and sorted processing of data organ-      When the limit of the resolution (i.e., the last bit) in one
ized by a multidimensional index are investigated.            attribute is reached, this attribute is not used for bit-in-
                                                              terleaving any further. The number of bits in each further
3 The UB-Tree                                                 step is reduced in this case.
                                                              For each attribute Ai we denote the number of distinct
The basic idea of the UB-Tree [Bay97a, Mar99] is to use       values of its domain by ri , i.e., ri = |Ai |
a space-filling curve to partition a multidimensional uni-    Definition 1: The number of steps for attribute Ai of a
verse. Using the Z-Curve (Figure 1a) the UB-Tree pre-         domain with cardinality1 ri is determined by its resolu-
serves the multidimensional clustering. A Z-Address α =       tion: steps(i) = log2ri
Z(x) is the ordinal number of the key attributes of a tuple
x on the Z-Curve, which can be efficiently computed by        Definition 2: The length of step k in bits (i.e., the num-
bit-interleaving [OM84]. A standard B-Tree is used to         ber of dimensions in step k) is:
index the tuples taking the linear Z-Address of the tuples            steplength(k) = |i | steps(i) ≥ k and i ∈ D}|
as keys.                                                      Figure 3 shows this generalized algorithm for bit inter-
                                  0 1 2 3 4 5 6 7             leaving.
                              0   0 1 4 5 16 17 20 21         Input: x : tuple
                                                              Output: Z-address α
                              1   2 3 6 7 18 19 22 23
                              2   8 9 12 13 24 25 28 29
                              3   10 11 14 15 26 27 30 31
                              4   32 33 36 37 48 49 52 53     for step = 1 to max({steps(rj) | j ∈ D})
                              5   34 35 38 39 50 51 54 55        for i = 1 to steplength(step)
                              6   40 41 44 45 56 57 60 61           copy bit step of xi to bit i of αstep
                              7   42 43 46 47 58 59 62 63
                                                                 end for
                (a)                       (b)                 end for
           Figure 1: Z-curve and Z-addresses                         Figure 3: Bit-Interleaving to calculate α = Z(x)
The fundamental innovation of UB-Trees is the concept         With r = max({steps(rj) | j ∈ D}) bit interleaving has a
of Z-Regions to create a disjunctive partitioning of the                                                             (
                                                              CPU-complexity of O(d⋅r) bit operations (resp. O ∑id=1 ri    )
multidimensional space. This allows for very efficient
processing of multidimensional range queries [Mar99]. A       1
                                                                  Note that we defined ri to be 2v for some v ∈ 1R



V. Markl, R. Bayer                                                                                                       1-2
for attributes of different length). The same holds for the    NATION = “USA” is mapped to an interval restriction in
inverse algorithm Z-1 that calculates the Cartesian coor-      a linear domain. With this technique, a star join on a
dinates of a tuple from its address.                           schema with d dimensions therefore creates a d-
                                                               dimensional interval restriction on the fact table which
3.2 Range Queries on UB-Trees                                  then may efficiently be processed by the UB-Tree.
                                                               In general one can imagine any foreign key relationship
To answer a range query, only those Z-regions, which           to be used for such a clustering. In the following we il-
properly intersect the query box, must be fetched from         lustrate MHC/HI by hierarchical relationships as they
the database and thus from the disk. Initially the range       usually occur in data warehousing applications. In
query algorithm calculates and retrieves the first Z-region    ROLAP hierarchies are usually modeled implicitly by a
that is overlapped by the query-box. Then the next inter-      set of attributes A1, ..., An where Ai corresponds to hierar-
secting Z-region is calculated and retrieved. This is re-      chy level i and level 1 is the root of the hierarchy.
peated until a minimal cover for the query box has been        Many attributes in relational DBMS in general and in
constructed, i.e., the region that contains the ending         data warehouses in particular have an actual domain of a
point of the query box has been retrieved.                     very small set of values. A typical example (cf. Figure 4a)
It is important to note that for any Z-region the calcula-     is the attribute REGION of the dimension table
tion of the next point intersecting the query box is per-      CUSTOMER, which has an actual domain of 8 values
formed solely in main memory. [Bay97a] describes an            (Southern Europe, Central Europe, Northern Europe,
algorithm that for a Z-region [α : β ] and a query box Q       Western Europe, North America, Latin America, Asia,
calculates the smallest Z-address γ > β that intersects Q      Australia). By MHC/HI, these long strings are replaced
in time exponential to the number of dimensions. A lin-        by numbers. In the corresponding tables small numbers
ear algorithm which is solely based on bit operations is       or bit strings are stored instead of long, space wasting
described in [Mar99] and [RMF+00].                             character strings. If all records must be retrieved that
                                                               concern Europe, only an interval [0;3] on this number is
3.3 The Tetris Algorithm                                       necessary. Otherwise every member of region must be
With the Tetris algorithm [MZB99], tables organized by         specified. The mapped numbers are called surrogates
a UB-Tree can be read in any key sort order in O(n) disk       because they replace the character strings. This data type
accesses where n is the number of pages of the table or        is called enumeration type.
the minimal number of regions covering a query box.
                                                                             REGION          f(REGION)
The Tetris algorithm is a generalization of the multidi-
mensional range query algorithm that efficiently com-                        South Europe    0
bines sort operations with the evaluation of multi-attrib-                   Middle Europe   1
ute restrictions. The basic idea is to use the partial sort                  Northern Europe 2
order imposed by a multidimensional partitioning in or-                      Western Europe 3
der to process a table in some total sort order. Essentially                 North America   4
a plane sweep over a query space defined by restrictions                     Latin America   5
on a multidimensionally partitioned table is performed.                      Asia            6
The direction of the sweep is determined by the sort at-                     Australia       7
tribute. Initially the algorithm calculates the first Z-re-
gion that is overlapped by the query box, retrieves it and                                      (a)
caches it in main memory. Then it continues to read and
cache the next Z-regions with respect to the requested                              CUSTOMER
sort order, until a complete thinnest possible slice of the
                                                               0 SouthEurope ... 4 North America ...6 Asia            7 Australia
query box (in the sorting dimension) has been read. Then
the cached tuples of this slice are sorted in main memory,
                                                                             0 Canada 1 USA          0 Wholesale 1 Retail
returned in sort order to the caller and removed from
cache. The algorithm proceeds reading the next slice,                          0 Wholesale     1 Retail         ... Kana´s Sushi Bar   ...
until all Z-regions intersecting the query box have been
processed. Only disk pages overlapping the query space                                       ... 2 Bar    ...
are accessed. With sufficient, but modest, cache memory
                                                                                       ... Joe‘s Sports Bar     ...
each disk page is accessed only once.
                                                                                                  (b)
3.4 Multidimensional Hierarchical Clustering
                                                                  Figure 4: Surrogates for REGION and the Customer
Multidimensional Hierarchical Clustering by Hierarchy                                  Hierarchy
Interleaving (MHC/HI, [MRB99]) clusters a multidimen-
sional data set on disk with respect to the hierarchical       This surrogate technique is generalized to all levels of the
relationships in multiple dimensions. This is achieved by      hierarchies. Figure 4b shows a part of the customer
creating surrogate values that ensure that a hierarchical      hierarchy with the surrogates propagated to all levels of
restriction like REGION = “North America” and                  the hierarchy.




V. Markl, R. Bayer                                                                                                                           1-3
To efficiently encode hierarchies, we introduce the con-      used to accelerate star-joins, the most frequent operation
cept of compound surrogates for hierarchies. Since we         of query processing for relational data warehouses.
require hierarchies to form a disjoint partitioning of a
dimension, a uniquely identifying compound surrogate          4.1 The Juice & More Schema
for each child node of a hierarchy member exists and can
be recursively calculated by concatenating the compound       We use the schema of the beverages supplier ‘Juice &
surrogate of the member with the running number of the        More’, a real customer of one of our project partners2. In
child node. Because only efficient bit operations are nec-    the data warehouse of ‘Juice & More’ data is organized
essary, the computation is extremely fast. The ord func-      along the following four dimensions: CUSTOMER,
tion returns the corresponding surrogate of a hierarchy       PRODUCT, DISTRIBUTION and TIME. Figure 5a
member in the specified level.                                shows the hierarchies over the dimensions (the number
The hierarchy path North America     Å     USAÅ  Retail Å     in parentheses specifies the maximum number of level
                                                              members).
Bar (cf. Figure 4b) has the compound surrogate:
ordCustomer (North America) ο ordNorth America (USA) ο             PRODUCT             CUSTOMER             DISTRIBUTION              TIME
ordUSA(Retail) ο ordRetail(Bar) = 4 ο 1 ο 1 ο 2.
The upper limit of the domain for surrogates of level i is
calculated as the maximum fan-out (number of children)             All Products        All Customer         All Distributions        All Time

of all members of level i–1 of a hierarchy H, i.e., surro-
gates(H, i) = max {cardinality(children(H, m)) where m              Type (5)             Region (8)             Sales                Year (3)
∈ level(H, i - 1)}                                                                                          Organization (5)
The compound surrogate can be interpreted as a number.
                                                                                                                                    Month (12)
In the above example – using the surrogate length of                Brand (8)            Nation (7)
                                                                                                              Distribution
Figure 4a with a length of 10 bits for the customer surro-                                                    Channel (3)
gate (3 for REGION, 3 for NATION, 1 for Trade Type
                                                                  Category (19)        Trade Type (2)
and 3 for Business Type) the hierarchy path results in the
compound surrogate 10000110102 = 538.
Usually growth expectations for a hierarchy are known             Container (10)      Business Type (7)
well in advance. Often hierarchy trees are even static.
Therefore it is possible to determine a reasonable number                                             (a)
of bits for storing each surrogate of the compound surro-
                                                                        PRODUCT
gate of a hierarchy. The overall number of bits necessary               2180 rows
                                                                                                                       DISTRIBUTION
                                                                                                                          12 rows
to store a compound surrogate is relatively small. For                  PRODKEY                                          DISTKEY
instance, a hierarchy tree with four branches on 8 levels                 TYPE                                          SALESORG
already represents 48 = 65536 partitions and is stored in                BRAND                    FACT                   CHANNEL
only 16 bits. The maximum length of the compound sur-                  CATEGORY                 26M rows                      ...
rogates for the ‘Juice & More’ schema that we used for                 CONTAINER                PRODKEY

our analysis can be computed from the maximum fan-out                       ...                 CUSTKEY

of the hierarchy levels given in Figure 4a.                                                      DISTKEY

                                                                        CUSTOMER                 TIMEKEY
This very compact fixed length encoding preserves the                    7064 rows
                                                                                                  SALES
lexicographic order on the hierarchy levels. Thus, point                CUSTKEY                 DISTCOST
restrictions on upper hierarchy levels result in range re-               REGION                       ...                      TIME
strictions on the finest granularity of a hierarchy. For                 NATION                                               36 rows

instance, the point restriction NATION = “USA” on the                  TRADE-TYPE                                            TIMEKEY

second level of the CUSTOMER hierarchy with f(“North                  BUSINESS-TYPE                                           YEAR
                                                                            ...                                              MONTH
America”) = 4 = 1002 and f(“USA”) = 1 = 0012 maps to
the range restriction cscustomer between 528 =                                                     (b)
10000100002 and 543 = 10000111112. Thus, a star join
with this surrogate encoding for the foreign keys of a fact       Figure 5: Hierarchies in the ‘Juice & More’ schema and
table results in a range restriction on each compound                         the corresponding star schema
surrogate, if some hierarchy level of each dimension is       The ROLAP data model for the ‘Juice & More’ schema
restricted to a point. In the same way intervals on the       (Figure 5b) is a typical star schema with one fact table
children of one hierarchy level result in a range of the      FACT and a table for each of the 4 dimensions. Let
corresponding compound surrogates (e.g., YEAR = 1998          ‘SALES’ and ‘DISTCOST’ be some of the measures in
and MONTH between April and June). A star join on a           the fact table. We used the methodologies of surrogates
schema with d dimensions therefore creates a d-dimen-         and multidimensional hierarchical clustering as
sional interval restriction on the fact table.                described in Section 3.4 for clustering the fact table of
                                                              ‘Juice & More’ with UB-Trees.
4 Schema and Queries of ‘Juice & More’
In this section we investigate, how UB-Trees and the          2
                                                               The company and the data presented here have been
Tetris algorithm in combination with MHC/HI may be            made anonymous.



V. Markl, R. Bayer                                                                                                                               1-4
In the following we describe some example queries            SELECT SALESORG, CHANNEL, SUM(DistCost)
involving star joins for the ‘Juice & More’ schema. These    FROM    Fact F, Distribution D, Product P
queries were taken from the decision support system of       WHERE   F.DistKey = D.DistKey AND
‘Juice & More’. Thus next to the investigation of                    F.ProductKey = P.ProductKey AND
multidimensional hierarchical clustering this section is             P.Type = X
interesting from a second point of view: The highly          GROUP BY D.SalesOrg,D.Channel
skewed data distribution of the ‘Juice & More’ will prove                 Figure 7: Distribution cost (Q2)
that UB-Trees, the Tetris algorithm and MHC/HI are not
only applicable to laboratory environment tests with         Query 3 (Q3, cf. Figure 8) restricts all dimensions on the
generated data, but also prove their efficiency in the       first level of the hierarchies. This query also uses the
practical application scenarios. In order to show the        Tetris algorithm on MHC/HI compound surrogates.
highly skewed data distribution we included an entire        SELECT SUM(SALES)
section displaying the one-dimensional data distribution     FROM Fact F, Distribution D, Product P,
of ‘Juice & More’ for every dimension.                               Customer C, Time T
We then present measurements performed with our              WHERE   F.DistKey = D.DistKey AND
prototype implementation of the UB-Tree on top of                    F.TimeKey = T. TimeKey AND
Oracle 8. For the evaluation of our clustering technique             F.CustKey = C.CustKey AND
we defined a benchmark with 36 queries. In comparison                F.ProdKey = P.ProdKey AND
we also conducted measurements with native Oracle 8                  P.Type = t AND D.SalesOrg = s AND
access methods: parallel full table scan (FTS) and bitmap            T.Year = y AND C.Region = r
indexes (BII). For these measurements we used a              Figure 8: Partial match query in first hierarchy level (Q3)
completely denormalized fact table, that is, no additional
joins next to the star join had to be performed to answer
the queries. The bitmap indexes were created on each         5 Data Distributions of the ‘Juice & More’
hierarchy level. We did not include secondary indexes in     Fact Table
our comparison measurements because earlier
experiments showed that they are neither competitive to      The data of ‘Juice & More’ is real world data; the data
the UB-Tree nor to FTS or BII [MZB99].                       distribution of both the fact table and the dimension ta-
                                                             bles is highly skewed: The dimensions are neither dis-
4.2     Queries on the ‘Juice & More’ Schema                 tributed uniformly nor are independent. The original fact
                                                             table consisted of 823.464 tuples (about 175 MB). To get
In the following we present typical queries that are taken
from the ‘Juice & More’ SAP business information             a realistic large data cube, the fact table was enlarged to
warehouse. We will use these queries to illustrate our       26.350.848 tuples (about 6 GB). Our project partner im-
approach and we will present performance measurements        plemented an augmentation algorithm with minimal im-
for exactly these queries in Section 6.2.                    pact on the data distribution (see [Pie98]).
Query 1 (Q1, cf. Figure 6) computes the sales for a given    In the following we show some charts that describe the
product group (TYPE and BRAND specified as (X1,              one-dimensional data distribution for each dimension.
X2)) and a given customer group (NATION and                  However, we once more stress that the dimensions are
REGION specified as (Y1, Y2)) for the months from            not independent (e.g., some customers always order the
October to December of 1993. This query uses the UB-         same subset of products, some customers or products only
Tree range query algorithm on MHC/HI surrogates.             exist for a certain time, etc.). Thus in general, the overall
However, surrogates are only used for clustering and
                                                             selectivity of a query restricting several dimensions is not
indexing and thus are transparent to the user.
                                                             the product of the selectivities of the one-dimensional
SELECT SUM(Sales)                                            restrictions (which is shown in the following charts). We
FROM Fact F, Customer C, Product P, Time T                   will see this deviation in Section 6.
WHERE  F.ProdKey = P.ProdKey AND                             The fact table of ‘Juice & More’ stores several measures
    F.CustKey = C.CustKey AND
                                                             (e.g., distribution cost, sales) aggregated on a daily basis
    P.Type = X1 AND P.Brand = X2 AND
                                                             with respect to the dimensions time, customer, product
    C.Region = Y1 AND C.Nation = Y2 AND
    F.TimeKey = T.TimeKey AND T.Year = 1993 AND
                                                             and distribution. Since the data is sensitive real-world
    T.Month >= October AND T.Month <= December
                                                             business data, it is not possible to show the labels/names
              Figure 6: Time Interval (Q1)                   of the hierarchy members in the charts.
Query 2 (Q2, cf. Figure 7) calculates the cost of            5.1      The Time Dimension
distribution of the products of type X for each
distribution channel. This query uses the Tetris algorithm   The time dimension of ‘Juice & More’ consists of a two-
on MHC/HI surrogates in order to efficiently calculate       level hierarchy of months and years. The test data stored
aggregations.                                                the years from 1993 to 1995. As shown in Figure 5a, the
                                                             days of the time dimension are organized by a two level




V. Markl, R. Bayer                                                                                                    1-5
hierarchy (year and month). Figure 9 shows the cumu-                                                                                                                                                                                                                                                                                                                            Multidimensional hierarchical clustering ensures that a
lated data distribution of the fact table with respect to the                                                                                                                                                                                                                                                                                                                   restriction in the first hierarchy level will result in a 1%,
time dimension grouped by year and month. The hori-                                                                                                                                                                                                                                                                                                                             27%, 6%, 32% respectively 34% reduction of I/Os which
zontal axis displays the hierarchy members, with “All” at                                                                                                                                                                                                                                                                                                                       are necessary to retrieve the result set. Without exactly
the very bottom (i.e., the lowest level), the years 1993,                                                                                                                                                                                                                                                                                                                       showing the relationship between the hierarchy members,
1994, and 1995 in the middle and the twelve months for                                                                                                                                                                                                                                                                                                                          we show the skew of the data distribution over the first
each year above the year. The arrows in the horizontal                                                                                                                                                                                                                                                                                                                          four product levels in Figure 11.
axis indicate the relationship between the members of
neighboring hierarchy levels.                                                                                                                                                                                                                                                                                                                                                       4%
Thus the fact table of ‘Juice & More’ is almost uniformly
distributed with respect to the time dimension. The
                                                                                                                                                                                                                                                                                                                                                                                    3%
minimum number of facts for one month is 2,70% of the
fact table (in January 1993, December 1993 and January
1995), whereas the maximum number of facts per month                                                                                                                                                                                                                                                                                                                                2%
is 2,83% (in March of each of the three years). Thus with
a multidimensional clustering using 5 bits for time, re-                                                                                                                                                                                                                                                                                                                            1%
strictions to one month in the time dimension (=1/36)
can be expected to reduce the amount of data to around
                                                                                                                                                                                                                                                                                                                                                                                    0%
1/25 = 1/32.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   product
                                                                                                                                                                                                                                                                                                                                                                                               Figure 11: Product – first four hierachy levels
                                                                                                                                                    2,83%




                                                                                                                                                                                                                                                                            2,83%
                            2,83%




                                                                                                                                                                                                                                                                                                          2,82%
                                                          2,82%




                                                                                                                                                                        2,81%
                                                                                                                                                                                  2,82%




                                                                                                                                                                                                                                                                                                2,82%
                                                2,81%




                                                                                                                      2,81%




                                                                                                                                                                                                                                              2,81%




                                                                                                                                                                                                                                                                                                                                                                      2,81%
                                                                                                            2,80%




                                                                                                                                                                                                                                    2,80%




                                                                                                                                                                                                                                                                                                                                                            2,80%
                                                                                                                                                                                                                                                                                      2,79%




 3,0%
                                      2,79%




                                                                                                                                                              2,79%
                                                                    2,78%




                                                                                                                                                                                            2,78%




                                                                                                                                                                                                                                                                                                                    2,78%
                                                                                                                                                                                                                                                                                                                              2,77%
                                                                              2,77%




                                                                                                                                                                                                      2,77%
                                                                                                  2,75%




                                                                                                                                                                                                                          2,75%




                                                                                                                                                                                                                                                                  2,75%




                                                                                                                                                                                                                                                                                                                                                  2,75%
                  2,75%




                                                                                                                                          2,75%




                                                                                                                                                                                                                                                                                                                                        2,72%
                                                                                        2,71%




                                                                                                                                                                                                                2,71%




                                                                                                                                                                                                                                                        2,70%
        2,70%




                                                                                                                                2,70%




 2,5%

                                                                                                                                                                                                                                                                                                                                                                                5.2                            The Customer Dimension
 2,0%
                                                                                                                                                                                                                                                                                                                                                                                According to business administration literature 20% of
 1,5%
                                                                                                                                                                                                                                                                                                                                                                                the customers contribute to 80% of the business. The
                                                                                                                                                                                                                                                                                                                                                                                customer dimension of ‘Juice & More’ is a typical exam-
 1,0%
                                                                                                                                                                                                                                                                                                                                                                                ple for a classification of customers in such a company. A
                                                                                                                                                                                                                                                                                                                                                                                high number of customers (in this case 88%, the very left
 0,5%                                                                                                                                                                                                                                                                                                                                                                           entry of Figure 12) are not classified (maybe they are not
                                                                                                                                                                                                                                                                                                                                                                                interesting for the company because of small turnover or
 0,0%                                                                                                                                                                                                                                                                                                                                                                           it is not possible to find a classification). The classified
        1993001
                  1993002
                            1993003
                                      1993004
                                                1993005
                                                          1993006
                                                                    1993007
                                                                              1993008
                                                                                        1993009
                                                                                                  1993010
                                                                                                            1993011
                                                                                                                      1993012
                                                                                                                                1994001
                                                                                                                                          1994002
                                                                                                                                                    1994003
                                                                                                                                                              1994004
                                                                                                                                                                        1994005
                                                                                                                                                                                  1994006
                                                                                                                                                                                            1994007
                                                                                                                                                                                                      1994008
                                                                                                                                                                                                                1994009
                                                                                                                                                                                                                          1994010
                                                                                                                                                                                                                                    1994011
                                                                                                                                                                                                                                              1994012
                                                                                                                                                                                                                                                        1995001
                                                                                                                                                                                                                                                                  1995002
                                                                                                                                                                                                                                                                            1995003
                                                                                                                                                                                                                                                                                      1995004
                                                                                                                                                                                                                                                                                                1995005
                                                                                                                                                                                                                                                                                                          1995006
                                                                                                                                                                                                                                                                                                                    1995007
                                                                                                                                                                                                                                                                                                                              1995008
                                                                                                                                                                                                                                                                                                                                        1995009
                                                                                                                                                                                                                                                                                                                                                  1995010
                                                                                                                                                                                                                                                                                                                                                            1995011
                                                                                                                                                                                                                                                                                                                                                                      1995012




                                                                                                                                                                                                                                                                                                                                                                                customer groups contain 0% to 3% of the tuples stored in
                                                                                                                                                                                                                                                                                                                                                                                the table.3 The hierarchical relationships on the horizon-
        Figure 9: Data Distribution in the Time Dimension
                                                                                                                                                                                                                                                                                                                                                                                tal axis show that the hierarchy of the customer dimen-
                                                                                                                                                                                                                                                                                                                                                                                sion is not balanced, since several hierarchy members
5.2                                        The Product Dimension                                                                                                                                                                                                                                                                                                                just have one child.
The top level of the hierarchy on product has five entries.                                                                                                                                                                                                                                                                                                                         3,0%
                                                                                                                                                                                                                                                                                                                                                                                           87,33%




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               2,96%
The data distribution is quite skewed, there are three                                                                                                                                                                                                                                                                                                                              2,5%

product groups to which 93% of all tuples of the fact ta-
                                                                                                                                                                                                                                                                                                                                                                                                                                                               1,72%




                                                                                                                                                                                                                                                                                                                                                                                    2,0%
                                                                                                                                                                                                                                                                                                                                                                                                                                       1,64%




ble belong. 1% of the data is unclassified. The distribu-
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       1,27%




tion of the first level of the product hierarchy is illus-                                                                                                                                                                                                                                                                                                                          1,5%


trated in Figure 10.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               0,77%




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               0,76%
                                                                                                                                                                                                                                                                                                                                                                                                                               0,73%




                                                                                                                                                                                                                                                                                                                                                                                    1,0%
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       0,55%
                                                                                                                                                                                                                                                                                                                                                                                                               0,28%




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               0,27%


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               0,21%




                                                                                                                                                                                                                                                                                                                                                                                    0,5%
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       0,13%
                                                                                                                                                                                                                                                                                                                                                                                               0,14%




                                                                                                                                                                                                                                                                                                                                                                                                                                                                               0,13%




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       0,16%
                                                                                                                                                                                                                                                                                                                                                                                                                                                       0,15%




                                                                                                                         1%
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       0,08%
                                                                                                                                                                                                                                                                                                                                                                                                       0,12%




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       0,09%




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       0,06%
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               0,05%
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               0,06%




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       0,07%




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               0,06%




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               0,07%
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       0,02%
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       0,04%
                                                                                                                                                                                                                                                                                                                                                                                                                       0,04%




                                                                                                                                                                                                                                                                                                                                                                                                                                                                       0,03%




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               0,02%
                                                                                                                                                                                                                                                                                                                                                                                                                                               0,00%




                                                                                                                                                                                                                                                                                                                                                                                    0,0%



                                                                                                                                                                                                                                    unclassified
                                                                    34%                                                                   27%
                                                                                                                                                                                                                                    910
                                                                                                                                                                                                                                    912
                                                                                                                                                              6%                                                                    920
                                                                                                                                                                                                                                    922
                                                                                                                                                                                                                                                                                                                                                                                           Figure 12: Customer – first four hierachy levels
                                                                                                            32%


                                                                                                                                                                                                                                                                                                                                                                                3
                                                                                                                                                                                                                                                                                                                                                                                  Note that the data in the ‘Juice & More’ warehouse is
                                           Figure 10: Product – first hierachy level                                                                                                                                                                                                                                                                                            aggregated on a daily basis, thus the amount of data is
                                                                                                                                                                                                                                                                                                                                                                                usually compressed for large customers, thus the
                                                                                                                                                                                                                                                                                                                                                                                proportion of large customers is reduced.



V. Markl, R. Bayer                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     1-6
The consequence of the small number of classified cus-                                                                              6.1                  Performance Analysis
tomers is that queries the restriction CUSTOMER will
result in a selectivity of at most 3% and therefore the                                                                             Figure 15 shows the compound surrogates for the ‘Juice
overall result set will be small.                                                                                                   & More’ data warehouse, which are calculated as fixed
                                                                                                                                    length compound surrogates in [MRB99]. For any of the
                                                                                                                                    4 hierarchies the length of the compound surrogate does
5.2        The Distribution Dimension                                                                                               not exceed 15 bits and thus can be stored in a single inte-
                                                                                                                                    ger value. These compound surrogates are used as attrib-
There are seven entries on the first level of the distribu-                                                                         utes for each of the four dimensions of ‘Juice & More’ to
tion hierarchy. The data distribution of the fact table with                                                                        calculate the Z-address for each tuple of the ‘Juice &
respect to the distribution dimension is highly skewed.                                                                             More’ fact table.
Figure 13 shows the distribution of the first hierarchy
level (i.e., sales organization), while Figure 14 shows the
distribution of facts in the fact table for each distribution                                                                                         14243 14243 142 4 434 14243
                                                                                                                                    cs product = p15 p14 p13 p12 p11 p10 p9 p8 p 7 p 6 p5 p 4 p3 p 2 p1
                                                                                                                                                                         level 1                         level 2                       level 3                            level 4


                                                                                                                                                       123 123 123 123
channel of each sales organization. Again the arrows
indicate the hierarchical relationship of the members of                                                                            cs customer      =c c c c c c c c c c
                                                                                                                                                                     10 9 8                 7 6 5                  4               3 2 1



                                                                                                                                                    {123
neighboring hierarchy levels with the hierarchy root                                                                                                                   level1               level 2             level 3            level 4
“All” at the bottom of the Figure.                                                                                                  cs time        =t t t t t t
                                                                                                                                                           6 5 4 3 2 1
                                                                                                                                                         level1          level 2


                                                           8% 13%
                                                                                                                                    cs distribution
                                                                                                                                                         123 123
                                                                                                                                                       =d d d d d          5
                                                                                                                                                                            level 1
                                                                                                                                                                                   4      3          2
                                                                                                                                                                                                 level 2
                                                                                                                                                                                                            1


                                            16%                                 13%                                                   Figure 15: Compound surrogates for each dimension of
                                                                                                                                                       ‘Juice & More’
                                                                                 7%
                                                                                                                                    The UB-Tree for the ‘Juice & More’ fact table consists of
                                                        37%                     6%
                                                                                                                                    P = 878362 pages, which corresponds to: l = log2 P =
                                                                                                                                    log2878362 = 19,7 bits that are in any case necessary to
                                                                                                                                    address the bit-interleaved multidimensional space. With
          1                  2                3              4                  5                 6               8                 bit interleaving in the order of dimensions PRODUCT,
                                                                                                                                    CUSTOMER, TIME, and DISTRIBUTION the Z-address
         Figure 13: Distribution – first hierachy level                                                                             α for a tuple of the ‘Juice & More’ fact table is calcu-
                                                                                                                                    lated as:
                                                                                                                                          1444444442444444443 { 14444444244444443
           10,30%




                                  10,30%




 12,5%                                                                                                                              α = p15 c10 t 6 d 5 p14 c9 t 5 d 4 p13 c8t 4 d 3 p12 c7 t 3 d 2 p11c 6 t 2 d 1 p10 c5t 1 p 9 c 4 p8 c 3 p 7 c 2 p 6 c1 p5 p 4 p 3 p 2 p1
                                                                                     12,50%

                                                                                               12,50%

                                                                                                         12,50%
                                                                       12,50%




                                                                                                                                                       completely partitione d for any data distributi on           partly split        partitione d depending on the data distribution
                                                                                                                            8,45%




 10,0%                                                                                                                              The first 19 bits of the Z-address are guaranteed to be
                                                                                                                                    used to partition the four dimensional universe of the
                                                     6,66%

                                                              5,84%




  7,5%                                                                                                                              ‘Juice & More’ fact table. This means that the binary
                                                                                                                   4,05%




  5,0%
                                                                                                                                    strings p15p14p13p12p11 of the compound surrogate of
                                                                                                                                    product, c10c9c8c7c6 of customer, t6t5t4t3t2 of time and
                     2,20%




                                            2,20%




  2,5%                                                                                                                              d5d4d3d2 of distribution are used to partition the universe.
                                                                                                                                    For each of the four dimensions the first hierarchy level
  0,0%                                                                                                                              is completely used for the partitioning. The second hier-
          09 1
                    10 1
                                 09 2
                                           10 2
                                                    12 3
                                                             11 4
                                                                      13 5
                                                                                    14 5
                                                                                              15 5
                                                                                                        16 6
                                                                                                                  17 6
                                                                                                                           19 8




                                                                                                                                    archy level is used to a large extent to partition the uni-
                                                                                                                                    verse. Therefore a restriction in the first hierarchy level
      Figure 14: Distribution – first two hierachy levels                                                                           will result in a reduction of the number of pages as de-
                                                                                                                                    termined by the data distribution of Section 5, i.e., a re-
                                                                                                                                    striction of the product main group to “910” will reduce
6 Performance Analysis of MHC/HI for                                                                                                the number of pages to be retrieved to 27%, a restriction
‘Juice & More’                                                                                                                      to “912” will result in a reduction to 6,41%. This holds
                                                                                                                                    for the restriction of the first hierarchy level in any di-
In this section we first analyze the performance of
                                                                                                                                    mension. If the top hierarchy level is restricted in several
MHC/HI on UB-Trees analytically and then, taking the
                                                                                                                                    dimensions and the independence assumptions holds for
data distributions and query selectivities into account,
                                                                                                                                    these dimensions, the reduction is multiplicative. Figure
verify our expectations by comparing the number of re-
                                                                                                                                    16 shows the predicted selectivity calculated as the prod-
trieved pages with the product of the selectivities over all
                                                                                                                                    uct of the selectivity in each dimension, the actual selec-
dimensions. We then also give response times which
                                                                                                                                    tivity and the loaded pages in percent of the entire pages
show the superiority of UB-Trees and MHC/HI over
                                                                                                                                    in the database for two queries, which restrict the first
traditional    indexes    even    for     our     prototype
                                                                                                                                    hierarchy level of three out of four dimensions.
implementation.




V. Markl, R. Bayer                                                                                                                                                                                                                                                                        1-7
 scust- spro      sdistri-   stime Pages    Predic-    Actual loaded     sion in average 99,99% of the tuples on the pages con-
 omer     duc     bution     in    loaded   ted        selecti- pages    tributed to the result set. A standard deviation of less
 in       in      in %       %              selecti-   vity in in %      than 0,001 for these measurements means that the multi-
 %        %t                                vity in    %                 dimensional hierarchical clustering is perfect for multi-
                                            %                            dimensional restrictions in the first hierarchy level.
                                                                         When additionally restricting the second hierarchy level
 0,78 6,4         37,5       100 178        0,0182     0,0199   0,0207   in average 557 pages were loaded, where 10,7% of the
                                                                         tuples did not contribute to the result set. The standard
                                                                         deviation here was 0,06. Additionally restricting the third
 87,3 6,4         37,5       100 18996      2,0981     2,1618   2,1627   hierarchy level of each dimension usually created result
                                                                         sets with only one page.
 Figure 16: Restrictions in the first hierarchy level in 3 of            Thus multidimensional hierarchical restrictions are very
                       4 dimensions                                      well processed by UB-Trees storing compound surrogates
                                                                         which are created by the multidimensional hierarchical
However, the data is not independently distributed in the                clustering technique. For star schemas as used in present
entire 4-dimensional universe of ‘Juice & More’. In this                 data warehousing applications this approach significantly
case the predicted selectivity does not describe the actual              speeds up query performance and reduce resource re-
selectivity anymore. Thus some bits of the first 19 bits are             quirements in disk space and processing time.
correlated. This means that not all combinations of these
bits occur and some partitioning will take places in the                 6.2                       Performance Measurements
bits below bit number 19 of the Z-address. In this case
the second level may already be completely partitioned                   The measurements for ‘Juice & More’ were performed on
and even a third hierarchy level partitioning may have                   a SUN Enterprise with four 300 MHz UltraSPARC proc-
started for some dimensions. A typical part of the multi-                essors and 2 GB RAM under Solaris 2.6. As secondary
dimensional space where this will happen is the customer                 storage a partition on a SPARCstorage array with RAID
hierarchy “unclassified”, which stores 87,34% of the                     level 0 (6 disks striping, 5-6 MB/s transfer rate per disk)
customers. At most the three bits c10c9c8 of the customer                was used. All measurements were done in a single-user
hierarchy are needed to distinguish these customers from                 environment.
all other customers. Thus for the unspecified customers                  It is important to note that our implementation still
the bits c7c6 of the first 19 bits of the Z-address are cor-             causes significant overhead due to the fact that we have
related to c10c9c8 and two further bits may be used for                  implemented the UB-Tree on top of a DBMS and not in
partitioning. Thus d1 will be split completely, p10 will be              the kernel itself. First, the number of SQL statements that
used for the partitioning and t1 will be partly split (c5 is             have to be processed (UB: 1 statement for each page in
also correlated to c10c9c8). Our measurements show that                  the result set, Oracle 8 methods: 1 statement in total)
this effect also holds for other dimensions. Figure 17                   leads to extensive inter-process communication (about
shows queries where the first two hierarchy levels of                    30% of the total processing time) and DBMS overhead
CUSTOMER, PRODUCT and TIME are restricted,                               (e.g., parsing of statements). Second, our table is larger
whereas the DISTRIBUTION dimension is not restricted.                    than the one for the FTS and the bitmap indexes due to
The selectivity predicted by the cost functions here differs             unimplemented compressing techniques in the UB-Tree
from the actual selectivity of the query because of de-                  (for 8 KB pages: UB: 878362 pages, FTS: 723539 pages,
pendencies in the data distribution. However, the per-                   BII: FTS+31134 pages).
centage of pages loaded is similar to the actual selectivity             Figure 18 shows result set sizes and response times of the
of each query.                                                           three example queries (Section 4.2). Q1 shows that the
                                                                         UB-Tree with multidimensional clustering is over 2
scusto-    spro-      sdis- stime    Pa- predicted actual       loaded   times faster than BII even for very small result sets. Q3
mer in     duct in    tribu- in %    ges selecti-   selec-      pages    which is processed by the unoptimized UB-Tree at least
%          %          tion           loa- vity in % tivity      in %     10 times faster than with any other access method un-
                      in             ded            in %                 dermines this observation.
                      %

2,95       3,64       100 38,4       312 0,0414        0,0310   0,0355
                                                                                             300
                                                                                                              285          232         275
                                                                         Reponse Time in s




2,95       27,99 100 38,4            782 0,0318        0,0851   0,0890
                                                                                             200
                                                                                                                         186
 Figure 17: Restriction in the first two hierarchy levels in
                     3 of 4 dimensions                                                       100
                                                                                                                                                      FTS
                                                                                                          5         118               10          Bitmap
Each of the 878362 pages of the ‘Juice & More’ fact table
                                                                                                      2                           1          UB
stores 30 tuples. All of the measurements showed that                                          0
                                                                                                     Q1             Q2           Q3
when restricting the first hierarchy level in each dimen-




V. Markl, R. Bayer                                                                                                                                          1-8
                                                                                                      19 lists the number of hierarchy levels that by each query
                                                      Percentage                                      are restricted to a point for each dimension. Since the
                            Query       Loaded Tuples of Database                                     data is non-uniformly distributed, the selectivity of each
                            Q1                   8160       0,03%                                     query depends on the exact point restriction, not only on
                            Q2                1696416       6,52%                                     the number of restricted hierarchy levels. We thus
                            Q3                  19752       0,08%                                     present two instances of the queries, each of which has a
          Figure 18: Query response times and result set sizes                                        different selectivity.
                                                                                                                          500
The result set of query Q2 is quite large but the almost                                                                                                                                 472

perfect clustering factor of the UB-Tree (in average more
than 29 out of 30 tuples/page belong to the result set) still                                                             400
leads to a speed up of more than 30 % in comparison to
BII. The time for FTS for Q2 differs from the times for                                                                                               313
                                                                                                                                        287
Q1 and Q3 due to the less complex WHERE clause of the                                                                                                              283                         288




                                                                                                        time in seconds
                                                                                                                          300                                                      274
                                                                                                                                                                                                       UB
statement. The number of comparison operations is                                                                                                                                                242
                                                                                                                                                                                                       BM
therefore much smaller than for the other queries, which                                                                                                                                               FTS
causes the faster execution.                                                                                              200
                                                                                                                                                             171
All these results on real data show how well the multidi-
mensional hierarchical clustering with UB-Trees works
                                                                                                                          100
in practice and the accuracy of our theoretical cost model                                                                                    69 78
                                                                                                                                                                         56
                                                                                                                                                            39                44
[MZB99]. In total more than 77% of all benchmark que-
                                                                                                                                  3 6
ries (28 out of 36) showed a speed up between a factor of                                                                   0
1.3 and 10 over traditional techniques.                                                                                         Q4 (0,3176%) Q5 (3,4383%) Q6 (2,0981%) Q7 (1,1346%) Q8 (34,000%)


                     Cust- Pro- Distr-                      Time         Select-         Select-                                Figure 21: Performance Q4-Q8 (instance 2)
                     omer duct ibution                                   ivity           ivity
                                                                         Instan-         Instan-      7 Conclusions and Future Work
                                                                         ce 1 in         ce 2 in      We have presented multidimensional hierarchical clus-
                                                                         %               %            tering by hierarchy interleaving (MHC/HI) as a physical
                                                                                                      data modeling technique for OLAP applications in rela-
Q4                   2            2         0               1            0,0414 0,3176                tional databases. Our performance measurements have
                                                                                                      shown that MHC/HI in combination with UB-Trees and
Q5                   1            1         1               1            0,0070 3,4383
                                                                                                      the Tetris algorithm significantly reduces the number of
Q6                   1            1         1               0            0,0182 2,0981                random accesses to the fact table for star joins and other
                                                                                                      queries with restrictions in multiple hierarchies. This
Q7                   1            0         0               1            1,1346 1,1346                results in considerably lower response times for typical
Q8                   0            1         0               1            6,0000 34,000                data warehousing queries with star joins. For a 6 GB re-
                                                                                                      tail data SAP business information warehouse, our pro-
Figure 19: Restricted hierarchies and selectivities for five                                          totype implementation of MHC/HI accelerates the proc-
   queries against the ‘Juice & More’ data warehouse                                                  essing of star-join queries more than a factor of two com-
                    300
                                                                                                      pared to bitmap indexes, clustering B*-Trees or parallel
                               284,53
                                            277,01
                                                         269,05         274,44                        full table scans. For dimensionalities typical for data
                                                                                                      warehousing, only I/O-time linear in size of the result set
                    250                                                                231,65         prior to aggregation and sublinear temporary storage are
                                                                                                      necessary to aggregate parts of a fact table of a star or
                    200                                                            186                snowflake schema. Thus secondary storage space and
  Time in Seconds




                                                                                                UB
                                                                                                      pre-computation time for many aggregates and bitmap
                    150                                                                         BM    indexes can be avoided. In addition the widely discussed
                                                                                 118            FTS   view maintenance problem is minimized. Depending on
                    100
                                                                                                      the query, temporary storage requirements for sorting are
                                                                                                      reduced by several orders of magnitude. Our clustering
                                                                   56                                 approach also holds not only for ROLAP but also for
                                                                        44
                    50
                                                                                                      MOLAP implementations of a data warehouse since both
                            2 5          1 5          1 4                                             ROLAP fact tables and MOLAP data cubes can be clus-
                      0                                                                               tered in this way.
                          Q4 (0,0414%) Q5 (0,0070%) Q6 (0,0182%) Q7 (1,1346%) Q8 (6,0000%)
                                                                                                      Our future work includes deriving a set of cost based de-
                          Figure 20: Performance Q4-Q8 (instance 1)                                   cision rules for how to – possibly multidimensionally -
                                                                                                      index a relation for a given set of queries. We also intend
Figures 20 and 21 list two instances for each of five                                                 to apply our model to cost based query optimization and
further queries of that benchmark. For each query Figure                                              combine the model with multidimensional histograms in



V. Markl, R. Bayer                                                                                                                                                                                     1-9
order to take dependencies and correlations between the    [MRB99]V. Markl, F. Ramsak, and R. Bayer. Improving
attributes into account.                                           OLAP Performance by Multidimensional
                                                                   Hierarchical Clustering. Proc. of IDEAS’99,
8 References                                                       1999.
                                                           [MZB99] V. Markl, M. Zirkel, and R. Bayer. Processing
[Bay97a] R. Bayer. The universal B-Tree for                        Operations with Restrictions in Relational
        multidimensional Indexing: General Concepts.               Database Management Systems without external
        WWCA ’97, Tsukuba, Japan, p- 10-11,                        Sorting. ICDE, 1999.
        Springer Verlag, March, 1997.                      [NHS84] J. Nievergelt, H. Hinterberger, and K. C.
[Bay97b] R. Bayer. UB-Trees and UB-Cache – A new                   Sevcik. The Grid-File. ACM TODS, 9(1),
        Processing Paradigm for Database Systems.                  March 1984, pp. 38-71.
        Technical Report TUM-I9722, TU München,            [OM84] J. A. Orenstein and T.H. Merret. A Class of
        1997.                                                      Data Structures for Associate Searching.
[CD97] S. Chaudhuri and U. Dayal. An Overview of                   SIGMOD-PODS Conf., Portland, Oregon, 1984,
        Data Warehousing and OLAP Technologies.                    pp. 294-305.
        SIGMOD Rec. 26(1), 1997.                           [OQ97] P. O´Neill and D. Quass. Improved Query
[CI98] C. Chan and Y. Ioannidis. Bitmap Index Design               Performance with Variant Indexes. ACM
        and Evaluation. SIGMOD, 1998.                              SIGMOD Intl. Conf. On Management of Data,
[Dat88] C.J. Date. A Guide to the SQL Standard,2nd                 Tucson, Arizona, pp. 38-49,1997.
        edition. Addison-Wesley, 1988.                     [Pie98] R. Pieringer. Evaluation of the UB-Tree in the
[GG97] V. Gaede and O. Günther. Multidimensional                   SAP Environment.           Master Thesis, TU
        Access Methods. ACM Computing Surveys                      München, 1998.
        30(2), 1997.                                       [RMF+00] F. Ramsak, V. Markl, R. Fenk et al.
[GHR+97] H. Gupta, V. Harinarayan, A. Rajaraman, and               Integrating the UB-Tree into a data-basesystem
        D. Ullman. Index Selection for OLAP. ICDE,                 kernel. VLDB2000, Cairo, 2000.
        1997.                                              [Sar97] S. Sarawagi. Indexing OLAP data. Data Eng.
[GR97] J. Gray and A. Reuter. Transaction Processing:              Bulletin 20 (1), pp. 36-43, 1997.
        Concepts and Techniques. Morgan Kaufmann           [Sam90] H. Samet. The Design and Analysis of Spatial
        Publishers, 1997.                                          Data Structures. Addison Wesley, 1990
[Gra93] G. Graefe. Query Evaluation Techniques for         [SDN+96] A. Shukla, P. Deshpande, J. Naughton, and K.
        Large Databases. ACM Computing Surveys 25,                 Ramasamy.         Storage     Estimation    for
        1993, pp. 73-170.                                          Multidimensional Aggregates in the Presence of
[Gup97] H. Gupta. Selection of Views to Materialize in a           Hierarchies. VLDB Conference. 1996.
        Data Warehouse. Proc. of the Intl. Conference      [SDN+98] A. Shukla, P. Deshpande, and J. Naughton.
        on Database Theory, Athens, Greece, January                Materialized        View       Selection    for
        1997.                                                      Multidimensional Datasets. SIGMOD Conf.,
[Gut84] A. Guttman. R-Trees: A dynamic Index                       1998.
        Structure for spatial Searching. Proc. of ACM      [TS97] D. Theodoratos and T. K. Sellis. Data
        SIGMOD Conf., 1984, pp. 47-57.                             Warehouse Configuration. VLDB, 1997, p. 126-
[HAM+97] C.T. Ho, R. Agrawal, N. Megiddo, and R.                   135.
        Srikant. Range Queries in OLAP Data Cubes.         [SRF97] T. K. Sellis, N. Roussopoulos, C. Faloutsos:
        SIGMOD Conf., 1997, pp. 73-88.                             Multidimensional Access Methods: Trees Have
[JL98] Jürgens, M.; Lenz, H.J.: The R*a-Tree: An                   Grown Everywhere. VLDB, 1997, P. 13-14.
        improved R*-Tree with Materialized Data for        [WB98] M.C.Wu and A.P. Buchmann. Encoded Bitmap
        Supporting Range Queries on OLAP-Data,                     Indexing for Data Warehouses. ICDE, Orlando,
        DWDOT Workshop, Vienna, 1998.                              1998.
[Kim96] R. Kimball. The Data Warehouse Toolkit. John
        Wiley & Sons, New York. 1996.
[LS90] Lomet, D.; Salzberg, B.: The hB-Tree: A
        Multiattribute Indexing Method with good
        guaranteed Performance. ACM TODS, 15(4),
        1990, pp. 625 – 658.
[Mar99] V. Markl. MISTRAL: Processing Relational
        Queries using a Multidimensional Access
        Method. Ph.D. Thesis, TU München, 1999.
[Moe98] G. Moerkotte. Small Materialized Aggregates:
        A Light Weight Index Structure for Data
        Warehousing. VLDB Conference. New York,
        USA, 1998.




V. Markl, R. Bayer                                                                                           1-10