=Paper=
{{Paper
|id=Vol-28/paper-1
|storemode=property
|title=Processing relational OLAP queries with UB-Trees and multidimensional hierarchical clustering
|pdfUrl=https://ceur-ws.org/Vol-28/paper1.pdf
|volume=Vol-28
|dblpUrl=https://dblp.org/rec/conf/dmdw/MarklB00
}}
==Processing relational OLAP queries with UB-Trees and multidimensional hierarchical clustering==
Processing Relational OLAP Queries with UB-Trees and
Multidimensional Hierarchical Clustering
Volker Markl Rudolf Bayer
Bayerisches Forschungszentrum Institut für Informatik
für Wissensbasierte Systeme Technische Universität München
(FORWISS) Orleanstraße 34, D-81667
Orleansstraße 34, D-81667, München, Germany
München, Germany bayer@in.tum.de
volker.markl@forwiss.de
1 Introduction
Abstract The most established relational data models for data
warehousing applications are the star schema and the
Multidimensional access methods like the UB- snowflake schema. In both approaches there is a central
Tree can be used to accelerate almost any query fact table that contains the measures and the dimension
processing operation, if proper query processing tables are situated around it. The connection between a
algorithms are used: Relational queries or SQL fact tuple and the corresponding dimension members is
queries consist of restrictions, projections, realized via foreign key relationships. In the star schema
ordering, grouping and aggregation, and join the dimension tables are completely denormalized while
operations. In the presence of multidimensional in the snowflake schema they may be normalized. Que-
restrictions or sorting, multidimensional range ries usually contain restrictions on multiple dimension
query or Tetris algorithms efficiently process tables (e.g., only sales for specific customer group and for
these operations. In addition, these algorithms a specific time period are asked) that are then used as
also efficiently support queries that generate restrictions on the usually very large fact table.
some hierarchical restrictions (for instance by In this paper we investigate, how UB-Trees and the Tet-
following 1:n foreign key relationships). In this ris algorithm in combination with our technique of mul-
paper we investigate the impacts on query tidimensional hierarchical clustering by hierarchy inter-
processing in RDBMS when using UB-Trees and leaving (MHC/HI) may be used to accelerate relational
multidimensional hierarchical clustering for query processing with a special focus on star-joins, the
physical data organization. We illustrate the most frequent operation of query processing for relational
benefits by performance measurements of data warehouses. With our technique of MHC/HI we spe-
queries for a star schema from a real world cifically cluster the data with respect to the foreign key
application of a SAP business information relationships defined by the star- or snowflake schema of
warehouse. The performance results reported in a data warehouse. We also illustrate the physical data
this paper were measured with our prototype modeling with MHC/HI for a SAP business information
implementation of UB-Trees on top of Oracle 8. warehouse and give a performance analysis for this real-
We compare the performance of UB-Trees to world application. Our performance analysis gives in-
native query processing techniques of Oracle, sight into the real world data distribution of 6 GB of sales
namely access via an index organized table, data from a fruit juice company and shows, why MHC/HI
which essentially stores a relation in a clustered in combination with UB-Trees is superior to classical
B*-Tree, and access via a full table scan of an bitmap indexes, clustering B*-Trees and parallel full
entire relation. In addition we measure the table scans.
performance of the intersection of multiple Our paper is organized as follows: Section 2 surveys re-
bitmap indexes to answer multidimensional lated work, in Section 3 we present the basic concepts of
range queries. UB-Trees, their algorithms and MHC/HI. Section 4 pre-
sents the SAP business warehouse schema that we use for
our analysis. In Section 5 we analyze the data distribu-
tion of the real world data. Section 6 gives a performance
The copyright of this paper belongs to the paper’s authors. Permission to copy
without fee all or part of this material is granted provided that the copies are not
analysis and performance measurements with our proto-
made or distributed for direct commercial advantage. type implementation. Section 7 concludes our paper and
Proceedings of the International Workshop on Design and gives an outlook on future work.
Management of Data Warehouses (DMDW’2000)
Stockholm, Sweden, June 5-6, 2000
(M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.)
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-28/
V. Markl, R. Bayer 1-1
2 Related Work Z-Region [α : β ] is the space covered by an interval on
the Z-Curve and is defined by two Z-Addresses α and β.
Performance optimization has been well studied in the We call β the region address of [α : β ].Each Z-Region
field of OLTP systems [Gra93]. Due to the completely maps exactly onto one page on secondary storage, i.e., to
different query characteristics of OLAP applications in one leaf page of the B-Tree.
comparison to OLTP new questions have to be addressed
here. The performance problem is heavily linked to the
physical data model.
The index selection problem for ROLAP application is
widely discussed in the research community [GHR+97,
Sar97]. Especially bitmap indexes have been proposed to
speed up ROLAP applications because of their compact-
(a) (b) (c)
ness and support of star joins [CI98]. A common way of
performance improvement is the usage of materialized Figure 2: Z-regions
views - often in combination with indexing methods
For an 8×8 universe, i.e., s = 3 and d = 2, Figure1b
[TS97, Moe98, WB98]. Due to the large number of pos-
shows the corresponding Z-addresses. Figure2a shows
sible views a selection problem exists besides the mainte-
the Z-region [4: 20] and Figure2b shows a partitioning
nance issue [Gup97, SDN+96, SDN+98]. Clustering of
with five Z-regions [0 : 3], [4 : 20], [21: 35], [36 : 47]
OLAP data plays a key role in providing good perform-
and [48 : 63]. Assuming a page capacity of 2 points,
ance. Clustering has been well researched in the field of
Figure2c shows ten points, which create the partitioning
access methods. B-Trees, for instance, provide one-di-
of Figure2b. The details of the UB-Tree algorithms are
mensional clustering. Multidimensional clustering has
described in [Bay97a, Mar99].
been discussed in the field of multidimensional access
methods. See [GG97] and [Sam90] for excellent surveys
of almost all of these methods. [ZSL98] addresses the 3.1 Bit Interleaving
issue of hierarchical clustering for the one-dimensional The performance of the UB-Tree crucially relies on an
case. efficient implementation of the Z-address calculation. For
Most work on applying multidimensional indexes to tuples of positive integer numbers, bit interleaving im-
RDBMS discusses restrictions by range queries [SRF97, mediately yields an algorithm to calculate a Z-address
NHS84, Gut84, OM84, LS90]. [JL98] accelerates range from the binary representation of a tuple.
queries with aggregations by storing aggregated data in It is easy to incorporate varying attribute lengths, i.e.,
R-Trees. [MRB99] and [Bay97b] are the basis of our ap- attributes with different resolutions, into this algorithm:
proach, where joins and sorted processing of data organ- When the limit of the resolution (i.e., the last bit) in one
ized by a multidimensional index are investigated. attribute is reached, this attribute is not used for bit-in-
terleaving any further. The number of bits in each further
3 The UB-Tree step is reduced in this case.
For each attribute Ai we denote the number of distinct
The basic idea of the UB-Tree [Bay97a, Mar99] is to use values of its domain by ri , i.e., ri = |Ai |
a space-filling curve to partition a multidimensional uni- Definition 1: The number of steps for attribute Ai of a
verse. Using the Z-Curve (Figure 1a) the UB-Tree pre- domain with cardinality1 ri is determined by its resolu-
serves the multidimensional clustering. A Z-Address α = tion: steps(i) = log2ri
Z(x) is the ordinal number of the key attributes of a tuple
x on the Z-Curve, which can be efficiently computed by Definition 2: The length of step k in bits (i.e., the num-
bit-interleaving [OM84]. A standard B-Tree is used to ber of dimensions in step k) is:
index the tuples taking the linear Z-Address of the tuples steplength(k) = |i | steps(i) ≥ k and i ∈ D}|
as keys. Figure 3 shows this generalized algorithm for bit inter-
0 1 2 3 4 5 6 7 leaving.
0 0 1 4 5 16 17 20 21 Input: x : tuple
Output: Z-address α
1 2 3 6 7 18 19 22 23
2 8 9 12 13 24 25 28 29
3 10 11 14 15 26 27 30 31
4 32 33 36 37 48 49 52 53 for step = 1 to max({steps(rj) | j ∈ D})
5 34 35 38 39 50 51 54 55 for i = 1 to steplength(step)
6 40 41 44 45 56 57 60 61 copy bit step of xi to bit i of αstep
7 42 43 46 47 58 59 62 63
end for
(a) (b) end for
Figure 1: Z-curve and Z-addresses Figure 3: Bit-Interleaving to calculate α = Z(x)
The fundamental innovation of UB-Trees is the concept With r = max({steps(rj) | j ∈ D}) bit interleaving has a
of Z-Regions to create a disjunctive partitioning of the (
CPU-complexity of O(d⋅r) bit operations (resp. O ∑id=1 ri )
multidimensional space. This allows for very efficient
processing of multidimensional range queries [Mar99]. A 1
Note that we defined ri to be 2v for some v ∈ 1R
V. Markl, R. Bayer 1-2
for attributes of different length). The same holds for the NATION = “USA” is mapped to an interval restriction in
inverse algorithm Z-1 that calculates the Cartesian coor- a linear domain. With this technique, a star join on a
dinates of a tuple from its address. schema with d dimensions therefore creates a d-
dimensional interval restriction on the fact table which
3.2 Range Queries on UB-Trees then may efficiently be processed by the UB-Tree.
In general one can imagine any foreign key relationship
To answer a range query, only those Z-regions, which to be used for such a clustering. In the following we il-
properly intersect the query box, must be fetched from lustrate MHC/HI by hierarchical relationships as they
the database and thus from the disk. Initially the range usually occur in data warehousing applications. In
query algorithm calculates and retrieves the first Z-region ROLAP hierarchies are usually modeled implicitly by a
that is overlapped by the query-box. Then the next inter- set of attributes A1, ..., An where Ai corresponds to hierar-
secting Z-region is calculated and retrieved. This is re- chy level i and level 1 is the root of the hierarchy.
peated until a minimal cover for the query box has been Many attributes in relational DBMS in general and in
constructed, i.e., the region that contains the ending data warehouses in particular have an actual domain of a
point of the query box has been retrieved. very small set of values. A typical example (cf. Figure 4a)
It is important to note that for any Z-region the calcula- is the attribute REGION of the dimension table
tion of the next point intersecting the query box is per- CUSTOMER, which has an actual domain of 8 values
formed solely in main memory. [Bay97a] describes an (Southern Europe, Central Europe, Northern Europe,
algorithm that for a Z-region [α : β ] and a query box Q Western Europe, North America, Latin America, Asia,
calculates the smallest Z-address γ > β that intersects Q Australia). By MHC/HI, these long strings are replaced
in time exponential to the number of dimensions. A lin- by numbers. In the corresponding tables small numbers
ear algorithm which is solely based on bit operations is or bit strings are stored instead of long, space wasting
described in [Mar99] and [RMF+00]. character strings. If all records must be retrieved that
concern Europe, only an interval [0;3] on this number is
3.3 The Tetris Algorithm necessary. Otherwise every member of region must be
With the Tetris algorithm [MZB99], tables organized by specified. The mapped numbers are called surrogates
a UB-Tree can be read in any key sort order in O(n) disk because they replace the character strings. This data type
accesses where n is the number of pages of the table or is called enumeration type.
the minimal number of regions covering a query box.
REGION f(REGION)
The Tetris algorithm is a generalization of the multidi-
mensional range query algorithm that efficiently com- South Europe 0
bines sort operations with the evaluation of multi-attrib- Middle Europe 1
ute restrictions. The basic idea is to use the partial sort Northern Europe 2
order imposed by a multidimensional partitioning in or- Western Europe 3
der to process a table in some total sort order. Essentially North America 4
a plane sweep over a query space defined by restrictions Latin America 5
on a multidimensionally partitioned table is performed. Asia 6
The direction of the sweep is determined by the sort at- Australia 7
tribute. Initially the algorithm calculates the first Z-re-
gion that is overlapped by the query box, retrieves it and (a)
caches it in main memory. Then it continues to read and
cache the next Z-regions with respect to the requested CUSTOMER
sort order, until a complete thinnest possible slice of the
0 SouthEurope ... 4 North America ...6 Asia 7 Australia
query box (in the sorting dimension) has been read. Then
the cached tuples of this slice are sorted in main memory,
0 Canada 1 USA 0 Wholesale 1 Retail
returned in sort order to the caller and removed from
cache. The algorithm proceeds reading the next slice, 0 Wholesale 1 Retail ... Kana´s Sushi Bar ...
until all Z-regions intersecting the query box have been
processed. Only disk pages overlapping the query space ... 2 Bar ...
are accessed. With sufficient, but modest, cache memory
... Joe‘s Sports Bar ...
each disk page is accessed only once.
(b)
3.4 Multidimensional Hierarchical Clustering
Figure 4: Surrogates for REGION and the Customer
Multidimensional Hierarchical Clustering by Hierarchy Hierarchy
Interleaving (MHC/HI, [MRB99]) clusters a multidimen-
sional data set on disk with respect to the hierarchical This surrogate technique is generalized to all levels of the
relationships in multiple dimensions. This is achieved by hierarchies. Figure 4b shows a part of the customer
creating surrogate values that ensure that a hierarchical hierarchy with the surrogates propagated to all levels of
restriction like REGION = “North America” and the hierarchy.
V. Markl, R. Bayer 1-3
To efficiently encode hierarchies, we introduce the con- used to accelerate star-joins, the most frequent operation
cept of compound surrogates for hierarchies. Since we of query processing for relational data warehouses.
require hierarchies to form a disjoint partitioning of a
dimension, a uniquely identifying compound surrogate 4.1 The Juice & More Schema
for each child node of a hierarchy member exists and can
be recursively calculated by concatenating the compound We use the schema of the beverages supplier ‘Juice &
surrogate of the member with the running number of the More’, a real customer of one of our project partners2. In
child node. Because only efficient bit operations are nec- the data warehouse of ‘Juice & More’ data is organized
essary, the computation is extremely fast. The ord func- along the following four dimensions: CUSTOMER,
tion returns the corresponding surrogate of a hierarchy PRODUCT, DISTRIBUTION and TIME. Figure 5a
member in the specified level. shows the hierarchies over the dimensions (the number
The hierarchy path North America Å USAÅ Retail Å in parentheses specifies the maximum number of level
members).
Bar (cf. Figure 4b) has the compound surrogate:
ordCustomer (North America) ο ordNorth America (USA) ο PRODUCT CUSTOMER DISTRIBUTION TIME
ordUSA(Retail) ο ordRetail(Bar) = 4 ο 1 ο 1 ο 2.
The upper limit of the domain for surrogates of level i is
calculated as the maximum fan-out (number of children) All Products All Customer All Distributions All Time
of all members of level i–1 of a hierarchy H, i.e., surro-
gates(H, i) = max {cardinality(children(H, m)) where m Type (5) Region (8) Sales Year (3)
∈ level(H, i - 1)} Organization (5)
The compound surrogate can be interpreted as a number.
Month (12)
In the above example – using the surrogate length of Brand (8) Nation (7)
Distribution
Figure 4a with a length of 10 bits for the customer surro- Channel (3)
gate (3 for REGION, 3 for NATION, 1 for Trade Type
Category (19) Trade Type (2)
and 3 for Business Type) the hierarchy path results in the
compound surrogate 10000110102 = 538.
Usually growth expectations for a hierarchy are known Container (10) Business Type (7)
well in advance. Often hierarchy trees are even static.
Therefore it is possible to determine a reasonable number (a)
of bits for storing each surrogate of the compound surro-
PRODUCT
gate of a hierarchy. The overall number of bits necessary 2180 rows
DISTRIBUTION
12 rows
to store a compound surrogate is relatively small. For PRODKEY DISTKEY
instance, a hierarchy tree with four branches on 8 levels TYPE SALESORG
already represents 48 = 65536 partitions and is stored in BRAND FACT CHANNEL
only 16 bits. The maximum length of the compound sur- CATEGORY 26M rows ...
rogates for the ‘Juice & More’ schema that we used for CONTAINER PRODKEY
our analysis can be computed from the maximum fan-out ... CUSTKEY
of the hierarchy levels given in Figure 4a. DISTKEY
CUSTOMER TIMEKEY
This very compact fixed length encoding preserves the 7064 rows
SALES
lexicographic order on the hierarchy levels. Thus, point CUSTKEY DISTCOST
restrictions on upper hierarchy levels result in range re- REGION ... TIME
strictions on the finest granularity of a hierarchy. For NATION 36 rows
instance, the point restriction NATION = “USA” on the TRADE-TYPE TIMEKEY
second level of the CUSTOMER hierarchy with f(“North BUSINESS-TYPE YEAR
... MONTH
America”) = 4 = 1002 and f(“USA”) = 1 = 0012 maps to
the range restriction cscustomer between 528 = (b)
10000100002 and 543 = 10000111112. Thus, a star join
with this surrogate encoding for the foreign keys of a fact Figure 5: Hierarchies in the ‘Juice & More’ schema and
table results in a range restriction on each compound the corresponding star schema
surrogate, if some hierarchy level of each dimension is The ROLAP data model for the ‘Juice & More’ schema
restricted to a point. In the same way intervals on the (Figure 5b) is a typical star schema with one fact table
children of one hierarchy level result in a range of the FACT and a table for each of the 4 dimensions. Let
corresponding compound surrogates (e.g., YEAR = 1998 ‘SALES’ and ‘DISTCOST’ be some of the measures in
and MONTH between April and June). A star join on a the fact table. We used the methodologies of surrogates
schema with d dimensions therefore creates a d-dimen- and multidimensional hierarchical clustering as
sional interval restriction on the fact table. described in Section 3.4 for clustering the fact table of
‘Juice & More’ with UB-Trees.
4 Schema and Queries of ‘Juice & More’
In this section we investigate, how UB-Trees and the 2
The company and the data presented here have been
Tetris algorithm in combination with MHC/HI may be made anonymous.
V. Markl, R. Bayer 1-4
In the following we describe some example queries SELECT SALESORG, CHANNEL, SUM(DistCost)
involving star joins for the ‘Juice & More’ schema. These FROM Fact F, Distribution D, Product P
queries were taken from the decision support system of WHERE F.DistKey = D.DistKey AND
‘Juice & More’. Thus next to the investigation of F.ProductKey = P.ProductKey AND
multidimensional hierarchical clustering this section is P.Type = X
interesting from a second point of view: The highly GROUP BY D.SalesOrg,D.Channel
skewed data distribution of the ‘Juice & More’ will prove Figure 7: Distribution cost (Q2)
that UB-Trees, the Tetris algorithm and MHC/HI are not
only applicable to laboratory environment tests with Query 3 (Q3, cf. Figure 8) restricts all dimensions on the
generated data, but also prove their efficiency in the first level of the hierarchies. This query also uses the
practical application scenarios. In order to show the Tetris algorithm on MHC/HI compound surrogates.
highly skewed data distribution we included an entire SELECT SUM(SALES)
section displaying the one-dimensional data distribution FROM Fact F, Distribution D, Product P,
of ‘Juice & More’ for every dimension. Customer C, Time T
We then present measurements performed with our WHERE F.DistKey = D.DistKey AND
prototype implementation of the UB-Tree on top of F.TimeKey = T. TimeKey AND
Oracle 8. For the evaluation of our clustering technique F.CustKey = C.CustKey AND
we defined a benchmark with 36 queries. In comparison F.ProdKey = P.ProdKey AND
we also conducted measurements with native Oracle 8 P.Type = t AND D.SalesOrg = s AND
access methods: parallel full table scan (FTS) and bitmap T.Year = y AND C.Region = r
indexes (BII). For these measurements we used a Figure 8: Partial match query in first hierarchy level (Q3)
completely denormalized fact table, that is, no additional
joins next to the star join had to be performed to answer
the queries. The bitmap indexes were created on each 5 Data Distributions of the ‘Juice & More’
hierarchy level. We did not include secondary indexes in Fact Table
our comparison measurements because earlier
experiments showed that they are neither competitive to The data of ‘Juice & More’ is real world data; the data
the UB-Tree nor to FTS or BII [MZB99]. distribution of both the fact table and the dimension ta-
bles is highly skewed: The dimensions are neither dis-
4.2 Queries on the ‘Juice & More’ Schema tributed uniformly nor are independent. The original fact
table consisted of 823.464 tuples (about 175 MB). To get
In the following we present typical queries that are taken
from the ‘Juice & More’ SAP business information a realistic large data cube, the fact table was enlarged to
warehouse. We will use these queries to illustrate our 26.350.848 tuples (about 6 GB). Our project partner im-
approach and we will present performance measurements plemented an augmentation algorithm with minimal im-
for exactly these queries in Section 6.2. pact on the data distribution (see [Pie98]).
Query 1 (Q1, cf. Figure 6) computes the sales for a given In the following we show some charts that describe the
product group (TYPE and BRAND specified as (X1, one-dimensional data distribution for each dimension.
X2)) and a given customer group (NATION and However, we once more stress that the dimensions are
REGION specified as (Y1, Y2)) for the months from not independent (e.g., some customers always order the
October to December of 1993. This query uses the UB- same subset of products, some customers or products only
Tree range query algorithm on MHC/HI surrogates. exist for a certain time, etc.). Thus in general, the overall
However, surrogates are only used for clustering and
selectivity of a query restricting several dimensions is not
indexing and thus are transparent to the user.
the product of the selectivities of the one-dimensional
SELECT SUM(Sales) restrictions (which is shown in the following charts). We
FROM Fact F, Customer C, Product P, Time T will see this deviation in Section 6.
WHERE F.ProdKey = P.ProdKey AND The fact table of ‘Juice & More’ stores several measures
F.CustKey = C.CustKey AND
(e.g., distribution cost, sales) aggregated on a daily basis
P.Type = X1 AND P.Brand = X2 AND
with respect to the dimensions time, customer, product
C.Region = Y1 AND C.Nation = Y2 AND
F.TimeKey = T.TimeKey AND T.Year = 1993 AND
and distribution. Since the data is sensitive real-world
T.Month >= October AND T.Month <= December
business data, it is not possible to show the labels/names
Figure 6: Time Interval (Q1) of the hierarchy members in the charts.
Query 2 (Q2, cf. Figure 7) calculates the cost of 5.1 The Time Dimension
distribution of the products of type X for each
distribution channel. This query uses the Tetris algorithm The time dimension of ‘Juice & More’ consists of a two-
on MHC/HI surrogates in order to efficiently calculate level hierarchy of months and years. The test data stored
aggregations. the years from 1993 to 1995. As shown in Figure 5a, the
days of the time dimension are organized by a two level
V. Markl, R. Bayer 1-5
hierarchy (year and month). Figure 9 shows the cumu- Multidimensional hierarchical clustering ensures that a
lated data distribution of the fact table with respect to the restriction in the first hierarchy level will result in a 1%,
time dimension grouped by year and month. The hori- 27%, 6%, 32% respectively 34% reduction of I/Os which
zontal axis displays the hierarchy members, with “All” at are necessary to retrieve the result set. Without exactly
the very bottom (i.e., the lowest level), the years 1993, showing the relationship between the hierarchy members,
1994, and 1995 in the middle and the twelve months for we show the skew of the data distribution over the first
each year above the year. The arrows in the horizontal four product levels in Figure 11.
axis indicate the relationship between the members of
neighboring hierarchy levels. 4%
Thus the fact table of ‘Juice & More’ is almost uniformly
distributed with respect to the time dimension. The
3%
minimum number of facts for one month is 2,70% of the
fact table (in January 1993, December 1993 and January
1995), whereas the maximum number of facts per month 2%
is 2,83% (in March of each of the three years). Thus with
a multidimensional clustering using 5 bits for time, re- 1%
strictions to one month in the time dimension (=1/36)
can be expected to reduce the amount of data to around
0%
1/25 = 1/32. product
Figure 11: Product – first four hierachy levels
2,83%
2,83%
2,83%
2,82%
2,82%
2,81%
2,82%
2,82%
2,81%
2,81%
2,81%
2,81%
2,80%
2,80%
2,80%
2,79%
3,0%
2,79%
2,79%
2,78%
2,78%
2,78%
2,77%
2,77%
2,77%
2,75%
2,75%
2,75%
2,75%
2,75%
2,75%
2,72%
2,71%
2,71%
2,70%
2,70%
2,70%
2,5%
5.2 The Customer Dimension
2,0%
According to business administration literature 20% of
1,5%
the customers contribute to 80% of the business. The
customer dimension of ‘Juice & More’ is a typical exam-
1,0%
ple for a classification of customers in such a company. A
high number of customers (in this case 88%, the very left
0,5% entry of Figure 12) are not classified (maybe they are not
interesting for the company because of small turnover or
0,0% it is not possible to find a classification). The classified
1993001
1993002
1993003
1993004
1993005
1993006
1993007
1993008
1993009
1993010
1993011
1993012
1994001
1994002
1994003
1994004
1994005
1994006
1994007
1994008
1994009
1994010
1994011
1994012
1995001
1995002
1995003
1995004
1995005
1995006
1995007
1995008
1995009
1995010
1995011
1995012
customer groups contain 0% to 3% of the tuples stored in
the table.3 The hierarchical relationships on the horizon-
Figure 9: Data Distribution in the Time Dimension
tal axis show that the hierarchy of the customer dimen-
sion is not balanced, since several hierarchy members
5.2 The Product Dimension just have one child.
The top level of the hierarchy on product has five entries. 3,0%
87,33%
2,96%
The data distribution is quite skewed, there are three 2,5%
product groups to which 93% of all tuples of the fact ta-
1,72%
2,0%
1,64%
ble belong. 1% of the data is unclassified. The distribu-
1,27%
tion of the first level of the product hierarchy is illus- 1,5%
trated in Figure 10.
0,77%
0,76%
0,73%
1,0%
0,55%
0,28%
0,27%
0,21%
0,5%
0,13%
0,14%
0,13%
0,16%
0,15%
1%
0,08%
0,12%
0,09%
0,06%
0,05%
0,06%
0,07%
0,06%
0,07%
0,02%
0,04%
0,04%
0,03%
0,02%
0,00%
0,0%
unclassified
34% 27%
910
912
6% 920
922
Figure 12: Customer – first four hierachy levels
32%
3
Note that the data in the ‘Juice & More’ warehouse is
Figure 10: Product – first hierachy level aggregated on a daily basis, thus the amount of data is
usually compressed for large customers, thus the
proportion of large customers is reduced.
V. Markl, R. Bayer 1-6
The consequence of the small number of classified cus- 6.1 Performance Analysis
tomers is that queries the restriction CUSTOMER will
result in a selectivity of at most 3% and therefore the Figure 15 shows the compound surrogates for the ‘Juice
overall result set will be small. & More’ data warehouse, which are calculated as fixed
length compound surrogates in [MRB99]. For any of the
4 hierarchies the length of the compound surrogate does
5.2 The Distribution Dimension not exceed 15 bits and thus can be stored in a single inte-
ger value. These compound surrogates are used as attrib-
There are seven entries on the first level of the distribu- utes for each of the four dimensions of ‘Juice & More’ to
tion hierarchy. The data distribution of the fact table with calculate the Z-address for each tuple of the ‘Juice &
respect to the distribution dimension is highly skewed. More’ fact table.
Figure 13 shows the distribution of the first hierarchy
level (i.e., sales organization), while Figure 14 shows the
distribution of facts in the fact table for each distribution 14243 14243 142 4 434 14243
cs product = p15 p14 p13 p12 p11 p10 p9 p8 p 7 p 6 p5 p 4 p3 p 2 p1
level 1 level 2 level 3 level 4
123 123 123 123
channel of each sales organization. Again the arrows
indicate the hierarchical relationship of the members of cs customer =c c c c c c c c c c
10 9 8 7 6 5 4 3 2 1
{123
neighboring hierarchy levels with the hierarchy root level1 level 2 level 3 level 4
“All” at the bottom of the Figure. cs time =t t t t t t
6 5 4 3 2 1
level1 level 2
8% 13%
cs distribution
123 123
=d d d d d 5
level 1
4 3 2
level 2
1
16% 13% Figure 15: Compound surrogates for each dimension of
‘Juice & More’
7%
The UB-Tree for the ‘Juice & More’ fact table consists of
37% 6%
P = 878362 pages, which corresponds to: l = log2 P =
log2878362 = 19,7 bits that are in any case necessary to
address the bit-interleaved multidimensional space. With
1 2 3 4 5 6 8 bit interleaving in the order of dimensions PRODUCT,
CUSTOMER, TIME, and DISTRIBUTION the Z-address
Figure 13: Distribution – first hierachy level α for a tuple of the ‘Juice & More’ fact table is calcu-
lated as:
1444444442444444443 { 14444444244444443
10,30%
10,30%
12,5% α = p15 c10 t 6 d 5 p14 c9 t 5 d 4 p13 c8t 4 d 3 p12 c7 t 3 d 2 p11c 6 t 2 d 1 p10 c5t 1 p 9 c 4 p8 c 3 p 7 c 2 p 6 c1 p5 p 4 p 3 p 2 p1
12,50%
12,50%
12,50%
12,50%
completely partitione d for any data distributi on partly split partitione d depending on the data distribution
8,45%
10,0% The first 19 bits of the Z-address are guaranteed to be
used to partition the four dimensional universe of the
6,66%
5,84%
7,5% ‘Juice & More’ fact table. This means that the binary
4,05%
5,0%
strings p15p14p13p12p11 of the compound surrogate of
product, c10c9c8c7c6 of customer, t6t5t4t3t2 of time and
2,20%
2,20%
2,5% d5d4d3d2 of distribution are used to partition the universe.
For each of the four dimensions the first hierarchy level
0,0% is completely used for the partitioning. The second hier-
09 1
10 1
09 2
10 2
12 3
11 4
13 5
14 5
15 5
16 6
17 6
19 8
archy level is used to a large extent to partition the uni-
verse. Therefore a restriction in the first hierarchy level
Figure 14: Distribution – first two hierachy levels will result in a reduction of the number of pages as de-
termined by the data distribution of Section 5, i.e., a re-
striction of the product main group to “910” will reduce
6 Performance Analysis of MHC/HI for the number of pages to be retrieved to 27%, a restriction
‘Juice & More’ to “912” will result in a reduction to 6,41%. This holds
for the restriction of the first hierarchy level in any di-
In this section we first analyze the performance of
mension. If the top hierarchy level is restricted in several
MHC/HI on UB-Trees analytically and then, taking the
dimensions and the independence assumptions holds for
data distributions and query selectivities into account,
these dimensions, the reduction is multiplicative. Figure
verify our expectations by comparing the number of re-
16 shows the predicted selectivity calculated as the prod-
trieved pages with the product of the selectivities over all
uct of the selectivity in each dimension, the actual selec-
dimensions. We then also give response times which
tivity and the loaded pages in percent of the entire pages
show the superiority of UB-Trees and MHC/HI over
in the database for two queries, which restrict the first
traditional indexes even for our prototype
hierarchy level of three out of four dimensions.
implementation.
V. Markl, R. Bayer 1-7
scust- spro sdistri- stime Pages Predic- Actual loaded sion in average 99,99% of the tuples on the pages con-
omer duc bution in loaded ted selecti- pages tributed to the result set. A standard deviation of less
in in in % % selecti- vity in in % than 0,001 for these measurements means that the multi-
% %t vity in % dimensional hierarchical clustering is perfect for multi-
% dimensional restrictions in the first hierarchy level.
When additionally restricting the second hierarchy level
0,78 6,4 37,5 100 178 0,0182 0,0199 0,0207 in average 557 pages were loaded, where 10,7% of the
tuples did not contribute to the result set. The standard
deviation here was 0,06. Additionally restricting the third
87,3 6,4 37,5 100 18996 2,0981 2,1618 2,1627 hierarchy level of each dimension usually created result
sets with only one page.
Figure 16: Restrictions in the first hierarchy level in 3 of Thus multidimensional hierarchical restrictions are very
4 dimensions well processed by UB-Trees storing compound surrogates
which are created by the multidimensional hierarchical
However, the data is not independently distributed in the clustering technique. For star schemas as used in present
entire 4-dimensional universe of ‘Juice & More’. In this data warehousing applications this approach significantly
case the predicted selectivity does not describe the actual speeds up query performance and reduce resource re-
selectivity anymore. Thus some bits of the first 19 bits are quirements in disk space and processing time.
correlated. This means that not all combinations of these
bits occur and some partitioning will take places in the 6.2 Performance Measurements
bits below bit number 19 of the Z-address. In this case
the second level may already be completely partitioned The measurements for ‘Juice & More’ were performed on
and even a third hierarchy level partitioning may have a SUN Enterprise with four 300 MHz UltraSPARC proc-
started for some dimensions. A typical part of the multi- essors and 2 GB RAM under Solaris 2.6. As secondary
dimensional space where this will happen is the customer storage a partition on a SPARCstorage array with RAID
hierarchy “unclassified”, which stores 87,34% of the level 0 (6 disks striping, 5-6 MB/s transfer rate per disk)
customers. At most the three bits c10c9c8 of the customer was used. All measurements were done in a single-user
hierarchy are needed to distinguish these customers from environment.
all other customers. Thus for the unspecified customers It is important to note that our implementation still
the bits c7c6 of the first 19 bits of the Z-address are cor- causes significant overhead due to the fact that we have
related to c10c9c8 and two further bits may be used for implemented the UB-Tree on top of a DBMS and not in
partitioning. Thus d1 will be split completely, p10 will be the kernel itself. First, the number of SQL statements that
used for the partitioning and t1 will be partly split (c5 is have to be processed (UB: 1 statement for each page in
also correlated to c10c9c8). Our measurements show that the result set, Oracle 8 methods: 1 statement in total)
this effect also holds for other dimensions. Figure 17 leads to extensive inter-process communication (about
shows queries where the first two hierarchy levels of 30% of the total processing time) and DBMS overhead
CUSTOMER, PRODUCT and TIME are restricted, (e.g., parsing of statements). Second, our table is larger
whereas the DISTRIBUTION dimension is not restricted. than the one for the FTS and the bitmap indexes due to
The selectivity predicted by the cost functions here differs unimplemented compressing techniques in the UB-Tree
from the actual selectivity of the query because of de- (for 8 KB pages: UB: 878362 pages, FTS: 723539 pages,
pendencies in the data distribution. However, the per- BII: FTS+31134 pages).
centage of pages loaded is similar to the actual selectivity Figure 18 shows result set sizes and response times of the
of each query. three example queries (Section 4.2). Q1 shows that the
UB-Tree with multidimensional clustering is over 2
scusto- spro- sdis- stime Pa- predicted actual loaded times faster than BII even for very small result sets. Q3
mer in duct in tribu- in % ges selecti- selec- pages which is processed by the unoptimized UB-Tree at least
% % tion loa- vity in % tivity in % 10 times faster than with any other access method un-
in ded in % dermines this observation.
%
2,95 3,64 100 38,4 312 0,0414 0,0310 0,0355
300
285 232 275
Reponse Time in s
2,95 27,99 100 38,4 782 0,0318 0,0851 0,0890
200
186
Figure 17: Restriction in the first two hierarchy levels in
3 of 4 dimensions 100
FTS
5 118 10 Bitmap
Each of the 878362 pages of the ‘Juice & More’ fact table
2 1 UB
stores 30 tuples. All of the measurements showed that 0
Q1 Q2 Q3
when restricting the first hierarchy level in each dimen-
V. Markl, R. Bayer 1-8
19 lists the number of hierarchy levels that by each query
Percentage are restricted to a point for each dimension. Since the
Query Loaded Tuples of Database data is non-uniformly distributed, the selectivity of each
Q1 8160 0,03% query depends on the exact point restriction, not only on
Q2 1696416 6,52% the number of restricted hierarchy levels. We thus
Q3 19752 0,08% present two instances of the queries, each of which has a
Figure 18: Query response times and result set sizes different selectivity.
500
The result set of query Q2 is quite large but the almost 472
perfect clustering factor of the UB-Tree (in average more
than 29 out of 30 tuples/page belong to the result set) still 400
leads to a speed up of more than 30 % in comparison to
BII. The time for FTS for Q2 differs from the times for 313
287
Q1 and Q3 due to the less complex WHERE clause of the 283 288
time in seconds
300 274
UB
statement. The number of comparison operations is 242
BM
therefore much smaller than for the other queries, which FTS
causes the faster execution. 200
171
All these results on real data show how well the multidi-
mensional hierarchical clustering with UB-Trees works
100
in practice and the accuracy of our theoretical cost model 69 78
56
39 44
[MZB99]. In total more than 77% of all benchmark que-
3 6
ries (28 out of 36) showed a speed up between a factor of 0
1.3 and 10 over traditional techniques. Q4 (0,3176%) Q5 (3,4383%) Q6 (2,0981%) Q7 (1,1346%) Q8 (34,000%)
Cust- Pro- Distr- Time Select- Select- Figure 21: Performance Q4-Q8 (instance 2)
omer duct ibution ivity ivity
Instan- Instan- 7 Conclusions and Future Work
ce 1 in ce 2 in We have presented multidimensional hierarchical clus-
% % tering by hierarchy interleaving (MHC/HI) as a physical
data modeling technique for OLAP applications in rela-
Q4 2 2 0 1 0,0414 0,3176 tional databases. Our performance measurements have
shown that MHC/HI in combination with UB-Trees and
Q5 1 1 1 1 0,0070 3,4383
the Tetris algorithm significantly reduces the number of
Q6 1 1 1 0 0,0182 2,0981 random accesses to the fact table for star joins and other
queries with restrictions in multiple hierarchies. This
Q7 1 0 0 1 1,1346 1,1346 results in considerably lower response times for typical
Q8 0 1 0 1 6,0000 34,000 data warehousing queries with star joins. For a 6 GB re-
tail data SAP business information warehouse, our pro-
Figure 19: Restricted hierarchies and selectivities for five totype implementation of MHC/HI accelerates the proc-
queries against the ‘Juice & More’ data warehouse essing of star-join queries more than a factor of two com-
300
pared to bitmap indexes, clustering B*-Trees or parallel
284,53
277,01
269,05 274,44 full table scans. For dimensionalities typical for data
warehousing, only I/O-time linear in size of the result set
250 231,65 prior to aggregation and sublinear temporary storage are
necessary to aggregate parts of a fact table of a star or
200 186 snowflake schema. Thus secondary storage space and
Time in Seconds
UB
pre-computation time for many aggregates and bitmap
150 BM indexes can be avoided. In addition the widely discussed
118 FTS view maintenance problem is minimized. Depending on
100
the query, temporary storage requirements for sorting are
reduced by several orders of magnitude. Our clustering
56 approach also holds not only for ROLAP but also for
44
50
MOLAP implementations of a data warehouse since both
2 5 1 5 1 4 ROLAP fact tables and MOLAP data cubes can be clus-
0 tered in this way.
Q4 (0,0414%) Q5 (0,0070%) Q6 (0,0182%) Q7 (1,1346%) Q8 (6,0000%)
Our future work includes deriving a set of cost based de-
Figure 20: Performance Q4-Q8 (instance 1) cision rules for how to – possibly multidimensionally -
index a relation for a given set of queries. We also intend
Figures 20 and 21 list two instances for each of five to apply our model to cost based query optimization and
further queries of that benchmark. For each query Figure combine the model with multidimensional histograms in
V. Markl, R. Bayer 1-9
order to take dependencies and correlations between the [MRB99]V. Markl, F. Ramsak, and R. Bayer. Improving
attributes into account. OLAP Performance by Multidimensional
Hierarchical Clustering. Proc. of IDEAS’99,
8 References 1999.
[MZB99] V. Markl, M. Zirkel, and R. Bayer. Processing
[Bay97a] R. Bayer. The universal B-Tree for Operations with Restrictions in Relational
multidimensional Indexing: General Concepts. Database Management Systems without external
WWCA ’97, Tsukuba, Japan, p- 10-11, Sorting. ICDE, 1999.
Springer Verlag, March, 1997. [NHS84] J. Nievergelt, H. Hinterberger, and K. C.
[Bay97b] R. Bayer. UB-Trees and UB-Cache – A new Sevcik. The Grid-File. ACM TODS, 9(1),
Processing Paradigm for Database Systems. March 1984, pp. 38-71.
Technical Report TUM-I9722, TU München, [OM84] J. A. Orenstein and T.H. Merret. A Class of
1997. Data Structures for Associate Searching.
[CD97] S. Chaudhuri and U. Dayal. An Overview of SIGMOD-PODS Conf., Portland, Oregon, 1984,
Data Warehousing and OLAP Technologies. pp. 294-305.
SIGMOD Rec. 26(1), 1997. [OQ97] P. O´Neill and D. Quass. Improved Query
[CI98] C. Chan and Y. Ioannidis. Bitmap Index Design Performance with Variant Indexes. ACM
and Evaluation. SIGMOD, 1998. SIGMOD Intl. Conf. On Management of Data,
[Dat88] C.J. Date. A Guide to the SQL Standard,2nd Tucson, Arizona, pp. 38-49,1997.
edition. Addison-Wesley, 1988. [Pie98] R. Pieringer. Evaluation of the UB-Tree in the
[GG97] V. Gaede and O. Günther. Multidimensional SAP Environment. Master Thesis, TU
Access Methods. ACM Computing Surveys München, 1998.
30(2), 1997. [RMF+00] F. Ramsak, V. Markl, R. Fenk et al.
[GHR+97] H. Gupta, V. Harinarayan, A. Rajaraman, and Integrating the UB-Tree into a data-basesystem
D. Ullman. Index Selection for OLAP. ICDE, kernel. VLDB2000, Cairo, 2000.
1997. [Sar97] S. Sarawagi. Indexing OLAP data. Data Eng.
[GR97] J. Gray and A. Reuter. Transaction Processing: Bulletin 20 (1), pp. 36-43, 1997.
Concepts and Techniques. Morgan Kaufmann [Sam90] H. Samet. The Design and Analysis of Spatial
Publishers, 1997. Data Structures. Addison Wesley, 1990
[Gra93] G. Graefe. Query Evaluation Techniques for [SDN+96] A. Shukla, P. Deshpande, J. Naughton, and K.
Large Databases. ACM Computing Surveys 25, Ramasamy. Storage Estimation for
1993, pp. 73-170. Multidimensional Aggregates in the Presence of
[Gup97] H. Gupta. Selection of Views to Materialize in a Hierarchies. VLDB Conference. 1996.
Data Warehouse. Proc. of the Intl. Conference [SDN+98] A. Shukla, P. Deshpande, and J. Naughton.
on Database Theory, Athens, Greece, January Materialized View Selection for
1997. Multidimensional Datasets. SIGMOD Conf.,
[Gut84] A. Guttman. R-Trees: A dynamic Index 1998.
Structure for spatial Searching. Proc. of ACM [TS97] D. Theodoratos and T. K. Sellis. Data
SIGMOD Conf., 1984, pp. 47-57. Warehouse Configuration. VLDB, 1997, p. 126-
[HAM+97] C.T. Ho, R. Agrawal, N. Megiddo, and R. 135.
Srikant. Range Queries in OLAP Data Cubes. [SRF97] T. K. Sellis, N. Roussopoulos, C. Faloutsos:
SIGMOD Conf., 1997, pp. 73-88. Multidimensional Access Methods: Trees Have
[JL98] Jürgens, M.; Lenz, H.J.: The R*a-Tree: An Grown Everywhere. VLDB, 1997, P. 13-14.
improved R*-Tree with Materialized Data for [WB98] M.C.Wu and A.P. Buchmann. Encoded Bitmap
Supporting Range Queries on OLAP-Data, Indexing for Data Warehouses. ICDE, Orlando,
DWDOT Workshop, Vienna, 1998. 1998.
[Kim96] R. Kimball. The Data Warehouse Toolkit. John
Wiley & Sons, New York. 1996.
[LS90] Lomet, D.; Salzberg, B.: The hB-Tree: A
Multiattribute Indexing Method with good
guaranteed Performance. ACM TODS, 15(4),
1990, pp. 625 – 658.
[Mar99] V. Markl. MISTRAL: Processing Relational
Queries using a Multidimensional Access
Method. Ph.D. Thesis, TU München, 1999.
[Moe98] G. Moerkotte. Small Materialized Aggregates:
A Light Weight Index Structure for Data
Warehousing. VLDB Conference. New York,
USA, 1998.
V. Markl, R. Bayer 1-10