Introduction Abstract

Mj nj nj = c +

Tree Based Indexes vs. Bitmap Indexes: A Performance Study

Marcus J u¨rgens

juergens@inf.fu-berlin.de 1

Hans-J. Lenz

hjlenz@wiwiss.fu-berlin.de 0 0 Institute of Statistics and Econometrics, Freie Universita ̈t Berlin , Germany 1 Institute of Statistics and Econometrics, Institute of Computer Science, Freie Universita ̈t Berlin , Germany

1 4 1 1

Data warehouses are used to store large amounts of data. This data is often used for On-Line Analytical Processing (OLAP). Short response times are essential for on-line decision support. Common approaches to reach this goal in read-mostly environments are the precomputation of materialized views and the use of index structures. In this paper, a framework is presented to evaluate different index structures analytically depending on nine parameters for the use in a data warehouse environment. The framework is applied to four different index structures to evaluate which structure works best for range queries. We show that all parameters influence the performance. Additionally, we show why bitmap index structures use modern disks better than traditional tree structures and why bitmaps will supplant the tree based index structures in the future.

Introduction Abstract

d 2 N Number of dimensions. The dimensionality of the data denotes the number of attributes. This parameter is important for the performance of a system. For the task of indexing eight dimensional data a different structure is better suited than for indexing one or two-dimensional data.

There are nine parameters which influence the performance of index structures. We group them into four different categories: data specific, query specific, system specific, and disk specific parameters. First we describe the different groups with their parameters.

Data specific parameters describe the data that has to be indexed. The three important parameters are: defines the used parameters that influence the performance of index structures. Section 3 explains the framework and defines the rules of how the high dimensional data is aggregated to two-dimensional data. In Section 4, four different index structures are modeled, and it is shown how these structures are applied in the framework. Section 5 describes the experiments and results of the application of the framework. The last section summarizes the approach and gives an outlook.

2 Input parameters

= range queries with query box size qs 0. In order to have Query specific parameters hold information about the queries processed by the system. In our approach we concentrate on range queries. Point queries can be expressed as only scalar values as parameters, we assume range queries that can be described with the two scalar values: = 0:04 qs (0 qs 1). A value of qs means that the Query box size. The size of the query box is given proportionate to the size of the data space and is denoted by query box fills 4 percent of the data space. to the PISA model [JL99]. In the following, uniformly distributed data is assumed. The models for the bitmaps are not affected by other distributions of data. R parison with the -tree if the number of indexed tut Number of stored tuples. The number of tuples influences the performance of different structures. In [JL98] it is shown that the Ra-tree improves in comples is increased. c an attribute is the number of different values an at[0; the range of 1). It is assumed that the attribute carc that each attribute can have different real numbers in

Cardinality of data space. The cardinality of the range of tribute may have. It is obvious that for attributes like gender, where there are at most three possible values (male, female, NULL) different index structures are better than for attributes like social security number or telephone number, where the attributes may have millions of different values. In the following, we assume dinality is the same in all dimensions. This assumption makes the model simple and can be relaxed later to make the experiments more realistic. The cardinality of the range of the jth attribute is denoted by cj . 2.1 2.3

System specific parameters

It is assumed that the positions of the query boxes are uniformly distributed over the data space.

System specific parameters are parameters that can be chosen by the database administrator (DBA) of a system. We assume that the DBMS has its own access to the disk system and does not use the I/O functions of the operating system:

Another parameter may be the distribution type of data (e. g. normally distributed or uniformly distributed data). To keep the framework simple, this parameter is not considered here, but the framework can easily be extended to take the actual distribution of data into account. In this case the models for the tree structures can be changed according Scale factor. Due to the fact that the access time, and not the available disk space, is the limiting factor, redundancy of stored data is accepted in order to be more time efficient. This is especially true in the context of data warehouses where materialized views occupy a large portion of disk space. The more data is materialized the more queries can be answered without accessing base data and the faster the systems are. Some random block. In five years, this factor will probably be increased to 36 to 72. One could argue that by this time index structures will only be of limited use, because sequential scans will be faster for most queries than the use of index structures. This is true if the amount of data is kept fixed. But the capacity of disks (and the amount of stored data) is increasing even faster than the bandwidth. Therefore, the time for scanning a whole disk will increase, and it will still be necessary to index the data. However, the index structures in the future have to be better according to the changed parameters than structures used today. The described framework supports the investigation as to which index structure is best suited for a certain application and given parameters. 3

The Framework

3.1

Configuration

In the following section, a framework is described to compare different index structures in respect to all parameters.

Grouping the above defined parameters together, we get a vector of nine scalar parameters sf = 2 structures. E. g. a scale factor of means that the index structures have the same property that they can trade space and time. In this approach we use bitmap indexes that are time optimal under given space constraints. Here, more space is added to the bitmap indexes than space is occupied by the investigated tree bitmaps can occupy twice the space used by the trees. (3) Index structures are usually not stored in main memory, but in the secondary memory and have to be read from secondary memory and transfered into main memory. In contrast to many approaches that compare index structures and only count the number of external I/Os and neglect the fact that blocks can be read sequentially much faster than reading blocks randomly, we take this fact into account and model it. Here the behavior of disks are modeled by two parameters: bw Bandwidth. The bandwidth of a disk is the speed [MB/s] in which the disk can read data and transfer it into main memory. Due to the fact that the disks are getting (physically) smaller and the data is stored more densely on the disks, this speed increases by approximately 40 % per year [BG98], [PK98]. bw From the bandwidth and the blocksize b, the time ts for one sequential access can be calculated as the time for reading and transferring one block of data into main memory bw The fact that the bandwidth is increasing much faster than the latency time tl is decreasing, widens the gap between a sequential and a random access. With todays (1999) disks it is (with a reasonable large block size) ten to twenty times faster to read a sequential block than a e cific vector a configuration. There is one dependency B the blocksize, a set of values is defined as to characterize a certain experimental setup. We call a spebetween the parameters. The number of dimensions of the data space must be greater than or equal to the number of dimensions in that the query box is restricted. In the rest of the paper, we will consider only configurations in which qd d.

For each parameter a set of values is defined. E. g. for f2048; 4096; 8192; 16384g. For the other parameters, sets = are defined similarly and denoted by capitalized letters.

The set of all configurations is defined as 2.4

Latency time. The second parameter is the average time the read/write heads need to get to a certain position and to start reading the desired data. This time is the latency time tl of a disk system. On average, this is the SeekTime+RotationTime/2. This parameter depends mainly on the rotation speed of the disk. The rotation speed does not increase with the same rate as the bandwith bw. It increases only by approximately 8 % per year [BG98], [PK98].

The time tr for a random block access is calculated as the time for bringing the read/write head to a certain position plus the time to read one block of data and transferring it into main memory Qs

Tljqd

(1) (2) 7d R-Tree 7d Ra-tree 7d eenqcuoadlietdy bitmap 7d enracondgeed

bitmap qR

(e2E)^(d=d0)^(b=b0) d of dimensions that have to be stored. In addition, there Here, we calculate the space needed for building up a tree based index structure. Due to the fact that there are many more leaf nodes than inner nodes (inner nodes occupy less than 2 % in our experiments) we consider only leaf nodes.

The number of leaf nodes is the same for structures that use aggregated data and for structures that neglect aggregates in the inner nodes.

The space (in bytes) used by one entry of a leaf node depends on the cardinality of the attributes cj and the number has to be a pointer (TID) to the data itself. b blocksize and on the size of the data entries sdata. The The maximum fanout of data pages depends on the chosen bigger the blocksize, the more data entries can be stored on each block. The exact quantity Mdata is calculated by

2d 3.2 hi(d0; b0)

(4)

Functions like min or max can be used similarly. If the user is very optimistic, she may use the min function. If the user is very pessimistic, the max function should be applied.

Two other aggregation approaches are investigated. In one approach, the data is aggregated by counting the number of cases in which each index structure is the best (fastest) one. Another approach is to calculate the median. 1-4 The results of the two other approaches are just a little bit different from the sum-aggregation method. Therefore, and because of space limitations, the other approaches are not further presented here.

Estimators for the tree based index structures Application of the framework to compare four index structures

To show how the framework works, the framework is applied to compare different index structures. The space used and the time ti used for processing a range query by each index structure is computed.

sdata

Mdata

In the rest of the paper, the GRID Model with uniformly distributed data as described in [JL99] is used. The number (6) (7) (5) ( 8 ) q where the query box is expressed as a tuple ; (q1; q2; qd) as described before. q q = (0:5; The rectangle represents the query box 0:5). [0; r = : 1)2. Therefore, the length of one rectangle is: 81

In Figure 2a), there are 64 data nodes assumed to be uniformly distributed over the two-dimensional data space All gray shaded rectangles in Figure 2a) represent leaf nodes that have to be accessed when an index structure like the STR-tree is used. In Figure 2a), these are 16 blocks.

The pages on the leaf node level are stored in random order on the disk. In the best case, they are stored according to one dimension or according to a space filling curve [IK94].

These savings decrease dramatically when the number of dimensions increases. We assume that the blocks are not ordered. For each page access to disk, one random access is necessary. The time estimation t1 for the tree without aggregated data is the number of necessary page accesses multiplied by the time used for one random access = 1 ^ equality encoded indexes by Bi Bi Bij . In [CI98], x in Table 1. If the value of is set to 3, the corresponding x indexes convert the value of in a number system with base j j x = 2 y + 3 of the first and base 2 of the second digit: z, y z where is represented by B21; B11; B01 and is represented x a column in a relational DBMS with 6 different values First we present the general idea of bitmap indexing and show how space and time can be traded. Assume there is from 0 to 5. The traditional bitmaps generate one bitmap vector (B0 B5) for each of the 6 different values as shown bit in vector B3 is set to 1. Otherwise the bit is set to 0.

This index structure has the great disadvantage, that one bitmap vector is necessary for each value of x. Therefore, this structure can only be used for attributes that have only few different extensions.

The next two index structures that are used for the application of the framework are equality and range encoded bitmap indexes. In this example equality encoded bitmap by B10 and B00 (cf. Table 1). Range encoded bitmaps are optimized for range queries. They can be calculated from the the authors show how time and space are traded in detail.

They focus on four different points of this trade: time optimal, space optimal, “knee”, and time optimal under given space constraint. Here we concentrate on time optimal bitmap indexes under a given space constraint. To compare the bitmap structures with tree structures, we assume that

(count, sum) The time estimation for the tree with use of aggregated data is the product of the number of accessed pages and the time for each random block access The two time estimators t1 and t2 are used in the framework to estimate the time for a given configuration to process a query if uniformly distributed data is assumed. 4.2

Estimators for bitmap indexing techniques

query box + 1; 1 qj r 0 1 0 1 a) Without use of aggregated b) With use of aggregated data data of pages that have to be accessed on leaf node level can be calculated as The idea of aggregated data in the inner nodes of an index structure is described in detail in [JL98]. The inner nodes store in addition to the reference to its successors some aggregated data about their successors (count and sum in this example, cf. Figure 3). If aggregated data is used, there is no access necessary to rectangles completely contained in the query box. This number is calculated by All pages that contain data that is completely contained in the query box do not have to be accessed. In the example in Figure 2b), access to four blocks is saved. The number (10) (11) value traditional bitmap | 1 B2 0 0 0 1 1 0 . .

. sf ture needs to store the data multiplied by scale factor . the space constraint depends on the space the tree strucFirst the size of each bitmap vector (e. g. B1) is calculated.

The number of blocks needed to store one bitmap vector is given by {z rji {z nji rji M by tree structures. Therefore, will be set dependent on M Let denote the number of bitmap vectors that can be stored by the system. In this model it is assumed, that the space for bitmap vectors is proportional to the space needed the blocks allocated by the tree structure and a scale factor sf, which is one of the input parameters of our model. j 2 f1; ; has to be performed for each dg. An additional The Mj’s can be used to calculate the base of the encoded bitmap indexes in each dimension. Note that in our examples the cj will all have the same value, but this can not be assumed in general. For equality and range encoded bitmap indexing techniques we get different structures. Therefore, we have to distinguish between the two bitmap indexing techniques. Having the Mj and cj at hand, the bases of the index can be calculated as shown in Figure 4.

For an equality encoded bitmap index, the algorithm to calculate the base is sketched in Figure 4. This algorithm optimization step (not shown here), improves the performance of the bitmap index structures [CI98]. The base in (15) (16) In the rest of the paper, the base is denoted as 1

bji 2 bji + bi v 1 vector needs a random block access, the remaining ac

The details can be found in [CI98]. From the number of bitmap vectors that have to be scanned and the size of the bitmap vectors in blocks the time estimate for the equality encoded bitmap index is calculated. The first block of a cesses can be read sequentially if the blocks of each bitmap vector are stored sequentially 1)ts)Ee (20) If range encoded bitmap indexing technique is used, the number of bitmap vectors that have to be scanned is given by (nj rj )(bj bj rj bj The time to execute one range query with range encoded bitmap vectors is given by The estimators t3 and t4 are used to estimate the time the different bitmap indexing techniques need to access the data. Together with the functions t1 and t2 this set of functions provides the base data for the framework. 5

Experiments and Results

bw ments the bandwidth and the latency time tl are set

In this section results of experiments where the framework has been applied are presented. In the following experito fixed values. The remaining seven parameters are varied and all possible combinations of the sets in Table 2 are considered (under the constraint qd d). This yields in 475.200 different combinations for each index structure.

For each of these combinations the four functions ti are evaluated. Then the framework is applied as described in 2 n(n 1) = 21 Chapter 4. There are different possibilities how to aggregate the seven dimensional space into a twodimensional space. Due to space limitations only four twodimensional results are presented here for todays disk systems and two results for disk system as expected in five years are shown. sf mensions. A large scale factor favors the range encoded b advantages when the blocksize is increased because bigb In Figure 6e) the blocksize and the number of dimend the number of dimensions is varied. As in the Figure 6e), d sions are compared. As expected, the trees have some sf dom block accesses. In Figure 6f) the scale factor and ger blocks yield in a higher fanout and therefore in less ranthe bitmaps become faster with an increasing number of dibitmap index in comparison to the equality encoded bitmap index. Generally speaking, the range encoded index seems to be better than the equality encoded index. In [CI98] the authors came to the same results with different experiments. This shows that our results are really meaningful.

However, our method is more general and considers more parameters. bw if we assume that the bandwidth is increasing by 40 % In the field of new computer technology it is very difficult and risky to make any predictions for the future. But each year and the latency time tl is decreasing by only 8 % per year (like they did during the last years), the presented models can be used to predict the performance of index structures in the future. Here we give some results for disk systems expected to be available in five years and compare them with results of todays disk systems.

With the given bases for the equality and range encoded bitmap indexes it is possible to estimate the time needed by both structures. The number of bitmaps that have to be scanned for a specific configuration according to [CI98] is Dimensions

Dimensions

g)Legend indexstructure symbolinfigure STR-treewithoutaggregateddata

STR-treewithaggregateddata equalityencodedbitmapindex rangeencodedbitmapindex M. Ju¨rgens,H.J.Lenz

Acknowledgments

Part of this research was supported by the German Research Society, Berlin-Brandenburg Graduate School in Distributed Information Systems (DFG grant no. GRK 316) [BG98]

Dina Bitton and Jim Gray. The rebirth of database machines. invited talk at VLDB, 1998. 6

Summary & Outlook

[Gut84] [IK94] [Inf97] [JL98] In the field of data warehouses fast access to large amounts of data is crucial. Many index structures support fast access to OLTP data, but perform poorly in read-mostly environments when aggregated data over large sets of data has to be calculated. We have presented a framework to compare different index structures for the use in a data warehouse environment. The framework has been applied to compare four different index structures depending on nine parameters. We have shown that all parameters influence the results and therefore should be taken into account when comparing index structures. In addition, we have shown that due to changes in disk technology, bitmap indexing techniques will probably outperform the traditional tree based index structures in the future.

R tree: An improved -tree with materialMarcus Ju¨ rgens and Hans-J. Lenz. The Raized data for supporting range queries on OLAP-data. In Proceedings of the International Workshop on Data Warehouse Design and OLAP Technology (DWDOT 98), Vienna, pages 186–191. IEEE Computer Society Press, 1998.

Yva´n J. Garc´ia, Mario A. L o´pez, and J. Edgington. STR: A simple and efficient algorithm for R-tree packing. In Proceedings of the 13th International Conference on Data Engineering (ICDE), pages 497–506. IEEE Computer Society Press, 1997.

Antonin Guttman. R-trees: A dynamic index structure for spatial searching. In Beatrice Yormark, editor, SIGMOD’84, Proceedings of Annual Meeting, Boston, Massachusetts, pages 47–57, 1984. [GHRU97] Himanshu Gupta, Venky Harinaryan, Anand

Rajaraman, and Jeffery D. Ullman. Index selection for OLAP. In Proceedings of the International Conference on Data Engineering (ICDE), pages 208–219, 1997.

[GLE97] [BKSS90] Norbert Beckmann, Hans-Peter Kriegel, Ralf

Schneider, and Bernhard Seeger. The R*tree: An efficient and robust access method for points and rectangles. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 322–331. ACM

Press, New York, N. Y., 1990.

Chee-Young Chan and Yannis E. Ioannidis.

Bitmap index design and evaluation. In Proceedings of the International Conference on

Management of Data, 1998. 1-9 Christos Faloutsos Ibrahim Kamell. Hilbert Rtree: An improved R-tree using fractals. In Proceedings of the 20st Conference on Very Large Data Bases (VLDB), pages 500–509, 1994.

Informix. Indexing for the enterprise data warehouse. white paper, 1997. Available at http://www.informix.com. [JL99]

Marcus Ju¨ rgens and Hans-J. Lenz. PISA: Performance models of index structures with and without aggregated data. In Proceedings of [OQ97] [PK98] [RL85] [Syb97]

the 8th International Conference on Statistical and Scientific Database Management (SSDBM) , 1999 .

Patrick O'Neil and Dallan

Quass . Improved query performance with variant indexes . SIGMOD Record (ACM Special Interest Group on Management of Data) , 26 ( 2 ): 38 - 49 , 1997 .

Hardware technology trends and database opportunities . invited talk at SIGMOD , 1998 .

Nick

Roussopoulos and

Daniel

Leifker . Direct spatial search on pictorial databases using packed R-trees . In ACM SIGMOD (International Conference on Management of Data) , pages 17 - 31 , Austin, Texas, May 1985 .

Sybase . Adaptive server IQ . white paper , 1997 . availabe at http://www.sybase.