<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Mj nj nj
= c +</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Tree Based Indexes vs. Bitmap Indexes: A Performance Study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marcus J u¨rgens</string-name>
          <email>juergens@inf.fu-berlin.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hans-J. Lenz</string-name>
          <email>hjlenz@wiwiss.fu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Statistics and Econometrics, Freie Universita ̈t Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Statistics and Econometrics, Institute of Computer Science, Freie Universita ̈t Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <volume>1</volume>
      <issue>4</issue>
      <fpage>1</fpage>
      <lpage>1</lpage>
      <abstract>
        <p>Data warehouses are used to store large amounts of data. This data is often used for On-Line Analytical Processing (OLAP). Short response times are essential for on-line decision support. Common approaches to reach this goal in read-mostly environments are the precomputation of materialized views and the use of index structures. In this paper, a framework is presented to evaluate different index structures analytically depending on nine parameters for the use in a data warehouse environment. The framework is applied to four different index structures to evaluate which structure works best for range queries. We show that all parameters influence the performance. Additionally, we show why bitmap index structures use modern disks better than traditional tree structures and why bitmaps will supplant the tree based index structures in the future.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction Abstract</title>
      <p>d 2 N Number of dimensions. The dimensionality of the
data denotes the number of attributes. This
parameter is important for the performance of a system. For
the task of indexing eight dimensional data a
different structure is better suited than for indexing one or
two-dimensional data.</p>
      <p>There are nine parameters which influence the performance
of index structures. We group them into four different
categories: data specific, query specific, system specific, and
disk specific parameters. First we describe the different
groups with their parameters.</p>
      <p>Data specific parameters describe the data that has to be
indexed. The three important parameters are:
defines the used parameters that influence the performance
of index structures. Section 3 explains the framework and
defines the rules of how the high dimensional data is
aggregated to two-dimensional data. In Section 4, four
different index structures are modeled, and it is shown how
these structures are applied in the framework. Section 5
describes the experiments and results of the application of
the framework. The last section summarizes the approach
and gives an outlook.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Input parameters</title>
      <p>= range queries with query box size qs 0. In order to have
Query specific parameters hold information about the
queries processed by the system. In our approach we
concentrate on range queries. Point queries can be expressed as
only scalar values as parameters, we assume range queries
that can be described with the two scalar values:
= 0:04 qs (0 qs 1). A value of qs means that the
Query box size. The size of the query box is given
proportionate to the size of the data space and is denoted by
query box fills 4 percent of the data space.
to the PISA model [JL99]. In the following, uniformly
distributed data is assumed. The models for the bitmaps are
not affected by other distributions of data.
R parison with the -tree if the number of indexed
tut Number of stored tuples. The number of tuples
influences the performance of different structures. In
[JL98] it is shown that the Ra-tree improves in
comples is increased.
c an attribute is the number of different values an
at[0; the range of 1). It is assumed that the attribute
carc that each attribute can have different real numbers in</p>
      <p>Cardinality of data space. The cardinality of the range of
tribute may have. It is obvious that for attributes like
gender, where there are at most three possible values
(male, female, NULL) different index structures are
better than for attributes like social security number or
telephone number, where the attributes may have
millions of different values. In the following, we assume
dinality is the same in all dimensions. This
assumption makes the model simple and can be relaxed later
to make the experiments more realistic. The
cardinality of the range of the jth attribute is denoted by cj .
2.1
2.3</p>
      <sec id="sec-2-1">
        <title>System specific parameters</title>
        <p>It is assumed that the positions of the query boxes are
uniformly distributed over the data space.</p>
        <p>System specific parameters are parameters that can be
chosen by the database administrator (DBA) of a system. We
assume that the DBMS has its own access to the disk
system and does not use the I/O functions of the operating
system:</p>
        <p>Another parameter may be the distribution type of data
(e. g. normally distributed or uniformly distributed data).
To keep the framework simple, this parameter is not
considered here, but the framework can easily be extended to
take the actual distribution of data into account. In this case
the models for the tree structures can be changed according
Scale factor. Due to the fact that the access time, and not
the available disk space, is the limiting factor,
redundancy of stored data is accepted in order to be more
time efficient. This is especially true in the context of
data warehouses where materialized views occupy a
large portion of disk space. The more data is
materialized the more queries can be answered without
accessing base data and the faster the systems are. Some
random block. In five years, this factor will probably be
increased to 36 to 72. One could argue that by this time
index structures will only be of limited use, because
sequential scans will be faster for most queries than the use
of index structures. This is true if the amount of data is kept
fixed. But the capacity of disks (and the amount of stored
data) is increasing even faster than the bandwidth.
Therefore, the time for scanning a whole disk will increase, and
it will still be necessary to index the data. However, the
index structures in the future have to be better according
to the changed parameters than structures used today. The
described framework supports the investigation as to which
index structure is best suited for a certain application and
given parameters.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>The Framework</title>
      <p>3.1</p>
      <sec id="sec-3-1">
        <title>Configuration</title>
        <p>In the following section, a framework is described to
compare different index structures in respect to all parameters.</p>
        <p>Grouping the above defined parameters together, we get a
vector of nine scalar parameters
sf = 2 structures. E. g. a scale factor of means that the
index structures have the same property that they can
trade space and time. In this approach we use bitmap
indexes that are time optimal under given space
constraints. Here, more space is added to the bitmap
indexes than space is occupied by the investigated tree
bitmaps can occupy twice the space used by the trees.
(3)
Index structures are usually not stored in main memory,
but in the secondary memory and have to be read from
secondary memory and transfered into main memory. In
contrast to many approaches that compare index structures
and only count the number of external I/Os and neglect the
fact that blocks can be read sequentially much faster than
reading blocks randomly, we take this fact into account and
model it. Here the behavior of disks are modeled by two
parameters:
bw Bandwidth. The bandwidth of a disk is the speed
[MB/s] in which the disk can read data and transfer
it into main memory. Due to the fact that the disks
are getting (physically) smaller and the data is stored
more densely on the disks, this speed increases by
approximately 40 % per year [BG98], [PK98].
bw From the bandwidth and the blocksize b, the time
ts for one sequential access can be calculated as the time
for reading and transferring one block of data into main
memory
bw The fact that the bandwidth is increasing much faster
than the latency time tl is decreasing, widens the gap
between a sequential and a random access. With todays
(1999) disks it is (with a reasonable large block size) ten
to twenty times faster to read a sequential block than a
e cific vector a configuration. There is one dependency
B the blocksize, a set of values is defined as
to characterize a certain experimental setup. We call a
spebetween the parameters. The number of dimensions of the
data space must be greater than or equal to the number of
dimensions in that the query box is restricted. In the rest
of the paper, we will consider only configurations in which
qd d.</p>
        <p>For each parameter a set of values is defined. E. g. for
f2048; 4096; 8192; 16384g. For the other parameters, sets
=
are defined similarly and denoted by capitalized letters.</p>
        <p>The set of all configurations is defined as
2.4</p>
        <p>Latency time. The second parameter is the average time
the read/write heads need to get to a certain position
and to start reading the desired data. This time is the
latency time tl of a disk system. On average, this is the
SeekTime+RotationTime/2. This parameter depends
mainly on the rotation speed of the disk. The
rotation speed does not increase with the same rate as the
bandwith bw. It increases only by approximately 8 %
per year [BG98], [PK98].</p>
        <p>The time tr for a random block access is calculated as the
time for bringing the read/write head to a certain position
plus the time to read one block of data and transferring it
into main memory
Qs</p>
        <p>Qd</p>
        <sec id="sec-3-1-1">
          <title>Tljqd</title>
          <p>(1)
(2)
7d R-Tree
7d Ra-tree
7d eenqcuoadlietdy
bitmap
7d enracondgeed</p>
          <p>bitmap
qR</p>
          <p>(e2E)^(d=d0)^(b=b0)
d of dimensions that have to be stored. In addition, there
Here, we calculate the space needed for building up a tree
based index structure. Due to the fact that there are many
more leaf nodes than inner nodes (inner nodes occupy less
than 2 % in our experiments) we consider only leaf nodes.</p>
          <p>The number of leaf nodes is the same for structures that use
aggregated data and for structures that neglect aggregates in
the inner nodes.</p>
          <p>The space (in bytes) used by one entry of a leaf node
depends on the cardinality of the attributes cj and the number
has to be a pointer (TID) to the data itself.
b blocksize and on the size of the data entries sdata. The
The maximum fanout of data pages depends on the chosen
bigger the blocksize, the more data entries can be stored on
each block. The exact quantity Mdata is calculated by</p>
          <p>2d
3.2
hi(d0; b0)</p>
          <p>(4)</p>
          <p>Functions like min or max can be used similarly. If the
user is very optimistic, she may use the min function. If
the user is very pessimistic, the max function should be
applied.</p>
          <p>Two other aggregation approaches are investigated. In
one approach, the data is aggregated by counting the
number of cases in which each index structure is the best
(fastest) one. Another approach is to calculate the median.
1-4
The results of the two other approaches are just a little bit
different from the sum-aggregation method. Therefore, and
because of space limitations, the other approaches are not
further presented here.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Estimators for the tree based index structures</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Application of the framework to compare four index structures</title>
      <p>To show how the framework works, the framework is
applied to compare different index structures. The space used
and the time ti used for processing a range query by each
index structure is computed.</p>
      <p>sdata</p>
      <sec id="sec-4-1">
        <title>Mdata</title>
        <p>
          In the rest of the paper, the GRID Model with uniformly
distributed data as described in [JL99] is used. The number
(6)
(7)
(5)
(
          <xref ref-type="bibr" rid="ref1">8</xref>
          )
q where the query box is expressed as a tuple
; (q1; q2; qd) as described before.
q q = (0:5; The rectangle represents the query box 0:5).
[0; r = : 1)2. Therefore, the length of one rectangle is: 81
        </p>
        <p>In Figure 2a), there are 64 data nodes assumed to be
uniformly distributed over the two-dimensional data space
All gray shaded rectangles in Figure 2a) represent leaf
nodes that have to be accessed when an index structure like
the STR-tree is used. In Figure 2a), these are 16 blocks.</p>
        <p>The pages on the leaf node level are stored in random order
on the disk. In the best case, they are stored according to
one dimension or according to a space filling curve [IK94].</p>
        <p>These savings decrease dramatically when the number of
dimensions increases. We assume that the blocks are not
ordered. For each page access to disk, one random access
is necessary. The time estimation t1 for the tree without
aggregated data is the number of necessary page accesses
multiplied by the time used for one random access
= 1 ^ equality encoded indexes by Bi Bi Bij . In [CI98],
x in Table 1. If the value of is set to 3, the corresponding
x indexes convert the value of in a number system with base
j j
x = 2 y + 3 of the first and base 2 of the second digit: z,
y z where is represented by B21; B11; B01 and is represented
x a column in a relational DBMS with 6 different values
First we present the general idea of bitmap indexing and
show how space and time can be traded. Assume there is
from 0 to 5. The traditional bitmaps generate one bitmap
vector (B0 B5) for each of the 6 different values as shown
bit in vector B3 is set to 1. Otherwise the bit is set to 0.</p>
        <p>This index structure has the great disadvantage, that one
bitmap vector is necessary for each value of x. Therefore,
this structure can only be used for attributes that have only
few different extensions.</p>
        <p>The next two index structures that are used for the
application of the framework are equality and range encoded
bitmap indexes. In this example equality encoded bitmap
by B10 and B00 (cf. Table 1). Range encoded bitmaps are
optimized for range queries. They can be calculated from the
the authors show how time and space are traded in detail.</p>
        <p>They focus on four different points of this trade: time
optimal, space optimal, “knee”, and time optimal under given
space constraint. Here we concentrate on time optimal
bitmap indexes under a given space constraint. To compare
the bitmap structures with tree structures, we assume that</p>
        <p>(count, sum)
The time estimation for the tree with use of aggregated data
is the product of the number of accessed pages and the time
for each random block access
The two time estimators t1 and t2 are used in the
framework to estimate the time for a given configuration to
process a query if uniformly distributed data is assumed.
4.2</p>
        <sec id="sec-4-1-1">
          <title>Estimators for bitmap indexing techniques</title>
          <p>query box
+ 1; 1 qj
r
0 1 0 1
a) Without use of aggregated b) With use of aggregated
data data
of pages that have to be accessed on leaf node level can be
calculated as
The idea of aggregated data in the inner nodes of an index
structure is described in detail in [JL98]. The inner nodes
store in addition to the reference to its successors some
aggregated data about their successors (count and sum in this
example, cf. Figure 3). If aggregated data is used, there is
no access necessary to rectangles completely contained in
the query box. This number is calculated by
All pages that contain data that is completely contained in
the query box do not have to be accessed. In the example
in Figure 2b), access to four blocks is saved. The number
(10)
(11)
value
traditional bitmap
|
1
B2
0
0
0
1
1
0
.
.</p>
          <p>.
sf ture needs to store the data multiplied by scale factor .
the space constraint depends on the space the tree
strucFirst the size of each bitmap vector (e. g. B1) is calculated.</p>
          <p>The number of blocks needed to store one bitmap vector is
given by
{z
rji
{z
nji rji
M by tree structures. Therefore, will be set dependent on
M Let denote the number of bitmap vectors that can be
stored by the system. In this model it is assumed, that the
space for bitmap vectors is proportional to the space needed
the blocks allocated by the tree structure and a scale factor
sf, which is one of the input parameters of our model.
j 2 f1; ; has to be performed for each dg. An additional
The Mj’s can be used to calculate the base of the encoded
bitmap indexes in each dimension. Note that in our
examples the cj will all have the same value, but this can not be
assumed in general. For equality and range encoded bitmap
indexing techniques we get different structures. Therefore,
we have to distinguish between the two bitmap indexing
techniques. Having the Mj and cj at hand, the bases of the
index can be calculated as shown in Figure 4.</p>
          <p>For an equality encoded bitmap index, the algorithm to
calculate the base is sketched in Figure 4. This algorithm
optimization step (not shown here), improves the
performance of the bitmap index structures [CI98]. The base in
(15)
(16)
In the rest of the paper, the base is denoted as
1</p>
          <p>bji
2
bji
+ bi
v 1 vector needs a random block access, the remaining
ac</p>
          <p>The details can be found in [CI98]. From the number of
bitmap vectors that have to be scanned and the size of the
bitmap vectors in blocks the time estimate for the
equality encoded bitmap index is calculated. The first block of a
cesses can be read sequentially if the blocks of each bitmap
vector are stored sequentially
1)ts)Ee
(20)
If range encoded bitmap indexing technique is used, the
number of bitmap vectors that have to be scanned is given
by
(nj
rj )(bj
bj
rj bj
The time to execute one range query with range encoded
bitmap vectors is given by
The estimators t3 and t4 are used to estimate the time the
different bitmap indexing techniques need to access the
data. Together with the functions t1 and t2 this set of
functions provides the base data for the framework.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments and Results</title>
      <p>bw ments the bandwidth and the latency time tl are set</p>
      <p>In this section results of experiments where the framework
has been applied are presented. In the following
experito fixed values. The remaining seven parameters are
varied and all possible combinations of the sets in Table 2 are
considered (under the constraint qd d). This yields in
475.200 different combinations for each index structure.</p>
      <p>For each of these combinations the four functions ti are
evaluated. Then the framework is applied as described in
2
n(n 1) = 21 Chapter 4. There are different possibilities
how to aggregate the seven dimensional space into a
twodimensional space. Due to space limitations only four
twodimensional results are presented here for todays disk
systems and two results for disk system as expected in five
years are shown.
sf mensions. A large scale factor favors the range encoded
b advantages when the blocksize is increased because
bigb In Figure 6e) the blocksize and the number of
dimend the number of dimensions is varied. As in the Figure 6e),
d sions are compared. As expected, the trees have some
sf dom block accesses. In Figure 6f) the scale factor and
ger blocks yield in a higher fanout and therefore in less
ranthe bitmaps become faster with an increasing number of
dibitmap index in comparison to the equality encoded bitmap
index. Generally speaking, the range encoded index seems
to be better than the equality encoded index. In [CI98]
the authors came to the same results with different
experiments. This shows that our results are really meaningful.</p>
      <p>However, our method is more general and considers more
parameters.
bw if we assume that the bandwidth is increasing by 40 %
In the field of new computer technology it is very
difficult and risky to make any predictions for the future. But
each year and the latency time tl is decreasing by only 8 %
per year (like they did during the last years), the presented
models can be used to predict the performance of index
structures in the future. Here we give some results for disk
systems expected to be available in five years and compare
them with results of todays disk systems.</p>
      <p>With the given bases for the equality and range encoded
bitmap indexes it is possible to estimate the time needed
by both structures. The number of bitmaps that have to be
scanned for a specific configuration according to [CI98] is
Dimensions</p>
      <p>Dimensions</p>
      <p>g)Legend
indexstructure symbolinfigure
STR-treewithoutaggregateddata</p>
      <p>STR-treewithaggregateddata
equalityencodedbitmapindex
rangeencodedbitmapindex
M. Ju¨rgens,H.J.Lenz</p>
      <sec id="sec-5-1">
        <title>Acknowledgments</title>
        <p>Part of this research was supported by the German
Research Society, Berlin-Brandenburg Graduate School
in Distributed Information Systems (DFG grant no.
GRK 316)
[BG98]</p>
        <p>Dina Bitton and Jim Gray. The rebirth of
database machines. invited talk at VLDB,
1998.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Summary &amp; Outlook</title>
      <p>[Gut84]
[IK94]
[Inf97]
[JL98]
In the field of data warehouses fast access to large amounts
of data is crucial. Many index structures support fast access
to OLTP data, but perform poorly in read-mostly
environments when aggregated data over large sets of data has to
be calculated. We have presented a framework to compare
different index structures for the use in a data warehouse
environment. The framework has been applied to compare
four different index structures depending on nine
parameters. We have shown that all parameters influence the
results and therefore should be taken into account when
comparing index structures. In addition, we have shown that
due to changes in disk technology, bitmap indexing
techniques will probably outperform the traditional tree based
index structures in the future.</p>
      <p>R tree: An improved -tree with
materialMarcus Ju¨ rgens and Hans-J. Lenz. The
Raized data for supporting range queries on
OLAP-data. In Proceedings of the
International Workshop on Data Warehouse Design
and OLAP Technology (DWDOT 98), Vienna,
pages 186–191. IEEE Computer Society Press,
1998.</p>
      <p>Yva´n J. Garc´ia, Mario A. L o´pez, and J.
Edgington. STR: A simple and efficient algorithm
for R-tree packing. In Proceedings of the 13th
International Conference on Data
Engineering (ICDE), pages 497–506. IEEE Computer
Society Press, 1997.</p>
      <p>Antonin Guttman. R-trees: A dynamic index
structure for spatial searching. In Beatrice
Yormark, editor, SIGMOD’84, Proceedings of
Annual Meeting, Boston, Massachusetts, pages
47–57, 1984.
[GHRU97] Himanshu Gupta, Venky Harinaryan, Anand</p>
      <p>Rajaraman, and Jeffery D. Ullman. Index
selection for OLAP. In Proceedings of the
International Conference on Data Engineering
(ICDE), pages 208–219, 1997.</p>
      <p>[GLE97]
[BKSS90] Norbert Beckmann, Hans-Peter Kriegel, Ralf</p>
      <p>Schneider, and Bernhard Seeger. The
R*tree: An efficient and robust access method for
points and rectangles. In Proceedings of the
ACM SIGMOD International Conference on
Management of Data, pages 322–331. ACM</p>
      <p>Press, New York, N. Y., 1990.</p>
      <p>Chee-Young Chan and Yannis E. Ioannidis.</p>
      <p>Bitmap index design and evaluation. In
Proceedings of the International Conference on</p>
      <p>Management of Data, 1998.
1-9
Christos Faloutsos Ibrahim Kamell. Hilbert
Rtree: An improved R-tree using fractals. In
Proceedings of the 20st Conference on Very
Large Data Bases (VLDB), pages 500–509,
1994.</p>
      <p>Informix. Indexing for the enterprise data
warehouse. white paper, 1997. Available at
http://www.informix.com.
[JL99]</p>
      <p>Marcus Ju¨ rgens and Hans-J. Lenz. PISA:
Performance models of index structures with and
without aggregated data. In Proceedings of
[OQ97]
[PK98]
[RL85]
[Syb97]</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>the 8th International Conference on Statistical and Scientific Database Management (SSDBM)</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Patrick O'Neil</surname>
            and
            <given-names>Dallan</given-names>
          </string-name>
          <string-name>
            <surname>Quass</surname>
          </string-name>
          .
          <article-title>Improved query performance with variant indexes</article-title>
          .
          <source>SIGMOD Record (ACM Special Interest Group on Management of Data)</source>
          ,
          <volume>26</volume>
          (
          <issue>2</issue>
          ):
          <fpage>38</fpage>
          -
          <lpage>49</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>Hardware technology trends and database opportunities</article-title>
          .
          <source>invited talk at SIGMOD</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Nick</given-names>
            <surname>Roussopoulos</surname>
          </string-name>
          and
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Leifker</surname>
          </string-name>
          .
          <article-title>Direct spatial search on pictorial databases using packed R-trees</article-title>
          .
          <source>In ACM SIGMOD (International Conference on Management of Data)</source>
          , pages
          <fpage>17</fpage>
          -
          <lpage>31</lpage>
          , Austin, Texas, May
          <year>1985</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Sybase</surname>
          </string-name>
          .
          <article-title>Adaptive server IQ</article-title>
          .
          <source>white paper</source>
          ,
          <year>1997</year>
          . availabe at http://www.sybase.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>