<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Table 5: JDI After Successive Dierences JC</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1999</year>
      </pub-date>
      <volume>19</volume>
      <issue>3</issue>
      <fpage>11</fpage>
      <lpage>12</lpage>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>2.2 Projection Index
2.1 Bitmap Indexes
2 Variant Indexes
K. Goyal, K. Ramamritham, A. Datta, H. Thomas 11-2
of dimensions. In such cases queries must be evaluated
large amount of data and most of the decompressed
In Section 2 we describe several indexing approaches
a data warehousing environment.</p>
      <p>So far, indexes have generally been considered
septhis paper we concentrate on this approach.</p>
      <p>We believe that it should in fact be more ecien t to
odically, and during this time no queries are allowed.</p>
      <p>Data warehouses are typically updated only
pericesses [RH95]. Although decompression adds to query
processing cost, the [RH95] paper showed that in
dates is no longer an issue, it is possible to use more
using indexes to ecien tly access base data.
than compensates for the increased processing time.
reducing storage costs, it also could reduce query
is the cost of storage due to index structures.
DataIndexes. Compression has several benets. Apart from
databases the reduced number of disk accesses more
data warehouses, storage costs are very high and so
various compression techniques. In Section 4 we
disThis allows the batch update process to reorganize the
infeasible, as this grows exponentially with the number
maintaining indexes in the presence of concurrent
upprocessing time by reducing the number of disk
achave been possible otherwise. Since the problems of
indexes to an optimal form, in a way that would not
specialized indexes to speed up query evaluation. In
compress a data warehouse as most queries access a
The remainder of this paper is organized as follows.
dexes provide indexing at no extra cost apart from
of indexes and in Section 5 we conclude the paper.
that of storing the data, thus saving a lot of space. We
less. Also, precomputing all necessary tables is often
data will actually be required to evaluate the query.
cuss how compression may be applied to certain types
propose compressing the data in the form of
DataInIn this paper we explore how compression performs in
used in data warehousing and in Section 3 we review
arate from the base data. Given the large size of
half the total number of rows. A Row ID
representamore CPU ecien t than RowIDs. The common
option will occupy 4 n bytes in all while a bitmap
reprechine words. Thus, Bitmap operations are a lot faster.
case the entries for both key values will have about
ing only two possible values Male or Female. In this
ticular table and a one at ordinal position i indicates
mous disk savings especially if n is large. There is disk
A further optimization described in [PQ97] is to
diis the number of distinct attributes. They represent it
be accessed for range selections.
may have several references. In such a case, it is more
to a RID list.
simplest of compression methods is to convert it back
particular, base 2 representation gives bitslice indexes.
think of each entry being represented in base n where n
[CY98] draws a parallel between bitmap indexing
vide the table into segments of a xed n umber of rows
[WB98] describes methods of ecien tly coding Bitslice
that the row is present in the list. Thus, we have ith
than 1/No of bits in a RowID. Also Bitmaps are a lot
or an RID list. This not only saves space, but also
ecien t to represent the list as a bitmap. Consider a
such an index is called a Bitmap Index. The following
in dieren t bases and discuss space time tradeos. In
take several instructions, while most processors have
total number of rows in the table. This results in
enorsaving as long as average density of a bitmap is greater
bitmap of length equal to the number of rows in a
parsingle instructions to directly AND, OR and NOT
maindexes so that the least number of bitmaps have to
fragment and each fragment can be stored as a bitmap
saves disk I/O in combining bitmaps. This is because
When bitmaps are sparse one can compress them. The
and number systems. The idea is that when you orient
some fragments may not have any set bits at all and
erations of taking union and intersecting RowID lists
data.
sentation will occupy only 2 n=8 bytes where n is the
hence while ANDing, the other bitmap may not be
rebitmaps vertically and place them side by side you can
each. The bitmaps corresponding to each segment is a
example illustrates why these are more ecien t in an
just a new way of representing the list of RowIDs and
OLAP environment. Consider a column Gender
havtrieved at all. This is particularly useful for clustered
of rows in the table. However, in OLAP each entry
four columns is stored as two BDIs, one of one column
single BDI if it is known that most queries will access
ordinal mapping between them.</p>
      <p>BDI. The BDI generalizes projection indexes by
allowing any number of columns to be in a single BDI.
columns. Clearly, Projection indexing is free in terms
easy to form the original table if we have projection
and the other of 3 columns. The dotted lines show the
suited for OLAP queries. Also notice that it is very
It would be useful to store two or more columns in a
of storage. A graphical representation of the BDI is
As we saw earlier projection indexes are particularly
shown in Figure 1. In this gure the original table of
the table where each partition may have one or more
them together. Thus, the BDI is a vertical partition of
jection Indexes on all columns, we need not store the
original tables at all. Hence, we get Projection Indexes
indexes on each of the columns. Thus, if we force
Profree of storage cost. This is the main idea behind the
imum size for the column or use a B-Tree to access
read may bring in more than one value in the
Foundcorresponding to a very dense FoundSet (A bitmap
and retrieve only a few of the columns. Thus, such an
ber provided the column is xed length. However, if
Set. OLAP queries have very low selectivity queries
is because more values t onto a page, and a single
index is extremely useful.
the column is variable length, we can either x a
maxvalues using the row number. The Projection index
is very useful when one has to retrieve column values
having a set bit for every row to be retrieved). This
tial integrity is strictly enforced on all star schemas.</p>
      <p>Indexes.
data in a warehouse in which the Indexing is
essenDataIndexes [KA98] are a new way to represent the
An assumption made in DataIndexes is that
referendexes and algorithms to evaluate queries using these
In this section, we describe the two types of
DataIntially free in terms of storage. The rest of the paper
assumes we have the data in the form of DataIndexes.</p>
      <p>BDIs. If the foreign key column is also stored as BDIs,
column value any more. In fact, we may take lesser
in terms of storage as one does not need to store the
sion Table. The gure uses dotted lines to show the
we must join it every time with the corresponding
colTimekey as foreign key referencing the Time
Dimenumn of the dimension table. The BDIs rely on ordinal
mapping. Thus, it is a good idea to store the
ordiConsider storing all dimensional and fact tables as
the maximum ordinal number. Such a JDI is shown
column. This is called a JDI. In a way this
precomstorage if the width of the key column is larger than
nal number of the row it references in the foreign key
putes the join. Thus it is a join index. This too is free
in Figure 2. The sales table is the fact table having
stored for each key value which is quite space ecien t.
formed in a single pass of the fact table. Also groups
are not stored in the index. Thus, the Groupset
Index is ecien t for aggregation. Also, only an integer is
not having any representative rows in the fact table
The Star Join Index concatenates column values
index that uses RIDs of a dimensional table instead
every dimension table.
table. Thus, to compute a query one simply ORs all
nation of the columns of a table. Bitmapped Join
Inall the resultant bitmaps for each of the Dimension
tafrom dieren t dimensional tables and lists RIDs in the
bitmaps for selections on each table and then ANDs
through a common join. Most Join Indexes represent
of search key values to index the records of the Fact
bles. It is clear that only one index is necessary for
with this index is that they are not ecien tly
recombidexes solve this problem. It consists of using a bitmap
fact table for each concatenated value. The problem
A join index by denition is an index on one table for a
nant. Thus, an index has to be made for every
combithe fully precomputed join in dieren t ways.
quantity that involves a column value of another table
the rows in the dimension tables which join with the
particular fact table row. A B-Tree Index is then built
want. The fact table rows are then clustered so that
values in the dimension tables. In other words, the
This index improves the eciency of grouping queries
appropriate depending on the queries a user would
fact table is sorted based on the ordinal position of
clustered together on disk. Further, the successive
ing to a primary key or a hierarchy which ever is more
sponding group in the fact table is stored. The end of
using the concatenation of the dimension key values
First, the dimensional tables must be ordered
accorda group is simply one before the beginning of the next
and using the same ordering among them as in the
groups are in the same order as a nested loop on the
value the ordinal number of the starting of the
correrows with the same foreign keys on all dimensions are
by creating a special index and clustering the data.
groups of the fact tables. For each concatenated key
group. It is easy to see that a group by can be
persteps 9 to 12. Clearly, this algorithm is very ecien t
of the relevant BDIs of the fact table. The algorithm
load only the blocks that have some row in the
correfact table BDIs having columns in : This is done by CF
steps 6 to 8. If this check succeeds for every JDI then
mensional BDIs and reading the corresponding row of
predicate in the query on any of these columns one may
to the rowset we scan all JDIs on a table belonging RF
the output record is built by using the in memory
disponding rowset. Now, for all records corresponding
the join, F be the fact table, be the set of columns CD
The SJL algorithm starts by loading all BDIs
correand let be the fact table columns that contribute CF
is there in the corresponding rowset. This is done by
dimension tables may be huge and such large amounts
each of the dimension tables and for the Fact table. RF
The SJL uses the JDI to do a star join in a single pass
and dicult to impro ve upon for the dense rowsets of
that the row if the table is selected) : : : on ith R1 RjDj
having columns in to be stored in memory. Some CD
Let D be the set of dimension tables participating in
This can be done by a scan of the appropriate BDIs.
typical OLAP queries. However, it requires all BDIs
rowsets (a bitmap having the bit checked to denote ith
of dimensional tables that contribute to the join result
of memory need not be available. Thus, we discuss
to the join result. To answer the query rst w e form
to D and check whether the row pointed to by the JDI
sponding to columns in (Steps 1 to 4). If there is a CD
is shown in Algorithm 1.
ble and also the merge of the intermediate results as
advantages in many cases. The scheme is to divide
compared to the memory available and the number of
tive in the number of partitions of the BDIs which
similar large dimension tables. The number of
partimemory along with some of the partitions of the other
We now describe a scheme that will reduce the
memNow, one can do a join in a manner very similar to
the large dimension tables that are accessed by most
nique is also done for caching in [ND97].
stored contiguously in separate les. A similar
techtions will depend on the size of the dimension tables
the size of which is such that it ts comfortably into
Clearly we have avoided repeated scans of the fact
taory requirements of SJL and thus provide performance
of partitions. Once this is done, we load a new
combithat only one of the partitions is replaced. One such
dimensions into memory and then scan the JDIs in
are higher than this one in the nested loop order. In
order is a nested loop order in which we go forward
the le corresponding to this particular combination
bination is chosen in a gray code order in the sense
nation and repeat the same procedure. The new
comin case of SJS. However, we may scan some parts of
of partitions of the partitioned dimension tables are
records referencing rows of a particular combination
queries. Next, the fact table is organized such that
the SJL, but requiring less memory. The algorithm is
some BDIs several times. The number is
multiplicathe larger dimension tables into horizontal partitions,
and backward alternately on each of the dimensions.
to get a combination of partitions of the partitioned
3.1 Compression Techniques
3 Compression
K. Goyal, K. Ramamritham, A. Datta, H. Thomas 11-6
memory, we may treat more than 1 partition of the
memory is very small and the fact table is not much
this scheme.
in number, indexing overheads also increase. Also, if
SJS may even be better.
tions of BDIs : : : ; then column of BDI is B1; B2; Bk B1
shall see later, compression using our schemes suers
as you horizontally partition the data. The partitions
are indexed and as the number of partitions increase
bigger than these dimensional tables and the query
BDI as a single partition and gain in performance. Bk
other words, if : : : ; are the number of parti- n1; n2; nk
Given this, it might seem that making ner and ner
: : : 1 times. Thus, if we have additional B1 B2 Bk
scanned once, is scanned times, is scanned B2 B1 Bk
involves many of these large dimensional tables, then
We may get several additional advantages out of
chunks would do no harm. This is not the case. As we
that his method cannot be improved upon by any in- 15 7 6 6 5
it assumes that all characters are equally probable, but as it involves doing innite precision arithmetic in a
cally. The adaptive a vor has only one pass. Initially, The implementation details are much more complex
signed shorter codes. Human w as also able to prove 13 11
the length of the code is closest to of log2(probability 0 1 24 0 1
symbol). Thus, frequently occurring characters are
asthe frequencies 15, 7, 6, 6 and 5 respectively. First, the 3.1.2 Arithmetic Coding
given the frequencies of each of the characters.
Supthe compressed form. In the gure, the nal range is
the variable length of the codes.
be implemented on a VLSI chip [RS91] that can en- tation involves too many divisions and multiplications
individual symbols are laid out as a string of leaf nodes
tegral bit width coding stream. Note that each code
Here is an illustration of how the codes are assigned Figure 3: Human Coding
represented as an interval between two reals. Initially,
passes. In the rst pass it scans or samples the data There are adaptive and non-adaptive methods for
data. mentally. One does not want to keep the entire
comtional to its probability of occurrence. For example, if
second it actually compresses with the tree built stati- similar to the Human coding described previously.</p>
      <p>The previous steps are repeated until one free interval increases. In the end, the number in the
inHuman coding. The non-adaptive a vor has two put.
that are going to be connected by a binary tree. Each Arithmetic coding [WC87] improves on Human
codis ecien tly implementable in hardware, i.e. it can end. Another serious problem is that the
implemencode and decode at speeds as fast as the disk transfer which makes it slow.
and the two children are removed. assigned to that character. In the gure, when e is
ennon-adaptive schemes assume that the data will have speeds. The result is absolutely no compression
oversome properties. A short description of some popular head. However, these chips are not yet manufactured
node has a weight which is equal to the frequency of ing by removing the restriction that integral number
bit, while the other is set to 1. is encoded the range progressively decreases and the
There are both adaptive and non-adaptive a vors of [0.2,0.26) and hence 0.25 requiring only 2 bits is
outthe symbol. Then, the following steps are done to ob- of bits be used. It actually represents each character
assigned a code of integral number of bits such that
pose there are only 5 characters A, B, C, D and E with
node is left which is made the ROOT. terval requiring the least number of bits is output as
A parent node for these two is created with weight there were only 4 characters a, e, i and u and they
compression algorithms follows. and would still be very expensive.
taken from the parent node when decoding a 0 similarly when a is encountered next. As the string
has a unique prex to enable correct decoding despite A B C D E
The two free nodes with the lowest weights are that range is [0,1). Each character is assigned an
interlocated. val in the range [0,1) and the width of which is
proporOne of the child nodes is designated as the path countered the range is narrowed to [0.2,0.5) and then
3.1.1 Human Coding ROOT
others. In Human coding [HD52], each character is 0 1
equal to the sum of the two child nodes. occurred with frequencies in the ratio of 2:3:1:4 then
in most meaningful data occur with the same fre- 0 1
number of bits needed to specify any number in that
it keeps updating its model as it learns more about the nite precision computer and also outputting
increHuman coding uses the fact that not all characters
quency. Some characters are more frequent than the 39
a possible assignment is shown in Figure 4. When a
Another good feature of Human coding is that it pressed number in memory and output it only at the
and gets the frequencies of the characters, while in the arithmetic encoding too, and the dierences are v ery
tain the tree in Figure 3. in a fractional number of bits. The compressed data is
The parent node is added to the list of free nodes character appears the range is narrowed to the fraction
0.2
u
a
u
i
e
a
0.5
e</p>
      <p>ea</p>
      <p>DICTIONARY</p>
      <p>LOOK-AHEAD</p>
      <p>LZW at the record level performs better than Human
glish documents are not compressed by much using
there are very frequent updates or reorganizations,
speed.
peated within a single column value which itself is of
a few tens of bytes. Thus, after going through the
only dierence is that at the attribute level we
cancoding for their datasets. However, one has to sample
pect to have built a tree that results in substantial
the block level, both the algorithms perform similarly,
consist of words from a very selective range. Long
Enexpansion. The adaptive a vor of LZW will also do
equally badly for the same reason. One cannot
exchoices we have. Let us consider them one by one
It should be clear that LZ77 and LZW are the only
the data. Also, as pointed out in [IW94], non-adaptive
carefully and this does well only when most values will
tionary used in both cases will be very similar. The
compression speed may become an issue and then it
at the attribute level as well as the block level. At
though there is some loss in compression ratio for both;
do not expect long sequences of characters being
reLZ77 algorithm it is clear that it will not do well at
le level. In both cases we sample data and the
dichowever the dierence between their compression
raFor the non-adaptive version of LZW at the attribute
not over-step attribute boundaries while compressing
level, nothing much has changed as compared to the
the attribute level. It might even result in a slight
LZ77 unless updates are huge and very frequent. If
is better to use LZW as it compresses also at a good
text. Now, consider the non-adaptive version of LZW.</p>
      <p>At the attribute level, things are quite dieren t. We
tios is not really signican t. Thus, we must choose
compression by going through such a small amount of
2. For text les, the dictionary based techniques
niques.
perform substantially better than statistical
techqueries.
compressors compress it to around 50% of the original
sented in 2 to 3 bytes, rounding to the next higher
it. However, in real life warehouses, some columns
will be best to store such data without compressing
by their very nature will be expected to have most
not make sense. Thus, if many of the queries require
will occupy 4 bytes(capable of representing more than
the measures which are mostly numerical. These are
be very large balances, though rarely. In other words,
totally random uniformly distributed over the range
in the higher order bits and the lower order bits.
BeThe width of the column will be large, as there can
arithmetic coding look appropriate for such data.</p>
      <p>Clearly, compressing these at the attribute level does
The most often queried columns in the fact table are
of balances in a bank is likely to be such a column.
then each compressed attribute occupies 3 to 4 bytes.</p>
      <p>Let us now concentrate on compressing at the block
if such a column is actually 8 bytes in width, most
the ones which are mostly aggregated in warehouse
of in compressing numerical data. If the numbers are
First, consider compressing them at the attribute
values will use only 4 bytes. Statistical techniques like
Note that the major skew in the frequency of digits,
level. First, we must nd out what w e can make use
of 5, 10 or 100. To illustrate what we mean, we have
attribute as the data will become variable length and
This would result in compression ratio of 0% to 25%.
few attributes from each block, it is not a good idea
constructed an example column in Table 2. A column
do not compress by as much as text, and most good
level. It is reasonable to assume that most elds
no redundancy and no algorithm will compress it. It
that can be represented in that width, then there is
values in the lower and middle range or be multiples
number. Add another byte to store a pointer to the
which is exploited by statistical coding techniques, is
size. This means that a column value will be
repre4 or less. Such numbers, represented in binary 1030)
to compress the numerical columns.
4.3 The JDIs
Table 2: Redundancy in Numeric Columns Table 3: Another Redundancy in Numeric Columns
4000
20000
K. Goyal, K. Ramamritham, A. Datta, H. Thomas 11-11
4.2.1 Another Redundancy
tics at the bit level since computing for ten sets is too
may be benecial. In this example we will use
statister to have a dieren t frequency table for every byte of
we have common statistics for all bit positions, then
pressed to = :461. Also, the 0:9log20:9 0:1log20:1
= :971. Thus, we may com- 0:6log20:6 0:4log20:4
where i denotes the character. For this example the
much and the simple case is sucien t to illustrate the
the frequencies become 60% ones and 40% zeroes.
sion eciency should impro ve by quite a lot. Here is an
other bits are not compressed so eectiv e
compresconcept. Assume a 32 bit attribute. Let 90% of the
of having one common frequency table, it will be
betpress to 86.7% instead of 97.1% of the total size.
Huthe algorithm that should be used due to its greater
speed of compression and expansion.
bits in the most signican t 8 bits be 0’s. Assume an
sides, they have dieren t kinds of skews. Thus, instead
example to show how having dieren t frequency tables
equal distribution in the lower 24 Arithmetic cod- bPits.
sion is (0:461 8 + 1 24)=32 = :867. However, if
With this frequency table it can be compressed to
ing compresses to a maximum of [WC87] pilog2pi
in speed. Since the width is small, overhead of storing
the width of the column at the price of a small decrease
the frequency distributions is not much and
compresbits are the characters. Therefore, it can be
comman coding does slightly worse than this and this is
as shown in Table 5.
; . Also, the dimensional tables must be JA; JB; JC JD
If we observe carefully, we nd that the dier- JC
code them. It should be clear that this results in great
ences between successive values is much smaller than
take successive dierences modulo 7 w e get the column
dimension table between the literals and then Human
can be converted to a bitmap if it is dense resulting
this will be the case in a large warehouse. Assume that
mensional tables A, B, C, D. The fact table will have
the size of the dimensional table C is 7. Then, if we
close together and the dierences between successive
sorted using the hierarchy in their attributes (as
deConsider a warehouse having a fact table F and 4
diin greater compression. This is illustrated in Figure 6.
scribed in the Section 4.4) so that related values are
Run Length Encoding looks ideal for such data.
AnAn additional advantage is that the bitmap may be
if the resultant bitmap is empty, one may not read the
at least the JDIs look extremely compressible. JA; JB
other important observation is that the repeats are
They will have groups of the same number repeated
several times. Table 4 illustrates this.
compression of and compressing them more is JA; JB
anded with the rowset in memory while doing SJL, and
sive dierences modulo the size of the corresponding
Thus, we must do some kind of organization of data.
the actual values and it is not dicult to visualize that
JDI entries after sorting is small. When this is done,
of no interest.
lengths and the literal which is repeating separately,
the literals will be in sorted groups. This sorted group
in sorted order. In other words if we store the
Runrun-lengths. Another possible way is to take
succes4 JDIs ; on A, B, C and D respectively. JA; JB; JC JD
Let us sort the data on JDIs in ascending order by
table having a hierarchy of date, month, year, etc. If
the hierarchy while is the lowest). This will re- Ak
long runs are not always expected but there is a good
possibility, it is better to do block level LZ77 as it
subHowever, we may order the dimensional table
accordsult in all repetitions getting clustered. Now, we may
zontal partitioning and groupset indexes (i.e. a group
have a hierarchy among their columns. The values
to occur several times in that column. Thus, one
too.
ing to the hierarchy, just as we did in the case of
horimay consider dictionary based methods to be the best.
by on attributes : : : where is highest in A1; A2 Ak A1
in columns that are higher in the hierarchy are likely
The dimensional tables usually consist of text and
sumes Run Length Encoding and does well otherwise,
apply RLE. A perfect example is the time dimension
D5
Main
D3
Table
D4
D2
D1
8
5
No. of Rows
20086
5
3443
26685
An important benet due to the isolation of
that are not compressed do not suer from an y kind of
of the same table. If the tables were not vertically
ineciency due to the compression of other columns
columns provided by the DataIndexes is that columns
processing.
cause the records become variable length.
adverse eect on even queries involving only the
un(or even expansion) and hence ineciency in query
partitioned, compressing one column would have an
compression method will result in lesser compression
compressed columns of the table. This is mainly
bescheme applied on data that was not expected by the
These decisions are very important as any compression
the fact table describing which are the basic dyes to
We have implemented these techniques, and this
secin terms of amounts of some primary dyes. The
cention gives a summary of how various techniques
peraround 3 to 4 records for every referenced key from
tral fact table has 5 JDIs. Table 6 gives a description
of the shade. The largest dimension table, D5, has
popular paint company. It has recipes of 26,685 shades
in this table is the only real numeric data we have.
of the tables.</p>
      <p>D4 contains only one attribute which is the name
formed on real data. The data has some of
characbe used and in what quantity. The quantity column
teristics of an actual warehouse. The data is from a
usually going to be processed, while making
dethe group-bys in most queries.
cisions regarding the organization. For instance,
Workload: One must consider the queries that are
the sort order may be determined as the order of
mance if disk is precious to him.</p>
      <p>Cost of Disk Space: If compression is adding
overhead, the user will compromise on query
perforis very high and reorganization cost has to be kept
at a minimum, then we must sort rst b y the time
dimension.</p>
      <p>Frequency of updates: If the frequency of updates
such a case, after dierencing we will have long runs
tables respectively. The table is sorted by the JDIs
are compressed to almost nothing as a result of run
rows. Due to the very low cardinality of D2 and D3,
pany, the table is also sorted by C1, C3. Thus, we can
Table D5 has four columns- C1, C2, C3, C4. C1 is
Human Coding it. C4 was directly Human coded
ing J4 and J5 was that it was mostly populated with
and converted into an integer by multiplying by 10 and
deed, the table above shows that block level LZW
com30% to 35% of its original size. This implies a
combe compressed. Let us begin with D4. It has only one
as most of the values were very low. C1 has runs of
does better for the other columns. The results are
created by the company, it so happens that J5 is also
C1. These are dieren tiated by their C2 value. Thus
pressing to less than 2K. Otherwise, Human Coding
Overall, the entire data got compressed to around
on J2, J3, J4 in that order. Again, since the data is
between them. The table was sorted by C1, C2. It
single values and C2 has sequences such as 1; 2; : : :. In
It only has 5 JDIs-J1 to J5 referencing the 5 dimension
that the data is represented in variable length format
pressing it. Each attribute is terminated by a newline
nity for dictionary based techniques. Thus, dictionary
And due to the same reason as above we expect
dicNow we consider compressing the fact table, ’Main’.
to denote its end. The result of compressing this using
ferences. Otherwise, one should use Human Coding.
(not xed length b y padding with blanks) before
comMain, D4, D5 are the only tables large enough to
However, an important observation while
compressthe other JDIs-J1, J4, J5 as for the case of D5 and the
tionary techniques to perform better on them.
Ina table with 16 rows. C4 is a oat column denoting
JDI, LZW should be used after taking successive
difof 0’s and 1’s respectively, which is a perfect
opportuthen the value was stored in BCD with no separators
length encoding. The same experiments were done on
the key which is referenced and has values from 1 to
almost sorted. In fact, J5 is equal to J4 in several
5815. The table has 3 to 4 rows with the same value of
Block level LZW is in Table 7.
techniques after dierencing do slightly better
comresults are summarized in Table 9.
column, called ’Name’ which consists of text. Note
pressed the same to much lesser. Thus, when patterns
quantity of the referenced item. C4 was pre-processed
like runs and sequences are expected to occur in the
compress C1, C2, C3 by successive dierencing and
so happens that since this is data created by the
comlong sequences of successive numbers like 1; 2; 3 : : :.</p>
      <p>J2 and J3 will have very long runs. As a result they
summarized in Table 8.</p>
      <p>C1, C2 forms a candidate key. C3 is a JDI referencing
Compressed Size
12.0K
Di/Hu
Di/LZW
Di/Hu
Di/LZW
Di/Hu
Di/LZW
Technique
Original Size
35.0K
CR
75%
84%
82%
45%
38%
35%
45%
75%
Di/Hu
Technique
Di/LZW
Human
Di/Hu
Di/LZW
Di/Hu
Di/LZW
LZW
61%
CR
67%
73%
61%
62%
90%
JDI
Type
JDI
JDI
JDI
JDI
JDI
JDI
JDI
JDI
Type
JDI
JDI
Numeric
Numeric
JDI
As stated previously this data has been synthetically
a warehouse to be. However, a 50% compression ratio
pression ratio of 65% to 70%, which is quite impressive.
created and is a lot more ordered than one can expect
would not be surprising for most warehouses.</p>
      <p>ToolKit. John Wiley and Sons, 1996.
[KIM96] Ralph Kimball. The Data Warehousing
[CY98]
[KA98]
References
K. Goyal, K. Ramamritham, A. Datta, H. Thomas 11-16
Acknowledgements
5 Conclusion and Future Work
processing. It was only because of the DataIndex
repvery ecien t, both in terms of space as well as query
most compression techniques (even techniques of the
of SJL enabling us to avoid using SJS as much as
that particular type of data to occupy lesser space than
compressed. Also, a large amount of data displays
houses stored as DataIndexes and discussed when each
any general technique would otherwise result in.
AnWe found that representing data as DataIndexes is
schemes had several drawbacks in the sense that the
suers. This is because the groups get split across
Section 2.5.5 we proposed a horizontal partitioning
nore the compression overhead in comparison to
savsimilarities (in the sense of compressibility) down a
extent that we don’t have to do SJS for most queries,
The contribution of this work is twofold. First, In
as before on queries involving columns that are not
not distribute their overhead to the other columns. In
good compression.
ing of I/O. A lot of research has to be done to
deumn are possible. Secondly, dieren t types of columns,
possible. Secondly, we described several compression
speeds have increased rapidly. Today, we cannot
igcolumn and therefore we expect compression ratios of
with dieren t semantics, could be compressed
indepenof them should be used. It might have appeared then
that partitioning nely is a good idea. However, the
each partition. Thus, we must partition only to the
level. We were particularly impressed by the hardware
dently using dieren t techniques which work best for
so that we get the ecien t joining of SJL along with
several partitions and have to be coded separately in
implementation of Human coding. However, their
future) to improve substantially when used on
vertically partitioned data.
sign fast compression and decompression algorithms
algorithms that could be used to compress data
waredon’t need very often and get the same performance
resentation that things like run length coding on a
colmore partitioning that is done, the more compression
scheme in order to reduce the memory requirements
one uses Human coding to compress all their data
tree seemed to be hard coded into the chip. Also, no
which give good compression ratios even at the block
other words, we can compress some columns which we
In the recent past, both processing power and I/O
other feature is that compressing some columns does
[GHQ95] A. Gupta, V. Harinarayan, and D. Quass.
housing. Proc. of the VLDB 1995.</p>
      <p>Aggregate-Query Processing in Data
Ware[HD52] Human D. A. A Method for Construction
Electr. Radio Engg. 40.9 September 1952.
of Minimum Redundancy Codes. Proc. Inst.</p>
      <p>Chee-Yong Chan, Yannis E. Ioannidis.</p>
      <p>Bitmap Index Design and Evaluation. Proc.
of SIGMOD 1998.
[IW94] B. Iyer, D. Wilhite Data Compression
Support in Databases. Proc. of 20th VLDB
Conference, 1994.</p>
      <p>Bitmap Indexing for Data Warehouses. Proc.</p>
      <p>ICDE, Orlando, February 1998.
[BW98] A. Buchmann and Ming-Chuan Wu. Encoded
mamritham, Helen Thomas, Igor Viguier.
"Have Your Data and Index It Too":
Ecient Storage and Indexing for Data
Warehouses. Technical Report, 1998.</p>
      <p>Anindya Datta, Bongki Moon, Krithi
RaThe research reported here was supported in part by
the National Science Foundation grant number
IRI9619588.
which stores data in little space with absolutely no
today! It would be an interesting venture to gure out
data that Human coding is not. In this way, they
niques must be designed which are implementable in
Before designing a compression technique, one
properties.</p>
      <p>Transfer Speeds of disks). More importantly,
techment ecien tly in hardware (at speeds as fast as Data
could complement each other and result in a system
extra time overhead.</p>
      <p>A good area for future research is to determine what
what makes the other techniques dicult to
implecan do well at all kinds of data. We did not have
the opportunity to analyze data in actual warehouses.
other redundancies typically occur in warehouses, so
hardware at high speeds and are good at compressing
pect in the data, for no single compression method
that algorithms can then be designed to exploit these
needs to determine what kind of redundancy to
ex</p>
      <p>Patrick O’Neil, Dallan Quass. Improved
Proc. of SIGMOD, 1997.</p>
      <p>Query Performance with Variant Indexes.
gatekeeper.dec.com/pub/DEC/SRC/researchM. Burrows, D. J. Wheeler. A Block
Sorting Lossless Data Compression
Alreports/SRC-124.ps.Z
gorithm. SRC research report 124,
tion, 1998.
housing and OLAP. Submitted for
PublicaRamamritham, Bongki Moon. Applying
Parallel Processing Techniques in Data
WareAnindya Datta, Debra VanderMeer, Krithi
formance Data Compression. IEEE
Computer 17.6 June 1984.</p>
      <p>Terry A. Welch. A Technique for High
Per</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>