-

0 Table 5: JDI After Successive Dierences JC

1999

19 3 11 12

2.2 Projection Index 2.1 Bitmap Indexes 2 Variant Indexes K. Goyal, K. Ramamritham, A. Datta, H. Thomas 11-2 of dimensions. In such cases queries must be evaluated large amount of data and most of the decompressed In Section 2 we describe several indexing approaches a data warehousing environment.

So far, indexes have generally been considered septhis paper we concentrate on this approach.

We believe that it should in fact be more ecien t to odically, and during this time no queries are allowed.

Data warehouses are typically updated only pericesses [RH95]. Although decompression adds to query processing cost, the [RH95] paper showed that in dates is no longer an issue, it is possible to use more using indexes to ecien tly access base data. than compensates for the increased processing time. reducing storage costs, it also could reduce query is the cost of storage due to index structures. DataIndexes. Compression has several benets. Apart from databases the reduced number of disk accesses more data warehouses, storage costs are very high and so various compression techniques. In Section 4 we disThis allows the batch update process to reorganize the infeasible, as this grows exponentially with the number maintaining indexes in the presence of concurrent upprocessing time by reducing the number of disk achave been possible otherwise. Since the problems of indexes to an optimal form, in a way that would not specialized indexes to speed up query evaluation. In compress a data warehouse as most queries access a The remainder of this paper is organized as follows. dexes provide indexing at no extra cost apart from of indexes and in Section 5 we conclude the paper. that of storing the data, thus saving a lot of space. We less. Also, precomputing all necessary tables is often data will actually be required to evaluate the query. cuss how compression may be applied to certain types propose compressing the data in the form of DataInIn this paper we explore how compression performs in used in data warehousing and in Section 3 we review arate from the base data. Given the large size of half the total number of rows. A Row ID representamore CPU ecien t than RowIDs. The common option will occupy 4 n bytes in all while a bitmap reprechine words. Thus, Bitmap operations are a lot faster. case the entries for both key values will have about ing only two possible values Male or Female. In this ticular table and a one at ordinal position i indicates mous disk savings especially if n is large. There is disk A further optimization described in [PQ97] is to diis the number of distinct attributes. They represent it be accessed for range selections. may have several references. In such a case, it is more to a RID list. simplest of compression methods is to convert it back particular, base 2 representation gives bitslice indexes. think of each entry being represented in base n where n [CY98] draws a parallel between bitmap indexing vide the table into segments of a xed n umber of rows [WB98] describes methods of ecien tly coding Bitslice that the row is present in the list. Thus, we have ith than 1/No of bits in a RowID. Also Bitmaps are a lot or an RID list. This not only saves space, but also ecien t to represent the list as a bitmap. Consider a such an index is called a Bitmap Index. The following in dieren t bases and discuss space time tradeos. In take several instructions, while most processors have total number of rows in the table. This results in enorsaving as long as average density of a bitmap is greater bitmap of length equal to the number of rows in a parsingle instructions to directly AND, OR and NOT maindexes so that the least number of bitmaps have to fragment and each fragment can be stored as a bitmap saves disk I/O in combining bitmaps. This is because When bitmaps are sparse one can compress them. The and number systems. The idea is that when you orient some fragments may not have any set bits at all and erations of taking union and intersecting RowID lists data. sentation will occupy only 2 n=8 bytes where n is the hence while ANDing, the other bitmap may not be rebitmaps vertically and place them side by side you can each. The bitmaps corresponding to each segment is a example illustrates why these are more ecien t in an just a new way of representing the list of RowIDs and OLAP environment. Consider a column Gender havtrieved at all. This is particularly useful for clustered of rows in the table. However, in OLAP each entry four columns is stored as two BDIs, one of one column single BDI if it is known that most queries will access ordinal mapping between them.

BDI. The BDI generalizes projection indexes by allowing any number of columns to be in a single BDI. columns. Clearly, Projection indexing is free in terms easy to form the original table if we have projection and the other of 3 columns. The dotted lines show the suited for OLAP queries. Also notice that it is very It would be useful to store two or more columns in a of storage. A graphical representation of the BDI is As we saw earlier projection indexes are particularly shown in Figure 1. In this gure the original table of the table where each partition may have one or more them together. Thus, the BDI is a vertical partition of jection Indexes on all columns, we need not store the original tables at all. Hence, we get Projection Indexes indexes on each of the columns. Thus, if we force Profree of storage cost. This is the main idea behind the imum size for the column or use a B-Tree to access read may bring in more than one value in the Foundcorresponding to a very dense FoundSet (A bitmap and retrieve only a few of the columns. Thus, such an ber provided the column is xed length. However, if Set. OLAP queries have very low selectivity queries is because more values t onto a page, and a single index is extremely useful. the column is variable length, we can either x a maxvalues using the row number. The Projection index is very useful when one has to retrieve column values having a set bit for every row to be retrieved). This tial integrity is strictly enforced on all star schemas.

Indexes. data in a warehouse in which the Indexing is essenDataIndexes [KA98] are a new way to represent the An assumption made in DataIndexes is that referendexes and algorithms to evaluate queries using these In this section, we describe the two types of DataIntially free in terms of storage. The rest of the paper assumes we have the data in the form of DataIndexes.

BDIs. If the foreign key column is also stored as BDIs, column value any more. In fact, we may take lesser in terms of storage as one does not need to store the sion Table. The gure uses dotted lines to show the we must join it every time with the corresponding colTimekey as foreign key referencing the Time Dimenumn of the dimension table. The BDIs rely on ordinal mapping. Thus, it is a good idea to store the ordiConsider storing all dimensional and fact tables as the maximum ordinal number. Such a JDI is shown column. This is called a JDI. In a way this precomstorage if the width of the key column is larger than nal number of the row it references in the foreign key putes the join. Thus it is a join index. This too is free in Figure 2. The sales table is the fact table having stored for each key value which is quite space ecien t. formed in a single pass of the fact table. Also groups are not stored in the index. Thus, the Groupset Index is ecien t for aggregation. Also, only an integer is not having any representative rows in the fact table The Star Join Index concatenates column values index that uses RIDs of a dimensional table instead every dimension table. table. Thus, to compute a query one simply ORs all nation of the columns of a table. Bitmapped Join Inall the resultant bitmaps for each of the Dimension tafrom dieren t dimensional tables and lists RIDs in the bitmaps for selections on each table and then ANDs through a common join. Most Join Indexes represent of search key values to index the records of the Fact bles. It is clear that only one index is necessary for with this index is that they are not ecien tly recombidexes solve this problem. It consists of using a bitmap fact table for each concatenated value. The problem A join index by denition is an index on one table for a nant. Thus, an index has to be made for every combithe fully precomputed join in dieren t ways. quantity that involves a column value of another table the rows in the dimension tables which join with the particular fact table row. A B-Tree Index is then built want. The fact table rows are then clustered so that values in the dimension tables. In other words, the This index improves the eciency of grouping queries appropriate depending on the queries a user would fact table is sorted based on the ordinal position of clustered together on disk. Further, the successive ing to a primary key or a hierarchy which ever is more sponding group in the fact table is stored. The end of using the concatenation of the dimension key values First, the dimensional tables must be ordered accorda group is simply one before the beginning of the next and using the same ordering among them as in the groups are in the same order as a nested loop on the value the ordinal number of the starting of the correrows with the same foreign keys on all dimensions are by creating a special index and clustering the data. groups of the fact tables. For each concatenated key group. It is easy to see that a group by can be persteps 9 to 12. Clearly, this algorithm is very ecien t of the relevant BDIs of the fact table. The algorithm load only the blocks that have some row in the correfact table BDIs having columns in : This is done by CF steps 6 to 8. If this check succeeds for every JDI then mensional BDIs and reading the corresponding row of predicate in the query on any of these columns one may to the rowset we scan all JDIs on a table belonging RF the output record is built by using the in memory disponding rowset. Now, for all records corresponding the join, F be the fact table, be the set of columns CD The SJL algorithm starts by loading all BDIs correand let be the fact table columns that contribute CF is there in the corresponding rowset. This is done by dimension tables may be huge and such large amounts each of the dimension tables and for the Fact table. RF The SJL uses the JDI to do a star join in a single pass and dicult to impro ve upon for the dense rowsets of that the row if the table is selected) : : : on ith R1 RjDj having columns in to be stored in memory. Some CD Let D be the set of dimension tables participating in This can be done by a scan of the appropriate BDIs. typical OLAP queries. However, it requires all BDIs rowsets (a bitmap having the bit checked to denote ith of dimensional tables that contribute to the join result of memory need not be available. Thus, we discuss to the join result. To answer the query rst w e form to D and check whether the row pointed to by the JDI sponding to columns in (Steps 1 to 4). If there is a CD is shown in Algorithm 1. ble and also the merge of the intermediate results as advantages in many cases. The scheme is to divide compared to the memory available and the number of tive in the number of partitions of the BDIs which similar large dimension tables. The number of partimemory along with some of the partitions of the other We now describe a scheme that will reduce the memNow, one can do a join in a manner very similar to the large dimension tables that are accessed by most nique is also done for caching in [ND97]. stored contiguously in separate les. A similar techtions will depend on the size of the dimension tables the size of which is such that it ts comfortably into Clearly we have avoided repeated scans of the fact taory requirements of SJL and thus provide performance of partitions. Once this is done, we load a new combithat only one of the partitions is replaced. One such dimensions into memory and then scan the JDIs in are higher than this one in the nested loop order. In order is a nested loop order in which we go forward the le corresponding to this particular combination bination is chosen in a gray code order in the sense nation and repeat the same procedure. The new comin case of SJS. However, we may scan some parts of of partitions of the partitioned dimension tables are records referencing rows of a particular combination queries. Next, the fact table is organized such that the SJL, but requiring less memory. The algorithm is some BDIs several times. The number is multiplicathe larger dimension tables into horizontal partitions, and backward alternately on each of the dimensions. to get a combination of partitions of the partitioned 3.1 Compression Techniques 3 Compression K. Goyal, K. Ramamritham, A. Datta, H. Thomas 11-6 memory, we may treat more than 1 partition of the memory is very small and the fact table is not much this scheme. in number, indexing overheads also increase. Also, if SJS may even be better. tions of BDIs : : : ; then column of BDI is B1; B2; Bk B1 shall see later, compression using our schemes suers as you horizontally partition the data. The partitions are indexed and as the number of partitions increase bigger than these dimensional tables and the query BDI as a single partition and gain in performance. Bk other words, if : : : ; are the number of parti- n1; n2; nk Given this, it might seem that making ner and ner : : : 1 times. Thus, if we have additional B1 B2 Bk scanned once, is scanned times, is scanned B2 B1 Bk involves many of these large dimensional tables, then We may get several additional advantages out of chunks would do no harm. This is not the case. As we that his method cannot be improved upon by any in- 15 7 6 6 5 it assumes that all characters are equally probable, but as it involves doing innite precision arithmetic in a cally. The adaptive a vor has only one pass. Initially, The implementation details are much more complex signed shorter codes. Human w as also able to prove 13 11 the length of the code is closest to of log2(probability 0 1 24 0 1 symbol). Thus, frequently occurring characters are asthe frequencies 15, 7, 6, 6 and 5 respectively. First, the 3.1.2 Arithmetic Coding given the frequencies of each of the characters. Supthe compressed form. In the gure, the nal range is the variable length of the codes. be implemented on a VLSI chip [RS91] that can en- tation involves too many divisions and multiplications individual symbols are laid out as a string of leaf nodes tegral bit width coding stream. Note that each code Here is an illustration of how the codes are assigned Figure 3: Human Coding represented as an interval between two reals. Initially, passes. In the rst pass it scans or samples the data There are adaptive and non-adaptive methods for data. mentally. One does not want to keep the entire comtional to its probability of occurrence. For example, if second it actually compresses with the tree built stati- similar to the Human coding described previously.

The previous steps are repeated until one free interval increases. In the end, the number in the inHuman coding. The non-adaptive a vor has two put. that are going to be connected by a binary tree. Each Arithmetic coding [WC87] improves on Human codis ecien tly implementable in hardware, i.e. it can end. Another serious problem is that the implemencode and decode at speeds as fast as the disk transfer which makes it slow. and the two children are removed. assigned to that character. In the gure, when e is ennon-adaptive schemes assume that the data will have speeds. The result is absolutely no compression oversome properties. A short description of some popular head. However, these chips are not yet manufactured node has a weight which is equal to the frequency of ing by removing the restriction that integral number bit, while the other is set to 1. is encoded the range progressively decreases and the There are both adaptive and non-adaptive a vors of [0.2,0.26) and hence 0.25 requiring only 2 bits is outthe symbol. Then, the following steps are done to ob- of bits be used. It actually represents each character assigned a code of integral number of bits such that pose there are only 5 characters A, B, C, D and E with node is left which is made the ROOT. terval requiring the least number of bits is output as A parent node for these two is created with weight there were only 4 characters a, e, i and u and they compression algorithms follows. and would still be very expensive. taken from the parent node when decoding a 0 similarly when a is encountered next. As the string has a unique prex to enable correct decoding despite A B C D E The two free nodes with the lowest weights are that range is [0,1). Each character is assigned an interlocated. val in the range [0,1) and the width of which is proporOne of the child nodes is designated as the path countered the range is narrowed to [0.2,0.5) and then 3.1.1 Human Coding ROOT others. In Human coding [HD52], each character is 0 1 equal to the sum of the two child nodes. occurred with frequencies in the ratio of 2:3:1:4 then in most meaningful data occur with the same fre- 0 1 number of bits needed to specify any number in that it keeps updating its model as it learns more about the nite precision computer and also outputting increHuman coding uses the fact that not all characters quency. Some characters are more frequent than the 39 a possible assignment is shown in Figure 4. When a Another good feature of Human coding is that it pressed number in memory and output it only at the and gets the frequencies of the characters, while in the arithmetic encoding too, and the dierences are v ery tain the tree in Figure 3. in a fractional number of bits. The compressed data is The parent node is added to the list of free nodes character appears the range is narrowed to the fraction 0.2 u a u i e a 0.5 e

DICTIONARY

LOOK-AHEAD

LZW at the record level performs better than Human glish documents are not compressed by much using there are very frequent updates or reorganizations, speed. peated within a single column value which itself is of a few tens of bytes. Thus, after going through the only dierence is that at the attribute level we cancoding for their datasets. However, one has to sample pect to have built a tree that results in substantial the block level, both the algorithms perform similarly, consist of words from a very selective range. Long Enexpansion. The adaptive a vor of LZW will also do equally badly for the same reason. One cannot exchoices we have. Let us consider them one by one It should be clear that LZ77 and LZW are the only the data. Also, as pointed out in [IW94], non-adaptive carefully and this does well only when most values will tionary used in both cases will be very similar. The compression speed may become an issue and then it at the attribute level as well as the block level. At though there is some loss in compression ratio for both; do not expect long sequences of characters being reLZ77 algorithm it is clear that it will not do well at le level. In both cases we sample data and the dichowever the dierence between their compression raFor the non-adaptive version of LZW at the attribute not over-step attribute boundaries while compressing level, nothing much has changed as compared to the the attribute level. It might even result in a slight LZ77 unless updates are huge and very frequent. If is better to use LZW as it compresses also at a good text. Now, consider the non-adaptive version of LZW.

At the attribute level, things are quite dieren t. We tios is not really signican t. Thus, we must choose compression by going through such a small amount of 2. For text les, the dictionary based techniques niques. perform substantially better than statistical techqueries. compressors compress it to around 50% of the original sented in 2 to 3 bytes, rounding to the next higher it. However, in real life warehouses, some columns will be best to store such data without compressing by their very nature will be expected to have most not make sense. Thus, if many of the queries require will occupy 4 bytes(capable of representing more than the measures which are mostly numerical. These are be very large balances, though rarely. In other words, totally random uniformly distributed over the range in the higher order bits and the lower order bits. BeThe width of the column will be large, as there can arithmetic coding look appropriate for such data.

Clearly, compressing these at the attribute level does The most often queried columns in the fact table are of balances in a bank is likely to be such a column. then each compressed attribute occupies 3 to 4 bytes.

Let us now concentrate on compressing at the block if such a column is actually 8 bytes in width, most the ones which are mostly aggregated in warehouse of in compressing numerical data. If the numbers are First, consider compressing them at the attribute values will use only 4 bytes. Statistical techniques like Note that the major skew in the frequency of digits, level. First, we must nd out what w e can make use of 5, 10 or 100. To illustrate what we mean, we have attribute as the data will become variable length and This would result in compression ratio of 0% to 25%. few attributes from each block, it is not a good idea constructed an example column in Table 2. A column do not compress by as much as text, and most good level. It is reasonable to assume that most elds no redundancy and no algorithm will compress it. It that can be represented in that width, then there is values in the lower and middle range or be multiples number. Add another byte to store a pointer to the which is exploited by statistical coding techniques, is size. This means that a column value will be repre4 or less. Such numbers, represented in binary 1030) to compress the numerical columns. 4.3 The JDIs Table 2: Redundancy in Numeric Columns Table 3: Another Redundancy in Numeric Columns 4000 20000 K. Goyal, K. Ramamritham, A. Datta, H. Thomas 11-11 4.2.1 Another Redundancy tics at the bit level since computing for ten sets is too may be benecial. In this example we will use statister to have a dieren t frequency table for every byte of we have common statistics for all bit positions, then pressed to = :461. Also, the 0:9log20:9 0:1log20:1 = :971. Thus, we may com- 0:6log20:6 0:4log20:4 where i denotes the character. For this example the much and the simple case is sucien t to illustrate the the frequencies become 60% ones and 40% zeroes. sion eciency should impro ve by quite a lot. Here is an other bits are not compressed so eectiv e compresconcept. Assume a 32 bit attribute. Let 90% of the of having one common frequency table, it will be betpress to 86.7% instead of 97.1% of the total size. Huthe algorithm that should be used due to its greater speed of compression and expansion. bits in the most signican t 8 bits be 0’s. Assume an sides, they have dieren t kinds of skews. Thus, instead example to show how having dieren t frequency tables equal distribution in the lower 24 Arithmetic cod- bPits. sion is (0:461 8 + 1 24)=32 = :867. However, if With this frequency table it can be compressed to ing compresses to a maximum of [WC87] pilog2pi in speed. Since the width is small, overhead of storing the width of the column at the price of a small decrease the frequency distributions is not much and compresbits are the characters. Therefore, it can be comman coding does slightly worse than this and this is as shown in Table 5. ; . Also, the dimensional tables must be JA; JB; JC JD If we observe carefully, we nd that the dier- JC code them. It should be clear that this results in great ences between successive values is much smaller than take successive dierences modulo 7 w e get the column dimension table between the literals and then Human can be converted to a bitmap if it is dense resulting this will be the case in a large warehouse. Assume that mensional tables A, B, C, D. The fact table will have the size of the dimensional table C is 7. Then, if we close together and the dierences between successive sorted using the hierarchy in their attributes (as deConsider a warehouse having a fact table F and 4 diin greater compression. This is illustrated in Figure 6. scribed in the Section 4.4) so that related values are Run Length Encoding looks ideal for such data. AnAn additional advantage is that the bitmap may be if the resultant bitmap is empty, one may not read the at least the JDIs look extremely compressible. JA; JB other important observation is that the repeats are They will have groups of the same number repeated several times. Table 4 illustrates this. compression of and compressing them more is JA; JB anded with the rowset in memory while doing SJL, and sive dierences modulo the size of the corresponding Thus, we must do some kind of organization of data. the actual values and it is not dicult to visualize that JDI entries after sorting is small. When this is done, of no interest. lengths and the literal which is repeating separately, the literals will be in sorted groups. This sorted group in sorted order. In other words if we store the Runrun-lengths. Another possible way is to take succes4 JDIs ; on A, B, C and D respectively. JA; JB; JC JD Let us sort the data on JDIs in ascending order by table having a hierarchy of date, month, year, etc. If the hierarchy while is the lowest). This will re- Ak long runs are not always expected but there is a good possibility, it is better to do block level LZ77 as it subHowever, we may order the dimensional table accordsult in all repetitions getting clustered. Now, we may zontal partitioning and groupset indexes (i.e. a group have a hierarchy among their columns. The values to occur several times in that column. Thus, one too. ing to the hierarchy, just as we did in the case of horimay consider dictionary based methods to be the best. by on attributes : : : where is highest in A1; A2 Ak A1 in columns that are higher in the hierarchy are likely The dimensional tables usually consist of text and sumes Run Length Encoding and does well otherwise, apply RLE. A perfect example is the time dimension D5 Main D3 Table D4 D2 D1 8 5 No. of Rows 20086 5 3443 26685 An important benet due to the isolation of that are not compressed do not suer from an y kind of of the same table. If the tables were not vertically ineciency due to the compression of other columns columns provided by the DataIndexes is that columns processing. cause the records become variable length. adverse eect on even queries involving only the un(or even expansion) and hence ineciency in query partitioned, compressing one column would have an compression method will result in lesser compression compressed columns of the table. This is mainly bescheme applied on data that was not expected by the These decisions are very important as any compression the fact table describing which are the basic dyes to We have implemented these techniques, and this secin terms of amounts of some primary dyes. The cention gives a summary of how various techniques peraround 3 to 4 records for every referenced key from tral fact table has 5 JDIs. Table 6 gives a description of the shade. The largest dimension table, D5, has popular paint company. It has recipes of 26,685 shades in this table is the only real numeric data we have. of the tables.

D4 contains only one attribute which is the name formed on real data. The data has some of characbe used and in what quantity. The quantity column teristics of an actual warehouse. The data is from a usually going to be processed, while making dethe group-bys in most queries. cisions regarding the organization. For instance, Workload: One must consider the queries that are the sort order may be determined as the order of mance if disk is precious to him.

Cost of Disk Space: If compression is adding overhead, the user will compromise on query perforis very high and reorganization cost has to be kept at a minimum, then we must sort rst b y the time dimension.

Frequency of updates: If the frequency of updates such a case, after dierencing we will have long runs tables respectively. The table is sorted by the JDIs are compressed to almost nothing as a result of run rows. Due to the very low cardinality of D2 and D3, pany, the table is also sorted by C1, C3. Thus, we can Table D5 has four columns- C1, C2, C3, C4. C1 is Human Coding it. C4 was directly Human coded ing J4 and J5 was that it was mostly populated with and converted into an integer by multiplying by 10 and deed, the table above shows that block level LZW com30% to 35% of its original size. This implies a combe compressed. Let us begin with D4. It has only one as most of the values were very low. C1 has runs of does better for the other columns. The results are created by the company, it so happens that J5 is also C1. These are dieren tiated by their C2 value. Thus pressing to less than 2K. Otherwise, Human Coding Overall, the entire data got compressed to around on J2, J3, J4 in that order. Again, since the data is between them. The table was sorted by C1, C2. It single values and C2 has sequences such as 1; 2; : : :. In It only has 5 JDIs-J1 to J5 referencing the 5 dimension that the data is represented in variable length format pressing it. Each attribute is terminated by a newline nity for dictionary based techniques. Thus, dictionary And due to the same reason as above we expect dicNow we consider compressing the fact table, ’Main’. to denote its end. The result of compressing this using ferences. Otherwise, one should use Human Coding. (not xed length b y padding with blanks) before comMain, D4, D5 are the only tables large enough to However, an important observation while compressthe other JDIs-J1, J4, J5 as for the case of D5 and the tionary techniques to perform better on them. Ina table with 16 rows. C4 is a oat column denoting JDI, LZW should be used after taking successive difof 0’s and 1’s respectively, which is a perfect opportuthen the value was stored in BCD with no separators length encoding. The same experiments were done on the key which is referenced and has values from 1 to almost sorted. In fact, J5 is equal to J4 in several 5815. The table has 3 to 4 rows with the same value of Block level LZW is in Table 7. techniques after dierencing do slightly better comresults are summarized in Table 9. column, called ’Name’ which consists of text. Note pressed the same to much lesser. Thus, when patterns quantity of the referenced item. C4 was pre-processed like runs and sequences are expected to occur in the compress C1, C2, C3 by successive dierencing and so happens that since this is data created by the comlong sequences of successive numbers like 1; 2; 3 : : :.

J2 and J3 will have very long runs. As a result they summarized in Table 8.

C1, C2 forms a candidate key. C3 is a JDI referencing Compressed Size 12.0K Di/Hu Di/LZW Di/Hu Di/LZW Di/Hu Di/LZW Technique Original Size 35.0K CR 75% 84% 82% 45% 38% 35% 45% 75% Di/Hu Technique Di/LZW Human Di/Hu Di/LZW Di/Hu Di/LZW LZW 61% CR 67% 73% 61% 62% 90% JDI Type JDI JDI JDI JDI JDI JDI JDI JDI Type JDI JDI Numeric Numeric JDI As stated previously this data has been synthetically a warehouse to be. However, a 50% compression ratio pression ratio of 65% to 70%, which is quite impressive. created and is a lot more ordered than one can expect would not be surprising for most warehouses.

ToolKit. John Wiley and Sons, 1996. [KIM96] Ralph Kimball. The Data Warehousing [CY98] [KA98] References K. Goyal, K. Ramamritham, A. Datta, H. Thomas 11-16 Acknowledgements 5 Conclusion and Future Work processing. It was only because of the DataIndex repvery ecien t, both in terms of space as well as query most compression techniques (even techniques of the of SJL enabling us to avoid using SJS as much as that particular type of data to occupy lesser space than compressed. Also, a large amount of data displays houses stored as DataIndexes and discussed when each any general technique would otherwise result in. AnWe found that representing data as DataIndexes is schemes had several drawbacks in the sense that the suers. This is because the groups get split across Section 2.5.5 we proposed a horizontal partitioning nore the compression overhead in comparison to savsimilarities (in the sense of compressibility) down a extent that we don’t have to do SJS for most queries, The contribution of this work is twofold. First, In as before on queries involving columns that are not not distribute their overhead to the other columns. In good compression. ing of I/O. A lot of research has to be done to deumn are possible. Secondly, dieren t types of columns, possible. Secondly, we described several compression speeds have increased rapidly. Today, we cannot igcolumn and therefore we expect compression ratios of with dieren t semantics, could be compressed indepenof them should be used. It might have appeared then that partitioning nely is a good idea. However, the each partition. Thus, we must partition only to the level. We were particularly impressed by the hardware dently using dieren t techniques which work best for so that we get the ecien t joining of SJL along with several partitions and have to be coded separately in implementation of Human coding. However, their future) to improve substantially when used on vertically partitioned data. sign fast compression and decompression algorithms algorithms that could be used to compress data waredon’t need very often and get the same performance resentation that things like run length coding on a colmore partitioning that is done, the more compression scheme in order to reduce the memory requirements one uses Human coding to compress all their data tree seemed to be hard coded into the chip. Also, no which give good compression ratios even at the block other words, we can compress some columns which we In the recent past, both processing power and I/O other feature is that compressing some columns does [GHQ95] A. Gupta, V. Harinarayan, and D. Quass. housing. Proc. of the VLDB 1995.

Aggregate-Query Processing in Data Ware[HD52] Human D. A. A Method for Construction Electr. Radio Engg. 40.9 September 1952. of Minimum Redundancy Codes. Proc. Inst.

Chee-Yong Chan, Yannis E. Ioannidis.

Bitmap Index Design and Evaluation. Proc. of SIGMOD 1998. [IW94] B. Iyer, D. Wilhite Data Compression Support in Databases. Proc. of 20th VLDB Conference, 1994.

Bitmap Indexing for Data Warehouses. Proc.

ICDE, Orlando, February 1998. [BW98] A. Buchmann and Ming-Chuan Wu. Encoded mamritham, Helen Thomas, Igor Viguier. "Have Your Data and Index It Too": Ecient Storage and Indexing for Data Warehouses. Technical Report, 1998.

Anindya Datta, Bongki Moon, Krithi RaThe research reported here was supported in part by the National Science Foundation grant number IRI9619588. which stores data in little space with absolutely no today! It would be an interesting venture to gure out data that Human coding is not. In this way, they niques must be designed which are implementable in Before designing a compression technique, one properties.

Transfer Speeds of disks). More importantly, techment ecien tly in hardware (at speeds as fast as Data could complement each other and result in a system extra time overhead.

A good area for future research is to determine what what makes the other techniques dicult to implecan do well at all kinds of data. We did not have the opportunity to analyze data in actual warehouses. other redundancies typically occur in warehouses, so hardware at high speeds and are good at compressing pect in the data, for no single compression method that algorithms can then be designed to exploit these needs to determine what kind of redundancy to ex

Patrick O’Neil, Dallan Quass. Improved Proc. of SIGMOD, 1997.

Query Performance with Variant Indexes. gatekeeper.dec.com/pub/DEC/SRC/researchM. Burrows, D. J. Wheeler. A Block Sorting Lossless Data Compression Alreports/SRC-124.ps.Z gorithm. SRC research report 124, tion, 1998. housing and OLAP. Submitted for PublicaRamamritham, Bongki Moon. Applying Parallel Processing Techniques in Data WareAnindya Datta, Debra VanderMeer, Krithi formance Data Compression. IEEE Computer 17.6 June 1984.

Terry A. Welch. A Technique for High Per