-

CorBit: Leveraging Correlations for Compressing Bitmap Indexes

Xi Lyu

Andreas Kipf

Pascal Pfeil

Dominik Horn

Jana Giceva

Tim Kraska

1 0 Amazon Web Services , Oskar-von-Miller-Ring 20, 80333 München , Germany 1 Massachusetts Institute of Technology , 77 Massachusetts Ave, Cambridge, MA 02139 , United States 2 Technical University of Munich , Boltzmannstr. 3, 85748 Garching bei München , Germany

A bitmap index is a secondary index structure that supports equality and range predicates. In its simplest form, a bitmap index stores one bitmap per unique column value indicating qualifying tuples. To use such indexes in large-scale data warehousing, they need to be space eficient. Existing schemes such as Roaring can compress individual bitmaps but do not consider cross-column compression. In this paper, we introduce CorBit, a technique that leverages column correlations to compress bitmap indexes on a given table. The high-level idea is to only store the bits that need to be flipped (the dif) when encoding the bitmaps of a column that correlates with another column that already has a bitmap index in place. CorBit automatically determines which columns to store an index for and which column bitmaps to dif-encode, minimizing the overall size of the index. Compared to Roaring, CorBit consumes 9.1% less space on the DMV dataset while incurring a 12.6% runtime overhead.

1. Introduction press multiple bitmap indexes defined on a single table.

On a high level, CorBit stores the bit diferences between Bitmap indexes are a well-known indexing technique in two given bitmap indexes and only fully materializes data warehousing. For example, YouTube’s SQL engine, one of the two indexes. We also introduce a new corGoogle Procella, uses Roaring bitmaps for indexing adver- relation metric, Total Reduced Popcnt, that captures the tising experiments. Such indexes can achieve tremendous compression opportunity between two columns. We furspeedups for selective queries, e.g., Procella claims 500× ther propose a greedy algorithm to decide which bitmap faster queries due to such indexes [1]. indexes to materialize and which indexes to dif-encode,

A straightforward implementation of a bitmap index minimizing overall index size. is to store one bitmap per unique column value. For We experiment with three real-world datasets and example, the Germany bitmap would indicate the posi- show that CorBit improves upon Roaring in space while tions (row numbers) of all tuples that have Germany as incurring a small runtime overhead. a column value. To trade precision for footprint, bitmap indexes can be used at the block level and individual column values can be binned together. However, even with 2. Compressing Bitmap Indexes these optimizations, bitmap indexes can consume too much space to use them in large-scale data warehousing. Basic Idea. If two columns in a table are highly correBitmap compression schemes such as Roaring [2, 3, 4] or lated, we only materialize the bitmap index of one colTree-Encoded Bitmaps [5] do very well in compressing umn. For each unique value in the other column, we individual bitmaps while maintaining fast lookup perfor- find the closest bitmap in the materialized column w.r.t. mance, but they do not consider cross-column compres- Hamming distance [6] and only store the diferences in sion opportunities. its bitmap. If the Hamming distance is small, the dif

In this paper, we introduce CorBit, a new bitmap com- encoded bitmaps will be sparse, allowing various bitmap pression scheme that exploits column correlations to com- compression methods such as Roaring or run-length encoding to save more space.

Joint Workshops at 49th International Conference on Very Large Data For example, the two columns Language and Country Bases (VLDBW’23) — Workshop on Applied AI for Database Systems in a user information table are usually highly correlated. and Applications (AIDB’23), August 28 - September 1, 2023, Vancouver, A sample contingency table is shown in Table 1. The Canada skewed frequency counts in the table indicate that if we * Corresponding author. would materialize the bitmaps of the Country column, †Work done while at Massachusetts Institute of Technology. using dif-encoding for the Language column would save p$fexipi.@lyaum@atzuomn..dcoe m(X(.PL.yPufe);ilk);ipdfo@mahmoarnzo@na.cmomazo(An..coKmipf()D; . Horn); space. E.g., for the value German in Language we choose jana.giceva@in.tum.de (J. Giceva); kraska@mit.edu (T. Kraska) the Germany bitmap as reference and store the diference © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License of the two bitmaps, i.e., the XOR of the two bitmaps. In CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) tMoecaosnusriidnegr fCoorroruerlactrioosns.-cTooludmecnidceowmhpircehsscioolunmscnhpeamires, 0.0 2 4 k 6 8 10 2 4 k 6 8 10 we need to measure the correlation between columns. A naïve approach would be to consider all pairs of columns. Figure 1: The efectiveness of correlation metrics More specifically, for each bitmap in one column, we find the closest one in the bitmaps of the other column w.r.t.

Hamming distance [6], and finally sum up all the dis- by using these candidate reference columns and the totances as a measure of correlation. The smaller the total, tal potential space savings obtained by using the actual the higher the correlation between the two columns. optimal candidate reference columns. Thus, we define

Unfortunately, the naïve approach is computationally the efectiveness of metric as expensive. A better method would be to find a suitable ∑︀ caonrdreelficaiteendt cmoelutrmicntphaaitrcs.anThheerlepfoursei,dweenthifayvepoetveanlutiaatlelyd a ,, = ∑︀==11 ((,, *,,)) number of metrics that measure the correlation of categorical features: e.g., Cramer’s V [7], Tschuprow’s T [8], Pearson’s Contingency Coeficient [9], and Approximate Functional Dependency Error [10]. Additionally, we also designed our own metric called Total Reduced Popcnt based on the contingency table. It directly measures how much the total population count (i.e., the number of “1” bits), also known as Hamming weight, of the difencoded bitmaps can be reduced compared to materialization. This metric is defined as where , represents the -th best candidate reference column selected by metric for column , * , represents the -th optimal candidate reference column for column , and (, ) denotes the amount of saved space for when using as a reference column.

Furthermore, the overall efectiveness of metric across the entire relation is defined as the weighted average of the efectiveness on each column. Here we use the size obtained using Roaring encoding as weights for columns, giving higher weight to larger columns.

The results of the experiments that evaluate the efectiveness of all metrics are shown in Figure 1. The key observation is that for all evaluated datasets, the Total Reduced Popcnt metric consistently performs well and as such is a good proxy for potential savings. , = ∑︁ max{0, max (2| =∧=| − | =|)}

∈ ∈ where and are two columns in relation , and or denotes the set of unique values in column or . This metric can be easily calculated based on the contingency table of column and , and is highly positively related to the saved space. This approach only requires traversing the dataset once and then utilizing contingency tables for subsequent calculations.

We have conducted several experiments to evaluate the efectiveness of these metrics in our scenario. We define the efectiveness of a metric as follows: given a parameter , for each column in relation , we select the top columns with the highest correlation metric value as candidate reference columns. We calculate the ratio between the total potential space savings obtained Compressing a Bitmap Index. The previous section focused on measuring the correlation between two columns, but in real-world scenarios, there are often multiple categorical columns involved. What we ultimately need is a Directed Acyclic Graph (DAG) * that represents the dependency relationships between the columns.

Specifically, each node in this DAG represents a column, with at most one outgoing edge pointing to another node to reference its bitmaps to exploit correlations and save space. Nodes without outgoing edges will be directly materialized. For performance reasons, we currently do not consider chained dependencies.

Gender 2MB

4. Run the greedy algorithm to obtain an approxi

mate optimal dependency graph * . 5. Compute the original bitmaps for preserved nodes and dif-encoded bitmaps for preserved edges in * , and persist them using Roaring.

To determine the optimal solution, we annotate the

actual sizes of the materialized column for each node in a complete digraph , with the size of the dif-encoded bitmaps of column when referencing column as cost for each directed edge (, ), as shown in Figure 2. Lookups. To retrieve the bitmap for a specific value in Finally, we can employ dynamic programming to obtain a column, if it is stored as an XOR bitmap, we read both the optimal dependency graph * . the XOR bitmap itself and the referenced bitmap, and

However, the aforementioned approach is excessively perform an XOR operation on them. If the value is stored expensive. Fortunately, we have identified an appropriate as the original bitmap, we can directly read and return it. correlation metric, namely Total Reduced Popcnt, which strongly correlates with saved space. What we annotate here is the Total Popcnt, the sum of the population counts 3. Evaluation of bitmaps of each column, which we can calculate easily.

Once we obtain an annotated complete digraph , we We evaluate our method using three real-world datasets can employ a greedy algorithm to obtain a near-optimal and compare it with Roaring in terms of space consumpsolution. This involves greedily selecting the best edge, tion and lookup latency. which has the highest diference between the node cost Experimental Setup. We conduct our experiments on a and edge cost without violating the mentioned properties. machine with 256 GiB of RAM, an Intel(R) Xeon(R) CPU We add this edge to * and continue the process as long E5-2660 v2 @ 2.20GHz, and a Toshiba MG03ACA100 as it results in a reduction in space consumption. 1 TiB 7200 RPM Hard Drive. We utilize the oficial C++

Finally, we compute and store the original bitmaps and implementation of Roaring1 for serializing, deserializing, XOR bitmaps based on the dependency graph. Since we and XORing Roaring bitmaps. use Total Reduced Popcnt, its intermediate results allow We evaluate our approach using the DMV [11], us to select, for each bitmap, the closest bitmap in its Taxi [12], and Diabetes [13] datasets, as shown in Table referenced column w.r.t. Hamming distance. This is also 2. For each dataset, we select categorical columns with one of the advantages of employing our metric. a proportion of unique values not exceeding 10%. For

For performance considerations, we store a XOR CorBit, we conduct experiments with various threshold bitmap only if it saves of the space compared to the values for parameter ranging from 10% to 95%. original bitmap. We choose 80% as the default value for To measure the average lookup latency, for each the threshold . By adjusting the parameter , users can lookup, we uniformly select a column from the dataset strike a balance between performance and space. To save and then uniformly select a value from that column. We more space, a smaller value of can be chosen, such then ask CorBit or Roaring to return the corresponding as 50%. On the other hand, if performance is of greater bitmap and measure the latency. concern, a larger value of , such as 90%, can be selected.

We experiment with diferent thresholds in Section 3.

In summary, compressing the bitmap index works as follows: Size versus Latency. The size and latency of the CorBit and Roaring compression methods on three real-world datasets are shown in Figure 3. On the DMV dataset (Figure 3a), CorBit ( = 80%) achieves space savings of 9.1% compared to Roaring, with a 12.6% increase in latency. By adjusting the threshold value , we can achieve a maximum space saving of up to 14.6%. However, if performance is more critical, by adjusting to 95% we can still get a space saving of 6.5% while only experiencing a 2.6% increase in latency.

1CRoaring: https://github.com/RoaringBitmap/CRoaring 1. Initialize a contingency table for each pair of

columns. 2. Scan the data once while updating the frequencies

in the contingency tables. 3. Calculate the Total Reduced Popcnt for each pair

of columns based on their contingency tables. 0% 70% 181.3MB 17690.6%MB 18840.6%MB 60

0% 90% 203.0MB 181789.905.M8%MBB 0% 2% 4% Spa6c%eSaving Rate

8% 10% 12% 14% (a) DMV (12,300,116 rows) 2.67MB -2% 0% 2%

On the Taxi dataset (Figure 3b), CorBit ( = 80%) 1.7% more file content compared to Roaring on average. achieved a space savings of 2.4% relative to Roaring, with Therefore, it does not significantly impact performance in similar query latency. Similarly, on the Diabetes dataset terms of disk access and bitmap deserialization. However, (Figure 3c), CorBit ( = 80%) has comparable perfor- CorBit ( = 80%) requires additional XOR operations mance to Roaring with space savings of 7.4%. between the XOR bitmaps and the referenced bitmaps, Size per Column. Table 3 shows the top 5 columns with incurring a latency increase. On average, approximately the highest space savings in the DMV dataset. We observe 7.4 us are spent on XOR operations, accounting for 11.5% that the Scoflaw Indicator references the Revocation of the total latency.

Indicator, and Zip and County reference City. These relationships are reasonable based on the meanings of Table 4 these columns, and CorBit has successfully discovered Latency breakdown on DMV the correlations, leading to space savings. CorBit ( = 80%) Roaring

It is worth noting that there is a strong positive correlation between saving rate and Total Reduced Popcnt. In Component µ s % µ s % addition to demonstrating its efectiveness by the metric Disk access 29.8 46.4% 30.1 52.7% shown in Figure 1, it further substantiates its efective- Roaring deserialization 26.5 41.4% 26.2 46.0% ness when choosing reference columns. XOR operation 7.4 11.5%

For Taxi, the Total Amount column has the highest Sum 64.2 100.0% 57.0 100.0% savings rate, by referencing Tip Amount. For Diabetes, the columns with the highest space savings are Acetohexamide, Glimepiride Pioglitazone, and Metformin Rosiglitazone. These columns reference Metformin Pioglitazone, 4. Conclusions because most of the time, none of the three drugs are We have introduced CorBit, a bitmap index compression prescribed with Metformin Pioglitazone. CorBit uses this scheme that can exploit cross-column correlations. We observation to save space by exploiting the correlation. have shown that CorBit can achieve significant space savLatency Breakdown. To further analyze latency, we ings on three real-world datasets. We have introduced profiled both CorBit ( = 80%) and Roaring and iden- a threshold parameter that allows for trading of comtified the three main components contributing to the pression rate for query performance. In future work, we overall latency: disk access, Roaring bitmap deserializa- plan to extend our work to support other compression tion, and the XOR operations. The breakdown of the time schemes, such as run-length, frame-of-reference, and spent on each component is shown in Table 4. dictionary encoding.

Our observation is that CorBit ( = 80%) reads only Acknowledgments. This research is supported by Google, Intel, and Microsoft as part of DSAIL at MIT, and NSF IIS 1900933. This research was also sponsored by the United States Air Force Research Laboratory and the United States Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the oficial policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. Pascal Pfeil and Dominik Horn were supported by a fellowship within the IFI programme of the German Academic Exchange Service (DAAD). [5] H. Lang, A. Beischl, V. Leis, P. Boncz, T. Neumann,

A. Kemper, Tree-encoded bitmaps, in: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 2020, pp. 937–967. [6] R. W. Hamming, Error detecting and error correcting codes, The Bell system technical journal 29 (1950) 147–160. [7] H. Cramér, Mathematical methods of statistics, vol

ume 26, Princeton university press, 1999. [8] H. Rietz, Aa tschuprow, grundbegrife und grund

probleme der korrelationstheorie (1926). [9] K. Pearson, Vii. mathematical contributions to the theory of evolution.—iii. regression, heredity, and panmixia, Philosophical Transactions of the Royal Society of London. Series A, containing papers of a [1] B. Chattopadhyay, P. Dutta, W. Liu, O. Tinn, A. Mc- mathematical or physical character (1896) 253–318. cormick, A. Mokashi, P. Harvey, H. Gonzalez, D. Lo- [10] S. Kruse, F. Naumann, Eficient discovery of apmax, S. Mittal, et al., Procella: Unifying serving and proximate dependencies, Proceedings of the VLDB analytical data at youtube (2019). Endowment 11 (2018) 759–772. [2] S. Chambi, D. Lemire, O. Kaser, R. Godin, Better [11] J. Huang, B. Chen, L. Luo, S. Yue, I. Ounis, Dvmbitmap performance with roaring bitmaps, Soft- car: A large-scale automotive dataset for visual ware: practice and experience 46 (2016) 709–719. ImnaterkrnetaitniognraelseCaorcnhfearnedncaeppolnicaBtiigonDsa,tian:(B20ig22DIaEtEaE), [3] D. Lemire, G. Ssi-Yan-Kai, O. Kaser, Consistently IEEE, 2022, pp. 4140–4147.

faster and smaller compressed bitmaps with roaring, [12] N. Y. C. Taxi, L. C. (TLC), Yellow taxi trip Software: Practice and Experience 46 (2016) 1547– records, https://www1.nyc.gov/site/tlc/about/ 1569. tlc-trip-record-data.page. accessed May 1, 2023, [4] D. Lemire, O. Kaser, N. Kurz, L. Deri, C. O’Hara, 2023.

F. Saint-Jacques, G. Ssi-Yan-Kai, Roaring bitmaps: [13] J. Clore, K. Cios, J. DeShazo, B. Strack, DiImplementation of an optimized software library, abetes 130-US hospitals for years 1999-2008, Software: Practice and Experience 48 (2018) 867– UCI Machine Learning Repository, 2014. DOI: 895. https://doi.org/10.24432/C5230J.