<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CorBit: Leveraging Correlations for Compressing Bitmap Indexes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xi Lyu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Kipf</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pascal Pfeil</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dominik Horn</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jana Giceva</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tim Kraska</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Amazon Web Services</institution>
          ,
          <addr-line>Oskar-von-Miller-Ring 20, 80333 München</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Massachusetts Institute of Technology</institution>
          ,
          <addr-line>77 Massachusetts Ave, Cambridge, MA 02139</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Technical University of Munich</institution>
          ,
          <addr-line>Boltzmannstr. 3, 85748 Garching bei München</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A bitmap index is a secondary index structure that supports equality and range predicates. In its simplest form, a bitmap index stores one bitmap per unique column value indicating qualifying tuples. To use such indexes in large-scale data warehousing, they need to be space eficient. Existing schemes such as Roaring can compress individual bitmaps but do not consider cross-column compression. In this paper, we introduce CorBit, a technique that leverages column correlations to compress bitmap indexes on a given table. The high-level idea is to only store the bits that need to be flipped (the dif) when encoding the bitmaps of a column that correlates with another column that already has a bitmap index in place. CorBit automatically determines which columns to store an index for and which column bitmaps to dif-encode, minimizing the overall size of the index. Compared to Roaring, CorBit consumes 9.1% less space on the DMV dataset while incurring a 12.6% runtime overhead.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
press multiple bitmap indexes defined on a single table.</p>
      <p>On a high level, CorBit stores the bit diferences between
Bitmap indexes are a well-known indexing technique in two given bitmap indexes and only fully materializes
data warehousing. For example, YouTube’s SQL engine, one of the two indexes. We also introduce a new
corGoogle Procella, uses Roaring bitmaps for indexing adver- relation metric, Total Reduced Popcnt, that captures the
tising experiments. Such indexes can achieve tremendous compression opportunity between two columns. We
furspeedups for selective queries, e.g., Procella claims 500× ther propose a greedy algorithm to decide which bitmap
faster queries due to such indexes [1]. indexes to materialize and which indexes to dif-encode,</p>
      <p>A straightforward implementation of a bitmap index minimizing overall index size.
is to store one bitmap per unique column value. For We experiment with three real-world datasets and
example, the Germany bitmap would indicate the posi- show that CorBit improves upon Roaring in space while
tions (row numbers) of all tuples that have Germany as incurring a small runtime overhead.
a column value. To trade precision for footprint, bitmap
indexes can be used at the block level and individual
column values can be binned together. However, even with 2. Compressing Bitmap Indexes
these optimizations, bitmap indexes can consume too
much space to use them in large-scale data warehousing. Basic Idea. If two columns in a table are highly
correBitmap compression schemes such as Roaring [2, 3, 4] or lated, we only materialize the bitmap index of one
colTree-Encoded Bitmaps [5] do very well in compressing umn. For each unique value in the other column, we
individual bitmaps while maintaining fast lookup perfor- find the closest bitmap in the materialized column w.r.t.
mance, but they do not consider cross-column compres- Hamming distance [6] and only store the diferences in
sion opportunities. its bitmap. If the Hamming distance is small, the
dif</p>
      <p>In this paper, we introduce CorBit, a new bitmap com- encoded bitmaps will be sparse, allowing various bitmap
pression scheme that exploits column correlations to com- compression methods such as Roaring or run-length
encoding to save more space.</p>
      <p>Joint Workshops at 49th International Conference on Very Large Data For example, the two columns Language and Country
Bases (VLDBW’23) — Workshop on Applied AI for Database Systems in a user information table are usually highly correlated.
and Applications (AIDB’23), August 28 - September 1, 2023, Vancouver, A sample contingency table is shown in Table 1. The
Canada skewed frequency counts in the table indicate that if we
* Corresponding author. would materialize the bitmaps of the Country column,
†Work done while at Massachusetts Institute of Technology. using dif-encoding for the Language column would save
p$fexipi.@lyaum@atzuomn..dcoe m(X(.PL.yPufe);ilk);ipdfo@mahmoarnzo@na.cmomazo(An..coKmipf()D; . Horn); space. E.g., for the value German in Language we choose
jana.giceva@in.tum.de (J. Giceva); kraska@mit.edu (T. Kraska) the Germany bitmap as reference and store the diference
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License of the two bitmaps, i.e., the XOR of the two bitmaps. In
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
tMoecaosnusriidnegr fCoorroruerlactrioosns.-cTooludmecnidceowmhpircehsscioolunmscnhpeamires, 0.0 2 4 k 6 8 10 2 4 k 6 8 10
we need to measure the correlation between columns. A
naïve approach would be to consider all pairs of columns. Figure 1: The efectiveness of correlation metrics
More specifically, for each bitmap in one column, we find
the closest one in the bitmaps of the other column w.r.t.</p>
      <p>Hamming distance [6], and finally sum up all the dis- by using these candidate reference columns and the
totances as a measure of correlation. The smaller the total, tal potential space savings obtained by using the actual
the higher the correlation between the two columns. optimal  candidate reference columns. Thus, we define</p>
      <p>Unfortunately, the naïve approach is computationally the efectiveness of metric  as
expensive. A better method would be to find a suitable ∑︀
caonrdreelficaiteendt cmoelutrmicntphaaitrcs.anThheerlepfoursei,dweenthifayvepoetveanlutiaatlelyd a ,, = ∑︀==11 ((,, *,,))
number of metrics that measure the correlation of
categorical features: e.g., Cramer’s V [7], Tschuprow’s T [8],
Pearson’s Contingency Coeficient [9], and Approximate
Functional Dependency Error [10]. Additionally, we also
designed our own metric called Total Reduced Popcnt
based on the contingency table. It directly measures
how much the total population count (i.e., the number
of “1” bits), also known as Hamming weight, of the
difencoded bitmaps can be reduced compared to
materialization. This metric is defined as
where , represents the -th best candidate reference
column selected by metric  for column , * ,
represents the -th optimal candidate reference column for
column , and (,  ) denotes the amount of saved
space for  when using  as a reference column.</p>
      <p>Furthermore, the overall efectiveness of metric 
across the entire relation  is defined as the weighted
average of the efectiveness on each column. Here we
use the size obtained using Roaring encoding as weights
for columns, giving higher weight to larger columns.</p>
      <p>The results of the experiments that evaluate the
efectiveness of all metrics are shown in Figure 1. The key
observation is that for all evaluated datasets, the Total
Reduced Popcnt metric consistently performs well and as
such is a good proxy for potential savings.
 , = ∑︁ max{0, max (2| =∧=| − |  =|)}</p>
      <p>∈ ∈
where  and  are two columns in relation , and 
or  denotes the set of unique values in column  or
. This metric can be easily calculated based on the
contingency table of column  and , and is highly
positively related to the saved space. This approach only
requires traversing the dataset once and then utilizing
contingency tables for subsequent calculations.</p>
      <p>We have conducted several experiments to evaluate
the efectiveness of these metrics in our scenario. We
define the efectiveness of a metric as follows: given a
parameter , for each column  in relation , we select
the top  columns with the highest correlation metric
value as candidate reference columns. We calculate the
ratio between the total potential space savings obtained
Compressing a Bitmap Index. The previous
section focused on measuring the correlation between two
columns, but in real-world scenarios, there are often
multiple categorical columns involved. What we ultimately
need is a Directed Acyclic Graph (DAG) * that
represents the dependency relationships between the columns.</p>
      <p>Specifically, each node in this DAG represents a column,
with at most one outgoing edge pointing to another node
to reference its bitmaps to exploit correlations and save
space. Nodes without outgoing edges will be directly
materialized. For performance reasons, we currently do
not consider chained dependencies.</p>
      <p>Gender
2MB</p>
    </sec>
    <sec id="sec-2">
      <title>4. Run the greedy algorithm to obtain an approxi</title>
      <p>mate optimal dependency graph * .
5. Compute the original bitmaps for preserved
nodes and dif-encoded bitmaps for preserved
edges in * , and persist them using Roaring.</p>
    </sec>
    <sec id="sec-3">
      <title>To determine the optimal solution, we annotate the</title>
      <p>actual sizes of the materialized column for each node in
a complete digraph , with the size of the dif-encoded
bitmaps of column  when referencing column  as
cost for each directed edge (, ), as shown in Figure 2. Lookups. To retrieve the bitmap for a specific value in
Finally, we can employ dynamic programming to obtain a column, if it is stored as an XOR bitmap, we read both
the optimal dependency graph * . the XOR bitmap itself and the referenced bitmap, and</p>
      <p>However, the aforementioned approach is excessively perform an XOR operation on them. If the value is stored
expensive. Fortunately, we have identified an appropriate as the original bitmap, we can directly read and return it.
correlation metric, namely Total Reduced Popcnt, which
strongly correlates with saved space. What we annotate
here is the Total Popcnt, the sum of the population counts 3. Evaluation
of bitmaps of each column, which we can calculate easily.</p>
      <p>Once we obtain an annotated complete digraph , we We evaluate our method using three real-world datasets
can employ a greedy algorithm to obtain a near-optimal and compare it with Roaring in terms of space
consumpsolution. This involves greedily selecting the best edge, tion and lookup latency.
which has the highest diference between the node cost Experimental Setup. We conduct our experiments on a
and edge cost without violating the mentioned properties. machine with 256 GiB of RAM, an Intel(R) Xeon(R) CPU
We add this edge to * and continue the process as long E5-2660 v2 @ 2.20GHz, and a Toshiba MG03ACA100
as it results in a reduction in space consumption. 1 TiB 7200 RPM Hard Drive. We utilize the oficial C++</p>
      <p>Finally, we compute and store the original bitmaps and implementation of Roaring1 for serializing, deserializing,
XOR bitmaps based on the dependency graph. Since we and XORing Roaring bitmaps.
use Total Reduced Popcnt, its intermediate results allow We evaluate our approach using the DMV [11],
us to select, for each bitmap, the closest bitmap in its Taxi [12], and Diabetes [13] datasets, as shown in Table
referenced column w.r.t. Hamming distance. This is also 2. For each dataset, we select categorical columns with
one of the advantages of employing our metric. a proportion of unique values not exceeding 10%. For</p>
      <p>For performance considerations, we store a XOR CorBit, we conduct experiments with various threshold
bitmap only if it saves  of the space compared to the values for parameter  ranging from 10% to 95%.
original bitmap. We choose 80% as the default value for To measure the average lookup latency, for each
the threshold . By adjusting the parameter , users can lookup, we uniformly select a column from the dataset
strike a balance between performance and space. To save and then uniformly select a value from that column. We
more space, a smaller value of  can be chosen, such then ask CorBit or Roaring to return the corresponding
as 50%. On the other hand, if performance is of greater bitmap and measure the latency.
concern, a larger value of , such as 90%, can be selected.</p>
      <p>We experiment with diferent thresholds in Section 3.</p>
      <p>In summary, compressing the bitmap index works as
follows:
Size versus Latency. The size and latency of the CorBit
and Roaring compression methods on three real-world
datasets are shown in Figure 3. On the DMV dataset
(Figure 3a), CorBit ( = 80%) achieves space savings of
9.1% compared to Roaring, with a 12.6% increase in
latency. By adjusting the threshold value , we can achieve
a maximum space saving of up to 14.6%. However, if
performance is more critical, by adjusting  to 95% we can
still get a space saving of 6.5% while only experiencing a
2.6% increase in latency.</p>
      <p>1CRoaring: https://github.com/RoaringBitmap/CRoaring
1. Initialize a contingency table for each pair of</p>
      <p>columns.
2. Scan the data once while updating the frequencies</p>
      <p>in the contingency tables.
3. Calculate the Total Reduced Popcnt for each pair</p>
      <p>of columns based on their contingency tables.
0%
70%
181.3MB 17690.6%MB
18840.6%MB
60</p>
      <p>0%
90%
203.0MB 181789.905.M8%MBB
0% 2% 4% Spa6c%eSaving Rate</p>
      <p>8% 10% 12% 14%
(a) DMV (12,300,116 rows)
2.67MB
-2% 0%
2%</p>
      <p>On the Taxi dataset (Figure 3b), CorBit ( = 80%) 1.7% more file content compared to Roaring on average.
achieved a space savings of 2.4% relative to Roaring, with Therefore, it does not significantly impact performance in
similar query latency. Similarly, on the Diabetes dataset terms of disk access and bitmap deserialization. However,
(Figure 3c), CorBit ( = 80%) has comparable perfor- CorBit ( = 80%) requires additional XOR operations
mance to Roaring with space savings of 7.4%. between the XOR bitmaps and the referenced bitmaps,
Size per Column. Table 3 shows the top 5 columns with incurring a latency increase. On average, approximately
the highest space savings in the DMV dataset. We observe 7.4 us are spent on XOR operations, accounting for 11.5%
that the Scoflaw Indicator references the Revocation of the total latency.</p>
      <p>Indicator, and Zip and County reference City. These
relationships are reasonable based on the meanings of Table 4
these columns, and CorBit has successfully discovered Latency breakdown on DMV
the correlations, leading to space savings. CorBit ( = 80%) Roaring</p>
      <p>It is worth noting that there is a strong positive
correlation between saving rate and Total Reduced Popcnt. In Component µ s % µ s %
addition to demonstrating its efectiveness by the metric Disk access 29.8 46.4% 30.1 52.7%
shown in Figure 1, it further substantiates its efective- Roaring deserialization 26.5 41.4% 26.2 46.0%
ness when choosing reference columns. XOR operation 7.4 11.5%</p>
      <p>For Taxi, the Total Amount column has the highest Sum 64.2 100.0% 57.0 100.0%
savings rate, by referencing Tip Amount. For Diabetes,
the columns with the highest space savings are
Acetohexamide, Glimepiride Pioglitazone, and Metformin
Rosiglitazone. These columns reference Metformin Pioglitazone, 4. Conclusions
because most of the time, none of the three drugs are We have introduced CorBit, a bitmap index compression
prescribed with Metformin Pioglitazone. CorBit uses this scheme that can exploit cross-column correlations. We
observation to save space by exploiting the correlation. have shown that CorBit can achieve significant space
savLatency Breakdown. To further analyze latency, we ings on three real-world datasets. We have introduced
profiled both CorBit (  = 80%) and Roaring and iden- a threshold parameter that allows for trading of
comtified the three main components contributing to the pression rate for query performance. In future work, we
overall latency: disk access, Roaring bitmap deserializa- plan to extend our work to support other compression
tion, and the XOR operations. The breakdown of the time schemes, such as run-length, frame-of-reference, and
spent on each component is shown in Table 4. dictionary encoding.</p>
      <p>Our observation is that CorBit ( = 80%) reads only
Acknowledgments. This research is supported by Google, Intel,
and Microsoft as part of DSAIL at MIT, and NSF IIS 1900933. This
research was also sponsored by the United States Air Force Research
Laboratory and the United States Air Force Artificial Intelligence
Accelerator and was accomplished under Cooperative Agreement
Number FA8750-19-2-1000. The views and conclusions contained
in this document are those of the authors and should not be
interpreted as representing the oficial policies, either expressed or
implied, of the United States Air Force or the U.S. Government. The
U.S. Government is authorized to reproduce and distribute reprints
for Government purposes notwithstanding any copyright notation
herein. Pascal Pfeil and Dominik Horn were supported by a
fellowship within the IFI programme of the German Academic Exchange
Service (DAAD).
[5] H. Lang, A. Beischl, V. Leis, P. Boncz, T. Neumann,</p>
      <p>A. Kemper, Tree-encoded bitmaps, in: Proceedings
of the 2020 ACM SIGMOD International
Conference on Management of Data, 2020, pp. 937–967.
[6] R. W. Hamming, Error detecting and error
correcting codes, The Bell system technical journal 29
(1950) 147–160.
[7] H. Cramér, Mathematical methods of statistics,
vol</p>
      <p>ume 26, Princeton university press, 1999.
[8] H. Rietz, Aa tschuprow, grundbegrife und
grund</p>
      <p>probleme der korrelationstheorie (1926).
[9] K. Pearson, Vii. mathematical contributions to the
theory of evolution.—iii. regression, heredity, and
panmixia, Philosophical Transactions of the Royal
Society of London. Series A, containing papers of a
[1] B. Chattopadhyay, P. Dutta, W. Liu, O. Tinn, A. Mc- mathematical or physical character (1896) 253–318.
cormick, A. Mokashi, P. Harvey, H. Gonzalez, D. Lo- [10] S. Kruse, F. Naumann, Eficient discovery of
apmax, S. Mittal, et al., Procella: Unifying serving and proximate dependencies, Proceedings of the VLDB
analytical data at youtube (2019). Endowment 11 (2018) 759–772.
[2] S. Chambi, D. Lemire, O. Kaser, R. Godin, Better [11] J. Huang, B. Chen, L. Luo, S. Yue, I. Ounis,
Dvmbitmap performance with roaring bitmaps, Soft- car: A large-scale automotive dataset for visual
ware: practice and experience 46 (2016) 709–719. ImnaterkrnetaitniognraelseCaorcnhfearnedncaeppolnicaBtiigonDsa,tian:(B20ig22DIaEtEaE),
[3] D. Lemire, G. Ssi-Yan-Kai, O. Kaser, Consistently IEEE, 2022, pp. 4140–4147.</p>
      <p>faster and smaller compressed bitmaps with roaring, [12] N. Y. C. Taxi, L. C. (TLC), Yellow taxi trip
Software: Practice and Experience 46 (2016) 1547– records, https://www1.nyc.gov/site/tlc/about/
1569. tlc-trip-record-data.page. accessed May 1, 2023,
[4] D. Lemire, O. Kaser, N. Kurz, L. Deri, C. O’Hara, 2023.</p>
      <p>F. Saint-Jacques, G. Ssi-Yan-Kai, Roaring bitmaps: [13] J. Clore, K. Cios, J. DeShazo, B. Strack,
DiImplementation of an optimized software library, abetes 130-US hospitals for years 1999-2008,
Software: Practice and Experience 48 (2018) 867– UCI Machine Learning Repository, 2014. DOI:
895. https://doi.org/10.24432/C5230J.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>