=Paper= {{Paper |id=Vol-2652/paper09 |storemode=property |title=Redesigning Query Engines for White-box Compression |pdfUrl=https://ceur-ws.org/Vol-2652/paper09.pdf |volume=Vol-2652 |authors=Diego Tomé |dblpUrl=https://dblp.org/rec/conf/vldb/Tome20 }} ==Redesigning Query Engines for White-box Compression== https://ceur-ws.org/Vol-2652/paper09.pdf
     Redesigning Query Engines for White-box Compression

                                                                    Diego Tomé
                                                                   tome@cwi.nl
                                                             CWI Amsterdam, NL
                                                           Supervised by Peter Boncz




ABSTRACT                                                                     purpose compression methods based on Huffman [12], or
Modern columnar databases heavily use compression to re-                     arithmetic coding [19], and Lempel Ziv [20]. While they
duce memory footprint and boost query execution. These                       achieve good compression ratios, encoding/decoding speeds
techniques, however, are implemented as a ”black box”,                       are relatively low, typically impacting query performance.
since their decompression logic is hard-coded and part of the                For this reason, columnar databases rely on compression
table scan infrastructure. We proposed a novel compression                   methods that are more light-weight, such as Run Length En-
model called White-box compression that views compression                    coding (RLE), Frame-of-Reference (FOR), and Dictionary
actions as functions over the physical columns stored in a                   compression (DICT)[3]. These lightweight schemes take into
block. Because these functions become visible as expressions                 account some knowledge of the data-type and -distribution,
in the query plan, many more optimizations can be made by                    resulting also in good compression ratio but much higher
the database system, boosting query execution speed. These                   (de)compression speed.
functions are learnt from the data and also allow the data to                   A limitation of the state-of-the-art is that these existing
be stored much more compactly, by decomposing string val-                    techniques are implemented as ”black-box” in the current
ues, storing data in appropriate data-types automatically,                   database systems and the decompression logic is hidden in-
and exploiting correlations between columns.                                 side the scan. The query plan is not aware of any decom-
   White-box compression opens up a whole new set of re-                     pression step, while the current approach is to eagerly de-
search questions. We started with (1) How to learn white-                    compress all the data in the scan, keeping the execution
box compression expressions (functions) from the data au-                    engine oblivious of the compression techniques – wasting op-
tomatically? This Ph.D. research will subsequently study                     timization opportunities like predicate evaluation over par-
(2) How to leverage white-box compression with (run-time)                    tially decompressed data. Furthermore, we observed that in
query optimizations? (3) How can we integrate white-box                      real-life datasets, data is often encoded in wrong data types
compression in a query engine, if the white-box functions                    (typically as strings), contain codes that combine strings
may be different for each block of data?                                     and numerical parts (to which lightweight compression is
                                                                             not applicable), and/or has highly correlated columns [9].
                                                                             Regarding the latter, columnar formats like Parquet com-
1.    INTRODUCTION                                                           press datasets with strong column correlations worse than
   Data compression is widely used on analytical databases                   row-formats such as Avro, because of the general-purpose
to reduce data storage size, as well as data transfer sizes                  compression typically slapped on top of these formats (in a
(over the network, disk, RAM) and provide faster query                       row-format, the co-located redundancies in a row get com-
execution. This is often effective on columnar databases,                    pressed away). Column stores store each column indepen-
where the data pertaining to the same column are stored                      dently and lose this opportunity.
contiguously because compression algorithms perform bet-                        With white-box compression we proposed a completely
ter on data with low information entropy [3]. While data                     new framework for compression in database systems. A ta-
transfer can benefit from the improved compression ratio                     ble is a set of logical columns that is reconstructed by apply-
in columnar databases, query execution might suffer from                     ing data-dependent functions over so-called physical columns
slow decompression, requiring careful consideration of which                 that are stored. The compression function is therefore also
compression technique should be applied.                                     data (meta-data), stored in the block header. This func-
   The literature provides a number of compression tech-                     tion is learnt from the data when writing it into blocks.
niques for databases. On the one hand, there are general-                    We showed in [9] that this is feasible, that it greatly reduces
                                                                             storage size, and provides interesting query optimization op-
                                                                             portunities.
                                                                                However, we argue that to make this new idea usable
                                                                             in data systems, we need to explore new query optimiza-
                                                                             tion opportunities and redesign the query engine. Not only
                                                                             does white-box compression introduce unexplored oppor-
                                                                             tunities for selection push-down, late decompression, and
Proceedings of the VLDB 2020 PhD Workshop, August 31st, 2020. Tokyo,
                                                                             compressed execution. A significant challenge introduced
Japan. Copyright (C) 2020 for this paper by its authors. Copying permitted   by white-box compression is that when a block of data is
for private and academic purposes.
written to disk, the white-box compressed model is learnt,          Logical                                                  Physical
depending on the characterization of that block of data. As                  A                       B                        P     Q
such, each block may use a (slightly) different compressed
                                                                    "GSA_8350"        "GENERAL SERVICES ADMINISTRATION"      0    8350
representation. This already used to be the case in tra-
                                                                    "GSA_8351"        "GENERAL SERVICES ADMINISTRATION"      0    8351
ditional compression, but since black-box compression hides                                                                  1    2072
                                                                    "HHS_2072"        "HEALTH AND HUMAN SERVICES"
the compression from the query engine this is only felt in the      "TREAS_4791"      "TREASURY"                             2    4791
scan. With white-box compression, the compression func-             "TREAS_4792"      "TREASURY"                             2    4792
tions, which are computational query plan expressions1 will         "HHS_2073"        "HEALTH AND HUMAN SERVICES"            1    2073
change continuously during query execution, whenever data           "GSA_8352"        "GENERAL SERVICES ADMINISTRATION"      0    8352
from a new block is processed. This calls for a database
system that continuously re-optimizes its query plans and           Decompression Function
adapts them to the current characteristics of the data.
                                                                    A = concat(map(P, dictAP), const("_").format(Q, "%d"))
Paper Structure. The rest of this paper is structured
                                                                    B = map(P, dictBP)
as follows. Section 2 provides an overview of related work.
Then, in section 3, we describe the white-box compression
model and we discuss possible solutions for the challenges          Dictionary BP                                   Dictionary AP
that arise. Finally, in section 4, we discuss our research plan.
                                                                        0   "GENERAL SERVICES ADMINISTRATION"         0   "GSA"
                                                                        1   "HEALTH AND HUMAN SERVICES"               1   "HHS"
                                                                        2   "TREASURY"                                2   "TREAS"
2.   RELATED WORK
   The problem of efficient compression on database sys-           Figure 1: White-box compression model applied on
tems has received significant attention over the years [10,        logical columns A and B.
16, 21, 3]. As a result, compression has been well-explored
on columnar databases for efficient query processing [21, 3,
5, 11, 15, 8, 14]. On these databases, one can achieve a           much more complex data transformations, as well as the ex-
good balance between compression ratio and performance             ploitation of column correlations (e.g. different dictionary-
with lightweight techniques.                                       encoded columns using the same physical column holding
   The lightweight techniques provided advances on exploit-        the codes).
ing compressed data during the data processing pipeline,              More recent research in compression has moved towards a
the so-called compressed execution [3, 4]. As a result, It         storage method for decomposing string attributes in column
brought performance improvement by allowing the push-              stores [13]. This decomposition of strings relies on finding
down of predicates to compressed data in the scan oper-            patterns to split the string into segments and compress them
ator [18]. On more advanced analytical systems like Hy-            independently. White-box compression, on the other hand,
per [14], data is stored in self-contained blocks allowing         is a more generic model, not restricted to strings, and which
predicate push-down and selective scan on compressed data.         as mentioned can leverage column correlations. Further [13]
Nevertheless, they have to sacrifice storage by keeping a          restrict query execution optimizations to the scan library,
byte-addressable representation.                                   while in our approach, compression functions become com-
   Decompression beyond scans has been also explored [6] by        putational expressions part of the query plan. Making that
re-compressing the data in between operators. The goal is          work over blocks that are compressed differently is one main
to reduce the memory footprint for large intermediates but         challenge addressed in this thesis.
it is limited to the column-at-time processing model. On
white-box compression, decompression is part of the query
plan which allows the optimizer to delay decompression and         3.       WHITE-BOX COMPRESSION
push-down predicates to partially decompressed data.                  In white-box compression, we define an operator as a func-
   Google recently described its query engine called Pro-          tion o that takes as input zero or more columns and optional
cella and its new big data file format Artus [7]. Artus            metadata information and outputs a column: o : [C × C ×
introduces customized versions of lightweight techniques like      ...]×[M ] → C. The domain of o is composed of columns and
RLE, Delta, and Dictionary encoding. On the API of Artus,          metadata and the co-domain is a set of columns. A column
RLE columns and dictionary indices are directly exposed to         is defined by its data type and the values that it contains.
the query engine, allowing the engine to aggressively push         All the values that do not fit the chosen data representation
computations down to the data format. In White-box com-            are considered exceptions and receive special treatment. In
pression, we also want to expose the compressed data to the        the model, these values are stored separately in physical ex-
query engine. Rather than implementing FOR explicitly,             ception columns that can themselves be further white-box
white-box compression sees FOR as an additional function,          compressed.
that adds a constant base to a physical column holding a              Figure 1 illustrates a table compressed with white-box
small integer. RLE in white-box compression stores a logi-         compression. Column A has a particular pattern composed
cal column as two (much shorter) physical columns holding          of a string prefix, an underscore character, and a number.
(count, value). On top of that, white-box compression allows       In this case, it is not possible to compress this column with
                                                                   dictionary encoding because it has a high cardinality. There-
                                                                   fore, the only way of compressing this column would be by
1                                                                  using a heavyweight technique like LZ4 or some other vari-
  In our current model, they could become even more adven-
turous as follow-up work.                                          ant that has the problem of slow de/compression.
       Type/Column           Logical      Physical               SELECT t2.B, t1.C FROM t1, t2 WHERE t2.B = t1.C AND A LIKE ’TREAS%’


       varchar               80.3%            3.0%                                                                       ΓB,C                                                                                    ΓB,C
       tinyint               0%              31.7%                                                                       ⋈                                                                                       ⋈
       smallint              13.7%           60.4%                                                                                    σ




                                                                   Query Execution / Decompression




                                                                                                                                                             Query Execution / Decompression
                                                                                                     ...                      LIKE 'TREAS%'                                                    ...
       double                2.3%             0.3%
       decimal               2.1%             4.6%                                                    C                           A           B    Logical                                     C                        B   Logical
                                                                                                                                                   columns                                                                  columns
       integer               0.9%               0%
       boolean               0.7%               0%                                                   ...                      concat           map                                             ...
                                                                                                                                                                                                                                Different
                                                                                                                         map          format                                                                          map
                                                                                                                                                                                                                             Decompression




                                                                                                      Exceptions
                                                                                                                                                                                                                      σP=2




                                                                                                                                                                                                Exceptions
Table 1: Data types distribution for logical and                                                                                                                                                                                 Logic
                                                                                                                   X          P           Q       Physical                                                   X       P Physical
physical columns on the Public BI benchmark.                                                                                                      columns                                                              columns
Physical columns are the result of applying white-
                                                                                                                                                                                                                     FOR
box compression.                                                                                                   LZ4       FOR       DICT



                                                                                                                          header                                                                       LZ4        header      header




                                                                                                                                                                 Scan
                                                                   Scan
                                                                                                                         101010                                                                                  101010       101010
                                                                                                                         100110                                                                                  100110       100110
   For cases not covered by any lightweight technique, the
                                                                                                                    Block of data                                                                                 Blocks of data
white-box model can enable compression by changing the
physical representation of columns. For instance, column A                                           White-box decompression                                    Adaptive White-box decompression
can be decomposed into three other columns that can be fur-
ther compressed. The prefix string can be now stored as the      Figure 2: Fusing decompression and Query execu-
dictionary AP, the underscore as a constant and the number       tion on top of white-box compressed data.
can be stored as an integer that can be further compressed
with some other lightweight technique.                           compression. We now discuss fast decompression and its
   Another compression opportunity happens when column           integration in the query plan.
B is stored in the dictionary BP. In this case, we are able to      On the white-box model, the decompression functions to
represent one column as a function of another (i.e. column       rebuild the logical columns are stored in the block meta-
correlation) by identifying the same association between dic-    data. Decompression will be implemented in two phases. In
tionaries AP and BP. As a result, only one physical column       phase 1, the compressed data is first unpacked using fast
(i.e. P) is stored together with the two dictionaries and the    SIMD codecs into byte-addressable physical columns. This
decompression function to reconstruct the logical columns.       partially decompressed representation can be represented as
                                                                 columnar vectors in the query plan as long as needed, and
Column Correlations. The particular focus of this work
                                                                 once an operator like a join, group by, or projection requires
on correlations is based on their frequent occurrence on real-
                                                                 the logical representation the second phase of the decom-
world datasets. We have recently introduced the Public BI
                                                                 pression is performed (lazy decompression). Many oper-
benchmark [1], a real-world dataset derived from 46 of the
                                                                 ations can thus take place on the data still in (partially)
biggest Tableau workbooks [2]. While exploring this data,
                                                                 compressed form. The second phase is the full decompres-
we noticed that it tends to comprise patterns not found in
                                                                 sion of physical columns into logical columns.
synthetic database benchmarks like TPC-H and TPC-DS.
                                                                    Figure 2 depicts our proposal for decompression and query
In this dataset, data is often skewed in terms of value and
                                                                 execution on a white-box representation. In the left part,
frequency distribution and it is correlated across columns.
                                                                 we show an unoptimized approach where the physical co-
In particular, most of the correlations are between nominal
                                                                 lumns are black-box decompressed into columns X, P, and
values (i.e. strings).
                                                                 Q. In this scenario, the query optimizer is able to push-down
   In our work [9], we formally define the white-box compres-
                                                                 predicates straight to physical columns, avoiding unneces-
sion model and propose a learning algorithm to identify pat-
                                                                 sary decompression steps. The challenge in this approach is
terns on the data. Our initial approach already doubles the
                                                                 the black-box decompression steps still present, which lim-
compression factor on the public BI benchmark [1]. White-
                                                                 its the ability of the execution engine to operate over the
box compression is very effective here because this data has
                                                                 physical columns in compressed format.
many string columns. In Table 1 we show the data type dis-
                                                                    In the right part of Figure 2, we illustrate the optimized
tribution for logical and physical columns on the public BI
                                                                 version of the query plan fused with decompression. The
benchmark. Thanks to white-box compression the volume
                                                                 optimizer can adapt the plan by pruning some decompres-
of strings is reduced and the dataset becomes more com-
                                                                 sion steps and pushing-down predicates to the partially de-
pressible. Integer columns are stored in smaller types while
                                                                 compressed data. The column compressed with frame-of-
boolean columns are represented as constant operations in-
                                                                 reference becomes also a white-box representation repre-
side an expression. Correlated columns play an important
                                                                 sented by a column of values and the reference with dif-
role since we noticed a reduction of 70% in the number of
                                                                 ference operator.
columns after white-box compression.
                                                                 Adaptive Query Processing. In [9], we considered a
                                                                 simple approach in which the entire logical column would
4.   WHITE-BOX DECOMPRESSION                                     use the same decompression function for the whole table. If
  In the previous section we showed how White-box com-           these functions may change for each block of data, integra-
pression is defined and the main advantages over black-box       tion of white-box compression in the query engine becomes
more tricky: whenever a block of data is read, we should       are reduced. However, in order to apply these optimizations
stop the query execution and re-instantiate a new query        in the face of continuously changing white-box compression
plan and re-optimize. We argue that a vectorized engine        functions, we need to quickly re-optimize query plans when
is more likely to succeed in quickly handling such changes,    new data arrives. For this purpose, we plan to split query
than an alternative approach with a JIT-compiled engine,       optimization into two phases: a heavy strategical phase that
which would introduce-recompilation latency for each new       includes join ordering and is executed only once, and a tac-
data block. To further save time, we propose to split query    tical phase that applies cheap optimization that leverage
optimization into a main strategical phase that is executed    compressed execution opportunities, specifically.
once before execution and perform lightweight tactical re-
optimization whenever a new data block is brought in and       6.   REFERENCES
the strategic query execution plan is adapted to it.            [1] Public BI Benchmark.
                                                                    https://github.com/cwida/public bi benchmark.
                                                                [2] Tableau Public. https://public.tableau.com.
5.   RESEARCH PLAN                                              [3] D. Abadi, S. Madden, and M. Ferreira. Integrating
   We believe white-box compression is the foundational idea        Compression and Execution in Column-oriented Database
for a next-generation of database engines, where compressed         Systems. In SIGMOD, pages 671–682, 2006.
                                                                [4] D. J. Abadi. Query Execution in Column-oriented
columnar execution is critical to leverage powerful SIMD
                                                                    Database Systems. PhD thesis, 2008. AAI0820132.
units. It also unlocks many optimization possibilities by       [5] C. Binnig, S. Hildenbrand, and F. Färber. Dictionary-based
learning data representations from the data and is less vul-        Order-preserving String Compression for Main Memory
nerable to ill-designed database layouts that are often ob-         Column Stores. In SIGMOD, pages 283–296, 2009.
served in cloud usage situations.                               [6] P. Damme, A. Ungethüm, J. Pietrzyk, A. Krause,
   We describe the following as the aspects that shall be           D. Habich, and W. Lehner. Morphstore: Analytical query
explored in this research during the next two years:                engine with a holistic compression-enabled processing
                                                                    model. arXiv preprint arXiv:2004.09350, 2020.
(De)compression Library: We start with developing a             [7] B. C. et al. Procella: Unifying Serving and Analytical Data
basic library for compression and decompression of data in          at YouTube. PVLDB, pages 2022–2034, 2019.
the white-box representation. Our first goal is to have this    [8] Z. Feng, E. Lo, B. Kao, and W. Xu. ByteSlice: Pushing the
library independent of any database system and evaluate             Envelop of Main Memory Data Processing with a New
                                                                    Storage Layout. In SIGMOD, pages 31–46, 2015.
compression and decompression speeds.
                                                                [9] B. Ghita, D. G. Tomé, and P. A. Boncz. White-box
Storage Layout: Besides the de/compression library it is            Compression: Learning and Exploiting Compact Table
necessary to define how expression trees will be stored and         Representations. In CIDR, 2020.
instantiated during query execution. Therefore, the second     [10] G. Graefe and L. D. Shapiro. Data Compression and
                                                                    Database Performance. In Symposium on Applied
step is the definition of a file layout to represent data on        Computing, pages 22–27, April 1991.
white-box representation on disk. We plan to have all the      [11] B. Hentschel, M. S. Kester, and S. Idreos. Column
information for white-box decompression on block headers            Sketches: A Scan Accelerator for Rapid and Robust
that will be instantiated whenever a block is loaded.               Predicate Evaluation. In SIGMOD, pages 857–872, 2018.
                                                               [12] D. A. Huffman. A Method for the Construction of
Compressed Execution: In a white-box compression mo-                Minimum-Redundancy Codes. IRE, pages 1098–1101, 1952.
del, the push-down of database operators within the decom-     [13] H. Jiang, C. Liu, Q. Jin, J. Paparrizos, and A. J. Elmore.
pression tree becomes more transparent. It is not so clear,         PIDS: Attribute Decomposition for Improved Compression
however, which kind of expressions can be built aiming com-         and Query Performance in Columnar Storage. PVLDB,
pressed execution or which database operators allow such ex-        2020.
ecution. To clarify these question we will perform a careful   [14] H. Lang, T. Mühlbauer, F. Funke, P. A. Boncz,
investigation on which database operators get benefit from          T. Neumann, and A. Kemper. Data Blocks: Hybrid OLTP
                                                                    and OLAP on Compressed Storage Using Both
compressed execution and how to generate decompression              Vectorization and Compilation. In SIGMOD, pages
expressions that enable such an approach.                           311–326, 2016.
Vectorization vs. JIT Compilation: With an adap-               [15] Y. Li and J. M. Patel. BitWeaving: fast scans for main
                                                                    memory data processing. In SIGMOD, pages 289–300, 2013.
tive query processing on white-box representation a JIT-
                                                               [16] O. Polychroniou and K. A. Ross. Efficient Lightweight
compiled engine might suffer from compilation overhead.             Compression Alongside Fast Scans. In DAMON@SIGMOD,
For every new block, the decoder has to be JIT-compiled             pages 9:1–9:6, 2015.
and the overhead grows with the number of different com-       [17] M. Raasveldt and H. Mühleisen. DuckDB: An Embeddable
pression schemes per column. Therefore, on this thesis, we          Analytical Database. In SIGMOD, page 1981–1984, 2019.
narrow down our design to a vectorized query engine where      [18] V. e. a. Raman. DB2 with BLU Acceleration: So Much
changes in the decompression logic can be best handled and          More Than Just a Column Store. PVLDB, pages
interpreted. We will consider DuckDB [17] as our target             1080–1091, 2013.
                                                               [19] I. H. Witten, R. M. Neal, and J. G. Cleary. Arithmetic
since its the only open-source columnar store with a vector-
                                                                    Coding for Data Compression. Commun. ACM, pages
ized execution engine.                                              520–540, 1987.
Tactical Query Optimization: Compressed execution of-          [20] J. Ziv and A. Lempel. A Universal Algorithm for
                                                                    Sequential Data Compression. IEEE Transactions on
fers opportunities to better use SIMD resources (thinner            Information Theory, pages 337–343, 1977.
data can execute in more lanes in parallel), but also allows   [21] M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar
for early pruning of data using cheap(er) test on thin frag-        RAM-CPU Cache Compression. In ICDE, pages 59–, 2006.
ments of the data. It can also lead to hash-tables that are
smaller and thus faster and network communications that