=Paper=
{{Paper
|id=Vol-2652/paper09
|storemode=property
|title=Redesigning Query Engines for White-box Compression
|pdfUrl=https://ceur-ws.org/Vol-2652/paper09.pdf
|volume=Vol-2652
|authors=Diego Tomé
|dblpUrl=https://dblp.org/rec/conf/vldb/Tome20
}}
==Redesigning Query Engines for White-box Compression==
Redesigning Query Engines for White-box Compression Diego Tomé tome@cwi.nl CWI Amsterdam, NL Supervised by Peter Boncz ABSTRACT purpose compression methods based on Huffman [12], or Modern columnar databases heavily use compression to re- arithmetic coding [19], and Lempel Ziv [20]. While they duce memory footprint and boost query execution. These achieve good compression ratios, encoding/decoding speeds techniques, however, are implemented as a ”black box”, are relatively low, typically impacting query performance. since their decompression logic is hard-coded and part of the For this reason, columnar databases rely on compression table scan infrastructure. We proposed a novel compression methods that are more light-weight, such as Run Length En- model called White-box compression that views compression coding (RLE), Frame-of-Reference (FOR), and Dictionary actions as functions over the physical columns stored in a compression (DICT)[3]. These lightweight schemes take into block. Because these functions become visible as expressions account some knowledge of the data-type and -distribution, in the query plan, many more optimizations can be made by resulting also in good compression ratio but much higher the database system, boosting query execution speed. These (de)compression speed. functions are learnt from the data and also allow the data to A limitation of the state-of-the-art is that these existing be stored much more compactly, by decomposing string val- techniques are implemented as ”black-box” in the current ues, storing data in appropriate data-types automatically, database systems and the decompression logic is hidden in- and exploiting correlations between columns. side the scan. The query plan is not aware of any decom- White-box compression opens up a whole new set of re- pression step, while the current approach is to eagerly de- search questions. We started with (1) How to learn white- compress all the data in the scan, keeping the execution box compression expressions (functions) from the data au- engine oblivious of the compression techniques – wasting op- tomatically? This Ph.D. research will subsequently study timization opportunities like predicate evaluation over par- (2) How to leverage white-box compression with (run-time) tially decompressed data. Furthermore, we observed that in query optimizations? (3) How can we integrate white-box real-life datasets, data is often encoded in wrong data types compression in a query engine, if the white-box functions (typically as strings), contain codes that combine strings may be different for each block of data? and numerical parts (to which lightweight compression is not applicable), and/or has highly correlated columns [9]. Regarding the latter, columnar formats like Parquet com- 1. INTRODUCTION press datasets with strong column correlations worse than Data compression is widely used on analytical databases row-formats such as Avro, because of the general-purpose to reduce data storage size, as well as data transfer sizes compression typically slapped on top of these formats (in a (over the network, disk, RAM) and provide faster query row-format, the co-located redundancies in a row get com- execution. This is often effective on columnar databases, pressed away). Column stores store each column indepen- where the data pertaining to the same column are stored dently and lose this opportunity. contiguously because compression algorithms perform bet- With white-box compression we proposed a completely ter on data with low information entropy [3]. While data new framework for compression in database systems. A ta- transfer can benefit from the improved compression ratio ble is a set of logical columns that is reconstructed by apply- in columnar databases, query execution might suffer from ing data-dependent functions over so-called physical columns slow decompression, requiring careful consideration of which that are stored. The compression function is therefore also compression technique should be applied. data (meta-data), stored in the block header. This func- The literature provides a number of compression tech- tion is learnt from the data when writing it into blocks. niques for databases. On the one hand, there are general- We showed in [9] that this is feasible, that it greatly reduces storage size, and provides interesting query optimization op- portunities. However, we argue that to make this new idea usable in data systems, we need to explore new query optimiza- tion opportunities and redesign the query engine. Not only does white-box compression introduce unexplored oppor- tunities for selection push-down, late decompression, and Proceedings of the VLDB 2020 PhD Workshop, August 31st, 2020. Tokyo, compressed execution. A significant challenge introduced Japan. Copyright (C) 2020 for this paper by its authors. Copying permitted by white-box compression is that when a block of data is for private and academic purposes. written to disk, the white-box compressed model is learnt, Logical Physical depending on the characterization of that block of data. As A B P Q such, each block may use a (slightly) different compressed "GSA_8350" "GENERAL SERVICES ADMINISTRATION" 0 8350 representation. This already used to be the case in tra- "GSA_8351" "GENERAL SERVICES ADMINISTRATION" 0 8351 ditional compression, but since black-box compression hides 1 2072 "HHS_2072" "HEALTH AND HUMAN SERVICES" the compression from the query engine this is only felt in the "TREAS_4791" "TREASURY" 2 4791 scan. With white-box compression, the compression func- "TREAS_4792" "TREASURY" 2 4792 tions, which are computational query plan expressions1 will "HHS_2073" "HEALTH AND HUMAN SERVICES" 1 2073 change continuously during query execution, whenever data "GSA_8352" "GENERAL SERVICES ADMINISTRATION" 0 8352 from a new block is processed. This calls for a database system that continuously re-optimizes its query plans and Decompression Function adapts them to the current characteristics of the data. A = concat(map(P, dictAP), const("_").format(Q, "%d")) Paper Structure. The rest of this paper is structured B = map(P, dictBP) as follows. Section 2 provides an overview of related work. Then, in section 3, we describe the white-box compression model and we discuss possible solutions for the challenges Dictionary BP Dictionary AP that arise. Finally, in section 4, we discuss our research plan. 0 "GENERAL SERVICES ADMINISTRATION" 0 "GSA" 1 "HEALTH AND HUMAN SERVICES" 1 "HHS" 2 "TREASURY" 2 "TREAS" 2. RELATED WORK The problem of efficient compression on database sys- Figure 1: White-box compression model applied on tems has received significant attention over the years [10, logical columns A and B. 16, 21, 3]. As a result, compression has been well-explored on columnar databases for efficient query processing [21, 3, 5, 11, 15, 8, 14]. On these databases, one can achieve a much more complex data transformations, as well as the ex- good balance between compression ratio and performance ploitation of column correlations (e.g. different dictionary- with lightweight techniques. encoded columns using the same physical column holding The lightweight techniques provided advances on exploit- the codes). ing compressed data during the data processing pipeline, More recent research in compression has moved towards a the so-called compressed execution [3, 4]. As a result, It storage method for decomposing string attributes in column brought performance improvement by allowing the push- stores [13]. This decomposition of strings relies on finding down of predicates to compressed data in the scan oper- patterns to split the string into segments and compress them ator [18]. On more advanced analytical systems like Hy- independently. White-box compression, on the other hand, per [14], data is stored in self-contained blocks allowing is a more generic model, not restricted to strings, and which predicate push-down and selective scan on compressed data. as mentioned can leverage column correlations. Further [13] Nevertheless, they have to sacrifice storage by keeping a restrict query execution optimizations to the scan library, byte-addressable representation. while in our approach, compression functions become com- Decompression beyond scans has been also explored [6] by putational expressions part of the query plan. Making that re-compressing the data in between operators. The goal is work over blocks that are compressed differently is one main to reduce the memory footprint for large intermediates but challenge addressed in this thesis. it is limited to the column-at-time processing model. On white-box compression, decompression is part of the query plan which allows the optimizer to delay decompression and 3. WHITE-BOX COMPRESSION push-down predicates to partially decompressed data. In white-box compression, we define an operator as a func- Google recently described its query engine called Pro- tion o that takes as input zero or more columns and optional cella and its new big data file format Artus [7]. Artus metadata information and outputs a column: o : [C × C × introduces customized versions of lightweight techniques like ...]×[M ] → C. The domain of o is composed of columns and RLE, Delta, and Dictionary encoding. On the API of Artus, metadata and the co-domain is a set of columns. A column RLE columns and dictionary indices are directly exposed to is defined by its data type and the values that it contains. the query engine, allowing the engine to aggressively push All the values that do not fit the chosen data representation computations down to the data format. In White-box com- are considered exceptions and receive special treatment. In pression, we also want to expose the compressed data to the the model, these values are stored separately in physical ex- query engine. Rather than implementing FOR explicitly, ception columns that can themselves be further white-box white-box compression sees FOR as an additional function, compressed. that adds a constant base to a physical column holding a Figure 1 illustrates a table compressed with white-box small integer. RLE in white-box compression stores a logi- compression. Column A has a particular pattern composed cal column as two (much shorter) physical columns holding of a string prefix, an underscore character, and a number. (count, value). On top of that, white-box compression allows In this case, it is not possible to compress this column with dictionary encoding because it has a high cardinality. There- fore, the only way of compressing this column would be by 1 using a heavyweight technique like LZ4 or some other vari- In our current model, they could become even more adven- turous as follow-up work. ant that has the problem of slow de/compression. Type/Column Logical Physical SELECT t2.B, t1.C FROM t1, t2 WHERE t2.B = t1.C AND A LIKE ’TREAS%’ varchar 80.3% 3.0% ΓB,C ΓB,C tinyint 0% 31.7% ⋈ ⋈ smallint 13.7% 60.4% σ Query Execution / Decompression Query Execution / Decompression ... LIKE 'TREAS%' ... double 2.3% 0.3% decimal 2.1% 4.6% C A B Logical C B Logical columns columns integer 0.9% 0% boolean 0.7% 0% ... concat map ... Different map format map Decompression Exceptions σP=2 Exceptions Table 1: Data types distribution for logical and Logic X P Q Physical X P Physical physical columns on the Public BI benchmark. columns columns Physical columns are the result of applying white- FOR box compression. LZ4 FOR DICT header LZ4 header header Scan Scan 101010 101010 101010 100110 100110 100110 For cases not covered by any lightweight technique, the Block of data Blocks of data white-box model can enable compression by changing the physical representation of columns. For instance, column A White-box decompression Adaptive White-box decompression can be decomposed into three other columns that can be fur- ther compressed. The prefix string can be now stored as the Figure 2: Fusing decompression and Query execu- dictionary AP, the underscore as a constant and the number tion on top of white-box compressed data. can be stored as an integer that can be further compressed with some other lightweight technique. compression. We now discuss fast decompression and its Another compression opportunity happens when column integration in the query plan. B is stored in the dictionary BP. In this case, we are able to On the white-box model, the decompression functions to represent one column as a function of another (i.e. column rebuild the logical columns are stored in the block meta- correlation) by identifying the same association between dic- data. Decompression will be implemented in two phases. In tionaries AP and BP. As a result, only one physical column phase 1, the compressed data is first unpacked using fast (i.e. P) is stored together with the two dictionaries and the SIMD codecs into byte-addressable physical columns. This decompression function to reconstruct the logical columns. partially decompressed representation can be represented as columnar vectors in the query plan as long as needed, and Column Correlations. The particular focus of this work once an operator like a join, group by, or projection requires on correlations is based on their frequent occurrence on real- the logical representation the second phase of the decom- world datasets. We have recently introduced the Public BI pression is performed (lazy decompression). Many oper- benchmark [1], a real-world dataset derived from 46 of the ations can thus take place on the data still in (partially) biggest Tableau workbooks [2]. While exploring this data, compressed form. The second phase is the full decompres- we noticed that it tends to comprise patterns not found in sion of physical columns into logical columns. synthetic database benchmarks like TPC-H and TPC-DS. Figure 2 depicts our proposal for decompression and query In this dataset, data is often skewed in terms of value and execution on a white-box representation. In the left part, frequency distribution and it is correlated across columns. we show an unoptimized approach where the physical co- In particular, most of the correlations are between nominal lumns are black-box decompressed into columns X, P, and values (i.e. strings). Q. In this scenario, the query optimizer is able to push-down In our work [9], we formally define the white-box compres- predicates straight to physical columns, avoiding unneces- sion model and propose a learning algorithm to identify pat- sary decompression steps. The challenge in this approach is terns on the data. Our initial approach already doubles the the black-box decompression steps still present, which lim- compression factor on the public BI benchmark [1]. White- its the ability of the execution engine to operate over the box compression is very effective here because this data has physical columns in compressed format. many string columns. In Table 1 we show the data type dis- In the right part of Figure 2, we illustrate the optimized tribution for logical and physical columns on the public BI version of the query plan fused with decompression. The benchmark. Thanks to white-box compression the volume optimizer can adapt the plan by pruning some decompres- of strings is reduced and the dataset becomes more com- sion steps and pushing-down predicates to the partially de- pressible. Integer columns are stored in smaller types while compressed data. The column compressed with frame-of- boolean columns are represented as constant operations in- reference becomes also a white-box representation repre- side an expression. Correlated columns play an important sented by a column of values and the reference with dif- role since we noticed a reduction of 70% in the number of ference operator. columns after white-box compression. Adaptive Query Processing. In [9], we considered a simple approach in which the entire logical column would 4. WHITE-BOX DECOMPRESSION use the same decompression function for the whole table. If In the previous section we showed how White-box com- these functions may change for each block of data, integra- pression is defined and the main advantages over black-box tion of white-box compression in the query engine becomes more tricky: whenever a block of data is read, we should are reduced. However, in order to apply these optimizations stop the query execution and re-instantiate a new query in the face of continuously changing white-box compression plan and re-optimize. We argue that a vectorized engine functions, we need to quickly re-optimize query plans when is more likely to succeed in quickly handling such changes, new data arrives. For this purpose, we plan to split query than an alternative approach with a JIT-compiled engine, optimization into two phases: a heavy strategical phase that which would introduce-recompilation latency for each new includes join ordering and is executed only once, and a tac- data block. To further save time, we propose to split query tical phase that applies cheap optimization that leverage optimization into a main strategical phase that is executed compressed execution opportunities, specifically. once before execution and perform lightweight tactical re- optimization whenever a new data block is brought in and 6. REFERENCES the strategic query execution plan is adapted to it. [1] Public BI Benchmark. https://github.com/cwida/public bi benchmark. [2] Tableau Public. https://public.tableau.com. 5. RESEARCH PLAN [3] D. Abadi, S. Madden, and M. Ferreira. Integrating We believe white-box compression is the foundational idea Compression and Execution in Column-oriented Database for a next-generation of database engines, where compressed Systems. In SIGMOD, pages 671–682, 2006. [4] D. J. Abadi. Query Execution in Column-oriented columnar execution is critical to leverage powerful SIMD Database Systems. PhD thesis, 2008. AAI0820132. units. It also unlocks many optimization possibilities by [5] C. Binnig, S. Hildenbrand, and F. Färber. Dictionary-based learning data representations from the data and is less vul- Order-preserving String Compression for Main Memory nerable to ill-designed database layouts that are often ob- Column Stores. In SIGMOD, pages 283–296, 2009. served in cloud usage situations. [6] P. Damme, A. Ungethüm, J. Pietrzyk, A. Krause, We describe the following as the aspects that shall be D. Habich, and W. Lehner. Morphstore: Analytical query explored in this research during the next two years: engine with a holistic compression-enabled processing model. arXiv preprint arXiv:2004.09350, 2020. (De)compression Library: We start with developing a [7] B. C. et al. Procella: Unifying Serving and Analytical Data basic library for compression and decompression of data in at YouTube. PVLDB, pages 2022–2034, 2019. the white-box representation. Our first goal is to have this [8] Z. Feng, E. Lo, B. Kao, and W. Xu. ByteSlice: Pushing the library independent of any database system and evaluate Envelop of Main Memory Data Processing with a New Storage Layout. In SIGMOD, pages 31–46, 2015. compression and decompression speeds. [9] B. Ghita, D. G. Tomé, and P. A. Boncz. White-box Storage Layout: Besides the de/compression library it is Compression: Learning and Exploiting Compact Table necessary to define how expression trees will be stored and Representations. In CIDR, 2020. instantiated during query execution. Therefore, the second [10] G. Graefe and L. D. Shapiro. Data Compression and Database Performance. In Symposium on Applied step is the definition of a file layout to represent data on Computing, pages 22–27, April 1991. white-box representation on disk. We plan to have all the [11] B. Hentschel, M. S. Kester, and S. Idreos. Column information for white-box decompression on block headers Sketches: A Scan Accelerator for Rapid and Robust that will be instantiated whenever a block is loaded. Predicate Evaluation. In SIGMOD, pages 857–872, 2018. [12] D. A. Huffman. A Method for the Construction of Compressed Execution: In a white-box compression mo- Minimum-Redundancy Codes. IRE, pages 1098–1101, 1952. del, the push-down of database operators within the decom- [13] H. Jiang, C. Liu, Q. Jin, J. Paparrizos, and A. J. Elmore. pression tree becomes more transparent. It is not so clear, PIDS: Attribute Decomposition for Improved Compression however, which kind of expressions can be built aiming com- and Query Performance in Columnar Storage. PVLDB, pressed execution or which database operators allow such ex- 2020. ecution. To clarify these question we will perform a careful [14] H. Lang, T. Mühlbauer, F. Funke, P. A. Boncz, investigation on which database operators get benefit from T. Neumann, and A. Kemper. Data Blocks: Hybrid OLTP and OLAP on Compressed Storage Using Both compressed execution and how to generate decompression Vectorization and Compilation. In SIGMOD, pages expressions that enable such an approach. 311–326, 2016. Vectorization vs. JIT Compilation: With an adap- [15] Y. Li and J. M. Patel. BitWeaving: fast scans for main memory data processing. In SIGMOD, pages 289–300, 2013. tive query processing on white-box representation a JIT- [16] O. Polychroniou and K. A. Ross. Efficient Lightweight compiled engine might suffer from compilation overhead. Compression Alongside Fast Scans. In DAMON@SIGMOD, For every new block, the decoder has to be JIT-compiled pages 9:1–9:6, 2015. and the overhead grows with the number of different com- [17] M. Raasveldt and H. Mühleisen. DuckDB: An Embeddable pression schemes per column. Therefore, on this thesis, we Analytical Database. In SIGMOD, page 1981–1984, 2019. narrow down our design to a vectorized query engine where [18] V. e. a. Raman. DB2 with BLU Acceleration: So Much changes in the decompression logic can be best handled and More Than Just a Column Store. PVLDB, pages interpreted. We will consider DuckDB [17] as our target 1080–1091, 2013. [19] I. H. Witten, R. M. Neal, and J. G. Cleary. Arithmetic since its the only open-source columnar store with a vector- Coding for Data Compression. Commun. ACM, pages ized execution engine. 520–540, 1987. Tactical Query Optimization: Compressed execution of- [20] J. Ziv and A. Lempel. A Universal Algorithm for Sequential Data Compression. IEEE Transactions on fers opportunities to better use SIMD resources (thinner Information Theory, pages 337–343, 1977. data can execute in more lanes in parallel), but also allows [21] M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar for early pruning of data using cheap(er) test on thin frag- RAM-CPU Cache Compression. In ICDE, pages 59–, 2006. ments of the data. It can also lead to hash-tables that are smaller and thus faster and network communications that