Position Caching in a Column-Store with Late Materialization: An Initial Study Viacheslav Galaktionov1, 2 , Evgeniy Klyuchikov1, 2 , George Chernishev1, 2, 3 1 Information Systems Engineering Lab, JetBrains Research 2 Saint Petersburg State University, Russia 3 National Research University Higher School of Economics {viacheslav.galaktionov,evgeniy.klyuchikov,chernishev}@gmail.com ABSTRACT Select R1.B, R1.C, A common technique to speed up DBMS query processing is to R2.E, R2.H, R3.F cache parts of query results and reuse them later. In this paper we From R1, R2, R3 Where R1.A = R2.D propose a novel approach which is aimed specifically at caching Project AND R2.G = R3.K intermediates in a late-materialization-oriented column-store. Join The idea of our approach is to cache positions (row numbers) G=K instead of data values. The small size of positional representation G is a valuable advantage: cache can accommodate more entries K F and consider intermediates that involve “heavy” operators, e.g. Project joins of large tables. Join Position caching thrives in late materialization environments A=D since position exchange is prevalent in them. In particular, ex- pensive predicates and heavy joins are usually processed based C B A D G E H on positions. Our approach is able to cache them efficiently, thus significantly reducing system load. Figure 1: Late materialization example, adapted from [17] To assess the importance of intermediates our position caching technique features a cost model that is based on usage statistics and complexity estimations. Furthermore, to allow intermediate scheme depends on many parameters: join filtering power, data reuse for the queries that are not fully identical, we proposed an distribution, disk parameters such as seek times and bandwidth. efficient query containment checking algorithm. Several policies Late materialization has formed the basis of several column- for cache population and eviction were proposed. Finally, our oriented DBMSes [1]. Its many aspects, such as the design of LM- approach is enhanced by lightweight compression schemes. enabled operators [3, 17] and the query processing schemes [6, Experimental evaluation was performed using a stream of ran- 11], have been extensively studied. However, some aspects re- domly generated Star-Schema-Benchmark-like queries. It showed ceived less attention. up to 3 times improvement in query run times. Additionally, com- In this paper we further investigate one of them — intermediate pressing the intermediates reduces the space requirements by up result caching. Unlike many existing caching techniques, our to 2 times without a noticeable performance overhead. approach does not cache the data itself. Instead, it aims to cache useful data in positional form which allows us to: 1 INTRODUCTION (1) reduce the memory footprint, since all table row data, Late materialization [1] is a query processing technique that aims which can be arbitrarily large, is replaced with a single to operate on positions as long as possible. Essentially, it defers position (a 32-bit number) tuple reconstruction to later stages of a query plan. The goal is to (2) use thusly saved memory to store intermediates that are conserve disk bandwidth by: 1) reading individual columns only closer to the bottom part of the plan. It comes more costly, when they are necessary for processing at a given point, and 2) but on the other hand, they have higher reuse chance. organizing column access in such an order that subsequent reads Our caching approach also employs query containment checks require fewer disk accesses due to data filtering. which allows reusing intermediates from queries that are not This approach leads to unconventional query plans with novel fully identical, but one of them subsumes another. opportunities [10] for efficient query processing. In such query To demonstrate the viability of the proposed position caching, plans operators: 1) usually ingest only necessary columns, 2) ex- we have implemented it inside PosDB — a distributed disk-based change not only data, but also positions. Positions are essentially column-store with late materialization. Since PosDB currently row ids which will be later used to request data. An example of lacks a full-fledged cost-based optimizer, we had to devise a num- late materialization processing is presented in Figure 1. ber of simpler heuristic algorithms. While our approach is simpler Here, for column G, query processor reads only the values than existing methods, it still allows us to obtain considerable that have passed join predicate. For columns E and H it is even performance gains. better: query processor will have to read values that have passed Apart from that we have also implemented compression for both join predicates. Of course, the benefit of such processing cached positions. Keeping only sequences of integers favours compression which allows us to reduce memory footprint even © Copyright 2020 for this paper held by its author(s). Published in the proceedings of further. For now, we have implemented only delta encoding and DOLAP 2020 (March 30, 2020, Copenhagen, Denmark, co-located with EDBT/ICDT 2020) on CEUR-WS.org. Use permitted under Creative Commons License Attribution null suppression since our experiments indicated that these meth- 4.0 International (CC BY 4.0). ods result in good space savings with a small overhead. Overall, the contribution of this paper is the following: reused later, b) can reuse current cache. If any are found, they (1) A novel approach to caching intermediates in a column- interact with cache during query evaluation by either a) putting store that caches positions instead of data. the resulting data into cache or b) reusing cached data instead of (2) A study of applicability of compression to caching inside subtree evaluation. a column-store. In this study we assume that the target DBMS represents (3) An experimental evaluation of several policies for cache query plans as a tree of operators, each of which returns values population and eviction. or positions exclusively. As we target late materialization, most of the filtering and join operators are positional and (due to OLAP 2 RELATED WORK specifics) appear to be the most costly parts of a query plan. Our caching approach intentionally targets only the positional Caching intermediates. The idea of reusing intermediate re- data. As we stated in the introduction, “positional” cache entries sults for query evaluation in databases has existed for a long time. are lightweight and are more susceptible to compression. There- It comes in two formulations. In the first one there is a number fore, “heavy” operators from the bottom of the tree (usually costly of simultaneously running queries and the goal is to generate filters and joins) can be cached with more ease. Thus, a significant query plans that would maximize result sharing. This is called decrease in system load can be achieved. multi-query optimization [15, 16]. The second option is caching: An essential drawback of position only caching is repeated parts of query plans that were evaluated in the past are stored data value reading. However, disk subsystem decreases the over- and reused when appropriate query arrives. In this paper we head. Caching is usually applied when the queries are somewhat consider the second option. similar in the terms of referenced disk pages. Therefore, the disk The next two papers study caching of intermediates inside pages necessary for materialization are often already stored in- the MonetDB [11] system family, an in-memory column-store side the buffer manager. with late materialization. In reference [12] authors propose to Caching: general overview. Our caching algorithm consists exploit MonetDB operator-at-a-time processing model. The idea of pre- and post- processing of query plans. The former adds is to cache and reuse BAT by-products of query execution. The special operators to gather statistical information and substitutes proposed approach featured plan subsumption and both inter- subtrees with existing cache entries. The latter collects the final and intra- query reuse. The study [13] addresses the case of statistical information and updates the cache if needed. pipelined operator evaluation inside Vectorwise database engine. The cache consists of query history and a set of cache en- In reference [9] another mechanism for in-memory result tries (each being a list of position blocks). Query history is a list caching was proposed. The idea is to cache internal data struc- of subtrees of N most recently executed queries together with tures of a database engine in order to lower memory pressure the subtrees that are still cached. Each history entry contains a and preserve cache and register locality. In particular, hash tables pointer to the corresponding cache entry (if any), a high-level that are produced by hash-joins and hash-aggregations. description of the subtree, its result’s size and usage statistic. A Disk-based caching was considered in reference [4]. Apart usage statistic is a set of queries the entry can be potentially from being able to cache data, this system can also cache positions. useful for. All this information is used in the cost models of cache Also, authors proposed a cost-based optimizer to find the best population and eviction policies. plan. Experiments showed that position caching performs better There might not be a perfectly matching cache entry for a than data caching when cache entries are stored on disk. subtree, but at the same time there may be an entry that provides Caching and compression. Compression has been used in a larger, subsuming, result. This is determined with a containment databases for a long time [19], however it is column-stores where check. If the check is passed, the original subtree is replaced by it brought significant benefits. Compression in column-stores this cache entry with a special filtering operator on top. is different from row-stores: 1) columnar storage results in ho- Caching: preprocessing. Preprocessing is the most sophisti- mogeneity of data which in turn leads to better compression cated part, see Figure 2. The algorithm iterates through the query rates and, 2) compression should be lightweight, i.e. it should not plan and for each subtree 𝑆: a) checks if it fits into cache b) adds a excessively load CPU. special operator PBCounter on top of it c) updates cache history We reference two papers which we believe to be most impor- d) tries to replace 𝑆 with a (possibly filtered) cache entry. tant, while many others can be found. Authors of reference [2] The first two steps are related to size estimation of interme- studied several lightweight compression algorithms in a disk- diates. An auxiliary PBCounter is used to count the number of based column-store. They compared their behavior in case of blocks passed during query evaluation. Then, this number is used eager decompression and in case when query engine operated to: a) update the benefit measure of 𝑆 (described later) b) filter on compressed data directly. Also, they have presented a deci- future subtrees that are guaranteed not to fit into cache. sion tree for selecting compression method depending on data After size estimation has been taken care of, we add a history characteristics. Compression for in-memory column-stores was entry for 𝑆 and update the usage statistics of other entries that can studied in reference [20]. Authors have proposed and evaluated (potentially) be used by 𝑆. Usage statistics in their turn participate three different compression schemes which aimed to efficiently in the cost model of population and eviction policies. Therefore, use CPU resources. we try to add the best yet uncached entry into cache. To the best of our knowledge, no paper considered compres- Finally, after the cache was adjusted, we search for a cache sion in the context of caching intermediates in column-stores. entry that can replace 𝑆 with minimum additional filtering. If one is found, we use a special operator FetchCache instead of direct 𝑆 3 PROPOSED APPROACH evaluation. This operator passes position blocks, decompressing Preliminaries and prerequisites. A common workflow of a them if needed. DBMS caching subsystem is as follows. When a query arrives, Caching: postrocessing. This stage has two goals: gather the its plan is generated and inspected for subtrees that a) can be results of every PBCounter that is a part of the current query’s Initial query plan Patched query plan Try to replace S with a cache entry Loop Replace S with FetchCache(T) or Select next yes All subtrees Filter(FetchCache(T)) subtree, S no processed? found Result size allows no Replace S with Search for best fitting caching? PBCounter(S) not found cached intermediate, T no info yes Update cache Save in cache history Traverse cache history for yet Consult cache policies Update containment and uncached entry with max benefit, Admit UT to cache if usability statistics UT required Figure 2: PosDB caching workflow: preprocessing operator tree, and calculate the benefit estimates for each inter- Query Containment. Implementing a caching subsystem mediate. implies solving the query containment problem. It arises when Policies for cache population and eviction. We have con- it is necessary to determine which cache entry corresponds to a sidered several different policies for adding and removing in- newly arrived query. Given the sheer expressive power of SQL, termediates. We assume that for each history entry a numeric it is hard to check even the simple equivalence of two queries. benefit can be calculated. We also define total benefit as a However, in order to build an efficient caching subsystem it is product of benefit and the usage frequency (number of queries essential to find cached entries that can be used as base table to in the the entry usage statistics). evaluate the given query. For example, consider two queries: For cache population we have considered: • SELECT * FROM T1 WHERE T1.A > 100 • MIN: choose an uncached intermediate whose total benefit • SELECT * FROM T1 WHERE T1.A > 110 is larger than minimal among cached intermediates. If the first query is cached, then the second one can use the stored • AVG: choose an uncached intermediate whose total benefit answer with additional filtering. Depending on the predicate is larger than average among cached intermediates. selectivity and hardware parameters this approach may result in Cache eviction has a larger number of options: savings. • LRU (Least Recently Used): evict intermediates that were It is well-known that in the general case checking query con- not used for the longest time. tainment is very demanding computationally. For example, it was • LFU (Least Frequently Used): evict intermediates with the shown [8] that the problem of conjunctive query containment shortest history of usage. is NP complete. Therefore, we had to restrict admissible query • LCS (Largest Cache Space): evict intermediates that occupy class to “mostly” conjunctive SPJ (Select-Project-Join) queries. the largest amount of space. “Mostly” means that OR can appear in WHERE clause only to list • BENEFIT: evict entries that have the lowest benefit. possible individual values of an attribute, e.g. T1.A = 5 OR T1.A • BENEFITHIST: evict entries that have the lowest total = 6 OR T1.A = 7 is allowed, while T1.A < 5 OR T1.A = 10 is benefit. not. Therefore, the resulting class is limited, but big enough to • RANDOM: evict random intermediates. cover all queries from the Star Schema Benchmark (SSB). Note, Cost model. The benefit of a query history entry is defined that since only positional data is involved, there is no need in as a function of: checking containment of aggregation or sort. In order to perform containment checks we have designed a • Subtree complexity: a weighted sum of the subtree opera- special data structure to represent query metadata. Essentially, it tors. Each operator type corresponds to a constant weight records: 1) the set of attribute fragments covered, 2) all conducted which is a model parameter; joins, 3) all predicates that were used for filtering. • Working set complexity: an estimated amount of data Obtaining such a structure is just a matter of traversing the the query executor requests from the storage system to corresponding operator tree. process the subtree; Now, in order to perform a containment check given two such • Actual result size: computed by the corresponding PBCounter structures, all that is required is to make sure they share their sets operator; of attribute fragments and joins, and every predicate from the first • Usage statistics: a list of unique query numbers that could structure is implied by some predicate from the second structure. (or did) use it. If that condition holds, then the second structure describes a Note that we rely on two kinds of complexity: subtree and containing query. working set complexity. While the former favours the“structurally Compression. Cached position blocks appear to be a good complex” subtrees, the latter favours operation on larger amounts target for compression. First, columnar representation guarantees of data. They allow to distinguish simple queries on a large data data homogeneity and positions are just integers of fixed size. and complex queries an a small data. Secondly, the structure of a query plan explicitly restricts the Overall, the following formula is used to calculate benefit: possible position block structure. For example, large tables like Complexity(𝐼 ) × EstimatedSize(𝐼 ) SSB LINEORDER are usually treated in the way that preserves 𝐵𝑒𝑛𝑒 𝑓 𝑖𝑡 (𝐼 ) = . their initial order. Therefore, the corresponding position stream ActualSize(𝐼 ) SELECT sum(lo_revenue), d_year, p_brand1 FROM lineorder, date, part, supplier To user Operator Operator Reader WHERE lo_orderdate = d_datekey internals and lo_partkey = p_partkey Join Index and lo_suppkey = s_suppkey Sort local access remote access and p_category = 'MFGR#12' T1 T2 T3 and s_region = 'AMERICA' GROUP BY d_year, p_brand1 ... Aggregate 1 2 3 ORDER BY d_year, p_brand1; Tuples 2 1 2 Columns Read(d_datekey) Join Read(lo_orderdate) 3 3 3 Read(s_suppkey) Join Read(lo_suppkey) DS(date) T1 T2 T3 row1 row1 row1 Read(s_region) Filter Read(p_partkey) Join Read(lo_partkey) row2 row2 row2 row3 row3 row3 DS(supplier) DS(lineorder) Read(p_category) Filter DS(part) (a) Example of join index (b) Query plan example Figure 3: PosDB internals is ordered and relatively dense. On the contrary, small tables Often, positional operators (e.g. Join, Filter) require data val- which participate in joins often provide sparse and unordered ues in addition to positions. To provide such operators with data, a sequences of positions. set of auxiliary entities called readers were introduced. For exam- Turning to concrete compression strategies, we have tried ple, ColumnReader retrieves values of a specific attribute, while different encoding methods and found the delta and null suppres- SyncReader provides values of several attributes synchronously. sion to be the most appropriate. The former allows to narrow PosDB is a both distributed and parallel column-store. It is down the range of absolute values inside each position block, distributed in terms of both data and query execution: a table may especially when they have ascending or descending order. Then, be fragmented and replicated across several nodes. A number the latter transforms 32-bit (signed or unsigned) numbers into of table-level fragmentation methods is supported: round-robin, 16-bit and 8-bit ones. Thus, up to four times space compression hash and range partitioning strategies. Query distribution allows can be achieved with an almost negligible overhead of two for it to run a query on different nodes, possible with individual query loops. plan parts residing on distinct nodes. Query distribution is imple- Two other apparently efficient compression strategies are RLE mented by two pairs of special operators: {SendPos, ReceivePos} and Range. However, experiments demonstrated their limited use and {SendTuple, ReceiveTuple}. Therefore, both positional and in practice. After the first filtering or join operator the consecutive tuple operators can be executed on arbitrary nodes, regardless positions are usually different or repeat just a little. In the latter where their children reside. case delta + null suppression give a comparable or better result. Both inter- and intra-query parallelism [14] are supported. To We have also assessed and turned down some of the more implement intra-query parallelism two special operators were complex compression methods, such as Dictionary compression created. Asynchronizer allows to execute an operator tree in a and LZW. They either require significant time overhead or fail to separate thread and UnionAll is used to collect data from several provide noticeable space gains, encoding numbers with almost subtrees that are executed in their own threads. the same numbers. Recently [5], fully distributed operators like distributed join and aggregation were added. A detailed description of PosDB 4 POSDB architecture can be found in papers [5, 7]. To evaluate our approach we have implemented it inside PosDB — a distributed column-store with a processing model that adheres 5 EXPERIMENTAL EVALUATION to late materialization. Each PosDB query plan consists of two We have evaluated the proposed approach with two experiments, parts: position- and tuple-based (see Figure 3b). In the former designed to measure: a) the impact of different eviction policies on (the bottom one) operators exchange only positions that are query execution performance and b) the impact of compression represented as a generalized join index [18] (Figure 3a). In our on the amount of space used for storing intermediates. system this data structure is used to represent results of filter We fixed the size of cache history to 50 last queries and used and join operators. For a filter, it is just a list of positions (row only the MIN population policy. Cache size was varied from 2 to numbers) that satisfy a predicate. This structure can describe 32 thousands of position blocks, each of them being 32 KB large results of an arbitrary number of joins, e.g. consider the join when uncompressed. index presented in Figure 3a. Here, the second row of 𝑇1 was Finally, we relied on SSB with SF 10 and conducted a series joined with the first row of 𝑇2 and then the resulting tuple was of runs, a thousand of random “SSB-like” queries each. These joined with the second row of 𝑇3 . queries differ from the usual SSB ones only in the way the filtering The upper part of any plan consists of operators that work predicates are generated: in addition to randomly choosing the with tuples, similarly to classic row-stores. The lowest of these boundaries, we also randomly select the relation. For example, operators is responsible for materialization, i.e. it transforms join DATE.D_YEAR = 1993 can become DATE.D_YEAR < 1995. indexes into tuples of actual values. Such materializing operators Experimental environment. We used a laptop with an In- perform aggregation and window functions. tel(R) Core(TM) i7-3630QM CPU @ 2.40GHz; 8 GB RAM; HDD BENEFIT BENEFIT Used cache space, bytes/(32 × 210 ) LRU 30,000 LRU 4,000 RANDOM RANDOM BENEFITHIST NO COMPRESSION Execution time, s LFU LCS 20,000 3,000 10,000 2,000 0 0 5000 10000 15000 20000 25000 30000 35000 0 5000 10000 15000 20000 25000 30000 35000 156 313 469 625 781 938 1094 156 313 469 625 781 938 1094 Cache size (upper in blocks, lower in MB) Cache size (upper in blocks, lower in MB) Figure 4: Query execution performance Figure 5: Memory usage before and after compression Inc., Hanover, MA, USA. Samsung ST1000LM024, 5400 rpm; Funtoo Linux, kernel version [2] Daniel Abadi, Samuel Madden, and Miguel Ferreira. 2006. Integrating Com- 5.2.7; gcc 9.2.0. pression and Execution in Column-oriented Database Systems (SIGMOD ’06). Query execution performance. First, we measured the per- [3] D. J. Abadi, D. S. Myers, D. J. DeWitt, and S. R. Madden. 2007. Materialization Strategies in a Column-Oriented DBMS. In ICDE’07. 466–475. formance of query execution when different eviction policies are [4] Chungmin Melvin Chen and Nicholas Roussopoulos. 1994. The Implementa- used. The results are given in Figure 4 and show that with smaller tion and Performance Evaluation of the ADMS Query Optimizer: Integrating cache sizes, the choice of a strategy had no significant impact on Query Result Caching and Matching (EDBT ’94). 323–336. [5] George Chernishev, Viacheslav Galaktionov, Valentin Grigorev, Evgeniy query performance. With larger cache sizes, however, BENEFIT Klyuchikov, Evgeniy Slobodkin, and Kirill Smirnov. [n.d.]. PosDB 2019: A turned out to be the best, RANDOM — the worst. Distributed Disk-Based Column-Store with Late Materialization (submitted). Proc. VLDB Endow. (Oct. [n. d.]). Another interesting outcome is that BENEFITHIST, which took [6] G. A. Chernishev, V. A. Galaktionov, V. D. Grigorev, E. S. Klyuchikov, and history into account, resulted in worse performance than BENEFIT, K. K. Smirnov. 2018. PosDB: An Architecture Overview. Programming and which did not. We suppose that just relying on the usage fre- Computer Software 44, 1 (01 Jan 2018), 62–74. [7] G. A. Chernishev, V. A. Galaktionov, V. D. Grigorev, E. S. Klyuchikov, and quency results in a longer eviction time for already-cold cache K. K. Smirnov. 2018. PosDB: An Architecture Overview. Programming and entries (which were hot some time ago). We believe the situation Computer Software 44, 1 (Jan. 2018), 62–74. could be improved by also considering the times of several last [8] Rada Chirkova. 2009. Query Containment. Springer US, 2249–2253. [9] Kayhan Dursun, Carsten Binnig, Ugur Cetintemel, and Tim Kraska. 2017. uses. Revisiting Reuse in Main Memory Database Systems (SIGMOD ’17). 15. Space consumption due to compression. Next, we enabled [10] Stavros Harizopoulos, Daniel Abadi, and Peter Boncz. 2009. Column-Oriented Database Systems, VLDB 2009 Tutorial. nms.csail.mit.edu/~stavros/pubs/ compression for intermediates and ran the same experiment us- tutorial2009-column_stores.pdf ing only the BENEFIT, LRU, and RANDOM eviction policies. Figure 5 [11] Stratos Idreos, Fabian Groffen, Niels Nes, Stefan Manegold, K. Sjoerd Mul- shows the maximum amount of main memory used to store the lender, and Martin L. Kersten. 2012. MonetDB: Two Decades of Research in Column-oriented Database Architectures. IEEE Data Eng. Bull. 35, 1 (2012). intermediates during a particular run. The values were initially [12] Milena G. Ivanova, Martin L. Kersten, Niels J. Nes, and Romulo A.P. Gonçalves. measured in bytes, and then divided by the standard size of a 2010. An Architecture for Recycling Intermediates in a Column-store. ACM position block (32 KB). Trans. Database Syst. 35, 4, Article 24 (Oct. 2010), 24:1–24:43 pages. [13] Fabian Nagel, Peter Boncz, and Stratis D. Viglas. 2013. Recycling in Pipelined It is worth noting that enabling compression has not had any Query Evaluation (ICDE ’13). IEEE Computer Society, 338–349. noticeable effect on query execution times. Therefore, for the [14] M. Tamer Ozsu. 2007. Principles of Distributed Database Systems (3rd ed.). Prentice Hall Press, Upper Saddle River, NJ, USA. sake of brevity, the corresponding graph has been omitted. [15] Prasan Roy and S. Sudarshan. 2009. Multi-Query Optimization. Springer US, Boston, MA, 1849–1852. 6 CONCLUSION AND FUTURE WORK [16] Timos K. Sellis. 1988. Multiple-query Optimization. ACM Trans. Database Syst. 13, 1 (March 1988), 23–52. In this paper we have proposed a novel approach to in-memory [17] Dimitris Tsirogiannis, Stavros Harizopoulos, Mehul A. Shah, Janet L. Wiener, positional caching inside a column-store with late materializa- and Goetz Graefe. 2009. Query Processing Techniques for Solid State Drives (SIGMOD ’09). Association for Computing Machinery, 59–72. tion. Compared to data caching, positional caching allows the [18] Patrick Valduriez. 1987. Join Indices. ACM Trans. Database Syst. 12, 2 (June system to save space by storing only integers which can be easily 1987), 218–246. [19] Till Westmann, Donald Kossmann, Sven Helmer, and Guido Moerkotte. 2000. compressed. Experiments performed in PosDB demonstrated the The Implementation and Performance of Compressed Databases. SIGMOD viability of this approach. Also we have evaluated a number of Rec. 29, 3 (Sept. 2000), 55–67. policies for cache population and eviction, including our own. [20] M. Zukowski, S. Heman, N. Nes, and P. Boncz. 2006. Super-Scalar RAM-CPU Cache Compression. In ICDE’06. 59–59. Future work will include evaluation of caching in a distributed environment. REFERENCES [1] Daniel Abadi, Peter Boncz, and Stavros Harizopoulos. 2013. The Design and Implementation of Modern Column-Oriented Database Systems. Now Publishers