=Paper=
{{Paper
|id=Vol-3163/paper5
|storemode=property
|title=Efficient External Sorting in DuckDB
|pdfUrl=https://ceur-ws.org/Vol-3163/BICOD21_paper_9.pdf
|volume=Vol-3163
|authors=Laurens Kuiper,Mark Raasveldt,Hannes Mühleisen
|dblpUrl=https://dblp.org/rec/conf/bncod/KuiperRM21
}}
==Efficient External Sorting in DuckDB==
Efficient External Sorting in DuckDB Laurens Kuiper, Mark Raasveldt and Hannes Mühleisen CWI, Amsterdam Abstract Interactive data analysis is often conveniently done on personal computers that have limited memory. Current analytical data management systems rely almost exclusively on main memory for computation. When the data size exceeds the memory limit, many systems cannot complete queries or resort to an external execution strategy that assumes a high I/O cost. These strategies are often much slower than the in-memory strategy. However, I/O cost has gone down: Most modern laptops have fast NVMe storage. We believe that the difference between in-memory and external does not have to be this big. We implement a parallel external sorting operator in DuckDB that demonstrates this. Experimental results with our implementation show that even when the data size far exceeds the memory size, the performance loss is negligible. From this result, we conclude that it is possible to have a graceful degradation from in-memory to external sorting. Keywords Sorting, Parallel Sorting, In-Memory Sorting, Disk-Based External Sorting, Relational Databases 1. Introduction pointer intA Row Layout stringA intB stringB 0x0001 37 0x0001 42 0x0002 It is not uncommon for database systems to have hun- 0x0003 37 0x0003 66 0x0004 0x0005 42 0x0005 66 0x0006 dreds or even thousands of gigabytes of RAM at their Row Heap disposal. High-performance systems such as HyPer [1], radix pointer and ClickHouse [2] fully utilize the available memory CWI swizzling and perform much better on analytical workloads than DuckDBLabs goose their traditional disk-based counterparts. Because these Figure 1: DuckDB’s row layout and row heap. systems usually run on machines with such large mem- ory capacities, the assumption is often that the workload fits in memory. While laptops have also enjoyed increased memory memory. Fast queries may become slow or run into an capacity, their physical design has limited space. There- error when a table grows in size, creating a frustrating fore they typically have only 16GB of memory. Laptops experience for users. are often used in interactive data analysis, with tools like We can mitigate this problem by implementing oper- Pandas [3] and dplyr [4], showing that there is a need for ators such that they optimally use the amount of avail- analytical data management technology that runs on a able memory and only write data to disk when this is laptop. However, these tools operate only in memory. As necessary. I/O quickly becomes the bottleneck on ma- a result, users cannot process datasets that are slightly chines with low-bandwidth storage devices. However, larger than memory, on their own machine. most modern laptops have nVME storage with high write Disk-based database systems, on the other hand, speeds, making I/O less of a limiting factor. have long solved the problem of processing larger-than- We have implemented a parallel, external sorting op- memory datasets. These systems are generally much erator in DuckDB [5] that demonstrates this. Our im- slower than in-memory systems on analytical workloads. plementation seamlessly transitions from in-memory to When a user wants to process a larger-than-memory external sorting by storing data in buffer-managed blocks dataset using an in-memory system, usually one of two that are offloaded to disk using a least-recently-used things happens 1) The system throws an error stating it queue, similar to LeanStore [6]. is out of memory, 2) The system switches to an external Transitioning from in-memory to disk is made pos- strategy that is much less efficient than the in-memory sible by DuckDB’s unified internal row layout, shown strategy, which results in a slow execution time, even in Figure 1, which can be spilled to disk using pointer when, for example, the input is only 10% larger than swizzling [7]. We compare our implementation against four other BICOD’21: British International Conference on Databases, December systems using an improvised relational sorting bench- 9–10, 2021, London, UK mark on two tables from TPC-DS [8]. Our implementa- Envelope-Open laurens.kuiper@cwi.nl (L. Kuiper); raasveld@cwi.nl tion achieves excellent performance when data fits in (M. Raasveldt); hannes.muehleisen@cwi.nl (H. Mühleisen) © 2021 Copyright for this paper by its authors. Use permitted under Creative memory and shows a graceful degradation in perfor- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) mance as we go over the limit of available memory. 2. Sorting Relational Data (a) birth country NETHERLANDS birth year 1992 GERMANY 1924 Sorting is one of the most well studied problems in com- birth country birth year puting science. Research in this area forms the basis of (b) 78 69 84 72 69 82 76 65 78 68 83 0 200 7 0 0 71 69 82 77 65 78 89 0 132 7 0 0 sorting in database systems but focuses mostly on sorting binary string large arrays or key/value pairs. Sorting is more complex (c) 177 186 171 183 186 173 179 190 177 187 172 255 128 0 7 200 for relational data as many different types need to be 184 186 173 178 190 177 166 255 255 255 255 255 128 0 7 132 supported, as well as NULL values. There can also be Figure 2: Binary string encoding. The original data in (a) multiple order clauses. Besides sorting the order clause is represented as (b) on a little-endian machine. It is en- (key) columns, all other selected (payload) columns need coded as (c) when ordered for a query with order clauses to be re-ordered as well. c _ b i r t h _ c o u n t r y D E S C , c _ b i r t h _ y e a r A S C . Descending order In 2006, Goetz Graefe surveyed sorting in database sys- flips the bits. The string “GERMANY” is padded to ensure tems [9]. The most important takeaway from this survey fixed size. when it comes to performance is that the cost of sort- ing is dominated by comparing values and re-ordering data. Anything that makes either of these two operations A few relational operators are inherently row-based, cheaper will have an impact on the overall speed. such as joins and aggregations. For vectorized execution There are two obvious ways to go about implementing engines, it is common practice to physically convert vec- a comparator in a column-store when we have multiple tors to and from a row layout for these operators using O R D E R B Y clauses: scatter-gather. We argue that this layout should be used for sorting, as sorting is also a row-based operator. 1. Iterate through the clauses: Compare columns We show DuckDB’s row layout in Figure 1. Rows have until we find one that is not equal, or until we a fixed size that can be easily re-ordered while sorting. have compared all columns. This comparator We represent variable-sized columns with pointers into jumps between columns, causing random access a string heap where the data resides. The heap data does in memory. not have to be re-ordered while sorting in memory. Each 2. Sort the data entirely by the first clause, then fixed-size row has an additional p o i n t e r field that points sort by the second clause, but only where the to its “heap row”. This pointer is not used for in-memory first clause was equal, and so on. This approach sorting but is crucial for external sorting, which we will requires multiple passes over the data. explain in section 2.2. The binary string comparison technique [10] improves sorting performance by simplifying the comparator. It 2.1. Parallel Sorting encodes all columns in the ORDER BY clause into a single binary sequence that, when compared using m e m c m p will DuckDB uses Morsel-Driven Parallelism [11], a frame- yield the correct overall sorting order. This encoding work for parallel query execution. For the sorting oper- also yields the correct order with a byte-by-byte Radix ator, this means that multiple threads collect a roughly sort. Although this technique has existed since the days equal amount of data, in parallel, from the input table. of System R, not many systems use it today and opt for We use this parallelism by letting each thread sort its one of the ways listed above. collected data using Radix sort. After this initial sorting We implement this comparator and opt for fixed-size phase, each thread has one or more sorted blocks of data, encodings, which can be more easily re-ordered. For which must be combined into the final sorted result using variable-size types such as strings, we can therefore only merge sort. encode a prefix. We compare the whole string only when There are two main ways of implementing merge sort: prefixes are equal. The encoding is shown in Figure 2. K-way and Cascade merge. The K-way merge merges 𝐾 Not shown in the figure are NULL values and colla- lists into one sorted list in one pass and is traditionally tions. NULL values are handled by prefixing each value used for external sorting because it minimizes I/O [12]. with an additional byte denoting whether the value is Cascade merge is used for in-memory sorting because it NULL. Collations are handled by evaluating the collation is more efficient than K-way merge. It merges two lists of function before encoding the prefix of the string. sorted data at-a-time until only one sorted list remains. The other high cost of sorting is re-ordering data. Sys- Recent work on K-way external merge sort [13] on tems that use columnar storage must re-order all selected devices with flash memory reduces execution time by columns, which causes a random access pattern for each. 20% to 35% compared to standard external merge sort. Row-based systems only have to pay the cost of this ran- Salah et al. [14]show that K-way merge can achieve better dom access pattern once. When we select many columns, performance than cascaded merge when it comes to in- this becomes a considerable advantage. place sorting. This work on K-way merging may seem attractive for our implementation, but cascaded merge is 3. Evaluation still a more attractive option when it comes to in-memory sorting. We also need to deal with variable-size data, In this section, we evaluate DuckDB’s sorting implemen- which complicates in-place sorting. We do not want tation. We compare against ClickHouse [2], HyPer [1], to compromise in-memory sorting and choose cascade Pandas [3] and SQLite [16]. HyPer and ClickHouse are merge. full-blown analytical database systems that focus on par- Cascade merge is embarrassingly parallel when there allel in-memory computation. Pandas is single-threaded are many more sorted blocks than available threads. As and in-memory, and SQLite is a more traditional (single- the blocks get merged, there will not be enough blocks threaded) disk-based system. to keep all threads busy. In the final round, when we To demonstrate how the systems perform in an envi- merge two blocks to create the final sorted result, there ronment with limited RAM, we run our experiments on is no parallelism: One thread processes all the data. We a 2020 MacBook Pro. It has 16GB of memory and a fast parallelize this using Merge Path [15]. Merge path pre- SSD with a write speed of over 3GB/s. HyPer does not computes where blocks intersect, creating partitions that yet run natively on ARM CPUs, so we emulate it using can merge independently of each other. Binary search Rosetta 2. See our blog [17] for an experiment on x86 efficiently computes these partitions. CPU architecture. Benchmarking sorting in database systems is not straightforward. We would like to measure only the time 2.2. External Sorting it takes to sort, with as little noise as possible. There- To sort more data than fits in memory, we write blocks fore, we cannot use S E L E C T queries, as the client-server of sorted data to disk. Rather than actively doing this in protocol will quickly dominate query runtime [18]. our sorting implementation, DuckDB’s buffer manager To approach a fair comparison, we measure the decides when to do this: When memory is full. end-to-end time of queries that sort the data and write Writing data to disk is trivial for fixed-size types but the result to a temporary table, i.e., C R E A T E T E M P O R A R Y non-trivial for variable-size types, as the pointers in our T A B L E o u t p u t A S S E L E C T . . . F R O M . . . O R D E R B Y . . . ; . row layout will be invalidated. To be able to offload For Pandas we will use s o r t _ v a l u e s with i n p l a c e = F a l s e variable-sized types as well, we use pointer swizzling. to mimic this query. To measure stable end-to-end query When we are sorting externally, we convert the pointers time, we run each query 5 times and report the median in the row layout to offsets, shown in Figure 3. run time. Scripts for our experiments are available on GitHub1 . Row Layout We have created a relational sorting benchmark on offset intA stringA intB stringB 0 37 0 42 5 the c u s t o m e r and c a t a l o g _ s a l e s tables from TPC-DS [8]. 12 37 0 66 3 The row counts at different scale factors are shown in 24 42 0 66 10 Table 1. Row Heap radix pointer scale factor catalog_sales customer CWI swizzling 10 14.401.261 500.000 DuckDBLabs goose 100 143.997.065 2.000.000 Figure 3: DuckDB’s internal row layout (swizzled). 300 260.014.080 5.000.000 Table 1 TPC-DS table row count for c u s t o m e r and c a t a l o g _ s a l e s at We replace the 8-byte pointer field with an 8-byte scale factor 10, 100, and 300. offset, which denotes where the strings of this row reside in the heap block. We also replace the pointers to the string values within the row with an 8-byte relative offset. TPC-DS tables are challenging for sorting implemen- This offset denotes how far this particular string is located tations because they are wide (many columns, unlike the from the start of this row’s heap row. Using relative tables in TPC-H) and have a mix of fixed- and variable- offsets within rows rather than absolute offsets is very sized types. c a t a l o g _ s a l e s has 34 columns, all fixed-size useful during sorting: These relative offsets stay constant types: integer and double. c u s t o m e r has 18 columns, and do not need to be updated when we copy the row. fixed- and variable-size: 10 integers, 8 strings. With this dual-purpose row-wise representation, we We sort the c a t a l o g _ s a l e s table on c s _ q u a n t i t y and achieve an almost seamless transition between in- c s _ i t e m _ s k , and select an increasing number of payload memory and external sorting. The only difference be- columns. This experiment tests both the system’s ability tween the two is swizzling and unswizzling each pointer to sort and re-order the payload. We show the results of once and re-ordering the heap during sorting. this experiment in Figure 4. 1 https://github.com/lnkuiper/experiments/tree/master/sorting sf = 10 sf = 100 sf = 100 sf = 300 10.0 600 50% 100% 5 3 7.5 400 4 time (s) 3 time (s) 5.0 2 200 2.5 2 0 1 5 10 15 20 25 30 5 10 15 20 25 30 1 number of payload columns number of payload columns ClickHouse DuckDB HyPer Pandas SQLite 0 0 CH DuckDBHyPer PandasSQLite CH DuckDBHyPer PandasSQLite integer string Figure 4: c a t a l o g _ s a l e s ordered by c s _ q u a n t i t y , c s _ i t e m _ s k , with an increasing number of payload columns. These Figure 5: c u s t o m e r ordered by different column types: In- columns have few unique values, creating a difficult challenge teger columns (c _ b i r t h _ y e a r , c _ b i r t h _ m o n t h , c _ b i r t h _ d a y ), for sorting algorithms. The grey vertical lines in the SF100 and string columns (c _ f i r s t _ n a m e , c _ l a s t _ n a m e ). plot indicate at which points the payload columns take up 50% and 100% of the amount of available memory. We show the result of this experiment in Figure 5. As expected, sorting by strings is slower than sorting The table fits in memory at SF10, and the systems’ by integers for most systems. In this experiment, the performances are in the same ballpark, except for SQLite, payload also includes string columns. Pandas has an ad- and DuckDB is the clear winner. At SF100, around 14 vantage here because it already has the strings in memory, selected payload columns, the input table takes up 50% and most likely only needs to re-order pointers to these of memory. It is common for systems to not sort data strings. The database systems need to copy strings twice: in place, but copy it to a new location, which requires Once when reading the input table, and again when cre- double the amount of memory. The figure clearly shows ating the output table. Profiling in DuckDB reveals that this: All systems except DuckDB and SQLite run into an the actual sorting takes less than a second at SF300, and error due to running out of memory and are unable to most time is spent on (de)serializing strings. See [17] complete the benchmark. for more details on the difference between integers and ClickHouse switches to an external sorting strategy, strings. which is much slower than its in-memory strategy. There- fore, adding a few payload columns results in a runtime that is orders of magnitude higher. Despite switching 4. Conclusion and Future Work strategy, ClickHouse runs into an out-of-memory error. HyPer uses the m m a p system call, which creates a map- In this paper, we presented our parallel external sort- ping between a block of memory and a file, which allows ing implementation in DuckDB. We compared it against HyPer to continue for a while when the data no longer four data management systems using a relational sorting fits in memory, before running into an error as well. As benchmark based on TPC-DS. Three of the four systems we can see, the runtime becomes very slow before HyPer perform well in memory, but crash as the data goes over runs out of memory: Random memory access becomes the amount of available memory. DuckDB is the only sys- random disk access. tem under benchmark that performs well both in memory Surprisingly, Pandas can load the dataset at SF100 be- and external. These results demonstrate that it is possi- cause macOS dynamically increases swap size. Most ble to implement a sorting operator that is efficient in operating systems do not do this, and Pandas will not memory and has a graceful degradation in performance load the dataset at all. Pandas relies on NumPy’s [19] as the input size exceeds the memory limit. single-threaded quicksort implementation. Pandas shows impressive performance, partly because it already has 4.1. Future Work the input data fully materialized in memory and does not have to stream data through an execution pipeline from It is unclear how each technique contributed to end-to- the input to the output table like most DBMSes. end performance. Quantifying these contributions e.g. Meanwhile, DuckDB and SQLite do not show a visible through simulation is an area of future research. difference in performance when where data no longer fits Our sorting implementation uses a row layout that can in memory. SQLite always opts for a traditional external be offloaded to storage using pointer swizzling. Other sorting strategy, resulting in a robust, but overall slower blocking operators could benefit from this layout. For in- performance than DuckDB. stance, join, aggregation and window. Incorporating this In our next benchmark, on the c u s t o m e r table, we test layout would enable external computation for these op- how well the systems can sort by strings and by integers. erators as well. Implementing these operators such that Both comparing and re-ordering strings are much more their performance degrades gracefully as the input size expensive than comparing and re-ordering numeric types. exceeds the memory limit is an area of future research. References https://doi.org/10.1145/564691.564759. doi:1 0 . 1 1 4 5 / 564691.564759. [1] A. Kemper, T. Neumann, HyPer: A hybrid [9] G. Graefe, Implementing Sorting in Database OLTP&OLAP main memory database system based Systems, ACM Comput. Surv. 38 (2006) 10–es. on virtual memory snapshots, in: S. Abiteboul, URL: https://doi.org/10.1145/1132960.1132964. K. Böhm, C. Koch, K. Tan (Eds.), Proceedings of the doi:1 0 . 1 1 4 5 / 1 1 3 2 9 6 0 . 1 1 3 2 9 6 4 . 27th International Conference on Data Engineering, [10] M. W. Blasgen, R. G. Casey, K. P. Eswaran, An ICDE 2011, April 11-16, 2011, Hannover, Germany, Encoding Method for Multifield Sorting and In- IEEE Computer Society, 2011, pp. 195–206. URL: dexing, Commun. ACM 20 (1977) 874–878. URL: https://doi.org/10.1109/ICDE.2011.5767867. doi:1 0 . https://doi.org/10.1145/359863.359892. doi:1 0 . 1 1 4 5 / 1109/ICDE.2011.5767867. 359863.359892. [2] B. Imasheva, A. Nakispekov, A. Sidelkovskaya, [11] V. Leis, P. Boncz, A. Kemper, T. Neumann, Morsel- A. Sidelkovskiy, The Practice of Moving to Big Driven Parallelism: A NUMA-Aware Query Evalu- Data on the Case of the NoSQL Database, Click- ation Framework for the Many-Core Age, in: Pro- House, in: H. A. L. Thi, H. M. Le, T. P. Dinh (Eds.), ceedings of the 2014 ACM SIGMOD International Optimization of Complex Systems: Theory, Models, Conference on Management of Data, SIGMOD Algorithms and Applications, WCGO 2019, World ’14, Association for Computing Machinery, New Congress on Global Optimization, Metz, France, 8- York, NY, USA, 2014, p. 743–754. URL: https://doi. 10 July, 2019, volume 991 of Advances in Intelligent org/10.1145/2588555.2610507. doi:1 0 . 1 1 4 5 / 2 5 8 8 5 5 5 . Systems and Computing, Springer, 2019, pp. 820–828. 2610507. URL: https://doi.org/10.1007/978-3-030-21803-4_82. [12] D. Knuth, The Art Of Computer Programming, vol. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 2 1 8 0 3 - 4 \ _ 8 2 . 3: Sorting And Searching, Addison-Wesley, 1973. [3] W. McKinney, et al., Data structures for statistical [13] R. Jackson, J. Gresl, R. Lawrence, Efficient Exter- computing in Python, in: Proceedings of the 9th nal Sorting for Memory-Constrained Embedded De- Python in Science Conference, volume 445, Austin, vices with Flash Memory, ACM Trans. Embed. Com- TX, 2010, pp. 51–56. put. Syst. 20 (2021). URL: https://doi.org/10.1145/ [4] H. Wickham, R. François, L. Henry, K. Müller, 3446976. doi:1 0 . 1 1 4 5 / 3 4 4 6 9 7 6 . dplyr: A Grammar of Data Manipulation, 2021. URL: [14] A. Salah, K. Li, Q. Liao, M. Hashem, Z. Li, A. T. https://CRAN.R-project.org/package=dplyr, r pack- Chronopoulos, A. Y. Zomaya, A Time-Space Effi- age version 1.0.7. cient Algorithm for Parallel k-Way In-Place Merg- [5] M. Raasveldt, H. Mühleisen, DuckDB: An Em- ing Based on Sequence Partitioning and Perfect beddable Analytical Database, in: Proceedings Shuffle, ACM Trans. Parallel Comput. 7 (2020). of the 2019 International Conference on Manage- URL: https://doi.org/10.1145/3391443. doi:1 0 . 1 1 4 5 / ment of Data, SIGMOD ’19, Association for Com- 3391443. puting Machinery, New York, NY, USA, 2019, p. [15] O. Green, S. Odeh, Y. Birk, Merge Path - A 1981–1984. URL: https://doi.org/10.1145/3299869. Visually Intuitive Approach to Parallel Merging, 3320212. doi:1 0 . 1 1 4 5 / 3 2 9 9 8 6 9 . 3 3 2 0 2 1 2 . CoRR abs/1406.2628 (2014). URL: http://arxiv.org/ [6] V. Leis, M. Haubenschild, A. Kemper, T. Neumann, abs/1406.2628. a r X i v : 1 4 0 6 . 2 6 2 8 . LeanStore: In-Memory Data Management beyond [16] R. D. Hipp, SQLite, 2021. URL: https://www.sqlite. Main Memory, in: 34th IEEE International Confer- org/index.html. ence on Data Engineering, ICDE 2018, Paris, France, [17] L. Kuiper, Fastest table sort in the West - Redesign- April 16-19, 2018, IEEE Computer Society, 2018, pp. ing DuckDB’s sort, 2021. URL: https://duckdb.org/ 185–196. URL: https://doi.org/10.1109/ICDE.2018. 2021/08/27/external-sorting.html. 00026. doi:1 0 . 1 1 0 9 / I C D E . 2 0 1 8 . 0 0 0 2 6 . [18] M. Raasveldt, H. Mühleisen, Don’t Hold My Data [7] A. Kemper, D. Kossmann, Adaptable Pointer Swiz- Hostage: A Case for Client Protocol Redesign, Proc. zling Strategies in Object Bases: Design, Real- VLDB Endow. 10 (2017) 1022–1033. URL: https: ization, and Quantitative Analysis, VLDB J. 4 //doi.org/10.14778/3115404.3115408. doi:1 0 . 1 4 7 7 8 / (1995) 519–566. URL: http://www.vldb.org/journal/ 3115404.3115408. VLDBJ4/P519.pdf. [19] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gom- [8] M. Poess, B. Smith, L. Kollar, P. Larson, TPC-DS, mers, P. Virtanen, D. Cournapeau, E. Wieser, J. Tay- Taking Decision Support Benchmarking to the next lor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, Level, in: Proceedings of the 2002 ACM SIGMOD M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del International Conference on Management of Data, Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, SIGMOD ’02, Association for Computing Machin- K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, ery, New York, NY, USA, 2002, p. 582–587. URL: C. Gohlke, T. E. Oliphant, Array programming with NumPy, Nature 585 (2020) 357–362. URL: https://doi.org/10.1038/s41586-020-2649-2. doi:1 0 . 1038/s41586- 020- 2649- 2.