Assessing Non-volatile Memory in Modern Heterogeneous Storage Landscape using a Write-optimized Storage Stack Sajad Karim1 , Johannes Wünsche2 , David Broneske1 , Michael Kuhn2 and Gunter Saake1 1 Databases and Software Engineering, Otto von Guericke University Magdeburg 2 Parallel Computing and I/O, Otto von Guericke University Magdeburg Abstract Non-volatile memory (NVM), or persistent memory, is a promising and emerging storage technology that has not only disrupted the typical long-established memory hierarchy but also invalidated the proclaimed programming paradigm used in traditional database management systems and file systems. It bridges the gap between primary and secondary storage and, hence, shares the characteristics of both categories. However, currently, there exists no common storage engine built particularly to study the characteristics of the modern storage landscape, which has become more heterogeneous after NVM. Therefore, a general-purpose storage engine, Haura, is utilized to study the benefits of the modern storage landscape. In this work, NVM is integrated into the storage stack of Haura and studied the patterns for modern storage devices involved and their impact on the performance of Haura. Our work shows, NVM performs best under sequential workloads, but random access is better with larger block sizes. Furthermore, the block size has a significant impact on the performance of storage devices, with smaller block sizes favoring NVM and larger block sizes favoring NVMe-supported devices. Keywords Non-Volatile Memory, Persistent Memory, Storage Class Memory, Non-Volatile Random Access Memory, Persistent Memory Programming, Modern Heterogeneous Storage Landscape, Write-Optimized Storage Engine (Haura) 1. Introduction fore, strategies like journaling and copy-on-write are used to ensure the consistency of data. Moreover, the According to Kazemie [1], the volume of data in 2011 was scalability of DRAM resulted in main memory database around 1.8 Zettabytes, and its volume doubles approxi- systems [2, 3, 4]. However, DRAM’s further scalability mately every two years. At least it was before the onset has innately become quite a challenging task [5], and of COVID-19 whose outbreak further fueled its growth also, because of its energy consumption, the solution is when the use of digital services rose exponentially. This unaffordable for most businesses. data deluge is unprecedented and has created new chal- Persistent memory, on the other hand, is considered lenges for database management systems and file systems to be an alternative to deal with the above-mentioned which are used in a wide range of applications for data issues. It is a new category in the storage hierarchy that analysis and management. is non-volatile, byte-addressable, provides DRAM-like la- The traditional database management systems and file tency, and offers much higher capacity than DRAM. This systems are developed considering the typical storage new storage class has not only opened opportunities for hierarchy where memory is fast but volatile and limited, new system designs but has also opened opportunities and secondary storage is persistent and vast but has high for enhancements in the existing storage engines. For in- latency. In such systems, the data is logically split into stance, some work is already made in traditional database two sets of copies; working and consistent copies. The systems in that NVM is used to improve the traditional working copy resides in main memory, whereas the per- disk-based (centralized/decentralized) logging [6, 7, 8, 9]. sistent copy resides on one or more secondary storage In prior work, NVM is used as a buffer between DRAM devices. Also, making data persistent is an error-prone and secondary storage devices [10, 5, 11]. Moreover, sev- process as problems like crashes and race conditions can eral index data structures like NVTree [12] and FPTree corrupt data or leave it in an inconsistent state. There- [13] are introduced that exploit the properties of NVM. Nevertheless, presently, there is no common storage en- 34th GI-Workshop on Foundations of Databases (Grundlagen von Daten- gine that is built particularly to study the characteristics banken), June 7-9, 2023, Hirsau, Germany of the modern storage landscape, which has become more Envelope-Open sajad.karim@ovgu.de (S. Karim); johannes.wuensche@ovgu.de heterogeneous after the addition of NVM, and in conse- (J. Wünsche); david.broneske@ovgu.de (D. Broneske); michael.kuhn@ovgu.de (M. Kuhn); gunter.saake@ovgu.de quence, there is a research gap in this direction. (G. Saake) In order to investigate the benefits of a common stor- Orcid 0009-0002-4910-8453 (S. Karim); 0000-0002-5304-7262 age engine that manages all the storage devices in the (J. Wünsche); 0000-0002-9580-740X (D. Broneske); modern storage landscape, a prototype of a general- 0000-0001-8167-8574 (M. Kuhn); 0000-0001-9576-8474 (G. Saake) purpose storage engine, called Haura [14, 15], is used. © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings It runs in user space and supports object and key-value They are available in DIMM form factor and compatible interfaces. The key contributions of our work are as with conventional DDR4 sockets. They co-exist with con- follows: ventional DDR4 DRAM DIMMs and use the same mem- ory channel. The internal granularity of the modules • We used PMDK (Persistent Memory Development is 256 bytes and they can be operated in three different Kit) to supplement Haura with persistent mem- modes; memory, app direct, and dual modes. ory. PMDK is an open-source (C/C++) kit that In the memory mode, NVRAM supplements DRAM offers different libraries/utilities to interact with where DRAM acts as an L4 cache and NVRAM as the persistent memory. We used the most appropri- main (volatile) memory. The host memory controller ate library after a brief evaluation process. integrated into the processor manages the movement of • We investigated the impact of the above- data between DRAM and NVRAM. On the other hand, mentioned change on Haura and used persistent in the app direct mode, NVRAM is a persistent memory memory to store the B𝜖 -tree nodes (Section 2.3). module where the applications have direct access to the • We identified the access patterns for different stor- device, it is still byte-addressable, and applications can age devices that supplement or affect the through- use it as a storage device. Lastly, in the dual mode, part put of Haura. of the NVRAM can be allocated to applications, and the rest can be utilized as non-volatile memory. The remainder of this paper is structured as follows. Section 2 provides background on non-volatile memory and the related programming techniques, and also briefly 2.2. Non-Volatile Memory Programming describes Haura. The implementation phase is discussed The typical programming models categorize the data in Section 3 which is then followed by the evaluation in structures into two broad categories; memory resident Section 4. Next, Section 5 details the related work and the and storage resident data structures [20]. It is mainly paper concludes with a summary and open challenges in due to the underlying system architecture where main Section 6. memory is attached directly to the memory bus and sec- ondary storage, due to its high latency, communicates 2. Background with the system via an I/O controller. The models operate on the data in main memory at byte granularity and en- In this section, we discuss non-volatile memory and dif- sure its persistency by explicitly writing it to secondary ferent programming models to access it. We then describe storage. A key challenge of such models is to ensure Haura and briefly touch upon its key components. the consistency and integrity of data across all storage classes that share different characteristics. For example, the integrity of data in main memory could be ensured 2.1. Non-Volatile Memory using mutexes whereas the consistency and durability of Persistent Memory (PMem), Storage Class Memory data on secondary storage are ensured using strategies (SCM), Non-Volatile Memory (NVM), and Non-Volatile like journaling and write-ahead logging [21]. RAM (NVRAM) are the names often used to address The above-mentioned programming paradigm cannot this new class of storage. It sits between primary and be followed when working with persistent memory as secondary storage in the typical storage hierarchy, and it is attached to the memory bus and is non-volatile. it is also considered a disruptive technology as it has Therefore, a new model is inevitable that simultaneously disrupted the traditional memory paradigm. It is non- addresses all the atomicity and consistency issues, like volatile, has DRAM-like latency, and offers much higher concurrency and power failure, for instance [22, 20, 23]. capacity than DRAM. It is byte-addressable, and its prop- Moreover, persistent memory is a comparatively new erty to be directly accessible using the cache lines by the technology, and writing software with all the consider- CPU demands a different architecture than the one used ations requires an in-depth knowledge of the hardware in typical storage engines. Some example technologies and cache. Therefore, several APIs are available that han- are Phase Change Memory [16], Spin Transfer Torque dle the hardware-related intricacies internally. PMDK1 RAM (STT-RAM) [17], Carbon NanoTube RAM (NRAM, (Persistent Memory Development Kit) is one such exam- NanoRAM) [18], and Memristors [19]. ple which is based on SNIA NVM programming model2 . Presently, only Intel® produces persistent memory modules under the brand named Intel® Optane™ DC Persistent Memory. It offers different generations that vary in performance and capacity, and the modules are designed to be used with specific generations of Intel® ’s 1 https://pmem.io/pmdk/ processors, Intel® Xeon Scalable Processors, for instance. 2 https://snia.org/tech_activities/standards/curr_standards/npm SNIA (Storage and Networking Industry Association) User API proposes different programming models [22, 24, 20, 23] ... Cursor to program persistent memory. The simplest way is to ObjectStore use the module as a block device and access it using a Interface standard file API. Another approach is via an optimized ... Superblock Snapshot Dataset Database file system, ext and XFS in Linux and NTFS in Windows, that is adapted specifically for persistent memory. This Allocator Interface approach, contrary to the previous one, allows small read- Tree ... Leaf Node Messages Cache and-write operations and is more efficient. Last but not least, DAX or Direct Access is another approach in that Interface the persistent memory is accessed as a memory-mapped ... DataManagement file. Nevertheless, contrary to memory mapping files Interface on secondary storage, the operating system does not Checksum ... Configuration Compression maintain pages for persistent memory in main memory. StoragePool Interface 2.3. Haura Vdev ... File Parity Mirror Haura is a general-purpose and write-optimized tiered File API storage stack that runs in user space and supports object Disk 1 Disk 2 Disk 3 Disk 4 Disk N and key-value interfaces [14]. It has the ability to handle multiple datasets, provides data integrity, and supports advanced features like snapshots, data placement, and Figure 1: A layered conceptual diagram illustrating the main fail-over strategies. The core of the engine is the B𝜖 - components of Haura. The objects in grey represent the main tree, which is the sole index data structure in the engine. modules, whereas the objects in yellow and sky blue colors are The engine supports block storage devices, solid-state the helper modules and classes respectively. The key classes drives, for instance, and also has its own caching layer in the modules are represented using white blocks. using DRAM (separate from the operating system). It also offers features like data striping, mirroring, and parity. It of the objects. The first dataset stores the chunks of the follows ZFS [25] architecture and uses a similar layered object, whereas the second dataset stores the indirection- approach [14] where layers or modules interact with each related information and other metadata, like modification other using interfaces. A schematic diagram of Haura is time and size, to name a few. presented in Figure 1. The Database module controls and manages all the Haura was initially built as a key-value storage engine activities regarding a database. A database in Haura where arbitrary-sized keys and values can be stored and consists of one or more datasets and their respective an extension to support objects was made later in [15] snapshots. Datasets and snapshots are actually B𝜖 -trees. in that the ObjectStore module was added to the stack. Moreover, it also maintains a separate B𝜖 -tree, named The ObjectStore module exposes the necessary routines the root tree, to store all the information regarding the to interact with the engine and supports all the primitive database. For example, it maintains active datasets and operations like create, read, write, and query. It uses the their pointers in the storage. It also maintains informa- same key-value interface to store the objects. However, a tion regarding the usage of storage devices in the form key challenge it addresses is the transformation of objects of bitmaps. into key-value pairs. Since, in the key-value version The Tree module contains the actual implementation of Haura, although the keys and the values can have a of the B𝜖 -tree and encapsulates all the tree-related opera- variable size, there was still an upper limit defined on tions and exposes the methods to the upper layer. their sizes. Objects, on the other hand, can contain data The DataManagement module ensures the persistence of several gigabytes; therefore, a mechanism is devised of the underlying data and its retrieval when requested. in that objects are split into chunks, and each chunk is However, it internally interacts with a wide range of assigned a unique identifier. Moreover, the object name modules, especially with the helper modules, and plays a is primarily the key of the object, it can have a variable vital role in achieving their internal functionalities. The size and can spread over a few kilobytes, therefore, an cache module, for instance, is managed by this module. indirection is added where the object name and other The write and update requests from the upper layer first metadata are stored separately, and a unique fixed-size land into this module and then passed on to the cache identifier is assigned to each chunk. Furthermore, the module. Similarly, in the case of a cache miss, this module ObjectStore module, via the Database module, maintains fetches the required data using the StoragePool module two different datasets to store the data and the metadata and are passes the data to the cache module for later usage. Moreover, this module is also responsible for the • The virtual device should be able to perform both compression and decompression of the data, and it uses synchronous and asynchronous calls to the un- the compression module for this purpose. Furthermore, derlying storage device. the decision as to which blocks on storage media are to be • Haura uses bitmaps to manage the allocation of used to write the data is also taken in this module, and it the blocks on the storage devices. It partitions the uses the AllocationHandler module to allocate the blocks. whole space into equal-size blocks and allocates Last but not least, it communicates with the StoragePool and de-allocates the blocks internally, therefore, module to perform the write and read I/O operations. this bookkeeping is not required in the virtual The StoragePool module performs two key operations. device or any library used in it. First, it maintains queues for asynchronous I/O opera- • Haura uses the copy-on-write technique to up- tions. Second, it dispatches the I/O calls to the respective date the nodes. It first copies nodes to the main virtual devices in the Vdev module. However, it also memory, applies changes to the nodes, and then exposes the methods for synchronous calls where it by- writes the data back to the device in a new loca- passes the queues. The interface of this layer matches tion. It never performs in-place updates. with the Vdev module, however, it requires an additional parameter to communicate with the desired virtual de- Now considering the above properties, libpmem and libp- vice in the Vdev module as the module may contain more memblk from PMDK’s persistent library suite suits the than one virtual device. current architecture of Haura and can be used to imple- Lastly, the Vdev module provides different implemen- ment the functionality. tations to interact with the storage devices, and they are Libpmemblk is a high-level library that provides func- referred to as virtual devices in the system. Currently, tionality to manage an array of fixed-size blocks. The Haura supports single, mirror, and parity implementa- blocks can be updated and read using their indices in the tions. The single version of the implementations works array. It does not provide byte-level access to the blocks, on a single storage device, and it has two further sub- and any update requires the re-writing of the whole block. implementations, file and memory, for SSD/HDD and Whereas, libpmem is one of the low-level libraries in the DRAM (as volatile storage) respectively. It is the simplest kit, and the other high-level libraries are built on top of implementation provided by this module, it is not fault- it. It wraps the basic operations exposed by an operating tolerant, and the underlying data is lost in case of error system and adds optimizations for persistent memory. or failure. The other implementations, as their names The other libraries from the PMDK’s persistent suite, suggest, mirror and parity, are introduced to support like libpmemobj and libpmemkv, are not a suitable choice mirroring and parity functionalities respectively. because; first, they do not provide the required interface to implement the virtual device, and second, they inter- nally perform the operations that are already addressed in 3. Implementation Haura. For example, libpmemobj internally implements the object store functionality on top of memorymapped In this section, we briefly discuss the implementation of files, whereas the current architecture of Haura expects the new virtual device for persistent memory and touch the virtual device to perform raw read and write opera- upon the important steps considered during this phase. tions on a specified location in the memory. Moreover, the management of key-value data at the library level, 3.1. Programming Model Selection as in the case of libpmemkv, is the core functionality of PMDK provides several high and low-level libraries to Haura and is therefore redundant. interact with persistent memory. Haura, on the other The final selection from the shortlisted3 libraries is hand, is also a well-developed engine and expects a vir- made using an experiment, which results are in Figure 2. tual device to implement a certain interface. Therefore, First, it is quite evident from the graph that the approach in the initial phase, a list of properties is formulated to for accessing persistent memory via a memory-aware set a criterion, and each new implementation of a virtual file API is not a feasible approach as its latency to read device must adhere to the list to work properly with the and write the data, especially using small buffers, is sig- existing interfaces in Haura. The properties are: nificantly high. Nevertheless, a prominent drop can be observed in its latency with the increase in buffer size • Haura stores the nodes of the B𝜖 -tree using virtual but it is still higher than the rest. The approaches that devices, therefore, the virtual device should be compete with each other are libpmem and libpmemblk. able to perform read and write operations in var- Libpmemblk, in some instances, performed better than ied block sizes, from a few kilobytes to megabytes. libpmem, but its performance is worst with small buffers, 3 PMem-aware file API is also added to the list for comparison. libpmem mirror. Similarly, a new implementation of a virtual de- vice for persistent memory is added to the Vdev module. Time (s) Write 2 libpmemblk 10 File API Moreover, as further mentioned in Section 3.1, the library from PMDK is chosen with careful consideration to avoid 100 any architecture-related changes to Haura. Therefore, this new virtual device implementation exposes a similar 101 interface as available in the other implementations and Time (s) Read it internally makes use of the wrapper library mentioned in the previous section to perform the storage-specific operations. 100 The integration of the new virtual device required 6B 2B 64 B 51 B B B B 8B B B B 16 B 32 B 12 B 25 B B B B 16 B 32 B K 6K 2K M 64 1K 2K 4K 8K K K 8K 1M 2M 4M 8M M alterations in a few traits5 and structs in different mod- 25 51 12 Buffer Size ules. For example, the DataManagement module inter- acts with the StoragePool module using a trait called Figure 2: An object (size 5 GB) is written and read sequen- StoragePoolLayer, and StoragePoolUnit, which tially multiple times with different block sizes using libpmem, is a struct in the StoragePool module, implements libpmemblk, and a standard File API. StoragePoolLayer and maintains the information re- garding the configured virtual devices in an array of type and also it crashes when the block size exceeds 8 MB. StorageTier. Furthermore, the type StorageTier is Therefore, libpmem is finally selected as it is (compara- an array of Dev, and Dev is an enum that stores the in- tively) consistent throughout the experiment. stance of the associated virtual device. The enum Dev provides three different types of features, Leaf, Mirror, and Parity1. Leaf further offers two different features, 3.2. Rust Wrapper for libpmem File and Memory. All these mentioned traits and structs Haura is written in Rust, and PMDK supports other lan- are affected by the new implementation. guages in that support for only C/C++ is fully tested, Furthermore, virtual devices are accessed using differ- therefore, the second step, after selecting the library from ent traits that define distinct behaviors. The first trait PMDK’s suite, involved writing a wrapper for the selected is Vdev. It exposes functions to query different prop- library (i.e. libpmem) in Rust. In this regard, Rust pro- erties and states of the virtual devices. For example, it vides a utility named bindgen that generates the FFI4 can be used to fetch the id and size of the device. On bindings to C/C++ libraries. The tool requires two files to the other hand, the traits VdevWrite and VdevRead, as generate the bindings. The first file is wrapper.h which their name suggests, are used to perform read and write should contain all the header files and declarations that operations on virtual devices, and provide methods to per- the target application intends to use. The other file is form the operations synchronously and asynchronously. build.rs which should contain the details regarding the These traits are implemented for the new virtual device. generation of the bindings. This file is part of the folder Last but not least, other changes have been added to make structure followed in Rust and its compiler, before the the virtual device visible to Haura through configuration compilation of the code, looks for this file in the root details. folder and executes it so that the bindings can be gen- erated (and made available) before the execution of the actual program. 4. Performance Analysis Once the bindings are generated, the next step involved In this section, we analyze the impact of the newly added writing the methods to perform read and write operations virtual device on Haura. We start by testing Haura for on persistent memory and exposing them to be used in different workloads, and the baseline is the existing best- its respective virtual device in the Vdev module which is performing implementation of the virtual device. We mentioned in the following section. then study the impact with different configurations, with different thread counts and cache sizes, for instance. 3.3. NVM as a Virtual Device As discussed in Section 2.3, Haura interacts with storage 4.1. Experimental Setup devices using different virtual device implementations The experiments are performed on a dual-socket server in the Vdev module, and currently, there are four imple- having Intel® Xeon® Gold 5220R with 2.20 GHz base fre- mentations available namely, file, memory, parity1, and 4 5 Foreign function interface (FFI) is a method to invoke calls from a A trait, in Rust, can be considered as an equivalent to an interface library written and compiled in a different language. in object-oriented languages like Java and C# [SK22d]. PMem - Write The key reason behind the poor performance of the PMem - Read read I/O is the layout in which Haura stores the data. 1,000 SSD NVMe - Write SSD NVMe - Read Haura starts by splitting the objects into chunks and I/O (MB) SSD SATA - Write transforming them into messages. The messages are 500 SSD SATA - Read then pushed into the root node, and they descend gradu- ally to the target leaf node, and during the descent, they 0 are buffered in internal nodes and flushed down to the child node only when the buffer is full. Later, when the 0 0 0 1 1 2 2 3 3 4 5 5 6 7 7 7 4 6 :1 0:0 0:1 0:2 0:3 0:4 0:5 1:0 1:1 1:3 1:4 1:5 2:0 2:2 2:3 2:4 1:2 2:1 sync operation is performed, Haura follows the postorder -0 Runtime (min:s) [26] approach to persist the data that does not guarantee the ordering of the chunks on the storage device. On Figure 3: Three different executions representing the sequen- the other hand, when Haura fetches an object, it starts tial I/O throughput of Haura when configured with different fetching its chunks sequentially from the root node first storage devices. The data is recorded at an interval of 500ms. and keeps fetching the child nodes until it reaches the leaf node or finds the messages for the queried chunk. quency and each CPU contains 24 physical cores and each Therefore, this reading approach cannot benefit from se- core supports two threads. Each CPU-socket contains quential access as the tree data is already stored in the two integrated memory controllers (iMCs) with three postorder layout. Furthermore, the other main reason memory channels, each channel (except the last ones) that applies to both scenarios is the use of a single thread connected to one PMem and DRAM DIMM resulting in 4 that leaves the device underutilized. Last but not least, interleaved6 PMem and 6 DRAM DIMMs per socket. The the reason the write I/Os performed better is because PMem used is 128 GB Intel® ™ DC Persistent Memory they were asynchronous calls where the thread was ca- Series 100 DIMMs, resulting in a total persistent memory pable of issuing multiple asynchronous I/Os using the capacity of 1 TB (128 GB x 4 DIMMs x 2 sockets), and the asynchronous programming technique, whereas the read capacity of DRAM is 384 GB (32 GB x 6 DIMMs x 2 sock- I/Os were synchronous calls. ets). Moreover, the server contains two NUMA nodes Moreover, another interesting pattern that surfaced each with 48 logical cores, 4 PMem DIMMs, and 6 DRAM during the detailed analysis is, the relative performance DIMMs. However, to avoid memory access overhead, the of the storage devices was not consistent all the time. As experiments are run on socket 0. Lastly, the machine shown in Figure 4, the difference between PMem and runs Ubuntu 20.04.3 LTS (5.4.0-126-generic), and PMem SSD NVMe is significant for small block sizes, however, is accessed in the app direct Mode using fsdax 7 . the difference shrinks considerably for large blocks. 4.2. Sequential Workload 4.3. Random I/O and Worker Threads In this experiment, Haura is configured for three different This experiment evaluates the impact of the number of storage devices; PMem, SSD NVMe, and SSD SATA. The threads and cache size on Haura when configured with experiment writes 5 objects, each size 5 GB, and then different storage devices. It writes an archive file10 (size reads them sequentially in the same order, however, the 1011 MB) as an object to the engine. The file contains write requests are asynchronous, which allows multiple 80,690 entries with metadata stored in the first 9.3 MiB write requests to be dispatched, whereas the read requests that contain the central directory to locate the individual are synchronous. files. Moreover, the scenarios with circles (Figure 5) store The results in Figure 3 show that PMem performed the metadata on the first device (i.e. SSD SATA) and better than the rest for the write I/O, and it lagged behind the remaining content on the second mentioned device. SSD NVMe for the read I/O. But, the resulted throughput Lastly, the script fetches 50,000 files randomly11 . for PMem in both cases is quite off from the expected The results are presented in Figure 5, and it is evident values because as per the specifications, a single PMem from all sub-plots that the execution time improves with DIMM (with four cache-lines) can write and read up to the increase in the worker threads and cache size. 1,800 and 6,800 MB/s8 , respectively9 . The scenarios that performed worst are the ones that used SSD SATA to store the contents and a faster de- 6 In DIMM interleaving, the data is interleaved as per the configured block size (i.e., 4 KB in the current settings.) across the DIMMs. to 3,200 and 2,000 MB/s and random read and write operations of up 7 https://docs.pmem.io/ndctl-user-guide/managing-namespaces to 540,000 and 55,000 IOPS. SSD SATA can perform sequential read 8 https://www.intel.de/content/www/de/de/products/ and write operations of up to 550 and 510 MB/s and random read docs/memory-storage/optane-persistent-memory/ and write operations of up to 86,000 and 30,000 IOPS, respectively. 10 optane-dc-persistent-memory-brief.html https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.12.13.tar.xz 9 11 SSD NVMe can perform sequential read and write operations of up https://docs.rs/xoshiro/latest/xoshiro/struct.Xoshiro256Plus.html PMem (Write/Read) SSD NVMe (Write/Read) Time (s) Time (ms) Write Time (ms) Read 102 213 30 SSD SATA SSD SATA (Avg) 11 28 2 Leaf Node Size (KB) - Write SSD NVMe SSD NVMe (Avg) 26 29 100 PMem PMem (Avg) 24 27 22 25 20 10−2 23 18 102 21 16 14 213 12 211 Leaf Node Size 100 10 (KB) - Read 29 8 27 6 −2 4 10 25 2 12 1,,3 4, 4 4 84 23 0 0 08 28 84 21 1 2 23 25 27 29 211 213 21 23 25 27 29 211 213 Block Size (KB) Internal Node Size (KB) Internal Node Size (KB) Figure 6: Multiple end-to-end executions to analyze the im- Figure 4: The plots group the calls (write and read respec- pact of internal and leaf node sizes on the throughput of tively) from Figure 3 with respect to their payloads. Haura. nal nodes and then repeats the experiment with different Cache Size (16 MB) Cache Size (32 MB) Cache Size (64 MB) leaf node sizes, and writes and reads an object with the size of 128 MB sequentially. 103 103 103 The results are illustrated using a heatmap in Figure 6. Time (ms) First, it is evident that both storage devices share almost the same temperatures. Second, the engine performs 102 102 102 worst for small block sizes, especially for internal nodes. However, the performance improves with the increase 10 20 30 10 20 30 10 20 30 in the size, and the concentration of blue color indicates Threads Threads Threads SSD NVMe/SSD SATA SSD NVMe the engine performs better under large block sizes. PMem/SSD SATA PMem One reason for the high temperatures is due to the height of the tree that grows deep when the nodes, in- ternal nodes in particular, are small. For instance, when Figure 5: The impact of threads and different cache sizes on the node size is 512 bytes, an object size 128 MB would Huara when used with different storage configurations. result in 26,1376 nodes12 , and the internal nodes contain a limited number of messages and pivots, whereas, when vice to store the metadata of the file. However, a minute the node size is 4 MB, the tree would only need 32 nodes difference can be observed, with PMem the execution to accommodate the object. Therefore, when the tree is time is slightly worse for threads fewer than 6, neverthe- deep, Haura spends considerable time flushing and merg- less, the difference diminishes, and with the increase ing the nodes. Moreover, another obvious reason is when in threads (e.g., 30 threads), PMem performed better the nodes are small multiple requests are dispatched to than SSD NVMe. On the other hand, a significant differ- the storage devices. However, further analysis is required ence in performance can be seen when only PMem and to capture the time only taken by the virtual devices. SSD NVMe are used to store the whole file. As can be seen, SSD NVMe performs marginally better when fewer threads are used, however, as the thread count passes 9 5. Related Work threads, PMem surpasses SSD NVMe with a significant difference at the end. In existing engines, NVM is mostly utilized to improve the caching and recovery of the engines. In [6, 7], the logging component uses NVM at different levels that 4.4. Node-Size Significance improve the logging and recovery of the engine. More- An intriguing behavior we came across while discussing over, [27] discusses three different logging techniques the sequential workload is that the size of the payload and implements their equivalent NVM designs. The re- influences the performance of Haura. Therefore, to fur- sults show that in-place update is the most appropriate ther investigate the behavior, this experiment evaluates technique for NVM. Furthermore, SOFORT [28] and FOE- Haura (for PMem and SSD NVMe) with different internal and leaf node sizes, that is, thus far set to 4 MB for both 12 The actual count of the nodes for an object size 128 MB is higher node types. The experiment first sets the size of the inter- than 261376 because each object chunk is assigned a key as well. DUS [29] are examples of main memory database systems [3] P.-Å. Larson, J. Levandoski, Modern main-memory database that utilize NVM to improve the recovery of the system. systems, VLDB Endowment (2016). In some literature, NVM is also utilized as a buffer and [4] J. DeBrabant, et al., Anti-caching: A new approach to database management system architecture, VLDB Endowment (2013). there are two main designs in this approach. The first [5] I. Oukid, et al., Storage class memory and databases: Opportu- is to use NVM to supplement DRAM which is already nities and challenges, it-Information Technology (2017). mentioned in [28, 29, 30], and the second is to use NVM [6] J. N. Gray, Notes on data base operating systems, Springer as another layer between DRAM and secondary storage Berlin Heidelberg, Berlin, Heidelberg, 1978, pp. 393–481. and in this regard, a technique called three-tier buffer [7] T. Wang, et al., Scalable logging through emerging non-volatile memory, Proceedings of the VLDB Endowment 7 (2014). management is suggested in [31]. [8] D. Lomet, R. Anderson, et al., How the Rdb/VMS data sharing Furthermore, data structures are also optimized to ex- system became fast, DEC Cambridge Research Lab Technical ploit the full potential of NVM. FPTree [13] is an NVM- Report CRL 92 (1992). aware B+ -tree that stores leaf nodes in NVM and inter- [9] J. Huang, K. Schwan, M. K. Qureshi, NVRAM-aware logging nal nodes in DRAM, and it performs better than other in transaction systems, VLDB Endowment (2014). [10] P. Götze, et al., Data management on non-volatile memory: a NVM-optimized trees, NV-Tree [12] and wBTree [32], for perspective, Datenbank-Spektrum (2018). instance. Moreover, FOEDUS [29] also uses a customized [11] A. Eisenman, et al., Reducing DRAM footprint with NVM in tree called Master-Tree. Facebook, EuroSys, 2018. Our work enables Haura to persist the tree nodes on [12] J. Yang, et al., NV-Tree: Reducing Consistency Cost for NVM- NVM, which, along with an allocation strategy, can be based Single Level Systems, USENIX FAST, 2015. [13] I. Oukid, et al., FPTree: A hybrid SCM-DRAM persistent and used to improve recovery and caching. However, the concurrent B-tree for SCM, MOD, SIGMOD, 2016. migration of the persisted nodes is presently not possible. [14] F. Wiedemann, Modern Storage Stack with Key-Value Store Haura can also store the internal nodes on NVM as done Interface and Snapshots Based on CoW B𝜖 -Trees, 2018. in FPTree [13]. Lastly, depending on the size of the data, [15] T. Höppner, Design and Implementation of an Object Store the engine can be used as NVM-DRAM engine. with Tiered Storage, 2021. [16] A. Faraclas, et al., Modeling of set and reset operations of phase- change memory cells, IEEE electron device letters (2011). [17] A. K. Mishra, et al., Architecting on-chip interconnects for 6. Conclusion stacked 3D STT-RAM caches in CMPs, ISCA, IEEE, 2011. [18] B. Gervasi, Will carbon nanotube memory replace DRAM?, In this work, Haura, a general-purpose and write- IEEE Micro (2019). optimized storage engine is used to study the character- [19] D. B. Strukov, G. S. Snider, D. R. Stewart, R. S. Williams, The istics of the modern storage landscape that has become missing memristor found, nature (2008). more heterogeneous with the advent of PMem which is [20] S. Scargall, Programming persistent memory: A comprehen- sive guide for developers, Springer Nature, 2020. a promising technology that shares the characteristics [21] S. Sippu, et al., Transaction processing: Management of the of primary and secondary storage and has disrupted the logical database and its underlying structure, Springer, 2015. traditional memory paradigm. A few important findings [22] A. Baldassin, et al., Persistent Memory: A Survey of Program- our work uncovered are; first, persistent memory per- ming Support and Implementations, ACM CSUR (2021). forms optimally when accessed using the largest possible [23] A. Rudoff, Persistent memory programming, Login: The Usenix Magazine (2017). blocks in random workloads. Second, the size of the [24] A. Rudoff, Programming models for emerging non-volatile cache and thread count impact Haura’s throughput. Last memory technologies, USENIX & SAGE (2013). but not least, the size of the nodes also determines the [25] W. Fuzong, G. Helin, Z. Jian, Dynamic data compression throughput of the engine with the internal node having algorithm selection for big data processing on local file system, more influence. The insights gathered in this paper can in: Proceedings of the International Conference on Computer Science and Artificial Intelligence, 2018, pp. 110–114. be used to significantly improve Haura’s performance [26] G. Valiente, Algorithms on trees and graphs, volume 112, and further exploit the characteristics of PMem. How- Springer, 2002. ever, two aspects that need to be investigated are the [27] J. Arulraj, et al., Let’s talk about storage & recovery methods use of in-place updates for PMem and accessing it using for non-volatile memory database systems, SIGMOD, 2015. devdax13 that produces better results than DAX in certain [28] I. Oukid, et al., SOFORT: A hybrid SCM-DRAM storage engine for fast data recovery, DaMoN, 2014. cases [33]. [29] H. Kimura, FOEDUS: OLTP engine for a thousand cores and NVRAM, ACM MOD, SIGMOD, 2015. [30] L. Lersch, et al., An analysis of LSM caching in NVRAM, References DaMoN, 2017. [31] A. van Renen, et al., Managing non-volatile memory in [1] U. Kazemi, A survey of big data: challenges and specifications, database systems, MOD, 2018. CiiT IJSETA (2018). [32] S. Chen, et al., Rethinking database algorithms for phase [2] F. Faerber, et al., Main memory database systems, Foundations change memory, Cidr, 2011. and Trends® in Databases (2017). [33] B. Daase, et al., Maximizing persistent memory bandwidth utilization for OLAP workloads, PODS SIGMOD, 2021. 13 https://docs.pmem.io/ndctl-user-guide/managing-namespaces