-

strategies like journaling and copy

1613-0073

Memory in Modern Heterogeneous Storage Landscape using a Write-optimized Storage Stack

Sajad Karim

sajad.karim@ovgu.de 0 2

Johannes Wünsche

johannes.wuensche@ovgu.de 1 2

David Broneske

david.broneske@ovgu.de 0 2

Michael Kuhn

michael.kuhn@ovgu.de 1 2

Gunter Saake

gunter.saake@ovgu.de 0 2

Non-Volatile Memory, Persistent Memory, Storage Class Memory, Non-Volatile Random Access Memory, Persistent Memory

0 Databases and Software Engineering, Otto von Guericke University Magdeburg 1 Parallel Computing and I/O, Otto von Guericke University Magdeburg 2 Programming, Modern Heterogeneous Storage Landscape, Write-Optimized Storage Engine , Haura

Non-volatile memory (NVM), or persistent memory, is a promising and emerging storage technology that has not only disrupted the typical long-established memory hierarchy but also invalidated the proclaimed programming paradigm used in traditional database management systems and file systems. It bridges the gap between primary and secondary storage and, hence, shares the characteristics of both categories. However, currently, there exists no common storage engine built particularly to study the characteristics of the modern storage landscape, which has become more heterogeneous after NVM. Therefore, a general-purpose storage engine, Haura, is utilized to study the benefits of the modern storage landscape. In this work, NVM is integrated into the storage stack of Haura and studied the patterns for modern storage devices involved and their impact on the performance of Haura. Our work shows, NVM performs best under sequential workloads, but random access is better with larger block sizes. Furthermore, the block size has a significant impact on the performance of storage devices, with smaller block sizes favoring NVM and larger block sizes favoring NVMe-supported devices.

1. Introduction According to Kazemie [1], the volume of data in 2011 was around 1.8 Zettabytes, and its volume doubles approximately every two years. At least it was before the onset of COVID-19 whose outbreak further fueled its growth when the use of digital services rose exponentially. This data deluge is unprecedented and has created new challenges for database management systems and file systems which are used in a wide range of applications for data analysis and management.

The traditional database management systems and file systems are developed considering the typical storage hierarchy where memory is fast but volatile and limited, and secondary storage is persistent and vast but has high latency. In such systems, the data is logically split into two sets of copies; working and consistent copies. The working copy resides in main memory, whereas the persistent copy resides on one or more secondary storage devices. Also, making data persistent is an error-prone process as problems like crashes and race conditions can nEvelop-O (G. Saake)

In prior work, NVM is used as a bufer between DRAM and secondary storage devices [10, 5, 11]. Moreover, several index data structures like NVTree [12] and FPTree Nevertheless, presently, there is no common storage en

of the modern storage landscape, which has become more heterogeneous after the addition of NVM, and in consequence, there is a research gap in this direction.

In order to investigate the benefits of a common storage engine that manages all the storage devices in the modern storage landscape, a prototype of a general

CEUR ceur-ws.org

It runs in user space and supports object and key-value

interfaces. The key contributions of our work are as follows: • We used PMDK (Persistent Memory Development Kit) to supplement Haura with persistent memory. PMDK is an open-source (C/C++) kit that ofers diferent libraries/utilities to interact with persistent memory. We used the most appropriate library after a brief evaluation process. • We investigated the impact of the abovementioned change on Haura and used persistent memory to store the B -tree nodes (Section 2.3). • We identified the access patterns for diferent storage devices that supplement or afect the throughput of Haura.

They are available in DIMM form factor and compatible

with conventional DDR4 sockets. They co-exist with conventional DDR4 DRAM DIMMs and use the same memory channel. The internal granularity of the modules is 256 bytes and they can be operated in three diferent modes; memory, app direct, and dual modes.

In the memory mode, NVRAM supplements DRAM where DRAM acts as an L4 cache and NVRAM as the main (volatile) memory. The host memory controller integrated into the processor manages the movement of data between DRAM and NVRAM. On the other hand, in the app direct mode, NVRAM is a persistent memory module where the applications have direct access to the device, it is still byte-addressable, and applications can use it as a storage device. Lastly, in the dual mode, part of the NVRAM can be allocated to applications, and the rest can be utilized as non-volatile memory. 2.2. Non-Volatile Memory Programming

The remainder of this paper is structured as follows. Section 2 provides background on non-volatile memory

and the related programming techniques, and also briefly describes Haura. The implementation phase is discussed in Section 3 which is then followed by the evaluation in

Section 4. Next, Section 5 details the related work and the

paper concludes with a summary and open challenges in

Section 6.

The typical programming models categorize the data structures into two broad categories; memory resident and storage resident data structures [20]. It is mainly due to the underlying system architecture where main memory is attached directly to the memory bus and secondary storage, due to its high latency, communicates 2. Background with the system via an I/O controller. The models operate on the data in main memory at byte granularity and enIn this section, we discuss non-volatile memory and dif- sure its persistency by explicitly writing it to secondary ferent programming models to access it. We then describe storage. A key challenge of such models is to ensure Haura and briefly touch upon its key components. the consistency and integrity of data across all storage classes that share diferent characteristics. For example, 2.1. Non-Volatile Memory the integrity of data in main memory could be ensured using mutexes whereas the consistency and durability of Persistent Memory (PMem), Storage Class Memory data on secondary storage are ensured using strategies (SCM), Non-Volatile Memory (NVM), and Non-Volatile like journaling and write-ahead logging [21]. RAM (NVRAM) are the names often used to address The above-mentioned programming paradigm cannot this new class of storage. It sits between primary and be followed when working with persistent memory as secondary storage in the typical storage hierarchy, and it is attached to the memory bus and is non-volatile. it is also considered a disruptive technology as it has Therefore, a new model is inevitable that simultaneously disrupted the traditional memory paradigm. It is non- addresses all the atomicity and consistency issues, like volatile, has DRAM-like latency, and ofers much higher concurrency and power failure, for instance [22, 20, 23]. capacity than DRAM. It is byte-addressable, and its prop- Moreover, persistent memory is a comparatively new erty to be directly accessible using the cache lines by the technology, and writing software with all the considerCPU demands a diferent architecture than the one used ations requires an in-depth knowledge of the hardware in typical storage engines. Some example technologies and cache. Therefore, several APIs are available that hanare Phase Change Memory [16], Spin Transfer Torque dle the hardware-related intricacies internally. PMDK1 RAM (STT-RAM) [17], Carbon NanoTube RAM (NRAM, (Persistent Memory Development Kit) is one such examNanoRAM) [18], and Memristors [19]. ple which is based on SNIA NVM programming model2.

Presently, only Intel® produces persistent memory modules under the brand named Intel® Optane™ DC

Persistent Memory. It ofers diferent generations that

vary in performance and capacity, and the modules are designed to be used with specific generations of Intel ®’s

1https://pmem.io/pmdk/

processors, Intel® Xeon Scalable Processors, for instance. 2https://snia.org/tech_activities/standards/curr_standards/npm Interface ... Superblock ...

Cursor Snapshot

Dataset ...

Node

Messages Interface Leaf Interface Interface Interface

SNIA (Storage and Networking Industry Association) proposes diferent programming models [ 22, 24, 20, 23] to program persistent memory. The simplest way is to use the module as a block device and access it using a standard file API. Another approach is via an optimized ifle system, ext and XFS in Linux and NTFS in Windows, that is adapted specifically for persistent memory. This approach, contrary to the previous one, allows small readand-write operations and is more eficient. Last but not least, DAX or Direct Access is another approach in that the persistent memory is accessed as a memory-mapped ifle. Nevertheless, contrary to memory mapping files on secondary storage, the operating system does not maintain pages for persistent memory in main memory. 2.3. Haura Haura is a general-purpose and write-optimized tiered storage stack that runs in user space and supports object and key-value interfaces [14]. It has the ability to handle multiple datasets, provides data integrity, and supports advanced features like snapshots, data placement, and fail-over strategies. The core of the engine is the B tree, which is the sole index data structure in the engine.

The engine supports block storage devices, solid-state drives, for instance, and also has its own caching layer using DRAM (separate from the operating system). It also ofers features like data striping, mirroring, and parity. It follows ZFS [25] architecture and uses a similar layered approach [14] where layers or modules interact with each other using interfaces. A schematic diagram of Haura is presented in Figure 1.

Haura was initially built as a key-value storage engine

where arbitrary-sized keys and values can be stored and an extension to support objects was made later in [15] in that the ObjectStore module was added to the stack. The ObjectStore module exposes the necessary routines to interact with the engine and supports all the primitive operations like create, read, write, and query. It uses the same key-value interface to store the objects. However, a key challenge it addresses is the transformation of objects into key-value pairs. Since, in the key-value version of Haura, although the keys and the values can have a variable size, there was still an upper limit defined on their sizes. Objects, on the other hand, can contain data of several gigabytes; therefore, a mechanism is devised in that objects are split into chunks, and each chunk is assigned a unique identifier. Moreover, the object name is primarily the key of the object, it can have a variable size and can spread over a few kilobytes, therefore, an indirection is added where the object name and other metadata are stored separately, and a unique fixed-size identifier is assigned to each chunk. Furthermore, the

ObjectStore module, via the Database module, maintains two diferent datasets to store the data and the metadata

...

Cache ... Configuration Compression Allocator Checksum

ObjectStore Database Tree DataManagement StoragePool Vdev

File API ...

File

Parity

Mirror Disk 1 Disk 2 Disk 3 Disk 4 Disk N of the objects. The first dataset stores the chunks of the object, whereas the second dataset stores the indirectionrelated information and other metadata, like modification time and size, to name a few.

The Database module controls and manages all the

activities regarding a database. A database in Haura consists of one or more datasets and their respective snapshots. Datasets and snapshots are actually B -trees.

Moreover, it also maintains a separate B -tree, named

the root tree, to store all the information regarding the database. For example, it maintains active datasets and their pointers in the storage. It also maintains information regarding the usage of storage devices in the form of bitmaps.

The Tree module contains the actual implementation

of the B -tree and encapsulates all the tree-related operations and exposes the methods to the upper layer.

The DataManagement module ensures the persistence

of the underlying data and its retrieval when requested.

However, it internally interacts with a wide range of

modules, especially with the helper modules, and plays a vital role in achieving their internal functionalities. The cache module, for instance, is managed by this module.

The write and update requests from the upper layer first land into this module and then passed on to the cache module. Similarly, in the case of a cache miss, this module fetches the required data using the StoragePool module and are passes the data to the cache module for later usage. Moreover, this module is also responsible for the • The virtual device should be able to perform both compression and decompression of the data, and it uses synchronous and asynchronous calls to the unthe compression module for this purpose. Furthermore, derlying storage device. the decision as to which blocks on storage media are to be • Haura uses bitmaps to manage the allocation of used to write the data is also taken in this module, and it the blocks on the storage devices. It partitions the uses the AllocationHandler module to allocate the blocks. whole space into equal-size blocks and allocates Last but not least, it communicates with the StoragePool and de-allocates the blocks internally, therefore, module to perform the write and read I/O operations. this bookkeeping is not required in the virtual

The StoragePool module performs two key operations. device or any library used in it.

First, it maintains queues for asynchronous I/O opera- • Haura uses the copy-on-write technique to uptions. Second, it dispatches the I/O calls to the respective date the nodes. It first copies nodes to the main virtual devices in the Vdev module. However, it also memory, applies changes to the nodes, and then exposes the methods for synchronous calls where it by- writes the data back to the device in a new locapasses the queues. The interface of this layer matches tion. It never performs in-place updates. with the Vdev module, however, it requires an additional parameter to communicate with the desired virtual de- Now considering the above properties, libpmem and libpvice in the Vdev module as the module may contain more memblk from PMDK’s persistent library suite suits the than one virtual device. current architecture of Haura and can be used to imple

Lastly, the Vdev module provides diferent implemen- ment the functionality.

tations to interact with the storage devices, and they are Libpmemblk is a high-level library that provides funcreferred to as virtual devices in the system. Currently, tionality to manage an array of fixed-size blocks. The Haura supports single, mirror, and parity implementa- blocks can be updated and read using their indices in the tions. The single version of the implementations works array. It does not provide byte-level access to the blocks, on a single storage device, and it has two further sub- and any update requires the re-writing of the whole block. implementations, file and memory, for SSD/HDD and Whereas, libpmem is one of the low-level libraries in the DRAM (as volatile storage) respectively. It is the simplest kit, and the other high-level libraries are built on top of implementation provided by this module, it is not fault- it. It wraps the basic operations exposed by an operating tolerant, and the underlying data is lost in case of error system and adds optimizations for persistent memory. or failure. The other implementations, as their names The other libraries from the PMDK’s persistent suite, suggest, mirror and parity, are introduced to support like libpmemobj and libpmemkv, are not a suitable choice mirroring and parity functionalities respectively. because; first, they do not provide the required interface to implement the virtual device, and second, they internally perform the operations that are already addressed in 3. Implementation Haura. For example, libpmemobj internally implements In this section, we briefly discuss the implementation of the object store functionality on top of memorymapped the new virtual device for persistent memory and touch ifles, whereas the current architecture of Haura expects upon the important steps considered during this phase. the virtual device to perform raw read and write operations on a specified location in the memory. Moreover, the management of key-value data at the library level, 3.1. Programming Model Selection as in the case of libpmemkv, is the core functionality of PMDK provides several high and low-level libraries to Haura and is therefore redundant. interact with persistent memory. Haura, on the other The final selection from the shortlisted 3 libraries is hand, is also a well-developed engine and expects a vir- made using an experiment, which results are in Figure 2. tual device to implement a certain interface. Therefore, First, it is quite evident from the graph that the approach in the initial phase, a list of properties is formulated to for accessing persistent memory via a memory-aware set a criterion, and each new implementation of a virtual ifle API is not a feasible approach as its latency to read device must adhere to the list to work properly with the and write the data, especially using small bufers, is sigexisting interfaces in Haura. The properties are: nificantly high. Nevertheless, a prominent drop can be observed in its latency with the increase in bufer size • Haura stores the nodes of the B -tree using virtual but it is still higher than the rest. The approaches that devices, therefore, the virtual device should be compete with each other are libpmem and libpmemblk. able to perform read and write operations in var- Libpmemblk, in some instances, performed better than ied block sizes, from a few kilobytes to megabytes. libpmem, but its performance is worst with small bufers, 3PMem-aware file API is also added to the list for comparison.

libpmem mirror. Similarly, a new implementation of a virtual de()se itre102 lFiiblepmAePmIblk vMicoerefoovrepre,rassisfuternthtemr emmeonrtiyoniseaddidneSdectotiothne3V.1d, ethvemliobdraurley. iTm W from PMDK is chosen with careful consideration to avoid 100 any architecture-related changes to Haura. Therefore, this new virtual device implementation exposes a similar ) 101 interface as available in the other implementations and (s d it internally makes use of the wrapper library mentioned ieTm eaR in the previous section to perform the storage-specific operations. 10064B128B256B512B1KB2KB4KB8KB16KB32KB64KB128KB256KB512KB1MB2MB4MB8MB16MB32MB altTerhaetioinntseginraatifoenwotfratihtse5 naenwd svtrirutcutaslindedviifecreenrteqmuoirde-d

Bufer Size ules. For example, the DataManagement module interacts with the StoragePool module using a trait called

Figure 2: An object (size 5 GB) is written and read sequen- StoragePoolLayer, and StoragePoolUnit, which tially multiple times with diferent block sizes using libpmem, is a struct in the StoragePool module, implements libpmemblk, and a standard File API. StoragePoolLayer and maintains the information regarding the configured virtual devices in an array of type and also it crashes when the block size exceeds 8 MB. StorageTier. Furthermore, the type StorageTier is Therefore, libpmem is finally selected as it is (compara- an array of Dev, and Dev is an enum that stores the intively) consistent throughout the experiment. stance of the associated virtual device. The enum Dev provides three diferent types of features, Leaf, Mirror, 3.2. Rust Wrapper for libpmem and Parity1. Leaf further ofers two diferent features,

File and Memory. All these mentioned traits and structs Haura is written in Rust, and PMDK supports other lan- are afected by the new implementation.

guages in that support for only C/C++ is fully tested, Furthermore, virtual devices are accessed using difertherefore, the second step, after selecting the library from ent traits that define distinct behaviors. The first trait PMDK’s suite, involved writing a wrapper for the selected is Vdev. It exposes functions to query diferent proplibrary (i.e. libpmem) in Rust. In this regard, Rust pro- erties and states of the virtual devices. For example, it vides a utility named bindgen that generates the FFI4 can be used to fetch the id and size of the device. On bindings to C/C++ libraries. The tool requires two files to the other hand, the traits VdevWrite and VdevRead, as generate the bindings. The first file is wrapper.h which their name suggests, are used to perform read and write should contain all the header files and declarations that operations on virtual devices, and provide methods to perthe target application intends to use. The other file is form the operations synchronously and asynchronously. build.rs which should contain the details regarding the These traits are implemented for the new virtual device. generation of the bindings. This file is part of the folder Last but not least, other changes have been added to make structure followed in Rust and its compiler, before the the virtual device visible to Haura through configuration compilation of the code, looks for this file in the root details. folder and executes it so that the bindings can be generated (and made available) before the execution of the actual program. 4. Performance Analysis

Once the bindings are generated, the next step involved

writing the methods to perform read and write operations on persistent memory and exposing them to be used in its respective virtual device in the Vdev module which is mentioned in the following section.

In this section, we analyze the impact of the newly added virtual device on Haura. We start by testing Haura for diferent workloads, and the baseline is the existing bestperforming implementation of the virtual device. We then study the impact with diferent configurations, with diferent thread counts and cache sizes, for instance. 3.3. NVM as a Virtual Device

As discussed in Section 2.3, Haura interacts with storage devices using diferent virtual device implementations in the Vdev module, and currently, there are four implementations available namely, file, memory, parity1, and

4.1. Experimental Setup

The experiments are performed on a dual-socket server

having Intel® Xeon® Gold 5220R with 2.20 GHz base fre

4Foreign function interface (FFI) is a method to invoke calls from a

library written and compiled in a diferent language.

5A trait, in Rust, can be considered as an equivalent to an interface

in object-oriented languages like Java and C# [SK22d].

PMem - Write The key reason behind the poor performance of the PMem - Read read I/O is the layout in which Haura stores the data.

)B 1,000 SSSSSSDDD SNNAVVTMMAee- --WRWreirtaietde tHraanusrfaorsmtairntsg btyhesmpliitntitnogmthesesaogbejesc.tsThinetomcehsusangkess aanrde I(/OM 500 SSD SATA - Read tahlleyntoputshheetdaringetot ltehaef rnooodten, oadned, daunrdinthgetyhededsecsecnedntg,rtahdeuy0 are bufered in internal nodes and flushed down to the -0:10 0:00 0:10 0:21 0:31 0:42 0:52 1:03 1:13 1:24 1:34 1:45 1:55 2:06 2:16 2:27 2:37 2:47 child node only when the bufer is full. Later, when the sync operation is performed, Haura follows the postorder

Runtime (min:s) [26] approach to persist the data that does not guarantee

the ordering of the chunks on the storage device. On Figure 3: Three diferent executions representing the sequen- the other hand, when Haura fetches an object, it starts tial I/O throughput of Haura when configured with diferent fetching its chunks sequentially from the root node first storage devices. The data is recorded at an interval of 500ms. and keeps fetching the child nodes until it reaches the leaf node or finds the messages for the queried chunk. quency and each CPU contains 24 physical cores and each Therefore, this reading approach cannot benefit from secore supports two threads. Each CPU-socket contains quential access as the tree data is already stored in the two integrated memory controllers (iMCs) with three postorder layout. Furthermore, the other main reason memory channels, each channel (except the last ones) that applies to both scenarios is the use of a single thread connected to one PMem and DRAM DIMM resulting in 4 that leaves the device underutilized. Last but not least, interleaved6 PMem and 6 DRAM DIMMs per socket. The the reason the write I/Os performed better is because PMem used is 128 GB Intel® ™ DC Persistent Memory they were asynchronous calls where the thread was caSeries 100 DIMMs, resulting in a total persistent memory pable of issuing multiple asynchronous I/Os using the capacity of 1 TB (128 GB x 4 DIMMs x 2 sockets), and the asynchronous programming technique, whereas the read capacity of DRAM is 384 GB (32 GB x 6 DIMMs x 2 sock- I/Os were synchronous calls. ets). Moreover, the server contains two NUMA nodes Moreover, another interesting pattern that surfaced each with 48 logical cores, 4 PMem DIMMs, and 6 DRAM during the detailed analysis is, the relative performance DIMMs. However, to avoid memory access overhead, the of the storage devices was not consistent all the time. As experiments are run on socket 0. Lastly, the machine shown in Figure 4, the diference between PMem and runs Ubuntu 20.04.3 LTS (5.4.0-126-generic), and PMem SSD NVMe is significant for small block sizes, however, is accessed in the app direct Mode using fsdax 7. the diference shrinks considerably for large blocks. 4.2. Sequential Workload In this experiment, Haura is configured for three diferent storage devices; PMem, SSD NVMe, and SSD SATA. The experiment writes 5 objects, each size 5 GB, and then reads them sequentially in the same order, however, the write requests are asynchronous, which allows multiple write requests to be dispatched, whereas the read requests are synchronous.

The results in Figure 3 show that PMem performed

better than the rest for the write I/O, and it lagged behind

SSD NVMe for the read I/O. But, the resulted throughput

for PMem in both cases is quite of from the expected values because as per the specifications, a single PMem DIMM (with four cache-lines) can write and read up to 1,800 and 6,800 MB/s8, respectively9. 6In DIMM interleaving, the data is interleaved as per the configured block size (i.e., 4 KB in the current settings.) across the DIMMs.

7https://docs.pmem.io/ndctl-user-guide/managing-namespaces 8https://www.intel.de/content/www/de/de/products/

docs/memory-storage/optane-persistent-memory/ optane-dc-persistent-memory-brief.html

9SSD NVMe can perform sequential read and write operations of up

4.3. Random I/O and Worker Threads This experiment evaluates the impact of the number of threads and cache size on Haura when configured with diferent storage devices. It writes an archive file 10 (size 1011 MB) as an object to the engine. The file contains 80,690 entries with metadata stored in the first 9.3 MiB that contain the central directory to locate the individual ifles. Moreover, the scenarios with circles (Figure 5) store the metadata on the first device (i.e. SSD SATA) and the remaining content on the second mentioned device.

Lastly, the script fetches 50,000 files randomly 11. The results are presented in Figure 5, and it is evident

from all sub-plots that the execution time improves with the increase in the worker threads and cache size.

The scenarios that performed worst are the ones that used SSD SATA to store the contents and a faster deto 3,200 and 2,000 MB/s and random read and write operations of up to 540,000 and 55,000 IOPS. SSD SATA can perform sequential read and write operations of up to 550 and 510 MB/s and random read and write operations of up to 86,000 and 30,000 IOPS, respectively. 10https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.12.13.tar.xz 11https://docs.rs/xoshiro/latest/xoshiro/struct.Xoshiro256Plus.html d 102 a e R ) sm 100 ( e m iT 10− 2 t 102 e i r W )s 100 m ( e im10− 2 T 4

SSD SATA SSD SATA (Avg) SSD NVMe SSD NVMe (Avg)

PMem PMem (Avg) 840 1,21,3 0828 4.4. Node-Size Significance

An intriguing behavior we came across while discussing

the sequential workload is that the size of the payload influences the performance of Haura. Therefore, to further investigate the behavior, this experiment evaluates

Haura (for PMem and SSD NVMe) with diferent internal

and leaf node sizes, that is, thus far set to 4 MB for both node types. The experiment first sets the size of the interIn existing engines, NVM is mostly utilized to improve the caching and recovery of the engines. In [6, 7], the logging component uses NVM at diferent levels that improve the logging and recovery of the engine. Moreover, [27] discusses three diferent logging techniques and implements their equivalent NVM designs. The results show that in-place update is the most appropriate technique for NVM. Furthermore, SOFORT [28] and FOE12The actual count of the nodes for an object size 128 MB is higher than 261376 because each object chunk is assigned a key as well.

DUS [29] are examples of main memory database systems [3] P.-Å. Larson, J. Levandoski, Modern main-memory database that utilize NVM to improve the recovery of the system. systems, VLDB Endowment (2016).

In some literature, NVM is also utilized as a bufer and [4] J. DeBrabant, et al., Anti-caching: A new approach to database management system architecture, VLDB Endowment (2013). there are two main designs in this approach. The first [5] I. Oukid, et al., Storage class memory and databases: Opportuis to use NVM to supplement DRAM which is already nities and challenges, it-Information Technology (2017). mentioned in [28, 29, 30], and the second is to use NVM [6] J. N. Gray, Notes on data base operating systems, Springer as another layer between DRAM and secondary storage Berlin Heidelberg, Berlin, Heidelberg, 1978, pp. 393–481. and in this regard, a technique called three-tier bufer [7] T. Wang, et al., Scalable logging through emerging non-volatile memory, Proceedings of the VLDB Endowment 7 (2014). management is suggested in [31]. [8] D. Lomet, R. Anderson, et al., How the Rdb/VMS data sharing

Furthermore, data structures are also optimized to ex- system became fast, DEC Cambridge Research Lab Technical ploit the full potential of NVM. FPTree [13] is an NVM- Report CRL 92 (1992). aware B+-tree that stores leaf nodes in NVM and inter- [9] J. Huang, K. Schwan, M. K. Qureshi, NVRAM-aware logging nal nodes in DRAM, and it performs better than other in transaction systems, VLDB Endowment (2014). [10] P. Götze, et al., Data management on non-volatile memory: a NVM-optimized trees, NV-Tree [12] and wBTree [32], for perspective, Datenbank-Spektrum (2018). instance. Moreover, FOEDUS [29] also uses a customized [11] A. Eisenman, et al., Reducing DRAM footprint with NVM in tree called Master-Tree. Facebook, EuroSys, 2018.

Our work enables Haura to persist the tree nodes on [12] J. Yang, et al., NV-Tree: Reducing Consistency Cost for NVMNVM, which, along with an allocation strategy, can be based Single Level Systems, USENIX FAST, 2015. [13] I. Oukid, et al., FPTree: A hybrid SCM-DRAM persistent and used to improve recovery and caching. However, the concurrent B-tree for SCM, MOD, SIGMOD, 2016. migration of the persisted nodes is presently not possible. [14] F. Wiedemann, Modern Storage Stack with Key-Value Store Haura can also store the internal nodes on NVM as done Interface and Snapshots Based on CoW B -Trees, 2018. in FPTree [13]. Lastly, depending on the size of the data, [15] T. Höppner, Design and Implementation of an Object Store the engine can be used as NVM-DRAM engine. with Tiered Storage, 2021. [16] A. Faraclas, et al., Modeling of set and reset operations of phasechange memory cells, IEEE electron device letters (2011). 6. Conclusion [17] A. K. Mishra, et al., Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs, ISCA, IEEE, 2011. [18] B. Gervasi, Will carbon nanotube memory replace DRAM?, In this work, Haura, a general-purpose and write- IEEE Micro (2019). optimized storage engine is used to study the character- [19] D. B. Strukov, G. S. Snider, D. R. Stewart, R. S. Williams, The istics of the modern storage landscape that has become missing memristor found, nature (2008). more heterogeneous with the advent of PMem which is [20] S. Scargall, Programming persistent memory: A comprehensive guide for developers, Springer Nature, 2020. a promising technology that shares the characteristics [21] S. Sippu, et al., Transaction processing: Management of the of primary and secondary storage and has disrupted the logical database and its underlying structure, Springer, 2015. traditional memory paradigm. A few important findings [22] A. Baldassin, et al., Persistent Memory: A Survey of Programour work uncovered are; first, persistent memory per- ming Support and Implementations, ACM CSUR (2021). forms optimally when accessed using the largest possible [23] A. Rudof, Persistent memory programming, Login: The Usenix Magazine (2017). blocks in random workloads. Second, the size of the [24] A. Rudof, Programming models for emerging non-volatile cache and thread count impact Haura’s throughput. Last memory technologies, USENIX & SAGE (2013). but not least, the size of the nodes also determines the [25] W. Fuzong, G. Helin, Z. Jian, Dynamic data compression throughput of the engine with the internal node having algorithm selection for big data processing on local file system, more influence. The insights gathered in this paper can in: Proceedings of the International Conference on Computer Science and Artificial Intelligence, 2018, pp. 110–114. be used to significantly improve Haura’s performance [26] G. Valiente, Algorithms on trees and graphs, volume 112, and further exploit the characteristics of PMem. How- Springer, 2002. ever, two aspects that need to be investigated are the [27] J. Arulraj, et al., Let’s talk about storage & recovery methods use of in-place updates for PMem and accessing it using for non-volatile memory database systems, SIGMOD, 2015. devdax13 that produces better results than DAX in certain [28] I. Oukid, et al., SOFORT: A hybrid SCM-DRAM storage engine for fast data recovery, DaMoN, 2014. cases [33]. [29] H. Kimura, FOEDUS: OLTP engine for a thousand cores and NVRAM, ACM MOD, SIGMOD, 2015.

References [30] L. Lersch, et al., An analysis of LSM caching in NVRAM, DaMoN, 2017. [1] U. Kazemi, A survey of big data: challenges and specifications, [31] A. van Renen, et al., Managing non-volatile memory in

CiiT IJSETA (2018). database systems, MOD, 2018.

[2] F. Faerber, et al., Main memory database systems, Foundations [32] S. Chen, et al., Rethinking database algorithms for phase and Trends® in Databases (2017). change memory, Cidr, 2011.

[33] B. Daase, et al., Maximizing persistent memory bandwidth utilization for OLAP workloads, PODS SIGMOD, 2021.