Assessing Non-volatile Memory in Modern Heterogeneous
                                Storage Landscape using a Write-optimized Storage Stack
                                Sajad Karim1 , Johannes Wünsche2 , David Broneske1 , Michael Kuhn2 and Gunter Saake1
                                1
                                    Databases and Software Engineering, Otto von Guericke University Magdeburg
                                2
                                    Parallel Computing and I/O, Otto von Guericke University Magdeburg


                                                  Abstract
                                                  Non-volatile memory (NVM), or persistent memory, is a promising and emerging storage technology that has not only
                                                  disrupted the typical long-established memory hierarchy but also invalidated the proclaimed programming paradigm used
                                                  in traditional database management systems and file systems. It bridges the gap between primary and secondary storage
                                                  and, hence, shares the characteristics of both categories. However, currently, there exists no common storage engine built
                                                  particularly to study the characteristics of the modern storage landscape, which has become more heterogeneous after NVM.
                                                  Therefore, a general-purpose storage engine, Haura, is utilized to study the benefits of the modern storage landscape. In this
                                                  work, NVM is integrated into the storage stack of Haura and studied the patterns for modern storage devices involved and
                                                  their impact on the performance of Haura. Our work shows, NVM performs best under sequential workloads, but random
                                                  access is better with larger block sizes. Furthermore, the block size has a significant impact on the performance of storage
                                                  devices, with smaller block sizes favoring NVM and larger block sizes favoring NVMe-supported devices.

                                                  Keywords
                                                  Non-Volatile Memory, Persistent Memory, Storage Class Memory, Non-Volatile Random Access Memory, Persistent Memory
                                                  Programming, Modern Heterogeneous Storage Landscape, Write-Optimized Storage Engine (Haura)


                                1. Introduction                                                                                                   fore, strategies like journaling and copy-on-write are
                                                                                                                                                  used to ensure the consistency of data. Moreover, the
                                According to Kazemie [1], the volume of data in 2011 was scalability of DRAM resulted in main memory database
                                around 1.8 Zettabytes, and its volume doubles approxi- systems [2, 3, 4]. However, DRAM’s further scalability
                                mately every two years. At least it was before the onset has innately become quite a challenging task [5], and
                                of COVID-19 whose outbreak further fueled its growth also, because of its energy consumption, the solution is
                                when the use of digital services rose exponentially. This unaffordable for most businesses.
                                data deluge is unprecedented and has created new chal-                                                               Persistent memory, on the other hand, is considered
                                lenges for database management systems and file systems to be an alternative to deal with the above-mentioned
                                which are used in a wide range of applications for data issues. It is a new category in the storage hierarchy that
                                analysis and management.                                                                                          is non-volatile, byte-addressable, provides DRAM-like la-
                                              The traditional database management systems and file tency, and offers much higher capacity than DRAM. This
                                systems are developed considering the typical storage new storage class has not only opened opportunities for
                                hierarchy where memory is fast but volatile and limited, new system designs but has also opened opportunities
                                and secondary storage is persistent and vast but has high for enhancements in the existing storage engines. For in-
                                latency. In such systems, the data is logically split into stance, some work is already made in traditional database
                                two sets of copies; working and consistent copies. The systems in that NVM is used to improve the traditional
                                working copy resides in main memory, whereas the per- disk-based (centralized/decentralized) logging [6, 7, 8, 9].
                                sistent copy resides on one or more secondary storage In prior work, NVM is used as a buffer between DRAM
                                devices. Also, making data persistent is an error-prone and secondary storage devices [10, 5, 11]. Moreover, sev-
                                process as problems like crashes and race conditions can eral index data structures like NVTree [12] and FPTree
                                corrupt data or leave it in an inconsistent state. There- [13] are introduced that exploit the properties of NVM.
                                                                                                                                                  Nevertheless, presently, there is no common storage en-
                                34th GI-Workshop on Foundations of Databases (Grundlagen von Daten- gine that is built particularly to study the characteristics
                                banken), June 7-9, 2023, Hirsau, Germany                                                                          of the modern storage landscape, which has become more
                                Envelope-Open sajad.karim@ovgu.de (S. Karim); johannes.wuensche@ovgu.de heterogeneous after the addition of NVM, and in conse-
                                (J. Wünsche); david.broneske@ovgu.de (D. Broneske);
                                michael.kuhn@ovgu.de (M. Kuhn); gunter.saake@ovgu.de
                                                                                                                                                  quence, there is a research gap in this direction.
                                (G. Saake)                                                                                                           In order to investigate the benefits of a common stor-
                                Orcid 0009-0002-4910-8453 (S. Karim); 0000-0002-5304-7262                                                         age engine that manages all the storage devices in the
                                (J. Wünsche); 0000-0002-9580-740X (D. Broneske);                                                                  modern storage landscape, a prototype of a general-
                                0000-0001-8167-8574 (M. Kuhn); 0000-0001-9576-8474 (G. Saake)                                                     purpose storage engine, called Haura [14, 15], is used.
                                                     © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
It runs in user space and supports object and key-value        They are available in DIMM form factor and compatible
interfaces. The key contributions of our work are as           with conventional DDR4 sockets. They co-exist with con-
follows:                                                       ventional DDR4 DRAM DIMMs and use the same mem-
                                                               ory channel. The internal granularity of the modules
     • We used PMDK (Persistent Memory Development             is 256 bytes and they can be operated in three different
       Kit) to supplement Haura with persistent mem-           modes; memory, app direct, and dual modes.
       ory. PMDK is an open-source (C/C++) kit that               In the memory mode, NVRAM supplements DRAM
       offers different libraries/utilities to interact with   where DRAM acts as an L4 cache and NVRAM as the
       persistent memory. We used the most appropri-           main (volatile) memory. The host memory controller
       ate library after a brief evaluation process.           integrated into the processor manages the movement of
     • We investigated the impact of the above-                data between DRAM and NVRAM. On the other hand,
       mentioned change on Haura and used persistent           in the app direct mode, NVRAM is a persistent memory
       memory to store the B𝜖 -tree nodes (Section 2.3).       module where the applications have direct access to the
     • We identified the access patterns for different stor-   device, it is still byte-addressable, and applications can
       age devices that supplement or affect the through-      use it as a storage device. Lastly, in the dual mode, part
       put of Haura.                                           of the NVRAM can be allocated to applications, and the
                                                               rest can be utilized as non-volatile memory.
   The remainder of this paper is structured as follows.
Section 2 provides background on non-volatile memory
and the related programming techniques, and also briefly       2.2. Non-Volatile Memory Programming
describes Haura. The implementation phase is discussed         The typical programming models categorize the data
in Section 3 which is then followed by the evaluation in       structures into two broad categories; memory resident
Section 4. Next, Section 5 details the related work and the    and storage resident data structures [20]. It is mainly
paper concludes with a summary and open challenges in          due to the underlying system architecture where main
Section 6.                                                     memory is attached directly to the memory bus and sec-
                                                               ondary storage, due to its high latency, communicates
2. Background                                                  with the system via an I/O controller. The models operate
                                                               on the data in main memory at byte granularity and en-
In this section, we discuss non-volatile memory and dif-       sure its persistency by explicitly writing it to secondary
ferent programming models to access it. We then describe       storage. A key challenge of such models is to ensure
Haura and briefly touch upon its key components.               the consistency and integrity of data across all storage
                                                               classes that share different characteristics. For example,
                                                               the integrity of data in main memory could be ensured
2.1. Non-Volatile Memory                                       using mutexes whereas the consistency and durability of
Persistent Memory (PMem), Storage Class Memory                 data on secondary storage are ensured using strategies
(SCM), Non-Volatile Memory (NVM), and Non-Volatile             like journaling and write-ahead logging [21].
RAM (NVRAM) are the names often used to address                   The above-mentioned programming paradigm cannot
this new class of storage. It sits between primary and         be followed when working with persistent memory as
secondary storage in the typical storage hierarchy, and        it is attached to the memory bus and is non-volatile.
it is also considered a disruptive technology as it has        Therefore, a new model is inevitable that simultaneously
disrupted the traditional memory paradigm. It is non-          addresses all the atomicity and consistency issues, like
volatile, has DRAM-like latency, and offers much higher        concurrency and power failure, for instance [22, 20, 23].
capacity than DRAM. It is byte-addressable, and its prop-      Moreover, persistent memory is a comparatively new
erty to be directly accessible using the cache lines by the    technology, and writing software with all the consider-
CPU demands a different architecture than the one used         ations requires an in-depth knowledge of the hardware
in typical storage engines. Some example technologies          and cache. Therefore, several APIs are available that han-
are Phase Change Memory [16], Spin Transfer Torque             dle the hardware-related intricacies internally. PMDK1
RAM (STT-RAM) [17], Carbon NanoTube RAM (NRAM,                 (Persistent Memory Development Kit) is one such exam-
NanoRAM) [18], and Memristors [19].                            ple which is based on SNIA NVM programming model2 .
   Presently, only Intel® produces persistent memory
modules under the brand named Intel® Optane™ DC
Persistent Memory. It offers different generations that
vary in performance and capacity, and the modules are
designed to be used with specific generations of Intel® ’s     1
                                                                   https://pmem.io/pmdk/
processors, Intel® Xeon Scalable Processors, for instance.     2
                                                                   https://snia.org/tech_activities/standards/curr_standards/npm
   SNIA (Storage and Networking Industry Association)                                                         User API

proposes different programming models [22, 24, 20, 23]                                                                              ...      Cursor

to program persistent memory. The simplest way is to                         ObjectStore


use the module as a block device and access it using a                                                        Interface

standard file API. Another approach is via an optimized                                            ...   Superblock       Snapshot          Dataset
                                                                             Database
file system, ext and XFS in Linux and NTFS in Windows,
that is adapted specifically for persistent memory. This         Allocator                                    Interface


approach, contrary to the previous one, allows small read-                   Tree
                                                                                                   ...        Leaf         Node            Messages
                                                                                                                                                              Cache
and-write operations and is more efficient. Last but not
least, DAX or Direct Access is another approach in that
                                                                                                              Interface


the persistent memory is accessed as a memory-mapped
                                                                                                                                                      ...
                                                                             DataManagement

file. Nevertheless, contrary to memory mapping files                                                          Interface
on secondary storage, the operating system does not             Checksum
                                                                                                                                    ...   Configuration
                                                                                                                                                            Compression

maintain pages for persistent memory in main memory.                         StoragePool


                                                                                                              Interface


2.3. Haura                                                                   Vdev
                                                                                                   ...        File         Parity            Mirror


Haura is a general-purpose and write-optimized tiered                                   File API


storage stack that runs in user space and supports object                    Disk 1     Disk 2       Disk 3      Disk 4                          Disk N
and key-value interfaces [14]. It has the ability to handle
multiple datasets, provides data integrity, and supports
advanced features like snapshots, data placement, and           Figure 1: A layered conceptual diagram illustrating the main
fail-over strategies. The core of the engine is the B𝜖 -        components of Haura. The objects in grey represent the main
tree, which is the sole index data structure in the engine.     modules, whereas the objects in yellow and sky blue colors are
The engine supports block storage devices, solid-state          the helper modules and classes respectively. The key classes
drives, for instance, and also has its own caching layer        in the modules are represented using white blocks.
using DRAM (separate from the operating system). It also
offers features like data striping, mirroring, and parity. It   of the objects. The first dataset stores the chunks of the
follows ZFS [25] architecture and uses a similar layered        object, whereas the second dataset stores the indirection-
approach [14] where layers or modules interact with each        related information and other metadata, like modification
other using interfaces. A schematic diagram of Haura is         time and size, to name a few.
presented in Figure 1.                                             The Database module controls and manages all the
   Haura was initially built as a key-value storage engine      activities regarding a database. A database in Haura
where arbitrary-sized keys and values can be stored and         consists of one or more datasets and their respective
an extension to support objects was made later in [15]          snapshots. Datasets and snapshots are actually B𝜖 -trees.
in that the ObjectStore module was added to the stack.          Moreover, it also maintains a separate B𝜖 -tree, named
The ObjectStore module exposes the necessary routines           the root tree, to store all the information regarding the
to interact with the engine and supports all the primitive      database. For example, it maintains active datasets and
operations like create, read, write, and query. It uses the     their pointers in the storage. It also maintains informa-
same key-value interface to store the objects. However, a       tion regarding the usage of storage devices in the form
key challenge it addresses is the transformation of objects     of bitmaps.
into key-value pairs. Since, in the key-value version              The Tree module contains the actual implementation
of Haura, although the keys and the values can have a           of the B𝜖 -tree and encapsulates all the tree-related opera-
variable size, there was still an upper limit defined on        tions and exposes the methods to the upper layer.
their sizes. Objects, on the other hand, can contain data          The DataManagement module ensures the persistence
of several gigabytes; therefore, a mechanism is devised         of the underlying data and its retrieval when requested.
in that objects are split into chunks, and each chunk is        However, it internally interacts with a wide range of
assigned a unique identifier. Moreover, the object name         modules, especially with the helper modules, and plays a
is primarily the key of the object, it can have a variable      vital role in achieving their internal functionalities. The
size and can spread over a few kilobytes, therefore, an         cache module, for instance, is managed by this module.
indirection is added where the object name and other            The write and update requests from the upper layer first
metadata are stored separately, and a unique fixed-size         land into this module and then passed on to the cache
identifier is assigned to each chunk. Furthermore, the          module. Similarly, in the case of a cache miss, this module
ObjectStore module, via the Database module, maintains          fetches the required data using the StoragePool module
two different datasets to store the data and the metadata       and are passes the data to the cache module for later
usage. Moreover, this module is also responsible for the                • The virtual device should be able to perform both
compression and decompression of the data, and it uses                    synchronous and asynchronous calls to the un-
the compression module for this purpose. Furthermore,                     derlying storage device.
the decision as to which blocks on storage media are to be              • Haura uses bitmaps to manage the allocation of
used to write the data is also taken in this module, and it               the blocks on the storage devices. It partitions the
uses the AllocationHandler module to allocate the blocks.                 whole space into equal-size blocks and allocates
Last but not least, it communicates with the StoragePool                  and de-allocates the blocks internally, therefore,
module to perform the write and read I/O operations.                      this bookkeeping is not required in the virtual
   The StoragePool module performs two key operations.                    device or any library used in it.
First, it maintains queues for asynchronous I/O opera-                  • Haura uses the copy-on-write technique to up-
tions. Second, it dispatches the I/O calls to the respective              date the nodes. It first copies nodes to the main
virtual devices in the Vdev module. However, it also                      memory, applies changes to the nodes, and then
exposes the methods for synchronous calls where it by-                    writes the data back to the device in a new loca-
passes the queues. The interface of this layer matches                    tion. It never performs in-place updates.
with the Vdev module, however, it requires an additional
parameter to communicate with the desired virtual de-          Now considering the above properties, libpmem and libp-
vice in the Vdev module as the module may contain more         memblk from PMDK’s persistent library suite suits the
than one virtual device.                                       current architecture of Haura and can be used to imple-
   Lastly, the Vdev module provides different implemen-        ment the functionality.
tations to interact with the storage devices, and they are         Libpmemblk is a high-level library that provides func-
referred to as virtual devices in the system. Currently,       tionality to manage an array of fixed-size blocks. The
Haura supports single, mirror, and parity implementa-          blocks can be updated and read using their indices in the
tions. The single version of the implementations works         array. It does not provide byte-level access to the blocks,
on a single storage device, and it has two further sub-        and any update requires the re-writing of the whole block.
implementations, file and memory, for SSD/HDD and              Whereas, libpmem is one of the low-level libraries in the
DRAM (as volatile storage) respectively. It is the simplest    kit, and the other high-level libraries are built on top of
implementation provided by this module, it is not fault-       it. It wraps the basic operations exposed by an operating
tolerant, and the underlying data is lost in case of error     system and adds optimizations for persistent memory.
or failure. The other implementations, as their names              The other libraries from the PMDK’s persistent suite,
suggest, mirror and parity, are introduced to support          like libpmemobj and libpmemkv, are not a suitable choice
mirroring and parity functionalities respectively.             because; first, they do not provide the required interface
                                                               to implement the virtual device, and second, they inter-
                                                               nally perform the operations that are already addressed in
3. Implementation                                              Haura. For example, libpmemobj internally implements
                                                               the object store functionality on top of memorymapped
In this section, we briefly discuss the implementation of
                                                               files, whereas the current architecture of Haura expects
the new virtual device for persistent memory and touch
                                                               the virtual device to perform raw read and write opera-
upon the important steps considered during this phase.
                                                               tions on a specified location in the memory. Moreover,
                                                               the management of key-value data at the library level,
3.1. Programming Model Selection                               as in the case of libpmemkv, is the core functionality of
PMDK provides several high and low-level libraries to          Haura and is therefore redundant.
interact with persistent memory. Haura, on the other               The final selection from the shortlisted3 libraries is
hand, is also a well-developed engine and expects a vir-       made using an experiment, which results are in Figure 2.
tual device to implement a certain interface. Therefore,       First, it is quite evident from the graph that the approach
in the initial phase, a list of properties is formulated to    for accessing persistent memory via a memory-aware
set a criterion, and each new implementation of a virtual      file API is not a feasible approach as its latency to read
device must adhere to the list to work properly with the       and write the data, especially using small buffers, is sig-
existing interfaces in Haura. The properties are:              nificantly high. Nevertheless, a prominent drop can be
                                                               observed in its latency with the increase in buffer size
     • Haura stores the nodes of the B𝜖 -tree using virtual    but it is still higher than the rest. The approaches that
       devices, therefore, the virtual device should be        compete with each other are libpmem and libpmemblk.
       able to perform read and write operations in var-       Libpmemblk, in some instances, performed better than
       ied block sizes, from a few kilobytes to megabytes.     libpmem, but its performance is worst with small buffers,

                                                               3
                                                                   PMem-aware file API is also added to the list for comparison.
                                                            libpmem       mirror. Similarly, a new implementation of a virtual de-
                                                                          vice for persistent memory is added to the Vdev module.
    Time (s)
     Write          2                                       libpmemblk
               10
                                                            File API      Moreover, as further mentioned in Section 3.1, the library
                                                                          from PMDK is chosen with careful consideration to avoid
               100
                                                                          any architecture-related changes to Haura. Therefore,
                                                                          this new virtual device implementation exposes a similar
               101                                                        interface as available in the other implementations and
    Time (s)
     Read


                                                                          it internally makes use of the wrapper library mentioned
                                                                          in the previous section to perform the storage-specific
                                                                          operations.
               100
                                                                              The integration of the new virtual device required
                   6B
                   2B


                64 B


               51 B
                    B


                    B
                    B
                  8B


                    B
                    B
                    B
                16 B
                32 B

               12 B
               25 B


                    B
                    B
                    B
               16 B
               32 B
                  K


                 6K
                 2K


                 M
                 64


                 1K
                 2K
                 4K
                 8K
                  K


                  K
                 8K


                1M
                2M
                4M
                8M
                 M
                                                                          alterations in a few traits5 and structs in different mod-
                25
                51
                12


                                      Buffer Size                         ules. For example, the DataManagement module inter-
                                                                          acts with the StoragePool module using a trait called
Figure 2: An object (size 5 GB) is written and read sequen-               StoragePoolLayer, and StoragePoolUnit, which
tially multiple times with different block sizes using libpmem,           is a struct in the StoragePool module, implements
libpmemblk, and a standard File API.                                      StoragePoolLayer and maintains the information re-
                                                                          garding the configured virtual devices in an array of type
and also it crashes when the block size exceeds 8 MB.                     StorageTier. Furthermore, the type StorageTier is
Therefore, libpmem is finally selected as it is (compara-                 an array of Dev, and Dev is an enum that stores the in-
tively) consistent throughout the experiment.                             stance of the associated virtual device. The enum Dev
                                                                          provides three different types of features, Leaf, Mirror,
                                                                          and Parity1. Leaf further offers two different features,
3.2. Rust Wrapper for libpmem
                                                                          File and Memory. All these mentioned traits and structs
Haura is written in Rust, and PMDK supports other lan-                    are affected by the new implementation.
guages in that support for only C/C++ is fully tested,                        Furthermore, virtual devices are accessed using differ-
therefore, the second step, after selecting the library from              ent traits that define distinct behaviors. The first trait
PMDK’s suite, involved writing a wrapper for the selected                 is Vdev. It exposes functions to query different prop-
library (i.e. libpmem) in Rust. In this regard, Rust pro-                 erties and states of the virtual devices. For example, it
vides a utility named bindgen that generates the FFI4                     can be used to fetch the id and size of the device. On
bindings to C/C++ libraries. The tool requires two files to               the other hand, the traits VdevWrite and VdevRead, as
generate the bindings. The first file is wrapper.h which                  their name suggests, are used to perform read and write
should contain all the header files and declarations that                 operations on virtual devices, and provide methods to per-
the target application intends to use. The other file is                  form the operations synchronously and asynchronously.
build.rs which should contain the details regarding the                   These traits are implemented for the new virtual device.
generation of the bindings. This file is part of the folder               Last but not least, other changes have been added to make
structure followed in Rust and its compiler, before the                   the virtual device visible to Haura through configuration
compilation of the code, looks for this file in the root                  details.
folder and executes it so that the bindings can be gen-
erated (and made available) before the execution of the
actual program.                                                           4. Performance Analysis
   Once the bindings are generated, the next step involved
                                                                          In this section, we analyze the impact of the newly added
writing the methods to perform read and write operations
                                                                          virtual device on Haura. We start by testing Haura for
on persistent memory and exposing them to be used in
                                                                          different workloads, and the baseline is the existing best-
its respective virtual device in the Vdev module which is
                                                                          performing implementation of the virtual device. We
mentioned in the following section.
                                                                          then study the impact with different configurations, with
                                                                          different thread counts and cache sizes, for instance.
3.3. NVM as a Virtual Device
As discussed in Section 2.3, Haura interacts with storage 4.1. Experimental Setup
devices using different virtual device implementations
                                                          The experiments are performed on a dual-socket server
in the Vdev module, and currently, there are four imple-
                                                          having Intel® Xeon® Gold 5220R with 2.20 GHz base fre-
mentations available namely, file, memory, parity1, and
4                                                                         5
    Foreign function interface (FFI) is a method to invoke calls from a       A trait, in Rust, can be considered as an equivalent to an interface
    library written and compiled in a different language.                     in object-oriented languages like Java and C# [SK22d].
                                                PMem - Write      The key reason behind the poor performance of the
                                                PMem - Read
                                                               read I/O is the layout in which Haura stores the data.
      1,000                                  SSD NVMe - Write
                                             SSD NVMe - Read
                                                               Haura starts by splitting the objects into chunks and
    I/O (MB)


                                             SSD SATA - Write  transforming them into messages. The messages are
        500                                  SSD SATA - Read   then pushed into the root node, and they descend gradu-
                                                               ally to the target leaf node, and during the descent, they
          0                                                    are buffered in internal nodes and flushed down to the
                                                               child node only when the buffer is full. Later, when the
                      0
                      0
                      0
                      1
                      1
                      2
                      2
                      3
                      3


                      4
                      5
                      5
                      6


                      7
                      7
                      7
                      4


                      6
                  :1
                  0:0
                  0:1
                  0:2
                  0:3
                  0:4
                  0:5
                  1:0
                  1:1


                  1:3
                  1:4
                  1:5
                  2:0


                  2:2
                  2:3
                  2:4
                  1:2


                  2:1
                                                               sync operation is performed, Haura follows the postorder
               -0


                             Runtime (min:s)                   [26] approach to persist the data that does not guarantee
                                                               the ordering of the chunks on the storage device. On
Figure 3: Three different executions representing the sequen- the other hand, when Haura fetches an object, it starts
tial I/O throughput of Haura when configured with different fetching its chunks sequentially from the root node first
storage devices. The data is recorded at an interval of 500ms. and keeps fetching the child nodes until it reaches the
                                                               leaf node or finds the messages for the queried chunk.
quency and each CPU contains 24 physical cores and each Therefore, this reading approach cannot benefit from se-
core supports two threads. Each CPU-socket contains quential access as the tree data is already stored in the
two integrated memory controllers (iMCs) with three postorder layout. Furthermore, the other main reason
memory channels, each channel (except the last ones) that applies to both scenarios is the use of a single thread
connected to one PMem and DRAM DIMM resulting in 4 that leaves the device underutilized. Last but not least,
interleaved6 PMem and 6 DRAM DIMMs per socket. The the reason the write I/Os performed better is because
PMem used is 128 GB Intel® ™ DC Persistent Memory they were asynchronous calls where the thread was ca-
Series 100 DIMMs, resulting in a total persistent memory pable of issuing multiple asynchronous I/Os using the
capacity of 1 TB (128 GB x 4 DIMMs x 2 sockets), and the asynchronous programming technique, whereas the read
capacity of DRAM is 384 GB (32 GB x 6 DIMMs x 2 sock- I/Os were synchronous calls.
ets). Moreover, the server contains two NUMA nodes                Moreover, another interesting pattern that surfaced
each with 48 logical cores, 4 PMem DIMMs, and 6 DRAM during the detailed analysis is, the relative performance
DIMMs. However, to avoid memory access overhead, the of the storage devices was not consistent all the time. As
experiments are run on socket 0. Lastly, the machine shown in Figure 4, the difference between PMem and
runs Ubuntu 20.04.3 LTS (5.4.0-126-generic), and PMem SSD NVMe is significant for small block sizes, however,
is accessed in the app direct Mode using fsdax 7 .             the difference shrinks considerably for large blocks.


4.2. Sequential Workload                                                4.3. Random I/O and Worker Threads
In this experiment, Haura is configured for three different             This experiment evaluates the impact of the number of
storage devices; PMem, SSD NVMe, and SSD SATA. The                      threads and cache size on Haura when configured with
experiment writes 5 objects, each size 5 GB, and then                   different storage devices. It writes an archive file10 (size
reads them sequentially in the same order, however, the                 1011 MB) as an object to the engine. The file contains
write requests are asynchronous, which allows multiple                  80,690 entries with metadata stored in the first 9.3 MiB
write requests to be dispatched, whereas the read requests              that contain the central directory to locate the individual
are synchronous.                                                        files. Moreover, the scenarios with circles (Figure 5) store
   The results in Figure 3 show that PMem performed                     the metadata on the first device (i.e. SSD SATA) and
better than the rest for the write I/O, and it lagged behind            the remaining content on the second mentioned device.
SSD NVMe for the read I/O. But, the resulted throughput                 Lastly, the script fetches 50,000 files randomly11 .
for PMem in both cases is quite off from the expected                      The results are presented in Figure 5, and it is evident
values because as per the specifications, a single PMem                 from all sub-plots that the execution time improves with
DIMM (with four cache-lines) can write and read up to                   the increase in the worker threads and cache size.
1,800 and 6,800 MB/s8 , respectively9 .                                    The scenarios that performed worst are the ones that
                                                                        used SSD SATA to store the contents and a faster de-
6
  In DIMM interleaving, the data is interleaved as per the configured
  block size (i.e., 4 KB in the current settings.) across the DIMMs.     to 3,200 and 2,000 MB/s and random read and write operations of up
7
  https://docs.pmem.io/ndctl-user-guide/managing-namespaces              to 540,000 and 55,000 IOPS. SSD SATA can perform sequential read
8
  https://www.intel.de/content/www/de/de/products/                       and write operations of up to 550 and 510 MB/s and random read
  docs/memory-storage/optane-persistent-memory/                          and write operations of up to 86,000 and 30,000 IOPS, respectively.
                                                                        10
  optane-dc-persistent-memory-brief.html                                   https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.12.13.tar.xz
9                                                                       11
  SSD NVMe can perform sequential read and write operations of up          https://docs.rs/xoshiro/latest/xoshiro/struct.Xoshiro256Plus.html
                                                                                                                                      PMem (Write/Read)           SSD NVMe (Write/Read)             Time (s)
Time (ms) Write Time (ms) Read    102                                                                                       213                                                                          30
                                                                   SSD SATA      SSD SATA (Avg)                                 11                                                                       28
                                                                                                                            2


                                                                                                           Leaf Node Size
                                                                                                            (KB) - Write
                                                                   SSD NVMe      SSD NVMe (Avg)                                                                                                          26
                                                                                                                            29
                                  100                              PMem          PMem (Avg)                                                                                                              24
                                                                                                                            27                                                                           22
                                                                                                                            25                                                                           20
                                 10−2                                                                                       23                                                                           18
                                  102                                                                                        21
                                                                                                                                                                                                         16
                                                                                                                                                                                                         14
                                                                                                                            213
                                                                                                                                                                                                         12
                                                                                                                            211


                                                                                                           Leaf Node Size
                                  100                                                                                                                                                                    10


                                                                                                            (KB) - Read
                                                                                                                            29                                                                           8
                                                                                                                            27                                                                           6
                                      −2                                                                                                                                                                 4
                                 10
                                                                                                                            25
                                                                                                                                                                                                         2
                                                     12
                                                     1,,3


                                                                                                  4, 4
                                       4


                                                84


                                                                                                                            23                                                                           0
                                                 0

                                                          08
                                                          28


                                                                                                    84
                                                                                                                            21 1
                                                                                                                             2         23   25   27   29 211 213 21     23   25   27   29 211 213
                                                               Block Size (KB)                                                       Internal Node Size (KB)          Internal Node Size (KB)

                                                                                                         Figure 6: Multiple end-to-end executions to analyze the im-
Figure 4: The plots group the calls (write and read respec-                                              pact of internal and leaf node sizes on the throughput of
tively) from Figure 3 with respect to their payloads.                                                    Haura.

                                                                                                         nal nodes and then repeats the experiment with different
                                      Cache Size (16 MB) Cache Size (32 MB) Cache Size (64 MB) leaf node sizes, and writes and reads an object with the
                                                               size of 128 MB sequentially.
                                 103                  103                        103
                                                                  The results are illustrated using a heatmap in Figure 6.
Time (ms)


                                                               First, it is evident that both storage devices share almost
                                                               the same temperatures. Second, the engine performs
                                           102
     102
                       102                                     worst for small block sizes, especially for internal nodes.
                                                               However, the performance improves with the increase
            10    20  30       10    20  30       10    20  30
                                                               in the size, and the concentration of blue color indicates
             Threads            Threads            Threads
                     SSD NVMe/SSD SATA       SSD NVMe
                                                               the engine performs better under large block sizes.
                       PMem/SSD SATA           PMem               One reason for the high temperatures is due to the
                                                               height of the tree that grows deep when the nodes, in-
                                                               ternal nodes in particular, are small. For instance, when
Figure 5: The impact of threads and different cache sizes on the node size is 512 bytes, an object size 128 MB would
Huara when used with different storage configurations.         result in 26,1376 nodes12 , and the internal nodes contain
                                                               a limited number of messages and pivots, whereas, when
vice to store the metadata of the file. However, a minute the node size is 4 MB, the tree would only need 32 nodes
difference can be observed, with PMem the execution to accommodate the object. Therefore, when the tree is
time is slightly worse for threads fewer than 6, neverthe- deep, Haura spends considerable time flushing and merg-
less, the difference diminishes, and with the increase ing the nodes. Moreover, another obvious reason is when
in threads (e.g., 30 threads), PMem performed better the nodes are small multiple requests are dispatched to
than SSD NVMe. On the other hand, a significant differ- the storage devices. However, further analysis is required
ence in performance can be seen when only PMem and to capture the time only taken by the virtual devices.
SSD NVMe are used to store the whole file. As can be
seen, SSD NVMe performs marginally better when fewer
threads are used, however, as the thread count passes 9 5. Related Work
threads, PMem surpasses SSD NVMe with a significant
difference at the end.                                         In existing engines, NVM is mostly utilized to improve
                                                               the caching and recovery of the engines. In [6, 7], the
                                                               logging component uses NVM at different levels that
4.4. Node-Size Significance                                    improve the logging and recovery of the engine. More-
An intriguing behavior we came across while discussing over, [27] discusses three different logging techniques
the sequential workload is that the size of the payload and implements their equivalent NVM designs. The re-
influences the performance of Haura. Therefore, to fur- sults show that in-place update is the most appropriate
ther investigate the behavior, this experiment evaluates technique for NVM. Furthermore, SOFORT [28] and FOE-
Haura (for PMem and SSD NVMe) with different internal
and leaf node sizes, that is, thus far set to 4 MB for both 12 The actual count of the nodes for an object size 128 MB is higher
node types. The experiment first sets the size of the inter-     than 261376 because each object chunk is assigned a key as well.
DUS [29] are examples of main memory database systems                   [3] P.-Å. Larson, J. Levandoski, Modern main-memory database
that utilize NVM to improve the recovery of the system.                     systems, VLDB Endowment (2016).
   In some literature, NVM is also utilized as a buffer and             [4] J. DeBrabant, et al., Anti-caching: A new approach to database
                                                                            management system architecture, VLDB Endowment (2013).
there are two main designs in this approach. The first                  [5] I. Oukid, et al., Storage class memory and databases: Opportu-
is to use NVM to supplement DRAM which is already                           nities and challenges, it-Information Technology (2017).
mentioned in [28, 29, 30], and the second is to use NVM                 [6] J. N. Gray, Notes on data base operating systems, Springer
as another layer between DRAM and secondary storage                         Berlin Heidelberg, Berlin, Heidelberg, 1978, pp. 393–481.
and in this regard, a technique called three-tier buffer                [7] T. Wang, et al., Scalable logging through emerging non-volatile
                                                                            memory, Proceedings of the VLDB Endowment 7 (2014).
management is suggested in [31].                                        [8] D. Lomet, R. Anderson, et al., How the Rdb/VMS data sharing
   Furthermore, data structures are also optimized to ex-                   system became fast, DEC Cambridge Research Lab Technical
ploit the full potential of NVM. FPTree [13] is an NVM-                     Report CRL 92 (1992).
aware B+ -tree that stores leaf nodes in NVM and inter-                 [9] J. Huang, K. Schwan, M. K. Qureshi, NVRAM-aware logging
nal nodes in DRAM, and it performs better than other                        in transaction systems, VLDB Endowment (2014).
                                                                       [10] P. Götze, et al., Data management on non-volatile memory: a
NVM-optimized trees, NV-Tree [12] and wBTree [32], for                      perspective, Datenbank-Spektrum (2018).
instance. Moreover, FOEDUS [29] also uses a customized                 [11] A. Eisenman, et al., Reducing DRAM footprint with NVM in
tree called Master-Tree.                                                    Facebook, EuroSys, 2018.
   Our work enables Haura to persist the tree nodes on                 [12] J. Yang, et al., NV-Tree: Reducing Consistency Cost for NVM-
NVM, which, along with an allocation strategy, can be                       based Single Level Systems, USENIX FAST, 2015.
                                                                       [13] I. Oukid, et al., FPTree: A hybrid SCM-DRAM persistent and
used to improve recovery and caching. However, the                          concurrent B-tree for SCM, MOD, SIGMOD, 2016.
migration of the persisted nodes is presently not possible.            [14] F. Wiedemann, Modern Storage Stack with Key-Value Store
Haura can also store the internal nodes on NVM as done                      Interface and Snapshots Based on CoW B𝜖 -Trees, 2018.
in FPTree [13]. Lastly, depending on the size of the data,             [15] T. Höppner, Design and Implementation of an Object Store
the engine can be used as NVM-DRAM engine.                                  with Tiered Storage, 2021.
                                                                       [16] A. Faraclas, et al., Modeling of set and reset operations of phase-
                                                                            change memory cells, IEEE electron device letters (2011).
                                                                       [17] A. K. Mishra, et al., Architecting on-chip interconnects for
6. Conclusion                                                               stacked 3D STT-RAM caches in CMPs, ISCA, IEEE, 2011.
                                                                       [18] B. Gervasi, Will carbon nanotube memory replace DRAM?,
In this work, Haura, a general-purpose and write-                           IEEE Micro (2019).
optimized storage engine is used to study the character-               [19] D. B. Strukov, G. S. Snider, D. R. Stewart, R. S. Williams, The
istics of the modern storage landscape that has become                      missing memristor found, nature (2008).
more heterogeneous with the advent of PMem which is                    [20] S. Scargall, Programming persistent memory: A comprehen-
                                                                            sive guide for developers, Springer Nature, 2020.
a promising technology that shares the characteristics                 [21] S. Sippu, et al., Transaction processing: Management of the
of primary and secondary storage and has disrupted the                      logical database and its underlying structure, Springer, 2015.
traditional memory paradigm. A few important findings                  [22] A. Baldassin, et al., Persistent Memory: A Survey of Program-
our work uncovered are; first, persistent memory per-                       ming Support and Implementations, ACM CSUR (2021).
forms optimally when accessed using the largest possible               [23] A. Rudoff, Persistent memory programming, Login: The
                                                                            Usenix Magazine (2017).
blocks in random workloads. Second, the size of the                    [24] A. Rudoff, Programming models for emerging non-volatile
cache and thread count impact Haura’s throughput. Last                      memory technologies, USENIX & SAGE (2013).
but not least, the size of the nodes also determines the               [25] W. Fuzong, G. Helin, Z. Jian, Dynamic data compression
throughput of the engine with the internal node having                      algorithm selection for big data processing on local file system,
more influence. The insights gathered in this paper can                     in: Proceedings of the International Conference on Computer
                                                                            Science and Artificial Intelligence, 2018, pp. 110–114.
be used to significantly improve Haura’s performance                   [26] G. Valiente, Algorithms on trees and graphs, volume 112,
and further exploit the characteristics of PMem. How-                       Springer, 2002.
ever, two aspects that need to be investigated are the                 [27] J. Arulraj, et al., Let’s talk about storage & recovery methods
use of in-place updates for PMem and accessing it using                     for non-volatile memory database systems, SIGMOD, 2015.
devdax13 that produces better results than DAX in certain              [28] I. Oukid, et al., SOFORT: A hybrid SCM-DRAM storage engine
                                                                            for fast data recovery, DaMoN, 2014.
cases [33].                                                            [29] H. Kimura, FOEDUS: OLTP engine for a thousand cores and
                                                                            NVRAM, ACM MOD, SIGMOD, 2015.
                                                                       [30] L. Lersch, et al., An analysis of LSM caching in NVRAM,
References                                                                  DaMoN, 2017.
                                                                       [31] A. van Renen, et al., Managing non-volatile memory in
 [1] U. Kazemi, A survey of big data: challenges and specifications,
                                                                            database systems, MOD, 2018.
     CiiT IJSETA (2018).
                                                                       [32] S. Chen, et al., Rethinking database algorithms for phase
 [2] F. Faerber, et al., Main memory database systems, Foundations
                                                                            change memory, Cidr, 2011.
     and Trends® in Databases (2017).
                                                                       [33] B. Daase, et al., Maximizing persistent memory bandwidth
                                                                            utilization for OLAP workloads, PODS SIGMOD, 2021.
13
     https://docs.pmem.io/ndctl-user-guide/managing-namespaces