=Paper=
{{Paper
|id=Vol-3714/paper4
|storemode=property
|title=Assessing Non-volatile Memory in Modern Heterogeneous Storage Landscape using a Write-optimized Storage Stack
|pdfUrl=https://ceur-ws.org/Vol-3714/paper4.pdf
|volume=Vol-3714
|authors=Sajad Karim,Johannes Wünsche,David Broneske,Michael Kuhn,Gunter Saake
|dblpUrl=https://dblp.org/rec/conf/gvd/KarimWB0S23
}}
==Assessing Non-volatile Memory in Modern Heterogeneous Storage Landscape using a Write-optimized Storage Stack==
Assessing Non-volatile Memory in Modern Heterogeneous
Storage Landscape using a Write-optimized Storage Stack
Sajad Karim1 , Johannes Wünsche2 , David Broneske1 , Michael Kuhn2 and Gunter Saake1
1
Databases and Software Engineering, Otto von Guericke University Magdeburg
2
Parallel Computing and I/O, Otto von Guericke University Magdeburg
Abstract
Non-volatile memory (NVM), or persistent memory, is a promising and emerging storage technology that has not only
disrupted the typical long-established memory hierarchy but also invalidated the proclaimed programming paradigm used
in traditional database management systems and file systems. It bridges the gap between primary and secondary storage
and, hence, shares the characteristics of both categories. However, currently, there exists no common storage engine built
particularly to study the characteristics of the modern storage landscape, which has become more heterogeneous after NVM.
Therefore, a general-purpose storage engine, Haura, is utilized to study the benefits of the modern storage landscape. In this
work, NVM is integrated into the storage stack of Haura and studied the patterns for modern storage devices involved and
their impact on the performance of Haura. Our work shows, NVM performs best under sequential workloads, but random
access is better with larger block sizes. Furthermore, the block size has a significant impact on the performance of storage
devices, with smaller block sizes favoring NVM and larger block sizes favoring NVMe-supported devices.
Keywords
Non-Volatile Memory, Persistent Memory, Storage Class Memory, Non-Volatile Random Access Memory, Persistent Memory
Programming, Modern Heterogeneous Storage Landscape, Write-Optimized Storage Engine (Haura)
1. Introduction fore, strategies like journaling and copy-on-write are
used to ensure the consistency of data. Moreover, the
According to Kazemie [1], the volume of data in 2011 was scalability of DRAM resulted in main memory database
around 1.8 Zettabytes, and its volume doubles approxi- systems [2, 3, 4]. However, DRAM’s further scalability
mately every two years. At least it was before the onset has innately become quite a challenging task [5], and
of COVID-19 whose outbreak further fueled its growth also, because of its energy consumption, the solution is
when the use of digital services rose exponentially. This unaffordable for most businesses.
data deluge is unprecedented and has created new chal- Persistent memory, on the other hand, is considered
lenges for database management systems and file systems to be an alternative to deal with the above-mentioned
which are used in a wide range of applications for data issues. It is a new category in the storage hierarchy that
analysis and management. is non-volatile, byte-addressable, provides DRAM-like la-
The traditional database management systems and file tency, and offers much higher capacity than DRAM. This
systems are developed considering the typical storage new storage class has not only opened opportunities for
hierarchy where memory is fast but volatile and limited, new system designs but has also opened opportunities
and secondary storage is persistent and vast but has high for enhancements in the existing storage engines. For in-
latency. In such systems, the data is logically split into stance, some work is already made in traditional database
two sets of copies; working and consistent copies. The systems in that NVM is used to improve the traditional
working copy resides in main memory, whereas the per- disk-based (centralized/decentralized) logging [6, 7, 8, 9].
sistent copy resides on one or more secondary storage In prior work, NVM is used as a buffer between DRAM
devices. Also, making data persistent is an error-prone and secondary storage devices [10, 5, 11]. Moreover, sev-
process as problems like crashes and race conditions can eral index data structures like NVTree [12] and FPTree
corrupt data or leave it in an inconsistent state. There- [13] are introduced that exploit the properties of NVM.
Nevertheless, presently, there is no common storage en-
34th GI-Workshop on Foundations of Databases (Grundlagen von Daten- gine that is built particularly to study the characteristics
banken), June 7-9, 2023, Hirsau, Germany of the modern storage landscape, which has become more
Envelope-Open sajad.karim@ovgu.de (S. Karim); johannes.wuensche@ovgu.de heterogeneous after the addition of NVM, and in conse-
(J. Wünsche); david.broneske@ovgu.de (D. Broneske);
michael.kuhn@ovgu.de (M. Kuhn); gunter.saake@ovgu.de
quence, there is a research gap in this direction.
(G. Saake) In order to investigate the benefits of a common stor-
Orcid 0009-0002-4910-8453 (S. Karim); 0000-0002-5304-7262 age engine that manages all the storage devices in the
(J. Wünsche); 0000-0002-9580-740X (D. Broneske); modern storage landscape, a prototype of a general-
0000-0001-8167-8574 (M. Kuhn); 0000-0001-9576-8474 (G. Saake) purpose storage engine, called Haura [14, 15], is used.
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
It runs in user space and supports object and key-value They are available in DIMM form factor and compatible
interfaces. The key contributions of our work are as with conventional DDR4 sockets. They co-exist with con-
follows: ventional DDR4 DRAM DIMMs and use the same mem-
ory channel. The internal granularity of the modules
• We used PMDK (Persistent Memory Development is 256 bytes and they can be operated in three different
Kit) to supplement Haura with persistent mem- modes; memory, app direct, and dual modes.
ory. PMDK is an open-source (C/C++) kit that In the memory mode, NVRAM supplements DRAM
offers different libraries/utilities to interact with where DRAM acts as an L4 cache and NVRAM as the
persistent memory. We used the most appropri- main (volatile) memory. The host memory controller
ate library after a brief evaluation process. integrated into the processor manages the movement of
• We investigated the impact of the above- data between DRAM and NVRAM. On the other hand,
mentioned change on Haura and used persistent in the app direct mode, NVRAM is a persistent memory
memory to store the B𝜖 -tree nodes (Section 2.3). module where the applications have direct access to the
• We identified the access patterns for different stor- device, it is still byte-addressable, and applications can
age devices that supplement or affect the through- use it as a storage device. Lastly, in the dual mode, part
put of Haura. of the NVRAM can be allocated to applications, and the
rest can be utilized as non-volatile memory.
The remainder of this paper is structured as follows.
Section 2 provides background on non-volatile memory
and the related programming techniques, and also briefly 2.2. Non-Volatile Memory Programming
describes Haura. The implementation phase is discussed The typical programming models categorize the data
in Section 3 which is then followed by the evaluation in structures into two broad categories; memory resident
Section 4. Next, Section 5 details the related work and the and storage resident data structures [20]. It is mainly
paper concludes with a summary and open challenges in due to the underlying system architecture where main
Section 6. memory is attached directly to the memory bus and sec-
ondary storage, due to its high latency, communicates
2. Background with the system via an I/O controller. The models operate
on the data in main memory at byte granularity and en-
In this section, we discuss non-volatile memory and dif- sure its persistency by explicitly writing it to secondary
ferent programming models to access it. We then describe storage. A key challenge of such models is to ensure
Haura and briefly touch upon its key components. the consistency and integrity of data across all storage
classes that share different characteristics. For example,
the integrity of data in main memory could be ensured
2.1. Non-Volatile Memory using mutexes whereas the consistency and durability of
Persistent Memory (PMem), Storage Class Memory data on secondary storage are ensured using strategies
(SCM), Non-Volatile Memory (NVM), and Non-Volatile like journaling and write-ahead logging [21].
RAM (NVRAM) are the names often used to address The above-mentioned programming paradigm cannot
this new class of storage. It sits between primary and be followed when working with persistent memory as
secondary storage in the typical storage hierarchy, and it is attached to the memory bus and is non-volatile.
it is also considered a disruptive technology as it has Therefore, a new model is inevitable that simultaneously
disrupted the traditional memory paradigm. It is non- addresses all the atomicity and consistency issues, like
volatile, has DRAM-like latency, and offers much higher concurrency and power failure, for instance [22, 20, 23].
capacity than DRAM. It is byte-addressable, and its prop- Moreover, persistent memory is a comparatively new
erty to be directly accessible using the cache lines by the technology, and writing software with all the consider-
CPU demands a different architecture than the one used ations requires an in-depth knowledge of the hardware
in typical storage engines. Some example technologies and cache. Therefore, several APIs are available that han-
are Phase Change Memory [16], Spin Transfer Torque dle the hardware-related intricacies internally. PMDK1
RAM (STT-RAM) [17], Carbon NanoTube RAM (NRAM, (Persistent Memory Development Kit) is one such exam-
NanoRAM) [18], and Memristors [19]. ple which is based on SNIA NVM programming model2 .
Presently, only Intel® produces persistent memory
modules under the brand named Intel® Optane™ DC
Persistent Memory. It offers different generations that
vary in performance and capacity, and the modules are
designed to be used with specific generations of Intel® ’s 1
https://pmem.io/pmdk/
processors, Intel® Xeon Scalable Processors, for instance. 2
https://snia.org/tech_activities/standards/curr_standards/npm
SNIA (Storage and Networking Industry Association) User API
proposes different programming models [22, 24, 20, 23] ... Cursor
to program persistent memory. The simplest way is to ObjectStore
use the module as a block device and access it using a Interface
standard file API. Another approach is via an optimized ... Superblock Snapshot Dataset
Database
file system, ext and XFS in Linux and NTFS in Windows,
that is adapted specifically for persistent memory. This Allocator Interface
approach, contrary to the previous one, allows small read- Tree
... Leaf Node Messages
Cache
and-write operations and is more efficient. Last but not
least, DAX or Direct Access is another approach in that
Interface
the persistent memory is accessed as a memory-mapped
...
DataManagement
file. Nevertheless, contrary to memory mapping files Interface
on secondary storage, the operating system does not Checksum
... Configuration
Compression
maintain pages for persistent memory in main memory. StoragePool
Interface
2.3. Haura Vdev
... File Parity Mirror
Haura is a general-purpose and write-optimized tiered File API
storage stack that runs in user space and supports object Disk 1 Disk 2 Disk 3 Disk 4 Disk N
and key-value interfaces [14]. It has the ability to handle
multiple datasets, provides data integrity, and supports
advanced features like snapshots, data placement, and Figure 1: A layered conceptual diagram illustrating the main
fail-over strategies. The core of the engine is the B𝜖 - components of Haura. The objects in grey represent the main
tree, which is the sole index data structure in the engine. modules, whereas the objects in yellow and sky blue colors are
The engine supports block storage devices, solid-state the helper modules and classes respectively. The key classes
drives, for instance, and also has its own caching layer in the modules are represented using white blocks.
using DRAM (separate from the operating system). It also
offers features like data striping, mirroring, and parity. It of the objects. The first dataset stores the chunks of the
follows ZFS [25] architecture and uses a similar layered object, whereas the second dataset stores the indirection-
approach [14] where layers or modules interact with each related information and other metadata, like modification
other using interfaces. A schematic diagram of Haura is time and size, to name a few.
presented in Figure 1. The Database module controls and manages all the
Haura was initially built as a key-value storage engine activities regarding a database. A database in Haura
where arbitrary-sized keys and values can be stored and consists of one or more datasets and their respective
an extension to support objects was made later in [15] snapshots. Datasets and snapshots are actually B𝜖 -trees.
in that the ObjectStore module was added to the stack. Moreover, it also maintains a separate B𝜖 -tree, named
The ObjectStore module exposes the necessary routines the root tree, to store all the information regarding the
to interact with the engine and supports all the primitive database. For example, it maintains active datasets and
operations like create, read, write, and query. It uses the their pointers in the storage. It also maintains informa-
same key-value interface to store the objects. However, a tion regarding the usage of storage devices in the form
key challenge it addresses is the transformation of objects of bitmaps.
into key-value pairs. Since, in the key-value version The Tree module contains the actual implementation
of Haura, although the keys and the values can have a of the B𝜖 -tree and encapsulates all the tree-related opera-
variable size, there was still an upper limit defined on tions and exposes the methods to the upper layer.
their sizes. Objects, on the other hand, can contain data The DataManagement module ensures the persistence
of several gigabytes; therefore, a mechanism is devised of the underlying data and its retrieval when requested.
in that objects are split into chunks, and each chunk is However, it internally interacts with a wide range of
assigned a unique identifier. Moreover, the object name modules, especially with the helper modules, and plays a
is primarily the key of the object, it can have a variable vital role in achieving their internal functionalities. The
size and can spread over a few kilobytes, therefore, an cache module, for instance, is managed by this module.
indirection is added where the object name and other The write and update requests from the upper layer first
metadata are stored separately, and a unique fixed-size land into this module and then passed on to the cache
identifier is assigned to each chunk. Furthermore, the module. Similarly, in the case of a cache miss, this module
ObjectStore module, via the Database module, maintains fetches the required data using the StoragePool module
two different datasets to store the data and the metadata and are passes the data to the cache module for later
usage. Moreover, this module is also responsible for the • The virtual device should be able to perform both
compression and decompression of the data, and it uses synchronous and asynchronous calls to the un-
the compression module for this purpose. Furthermore, derlying storage device.
the decision as to which blocks on storage media are to be • Haura uses bitmaps to manage the allocation of
used to write the data is also taken in this module, and it the blocks on the storage devices. It partitions the
uses the AllocationHandler module to allocate the blocks. whole space into equal-size blocks and allocates
Last but not least, it communicates with the StoragePool and de-allocates the blocks internally, therefore,
module to perform the write and read I/O operations. this bookkeeping is not required in the virtual
The StoragePool module performs two key operations. device or any library used in it.
First, it maintains queues for asynchronous I/O opera- • Haura uses the copy-on-write technique to up-
tions. Second, it dispatches the I/O calls to the respective date the nodes. It first copies nodes to the main
virtual devices in the Vdev module. However, it also memory, applies changes to the nodes, and then
exposes the methods for synchronous calls where it by- writes the data back to the device in a new loca-
passes the queues. The interface of this layer matches tion. It never performs in-place updates.
with the Vdev module, however, it requires an additional
parameter to communicate with the desired virtual de- Now considering the above properties, libpmem and libp-
vice in the Vdev module as the module may contain more memblk from PMDK’s persistent library suite suits the
than one virtual device. current architecture of Haura and can be used to imple-
Lastly, the Vdev module provides different implemen- ment the functionality.
tations to interact with the storage devices, and they are Libpmemblk is a high-level library that provides func-
referred to as virtual devices in the system. Currently, tionality to manage an array of fixed-size blocks. The
Haura supports single, mirror, and parity implementa- blocks can be updated and read using their indices in the
tions. The single version of the implementations works array. It does not provide byte-level access to the blocks,
on a single storage device, and it has two further sub- and any update requires the re-writing of the whole block.
implementations, file and memory, for SSD/HDD and Whereas, libpmem is one of the low-level libraries in the
DRAM (as volatile storage) respectively. It is the simplest kit, and the other high-level libraries are built on top of
implementation provided by this module, it is not fault- it. It wraps the basic operations exposed by an operating
tolerant, and the underlying data is lost in case of error system and adds optimizations for persistent memory.
or failure. The other implementations, as their names The other libraries from the PMDK’s persistent suite,
suggest, mirror and parity, are introduced to support like libpmemobj and libpmemkv, are not a suitable choice
mirroring and parity functionalities respectively. because; first, they do not provide the required interface
to implement the virtual device, and second, they inter-
nally perform the operations that are already addressed in
3. Implementation Haura. For example, libpmemobj internally implements
the object store functionality on top of memorymapped
In this section, we briefly discuss the implementation of
files, whereas the current architecture of Haura expects
the new virtual device for persistent memory and touch
the virtual device to perform raw read and write opera-
upon the important steps considered during this phase.
tions on a specified location in the memory. Moreover,
the management of key-value data at the library level,
3.1. Programming Model Selection as in the case of libpmemkv, is the core functionality of
PMDK provides several high and low-level libraries to Haura and is therefore redundant.
interact with persistent memory. Haura, on the other The final selection from the shortlisted3 libraries is
hand, is also a well-developed engine and expects a vir- made using an experiment, which results are in Figure 2.
tual device to implement a certain interface. Therefore, First, it is quite evident from the graph that the approach
in the initial phase, a list of properties is formulated to for accessing persistent memory via a memory-aware
set a criterion, and each new implementation of a virtual file API is not a feasible approach as its latency to read
device must adhere to the list to work properly with the and write the data, especially using small buffers, is sig-
existing interfaces in Haura. The properties are: nificantly high. Nevertheless, a prominent drop can be
observed in its latency with the increase in buffer size
• Haura stores the nodes of the B𝜖 -tree using virtual but it is still higher than the rest. The approaches that
devices, therefore, the virtual device should be compete with each other are libpmem and libpmemblk.
able to perform read and write operations in var- Libpmemblk, in some instances, performed better than
ied block sizes, from a few kilobytes to megabytes. libpmem, but its performance is worst with small buffers,
3
PMem-aware file API is also added to the list for comparison.
libpmem mirror. Similarly, a new implementation of a virtual de-
vice for persistent memory is added to the Vdev module.
Time (s)
Write 2 libpmemblk
10
File API Moreover, as further mentioned in Section 3.1, the library
from PMDK is chosen with careful consideration to avoid
100
any architecture-related changes to Haura. Therefore,
this new virtual device implementation exposes a similar
101 interface as available in the other implementations and
Time (s)
Read
it internally makes use of the wrapper library mentioned
in the previous section to perform the storage-specific
operations.
100
The integration of the new virtual device required
6B
2B
64 B
51 B
B
B
B
8B
B
B
B
16 B
32 B
12 B
25 B
B
B
B
16 B
32 B
K
6K
2K
M
64
1K
2K
4K
8K
K
K
8K
1M
2M
4M
8M
M
alterations in a few traits5 and structs in different mod-
25
51
12
Buffer Size ules. For example, the DataManagement module inter-
acts with the StoragePool module using a trait called
Figure 2: An object (size 5 GB) is written and read sequen- StoragePoolLayer, and StoragePoolUnit, which
tially multiple times with different block sizes using libpmem, is a struct in the StoragePool module, implements
libpmemblk, and a standard File API. StoragePoolLayer and maintains the information re-
garding the configured virtual devices in an array of type
and also it crashes when the block size exceeds 8 MB. StorageTier. Furthermore, the type StorageTier is
Therefore, libpmem is finally selected as it is (compara- an array of Dev, and Dev is an enum that stores the in-
tively) consistent throughout the experiment. stance of the associated virtual device. The enum Dev
provides three different types of features, Leaf, Mirror,
and Parity1. Leaf further offers two different features,
3.2. Rust Wrapper for libpmem
File and Memory. All these mentioned traits and structs
Haura is written in Rust, and PMDK supports other lan- are affected by the new implementation.
guages in that support for only C/C++ is fully tested, Furthermore, virtual devices are accessed using differ-
therefore, the second step, after selecting the library from ent traits that define distinct behaviors. The first trait
PMDK’s suite, involved writing a wrapper for the selected is Vdev. It exposes functions to query different prop-
library (i.e. libpmem) in Rust. In this regard, Rust pro- erties and states of the virtual devices. For example, it
vides a utility named bindgen that generates the FFI4 can be used to fetch the id and size of the device. On
bindings to C/C++ libraries. The tool requires two files to the other hand, the traits VdevWrite and VdevRead, as
generate the bindings. The first file is wrapper.h which their name suggests, are used to perform read and write
should contain all the header files and declarations that operations on virtual devices, and provide methods to per-
the target application intends to use. The other file is form the operations synchronously and asynchronously.
build.rs which should contain the details regarding the These traits are implemented for the new virtual device.
generation of the bindings. This file is part of the folder Last but not least, other changes have been added to make
structure followed in Rust and its compiler, before the the virtual device visible to Haura through configuration
compilation of the code, looks for this file in the root details.
folder and executes it so that the bindings can be gen-
erated (and made available) before the execution of the
actual program. 4. Performance Analysis
Once the bindings are generated, the next step involved
In this section, we analyze the impact of the newly added
writing the methods to perform read and write operations
virtual device on Haura. We start by testing Haura for
on persistent memory and exposing them to be used in
different workloads, and the baseline is the existing best-
its respective virtual device in the Vdev module which is
performing implementation of the virtual device. We
mentioned in the following section.
then study the impact with different configurations, with
different thread counts and cache sizes, for instance.
3.3. NVM as a Virtual Device
As discussed in Section 2.3, Haura interacts with storage 4.1. Experimental Setup
devices using different virtual device implementations
The experiments are performed on a dual-socket server
in the Vdev module, and currently, there are four imple-
having Intel® Xeon® Gold 5220R with 2.20 GHz base fre-
mentations available namely, file, memory, parity1, and
4 5
Foreign function interface (FFI) is a method to invoke calls from a A trait, in Rust, can be considered as an equivalent to an interface
library written and compiled in a different language. in object-oriented languages like Java and C# [SK22d].
PMem - Write The key reason behind the poor performance of the
PMem - Read
read I/O is the layout in which Haura stores the data.
1,000 SSD NVMe - Write
SSD NVMe - Read
Haura starts by splitting the objects into chunks and
I/O (MB)
SSD SATA - Write transforming them into messages. The messages are
500 SSD SATA - Read then pushed into the root node, and they descend gradu-
ally to the target leaf node, and during the descent, they
0 are buffered in internal nodes and flushed down to the
child node only when the buffer is full. Later, when the
0
0
0
1
1
2
2
3
3
4
5
5
6
7
7
7
4
6
:1
0:0
0:1
0:2
0:3
0:4
0:5
1:0
1:1
1:3
1:4
1:5
2:0
2:2
2:3
2:4
1:2
2:1
sync operation is performed, Haura follows the postorder
-0
Runtime (min:s) [26] approach to persist the data that does not guarantee
the ordering of the chunks on the storage device. On
Figure 3: Three different executions representing the sequen- the other hand, when Haura fetches an object, it starts
tial I/O throughput of Haura when configured with different fetching its chunks sequentially from the root node first
storage devices. The data is recorded at an interval of 500ms. and keeps fetching the child nodes until it reaches the
leaf node or finds the messages for the queried chunk.
quency and each CPU contains 24 physical cores and each Therefore, this reading approach cannot benefit from se-
core supports two threads. Each CPU-socket contains quential access as the tree data is already stored in the
two integrated memory controllers (iMCs) with three postorder layout. Furthermore, the other main reason
memory channels, each channel (except the last ones) that applies to both scenarios is the use of a single thread
connected to one PMem and DRAM DIMM resulting in 4 that leaves the device underutilized. Last but not least,
interleaved6 PMem and 6 DRAM DIMMs per socket. The the reason the write I/Os performed better is because
PMem used is 128 GB Intel® ™ DC Persistent Memory they were asynchronous calls where the thread was ca-
Series 100 DIMMs, resulting in a total persistent memory pable of issuing multiple asynchronous I/Os using the
capacity of 1 TB (128 GB x 4 DIMMs x 2 sockets), and the asynchronous programming technique, whereas the read
capacity of DRAM is 384 GB (32 GB x 6 DIMMs x 2 sock- I/Os were synchronous calls.
ets). Moreover, the server contains two NUMA nodes Moreover, another interesting pattern that surfaced
each with 48 logical cores, 4 PMem DIMMs, and 6 DRAM during the detailed analysis is, the relative performance
DIMMs. However, to avoid memory access overhead, the of the storage devices was not consistent all the time. As
experiments are run on socket 0. Lastly, the machine shown in Figure 4, the difference between PMem and
runs Ubuntu 20.04.3 LTS (5.4.0-126-generic), and PMem SSD NVMe is significant for small block sizes, however,
is accessed in the app direct Mode using fsdax 7 . the difference shrinks considerably for large blocks.
4.2. Sequential Workload 4.3. Random I/O and Worker Threads
In this experiment, Haura is configured for three different This experiment evaluates the impact of the number of
storage devices; PMem, SSD NVMe, and SSD SATA. The threads and cache size on Haura when configured with
experiment writes 5 objects, each size 5 GB, and then different storage devices. It writes an archive file10 (size
reads them sequentially in the same order, however, the 1011 MB) as an object to the engine. The file contains
write requests are asynchronous, which allows multiple 80,690 entries with metadata stored in the first 9.3 MiB
write requests to be dispatched, whereas the read requests that contain the central directory to locate the individual
are synchronous. files. Moreover, the scenarios with circles (Figure 5) store
The results in Figure 3 show that PMem performed the metadata on the first device (i.e. SSD SATA) and
better than the rest for the write I/O, and it lagged behind the remaining content on the second mentioned device.
SSD NVMe for the read I/O. But, the resulted throughput Lastly, the script fetches 50,000 files randomly11 .
for PMem in both cases is quite off from the expected The results are presented in Figure 5, and it is evident
values because as per the specifications, a single PMem from all sub-plots that the execution time improves with
DIMM (with four cache-lines) can write and read up to the increase in the worker threads and cache size.
1,800 and 6,800 MB/s8 , respectively9 . The scenarios that performed worst are the ones that
used SSD SATA to store the contents and a faster de-
6
In DIMM interleaving, the data is interleaved as per the configured
block size (i.e., 4 KB in the current settings.) across the DIMMs. to 3,200 and 2,000 MB/s and random read and write operations of up
7
https://docs.pmem.io/ndctl-user-guide/managing-namespaces to 540,000 and 55,000 IOPS. SSD SATA can perform sequential read
8
https://www.intel.de/content/www/de/de/products/ and write operations of up to 550 and 510 MB/s and random read
docs/memory-storage/optane-persistent-memory/ and write operations of up to 86,000 and 30,000 IOPS, respectively.
10
optane-dc-persistent-memory-brief.html https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.12.13.tar.xz
9 11
SSD NVMe can perform sequential read and write operations of up https://docs.rs/xoshiro/latest/xoshiro/struct.Xoshiro256Plus.html
PMem (Write/Read) SSD NVMe (Write/Read) Time (s)
Time (ms) Write Time (ms) Read 102 213 30
SSD SATA SSD SATA (Avg) 11 28
2
Leaf Node Size
(KB) - Write
SSD NVMe SSD NVMe (Avg) 26
29
100 PMem PMem (Avg) 24
27 22
25 20
10−2 23 18
102 21
16
14
213
12
211
Leaf Node Size
100 10
(KB) - Read
29 8
27 6
−2 4
10
25
2
12
1,,3
4, 4
4
84
23 0
0
08
28
84
21 1
2 23 25 27 29 211 213 21 23 25 27 29 211 213
Block Size (KB) Internal Node Size (KB) Internal Node Size (KB)
Figure 6: Multiple end-to-end executions to analyze the im-
Figure 4: The plots group the calls (write and read respec- pact of internal and leaf node sizes on the throughput of
tively) from Figure 3 with respect to their payloads. Haura.
nal nodes and then repeats the experiment with different
Cache Size (16 MB) Cache Size (32 MB) Cache Size (64 MB) leaf node sizes, and writes and reads an object with the
size of 128 MB sequentially.
103 103 103
The results are illustrated using a heatmap in Figure 6.
Time (ms)
First, it is evident that both storage devices share almost
the same temperatures. Second, the engine performs
102
102
102 worst for small block sizes, especially for internal nodes.
However, the performance improves with the increase
10 20 30 10 20 30 10 20 30
in the size, and the concentration of blue color indicates
Threads Threads Threads
SSD NVMe/SSD SATA SSD NVMe
the engine performs better under large block sizes.
PMem/SSD SATA PMem One reason for the high temperatures is due to the
height of the tree that grows deep when the nodes, in-
ternal nodes in particular, are small. For instance, when
Figure 5: The impact of threads and different cache sizes on the node size is 512 bytes, an object size 128 MB would
Huara when used with different storage configurations. result in 26,1376 nodes12 , and the internal nodes contain
a limited number of messages and pivots, whereas, when
vice to store the metadata of the file. However, a minute the node size is 4 MB, the tree would only need 32 nodes
difference can be observed, with PMem the execution to accommodate the object. Therefore, when the tree is
time is slightly worse for threads fewer than 6, neverthe- deep, Haura spends considerable time flushing and merg-
less, the difference diminishes, and with the increase ing the nodes. Moreover, another obvious reason is when
in threads (e.g., 30 threads), PMem performed better the nodes are small multiple requests are dispatched to
than SSD NVMe. On the other hand, a significant differ- the storage devices. However, further analysis is required
ence in performance can be seen when only PMem and to capture the time only taken by the virtual devices.
SSD NVMe are used to store the whole file. As can be
seen, SSD NVMe performs marginally better when fewer
threads are used, however, as the thread count passes 9 5. Related Work
threads, PMem surpasses SSD NVMe with a significant
difference at the end. In existing engines, NVM is mostly utilized to improve
the caching and recovery of the engines. In [6, 7], the
logging component uses NVM at different levels that
4.4. Node-Size Significance improve the logging and recovery of the engine. More-
An intriguing behavior we came across while discussing over, [27] discusses three different logging techniques
the sequential workload is that the size of the payload and implements their equivalent NVM designs. The re-
influences the performance of Haura. Therefore, to fur- sults show that in-place update is the most appropriate
ther investigate the behavior, this experiment evaluates technique for NVM. Furthermore, SOFORT [28] and FOE-
Haura (for PMem and SSD NVMe) with different internal
and leaf node sizes, that is, thus far set to 4 MB for both 12 The actual count of the nodes for an object size 128 MB is higher
node types. The experiment first sets the size of the inter- than 261376 because each object chunk is assigned a key as well.
DUS [29] are examples of main memory database systems [3] P.-Å. Larson, J. Levandoski, Modern main-memory database
that utilize NVM to improve the recovery of the system. systems, VLDB Endowment (2016).
In some literature, NVM is also utilized as a buffer and [4] J. DeBrabant, et al., Anti-caching: A new approach to database
management system architecture, VLDB Endowment (2013).
there are two main designs in this approach. The first [5] I. Oukid, et al., Storage class memory and databases: Opportu-
is to use NVM to supplement DRAM which is already nities and challenges, it-Information Technology (2017).
mentioned in [28, 29, 30], and the second is to use NVM [6] J. N. Gray, Notes on data base operating systems, Springer
as another layer between DRAM and secondary storage Berlin Heidelberg, Berlin, Heidelberg, 1978, pp. 393–481.
and in this regard, a technique called three-tier buffer [7] T. Wang, et al., Scalable logging through emerging non-volatile
memory, Proceedings of the VLDB Endowment 7 (2014).
management is suggested in [31]. [8] D. Lomet, R. Anderson, et al., How the Rdb/VMS data sharing
Furthermore, data structures are also optimized to ex- system became fast, DEC Cambridge Research Lab Technical
ploit the full potential of NVM. FPTree [13] is an NVM- Report CRL 92 (1992).
aware B+ -tree that stores leaf nodes in NVM and inter- [9] J. Huang, K. Schwan, M. K. Qureshi, NVRAM-aware logging
nal nodes in DRAM, and it performs better than other in transaction systems, VLDB Endowment (2014).
[10] P. Götze, et al., Data management on non-volatile memory: a
NVM-optimized trees, NV-Tree [12] and wBTree [32], for perspective, Datenbank-Spektrum (2018).
instance. Moreover, FOEDUS [29] also uses a customized [11] A. Eisenman, et al., Reducing DRAM footprint with NVM in
tree called Master-Tree. Facebook, EuroSys, 2018.
Our work enables Haura to persist the tree nodes on [12] J. Yang, et al., NV-Tree: Reducing Consistency Cost for NVM-
NVM, which, along with an allocation strategy, can be based Single Level Systems, USENIX FAST, 2015.
[13] I. Oukid, et al., FPTree: A hybrid SCM-DRAM persistent and
used to improve recovery and caching. However, the concurrent B-tree for SCM, MOD, SIGMOD, 2016.
migration of the persisted nodes is presently not possible. [14] F. Wiedemann, Modern Storage Stack with Key-Value Store
Haura can also store the internal nodes on NVM as done Interface and Snapshots Based on CoW B𝜖 -Trees, 2018.
in FPTree [13]. Lastly, depending on the size of the data, [15] T. Höppner, Design and Implementation of an Object Store
the engine can be used as NVM-DRAM engine. with Tiered Storage, 2021.
[16] A. Faraclas, et al., Modeling of set and reset operations of phase-
change memory cells, IEEE electron device letters (2011).
[17] A. K. Mishra, et al., Architecting on-chip interconnects for
6. Conclusion stacked 3D STT-RAM caches in CMPs, ISCA, IEEE, 2011.
[18] B. Gervasi, Will carbon nanotube memory replace DRAM?,
In this work, Haura, a general-purpose and write- IEEE Micro (2019).
optimized storage engine is used to study the character- [19] D. B. Strukov, G. S. Snider, D. R. Stewart, R. S. Williams, The
istics of the modern storage landscape that has become missing memristor found, nature (2008).
more heterogeneous with the advent of PMem which is [20] S. Scargall, Programming persistent memory: A comprehen-
sive guide for developers, Springer Nature, 2020.
a promising technology that shares the characteristics [21] S. Sippu, et al., Transaction processing: Management of the
of primary and secondary storage and has disrupted the logical database and its underlying structure, Springer, 2015.
traditional memory paradigm. A few important findings [22] A. Baldassin, et al., Persistent Memory: A Survey of Program-
our work uncovered are; first, persistent memory per- ming Support and Implementations, ACM CSUR (2021).
forms optimally when accessed using the largest possible [23] A. Rudoff, Persistent memory programming, Login: The
Usenix Magazine (2017).
blocks in random workloads. Second, the size of the [24] A. Rudoff, Programming models for emerging non-volatile
cache and thread count impact Haura’s throughput. Last memory technologies, USENIX & SAGE (2013).
but not least, the size of the nodes also determines the [25] W. Fuzong, G. Helin, Z. Jian, Dynamic data compression
throughput of the engine with the internal node having algorithm selection for big data processing on local file system,
more influence. The insights gathered in this paper can in: Proceedings of the International Conference on Computer
Science and Artificial Intelligence, 2018, pp. 110–114.
be used to significantly improve Haura’s performance [26] G. Valiente, Algorithms on trees and graphs, volume 112,
and further exploit the characteristics of PMem. How- Springer, 2002.
ever, two aspects that need to be investigated are the [27] J. Arulraj, et al., Let’s talk about storage & recovery methods
use of in-place updates for PMem and accessing it using for non-volatile memory database systems, SIGMOD, 2015.
devdax13 that produces better results than DAX in certain [28] I. Oukid, et al., SOFORT: A hybrid SCM-DRAM storage engine
for fast data recovery, DaMoN, 2014.
cases [33]. [29] H. Kimura, FOEDUS: OLTP engine for a thousand cores and
NVRAM, ACM MOD, SIGMOD, 2015.
[30] L. Lersch, et al., An analysis of LSM caching in NVRAM,
References DaMoN, 2017.
[31] A. van Renen, et al., Managing non-volatile memory in
[1] U. Kazemi, A survey of big data: challenges and specifications,
database systems, MOD, 2018.
CiiT IJSETA (2018).
[32] S. Chen, et al., Rethinking database algorithms for phase
[2] F. Faerber, et al., Main memory database systems, Foundations
change memory, Cidr, 2011.
and Trends® in Databases (2017).
[33] B. Daase, et al., Maximizing persistent memory bandwidth
utilization for OLAP workloads, PODS SIGMOD, 2021.
13
https://docs.pmem.io/ndctl-user-guide/managing-namespaces