<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>strategies like journaling and copy</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Memory in Modern Heterogeneous Storage Landscape using a Write-optimized Storage Stack</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sajad Karim</string-name>
          <email>sajad.karim@ovgu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johannes Wünsche</string-name>
          <email>johannes.wuensche@ovgu.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Broneske</string-name>
          <email>david.broneske@ovgu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Kuhn</string-name>
          <email>michael.kuhn@ovgu.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gunter Saake</string-name>
          <email>gunter.saake@ovgu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Non-Volatile Memory, Persistent Memory, Storage Class Memory, Non-Volatile Random Access Memory, Persistent Memory</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Databases and Software Engineering, Otto von Guericke University Magdeburg</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Parallel Computing and I/O, Otto von Guericke University Magdeburg</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Programming, Modern Heterogeneous Storage Landscape, Write-Optimized Storage Engine</institution>
          ,
          <addr-line>Haura</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Non-volatile memory (NVM), or persistent memory, is a promising and emerging storage technology that has not only disrupted the typical long-established memory hierarchy but also invalidated the proclaimed programming paradigm used in traditional database management systems and file systems. It bridges the gap between primary and secondary storage and, hence, shares the characteristics of both categories. However, currently, there exists no common storage engine built particularly to study the characteristics of the modern storage landscape, which has become more heterogeneous after NVM. Therefore, a general-purpose storage engine, Haura, is utilized to study the benefits of the modern storage landscape. In this work, NVM is integrated into the storage stack of Haura and studied the patterns for modern storage devices involved and their impact on the performance of Haura. Our work shows, NVM performs best under sequential workloads, but random access is better with larger block sizes. Furthermore, the block size has a significant impact on the performance of storage devices, with smaller block sizes favoring NVM and larger block sizes favoring NVMe-supported devices.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
According to Kazemie [1], the volume of data in 2011 was
around 1.8 Zettabytes, and its volume doubles
approximately every two years. At least it was before the onset
of COVID-19 whose outbreak further fueled its growth
when the use of digital services rose exponentially. This
data deluge is unprecedented and has created new
challenges for database management systems and file systems
which are used in a wide range of applications for data
analysis and management.</p>
      <p>The traditional database management systems and file
systems are developed considering the typical storage
hierarchy where memory is fast but volatile and limited,
and secondary storage is persistent and vast but has high
latency. In such systems, the data is logically split into
two sets of copies; working and consistent copies. The
working copy resides in main memory, whereas the
persistent copy resides on one or more secondary storage
devices. Also, making data persistent is an error-prone
process as problems like crashes and race conditions can
nEvelop-O
(G. Saake)</p>
    </sec>
    <sec id="sec-2">
      <title>In prior work, NVM is used as a bufer between DRAM and secondary storage devices [10, 5, 11]. Moreover, several index data structures like NVTree [12] and FPTree</title>
    </sec>
    <sec id="sec-3">
      <title>Nevertheless, presently, there is no common storage en</title>
      <p>of the modern storage landscape, which has become more
heterogeneous after the addition of NVM, and in
consequence, there is a research gap in this direction.</p>
    </sec>
    <sec id="sec-4">
      <title>In order to investigate the benefits of a common storage engine that manages all the storage devices in the modern storage landscape, a prototype of a general</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-5">
      <title>It runs in user space and supports object and key-value</title>
      <p>interfaces. The key contributions of our work are as
follows:
• We used PMDK (Persistent Memory Development
Kit) to supplement Haura with persistent
memory. PMDK is an open-source (C/C++) kit that
ofers diferent libraries/utilities to interact with
persistent memory. We used the most
appropriate library after a brief evaluation process.
• We investigated the impact of the
abovementioned change on Haura and used persistent
memory to store the B -tree nodes (Section 2.3).
• We identified the access patterns for diferent
storage devices that supplement or afect the
throughput of Haura.</p>
    </sec>
    <sec id="sec-6">
      <title>They are available in DIMM form factor and compatible</title>
      <p>with conventional DDR4 sockets. They co-exist with
conventional DDR4 DRAM DIMMs and use the same
memory channel. The internal granularity of the modules
is 256 bytes and they can be operated in three diferent
modes; memory, app direct, and dual modes.</p>
      <p>In the memory mode, NVRAM supplements DRAM
where DRAM acts as an L4 cache and NVRAM as the
main (volatile) memory. The host memory controller
integrated into the processor manages the movement of
data between DRAM and NVRAM. On the other hand,
in the app direct mode, NVRAM is a persistent memory
module where the applications have direct access to the
device, it is still byte-addressable, and applications can
use it as a storage device. Lastly, in the dual mode, part
of the NVRAM can be allocated to applications, and the
rest can be utilized as non-volatile memory.
2.2. Non-Volatile Memory Programming</p>
    </sec>
    <sec id="sec-7">
      <title>The remainder of this paper is structured as follows.</title>
    </sec>
    <sec id="sec-8">
      <title>Section 2 provides background on non-volatile memory</title>
      <p>and the related programming techniques, and also briefly
describes Haura. The implementation phase is discussed
in Section 3 which is then followed by the evaluation in</p>
    </sec>
    <sec id="sec-9">
      <title>Section 4. Next, Section 5 details the related work and the</title>
      <p>paper concludes with a summary and open challenges in</p>
    </sec>
    <sec id="sec-10">
      <title>Section 6.</title>
      <p>The typical programming models categorize the data
structures into two broad categories; memory resident
and storage resident data structures [20]. It is mainly
due to the underlying system architecture where main
memory is attached directly to the memory bus and
secondary storage, due to its high latency, communicates
2. Background with the system via an I/O controller. The models operate
on the data in main memory at byte granularity and
enIn this section, we discuss non-volatile memory and dif- sure its persistency by explicitly writing it to secondary
ferent programming models to access it. We then describe storage. A key challenge of such models is to ensure
Haura and briefly touch upon its key components. the consistency and integrity of data across all storage
classes that share diferent characteristics. For example,
2.1. Non-Volatile Memory the integrity of data in main memory could be ensured
using mutexes whereas the consistency and durability of
Persistent Memory (PMem), Storage Class Memory data on secondary storage are ensured using strategies
(SCM), Non-Volatile Memory (NVM), and Non-Volatile like journaling and write-ahead logging [21].
RAM (NVRAM) are the names often used to address The above-mentioned programming paradigm cannot
this new class of storage. It sits between primary and be followed when working with persistent memory as
secondary storage in the typical storage hierarchy, and it is attached to the memory bus and is non-volatile.
it is also considered a disruptive technology as it has Therefore, a new model is inevitable that simultaneously
disrupted the traditional memory paradigm. It is non- addresses all the atomicity and consistency issues, like
volatile, has DRAM-like latency, and ofers much higher concurrency and power failure, for instance [22, 20, 23].
capacity than DRAM. It is byte-addressable, and its prop- Moreover, persistent memory is a comparatively new
erty to be directly accessible using the cache lines by the technology, and writing software with all the
considerCPU demands a diferent architecture than the one used ations requires an in-depth knowledge of the hardware
in typical storage engines. Some example technologies and cache. Therefore, several APIs are available that
hanare Phase Change Memory [16], Spin Transfer Torque dle the hardware-related intricacies internally. PMDK1
RAM (STT-RAM) [17], Carbon NanoTube RAM (NRAM, (Persistent Memory Development Kit) is one such
examNanoRAM) [18], and Memristors [19]. ple which is based on SNIA NVM programming model2.</p>
      <p>Presently, only Intel® produces persistent memory
modules under the brand named Intel® Optane™ DC</p>
    </sec>
    <sec id="sec-11">
      <title>Persistent Memory. It ofers diferent generations that</title>
      <p>vary in performance and capacity, and the modules are
designed to be used with specific generations of Intel ®’s</p>
    </sec>
    <sec id="sec-12">
      <title>1https://pmem.io/pmdk/</title>
      <p>processors, Intel® Xeon Scalable Processors, for instance. 2https://snia.org/tech_activities/standards/curr_standards/npm
Interface
... Superblock
...</p>
      <p>Cursor
Snapshot</p>
      <p>Dataset
...</p>
      <p>Node</p>
      <p>Messages
Interface
Leaf
Interface
Interface
Interface</p>
      <p>SNIA (Storage and Networking Industry Association)
proposes diferent programming models [ 22, 24, 20, 23]
to program persistent memory. The simplest way is to
use the module as a block device and access it using a
standard file API. Another approach is via an optimized
ifle system, ext and XFS in Linux and NTFS in Windows,
that is adapted specifically for persistent memory. This
approach, contrary to the previous one, allows small
readand-write operations and is more eficient. Last but not
least, DAX or Direct Access is another approach in that
the persistent memory is accessed as a memory-mapped
ifle. Nevertheless, contrary to memory mapping files
on secondary storage, the operating system does not
maintain pages for persistent memory in main memory.
2.3. Haura
Haura is a general-purpose and write-optimized tiered
storage stack that runs in user space and supports object
and key-value interfaces [14]. It has the ability to handle
multiple datasets, provides data integrity, and supports
advanced features like snapshots, data placement, and
fail-over strategies. The core of the engine is the B
tree, which is the sole index data structure in the engine.</p>
      <p>The engine supports block storage devices, solid-state
drives, for instance, and also has its own caching layer
using DRAM (separate from the operating system). It also
ofers features like data striping, mirroring, and parity. It
follows ZFS [25] architecture and uses a similar layered
approach [14] where layers or modules interact with each
other using interfaces. A schematic diagram of Haura is
presented in Figure 1.</p>
      <sec id="sec-12-1">
        <title>Haura was initially built as a key-value storage engine</title>
        <p>where arbitrary-sized keys and values can be stored and
an extension to support objects was made later in [15]
in that the ObjectStore module was added to the stack.
The ObjectStore module exposes the necessary routines
to interact with the engine and supports all the primitive
operations like create, read, write, and query. It uses the
same key-value interface to store the objects. However, a
key challenge it addresses is the transformation of objects
into key-value pairs. Since, in the key-value version
of Haura, although the keys and the values can have a
variable size, there was still an upper limit defined on
their sizes. Objects, on the other hand, can contain data
of several gigabytes; therefore, a mechanism is devised
in that objects are split into chunks, and each chunk is
assigned a unique identifier. Moreover, the object name
is primarily the key of the object, it can have a variable
size and can spread over a few kilobytes, therefore, an
indirection is added where the object name and other
metadata are stored separately, and a unique fixed-size
identifier is assigned to each chunk. Furthermore, the</p>
      </sec>
      <sec id="sec-12-2">
        <title>ObjectStore module, via the Database module, maintains two diferent datasets to store the data and the metadata</title>
        <p>...</p>
        <p>Cache
... Configuration Compression
Allocator
Checksum</p>
        <p>ObjectStore
Database
Tree
DataManagement
StoragePool
Vdev</p>
        <p>File API
...</p>
        <p>File</p>
        <p>Parity</p>
        <p>Mirror
Disk 1 Disk 2 Disk 3 Disk 4
Disk N
of the objects. The first dataset stores the chunks of the
object, whereas the second dataset stores the
indirectionrelated information and other metadata, like modification
time and size, to name a few.</p>
      </sec>
      <sec id="sec-12-3">
        <title>The Database module controls and manages all the</title>
        <p>activities regarding a database. A database in Haura
consists of one or more datasets and their respective
snapshots. Datasets and snapshots are actually B -trees.</p>
        <sec id="sec-12-3-1">
          <title>Moreover, it also maintains a separate B -tree, named</title>
          <p>the root tree, to store all the information regarding the
database. For example, it maintains active datasets and
their pointers in the storage. It also maintains
information regarding the usage of storage devices in the form
of bitmaps.</p>
        </sec>
      </sec>
      <sec id="sec-12-4">
        <title>The Tree module contains the actual implementation</title>
        <p>of the B -tree and encapsulates all the tree-related
operations and exposes the methods to the upper layer.</p>
      </sec>
      <sec id="sec-12-5">
        <title>The DataManagement module ensures the persistence</title>
        <p>of the underlying data and its retrieval when requested.</p>
      </sec>
      <sec id="sec-12-6">
        <title>However, it internally interacts with a wide range of</title>
        <p>modules, especially with the helper modules, and plays a
vital role in achieving their internal functionalities. The
cache module, for instance, is managed by this module.</p>
        <p>The write and update requests from the upper layer first
land into this module and then passed on to the cache
module. Similarly, in the case of a cache miss, this module
fetches the required data using the StoragePool module
and are passes the data to the cache module for later
usage. Moreover, this module is also responsible for the • The virtual device should be able to perform both
compression and decompression of the data, and it uses synchronous and asynchronous calls to the
unthe compression module for this purpose. Furthermore, derlying storage device.
the decision as to which blocks on storage media are to be • Haura uses bitmaps to manage the allocation of
used to write the data is also taken in this module, and it the blocks on the storage devices. It partitions the
uses the AllocationHandler module to allocate the blocks. whole space into equal-size blocks and allocates
Last but not least, it communicates with the StoragePool and de-allocates the blocks internally, therefore,
module to perform the write and read I/O operations. this bookkeeping is not required in the virtual</p>
        <sec id="sec-12-6-1">
          <title>The StoragePool module performs two key operations. device or any library used in it.</title>
          <p>First, it maintains queues for asynchronous I/O opera- • Haura uses the copy-on-write technique to
uptions. Second, it dispatches the I/O calls to the respective date the nodes. It first copies nodes to the main
virtual devices in the Vdev module. However, it also memory, applies changes to the nodes, and then
exposes the methods for synchronous calls where it by- writes the data back to the device in a new
locapasses the queues. The interface of this layer matches tion. It never performs in-place updates.
with the Vdev module, however, it requires an additional
parameter to communicate with the desired virtual de- Now considering the above properties, libpmem and
libpvice in the Vdev module as the module may contain more memblk from PMDK’s persistent library suite suits the
than one virtual device. current architecture of Haura and can be used to
imple</p>
        </sec>
        <sec id="sec-12-6-2">
          <title>Lastly, the Vdev module provides diferent implemen- ment the functionality.</title>
          <p>tations to interact with the storage devices, and they are Libpmemblk is a high-level library that provides
funcreferred to as virtual devices in the system. Currently, tionality to manage an array of fixed-size blocks. The
Haura supports single, mirror, and parity implementa- blocks can be updated and read using their indices in the
tions. The single version of the implementations works array. It does not provide byte-level access to the blocks,
on a single storage device, and it has two further sub- and any update requires the re-writing of the whole block.
implementations, file and memory, for SSD/HDD and Whereas, libpmem is one of the low-level libraries in the
DRAM (as volatile storage) respectively. It is the simplest kit, and the other high-level libraries are built on top of
implementation provided by this module, it is not fault- it. It wraps the basic operations exposed by an operating
tolerant, and the underlying data is lost in case of error system and adds optimizations for persistent memory.
or failure. The other implementations, as their names The other libraries from the PMDK’s persistent suite,
suggest, mirror and parity, are introduced to support like libpmemobj and libpmemkv, are not a suitable choice
mirroring and parity functionalities respectively. because; first, they do not provide the required interface
to implement the virtual device, and second, they
internally perform the operations that are already addressed in
3. Implementation Haura. For example, libpmemobj internally implements
In this section, we briefly discuss the implementation of the object store functionality on top of memorymapped
the new virtual device for persistent memory and touch ifles, whereas the current architecture of Haura expects
upon the important steps considered during this phase. the virtual device to perform raw read and write
operations on a specified location in the memory. Moreover,
the management of key-value data at the library level,
3.1. Programming Model Selection as in the case of libpmemkv, is the core functionality of
PMDK provides several high and low-level libraries to Haura and is therefore redundant.
interact with persistent memory. Haura, on the other The final selection from the shortlisted 3 libraries is
hand, is also a well-developed engine and expects a vir- made using an experiment, which results are in Figure 2.
tual device to implement a certain interface. Therefore, First, it is quite evident from the graph that the approach
in the initial phase, a list of properties is formulated to for accessing persistent memory via a memory-aware
set a criterion, and each new implementation of a virtual ifle API is not a feasible approach as its latency to read
device must adhere to the list to work properly with the and write the data, especially using small bufers, is
sigexisting interfaces in Haura. The properties are: nificantly high. Nevertheless, a prominent drop can be
observed in its latency with the increase in bufer size
• Haura stores the nodes of the B -tree using virtual but it is still higher than the rest. The approaches that
devices, therefore, the virtual device should be compete with each other are libpmem and libpmemblk.
able to perform read and write operations in var- Libpmemblk, in some instances, performed better than
ied block sizes, from a few kilobytes to megabytes. libpmem, but its performance is worst with small bufers,
3PMem-aware file API is also added to the list for comparison.</p>
          <p>libpmem mirror. Similarly, a new implementation of a virtual
de()se itre102 lFiiblepmAePmIblk vMicoerefoovrepre,rassisfuternthtemr emmeonrtiyoniseaddidneSdectotiothne3V.1d, ethvemliobdraurley.
iTm W from PMDK is chosen with careful consideration to avoid
100 any architecture-related changes to Haura. Therefore,
this new virtual device implementation exposes a similar
) 101 interface as available in the other implementations and
(s d it internally makes use of the wrapper library mentioned
ieTm eaR in the previous section to perform the storage-specific
operations.
10064B128B256B512B1KB2KB4KB8KB16KB32KB64KB128KB256KB512KB1MB2MB4MB8MB16MB32MB altTerhaetioinntseginraatifoenwotfratihtse5 naenwd svtrirutcutaslindedviifecreenrteqmuoirde-d</p>
        </sec>
      </sec>
      <sec id="sec-12-7">
        <title>Bufer Size ules. For example, the DataManagement module interacts with the StoragePool module using a trait called</title>
        <p>Figure 2: An object (size 5 GB) is written and read sequen- StoragePoolLayer, and StoragePoolUnit, which
tially multiple times with diferent block sizes using libpmem, is a struct in the StoragePool module, implements
libpmemblk, and a standard File API. StoragePoolLayer and maintains the information
regarding the configured virtual devices in an array of type
and also it crashes when the block size exceeds 8 MB. StorageTier. Furthermore, the type StorageTier is
Therefore, libpmem is finally selected as it is (compara- an array of Dev, and Dev is an enum that stores the
intively) consistent throughout the experiment. stance of the associated virtual device. The enum Dev
provides three diferent types of features, Leaf, Mirror,
3.2. Rust Wrapper for libpmem and Parity1. Leaf further ofers two diferent features,</p>
      </sec>
      <sec id="sec-12-8">
        <title>File and Memory. All these mentioned traits and structs</title>
      </sec>
      <sec id="sec-12-9">
        <title>Haura is written in Rust, and PMDK supports other lan- are afected by the new implementation.</title>
        <p>guages in that support for only C/C++ is fully tested, Furthermore, virtual devices are accessed using
difertherefore, the second step, after selecting the library from ent traits that define distinct behaviors. The first trait
PMDK’s suite, involved writing a wrapper for the selected is Vdev. It exposes functions to query diferent
proplibrary (i.e. libpmem) in Rust. In this regard, Rust pro- erties and states of the virtual devices. For example, it
vides a utility named bindgen that generates the FFI4 can be used to fetch the id and size of the device. On
bindings to C/C++ libraries. The tool requires two files to the other hand, the traits VdevWrite and VdevRead, as
generate the bindings. The first file is wrapper.h which their name suggests, are used to perform read and write
should contain all the header files and declarations that operations on virtual devices, and provide methods to
perthe target application intends to use. The other file is form the operations synchronously and asynchronously.
build.rs which should contain the details regarding the These traits are implemented for the new virtual device.
generation of the bindings. This file is part of the folder Last but not least, other changes have been added to make
structure followed in Rust and its compiler, before the the virtual device visible to Haura through configuration
compilation of the code, looks for this file in the root details.
folder and executes it so that the bindings can be
generated (and made available) before the execution of the
actual program. 4. Performance Analysis</p>
      </sec>
      <sec id="sec-12-10">
        <title>Once the bindings are generated, the next step involved</title>
        <p>writing the methods to perform read and write operations
on persistent memory and exposing them to be used in
its respective virtual device in the Vdev module which is
mentioned in the following section.</p>
        <p>In this section, we analyze the impact of the newly added
virtual device on Haura. We start by testing Haura for
diferent workloads, and the baseline is the existing
bestperforming implementation of the virtual device. We
then study the impact with diferent configurations, with
diferent thread counts and cache sizes, for instance.
3.3. NVM as a Virtual Device</p>
      </sec>
      <sec id="sec-12-11">
        <title>As discussed in Section 2.3, Haura interacts with storage devices using diferent virtual device implementations in the Vdev module, and currently, there are four implementations available namely, file, memory, parity1, and</title>
        <p>4.1. Experimental Setup</p>
      </sec>
      <sec id="sec-12-12">
        <title>The experiments are performed on a dual-socket server</title>
        <p>having Intel® Xeon® Gold 5220R with 2.20 GHz base
fre</p>
      </sec>
      <sec id="sec-12-13">
        <title>4Foreign function interface (FFI) is a method to invoke calls from a</title>
        <p>library written and compiled in a diferent language.</p>
      </sec>
      <sec id="sec-12-14">
        <title>5A trait, in Rust, can be considered as an equivalent to an interface</title>
        <p>in object-oriented languages like Java and C# [SK22d].</p>
        <sec id="sec-12-14-1">
          <title>PMem - Write The key reason behind the poor performance of the</title>
        </sec>
        <sec id="sec-12-14-2">
          <title>PMem - Read read I/O is the layout in which Haura stores the data.</title>
          <p>)B 1,000 SSSSSSDDD SNNAVVTMMAee- --WRWreirtaietde tHraanusrfaorsmtairntsg btyhesmpliitntitnogmthesesaogbejesc.tsThinetomcehsusangkess aanrde
I(/OM 500 SSD SATA - Read tahlleyntoputshheetdaringetot ltehaef rnooodten, oadned,
daunrdinthgetyhededsecsecnedntg,rtahdeuy0 are bufered in internal nodes and flushed down to the
-0:10 0:00 0:10 0:21 0:31 0:42 0:52 1:03 1:13 1:24 1:34 1:45 1:55 2:06 2:16 2:27 2:37 2:47 child node only when the bufer is full. Later, when the
sync operation is performed, Haura follows the postorder</p>
        </sec>
        <sec id="sec-12-14-3">
          <title>Runtime (min:s) [26] approach to persist the data that does not guarantee</title>
          <p>the ordering of the chunks on the storage device. On
Figure 3: Three diferent executions representing the sequen- the other hand, when Haura fetches an object, it starts
tial I/O throughput of Haura when configured with diferent fetching its chunks sequentially from the root node first
storage devices. The data is recorded at an interval of 500ms. and keeps fetching the child nodes until it reaches the
leaf node or finds the messages for the queried chunk.
quency and each CPU contains 24 physical cores and each Therefore, this reading approach cannot benefit from
secore supports two threads. Each CPU-socket contains quential access as the tree data is already stored in the
two integrated memory controllers (iMCs) with three postorder layout. Furthermore, the other main reason
memory channels, each channel (except the last ones) that applies to both scenarios is the use of a single thread
connected to one PMem and DRAM DIMM resulting in 4 that leaves the device underutilized. Last but not least,
interleaved6 PMem and 6 DRAM DIMMs per socket. The the reason the write I/Os performed better is because
PMem used is 128 GB Intel® ™ DC Persistent Memory they were asynchronous calls where the thread was
caSeries 100 DIMMs, resulting in a total persistent memory pable of issuing multiple asynchronous I/Os using the
capacity of 1 TB (128 GB x 4 DIMMs x 2 sockets), and the asynchronous programming technique, whereas the read
capacity of DRAM is 384 GB (32 GB x 6 DIMMs x 2 sock- I/Os were synchronous calls.
ets). Moreover, the server contains two NUMA nodes Moreover, another interesting pattern that surfaced
each with 48 logical cores, 4 PMem DIMMs, and 6 DRAM during the detailed analysis is, the relative performance
DIMMs. However, to avoid memory access overhead, the of the storage devices was not consistent all the time. As
experiments are run on socket 0. Lastly, the machine shown in Figure 4, the diference between PMem and
runs Ubuntu 20.04.3 LTS (5.4.0-126-generic), and PMem SSD NVMe is significant for small block sizes, however,
is accessed in the app direct Mode using fsdax 7. the diference shrinks considerably for large blocks.
4.2. Sequential Workload
In this experiment, Haura is configured for three diferent
storage devices; PMem, SSD NVMe, and SSD SATA. The
experiment writes 5 objects, each size 5 GB, and then
reads them sequentially in the same order, however, the
write requests are asynchronous, which allows multiple
write requests to be dispatched, whereas the read requests
are synchronous.</p>
        </sec>
      </sec>
      <sec id="sec-12-15">
        <title>The results in Figure 3 show that PMem performed</title>
        <p>better than the rest for the write I/O, and it lagged behind</p>
      </sec>
      <sec id="sec-12-16">
        <title>SSD NVMe for the read I/O. But, the resulted throughput</title>
        <p>for PMem in both cases is quite of from the expected
values because as per the specifications, a single PMem
DIMM (with four cache-lines) can write and read up to
1,800 and 6,800 MB/s8, respectively9.
6In DIMM interleaving, the data is interleaved as per the configured
block size (i.e., 4 KB in the current settings.) across the DIMMs.</p>
      </sec>
      <sec id="sec-12-17">
        <title>7https://docs.pmem.io/ndctl-user-guide/managing-namespaces</title>
      </sec>
      <sec id="sec-12-18">
        <title>8https://www.intel.de/content/www/de/de/products/</title>
        <p>docs/memory-storage/optane-persistent-memory/
optane-dc-persistent-memory-brief.html</p>
      </sec>
      <sec id="sec-12-19">
        <title>9SSD NVMe can perform sequential read and write operations of up</title>
        <p>4.3. Random I/O and Worker Threads
This experiment evaluates the impact of the number of
threads and cache size on Haura when configured with
diferent storage devices. It writes an archive file 10 (size
1011 MB) as an object to the engine. The file contains
80,690 entries with metadata stored in the first 9.3 MiB
that contain the central directory to locate the individual
ifles. Moreover, the scenarios with circles (Figure 5) store
the metadata on the first device (i.e. SSD SATA) and
the remaining content on the second mentioned device.</p>
        <sec id="sec-12-19-1">
          <title>Lastly, the script fetches 50,000 files randomly 11.</title>
        </sec>
      </sec>
      <sec id="sec-12-20">
        <title>The results are presented in Figure 5, and it is evident</title>
        <p>from all sub-plots that the execution time improves with
the increase in the worker threads and cache size.</p>
        <p>The scenarios that performed worst are the ones that
used SSD SATA to store the contents and a faster
deto 3,200 and 2,000 MB/s and random read and write operations of up
to 540,000 and 55,000 IOPS. SSD SATA can perform sequential read
and write operations of up to 550 and 510 MB/s and random read
and write operations of up to 86,000 and 30,000 IOPS, respectively.
10https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.12.13.tar.xz
11https://docs.rs/xoshiro/latest/xoshiro/struct.Xoshiro256Plus.html
d 102
a
e
R
)
sm 100
(
e
m
iT 10− 2
t 102
e
i
r
W
)s 100
m
(
e
im10− 2
T 4</p>
        <p>SSD SATA SSD SATA (Avg)
SSD NVMe SSD NVMe (Avg)</p>
        <p>PMem PMem (Avg)
840 1,21,3
0828
4.4. Node-Size Significance</p>
      </sec>
      <sec id="sec-12-21">
        <title>An intriguing behavior we came across while discussing</title>
        <p>the sequential workload is that the size of the payload
influences the performance of Haura. Therefore, to
further investigate the behavior, this experiment evaluates</p>
      </sec>
      <sec id="sec-12-22">
        <title>Haura (for PMem and SSD NVMe) with diferent internal</title>
        <p>and leaf node sizes, that is, thus far set to 4 MB for both
node types. The experiment first sets the size of the
interIn existing engines, NVM is mostly utilized to improve
the caching and recovery of the engines. In [6, 7], the
logging component uses NVM at diferent levels that
improve the logging and recovery of the engine.
Moreover, [27] discusses three diferent logging techniques
and implements their equivalent NVM designs. The
results show that in-place update is the most appropriate
technique for NVM. Furthermore, SOFORT [28] and
FOE12The actual count of the nodes for an object size 128 MB is higher
than 261376 because each object chunk is assigned a key as well.</p>
        <p>DUS [29] are examples of main memory database systems [3] P.-Å. Larson, J. Levandoski, Modern main-memory database
that utilize NVM to improve the recovery of the system. systems, VLDB Endowment (2016).</p>
        <p>In some literature, NVM is also utilized as a bufer and [4] J. DeBrabant, et al., Anti-caching: A new approach to database
management system architecture, VLDB Endowment (2013).
there are two main designs in this approach. The first [5] I. Oukid, et al., Storage class memory and databases:
Opportuis to use NVM to supplement DRAM which is already nities and challenges, it-Information Technology (2017).
mentioned in [28, 29, 30], and the second is to use NVM [6] J. N. Gray, Notes on data base operating systems, Springer
as another layer between DRAM and secondary storage Berlin Heidelberg, Berlin, Heidelberg, 1978, pp. 393–481.
and in this regard, a technique called three-tier bufer [7] T. Wang, et al., Scalable logging through emerging non-volatile
memory, Proceedings of the VLDB Endowment 7 (2014).
management is suggested in [31]. [8] D. Lomet, R. Anderson, et al., How the Rdb/VMS data sharing</p>
        <p>Furthermore, data structures are also optimized to ex- system became fast, DEC Cambridge Research Lab Technical
ploit the full potential of NVM. FPTree [13] is an NVM- Report CRL 92 (1992).
aware B+-tree that stores leaf nodes in NVM and inter- [9] J. Huang, K. Schwan, M. K. Qureshi, NVRAM-aware logging
nal nodes in DRAM, and it performs better than other in transaction systems, VLDB Endowment (2014).
[10] P. Götze, et al., Data management on non-volatile memory: a
NVM-optimized trees, NV-Tree [12] and wBTree [32], for perspective, Datenbank-Spektrum (2018).
instance. Moreover, FOEDUS [29] also uses a customized [11] A. Eisenman, et al., Reducing DRAM footprint with NVM in
tree called Master-Tree. Facebook, EuroSys, 2018.</p>
        <p>Our work enables Haura to persist the tree nodes on [12] J. Yang, et al., NV-Tree: Reducing Consistency Cost for
NVMNVM, which, along with an allocation strategy, can be based Single Level Systems, USENIX FAST, 2015.
[13] I. Oukid, et al., FPTree: A hybrid SCM-DRAM persistent and
used to improve recovery and caching. However, the concurrent B-tree for SCM, MOD, SIGMOD, 2016.
migration of the persisted nodes is presently not possible. [14] F. Wiedemann, Modern Storage Stack with Key-Value Store
Haura can also store the internal nodes on NVM as done Interface and Snapshots Based on CoW B -Trees, 2018.
in FPTree [13]. Lastly, depending on the size of the data, [15] T. Höppner, Design and Implementation of an Object Store
the engine can be used as NVM-DRAM engine. with Tiered Storage, 2021.
[16] A. Faraclas, et al., Modeling of set and reset operations of
phasechange memory cells, IEEE electron device letters (2011).
6. Conclusion [17] A. K. Mishra, et al., Architecting on-chip interconnects for
stacked 3D STT-RAM caches in CMPs, ISCA, IEEE, 2011.
[18] B. Gervasi, Will carbon nanotube memory replace DRAM?,
In this work, Haura, a general-purpose and write- IEEE Micro (2019).
optimized storage engine is used to study the character- [19] D. B. Strukov, G. S. Snider, D. R. Stewart, R. S. Williams, The
istics of the modern storage landscape that has become missing memristor found, nature (2008).
more heterogeneous with the advent of PMem which is [20] S. Scargall, Programming persistent memory: A
comprehensive guide for developers, Springer Nature, 2020.
a promising technology that shares the characteristics [21] S. Sippu, et al., Transaction processing: Management of the
of primary and secondary storage and has disrupted the logical database and its underlying structure, Springer, 2015.
traditional memory paradigm. A few important findings [22] A. Baldassin, et al., Persistent Memory: A Survey of
Programour work uncovered are; first, persistent memory per- ming Support and Implementations, ACM CSUR (2021).
forms optimally when accessed using the largest possible [23] A. Rudof, Persistent memory programming, Login: The
Usenix Magazine (2017).
blocks in random workloads. Second, the size of the [24] A. Rudof, Programming models for emerging non-volatile
cache and thread count impact Haura’s throughput. Last memory technologies, USENIX &amp; SAGE (2013).
but not least, the size of the nodes also determines the [25] W. Fuzong, G. Helin, Z. Jian, Dynamic data compression
throughput of the engine with the internal node having algorithm selection for big data processing on local file system,
more influence. The insights gathered in this paper can in: Proceedings of the International Conference on Computer
Science and Artificial Intelligence, 2018, pp. 110–114.
be used to significantly improve Haura’s performance [26] G. Valiente, Algorithms on trees and graphs, volume 112,
and further exploit the characteristics of PMem. How- Springer, 2002.
ever, two aspects that need to be investigated are the [27] J. Arulraj, et al., Let’s talk about storage &amp; recovery methods
use of in-place updates for PMem and accessing it using for non-volatile memory database systems, SIGMOD, 2015.
devdax13 that produces better results than DAX in certain [28] I. Oukid, et al., SOFORT: A hybrid SCM-DRAM storage engine
for fast data recovery, DaMoN, 2014.
cases [33]. [29] H. Kimura, FOEDUS: OLTP engine for a thousand cores and
NVRAM, ACM MOD, SIGMOD, 2015.</p>
        <p>References [30] L. Lersch, et al., An analysis of LSM caching in NVRAM,
DaMoN, 2017.
[1] U. Kazemi, A survey of big data: challenges and specifications, [31] A. van Renen, et al., Managing non-volatile memory in</p>
        <sec id="sec-12-22-1">
          <title>CiiT IJSETA (2018). database systems, MOD, 2018.</title>
          <p>[2] F. Faerber, et al., Main memory database systems, Foundations [32] S. Chen, et al., Rethinking database algorithms for phase
and Trends® in Databases (2017). change memory, Cidr, 2011.</p>
          <p>[33] B. Daase, et al., Maximizing persistent memory bandwidth
utilization for OLAP workloads, PODS SIGMOD, 2021.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>