1. Introduction

Towards MRAM Byte-Addressable Persistent Memory in Edge Database Systems

Luís Meruje Ferreira

0 1

Fábio Coelho

0 1

José Orlando Pereira

0 1 0 INESC TEC, Campus da Faculdade de Engenharia da Universidade do Porto , Rua Dr. Roberto Frias, 4200-465 Porto , Portugal 1 University of Minho, Campus de Gualtar, Rua da Universidade , 4710-057 Braga , Portugal

98 111

There is a growing demand for persistent data in IoT, edge and similar resource-constrained devices. However, standard FLASH memory-based solutions present performance, energy, and reliability limitations in these applications. We propose MRAM persistent memory as an alternative to FLASH based storage. Preliminary experimental results show that its performance, power consumption, and reliability in typical database workloads is competitive for resource-constrained devices. This opens up new opportunities, as well as challenges, for small-scale database systems. MRAM is tested for its raw performance and applicability to key-value and relational database systems on resource-constrained devices. Improvements of as much as three orders of magnitude in write performance for key-value systems were observed in comparison to an alternative NAND FLASH based device.

eol>MRAM edge databases persistent memory microcontroller

1. Introduction

the following contributions are provided: systems for these devices is very similar to doing so for a commodity server. • Comparison of MRAM and FLASH storage charac- Microcontrollers (MCUs), on the other hand, are limteristics - We compare the nominal endurance, en- ited to up to 1 MB of memory, and a few MB of internal ergy expenditure, storage capacity, and monetary FLASH storage, and single core processing units with frecost of each type of device as advertised by the quencies under 0.5GHz. The capabilities of these devices corresponding vendors. Moreover, we determine can be extended by adding external memory and storand discuss the advantages and disadvantages of age. They are mainly used as end devices in IoT systems each type of device. such as sensors or actuators, and are frequently powered • Comparison of MRAM and FLASH storage perfor- by batteries, so power consumption is of major concern. mance - We experimentally compare the through- Furthermore, they do not support full-fledged operating put of each type of device under varying I/O op- systems, so programming MCUs is a lower-level experierations sizes, both in terms of their raw per- ence. formance and for their performance under rel- MCUs constitute the first layer of edge databases, often evant use cases for resource constrained devices, being the data generators of these systems [16, 9, 10, 13]. namely key-value and relational database sys- Historically, this data was mostly ofloaded to MPUs or tems. more capable cloud nodes; however, with the increase in the number of MCU devices, concerns about network

To evaluate MRAM’s capabilities, a new prototype was overload started to arise, as the increasing number of developed, combining a state-of-the-art MCU with an parallel connections to centralized processing servers MRAM memory device. Results show that MRAM is ca- imposes a large load on network resources [19]. Furtherpable of providing full throughput at much smaller I/O op- more, publications have shown that for an MCU, data erations when compared to FLASH storage, enabling it to transmission can consume more energy than local storprovide 3 orders of magnitude better performance in key- age and processing [6]. When coupled with the fact that value applications. For the case of relational databases, the capabilities of MCUs continue to increase and that, MRAM can forego FLASH specific mechanisms such as compared to MPU systems, MCUs are more afordable wear-leveling, thus freeing resources that can be used and consume less energy [20], it makes sense to push as instead by the DataBase Management System’s (DBMS) many processing and storage tasks as possible towards query engine. MCUs.

The rest of the paper is organized as follows: Section 2 provides the necessary background on MCUs and MPUs, FLASH storage FLASH storage is the most common FLASH storage, and MRAM persistent memory; Section 3 type of storage solution used in both MPU and MCU compares the characteristics of both storage solutions in devices, with the two main types of FLASH used being terms of vendor-provided information; Section 4 details NOR FLASH and NAND FLASH. how diferent data management systems were adapted to work with MRAM memory; Section 5 provides the results of our practical evaluation; and Section 6 discusses the results. Finally, Section 7 draws conclusions and provides possible paths for future work.

2. Background

MPUs vs MCUs Microprocessors (MPUs) are processing units including multiple processing cores (e.g., up to 5 cores in recent oferings [ 15]), with frequencies over 1GHz, and usually coupled to a few GB of memory (under 10) and hundreds of GB of storage (e.g., SD cards). MPUs usually serve as intermediaries between cloud and end devices, i.e., IoT or edge gateways [13, 16, 11]. RaspberryPi is a popular MPU based device that is recurrently used by researchers in IoT and Edge related publications [17, 18]. MPUs, due to their greater amount of resources and support for the appropriate primitives, often support the use of operating systems such as Linux, so developing • NOR FLASH. NOR FLASH memory has fast read speeds, but has less storage capacity and higher cost per byte than NAND. As such, NOR FLASH memory is often used to store application code in MCUs since code tends to be small, and fast read speeds mean faster execution times (although developers can manipulate unused NOR space however they want). Overall, NOR FLASH memory tends to be used in predominantly read-intensive workloads. • NAND FLASH. NAND FLASH memory, on the other hand, because it is cheaper, provides better write performance and greater storage capacity, is the most widely used technology of the two. It is the underlying technology of most digital storage media, such as SD cards and SSDs.

FLASH storage devices (both NAND and NOR) are

organized into blocks. Each block contains a series of pages (e.g., 128 pages), where each page can store, for example, 8KB of data [21].

Erase operations are the only way in flash memory and MCUs, whereas Intel Optane is only available for to convert data bits to 1. Furthermore, the smallest unit more capable computers. that can be erased is a block, afecting multiple pages. Despite MRAM’s technology being available since the Generally, NOR FLASH tends to have much slower erase 1980s[23], it was only recently that significant advances speeds than NAND FLASH. in performance and chip density have made MRAM at

Program (i.e. write) operations are done at the page tractive for data management applications. It is important level, and data bits can only be changed from 1 to 0. This to understand how these new MRAM chips compare to means that a data section can only be written once after current FLASH storage technologies, in order to undereach erase operation. Similarly, read operations are also stand the viability of MRAM as either a replacement, or performed at the page level. complement, to the standard FLASH technologies cur

The fact that the lowest unit of control in FLASH stor- rently in use. age is a page hinders its overall performance. For example, if an operation afects only a part of a page, the entire page must still be read or written, and the unwanted data 3. MRAM vs FLASH will be ignored. In cases where the data to be read or To understand the viability of MRAM as an alternative written fits within a single page, but the data happens to to FLASH storage, four MRAM devices with increasing be unaligned such that it is split between two pages, both capacity and read/write performance are compared with pages must be read or written. Erase operations being NAND FLASH and NOR FLASH devices. The MRAM deperformed at the block level restrict write operations. For vices chosen were: (M1) AS30040316 [24], (M2) MR4A16 example, for a write operation to be performed over a BMA35 [25], (M3) EMxxLx [26] and (M4) EMD4E001GAS page which is not erased, and the data in the remaining 2 [27]. The corresponding M1-M4 notations are used for pages to be kept, all pages in the block must be erased and each of these devices to ease referencing during the rest of rewritten. This is why erase operations are often delayed this Section. As for FLASH storage, MT29F128G08AJAA until multiple pages have been marked for deletion. AWP-ITZ:A [21] was selected to represent NAND FLASH

Furthermore, FLASH storage devices support a rela- and MT28EW512ABA1HPC-0SIT [28] was chosen to reptively low number of erase-write cycles per storage block, resent NOR FLASH. after which a block can no longer be modified. Therefore, Furthermore, the following characteristics for each systems must adopt wear-leveling mechanisms, where device are analyzed: write operations are carefully spread so that no block is subjected to substantially more write operations than the others. Finally, FLASH storage devices provide asymmetrical performance. Accessing random addresses is slower than accessing sequential positions, and write operations are slower than read operations.

MRAM persistent memory Magnetoresistive Random

Access Memory (MRAM) is a type of persistent memory where data is truly byte-addressable. For the case of the devices showcased here, data is organized into 1 or 2 byte cells, with the possibility for each byte to be read or written independently. Furthermore, data bits can be freely converted between 0 and 1 by write operations, forgoing the need for data to be erased.

Compared to FLASH storage, MRAM provides better read / write performance, more write cycles per cell, as well as symmetric performance for sequential and random accesses. Furthermore, due to being byte-addressable, reads and writes can reach their maximum throughput even with very small operations, whereas FLASH storage only achieves maximum throughput for operations involving multiple kilobytes of data.

MRAM presents similar characteristics to 3D XPoint [22], the byte addressable persistent storage technology on which Intel Optane is based. Contrary to 3D XPoint, however, MRAM chips are available for use with MPUs • Read/Write/Erase throughput - the performance throughput of a device for read, write and erase operations. The metric considered was Megabytes per second (MB/s). Note that as per the discussion in Section 2, erase operations do not apply to MRAM devices. • Capacity - the amount of data a given device is

able to store, in megabits. • Endurance - the number of writes or erases that a particular data cell can endure before the vendor no longer guarantees correct functioning of the data cell. • Energy - the amount of energy required to perform a write operation. The metric considered was nanojoules per Byte written. • Cost - the monetary cost of a given device per amount of storage capacity. The metric considered was euros per megabit of storage capacity.

The values of the characteristics analyzed for each device are presented in Table 1. Values were calculated based on information made available by each device’s datasheet. For performance throughput, the values presented correspond to the maximum nominal values. The energy consumption figures are based on either peak consumption or typical consumption values, depending on the information made available by vendors. Performance All considered MRAM devices outper- chip can support, it may be enough for current edge and form both FLASH devices in write performance, between IoT persistent storage requirements. 2.42× and 1040× . As for read performance, both FLASH devices are outperformed by the M3 and M4 MRAM de- Cost MRAM has a higher cost per MB than NAND and vices by a factor of between 1.18× and 11× . Further- NOR FLASH. The M4 MRAM device (the less expensive more, NAND FLASH is 479× faster than NOR FLASH per byte) is 4.26× more expensive than the representawhen erasing a block of data. Since MRAM can over- tive NOR FLASH device and 81× more expensive than ride data without first deleting it, its operations are not the NAND FLASH device. However, there is a logarithafected by erase performance, which also greatly sim- mic relationship between the capacity of the MRAM chip plifies the management of data being stored on MRAM, and its price per megabit, i.e., as the density increases, the when compared to FLASH storage. price decreases significantly. If the MRAM chip density continues to increase and this relationship is maintained, Endurance MRAM supports at least 100000× more we can expect the gap between the cost of MRAM and operations per cell than FLASH memories, and some de- FLASH memory chips to decrease. vices claim an unlimited number of operations during the lifetime of the chip. As such, there is no need for employing wear-leveling mechanisms, meaning less operational 4. Data Systems on MRAM overhead. This also translates into a longer life for the device, making it a better choice for scenarios with high data churn.

Energy In the case of MRAM, the energy required to write a single byte has an inverse correlation with its throughput performance. All MRAM devices show a lower energy consumption when writing data compared to FLASH devices, requiring 4× -13× less energy compared to NAND FLASH, and 42× -127× less energy compared to NOR FLASH.

The two major drawbacks of MRAM are capacity and cost.

Capacity The most capable MRAM device, M4, has a storage capacity of 1000 Megabits, which is 2× the capacity of the NOR FLASH device, but 128× less than the capacity of the NAND FLASH device. Recent advances have achieved multi-Gb capacity in single MRAM chips [29], however, we have not considered these devices for analysis, as they are not yet widely available for commercial use, with vendors marketing those devices only for space-grade applications. Although this is still significantly less than the hundreds of gigabits that a NAND

Three systems were either implemented or adapted to

run over MRAM to understand how MRAM memory can impact each of the two use cases previously identified for data storage in resource-constrained devices: key-value stores and relational database systems. Since MRAM works similarly to common volatile RandomAccess Memory (RAM), two structures commonly used for in-memory key-value storage were selected: a Linear Probing Hash Table (LPHT) and a Cache-Line Hash Table (CLHT) [30]. Since MRAM is persistent, such data structures can easily be adapted to provide the equivalent of a key-value store. For comparison, RocksDB, a wellestablished persistent key-value store, was selected as a baseline. Since RocksDB is a more complex system than the selected hash tables, a more capable computation unit was assigned, to ofset the increased computational overhead (see Section 5).

For the case of relational databases, we needed a system that could easily be adapted to run on either an MPU or MCU without changing its core functionality, in order to provide a fair comparison. With that objective, SQLite was selected since portability across diferent operating systems is guaranteed by its separate OS layer, which allows for custom implementations. Each of these systems interacts with MRAM through a custom driver which supports write and read operations in multiples of 1, 2, 4 or 8 bytes. Below, we detail how each system was adapted to run over MRAM.

Update operations, however, would need further mechanisms to ensure crash consistency. As it is unclear from vendor datasheets whether such elementary operations guarantee atomicity, this issue deserves further study. Cache-Line Hash Table The CLHT [30] is a dynamic

Linear Probing Hash Table The LPHT was imple- hash table that increases its size as more pairs are added. mented from scratch, supporting Insert, Read and Update The table consists of a series of buckets, where each operations. It separates MRAM’s space into two sections: bucket contains a set of key-value pairs, a lock, and a one for metadata, which keeps tracks of the occupation pointer to the next bucket. As such, keys are hashed state for each key-value slot, and a second for data, which into positions of the hash table, where each position is stores the actual key-value pairs. The size of these pairs composed of a linked list of buckets. CLHT supports must be set before the hash table is used, and all key- insert, read, and remove operations. values share the same size. CLHT’s main advantage is the fact that each bucket

Information on occupied slots is stored in an array of is sized to fit into a cache line, thus greatly accelerating bits, where each bit keeps the occupation state of a key- consecutive operations in the same bucket, a common value pair slot. If the bit is set to one, the slot is occupied, occurrence both when inserting and when fetching keyotherwise it is free. value pairs.

To run a CLHT on MRAM, a series of modifications • Insert Operation - Insert operations are per- were applied to the original implementation [31], more formed through the put(key,value) command. When specifically to the Lock-based version. First, locking was the put() command is called, the key is hashed disabled, as the prototype developed only has a single into one of the key-value slots. If the slot is oc- core (see Section 5 for setup details). Although a Lockcupied, a try is made for the slot that follows free version is also provided, that version of CLHT uses immediately after, and so on, until an empty slot snapshotting mechanisms to allow concurrent operation, is found. When an empty slot is found, the key- which incurs computational overhead that is undesirable value pair is written into that slot, and then, the in an MCU. bit indicating the slot is occupied is set to 1. If no Secondly, all read and write operations of the hash slot is found, the hash table is full and the insert table on the underlying storage device are redirected operation fails. through the MRAM driver. Third, a simple custom heap • Update Operation - Update operations are also memory area was implemented on MRAM, since the origperformed when the put() command is called. If inal implementation relied on malloc for space allocation, during an insert operation, the key is found al- which caused memory fragmentation when enforcing ready stored in the hash table, the corresponding alignment constraints. By using our own heap implevalue is replaced with the new one, i.e., update mentation, no memory space is wasted. Our heap impleoperations replace the old value with a new one. mentation currently only supports allocating more space. • Read operation - Read operations are performed We leave implementing deallocation and defragmentathrough a get(key) command. Similarly to an in- tion operations to future work. sert operation, a read is performed by hashing the Finally, the size of the bucket and key-value pair was key to a slot, and traversing the corresponding adjusted to fit the cache line size of the MCU selected and successive slots until either the key is found to interface with the MRAM device. Each key or value in an occupied slot, in which case the value is occupies 4 bytes, and a bucket is set to a size of 32 bytes, returned; or until an empty slot is found, or all holding 3 key-value pairs and additional metadata. The slots are traversed, returning a null value in that rest of the codebase remained unchanged. case.

Although not implemented, removing a key-value pair

is as simple as flipping the occupation bit corresponding to the afected pair to 0.

Each operation in the MRAM memory is split into 16-bit or 8-bit operations, which are performed one at a time over the memory. Assuming that these operations are atomic, insert, remove (if implemented), and read operations are crash-consistent, meaning that in case of failure, the hash table would guarantee a consistent state. SQLite SQLite is a highly portable embedded relational database. However, it is more commonly used in MPUs, since previous MCUs were not able to run this database system [7]. Even so, with advances in MCU capabilities, and by augmenting an MCU with MRAM, we were able to successfully run SQLite on an STM32 (a popular line of MCUs). To do so, a custom OS portability layer is required [32]. The OS layer establishes how SQLite interacts with the underlying file system and OS calls.

It includes functions for retrieving random values, and

current time; and also functions for opening, reading, writing, and closing files.

To build the custom OS layer, three components were required: the OS layer implementation itself; LittleFS [33], a file system for MCUs; and the MRAM driver. The MRAM driver performs low-level read and write operations on the MRAM. LittleFS, in turn, provides a lightweight file system that requires only a handful of functions to be implemented, such as writing and reading data to the storage medium. In this case, this functionality is provided to LittleFS through the MRAM driver. Finally, the custom OS layer makes use of LittleFS to implement file operations, while OS functions such as random number generation are implemented using functions provided by native STM32 libraries.

5. Experiments

circuit board had to be designed and produced to interface the STM32 with the MRAM device.

We perform a series of experiments to assess the viability of MRAM as a suitable alternative, or complement, to current FLASH based storage. First the raw performance of the considered devices is evaluated, and then their performance is compared under key-value and relational database scenarios.

For the experimental setup, two devices were used: an

STM32 MCU with MRAM memory, and an MPU, more specifically a Raspberry Pi 3B, with an SD card as its storage medium (i.e., NAND FLASH storage). The main characteristics of each are described in Table 2.

The STM32H743ZI microcontroller (MCU) [34] is a single core, 32 bit, 480MHz processing unit that comes 5.1. Raw performance evaluation with 2MB of NOR FLASH memory and 1MB of RAM memory. The MCU connects to an AS3004316 MRAM The read and write throughput capabilities for the storage memory [24], with 4Mb of storage capacity, and 35ns mediums in each device are evaluated, both in sequential access time both for read and write operations of either and random access scenarios. Furthermore, the relation 8 or 16 bits. This MCU has 16 Kilobytes of L1 cache for between I/O block size and throughput performance is instructions, and 16 Kilobytes of L1 cache for data. By evaluated. default, both caches are disabled. For the tests depicted here, the instruction cache is always enabled, however Testing methodology For MRAM, a random string the data cache is set depending on the test being run. with length equal to the desired operations size was genWhenever data cache is used, it is set as write-through, erated, and written to the device, either to sequential or so that any write to the cache is immediately persisted random addresses. As for reads, blocks of data of the deto MRAM memory. sired size were read, from random or sequential addresses.

The RaspberryPi 3B is driven by a 64 bit BCM2837 The addresses were selected before the test was run. In microprocessor (MPU), boasting 4 cores at 1.2Ghz. It has the case of random addresses, duplicates are allowed, so 1GB of RAM memory, and uses a SanDisk Extreme SD a particular location may be overwritten multiple times. Card, with 32GB of storage capacity. As for sequential addresses, if the maximum address is

Notice that the MRAM uses between 10 × − 100× less reached, operations wrap around the initial address. All energy than the SD Card, and that the Raspberry Pi has tests run until 500MB are read or written. The STM32’s considerably more computational power and memory L1 data cache is disabled for this test. In the case of the resources than the STM32H743ZI MCU. SD Card, fio, an open-source I/O tester [40], was used.

For easy reference, the names STM32 (as well as MRAM), Each test runs for 20 seconds, with a ramp up time of 2 and RPi (or one of NAND FLASH or SD Card setup) are seconds. We chose the following settings for fio: the used throughout this section to describe the MCU and engine chosen was libaio; iodepth is set to 20; the the MPU based setups, respectively. direct option is set to 1; and there is only 1 job running

It is possible to interface both NAND and NOR FLASH, at a time. Results were averaged over 5 independent runs. as well as MRAM, with both MPUs and MCUs. However, The direct option only allows operation sizes equal or this specific setup was selected as it was the option with the greatest potential for success, given that a custom aEstimation based on: [36, 37] bEstimation based on: [38, 39, 37]

211 216 Operation size (bytes) 221

226 MRAM-sequential (write) MRAM-sequential (read) RPi-sequential (read) RPi-sequential (write)

MRAM-random (write) MRAM-random (read) RPi-random (read)

RPi-random (write) greater to the page size of the device, so operation sizes for the SD Card start at 512 bytes.

Figure 1 shows the performance of both MRAM and RPi’s NAND FLASH under read/write sequential/random workloads with varying request sizes. MRAM is able to achieve its maximum throughput with I/O blocks as small as 4 bytes for writes, and 512 bytes for reads, due to its byte-addressability, being able to maintain that level of throughput as block size increases. Maximum speeds of 34MB/s for writes and 29MB/s for reads were observed. Both random and sequential read/write patterns presented identical performance (notice the overlapping lines in Figure 1). The SD card achieved 22MB/s for reads, and 26MB/s for writes at block sizes of multiple kilobytes. Furthermore, random accesses present lower performance than sequential accesses. In conclusion, the MRAM device is able to provide higher throughput than the SD Card storage on the Raspberry Pi for all block sizes, especially at I/O operation sizes under 4KB. We confirm that, in the case of MRAM, random or sequential accesses have no impact on performance, with the results for both types of accesses being almost exactly the same. However, we notice that there is a diference in performance between write and read operations, with write operations outperforming the latter. We leave to future work determining the cause for this discrepancy.

In the case of the SD card, we note that the read speed is also higher than write speed, which is uncommon for FLASH storage. This is, however, inline with the results found in previous tests for SD card performance with Raspberry Pis [41].

5.2. Impact on key-value systems

Being one of the identified use cases for data management systems in IoT and edge related systems, where resource constrained devices are used to store and process data, the impact of using MRAM for key-value systems is evaluated. In this experiment, I/O operations of varying sizes are executed over diferent key-value systems. The objective of the experiment is to evaluate how the previously identified advantage in raw performance afects these systems. Both an LPHT, and a Hash Table previously adapted to work with Intel Optane [42], CLHT, are implemented on the STM32 over MRAM. Since data stored in MRAM is persistent, both Hash Tables provide a similar service to a persistent key-value store, although with less functionality. We compare them with RocksDB, a popular persistent key-value, running on the RPi. We run RocksDB both with and without fsync, a configuration which when turned on guarantees persistence for each write operation. Single and multi-threaded execution is also considered for the case of RocksDB. We acknowledge that RocksDB is a more complex system than a simple hashtable, but the RPi’s MPU gives it a significant computational advantage over the hashtables running on the STM32. We also include results without fsync, giving RocksDB the advantage of not having to persist its wal log on every single write operation.

Testing methodology For the key-value scenarios,

a series of string arrays were generated separately, in order to ensure that the operations submitted to each of the evaluated systems is identical, and that the data generation process does not afect performance estimation. Datasets composed of arrays of randomly generated 2, 4, 8, 16, 32, 64, 128, and 256 byte strings were built. String deduplication was not performed, making it possible to have multiple put operations for the same key. The size of each dataset is equal to roughly 50% of the storage capacity of the MRAM memory (i.e., 2Mb). For the 2, 4, 8, 16, 32, 64, 128, and 256 byte datasets 65365, 32768, 16384, 8192, 4096, 2048, 1024, and 512 entries were generated, respectively. For each byte size, 5 diferent arrays were generated.

In the case of RocksDB versus LPHT (Section 5.2.1) the write experiment progresses as follows. One of the 5 datasets with 2 byte strings is selected. For each string A in the dataset, an operation of the type put(A,A) is performed, using the same string for both the key and value ifelds. The performance of each system is then averaged over the 5 diferent data sets for the same byte size. The same procedure is followed for the remaining byte sizes, and a similar procedure is followed for the read workload, but with get(A), instead of put(A,A) operations. For the case of RocksDB, diferent combinations of fsync (on or of) and number of client threads are tested, as they MRAM Hashmap (write) RocksDB-nofsync-1thread (write) RocksDB-fsync-1thread (write)

RocksDB-nofsync-6threads (write) RocksDB-fsync-6threads (write) 21 22 23 24 25 26 Key/value size (bytes) 27 28 MRAM Hashmap (read) RocksDB-1thread (read)

RocksDB-6threads (read) have a significant impact in system performance. When multiple client threads are used, the elements of each dataset are split as equally as possible amongst them. Since RocksDB with fsync turned on performs significantly slower, tests targeting this setup are limited to 5000 put operations per data set.

In the case of RocksDB versus CLHT (Section 5.2.2), the key and value field sizes are fixed to 4 bytes each, so only the 4 byte data sets are used, since the size of buckets must align with the size of a cache line. Furthermore, each run is fixed to 20000 put operations, due to the added space occupied by CLHT’s additional structures. Similar to LPHT, CLHT is initialized with space to fit 2 times the amount of data that is inserted in each test.

For both LPHT and CLHT, at the end of each run, a consistency check is performed, where each of the stored values is retrieved from the Hash Table, and checked for correctness. We highlight that for the specific case of LPHT, we observed up to 0.002% of pairs missing pairs from the table when checking for consistency in scope of a run. We consider this to be due to a problem with our circuit board design for the MRAM memory chip, since slowing down the speed of the memory solves these errors.

Data L1 cache was enabled for all tests involving keyvalue systems, with the cache policy set to write-through. Since the results of the experiments resulted in disparities of multiple orders of magnitude, a logarithmic scale is used for the y-axis of all diagrams, which represent operations per second. 5.2.1. LPHT vs RocksDB Figure 2 compares the number of operations per second that RocksDB and the LPHT are able to perform, when the size of a single key and corresponding value increases. The size represented on the horizontal axis, in bytes, corresponds to the size of a single key, or a single value. This experiment’s conclusion is that the MRAM setup outperforms the NAND FLASH alternative on almost all scenarios. For write operations (left side of Figure 2), MRAM outperforms all RocksDB setups. However, as the key/value size increases, the diference between the STM32 setup, and the RocksDB setups where fsync is turned of, shrinks. At a key/value size of 4 bytes, MRAM is able to perform 35× more operations per second than RocksDB with 1 thread and no fsync. But, when the size is increased to 256 bytes, the ratio between the two is only 1.4× (still in favor of MRAM).

The LPHT running on MRAM memory guarantees persistence on each write operation, so the RocksDB setups that more closely resemble it are the ones where fsync is enforced. When guaranteeing persistence at each operation, the multithreaded RocksDB setup is vastly outperformed by the STM32’s Hash Table, with LPHT performing between 134× and 3837× more operations per second.

For the case of read operations (right side of Figure 2), LPHT is able to outperform the multithreaded RocksDB for key/value sizes under 32 bytes, by a factor of between 1.64× and 6.69× more operations per second. For the case of the single threaded RocksDB, the Hash Table is able to outperform RocksDB for key/value sizes under 128 bytes, with up to 20× more read operations per second. of reads, MRAM outperforms RocksDB with 6 threads by 9× .

We conclude that the raw performance advantage of MRAM over NAND Flash translates into a significant advantage in key-value systems, especially for smaller key-value sizes. For this use case, trading computational power for storage performance is the correct approach, indicating that the main bottleneck of these systems is indeed the FLASH storage device.

5.3. Impact on relational database system Finally, on a more complex scenario, SQLite’s perfor

mance is compared when running on the STM32 over MRAM, and on the RPi. This allows for a comparison of the same exact system across the two platforms. The results of running SQLite on the STM32 with a custom OS layer are compared against SQLite running on the RPi with NAND FLASH (using the default UNIX OS layer). Writes

Reads

Key/value size: 4 bytes CLHT on MRAM RocksDB-nofsync-1 thread RocksDB-nofsync-6 threads RocksDB-fsync-1 thread RocksDB-fsync-6 threads

Testing methodology For SQLite, a schema consist

ing of a single table representing a sensor is used. The Figure 3: Comparison between CLHT running on MRAM and table consists of four columns of the integer type: timesRocksDB running on NAND FLASH in RPi. tamp, device_id, zone, and pressure. Each insert operation inserts a new record which increments the timestamp of the previously inserted record, and generates random It is interesting to note that RockDB’s performance for values for the remaining columns. Each run reads or keys/values with 2 bytes is significantly better than for writes a total of 5000 rows, but the number of inserted the remaining sizes. The most likely reason for this is the values or selected rows per transaction varies. For exambig number of duplicate values present in the dataset used ple, in the first test 5000 transactions are executed, with for this particular experiment, which allows RocksDB to a single read or write operation being performed in each easily keep all values in its block cache. We also note that transaction. For the last test, however, only 50 transacfor key/value sizes above the previously stated, RocksDB tions are performed, with 100 values being selected or is able to perform more read operations per second than inserted in each transaction. Results depict an average LPHT. of five independent runs, and all SQLite files are deleted

In both cases, the performance of LPHT declines at between runs. SQL queries are generated prior to the a faster rate than RocksDB as key/value size increases. test, so that throughput estimation is not afected by the This can be due to the fact that as keys get bigger, the time spent generating those queries. STM32’s L1 data computational efort to compute its hash value increases. cache is enabled for all SQLite experiments enforcing a Since the STM32 has less computational power than the write-through policy.

RPi, this efect will be more noticeable. Figure 4 shows the number of rows either inserted or read per second, in relation to the number of rows afected by a single transaction. Unlike in the case of key5.2.2. CLHT vs RocksDB value stores, it is not enough to perform a direct swap of Figure 3 shows how CLHT compares to diferent con- the storage medium from NAND Flash to MRAM for the ifgurations of RocksDB while inserting key/value pairs STM32 to outperform the MPU based (RPi) in a relational with 4 bytes each. When compared to RocksDB running database scenario. That is because relational databases without the fsync option and a single thread (the best impose a greater computational overhead, thus giving the scenario for the no fsync configuration), CLHT is able advantage to the more capable RPi. Even so, the greater to perform 11× more write operations per second. Com- performance of MRAM for small write operations enpared to RocksDB’s best scenario with fsync being en- ables the STM32 to achieve a performance that is close to forced, where RocksDB uses 6 threads, which is also the that of the RPi for insert transactions afecting very few scenario that most closely resembles the persistence that rows. For the experiment where each write transaction MRAM provides with each write, CLHT is able to per- performs only two insert operations, the RPi outperforms form 1827× more write operations per second. In terms in sensor networks, i.e., smart health care or industrial

IoT, might benefit considering this technology. 104 MRAM imposes less computational overhead on systems, as it does not require wear-leveling, batching, or sequential ordering mechanisms which are often used by ce FLASH based systems. This opens the way to lowering /ssw103 systems’ complexity when using MRAM. oR This can be of special importance for relational database systems in resource constrained devices. In such computationally limited devices, MRAM allows forgoing FLASH focused mechanisms, freeing computational capacity that 102 can instead be used by the DBMSs query engine. This can help support the ongoing efort to enable more fea0 20 40 60 80 100 tures in MCU relational databases, since current options Rows/transaction have to severely limit the number of supported features

in order to fit available resources [ 7, 6]. Furthermore, STM32-MRAM (insert) STM32-MRAM (read) these MCUs provide additional resources such as: Direct RPi-NAND (insert) RPi-NAND (read) Memory Access controllers (DMAs) that enable data to Figure 4: Comparison between SQLite running on: STM32’s be moved between storage devices without CPU intervenMRAM and RPI’s NAND FLASH. tion; and dedicated hashing controllers that can calculate hash values also without CPU intervention; which can be explored to further increase database performance while the STM32 by only 1.02× . As the number of insert state- putting less load on the CPU. In the case of key-value ments per transaction increases, SQLite performs bigger systems MRAM could enable more functionality to be I/O operations which decreases the performance gap of shifted from MPU devices to MCU devices, while still the two storage media, and allows the RPi to perform improving MCU battery lifetime. For example, in the up to 1.48× more insert operations per second than the case of wearable sensors, data has to be uploaded in its STM32. In the case of select operations, the RPi is able entirety to a more capable MPU to calculate statistics on to read around 2.21× more rows per second than the the gathered data (e.g., [11, 43]) due to lack of CPU power, STM32 across all types of transactions. We conclude which coincidentally increases the amount of data transthat for relational databases MRAM can assist an MCU mitted, increasing the rate at which the sensors’ battery achieve a similar performance to an MPU for basic op- is drained. By consuming a lower amount of computaerations while consuming less energy and having less tional capacity MRAM can allow the MCU to make those computational resources. However, a direct substitution calculations locally, thus only transmitting the already of the storage media is not suficient for the MCU to out- processed data. This data will be smaller, and allow the perform the MPU. One possible way to further improve MCU to conserve more energy by moving the load on the performance for SQLite in the MCU would be to shed MCU towards computation in place of data transmission. the additional computational overhead that is imposed MRAM will be specifically appealing in scenarios where by the FLASH oriented mechanisms such as the wear- key-values are small (i.e., small write operations), the leveling mechanism in LittleFS, and the Write-Ahead Log most common occurrence in key-value systems [14], and in SQLite. where said data needs to be persisted. This may be a requirement for critical systems such as those involving medical scenarios or public services management (e.g., 6. Discussion smart grid applications).

Both solutions share the same price bracket, however The first conclusion to draw from this work is that MRAM in our approach CPU is traded for memory performance. provides a big advantage in small I/O operations. MRAM We believe this to be the correct choice for the case of adoption can be particularly interesting for key-value edge databases, since storage is the primary bottleneck. applications, such as edge Time-Series Databases (TS- However, we must take into account that MRAM has a DBs) and Key-Value Stores, which often handle small significantly lower storage capacity per chip (up to 8Gb key/value pairs [14]. Furthermore, MRAM can provide per chip [44]) and a greater price per space unit. In total, strong consistency guarantees, since all write operations the STM32 used in this work could directly support up are immediately persisted. As depicted in Figure 2, the to 512Mb of MRAM memory. As such, the main contriimpact of using fsync (i.e., persisting every write) with bution given by MRAM to edge systems, at the moment, FLASH memory is significant. Thus, critical applications is not in storage capacity, which is the case for FLASH, but rather in performance, energy expenditure, and en- battery, requiring frequent recharging of the medical sendurance. As such, a hybrid approach could provide the sor device. If instead MRAM storage was used, the MCU best of both worlds (i.e., MRAM and FLASH). MRAM could potentially have enough processing power left to could be combined with more conventional FLASH stor- extract the ECG data locally, and only relay relevant inage, e.g, an SD Card, to achieve both better performance formation to the MPU, extending operational lifetime of and durability, while still ensuring a large amount of the charge cycle. storage space. With the perspective of decreasing prices The conducted experiments used an M1 MRAM device (see Section 3), MRAM only storage may also be a possi- (Table 1), as it allowed to create a prototype in a shorter bility in the future. MRAM memory may also pave the time frame. Employing faster M3 or M4 devices could way to instant recoverability if used as an alternative potentially increase the observed performance, which we to non-persistent program memory. Energy-wise, the reserve for future work. considered MRAM setup has a power profile 10 × smaller Similar to MRAM, there are a series of other persistent when compared with the NAND FLASH which provides memory technologies which can be considered for use a positive impact for edge applications. with database systems. We consider the comparison of

We hypothesize two use cases for MRAM use, to bet- MRAM against other types of persistent memory to be ter clarify how this technology can benefit edge data outside the scope of this work, but we encourage intermanagement systems. ested parties to check on related work which provides that analysis [45]. As for how previous work with the Relational database use case - Picture a scenario popular Intel Optane persistent memory can be applied where each sensor runs its own relational database over to MRAM, we believe there are multiple reasons why FLASH (e.g., [6]). At any given moment, a sensor may such work may not be applicable here. The Intel Optane be queried for its data, however it is limited to only a line is composed of more complex devices which are comfew operations, such as select, update, delete and insert posed of multiple data storage chips, with non-persistent operations, or simple join operations. More complex caching mechanisms and capability for concurrent operoperations, such as nested queries are not supported, due ations. Related work in Intel Optane enabled key-value to a lack of CPU power which would make the time to stores, for example, focuses on providing consistency complete the query unacceptable. Thus, the client must guarantees given non-persistent write operations (i.e., issue only the innermost select query, and process the involving caching) and maximizing concurrency related received data locally, possibly requiring further queries performance [14, 42]. Some optimizations are also based to complete the original query. This means that more on optimizing the use of the libraries provided for Intel data will be transmitted to the client than the data needed Optane access. In contrast, databases for MCUs, as anato answer the original query, therefore more energy will lyzed here, have a single execution thread. Furthermore, be used by the MCU. the targeted MRAM device does not support concurrent

Now, replace the storage device for either MRAM only, operations and does not provide caching mechanisms. or a hybrid MRAM and FLASH solution. MRAM having a The MRAM memory is accessed in the same way as norlower management complexity frees up part of the com- mal memory: through a pointer to a particular address putational budget, which can now be used by the query which is mapped to a location in the MRAM memory. engine to support faster processing. Furthermore, faster As such, the set of problems for systems targeting Intel performance means less I/O waiting time, equating to Optane is not the same as for MRAM systems. less unused processor cycles. With the extra computational budget attributed to the query engine we are now 7. Conclusion able to support nested queries. By executing the entire query in one go only the minimum required amount of data is transmitted to the client, optimizing the amount of energy used.

Research for the use of persistent byte-addressable mem

ory for database systems has been focused on data centerscale applications, namely supported by Intel Optane products, [46, 47]. Results show, however, that byte adKey-value use case - Picture a scenario where a pa- dressable persistent memory should also be explored for tient wears an MCU based and battery powered sensor, use in resource constrained data management systems. that takes heart related measurements. Storing and pro- This paper shows that MRAM provides several advancessing the data locally using FLASH storage would be tages over NAND FLASH alternatives. At the hardware too computationally expensive for the MCU, so instead level, MRAM enables 5 orders of magnitude more write those measurements are transmitted in raw form to a operations per cell, thus making it practically impervimore capable MPU, where ECG data is extracted from ous to cell wear-out. Furthermore, random and sequenthe raw data. This transmission of data drains the MCUs tial accesses have identical performance, and maximum throughput is achieved with writes as small as 4 bytes, and reads of 512 bytes.

MRAM shows a throughput advantage on all I/O block sizes when compared to FLASH, particularly for block sizes under 32KB. This was observed in the Raw Performance tests, but also for the Hash Table tests, despite being a more complex workload and with the exception that for key/value sizes greater than 32 bytes, RocksDB evaluation with the NAND Flash alternative outperforms MRAM’s LPHT. The relational database test with SQLite showed that although MRAM can help MCUs reach a performance close to that of an MPU for a relational database, a direct replacement of NAND FLASH for MRAM is not suficient for the MCU to outperform the MPU. However, MRAM allows for a lot of the mechanisms that are currently used to accommodate FLASH to be avoided, opening new architectures directed specifically at MRAM to outperform MPUs.

In a nutshell, MRAM presents a big advantage over NAND FLASH in small I/O operations, being able to achieve full throughput at operation sizes of just a few bytes. Furthermore, performance is not afected by random access patterns. The virtually infinite endurance of MRAM memory avoids the need for any wear-leveling mechanisms, and its low power consumption contributes to extend the lifetime of battery powered MCUs. Nominal values also point to MRAM being able to achieve a significantly higher peak throughput than FLASH storage.

Thus, MRAM can allow for systems which are simpler to implement, have higher performance, and consume less energy.

Acknowledgements This project has received funding from the European

Union’s Horizon 2020 research and innovation programme under grant agreement No 857237. The sole responsibility for the content on this publication lies with the authors. It does not necessarily reflect the opinion of the European Commission (EC). The EC are not responsible for any use that may be made of the information contained therein. It is also funded by National Funds through the FCT — Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) PhD grant (PD/BD/151402/2021).

spruj40c.pdf?ts=1687780680209, rev. C. technology.com/products/discrete-mram/space. [16] M. Li, D. Ganesan, P. Shenoy, Presto: Feedback- [30] T. David, R. Guerraoui, V. Trigonakis, Asynchrodriven data management in sensor networks, nized concurrency: The secret to scaling concurrent IEEE/ACM Transactions on Networking 17 (2009) search data structures, ACM SIGARCH Computer 1256–1269. Architecture News 43 (2015) 631–644. [17] C. Wang, X. Huang, J. Qiao, T. Jiang, L. Rui, J. Zhang, [31] E. Distributed Computing Laboratory, Clht, 2013.

R. Kang, J. Feinauer, K. McGrail, P. Wang, et al., URL: https://github.com/LPD-EPFL/CLHT. Apache iotdb: Time-series database for internet of [32] SQLite, The sqlite os interface or "vfs", 2023. URL: things, Proc. VLDB Endow. 13 (2020) 2901–2904. https://www.sqlite.org/vfs.html. [18] F. Wu, C. Qiu, T. Wu, M. Yuce, Edge-based hybrid [33] littlefs project, littlefs, 2023. URL: https: system implementation for long-range safety and //github.com/littlefs-project/littlefs. healthcare iot applications, IEEE Internet of Things [34] STMicroelectronics, Stm32h743zi, 2023. URL: Journal 8 (2021) 9970–9980. https://www.st.com/en/microcontrollers[19] S. Alamouti, F. Arjomandi, M. Burger, Hybrid microprocessors/stm32h743zi.html. edge cloud: A pragmatic approach for decentralized [35] SanDisk, Sandisk extreme® microsdxc™ uhs-i cloud computing, IEEE Communications Magazine card, 2023. URL: https://www.westerndigital.com/ 60 (2022) 16–29. products/memory-cards/sandisk-extreme-uhs-i[20] aws, What’s the diference between micropro- microsd#SDSQXAF-032G-GN6MA. cessors and microcontrollers?, 2023. URL: https: [36] G. P. Perrucci, F. H. P. Fitzek, J. Widmer, Sur//aws.amazon.com/pt/compare/the-diference- vey on energy consumption entities on the smartbetween-microprocessors-microcontrollers/. phone platform, in: 2011 IEEE 73rd Vehicular Tech[21] M. Technology, Mt29f128g08ajaaawp-itz:a, nology Conference (VTC Spring), 2011, pp. 1–6. nand flash memory, rev. h, 2014. URL: doi:10.1109/VETECS.2011.5956528. https://pt.mouser.com/datasheet/2/671/ [37] SanDisk, Sandisk® industrial microsd card micron_technology_micts06235-1-1759187.pdf . datasheet, 2016. URL: https://images-na.ssl-images[22] Intel, 3d xpoint™: A breakthrough in non- amazon.com/images/I/91tTtUMDM3L.pdf . volatile memory technology, 2015. URL: [38] S. Crawford, How secure digital memhttps://www.intel.com/content/www/us/en/ ory cards work, 2011. URL: https: architecture-and-technology/intel-micron-3d- //computer.howstufworks .com/secure-digitalxpoint-webcast.html. memory-cards.htm. [23] J. Heidecker, MRAM Technology Status, Technical [39] Samsung, Microsd pro endurance, 2023. URL:

Report, NASA, 2013. https://semiconductor.samsung.com/consumer[24] A. Technology, As3004316, parallel per- storage/memory-card/micro-sd-pro-endurance/. sistent sram memory, rev. t, 2022. URL: [40] J. Axboe, Fio-flexible io tester, 2014. URL: https: https://pt.mouser.com/datasheet/2/1122/ //github.com/axboe/fio.

1Mb_32Mb_Parallel_x16_MRAM_2-1949428.pdf . [41] A. Piltch, Best microsd cards for raspberry pi 2023, [25] E. Technologies, Mr4a16b, rev. 11.7, 2018. 2023. URL: https://www.tomshardware.com/bestURL: https://pt.mouser.com/datasheet/2/144/ picks/raspberry-pi-microsd-cards.

MR4A16B_Datasheet-1511254.pdf . [42] S. K. Lee, J. Mohan, S. Kashyap, T. Kim, V. Chi[26] E. Technologies, Emxxlx, expanded serial pe- dambaram, Recipe: Converting concurrent dram ripheral interface (xspi) industrial stt-mram per- indexes to persistent-memory indexes, in: Proceedsistent memory, rev. 2.9, 2022. URL: https:// ings of the 27th ACM Symposium on Operating www.everspin.com/supportdocs/all, rev. 2.9. Systems Principles, 2019, pp. 462–477. [27] E. Technologies, Emd4e001gas2, 1gb non-volatile [43] H. Dubey, J. Yang, N. Constant, A. M. Amiri, Q. Yang, st-ddr4 spin-transfer torque mram, rev. 1.2, 2020. K. Makodiya, Fog data: Enhancing telehealth big URL: https://www.mouser.com/datasheet/2/144/ data through fog computing, in: Proceedings of EMD4E001GAS2_1_2_08252020-1923803.pdf . the ASE bigdata & socialinformatics 2015, 2015, pp. [28] M. Technology, Mt28ew512aba1hpc-0sit tr, 1–6.

parallel nor flash embedded memory, rev. i, [44] A. Technology, As308g208, space-grade 2018. URL: https://media-www.micron.com/ high performance dual-quad serial per-/media/client/global/documents/products/ sistent sram memory, rev. e, 2023. URL: data-sheet/nor-flash/parallel/mt28ew _mt28fw/ https://www.avalanche-technology.com/wpmt28ew_qlkp_512_aba_0sit.pdf . content/uploads/1G-8Gb-Dual-QSPI-Space[29] A. Technology, Avalanche technology - products Grade-Serial-E-01_10_2023.pdf . - space grade, 2023. URL: https://www.avalanche- [45] S. Kargar, F. Nawab, Challenges and future directions for energy, latency, and lifetime improvements in nvms, Distributed and Parallel Databases (2022) 1–27. [46] A. Shanbhag, N. Tatbul, D. Cohen, S. Madden, Largescale in-memory analytics on intel® optane™ dc persistent memory, in: Proceedings of the 16th International Workshop on Data Management on

New Hardware, 2020, pp. 1–8. [47] Y. Wu, K. Park, R. Sen, B. Kroth, J. Do, Lessons learned from the early performance evaluation of intel optane dc persistent memory in dbms, in: Proceedings of the 16th International Workshop on Data Management on New Hardware, 2020, pp. 1–3.