Exploring Storage Bottlenecks in Linux-based Embedded Systems Russell Joyce Neil Audsley Real-Time Systems Research Group Real-Time Systems Research Group Department of Computer Science Department of Computer Science University of York, UK University of York, UK russell.joyce@york.ac.uk neil.audsley@york.ac.uk ABSTRACT when performing random data accesses, due to the high la- With recent advances in non-volatile memory technologies tency and low bandwidth associated with the mechanical and embedded hardware, large, high-speed persistent-storage operation of hard disk drives, as well as the constant in- devices can now realistically be used in embedded systems. crease in CPU and memory speeds over time. Despite the Traditional models of storage systems, including the imple- deceleration of single-core CPU scaling in recent years, the mentation in the Linux kernel, assume the performance of main bottleneck associated with accessing non-volatile stor- storage devices to be far slower than CPU and system mem- age in a general-purpose system is still typically the storage ory speeds, encouraging extensive caching and buffering over device itself. direct access to storage hardware. In an embedded system, Linux (along with many other operating systems) uses a however, processing and memory resources are limited while number of methods to reduce the impact that slow storage storage hardware can still operate at full speed, causing this devices cause on overall system performance. Firstly, main balance to shift, and leading to the observation of perfor- memory is heavily used to cache data between block device mance bottlenecks caused by the operating system rather accesses, avoiding unnecessary repeated reads of the same than the speed of storage devices themselves. data from disk. This also helps in the efficient operation In this paper, we present performance and profiling results of file systems, as structures describing the position of files from high-speed storage devices attached to a Linux-based on a disk can be cached for fast retrieval. Secondly, buffers embedded system, showing that the kernel’s standard file are provided for data flowing to and from persistent storage, I/O operations are inadequate for such a set-up, and that which allow applications to spend less time waiting on disk ‘direct I/O’ may be preferable for certain situations. Ex- operations, as these can be performed asynchronously by the amination of the results identifies areas where potential im- operating system without the application necessarily waiting provements may be made in order to reduce CPU load and for their completion. Finally, sophisticated scheduling and increase maximum storage throughput. data layout algorithms can be used to optimise the data that is written to a device, taking advantage of idle CPU time caused by the system waiting for I/O operations to Categories and Subject Descriptors complete. D.4.7 [Operating Systems]: Organization and Design— For a general-purpose Linux system, these techniques can Real-time systems and embedded systems have a large positive effect on the efficient use of storage – memory and the CPU often far outperform the speed of a hard disk drive, so any use of them to reduce disk accesses General Terms is desirable. However, this relationship between CPU, mem- Design, Measurement, Performance ory and storage speeds does not hold in all situations, and therefore these techniques may not always provide a benefit Keywords to the performance of a system. The limited resources of a typical embedded system can Linux, storage skew the balance between storage and CPU speed, which can cause issues for a number of embedded applications that re- 1. INTRODUCTION quire fast and reliable access to storage. Examples of these Traditionally, access to persistent storage has been orders include applications that receive streaming data over a high- of magnitude slower than volatile system memory, especially speed interface that must be stored in real-time, such as data being sent from sensors or video feeds, perhaps with inter- mediate processing being performed using hardware accel- erators. This paper considers effects that the limited CPU and memory speeds of an embedded system can have on a fast storage device – due to the change in balance between rela- tive speeds, the system cannot be expected to perform in the same way as a typical computer, with certain performance EWiLi’15, October 8th, 2015, Amsterdam, The Netherlands. bottlenecks shifting away from storage hardware limitations Copyright retained by the authors. and into software operations. Storage Main Memory Results are presented in section 3 from basic testing of Device Application Space Kernel Space storage devices in an embedded system, showing that se- quential storage operations experience bottlenecks caused by Device CPU limitations rather than the speed of the storage hard- Data Standard I/O Data Data Cache Buffers Stored Data ware if standard Linux file operations are used. Removing reliance on the page cache (through direct I/O) is shown to improve performance for large block sizes, especially on a Device fast SSD, due to the reduction in the number of times data Data Direct I/O Stored Buffers Data is copied in main memory. Potential solutions briefly presented in section 5 suggest that restructuring the storage stack to favour device accesses Figure 1: The operation of direct I/O in the Linux over memory and CPU usage in this type of system, as well kernel, bypassing the page cache as more radical changes such as the introduction of hard- ware accelerators, may reduce the negative effects of CPU limitations on storage speeds. involved in reading and writing data from a block device, allowing data to be copied directly to and from an applica- 2. PROBLEM SUMMARY tion’s memory space without being transferred via the page Recent advances in flash memory technology have caused cache. While this allows applications more-direct access to the widespread adoption of Solid-State Drives (SSDs), which storage devices, it can also create restrictions and have a offer far faster storage access compared to mechanical hard severe negative impact on storage speeds if used incorrectly. drives, along with other benefits such as lower energy con- In the past, there has been some resistance to the direct sumption and more-uniform access times. It is anticipated I/O functionality of Linux [11], partly due to the benefits of that over the next several years, further advances in non- utilising the page cache that are removed with direct I/O, volatile memory technologies will accelerate the increasing and the large disparity between CPU/memory and storage trend in storage device speeds, potentially allowing for large, speeds meaning there were rarely any situations where the non-volatile memory devices that operate with similar per- overhead of additional memory copies was significant enough formance to volatile RAM. At a certain point, fast storage to cause a slowdown. However, when storage is fast and the speeds, relative to CPU and system memory speeds, will speed of copying data around memory is slow, using direct cause a critical change in the balance of a system, requir- I/O can have a significant performance improvement if cer- ing a significant reconsideration of an operating system’s tain criteria are met. approach to storage access [10]. Figure 1 shows the basic principles of standard and direct At present, this shift in the balance of system performance I/O, with direct I/O bypassing the page cache and removing is beginning to affect the embedded world, where processing the need to copy data from one area of memory to another and memory speeds are typically low due to constraints such between storage devices and applications. as energy usage, size and cost, but where fast solid-state One of the main issues with direct I/O is the large over- storage still has the potential to run at the same speed as in head caused when dealing with data in small block sizes. a more powerful system. For example, an embedded system Even when using a fast storage device, reading and writing consisting of a slow, low-core-count CPU and slowly-clocked small amounts of data is far slower per byte than larger sizes, memory connected to a high-end, desktop-grade SSD has due to constant overheads in communication and processing a far different balance between storage, memory and CPU that do not scale with block size. Without kernel buffers than is expected by the operating system design. While in place to help optimise disk accesses, applications that use such a system may run Linux perfectly adequately for many small block sizes will suffer greatly in storage speed when us- tasks, it will not be able to take advantage of the full speed ing direct I/O, compared to when utilising the kernel’s data of the SSD using traditional methods of storage access, due caching mechanisms, which will queue requests to more effi- to bottlenecks elsewhere in the system. ciently access hardware. The performance of accessing large Before fast solid-state storage was common, non-volatile block sizes on a storage device does not suffer from this is- storage in an embedded system would often consist of slow sue, however, so applications that either inherently use large flash memory, due to the high energy consumption and low block sizes, or use their own caching mechanisms to emulate durability of faster mechanical media, meaning the poten- large block accesses, can use direct I/O effectively where tial increase in secondary storage speeds provided by SSDs required. is even greater in embedded systems than many general- A further issue with the implementation of direct I/O in purpose systems. An increase in the general storage require- the Linux kernel is that it is not standardised, and is not ments and expectations for systems, driven by fields such as part of the POSIX specification, so its behaviour and safety multimedia and ‘big data’ processing, have also accelerated cannot necessarily be guaranteed for all situations. The for- the adoption of fast solid-state storage in embedded systems. mal definition of the O_DIRECT flag for the open() system call is simply to “try to minimize cache effects of the I/O to 2.1 Buffered vs Direct I/O and from this file” [1], which may be interpreted differently The Linux storage model relies heavily on the buffering (or not at all) by various file systems and kernel versions. and caching of data in system memory, typically requiring data to be copied multiple times before it reaches its ulti- 2.2 Real-time Storage Implications mate destination. The kernel provides the ‘direct I/O’ file Many high-performance storage applications require con- access method to reduce the amount of memory activity sideration of real-time constraints, due to external producers or consumers of data running independently of the system One advantage of the kernel using its page cache to store a storing it – if a storage system cannot save or provide data at copy of data is the ability to access that data at a later time the required speed then critical information may be lost or without having to load it from secondary storage, however the system may malfunction. While solid-state storage de- this will have no benefit if data is solely being written to or vices have far more consistent access times than mechanical read from a disk as part of a streaming application, because storage, making them more suitable for time-critical appli- by the time the data is needed a second time it is likely that cations, if the CPU of a system is proving to be a bottleneck it has already been purged from the cache. in storage access times, the ability to maintain a consistent speed of data access relies heavily on CPU utilisation. 3. EXPERIMENTAL WORK If the CPU can be removed as far as possible from the In order to examine the effects that a slow system can operation of copying data to storage, the impact of other have on the performance of storage devices, and to iden- processes on this will be reduced, increasing the predictabil- tify the potential bottlenecks present in the Linux storage ity of storage operations and making real-time guarantees stack, we performed a number of experiments with storage more possible. This could be achieved through methods operations while collecting profiling and system performance such as hardware acceleration, as well as simplification of information. the software storage stack. 2.3 Motivation 3.1 Experimental Set-up The experimental set-up consisted of a an Avnet Zed- A number of examples exist where fast and reliable access Board Mini-ITX development board connected to storage to storage is required by an embedded system, which may be devices using its PCI Express Gen2 x4 connector. The Zed- limited by CPU or memory resources when standard Linux Board Mini-ITX provides a Xilinx Zynq-7000 system-on- file system operations are used. chip, which combines a dual-core ARM Cortex-A9 processor Embedded accelerators are increasingly being investigated (clocked at 666MHz) with a large amount of FPGA fabric, for use in high-performance computing environments, due to alongside 1GiB of DDR3 RAM and many other on-board their energy efficiency when compared to traditional server peripherals. hardware [8, 7]. CPU usage when performing storage opera- The system uses Linux 3.18 (based on the Xilinx 2015.2 tions has also been identified as an issue in server situations, branch) running on the ARM cores, while an AXI-to-PCIe using large amounts of energy compared to storage devices bridge design is programmed on the FPGA, to provide an themselves, and motivating research into how storage sys- interface between the processor and PCI Express devices. tems can be made to be more efficient [4, 9]. To provide a range of results, two storage devices were Standalone embedded systems that use storage devices tested with the system: a Western Digital Blue 500GB SATA also create motivation for efficient access to storage, for ap- III hard disk drive, connected though a Startech SATA III plications such as logging sensor data and recording high- RAID card; and an Intel SSD 750 400GB. While both de- bandwidth video streams [6]. Often, external data sources vices use the same PCI Express interface for their physical will have constraints on the speeds required for their stor- connection to the board, the RAID card uses AHCI for its age, for example, with the number of frames of video that logical storage interface, whereas the SSD uses the more ef- must be stored each second, so any methods that can help ficient NVMe interface. to meet these requirements while keeping energy usage at a Due to limitations of the high-speed serial transceiver hard- minimum are desirable. ware on the Zynq SoC, the speed of the SSD interface is Consider a basic Linux application that reads a stream limited to PCI Express Gen2 x4 (from its native Gen3 x4), of data from a network interface and writes it to a continu- reducing the maximum four-lane bandwidth from 3940MB/s ous file on secondary storage using standard file operations. to 2000MB/s. While this is still far faster than the 600MB/s Disregarding any other system activity and additional op- maximum of the SATA-III interface used by the HDD, it erations performed by the file system, data will be copied a means the SSD will never achieve its advertised maximum minimum of six times on its path from network to disk: capable speed of 2200MB/s in this hardware set-up. 1. From the network device to a buffer in the device driver 2. From the driver’s buffer to a general network-layer ker- 3.2 Data Copy Tests nel buffer To determine an indication of the operating speeds of the 3. From the kernel buffer to the application’s memory storage devices at various block sizes with minimal external space overhead, we performed basic testing using the Linux dd 4. From the application’s memory space to a kernel file utility. For write tests, /dev/zero was used as a source file, buffer and for read tests, /dev/null was used as a destination. 5. From the file buffer to the storage device driver Both storage devices were freshly formatted with an ext4 file system before each test. 6. From the driver’s buffer to the storage device itself For each block size and storage device, four tests were per- This process has little impact on overall throughput if ei- formed: reading from a file on the device, writing to a file ther storage or network speed is slow relative to main mem- on the device, and reading and writing with direct I/O en- ory and CPU, however as soon as this balance changes, any abled (using the iflag=direct and oflag=direct operands additional memory copying can have a severe impact. Tech- of dd respectively). Additionally, read and write tests were niques such as DMA can help to reduce the CPU load related performed with a 512MiB tmpfs RAM disk (/dev/shm) in to copying data from one memory location to another, how- order to determine possible maximum speeds when no exter- ever this relies on hardware and driver support, and does not nal storage devices or low-level drivers were involved. Each fully tackle the inefficiencies of unnecessary memory copies. test was performed with and without the capture of system 400 resource usage and collection of profiling data, so results 380 360 could be gathered without any additional overheads caused 340 320 by these measurements. 300 Average Speed (MiB/s) 280 With the secondary storage devices, data was recorded for 260 240 20GiB sequential transfers, and with the RAM disk, 256MiB 220 200 transfers were used due to the lower available space. Sequen- 180 160 tial transfers are used as they represent the type of prob- 140 120 lem that is likely to be encountered when requiring high- 100 80 speed storage in an embedded system – reading and writ- 60 40 RAM Read SSD Read HDD Read SSD Read Direct I/O ing streams of contiguous data – as well as being simple to 20 0 HDD Read Direct I/O implement and test. Storage devices, especially mechanical 512B 4KiB 512KiB 1MiB 16MiB 128MiB 256MiB 512MiB 768MiB hard disks, generally perform faster with sequential transfers Block Size than with random accesses, and operating and file system overheads are also likely to be greater for non-sequential ac- 320 cess patterns, so further experimentation will be necessary 300 280 to determine whether the same effects are present when us- 260 Average Speed (MiB/s) 240 ing different I/O patterns. 220 200 180 3.3 System Resource Usage and Profiling 160 140 To collect information about system resource usage during 120 100 each test, we used the dstat utility [2] to capture memory 80 usage, CPU usage and storage device transfer speeds each 60 RAM Write SSD Write 40 second. 20 HDD Write SSD Write Direct I/O 0 HDD Write Direct I/O Additionally, to determine the amount of execution time that is spent in each relevant function within the user ap- 512B 4KiB 512KiB 1MiB 16MiB 128MiB 256MiB 512MiB 768MiB Block Size plication, kernel and associated libraries during the tests, we ran dd within the full-system profiler, operf, part of the OProfile suite of tools [3]. The impact on performance caused by profiling is kept to a minimum through support Figure 2: Plots of average read and write speeds for from the CPU hardware and the kernel performance events each device and block size tested subsystem, however slight overheads are likely while the pro- filer is running, potentially causing slower speeds and slight would allow for this higher speed. This could be confirmed differences in observed data. by repeating the tests with a SATA SSD connected to the same RAID card as the HDD, instead of using a separate 3.4 Results NVMe device. The following results were gathered using the system and When both reading and writing using standard I/O, RAM methods described above, in order to investigate various as- disk performance is far higher than both non-volatile stor- pects of storage operations. age devices. This was expected even when bottlenecks exist outside of the storage devices themselves, as the kernel op- 3.4.1 Read and Write Speeds timises accesses to tmpfs file systems by avoiding the page Figure 2 shows the average read and write speeds for a cache, thus requiring fewer memory copy operations. number of block sizes when transferring data to or from the storage devices and RAM disk. 3.4.2 Impact of Direct I/O The standard read and write speeds for both storage de- When performing the same tests with direct I/O enabled, vices are very similar, with the SSD only performing slightly speeds to the storage devices are generally higher when the faster than the HDD for all block sizes tested. This suggests block size is sufficiently large to overcome the overheads that bottlenecks exist outside of the storage devices in the involved, such as increased communication with hardware. test system, either caused by the CPU or system memory 512B and 4KiB block sizes are slower than the standard bandwidth, as it is expected that the SSD should perform write tests, as the kernel cannot cache data and write it to significantly faster than the HDD in both read and write the device in larger blocks, but larger block sizes are faster. speed. It appears that the HDD is limited by other factors when For the 512B block size, speeds are slower on both de- block sizes of 512KiB and above are used, which may be due vices, however there is little difference in speed once block to the inefficiencies of AHCI, or simply the speed limitations sizes increase above this. This slow speed could be due to of the disk itself. This is reinforced by the HDD direct I/O the significant number of extra context switches at a low read speeds being approximately equal to, or lower than block size being a bottleneck, rather than the factors limit- standard I/O speeds to the device, rather than seeing the ing storage operations at 4KiB and above. performance increases of the SSD. The consistently slightly higher speeds seen with the SSD For the SSD, maximum direct write speeds are over dou- are likely to be caused by it using an NVMe logical interface ble those of standard I/O, and maximum direct read speeds to communicate with the operating system, compared to also show a significant improvement, however these are both the less-efficient AHCI interface used by the SATA HDD. still far lower than the rated speeds of the device. A further If the storage operations are indeed experiencing a CPU bottleneck appears to be encountered between 1MiB and bottleneck, then the more-efficient low-level drivers of NVMe 16MiB direct I/O block sizes, suggesting that at this point 100 55 Time in Memory Copy Functions (%) 50 90 45 Kernel CPU Usage (%) 80 40 70 35 30 60 25 50 20 40 15 10 30 HDD Read HDD Write Direct I/O SSD Read Direct I/O HDD Write SSD Read SSD Write Direct I/O 5 HDD Read SSD Read 20 HDD Read Direct I/O SSD Write 0 HDD Write SSD Write 512B 4KiB 512KiB 1MiB 16MiB 128MiB 256MiB 512MiB 768MiB 512B 4KiB 512KiB 1MiB 16MiB 128MiB 256MiB 512MiB 768MiB Block Size Block Size Figure 3: Plot of average single-core kernel CPU Figure 4: Plot of proportion of time spent in mem- usage for each block size tested ory copy functions for each block size tested the block size is large enough to overcome any communica- large amount of CPU time waiting for device locks to be tion and driver overheads and the earlier limitations experi- released in the _raw_spin_unlock_irq function. enced with non-direct I/O are once again affecting speeds. This speed limit (at around 230MiB/s) also matches the 4. RELATED WORK write speed limit of the RAM disk when using block sizes There are several areas of related work that suggest the between 512KiB and 16 MiB, suggesting that both the SSD current position of storage in a system architecture needs and RAM disk are experiencing the same bottleneck here. rethinking, due to the introduction of fast storage technolo- gies, and due to inefficiencies in the software storage stack 3.4.3 CPU Usage and file systems. Their focus is not entirely on embedded Figure 3 shows the mean single-core kernel CPU usages systems, but also on the increasing demand for efficient and across block sizes for each test, where single-core figures are fast storage in high-performance computing environments. calculated as the maximum of the two cores for each sample recorded. In general, it can be seen that a large amount of Refactor, reduce, recycle. CPU time is spent in the kernel across the tests, with all but The discussion in [10] advocates the necessary simplifi- HDD direct I/O using an entire CPU core of processing for cation of software storage stacks through refactoring and large block sizes, strongly suggesting that the bottlenecks reduction, in order to make them able to fully utilise emerg- implied by the speed results are caused by inadequate pro- ing high-speed non-volatile memory technologies, and lower cessing power. the relative processor impact caused by fast I/O. It demon- The low system CPU usage of the HDD direct I/O tests strates that due to the large increase in storage speeds avail- suggests that the bottleneck may indeed be the disk itself, able with these devices, the traditional balance between slow unlike the SSD tests, which show more clear, consistent lim- storage and fast CPU speed is broken, and that improve- its in their transfer speeds. ments can be made through changes to the way storage is For the SSD direct I/O write test, the 16MiB block size handled by the operating system. The work does not explic- where the speed bottleneck begins corresponds to where itly reference embedded systems, however the same theories CPU usage reaches 100%, further suggesting that the bot- apply in greater measure, due to even greater restrictions on tleneck is caused by processing on the CPU. processing resources. Further experimentation to test the direct impact that CPU speed has on the storage speeds could be carried out by repeating the tests while altering the clock speed of the Hardware file systems. ARM core, or limiting the number of CPU cores available Efforts such as [5] and [12] attempt to improve the stor- to the operating system. age performance in an embedded system by offloading cer- tain file system operations to hardware accelerator cores on 3.4.4 Profiling Results an FPGA. While the hardware file system implementation in [5] is quite limited and specialised in its operation, it is Results from profiling show that for both read and write motivated by a similar need to optimise storage access be- tests, a large amount of CPU time is spent copying data be- yond what was capable by the CPU in the target system. tween user and kernel areas of memory. Figure 4 shows the These hardware file system accelerators are aimed more to- percentage of total execution time spent in the kernel func- wards usage in high-performance computing environments tions __copy_to_user (for read) and __copy_from_user (for than stand-alone embedded systems. write), used for copying data to and from user space respec- tively. The direct I/O write tests spend no time in these functions, but instead a large amount of time is spent flush- 5. DISCUSSION AND FURTHER WORK ing the CPU data cache in the v7_flush_kern_dcache_area The results presented in section 3 show that when CPU function. resources are sufficiently constrained, there are clear bot- Both read and direct I/O operations, which rely on more tlenecks in storage operations, besides the access times of immediate access to storage devices, additionally spend a storage devices themselves. In order to utilise the full po- tential of high-speed storage devices in an embedded Linux a single file system type and clean devices. Further experi- environment, and to avoid their use degrading the operation ments will be carried out with other I/O patterns, such as of other tasks running in the system, changes must be made random reads/writes, and benchmarks based on real-world to the storage stack to optimise how they are accessed. usage patterns, in order to better gauge the scope of the issue and the focus for improvements. 5.1 Potential Solutions Further work will also involve modifying areas of the test There are several potential solutions to the problems cov- platform, such as the CPU clock speed and the number of ered, ranging from optimisations in existing software imple- available cores, in order to give insight on the direct affect mentations to more radical system architecture changes. this has on results. Alternative platforms, such as more- powerful server hardware, can be used to test exactly how 5.1.1 VFS Optimisations much of a limiting effect the embedded hardware has on It may be possible to reduce storage overheads through storage capabilities. restructuring the storage stack in Linux to better optimise As well as experimental work on existing implementations, it for high-speed storage with lower CPU usage. Results practical work to test the feasibility of solutions suggested from profiling may be used to identify the areas of the stor- above will be necessary in order to improve on the cur- age stack that are performing particularly inefficiently, or rent situation. Modelling storage system operation based that are simply unnecessary for the required tasks. Such on experimental results may assist in implementation work, optimisations would potentially require large changes to the through the identification of areas that can be improved and structure of the Linux kernel. giving a base on which to test solutions in a more abstract way. 5.1.2 Improved Direct I/O The performance results show that using direct I/O can 6. REFERENCES give a major boost to performance, especially with the fast [1] open(2) – Linux Programmer’s Manual. Release 4.02. SSD, if block sizes are above a reasonable threshold for disk [2] Dstat: Versatile resource statistics tool, Mar. 2012. access operations, but can also severely reduce performance Online: http://dag.wiee.rs/home-made/dstat/. if used for small block sizes. [3] OProfile – A System Profiler for Linux, Aug. 2015. Given its potential benefits, a reimplemented pseudo-direct Online: http://oprofile.sourceforge.net/. I/O could operate with the benefits of direct I/O for large [4] A. M. Caulfield et al. Understanding the impact of block sizes, but attempt to efficiently buffer storage de- emerging non-volatile memories on high-performance, vice accesses when block sizes are below a practical limit. IO-intensive computing. In Proc. 2010 ACM/IEEE Standardising direct I/O so its operation can be guaranteed Int. Conf. High Performance Computing, Networking, across file systems and kernel versions would also allow its Storage, and Analysis, New Orleans, LA, Nov. 2010. usage to be more widely accepted. [5] A. Mendon. The case for a Hardware Filesystem. PhD 5.1.3 Hardware Acceleration thesis, University of North Carolina at Charlotte, NC, 2012. One possible method of relieving CPU load during storage operations would be to introduce hardware acceleration into [6] National Instruments. Data acquisition: I/O for a system, in order to perform some of the tasks associated embedded systems. White Paper, Oct. 2012. Available: with the software storage stack in hardware instead. These http://www.ni.com/white-paper/7021/en/. accelerators could range from simple direct memory access [7] A. Putnam et al. A reconfigurable fabric for (DMA) units, used to perform the expensive memory copy accelerating large-scale datacenter services. In Proc. operations without taking up CPU time, or more-complex 41st Int. Symp. Computer Architectures, June 2014. file-system-aware accelerators that access the storage device [8] R. Sass, W. Kritikos, A. Schmidt, S. Beeravolu, and directly, effectively shifting the hardware/software divide P. Beeraka. Reconfigurable computing cluster (RCC) further up the storage stack. project: Investigating the feasibility of FPGA-based Introducing hardware that can access storage indepen- petascale computing. In Proc. 15th Annu. IEEE dently of the CPU may give an advantage for applications Symp. Field-Programmable Custom Computing that use large streams of data, as more than just the stor- Machines, Apr. 2007. age device can be attached to the hardware. For example, [9] P. Sehgal, V. Tarasov, and E. Zadok. Evaluating a hardware accelerator could directly receive data from a performance and energy in file system server hardware video encoder and write it straight to a file on per- workloads. In Proc. 8th USENIX Conf. File and sistent storage, with little CPU intervention and no buffering Storage Technologies, Feb. 2010. required in main system memory. [10] S. Swanson and A. M. Caulfield. Refactor, reduce, recycle: Restructuring the I/O stack for the future of 5.2 Further Work storage. Computer, 46(8):52–59, Aug. 2013. There is potential for much deeper investigation into the [11] L. Torvalds. Re: O DIRECT question. Linux Kernel operation of fast storage devices in an embedded Linux en- Mailing List, Jan. 2007. Available: vironment, in order to fully understand the bottlenecks in- https://lkml.org/lkml/2007/1/11/121. volved and propose more comprehensive solutions. [12] V. Varadarajan, S. K. R, A. Nedunchezhian, and While the results presented in section 3 highlight some R. Parthasarathi. A reconfigurable hardware to examples of circumstances where storage speeds are heavily accelerate directory search. In Proc. IEEE Int. Conf. limited by areas other than storage devices themselves, they High Performance Computing, Dec. 2009. only focus on basic tests working with sequential data on