=Paper= {{Paper |id=Vol-1464/ewili15_17 |storemode=property |title=Exploring Storage Bottlenecks in Linux-based Embedded Systems |pdfUrl=https://ceur-ws.org/Vol-1464/ewili15_17.pdf |volume=Vol-1464 |dblpUrl=https://dblp.org/rec/conf/ewili/JoyceA15 }} ==Exploring Storage Bottlenecks in Linux-based Embedded Systems== https://ceur-ws.org/Vol-1464/ewili15_17.pdf
                                Exploring Storage Bottlenecks
                             in Linux-based Embedded Systems

                             Russell Joyce                                        Neil Audsley
                Real-Time Systems Research Group                      Real-Time Systems Research Group
                 Department of Computer Science                        Department of Computer Science
                       University of York, UK                                University of York, UK
                    russell.joyce@york.ac.uk                              neil.audsley@york.ac.uk

ABSTRACT                                                         when performing random data accesses, due to the high la-
With recent advances in non-volatile memory technologies         tency and low bandwidth associated with the mechanical
and embedded hardware, large, high-speed persistent-storage      operation of hard disk drives, as well as the constant in-
devices can now realistically be used in embedded systems.       crease in CPU and memory speeds over time. Despite the
Traditional models of storage systems, including the imple-      deceleration of single-core CPU scaling in recent years, the
mentation in the Linux kernel, assume the performance of         main bottleneck associated with accessing non-volatile stor-
storage devices to be far slower than CPU and system mem-        age in a general-purpose system is still typically the storage
ory speeds, encouraging extensive caching and buffering over     device itself.
direct access to storage hardware. In an embedded system,           Linux (along with many other operating systems) uses a
however, processing and memory resources are limited while       number of methods to reduce the impact that slow storage
storage hardware can still operate at full speed, causing this   devices cause on overall system performance. Firstly, main
balance to shift, and leading to the observation of perfor-      memory is heavily used to cache data between block device
mance bottlenecks caused by the operating system rather          accesses, avoiding unnecessary repeated reads of the same
than the speed of storage devices themselves.                    data from disk. This also helps in the efficient operation
   In this paper, we present performance and profiling results   of file systems, as structures describing the position of files
from high-speed storage devices attached to a Linux-based        on a disk can be cached for fast retrieval. Secondly, buffers
embedded system, showing that the kernel’s standard file         are provided for data flowing to and from persistent storage,
I/O operations are inadequate for such a set-up, and that        which allow applications to spend less time waiting on disk
‘direct I/O’ may be preferable for certain situations. Ex-       operations, as these can be performed asynchronously by the
amination of the results identifies areas where potential im-    operating system without the application necessarily waiting
provements may be made in order to reduce CPU load and           for their completion. Finally, sophisticated scheduling and
increase maximum storage throughput.                             data layout algorithms can be used to optimise the data
                                                                 that is written to a device, taking advantage of idle CPU
                                                                 time caused by the system waiting for I/O operations to
Categories and Subject Descriptors                               complete.
D.4.7 [Operating Systems]: Organization and Design—                 For a general-purpose Linux system, these techniques can
Real-time systems and embedded systems                           have a large positive effect on the efficient use of storage –
                                                                 memory and the CPU often far outperform the speed of a
                                                                 hard disk drive, so any use of them to reduce disk accesses
General Terms                                                    is desirable. However, this relationship between CPU, mem-
Design, Measurement, Performance                                 ory and storage speeds does not hold in all situations, and
                                                                 therefore these techniques may not always provide a benefit
Keywords                                                         to the performance of a system.
                                                                    The limited resources of a typical embedded system can
Linux, storage                                                   skew the balance between storage and CPU speed, which can
                                                                 cause issues for a number of embedded applications that re-
1.    INTRODUCTION                                               quire fast and reliable access to storage. Examples of these
   Traditionally, access to persistent storage has been orders   include applications that receive streaming data over a high-
of magnitude slower than volatile system memory, especially      speed interface that must be stored in real-time, such as data
                                                                 being sent from sensors or video feeds, perhaps with inter-
                                                                 mediate processing being performed using hardware accel-
                                                                 erators.
                                                                    This paper considers effects that the limited CPU and
                                                                 memory speeds of an embedded system can have on a fast
                                                                 storage device – due to the change in balance between rela-
                                                                 tive speeds, the system cannot be expected to perform in the
                                                                 same way as a typical computer, with certain performance
EWiLi’15, October 8th, 2015, Amsterdam, The Netherlands.         bottlenecks shifting away from storage hardware limitations
Copyright retained by the authors.
and into software operations.                                                                                                     Storage
                                                                                         Main Memory
   Results are presented in section 3 from basic testing of                                                                       Device
                                                                        Application Space                     Kernel Space
storage devices in an embedded system, showing that se-
quential storage operations experience bottlenecks caused by                                                             Device
CPU limitations rather than the speed of the storage hard-              Data      Standard I/O           Data
                                                                                                          Data
                                                                                                        Cache
                                                                                                                         Buffers    Stored
                                                                                                                                    Data
ware if standard Linux file operations are used. Removing
reliance on the page cache (through direct I/O) is shown to
improve performance for large block sizes, especially on a
                                                                                                                         Device
fast SSD, due to the reduction in the number of times data              Data                     Direct I/O                        Stored
                                                                                                                         Buffers
                                                                                                                                    Data
is copied in main memory.
   Potential solutions briefly presented in section 5 suggest
that restructuring the storage stack to favour device accesses    Figure 1: The operation of direct I/O in the Linux
over memory and CPU usage in this type of system, as well         kernel, bypassing the page cache
as more radical changes such as the introduction of hard-
ware accelerators, may reduce the negative effects of CPU
limitations on storage speeds.                                    involved in reading and writing data from a block device,
                                                                  allowing data to be copied directly to and from an applica-
2.    PROBLEM SUMMARY                                             tion’s memory space without being transferred via the page
   Recent advances in flash memory technology have caused         cache. While this allows applications more-direct access to
the widespread adoption of Solid-State Drives (SSDs), which       storage devices, it can also create restrictions and have a
offer far faster storage access compared to mechanical hard       severe negative impact on storage speeds if used incorrectly.
drives, along with other benefits such as lower energy con-       In the past, there has been some resistance to the direct
sumption and more-uniform access times. It is anticipated         I/O functionality of Linux [11], partly due to the benefits of
that over the next several years, further advances in non-        utilising the page cache that are removed with direct I/O,
volatile memory technologies will accelerate the increasing       and the large disparity between CPU/memory and storage
trend in storage device speeds, potentially allowing for large,   speeds meaning there were rarely any situations where the
non-volatile memory devices that operate with similar per-        overhead of additional memory copies was significant enough
formance to volatile RAM. At a certain point, fast storage        to cause a slowdown. However, when storage is fast and the
speeds, relative to CPU and system memory speeds, will            speed of copying data around memory is slow, using direct
cause a critical change in the balance of a system, requir-       I/O can have a significant performance improvement if cer-
ing a significant reconsideration of an operating system’s        tain criteria are met.
approach to storage access [10].                                     Figure 1 shows the basic principles of standard and direct
   At present, this shift in the balance of system performance    I/O, with direct I/O bypassing the page cache and removing
is beginning to affect the embedded world, where processing       the need to copy data from one area of memory to another
and memory speeds are typically low due to constraints such       between storage devices and applications.
as energy usage, size and cost, but where fast solid-state           One of the main issues with direct I/O is the large over-
storage still has the potential to run at the same speed as in    head caused when dealing with data in small block sizes.
a more powerful system. For example, an embedded system           Even when using a fast storage device, reading and writing
consisting of a slow, low-core-count CPU and slowly-clocked       small amounts of data is far slower per byte than larger sizes,
memory connected to a high-end, desktop-grade SSD has             due to constant overheads in communication and processing
a far different balance between storage, memory and CPU           that do not scale with block size. Without kernel buffers
than is expected by the operating system design. While            in place to help optimise disk accesses, applications that use
such a system may run Linux perfectly adequately for many         small block sizes will suffer greatly in storage speed when us-
tasks, it will not be able to take advantage of the full speed    ing direct I/O, compared to when utilising the kernel’s data
of the SSD using traditional methods of storage access, due       caching mechanisms, which will queue requests to more effi-
to bottlenecks elsewhere in the system.                           ciently access hardware. The performance of accessing large
   Before fast solid-state storage was common, non-volatile       block sizes on a storage device does not suffer from this is-
storage in an embedded system would often consist of slow         sue, however, so applications that either inherently use large
flash memory, due to the high energy consumption and low          block sizes, or use their own caching mechanisms to emulate
durability of faster mechanical media, meaning the poten-         large block accesses, can use direct I/O effectively where
tial increase in secondary storage speeds provided by SSDs        required.
is even greater in embedded systems than many general-               A further issue with the implementation of direct I/O in
purpose systems. An increase in the general storage require-      the Linux kernel is that it is not standardised, and is not
ments and expectations for systems, driven by fields such as      part of the POSIX specification, so its behaviour and safety
multimedia and ‘big data’ processing, have also accelerated       cannot necessarily be guaranteed for all situations. The for-
the adoption of fast solid-state storage in embedded systems.     mal definition of the O_DIRECT flag for the open() system
                                                                  call is simply to “try to minimize cache effects of the I/O to
2.1    Buffered vs Direct I/O                                     and from this file” [1], which may be interpreted differently
  The Linux storage model relies heavily on the buffering         (or not at all) by various file systems and kernel versions.
and caching of data in system memory, typically requiring
data to be copied multiple times before it reaches its ulti-      2.2      Real-time Storage Implications
mate destination. The kernel provides the ‘direct I/O’ file          Many high-performance storage applications require con-
access method to reduce the amount of memory activity             sideration of real-time constraints, due to external producers
or consumers of data running independently of the system              One advantage of the kernel using its page cache to store a
storing it – if a storage system cannot save or provide data at    copy of data is the ability to access that data at a later time
the required speed then critical information may be lost or        without having to load it from secondary storage, however
the system may malfunction. While solid-state storage de-          this will have no benefit if data is solely being written to or
vices have far more consistent access times than mechanical        read from a disk as part of a streaming application, because
storage, making them more suitable for time-critical appli-        by the time the data is needed a second time it is likely that
cations, if the CPU of a system is proving to be a bottleneck      it has already been purged from the cache.
in storage access times, the ability to maintain a consistent
speed of data access relies heavily on CPU utilisation.            3.    EXPERIMENTAL WORK
   If the CPU can be removed as far as possible from the
                                                                      In order to examine the effects that a slow system can
operation of copying data to storage, the impact of other
                                                                   have on the performance of storage devices, and to iden-
processes on this will be reduced, increasing the predictabil-
                                                                   tify the potential bottlenecks present in the Linux storage
ity of storage operations and making real-time guarantees
                                                                   stack, we performed a number of experiments with storage
more possible. This could be achieved through methods
                                                                   operations while collecting profiling and system performance
such as hardware acceleration, as well as simplification of
                                                                   information.
the software storage stack.
2.3    Motivation                                                  3.1    Experimental Set-up
                                                                      The experimental set-up consisted of a an Avnet Zed-
   A number of examples exist where fast and reliable access
                                                                   Board Mini-ITX development board connected to storage
to storage is required by an embedded system, which may be
                                                                   devices using its PCI Express Gen2 x4 connector. The Zed-
limited by CPU or memory resources when standard Linux
                                                                   Board Mini-ITX provides a Xilinx Zynq-7000 system-on-
file system operations are used.
                                                                   chip, which combines a dual-core ARM Cortex-A9 processor
   Embedded accelerators are increasingly being investigated
                                                                   (clocked at 666MHz) with a large amount of FPGA fabric,
for use in high-performance computing environments, due to
                                                                   alongside 1GiB of DDR3 RAM and many other on-board
their energy efficiency when compared to traditional server
                                                                   peripherals.
hardware [8, 7]. CPU usage when performing storage opera-
                                                                      The system uses Linux 3.18 (based on the Xilinx 2015.2
tions has also been identified as an issue in server situations,
                                                                   branch) running on the ARM cores, while an AXI-to-PCIe
using large amounts of energy compared to storage devices
                                                                   bridge design is programmed on the FPGA, to provide an
themselves, and motivating research into how storage sys-
                                                                   interface between the processor and PCI Express devices.
tems can be made to be more efficient [4, 9].
                                                                      To provide a range of results, two storage devices were
   Standalone embedded systems that use storage devices
                                                                   tested with the system: a Western Digital Blue 500GB SATA
also create motivation for efficient access to storage, for ap-
                                                                   III hard disk drive, connected though a Startech SATA III
plications such as logging sensor data and recording high-
                                                                   RAID card; and an Intel SSD 750 400GB. While both de-
bandwidth video streams [6]. Often, external data sources
                                                                   vices use the same PCI Express interface for their physical
will have constraints on the speeds required for their stor-
                                                                   connection to the board, the RAID card uses AHCI for its
age, for example, with the number of frames of video that
                                                                   logical storage interface, whereas the SSD uses the more ef-
must be stored each second, so any methods that can help
                                                                   ficient NVMe interface.
to meet these requirements while keeping energy usage at a
                                                                      Due to limitations of the high-speed serial transceiver hard-
minimum are desirable.
                                                                   ware on the Zynq SoC, the speed of the SSD interface is
   Consider a basic Linux application that reads a stream
                                                                   limited to PCI Express Gen2 x4 (from its native Gen3 x4),
of data from a network interface and writes it to a continu-
                                                                   reducing the maximum four-lane bandwidth from 3940MB/s
ous file on secondary storage using standard file operations.
                                                                   to 2000MB/s. While this is still far faster than the 600MB/s
Disregarding any other system activity and additional op-
                                                                   maximum of the SATA-III interface used by the HDD, it
erations performed by the file system, data will be copied a
                                                                   means the SSD will never achieve its advertised maximum
minimum of six times on its path from network to disk:
                                                                   capable speed of 2200MB/s in this hardware set-up.
  1. From the network device to a buffer in the device driver
  2. From the driver’s buffer to a general network-layer ker-      3.2    Data Copy Tests
     nel buffer                                                       To determine an indication of the operating speeds of the
  3. From the kernel buffer to the application’s memory            storage devices at various block sizes with minimal external
     space                                                         overhead, we performed basic testing using the Linux dd
  4. From the application’s memory space to a kernel file          utility. For write tests, /dev/zero was used as a source file,
     buffer                                                        and for read tests, /dev/null was used as a destination.
  5. From the file buffer to the storage device driver             Both storage devices were freshly formatted with an ext4
                                                                   file system before each test.
  6. From the driver’s buffer to the storage device itself
                                                                      For each block size and storage device, four tests were per-
  This process has little impact on overall throughput if ei-      formed: reading from a file on the device, writing to a file
ther storage or network speed is slow relative to main mem-        on the device, and reading and writing with direct I/O en-
ory and CPU, however as soon as this balance changes, any          abled (using the iflag=direct and oflag=direct operands
additional memory copying can have a severe impact. Tech-          of dd respectively). Additionally, read and write tests were
niques such as DMA can help to reduce the CPU load related         performed with a 512MiB tmpfs RAM disk (/dev/shm) in
to copying data from one memory location to another, how-          order to determine possible maximum speeds when no exter-
ever this relies on hardware and driver support, and does not      nal storage devices or low-level drivers were involved. Each
fully tackle the inefficiencies of unnecessary memory copies.      test was performed with and without the capture of system
                                                                                         400
resource usage and collection of profiling data, so results                              380
                                                                                         360
could be gathered without any additional overheads caused                                340
                                                                                         320
by these measurements.                                                                   300




                                                                 Average Speed (MiB/s)
                                                                                         280
   With the secondary storage devices, data was recorded for                             260
                                                                                         240
20GiB sequential transfers, and with the RAM disk, 256MiB                                220
                                                                                         200
transfers were used due to the lower available space. Sequen-                            180
                                                                                         160
tial transfers are used as they represent the type of prob-                              140
                                                                                         120
lem that is likely to be encountered when requiring high-                                100
                                                                                          80
speed storage in an embedded system – reading and writ-                                   60
                                                                                          40
                                                                                                                                 RAM Read                   SSD Read
                                                                                                                                 HDD Read                   SSD Read Direct I/O
ing streams of contiguous data – as well as being simple to                               20
                                                                                           0                                     HDD Read Direct I/O
implement and test. Storage devices, especially mechanical
                                                                                               512B     4KiB   512KiB   1MiB     16MiB      128MiB     256MiB    512MiB     768MiB
hard disks, generally perform faster with sequential transfers                                                                 Block Size
than with random accesses, and operating and file system
overheads are also likely to be greater for non-sequential ac-                           320

cess patterns, so further experimentation will be necessary                              300
                                                                                         280
to determine whether the same effects are present when us-                               260




                                                                 Average Speed (MiB/s)
                                                                                         240
ing different I/O patterns.                                                              220
                                                                                         200
                                                                                         180
3.3     System Resource Usage and Profiling                                              160
                                                                                         140
   To collect information about system resource usage during                             120
                                                                                         100
each test, we used the dstat utility [2] to capture memory                                80
usage, CPU usage and storage device transfer speeds each                                  60
                                                                                                                                RAM Write                   SSD Write
                                                                                          40
second.                                                                                   20                                    HDD Write                   SSD Write Direct I/O
                                                                                           0                                    HDD Write Direct I/O
   Additionally, to determine the amount of execution time
that is spent in each relevant function within the user ap-                                    512B     4KiB   512KiB   1MiB     16MiB      128MiB     256MiB    512MiB     768MiB

                                                                                                                               Block Size
plication, kernel and associated libraries during the tests,
we ran dd within the full-system profiler, operf, part of
the OProfile suite of tools [3]. The impact on performance
caused by profiling is kept to a minimum through support         Figure 2: Plots of average read and write speeds for
from the CPU hardware and the kernel performance events          each device and block size tested
subsystem, however slight overheads are likely while the pro-
filer is running, potentially causing slower speeds and slight   would allow for this higher speed. This could be confirmed
differences in observed data.                                    by repeating the tests with a SATA SSD connected to the
                                                                 same RAID card as the HDD, instead of using a separate
3.4     Results                                                  NVMe device.
  The following results were gathered using the system and         When both reading and writing using standard I/O, RAM
methods described above, in order to investigate various as-     disk performance is far higher than both non-volatile stor-
pects of storage operations.                                     age devices. This was expected even when bottlenecks exist
                                                                 outside of the storage devices themselves, as the kernel op-
3.4.1    Read and Write Speeds                                   timises accesses to tmpfs file systems by avoiding the page
   Figure 2 shows the average read and write speeds for a        cache, thus requiring fewer memory copy operations.
number of block sizes when transferring data to or from the
storage devices and RAM disk.                                        3.4.2                            Impact of Direct I/O
   The standard read and write speeds for both storage de-          When performing the same tests with direct I/O enabled,
vices are very similar, with the SSD only performing slightly    speeds to the storage devices are generally higher when the
faster than the HDD for all block sizes tested. This suggests    block size is sufficiently large to overcome the overheads
that bottlenecks exist outside of the storage devices in the     involved, such as increased communication with hardware.
test system, either caused by the CPU or system memory           512B and 4KiB block sizes are slower than the standard
bandwidth, as it is expected that the SSD should perform         write tests, as the kernel cannot cache data and write it to
significantly faster than the HDD in both read and write         the device in larger blocks, but larger block sizes are faster.
speed.                                                              It appears that the HDD is limited by other factors when
   For the 512B block size, speeds are slower on both de-        block sizes of 512KiB and above are used, which may be due
vices, however there is little difference in speed once block    to the inefficiencies of AHCI, or simply the speed limitations
sizes increase above this. This slow speed could be due to       of the disk itself. This is reinforced by the HDD direct I/O
the significant number of extra context switches at a low        read speeds being approximately equal to, or lower than
block size being a bottleneck, rather than the factors limit-    standard I/O speeds to the device, rather than seeing the
ing storage operations at 4KiB and above.                        performance increases of the SSD.
   The consistently slightly higher speeds seen with the SSD        For the SSD, maximum direct write speeds are over dou-
are likely to be caused by it using an NVMe logical interface    ble those of standard I/O, and maximum direct read speeds
to communicate with the operating system, compared to            also show a significant improvement, however these are both
the less-efficient AHCI interface used by the SATA HDD.          still far lower than the rated speeds of the device. A further
If the storage operations are indeed experiencing a CPU          bottleneck appears to be encountered between 1MiB and
bottleneck, then the more-efficient low-level drivers of NVMe    16MiB direct I/O block sizes, suggesting that at this point
                       100                                                                                                                                55




                                                                                                                      Time in Memory Copy Functions (%)
                                                                                                                                                          50
                       90
                                                                                                                                                          45
Kernel CPU Usage (%)



                       80
                                                                                                                                                          40

                       70                                                                                                                                 35

                                                                                                                                                          30
                       60
                                                                                                                                                          25
                       50                                                                                                                                 20

                       40                                                                                                                                 15

                                                                                                                                                          10
                       30               HDD Read                 HDD Write Direct I/O        SSD Read Direct I/O
                                        HDD Write                SSD Read                    SSD Write Direct I/O                                         5                                                        HDD Read         SSD Read
                       20               HDD Read Direct I/O      SSD Write                                                                                0                                                        HDD Write        SSD Write


                             512B     4KiB    512KiB     1MiB     16MiB      128MiB     256MiB    512MiB     768MiB                                            512B   4KiB   512KiB   1MiB     16MiB      128MiB   256MiB      512MiB   768MiB

                                                                Block Size                                                                                                                   Block Size




Figure 3: Plot of average single-core kernel CPU                                                                      Figure 4: Plot of proportion of time spent in mem-
usage for each block size tested                                                                                      ory copy functions for each block size tested


the block size is large enough to overcome any communica-                                                             large amount of CPU time waiting for device locks to be
tion and driver overheads and the earlier limitations experi-                                                         released in the _raw_spin_unlock_irq function.
enced with non-direct I/O are once again affecting speeds.
This speed limit (at around 230MiB/s) also matches the                                                                4.                                       RELATED WORK
write speed limit of the RAM disk when using block sizes                                                                There are several areas of related work that suggest the
between 512KiB and 16 MiB, suggesting that both the SSD                                                               current position of storage in a system architecture needs
and RAM disk are experiencing the same bottleneck here.                                                               rethinking, due to the introduction of fast storage technolo-
                                                                                                                      gies, and due to inefficiencies in the software storage stack
    3.4.3                           CPU Usage
                                                                                                                      and file systems. Their focus is not entirely on embedded
   Figure 3 shows the mean single-core kernel CPU usages                                                              systems, but also on the increasing demand for efficient and
across block sizes for each test, where single-core figures are                                                       fast storage in high-performance computing environments.
calculated as the maximum of the two cores for each sample
recorded. In general, it can be seen that a large amount of
                                                                                                                      Refactor, reduce, recycle.
CPU time is spent in the kernel across the tests, with all but
                                                                                                                         The discussion in [10] advocates the necessary simplifi-
HDD direct I/O using an entire CPU core of processing for
                                                                                                                      cation of software storage stacks through refactoring and
large block sizes, strongly suggesting that the bottlenecks
                                                                                                                      reduction, in order to make them able to fully utilise emerg-
implied by the speed results are caused by inadequate pro-
                                                                                                                      ing high-speed non-volatile memory technologies, and lower
cessing power.
                                                                                                                      the relative processor impact caused by fast I/O. It demon-
   The low system CPU usage of the HDD direct I/O tests
                                                                                                                      strates that due to the large increase in storage speeds avail-
suggests that the bottleneck may indeed be the disk itself,
                                                                                                                      able with these devices, the traditional balance between slow
unlike the SSD tests, which show more clear, consistent lim-
                                                                                                                      storage and fast CPU speed is broken, and that improve-
its in their transfer speeds.
                                                                                                                      ments can be made through changes to the way storage is
   For the SSD direct I/O write test, the 16MiB block size
                                                                                                                      handled by the operating system. The work does not explic-
where the speed bottleneck begins corresponds to where
                                                                                                                      itly reference embedded systems, however the same theories
CPU usage reaches 100%, further suggesting that the bot-
                                                                                                                      apply in greater measure, due to even greater restrictions on
tleneck is caused by processing on the CPU.
                                                                                                                      processing resources.
   Further experimentation to test the direct impact that
CPU speed has on the storage speeds could be carried out
by repeating the tests while altering the clock speed of the                                                          Hardware file systems.
ARM core, or limiting the number of CPU cores available                                                                 Efforts such as [5] and [12] attempt to improve the stor-
to the operating system.                                                                                              age performance in an embedded system by offloading cer-
                                                                                                                      tain file system operations to hardware accelerator cores on
    3.4.4                           Profiling Results                                                                 an FPGA. While the hardware file system implementation
                                                                                                                      in [5] is quite limited and specialised in its operation, it is
   Results from profiling show that for both read and write
                                                                                                                      motivated by a similar need to optimise storage access be-
tests, a large amount of CPU time is spent copying data be-
                                                                                                                      yond what was capable by the CPU in the target system.
tween user and kernel areas of memory. Figure 4 shows the
                                                                                                                      These hardware file system accelerators are aimed more to-
percentage of total execution time spent in the kernel func-
                                                                                                                      wards usage in high-performance computing environments
tions __copy_to_user (for read) and __copy_from_user (for
                                                                                                                      than stand-alone embedded systems.
write), used for copying data to and from user space respec-
tively. The direct I/O write tests spend no time in these
functions, but instead a large amount of time is spent flush-                                                         5.                                       DISCUSSION AND FURTHER WORK
ing the CPU data cache in the v7_flush_kern_dcache_area                                                                  The results presented in section 3 show that when CPU
function.                                                                                                             resources are sufficiently constrained, there are clear bot-
   Both read and direct I/O operations, which rely on more                                                            tlenecks in storage operations, besides the access times of
immediate access to storage devices, additionally spend a                                                             storage devices themselves. In order to utilise the full po-
tential of high-speed storage devices in an embedded Linux       a single file system type and clean devices. Further experi-
environment, and to avoid their use degrading the operation      ments will be carried out with other I/O patterns, such as
of other tasks running in the system, changes must be made       random reads/writes, and benchmarks based on real-world
to the storage stack to optimise how they are accessed.          usage patterns, in order to better gauge the scope of the
                                                                 issue and the focus for improvements.
5.1     Potential Solutions                                         Further work will also involve modifying areas of the test
  There are several potential solutions to the problems cov-     platform, such as the CPU clock speed and the number of
ered, ranging from optimisations in existing software imple-     available cores, in order to give insight on the direct affect
mentations to more radical system architecture changes.          this has on results. Alternative platforms, such as more-
                                                                 powerful server hardware, can be used to test exactly how
5.1.1    VFS Optimisations                                       much of a limiting effect the embedded hardware has on
   It may be possible to reduce storage overheads through        storage capabilities.
restructuring the storage stack in Linux to better optimise         As well as experimental work on existing implementations,
it for high-speed storage with lower CPU usage. Results          practical work to test the feasibility of solutions suggested
from profiling may be used to identify the areas of the stor-    above will be necessary in order to improve on the cur-
age stack that are performing particularly inefficiently, or     rent situation. Modelling storage system operation based
that are simply unnecessary for the required tasks. Such         on experimental results may assist in implementation work,
optimisations would potentially require large changes to the     through the identification of areas that can be improved and
structure of the Linux kernel.                                   giving a base on which to test solutions in a more abstract
                                                                 way.
5.1.2    Improved Direct I/O
   The performance results show that using direct I/O can        6.   REFERENCES
give a major boost to performance, especially with the fast       [1] open(2) – Linux Programmer’s Manual. Release 4.02.
SSD, if block sizes are above a reasonable threshold for disk     [2] Dstat: Versatile resource statistics tool, Mar. 2012.
access operations, but can also severely reduce performance           Online: http://dag.wiee.rs/home-made/dstat/.
if used for small block sizes.
                                                                  [3] OProfile – A System Profiler for Linux, Aug. 2015.
   Given its potential benefits, a reimplemented pseudo-direct
                                                                      Online: http://oprofile.sourceforge.net/.
I/O could operate with the benefits of direct I/O for large
                                                                  [4] A. M. Caulfield et al. Understanding the impact of
block sizes, but attempt to efficiently buffer storage de-
                                                                      emerging non-volatile memories on high-performance,
vice accesses when block sizes are below a practical limit.
                                                                      IO-intensive computing. In Proc. 2010 ACM/IEEE
Standardising direct I/O so its operation can be guaranteed
                                                                      Int. Conf. High Performance Computing, Networking,
across file systems and kernel versions would also allow its
                                                                      Storage, and Analysis, New Orleans, LA, Nov. 2010.
usage to be more widely accepted.
                                                                  [5] A. Mendon. The case for a Hardware Filesystem. PhD
5.1.3    Hardware Acceleration                                        thesis, University of North Carolina at Charlotte, NC,
                                                                      2012.
   One possible method of relieving CPU load during storage
operations would be to introduce hardware acceleration into       [6] National Instruments. Data acquisition: I/O for
a system, in order to perform some of the tasks associated            embedded systems. White Paper, Oct. 2012. Available:
with the software storage stack in hardware instead. These            http://www.ni.com/white-paper/7021/en/.
accelerators could range from simple direct memory access         [7] A. Putnam et al. A reconfigurable fabric for
(DMA) units, used to perform the expensive memory copy                accelerating large-scale datacenter services. In Proc.
operations without taking up CPU time, or more-complex                41st Int. Symp. Computer Architectures, June 2014.
file-system-aware accelerators that access the storage device     [8] R. Sass, W. Kritikos, A. Schmidt, S. Beeravolu, and
directly, effectively shifting the hardware/software divide           P. Beeraka. Reconfigurable computing cluster (RCC)
further up the storage stack.                                         project: Investigating the feasibility of FPGA-based
   Introducing hardware that can access storage indepen-              petascale computing. In Proc. 15th Annu. IEEE
dently of the CPU may give an advantage for applications              Symp. Field-Programmable Custom Computing
that use large streams of data, as more than just the stor-           Machines, Apr. 2007.
age device can be attached to the hardware. For example,          [9] P. Sehgal, V. Tarasov, and E. Zadok. Evaluating
a hardware accelerator could directly receive data from a             performance and energy in file system server
hardware video encoder and write it straight to a file on per-        workloads. In Proc. 8th USENIX Conf. File and
sistent storage, with little CPU intervention and no buffering        Storage Technologies, Feb. 2010.
required in main system memory.                                  [10] S. Swanson and A. M. Caulfield. Refactor, reduce,
                                                                      recycle: Restructuring the I/O stack for the future of
5.2     Further Work                                                  storage. Computer, 46(8):52–59, Aug. 2013.
   There is potential for much deeper investigation into the     [11] L. Torvalds. Re: O DIRECT question. Linux Kernel
operation of fast storage devices in an embedded Linux en-            Mailing List, Jan. 2007. Available:
vironment, in order to fully understand the bottlenecks in-           https://lkml.org/lkml/2007/1/11/121.
volved and propose more comprehensive solutions.                 [12] V. Varadarajan, S. K. R, A. Nedunchezhian, and
   While the results presented in section 3 highlight some            R. Parthasarathi. A reconfigurable hardware to
examples of circumstances where storage speeds are heavily            accelerate directory search. In Proc. IEEE Int. Conf.
limited by areas other than storage devices themselves, they          High Performance Computing, Dec. 2009.
only focus on basic tests working with sequential data on