<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>GCMA: Guaranteed Contiguous Memory Allocator</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>SeongJae Park</string-name>
          <email>sjpark@dcslab.snu.ac.kr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minchan Kim</string-name>
          <email>minchan.kim@lge.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Heon Y. Yeom</string-name>
          <email>yeom@snu.ac.kr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>EWiLi'15, October 8th, 2015, Amsterdam, The Nether-</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LG Electronics</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Seoul National University</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>lands., Copyright retained by the authors.</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>While demand for physically contiguous memory allocation is still alive, especially in embedded system, existing solutions are insu cient. The most adapted solution is reservation technique. Though it serves allocation well, it could severely degrade memory utilization. There are hardware solutions like Scatter/Gather DMA and IOMMU. However, cost of these additional hardware is too excessive for low-end devices. CMA is a software solution of Linux that aims to solve not only allocation but also memory utilization problem. However, in real environment, CMA could incur unpredictably slow latency and could often fail to allocate contiguous memory due to its complex design. We introduce a new solution for the above problem, GCMA (Guaranteed Contiguous Memory Allocator). It guarantees not only memory space e ciency but also fast latency and success by using reservation technique and letting only immediately discardable to use the area e ciently. Our evaluation on Raspberry Pi 2 shows 15 to 130 times faster and more predictable allocation latency without system performance degradation compared to CMA.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Because resource requirement of processes is not predictable,
keeping high resource availability with high utilization has
always been one of the hardest challenges. Memory
management has not been exception either. Especially, large
memory allocation was pretty much impossible because of the
fragmentation problem [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The problem seemed to be
completely solved with the introduction of virtual memory [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
In real world, however, the demand for physically contiguous
memory still exists [
        <xref ref-type="bibr" rid="ref13 ref5 ref6">5, 6, 13</xref>
        ]. Typical examples are various
embedded devices such as video codec or camera which
requires large bu ers.
      </p>
      <p>A traditional solution for the problem was memory
reservation technique. The technique reserves su cient amount
of contiguous memory during boot time and use the area
only for contiguous memory allocation. This technique
guarantees fast and successful allocation but could degrade
memory space e ciency if contiguous memory allocation does not
use whole reserved area e ciently. There are alternative
solutions using additional hardware like Scatter/Gather DMA
or IOMMU. They solve the problem by helping devices to
access discontiguous memory as if it was contiguous, like
virtual memory system does to processes. However, additional</p>
      <p>still-blogbench</p>
      <sec id="sec-1-1">
        <title>Baseline CMA</title>
        <p>tiy 1.0
l
iab 0.8
rob 0.6
eP 0.4
v
ltia 0.2
um0.0
u
C
0
1,000
2,000
3,000
4,000
5,000</p>
      </sec>
      <sec id="sec-1-2">
        <title>Latency (milli-seconds)</title>
      </sec>
      <sec id="sec-1-3">
        <title>Baseline CMA</title>
      </sec>
      <sec id="sec-1-4">
        <title>GCMA</title>
        <p>hardwitla1re.0can be too expensive to low-end devices.</p>
        <p>
          y
Linuabx0.8kernel has a subsystem for the problem, namely,
i
CMAro(C0.o6ntiguous Memory Allocator) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. It basically
folb
lows tePhe0.r4eservation technique and boosts memory e ciency
v
by alltiow0.i2ng the reserved area to be used by other movable
a
l
pages.umA0.t0 the same time, it keeps contiguous memory
allocationuCbe a0vailabl1e,00b0y mig2,r0a0t0ing m3,o0v0a0ble 4p,a0g0e0s ou5t,0o0f0 the
reserved area if those movable pages are required for
contiguous memory allocationL.atency (milli-sCecMonAdsi)s not very popular
        </p>
        <p>
          However,
because memory migration usually takes a long time and
could even fail from time to time. Figure 1 shows CDF of
latency for a photo shot on Raspberry Pi 2 [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] under
memory intensive background task, which is a realistic workload.
Baseline means Raspberry Pi 2 using the reservation
technique and CMA means modi ed to use CMA as alternative.
Maximum latency for a photo shot using CMA is about 9.8
seconds while it is 1.6 seconds with reservation technique.
Normally, 9.8 seconds for a photo shot is not acceptable
even in the worst case. As a result, most system uses the
reservation technique or additional hardware as a last resort
despite expensive cost.
        </p>
        <p>
          To this end, we propose a new contiguous memory
allocator subsystem, GCMA (Guaranteed Contiguous Memory
Allocator) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], that guarantees not only memory space e
ciency but also fast response time and successful allocation
by carefully selecting additional clients foPragtehe1 reserved area
as appropriate ones only. In this paper, we introduce the
idea, design and evaluation results of GCMA
implementation. GCMA on Raspberry Pi 2 shows 15 to 130 times faster
and much predictable response time of contiguous pages
allocation without system performance degradation compared
to CMA, and zero overhead on latency of a photo shot under
background task compared to the reservation technique.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND</title>
    </sec>
    <sec id="sec-3">
      <title>2.1 Virtual Memory Management</title>
      <p>
        Virtual memory [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] system gives a process an illusion that it
owns entire contiguous memory space by letting the process
use logical address which can be later mapped into
discontiguous physical memory space. Because processes access
memory using only logical address, system can easily
allocate large contiguous memory to processes without
fragmentation problem. However, giving every process the illusion
cannot be done easily in reality because physical memory in
system is usually smaller than the total sum of each process's
illusion memory space. To deal with this problem, kernel
conceptually expands memory using another large storage
like hard disk drive or SSD. However, because large
storages are usually much slower than memory, only pages that
will not be accessed soon should be in the storage. When
memory is full and a process requires one more page, kernel
moves content of a page that is not expected to be accessed
soon into the storage and gives the page to the process.
      </p>
      <p>
        In detail, pages of a process can be classi ed into 2 types,
le-backed page and anonymous page. File-backed page
have content of a le that are cached in memory via page
cache [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] for fast I/O. The page can be freed after
synchronizing their content with the backing le. If the page
is already synchronized with, it can be freed immediately.
Anonymous page is a page that has no backing le. A page
allocated from heap using malloc() can be an example.
Because anonymous page has no backing le, it cannot be freed
before its content is stored safely elsewhere. For this
reason, system provides backing storage for anonymous pages,
namely, swap device.
      </p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Contiguous Memory Allocation</title>
      <p>Because MMU, a hardware unit that translates logical memory
address into physical location of the memory, is sitting
behind CPU, devices that communicate with system memory
directly without CPU have to deal with physical memory
address rather than taking advantage of the virtual
memory. If the device, for example, requires large memory for an
image processing bu er, the bu er should be contiguous in
physical address space. However, allocating physically
contiguous memory is hard and cannot always be guaranteed
due to fragmentation problem.</p>
      <p>
        Some devices support hardware facilities that are helpful
for the problem, such as Scatter/Gather DMA or IOMMU [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
Scatter/Gather DMA can gather bu er from scattered
memory regions. IOMMU provides contiguous memory address
space illusion to devices similar as MMU. However, neither
Scatter/Gather CMA nor IOMMU is free. Supporting them
requires additional price and energy, which can be critical in
low-end device market unlike high-end market. Moreover, it
seems that the trend would not disappear in the near
future due to advance of IoT paradigm and emerging low-end
smartphone markets. Low-end devices market is not the
only eld that could utilize e cient contiguous memory
allocation. One good example is huge page [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] management.
Because huge page could severely reduce TLB over ow,
efcient huge page management is important for system with
memory intensive workloads like high performance
computing area. In this case, neither Scatter/Gather DMA nor
IOMMU can be a solution because they do not guarantee
real physically contiguous memory. This paper focuses on
the low-end device problem, though.
      </p>
      <p>
        Linux kernel already provides a software solution called
CMA [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. CMA, which is based on reservation technique,
applies the basic concept of virtual memory that pages in
virtual memory can move to any other place in physical
memory. It reserves memory for contiguous memory
allocation during boot time as reservation technique does. In the
meantime, CMA lets movable pages to reside in the reserved
area to keep memory utilization. If contiguous memory
allocation is issued, appropriate movable pages are migrated
out from the reserved area and the allocation is done using
the freed area. However, unlike our hope, page migration is
a pretty complex process. First of all, it should do content
copying and re-arranging of mapping between logical address
and physical address. Second, there could be a thread
holding the movable page in kernel context. Because the thread
is already holding the page, it cannot be migrated until the
holding thread releases it. In consequence, CMA guarantees
neither fast latency nor success. Another solution for fast
allocation latency based on CMA idea was proposed by Jeong
et al.[
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. It achieves the goal by restricting use of reserved
area as evicted clean le cache rather than movable pages.
Because anonymous pages cannot be in the reserved area,
the memory utilization could su er under non le intensive
workload despite its su ciently fast latency.
3.
      </p>
    </sec>
    <sec id="sec-5">
      <title>GCMA: GUARANTEED CMA</title>
    </sec>
    <sec id="sec-6">
      <title>3.1 Mistake on Design of CMA</title>
      <p>In conceptual level, basic design of CMA is as below:
1. Reserve contiguous memory space and let contiguous
memory allocation to be primary client of the area.
2. Share the reserved area with secondary clients;
3. Reclaim memory used by secondary clients whenever
a primary client requests.</p>
      <p>In detail, CMA chooses movable pages as its secondary
client while using page migration as a reclamation method.
Actually, basic design of CMA has no problem; the
problem of CMA is that movable pages are actually not an
appropriate candidate for secondary client. Because of many
limitations of page migration described in Section 2.2,
movable pages cannot easily give up the reserved area. That's
why CMA su ers from slow latency and sometimes even
fails. This problem can be avoided by selecting appropriate
candidates instead of movable pages and managing them
effectively.</p>
    </sec>
    <sec id="sec-7">
      <title>3.2 Appropriate Secondary Clients</title>
      <p>We have seen a bad candidate for secondary clients,
movable pages. The problem is that movable pages cannot be
freed for the primary client (contiguous memory allocation)
immediately because of high cost and possible failures of
page migration. Therefore, appropriate secondary clients
should satisfy the following three requirements. First of all,
it should be freed with an a ordable cost. Second, it should
not be accessed frequently because the area could be
suddenly discarded for primary client. Finally, it should be out
of kernel scope to avoid pinning of the page to other threads.</p>
      <sec id="sec-7-1">
        <title>Swap Device</title>
      </sec>
      <sec id="sec-7-2">
        <title>File</title>
        <p>For these purposes, nal chance cache for clean pages
that evicting from page cache and swapped out pages from
system memory (Section 2.1) are good candidates.
Evicting clean pages can be freed immediately without any cost.
Swapping pages can be freed just after write-back. Those
pages would not be accessed soon because they are chosen
as reclaim target which kernel expected so. They are out
of kernel scope because they are already evicted or swapped
out. Additionally, system can keep high memory utilization
because the nal chance cache can utilize not only le-backed
pages, but also anonymous pages. As consequence, GCMA
can be designed as a contiguous memory allocator using nal
chance cache for evicting clean pages and swapping pages as
its secondary clients. In other words, GCMA carries out
two missions. A contiguous memory allocator and a
temporal shelter for evicting clean pages and swapping pages is
that.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>3.3 Limitation and Optimization</title>
      <p>Although secondary clients of GCMA are e ective enough
for contiguous memory allocation, it could have adverse
effect on system performance compared to CMA. While
movable pages, the secondary client of CMA, would be located
in reserved area with only slight overhead, secondary clients
of GCMA requires reclaim overhead before they are located
in reserved area and it would consume processing power.
Moreover, swapping pages would require write-back to swap
device. It could consume I/O resource as well.</p>
      <p>
        To avoid the limitation, GCMA utilizes the secondary
client as write-through mode cache [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. With write-through
mode, content of pages being swapped out will be written
to GCMA reserved area and backing swap device
simultaneously. After that, GCMA can discard the page when
necessary without incurring additional overhead because the
content is already in swap device. Though using write-through
mode could enhance GCMA latency, system performance
could be degraded because it would consume much I/O
resource. To minimize the e ect, we suggest constructing the
swap device with a compressed memory block device[
        <xref ref-type="bibr" rid="ref10 ref4">4, 10</xref>
        ],
which could enhance write-through speed and swapping
performance. We recommend Zram[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] as a swap device, which
is an o cial recommendation from Google [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] for Android
devices with low memory. After those optimization applied,
GCMA proved it has no performance degradation at all from
our evaluation owing to simple design and e cient
implementation. Detailed evaluation setup and results are
described in Section 5.
      </p>
    </sec>
    <sec id="sec-9">
      <title>4. IMPLEMENTATION</title>
      <p>To minimize migration e ort of CMA users, GCMA shares
CMA interface. Therefore, CMA client code can use GCMA
alternatively by changing only one function call code from
its initialization or turning on a kernel con guration option</p>
      <sec id="sec-9-1">
        <title>Cleancache</title>
      </sec>
      <sec id="sec-9-2">
        <title>Frontswap CMA Interface</title>
      </sec>
      <sec id="sec-9-3">
        <title>Dmem Interface</title>
      </sec>
      <sec id="sec-9-4">
        <title>Dmem Logic</title>
      </sec>
      <sec id="sec-9-5">
        <title>Reserved Area</title>
        <p>without any code change.</p>
        <p>GCMA is implemented as a small independent Linux
kernel module to keep code simple rather than increasing
comprehensive hooks in kernel. Only a few lines of CMA code
have been modi ed for interface integration and
experimental features for evaluation. As evidence, GCMA
implementation has been easily ported to various Linux versions, which
have always been done in few minutes. In total, our
implementation uses only about 1,300 lines of code for main
implementation and only about 200 lines of code for changes to
CMA. GCMA is published as a free / open source software
under GPL license at https://github.com/sjp38/linux.
gcma/releases/tag/gcma/rfc/v2.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>4.1 Secondary Client Implementation</title>
      <p>
        Final chance cache for evicting clean pages and swapping
pages can make system performance improvements. The
fact has been well known in the Linux kernel community and
facilities for the chance, namely, Cleancache and Frontswap [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
have been proposed by the community. As those names
imply, they are nal chance cache for evicting clean pages and
swapping pages. To encourage more exible system con
guration, community implemented only front-end and interface
of them in Linux kernel and left back-end implementation to
other modules. Once the back-end implementation has been
registered, page cache and swap layer tries to put evicting
clean pages and swapping pages inside that back-end
implementation as soon as the victim is selected. If putting those
pages succeeds, future reference of those pages will be read
back from that back-end implementation which would be
much faster than le or swap device. Otherwise, the victim
would be pushed back to the le or swap device. Because
the design ts perfectly with the secondary clients of GCMA,
GCMA uses it by implementing those back-ends rather than
inventing the interface again.
      </p>
      <p>Back-end for Cleancache and Frontswap has similar
requirements because they are just software caches. To remove
redundancy, we implemented another abstraction for them,
namely, dmem (discardable memory). It is a key-value
storage that any entry can be suddenly evicted. It uses a hash
table to keep key-value entry and constructs buckets of the
table with red-black tree. It also tracks LRU list of entries
and evicts bunch of LRU entries if the storage becomes too
narrow as normal caches do. GCMA implements Cleancache
and Frontswap back-ends simply using the dmem by giving
its reserved area as an area for values of key-value pairs and
let Cleancache and Frontswap use it with information for
evicting clean pages and swapping pages as a key and
content of the page as a value. When GCMA needs a page used
by Cleancache or Frontswap for contiguous memory
allocation, it evicts the page associated entry with dmem interface.</p>
      <p>Figure 3 depicts overall architecture of GCMA.
CMA</p>
      <p>GCMA
)s 1,800
d
n
o
ec 1,200
s
i
l
l
i(m 600
y
c
n
e
t
a
L
0
20,000
30,000</p>
      <sec id="sec-10-1">
        <title>Number of requested contiguous pages</title>
        <p>5.1</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>EVALUATION</title>
    </sec>
    <sec id="sec-12">
      <title>Evaluation Environments</title>
      <p>Component
Processors
Memory
Storage</p>
      <p>Speci cation
900 MHz quad-core ARM Cortex-A7
1 GiB LPDDR2 SDRAM</p>
      <p>Class 10 SanDisk 16 GiB microSD card</p>
      <p>
        We use Raspberry Pi 2 Model B single-board computer [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]
as our evaluation environment. It is one of the most popular
low-end mini computers in the world. Detailed speci cation
is described in Table 1.
      </p>
      <p>Page 1</p>
      <p>
        Detailed system con gurations we use in evaluations is
described in Table 2. Because Raspberry Pi development
team provides forked Linux kernel optimized for Raspberry
Pi via Github, we use the kernel rather than vanilla kernel.
The kernel we selected is based on Linux v3.18.11. We
represent the kernel as Linux rpi-v3.18.11. Raspberry
Pi has used reservation technique for contiguous memory
allocation from the beginning. After Linux implemented
CMA, Raspberry Pi started to support CMA from
November 2012 [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. However, Raspberry Pi development team
has found CMA problem and decided to support CMA in
uno cial way only [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. As a result, Raspberry Pi default
con guration is reservation technique yet.
      </p>
      <p>
        Our evaluation focus on the following three factors: First,
latency of CMA and GCMA. Second, e ect of CMA and
GCMA on a real workload, Raspberry Pi Camera [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Third,
e ect of CMA and GCMA to system performance.
      </p>
    </sec>
    <sec id="sec-13">
      <title>5.2 Latency of Contiguous Memory Allocation</title>
      <p>To compare contiguous memory allocation latency of CMA
and GCMA, we issue contiguous memory requests with
varying allocation sizes. The rst request is for 64 pages (256
KiB) and the size is doubled for subsequent requests until
the last request is for 32,768 pages (128 MiB). We give 2
seconds interval between each allocation to minimize any e ect
to other processes and the system. We repeat this 30 times
with both CMA and GCMA con guration.</p>
      <p>The average latencies of CMA and GCMA are shown in
Figure 4. GCMA shows 14.89x to 129.41x faster latency
compared to CMA. Moreover, CMA failed once for 32,768
pages allocation while GCMA did not even though the
workload has run with no background jobs. Even without any
background job, CMA could meet secondary client page that
need to be moved out because any movable page can be
located inside CMA reserved area. In the case, moving the
page out would consume lots of time and could fail in worst
case. The worst case actually happened once during 32,768
pages allocation. On the other hand, because only
evicting clean pages and swapping pages can be located inside
reserved area of GCMA, GCMA wouldn't need to discard
pages out unless memory intensive background workload
exists. Moreover, CMA code is much bigger and slower than
GCMA because CMA has to consider complicate situations
of movable pages and migration while GCMA needs to
consider only dmem. That's why GCMA shows much lower
latency than CMA even without any background workload.</p>
    </sec>
    <sec id="sec-14">
      <title>5.3 Distribution of Latencies</title>
      <p>For predictability comparison between CMA and GCMA, we
do the contiguous memory allocation workload again with
only 1,024 contiguous pages requests and draw CDF of
latencies. In this evaluation, we rst do the workload without
any background job to show latency of CMA and GCMA
itself without any interference. After that, we do the
workload with a background job to show latencies under realistic
memory stress. For the background job, we use Blogbench
benchmark. The benchmark is a portable le system
benchmark that simulates a real-world busy blog server.</p>
      <p>The result is shown in Figure 5. In ideal case without
any other background jobs, GCMA latencies mostly lie
under 1 millisecond while CMA latencies are anywhere from
4 milliseconds to even more than 100 milliseconds. Under
Blogbench as a background job, GCMA latencies mostly lie
under 2 milliseconds while CMA latencies lie from 4
milliseconds to even more than 4 seconds. The result says that
GCMA latency is not only fast but also predictable and
insensitive to background stress compared to CMA.</p>
      <p>Even without a background workload, CMA shows much
dispersed latency than GCMA because CMA is slower than
GCMA basically due to complexity and has more chance of
secondary client page clean-up. With a background
workload, CMA could meet more secondary client pages to
cleanup because any movable page of background job can be
inside CMA reserved area. In contrast, because only evicting
clean pages and swapping pages can be inserted to GCMA
reserved area, GCMA would not meet secondary client pages
to clean-up unless the background job has su ciently large
memory footprint.</p>
    </sec>
    <sec id="sec-15">
      <title>5.4 Raspberry Pi Camera Latency</title>
      <p>Though result from Section 5.3 is impressive, it shows only
a potential of GCMA, not any performance e ect on a real
workload. That's why we evaluate latency of photo shots
using Raspberry Pi Camera in this section.
CMA</p>
      <sec id="sec-15-1">
        <title>GCMA</title>
        <p>CMA</p>
        <p>GCMA</p>
        <p>
          For a photo shot, Raspberry Pi 2 usually allocates 64
contiguous pages 25 times asynchronously using a kernel thread
and keeps the allocated memory without release as long as
possible [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. The maintained memory is used for
subsequent photo shots. Therefore, only the rst photo shot is
a ected from contiguous memory allocation. For convenient
evaluation, we con gure the system to release every memory
allocated for camera as soon as possible to make subsequent
photo shots to be also a ected.
        </p>
        <p>We measure latency of each photo shot using Raspberry
Pi Camera 30 times under every con guration with 10
seconds interval between each shot to eliminate any e ect from
previous shots. Each con guration uses Reservation
technique, CMA, and GCMA for the photo shot. To explicitly
show the e ect of contiguous memory allocation, we
measure contiguous memory allocation latencies caused for each
photo shot under CMA and GCMA.</p>
        <p>Without a background job, every con guration shows
similar photo shot latency of about 0.9 seconds. There is no
di erence between CMA and GCMA though GCMA shows
much faster latency than CMA from Section 5.3. There
are two reasons behind this. First, though CMA is about
15 times faster than CMA for each allocation required by
the camera, absolute time for the allocation is only 1
millisecond for CMA, 69 micro-seconds for GCMA. Because a
photo shot issues the allocation only 25 times,Ptahgee a1llocation
latency comprises only a small portion of the entire photo
shot latency. Second, CMA can do allocation almost
without any page migration if there are no background jobs and
the number of required pages is not many. In other words,
the environment is too optimal.</p>
        <p>
          To show real latency under a realistic environment, we
do the camera workload with Blogbench in background as
we did in Section 5.3 for a simulation of a realistic
background workload. To minimize scheduling e ect on latency,
we set priority of photo shot process higher than that of
background workload using nice [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] command.
        </p>
        <p>
          The result is described in Figure 6. CMA shows much
slower camera shot latency while GCMA shows similar
latency with Baseline using Reservation technique. In the
worst case, CMA even requires about 9.8 seconds to do a
photo shot while GCMA requires 1.04 seconds in the worst
case. Contiguous memory allocation latencies also show
similar but more dramatic result. Latencies of CMA lie between
4 milliseconds and 10 seconds while GCMA's latencies lie
between 750 micro-seconds and 3.32 milliseconds. This result
means contiguous memory allocation using CMA produces
extremely high and unpredictable latency under a realistic
situation. With this evaluation result, it's not surprising
that CMA on Raspberry Pi is not o cially supported [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
5.5
        </p>
      </sec>
    </sec>
    <sec id="sec-16">
      <title>Effect on System Performance</title>
      <p>lat ctx(usec)
bw le rd(MiB/s)
bw mem rd(MiB/s)
bw mem wr(MiB/s)</p>
      <p>
        Finally, to show performance e ect of CMA and GCMA
on system, we measure system performance under every
conguration using two benchmarks. First, we run a
microbenchmark called lmbench [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], 3 times and get an average
of results to show how performance of OS primitives is
affected by CMA and GCMA.
      </p>
      <p>The result is depicted in Table 3. Each line shows an
averaged measurement of context switch latency, le read
bandwidth, memory read bandwidth, and memory write
bandwidth. CMA and GCMA tend to show improved
performance compared to Baseline becPaaugsee1 of memory
utilization though the di erence is not so big. Di erences between
CMA and GCMA are just in margin of error.</p>
      <p>To show performance under realistic workload, we run
Blogbench job 3 times and measure average score
normalized by Baseline con guration result. At the same time,
we continuously issue photo shots on the background with
10 seconds interval as described in Section 5.4 to simulate
realistic contiguous memory allocation stress on system.</p>
      <p>The result is shown in Figure 7. Writes / reads
performances with background job are represented as Writes /
cam and Reads / cam. CMA and GCMA show enhanced
performance for every case though the enhancements are
not remarkably big. Those performance gains came from
enlarged memory space that are rescued from reservation.
GCMA shows even better enhancement than CMA owing to
the light overhead of reserved area management that
beneted from its simple design and characteristic of secondary
clients. Moreover, GCMA get less performance degradation
from background contiguous memory allocations than CMA
owing to its fast latency. As a summary, GCMA is not only
faster than CMA but also more helpful for system
performance.</p>
      <sec id="sec-16-1">
        <title>Baseline CMA</title>
      </sec>
      <sec id="sec-16-2">
        <title>GCMA</title>
        <p>CMA
GCMA
1.1
0.7</p>
      </sec>
    </sec>
    <sec id="sec-17">
      <title>6. CONCLUSION</title>
      <p>Physically contiguous memory allocation is still a big
problem for low-end embedded devices. For example, Raspberry
Pi, a popular credit card sized computer, still uses
reservation technique despite of low memory utilization because
hardware solutions such as Scatter/Gather DMA or IOMMU
were too expensive and a software solution, CMA, was not
e ective.</p>
      <p>We introduced GCMA, a contiguous memory allocator
that guarantees fast latency, successful allocation, and
reasonable memory space e ciency. It achieves those goals by
using the reservation technique and e ectively utilizing the
reserved area. From our evaluation on Raspberry Pi 2, while
CMA increases latency of a photo shot on Raspberry Pi from
1 second to 9 seconds in the worst case, GCMA shows no
additional latency compared to the reservation technique. Our
evaluation also shows that GCMA not only outperforms
latency but also improves system performance as CMA does.</p>
    </sec>
    <sec id="sec-18">
      <title>7. ACKNOWLEDGEMENTS</title>
      <p>This work was supported by the National Research
Foundation of Korea(NRF) grant funded by the Korea
government(MSIP) (No. 0421-20150075).
Page 1</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Arcangeli</surname>
          </string-name>
          .
          <article-title>Transparent hugepage support</article-title>
          .
          <source>KVM Forum</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Corbet</surname>
          </string-name>
          . Cleancache and Frontswap. http://lwn.net/Articles/386090/,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Denning</surname>
          </string-name>
          .
          <article-title>Before memory was virtual</article-title>
          .
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Freedman</surname>
          </string-name>
          .
          <article-title>The compression cache: virtual memory compression for handheld computers</article-title>
          .
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jeong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <surname>S. Maeng.</surname>
          </string-name>
          <article-title>DaaC: device-reserved memory as an eviction-based le cache</article-title>
          .
          <source>In in Proc. 21th Int. Conf. Compilers</source>
          , architectures
          <article-title>and synthesis for embedded systems, page 191</article-title>
          . ACM Press, Oct.
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jeong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Maeng</surname>
          </string-name>
          .
          <article-title>Rigorous rental memory management for embedded systems</article-title>
          .
          <source>ACM Transactions on Embedded Computing Systems</source>
          ,
          <volume>12</volume>
          (1s):
          <fpage>1</fpage>
          ,
          <string-name>
            <surname>Mar</surname>
          </string-name>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Johnstone</surname>
          </string-name>
          and
          <string-name>
            <given-names>P. R.</given-names>
            <surname>Wilson</surname>
          </string-name>
          .
          <article-title>The memory fragmentation problem</article-title>
          .
          <source>ACM SIGPLAN Notices</source>
          ,
          <volume>34</volume>
          (
          <issue>3</issue>
          ):
          <volume>26</volume>
          {
          <fpage>36</fpage>
          ,
          <string-name>
            <surname>Mar</surname>
          </string-name>
          .
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N. P.</given-names>
            <surname>Jouppi</surname>
          </string-name>
          .
          <article-title>Cache write policies and performance</article-title>
          .
          <source>In Proceedings of the 20th annual international symposium on Computer architecture - ISCA '93</source>
          , volume
          <volume>21</volume>
          , pages
          <fpage>191</fpage>
          {
          <fpage>201</fpage>
          , New York, New York, USA,
          <year>June 1993</year>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>MacKenzie</surname>
          </string-name>
          . nice.
          <source>Linux man page</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Magenheimer</surname>
          </string-name>
          .
          <article-title>In kernel memory compression</article-title>
          . http://lwn.net/Articles/545244/,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>McVoy</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Staelin</surname>
          </string-name>
          .
          <article-title>lmbench: Portable Tools for Performance Analysis</article-title>
          .
          <source>USENIX annual technical conference</source>
          ,
          <source>1996. Page 1</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12] msperl. Rpicon g. http://elinux.org/RPicon g,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nazarewicz</surname>
          </string-name>
          .
          <article-title>Contiguous Memory Allocator</article-title>
          .
          <source>In Proceedings of LinuxCon Europe</source>
          <year>2012</year>
          . LinuxFoundation,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Park</surname>
          </string-name>
          . introduce gcma. http://lwn.net/Articles/619865/,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.-W. Robert</given-names>
            <surname>Love</surname>
          </string-name>
          .
          <article-title>The Bu er Cache. In Linux-Kernel Manual: Guidelines for the Design and Implementation of Kernel 2.6</article-title>
          , page 348.
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T. C. Ruth</given-names>
            <surname>Suehle</surname>
          </string-name>
          .
          <article-title>Automatically Share Memory</article-title>
          . In Raspberry Pi Hacks: Tips &amp;
          <article-title>Tools for Making Things with the Inexpensive Linux Computer</article-title>
          , page 95.
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Team</surname>
          </string-name>
          . Low RAM. http://s.android.com/devices/tech/low-ram.html,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>E.</given-names>
            <surname>Upton</surname>
          </string-name>
          . Camera board available for sale! https://www.raspberrypi.org/camera-board
          <string-name>
            <surname>-</surname>
          </string-name>
          availablefor-sale/,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>E.</given-names>
            <surname>Upton</surname>
          </string-name>
          .
          <source>Raspberry Pi 2 on sale now at $35</source>
          . https://www.raspberrypi.org/raspberry-pi-2
          <string-name>
            <surname>-</surname>
          </string-name>
          on-sale/,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>