=Paper=
{{Paper
|id=Vol-1464/ewili15_12
|storemode=property
|title=GCMA: Guaranteed Contiguous Memory Allocator
|pdfUrl=https://ceur-ws.org/Vol-1464/ewili15_12.pdf
|volume=Vol-1464
|dblpUrl=https://dblp.org/rec/conf/ewili/ParkKY15
}}
==GCMA: Guaranteed Contiguous Memory Allocator==
<pdf width="1500px">https://ceur-ws.org/Vol-1464/ewili15_12.pdf</pdf>
<pre>
         GCMA: Guaranteed Contiguous Memory Allocator

                SeongJae Park                           Minchan Kim                                               Heon Y. Yeom
            Seoul National University                   LG Electronics                                    Seoul National University
          sjpark@dcslab.snu.ac.kr                minchan.kim@lge.com                                            yeom@snu.ac.kr

                                                                                                                                  still-blogbench


ABSTRACT                                                                                                        Baseline       CMA
While demand for physically contiguous memory allocation                                     1.0


                                                                    Cumulative Probability
is still alive, especially in embedded system, existing solu-                                0.8
tions are insufficient. The most adapted solution is reser-                                  0.6
vation technique. Though it serves allocation well, it could
                                                                                             0.4
severely degrade memory utilization. There are hardware
solutions like Scatter/Gather DMA and IOMMU. However,                                        0.2
cost of these additional hardware is too excessive for low-end                               0.0
devices. CMA is a software solution of Linux that aims to                                          0    1,000       2,000     3,000       4,000     5,000
solve not only allocation but also memory utilization prob-                                                     Latency (milli-seconds)
lem. However, in real environment, CMA could incur un-
predictably slow latency and could often fail to allocate con-   Figure 1: CDF of a photo shot on Raspberry pi 2
tiguous memory due to its complex design.                        under background task.
   We introduce a new solution for the above problem, GCMA
(Guaranteed Contiguous Memory Allocator). It guarantees                                                Baseline         CMA           GCMA
not only memory space efficiency but also fast latency and
                                                                         1.0
                                                                    Cumulative Probability


success by using reservation technique and letting only im-      hardware    can be too expensive to low-end devices.
mediately discardable to use the area efficiently. Our eval-        Linux0.8kernel has a subsystem for the problem, namely,
uation on Raspberry Pi 2 shows 15 to 130 times faster and                0.6
                                                                 CMA (Contiguous      Memory Allocator) [13]. It basically fol-
more predictable allocation latency without system perfor-       lows the0.4
                                                                           reservation technique and boosts memory efficiency
mance degradation compared to CMA.                                       0.2 the reserved area to be used by other movable
                                                                 by allowing
                                                                 pages. At
                                                                         0.0 the same time, it keeps contiguous memory allo-
                                                                 cation be available
                                                                              0        by migrating
                                                                                    1,000     2,000      movable
                                                                                                         3,000    pages out
                                                                                                                  4,000        of the
                                                                                                                            5,000
1.   INTRODUCTION                                                reserved area if those movable pages are required for contigu-
Because resource requirement of processes is not predictable,                             Latency (milli-seconds)
                                                                 ous memory allocation.     However, CMA is not very popular
keeping high resource availability with high utilization has     because memory migration usually takes a long time and
always been one of the hardest challenges. Memory manage-        could even fail from time to time. Figure 1 shows CDF of
ment has not been exception either. Especially, large mem-       latency for a photo shot on Raspberry Pi 2 [19] under mem-
ory allocation was pretty much impossible because of the         ory intensive background task, which is a realistic workload.
fragmentation problem [7]. The problem seemed to be com-         Baseline means Raspberry Pi 2 using the reservation tech-
pletely solved with the introduction of virtual memory [3].      nique and CMA means modified to use CMA as alternative.
In real world, however, the demand for physically contiguous     Maximum latency for a photo shot using CMA is about 9.8
memory still exists [5, 6, 13]. Typical examples are various     seconds while it is 1.6 seconds with reservation technique.
embedded devices such as video codec or camera which re-         Normally, 9.8 seconds for a photo shot is not acceptable
quires large buffers.                                            even in the worst case. As a result, most system uses the
   A traditional solution for the problem was memory reser-      reservation technique or additional hardware as a last resort
vation technique. The technique reserves sufficient amount       despite expensive cost.
of contiguous memory during boot time and use the area              To this end, we propose a new contiguous memory allo-
only for contiguous memory allocation. This technique guar-      cator subsystem, GCMA (Guaranteed Contiguous Memory
antees fast and successful allocation but could degrade mem-     Allocator) [14], that guarantees not only memory space effi-
ory space efficiency if contiguous memory allocation does not    ciency but also fast response time and successful allocation
use whole reserved area efficiently. There are alternative so-   by carefully selecting additional clients for  Page
                                                                                                                  the1 reserved area
lutions using additional hardware like Scatter/Gather DMA        as appropriate ones only. In this paper, we introduce the
or IOMMU. They solve the problem by helping devices to           idea, design and evaluation results of GCMA implementa-
access discontiguous memory as if it was contiguous, like vir-   tion. GCMA on Raspberry Pi 2 shows 15 to 130 times faster
tual memory system does to processes. However, additional        and much predictable response time of contiguous pages al-
 EWiLi’15, October 8th, 2015, Amsterdam, The Nether-             location without system performance degradation compared
lands.                                                           to CMA, and zero overhead on latency of a photo shot under
Copyright retained by the authors.                               background task compared to the reservation technique.
2.    BACKGROUND                                                   real physically contiguous memory. This paper focuses on
                                                                   the low-end device problem, though.
                                                                      Linux kernel already provides a software solution called
2.1    Virtual Memory Management                                   CMA [13]. CMA, which is based on reservation technique,
Virtual memory [3] system gives a process an illusion that it      applies the basic concept of virtual memory that pages in
owns entire contiguous memory space by letting the process         virtual memory can move to any other place in physical
use logical address which can be later mapped into discon-         memory. It reserves memory for contiguous memory alloca-
tiguous physical memory space. Because processes access            tion during boot time as reservation technique does. In the
memory using only logical address, system can easily allo-         meantime, CMA lets movable pages to reside in the reserved
cate large contiguous memory to processes without fragmen-         area to keep memory utilization. If contiguous memory al-
tation problem. However, giving every process the illusion         location is issued, appropriate movable pages are migrated
cannot be done easily in reality because physical memory in        out from the reserved area and the allocation is done using
system is usually smaller than the total sum of each process’s     the freed area. However, unlike our hope, page migration is
illusion memory space. To deal with this problem, kernel           a pretty complex process. First of all, it should do content
conceptually expands memory using another large storage            copying and re-arranging of mapping between logical address
like hard disk drive or SSD. However, because large stor-          and physical address. Second, there could be a thread hold-
ages are usually much slower than memory, only pages that          ing the movable page in kernel context. Because the thread
will not be accessed soon should be in the storage. When           is already holding the page, it cannot be migrated until the
memory is full and a process requires one more page, kernel        holding thread releases it. In consequence, CMA guarantees
moves content of a page that is not expected to be accessed        neither fast latency nor success. Another solution for fast al-
soon into the storage and gives the page to the process.           location latency based on CMA idea was proposed by Jeong
   In detail, pages of a process can be classified into 2 types,   et al.[5, 6]. It achieves the goal by restricting use of reserved
file-backed page and anonymous page. File-backed page              area as evicted clean file cache rather than movable pages.
have content of a file that are cached in memory via page          Because anonymous pages cannot be in the reserved area,
cache [15] for fast I/O. The page can be freed after syn-          the memory utilization could suffer under non file intensive
chronizing their content with the backing file. If the page        workload despite its sufficiently fast latency.
is already synchronized with, it can be freed immediately.
Anonymous page is a page that has no backing file. A page
allocated from heap using malloc() can be an example. Be-          3.     GCMA: GUARANTEED CMA
cause anonymous page has no backing file, it cannot be freed
before its content is stored safely elsewhere. For this rea-       3.1      Mistake on Design of CMA
son, system provides backing storage for anonymous pages,          In conceptual level, basic design of CMA is as below:
namely, swap device.
                                                                        1. Reserve contiguous memory space and let contiguous
                                                                           memory allocation to be primary client of the area.
2.2    Contiguous Memory Allocation
Because MMU, a hardware unit that translates logical memory             2. Share the reserved area with secondary clients;
address into physical location of the memory, is sitting be-
hind CPU, devices that communicate with system memory                   3. Reclaim memory used by secondary clients whenever
directly without CPU have to deal with physical memory                     a primary client requests.
address rather than taking advantage of the virtual mem-
                                                                      In detail, CMA chooses movable pages as its secondary
ory. If the device, for example, requires large memory for an
                                                                   client while using page migration as a reclamation method.
image processing buffer, the buffer should be contiguous in
                                                                   Actually, basic design of CMA has no problem; the prob-
physical address space. However, allocating physically con-
                                                                   lem of CMA is that movable pages are actually not an ap-
tiguous memory is hard and cannot always be guaranteed
                                                                   propriate candidate for secondary client. Because of many
due to fragmentation problem.
                                                                   limitations of page migration described in Section 2.2, mov-
   Some devices support hardware facilities that are helpful
                                                                   able pages cannot easily give up the reserved area. That’s
for the problem, such as Scatter/Gather DMA or IOMMU [13].
                                                                   why CMA suffers from slow latency and sometimes even
Scatter/Gather DMA can gather buffer from scattered mem-
                                                                   fails. This problem can be avoided by selecting appropriate
ory regions. IOMMU provides contiguous memory address
                                                                   candidates instead of movable pages and managing them ef-
space illusion to devices similar as MMU. However, neither
                                                                   fectively.
Scatter/Gather CMA nor IOMMU is free. Supporting them
requires additional price and energy, which can be critical in
low-end device market unlike high-end market. Moreover, it         3.2      Appropriate Secondary Clients
seems that the trend would not disappear in the near fu-           We have seen a bad candidate for secondary clients, mov-
ture due to advance of IoT paradigm and emerging low-end           able pages. The problem is that movable pages cannot be
smartphone markets. Low-end devices market is not the              freed for the primary client (contiguous memory allocation)
only field that could utilize efficient contiguous memory al-      immediately because of high cost and possible failures of
location. One good example is huge page [1] management.            page migration. Therefore, appropriate secondary clients
Because huge page could severely reduce TLB overflow, ef-          should satisfy the following three requirements. First of all,
ficient huge page management is important for system with          it should be freed with an affordable cost. Second, it should
memory intensive workloads like high performance comput-           not be accessed frequently because the area could be sud-
ing area. In this case, neither Scatter/Gather DMA nor             denly discarded for primary client. Finally, it should be out
IOMMU can be a solution because they do not guarantee              of kernel scope to avoid pinning of the page to other threads.
gcma

                        Process                                                Cleancache   Frontswap               CMA Interface
                 Normal Allocation        Contiguous Allocation
                                                                                   Dmem Interface
             System Memory            GCMA            Swap Device                                          GCMA Logic        CMA Logic
                                                                                    Dmem Logic
                                     Reserved
                   Reclaim / Swap                          File
                                       Area                                                         Reserved Area

                     Figure 2: GCMA workflow.
                                                                                  Figure 3: GCMA overall architecture.

          For these purposes, final chance cache for clean pages
       that evicting from page cache and swapped out pages from          without any code change.
       system memory (Section 2.1) are good candidates. Evict-             GCMA is implemented as a small independent Linux ker-
       ing clean pages can be freed immediately without any cost.        nel module to keep code simple rather than increasing com-
       Swapping pages can be freed just after write-back. Those          prehensive hooks in kernel. Only a few lines of CMA code
       pages would not be accessed soon because they are chosen          have been modified for interface integration and experimen-
       as reclaim target which kernel expected so. They are out          tal features for evaluation. As evidence, GCMA implemen-
       of kernel scope because they are already evicted or swapped       tation has been easily ported to various Linux versions, which
       out. Additionally, system can keep high memory utilization        have always been done in few minutes. In total, our imple-
       because the final chance cache can utilize not only file-backed   mentation uses only about 1,300 lines of code for main im-
       pages, but also anonymous pages. As consequence, GCMA             plementation and only about 200 lines of code for changes to
       can be designed as a contiguous memory allocator using final      CMA. GCMA is published as a free / open source software
       chance cache for evicting clean pages and swapping pages as       under GPL license at https://github.com/sjp38/linux.
       its secondary clients. In other words, GCMA carries out           gcma/releases/tag/gcma/rfc/v2.
       two missions. A contiguous memory allocator and a tem-
       poral shelter for evicting clean pages and swapping pages is
       that.                                                             4.1     Secondary Client Implementation
                                                                         Final chance cache for evicting clean pages and swapping
                                                                         pages can make system performance improvements. The
       3.3    Limitation and Optimization                                fact has been well known in the Linux kernel community and
       Although secondary clients of GCMA are effective enough           facilities for the chance, namely, Cleancache and Frontswap [2]
       for contiguous memory allocation, it could have adverse ef-       have been proposed by the community. As those names im-
       fect on system performance compared to CMA. While mov-            ply, they are final chance cache for evicting clean pages and
       able pages, the secondary client of CMA, would be located         swapping pages. To encourage more flexible system configu-
       in reserved area with only slight overhead, secondary clients     ration, community implemented only front-end and interface
       of GCMA requires reclaim overhead before they are located         of them in Linux kernel and left back-end implementation to
       in reserved area and it would consume processing power.           other modules. Once the back-end implementation has been
       Moreover, swapping pages would require write-back to swap         registered, page cache and swap layer tries to put evicting
       device. It could consume I/O resource as well.                    clean pages and swapping pages inside that back-end imple-
          To avoid the limitation, GCMA utilizes the secondary           mentation as soon as the victim is selected. If putting those
       client as write-through mode cache [8]. With write-through        pages succeeds, future reference of those pages will be read
       mode, content of pages being swapped out will be written          back from that back-end implementation which would be
       to GCMA reserved area and backing swap device simultane-          much faster than file or swap device. Otherwise, the victim
       ously. After that, GCMA can discard the page when neces-          would be pushed back to the file or swap device. Because
       sary without incurring additional overhead because the con-       the design fits perfectly with the secondary clients of GCMA,
       tent is already in swap device. Though using write-through        GCMA uses it by implementing those back-ends rather than
       mode could enhance GCMA latency, system performance               inventing the interface again.
       could be degraded because it would consume much I/O re-              Back-end for Cleancache and Frontswap has similar re-
       source. To minimize the effect, we suggest constructing the       quirements because they are just software caches. To remove
       swap device with a compressed memory block device[4, 10],         redundancy, we implemented another abstraction for them,
       which could enhance write-through speed and swapping per-         namely, dmem (discardable memory). It is a key-value stor-
       formance. We recommend Zram[10] as a swap device, which           age that any entry can be suddenly evicted. It uses a hash
       is an official recommendation from Google [17] for Android        table to keep key-value entry and constructs buckets of the
       devices with low memory. After those optimization applied,        table with red-black tree. It also tracks LRU list of entries
       GCMA proved it has no performance degradation at all from         and evicts bunch of LRU entries if the storage becomes too
       our evaluation owing to simple design and efficient imple-        narrow as normal caches do. GCMA implements Cleancache
       mentation. Detailed evaluation setup and results are de-          and Frontswap back-ends simply using the dmem by giving
       scribed in Section 5.                                             its reserved area as an area for values of key-value pairs and
                                                                         let Cleancache and Frontswap use it with information for
       4.    IMPLEMENTATION                                              evicting clean pages and swapping pages as a key and con-
       To minimize migration effort of CMA users, GCMA shares            tent of the page as a value. When GCMA needs a page used
       CMA interface. Therefore, CMA client code can use GCMA            by Cleancache or Frontswap for contiguous memory alloca-
       alternatively by changing only one function call code from        tion, it evicts the page associated entry with dmem interface.
       its initialization or turning on a kernel configuration option    Figure 3 depicts overall architecture of GCMA.
                                                                    gcma_sole


                                                 CMA         GCMA                         KiB) and the size is doubled for subsequent requests until
     Latency(milli-seconds)   1,800                                                       the last request is for 32,768 pages (128 MiB). We give 2 sec-
                                                                                          onds interval between each allocation to minimize any effect
                              1,200                                                       to other processes and the system. We repeat this 30 times
                                                                                          with both CMA and GCMA configuration.
                               600                                                           The average latencies of CMA and GCMA are shown in
                                                                                          Figure 4. GCMA shows 14.89x to 129.41x faster latency
                                                                                          compared to CMA. Moreover, CMA failed once for 32,768
                                      0           10,000         20,000          30,000
                                                                                          pages allocation while GCMA did not even though the work-
                                          Number of requested contiguous pages            load has run with no background jobs. Even without any
                                                                                          background job, CMA could meet secondary client page that
                              Figure 4: Averaged allocation latency.                      need to be moved out because any movable page can be lo-
                                                                                          cated inside CMA reserved area. In the case, moving the
                                                                                          page out would consume lots of time and could fail in worst
5.                      EVALUATION                                                        case. The worst case actually happened once during 32,768
                                                                                          pages allocation. On the other hand, because only evict-
                                                                                          ing clean pages and swapping pages can be located inside
5.1                           Evaluation Environments                                     reserved area of GCMA, GCMA wouldn’t need to discard
                                                                                          pages out unless memory intensive background workload ex-
     Component                             Specification
                                                                                          ists. Moreover, CMA code is much bigger and slower than
     Processors                            900 MHz quad-core ARM Cortex-A7
                                                                                          GCMA because CMA has to consider complicate situations
     Memory                                1 GiB LPDDR2 SDRAM
                                                                                          of movable pages and migration while GCMA needs to con-
     Storage                               Class 10 SanDisk 16 GiB microSD card
                                                                                          sider only dmem. That’s why GCMA shows much lower
                                                                                          latency than CMA even without any background workload.
 Table 1: Raspberry Pi 2 Model B specifications.

   We use Raspberry Pi 2 Model B single-board computer [19]                               5.3   Distribution of Latencies
as our evaluation environment. It is one of the most popular                              For predictability comparison between CMA and GCMA, we
low-end mini computers in the world. Detailed specification                               do the contiguous memory allocation workload again with
is described in Table 1.                                                                  only 1,024 contiguous pages requests and draw CDF of la-
                                                                                          tencies. In this evaluation, we first do the workload without
  Name                                Specification                                       any background job to show latency of CMA and GCMA
  Baseline                            Linux rpi-v3.18.11 + 100 MiB swap +                 itself without any interference. After that, we do the work-
                                      256 MiB reserved area                               load with a background job to show latencies under realistic
  CMA                                 Linux rpi-v3.18.11 + 100 MiB swap +                 memory stress. For the background job, we use Blogbench
                                      256 MiB CMA area                                    benchmark. The benchmark is a portable file system bench-
  GCMA                                Linux rpi-v3.18.11 + 100 MiB Zram swap +            mark that simulates a real-world busy blog server.
                                      256 MiB GCMA area                                      The result is shown in Figure 5. In ideal case without
                                                                                          any other background jobs, GCMA latencies mostly lie un-
                              Table 2: Configurations for evaluations.                    der 1 millisecond while CMA latencies are anywhere from
                                                                     Page 1
                                                                                          4 milliseconds to even more than 100 milliseconds. Under
   Detailed system configurations we use in evaluations is                                Blogbench as a background job, GCMA latencies mostly lie
described in Table 2. Because Raspberry Pi development                                    under 2 milliseconds while CMA latencies lie from 4 mil-
team provides forked Linux kernel optimized for Raspberry                                 liseconds to even more than 4 seconds. The result says that
Pi via Github, we use the kernel rather than vanilla kernel.                              GCMA latency is not only fast but also predictable and in-
The kernel we selected is based on Linux v3.18.11. We                                     sensitive to background stress compared to CMA.
represent the kernel as Linux rpi-v3.18.11. Raspberry                                        Even without a background workload, CMA shows much
Pi has used reservation technique for contiguous memory                                   dispersed latency than GCMA because CMA is slower than
allocation from the beginning. After Linux implemented                                    GCMA basically due to complexity and has more chance of
CMA, Raspberry Pi started to support CMA from Novem-                                      secondary client page clean-up. With a background work-
ber 2012 [16]. However, Raspberry Pi development team                                     load, CMA could meet more secondary client pages to clean-
has found CMA problem and decided to support CMA in                                       up because any movable page of background job can be in-
unofficial way only [12]. As a result, Raspberry Pi default                               side CMA reserved area. In contrast, because only evicting
configuration is reservation technique yet.                                               clean pages and swapping pages can be inserted to GCMA
   Our evaluation focus on the following three factors: First,                            reserved area, GCMA would not meet secondary client pages
latency of CMA and GCMA. Second, effect of CMA and                                        to clean-up unless the background job has sufficiently large
GCMA on a real workload, Raspberry Pi Camera [18]. Third,                                 memory footprint.
effect of CMA and GCMA to system performance.
                                                                                          5.4   Raspberry Pi Camera Latency
5.2                           Latency of Contiguous Memory Allocation                     Though result from Section 5.3 is impressive, it shows only
To compare contiguous memory allocation latency of CMA                                    a potential of GCMA, not any performance effect on a real
and GCMA, we issue contiguous memory requests with vary-                                  workload. That’s why we evaluate latency of photo shots
ing allocation sizes. The first request is for 64 pages (256                              using Raspberry Pi Camera in this section.
                                                              gcma_sole                                                                    gcma.blogbench


                                             CMA       GCMA                                                               CMA          GCMA
                                 1.0                                                                       1.0

        Cumulative Probability


                                                                                  Cumulative Probability
                                 0.8                                                                       0.8
                                 0.6                                                                       0.6
                                 0.4                                                                       0.4
                                 0.2                                                                       0.2
                                 0.0                                                                       0.0
                                         1           10             100   1,000                                       1        10        100        1,000   10,000
                                          Latency (milli-seconds)                                                         Latency (milli-seconds)
                                       (a) Without Blogbench                                                           (b) With Blogbench

                                             Figure 5: CDF of 1,024 contiguous pages allocation latency.


   For a photo shot, Raspberry Pi 2 usually allocates 64 con-                                       extremely high and unpredictable latency under a realistic
tiguous pages 25 times asynchronously using a kernel thread                                         situation. With this evaluation result, it’s not surprising
and keeps the allocated memory without release as long as                                           that CMA on Raspberry Pi is not officially supported [12].
possible [16]. The maintained memory is used for subse-
quent photo shots. Therefore, only the first photo shot is
affected from contiguous memory allocation. For convenient                                          5.5          Effect on System Performance
evaluation, we configure the system to release every memory
allocated for camera as soon as possible to make subsequent                                                                          Baseline       CMA       GCMA
photo shots to be also affected.                                                                             lat ctx(usec)           147.36         143.525   142.93
   We measure latency of each photo shot using Raspberry                                                     bw file rd(MiB/s)       511.77         517.6     519.73
Pi Camera 30 times under every configuration with 10 sec-                                                    bw mem rd(MiB/s)        1426.33        1438.33   1434.5
onds interval between each shot to eliminate any effect from                                                 bw mem wr(MiB/s)        696.5          701.25    699.966
previous shots. Each configuration uses Reservation tech-
nique, CMA, and GCMA for the photo shot. To explicitly                                                           Table 3: LMbench measurement result.
show the effect of contiguous memory allocation, we mea-
sure contiguous memory allocation latencies caused for each                                            Finally, to show performance effect of CMA and GCMA
photo shot under CMA and GCMA.                                                                      on system, we measure system performance under every con-
   Without a background job, every configuration shows sim-                                         figuration using two benchmarks. First, we run a micro-
ilar photo shot latency of about 0.9 seconds. There is no                                           benchmark called lmbench [11], 3 times and get an average
difference between CMA and GCMA though GCMA shows                                                   of results to show how performance of OS primitives is af-
much faster latency than CMA from Section 5.3. There                                                fected by CMA and GCMA.
are two reasons behind this. First, though CMA is about                                                The result is depicted in Table 3. Each line shows an
15 times faster than CMA for each allocation required by                                            averaged measurement of context switch latency, file read
the camera, absolute time for the allocation is only 1 mil-                                         bandwidth, memory read bandwidth, and memory write
lisecond for CMA, 69 micro-seconds for GCMA. Because a                                              bandwidth. CMA and GCMA tend to show improved per-
photo shot issues the allocation only 25 times, Page
                                                the allocation
                                                     1                                              formance compared to Baseline becausePage 1 of memory utiliza-
latency comprises only a small portion of the entire photo                                          tion though the difference is not so big. Differences between
shot latency. Second, CMA can do allocation almost with-                                            CMA and GCMA are just in margin of error.
out any page migration if there are no background jobs and                                             To show performance under realistic workload, we run
the number of required pages is not many. In other words,                                           Blogbench job 3 times and measure average score normal-
the environment is too optimal.                                                                     ized by Baseline configuration result. At the same time,
   To show real latency under a realistic environment, we                                           we continuously issue photo shots on the background with
do the camera workload with Blogbench in background as                                              10 seconds interval as described in Section 5.4 to simulate
we did in Section 5.3 for a simulation of a realistic back-                                         realistic contiguous memory allocation stress on system.
ground workload. To minimize scheduling effect on latency,                                             The result is shown in Figure 7. Writes / reads perfor-
we set priority of photo shot process higher than that of                                           mances with background job are represented as Writes /
background workload using nice [9] command.                                                         cam and Reads / cam. CMA and GCMA show enhanced
   The result is described in Figure 6. CMA shows much                                              performance for every case though the enhancements are
slower camera shot latency while GCMA shows similar la-                                             not remarkably big. Those performance gains came from
tency with Baseline using Reservation technique. In the                                             enlarged memory space that are rescued from reservation.
worst case, CMA even requires about 9.8 seconds to do a                                             GCMA shows even better enhancement than CMA owing to
photo shot while GCMA requires 1.04 seconds in the worst                                            the light overhead of reserved area management that bene-
case. Contiguous memory allocation latencies also show sim-                                         fited from its simple design and characteristic of secondary
ilar but more dramatic result. Latencies of CMA lie between                                         clients. Moreover, GCMA get less performance degradation
4 milliseconds and 10 seconds while GCMA’s latencies lie be-                                        from background contiguous memory allocations than CMA
tween 750 micro-seconds and 3.32 milliseconds. This result                                          owing to its fast latency. As a summary, GCMA is not only
means contiguous memory allocation using CMA produces                                               faster than CMA but also more helpful for system perfor-
                                                                                                    mance.
                                                                     Latency (milli-seconds)

                                                                                                                                                                           still-blogbench


                                                            Baseline         CMA           GCMA                                                          CMA          GCMA
                                                  1.0                                                                                    1.0


                                                                                                                Cumulative Probability
                         Cumulative Probability
                                                  0.8                                                                                    0.8
                                                  0.6                                                                                    0.6
                                                  0.4                                                                                    0.4
                                                  0.2                                                                                    0.2
                                                  0.0                                                                                    0.0
                                                        0    1,000       2,000     3,000       4,000    5,000                                        1        10        100        1,000     10,000
                                                                     Latency (milli-seconds)                                                             Latency (milli-seconds)
                                                                     (a) Photo shot                                                            (b) Contiguous memory allocation

                                                                    blogbench.blogbench
                                                  Figure 6: CDF of photo   shot / following memory allocation latencies under Blogbench.


                                                             CMA        GCMA                                                       [4] M. Freedman. The compression cache: virtual memory
                        1.1                                                                                                            compression for handheld computers. 2000.
                                                                                                                                   [5] J. Jeong, H. Kim, J. Hwang, J. Lee, and S. Maeng.
                        1.0
     Normalized score


                                                                                                                                       DaaC: device-reserved memory as an eviction-based
                                                                                                                                       file cache. In in Proc. 21th Int. Conf. Compilers,
                        0.9                                                                                                            architectures and synthesis for embedded systems, page
                                                                                                                                       191. ACM Press, Oct. 2012.
                        0.8                                                                                                        [6] J. Jeong, H. Kim, J. Hwang, J. Lee, and S. Maeng.
                                                                                                                                       Rigorous rental memory management for embedded
                        0.7                                                                                                            systems. ACM Transactions on Embedded Computing
                                                   Writes       Reads        Writes / cam Reads / cam                                  Systems, 12(1s):1, Mar. 2013.
                                                                                           Page 1
                                                                                                                                   [7] M. S. Johnstone and P. R. Wilson. The memory
Figure 7: Blogbench performance with CMA and                                                                                           fragmentation problem. ACM SIGPLAN Notices,
GCMA.                                                                                                                                  34(3):26–36, Mar. 1999.
                                                                                                                                   [8] N. P. Jouppi. Cache write policies and performance. In
                                                                                                                                       Proceedings of the 20th annual international
6.                 CONCLUSION                                                                                                          symposium on Computer architecture - ISCA ’93,
Physically contiguous memory allocation is still a big prob-                                                                           volume 21, pages 191–201, New York, New York,
lem for low-end embedded devices. For example, Raspberry                                                                               USA, June 1993. ACM Press.
Pi, a popular credit card sized computer, still uses reser-
                                                                                                                                   [9] D. MacKenzie. nice. Linux man page, 2010.
vation technique despite of low memory utilization because
hardware solutions such as Scatter/Gather DMA or IOMMU                                                                            [10] D. Magenheimer. In kernel memory compression.
were too expensive and a software solution, CMA, was not                                                                               http://lwn.net/Articles/545244/, 2013.
effective.                                                                                                                        [11] L. McVoy and C. Staelin. lmbench: Portable Tools for
   We introduced GCMA, a contiguous memory allocator                                                                                   Performance Analysis. USENIX annual technical
that guarantees fast latency, successful allocation, and rea-                                                                          conference, 1996.               Page 1
sonable memory space efficiency. It achieves those goals by                                                                       [12] msperl. Rpiconfig. http://elinux.org/RPiconfig, 2014.
using the reservation technique and effectively utilizing the                                                                     [13] M. Nazarewicz. Contiguous Memory Allocator. In
reserved area. From our evaluation on Raspberry Pi 2, while                                                                            Proceedings of LinuxCon Europe 2012.
CMA increases latency of a photo shot on Raspberry Pi from                                                                             LinuxFoundation, 2012.
1 second to 9 seconds in the worst case, GCMA shows no ad-                                                                        [14] S. Park. introduce gcma.
ditional latency compared to the reservation technique. Our                                                                            http://lwn.net/Articles/619865/, 2014.
evaluation also shows that GCMA not only outperforms la-                                                                          [15] A.-W. Robert Love. The Buffer Cache. In
tency but also improves system performance as CMA does.                                                                                Linux-Kernel Manual: Guidelines for the Design and
                                                                                                                                       Implementation of Kernel 2.6, page 348. 2005.
                                                                                                                                  [16] T. C. Ruth Suehle. Automatically Share Memory. In
7.                 ACKNOWLEDGEMENTS                                                                                                    Raspberry Pi Hacks: Tips & Tools for Making Things
This work was supported by the National Research Foun-                                                                                 with the Inexpensive Linux Computer, page 95. 2013.
dation of Korea(NRF) grant funded by the Korea govern-
                                                                                                                                  [17] A. Team. Low RAM.
ment(MSIP) (No. 0421-20150075).
                                                                                                                                       http://s.android.com/devices/tech/low-ram.html, 2013.
                                                                                    Page 1                                        [18] E. Upton. Camera board available for sale!
8.                 REFERENCES                                                                                                          https://www.raspberrypi.org/camera-board-available-
 [1] A. Arcangeli. Transparent hugepage support. KVM                                                                                   for-sale/,
     Forum, 2010.                                                                                                                      2013.
 [2] J. Corbet. Cleancache and Frontswap.                                                                                         [19] E. Upton. Raspberry Pi 2 on sale now at $35.
     http://lwn.net/Articles/386090/, 2010.                                                                                            https://www.raspberrypi.org/raspberry-pi-2-on-sale/,
 [3] P. Denning. Before memory was virtual. 1997.                                                                                      2015.

</pre>