<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Microkernel Dedicated for Dynamic Partial Reconfiguration on ARM-FPGA Platform</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tian Xia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jean-Christophe Prévotet</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabienne Nouvel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Université Europe de Bretagne</institution>
          ,
          <addr-line>France INSA, IETR, UMR 6164, F-35708 RENNES</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <abstract>
        <p>This paper describes the rst implementation of a custom micro-kernel on a ARM-FPGA platform capable of managing recon gurable hardware parts dynamically. After describing the structure of the proposed micro-kernel, we will focus on a custom speci c system task dealing with the recon guration management, which is associated to a dedicated scheduling mechanism. We will describe the hardware platform on which the microkernel has been ported and provide a use case application in order to demonstrate the feasibility of the approach. At the end of this paper, we will provide quantitative results in terms of recon guration overhead and microkernel timing performances.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>During the last decades, with the development of
commodity eld-programmable gate array (FPGA), the
technique of recon gurable computing has gained increasing
attention for its potential in exploiting hardware resources.
Through time-multiplexed sharing of the FPGA fabric, a
higher integration of functionalities can be achieved. The
main drawback of traditional FPGA recon guration
computing is the lack of exibility, because the whole fabric is
required to be recon gured even when modi cation is only
required for a part of the FPGA. As a consequence,
enormous time overhead and power consumption are produced,
which severely limits recon guration in embedded systems.</p>
      <p>
        As a solution, a more advanced technique enabling to
recon gure particular areas of an FPGA while the rest
continues executing has been proposed and is known as Dynamic
Partial Recon guration (DPR). This technique has proved
to be quite prospective in the embedded domain because
of its runtime adaptivity for hardware algorithms and lower
power consumption compared to large-scale static circuits
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. With DPR feature, hardware accelerators can be
dynamically dispatched and managed, becoming as exible as
software functions.
      </p>
      <p>
        On the other hand, with the widespread applications of
handheld devices, reliability and security of embedded
systems have become a serious concern. Dealing with
microkernels constitutes a promising idea because it allows the user
to execute various applications (commodity APIs, real-time
tasks, etc.) in their own isolated container to ensure
isolation and thus security. Consequently, it has been a popular
research trend in the embedded systems domain for many
years[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        In this paper, we describe and study a custom embedded
microkernel on a hybrid ARM-FPGA Zynq-7000 platform
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This microkernel is a revised version of the NOVA
microhypervisor [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and is integrated with the management
and scheduling of recon gurable hardware resources. This
proposed architecture allows for dynamic management of
SW/HW tasks, secure task isolation and e cient SW/HW
communication.
      </p>
      <p>The remainder of the paper is organized as follows:
Section 2 presents current researches in management of DPR
architectures. In Section 3, an overall architecture of the
proposed platform is introduced. Section 4 focuses on the design
and implementation of the microkernel, with detailed
introduction to the hardware tasks management and scheduling
mechanisms. In Section 5, we present a case study to
demonstrate the capabilities of the proposed microkernel. Finally
section 6 concludes the paper.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Compared with the traditional full recon guration
mechanism, the DPR technique bene ts from the following major
advantages [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]:
      </p>
      <sec id="sec-2-1">
        <title>Reduced hardware resource utilization</title>
      </sec>
      <sec id="sec-2-2">
        <title>Improved design e ciency</title>
        <p>
          Reduced recon guration latency and better robustness
Despite of the enhanced exibility provided by DPR
techniques, the recon guration overhead remains a crucial issue
in practice. In modern high-end FPGAs which may have
tens of millions of con guration points, one recon guration
of a complex module will be very time-consuming.
Numerous studies have been led to propose e cient hardware
recon guration management with dedicated architecture and
OS support. A custom DPR controller was introduced in
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to realize high-speed on-chip recon guration. In [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], a
speci c operating system CAP-OS was proposed to provide
clients with hardware task management and priority-based
scheduling. Other researches were made in the OveRSoC
project, which provided a model at high-level abstraction
and allowed to e ciently simulate and validate embedded
RTOS for recon gurable platforms [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Most researches on
traditional DPR devices (i.e. Virtex FPGA family) employ
embedded processors such as MicroBlaze or PowerPC, whose
computing ability is relatively limited.
        </p>
        <p>Compared to classical devices, the Zynq-7000 platform
integrates the ARM Cortex-A9 processor with various
onboard resources and brings up enormous possibilities for
embedded techniques. In this platform, the programmable
fabric is considered as a unique auxiliary computing resource
to this fully capable processing system, and the recon
guration management is expected to be one of many tasks in
the system. Hence, a speci c kernel is the ideal solution to
rationally dispatch both hardware and software resources.</p>
        <p>
          While considerable e orts have been made to port
microkernel techniques to traditional embedded systems, such as
the OKL4 from Open Kernel Labs [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], most existing
microkernels on ARM do not consider recon gurable hardware.
Instead, most of the works only use a micro-kernel to
manage heterogeneous platforms i.e. software and static
hardware parts. For example, in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], a L4 kernel is ported to
manage hardware and software tasks, but without using
dynamic recon guration. In parallel, research in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] discussed
the recon guration management on Zynq platform at the
application level, without using any operating system and
thus with poor exibility.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>PROPOSED PLATFORM</title>
      <p>The motivation of the proposed hybrid ARM-FPGA
platform framework is to establish a user-practical environment
with a highly abstract microkernel. The management of
hardware resources is integrated as a user application, with
relatively easy access. Both software and hardware tasks are
registered and scheduled by a custom microkernel. A block
diagram of the proposed platform is shown in Fig. 1.</p>
      <p>On the proposed platform, computing resources are
divided into Processing System (PS) and Programmable Logic
(PL). On the PS side, a simpli ed microkernel hosts
multiple software applications, including Guest OSes and user
applications executing within the user space. Each
application is housed in an individual isolated space of the
microkernel, which is referenced as an execution context (EC).
By scheduling and switching ECs, the ARM processor is
shared among guests according to time multiplexing. The
PS
PL</p>
      <p>HW task data
Bit Bit Bit
file file file</p>
      <p>DDR</p>
      <p>User
App</p>
      <p>I/O HW Task
Driver Manager</p>
      <p>Microkernel</p>
      <p>ARM Cortex-A9</p>
      <p>AXI Interconnection
FPGA fabric is engaged in several hardware acceleration
operations, which are executing concurrently with SW tasks.
A speci c hardware task management routine is proposed to
control and recon gure the hardware accelerators
dynamically. Such a routine runs as a guest to the microkernel and
is scheduled whenever the management of HW tasks is
required. The mechanism related to this part will be described
in detail in Section 4. In this way, the FPGA resource is
seen as a standard user application by the microkernel and
thereby hardware and software tasks can be managed
concurrently in our framework.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Hardware Tasks</title>
      <p>In a recon gurable embedded system, hardware tasks are
implemented using functional fabric structures in the FPGA,
which can be user-de ned computing blocks or commercial
IP cores. As shown in Fig. 1, the FPGA fabric is divided
into multiple partial recon gurable black boxes or containers,
which are capable of housing hardware tasks
independently. These containers are de ned as partial recon gurable
regions (PRR). The hardware task which is running in each
container is run-time switchable under the control of the
hardware task manager. Di erent sizes of blocks are
allocated to di erent PRRs for di erent task purposes.</p>
      <p>The resource that holds the fabric information of hardware
tasks is contained in a bitstream le. Di erent bitstream
les can be stored in various memory devices and be
accessed via a simple look-up table. Note that, the container
corresponding to each HW task has always the same
constrained location in the FPGA. A HW task is dispatched by
transferring the corresponding bitstream le to the assigned
PRR. Normally, HW tasks with similar or close
functionalities should be distributed to the same PRR, so that the
coherence of HW task interfaces can be guaranteed. Each
HW task should have one corresponding SW application to
monitor and control its behaviour.</p>
      <p>One of the crucial features regarding hardware tasks is
the recon guration overhead, which is linearly correlated to
the size of the bitstream, thus, the PRR size. This means
all HW tasks implemented in the same PRR will have the
same time overhead for recon guration.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>HW/SW Task Communication</title>
      <p>To connect PL with PS, two interface types based on the
standard AXI bus protocol are employed. O ering a uni ed
mapping to the processor and being accessed as a normal
memory access, the AXI GP is intended for low-speed
general purpose communication. As in Fig. 1, the processing
system takes control of two master AXI GP interfaces as
main methods to con gure and read back the states of the
HW tasks.</p>
      <p>AXI HP is aimed for high performance data exchange
with burst transfer, which may transfer data blocks as large
as 4KB in one burst, and is su cient for generic data
processing applications. On our platform, 4 AXI HP interfaces
are used and in charge of accessing both on chip
memory(OCM) and DDR. Since HW tasks access AXI HP as
masters, data is fetched and written back without acknowledging
the processor, allowing the processor to run simultaneously
with HW tasks.
3.3</p>
    </sec>
    <sec id="sec-6">
      <title>Reconfiguration Interface</title>
      <p>Two methods for partial recon guration are supported on
the Zynq platform: Processor Con guration Access Port
(PCAP) and Internal Con guration Access Port (ICAP).
Using PCAP, as shown in the datapath of Fig.1, PS is
enabled to initialize bitstream transfers from memory to PL
through the Device Con guration Interface (DevCfg) at high
throughput (130MB/s). In contrast, ICAP is designed for
self-con guration from the PL side with a AXI4-Lite as
transfer port. Such a mechanism severely limits the
reconguration speed (19MB/s). ICAP is less interesting also
because it requires additional hardware resources and will
occupy at least one AXI interface. On our platform, PCAP
is selected for its better compatibility with software
applications and higher throughput.
3.4</p>
    </sec>
    <sec id="sec-7">
      <title>PRR Controller Block</title>
      <p>As shown in Fig. 1, a PRR controller block is introduced
to monitor and manage the states of HW tasks. This block
runs as a state machine under the supervision of the HW
task manager. Through the AXI GP interface, we have
implemented a group of con guration registers(PPR reg group)
which are mapped into memory space and accessible to the
processor. By con guring these registers, a SW service is
able to set up HW tasks, such as de ning working modes,
and data address. Since the number of PRRs is pre- xed, we
provide each PRR a PPR reg group for con guration. The
context of the registers is left for user-de nition to adjust to
di erent HW tasks. Table 1 describes the con guration of
this register group.
3.4.1</p>
      <sec id="sec-7-1">
        <title>Reconfiguration security</title>
        <p>In case of a PRR recon guration, a switch of HW task
is normally required. The PRR Controller is proposed to
guarantee the HW task security, avoiding invalid data
output and undesired task state. Based on these considerations,
following features are included:</p>
        <p>In case of a certain multi-block pipeline structure, the
pipeline should be emptied before any HW task switch,
so that invalid output data are avoided.</p>
        <p>To maintain the integrity of the data structure being
processed, the PRR controller avoids recon gurations
interrupting of data frames.</p>
        <p>A reset should be asserted to initialize the recon gured
PRR before being allowed to be activated.</p>
      </sec>
      <sec id="sec-7-2">
        <title>3.4.2 Interrupts Management</title>
        <p>The PRR controller is able to generate general-purpose
interrupts through the Shared Peripheral Interrupts (SPI)
connected to the generic interrupt controller (GIC). 8 SPI
resources are used to provide the PS with di erent HW task
information such as task completion or critical errors.
4.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>REAL TIME MICROKERNEL</title>
      <p>To facilitate the management of multiple guest SW
applications and HW tasks, we developed a simpli ed microkernel
based on Mini-NOVA, one revision of the NOVA
hypervisor. In this section, we propose a speci c HW task manager
service and a scheduling strategy to support dynamic PR
management.
4.1</p>
    </sec>
    <sec id="sec-9">
      <title>Microkernel Description</title>
      <p>The microkernel runs on top of bare-metal hardware. By
implementing the basic OS functionalities, the microkernel
establishes an abstraction layer of the hardware platform to
user applications. The application of microkernel bene ts
in the way that higher security level can be achieved with
virtualization technique. One of the essential features of this
microkernel is security, so the principle of least privilege is
strictly followed in our framework to make sure that a
minimal tested computing base (TCB) is achieved. Such a
feature will also improve performance with a quicker execution
of context switches.</p>
      <p>The proposed microkernel has simpli ed functionality and
reduced complexity, which makes it more suitable for
embedded systems and also more adaptable. Since the initial
Mini-NOVA is designed for x86 architecture, several modi
cations have been made to execute on the ARM Cortex-A9
which is available on the Zynq-7000 platform. Besides,
additional mechanisms and a new scheduling strategy have also
been provided to the system. The main features of the
proposed microkernel are:</p>
      <p>Modi ed bootloader and boot sequence for both Zynq
platform (e.g. FPGA initialization, DDR
initialization, etc.) and ARM Cortex-A9 processor(e.g. kernel
boot, user boot, paging table, exception vector, etc.)
Separate virtual memory mapping for kernel and user
space while providing isolated execution context for
each user application
System calls and IRQs provided to user applications
Speci c Priority-based round-robin to support PR</p>
      <sec id="sec-9-1">
        <title>Supporting virtualized OS(e.g. uC/OS-II)</title>
        <p>We should note that, to minimize the TCB size of the
kernel and guarantee system security, most board-speci c
support APIs and services are implemented in user space,
including HW task manager, AXI support, and supports for
on-board peripheral resources (UART, SD card, interrupt
controller, TCC Timer, etc.).</p>
        <p>The virtual memory space of our system is divided into
several domains. As described in Table 2, the kernel space
and user space are access-isolated by virtual mapping. A
range of 256MB memory space on the upper side is
distributed to the microkernel, whereas user applications
execute in the lower memory space. Besides the user space and
kernel space, an extra space up to 256MB is allocated to
store the bitstream les dealing with HW tasks. This area
is programmed to be only accessible from the user space.</p>
        <p>The execution context (EC) is the major kernel object,
which is the abstraction of user threads or applications in
the kernel space. Each EC is exclusively attached to one
user application and is able to maintain and manipulate user
applications' features such as the CPU/FPU register state,
stack location, and scheduling sequence. By resuming its
EC, a given task can be completely restored. When
sensitive operations (page allocation, thread creation, cache
operation, etc.) are required, the user space may access the
kernel services by generating system calls, which are also
handled through an EC.
4.2</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>HW Task Manager</title>
      <p>The HW task manager is de ned as a special user
application serving other applications. Though executed in user
space, this service cooperates closely with the kernel and is
an essential part of the PR control ow in the system. In
the following, we describe its di erent features.
4.2.1</p>
      <sec id="sec-10-1">
        <title>Bitstreams Management</title>
        <p>The switch of HW tasks is based on the download of
different bitstream les. As introduced in Section 3, each
bitstream corresponds to one HW task, and its PRR container
is pre- xed (but not exclusive), which also determines its
recon guration overhead. All bitstream les are loaded to
the HW task memory space shown in Table 2, at the kernel
bootload stage. A descriptor is provided to each available
bitstream le by de ning a bit descriptor class. We also
created a look-up table for all bit descriptor objects indexed by
a unique ID number. In fact, the object members given in
Table 3, bit descriptor::id is the only information that a
normal user application should know about HW tasks. Other
pieces of information such as location and length are only
used by the HW task manager.
4.2.2</p>
      </sec>
      <sec id="sec-10-2">
        <title>Calling the HW Task Manager</title>
        <p>Any attempt to dispatch, recon gure, modify or disable
HW tasks should be accomplished by the HW task
manager. In other words, operations towards HW tasks are isolated
from other user applications. We employed this mechanism
to ensure the security of the FPGA fabric. For user
applications which are cooperating with HW tasks, the only
accessible memory space is the HW task data section, which
is used for massive SW/HW data exchange.</p>
        <p>As described in Section 3, the behavior of HW tasks are
controlled by writing parameter values to their
corresponding PRR con guration registers, for which the contexts of
parameters are de ned by user application and are not in
the concern of the HW task manager. All the information
required by the HW manager are the ID of HW task and
the arguments to be transferred to the register group.
Bitfile descriptor table
Syscall_HW_Manager(1, 0, arg01, arg02, arg03) id addr length dalay prr_id
1 A1 L1 D1 1
EC</p>
        <p>SW App
Rescheduling()</p>
        <p>SW</p>
        <p>Application</p>
        <p>Syscall_yield()</p>
        <p>EC
HW Manager
kernel user</p>
        <p>MaHnWager PCAP
prr_ transferbitfile()</p>
        <p>HW task
data
Bitfile1
Bitfile2</p>
        <p>Bitfile3
AXI4-Lite</p>
        <p>Config</p>
        <p>AXI4
Proc_status
PPR_stastus
Int_status
PPR_delay</p>
        <p>arg01 ~arg03
Regs1 Regs2 Regs3
PRR1 PRR2 PRR3
A block diagram describing the execution of the HW task
manager is shown in Fig. 2. As demonstrated, a speci c
system call from user space will require the kernel to launch
the HW task manager. Arguments are passed through to
the HW manager. The prototype of this speci c system call
is:</p>
        <p>Syscall HW Manager(HW id, irq en, arg01, arg02, arg03)
By handling this system call, the kernel invokes a
reschedule process and returns to user space, passing control to the
HW task manager. In this process, arguments are also
delivered to the HW manager. The HW manager will compare
the HW id with the executing HW task. If it is already
implemented in PRR, then only the parameters are changed by
writing arguments to the register group, otherwise a PCAP
transfer will be con gured to reload the target PRR with
the desired HW task. The irq en argument will indicate
whether the PL interrupt is enabled for the corresponding
PRR by setting values in the PRR Int status register. After
accomplishing the required operation, the HW task manager
gives back control to the previously interrupted application.</p>
        <p>In some cases, a PR request cannot be acknowledged
immediately. As the scenarios described in Section 3, a HW
task may be in the middle of a data frame process and not
ready for recon guration. In such situations, to avoid
monopolizing the CPU, the HW task manager will be pulled
up and give up its CPU usage to other SW applications.
When the data frame is completely processed, the target
PRR informs the HW task manager by triggering an IRQ
IRQ Reco rdy, then the service will be relaunched to start
the PCAP bitstream download.</p>
        <p>One major drawback of the PR technique is its signi cant
recon guration time overhead. To reduce its e ect on
performance, we abort the polling-for-done mechanism. Instead,
the completion of a PCAP transfer is not acknowledged to
the HW task manager. Once the HW task manager
launches the PCAP transfer, it gives up the CPU control and wait
for the next call. A HW task is set to automatically start
an operation as soon as recon guration is done, thereby the
recon guration time overhead is overlapped by CPU
operations. SW applications are able to be synchronized with a
HW task state by its general-purpose IRQ. This
functionality is enabled by the PRR Int status register. For example,
imagine a simple application with an image displayer SW
task that is using a HW Image lter accelerator. It will
fetch the target image and write the results back to memory
through AXI4 automatically. Once the image processing is
List_prio[prio_top]</p>
        <p>
          List_prio[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
List_prio[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
prio_top = 1
        </p>
      </sec>
      <sec id="sec-10-3">
        <title>HW Task Manager API</title>
        <p>The driver API of DevCfg is supported by the Xilinx SDK
tool, which deals with the non-secure/secure PCAP transfer.
Besides of the DevCfg API, several additional functions are
developed to facilitate and simplify the HW management.
In Table 4, the API supporting HW task management is
listed and described.
4.3</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>SW Tasks Scheduling</title>
      <p>The scheduling strategy of SW tasks in Mini-NOVA is a
priority-based round-robin mechanism. The scheduler
manages the execution sequence by manipulating ECs. Each EC
obtains its own priority level at its creation, which is
changeable afterwards. Within the same priority level, SW tasks
share the CPU through round-robin scheduling. Among
different priority levels, high-priority tasks will always preempt
low-priority tasks since the scheduler always selects the
highest priority EC and dispatches the SW task attached to it.</p>
      <p>Basically, all general SW tasks execute at the same
priority level (1 by default). However, to ful ll the timing
constraints for speci c requests such as real-time tasks and PR
requests, di erent priority levels are introduced. In this case,
speci c tasks should be of higher priority so that they can be
dispatched in time. Since our current system mainly deals
with HW management, only the HW task manager is being
discussed here.</p>
      <p>Fig. 3 presents the scheduling mechanism based on
priority. At each priority level, ECs are organized as a
doublelinked queue, which is indexed by a list prio[] structure.
list prio[] is a list of EC pointers indexed by a priority
level. Each list prio[] element points to a certain priority level
EC queue. The run queue is composed of di erent priority
level EC queues, and the prio top signal identi es the
highest priority level in current run queue. When reschedule()
is invoked, prio top is used to access the
highest-prioritylevel EC queue by dispatching list prio[prio top]. Once
dispatched, the queue will keep executing until another
reschedule() is invoked.</p>
      <p>
        As shown in Fig. 3, the EC of the HW task manager is
registered in the microkernel at its creation with a default
priority level 2. Initially, the HW manager is not included in
run queue as Fig. 3(a). When Syscall HW Manager() is
executed, the microkernel will launch HW Manager Enqueue()
to add the HW task manager into the run queue as shown
in Fig. 3(b). Then, the reschedule() function is launched to
update the schedule and dispatch the HW task manager as
the highest priority EC by selecting list prio[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. When the
HW task manager nishes its task or enters the pull-up
srun_queue
run_queue
HW_Manager_Dequeue()
      </p>
      <p>HW_Manager_Enqueue()
Activated</p>
      <p>Suspended
tate, HW Manager Dequeue() is called to remove the EC of
the HW task manager from the run queue, as shown in Fig.
3(a), thus low-priority SW tasks are permitted to execute.
Through this strategy, the PR of an HW accelerator is able
to preempt other SW tasks and a quick response for the HW
task management is guaranteed.</p>
    </sec>
    <sec id="sec-12">
      <title>5. USE-CASE IMPLEMENTATION</title>
      <p>In order to test the SW/HW scheduling mechanism on the
platform, a use-case application based on a real scenario has
been proposed. In this scenario, a mobile wireless terminal
is capable of dynamically change its con guration in order to
obtain the best level of performances according to the
channel conditions. For example, if the channel is very noisy, the
transmitter will deal with a simple but very e cient QAM
modulation to the detriment of the throughput. As soon
as the channel conditions allow to increase the throughput,
the mobile device may recon gure itself to change its inner
hardware modulator and rapidly adapt to the environment.</p>
    </sec>
    <sec id="sec-13">
      <title>5.1 Implementation Description</title>
      <p>In the proposed use-case scenario, the application is
divided into two main software tasks running on the processor
and two additional hardware tasks running in the FPGA.</p>
      <p>The SW ChannelSensor task performs a channel
estimation in order to evaluate the maximum level of performance
to be obtained in terms of throughput and error rate. The
SW HardwareManager is an instance of the HW task
manager, as described in Section 4.</p>
      <p>Concerning the hardware parts, two recon gurable HW
tasks sets have been considered which respectively deal with
the modulation scheme and the IFFT used in the OFDM
context. The HW Modulation task deals with the nature
of modulation to be implemented i.e. the constellation size.
In this work, three constellations sizes have been considered:
4-QAM, 16-QAM and 64-QAM. Regarding the second
hardware task, HW IFFT, several con gurations have also been
implemented according to the number of points to consider.
In our application, a range of number of points for I-FFT
(from 256 points to 8192 points) was implemented
depending on the channel bandwidth to be considered. All HW
task execute in their corresponding PPR (PRR0 - PRR3).</p>
      <p>Since HW Modulation and HW IFFT execute in pipeline,
the recon guration of these HW tasks will suspend the entire
pipeline. To minimize the signi cant time overhead, we
propose a multiple-path structure. A block diagram depicts this
PRR0</p>
      <p>QAM1
PRR1</p>
      <p>QAM2
HW_IFFT8192
PRR2</p>
      <p>IFFT1
PRR3</p>
      <p>IFFT2</p>
      <p>Q
I
task</p>
      <p>HW_Modulation</p>
      <p>HW_IFFT
structure in Fig. 4. Both HW Modulation and HW IFFT
consist of a pair of identical PRRs. While the current PPR
continues working, SW ChannelSensor may alter the HW
task by recon guring the other PRR, and activating the new
datapath after recon guration. Thus, the overhead caused
by recon guration is reduced.</p>
      <p>With a 18,800 bits data frame size and 100MHz FPGA
clock frequency, a Gantt chart for the result of proposed
scenario on our platform is given in Fig.5. The application
begins with the SW ChannelSensor task deciding to change
the hardware con guration because the the channel's
conditions are not suitable for the default con guration (a QAM4
modulation scheme and a 256 points I-FFT). In this case, the
task calls the SW HardwareManager to manage its request
of switching I-FFT mode to 512 points (t1 - t2). Since PRR3
is idle and ready for recon guration, SW HardwareManager
launches the PCAP transfer to implement HW IFFT512 to
PRR3 while the QAM4-IFFT256 pipeline continues
computing (t2 - t5). After the completion of PCAP transfer (t5),
the pipeline holds for currently-processed data frame to be
completely processed (t5 - t6) before the HW IFFT512 is
activated at t6. The same procedure is executed again, when
SW ChannelSensor decides to switch from QAM4
modulation to QAM16 (t7 - t12). Some attributes of SW/HW tasks
are listed in Table 5.
5.2</p>
    </sec>
    <sec id="sec-14">
      <title>Discussion</title>
      <p>As shown in the Gantt chart, the major overhead of
recon guration is fully circumvented by both SW and HW
tasks running in parallel. For data processing, the only
overhead caused by the HW task switch is the delay
required to process a complete data frame(worst case 0.168
ms, in case of 8096 points I-FFT). Due to the simpli ed
kernel and scheduling mechanism, a quick response to PR
is achieved (0.0119 ms). We should note that the
tremendous recon guration overhead of I-FFT tasks result from the
massive computing-intensive structure of I-FFT blocks.
Implemented by Xilinx Planahead synthesis tool, it consumes
5600 LUTs and 1600 SLICEs, which takes up to 13% FPGA
Resource
Usage
no
no
no
2%
13%
resources on chip. For static FPGA circuits, implementing
multiple I-FFT blocks with di erent points will cost
considerable FPGA area, while on our platform only 26% FPGA
resources (2 I-FFT blocks) are used to hold multiple I-FFT
blocks. Thus the chip cost is signi cantly reduced.
6.</p>
    </sec>
    <sec id="sec-15">
      <title>CONCLUSION</title>
      <p>In this paper, we have presented a custom ARM-speci ed
microkernel on a partially recon gurable FPGA platform.
This approach allows to dynamically manage recon gurable
HW accelerators and SW tasks by developing a speci c
scheduling mechanism. E orts have been made to maximize the
performance of the FPGA fabric and minimize the overhead
caused by partial recon guration. We are currently working
on the virtualization of guest OS. By implementing di
erent OSes based on the microkernel, we intend to establish a
complete virtualizable embedded system</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Thomas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Coutinho</surname>
          </string-name>
          , and W. Luk, \
          <article-title>Recon gurable computing: Productivity and performance,"</article-title>
          <source>in Asilomar Conference on Signals, Systems and Computers</source>
          , pp.
          <volume>685</volume>
          {
          <issue>689</issue>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Heiser</surname>
          </string-name>
          , \
          <article-title>The role of virtualization in embedded systems,"</article-title>
          <source>in Proceedings of the 1st workshop on Isolation and integration in embedded systems</source>
          , pp.
          <volume>11</volume>
          {
          <issue>16</issue>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3] \Ug585: Zynq-7000
          <source>all programmable soc technical reference manual," Xilinx Inc</source>
          .,
          <year>March 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>U.</given-names>
            <surname>Steinberg</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Kauer</surname>
          </string-name>
          , \
          <article-title>Nova: a microhypervisor based secure virtualization architecture,"</article-title>
          <source>in Proceedings of the 5th European conference on Computer systems</source>
          , pp.
          <volume>209</volume>
          {
          <issue>222</issue>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Claus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Stechele,
          <string-name>
            <given-names>L.</given-names>
            <surname>Braun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hubner</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Becker</surname>
          </string-name>
          , \
          <article-title>A multi-platform controller allowing for maximum dynamic partial recon guration throughput,"</article-title>
          <source>in Field Programmable Logic and Applications</source>
          , pp.
          <volume>535</volume>
          {
          <issue>538</issue>
          , IEEE,
          <year>September 2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Go</surname>
          </string-name>
          hringer, M. Hubner,
          <string-name>
            <given-names>E. N.</given-names>
            <surname>Zeutebouo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Becker</surname>
          </string-name>
          , \
          <article-title>Operating system for runtime recon gurable multiprocessor systems,"</article-title>
          <source>International Journal of Recon gurable Computing</source>
          , vol.
          <year>2011</year>
          ,
          <year>January 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Prevotet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Benkhelifa</surname>
          </string-name>
          , and e. a. B.
          <string-name>
            <surname>Granado</surname>
          </string-name>
          , \
          <article-title>A framework for the exploration of rtos dedicated to the management of hardware recon gurable resources,"</article-title>
          <source>in International Conference on Recon gurable Computing and FPGAs</source>
          , pp.
          <volume>61</volume>
          {
          <issue>66</issue>
          , IEEE,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cui</surname>
          </string-name>
          , and et al, \
          <article-title>Microkernel hypervisor for a hybrid arm-fpga platform,"</article-title>
          <source>in 24th International Conference on Application-Speci c Systems, Architectures and Processors</source>
          , pp.
          <volume>219</volume>
          {
          <issue>226</issue>
          , IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Vipin</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Fahmy</surname>
          </string-name>
          ,
          <article-title>\A high speed open source controller for fpga partial recon guration,"</article-title>
          <source>in FPT</source>
          , pp.
          <volume>61</volume>
          {
          <issue>66</issue>
          , IEEE,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>