=Paper= {{Paper |id=Vol-1291/ewili14_7 |storemode=property |title=Microkernel Dedicated for Dynamic Partial Reconfiguration on ARM-FPGA Platform |pdfUrl=https://ceur-ws.org/Vol-1291/ewili14_7.pdf |volume=Vol-1291 |dblpUrl=https://dblp.org/rec/conf/ewili/XiaPN14 }} ==Microkernel Dedicated for Dynamic Partial Reconfiguration on ARM-FPGA Platform== https://ceur-ws.org/Vol-1291/ewili14_7.pdf
Microkernel Dedicated for Dynamic Partial Reconfiguration
                 on ARM-FPGA Platform

                            Tian Xia, Jean-Christophe Prévotet and Fabienne Nouvel
                                               Université Europe de Bretagne, France
                                             INSA, IETR, UMR 6164, F-35708 RENNES
                     {tian.xia; jean-christophe.prevotet; fabienne.nouvel}@insa-rennes.fr


ABSTRACT                                                          [1]. With DPR feature, hardware accelerators can be dy-
This paper describes the first implementation of a custom         namically dispatched and managed, becoming as flexible as
micro-kernel on a ARM-FPGA platform capable of manag-             software functions.
ing reconfigurable hardware parts dynamically. After de-             On the other hand, with the widespread applications of
scribing the structure of the proposed micro-kernel, we will      handheld devices, reliability and security of embedded sys-
focus on a custom specific system task dealing with the re-       tems have become a serious concern. Dealing with microker-
configuration management, which is associated to a dedi-          nels constitutes a promising idea because it allows the user
cated scheduling mechanism. We will describe the hardware         to execute various applications (commodity APIs, real-time
platform on which the microkernel has been ported and pro-        tasks, etc.) in their own isolated container to ensure isola-
vide a use case application in order to demonstrate the fea-      tion and thus security. Consequently, it has been a popular
sibility of the approach. At the end of this paper, we will       research trend in the embedded systems domain for many
provide quantitative results in terms of reconfiguration over-    years[2].
head and microkernel timing performances.                            In this paper, we describe and study a custom embedded
                                                                  microkernel on a hybrid ARM-FPGA Zynq-7000 platform
                                                                  [3]. This microkernel is a revised version of the NOVA mi-
Keywords                                                          crohypervisor [4], and is integrated with the management
Microkernel, Real-Time Systems, FPGA, Embedded Sys-               and scheduling of reconfigurable hardware resources. This
tem, Reconfigurable Architectures                                 proposed architecture allows for dynamic management of
                                                                  SW/HW tasks, secure task isolation and efficient SW/HW
1.    INTRODUCTION                                                communication.
                                                                     The remainder of the paper is organized as follows: Sec-
   During the last decades, with the development of com-
                                                                  tion 2 presents current researches in management of DPR ar-
modity field-programmable gate array (FPGA), the tech-
                                                                  chitectures. In Section 3, an overall architecture of the pro-
nique of reconfigurable computing has gained increasing at-
                                                                  posed platform is introduced. Section 4 focuses on the design
tention for its potential in exploiting hardware resources.
                                                                  and implementation of the microkernel, with detailed intro-
Through time-multiplexed sharing of the FPGA fabric, a
                                                                  duction to the hardware tasks management and scheduling
higher integration of functionalities can be achieved. The
                                                                  mechanisms. In Section 5, we present a case study to demon-
main drawback of traditional FPGA reconfiguration com-
                                                                  strate the capabilities of the proposed microkernel. Finally
puting is the lack of flexibility, because the whole fabric is
                                                                  section 6 concludes the paper.
required to be reconfigured even when modification is only
required for a part of the FPGA. As a consequence, enor-
mous time overhead and power consumption are produced,            2.    RELATED WORK
which severely limits reconfiguration in embedded systems.          Compared with the traditional full reconfiguration mech-
   As a solution, a more advanced technique enabling to re-       anism, the DPR technique benefits from the following major
configure particular areas of an FPGA while the rest contin-      advantages [3]:
ues executing has been proposed and is known as Dynamic
Partial Reconfiguration (DPR). This technique has proved               • Reduced hardware resource utilization
to be quite prospective in the embedded domain because                 • Improved design efficiency
of its runtime adaptivity for hardware algorithms and lower
power consumption compared to large-scale static circuits              • Reduced reconfiguration latency and better robustness

                                                                     Despite of the enhanced flexibility provided by DPR tech-
                                                                  niques, the reconfiguration overhead remains a crucial issue
                                                                  in practice. In modern high-end FPGAs which may have
                                                                  tens of millions of configuration points, one reconfiguration
                                                                  of a complex module will be very time-consuming. Numer-
                                                                  ous studies have been led to propose efficient hardware re-
                                                                  configuration management with dedicated architecture and
EWiLi’14, November 2014, Lisbon, Portugal.                        OS support. A custom DPR controller was introduced in
Copyright retained by the authors.                                [5] to realize high-speed on-chip reconfiguration. In [6], a
specific operating system CAP-OS was proposed to provide                               FPGA fabric is engaged in several hardware acceleration op-
clients with hardware task management and priority-based                               erations, which are executing concurrently with SW tasks.
scheduling. Other researches were made in the OveRSoC                                  A specific hardware task management routine is proposed to
project, which provided a model at high-level abstraction                              control and reconfigure the hardware accelerators dynami-
and allowed to efficiently simulate and validate embedded                              cally. Such a routine runs as a guest to the microkernel and
RTOS for reconfigurable platforms [7]. Most researches on                              is scheduled whenever the management of HW tasks is re-
traditional DPR devices (i.e. Virtex FPGA family) employ                               quired. The mechanism related to this part will be described
embedded processors such as MicroBlaze or PowerPC, whose                               in detail in Section 4. In this way, the FPGA resource is
computing ability is relatively limited.                                               seen as a standard user application by the microkernel and
   Compared to classical devices, the Zynq-7000 platform                               thereby hardware and software tasks can be managed con-
integrates the ARM Cortex-A9 processor with various on-                                currently in our framework.
board resources and brings up enormous possibilities for em-
bedded techniques. In this platform, the programmable fab-                             3.1   Hardware Tasks
ric is considered as a unique auxiliary computing resource                                In a reconfigurable embedded system, hardware tasks are
to this fully capable processing system, and the reconfigu-                            implemented using functional fabric structures in the FPGA,
ration management is expected to be one of many tasks in                               which can be user-defined computing blocks or commercial
the system. Hence, a specific kernel is the ideal solution to                          IP cores. As shown in Fig. 1, the FPGA fabric is divided in-
rationally dispatch both hardware and software resources.                              to multiple partial reconfigurable black boxes or containers,
   While considerable efforts have been made to port micro-                            which are capable of housing hardware tasks independent-
kernel techniques to traditional embedded systems, such as                             ly. These containers are defined as partial reconfigurable
the OKL4 from Open Kernel Labs [2], most existing micro-                               regions (PRR). The hardware task which is running in each
kernels on ARM do not consider reconfigurable hardware.                                container is run-time switchable under the control of the
Instead, most of the works only use a micro-kernel to man-                             hardware task manager. Different sizes of blocks are allo-
age heterogeneous platforms i.e. software and static hard-                             cated to different PRRs for different task purposes.
ware parts. For example, in [8], a L4 kernel is ported to                                 The resource that holds the fabric information of hardware
manage hardware and software tasks, but without using dy-                              tasks is contained in a bitstream file. Different bitstream
namic reconfiguration. In parallel, research in [9] discussed                          files can be stored in various memory devices and be ac-
the reconfiguration management on Zynq platform at the                                 cessed via a simple look-up table. Note that, the container
application level, without using any operating system and                              corresponding to each HW task has always the same con-
thus with poor flexibility.                                                            strained location in the FPGA. A HW task is dispatched by
                                                                                       transferring the corresponding bitstream file to the assigned
                                                                                       PRR. Normally, HW tasks with similar or close function-
3.   PROPOSED PLATFORM                                                                 alities should be distributed to the same PRR, so that the
   The motivation of the proposed hybrid ARM-FPGA plat-                                coherence of HW task interfaces can be guaranteed. Each
form framework is to establish a user-practical environment                            HW task should have one corresponding SW application to
with a highly abstract microkernel. The management of                                  monitor and control its behaviour.
hardware resources is integrated as a user application, with                              One of the crucial features regarding hardware tasks is
relatively easy access. Both software and hardware tasks are                           the reconfiguration overhead, which is linearly correlated to
registered and scheduled by a custom microkernel. A block                              the size of the bitstream, thus, the PRR size. This means
diagram of the proposed platform is shown in Fig. 1.                                   all HW tasks implemented in the same PRR will have the
   On the proposed platform, computing resources are divid-                            same time overhead for reconfiguration.
ed into Processing System (PS) and Programmable Logic
(PL). On the PS side, a simplified microkernel hosts mul-                              3.2   HW/SW Task Communication
tiple software applications, including Guest OSes and user                               To connect PL with PS, two interface types based on the
applications executing within the user space. Each appli-                              standard AXI bus protocol are employed. Offering a unified
cation is housed in an individual isolated space of the mi-                            mapping to the processor and being accessed as a normal
crokernel, which is referenced as an execution context (EC).                           memory access, the AXI GP is intended for low-speed gen-
By scheduling and switching ECs, the ARM processor is                                  eral purpose communication. As in Fig. 1, the processing
shared among guests according to time multiplexing. The                                system takes control of two master AXI GP interfaces as
                                                                                       main methods to configure and read back the states of the
            HW ta sk data             Use r      I/O      HW Task                      HW tasks.
                                                                               GIC
          Bit       Bit      Bit
                                      App       Driver    Manager                        AXI HP is aimed for high performance data exchange
          file      file     file              Microke rne l                  DecCfg   with burst transfer, which may transfer data blocks as large
                   DDR                        ARM Cor tex-A9                           as 4KB in one burst, and is sufficient for generic data pro-
                              AXI Inter connection                            PCAP     cessing applications. On our platform, 4 AXI HP interfaces
     PS
                                                                                       are used and in charge of accessing both on chip memo-
          AXI4-HP M aster                                AXI4-Lite S lave              ry(OCM) and DDR. Since HW tasks access AXI HP as mas-
                           PRR Controller
                                                           Config. Reg1
                                                                                       ters, data is fetched and written back without acknowledging
                 PPR reg
                  group
                                PPR reg
                                 group
                                              PPR reg
                                               group
                                                           Config. Reg2                the processor, allowing the processor to run simultaneously
                                                           Config. Reg3
                                                                                       with HW tasks.

     PL
                 PRR1           PRR2           PRR3                 Config.
                                                                                       3.3   Reconfiguration Interface
                                                                                         Two methods for partial reconfiguration are supported on
Figure 1: Diagram of the Proposed Hybrid Platform
                                                                          connected to the generic interrupt controller (GIC). 8 SPI
Table 1: Description of PRR Configuration Regis-                          resources are used to provide the PS with different HW task
ters                                                                      information such as task completion or critical errors.
   Reg Name       Width   Description
                          Mark process status:
                          Bit[0]: start data processing
   Proc status     16
                          Bit[1]: pause data processing
                                                                          4.    REAL TIME MICROKERNEL
                          Bit[2]: interrupt handling over
                                                                            To facilitate the management of multiple guest SW appli-
                          Mark interrupt status:
 PRR Int status    16     Bit[0:7]: PRRs interrupt enable
                                                                          cations and HW tasks, we developed a simplified microkernel
                          Bit[8:15]: PRRs interrupt status                based on Mini-NOVA, one revision of the NOVA hypervi-
                          Mark PRR enable status:                         sor. In this section, we propose a specific HW task manager
   PRR status      16     Bit[0:7]: PRRs enable
                          Bit[8:15]: PRRs switch enable.
                                                                          service and a scheduling strategy to support dynamic PR
                          PRR’s status for reconfiguration:               management.
  PRR Reco rdy     16
                          Bit[0:7]: PRR ready
   PRR delay       32     Time overhead for current reconfiguration       4.1     Microkernel Description
                          General-purpose registers defined by user: HW
  PRR gpr[7:0]     32
                          task ID, working mode, parameters, etc.            The microkernel runs on top of bare-metal hardware. By
                                                                          implementing the basic OS functionalities, the microkernel
the Zynq platform: Processor Configuration Access Port (P-                establishes an abstraction layer of the hardware platform to
CAP) and Internal Configuration Access Port (ICAP). Us-                   user applications. The application of microkernel benefits
ing PCAP, as shown in the datapath of Fig.1, PS is en-                    in the way that higher security level can be achieved with
abled to initialize bitstream transfers from memory to PL                 virtualization technique. One of the essential features of this
through the Device Configuration Interface (DevCfg) at high               microkernel is security, so the principle of least privilege is
throughput (130MB/s). In contrast, ICAP is designed for                   strictly followed in our framework to make sure that a min-
self-configuration from the PL side with a AXI4-Lite as                   imal tested computing base (TCB) is achieved. Such a fea-
transfer port. Such a mechanism severely limits the recon-                ture will also improve performance with a quicker execution
figuration speed (19MB/s). ICAP is less interesting also                  of context switches.
because it requires additional hardware resources and will                   The proposed microkernel has simplified functionality and
occupy at least one AXI interface. On our platform, PCAP                  reduced complexity, which makes it more suitable for em-
is selected for its better compatibility with software appli-             bedded systems and also more adaptable. Since the initial
cations and higher throughput.                                            Mini-NOVA is designed for x86 architecture, several modifi-
                                                                          cations have been made to execute on the ARM Cortex-A9
3.4     PRR Controller Block                                              which is available on the Zynq-7000 platform. Besides, addi-
  As shown in Fig. 1, a PRR controller block is introduced                tional mechanisms and a new scheduling strategy have also
to monitor and manage the states of HW tasks. This block                  been provided to the system. The main features of the pro-
runs as a state machine under the supervision of the HW                   posed microkernel are:
task manager. Through the AXI GP interface, we have im-
plemented a group of configuration registers(PPR reg group)                    • Modified bootloader and boot sequence for both Zynq
which are mapped into memory space and accessible to the                         platform (e.g. FPGA initialization, DDR initializa-
processor. By configuring these registers, a SW service is                       tion, etc.) and ARM Cortex-A9 processor(e.g. kernel
able to set up HW tasks, such as defining working modes,                         boot, user boot, paging table, exception vector, etc.)
and data address. Since the number of PRRs is pre-fixed, we                    • Separate virtual memory mapping for kernel and user
provide each PRR a PPR reg group for configuration. The                          space while providing isolated execution context for
context of the registers is left for user-definition to adjust to                each user application
different HW tasks. Table 1 describes the configuration of
                                                                               • System calls and IRQs provided to user applications
this register group.
                                                                               • Specific Priority-based round-robin to support PR
3.4.1     Reconfiguration security
                                                                               • Supporting virtualized OS(e.g. uC/OS-II)
   In case of a PRR reconfiguration, a switch of HW task
is normally required. The PRR Controller is proposed to                      We should note that, to minimize the TCB size of the
guarantee the HW task security, avoiding invalid data out-                kernel and guarantee system security, most board-specific
put and undesired task state. Based on these considerations,              support APIs and services are implemented in user space,
following features are included:                                          including HW task manager, AXI support, and supports for
   • In case of a certain multi-block pipeline structure, the             on-board peripheral resources (UART, SD card, interrupt
     pipeline should be emptied before any HW task switch,                controller, TCC Timer, etc.).
     so that invalid output data are avoided.                                The virtual memory space of our system is divided into
                                                                          several domains. As described in Table 2, the kernel space
   • To maintain the integrity of the data structure being                and user space are access-isolated by virtual mapping. A
     processed, the PRR controller avoids reconfigurations                range of 256MB memory space on the upper side is dis-
     interrupting of data frames.                                         tributed to the microkernel, whereas user applications exe-
   • A reset should be asserted to initialize the reconfigured            cute in the lower memory space. Besides the user space and
     PRR before being allowed to be activated.                            kernel space, an extra space up to 256MB is allocated to
                                                                          store the bitstream files dealing with HW tasks. This area
3.4.2     Interrupts Management                                           is programmed to be only accessible from the user space.
  The PRR controller is able to generate general-purpose                     The execution context (EC) is the major kernel object,
interrupts through the Shared Peripheral Interrupts (SPI)                 which is the abstraction of user threads or applications in
                                                                                                                                                     Bitfile descriptor table
                                                                                         Syscall_HW_Manager(1, 0, arg01, arg02, arg03)       id      addr length dalay prr_id
               Table 2: System Address Mapping                                                                                               1        A1       L1      D1     1
      Name            Addr Range            Accessibility     Description
                                                                                                   EC                 SW                                      HW task
                     0xC0000000 -                                                               SW App             Application
      Kernel                                   Kernel         Kernel space                                                                                     data
                     0xDFFFFFFF
                                                                                        Rescheduling()        Syscall_yield()
      User         0x0 - 0x2FFFFFFF         Kernel, User      User space                                                                                      Bitfile1
                     0x30000000 -           Kernel, User,     Bitstreams,                                                                                     Bitfile2
  HW Task                                                                                          EC                  HW
                     0x3FFFFFFF                  PL           HW task data                     HW Manager             Manager                     PCAP
                                                                                                                                                              Bitfile3
                     0x40000000 -
       PL                                User (AXI GP)        PL Memory Space                                               prr_ transferbitfile()
                     0xBFFFFFFF                                                                     kernel     user
                     0xE0000000 -                             Platform and                                              AXI4-Lite              Config         AXI4
  Peripheral                                Kernel, User
                     0xFDFFFFFF                               Peripheral regs
                                                                                                                                                  arg01 ~arg03
                                                                                                                           Proc_status
      Table 3: Structure of the bit descriptor Class                                                                       PPR_stastus      Regs1     Regs2   Regs3
      Obj. member         id         addr         len        delay         prr id                                           Int_status
                                                                                                                                            PRR1      PRR2    PRR3
                       HW task      Bitfile      Bitfile    Reconfig.      PRR                                             PPR_delay
        Contents
                         ID         Address     Length      Overhead        ID


the kernel space. Each EC is exclusively attached to one us-                           Figure 2: Execution of the HW Task Manager
er application and is able to maintain and manipulate user
applications’ features such as the CPU/FPU register state,                              A block diagram describing the execution of the HW task
stack location, and scheduling sequence. By resuming its                            manager is shown in Fig. 2. As demonstrated, a specific
EC, a given task can be completely restored. When sensi-                            system call from user space will require the kernel to launch
tive operations (page allocation, thread creation, cache op-                        the HW task manager. Arguments are passed through to
eration, etc.) are required, the user space may access the                          the HW manager. The prototype of this specific system call
kernel services by generating system calls, which are also                          is:
handled through an EC.                                                                Syscall HW Manager(HW id, irq en, arg01, arg02, arg03)
4.2      HW Task Manager                                                               By handling this system call, the kernel invokes a resched-
  The HW task manager is defined as a special user appli-                           ule process and returns to user space, passing control to the
cation serving other applications. Though executed in user                          HW task manager. In this process, arguments are also de-
space, this service cooperates closely with the kernel and is                       livered to the HW manager. The HW manager will compare
an essential part of the PR control flow in the system. In                          the HW id with the executing HW task. If it is already im-
the following, we describe its different features.                                  plemented in PRR, then only the parameters are changed by
                                                                                    writing arguments to the register group, otherwise a PCAP
4.2.1          Bitstreams Management                                                transfer will be configured to reload the target PRR with
   The switch of HW tasks is based on the download of dif-                          the desired HW task. The irq en argument will indicate
ferent bitstream files. As introduced in Section 3, each bit-                       whether the PL interrupt is enabled for the corresponding
stream corresponds to one HW task, and its PRR container                            PRR by setting values in the PRR Int status register. After
is pre-fixed (but not exclusive), which also determines its                         accomplishing the required operation, the HW task manager
reconfiguration overhead. All bitstream files are loaded to                         gives back control to the previously interrupted application.
the HW task memory space shown in Table 2, at the kernel                               In some cases, a PR request cannot be acknowledged im-
bootload stage. A descriptor is provided to each available                          mediately. As the scenarios described in Section 3, a HW
bitstream file by defining a bit descriptor class. We also cre-                     task may be in the middle of a data frame process and not
ated a look-up table for all bit descriptor objects indexed by                      ready for reconfiguration. In such situations, to avoid mo-
a unique ID number. In fact, the object members given in                            nopolizing the CPU, the HW task manager will be pulled
Table 3, bit descriptor::id is the only information that a nor-                     up and give up its CPU usage to other SW applications.
mal user application should know about HW tasks. Other                              When the data frame is completely processed, the target
pieces of information such as location and length are only                          PRR informs the HW task manager by triggering an IRQ
used by the HW task manager.                                                        IRQ Reco rdy, then the service will be relaunched to start
                                                                                    the PCAP bitstream download.
4.2.2          Calling the HW Task Manager                                             One major drawback of the PR technique is its significant
   Any attempt to dispatch, reconfigure, modify or disable                          reconfiguration time overhead. To reduce its effect on perfor-
HW tasks should be accomplished by the HW task manag-                               mance, we abort the polling-for-done mechanism. Instead,
er. In other words, operations towards HW tasks are isolated                        the completion of a PCAP transfer is not acknowledged to
from other user applications. We employed this mechanism                            the HW task manager. Once the HW task manager launch-
to ensure the security of the FPGA fabric. For user ap-                             es the PCAP transfer, it gives up the CPU control and wait
plications which are cooperating with HW tasks, the only                            for the next call. A HW task is set to automatically start
accessible memory space is the HW task data section, which                          an operation as soon as reconfiguration is done, thereby the
is used for massive SW/HW data exchange.                                            reconfiguration time overhead is overlapped by CPU opera-
   As described in Section 3, the behavior of HW tasks are                          tions. SW applications are able to be synchronized with a
controlled by writing parameter values to their correspond-                         HW task state by its general-purpose IRQ. This functional-
ing PRR configuration registers, for which the contexts of                          ity is enabled by the PRR Int status register. For example,
parameters are defined by user application and are not in                           imagine a simple application with an image displayer SW
the concern of the HW task manager. All the information                             task that is using a HW Image filter accelerator. It will
required by the HW manager are the ID of HW task and                                fetch the target image and write the results back to memory
the arguments to be transferred to the register group.                              through AXI4 automatically. Once the image processing is
                                                                                       List_prio[prio_top]
                                                                                                                                       run_queue
             Table 4: HW Task Manager API                                                 List_prio[2]
 API                            Description
                                                                                          List_prio[1]
 XDcfg Initialize();            Instantiate DevCfg
 AXI4 lite Init();              Instantiate master AXI4-lite                            prio_top = 1           Task1           Task2     Task3
 fpga start(); fpga pause();    Pass signals to whole fpga fabric by writing
 fpga interrupt done()          values to corresponding PPR controller regs.                                           (a)
                                Check current implemented HW tasks’ IDs               HW_Manager_Dequeue()                     HW_Manager_Enqueue()
 check current ppr(HW id)
                                to determine whether PR is necessary.
                                Check if target PRR is ready for PR, if not,
 check reco rdy(HW id)                                                                 List_prio[prio_top]                             run_queue
                                use sys yield() to quit HW manage.                                               HW
 prr set mode(HW id,irq en,     Set up PRR Int status and register group                   List_prio[2]         Manager
 arg01,arg02,arg03 )            of specific PRR.
 prr transferbitfile(HW id)     Launch PCAP to transfer target bitstream.
                                                                                           List_prio[1]
 prr register read(off,val)     Basic access method to all prr controller               prio_top = 2           Task1           Task2     Task3
 prr register write(off, val)   registers by master AXI4-Lite
                                                                                                                       (b)
finished, the HW filter will generate an IRQ to inform the                                         Activated            Suspended
displayer task that another operation can be executed.
                                                                               Figure 3: Microkernel Scheduling Mechanism. (a)
4.2.3      HW Task Manager API                                                 prio top=1; (b) prio top=2
   The driver API of DevCfg is supported by the Xilinx SDK                     tate, HW Manager Dequeue() is called to remove the EC of
tool, which deals with the non-secure/secure PCAP transfer.                    the HW task manager from the run queue, as shown in Fig.
Besides of the DevCfg API, several additional functions are                    3(a), thus low-priority SW tasks are permitted to execute.
developed to facilitate and simplify the HW management.                        Through this strategy, the PR of an HW accelerator is able
In Table 4, the API supporting HW task management is                           to preempt other SW tasks and a quick response for the HW
listed and described.                                                          task management is guaranteed.
4.3     SW Tasks Scheduling
   The scheduling strategy of SW tasks in Mini-NOVA is a                       5.    USE-CASE IMPLEMENTATION
priority-based round-robin mechanism. The scheduler man-                          In order to test the SW/HW scheduling mechanism on the
ages the execution sequence by manipulating ECs. Each EC                       platform, a use-case application based on a real scenario has
obtains its own priority level at its creation, which is change-               been proposed. In this scenario, a mobile wireless terminal
able afterwards. Within the same priority level, SW tasks                      is capable of dynamically change its configuration in order to
share the CPU through round-robin scheduling. Among dif-                       obtain the best level of performances according to the chan-
ferent priority levels, high-priority tasks will always preempt                nel conditions. For example, if the channel is very noisy, the
low-priority tasks since the scheduler always selects the high-                transmitter will deal with a simple but very efficient QAM
est priority EC and dispatches the SW task attached to it.                     modulation to the detriment of the throughput. As soon
   Basically, all general SW tasks execute at the same prior-                  as the channel conditions allow to increase the throughput,
ity level (1 by default). However, to fulfill the timing con-                  the mobile device may reconfigure itself to change its inner
straints for specific requests such as real-time tasks and PR                  hardware modulator and rapidly adapt to the environment.
requests, different priority levels are introduced. In this case,
specific tasks should be of higher priority so that they can be                5.1   Implementation Description
dispatched in time. Since our current system mainly deals                         In the proposed use-case scenario, the application is di-
with HW management, only the HW task manager is being                          vided into two main software tasks running on the processor
discussed here.                                                                and two additional hardware tasks running in the FPGA.
   Fig. 3 presents the scheduling mechanism based on prior-                       The SW ChannelSensor task performs a channel estima-
ity. At each priority level, ECs are organized as a double-                    tion in order to evaluate the maximum level of performance
linked queue, which is indexed by a list prio[] structure.                     to be obtained in terms of throughput and error rate. The
list prio[] is a list of EC pointers indexed by a priority lev-                SW HardwareManager is an instance of the HW task man-
el. Each list prio[] element points to a certain priority level                ager, as described in Section 4.
EC queue. The run queue is composed of different priority                         Concerning the hardware parts, two reconfigurable HW
level EC queues, and the prio top signal identifies the high-                  tasks sets have been considered which respectively deal with
est priority level in current run queue. When reschedule()                     the modulation scheme and the IFFT used in the OFDM
is invoked, prio top is used to access the highest-priority-                   context. The HW Modulation task deals with the nature
level EC queue by dispatching list prio[prio top]. Once dis-                   of modulation to be implemented i.e. the constellation size.
patched, the queue will keep executing until another resched-                  In this work, three constellations sizes have been considered:
ule() is invoked.                                                              4-QAM, 16-QAM and 64-QAM. Regarding the second hard-
   As shown in Fig. 3, the EC of the HW task manager is                        ware task, HW IFFT, several configurations have also been
registered in the microkernel at its creation with a default                   implemented according to the number of points to consider.
priority level 2. Initially, the HW manager is not included in                 In our application, a range of number of points for I-FFT
run queue as Fig. 3(a). When Syscall HW Manager() is ex-                       (from 256 points to 8192 points) was implemented depend-
ecuted, the microkernel will launch HW Manager Enqueue()                       ing on the channel bandwidth to be considered. All HW
to add the HW task manager into the run queue as shown                         task execute in their corresponding PPR (PRR0 - PRR3).
in Fig. 3(b). Then, the reschedule() function is launched to                      Since HW Modulation and HW IFFT execute in pipeline,
update the schedule and dispatch the HW task manager as                        the reconfiguration of these HW tasks will suspend the entire
the highest priority EC by selecting list prio[2]. When the                    pipeline. To minimize the significant time overhead, we pro-
HW task manager finishes its task or enters the pull-up s-                     pose a multiple-path structure. A block diagram depicts this
                                         HW_QAM4                          HW_IFFT256
       AXI4                              HW_QAM16                                ...                                                   Table 5: SW/HW Tasks’ Attributes
                                         HW_QAM64                         HW_IFFT8192                                                                        Execution         Reconfig.   Resource
                                                                                                                                  Task name      Type
                                                                                                                                                             Time(ms)          Time(ms)     Usage
                                         PRR0                            PRR2                                                SW ChannelSensor    SW              3                no         no
                                           QAM1                              IFFT1                                              SW HW Manager    SW           0,0096              no         no
       FIFO1                                                                                              Q                       EC Switch      SW           0,00232             no         no
                                                                                                                            HW QAM (4/16/64)     HW      0,09-0.03(1 frame)     0.231        2%
       FIFO2                             PRR1                            PRR3                                               HW IFFT (256-8192)   HW     0,006-0,168(1 frame)     2.72        13%
                                                                                                           I
                                           QAM2                              IFFT2
                                                                                                                           resources on chip. For static FPGA circuits, implementing
                      CrossBar                                                                                             multiple I-FFT blocks with different points will cost consid-
                                        HW_Modulation                      HW_IFFT
                                                                                                                           erable FPGA area, while on our platform only 26% FPGA
               Figure 4: Use-Case Implementation                                                                           resources (2 I-FFT blocks) are used to hold multiple I-FFT
                                                                                                                           blocks. Thus the chip cost is significantly reduced.
                         task
  SW_HWManager
 SW_ChannelSensor
      Microkernel
                                                                                                                           6.     CONCLUSION
                                                                                                                       t
     PCAP Transfer
                                           t1   t2    t3 t4         t5      t6         t7 t8      t9 t10 t11   t12           In this paper, we have presented a custom ARM-specified
 PRR0 (HW_QAM4)                                                                                                            microkernel on a partially reconfigurable FPGA platform.
PRR1 (HW_QAM16)                                                                                                            This approach allows to dynamically manage reconfigurable
PPR2 (HW_IFFT256)
PPR3 (HW_IFFT512)
                                                                                                                           HW accelerators and SW tasks by developing a specific schedul-
                                                                                                                           ing mechanism. Efforts have been made to maximize the
                            SW/HW Execution                   Reconfiguration                  Pipeline suspension
                                                                                                                           performance of the FPGA fabric and minimize the overhead
                     t1: syscall_HW_Manager() t2: reschedule()            t3: prr_transferbitfile(); syscall_yield()
                                                                                                                           caused by partial reconfiguration. We are currently working
                     t4: reschedule()                t5: PCAP_Done        t6: Data frame over
                                                                                                                           on the virtualization of guest OS. By implementing differ-
    Figure 5: Gantt Chart of the Tasks’ Execution                                                                          ent OSes based on the microkernel, we intend to establish a
                                                                                                                           complete virtualizable embedded system
structure in Fig. 4. Both HW Modulation and HW IFFT
consist of a pair of identical PRRs. While the current PPR                                                                 7.     REFERENCES
continues working, SW ChannelSensor may alter the HW                                                                       [1] D. Thomas, J. Coutinho, and W. Luk, “Reconfigurable
task by reconfiguring the other PRR, and activating the new                                                                    computing: Productivity and performance,” in
datapath after reconfiguration. Thus, the overhead caused                                                                      Asilomar Conference on Signals, Systems and
by reconfiguration is reduced.                                                                                                 Computers, pp. 685–689, 2009.
   With a 18,800 bits data frame size and 100MHz FPGA
                                                                                                                           [2] G. Heiser, “The role of virtualization in embedded
clock frequency, a Gantt chart for the result of proposed s-
                                                                                                                               systems,” in Proceedings of the 1st workshop on
cenario on our platform is given in Fig.5. The application
                                                                                                                               Isolation and integration in embedded systems,
begins with the SW ChannelSensor task deciding to change
                                                                                                                               pp. 11–16, ACM, 2008.
the hardware configuration because the the channel’s condi-
                                                                                                                           [3] “Ug585: Zynq-7000 all programmable soc technical
tions are not suitable for the default configuration (a QAM4
                                                                                                                               reference manual,” Xilinx Inc., March 2013.
modulation scheme and a 256 points I-FFT). In this case, the
task calls the SW HardwareManager to manage its request                                                                    [4] U. Steinberg and B. Kauer, “Nova: a microhypervisor
of switching I-FFT mode to 512 points (t1 - t2 ). Since PRR3                                                                   based secure virtualization architecture,” in Proceedings
is idle and ready for reconfiguration, SW HardwareManager                                                                      of the 5th European conference on Computer systems,
launches the PCAP transfer to implement HW IFFT512 to                                                                          pp. 209–222, 2010.
PRR3 while the QAM4-IFFT256 pipeline continues comput-                                                                     [5] C. Claus, B. Zhang, W. Stechele, L. Braun, M. Hubner,
ing (t2 - t5 ). After the completion of PCAP transfer (t5 ),                                                                   and J. Becker, “A multi-platform controller allowing for
the pipeline holds for currently-processed data frame to be                                                                    maximum dynamic partial reconfiguration throughput,”
completely processed (t5 - t6 ) before the HW IFFT512 is ac-                                                                   in Field Programmable Logic and Applications,
tivated at t6 . The same procedure is executed again, when                                                                     pp. 535–538, IEEE, September 2008.
SW ChannelSensor decides to switch from QAM4 modula-                                                                       [6] D. Göhringer, M. Hübner, E. N. Zeutebouo, and
tion to QAM16 (t7 - t12 ). Some attributes of SW/HW tasks                                                                      J. Becker, “Operating system for runtime reconfigurable
are listed in Table 5.                                                                                                         multiprocessor systems,” International Journal of
                                                                                                                               Reconfigurable Computing, vol. 2011, January 2011.
5.2       Discussion                                                                                                       [7] J. C. Prevotet, A. Benkhelifa, and e. a. B. Granado, “A
   As shown in the Gantt chart, the major overhead of re-                                                                      framework for the exploration of rtos dedicated to the
configuration is fully circumvented by both SW and HW                                                                          management of hardware reconfigurable resources,” in
tasks running in parallel. For data processing, the only                                                                       International Conference on Reconfigurable Computing
overhead caused by the HW task switch is the delay re-                                                                         and FPGAs, pp. 61–66, IEEE, 2008.
quired to process a complete data frame(worst case 0.168                                                                   [8] K. D. Pham, A. K. Jain, J. Cui, and et al, “Microkernel
ms, in case of 8096 points I-FFT). Due to the simplified                                                                       hypervisor for a hybrid arm-fpga platform,” in 24th
kernel and scheduling mechanism, a quick response to PR                                                                        International Conference on Application-Specific
is achieved (0.0119 ms). We should note that the tremen-                                                                       Systems, Architectures and Processors, pp. 219–226,
dous reconfiguration overhead of I-FFT tasks result from the                                                                   IEEE, 2013.
massive computing-intensive structure of I-FFT blocks. Im-                                                                 [9] K. Vipin and S. A. Fahmy, “A high speed open source
plemented by Xilinx Planahead synthesis tool, it consumes                                                                      controller for fpga partial reconfiguration,” in FPT,
5600 LUTs and 1600 SLICEs, which takes up to 13% FPGA                                                                          pp. 61–66, IEEE, 2012.