Dataflow-based Adaptation Framework with Coarse-Grained Reconfigurable Accelerators? Claudio Rubattu University of Sassari, 07100 Sassari, Italy claudio.rubattu@uniss.it Univ Rennes, INSA Rennes, IETR, UMR CNRS 6164, 35708 Rennes, France claudio.rubattu@insa-rennes.fr Abstract. Today, the demand of adaptive systems is constantly grow- ing, especially in hard-constrained contexts such as Cyber-Physical Sys- tems. However, the efficient management of such platforms requires deal- ing with several issues such as the real-time execution, energy saving and dynamic context changes. Such strict requirements imply a high flexibil- ity of the application and of the architecture on which it is executed. Run- time managers offer the possibility to dynamically schedule and map an application on the available software processing units. However, hard- ware acceleration may also be necessary for computationally-intensive workloads that depend on the running functionality, additionally com- plicating runtime management. Coarse-Grained Reconfigurable (CGR) accelerators have the ability to switch among different domain-specific functionalities with a small overhead. To support energy and time adap- tivity in heterogeneous systems, and to exploit multi-core architectures and CGR accelerators, this work proposes the combination of the SPI- DER software runtime manager and the dataflow-to-hardware MDC de- sign suite for CGR accelerators. Keywords: Coarse-Grained Reconfiguration · MDC · SPIDER · FPGA · Dataflow MoC · HW/SW Co-design · Runtime Manager · Datapath Merging 1 Context and Motivation In the last few years, besides the concepts of embedded and interconnected sys- tems, Cyber-Physical Systems (CPS) have become known and studied with in- terest by the scientific community. These systems are capable of monitoring and controlling physical elements and consider heterogeneous components that in- teract with each other in different modalities depending on the context in which they operate. Design and maintenance of such systems are extremely complex because of their multidisciplinary nature, the elaborate requirements, the hetero- geneity of their components and the continuous communication between physical ? This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 732105. and computing layers. Adaptation according to uncertain events and to chang- ing functional and non-functional requirements is the most important challenge for the developers. In this context, the Cross-layer modEl-based fRamework for multi-oBjective dEsign of Reconfigurable systems in unceRtain hybRid envirOn- ments (CERBERO) H20201 european project aims at developing a design envi- ronment for CPS based on two main elements: i) a cross-layer and model-based approach to describe, optimize and analyze the system according to different views; and ii) an extended adaptivity of the calculation to the system state as well as to its environment, adaptivity provided by an autonomous reconfigura- tion engine. This work is mainly focused on the second CERBERO element, and, in particular, on the study of the ability of the system to dynamically reconfigure itself according to its state and to its environment. Compared to all these approaches considered in Section 2, the proposed framework is focused on providing a combination of the state-of-the-art function- alities within the CERBERO project. To do that, in Section 3, an open-source adaptive management system is proposed, portable over several embedded soft- ware systems and heterogeneous CGR hardware accelerators. This type of CGR units are hardware blocks that accelerate a given set of application functional- ities switching among them without significant reconfiguration time overheads. Moreover, CGR accelerators can be implemented on both ASIC or FPGA plat- forms. In order to combine these Processing Elements (PEs) with multi-core architectures, the integration activity between the SPIDER2 dataflow-based run- time manager (see Section 2.1) and the MDC3 dataflow-to-hardware suite (see Section 2.2) has been conducted. Both tools are based on the dataflow Model of Computation (MoC) that can be used to separate temporal and functional prob- lems in hardware design. Moreover, modularity of the dataflow representations favors a natural splitting of the computation into different blocks, making it pos- sible to automatically map them onto heterogeneous PEs. Thus, the proposed idea is to leverage on a software-hardware design flow to develop and manage dataflow-based autonomous reconfigurable systems, as requested by the CER- BERO model-based approach to system design. Challenges and steps related to the implementation of the integration framework are presented in Section 4. 2 Background Several tools capable of handling runtime adaptivity have been proposed in liter- ature. Most approaches concentrate on software management and rely on exist- ing software frameworks [1, 5–7, 9, 14, 15, 20, 21]. Hardware runtime adaptivity is managed in [18], in which data routing in a specific HoneyComb processor array hardware has been considered, and in [2], in which, besides the software runtime handling, hardware tasks can be implemented in FPGA devices. How- ever, among the proposed frameworks only a few are open-source and effectively 1 http://www.cerbero-h2020.eu 2 https://github.com/preesm/spider 3 http://sites.unica.it/rpct/ available [2, 5, 6, 15]. Among these, [5] and [15] are High Performance Comput- ing (HPC) management systems that place themselves over large-scale facilities composed of multiple CPUs and GPUs. In [6], the framework does not consider hardware acceleration, which is today compulsory in most High Performance Embedded Computing (HPEC) systems, such as embedded video processing systems, embedded deep learning, telecommunication and computer vision sys- tems [19]. Although the framework presented in [2] can handle hardware tasks, these are specific to Intel FPGA devices. As mentioned in Section 1, the proposed framework considers a combination of the features provided by the state-of-the-art tools and consists of two existing tools: the SPIDER runtime manager and the MDC suite. 2.1 SPIDER The Synchronous Parameterized and Interfaced Dataflow Embedded Runtime (SPIDER) is a runtime manager for applications described through Parameter- ized and Interfaced Synchronous Dataflow (PiSDF) MoC and executed on hetero- geneous multi-core architectures [8]. When compared to DPN, PiSDF increases processing and communication predictability, serving as input information for multicore and multisystem partitioning, at the cost of some expressiveness, i.e. some unpredictable application behaviors cannot be modeled with PiSDF. The SPIDER runtime is currently available for ARM/Linux-based architectures, Intel x86 platforms, Keystone II architectures from Texas Instruments, and MPPA256 from Kalray. In order to ensure independence between application and platform levels, the SPIDER runtime structure consists of the following layers: – the Application Layer : that is composed of dataflow actors and PiSDF spec- ifications describing a stream processing application; – the Runtime Layer : the core of the runtime manager consisting of a master part called global runtime (GRT), that handles scheduling and mapping of the application, and slave elements named local runtimes (LRT), that execute the processing of the actors depending on the current scheduling decided by the GRT; – the Hardware Specific Layer : this layer is a platform-dependent component designed to manage the inter-core communication and synchronization. Figure 1 shows the execution scheme of SPIDER. The GRT (master) schedules the application (1) and sends the execution order based on the mapping decisions (2). The LRTs (slaves) deal with the execution of the actors present in the ded- icated job queue (3). Jobs are data structures containing the information about synchronization, data and code of the actors to properly perform one instance of an actor in a specific slave. The LRT can be implemented over general- or special-purpose processors, as well as accelerators. During code execution, LRTs exchange data tokens through a pool of data FIFOs (4). Once the processing of the actor is completed, the LRTs send new parameter values (if any) to the GRT (5). Indeed, a parameter value can be set dynamically by a configuration actor mapped in a LRT, influencing the algorithm execution. Moreover, the GRT receives also the execution traces (the actor start and end times based on the same timing reference) by the LRTs (6). Set Resolved Parameters 5 Parameters Fire Actors 3 Schedule Exchange Slave 1 4 Dataflow ... Jobs Queues Master Actors Tokens Slave 2 Send Order Data Queues Timings Pool Execution 6 Traces Fig. 1. The SPIDER runtime internal scheme. 2.2 MDC The Multi-Dataflow Composer (MDC) is a toolset for the design and develop- ment of CGR systems based on the Dataflow Process Network (DPN) MoC [11]. This design suite is composed of two main sub-components: – the Multi-Dataflow Generator (MDG): a dataflow-based model-to-model com- piler that, given an input set of dataflow networks describing the functional- ities to be executed in hardware, generates a high-level multi-dataflow spec- ification of the system leveraging on datapath merging techniques [17]; – the Platform Composer (PC): a dataflow-to-hardware synthesizer that, given the mentioned multi-dataflow specification and the hardware description of the dataflow actors and the protocol used for communication among them, generates a CGR hardware accelerator. MDC offers also other features that optimize the generated systems and favor their integration in real environments, such as: – a structural profiler [12] that, taking into account the low level feedback com- ing from a priori synthesis of the generated platform, is capable of identifying the optimal multi-dataflow configuration depending on a set of metrics (area, power, frequency); – a power manager [3, 4] that automatically sets power saving strategies, such as clock- and power-gating at system modelling level; – a rapid prototyper [16] embedding the generated CGR systems onto ready- to-use platform-dependent IPs (for Xilinx devices). 3 CGR Adaptation Framework SPIDER and MDC show complementary characteristics that motivate their in- tegration. SPIDER provides software scheduling and memory management at runtime for multi-core architectures. However, SPIDER supported processing elements do not include reconfigurable hardware blocks and adaptivity is based on an a priori knowledge of several metrics (latency, throughput and memory utilization) evaluated with respect to changes in software parameters. On the contrary, MDC provides a model-to-model compiler capable of merging several dataflow applications, as well as a dataflow-to-hardware synthesizer that im- plements CGR systems. Moreover, MDC profiles CGR system configurations, providing different metrics (area, power, frequency) and includes a power man- ager that offers clock- and power-gating techniques. PHYSICAL CYBER PHYSICAL CYBER PHYSICAL CYBER Application Graph Cross Layer (CPSs) System Layer Computing Layer Model SPIDER Self-Scheduling Manager System Environment Sensors Multi-ARM MDC CGRA Adaptation Engine HW Monitoring SW Monitoring Through PAPI Fig. 2. Adaptation scheme for heterogeneous hardware/software reconfigurable sys- tems. The proposed framework aims at exploiting the features of SPIDER and MDC in order to respectively manage software and hardware reconfigurability at runtime. Tool integration is also based on the use of dataflow models with similar properties in order to favour the estimation of metrics such as latency, throughput and energy. The main idea behind these metrics is to use CGR blocks as slave processing elements in the target system, and re-schedule these process- ing elements from a host processor at runtime using models of the instantaneous hardware behavior. The adaptation architecture is illustrated in Figure 2. An application graph, conforming to a PiSDF MoC, is dynamically scheduled by SPIDER. Depending on the scheduling, a hardware system composed of ARM cores and CGR accelerators (implemented in Xilinx or Intel modern SoC FPGAs) performs the computation. Software and hardware monitoring provides feedback to SPIDER about the current execution of the tasks. Regarding the monitoring, this feature will be provided through the integration of Papify4 , an event-based performance monitoring tool [10], and MDC5 . In addition, reconfiguration/re- scheduling can also be triggered by sensors in order to adapt the computing layer to the environment changes or system needs. 4 Open Issues and Research Plan The challenges that this research plan intends to address are summarized as follows: i) management of hardware/software adaptivity in systems for which low en- ergy and time overheads are required; ii) performance predictability related to the different configuration of the sys- tem; iii) hardware/software fault robustness in architecture composed of heteroge- neous processing elements, such as ARM cores and CGR accelerators. In order to guarantee fast reconfiguration and the correct execution of the appli- cation, CGR acceleration and hardware/software monitoring will be considered respectively. In addition, a strategy based on modeling of architecture and cho- sen applications will be used to achieve energy and latency predictability, as similarly proposed in [13]. This research plan is part of the activities of the H2020 CERBERO european project, which will be assessed in three different use cases among which: plane- tary exploration (lead by: Thales Alenia Space), smart marine vehicles for ocean monitoring (lead by: AmbieSense), and a smart travelling system for electric vehicles (lead by: TNO in cooperation with Centro Ricerche Fiat). The pro- posed flow will be used to test applications deriving from the scenarios of the Thales Alenia Space and AmbieSense use cases. To build the proposed flow, the following three steps have been envisioned: – Step 1. Integrate MDC and SPIDER by combining software and hardware adaptation according to variable application parameters; – Step 2. Verify this approach with respect to relevant CERBERO key perfor- mance indicators (as latency, throughput and energy); – Step 3. Derive a proof of concept of the proposed approach in the context of CERBERO project use case scenarios. The work done so far has been focused on the state-of-the-art adaptation tools and open issues, the study of the tools identified for the integration, and the strategies to address the problems described above. Further developments of the proposed flow will be necessary to improve the decision-making strategy that SPIDER considers for the adaptability of the sys- tem. The runtime manager inputs should embed a more detailed description of 4 https://github.com/Papify 5 This activity is planned within the CERBERO project and is currently ongoing. the application to be executed and the model of the architecture on which it has to be performed. Nevertheless, besides managing the known scenarios during the development phase, it would be necessary to take into account also the statistical data related to the experience over a certain period of time, in order to prevent unexpected events at design time. This information should be based on the user interrupts, the external environment and the hardware and software monitors. References 1. Assayad, I., Girault, A.: Adaptive Mapping for Multiple Applications on Parallel Architectures. In: Sabir, E., Garcı́a Armada, A., Ghogho, M., Debbah, M. (eds.) Ubiquitous Networking. pp. 584–595 (2017) 2. Bragg, G.M., Leech, C.R., Balsamo, D., Davis, J.J., Wachter, E.W., Merrett, G., Constantinides, G.A., Al-Hashimi, B.: An application- and platform-agnostic con- trol and monitoring framework for multicore systems. In: 3rd International Con- ference on Pervasive and Embedded Computing (2018) 3. Fanni, T., Sau, C., Meloni, P., Raffo, L., Palumbo, F.: Power and clock gating modelling in coarse grained reconfigurable systems. In: Proceedings of the ACM International Conference on Computing Frontiers, CF’16. pp. 384–391 (2016) 4. Fanni, T., Sau, C., Raffo, L., Palumbo, F.: Automated Power Gating Methodol- ogy for Dataflow-based Reconfigurable Systems. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, CF’15. pp. 61:1–61:6 (2015) 5. Gautier, T., Lima, J.V.F., Maillard, N., Raffin, B.: XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. pp. 1299– 1308 (2013) 6. Gerostathopoulos, I., Bures, T., Hnetynka, P., Keznikl, J., Kit, M., Plasil, F., Plouzeau, N.: Self-adaptation in software-intensive cyberphysical systems: From system goals to architecture configurations. Journal of Systems and Software 122, 378–397 (2016) 7. Han, M., Park, J., Baek, W.: CHRT: A criticality- and heterogeneity-aware run- time system for task-parallel applications. In: Design, Automation Test in Europe Conference Exhibition (DATE), 2017. pp. 942–945 (2017) 8. Heulot, J., Pelcat, M., Desnos, K., Nezan, J.F., Aridhi, S.: SPIDER: A Synchronous Parameterized and Interfaced Dataflow-based RTOS for multicore DSPS. In: 2014 6th European Embedded Design in Education and Research Conference (EDERC). pp. 167–171 (2014) 9. Hormati, A.H., Choi, Y., Kudlur, M., Rabbah, R., Mudge, T., Mahlke, S.: Flex- tream: Adaptive Compilation of Streaming Applications for Heterogeneous Ar- chitectures. In: 2009 18th International Conference on Parallel Architectures and Compilation Techniques. pp. 214–223 (2009) 10. Madroñal, D., Morvan, A., Lazcano, R., Salvador, R., Desnos, K., Juárez, E., Sanz, C.: Automatic Instrumentation of Dataflow Applications using PAPI. In: Proceed- ings of the 15th ACM International Conference on Computing Frontiers, CF’18. pp. 232–235 (2018) 11. Palumbo, F., Fanni, T., Sau, C., Meloni, P.: Power-Awarness in Coarse-Grained Reconfigurable Multi-Functional Architectures: A Dataflow Based Strategy. J. Sig- nal Process. Syst. 87(1), 81–106 (2017) 12. Palumbo, F., Sau, C., Raffo, L.: Coarse-grained reconfiguration: dataflow-based power management. IET Computers & Digital Techniques 9(1), 36–48 (2015) 13. Pelcat, M., Mercat, A., Desnos, K., Maggiani, L., Liu, Y., Heulot, J., Nezan, J., Hamidouche, W., Mnard, D., Bhattacharyya, S.S.: Reproducible Evaluation of Sys- tem Efficiency with a Model of Architecture: From Theory to Practice. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems pp. 1–14 (2018) 14. Quan, W., Pimentel, A.D.: A hierarchical run-time adaptive resource allocation framework for large-scale MPSoC systems. Design Automation for Embedded Sys- tems 20(4), 311–339 (2016) 15. Robson, M.P., Buch, R., Kale, L.V.: Runtime Coordinated Heterogeneous Tasks in Charm++. In: 2016 Second International Workshop on Extreme Scale Program- ming Models and Middleware (ESPM2). pp. 40–43 (2016) 16. Sau, C., Fanni, L., Meloni, P., Raffo, L., Palumbo, F.: Reconfigurable coprocessors synthesis in the MPEG-RVC domain. In: International Conference on ReConFig- urable Computing and FPGAs, ReConFig 2015. pp. 1–8 (2015) 17. Souza, C.C.d., Lima, A.M., Araujo, G., Moreano, N.B.: The Datapath Merging Problem in Reconfigurable Systems: Complexity, Dual Bounds and Heuristic Eval- uation. J. Exp. Algorithmics 10 (2005) 18. Thomas, A., Becker, J.: Dynamic Adaptive Runtime Routing Techniques in Multi- grain Reconfigurable Hardware Architectures. In: Becker, J., Platzner, M., Ver- nalde, S. (eds.) Field Programmable Logic and Application. pp. 115–124 (2004) 19. Wolf, M.: High-Performance Embedded Computing, Second Edition: Applications in Cyber-Physical Systems and Mobile Computing. 2nd edn. (2014) 20. Yun, J., Park, J., Baek, W.: HARS: A heterogeneity-aware runtime system for self-adaptive multithreaded applications. In: 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC). pp. 1–6 (2015) 21. Zhang, F., Cao, J., Khan, S.U., Li, K., Hwang, K.: A task-level adaptive MapRe- duce framework for real-time streaming data in healthcare applications. Future Generation Computer Systems 43-44, 149–160 (2015)