=Paper=
{{Paper
|id=Vol-3145/short11
|storemode=property
|title=A Prototyping and Evaluation Framework for Research on Timing-analysable Memory Hierarchies for Embedded Multicore SoCs
|pdfUrl=https://ceur-ws.org/Vol-3145/short11.pdf
|volume=Vol-3145
|authors=Florian Haas,Sebastian Altmeyer
|dblpUrl=https://dblp.org/rec/conf/cerciras/HaasA21
}}
==A Prototyping and Evaluation Framework for Research on Timing-analysable Memory Hierarchies for Embedded Multicore SoCs==
<pdf width="1500px">https://ceur-ws.org/Vol-3145/short11.pdf</pdf>
<pre>
A Prototyping and Evaluation Framework for
Research on Timing-analysable Memory Hierarchies
for Embedded Multicore SoCs
Florian Haas, Sebastian Altmeyer
University of Augsburg, Germany


                                     Abstract
                                     Research on memory hierarchies regarding the non-functional requirements in embedded multicore
                                     systems demands for a framework to support the prototyping and evaluation of new methods. In current
                                     multicore processors, accesses on shared resources by arbitrary tasks lead to interferences, which can
                                     result in timing violations of high-priority tasks. However, incorporating all potential interferences
                                     in the schedulability analysis leads to an enormous overestimation of the task execution times, and
                                     requires a full analysis of all tasks running on the system. Enhancements in the memory hierarchy
                                     can provide isolation to restrict potential interferences, thus improving the worst-case performance.
                                     To research on modifications in the memory hierarchy of a multicore processor, a prototyping and
                                     evaluation framework is required. This paper describes the design of such a framework, and outlines
                                     the individual parts and their interconnections.

                                     Keywords
                                     parallel real-time system, memory hierarchy, FPGA prototyping framework


1. Introduction
The performance of multicore processors is strongly desired in various domains of embedded
systems to satisfy the increasing demand for computational power. Complex algorithms and
software systems, e. g. in autonomous driving, benefit from high-performance general-purpose
shared-memory multicores. However, these processors do not meet the typical requirements on
real-time and safety, and thus cannot be used without performance-degrading and laborious
software mechanisms. Elaborate methods in such systems have been developed to further
improve the average-case performance of the processor, for example the increasing depth of
the memory hierarchy. These and the shared resources, like last-level caches, buses, and main
memory, result in the ultimate challenge of calculating tight WCET bounds for the tasks in a
time-critical system.
   The crucial problem is the missing guaranteed freedom of interferences between tasks that run
on separate cores. Thus, an arbitrary low-priority task is able to influence the timing behaviour

CERCIRAS WS01: 1st Workshop on Connecting Education and Research Communities for an Innovative Resource
Aware Society
  haas@es-augsburg.de (F. Haas); altmeyer@es-augsburg.de (S. Altmeyer)
{ https://es-augsburg.de/haas (F. Haas); https://es-augsburg.de/altmeyer (S. Altmeyer)
 0000-0002-4714-2562 (F. Haas); 0000-0002-2487-7144 (S. Altmeyer)
                                   © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
              http://ceur-ws.org
              ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
of another, potentially high-priority task on a different core. This can happen through accesses
on shared resources, for example shared caches or the main memory [1]. As a consequence, a
schedulability analysis of the overall system with only minimal overestimation becomes nearly
impossible for more than a few cores and deeper memory hierarchies.
   The general objective of research on this topic is to facilitate predictable performance, with
minimal over-estimation of timing bounds, by reducing the sources of potential interferences on
shared resources. Existing software-based approaches, e. g. performance counter monitors [2],
or program modification during compilation, are limited, as they can either only detect excessing
interferences, or are required to be applied to all tasks of the system. Thus, hardware mechanisms
promise a better lever to control the behaviour of any task on the system. However, to research
hardware-implemented methods, a proper evaluation platform is required. For example, a
hardware implementation of a memory bandwidth reservation mechanism like MemGuard [3]
could be evaluated and compared with other approaches. To research potential improvements
on shared resource accesses under timing constraints, a realistic model of a typical memory
hierarchy is needed in the first place. Microarchitecture simulators with multicore configurations
exist, but their processor-centric design does not support for a prototype implementation and a
realistic evaluation. Further, the evaluation system needs to be capable of executing realistic
benchmarks, for prototyping different ideas, as well as for a thorough evaluation of their impact
on the performance.
   Previous work focused mostly on fault tolerance of parallel systems, but the research always
involved shared-memory systems. Different systems have been used to evaluate the proposed
methods, from software-only approaches on typical desktop and server hardware, over the
Gem5 simulator, to FPGA prototypes. As a side effect of the conducted implementations and
evaluations, some experience with diverse platforms has been collected. The work on a software-
only fault tolerance mechanism [4, 5] showed the numerous restrictions of an unmodifiable
hardware implementation. To overcome these limitations, later research was undertaken on the
Gem5 microarchitecture simulator, where a customized hardware transactional memory was
built into the memory hierarchy [6, 7]. However, since the simulator focuses on the detailed
simulation of the processor cores itself, it provides only a rather functional memory hierarchy
with limited timing accuracy. Switching to an FPGA prototype with multiple MicroBlaze
softcores [8] showed the difficulties of integrating hardware and software parts with non-open
processor cores. Overall, these experiences affirm the demand for an open system to prototype
and evaluate memory hierarchies for future research ideas.
   This paper describes a prototyping and evaluation framework for embedded multicore sys-
tems, and outlines the assembly of the individual parts into a synthesizable design for both
simulation and prototyping on an FPGA. The framework is based on ChipYard, which supports
design and evaluation of full-system hardware, using the Rocket Chip generator and its in-order
RISC-V CPUs. The main benefit of ChipYard is the configurability and customizability of the
involved modules. The interconnects could also be replaced with a NoC to research on manycore
systems, or a combination of both with shared-memory clusters connected through a NoC.
Based on the proposed framework, the research on elements of the memory hierarchy will be
facilitated to improve the applicability of multicore processors in embedded systems.
2. Requirements for Research on Timing Predictable
   Shared-memory Multicore Systems
To approach the objective of calculating tight WCET bounds for time-sensitive tasks in shared-
memory multicore systems, the potential interferences on shared resources have to be identified
and measured first. While such evaluations can be performed on existing hardware, potential
new methods to prevent or restrict interferences require customisable hardware components.
  A system that enables the modification and enhancement of individual elements in the
memory hierarchy should fulfil the following requirements:

    • Customisable hardware to extend or modify elements of the memory hierarchy
    • Measurement of the overall performance and counting individual accesses on shared
      resources
    • Independence of CPU architectures
    • Scalable number of processor cores
    • Hardware cost estimation of extensions and customisations
    • Fast response on functional correctness of the implementation
    • Fast and approximate evaluation of the simulated model
    • Accurate full-system evaluation on an FPGA

These requirements are satisfied by our proposed framework, for which the Chipyard project
provides a promising foundation. It is the predestined choice, since it is built around the open
RISC-V ecosystem, and allows to customize or replace individual elements of the memory
hierarchy. It further supports simulation and FPGA synthesis based on the same and identical
code.


3. Overview of the Framework
The evaluation framework builds upon existing open-source projects that have been developed
in recent years around the prevalent RISC-V architecture.

3.1. Chipyard
Chipyard [9] simplifies the process of designing full-system hardware by integrating all neces-
sary parts from CPU cores to supplementing logic to connect the devices of an FPGA evaluation
board. Fig. 1 illustrates the individual parts of Chipyard: Processor cores can be created for
example with the Rocket Chip Generator, which generates configurable and customizable cores
that implement the RISC-V instruction set, either in-order Rocket cores, or the more complex
and powerful out-of-order BOOM cores. Beside the L1 caches provided by the Rocket Chip
Generator, secondary level caches and different kinds of interconnecting buses can be generated.
There is also code provided to connect to and communicate with peripheral devices like UART
and JTAG.
  The generated Verilog code can be further compiled with Verilator for a simulation of the
overall system, or with FireSim, which additionally allows to simulate DDR3 main memory.
                         Chipyard


                                      Rocket Chip Generator


                                             Peripherals


                               FPGA shells
                                                     Verilator / FireSim


                             FPGA prototype                Simulation

Figure 1: Overview of generators of Chipyard [9]. The resulting code can be synthesised for an FPGA
or simulated.


Alternatively, individual FPGA shells wrap the code with a harness to connect the units of the
SoC to I/O pins of a concrete FPGA, to build a prototype running on an FPGA evaluation board.
Such prototype is able to communicate with the built-in peripheral devices like UART and JTAG,
as well as the off-chip DDR memory.

3.2. Rocket Chip Generator
The Rocket Chip Generator [10] produces designs of a SoC with multiple processor cores, a
memory hierarchy, and interconnects. Fig. 2 depicts a generated chip with four processor tiles,
consisting of an in-order Rocket RISC-V core and L1 instruction and data caches, L2 cache banks
with the memory bus, and additional buses for peripheral devices, DMA devices, and control
units like the boot ROM and interrupt controllers. All processor tiles and all individual buses
are connected through a shared system bus, which is typically implemented as a crossbar, but
can also be configured as a ring bus.

3.3. Memory Hierarchy Evaluation Framework
A common objective of research on memory hierarchies for real-time systems is to reduce
interferences on shared resources. From this, the main elements of the system under evaluation
are derived: All units that control access to shared resources, like the peripheral bus, or the L2
cache, are of interest, as well as the private L1 caches that are connected to the shared system
bus. In Fig. 3, these elements are shown below the processor cores, which are not of special
interest for interference analysis. All accesses to shared resources that originate in the cores
have to pass through the L1 instruction or data caches, which can control the communication.
   The prototyping flow from implementing a design of one or more specific parts of the memory
hierarchy to code generation and simulation or evaluation is depicted in Fig. 4. Unit tests can
      Rocket Chip


          Rocket Tile           Rocket Tile            Rocket Tile               Rocket Tile
               Core                 Core                       Core                  Core

             L1I    L1D           L1I     L1D                L1I    L1D            L1I   L1D


                                               System Bus


           L2 Bank          L2 Bank            Periph. Bus           Ctrl. Bus       Front Bus


                   Memory Bus                     Boot ROM, Intrs., . . .                DMA


Figure 2: The Rocket Chip [10] consists of multiple processor tiles, and devices connected through
dedicated buses, like memory and peripherals. All parts of the chip communicate through the system
bus.


provide fast checks of the functional correctness of the implemented or modified mechanisms.
After passing these tests, Verilog code is generated, which can be simulated with Verilator
to test the design with a set of benchmarks. The simulation provides fast feedback on the
behaviour of the system, to compare different potential implementations before running the
full evaluation of the synthesised bitstream on the FPGA. The evaluation of the design on the
FPGA provides accurate timing measurements of the individual tasks, and a trace log of accesses
on shared resources. These results allow to quantify the improvements of the implemented
memory hierarchy modifications, and enable the detection of timing violations or forbidden


                                        Core                 Core

                                    L1I    L1D          L1I        L1D


                                               System Bus


                                    L2 Cache           Periph. Bus
Figure 3: Elements of interest to evaluate interferences in the memory hierarchy: private L1 caches,
shared L2 caches, buses that connect shared resources, and the shared system bus itself.
         Design                 Unit tests

          Chisel                  Chisel


                               Generation                Simulation
                                                                                 Benchmarks
                                  Verilog                 Verilator


                                                       results


                                Synthesis                Evaluation

                                Bitstream                  FPGA


                           timing results, resource access log
Figure 4: Overview of the prototyping flow with the evaluation framework. Results of the unit tests
provide immediate feedback, which can be further tested in the Verilator simulation. The evaluation of
the bitstream on the FPGA provides accurate timing results and logs of the resource accesses.


interferences that should not occur.
   The possibility to connect a debugger to the simulation, as well as to the system on the
FPGA, facilitates the detection of implementation faults, and provides detailed insight into the
behaviour of the system under specific circumstances when needed.
   With the feedback loop between the design and the simulation, available computational
capabilities can be leveraged to compare numerous different design variations, to select a few
designs of interest for the full evaluation of the FPGA.


4. Conclusion & Future Work
This paper described the design of a prototyping and evaluation framework to research on
memory hierarchies, for getting closer to the overall objective of enabling high-performance
multicore processors in embedded real-time systems. The framework is built upon existing
open-source projects around the RISC-V architecture, connecting the different tools together. It
integrates all the required steps to automatically generate the Verilog code, compile and run the
simulation, to synthesise the bitstream and program the FPGA with it, and to run the evaluation.
   The next step is to implement the basic tool chain for automatic unit tests, code generation,
simulation, and synthesis. Afterwards, measurement facilities in the individual components
of the memory hierarchy have to be added to evaluate the behaviour of the system under
parallel workloads. Such workloads first have to be identified based on use-cases from different
industries, and reconstructed by a set of different benchmarks.
  Based upon the proposed framework, research on new approaches for controlling interfer-
ences on shared resources within shared-memory multicores can take off.


Acknowledgments
This work is partially supported by the CERCIRAS COST Action no. CA19135 funded by COST.


References
 [1] C. Maiza, H. Rihani, J. M. Rivas, J. Goossens, S. Altmeyer, R. I. Davis, A Survey of Timing
     Verification Techniques for Multi-Core Real-Time Systems, ACM Comput. Surv. 52 (2019).
     doi:10.1145/3323212.
 [2] J. Freitag, S. Uhrig, T. Ungerer, Virtual timing isolation for mixed-criticality systems, in:
     Euromicro Conference on Real-Time Systems (ECRTS), 2018, pp. 13:1–13:23. doi:10.4230/
     LIPIcs.ECRTS.2018.13.
 [3] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, L. Sha, MemGuard: Memory Bandwidth
     Reservation System for Efficient Performance Isolation in Multi-core Platforms, in: Real-
     Time and Embedded Technology and Applications Symposium (RTAS), 2013, pp. 55–64.
     doi:10.1109/RTAS.2013.6531079.
 [4] F. Haas, S. Weis, T. Ungerer, G. Pokam, Y. Wu, Fault-Tolerant Execution on COTS Multi-core
     Processors with Hardware Transactional Memory Support, in: Architecture of Computing
     Systems (ARCS), 2017, pp. 16–30. doi:10.1007/978-3-319-54999-6_2.
 [5] F. Haas, Fault-tolerant Execution of Parallel Applications on x86 Multi-core Processors
     with Hardware Transactional Memory, Phd thesis, Universität Augsburg, 2019. URL:
     https://opus.bibliothek.uni-augsburg.de/opus4/59566.
 [6] R. Amslinger, S. Weis, C. Piatka, F. Haas, T. Ungerer, Redundant Execution on Hetero-
     geneous Multi-cores Utilizing Transactional Memory, in: Architecture of Computing
     Systems (ARCS), 2018, pp. 155–167. doi:10.1007/978-3-319-77610-1_12.
 [7] C. Piatka, R. Amslinger, F. Haas, S. Weis, S. Altmeyer, T. Ungerer, Investigating transactional
     memory for high performance embedded systems, in: Architecture of Computing Systems
     (ARCS), 2020, pp. 97–108. doi:10.1007/978-3-030-52794-5_8.
 [8] R. Amslinger, C. Piatka, F. Haas, S. Weis, T. Ungerer, S. Altmeyer, Hardware multiversioning
     for fail-operational multithreaded applications, in: International Symposium on Computer
     Architecture and High Performance Computing (SBAC-PAD), 2020, pp. 20–27. doi:10.
     1109/SBAC-PAD49847.2020.00014.
 [9] A. Amid, D. Biancolin, A. Gonzalez, D. Grubb, S. Karandikar, H. Liew, A. Magyar, H. Mao,
     A. Ou, N. Pemberton, P. Rigge, C. Schmidt, J. Wright, J. Zhao, Y. S. Shao, K. Asanović,
     B. Nikolić, Chipyard: Integrated Design, Simulation, and Implementation Framework for
     Custom SoCs, IEEE Micro 40 (2020) 10–21. doi:10.1109/MM.2020.2996616.
[10] K. Asanovic, R. Avizienis, J. Bachrach, S. Beamer, D. Biancolin, C. Celio, H. Cook, D. Dabbelt,
     J. Hauser, A. Izraelevitz, et al., The Rocket Chip Generator, Technical Report UCB/EECS-
     2016-17, EECS Department, University of California, Berkeley, 2016.

</pre>