=Paper= {{Paper |id=Vol-3867/paper5 |storemode=property |title=Caching in a Mixed-Criticality 5G Radio Base Station |pdfUrl=https://ceur-ws.org/Vol-3867/paper5.pdf |volume=Vol-3867 |authors=Emad Jacob Maroun,Luca Pezzarossa,Martin Schoeberl |dblpUrl=https://dblp.org/rec/conf/raw2/MarounPS24 }} ==Caching in a Mixed-Criticality 5G Radio Base Station== https://ceur-ws.org/Vol-3867/paper5.pdf
                         Caching in a Mixed-Criticality 5G Radio Base Station
                         Emad Jacob Maroun1,* , Luca Pezzarossa1 and Martin Schoeberl1
                         1
                             Technical University of Denmark, Department of Applied Mathematics and Computer Science


                                           Abstract
                                           Telecommunication is a critical driver of economic and social development. 5G technologies are state-of-the-art in telecommunication,
                                           setting strong and open-ended requirements for implementing systems. Current systems for implementing baseband technologies in
                                           5G depend on hardware separation to ensure high- and low-criticality tasks do not interfere in such a way as to violate guarantees.
                                           To increase performance and lower costs, this paper sets the research direction into future mixed-criticality systems that can handle
                                           both the high- and low-criticality tasks of the baseband unit. We analyze the 5G requirements and the common systems that currently
                                           implement them. We propose using T-CREST as the research platform with a specific architecture targeting mixed-criticality workloads.
                                           We present two cache proposals to reduce the interference of low-criticality tasks on high-criticality tasks but ensure high cache
                                           utilization and efficiency. The first cache proposal uses timeouts to automatically free cache lines reserved for high-criticality tasks.
                                           The second proposal uses contention tracking to limit how much low-criticality tasks may influence high-criticality tasks. Lastly, we
                                           propose a third cache architecture to unify the method and stack caches unique to T-CREST into a single level-2 cache.

                                           Keywords
                                           5g, t-crest, real-time systems, low latency, caches, radio baseband



                         1. Introduction                                                                                             hurts performance and price. Therefore, we are interested
                                                                                                                                     in investigating future system designs incorporating mixed-
                         Socio-technical evolution is dependent on mobile communi-                                                   criticality system research to merge the currently divided
                         cations as a critical driver to allow for economic and social                                               systems into a single platform that can handle the varying
                         development [1]. As such, the evolution of communication                                                    criticality of tasks. While the current heavy use of shared
                         technologies is essential in enabling societal development.                                                 scratchpads and the phased execution model [4] give high
                         5G is state-of-the-art in mobile communication technologies,                                                predictability to systems managing the OSI layer 1, it is
                         promising unprecedented speeds, ultra-low latency, and                                                      wasteful and difficult to unify with the use of shared caches
                         massive connectivity capabilities. With its lofty promises,                                                 in the systems managing the OSI layer 2. Therefore, innova-
                         implementing 5G communication networks is a significant                                                     tive techniques are needed to facilitate the unification of the
                         industrial challenge. Continued investment in 5G technolo-                                                  layer 1 and layer 2 systems into a unified hardware system.
                         gies is needed to reach beyond the minimal promises of the                                                     This paper addresses the challenge of sharing a level 2 (L2)
                         technology. Improvements in technical implementations                                                       cache between different tasks and executing on different
                         will ensure better service characteristics for customers and                                                cores while still delivering low-latency execution of critical
                         users at lower costs.                                                                                       tasks. We propose to use the T-CREST platform [5] to ex-
                            One critical aspect of telecommunications technology is                                                  plore different solutions of the challenges around memory
                         the radio base station (RBS), which provides wireless trans-                                                management for mixed-criticality systems by presenting
                         mission to and from mobile devices. The 5G functionality is                                                 three distinct caching architectures for future exploration.
                         implemented in these RBSs. Continued improvement of the                                                     All solutions are centered around regulating access to dif-
                         RBS is critical to staying at the forefront of the industry. As                                             ferent cache lines for high- and low-criticality jobs. More
                         such, research on how to best implement RBS for optimizing                                                  specifically, we propose two shared caches that use time-
                         performance and cost ensures long-term competitiveness                                                      outs and contention tracking to limit the interference of
                         in the industry.                                                                                            low-criticality tasks on high-criticality ones, as well as an
                            The requirements of 5G introduce a hierarchy of priori-                                                  L2 cache that unifies the split caches unique to T-CREST
                         tized tasks that the RBS has to complete. The RBS, therefore,                                               since they exhibit unique access characteristics that can be
                         becomes a mixed-criticality system [2], where minimum                                                       sped up predictably.
                         guarantees are upheld to ensure critical tasks are completed                                                   The contributions of this paper are: (1) A description of
                         correctly and in a timely fashion. On the other hand, non-                                                  common 5G RBS technologies and implementations, (2) a
                         critical tasks need to be performed as fast as possible; how-                                               discussion of the challenges future systems face in the pur-
                         ever, they only need to provide good quality of service (QoS)                                               suit of lower cost, higher efficiency, and improved perfor-
                         on average, so they may be de-prioritized to ensure that crit-                                              mance, and (3) three proposals for caching architectures that
                         ical tasks meet their deadlines. To ensure non-critical tasks                                               we intend to explore to address the challenges described.
                         do not interfere with the critical ones, hardware systems                                                      The rest of this paper is structured into four sections. The
                         are divided into several layers with differing responsibili-                                                following section will provide some background on how
                         ties correlating to the open systems interconnection (OSI)                                                  current systems implement 5G and their challenges. Section
                         model [3]. This hardware division makes it easier to con-                                                   3 introduces the T-CREST platform and how it can be used as
                         trol interference but decreases resource utilization, which                                                 a basis for research into a mixed-criticality system. Section
                                                                                                                                     4 discusses the three cache architecture proposals. Section
                         3rd workshop on Resource AWareness of Systems and Society (RAW 2024),                                       5 presents related work and Section 6 concludes the paper.
                         July 2–5, 2024, Maribor, Slovenia
                         *
                           Corresponding author.
                         $ ejama@dtu.dk (E. J. Maroun); lpez@dtu.dk (L. Pezzarossa);                                                 2. 5G Radio Baseband
                         masca@dtu.dk (M. Schoeberl)
                          0000-0002-3675-3376 (E. J. Maroun); 0000-0002-0863-2526                                                   During the initial phases of the 5G specification, three usage
                         (L. Pezzarossa); 0000-0003-2366-382X (M. Schoeberl)
                                       © 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons
                                                                                                                                     scenarios were identified as being critical for the future of
                                       License Attribution 4.0 International (CC BY 4.0).



CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                     Cluster 0                                 Cluster 1                                 Cluster 2
            DSP 0                 DSP 1
                                                       Acc 0               Acc 1                 Acc 4               Acc 5
           I$   D$               I$   D$

            DSP 2                 DSP 3
                                                       Acc 2               Acc 3                 Acc 6               Acc 7
           I$   D$               I$   D$

               Shared Scratchpad                         Shared Scratchpad                          Shared Scratchpad


                                                                                                                       Scheduler

          Cluster-Shared Scratchpad 1                Cluster-Shared Scratchpad 2

                                                                                             Off-Chip Main Memory DRAM

Figure 1: Hypothetical baseband unit architecture.



mobile communications [1]:                                            2.1. System Architecture
   Enhanced Mobile Broadband (eMBB): Focuses on pro-
                                                                      Typical RBS systems are divided into three hardware units:
viding significantly higher data rates and capacity compared
to previous telecommunication generations, enabling ap-                     1. The Remote Radio Unit (RRU). It is immediately con-
plications such as high-definition video streaming, virtual                    nected to the antennas and handles the initial input
reality, and augmented reality. This scenario covers the                       stream from the antennas. The antenna streams are
day-to-day activities of private users and data-heavy but                      initially processed in this unit and grouped into user
less critical industrial applications.                                         streams (e.g., 8 antenna streams are compressed to
   Ultra-Reliable and Low Latency Communications                               one group) to be sent to the next unit.
(URLLC): Emphasizes ultra-reliable and low-latency com-                     2. The Baseband Unit (BBU). It takes the input streams
munication, critical for applications that demand real-time                    from the RRU and further processes them. The RRU
responsiveness and mission-critical reliability, including                     and BBU units together constitute the physical layer
autonomous vehicles, remote surgery, and industrial au-                        of the OSI model (layer 1), handling the physical
tomation.                                                                      aspects of transmitting and receiving wireless 5G
   Massive Machine Type Communications (mMTC):                                 signals [7].
Targets the connectivity of a massive number of devices
                                                                            3. The Layer 2 unit handles the data link layer of the
using minimal energy, enabling the Internet of Things (IoT)
                                                                               OSI model (layer 2). This includes Medium Access
to scale to unprecedented levels, facilitating applications
                                                                               Control (MAC) and Radio Link Control (RLC) tasks.
such as smart cities, industrial IoT, and environmental mon-
itoring.                                                                 The varying characteristics of the workloads of the differ-
   These scenarios resulted in a requirement specification            ent units result in different hardware designs. While both
that includes the following criteria [6]:                             the BBU and layer 2 must handle high- and low-criticality
                                                                      tasks, they do so in different ways. This research aims to
     • Peak Data Rate: 20 Gbit/s download, 10 Gbit/s up-
                                                                      explore a merged system to handle the BBU and layer 2 tasks
       load. This is only in ideal conditions.
                                                                      in one hardware system. The new system is to be centered
     • Transmission Latency: 4 ms for eMBB, 1 ms for                  around the design of a BBU but explore technologies that
       URLLC. This is the latency added by the 5G net-                allow layer 2 tasks to run efficiently.
       work to the overall communication latency between
       endpoints.
     • Device Mobility: up to 500 km/h for rural eMBB, less           2.2. Baseband Unit
       for more dense areas.                                          The BBU system handles physical layer tasks centered
     • Density: up to 1.000.000 devices per square kilometer          around signal processing of incoming and outgoing trans-
       in the mMTC scenario.                                          missions. Its design ensures maximum predictability at the
                                                                      expense of resource utilization efficiency. Figure 1 provides
Note how each requirement applies in specific scenarios and           an overview of the system. It is not meant to be repre-
is not necessary in others. For example, the peak data rate           sentative of any specific system but to give an idea of the
is unnecessary for scenarios covered by URLLC or mMTC.                components often present and their interactions.
Meanwhile, the extreme latency requirement of 1 ms only
applies to URLLC.
                                                                      2.2.1. Hardware
   An RBS must manage these diverse requirements and,
therefore, becomes a mixed-criticality system. For example,           We focus on systems centered around a clustered and hetero-
tasks within the URLLC scenario must be prioritized over              geneous design. Each cluster contains a set of processors or
eMMB tasks to uphold the URLLC latency requirements. Not              accelerators (for illustration, we show four in Figure 1). First,
only do we have a range of priorities, but these priorities           the general computing capability is provided by digital sig-
may also change as usage changes. Adapting to ongoing                 nal processor (DSP) cores with high predictability [8]. Each
changes in network usage is, therefore, a critical aspect of          DSP has a private instruction and data cache and shares a
implementing 5G.
single scratchpad memory with the other processors in the         use the same data, the Read of each will load that data
cluster.                                                          into their respective partitions. This means data might be
   The other clusters contain acceleration cores for specific     duplicated in the cluster scratchpads. However, such shared
and common workloads. The accelerators in each cluster            data is rarely written to, and synchronization is explicitly
also share a scratchpad. The exact architecture of the accel-     handled at the application level and, therefore, is not an
erators is out of the scope of this paper.                        issue.
   The clusters may also share scratchpads, two are shown
as an example. These split scratchpads handle different           2.4. Layer 2 Design
data with specific access characteristics. For example, some
configuration data might be mostly read and changed rarely,       The common computing architectures for layer 2 are
while user-specific data may be updated continuously.             more traditional, with, e.g., superscalar cores and standard
   Lastly, a hardware scheduler can be present to orchestrate     caching. The workload on the system requires less stringent
task execution on the relevant cores and movement of data.        predictability than the BBU, allowing for a more traditional
We have omitted to describe any other application-specific        design. The tasks also require higher performance, pro-
devices or connections to peripherals.                            vided by the more complex design at the cost of predictabil-
                                                                  ity. To ensure high-criticality tasks meet their deadlines,
                                                                  the hardware resources can be partitioned by clusters and
2.3. Data Processing
                                                                  intentionally over-provisioned.
Data processing starts once every millisecond. While the             Layer 2, therefore, can have much wastage where high-
RRU is processing the antenna streams, the BBU starts with        criticality tasks are concerned. This unit’s more complex
a set of configuration tasks that prepare for the delivery        design makes it challenging to ensure tasks meet their dead-
of data from the RRU. These configuration tasks must run          lines. The only possibility of ensuring the deadlines are
on the DSP cores to, e.g., configure the accelerators before      met is to provide the tasks with such an overabundance of
they start executing. This could result in configuration data     resources that even when low-criticality tasks interfere, the
initially going to one of the cluster-shared scratchpads, from    high-criticality tasks will not be adversely affected. There-
where it is moved to the cluster scratchpads as needed. This      fore, the inefficient use of resources in layer 2 is a supporting
data starts in the shared scratchpad of the core running          reason for merging the layer 2 subsystem with the BBU sub-
the job and is off-loaded to the cluster-shared scratchpad        system.
when the configuration job is done. In parallel with the
configuration tasks, the data from the RRU is being loaded        2.5. Challenges
into the cluster-shared scratchpads. When that is ready,
proper processing tasks can begin executing on DSPs or            We aim to research new methods for implementing 5G RBS
accelerators as needed.                                           technologies to achieve better performance at lower cost.
   We consider only strict data access characteristics of the     Therefore, the current challenges of increased costs and
tasks. All shared data is read-only. User-specific data is        lower performance must be alleviated in any future system.
segmented into the relevant tasks and updated only by the            Challenge 1: The primary challenge for the above-
task currently being worked on. At no point are two tasks         mentioned RBS systems is a divided hardware architecture.
working on the same user data. These strict data access           The physical division ensures that high-critical tasks can
characteristics mean that synchronization and coherence           maintain their needed deadlines, which increases costs and
are not issues we will consider.                                  reduces overall performance. First, the separation necessi-
                                                                  tates manufacturing two physical systems, which is costly.
2.3.1. Phased Execution                                           Second, the separation means the two systems cannot share
                                                                  resources, reducing the efficient use of available resources.
The use of scratchpads in the BBU reduces the variability            Challenge 2: On the BBU system specifically, there is
in execution times. However, this requires methodical or-         also a challenge with efficient use of resources. While us-
chestration to ensure each job has the needed data. As such,      ing scratchpads ensures execution-time predictability for
every job is divided into three phases:                           all tasks, it also forces data duplication. If two tasks use
    1. Read: Any data a task requires is moved onto its           the same data, that data is moved into both tasks’ scratch-
       cluster’s scratchpad from the cluster-shared caches.       pads partition. This is both a waste of scratchpad memory
                                                                  and memory bandwidth. This is especially prevalent with
    2. Execute: The task’s job is executed to completion
                                                                  configuration data, which is often shared between many
       without needing to access memory other than the
                                                                  tasks and does not change often. The data loaded into the
       cluster’s scratchpad.
                                                                  scratchpads is also loaded on a pessimistic basis. Some tasks
    3. Write: All the data previously fetched for the job,
                                                                  may only need some of the data, meaning some data might
       which has been updated, is written back to the main
                                                                  be unnecessarily loaded into the scratchpads.
       memory.
                                                                     Challenge 3: Memory bandwidth is wasted when depen-
   This is a classic implementation of the phased execution       dent tasks use the same data. The Write phase in the BBU
[4, 9], also called the simple-task model [10]. The task sched-   system always runs after the Execute phase. A subsequent
uler ensures that a task’s Execute is only scheduled on a         job using the same data must reload it in its Read phase.
processor when its corresponding Read has terminated on           This is sub-optimal in cases where the subsequent task can
the same cluster. Data movement is performed using DMAs,          run on the same cluster as the first task. In such a case,
allowing processors to execute other jobs’ Execute phase          omitting the Write phase of the first task and the Read
in parallel with data movements.                                  phase of the second task would be better.
   A cluster’s scratchpad is partitioned so that each running
job has exclusive access to its memory portion. If two tasks
3. The T-CREST Platform                                           Accessing this data is also done without experiencing cache
                                                                  misses. The compiler also manages the stack cache, setting
We propose to use the T-CREST platform as a basis for             it up and tearing it down at function entry and exits and
research into future platforms for 5G RBS. This section de-       using stack-targeting load and store instruction variants.
scribes the platform’s current capabilities and how they          An analyzer can assume any stack-targeting instruction will
relate to the challenges present in divided RBS systems.          hit in the stack cache. Therefore, the cache size must only
                                                                  be modeled to account for the stack setup and tear-down
3.1. T-CREST and Patmos                                           time [23]. Data accesses that are not function-local may still
                                                                  go through the conventional data cache or circumvent all
The Patmos processor [11] is designed to serve real-time          caching to target the main memory directly.
systems. Several Patmos cores are combined with a network-           These two cache architectures are supported by the Platin
on-chip, a memory arbitration tree, and a memory controller       WCET-analyzer [24]. Platin models instruction execution
to the time-predictable multi-core platform T-CREST [5]. As       and tracks which blocks of code are likely to be in the
such, T-CREST provides techniques that make task execu-           method cache at a given point. It accounts for this at control-
tion time more predictable and reduce the worst-case execu-       flow point to know whether a method-cache miss is likely
tion time (WCET). Around the Patmos cores, it builds a plat-      and how many bytes would have to be loaded. For the stack
form with time-predictable components to reduce WCET              cache, it models the program stack’s size at any point and
analysis complexity and increase accuracy. T-CREST uses           tracks stack-cache-control instructions added by the com-
networks-on-chips [12, 13, 14] that ensure data is moved be-      piler. At points where the stack must grow, Platin knows
tween processing cores with a known maximum latency. For          whether the cache has free space or needs to spill some of
accessing shared main memory, T-CREST uses the dedicated          the program stack to main memory.
arbitration tree-based network-on-chip [15]. Regardless of
how many cores are accessing the memory, each access will
be serviced within a bounded latency.                             3.3. Missing Capabilities
   Patmos uses an in-order pipeline to ensure every instruc-      The T-CREST platform is missing some features and capa-
tion has a known and constant execution time. To exploit          bilities compared to the BBU system. We will enumerate
instruction-level parallelism predictably, Patmos is also a       these missing capabilities and highlight how we might ei-
very long instruction-word (VLIW) architecture with a dual-       ther simulate them using existing capabilities or discuss how
issue pipeline. VLIW architectures are a predictable way          to implement them into the platform as part of the research
of increasing performance without increasing complexity           project.
[16, 17]. Patmos executes instructions in bundles of up to
two instructions. The compiler must designate instructions        3.3.1. Acceleration and Clustering
as part of a bundle by setting a specific bit in the first in-
struction. All Patmos instructions are predicated: Based on       The specific processing requirements of an RBS means dedi-
one of eight predicate registers, each instruction is either      cated accelerators can be used for maximum efficiency. The
enabled or disabled. If the predicate register’s value is true,   T-CREST platform does not include anything resembling
the instruction is enabled, meaning it executes normally. If      these accelerators. Likewise, the T-CREST platform does
the value is false, the instruction is disabled and does not      not use any clustering, whose benefit is mainly driven by a
affect registers or memory. It effectively becomes a noop.        multi-layered intermediate memory, which we will discuss
However, the execution time of disabled instructions is the       in the next section.
same as when enabled. Predicated instructions allow the              As this research mainly focuses on the efficient use of
compiler to minimize execution time variability or even           resources, notably memory, we will not investigate or im-
eliminate it entirely [18].                                       plement any hardware acceleration. Instead, we will use
                                                                  the Patmos cores as substitutes for specific accelerators. We
                                                                  will implement clustering into the T-CREST platform so
3.2. Predictable Caching
                                                                  that each cluster can be designated to be allowed to execute
While caching is usually associated with unpredictability         specific tasks. This will allow us to treat one cluster as a
and difficulties for static analysis, T-CREST deploys two pre-    substitute for a BBU DSP cluster and others for different
dictable and easily analyzable caches. The first is a method      types of acceleration clusters.
cache [19] that replaces a traditional instruction cache in
Patmos [20]. the method cache caches whole or parts of func-      3.3.2. Hierarchical Memory
tions (sub-functions) such that instruction fetching never
misses except at specific points. The compiler manages this       The Patmos cores of T-CREST are each paired with private
cache by splitting the code into blocks that fit in the method    caches, as described earlier. However, no further hierarchy
cache and inserting cache-fill instructions where needed.         of intermediate memory exists. In contrast, the BBU system
For the Patmos ISA function call and return instructions          contains three levels of intermediate storage: First, each
ensure that the callee or the caller are in the method cache.     DSP (or accelerator) has its caches. Second, each cluster has
To support sub-function caching Patmos has cache filling          a shared scratchpad. lastly, cluster-shared scratchpads are
variants of branch instructions. Using a method cache limits      present for a last level of storing various types of data.
the number of places cache misses can occur to the specific          A multi-layered memory hierarchy is necessary for the ex-
cache-filling instructions. The method cache is simpler to        periments to be representative, especially given the unique
model for an analyzer to provide tight WCET bounds [21].          data access characteristics. Therefore, we will build a sec-
   The second unique cache of the T-CREST is the stack            ond layer of intermediate memory, which is shared between
cache [22]. It caches function-local data, often accessed pre-    the Patmos cores of each cluster. We will omit a last mem-
dictably, and can be loaded at function entry and exit points.    ory layer, as any methods of managing the second layers
                                                                                                                  Scheduler

                   Cluster 0                                Cluster 1                                Cluster 2
          Core 0            Core 1                 Core 2            Core 3                 Core 4            Core 5
        M$ S$ D$          M$ S$ D$               M$ S$ D$          M$ S$ D$               M$ S$ D$          M$ S$ D$


                Shared Cache                              Shared Cache                            Shared Cache



                                                       Memory Controller            Off-Chip Main Memory SRAM

Figure 2: Proposed T-CREST system for researching novel cache architectures.



we develop can be transferred to the rest of the layers of a        4. Cache Proposals
real-world system.
                                                                   To start addressing the challenge of merging layer 1 and
3.3.3. Hardware-Assisted Scheduling                                layer 2 systems, we focus on the challenge of using a shared
                                                                   cache in each cluster. As described earlier, the BBU archi-
The BBU systems often use hardware to accelerate schedul-          tecture sacrifices the efficient use of resources to ensure low
ing. T-CREST does not implement any hardware that can              variability in execution times. We aim to maximize resource
assist with scheduling. While using a hardware scheduler in        usage in the proposed system while maintaining low vari-
the BBU system ensures that the extreme amount of tasks            ability. We propose exploring three caching solutions that
gets scheduled in a reasonable time, the smaller scale of          address the challenges of predictable caching: (1) a critical-
this project’s prototypes can likely handled by software-          ity timeout cache, (2) a contention tracking cache, and (3) a
managed scheduling.                                                unified method/stack cache.
   Therefore, the initial proposed system will not have any
scheduling hardware, but dedicated Patmos cores will re-
                                                                    4.1. Criticality Timeout Cache
place it to handle the scheduling. Software-defined schedul-
ing can be a flexible way to test our scheduling strate-           In cases where strict predictability is unnecessary but flex-
gies as the system matures. Moving to a hardware sched-            ibility and utilization efficiency are essential, we propose
uler should be easily doable at later stages of research,          a cache using a partitioning approach based on cache line
where the scheduling has been studied and techniques cho-          timeouts. For that cache, we need an n-way set associa-
sen. Patmos already supports adding custom devices and             tive cache configuration. We can configure the cache at the
accelerators[25]. A hardware scheduler is a device that inter-     granularity of cache ways. Each cache way can be assigned
acts with the rest of the clusters, memories, and processors       either a criticality or a task/core ID (we will use criticality
and issues commands in the same manner a Patmos core               moving forward).
would.                                                                In this proposal, each cache way can be assigned either
                                                                   to high or low criticality. Cache lines can be used by high-
3.4. Proposed System Architecture                                  or low-criticality tasks. However, naturally high-criticality
                                                                   tasks are preferred. A low-criticality task cannot evict a
Figure 2 shows a diagram of our proposed system. It com-           high-criticality cache line. Therefore, to avoid starvation of
prises three clusters, each with a set of Patmos cores with        low-criticality tasks, at least one way must not be assigned
private split caches (Method, Stack, and Data) and a shared        for the high-criticality tasks.
cluster cache. The cores use the T-CREST memory tree to               When an access of the high criticality arrives, a cache line
access the shared cache, providing us with predictable and         in one high-criticality way is tagged as being occupied by
low-latency access. The clusters use the T-CREST memory            that criticality, and an associated timeout begins. As long
tree to connect to the memory controller, which manages            as the timeout is not reached, accesses of low-criticality
access to the off-chip, main memory. A shared bus (in gray         tasks cannot evict the cache line. If there is no access to
above the clusters) facilitates cross-cluster and cross-core       the line before the timeout is reached, the line criticality
communication. This allows a Patmos core or a hardware             is downgraded, allowing low-criticality jobs to evict the
device scheduler to issue scheduling commands to the whole         line. The cache can either be configured right before each
system.                                                            job starts executing, or the criticalities can be configured
   This system architecture will allow research on efficiently     ahead of time to match the tasks that will run on the cluster.
managing the cluster caches. The different clusters can            With timeouts, there is no need to explicitly release any
simulate the DSP or accelerator clusters on the BBU system,        data, as the timeout mechanism will do so automatically.
while the cluster-shared scratchpads of that system do not         Configuring the cache is done by setting the criticality of a
introduce new challenges. Therefore, limiting ourselves            cache way. When a way is configured with a criticality, all
to the two levels of cache (private and cluster caches) will       its cache lines will prefer accesses from that criticality, as
allow for fruitful experimentation during the research.            described above.
                                                                      A significant drawback of this approach is its unpre-
                                                                   dictability. Because timeouts might cause a cache line to be
                                                                   evicted even when it might be used in the future, it can be
difficult for a WCET analysis tool to track which cache lines       contention event will be blocked or mitigated. For example,
have reached the deadline and which have not. The effect            say 𝐽1 is high criticality, and 𝐽2 is not. As long as 𝐽1 has not
of the timeouts on WCET bounds can be challenging to esti-          reached its contention limit, the cache treats accesses from
mate and would require dedicated analysis. However, it can          both jobs equally. When the limit is reached, contention
also be omitted, as this cache architecture is better suited        events are mitigated between 𝐽1 and 𝐽2 . In the case of the
for measurement-based WCET estimation. With detailed                first event type, accesses from 𝐽2 that would cause an evic-
testing and measurements, getting a sufficiently safe WCET          tion of 𝐽1 ’s cache lines would be rejected by the cache. The
bound should be feasible.                                           access must then be rerouted directly to the main memory,
   This cache architecture is designed for high utilization         which the system must have support for. In the second event
and low scheduling complexity. Because it reserves each             type, if the default replacement policy would have 𝐽1 evict
cache line, only the necessary subset of a cache way is             its own cache line in the set, it would instead evict a cache
reserved at a given time. Cache lines that either timed out or      line from 𝐽2 .
were not used by the job are free to be used by low-criticality        Setting the contention limit is the responsibility of the
tasks, increasing the utilization of the cache. In this proposal,   job scheduler. Through traditional static WCET analysis
we also do not pre-load data into the cache. This means             with the assumption of private caches, jobs get their WCET
only data that is used will be loaded. Therefore, we avoid          bound. Any excess time between the bound and the task
both bandwidth wastage and cache space wastage when                 deadline is therefore open to contention. Before the sched-
loading data that is not used. When a job stops executing,          uler starts a job, it sets the contention limit, ensuring the
its associated cache lines will eventually time out and release     WCET of the job, with contention, still meets the deadline.
their contents automatically. The scheduler, therefore, does        The contention limit can be static, and it can be calculated
not need to manage the phased execution of jobs, reducing           as part of schedulability analysis. It can also be dynamic,
the pressure on the scheduler.                                      so the scheduler changes it for the runtime condition. If
                                                                    the task was started early, the contention is increased to
4.2. Contention Tracking Cache                                      match the slack time available. If the task was started late,
                                                                    the contention is reduced or set to zero to ensure that the
In this proposal, a combination of contention tracking in           deadline is still met.
the cache and contention-aware task scheduling will allow              This proposal’s major strength is that it disconnects the
for maximal cache utilization through dynamic partitioning,         analysis of tasks with differing criticalities. Because of
with high predictability through cache contention tracking          the contention limit, high-criticality tasks will never be
and mitigation.                                                     adversely affected by low-criticality tasks. Therefore, we
   In a multicore system without shared caches, the execu-          just need to ensure that all high-criticality tasks meet their
tion time of a job is affected by the cache behavior without        deadlines with other methods.1 It also does not statically par-
that behavior being affected by other jobs. Through cache           tition or lock the cache. At worst, when a contention limit is
analysis, we can bound the execution time attributable to           reached, the cache will be dynamically partitioned automat-
the cache. This is done by estimating the number of cache           ically simply by prioritizing the jobs that have reached the
misses that will occur. When the cache is shared, this anal-        limit. This maximizes cache utilization. It also allows for
ysis is no longer possible, as the interference of other jobs       maximizing the performance of low-criticality tasks as long
will cause additional cache misses in a manner that cannot          as it does not adversely affect any high-criticality tasks.
be estimated. In this proposal, we want to let the task sched-         This proposal does increase the complexity of the cache
uler limit the contention that a job is allowed to experience       controller, which needs to track contention events and miti-
such that it is guaranteed to meet its deadline.                    gate them for jobs that have reached their contention limit.
   We give two example types of contention: (1) A job 𝐽1            Each cache line needs to be associated with a job (or core),
experiences a contention event if a cache line 𝐶1 it popu-          each job needs a contention counter, and logic needs to
lated with data 𝐷1 is evicted by an access by another job 𝐽2 .      ensure the correct mitigation at contention limits. The pro-
This is because 𝐽1 will experience a cache miss on the next         posal also increases scheduler complexity. This complexity
access to 𝐷1 that it would not have experienced if 𝐽2 had           can be initially lowered by simply having statically deter-
not interfered. (2) 𝐽1 also experiences a contention event if a     mined contention limits. However, further work should ex-
cache miss occurs when accessing 𝐷1 results in the eviction         plore dynamically determined limits, which would increase
of a cache line that 𝐽1 also populated in the same cache set        the workload on the scheduler.
(with data 𝐷2 ). This event is a contention with any other
job with at least one populated cache line in the same set.
                                                                    4.3. Unified Method/Stack Cache
Without the other jobs, 𝐽1 would have populated an empty
cache line instead of evicting one of its other populated lines.    The Patmos processor on T-CREST uses the special method
The evicted line will cause a cache miss in the future when         and stack caches. While these caches have been researched
𝐽1 needs to access 𝐷2 again.                                        for their impact on predictability, and the Platin analyzer
   We only consider contention between different jobs. Self-        has analysis implementations for them, additional work is
contention also happens in private caches and is, therefore,        needed to integrate them into a shared L2 cache. Therefore,
already managed in the cache analysis for the private cache.        we propose investigating a shared L2 cache that integrates
   We limit the maximum allowed contention as defined               the features of both the method cache and the stack cache.
above to ensure that a job meets its deadline without inter-        It is meant to complement either a traditional L2 data cache
ference from other jobs. The scheduler will configure the           or scratchpad, with extended research avenues for a fully
cache with a maximum allowed contention. The cache con-             integrated L2 cache that supports the method-, stack-, and
troller will track contention by checking and counting the
above contention events for each job. When a job reaches            1
                                                                        For example, we could use partitioning between high-criticality tasks
its contention limit, any cache access that would cause a               only.
                        Priority    Contention    Unified         Any data whose address is taken in the program cannot be
                        Timeout      Tracking     Method/
                                                                  put in the stack cache, going instead to the shadow stack,
                                                   Stack
   All Data                ✓            ✓            ✗
                                                                  which is backed by main memory. Another big difference
   Shared                  ✓            ✓            ✗            between the Unified Method/Stack Cache and the others is
   Mixed-Criticality       ✓            ✓            ✗            that the proposal does not share the cache between multiple
   Analyzable              ✗            ✓*           ✓            cores, which also means it does not alleviate any challenges
   Needs Scheduling        ✗*           ✓*           ✗            for mixed-criticality systems.
   Guaranteed              ✗            ✓*           ✓               Analyzability is different between all the cache proposals.
Table 1
                                                                  The Priority Timeout cache does not support analyzability
Comparison between features of the three cache proposals.         very well, as it is difficult for analyzers to track when cache
                                                                  lines have timed out. The contention cache is analyzable,
                                                                  but only in the sense that it simplifies mixed-criticality anal-
                                                                  ysis by disallowing interference between tasks of different
data caches. This proposal can also complement either of
                                                                  criticalities. For tasks with the same criticality, the cache
the previous proposals.
                                                                  does not provide any assistance but does not complicate
   The method and stack caches have particular access pat-
                                                                  the analysis. The Unified Method/Stack Cache is the most
terns to their data. The method cache accesses a block of
                                                                  analyzable. Analyzers can reuse the analysis done for the
code at a time, pre-loading a complete block at once. It also
                                                                  separate method and stack caches and likely reuse it for
uses a first-in, first-out (FIFO) replacement policy to account
                                                                  the unified one with different configurations and minor
for functions earlier in the call stack being less likely to be
                                                                  customization.
called again soon. On the other hand, the stack cache is
                                                                     The proposals also differ in how much support is needed
not backed by main memory unless some data is spilled
                                                                  from the job scheduler at runtime. The Priority Timeout
when the cache is full. This allows the L2 cache to store
                                                                  Cache can be implemented without scheduler support if the
the spilled stack data first without sending it to the main
                                                                  way-based partitioning is configured ahead of time. If the
memory. Access to this stored data would have the same
                                                                  partitioning is done dynamically, it would be the scheduler’s
characteristics as access to the stack cache. Additionally,
                                                                  responsibility. The Contention Tracking Cache needs sup-
when space is tight in the L2 cache, the replacement policy
                                                                  port from the scheduler to ensure the amount of allowed
is the same as the stack cache: spill the data furthest up the
                                                                  contention is within the correct limit. The scheduler needs
stack.
                                                                  to account for when a high-criticality job is started so that
   An open question is how to partition the cache between
                                                                  an appropriate contention limit is chosen. A static approach
the method and stack data. Since both have a replacement
                                                                  can also be used where the contention limit is chosen ahead
policy that depends on reaching the space limit, a policy
                                                                  of time. However, that does not provide much benefit com-
is needed for deciding how much of the cache should be
                                                                  pared to traditional partitioning. The Unified Method/stack
meant for the methods, and how much should be used for
                                                                  Cache needs no scheduling support at all. The only thing
the stack. We should also investigate if this division can be
                                                                  that might be configurable would be how much of the cache
dynamically configured such that if the stack is not expected
                                                                  is prioritized for methods or the stack. However, this could
to use much space, then most of the L2 cache should be saved
                                                                  better be done by the program itself, e.g., through compiler
for the methods and vice versa. A different approach could
                                                                  management of the cache.
be to say that the stack gets priority up to a point. When
                                                                     Lastly, each cache has different guarantees on its behavior.
the stack needs to store more data, methods are evicted to
                                                                  The Priority Timeout Cache provides priority guarantees for
make room up to a point (e.g., half the L2 cache size). Any
                                                                  only a specific time. If that is not managed such that it does
space not used by the stack cache can store methods. This
                                                                  not run out, programs cannot be guaranteed that a specific
can also be done in reverse, where the method data gets
                                                                  amount of the cache is reserved for them. While giving
priority.
                                                                  no guarantees on partitioning, the Contention Tracking
   An open question that would need answering following
                                                                  cache guarantees how much contention could affect a job.
the above initial research, would be how to implement a
                                                                  However, this is only contention from lower criticality job
unified method/stack cache that is also shared between cores.
                                                                  contention, and so does not make any guarantees about
Since each core has a distinct stack, and is also likely to use
                                                                  the contention from similar-criticality tasks. However, the
different functions, we need to explore ways for a single
                                                                  Unified Method/Stack Cache is predictable and guarantees
cache to effectively manage multiple stacks and call trees.
                                                                  similar behavior to the split caches.

4.4. Discussion
                                                                  5. Related Work
The three caching proposals—Criticality Timeout Cache,
Contention Tracking Cache, and Unified Method/Stack               Shared caches are a significant challenge for predictability
Cache—each address the challenge of predictable caching           due to their inherent nature of allowing multiple cores to
in different ways. Table 1 compared the various features          access the same cache [26]. This can lead to contention
of our proposals. The first big difference is between the         and unpredictable performance. However, several solutions
Unified Method/Stack Cache and the two other caches. The          have been proposed to address this issue, including cache
Priority Timeout and Contention tracking caches both sup-         partitioning and locking [27].
port all program data, whereas the Unified Method/Stack              Partitioning is a technique that divides the shared cache
Cache only supports instruction data (methods) and stack          into several partitions, each dedicated to a specific core [28].
data. Even more specifically, the traditional stack does not      This approach can significantly improve predictability by
support all stack data, only that which does not need an          reducing contention [29]. Way-based partitioning involves
address, as the stack cache is not backed by main memory.         dividing the cache ways among different cores. Each core is
assigned a specific number of ways in the cache, ensuring         fications and the common system architecture, we propose
exclusive access to those ways. This method can effectively       using the T-CREST platform as the research platform for fu-
isolate the cache activities of different cores, improving pre-   ture mixed-criticality systems. We propose a specific system
dictability. On the other hand, index-based partitioning          architecture that best leverages the existing system archi-
involves dividing the cache sets among different cores. Each      tecture’s strength and increases its performance through
core is assigned specific sets in the cache, ensuring exclusive   shared caches. We propose three specific research directions
access. This method is more flexible than way-based parti-        within shared L2 caches for clustered systems. The various
tioning because the number of sets is usually large, allowing     proposals have distinct strengths and weaknesses that will
for finer-grained partitioning. However, a given set maps         be further explored in future work.
to specific address ranges. Therefore, this method requires
more detailed memory management. Page coloring is often
used to partition the cache [30]. The address space is divided    Acknowledgment
into colors associated with the cache sets. Assigning colors
                                                                  This work is partially supported by the CERCIRAS (Connect-
to tasks/cores provides the partitioning, assuming an assign-
                                                                  ing Education and Research Communities for an Innovative
ment that provides the correct memory for each task/core
                                                                  Resource Aware Society) COST Action no. CA19135 funded
is found. The cache hardware can also support index-based
                                                                  by COST (European Cooperation in Science and Technol-
partitioning for various benefits [31, 32]. However, some
                                                                  ogy).
form of software management will always be needed.
   Cache locking is another technique used to improve pre-
dictability in shared caches [33]. With locking, specific         References
cache lines can be locked to prevent them from being evicted,
ensuring they are always available for the necessary cores.        [1] International Telecommunication Union - Radiocom-
This can significantly reduce cache misses and improve                 munication Sector, IMT Vision - Framework and over-
predictability. Locking can be costly. Lock management in-             all objectives of the future development of IMT for
volves tracking the locked cache lines, increasing hardware            2020 and beyond, Technical Report M.2083-0, Interna-
complexity. Adding locking to a cache can reduce its ca-               tional Telecommunication Union, 2015.
pacity or speed depending on how fine-grained the locking          [2] A. Burns, R. I. Davis, Mixed criticality systems-a re-
is. Locking also reduces cache utilization, as any unused              view:(february 2022) (2022).
locked content cannot be evicted to free up the cache lines        [3] ISO/IEC 7498-1:1994(E), Information technology –
for needed data.                                                       Open Systems Interconnection – Basic Reference
   T-CREST has enabled much research within various as-                Model: The Basic Model, Technical Report 7498-1:1994,
pects of real-time systems [5]. Because all of T-CREST’s com-          International Organization for Standardization, 1996.
ponents are predictable, it is possible to implement constant      [4] R. Pellizzoni, E. Betti, S. Bak, G. Yao, J. Criswell, M. Cac-
execution-time code based on the single-path paradigm                  camo, R. Kegley, A predictable execution model for
[34, 18]. Single-path has an inherently high overhead, neces-          cots-based embedded systems, in: 2011 17th IEEE
sitating optimizations to reduce executed code [35], make              Real-Time and Embedded Technology and Applica-
best use of Patmos’ dual-issue pipeline [36, 17], and use cus-         tions Symposium, IEEE, 2011, pp. 269–279.
tom register allocation techniques [37]. The combination of        [5] M. Schoeberl, S. Abbaspour, B. Akesson, N. Auds-
T-CREST and single-path code has been shown to be com-                 ley, R. Capasso, J. Garside, K. Goossens, S. Goossens,
petitive with off-the-shelf ARM processors for a real-time             S. Hansen, R. Heckmann, S. Hepp, B. Huber, A. Jordan,
application [38]. Research is also ongoing to port the Lingua          E. Kasapaki, J. Knoop, Y. Li, D. Prokesch, W. Puffitsch,
Franca coordination language to T-CREST to enable the cre-             P. Puschner, A. Rocha, C. Silva, J. Sparsø, A. Tocchi,
ation of complete real-time systems within one framework               T-CREST: Time-predictable multi-core architecture for
[39, 40].                                                              embedded systems, Journal of Systems Architecture 61
                                                                       (2015) 449–471. doi:10.1016/j.sysarc.2015.04.
                                                                       002.
6. Conclusion                                                      [6] International Telecommunication Union - Radiocom-
                                                                       munication Sector, Minimum requirements related to
The increasing importance of 5G technologies necessitates
                                                                       technical performance for IMT-2020 radio interface(s),
continuous research and development into the hardware
                                                                       Technical Report M.2410-0, International Telecommu-
systems implementing the technology. The diverse require-
                                                                       nication Union, 2017.
ment specifications of this new technology necessitate a
                                                                   [7] Z. Kong, J. Gong, C.-Z. Xu, K. Wang, J. Rao, ebase:
system with varying degrees of strictness and performance.
                                                                       A baseband unit cluster testbed to improve energy-
Existing systems were designed with the minimal 5G guar-
                                                                       efficiency for cloud radio access network, in: 2013
antees in mind, ensuring the hard requirements, e.g., low
                                                                       IEEE International Conference on Communications
latency, were met before softer requirements like through-
                                                                       (ICC), IEEE, 2013, pp. 4222–4227.
put. This focus resulted in a divided physical system to
                                                                   [8] E. Tell, A. Nilsson, D. Liu, A programmable dsp core for
achieve the goals.
                                                                       baseband processing, in: The 3rd International IEEE-
   To increase future systems’ performance while maintain-
                                                                       NEWCAS Conference, 2005., IEEE, 2005, pp. 403–406.
ing the older system’s guarantees, this paper sets the re-
                                                                   [9] J. Arora, C. Maia, S. A. Rashid, G. Nelissen, E. Tovar,
search direction into a mixed-criticality 5G RBS with merged
                                                                       Schedulability analysis for 3-phase tasks with parti-
BBU and layer 2 systems. The system should be able to exe-
                                                                       tioned fixed-priority scheduling, Journal of Systems
cute high-criticality tasks, like those required by the URLLC
                                                                       Architecture 131 (2022) 102706.
5G scenario, and low-criticality, QoS tasks, like those for the
                                                                  [10] H. Kopetz, Real-Time Systems, Kluwer Academic,
eMMB, in one SoC. By analyzing the 5G requirement speci-
                                                                       Boston, MA, USA, 1997.
[11] M. Schoeberl, W. Puffitsch, S. Hepp, B. Huber,                  worst-case stack cache behavior, in: Proceedings of the
     D. Prokesch, Patmos: A time-predictable micro-                  21st International Conference on Real-Time Networks
     processor, Real-Time Systems 54(2) (2018) 389–423.              and Systems (RTNS 2013), ACM, New York, NY, USA,
     doi:10.1007/s11241-018-9300-4.                                  2013, pp. 55–64. doi:10.1145/2516821.2516828.
[12] M. Schoeberl, F. Brandner, J. Sparsø, E. Kasapaki,         [24] E. J. Maroun, E. Dengler, C. Dietrich, S. Hepp,
     A statically scheduled time-division-multiplexed                H. Herzog, B. Huber, J. Knoop, D. Wiltsche-Prokesch,
     network-on-chip for real-time systems, in: Proceed-             P. Puschner, P. Raffeck, et al., The platin multi-target
     ings of the 6th International Symposium on Networks-            worst-case analysis tool, in: 22nd International Work-
     on-Chip (NOCS), IEEE, Lyngby, Denmark, 2012, pp.                shop on Worst-Case Execution Time Analysis (WCET
     152–160. doi:10.1109/NOCS.2012.25.                              2024), Schloss Dagstuhl–Leibniz-Zentrum für Infor-
[13] E. Kasapaki, M. Schoeberl, R. B. Sørensen, C. T. Müller,        matik, 2024.
     K. Goossens, J. Sparsø, Argo: A real-time network-         [25] C. Pircher, A. Baranyai, C. Lehr, M. Schoeberl, Acceler-
     on-chip architecture with an efficient GALS imple-              ator interface for patmos, in: 2021 IEEE Nordic Circuits
     mentation, IEEE Transactions on Very Large Scale                and Systems Conference (NORCAS): NORCHIP and
     Integration (VLSI) Systems 24 (2016) 479–492. doi:10.           International Symposium of System-on-Chip (SoC),
     1109/TVLSI.2015.2405614.                                        2021.
[14] M. Schoeberl, Exploration of network interface             [26] B. C. Ward, J. L. Herman, C. J. Kenna, J. H. Anderson,
     architectures for a real-time network-on-chip, in:              Making shared caches more predictable on multicore
     Proceedings of the 2024 IEEE 27th International                 platforms, in: 2013 25th Euromicro Conference on
     Symposium on Real-Time Distributed Computing                    Real-Time Systems, IEEE, 2013, pp. 157–167.
     (ISORC), IEEE, United States, 2024. doi:10.1109/           [27] G. Gracioli, A. Alhammad, R. Mancuso, A. A. Fröh-
     ISORC61049.2024.10551364, 2024 IEEE 27th Inter-                 lich, R. Pellizzoni, A survey on cache management
     national Symposium on Real-Time Distributed Com-                mechanisms for real-time embedded systems, ACM
     puting, ISORC ; Conference date: 22-05-2024 Through             Computing Surveys (CSUR) 48 (2015) 1–36.
     25-05-2024.                                                [28] S. Mittal, A survey of techniques for cache partitioning
[15] M. Schoeberl, D. V. Chong, W. Puffitsch, J. Sparsø, A           in multicore processors, ACM Computing Surveys
     time-predictable memory network-on-chip, in: Pro-               (CSUR) 50 (2017) 1–39.
     ceedings of the 14th International Workshop on Worst-      [29] X. Vera, B. Lisper, J. Xue, Data caches in multitasking
     Case Execution Time Analysis (WCET 2014), Madrid,               hard real-time systems, in: RTSS 2003. 24th IEEE
     Spain, 2014, pp. 53–62. doi:10.4230/OASIcs.WCET.                Real-Time Systems Symposium, 2003, IEEE, 2003, pp.
     2014.53.                                                        154–165.
[16] J. Yan, W. Zhang, A time-predictable VLIW proces-          [30] T. Lugo, S. Lozano, J. Fernández, J. Carretero, A survey
     sor and its compiler support, Real-Time Syst. 38                of techniques for reducing interference in real-time
     (2008) 67–84. doi:http://dx.doi.org/10.1007/                    applications on multicore platforms, IEEE Access 10
     s11241-007-9030-5.                                              (2022) 21853–21882.
[17] E. J. Maroun, M. Schoeberl, P. Puschner, Predictable       [31] A. Chousein, R. N. Mahapatra, Fully associative cache
     and optimized single-path code for predicated proces-           partitioning with don’t care bits for real-time applica-
     sors, Journal of Systems Architecture (2024) 103214.            tions, ACM SIGBED Review 2 (2005) 35–38.
[18] E. J. Maroun, M. Schoeberl, P. Puschner, Compiler-         [32] M. Lee, S. Kim, Time-sensitivity-aware shared cache
     directed constant execution time on flat memory sys-            architecture for multi-core embedded systems, The
     tems, in: 2023 IEEE 26th International Symposium on             Journal of Supercomputing 75 (2019) 6746–6776.
     Real-Time Distributed Computing (ISORC), 2023, pp.         [33] S. Mittal, A survey of techniques for cache locking,
     64–75. doi:10.1109/ISORC58943.2023.00019.                       ACM Transactions on Design Automation of Elec-
[19] M. Schoeberl, A time predictable instruction cache for          tronic Systems (TODAES) 21 (2016) 1–24.
     a Java processor, in: On the Move to Meaningful In-        [34] P. Puschner, A. Burns, Writing temporally predictable
     ternet Systems 2004: Workshop on Java Technologies              code, in: Proceedings of the The Seventh IEEE In-
     for Real-Time and Embedded Systems (JTRES 2004),                ternational Workshop on Object-Oriented Real-Time
     volume 3292 of LNCS, Springer, Agia Napa, Cyprus,               Dependable Systems (WORDS 2002), IEEE Computer
     2004, pp. 371–382. doi:10.1007/b102133.                         Society, Washington, DC, USA, 2002, pp. 85–94. doi:10.
[20] P. Degasperi, S. Hepp, W. Puffitsch, M. Schoeberl, A            1109/WORDS.2002.1000040.
     method cache for Patmos, in: Proceedings of the            [35] E. J. Maroun, M. Schoeberl, P. Puschner, Constant-
     17th IEEE Symposium on Object/Component/Service-                Loop Dominators for Single-Path Code Optimization,
     oriented Real-time Distributed Computing (ISORC                 in: P. Wägemann (Ed.), 21th International Work-
     2014), IEEE, Reno, Nevada, USA, 2014, pp. 100–108.              shop on Worst-Case Execution Time Analysis (WCET
     doi:10.1109/ISORC.2014.47.                                      2023), volume 114 of Open Access Series in Informat-
[21] B. Huber, S. Hepp, M. Schoeberl, Scope-based method             ics (OASIcs), Schloss Dagstuhl – Leibniz-Zentrum für
     cache analysis, in: Proceedings of the 14th Inter-              Informatik, Dagstuhl, Germany, 2023, pp. 7:1–7:13.
     national Workshop on Worst-Case Execution Time                  URL: https://drops.dagstuhl.de/opus/volltexte/2023/
     Analysis (WCET 2014), Madrid, Spain, 2014, pp. 73–82.           18436. doi:10.4230/OASIcs.WCET.2023.7.
     doi:10.4230/OASIcs.WCET.2014.73.                           [36] E. J. Maroun, M. Schoeberl, P. Puschner, Compiling for
[22] S. Abbaspour, F. Brandner, M. Schoeberl, A time-                time-predictability with dual-issue single-path code,
     predictable stack cache, in: Proceedings of the 9th             Journal of Systems Architecture 118 (2021) 1–11.
     Workshop on Software Technologies for Embedded             [37] E. Maroun, M. Schoeberl, P. Puschner, Two-step reg-
     and Ubiquitous Systems, 2013.                                   ister allocation for implementing single-path code,
[23] A. Jordan, F. Brandner, M. Schoeberl, Static analysis of        in: Proceedings of the 2024 IEEE 27th International
     Symposium on Real-Time Distributed Computing
     (ISORC), IEEE, United States, 2024. doi:10.1109/
     ISORC61049.2024.10551362, 2024 IEEE 27th Inter-
     national Symposium on Real-Time Distributed Com-
     puting, ISORC ; Conference date: 22-05-2024 Through
     25-05-2024.
[38] M. Platzer, P. Puschner, A real-time application with
     fully predictable task timing, in: 2020 IEEE 23rd Inter-
     national Symposium on Real-Time Distributed Com-
     puting (ISORC), IEEE, 2020, pp. 43–46.
[39] E. Khodadad, L. Pezzarossa, M. Schoeberl, Towards
     lingua franca on the patmos processor, in: Proceedings
     of the 2024 IEEE 27th International Symposium on
     Real-Time Distributed Computing (ISORC), 2024.
[40] M. Schoeberl, E. Khodadad, S. Lin, E. J. Maroun,
     L. Pezzarossa, E. A. Lee, Invited Paper: Worst-Case
     Execution Time Analysis of Lingua Franca Applica-
     tions, in: T. Carle (Ed.), 22nd International Work-
     shop on Worst-Case Execution Time Analysis (WCET
     2024), volume 121 of Open Access Series in Informat-
     ics (OASIcs), Schloss Dagstuhl – Leibniz-Zentrum für
     Informatik, Dagstuhl, Germany, 2024, pp. 4:1–4:13.
     doi:10.4230/OASIcs.WCET.2024.4.