=Paper=
{{Paper
|id=Vol-3867/paper5
|storemode=property
|title=Caching in a Mixed-Criticality 5G Radio Base Station
|pdfUrl=https://ceur-ws.org/Vol-3867/paper5.pdf
|volume=Vol-3867
|authors=Emad Jacob Maroun,Luca Pezzarossa,Martin Schoeberl
|dblpUrl=https://dblp.org/rec/conf/raw2/MarounPS24
}}
==Caching in a Mixed-Criticality 5G Radio Base Station==
Caching in a Mixed-Criticality 5G Radio Base Station
Emad Jacob Maroun1,* , Luca Pezzarossa1 and Martin Schoeberl1
1
Technical University of Denmark, Department of Applied Mathematics and Computer Science
Abstract
Telecommunication is a critical driver of economic and social development. 5G technologies are state-of-the-art in telecommunication,
setting strong and open-ended requirements for implementing systems. Current systems for implementing baseband technologies in
5G depend on hardware separation to ensure high- and low-criticality tasks do not interfere in such a way as to violate guarantees.
To increase performance and lower costs, this paper sets the research direction into future mixed-criticality systems that can handle
both the high- and low-criticality tasks of the baseband unit. We analyze the 5G requirements and the common systems that currently
implement them. We propose using T-CREST as the research platform with a specific architecture targeting mixed-criticality workloads.
We present two cache proposals to reduce the interference of low-criticality tasks on high-criticality tasks but ensure high cache
utilization and efficiency. The first cache proposal uses timeouts to automatically free cache lines reserved for high-criticality tasks.
The second proposal uses contention tracking to limit how much low-criticality tasks may influence high-criticality tasks. Lastly, we
propose a third cache architecture to unify the method and stack caches unique to T-CREST into a single level-2 cache.
Keywords
5g, t-crest, real-time systems, low latency, caches, radio baseband
1. Introduction hurts performance and price. Therefore, we are interested
in investigating future system designs incorporating mixed-
Socio-technical evolution is dependent on mobile communi- criticality system research to merge the currently divided
cations as a critical driver to allow for economic and social systems into a single platform that can handle the varying
development [1]. As such, the evolution of communication criticality of tasks. While the current heavy use of shared
technologies is essential in enabling societal development. scratchpads and the phased execution model [4] give high
5G is state-of-the-art in mobile communication technologies, predictability to systems managing the OSI layer 1, it is
promising unprecedented speeds, ultra-low latency, and wasteful and difficult to unify with the use of shared caches
massive connectivity capabilities. With its lofty promises, in the systems managing the OSI layer 2. Therefore, innova-
implementing 5G communication networks is a significant tive techniques are needed to facilitate the unification of the
industrial challenge. Continued investment in 5G technolo- layer 1 and layer 2 systems into a unified hardware system.
gies is needed to reach beyond the minimal promises of the This paper addresses the challenge of sharing a level 2 (L2)
technology. Improvements in technical implementations cache between different tasks and executing on different
will ensure better service characteristics for customers and cores while still delivering low-latency execution of critical
users at lower costs. tasks. We propose to use the T-CREST platform [5] to ex-
One critical aspect of telecommunications technology is plore different solutions of the challenges around memory
the radio base station (RBS), which provides wireless trans- management for mixed-criticality systems by presenting
mission to and from mobile devices. The 5G functionality is three distinct caching architectures for future exploration.
implemented in these RBSs. Continued improvement of the All solutions are centered around regulating access to dif-
RBS is critical to staying at the forefront of the industry. As ferent cache lines for high- and low-criticality jobs. More
such, research on how to best implement RBS for optimizing specifically, we propose two shared caches that use time-
performance and cost ensures long-term competitiveness outs and contention tracking to limit the interference of
in the industry. low-criticality tasks on high-criticality ones, as well as an
The requirements of 5G introduce a hierarchy of priori- L2 cache that unifies the split caches unique to T-CREST
tized tasks that the RBS has to complete. The RBS, therefore, since they exhibit unique access characteristics that can be
becomes a mixed-criticality system [2], where minimum sped up predictably.
guarantees are upheld to ensure critical tasks are completed The contributions of this paper are: (1) A description of
correctly and in a timely fashion. On the other hand, non- common 5G RBS technologies and implementations, (2) a
critical tasks need to be performed as fast as possible; how- discussion of the challenges future systems face in the pur-
ever, they only need to provide good quality of service (QoS) suit of lower cost, higher efficiency, and improved perfor-
on average, so they may be de-prioritized to ensure that crit- mance, and (3) three proposals for caching architectures that
ical tasks meet their deadlines. To ensure non-critical tasks we intend to explore to address the challenges described.
do not interfere with the critical ones, hardware systems The rest of this paper is structured into four sections. The
are divided into several layers with differing responsibili- following section will provide some background on how
ties correlating to the open systems interconnection (OSI) current systems implement 5G and their challenges. Section
model [3]. This hardware division makes it easier to con- 3 introduces the T-CREST platform and how it can be used as
trol interference but decreases resource utilization, which a basis for research into a mixed-criticality system. Section
4 discusses the three cache architecture proposals. Section
3rd workshop on Resource AWareness of Systems and Society (RAW 2024), 5 presents related work and Section 6 concludes the paper.
July 2–5, 2024, Maribor, Slovenia
*
Corresponding author.
$ ejama@dtu.dk (E. J. Maroun); lpez@dtu.dk (L. Pezzarossa); 2. 5G Radio Baseband
masca@dtu.dk (M. Schoeberl)
0000-0002-3675-3376 (E. J. Maroun); 0000-0002-0863-2526 During the initial phases of the 5G specification, three usage
(L. Pezzarossa); 0000-0003-2366-382X (M. Schoeberl)
© 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons
scenarios were identified as being critical for the future of
License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Cluster 0 Cluster 1 Cluster 2
DSP 0 DSP 1
Acc 0 Acc 1 Acc 4 Acc 5
I$ D$ I$ D$
DSP 2 DSP 3
Acc 2 Acc 3 Acc 6 Acc 7
I$ D$ I$ D$
Shared Scratchpad Shared Scratchpad Shared Scratchpad
Scheduler
Cluster-Shared Scratchpad 1 Cluster-Shared Scratchpad 2
Off-Chip Main Memory DRAM
Figure 1: Hypothetical baseband unit architecture.
mobile communications [1]: 2.1. System Architecture
Enhanced Mobile Broadband (eMBB): Focuses on pro-
Typical RBS systems are divided into three hardware units:
viding significantly higher data rates and capacity compared
to previous telecommunication generations, enabling ap- 1. The Remote Radio Unit (RRU). It is immediately con-
plications such as high-definition video streaming, virtual nected to the antennas and handles the initial input
reality, and augmented reality. This scenario covers the stream from the antennas. The antenna streams are
day-to-day activities of private users and data-heavy but initially processed in this unit and grouped into user
less critical industrial applications. streams (e.g., 8 antenna streams are compressed to
Ultra-Reliable and Low Latency Communications one group) to be sent to the next unit.
(URLLC): Emphasizes ultra-reliable and low-latency com- 2. The Baseband Unit (BBU). It takes the input streams
munication, critical for applications that demand real-time from the RRU and further processes them. The RRU
responsiveness and mission-critical reliability, including and BBU units together constitute the physical layer
autonomous vehicles, remote surgery, and industrial au- of the OSI model (layer 1), handling the physical
tomation. aspects of transmitting and receiving wireless 5G
Massive Machine Type Communications (mMTC): signals [7].
Targets the connectivity of a massive number of devices
3. The Layer 2 unit handles the data link layer of the
using minimal energy, enabling the Internet of Things (IoT)
OSI model (layer 2). This includes Medium Access
to scale to unprecedented levels, facilitating applications
Control (MAC) and Radio Link Control (RLC) tasks.
such as smart cities, industrial IoT, and environmental mon-
itoring. The varying characteristics of the workloads of the differ-
These scenarios resulted in a requirement specification ent units result in different hardware designs. While both
that includes the following criteria [6]: the BBU and layer 2 must handle high- and low-criticality
tasks, they do so in different ways. This research aims to
• Peak Data Rate: 20 Gbit/s download, 10 Gbit/s up-
explore a merged system to handle the BBU and layer 2 tasks
load. This is only in ideal conditions.
in one hardware system. The new system is to be centered
• Transmission Latency: 4 ms for eMBB, 1 ms for around the design of a BBU but explore technologies that
URLLC. This is the latency added by the 5G net- allow layer 2 tasks to run efficiently.
work to the overall communication latency between
endpoints.
• Device Mobility: up to 500 km/h for rural eMBB, less 2.2. Baseband Unit
for more dense areas. The BBU system handles physical layer tasks centered
• Density: up to 1.000.000 devices per square kilometer around signal processing of incoming and outgoing trans-
in the mMTC scenario. missions. Its design ensures maximum predictability at the
expense of resource utilization efficiency. Figure 1 provides
Note how each requirement applies in specific scenarios and an overview of the system. It is not meant to be repre-
is not necessary in others. For example, the peak data rate sentative of any specific system but to give an idea of the
is unnecessary for scenarios covered by URLLC or mMTC. components often present and their interactions.
Meanwhile, the extreme latency requirement of 1 ms only
applies to URLLC.
2.2.1. Hardware
An RBS must manage these diverse requirements and,
therefore, becomes a mixed-criticality system. For example, We focus on systems centered around a clustered and hetero-
tasks within the URLLC scenario must be prioritized over geneous design. Each cluster contains a set of processors or
eMMB tasks to uphold the URLLC latency requirements. Not accelerators (for illustration, we show four in Figure 1). First,
only do we have a range of priorities, but these priorities the general computing capability is provided by digital sig-
may also change as usage changes. Adapting to ongoing nal processor (DSP) cores with high predictability [8]. Each
changes in network usage is, therefore, a critical aspect of DSP has a private instruction and data cache and shares a
implementing 5G.
single scratchpad memory with the other processors in the use the same data, the Read of each will load that data
cluster. into their respective partitions. This means data might be
The other clusters contain acceleration cores for specific duplicated in the cluster scratchpads. However, such shared
and common workloads. The accelerators in each cluster data is rarely written to, and synchronization is explicitly
also share a scratchpad. The exact architecture of the accel- handled at the application level and, therefore, is not an
erators is out of the scope of this paper. issue.
The clusters may also share scratchpads, two are shown
as an example. These split scratchpads handle different 2.4. Layer 2 Design
data with specific access characteristics. For example, some
configuration data might be mostly read and changed rarely, The common computing architectures for layer 2 are
while user-specific data may be updated continuously. more traditional, with, e.g., superscalar cores and standard
Lastly, a hardware scheduler can be present to orchestrate caching. The workload on the system requires less stringent
task execution on the relevant cores and movement of data. predictability than the BBU, allowing for a more traditional
We have omitted to describe any other application-specific design. The tasks also require higher performance, pro-
devices or connections to peripherals. vided by the more complex design at the cost of predictabil-
ity. To ensure high-criticality tasks meet their deadlines,
the hardware resources can be partitioned by clusters and
2.3. Data Processing
intentionally over-provisioned.
Data processing starts once every millisecond. While the Layer 2, therefore, can have much wastage where high-
RRU is processing the antenna streams, the BBU starts with criticality tasks are concerned. This unit’s more complex
a set of configuration tasks that prepare for the delivery design makes it challenging to ensure tasks meet their dead-
of data from the RRU. These configuration tasks must run lines. The only possibility of ensuring the deadlines are
on the DSP cores to, e.g., configure the accelerators before met is to provide the tasks with such an overabundance of
they start executing. This could result in configuration data resources that even when low-criticality tasks interfere, the
initially going to one of the cluster-shared scratchpads, from high-criticality tasks will not be adversely affected. There-
where it is moved to the cluster scratchpads as needed. This fore, the inefficient use of resources in layer 2 is a supporting
data starts in the shared scratchpad of the core running reason for merging the layer 2 subsystem with the BBU sub-
the job and is off-loaded to the cluster-shared scratchpad system.
when the configuration job is done. In parallel with the
configuration tasks, the data from the RRU is being loaded 2.5. Challenges
into the cluster-shared scratchpads. When that is ready,
proper processing tasks can begin executing on DSPs or We aim to research new methods for implementing 5G RBS
accelerators as needed. technologies to achieve better performance at lower cost.
We consider only strict data access characteristics of the Therefore, the current challenges of increased costs and
tasks. All shared data is read-only. User-specific data is lower performance must be alleviated in any future system.
segmented into the relevant tasks and updated only by the Challenge 1: The primary challenge for the above-
task currently being worked on. At no point are two tasks mentioned RBS systems is a divided hardware architecture.
working on the same user data. These strict data access The physical division ensures that high-critical tasks can
characteristics mean that synchronization and coherence maintain their needed deadlines, which increases costs and
are not issues we will consider. reduces overall performance. First, the separation necessi-
tates manufacturing two physical systems, which is costly.
2.3.1. Phased Execution Second, the separation means the two systems cannot share
resources, reducing the efficient use of available resources.
The use of scratchpads in the BBU reduces the variability Challenge 2: On the BBU system specifically, there is
in execution times. However, this requires methodical or- also a challenge with efficient use of resources. While us-
chestration to ensure each job has the needed data. As such, ing scratchpads ensures execution-time predictability for
every job is divided into three phases: all tasks, it also forces data duplication. If two tasks use
1. Read: Any data a task requires is moved onto its the same data, that data is moved into both tasks’ scratch-
cluster’s scratchpad from the cluster-shared caches. pads partition. This is both a waste of scratchpad memory
and memory bandwidth. This is especially prevalent with
2. Execute: The task’s job is executed to completion
configuration data, which is often shared between many
without needing to access memory other than the
tasks and does not change often. The data loaded into the
cluster’s scratchpad.
scratchpads is also loaded on a pessimistic basis. Some tasks
3. Write: All the data previously fetched for the job,
may only need some of the data, meaning some data might
which has been updated, is written back to the main
be unnecessarily loaded into the scratchpads.
memory.
Challenge 3: Memory bandwidth is wasted when depen-
This is a classic implementation of the phased execution dent tasks use the same data. The Write phase in the BBU
[4, 9], also called the simple-task model [10]. The task sched- system always runs after the Execute phase. A subsequent
uler ensures that a task’s Execute is only scheduled on a job using the same data must reload it in its Read phase.
processor when its corresponding Read has terminated on This is sub-optimal in cases where the subsequent task can
the same cluster. Data movement is performed using DMAs, run on the same cluster as the first task. In such a case,
allowing processors to execute other jobs’ Execute phase omitting the Write phase of the first task and the Read
in parallel with data movements. phase of the second task would be better.
A cluster’s scratchpad is partitioned so that each running
job has exclusive access to its memory portion. If two tasks
3. The T-CREST Platform Accessing this data is also done without experiencing cache
misses. The compiler also manages the stack cache, setting
We propose to use the T-CREST platform as a basis for it up and tearing it down at function entry and exits and
research into future platforms for 5G RBS. This section de- using stack-targeting load and store instruction variants.
scribes the platform’s current capabilities and how they An analyzer can assume any stack-targeting instruction will
relate to the challenges present in divided RBS systems. hit in the stack cache. Therefore, the cache size must only
be modeled to account for the stack setup and tear-down
3.1. T-CREST and Patmos time [23]. Data accesses that are not function-local may still
go through the conventional data cache or circumvent all
The Patmos processor [11] is designed to serve real-time caching to target the main memory directly.
systems. Several Patmos cores are combined with a network- These two cache architectures are supported by the Platin
on-chip, a memory arbitration tree, and a memory controller WCET-analyzer [24]. Platin models instruction execution
to the time-predictable multi-core platform T-CREST [5]. As and tracks which blocks of code are likely to be in the
such, T-CREST provides techniques that make task execu- method cache at a given point. It accounts for this at control-
tion time more predictable and reduce the worst-case execu- flow point to know whether a method-cache miss is likely
tion time (WCET). Around the Patmos cores, it builds a plat- and how many bytes would have to be loaded. For the stack
form with time-predictable components to reduce WCET cache, it models the program stack’s size at any point and
analysis complexity and increase accuracy. T-CREST uses tracks stack-cache-control instructions added by the com-
networks-on-chips [12, 13, 14] that ensure data is moved be- piler. At points where the stack must grow, Platin knows
tween processing cores with a known maximum latency. For whether the cache has free space or needs to spill some of
accessing shared main memory, T-CREST uses the dedicated the program stack to main memory.
arbitration tree-based network-on-chip [15]. Regardless of
how many cores are accessing the memory, each access will
be serviced within a bounded latency. 3.3. Missing Capabilities
Patmos uses an in-order pipeline to ensure every instruc- The T-CREST platform is missing some features and capa-
tion has a known and constant execution time. To exploit bilities compared to the BBU system. We will enumerate
instruction-level parallelism predictably, Patmos is also a these missing capabilities and highlight how we might ei-
very long instruction-word (VLIW) architecture with a dual- ther simulate them using existing capabilities or discuss how
issue pipeline. VLIW architectures are a predictable way to implement them into the platform as part of the research
of increasing performance without increasing complexity project.
[16, 17]. Patmos executes instructions in bundles of up to
two instructions. The compiler must designate instructions 3.3.1. Acceleration and Clustering
as part of a bundle by setting a specific bit in the first in-
struction. All Patmos instructions are predicated: Based on The specific processing requirements of an RBS means dedi-
one of eight predicate registers, each instruction is either cated accelerators can be used for maximum efficiency. The
enabled or disabled. If the predicate register’s value is true, T-CREST platform does not include anything resembling
the instruction is enabled, meaning it executes normally. If these accelerators. Likewise, the T-CREST platform does
the value is false, the instruction is disabled and does not not use any clustering, whose benefit is mainly driven by a
affect registers or memory. It effectively becomes a noop. multi-layered intermediate memory, which we will discuss
However, the execution time of disabled instructions is the in the next section.
same as when enabled. Predicated instructions allow the As this research mainly focuses on the efficient use of
compiler to minimize execution time variability or even resources, notably memory, we will not investigate or im-
eliminate it entirely [18]. plement any hardware acceleration. Instead, we will use
the Patmos cores as substitutes for specific accelerators. We
will implement clustering into the T-CREST platform so
3.2. Predictable Caching
that each cluster can be designated to be allowed to execute
While caching is usually associated with unpredictability specific tasks. This will allow us to treat one cluster as a
and difficulties for static analysis, T-CREST deploys two pre- substitute for a BBU DSP cluster and others for different
dictable and easily analyzable caches. The first is a method types of acceleration clusters.
cache [19] that replaces a traditional instruction cache in
Patmos [20]. the method cache caches whole or parts of func- 3.3.2. Hierarchical Memory
tions (sub-functions) such that instruction fetching never
misses except at specific points. The compiler manages this The Patmos cores of T-CREST are each paired with private
cache by splitting the code into blocks that fit in the method caches, as described earlier. However, no further hierarchy
cache and inserting cache-fill instructions where needed. of intermediate memory exists. In contrast, the BBU system
For the Patmos ISA function call and return instructions contains three levels of intermediate storage: First, each
ensure that the callee or the caller are in the method cache. DSP (or accelerator) has its caches. Second, each cluster has
To support sub-function caching Patmos has cache filling a shared scratchpad. lastly, cluster-shared scratchpads are
variants of branch instructions. Using a method cache limits present for a last level of storing various types of data.
the number of places cache misses can occur to the specific A multi-layered memory hierarchy is necessary for the ex-
cache-filling instructions. The method cache is simpler to periments to be representative, especially given the unique
model for an analyzer to provide tight WCET bounds [21]. data access characteristics. Therefore, we will build a sec-
The second unique cache of the T-CREST is the stack ond layer of intermediate memory, which is shared between
cache [22]. It caches function-local data, often accessed pre- the Patmos cores of each cluster. We will omit a last mem-
dictably, and can be loaded at function entry and exit points. ory layer, as any methods of managing the second layers
Scheduler
Cluster 0 Cluster 1 Cluster 2
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5
M$ S$ D$ M$ S$ D$ M$ S$ D$ M$ S$ D$ M$ S$ D$ M$ S$ D$
Shared Cache Shared Cache Shared Cache
Memory Controller Off-Chip Main Memory SRAM
Figure 2: Proposed T-CREST system for researching novel cache architectures.
we develop can be transferred to the rest of the layers of a 4. Cache Proposals
real-world system.
To start addressing the challenge of merging layer 1 and
3.3.3. Hardware-Assisted Scheduling layer 2 systems, we focus on the challenge of using a shared
cache in each cluster. As described earlier, the BBU archi-
The BBU systems often use hardware to accelerate schedul- tecture sacrifices the efficient use of resources to ensure low
ing. T-CREST does not implement any hardware that can variability in execution times. We aim to maximize resource
assist with scheduling. While using a hardware scheduler in usage in the proposed system while maintaining low vari-
the BBU system ensures that the extreme amount of tasks ability. We propose exploring three caching solutions that
gets scheduled in a reasonable time, the smaller scale of address the challenges of predictable caching: (1) a critical-
this project’s prototypes can likely handled by software- ity timeout cache, (2) a contention tracking cache, and (3) a
managed scheduling. unified method/stack cache.
Therefore, the initial proposed system will not have any
scheduling hardware, but dedicated Patmos cores will re-
4.1. Criticality Timeout Cache
place it to handle the scheduling. Software-defined schedul-
ing can be a flexible way to test our scheduling strate- In cases where strict predictability is unnecessary but flex-
gies as the system matures. Moving to a hardware sched- ibility and utilization efficiency are essential, we propose
uler should be easily doable at later stages of research, a cache using a partitioning approach based on cache line
where the scheduling has been studied and techniques cho- timeouts. For that cache, we need an n-way set associa-
sen. Patmos already supports adding custom devices and tive cache configuration. We can configure the cache at the
accelerators[25]. A hardware scheduler is a device that inter- granularity of cache ways. Each cache way can be assigned
acts with the rest of the clusters, memories, and processors either a criticality or a task/core ID (we will use criticality
and issues commands in the same manner a Patmos core moving forward).
would. In this proposal, each cache way can be assigned either
to high or low criticality. Cache lines can be used by high-
3.4. Proposed System Architecture or low-criticality tasks. However, naturally high-criticality
tasks are preferred. A low-criticality task cannot evict a
Figure 2 shows a diagram of our proposed system. It com- high-criticality cache line. Therefore, to avoid starvation of
prises three clusters, each with a set of Patmos cores with low-criticality tasks, at least one way must not be assigned
private split caches (Method, Stack, and Data) and a shared for the high-criticality tasks.
cluster cache. The cores use the T-CREST memory tree to When an access of the high criticality arrives, a cache line
access the shared cache, providing us with predictable and in one high-criticality way is tagged as being occupied by
low-latency access. The clusters use the T-CREST memory that criticality, and an associated timeout begins. As long
tree to connect to the memory controller, which manages as the timeout is not reached, accesses of low-criticality
access to the off-chip, main memory. A shared bus (in gray tasks cannot evict the cache line. If there is no access to
above the clusters) facilitates cross-cluster and cross-core the line before the timeout is reached, the line criticality
communication. This allows a Patmos core or a hardware is downgraded, allowing low-criticality jobs to evict the
device scheduler to issue scheduling commands to the whole line. The cache can either be configured right before each
system. job starts executing, or the criticalities can be configured
This system architecture will allow research on efficiently ahead of time to match the tasks that will run on the cluster.
managing the cluster caches. The different clusters can With timeouts, there is no need to explicitly release any
simulate the DSP or accelerator clusters on the BBU system, data, as the timeout mechanism will do so automatically.
while the cluster-shared scratchpads of that system do not Configuring the cache is done by setting the criticality of a
introduce new challenges. Therefore, limiting ourselves cache way. When a way is configured with a criticality, all
to the two levels of cache (private and cluster caches) will its cache lines will prefer accesses from that criticality, as
allow for fruitful experimentation during the research. described above.
A significant drawback of this approach is its unpre-
dictability. Because timeouts might cause a cache line to be
evicted even when it might be used in the future, it can be
difficult for a WCET analysis tool to track which cache lines contention event will be blocked or mitigated. For example,
have reached the deadline and which have not. The effect say 𝐽1 is high criticality, and 𝐽2 is not. As long as 𝐽1 has not
of the timeouts on WCET bounds can be challenging to esti- reached its contention limit, the cache treats accesses from
mate and would require dedicated analysis. However, it can both jobs equally. When the limit is reached, contention
also be omitted, as this cache architecture is better suited events are mitigated between 𝐽1 and 𝐽2 . In the case of the
for measurement-based WCET estimation. With detailed first event type, accesses from 𝐽2 that would cause an evic-
testing and measurements, getting a sufficiently safe WCET tion of 𝐽1 ’s cache lines would be rejected by the cache. The
bound should be feasible. access must then be rerouted directly to the main memory,
This cache architecture is designed for high utilization which the system must have support for. In the second event
and low scheduling complexity. Because it reserves each type, if the default replacement policy would have 𝐽1 evict
cache line, only the necessary subset of a cache way is its own cache line in the set, it would instead evict a cache
reserved at a given time. Cache lines that either timed out or line from 𝐽2 .
were not used by the job are free to be used by low-criticality Setting the contention limit is the responsibility of the
tasks, increasing the utilization of the cache. In this proposal, job scheduler. Through traditional static WCET analysis
we also do not pre-load data into the cache. This means with the assumption of private caches, jobs get their WCET
only data that is used will be loaded. Therefore, we avoid bound. Any excess time between the bound and the task
both bandwidth wastage and cache space wastage when deadline is therefore open to contention. Before the sched-
loading data that is not used. When a job stops executing, uler starts a job, it sets the contention limit, ensuring the
its associated cache lines will eventually time out and release WCET of the job, with contention, still meets the deadline.
their contents automatically. The scheduler, therefore, does The contention limit can be static, and it can be calculated
not need to manage the phased execution of jobs, reducing as part of schedulability analysis. It can also be dynamic,
the pressure on the scheduler. so the scheduler changes it for the runtime condition. If
the task was started early, the contention is increased to
4.2. Contention Tracking Cache match the slack time available. If the task was started late,
the contention is reduced or set to zero to ensure that the
In this proposal, a combination of contention tracking in deadline is still met.
the cache and contention-aware task scheduling will allow This proposal’s major strength is that it disconnects the
for maximal cache utilization through dynamic partitioning, analysis of tasks with differing criticalities. Because of
with high predictability through cache contention tracking the contention limit, high-criticality tasks will never be
and mitigation. adversely affected by low-criticality tasks. Therefore, we
In a multicore system without shared caches, the execu- just need to ensure that all high-criticality tasks meet their
tion time of a job is affected by the cache behavior without deadlines with other methods.1 It also does not statically par-
that behavior being affected by other jobs. Through cache tition or lock the cache. At worst, when a contention limit is
analysis, we can bound the execution time attributable to reached, the cache will be dynamically partitioned automat-
the cache. This is done by estimating the number of cache ically simply by prioritizing the jobs that have reached the
misses that will occur. When the cache is shared, this anal- limit. This maximizes cache utilization. It also allows for
ysis is no longer possible, as the interference of other jobs maximizing the performance of low-criticality tasks as long
will cause additional cache misses in a manner that cannot as it does not adversely affect any high-criticality tasks.
be estimated. In this proposal, we want to let the task sched- This proposal does increase the complexity of the cache
uler limit the contention that a job is allowed to experience controller, which needs to track contention events and miti-
such that it is guaranteed to meet its deadline. gate them for jobs that have reached their contention limit.
We give two example types of contention: (1) A job 𝐽1 Each cache line needs to be associated with a job (or core),
experiences a contention event if a cache line 𝐶1 it popu- each job needs a contention counter, and logic needs to
lated with data 𝐷1 is evicted by an access by another job 𝐽2 . ensure the correct mitigation at contention limits. The pro-
This is because 𝐽1 will experience a cache miss on the next posal also increases scheduler complexity. This complexity
access to 𝐷1 that it would not have experienced if 𝐽2 had can be initially lowered by simply having statically deter-
not interfered. (2) 𝐽1 also experiences a contention event if a mined contention limits. However, further work should ex-
cache miss occurs when accessing 𝐷1 results in the eviction plore dynamically determined limits, which would increase
of a cache line that 𝐽1 also populated in the same cache set the workload on the scheduler.
(with data 𝐷2 ). This event is a contention with any other
job with at least one populated cache line in the same set.
4.3. Unified Method/Stack Cache
Without the other jobs, 𝐽1 would have populated an empty
cache line instead of evicting one of its other populated lines. The Patmos processor on T-CREST uses the special method
The evicted line will cause a cache miss in the future when and stack caches. While these caches have been researched
𝐽1 needs to access 𝐷2 again. for their impact on predictability, and the Platin analyzer
We only consider contention between different jobs. Self- has analysis implementations for them, additional work is
contention also happens in private caches and is, therefore, needed to integrate them into a shared L2 cache. Therefore,
already managed in the cache analysis for the private cache. we propose investigating a shared L2 cache that integrates
We limit the maximum allowed contention as defined the features of both the method cache and the stack cache.
above to ensure that a job meets its deadline without inter- It is meant to complement either a traditional L2 data cache
ference from other jobs. The scheduler will configure the or scratchpad, with extended research avenues for a fully
cache with a maximum allowed contention. The cache con- integrated L2 cache that supports the method-, stack-, and
troller will track contention by checking and counting the
above contention events for each job. When a job reaches 1
For example, we could use partitioning between high-criticality tasks
its contention limit, any cache access that would cause a only.
Priority Contention Unified Any data whose address is taken in the program cannot be
Timeout Tracking Method/
put in the stack cache, going instead to the shadow stack,
Stack
All Data ✓ ✓ ✗
which is backed by main memory. Another big difference
Shared ✓ ✓ ✗ between the Unified Method/Stack Cache and the others is
Mixed-Criticality ✓ ✓ ✗ that the proposal does not share the cache between multiple
Analyzable ✗ ✓* ✓ cores, which also means it does not alleviate any challenges
Needs Scheduling ✗* ✓* ✗ for mixed-criticality systems.
Guaranteed ✗ ✓* ✓ Analyzability is different between all the cache proposals.
Table 1
The Priority Timeout cache does not support analyzability
Comparison between features of the three cache proposals. very well, as it is difficult for analyzers to track when cache
lines have timed out. The contention cache is analyzable,
but only in the sense that it simplifies mixed-criticality anal-
ysis by disallowing interference between tasks of different
data caches. This proposal can also complement either of
criticalities. For tasks with the same criticality, the cache
the previous proposals.
does not provide any assistance but does not complicate
The method and stack caches have particular access pat-
the analysis. The Unified Method/Stack Cache is the most
terns to their data. The method cache accesses a block of
analyzable. Analyzers can reuse the analysis done for the
code at a time, pre-loading a complete block at once. It also
separate method and stack caches and likely reuse it for
uses a first-in, first-out (FIFO) replacement policy to account
the unified one with different configurations and minor
for functions earlier in the call stack being less likely to be
customization.
called again soon. On the other hand, the stack cache is
The proposals also differ in how much support is needed
not backed by main memory unless some data is spilled
from the job scheduler at runtime. The Priority Timeout
when the cache is full. This allows the L2 cache to store
Cache can be implemented without scheduler support if the
the spilled stack data first without sending it to the main
way-based partitioning is configured ahead of time. If the
memory. Access to this stored data would have the same
partitioning is done dynamically, it would be the scheduler’s
characteristics as access to the stack cache. Additionally,
responsibility. The Contention Tracking Cache needs sup-
when space is tight in the L2 cache, the replacement policy
port from the scheduler to ensure the amount of allowed
is the same as the stack cache: spill the data furthest up the
contention is within the correct limit. The scheduler needs
stack.
to account for when a high-criticality job is started so that
An open question is how to partition the cache between
an appropriate contention limit is chosen. A static approach
the method and stack data. Since both have a replacement
can also be used where the contention limit is chosen ahead
policy that depends on reaching the space limit, a policy
of time. However, that does not provide much benefit com-
is needed for deciding how much of the cache should be
pared to traditional partitioning. The Unified Method/stack
meant for the methods, and how much should be used for
Cache needs no scheduling support at all. The only thing
the stack. We should also investigate if this division can be
that might be configurable would be how much of the cache
dynamically configured such that if the stack is not expected
is prioritized for methods or the stack. However, this could
to use much space, then most of the L2 cache should be saved
better be done by the program itself, e.g., through compiler
for the methods and vice versa. A different approach could
management of the cache.
be to say that the stack gets priority up to a point. When
Lastly, each cache has different guarantees on its behavior.
the stack needs to store more data, methods are evicted to
The Priority Timeout Cache provides priority guarantees for
make room up to a point (e.g., half the L2 cache size). Any
only a specific time. If that is not managed such that it does
space not used by the stack cache can store methods. This
not run out, programs cannot be guaranteed that a specific
can also be done in reverse, where the method data gets
amount of the cache is reserved for them. While giving
priority.
no guarantees on partitioning, the Contention Tracking
An open question that would need answering following
cache guarantees how much contention could affect a job.
the above initial research, would be how to implement a
However, this is only contention from lower criticality job
unified method/stack cache that is also shared between cores.
contention, and so does not make any guarantees about
Since each core has a distinct stack, and is also likely to use
the contention from similar-criticality tasks. However, the
different functions, we need to explore ways for a single
Unified Method/Stack Cache is predictable and guarantees
cache to effectively manage multiple stacks and call trees.
similar behavior to the split caches.
4.4. Discussion
5. Related Work
The three caching proposals—Criticality Timeout Cache,
Contention Tracking Cache, and Unified Method/Stack Shared caches are a significant challenge for predictability
Cache—each address the challenge of predictable caching due to their inherent nature of allowing multiple cores to
in different ways. Table 1 compared the various features access the same cache [26]. This can lead to contention
of our proposals. The first big difference is between the and unpredictable performance. However, several solutions
Unified Method/Stack Cache and the two other caches. The have been proposed to address this issue, including cache
Priority Timeout and Contention tracking caches both sup- partitioning and locking [27].
port all program data, whereas the Unified Method/Stack Partitioning is a technique that divides the shared cache
Cache only supports instruction data (methods) and stack into several partitions, each dedicated to a specific core [28].
data. Even more specifically, the traditional stack does not This approach can significantly improve predictability by
support all stack data, only that which does not need an reducing contention [29]. Way-based partitioning involves
address, as the stack cache is not backed by main memory. dividing the cache ways among different cores. Each core is
assigned a specific number of ways in the cache, ensuring fications and the common system architecture, we propose
exclusive access to those ways. This method can effectively using the T-CREST platform as the research platform for fu-
isolate the cache activities of different cores, improving pre- ture mixed-criticality systems. We propose a specific system
dictability. On the other hand, index-based partitioning architecture that best leverages the existing system archi-
involves dividing the cache sets among different cores. Each tecture’s strength and increases its performance through
core is assigned specific sets in the cache, ensuring exclusive shared caches. We propose three specific research directions
access. This method is more flexible than way-based parti- within shared L2 caches for clustered systems. The various
tioning because the number of sets is usually large, allowing proposals have distinct strengths and weaknesses that will
for finer-grained partitioning. However, a given set maps be further explored in future work.
to specific address ranges. Therefore, this method requires
more detailed memory management. Page coloring is often
used to partition the cache [30]. The address space is divided Acknowledgment
into colors associated with the cache sets. Assigning colors
This work is partially supported by the CERCIRAS (Connect-
to tasks/cores provides the partitioning, assuming an assign-
ing Education and Research Communities for an Innovative
ment that provides the correct memory for each task/core
Resource Aware Society) COST Action no. CA19135 funded
is found. The cache hardware can also support index-based
by COST (European Cooperation in Science and Technol-
partitioning for various benefits [31, 32]. However, some
ogy).
form of software management will always be needed.
Cache locking is another technique used to improve pre-
dictability in shared caches [33]. With locking, specific References
cache lines can be locked to prevent them from being evicted,
ensuring they are always available for the necessary cores. [1] International Telecommunication Union - Radiocom-
This can significantly reduce cache misses and improve munication Sector, IMT Vision - Framework and over-
predictability. Locking can be costly. Lock management in- all objectives of the future development of IMT for
volves tracking the locked cache lines, increasing hardware 2020 and beyond, Technical Report M.2083-0, Interna-
complexity. Adding locking to a cache can reduce its ca- tional Telecommunication Union, 2015.
pacity or speed depending on how fine-grained the locking [2] A. Burns, R. I. Davis, Mixed criticality systems-a re-
is. Locking also reduces cache utilization, as any unused view:(february 2022) (2022).
locked content cannot be evicted to free up the cache lines [3] ISO/IEC 7498-1:1994(E), Information technology –
for needed data. Open Systems Interconnection – Basic Reference
T-CREST has enabled much research within various as- Model: The Basic Model, Technical Report 7498-1:1994,
pects of real-time systems [5]. Because all of T-CREST’s com- International Organization for Standardization, 1996.
ponents are predictable, it is possible to implement constant [4] R. Pellizzoni, E. Betti, S. Bak, G. Yao, J. Criswell, M. Cac-
execution-time code based on the single-path paradigm camo, R. Kegley, A predictable execution model for
[34, 18]. Single-path has an inherently high overhead, neces- cots-based embedded systems, in: 2011 17th IEEE
sitating optimizations to reduce executed code [35], make Real-Time and Embedded Technology and Applica-
best use of Patmos’ dual-issue pipeline [36, 17], and use cus- tions Symposium, IEEE, 2011, pp. 269–279.
tom register allocation techniques [37]. The combination of [5] M. Schoeberl, S. Abbaspour, B. Akesson, N. Auds-
T-CREST and single-path code has been shown to be com- ley, R. Capasso, J. Garside, K. Goossens, S. Goossens,
petitive with off-the-shelf ARM processors for a real-time S. Hansen, R. Heckmann, S. Hepp, B. Huber, A. Jordan,
application [38]. Research is also ongoing to port the Lingua E. Kasapaki, J. Knoop, Y. Li, D. Prokesch, W. Puffitsch,
Franca coordination language to T-CREST to enable the cre- P. Puschner, A. Rocha, C. Silva, J. Sparsø, A. Tocchi,
ation of complete real-time systems within one framework T-CREST: Time-predictable multi-core architecture for
[39, 40]. embedded systems, Journal of Systems Architecture 61
(2015) 449–471. doi:10.1016/j.sysarc.2015.04.
002.
6. Conclusion [6] International Telecommunication Union - Radiocom-
munication Sector, Minimum requirements related to
The increasing importance of 5G technologies necessitates
technical performance for IMT-2020 radio interface(s),
continuous research and development into the hardware
Technical Report M.2410-0, International Telecommu-
systems implementing the technology. The diverse require-
nication Union, 2017.
ment specifications of this new technology necessitate a
[7] Z. Kong, J. Gong, C.-Z. Xu, K. Wang, J. Rao, ebase:
system with varying degrees of strictness and performance.
A baseband unit cluster testbed to improve energy-
Existing systems were designed with the minimal 5G guar-
efficiency for cloud radio access network, in: 2013
antees in mind, ensuring the hard requirements, e.g., low
IEEE International Conference on Communications
latency, were met before softer requirements like through-
(ICC), IEEE, 2013, pp. 4222–4227.
put. This focus resulted in a divided physical system to
[8] E. Tell, A. Nilsson, D. Liu, A programmable dsp core for
achieve the goals.
baseband processing, in: The 3rd International IEEE-
To increase future systems’ performance while maintain-
NEWCAS Conference, 2005., IEEE, 2005, pp. 403–406.
ing the older system’s guarantees, this paper sets the re-
[9] J. Arora, C. Maia, S. A. Rashid, G. Nelissen, E. Tovar,
search direction into a mixed-criticality 5G RBS with merged
Schedulability analysis for 3-phase tasks with parti-
BBU and layer 2 systems. The system should be able to exe-
tioned fixed-priority scheduling, Journal of Systems
cute high-criticality tasks, like those required by the URLLC
Architecture 131 (2022) 102706.
5G scenario, and low-criticality, QoS tasks, like those for the
[10] H. Kopetz, Real-Time Systems, Kluwer Academic,
eMMB, in one SoC. By analyzing the 5G requirement speci-
Boston, MA, USA, 1997.
[11] M. Schoeberl, W. Puffitsch, S. Hepp, B. Huber, worst-case stack cache behavior, in: Proceedings of the
D. Prokesch, Patmos: A time-predictable micro- 21st International Conference on Real-Time Networks
processor, Real-Time Systems 54(2) (2018) 389–423. and Systems (RTNS 2013), ACM, New York, NY, USA,
doi:10.1007/s11241-018-9300-4. 2013, pp. 55–64. doi:10.1145/2516821.2516828.
[12] M. Schoeberl, F. Brandner, J. Sparsø, E. Kasapaki, [24] E. J. Maroun, E. Dengler, C. Dietrich, S. Hepp,
A statically scheduled time-division-multiplexed H. Herzog, B. Huber, J. Knoop, D. Wiltsche-Prokesch,
network-on-chip for real-time systems, in: Proceed- P. Puschner, P. Raffeck, et al., The platin multi-target
ings of the 6th International Symposium on Networks- worst-case analysis tool, in: 22nd International Work-
on-Chip (NOCS), IEEE, Lyngby, Denmark, 2012, pp. shop on Worst-Case Execution Time Analysis (WCET
152–160. doi:10.1109/NOCS.2012.25. 2024), Schloss Dagstuhl–Leibniz-Zentrum für Infor-
[13] E. Kasapaki, M. Schoeberl, R. B. Sørensen, C. T. Müller, matik, 2024.
K. Goossens, J. Sparsø, Argo: A real-time network- [25] C. Pircher, A. Baranyai, C. Lehr, M. Schoeberl, Acceler-
on-chip architecture with an efficient GALS imple- ator interface for patmos, in: 2021 IEEE Nordic Circuits
mentation, IEEE Transactions on Very Large Scale and Systems Conference (NORCAS): NORCHIP and
Integration (VLSI) Systems 24 (2016) 479–492. doi:10. International Symposium of System-on-Chip (SoC),
1109/TVLSI.2015.2405614. 2021.
[14] M. Schoeberl, Exploration of network interface [26] B. C. Ward, J. L. Herman, C. J. Kenna, J. H. Anderson,
architectures for a real-time network-on-chip, in: Making shared caches more predictable on multicore
Proceedings of the 2024 IEEE 27th International platforms, in: 2013 25th Euromicro Conference on
Symposium on Real-Time Distributed Computing Real-Time Systems, IEEE, 2013, pp. 157–167.
(ISORC), IEEE, United States, 2024. doi:10.1109/ [27] G. Gracioli, A. Alhammad, R. Mancuso, A. A. Fröh-
ISORC61049.2024.10551364, 2024 IEEE 27th Inter- lich, R. Pellizzoni, A survey on cache management
national Symposium on Real-Time Distributed Com- mechanisms for real-time embedded systems, ACM
puting, ISORC ; Conference date: 22-05-2024 Through Computing Surveys (CSUR) 48 (2015) 1–36.
25-05-2024. [28] S. Mittal, A survey of techniques for cache partitioning
[15] M. Schoeberl, D. V. Chong, W. Puffitsch, J. Sparsø, A in multicore processors, ACM Computing Surveys
time-predictable memory network-on-chip, in: Pro- (CSUR) 50 (2017) 1–39.
ceedings of the 14th International Workshop on Worst- [29] X. Vera, B. Lisper, J. Xue, Data caches in multitasking
Case Execution Time Analysis (WCET 2014), Madrid, hard real-time systems, in: RTSS 2003. 24th IEEE
Spain, 2014, pp. 53–62. doi:10.4230/OASIcs.WCET. Real-Time Systems Symposium, 2003, IEEE, 2003, pp.
2014.53. 154–165.
[16] J. Yan, W. Zhang, A time-predictable VLIW proces- [30] T. Lugo, S. Lozano, J. Fernández, J. Carretero, A survey
sor and its compiler support, Real-Time Syst. 38 of techniques for reducing interference in real-time
(2008) 67–84. doi:http://dx.doi.org/10.1007/ applications on multicore platforms, IEEE Access 10
s11241-007-9030-5. (2022) 21853–21882.
[17] E. J. Maroun, M. Schoeberl, P. Puschner, Predictable [31] A. Chousein, R. N. Mahapatra, Fully associative cache
and optimized single-path code for predicated proces- partitioning with don’t care bits for real-time applica-
sors, Journal of Systems Architecture (2024) 103214. tions, ACM SIGBED Review 2 (2005) 35–38.
[18] E. J. Maroun, M. Schoeberl, P. Puschner, Compiler- [32] M. Lee, S. Kim, Time-sensitivity-aware shared cache
directed constant execution time on flat memory sys- architecture for multi-core embedded systems, The
tems, in: 2023 IEEE 26th International Symposium on Journal of Supercomputing 75 (2019) 6746–6776.
Real-Time Distributed Computing (ISORC), 2023, pp. [33] S. Mittal, A survey of techniques for cache locking,
64–75. doi:10.1109/ISORC58943.2023.00019. ACM Transactions on Design Automation of Elec-
[19] M. Schoeberl, A time predictable instruction cache for tronic Systems (TODAES) 21 (2016) 1–24.
a Java processor, in: On the Move to Meaningful In- [34] P. Puschner, A. Burns, Writing temporally predictable
ternet Systems 2004: Workshop on Java Technologies code, in: Proceedings of the The Seventh IEEE In-
for Real-Time and Embedded Systems (JTRES 2004), ternational Workshop on Object-Oriented Real-Time
volume 3292 of LNCS, Springer, Agia Napa, Cyprus, Dependable Systems (WORDS 2002), IEEE Computer
2004, pp. 371–382. doi:10.1007/b102133. Society, Washington, DC, USA, 2002, pp. 85–94. doi:10.
[20] P. Degasperi, S. Hepp, W. Puffitsch, M. Schoeberl, A 1109/WORDS.2002.1000040.
method cache for Patmos, in: Proceedings of the [35] E. J. Maroun, M. Schoeberl, P. Puschner, Constant-
17th IEEE Symposium on Object/Component/Service- Loop Dominators for Single-Path Code Optimization,
oriented Real-time Distributed Computing (ISORC in: P. Wägemann (Ed.), 21th International Work-
2014), IEEE, Reno, Nevada, USA, 2014, pp. 100–108. shop on Worst-Case Execution Time Analysis (WCET
doi:10.1109/ISORC.2014.47. 2023), volume 114 of Open Access Series in Informat-
[21] B. Huber, S. Hepp, M. Schoeberl, Scope-based method ics (OASIcs), Schloss Dagstuhl – Leibniz-Zentrum für
cache analysis, in: Proceedings of the 14th Inter- Informatik, Dagstuhl, Germany, 2023, pp. 7:1–7:13.
national Workshop on Worst-Case Execution Time URL: https://drops.dagstuhl.de/opus/volltexte/2023/
Analysis (WCET 2014), Madrid, Spain, 2014, pp. 73–82. 18436. doi:10.4230/OASIcs.WCET.2023.7.
doi:10.4230/OASIcs.WCET.2014.73. [36] E. J. Maroun, M. Schoeberl, P. Puschner, Compiling for
[22] S. Abbaspour, F. Brandner, M. Schoeberl, A time- time-predictability with dual-issue single-path code,
predictable stack cache, in: Proceedings of the 9th Journal of Systems Architecture 118 (2021) 1–11.
Workshop on Software Technologies for Embedded [37] E. Maroun, M. Schoeberl, P. Puschner, Two-step reg-
and Ubiquitous Systems, 2013. ister allocation for implementing single-path code,
[23] A. Jordan, F. Brandner, M. Schoeberl, Static analysis of in: Proceedings of the 2024 IEEE 27th International
Symposium on Real-Time Distributed Computing
(ISORC), IEEE, United States, 2024. doi:10.1109/
ISORC61049.2024.10551362, 2024 IEEE 27th Inter-
national Symposium on Real-Time Distributed Com-
puting, ISORC ; Conference date: 22-05-2024 Through
25-05-2024.
[38] M. Platzer, P. Puschner, A real-time application with
fully predictable task timing, in: 2020 IEEE 23rd Inter-
national Symposium on Real-Time Distributed Com-
puting (ISORC), IEEE, 2020, pp. 43–46.
[39] E. Khodadad, L. Pezzarossa, M. Schoeberl, Towards
lingua franca on the patmos processor, in: Proceedings
of the 2024 IEEE 27th International Symposium on
Real-Time Distributed Computing (ISORC), 2024.
[40] M. Schoeberl, E. Khodadad, S. Lin, E. J. Maroun,
L. Pezzarossa, E. A. Lee, Invited Paper: Worst-Case
Execution Time Analysis of Lingua Franca Applica-
tions, in: T. Carle (Ed.), 22nd International Work-
shop on Worst-Case Execution Time Analysis (WCET
2024), volume 121 of Open Access Series in Informat-
ics (OASIcs), Schloss Dagstuhl – Leibniz-Zentrum für
Informatik, Dagstuhl, Germany, 2024, pp. 4:1–4:13.
doi:10.4230/OASIcs.WCET.2024.4.