Caching in a Mixed-Criticality 5G Radio Base Station Emad Jacob Maroun1,* , Luca Pezzarossa1 and Martin Schoeberl1 1 Technical University of Denmark, Department of Applied Mathematics and Computer Science Abstract Telecommunication is a critical driver of economic and social development. 5G technologies are state-of-the-art in telecommunication, setting strong and open-ended requirements for implementing systems. Current systems for implementing baseband technologies in 5G depend on hardware separation to ensure high- and low-criticality tasks do not interfere in such a way as to violate guarantees. To increase performance and lower costs, this paper sets the research direction into future mixed-criticality systems that can handle both the high- and low-criticality tasks of the baseband unit. We analyze the 5G requirements and the common systems that currently implement them. We propose using T-CREST as the research platform with a specific architecture targeting mixed-criticality workloads. We present two cache proposals to reduce the interference of low-criticality tasks on high-criticality tasks but ensure high cache utilization and efficiency. The first cache proposal uses timeouts to automatically free cache lines reserved for high-criticality tasks. The second proposal uses contention tracking to limit how much low-criticality tasks may influence high-criticality tasks. Lastly, we propose a third cache architecture to unify the method and stack caches unique to T-CREST into a single level-2 cache. Keywords 5g, t-crest, real-time systems, low latency, caches, radio baseband 1. Introduction hurts performance and price. Therefore, we are interested in investigating future system designs incorporating mixed- Socio-technical evolution is dependent on mobile communi- criticality system research to merge the currently divided cations as a critical driver to allow for economic and social systems into a single platform that can handle the varying development [1]. As such, the evolution of communication criticality of tasks. While the current heavy use of shared technologies is essential in enabling societal development. scratchpads and the phased execution model [4] give high 5G is state-of-the-art in mobile communication technologies, predictability to systems managing the OSI layer 1, it is promising unprecedented speeds, ultra-low latency, and wasteful and difficult to unify with the use of shared caches massive connectivity capabilities. With its lofty promises, in the systems managing the OSI layer 2. Therefore, innova- implementing 5G communication networks is a significant tive techniques are needed to facilitate the unification of the industrial challenge. Continued investment in 5G technolo- layer 1 and layer 2 systems into a unified hardware system. gies is needed to reach beyond the minimal promises of the This paper addresses the challenge of sharing a level 2 (L2) technology. Improvements in technical implementations cache between different tasks and executing on different will ensure better service characteristics for customers and cores while still delivering low-latency execution of critical users at lower costs. tasks. We propose to use the T-CREST platform [5] to ex- One critical aspect of telecommunications technology is plore different solutions of the challenges around memory the radio base station (RBS), which provides wireless trans- management for mixed-criticality systems by presenting mission to and from mobile devices. The 5G functionality is three distinct caching architectures for future exploration. implemented in these RBSs. Continued improvement of the All solutions are centered around regulating access to dif- RBS is critical to staying at the forefront of the industry. As ferent cache lines for high- and low-criticality jobs. More such, research on how to best implement RBS for optimizing specifically, we propose two shared caches that use time- performance and cost ensures long-term competitiveness outs and contention tracking to limit the interference of in the industry. low-criticality tasks on high-criticality ones, as well as an The requirements of 5G introduce a hierarchy of priori- L2 cache that unifies the split caches unique to T-CREST tized tasks that the RBS has to complete. The RBS, therefore, since they exhibit unique access characteristics that can be becomes a mixed-criticality system [2], where minimum sped up predictably. guarantees are upheld to ensure critical tasks are completed The contributions of this paper are: (1) A description of correctly and in a timely fashion. On the other hand, non- common 5G RBS technologies and implementations, (2) a critical tasks need to be performed as fast as possible; how- discussion of the challenges future systems face in the pur- ever, they only need to provide good quality of service (QoS) suit of lower cost, higher efficiency, and improved perfor- on average, so they may be de-prioritized to ensure that crit- mance, and (3) three proposals for caching architectures that ical tasks meet their deadlines. To ensure non-critical tasks we intend to explore to address the challenges described. do not interfere with the critical ones, hardware systems The rest of this paper is structured into four sections. The are divided into several layers with differing responsibili- following section will provide some background on how ties correlating to the open systems interconnection (OSI) current systems implement 5G and their challenges. Section model [3]. This hardware division makes it easier to con- 3 introduces the T-CREST platform and how it can be used as trol interference but decreases resource utilization, which a basis for research into a mixed-criticality system. Section 4 discusses the three cache architecture proposals. Section 3rd workshop on Resource AWareness of Systems and Society (RAW 2024), 5 presents related work and Section 6 concludes the paper. July 2–5, 2024, Maribor, Slovenia * Corresponding author. $ ejama@dtu.dk (E. J. Maroun); lpez@dtu.dk (L. Pezzarossa); 2. 5G Radio Baseband masca@dtu.dk (M. Schoeberl)  0000-0002-3675-3376 (E. J. Maroun); 0000-0002-0863-2526 During the initial phases of the 5G specification, three usage (L. Pezzarossa); 0000-0003-2366-382X (M. Schoeberl) © 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons scenarios were identified as being critical for the future of License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Cluster 0 Cluster 1 Cluster 2 DSP 0 DSP 1 Acc 0 Acc 1 Acc 4 Acc 5 I$ D$ I$ D$ DSP 2 DSP 3 Acc 2 Acc 3 Acc 6 Acc 7 I$ D$ I$ D$ Shared Scratchpad Shared Scratchpad Shared Scratchpad Scheduler Cluster-Shared Scratchpad 1 Cluster-Shared Scratchpad 2 Off-Chip Main Memory DRAM Figure 1: Hypothetical baseband unit architecture. mobile communications [1]: 2.1. System Architecture Enhanced Mobile Broadband (eMBB): Focuses on pro- Typical RBS systems are divided into three hardware units: viding significantly higher data rates and capacity compared to previous telecommunication generations, enabling ap- 1. The Remote Radio Unit (RRU). It is immediately con- plications such as high-definition video streaming, virtual nected to the antennas and handles the initial input reality, and augmented reality. This scenario covers the stream from the antennas. The antenna streams are day-to-day activities of private users and data-heavy but initially processed in this unit and grouped into user less critical industrial applications. streams (e.g., 8 antenna streams are compressed to Ultra-Reliable and Low Latency Communications one group) to be sent to the next unit. (URLLC): Emphasizes ultra-reliable and low-latency com- 2. The Baseband Unit (BBU). It takes the input streams munication, critical for applications that demand real-time from the RRU and further processes them. The RRU responsiveness and mission-critical reliability, including and BBU units together constitute the physical layer autonomous vehicles, remote surgery, and industrial au- of the OSI model (layer 1), handling the physical tomation. aspects of transmitting and receiving wireless 5G Massive Machine Type Communications (mMTC): signals [7]. Targets the connectivity of a massive number of devices 3. The Layer 2 unit handles the data link layer of the using minimal energy, enabling the Internet of Things (IoT) OSI model (layer 2). This includes Medium Access to scale to unprecedented levels, facilitating applications Control (MAC) and Radio Link Control (RLC) tasks. such as smart cities, industrial IoT, and environmental mon- itoring. The varying characteristics of the workloads of the differ- These scenarios resulted in a requirement specification ent units result in different hardware designs. While both that includes the following criteria [6]: the BBU and layer 2 must handle high- and low-criticality tasks, they do so in different ways. This research aims to • Peak Data Rate: 20 Gbit/s download, 10 Gbit/s up- explore a merged system to handle the BBU and layer 2 tasks load. This is only in ideal conditions. in one hardware system. The new system is to be centered • Transmission Latency: 4 ms for eMBB, 1 ms for around the design of a BBU but explore technologies that URLLC. This is the latency added by the 5G net- allow layer 2 tasks to run efficiently. work to the overall communication latency between endpoints. • Device Mobility: up to 500 km/h for rural eMBB, less 2.2. Baseband Unit for more dense areas. The BBU system handles physical layer tasks centered • Density: up to 1.000.000 devices per square kilometer around signal processing of incoming and outgoing trans- in the mMTC scenario. missions. Its design ensures maximum predictability at the expense of resource utilization efficiency. Figure 1 provides Note how each requirement applies in specific scenarios and an overview of the system. It is not meant to be repre- is not necessary in others. For example, the peak data rate sentative of any specific system but to give an idea of the is unnecessary for scenarios covered by URLLC or mMTC. components often present and their interactions. Meanwhile, the extreme latency requirement of 1 ms only applies to URLLC. 2.2.1. Hardware An RBS must manage these diverse requirements and, therefore, becomes a mixed-criticality system. For example, We focus on systems centered around a clustered and hetero- tasks within the URLLC scenario must be prioritized over geneous design. Each cluster contains a set of processors or eMMB tasks to uphold the URLLC latency requirements. Not accelerators (for illustration, we show four in Figure 1). First, only do we have a range of priorities, but these priorities the general computing capability is provided by digital sig- may also change as usage changes. Adapting to ongoing nal processor (DSP) cores with high predictability [8]. Each changes in network usage is, therefore, a critical aspect of DSP has a private instruction and data cache and shares a implementing 5G. single scratchpad memory with the other processors in the use the same data, the Read of each will load that data cluster. into their respective partitions. This means data might be The other clusters contain acceleration cores for specific duplicated in the cluster scratchpads. However, such shared and common workloads. The accelerators in each cluster data is rarely written to, and synchronization is explicitly also share a scratchpad. The exact architecture of the accel- handled at the application level and, therefore, is not an erators is out of the scope of this paper. issue. The clusters may also share scratchpads, two are shown as an example. These split scratchpads handle different 2.4. Layer 2 Design data with specific access characteristics. For example, some configuration data might be mostly read and changed rarely, The common computing architectures for layer 2 are while user-specific data may be updated continuously. more traditional, with, e.g., superscalar cores and standard Lastly, a hardware scheduler can be present to orchestrate caching. The workload on the system requires less stringent task execution on the relevant cores and movement of data. predictability than the BBU, allowing for a more traditional We have omitted to describe any other application-specific design. The tasks also require higher performance, pro- devices or connections to peripherals. vided by the more complex design at the cost of predictabil- ity. To ensure high-criticality tasks meet their deadlines, the hardware resources can be partitioned by clusters and 2.3. Data Processing intentionally over-provisioned. Data processing starts once every millisecond. While the Layer 2, therefore, can have much wastage where high- RRU is processing the antenna streams, the BBU starts with criticality tasks are concerned. This unit’s more complex a set of configuration tasks that prepare for the delivery design makes it challenging to ensure tasks meet their dead- of data from the RRU. These configuration tasks must run lines. The only possibility of ensuring the deadlines are on the DSP cores to, e.g., configure the accelerators before met is to provide the tasks with such an overabundance of they start executing. This could result in configuration data resources that even when low-criticality tasks interfere, the initially going to one of the cluster-shared scratchpads, from high-criticality tasks will not be adversely affected. There- where it is moved to the cluster scratchpads as needed. This fore, the inefficient use of resources in layer 2 is a supporting data starts in the shared scratchpad of the core running reason for merging the layer 2 subsystem with the BBU sub- the job and is off-loaded to the cluster-shared scratchpad system. when the configuration job is done. In parallel with the configuration tasks, the data from the RRU is being loaded 2.5. Challenges into the cluster-shared scratchpads. When that is ready, proper processing tasks can begin executing on DSPs or We aim to research new methods for implementing 5G RBS accelerators as needed. technologies to achieve better performance at lower cost. We consider only strict data access characteristics of the Therefore, the current challenges of increased costs and tasks. All shared data is read-only. User-specific data is lower performance must be alleviated in any future system. segmented into the relevant tasks and updated only by the Challenge 1: The primary challenge for the above- task currently being worked on. At no point are two tasks mentioned RBS systems is a divided hardware architecture. working on the same user data. These strict data access The physical division ensures that high-critical tasks can characteristics mean that synchronization and coherence maintain their needed deadlines, which increases costs and are not issues we will consider. reduces overall performance. First, the separation necessi- tates manufacturing two physical systems, which is costly. 2.3.1. Phased Execution Second, the separation means the two systems cannot share resources, reducing the efficient use of available resources. The use of scratchpads in the BBU reduces the variability Challenge 2: On the BBU system specifically, there is in execution times. However, this requires methodical or- also a challenge with efficient use of resources. While us- chestration to ensure each job has the needed data. As such, ing scratchpads ensures execution-time predictability for every job is divided into three phases: all tasks, it also forces data duplication. If two tasks use 1. Read: Any data a task requires is moved onto its the same data, that data is moved into both tasks’ scratch- cluster’s scratchpad from the cluster-shared caches. pads partition. This is both a waste of scratchpad memory and memory bandwidth. This is especially prevalent with 2. Execute: The task’s job is executed to completion configuration data, which is often shared between many without needing to access memory other than the tasks and does not change often. The data loaded into the cluster’s scratchpad. scratchpads is also loaded on a pessimistic basis. Some tasks 3. Write: All the data previously fetched for the job, may only need some of the data, meaning some data might which has been updated, is written back to the main be unnecessarily loaded into the scratchpads. memory. Challenge 3: Memory bandwidth is wasted when depen- This is a classic implementation of the phased execution dent tasks use the same data. The Write phase in the BBU [4, 9], also called the simple-task model [10]. The task sched- system always runs after the Execute phase. A subsequent uler ensures that a task’s Execute is only scheduled on a job using the same data must reload it in its Read phase. processor when its corresponding Read has terminated on This is sub-optimal in cases where the subsequent task can the same cluster. Data movement is performed using DMAs, run on the same cluster as the first task. In such a case, allowing processors to execute other jobs’ Execute phase omitting the Write phase of the first task and the Read in parallel with data movements. phase of the second task would be better. A cluster’s scratchpad is partitioned so that each running job has exclusive access to its memory portion. If two tasks 3. The T-CREST Platform Accessing this data is also done without experiencing cache misses. The compiler also manages the stack cache, setting We propose to use the T-CREST platform as a basis for it up and tearing it down at function entry and exits and research into future platforms for 5G RBS. This section de- using stack-targeting load and store instruction variants. scribes the platform’s current capabilities and how they An analyzer can assume any stack-targeting instruction will relate to the challenges present in divided RBS systems. hit in the stack cache. Therefore, the cache size must only be modeled to account for the stack setup and tear-down 3.1. T-CREST and Patmos time [23]. Data accesses that are not function-local may still go through the conventional data cache or circumvent all The Patmos processor [11] is designed to serve real-time caching to target the main memory directly. systems. Several Patmos cores are combined with a network- These two cache architectures are supported by the Platin on-chip, a memory arbitration tree, and a memory controller WCET-analyzer [24]. Platin models instruction execution to the time-predictable multi-core platform T-CREST [5]. As and tracks which blocks of code are likely to be in the such, T-CREST provides techniques that make task execu- method cache at a given point. It accounts for this at control- tion time more predictable and reduce the worst-case execu- flow point to know whether a method-cache miss is likely tion time (WCET). Around the Patmos cores, it builds a plat- and how many bytes would have to be loaded. For the stack form with time-predictable components to reduce WCET cache, it models the program stack’s size at any point and analysis complexity and increase accuracy. T-CREST uses tracks stack-cache-control instructions added by the com- networks-on-chips [12, 13, 14] that ensure data is moved be- piler. At points where the stack must grow, Platin knows tween processing cores with a known maximum latency. For whether the cache has free space or needs to spill some of accessing shared main memory, T-CREST uses the dedicated the program stack to main memory. arbitration tree-based network-on-chip [15]. Regardless of how many cores are accessing the memory, each access will be serviced within a bounded latency. 3.3. Missing Capabilities Patmos uses an in-order pipeline to ensure every instruc- The T-CREST platform is missing some features and capa- tion has a known and constant execution time. To exploit bilities compared to the BBU system. We will enumerate instruction-level parallelism predictably, Patmos is also a these missing capabilities and highlight how we might ei- very long instruction-word (VLIW) architecture with a dual- ther simulate them using existing capabilities or discuss how issue pipeline. VLIW architectures are a predictable way to implement them into the platform as part of the research of increasing performance without increasing complexity project. [16, 17]. Patmos executes instructions in bundles of up to two instructions. The compiler must designate instructions 3.3.1. Acceleration and Clustering as part of a bundle by setting a specific bit in the first in- struction. All Patmos instructions are predicated: Based on The specific processing requirements of an RBS means dedi- one of eight predicate registers, each instruction is either cated accelerators can be used for maximum efficiency. The enabled or disabled. If the predicate register’s value is true, T-CREST platform does not include anything resembling the instruction is enabled, meaning it executes normally. If these accelerators. Likewise, the T-CREST platform does the value is false, the instruction is disabled and does not not use any clustering, whose benefit is mainly driven by a affect registers or memory. It effectively becomes a noop. multi-layered intermediate memory, which we will discuss However, the execution time of disabled instructions is the in the next section. same as when enabled. Predicated instructions allow the As this research mainly focuses on the efficient use of compiler to minimize execution time variability or even resources, notably memory, we will not investigate or im- eliminate it entirely [18]. plement any hardware acceleration. Instead, we will use the Patmos cores as substitutes for specific accelerators. We will implement clustering into the T-CREST platform so 3.2. Predictable Caching that each cluster can be designated to be allowed to execute While caching is usually associated with unpredictability specific tasks. This will allow us to treat one cluster as a and difficulties for static analysis, T-CREST deploys two pre- substitute for a BBU DSP cluster and others for different dictable and easily analyzable caches. The first is a method types of acceleration clusters. cache [19] that replaces a traditional instruction cache in Patmos [20]. the method cache caches whole or parts of func- 3.3.2. Hierarchical Memory tions (sub-functions) such that instruction fetching never misses except at specific points. The compiler manages this The Patmos cores of T-CREST are each paired with private cache by splitting the code into blocks that fit in the method caches, as described earlier. However, no further hierarchy cache and inserting cache-fill instructions where needed. of intermediate memory exists. In contrast, the BBU system For the Patmos ISA function call and return instructions contains three levels of intermediate storage: First, each ensure that the callee or the caller are in the method cache. DSP (or accelerator) has its caches. Second, each cluster has To support sub-function caching Patmos has cache filling a shared scratchpad. lastly, cluster-shared scratchpads are variants of branch instructions. Using a method cache limits present for a last level of storing various types of data. the number of places cache misses can occur to the specific A multi-layered memory hierarchy is necessary for the ex- cache-filling instructions. The method cache is simpler to periments to be representative, especially given the unique model for an analyzer to provide tight WCET bounds [21]. data access characteristics. Therefore, we will build a sec- The second unique cache of the T-CREST is the stack ond layer of intermediate memory, which is shared between cache [22]. It caches function-local data, often accessed pre- the Patmos cores of each cluster. We will omit a last mem- dictably, and can be loaded at function entry and exit points. ory layer, as any methods of managing the second layers Scheduler Cluster 0 Cluster 1 Cluster 2 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 M$ S$ D$ M$ S$ D$ M$ S$ D$ M$ S$ D$ M$ S$ D$ M$ S$ D$ Shared Cache Shared Cache Shared Cache Memory Controller Off-Chip Main Memory SRAM Figure 2: Proposed T-CREST system for researching novel cache architectures. we develop can be transferred to the rest of the layers of a 4. Cache Proposals real-world system. To start addressing the challenge of merging layer 1 and 3.3.3. Hardware-Assisted Scheduling layer 2 systems, we focus on the challenge of using a shared cache in each cluster. As described earlier, the BBU archi- The BBU systems often use hardware to accelerate schedul- tecture sacrifices the efficient use of resources to ensure low ing. T-CREST does not implement any hardware that can variability in execution times. We aim to maximize resource assist with scheduling. While using a hardware scheduler in usage in the proposed system while maintaining low vari- the BBU system ensures that the extreme amount of tasks ability. We propose exploring three caching solutions that gets scheduled in a reasonable time, the smaller scale of address the challenges of predictable caching: (1) a critical- this project’s prototypes can likely handled by software- ity timeout cache, (2) a contention tracking cache, and (3) a managed scheduling. unified method/stack cache. Therefore, the initial proposed system will not have any scheduling hardware, but dedicated Patmos cores will re- 4.1. Criticality Timeout Cache place it to handle the scheduling. Software-defined schedul- ing can be a flexible way to test our scheduling strate- In cases where strict predictability is unnecessary but flex- gies as the system matures. Moving to a hardware sched- ibility and utilization efficiency are essential, we propose uler should be easily doable at later stages of research, a cache using a partitioning approach based on cache line where the scheduling has been studied and techniques cho- timeouts. For that cache, we need an n-way set associa- sen. Patmos already supports adding custom devices and tive cache configuration. We can configure the cache at the accelerators[25]. A hardware scheduler is a device that inter- granularity of cache ways. Each cache way can be assigned acts with the rest of the clusters, memories, and processors either a criticality or a task/core ID (we will use criticality and issues commands in the same manner a Patmos core moving forward). would. In this proposal, each cache way can be assigned either to high or low criticality. Cache lines can be used by high- 3.4. Proposed System Architecture or low-criticality tasks. However, naturally high-criticality tasks are preferred. A low-criticality task cannot evict a Figure 2 shows a diagram of our proposed system. It com- high-criticality cache line. Therefore, to avoid starvation of prises three clusters, each with a set of Patmos cores with low-criticality tasks, at least one way must not be assigned private split caches (Method, Stack, and Data) and a shared for the high-criticality tasks. cluster cache. The cores use the T-CREST memory tree to When an access of the high criticality arrives, a cache line access the shared cache, providing us with predictable and in one high-criticality way is tagged as being occupied by low-latency access. The clusters use the T-CREST memory that criticality, and an associated timeout begins. As long tree to connect to the memory controller, which manages as the timeout is not reached, accesses of low-criticality access to the off-chip, main memory. A shared bus (in gray tasks cannot evict the cache line. If there is no access to above the clusters) facilitates cross-cluster and cross-core the line before the timeout is reached, the line criticality communication. This allows a Patmos core or a hardware is downgraded, allowing low-criticality jobs to evict the device scheduler to issue scheduling commands to the whole line. The cache can either be configured right before each system. job starts executing, or the criticalities can be configured This system architecture will allow research on efficiently ahead of time to match the tasks that will run on the cluster. managing the cluster caches. The different clusters can With timeouts, there is no need to explicitly release any simulate the DSP or accelerator clusters on the BBU system, data, as the timeout mechanism will do so automatically. while the cluster-shared scratchpads of that system do not Configuring the cache is done by setting the criticality of a introduce new challenges. Therefore, limiting ourselves cache way. When a way is configured with a criticality, all to the two levels of cache (private and cluster caches) will its cache lines will prefer accesses from that criticality, as allow for fruitful experimentation during the research. described above. A significant drawback of this approach is its unpre- dictability. Because timeouts might cause a cache line to be evicted even when it might be used in the future, it can be difficult for a WCET analysis tool to track which cache lines contention event will be blocked or mitigated. For example, have reached the deadline and which have not. The effect say 𝐽1 is high criticality, and 𝐽2 is not. As long as 𝐽1 has not of the timeouts on WCET bounds can be challenging to esti- reached its contention limit, the cache treats accesses from mate and would require dedicated analysis. However, it can both jobs equally. When the limit is reached, contention also be omitted, as this cache architecture is better suited events are mitigated between 𝐽1 and 𝐽2 . In the case of the for measurement-based WCET estimation. With detailed first event type, accesses from 𝐽2 that would cause an evic- testing and measurements, getting a sufficiently safe WCET tion of 𝐽1 ’s cache lines would be rejected by the cache. The bound should be feasible. access must then be rerouted directly to the main memory, This cache architecture is designed for high utilization which the system must have support for. In the second event and low scheduling complexity. Because it reserves each type, if the default replacement policy would have 𝐽1 evict cache line, only the necessary subset of a cache way is its own cache line in the set, it would instead evict a cache reserved at a given time. Cache lines that either timed out or line from 𝐽2 . were not used by the job are free to be used by low-criticality Setting the contention limit is the responsibility of the tasks, increasing the utilization of the cache. In this proposal, job scheduler. Through traditional static WCET analysis we also do not pre-load data into the cache. This means with the assumption of private caches, jobs get their WCET only data that is used will be loaded. Therefore, we avoid bound. Any excess time between the bound and the task both bandwidth wastage and cache space wastage when deadline is therefore open to contention. Before the sched- loading data that is not used. When a job stops executing, uler starts a job, it sets the contention limit, ensuring the its associated cache lines will eventually time out and release WCET of the job, with contention, still meets the deadline. their contents automatically. The scheduler, therefore, does The contention limit can be static, and it can be calculated not need to manage the phased execution of jobs, reducing as part of schedulability analysis. It can also be dynamic, the pressure on the scheduler. so the scheduler changes it for the runtime condition. If the task was started early, the contention is increased to 4.2. Contention Tracking Cache match the slack time available. If the task was started late, the contention is reduced or set to zero to ensure that the In this proposal, a combination of contention tracking in deadline is still met. the cache and contention-aware task scheduling will allow This proposal’s major strength is that it disconnects the for maximal cache utilization through dynamic partitioning, analysis of tasks with differing criticalities. Because of with high predictability through cache contention tracking the contention limit, high-criticality tasks will never be and mitigation. adversely affected by low-criticality tasks. Therefore, we In a multicore system without shared caches, the execu- just need to ensure that all high-criticality tasks meet their tion time of a job is affected by the cache behavior without deadlines with other methods.1 It also does not statically par- that behavior being affected by other jobs. Through cache tition or lock the cache. At worst, when a contention limit is analysis, we can bound the execution time attributable to reached, the cache will be dynamically partitioned automat- the cache. This is done by estimating the number of cache ically simply by prioritizing the jobs that have reached the misses that will occur. When the cache is shared, this anal- limit. This maximizes cache utilization. It also allows for ysis is no longer possible, as the interference of other jobs maximizing the performance of low-criticality tasks as long will cause additional cache misses in a manner that cannot as it does not adversely affect any high-criticality tasks. be estimated. In this proposal, we want to let the task sched- This proposal does increase the complexity of the cache uler limit the contention that a job is allowed to experience controller, which needs to track contention events and miti- such that it is guaranteed to meet its deadline. gate them for jobs that have reached their contention limit. We give two example types of contention: (1) A job 𝐽1 Each cache line needs to be associated with a job (or core), experiences a contention event if a cache line 𝐶1 it popu- each job needs a contention counter, and logic needs to lated with data 𝐷1 is evicted by an access by another job 𝐽2 . ensure the correct mitigation at contention limits. The pro- This is because 𝐽1 will experience a cache miss on the next posal also increases scheduler complexity. This complexity access to 𝐷1 that it would not have experienced if 𝐽2 had can be initially lowered by simply having statically deter- not interfered. (2) 𝐽1 also experiences a contention event if a mined contention limits. However, further work should ex- cache miss occurs when accessing 𝐷1 results in the eviction plore dynamically determined limits, which would increase of a cache line that 𝐽1 also populated in the same cache set the workload on the scheduler. (with data 𝐷2 ). This event is a contention with any other job with at least one populated cache line in the same set. 4.3. Unified Method/Stack Cache Without the other jobs, 𝐽1 would have populated an empty cache line instead of evicting one of its other populated lines. The Patmos processor on T-CREST uses the special method The evicted line will cause a cache miss in the future when and stack caches. While these caches have been researched 𝐽1 needs to access 𝐷2 again. for their impact on predictability, and the Platin analyzer We only consider contention between different jobs. Self- has analysis implementations for them, additional work is contention also happens in private caches and is, therefore, needed to integrate them into a shared L2 cache. Therefore, already managed in the cache analysis for the private cache. we propose investigating a shared L2 cache that integrates We limit the maximum allowed contention as defined the features of both the method cache and the stack cache. above to ensure that a job meets its deadline without inter- It is meant to complement either a traditional L2 data cache ference from other jobs. The scheduler will configure the or scratchpad, with extended research avenues for a fully cache with a maximum allowed contention. The cache con- integrated L2 cache that supports the method-, stack-, and troller will track contention by checking and counting the above contention events for each job. When a job reaches 1 For example, we could use partitioning between high-criticality tasks its contention limit, any cache access that would cause a only. Priority Contention Unified Any data whose address is taken in the program cannot be Timeout Tracking Method/ put in the stack cache, going instead to the shadow stack, Stack All Data ✓ ✓ ✗ which is backed by main memory. Another big difference Shared ✓ ✓ ✗ between the Unified Method/Stack Cache and the others is Mixed-Criticality ✓ ✓ ✗ that the proposal does not share the cache between multiple Analyzable ✗ ✓* ✓ cores, which also means it does not alleviate any challenges Needs Scheduling ✗* ✓* ✗ for mixed-criticality systems. Guaranteed ✗ ✓* ✓ Analyzability is different between all the cache proposals. Table 1 The Priority Timeout cache does not support analyzability Comparison between features of the three cache proposals. very well, as it is difficult for analyzers to track when cache lines have timed out. The contention cache is analyzable, but only in the sense that it simplifies mixed-criticality anal- ysis by disallowing interference between tasks of different data caches. This proposal can also complement either of criticalities. For tasks with the same criticality, the cache the previous proposals. does not provide any assistance but does not complicate The method and stack caches have particular access pat- the analysis. The Unified Method/Stack Cache is the most terns to their data. The method cache accesses a block of analyzable. Analyzers can reuse the analysis done for the code at a time, pre-loading a complete block at once. It also separate method and stack caches and likely reuse it for uses a first-in, first-out (FIFO) replacement policy to account the unified one with different configurations and minor for functions earlier in the call stack being less likely to be customization. called again soon. On the other hand, the stack cache is The proposals also differ in how much support is needed not backed by main memory unless some data is spilled from the job scheduler at runtime. The Priority Timeout when the cache is full. This allows the L2 cache to store Cache can be implemented without scheduler support if the the spilled stack data first without sending it to the main way-based partitioning is configured ahead of time. If the memory. Access to this stored data would have the same partitioning is done dynamically, it would be the scheduler’s characteristics as access to the stack cache. Additionally, responsibility. The Contention Tracking Cache needs sup- when space is tight in the L2 cache, the replacement policy port from the scheduler to ensure the amount of allowed is the same as the stack cache: spill the data furthest up the contention is within the correct limit. The scheduler needs stack. to account for when a high-criticality job is started so that An open question is how to partition the cache between an appropriate contention limit is chosen. A static approach the method and stack data. Since both have a replacement can also be used where the contention limit is chosen ahead policy that depends on reaching the space limit, a policy of time. However, that does not provide much benefit com- is needed for deciding how much of the cache should be pared to traditional partitioning. The Unified Method/stack meant for the methods, and how much should be used for Cache needs no scheduling support at all. The only thing the stack. We should also investigate if this division can be that might be configurable would be how much of the cache dynamically configured such that if the stack is not expected is prioritized for methods or the stack. However, this could to use much space, then most of the L2 cache should be saved better be done by the program itself, e.g., through compiler for the methods and vice versa. A different approach could management of the cache. be to say that the stack gets priority up to a point. When Lastly, each cache has different guarantees on its behavior. the stack needs to store more data, methods are evicted to The Priority Timeout Cache provides priority guarantees for make room up to a point (e.g., half the L2 cache size). Any only a specific time. If that is not managed such that it does space not used by the stack cache can store methods. This not run out, programs cannot be guaranteed that a specific can also be done in reverse, where the method data gets amount of the cache is reserved for them. While giving priority. no guarantees on partitioning, the Contention Tracking An open question that would need answering following cache guarantees how much contention could affect a job. the above initial research, would be how to implement a However, this is only contention from lower criticality job unified method/stack cache that is also shared between cores. contention, and so does not make any guarantees about Since each core has a distinct stack, and is also likely to use the contention from similar-criticality tasks. However, the different functions, we need to explore ways for a single Unified Method/Stack Cache is predictable and guarantees cache to effectively manage multiple stacks and call trees. similar behavior to the split caches. 4.4. Discussion 5. Related Work The three caching proposals—Criticality Timeout Cache, Contention Tracking Cache, and Unified Method/Stack Shared caches are a significant challenge for predictability Cache—each address the challenge of predictable caching due to their inherent nature of allowing multiple cores to in different ways. Table 1 compared the various features access the same cache [26]. This can lead to contention of our proposals. The first big difference is between the and unpredictable performance. However, several solutions Unified Method/Stack Cache and the two other caches. The have been proposed to address this issue, including cache Priority Timeout and Contention tracking caches both sup- partitioning and locking [27]. port all program data, whereas the Unified Method/Stack Partitioning is a technique that divides the shared cache Cache only supports instruction data (methods) and stack into several partitions, each dedicated to a specific core [28]. data. Even more specifically, the traditional stack does not This approach can significantly improve predictability by support all stack data, only that which does not need an reducing contention [29]. Way-based partitioning involves address, as the stack cache is not backed by main memory. dividing the cache ways among different cores. Each core is assigned a specific number of ways in the cache, ensuring fications and the common system architecture, we propose exclusive access to those ways. This method can effectively using the T-CREST platform as the research platform for fu- isolate the cache activities of different cores, improving pre- ture mixed-criticality systems. We propose a specific system dictability. On the other hand, index-based partitioning architecture that best leverages the existing system archi- involves dividing the cache sets among different cores. Each tecture’s strength and increases its performance through core is assigned specific sets in the cache, ensuring exclusive shared caches. We propose three specific research directions access. This method is more flexible than way-based parti- within shared L2 caches for clustered systems. The various tioning because the number of sets is usually large, allowing proposals have distinct strengths and weaknesses that will for finer-grained partitioning. However, a given set maps be further explored in future work. to specific address ranges. Therefore, this method requires more detailed memory management. Page coloring is often used to partition the cache [30]. The address space is divided Acknowledgment into colors associated with the cache sets. Assigning colors This work is partially supported by the CERCIRAS (Connect- to tasks/cores provides the partitioning, assuming an assign- ing Education and Research Communities for an Innovative ment that provides the correct memory for each task/core Resource Aware Society) COST Action no. CA19135 funded is found. The cache hardware can also support index-based by COST (European Cooperation in Science and Technol- partitioning for various benefits [31, 32]. However, some ogy). form of software management will always be needed. Cache locking is another technique used to improve pre- dictability in shared caches [33]. With locking, specific References cache lines can be locked to prevent them from being evicted, ensuring they are always available for the necessary cores. [1] International Telecommunication Union - Radiocom- This can significantly reduce cache misses and improve munication Sector, IMT Vision - Framework and over- predictability. Locking can be costly. Lock management in- all objectives of the future development of IMT for volves tracking the locked cache lines, increasing hardware 2020 and beyond, Technical Report M.2083-0, Interna- complexity. Adding locking to a cache can reduce its ca- tional Telecommunication Union, 2015. pacity or speed depending on how fine-grained the locking [2] A. Burns, R. I. Davis, Mixed criticality systems-a re- is. Locking also reduces cache utilization, as any unused view:(february 2022) (2022). locked content cannot be evicted to free up the cache lines [3] ISO/IEC 7498-1:1994(E), Information technology – for needed data. Open Systems Interconnection – Basic Reference T-CREST has enabled much research within various as- Model: The Basic Model, Technical Report 7498-1:1994, pects of real-time systems [5]. Because all of T-CREST’s com- International Organization for Standardization, 1996. ponents are predictable, it is possible to implement constant [4] R. Pellizzoni, E. Betti, S. Bak, G. Yao, J. Criswell, M. Cac- execution-time code based on the single-path paradigm camo, R. Kegley, A predictable execution model for [34, 18]. Single-path has an inherently high overhead, neces- cots-based embedded systems, in: 2011 17th IEEE sitating optimizations to reduce executed code [35], make Real-Time and Embedded Technology and Applica- best use of Patmos’ dual-issue pipeline [36, 17], and use cus- tions Symposium, IEEE, 2011, pp. 269–279. tom register allocation techniques [37]. The combination of [5] M. Schoeberl, S. Abbaspour, B. Akesson, N. Auds- T-CREST and single-path code has been shown to be com- ley, R. Capasso, J. Garside, K. Goossens, S. Goossens, petitive with off-the-shelf ARM processors for a real-time S. Hansen, R. Heckmann, S. Hepp, B. Huber, A. Jordan, application [38]. Research is also ongoing to port the Lingua E. Kasapaki, J. Knoop, Y. Li, D. Prokesch, W. Puffitsch, Franca coordination language to T-CREST to enable the cre- P. Puschner, A. Rocha, C. Silva, J. Sparsø, A. Tocchi, ation of complete real-time systems within one framework T-CREST: Time-predictable multi-core architecture for [39, 40]. embedded systems, Journal of Systems Architecture 61 (2015) 449–471. doi:10.1016/j.sysarc.2015.04. 002. 6. Conclusion [6] International Telecommunication Union - Radiocom- munication Sector, Minimum requirements related to The increasing importance of 5G technologies necessitates technical performance for IMT-2020 radio interface(s), continuous research and development into the hardware Technical Report M.2410-0, International Telecommu- systems implementing the technology. The diverse require- nication Union, 2017. ment specifications of this new technology necessitate a [7] Z. Kong, J. Gong, C.-Z. Xu, K. Wang, J. Rao, ebase: system with varying degrees of strictness and performance. A baseband unit cluster testbed to improve energy- Existing systems were designed with the minimal 5G guar- efficiency for cloud radio access network, in: 2013 antees in mind, ensuring the hard requirements, e.g., low IEEE International Conference on Communications latency, were met before softer requirements like through- (ICC), IEEE, 2013, pp. 4222–4227. put. This focus resulted in a divided physical system to [8] E. Tell, A. Nilsson, D. Liu, A programmable dsp core for achieve the goals. baseband processing, in: The 3rd International IEEE- To increase future systems’ performance while maintain- NEWCAS Conference, 2005., IEEE, 2005, pp. 403–406. ing the older system’s guarantees, this paper sets the re- [9] J. Arora, C. Maia, S. A. Rashid, G. Nelissen, E. Tovar, search direction into a mixed-criticality 5G RBS with merged Schedulability analysis for 3-phase tasks with parti- BBU and layer 2 systems. The system should be able to exe- tioned fixed-priority scheduling, Journal of Systems cute high-criticality tasks, like those required by the URLLC Architecture 131 (2022) 102706. 5G scenario, and low-criticality, QoS tasks, like those for the [10] H. Kopetz, Real-Time Systems, Kluwer Academic, eMMB, in one SoC. By analyzing the 5G requirement speci- Boston, MA, USA, 1997. [11] M. Schoeberl, W. Puffitsch, S. Hepp, B. Huber, worst-case stack cache behavior, in: Proceedings of the D. Prokesch, Patmos: A time-predictable micro- 21st International Conference on Real-Time Networks processor, Real-Time Systems 54(2) (2018) 389–423. and Systems (RTNS 2013), ACM, New York, NY, USA, doi:10.1007/s11241-018-9300-4. 2013, pp. 55–64. doi:10.1145/2516821.2516828. [12] M. Schoeberl, F. Brandner, J. Sparsø, E. Kasapaki, [24] E. J. Maroun, E. Dengler, C. Dietrich, S. Hepp, A statically scheduled time-division-multiplexed H. Herzog, B. Huber, J. Knoop, D. Wiltsche-Prokesch, network-on-chip for real-time systems, in: Proceed- P. Puschner, P. Raffeck, et al., The platin multi-target ings of the 6th International Symposium on Networks- worst-case analysis tool, in: 22nd International Work- on-Chip (NOCS), IEEE, Lyngby, Denmark, 2012, pp. shop on Worst-Case Execution Time Analysis (WCET 152–160. doi:10.1109/NOCS.2012.25. 2024), Schloss Dagstuhl–Leibniz-Zentrum für Infor- [13] E. Kasapaki, M. Schoeberl, R. B. Sørensen, C. T. Müller, matik, 2024. K. Goossens, J. Sparsø, Argo: A real-time network- [25] C. Pircher, A. Baranyai, C. Lehr, M. Schoeberl, Acceler- on-chip architecture with an efficient GALS imple- ator interface for patmos, in: 2021 IEEE Nordic Circuits mentation, IEEE Transactions on Very Large Scale and Systems Conference (NORCAS): NORCHIP and Integration (VLSI) Systems 24 (2016) 479–492. doi:10. International Symposium of System-on-Chip (SoC), 1109/TVLSI.2015.2405614. 2021. [14] M. Schoeberl, Exploration of network interface [26] B. C. Ward, J. L. Herman, C. J. Kenna, J. H. Anderson, architectures for a real-time network-on-chip, in: Making shared caches more predictable on multicore Proceedings of the 2024 IEEE 27th International platforms, in: 2013 25th Euromicro Conference on Symposium on Real-Time Distributed Computing Real-Time Systems, IEEE, 2013, pp. 157–167. (ISORC), IEEE, United States, 2024. doi:10.1109/ [27] G. Gracioli, A. Alhammad, R. Mancuso, A. A. Fröh- ISORC61049.2024.10551364, 2024 IEEE 27th Inter- lich, R. Pellizzoni, A survey on cache management national Symposium on Real-Time Distributed Com- mechanisms for real-time embedded systems, ACM puting, ISORC ; Conference date: 22-05-2024 Through Computing Surveys (CSUR) 48 (2015) 1–36. 25-05-2024. [28] S. Mittal, A survey of techniques for cache partitioning [15] M. Schoeberl, D. V. Chong, W. Puffitsch, J. Sparsø, A in multicore processors, ACM Computing Surveys time-predictable memory network-on-chip, in: Pro- (CSUR) 50 (2017) 1–39. ceedings of the 14th International Workshop on Worst- [29] X. Vera, B. Lisper, J. Xue, Data caches in multitasking Case Execution Time Analysis (WCET 2014), Madrid, hard real-time systems, in: RTSS 2003. 24th IEEE Spain, 2014, pp. 53–62. doi:10.4230/OASIcs.WCET. Real-Time Systems Symposium, 2003, IEEE, 2003, pp. 2014.53. 154–165. [16] J. Yan, W. Zhang, A time-predictable VLIW proces- [30] T. Lugo, S. Lozano, J. Fernández, J. Carretero, A survey sor and its compiler support, Real-Time Syst. 38 of techniques for reducing interference in real-time (2008) 67–84. doi:http://dx.doi.org/10.1007/ applications on multicore platforms, IEEE Access 10 s11241-007-9030-5. (2022) 21853–21882. [17] E. J. Maroun, M. Schoeberl, P. Puschner, Predictable [31] A. Chousein, R. N. Mahapatra, Fully associative cache and optimized single-path code for predicated proces- partitioning with don’t care bits for real-time applica- sors, Journal of Systems Architecture (2024) 103214. tions, ACM SIGBED Review 2 (2005) 35–38. [18] E. J. Maroun, M. Schoeberl, P. Puschner, Compiler- [32] M. Lee, S. Kim, Time-sensitivity-aware shared cache directed constant execution time on flat memory sys- architecture for multi-core embedded systems, The tems, in: 2023 IEEE 26th International Symposium on Journal of Supercomputing 75 (2019) 6746–6776. Real-Time Distributed Computing (ISORC), 2023, pp. [33] S. Mittal, A survey of techniques for cache locking, 64–75. doi:10.1109/ISORC58943.2023.00019. ACM Transactions on Design Automation of Elec- [19] M. Schoeberl, A time predictable instruction cache for tronic Systems (TODAES) 21 (2016) 1–24. a Java processor, in: On the Move to Meaningful In- [34] P. Puschner, A. Burns, Writing temporally predictable ternet Systems 2004: Workshop on Java Technologies code, in: Proceedings of the The Seventh IEEE In- for Real-Time and Embedded Systems (JTRES 2004), ternational Workshop on Object-Oriented Real-Time volume 3292 of LNCS, Springer, Agia Napa, Cyprus, Dependable Systems (WORDS 2002), IEEE Computer 2004, pp. 371–382. doi:10.1007/b102133. Society, Washington, DC, USA, 2002, pp. 85–94. doi:10. [20] P. Degasperi, S. Hepp, W. Puffitsch, M. Schoeberl, A 1109/WORDS.2002.1000040. method cache for Patmos, in: Proceedings of the [35] E. J. Maroun, M. Schoeberl, P. Puschner, Constant- 17th IEEE Symposium on Object/Component/Service- Loop Dominators for Single-Path Code Optimization, oriented Real-time Distributed Computing (ISORC in: P. Wägemann (Ed.), 21th International Work- 2014), IEEE, Reno, Nevada, USA, 2014, pp. 100–108. shop on Worst-Case Execution Time Analysis (WCET doi:10.1109/ISORC.2014.47. 2023), volume 114 of Open Access Series in Informat- [21] B. Huber, S. Hepp, M. Schoeberl, Scope-based method ics (OASIcs), Schloss Dagstuhl – Leibniz-Zentrum für cache analysis, in: Proceedings of the 14th Inter- Informatik, Dagstuhl, Germany, 2023, pp. 7:1–7:13. national Workshop on Worst-Case Execution Time URL: https://drops.dagstuhl.de/opus/volltexte/2023/ Analysis (WCET 2014), Madrid, Spain, 2014, pp. 73–82. 18436. doi:10.4230/OASIcs.WCET.2023.7. doi:10.4230/OASIcs.WCET.2014.73. [36] E. J. Maroun, M. Schoeberl, P. Puschner, Compiling for [22] S. Abbaspour, F. Brandner, M. Schoeberl, A time- time-predictability with dual-issue single-path code, predictable stack cache, in: Proceedings of the 9th Journal of Systems Architecture 118 (2021) 1–11. Workshop on Software Technologies for Embedded [37] E. Maroun, M. Schoeberl, P. Puschner, Two-step reg- and Ubiquitous Systems, 2013. ister allocation for implementing single-path code, [23] A. Jordan, F. Brandner, M. Schoeberl, Static analysis of in: Proceedings of the 2024 IEEE 27th International Symposium on Real-Time Distributed Computing (ISORC), IEEE, United States, 2024. doi:10.1109/ ISORC61049.2024.10551362, 2024 IEEE 27th Inter- national Symposium on Real-Time Distributed Com- puting, ISORC ; Conference date: 22-05-2024 Through 25-05-2024. [38] M. Platzer, P. Puschner, A real-time application with fully predictable task timing, in: 2020 IEEE 23rd Inter- national Symposium on Real-Time Distributed Com- puting (ISORC), IEEE, 2020, pp. 43–46. [39] E. Khodadad, L. Pezzarossa, M. Schoeberl, Towards lingua franca on the patmos processor, in: Proceedings of the 2024 IEEE 27th International Symposium on Real-Time Distributed Computing (ISORC), 2024. [40] M. Schoeberl, E. Khodadad, S. Lin, E. J. Maroun, L. Pezzarossa, E. A. Lee, Invited Paper: Worst-Case Execution Time Analysis of Lingua Franca Applica- tions, in: T. Carle (Ed.), 22nd International Work- shop on Worst-Case Execution Time Analysis (WCET 2024), volume 121 of Open Access Series in Informat- ics (OASIcs), Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 2024, pp. 4:1–4:13. doi:10.4230/OASIcs.WCET.2024.4.