A Program State Machine Based Virtual Processing Model in SystemC Tim Schmidt1 , Kim Grüttner2 , Rainer Dömer1 , Achim Rettberg3 1 University of California, Irvine, USA 2 OFFIS – Institute for Information Technology, Oldenburg, Germany 3 Carl von Ossietzky University Oldenburg, Germany ABSTRACT to raise the level of abstraction and support early design deci- The Program State Machine (PSM) Model of Computation sions. In [3], different abstraction levels have been proposed. offers a rich set of modeling elements to describe behavioral The specification level enables untimed modeling of function- and structural hierarchy, concurrency, synchronization, state ality and causality between behaviors in an executable model. transitions and timing. With the rising software complexity Behaviors can be statically composed in a sequential order, of today’s embedded systems, the use of Real-Time Operat- as finite-state machine or parallel. Communication between ing Systems (RTOS) has become state-of-the-art for nearly functions is described using double handshake channels with all System-on-Chip designs. Regrettably, the PSM model message passing and shared variables. The architecture level itself has insufficient support for the specification of the pre- introduces processing elements (PE) that execute behaviors emptive dynamic scheduling behavior of an RTOS. In this in sequential order. Behaviors are annotated with delays to paper, we propose a model for dynamically dispatching PSM specify the estimated execution times on the PEs. Commu- models on a virtual processing element. Our model aims nication between PEs is described through message passing to abstract from the targeted RTOS and the processor core communication with annotated delays. The implementation through execution time annotations and a flexible preemp- level adds instruction and cycle accurate timing for PEs and tive scheduler model. Mapping a PSM model to a set of signal level protocols with cycle accurate communication scheduled virtual processing elements only requires minor times. model transformation and enables early exploration of dif- ferent processing element mappings and scheduling policies. APPLICATION APPLICATION APPLICATION APPLICATION Our virtual processing model for PSMs is realized on top COMM & SYNC COMM & SYNC RTOS COMM & SYNC CHANNELS CHANNELS CHANNELS of the SystemC library. We evaluate the proposed virtual COMM & SYNC VIRTUAL PROCESSING INSTRUCTION SET RTOS MODEL processing model using a Canny edge detection filter. CHANNELS UNIT SIMULATOR SLDL SLDL SLDL SLDL (a) specification model (b) virtual processing model (c) architecture model (d) implementation model 1. INTRODUCTION The development process of state-of-the-art embedded Figure 1: Extension of a virtual processing model (see [4]) systems is complex and affects many different disciplines. Among others, the design process requires hardware and software design decisions. Today’s system complexity of em- In this work we focus on systems that implement their bedded Multi-Processor System-on-Chip (MPSoC) designs behavior in software, which can be mapped and executed is continuing at an almost exponential rate [2]. The strongly on different PEs of an MPSoC. Fig. 1 (a), (c) and (d) show growing complexity includes the integration of the function- the layers for executable MPSoC models proposed in [4]. ality as well as the associated software complexity. The All models are executed on top of a SLDL, e.g. SpecC or development of hardware is associated with immense costs. SystemC. The Application is user-defined behavior to be ex- Consequently, whenever possible, designers prefer software ecuted on the MPSoC. The layers between Application and solutions and realize algorithms on software processors. SLDL introduce communication, scheduling and timing tech- To cope with the increasing complexity and the time-to- niques to enable a stepwise refinement from the specification market pressure, new design methodologies are required. One level down to the implementation. design challenge for embedded system designers is mapping The design step from a non-scheduled specification model the functionality on the individual processing elements while (a) to a complete RTOS scheduled architecture model (c) meeting the required extra-functional properties (e.g. timing is a complex refinement step. It requires to transform the and power consumption) at minimal cost. To support this application or behavioral description into a process- or thread- challenge, System-Level Design Languages (SLDLs) enable based model and to use the configuration and scheduling primitives of the selected RTOS. At this time the choice of the task granularity and the supported scheduling primitives have usually been taken. At this level, the comparison of different task granularities and different scheduling policies, supported by different RTOSs induces major redesign effort. In this paper, we introduce a virtual processing model (b) to support a smoother refinement from the unscheduled specification to an RTOS scheduled architecture model for PSMs. EWiLi’14, November 2014, Lisbon, Portugal. Our approach supports PSM modeling at the specification Copyright retained by the authors. layer and enables early estimation of dynamic scheduling effects when mapping parallel behaviors on the same PE. Multi-Processing) or one global ready queue (Symmetric This step can be performed without any behavior to process Multi-Processing) used for dynamic process to core dispatch- or RTOS specific refinements. This way, designers can profit ing. The proposed extension supports the concept of Trans- from simple specification model modifications in combina- action Level Modeling (TLM) for intra-core communication. tion with early estimated execution time annotations, thus Both solutions focus on a process level RTOS abstraction at enabling early decisions regarding PE allocation, behavior the architecture and implementation level including features to process refinement granularity, process to PE binding and like process creation and interrupt handling. In contrast, scheduling policy selection. our proposed approach avoids process-level RTOS opera- The rest of the paper is organized as follows. In Section 2 tions and operates on an estimated execution time annotated we analyze related work. Next, we discuss the design of the specification model. Moreover, our solution keeps commu- virtual processing model in Section 3. Followed by a brief nication abstract and each processing element has its own description of the implementation in Section 4, we evaluate ready queue. After exploration, based on our virtual platform the new virtual processing model using a Canny filter for still model, we can transform the scheduled specification model image edge detection in Section 5. Finally, we conclude our into an RTOS model on the architecture level. work in Section 6. The timing accuracy and therefore the simulation perfor- mance of [4, 13] is limited by the fixed minimal resolution of 2. RELATED WORK discrete time advances. An extension deploying techniques with respect to preemptive scheduling models has been pre- An early proposal of a generic RTOS model based on Sys- sented in [14]. The Result Oriented Modeling collects and temC has been published in [11]. The presented abstract consumes consecutive timing annotations while still handling RTOS model achieves time-accurate task preemption via Sys- preemptions accurately. temC events and models time passing via a delay() method. A two layer for modeling approach for software task schedul- The RTOS overhead can be modeled as well. Two different ing considering shared resources has been proposed in [6, 7]. task scheduling schemes are studied: the first one uses a The design starts with an Application Layer (AL) model, dedicated thread for the scheduler, while the second is based which describes the functionality in terms of software tasks, on cooperative procedure calls, avoiding this overhead. Al- hardware modules and shared communication objects. These though in this approach explicit inter-task communication modeling elements are mapped on modeling elements of the resources are required (message queue, ...), the simulation Virtual Target Architecture Layer (VTAL): software process- time advances simultaneously as the tasks consume their ing elements with an RTOS model similar to [14], hardware delays. processing elements with fixed static scheduling, memories In [8], an RTOS modeling tool is presented. Its main and SystemC TLM for modeling shared buses and point-to- purpose is to accurately model an existing RTOS on top of point communication channels. Communication is realized SystemC. A system designer cannot directly use it. In this ap- via Remote Method Invocation (RMI) via shared buses or proach, the next RTOS “event” (interrupt, scheduling event, dedicated point-to-point channels. The individually mapped etc.) is predicted during run-time. This improves simula- software tasks can be annotated with Estimated Execution tion speed, but requires deeper knowledge of the underlying Time (EET) blocks that represent computation time. The system. design flow is supported with preemptive and cooperative An RTOS based scheduling approach with focus on pre- scheduling strategies, as well as deadline driven strategies. cise interrupt scheduling has been proposed in [17]. For This approach covers specification and architecture level this purpose, a separate scheduler is introduced to handle modeling. The main difference to our approach is that the incoming interrupt requests. Timing annotations and syn- RTOS model works with explicit tasks (i.e. processes). Our chronization within user tasks are handled by a replacement model could be refined as well to [6, 7] after PSM scheduling of the SystemC wait(). In [16] an annotation method for exploration. time estimation has been presented that supports flexible simulation and validation of real-time constraints for task migration between different target processors. The concept 3. VIRTUAL PROCESSING MODEL allows preemptive scheduling in the context of priority-based scheduling, supporting nested interrupts. 3.1 Basic Modeling Elements All mentioned solutions above work on architecture level We use an expressive subset of the program state machine models and allow to create and handle processes and in- (PSM) model of computation (MoC) to describe the func- terrupts on an RTOS specific abstraction. Our solution tionality of a system. A hierarchical PSM model with the addresses scheduling at a higher abstraction level and keeps corresponding thread graph is shown in Fig. 2(a) and (b). A communication at specification abstraction, thus no need for sequential composition of n behaviors describes a total execu- interrupts. tion order, denoted as a n dimensional tuple: (beh1 , ..., behn ). Several approaches based on abstract task graphs [9, 10, The execution starts with beh1 and finishes with behn . A 12, 15] have been proposed as well. In this case, a pure parallel composition of n behaviors describes a partial execu- functional SystemC model is mapped onto an architecture tion order, denoted as a set of n behaviors {beh1 , ..., behn }. model including an abstract RTOS. The mapping requires The parent behavior of a parallel composition of child behav- an abstract task graph of the model, where estimated ex- iors will not finish until all child behaviors have completed ecution times can be annotated on a per-task basis only, (Fork-Join semantics). The finite-state machine behavior ignoring control-flow dependent durations. This reduces the composition is a special case of a sequential execution from achievable accuracy. a start state to an end state. The execution order is defined The proposed RTOS model in [4] can be implemented on by state transitions. top of any SLDL (see Fig. 1) that supports the concepts The virtual processing model supports communication be- of process handling and time modeling. An extension of tween behaviors via double handshake channels (synchro- this approach [13] presents a high-level, host-compiled multi- nized) and shared variables (unsynchronized). A double core RTOS simulator. This multi-core processor model can handshake channel operates in rendezvous fashion (see Fig. 3). run more than one process simultaneously which can be or- When the data is transferred from the sender to the receiver, ganized by a separate ready queue per core (Asymmetric both behaviors resume their execution at the same time. A A A → ttop ing algorithms. We decided to provide these two fundamental B → ttop C B D [t1,t2] = fork(ttop) strategies because more complex strategies can be easily de- E D → t2 rived from them. C → t1 E → t2 C2 {C1,C2,C3}→t1 F → t2 C1 G F H [t3,t4] = fork(t2) 3.3.1 Fixed Priority G → t3 H → t4 C3 t2 = join(t3,t4) The fixed priorities are statically defined. The scheduler F → t2 J J → t2 always executes the behavior that has the highest priority ttop = join(t1,t2) and is ready to execute. We are interested in making the A → ttop a) PSM model b) Thread graph process of priority assignment to the behaviors on the virtual processing model as simple as possible. The designer assigns Processing Element PE1 Processing Element: Scheduler A PE1 Strategy fixed priorities only to leaf behaviors. A hierarchical higher B Sequence: behavior cannot hide the priority of a leaf behavior. C D A E Parallel: Set of ready behaviors Set of ready behaviors C2 B F C1, C2, C3, C1 G H F C1, C2, C3 FSM: Sequence: G, H C D inner behaviors leaf behaviors inner behaviors leaf behaviors C3 J State: State: State: Parallel: E J C1 C2 C3 F a) Before scheduling decision b) After simulation of behavior F Scheduler S: fixed priority G H Figure 4: Differentiation between the priority of inner and c) Mapped behavior on a PE d) Nested behaviors leaf behaviors (example based on Fig. 2) Figure 2: PSM model (a), which is mapped on a PE (c), The scheduler distinguishes between inner and leaf be- with corresponding thread graph (b) and nesting tree (d) haviors. As shown in Fig. 4(a), the set of ready behaviors can be categorized into two subsets. Inner behaviors al- ways have an infinite high priority and leaf behaviors have a communicating behavior can be blocked through communica- fixed priority defined by the designer. The scheduler always tion for some time, until the other communication partner is prefers an inner behavior over a leaf behavior. Let’s assume ready and the handshake has been completed. To achieve a the behaviors C1, C2, C3, and F are ready to execute. In high utilization of a PE the scheduler will be able to preempt this situation, the scheduler selects behavior F because F is blocked behaviors. an inner behavior (see Fig. 2(c)) and has infinite priority. Behavior F has two child behaviors G and H, which are leaf Receiver Communication Receiver behaviors. Behavior F is waiting until the child behaviors interface interface Behavior B1 Channel C1 Behavior B2 direction on channel on channel have completed. Fig. 4(b) shows both child behaviors G and send() s0 H added to the set of behaviors ready to execute. Port C1 blocked receive() with s1 sender B1 B2 Port with 3.3.2 Round Robin interface time All mapped behaviors on a processing element get time receiver Double handshake channel interface slices of the same length. If a behavior has terminated, the a) Communication diagram b) Sequence diagram scheduler selects immediately the next running behavior. The scheduler executes the behaviors in a circular order. If a Figure 3: Double handshake protocol behavior is not ready to execute, the next ready behavior with respect to the circular iteration is chosen. Fig. 5 depicts a round robin scheduling example. The parallel behaviors 3.2 Processing Elements G and H are mapped on the same processing element. The behaviors G and H are scheduled by round robin and the A PE maintains a single thread of execution (i.e. a single time slice is 5 time units. In the following, we keep the core processor). We associate exactly one scheduler with focus on behavior G that requests once 3 and once 12 time one PE one scheduling algorithm. We use the terms simu- units. The start behavior is arbitrary because the model lated time and simulation time to express the amount of task in the figure does not define one; we assume the simulation execution time currently simulated. The terms simulation starts with behavior G. Behavior G requests 3 time units, execution time and execution time refer to the amount of time computes, and requests 5 more time units. The request of the simulator requires on the host computer1 . Fig. 2(c) shows 3 time units can be consumed completely in one time slice; the scheduler S which is assigned to PE1 and associated however, the following request is too complex. 2 more time with the fixed priority scheduling strategy. In this case all units can be consumed after behavior H preempts behavior behaviors mapped on PE1 need a specific fixed priority, such G and can start executing. At time 10, behavior G is active that the scheduler can make a scheduling decision. Further- again and continues consuming the remaining 10 time units. more, we assume that computation is only in leaf behaviors (C1, C2, C3, E, G, H and J in Fig. 2), and hierarchical first time request second time behaviors (A, B, C, D, and F in Fig. 2) describe the causal of 3 time units request of 12 time untis chain of execution in the model that must be followed by the Processing Element: PE1 Behavior of G: scheduler. G H 1. Request 3 time units G G H G H G 2. Compute something 3. Request 12 time units 0 2 4 6 8 10 12 14 16 18 20 22 24 3.3 Scheduling Scheduler: Round robin, time slice: 5 time units 4. Compute something 5. ... simulated time 1. time 2. time 3. time We describe now the concept of scheduling for PSM models slice slice slice for the two requested fixed priority and round robin schedul- 1 Figure 5: Round robin scheduling Which is of course dependent on the host CPU, clock frequency etc. and a comparison between simulation execution times of different model is only possible on the same reference simulation host. Depending on the scheduling algorithm and the commu- nication status, a behavior can have one of the following status: ready, low priority ... status: ready, high priority driven by thread t1 driven by thread t2 states (see Fig. 6): ready, running, communicating and wait- ing. A ready behavior can be selected by the scheduler and waste_time(5ms) B1 B2 waste_time(3ms) port->receive() executed on the associated processing element. A behavior waste_time(7ms) has the state running, if it is currently executing on the ... C1 ... mapped processing element. A running behavior can be preempted in two different ways: (a) end of the time-slice, as defined by the scheduling algorithm (waiting state), (b) Figure 8: Scheduling with communication blocked communication request on double handshake channel (communicating). If a behavior has completed, its status is terminated. Fig. 6 shows all possible transitions between the We allow preemption of running behaviors. The function described states. then of preemption is to suspend the current running pro- cess and to resume a process the scheduling algorithm has end of communicating start of communication communication selected to be executed next. If we suspend a process waiting executes estimated activation computation termination on a timed event,we have to interrupt the waiting process, ready running terminated store how much time the process has already consumed, and finished estimated consume the remaining time later. For this reason, the Sys- preempted computation resume temC wait() statement in Line 6 is sensitive to two different completed after communication waiting or-composed sets of events. The first event max_time notifies the thread after the provided time slice of the scheduler is over. The second parameter is an or-composed event list. Figure 6: State automaton of a behavior All behaviors that start communication add the correspond- ing synchronization event of the channel to that list. This mechanism allows preempting a current running behavior by a behavior with higher priority that has completed commu- 4. IMPLEMENTATION nication. 4.1 Processing Element Algorithm 1 function waste time(requested time) The class osss_processing_element represents a PE and in- herits from the osss_behaviour. Fig. 7 shows the extension of 1: req time ← requested time 2: while req time > 0 do the OSSS-Behaviour class diagram [5]. The designer derives 3: max time ← available time(requested time) a class from osss_processing_element class and defines in the 4: start time ← current time constructor the execution order of the mapped behaviors on 5: behavior status ← running the highest hierarchical level. 6: wait(max time or registered communication events) 7: req time ← req time - (current time - start time) 8: if req time = 0 then osss_fixed_priorty_strategy osss_round_robin_strategy osss_module sc_module 9: return IF:Class #wait(…): void 0..* sc_port 10: else 1 osss_scheduler osss_behaviour 1 0..* IF:Class 11: behavior status ← waiting < < virtual > > schedule(): void <>+main(): void osss_port < < virtual > > available_time(): sc_time #osss_seq(osss_sequential_behaviour_list): void 1 12: dispatch() 1 dispatch(): void #osss_fsm(osss_state_transition_list): void osss_composite_behaviour 13: end if 1 #osss_par(osss_parallel_behaviour_list): void 1 <>#init(): void main(): void 14: end while osss_processing_element <>#final(): void osss_fsm_behaviour osss_scheduling_attributes 1 osss_parallel_behaviour osss_state sc_core::sc_time: m_simulated_time process_status: m_process_status 0..1 osss_sequential_behaviour Each PE represents a single core processor. For this reason, osss_priority*: m_prio osss_initial_state osss_end_state only one behavior can be ready for the SLDL scheduler. Otherwise, the scheduler would execute multiple behaviors Figure 7: OSSS-Behaviour class extensions (bold boxes) in parallel on the same PE. Communicating behaviors are an exception because they are waiting for their synchronization In the following, we describe how the basic scheduling event. If the process of a communicating behavior would be algorithms are designed to support preemption and commu- suspended, the behavior would ignore the synchronization nication. event. The function dispatch() (see Alg. 2) guarantees this requirement. The set of behaviors is stored in two lists, 4.2 Simulation of Time namely a list for inner behaviors and a list for leaf behaviors. Fig. 8 shows an example where the behaviors B1 and B2 The first behavior in the list of ready behaviors defines the are mapped on the same PE under fixed priority schedul- next running behavior. In this situation behavior B1 needs ing strategy. Thread t1 is associated with behavior B1 and to be suspended and behavior B2 should be ready. thread t2 with behavior B2. We assume behavior B2 is running and B1 ready (see Fig. 6) at the beginning and 4.3 Scheduling Strategy neither B1 nor B2 have consumed any time. Thread t2 We decided to separate the scheduler and the scheduling starts executing the main() function of B2 and enters the 3 strategies (see Fig. 7). The designer derives a class from milliseconds estimated waste_time() function of the sched- osss_scheduler. The function schedule() takes a list of all uler, see Alg. 1. The while loop runs until the entire re- leaf behaviors as argument. The function moves the next quested time of a timing annotation has been consumed. executing behavior to the beginning of the list. The func- At the beginning of the loop, thread t2 calls the function tion available_time defines the size of a consumable time available_time(requested_time), which asks the scheduling quantum. strategy how much of the totally requested time can be con- sumed. In our example, the scheduling strategy is fixed priority and the current behavior has the highest priority. In 5. EXPERIMENTS AND EVALUATION this case, the fixed priority scheduler accepts the complete In order to evaluate our proposed virtual processing model, time (i.e. 3 milliseconds). we focus on scheduling different partitions of a specification Algorithm 2 function dispatch() Simulated Speedup LoC ∆ [LoC] ∆ [%] time [ms] 1: Unsorted List: inner behaviors, leaf behaviors 2: Process successor process ← null Specification 40 - 1497 - - 3: Process current process ← get current process() IndividualPE 160 1 1541 44 2.9 4: if inner behaviors = ∅ then BlurX 140 1.14 1760 219 14.2 5: schedule(leaf behaviors) Blur2X 120 1.33 1840 80 4.5 6: successor process ← first element(leaf behaviors) Blur2X2Y 80 2.00 1897 57 3.1 7: else 8: successor process ← first element(inner behavior) 9: end if Table 1: Comparison of selected design metrics 10: for all Behavior b in inner behavior do 11: if process(b) 6= current process and status(b) 6= communicating then 12: suspend(process(b)) The following models, as shown in Fig. 9, are evaluated: 13: end if 14: end for Specification (untimed, without virtual PE), Individual 15: if current process 6= successor process then PE (timed blur leaf behaviors, all behaviors mapped on a sin- 16: resume(successor process) gle PE), BlurX (timed blur leaf behaviors, blurX4 mapped 17: suspend(current process) 18: end if to P E2, synchronization between P E1 and P E2 via double handshake channel (behavior blurX4 on P E2 can only start if behavior blurX_par on P E1 has been entered)), Blur2X Processing Element PE1 Processing Element PE1 Sync (timed blur leaf behaviors, parallel composition of blurX3 Apply_Hysteresis canny Apply_Hysteresis canny Processing Element PE2 and blurX4 mapped to P E2, synchronization of parallel com- prep gaussian_smooth prep gaussian_smooth Sync position like in BlurX) and Blur2X2Y (timed blur leaf blurX_par blurX_par blurX1 blurX2 blurX3 blurX4 blurX1 blurX2 blurX3 sync blurX4 20ms behaviors, parallel composition of blurX3 and blurX4 and 20ms 20ms 20ms 20ms 20ms 20ms 20ms Sync parallel composition of blurY3 and blurY4 mapped to P E2, blurY1 blurY2 blurY3 blurY_par blurY4 blurY1 blurY2 blurY3 blurY_par blurY4 Scheduler S synchronization like in Blur2X with additional synchroniza- 20ms 20ms 20ms 20ms 20ms 20ms 20ms 20ms tion barrier between sequential composition of parallel blur blur_done blur_done behaviors). Derivative_X_Y Derivative_X_Y When neglecting communication (shared array access), Non_Max_Supp Non_Max_Supp Magnitude_X_Y Magnitude_X_Y synchronization and scheduling (including context switch- Scheduler S Scheduler S ing) times, our model’s total simulated times, as expected b) BlurX a) IndividualPE Processing Element PE1 Sync Processing Element PE1 Sync by Amdahl’s law, are shown in Tab. 1. We compare the Apply_Hysteresis canny Apply_Hysteresis canny complexity of the different models using a simple Lines of Processing Processing prep gaussian_smooth Element PE2 prep gaussian_smooth Element PE2 Code (LoC) metric. The major effort was to allocate new Sync Sync blurY_par blurY_par channels and behaviors for synchronization. For instance, blurX1 blurX2 blurX1 blurX2 20ms 20ms sync sync blurX4 20ms blurX3 20ms 20ms 20ms sync sync blurX3 20ms blurX4 20ms for the model BlurX a new PE and three sync behaviors blurY_par Sync blurY_par Sync have been instantiated. Furthermore, the double handshake blurY1 20ms blurY2 20ms blurY3 20ms blurY4 20ms Scheduler S blurY1 20ms blurY2 20ms sync sync channel was hierarchically bound from the ports of the PE’s blurY3 blurY4 blur_done blur_done 20ms 20ms to the ports of the blur leaf behaviors. Scheduler S In the following, we discuss the simulation of the four Derivative_X_Y Derivative_X_Y Magnitude_X_Y Non_Max_Supp Magnitude_X_Y Non_Max_Supp virtual processing models using a round robin scheduler with different time slice granularities from 1ns up to 100, 000ns Scheduler S Scheduler S c) Blur2X d) Blur2X2Y on each individual PE. We measured the number of context switches and associated them with a constant cost of 1 ms. Figure 9: Partitioning of the Canny edge detector Fig. 10 shows the simulated time of model Blur2X2Y with context switching costs for a round robin scheduling of P E1 and P E2. As expected, we can observe that fine grained time slices < 100 ns have a huge impact on the overall simulated model and measure the simulated time, count the sched- time. On the other hand, the responsiveness (although not uler calls and measure the execution time of the SystemC necessary for the image filter design) rises. program. The count of scheduler calls is compared to the ex- pected number of scheduler calls depending on the scheduling Simula'on  'me  of  the  model  Blur2X2Y  with  'me   strategy. Based on the results, we can show that the commu- annota'ons  for  context  switches     240,000   nication via shared variables and double handshake channels, simula'on  'me  [ms]   220,000   as well as the individual scheduler on the PEs, do not violate 200,000   180,000   the causal execution order defined in the specification model. 160,000   140,000   The design we use for evaluation is a Canny edge detector 120,000   [1], a graphical filter to detect edges in a gray-scale image. As 100,000   80,000   starting point, we used an existing SpecC PSM model of the 60,000   filter, transformed it into an OSSS-Behaviour [5] model, and 1   10   100   1,000   10,000   100,000   Time  slice  of  the  individual  schedulers  [ns]   implemented different virtual processing models, as shown in Fig. 9. Communication is performed via shared variables and double handshake channels. An array containing the Figure 10: Simulation with costs for context switches complete image is shared among the blur behaviors. Each of these behaviors manipulates pixels on non-overlapping Fig. 11 visualizes the ratio between the simulated time for tiles of the image. In the Canny algorithm blurring is the context switches and computation for a 20 ms leaf behavior most computationally intensive block and therefore we map timing annotation. When the time slice is very short, almost different combinations of blur leaf behaviors to a second 70 % of the simulated time is spent on context switches. processing element. For our evaluation, we only require Fig. 12 shows the measured execution time of the various timing annotations for parallel and mapped leaf behaviors. models on an Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00 We annotated each blur behavior with 20ms. GHz with 4 GB RAM running Fedora 12 Linux using the Ra$o  between  context  switch  and  computa$on  $me  for   with different time slice granularities. So far we did not the  model  Blur2X2Y    Ra$o  between  context  switch   100%   evaluate our simulation results against measurements on a and  computa$on  $me   80%   real platform. To do a trade-off between simulation speed Computa2on   60%   2me   and accuracy, a comparison with measurement results is 40%   Context   necessary and part of future work. Currently, the context 20%   switch  2me   switch penalty and communication delay is handled in a very 0%   simple way and after the conduction of measurement trails these timing models will be refined. 1   2   5         0   0   1,   2,   5,   10 0   20 0     10 20 50 0 0 0 00 10 20 50 00 00 00 0 ,0 ,0 Time  slice  of  individual  scheduler  [ns]   Acknowledgement Figure 11: Ratio between context switches and computation time This work has been partially supported by the ARAMiS project (01IS11035M) and the EMC2 collaborative ARTEMIS project (01IS14002R), both funded by the German Federal time command. From this measurement, we can observe that Ministry of Research and Education (BMBF). the execution time of the individual models is proportional to the number of context switches, as shown in Fig. 10. References We traced the individual scheduler calls and compared the [1] J. Canny. A Computational Approach to Edge Detection. Pat- tern Analysis and Machine Intelligence, IEEE Transactions Execu&on  &me  for  round  robin  scheduled  processing   on, PAMI-8(6):679–698, nov. 1986. elements   [2] D. D. Gajski, S. Abdi, A. Gerstlauer, and G. Schirner. Embedded 1.100   System Design: Modeling, Synthesis and Verification. Springer, 1.000   Spec.  Model   1st edition, 2009. Execu&on  &me  [sec]   0.900   [3] D. D. Gajski, J. Zhu, R. Dömer, A. Gerstlauer, and S. Zhao. 0.800   Individual  PE   0.700   SpecC: Specification Language and Methodology. Springer, 1 Blur  X   edition, 2000. 0.600   0.500   Blur  2X   [4] A. Gerstlauer, H. Yu, and D. D. Gajski. RTOS Modeling for 0.400   Blur  2X  2Y   System Level Design. In Proceedings of DATE. IEEE Computer 0.300   Society, 2003. 0.200   [5] K. Grüttner and W. Nebel. Modelling Program–State Machines 1   10   100   1,000   10,000   100,000   in SystemC. In Forum on Specification and Design Languages Time  slice  of  the  individual  schedulers  [ns]   2008, 09 2008. [6] P. Hartmann, H. Kleen, P. Reinkemeier, and W. Nebel. Efficient Figure 12: Simulation execution times of canny edge detector modelling and simulation of embedded software multi-tasking us- partitionings ing SystemC and OSSS. In Specification, Verification and De- sign Languages, 2008. FDL 2008. Forum on, pages 19–24, Sept 2008. execution order of leaf behaviors (i.e. the causal chain) [7] P. A. Hartmann, K. Grüttner, A. Rettberg, and I. Podolski. Distributed Resource-Aware Scheduling for Multi-Core Architec- between the specification and the Virtual Processing Models. tures with SystemC. In Distributed, Parallel and Biologically In the specification model, the blur behavior’s execution Inspired Systems, volume 329, pages 181–192. Springer, 2010. order was blurX1, ... , blurX4, while in the different VPM [8] Z. He, A. Mok, and C. Peng. Timed RTOS modeling for Embed- ded System Design. In Real Time and Embedded Technology models the execution order changed to blurX4, ... , blurX1. and Applications Symposium (RTAS’05), 2005. Even though the ordering is different, validity of causality [9] S. Huss and S. Klaus. Assessment of Real-Time Operating Sys- for parallel compositions (partial order) only requires to be tems Characteristics in Embedded Systems Design by SystemC order isomorph, which is the case. models of RTOS services. In Proceedings of Design & Verifica- tion Conference and Exibition (DVCon’07), 2007. [10] T. Kempf, M. Dörper, R. Leupers, G. Ascheid, H. Meyr, T. Ko- gel, and B. Vanthournout. A Modular Simulation Framework for 6. CONCLUSION AND OUTLOOK Spatial and Temporal Task Mapping onto Multi-Processor SoC In this paper, we extended the proposed methodology in Platforms. In Proceedings of DATE, 2005. [4] and introduced a novel virtual processing model for PSM [11] R. Le Moigne, O. Pasquier, and J. Calvez. A Generic RTOS Model for Real-Time Systems Simulation with SystemC. In Pro- based models. The existing methodology allowed scheduling ceedings of DATE, 2004. of processes using generic RTOS primitives. The design step [12] S. Mahadevan, M. Storgaard, J. Madsen, and K. Virk. Arts: A from a non-scheduled specification model to a process-based System-Level Framework for Modeling MPSoC Components and Analysis of their Causality. In 13th IEEE International Sympo- RTOS scheduled architecture model is a major refinement sium on Modeling, Analysis, and Simulation of Computer and step. For this reason, we proposed to introduce an inter- Telecommunication Systems, 2005. mediate model, called virtual processing model. This model [13] P. Razaghi and A. Gerstlauer. Host-Compiled Multicore RTOS Simulator for Embedded Real-Time Software Development. In enables to add a scheduler with user defined scheduling al- Proceedings of DATE. IEEE, 2011. gorithm to a behavior, called virtual processing unit. This [14] G. Schirner and R. Dömer. Introducing preemptive scheduling flexible scheduling annotation enables fast and easy explo- in abstract RTOS models using result oriented modeling. In Proceedings of DATE, New York, NY, USA, 2008. ACM. ration, regarding scheduling granularities of behaviors and [15] M. Streubühr, J. Falk, C. Haubelt, J. Teich, R. Dorsch, and assignments of scheduling policies. After successful explo- T. Schlipf. Task Accurate Performance Modeling in SystemC ration, the behavior to process transformation and RTOS for Real-Time Multi-Processor Architectures. In Proceedings of DATE, 2006. configuration for architecture refinement can be performed. [16] H. Zabel and W. Müller. An Efficient Time Annotation Tech- We have sketched how to use SystemC to implement our nique in Abstract RTOS Simulations for Multiprocessor Task virtual processing model. Furthermore, we have integrated Migration. In Distributed, Parallel and Biologically Inspired the virtual processing model into the OSSS-Behaviour li- Systems, volume 271 of IFIP Advances in Information and Communication Technology. Springer, 2008. brary, supporting program state machine (PSM) modeling [17] H. Zabel, W. Müller, and A. Gerstlauer. Accurate RTOS Mod- in SystemC. Our implementation concept allows designers to eling and Analysis with SystemC. In W. Ecker, W. Müller, and implement new scheduling strategies. For the evaluation, we R. Dömer, editors, Hardware-dependent Software, pages 233– 260. Springer Netherlands, 2009. used a Canny filter design and created different behavior par- titions and scheduler configurations. The evaluation showed that our extension retains the functional causalities of the original PSM model when using a round robin scheduling