A Program State Machine Based Virtual Processing Model
                      in SystemC

                         Tim Schmidt1 , Kim Grüttner2 , Rainer Dömer1 , Achim Rettberg3
                                                1
                                                  University of California, Irvine, USA
                             2
                                 OFFIS – Institute for Information Technology, Oldenburg, Germany
                                      3
                                        Carl von Ossietzky University Oldenburg, Germany


ABSTRACT                                                            to raise the level of abstraction and support early design deci-
The Program State Machine (PSM) Model of Computation                sions. In [3], different abstraction levels have been proposed.
offers a rich set of modeling elements to describe behavioral       The specification level enables untimed modeling of function-
and structural hierarchy, concurrency, synchronization, state       ality and causality between behaviors in an executable model.
transitions and timing. With the rising software complexity         Behaviors can be statically composed in a sequential order,
of today’s embedded systems, the use of Real-Time Operat-           as finite-state machine or parallel. Communication between
ing Systems (RTOS) has become state-of-the-art for nearly           functions is described using double handshake channels with
all System-on-Chip designs. Regrettably, the PSM model              message passing and shared variables. The architecture level
itself has insufficient support for the specification of the pre-   introduces processing elements (PE) that execute behaviors
emptive dynamic scheduling behavior of an RTOS. In this             in sequential order. Behaviors are annotated with delays to
paper, we propose a model for dynamically dispatching PSM           specify the estimated execution times on the PEs. Commu-
models on a virtual processing element. Our model aims              nication between PEs is described through message passing
to abstract from the targeted RTOS and the processor core           communication with annotated delays. The implementation
through execution time annotations and a flexible preemp-           level adds instruction and cycle accurate timing for PEs and
tive scheduler model. Mapping a PSM model to a set of               signal level protocols with cycle accurate communication
scheduled virtual processing elements only requires minor           times.
model transformation and enables early exploration of dif-
ferent processing element mappings and scheduling policies.              APPLICATION               APPLICATION                APPLICATION                 APPLICATION

Our virtual processing model for PSMs is realized on top                                               COMM & SYNC                 COMM & SYNC         RTOS    COMM & SYNC
                                                                                                        CHANNELS                    CHANNELS                    CHANNELS
of the SystemC library. We evaluate the proposed virtual                     COMM & SYNC        VIRTUAL PROCESSING                                       INSTRUCTION SET
                                                                                                                                RTOS MODEL
processing model using a Canny edge detection filter.                         CHANNELS                  UNIT                                                SIMULATOR
                                                                               SLDL                       SLDL                       SLDL                       SLDL

                                                                     (a) specification model   (b) virtual processing model   (c) architecture model   (d) implementation model

1.    INTRODUCTION
   The development process of state-of-the-art embedded             Figure 1: Extension of a virtual processing model (see [4])
systems is complex and affects many different disciplines.
Among others, the design process requires hardware and
software design decisions. Today’s system complexity of em-            In this work we focus on systems that implement their
bedded Multi-Processor System-on-Chip (MPSoC) designs               behavior in software, which can be mapped and executed
is continuing at an almost exponential rate [2]. The strongly       on different PEs of an MPSoC. Fig. 1 (a), (c) and (d) show
growing complexity includes the integration of the function-        the layers for executable MPSoC models proposed in [4].
ality as well as the associated software complexity. The            All models are executed on top of a SLDL, e.g. SpecC or
development of hardware is associated with immense costs.           SystemC. The Application is user-defined behavior to be ex-
Consequently, whenever possible, designers prefer software          ecuted on the MPSoC. The layers between Application and
solutions and realize algorithms on software processors.            SLDL introduce communication, scheduling and timing tech-
   To cope with the increasing complexity and the time-to-          niques to enable a stepwise refinement from the specification
market pressure, new design methodologies are required. One         level down to the implementation.
design challenge for embedded system designers is mapping              The design step from a non-scheduled specification model
the functionality on the individual processing elements while       (a) to a complete RTOS scheduled architecture model (c)
meeting the required extra-functional properties (e.g. timing       is a complex refinement step. It requires to transform the
and power consumption) at minimal cost. To support this             application or behavioral description into a process- or thread-
challenge, System-Level Design Languages (SLDLs) enable             based model and to use the configuration and scheduling
                                                                    primitives of the selected RTOS. At this time the choice of
                                                                    the task granularity and the supported scheduling primitives
                                                                    have usually been taken. At this level, the comparison of
                                                                    different task granularities and different scheduling policies,
                                                                    supported by different RTOSs induces major redesign effort.
                                                                    In this paper, we introduce a virtual processing model (b)
                                                                    to support a smoother refinement from the unscheduled
                                                                    specification to an RTOS scheduled architecture model for
                                                                    PSMs.
EWiLi’14, November 2014, Lisbon, Portugal.                             Our approach supports PSM modeling at the specification
Copyright retained by the authors.                                  layer and enables early estimation of dynamic scheduling
effects when mapping parallel behaviors on the same PE.           Multi-Processing) or one global ready queue (Symmetric
This step can be performed without any behavior to process        Multi-Processing) used for dynamic process to core dispatch-
or RTOS specific refinements. This way, designers can profit      ing. The proposed extension supports the concept of Trans-
from simple specification model modifications in combina-         action Level Modeling (TLM) for intra-core communication.
tion with early estimated execution time annotations, thus        Both solutions focus on a process level RTOS abstraction at
enabling early decisions regarding PE allocation, behavior        the architecture and implementation level including features
to process refinement granularity, process to PE binding and      like process creation and interrupt handling. In contrast,
scheduling policy selection.                                      our proposed approach avoids process-level RTOS opera-
   The rest of the paper is organized as follows. In Section 2    tions and operates on an estimated execution time annotated
we analyze related work. Next, we discuss the design of the       specification model. Moreover, our solution keeps commu-
virtual processing model in Section 3. Followed by a brief        nication abstract and each processing element has its own
description of the implementation in Section 4, we evaluate       ready queue. After exploration, based on our virtual platform
the new virtual processing model using a Canny filter for still   model, we can transform the scheduled specification model
image edge detection in Section 5. Finally, we conclude our       into an RTOS model on the architecture level.
work in Section 6.                                                   The timing accuracy and therefore the simulation perfor-
                                                                  mance of [4, 13] is limited by the fixed minimal resolution of
2.   RELATED WORK                                                 discrete time advances. An extension deploying techniques
                                                                  with respect to preemptive scheduling models has been pre-
   An early proposal of a generic RTOS model based on Sys-        sented in [14]. The Result Oriented Modeling collects and
temC has been published in [11]. The presented abstract           consumes consecutive timing annotations while still handling
RTOS model achieves time-accurate task preemption via Sys-        preemptions accurately.
temC events and models time passing via a delay() method.            A two layer for modeling approach for software task schedul-
The RTOS overhead can be modeled as well. Two different           ing considering shared resources has been proposed in [6, 7].
task scheduling schemes are studied: the first one uses a         The design starts with an Application Layer (AL) model,
dedicated thread for the scheduler, while the second is based     which describes the functionality in terms of software tasks,
on cooperative procedure calls, avoiding this overhead. Al-       hardware modules and shared communication objects. These
though in this approach explicit inter-task communication         modeling elements are mapped on modeling elements of the
resources are required (message queue, ...), the simulation       Virtual Target Architecture Layer (VTAL): software process-
time advances simultaneously as the tasks consume their           ing elements with an RTOS model similar to [14], hardware
delays.                                                           processing elements with fixed static scheduling, memories
   In [8], an RTOS modeling tool is presented. Its main           and SystemC TLM for modeling shared buses and point-to-
purpose is to accurately model an existing RTOS on top of         point communication channels. Communication is realized
SystemC. A system designer cannot directly use it. In this ap-    via Remote Method Invocation (RMI) via shared buses or
proach, the next RTOS “event” (interrupt, scheduling event,       dedicated point-to-point channels. The individually mapped
etc.) is predicted during run-time. This improves simula-         software tasks can be annotated with Estimated Execution
tion speed, but requires deeper knowledge of the underlying       Time (EET) blocks that represent computation time. The
system.                                                           design flow is supported with preemptive and cooperative
   An RTOS based scheduling approach with focus on pre-           scheduling strategies, as well as deadline driven strategies.
cise interrupt scheduling has been proposed in [17]. For          This approach covers specification and architecture level
this purpose, a separate scheduler is introduced to handle        modeling. The main difference to our approach is that the
incoming interrupt requests. Timing annotations and syn-          RTOS model works with explicit tasks (i.e. processes). Our
chronization within user tasks are handled by a replacement       model could be refined as well to [6, 7] after PSM scheduling
of the SystemC wait(). In [16] an annotation method for           exploration.
time estimation has been presented that supports flexible
simulation and validation of real-time constraints for task
migration between different target processors. The concept        3.    VIRTUAL PROCESSING MODEL
allows preemptive scheduling in the context of priority-based
scheduling, supporting nested interrupts.                         3.1    Basic Modeling Elements
   All mentioned solutions above work on architecture level          We use an expressive subset of the program state machine
models and allow to create and handle processes and in-           (PSM) model of computation (MoC) to describe the func-
terrupts on an RTOS specific abstraction. Our solution            tionality of a system. A hierarchical PSM model with the
addresses scheduling at a higher abstraction level and keeps      corresponding thread graph is shown in Fig. 2(a) and (b). A
communication at specification abstraction, thus no need for      sequential composition of n behaviors describes a total execu-
interrupts.                                                       tion order, denoted as a n dimensional tuple: (beh1 , ..., behn ).
   Several approaches based on abstract task graphs [9, 10,       The execution starts with beh1 and finishes with behn . A
12, 15] have been proposed as well. In this case, a pure          parallel composition of n behaviors describes a partial execu-
functional SystemC model is mapped onto an architecture           tion order, denoted as a set of n behaviors {beh1 , ..., behn }.
model including an abstract RTOS. The mapping requires            The parent behavior of a parallel composition of child behav-
an abstract task graph of the model, where estimated ex-          iors will not finish until all child behaviors have completed
ecution times can be annotated on a per-task basis only,          (Fork-Join semantics). The finite-state machine behavior
ignoring control-flow dependent durations. This reduces the       composition is a special case of a sequential execution from
achievable accuracy.                                              a start state to an end state. The execution order is defined
   The proposed RTOS model in [4] can be implemented on           by state transitions.
top of any SLDL (see Fig. 1) that supports the concepts              The virtual processing model supports communication be-
of process handling and time modeling. An extension of            tween behaviors via double handshake channels (synchro-
this approach [13] presents a high-level, host-compiled multi-    nized) and shared variables (unsynchronized). A double
core RTOS simulator. This multi-core processor model can          handshake channel operates in rendezvous fashion (see Fig. 3).
run more than one process simultaneously which can be or-         When the data is transferred from the sender to the receiver,
ganized by a separate ready queue per core (Asymmetric            both behaviors resume their execution at the same time. A
                                 A                                                                   A → ttop                              ing algorithms. We decided to provide these two fundamental
                                                                                                     B → ttop
                   C
                                 B
                                                D
                                                                             [t1,t2] = fork(ttop)                                          strategies because more complex strategies can be easily de-
                                                E
                                                                                                                D → t2                     rived from them.
                                                                                C → t1                          E → t2
                            C2
                                                                        {C1,C2,C3}→t1                           F → t2
             C1
                                        G
                                                F
                                                        H
                                                                                                                  [t3,t4] = fork(t2)       3.3.1          Fixed Priority
                                                                                               G → t3              H → t4
                  C3
                                                                                                                  t2 = join(t3,t4)            The fixed priorities are statically defined. The scheduler
                                                                                                                F → t2
                                                J
                                                                                                                J → t2
                                                                                                                                           always executes the behavior that has the highest priority
                                                                               ttop = join(t1,t2)                                          and is ready to execute. We are interested in making the
                                                                                                     A → ttop
                       a) PSM model                                                         b) Thread graph
                                                                                                                                           process of priority assignment to the behaviors on the virtual
                                                                                                                                           processing model as simple as possible. The designer assigns
                  Processing Element PE1                                                 Processing Element:            Scheduler
                             A
                                                                                                PE1                      Strategy          fixed priorities only to leaf behaviors. A hierarchical higher
                                 B                                                             Sequence:                                   behavior cannot hide the priority of a leaf behavior.
                  C                         D                                                     A
                                            E
                                                                                                Parallel:                                             Set of ready behaviors                                           Set of ready behaviors
                        C2
                                                                                                   B
                                            F                                                                                                                                                                                              C1, C2, C3,
           C1
                                      G             H                                                                                                F                    C1, C2, C3
                                                                                FSM:                                Sequence:                                                                                                                 G, H
                                                                                 C                                     D
                                                                                                                                             inner behaviors             leaf behaviors                       inner behaviors           leaf behaviors
                  C3
                                            J                       State:      State:     State:                    Parallel:
                                                                                                            E                          J
                                                                     C1          C2         C3                          F                         a) Before scheduling decision                                    b) After simulation of behavior F

                  Scheduler S: fixed priority                                                                       G        H
                                                                                                                                           Figure 4: Differentiation between the priority of inner and
         c) Mapped behavior on a PE                                                   d) Nested behaviors
                                                                                                                                           leaf behaviors (example based on Fig. 2)
Figure 2: PSM model (a), which is mapped on a PE (c),
                                                                                                                                              The scheduler distinguishes between inner and leaf be-
with corresponding thread graph (b) and nesting tree (d)
                                                                                                                                           haviors. As shown in Fig. 4(a), the set of ready behaviors
                                                                                                                                           can be categorized into two subsets. Inner behaviors al-
                                                                                                                                           ways have an infinite high priority and leaf behaviors have a
communicating behavior can be blocked through communica-                                                                                   fixed priority defined by the designer. The scheduler always
tion for some time, until the other communication partner is                                                                               prefers an inner behavior over a leaf behavior. Let’s assume
ready and the handshake has been completed. To achieve a                                                                                   the behaviors C1, C2, C3, and F are ready to execute. In
high utilization of a PE the scheduler will be able to preempt                                                                             this situation, the scheduler selects behavior F because F is
blocked behaviors.                                                                                                                         an inner behavior (see Fig. 2(c)) and has infinite priority.
                                                                                                                                           Behavior F has two child behaviors G and H, which are leaf
     Receiver           Communication                Receiver                                                                              behaviors. Behavior F is waiting until the child behaviors
     interface                                       interface          Behavior B1            Channel C1               Behavior B2
                          direction
    on channel                                      on channel                                                                             have completed. Fig. 4(b) shows both child behaviors G and
                                                                                      send()
                                                                                                                                    s0
                                                                                                                                           H added to the set of behaviors ready to execute.
    Port                         C1                              blocked                                        receive()
    with                                                                                                                            s1
  sender               B1              B2
                                                               Port
                                                               with
                                                                                                                                           3.3.2          Round Robin
 interface                                                                                                                          time      All mapped behaviors on a processing element get time
                                                             receiver
         Double handshake channel                           interface                                                                      slices of the same length. If a behavior has terminated, the
              a) Communication diagram                                                     b) Sequence diagram                             scheduler selects immediately the next running behavior. The
                                                                                                                                           scheduler executes the behaviors in a circular order. If a
                        Figure 3: Double handshake protocol                                                                                behavior is not ready to execute, the next ready behavior
                                                                                                                                           with respect to the circular iteration is chosen. Fig. 5 depicts
                                                                                                                                           a round robin scheduling example. The parallel behaviors
3.2          Processing Elements                                                                                                           G and H are mapped on the same processing element. The
                                                                                                                                           behaviors G and H are scheduled by round robin and the
   A PE maintains a single thread of execution (i.e. a single
                                                                                                                                           time slice is 5 time units. In the following, we keep the
core processor). We associate exactly one scheduler with
                                                                                                                                           focus on behavior G that requests once 3 and once 12 time
one PE one scheduling algorithm. We use the terms simu-
                                                                                                                                           units. The start behavior is arbitrary because the model
lated time and simulation time to express the amount of task
                                                                                                                                           in the figure does not define one; we assume the simulation
execution time currently simulated. The terms simulation
                                                                                                                                           starts with behavior G. Behavior G requests 3 time units,
execution time and execution time refer to the amount of time
                                                                                                                                           computes, and requests 5 more time units. The request of
the simulator requires on the host computer1 . Fig. 2(c) shows
                                                                                                                                           3 time units can be consumed completely in one time slice;
the scheduler S which is assigned to PE1 and associated
                                                                                                                                           however, the following request is too complex. 2 more time
with the fixed priority scheduling strategy. In this case all
                                                                                                                                           units can be consumed after behavior H preempts behavior
behaviors mapped on PE1 need a specific fixed priority, such
                                                                                                                                           G and can start executing. At time 10, behavior G is active
that the scheduler can make a scheduling decision. Further-
                                                                                                                                           again and continues consuming the remaining 10 time units.
more, we assume that computation is only in leaf behaviors
(C1, C2, C3, E, G, H and J in Fig. 2), and hierarchical                                                                                                                                           first time request     second time
behaviors (A, B, C, D, and F in Fig. 2) describe the causal                                                                                                                                         of 3 time units request of 12 time untis

chain of execution in the model that must be followed by the                                                                                Processing Element: PE1         Behavior of G:

scheduler.                                                                                                                                       G             H            1. Request 3 time units       G    G      H       G        H        G
                                                                                                                                                                            2. Compute something
                                                                                                                                                                            3. Request 12 time units    0 2 4 6 8 10 12 14 16 18 20 22 24
3.3          Scheduling                                                                                                                      Scheduler: Round robin,
                                                                                                                                              time slice: 5 time units
                                                                                                                                                                            4. Compute something
                                                                                                                                                                            5. ...
                                                                                                                                                                                                                                           simulated time
                                                                                                                                                                                                         1. time 2. time 3. time
  We describe now the concept of scheduling for PSM models                                                                                                                                                slice   slice   slice
for the two requested fixed priority and round robin schedul-
1                                                                                                                                                                  Figure 5: Round robin scheduling
 Which is of course dependent on the host CPU, clock frequency
etc. and a comparison between simulation execution times of different
model is only possible on the same reference simulation host.                                                                                Depending on the scheduling algorithm and the commu-
nication status, a behavior can have one of the following                                                                                       status: ready, low priority        ...        status: ready, high priority
                                                                                                                                                   driven by thread t1                            driven by thread t2
states (see Fig. 6): ready, running, communicating and wait-
ing. A ready behavior can be selected by the scheduler and                                                                                 waste_time(5ms)                    B1         B2
                                                                                                                                                                                                       waste_time(3ms)
                                                                                                                                                                                                       port->receive()
executed on the associated processing element. A behavior                                                                                                                                              waste_time(7ms)

has the state running, if it is currently executing on the                                                                                                                         ...
                                                                                                                                                                                                             C1              ...
mapped processing element. A running behavior can be
preempted in two different ways: (a) end of the time-slice,
as defined by the scheduling algorithm (waiting state), (b)                                                                                Figure 8: Scheduling with communication
blocked communication request on double handshake channel
(communicating). If a behavior has completed, its status is
terminated. Fig. 6 shows all possible transitions between the                                                                      We allow preemption of running behaviors. The function
described states.                                                                                                               then of preemption is to suspend the current running pro-
                                                                                                                                cess and to resume a process the scheduling algorithm has
              end of                            communicating                      start of
           communication                                                        communication
                                                                                                                                selected to be executed next. If we suspend a process waiting
                                              executes estimated
     activation                                  computation                              termination                           on a timed event,we have to interrupt the waiting process,
                            ready                                         running                            terminated         store how much time the process has already consumed, and
                                              finished estimated
                                                                                                                                consume the remaining time later. For this reason, the Sys-
                                                                                       preempted
                                                  computation
                                                             resume
                                                                                                                                temC wait() statement in Line 6 is sensitive to two different
            completed after
            communication                                                  waiting                                              or-composed sets of events. The first event max_time notifies
                                                                                                                                the thread after the provided time slice of the scheduler is
                                                                                                                                over. The second parameter is an or-composed event list.
                    Figure 6: State automaton of a behavior                                                                     All behaviors that start communication add the correspond-
                                                                                                                                ing synchronization event of the channel to that list. This
                                                                                                                                mechanism allows preempting a current running behavior by
                                                                                                                                a behavior with higher priority that has completed commu-
4.        IMPLEMENTATION                                                                                                        nication.
4.1          Processing Element
                                                                                                                                Algorithm 1 function waste time(requested time)
  The class osss_processing_element represents a PE and in-
herits from the osss_behaviour. Fig. 7 shows the extension of                                                                    1:   req time ← requested time
                                                                                                                                 2:   while req time > 0 do
the OSSS-Behaviour class diagram [5]. The designer derives                                                                       3:      max time ← available time(requested time)
a class from osss_processing_element class and defines in the                                                                    4:      start time ← current time
constructor the execution order of the mapped behaviors on                                                                       5:      behavior status ← running
the highest hierarchical level.                                                                                                  6:      wait(max time or registered communication events)
                                                                                                                                 7:      req time ← req time - (current time - start time)
                                                                                                                                 8:      if req time = 0 then
 osss_fixed_priorty_strategy        osss_round_robin_strategy
                                                                      osss_module        sc_module                               9:         return
                                                                                                                     IF:Class
                                                                      #wait(…): void
                                                                                             0..*
                                                                                                            sc_port             10:      else
                                                                                                   1
              osss_scheduler                                 osss_behaviour
                                                                                                  1 0..*             IF:Class   11:         behavior status ← waiting
 < < virtual > > schedule(): void              <<virtual>>+main(): void                                    osss_port
 < < virtual > > available_time(): sc_time     #osss_seq(osss_sequential_behaviour_list): void 1
                                                                                                                                12:         dispatch()
                                                                                                                 1
 dispatch(): void                              #osss_fsm(osss_state_transition_list): void         osss_composite_behaviour     13:      end if
                   1                           #osss_par(osss_parallel_behaviour_list): void
                   1                           <<virtual>>#init(): void
                                                                                                   main(): void                 14:   end while
     osss_processing_element                   <<virtual>>#final(): void
                                                                                                   osss_fsm_behaviour
     osss_scheduling_attributes                       1                                         osss_parallel_behaviour
                                                                        osss_state
 sc_core::sc_time: m_simulated_time
 process_status: m_process_status
                                             0..1                                               osss_sequential_behaviour          Each PE represents a single core processor. For this reason,
 osss_priority*: m_prio                                   osss_initial_state   osss_end_state
                                                                                                                                only one behavior can be ready for the SLDL scheduler.
                                                                                                                                Otherwise, the scheduler would execute multiple behaviors
  Figure 7: OSSS-Behaviour class extensions (bold boxes)                                                                        in parallel on the same PE. Communicating behaviors are an
                                                                                                                                exception because they are waiting for their synchronization
  In the following, we describe how the basic scheduling                                                                        event. If the process of a communicating behavior would be
algorithms are designed to support preemption and commu-                                                                        suspended, the behavior would ignore the synchronization
nication.                                                                                                                       event. The function dispatch() (see Alg. 2) guarantees this
                                                                                                                                requirement. The set of behaviors is stored in two lists,
4.2          Simulation of Time                                                                                                 namely a list for inner behaviors and a list for leaf behaviors.
   Fig. 8 shows an example where the behaviors B1 and B2                                                                        The first behavior in the list of ready behaviors defines the
are mapped on the same PE under fixed priority schedul-                                                                         next running behavior. In this situation behavior B1 needs
ing strategy. Thread t1 is associated with behavior B1 and                                                                      to be suspended and behavior B2 should be ready.
thread t2 with behavior B2. We assume behavior B2 is
running and B1 ready (see Fig. 6) at the beginning and                                                                          4.3      Scheduling Strategy
neither B1 nor B2 have consumed any time. Thread t2                                                                                We decided to separate the scheduler and the scheduling
starts executing the main() function of B2 and enters the 3                                                                     strategies (see Fig. 7). The designer derives a class from
milliseconds estimated waste_time() function of the sched-                                                                      osss_scheduler. The function schedule() takes a list of all
uler, see Alg. 1. The while loop runs until the entire re-                                                                      leaf behaviors as argument. The function moves the next
quested time of a timing annotation has been consumed.                                                                          executing behavior to the beginning of the list. The func-
At the beginning of the loop, thread t2 calls the function                                                                      tion available_time defines the size of a consumable time
available_time(requested_time), which asks the scheduling                                                                       quantum.
strategy how much of the totally requested time can be con-
sumed. In our example, the scheduling strategy is fixed
priority and the current behavior has the highest priority. In                                                                  5.      EXPERIMENTS AND EVALUATION
this case, the fixed priority scheduler accepts the complete                                                                      In order to evaluate our proposed virtual processing model,
time (i.e. 3 milliseconds).                                                                                                     we focus on scheduling different partitions of a specification
Algorithm 2 function dispatch()                                                                                                                                                           Simulated                  Speedup                    LoC              ∆ [LoC]        ∆ [%]
                                                                                                                                                                                          time [ms]
 1: Unsorted List: inner behaviors, leaf behaviors
 2: Process successor process ← null                                                                                           Specification                                                  40                              -                 1497                   -          -
 3: Process current process ← get current process()                                                                            IndividualPE                                                  160                           1                    1541                44          2.9
 4: if inner behaviors = ∅ then                                                                                                BlurX                                                         140                          1.14                  1760                219         14.2
 5:    schedule(leaf behaviors)                                                                                                Blur2X                                                        120                          1.33                  1840                 80          4.5
 6:    successor process ← first element(leaf behaviors)                                                                       Blur2X2Y                                                       80                          2.00                  1897                 57          3.1
 7: else
 8:    successor process ← first element(inner behavior)
 9: end if                                                                                                                          Table 1: Comparison of selected design metrics
10: for all Behavior b in inner behavior do
11:    if process(b) 6= current process and
       status(b) 6= communicating then
12:       suspend(process(b))                                                                                                    The following models, as shown in Fig. 9, are evaluated:
13:    end if
14: end for                                                                                                                   Specification (untimed, without virtual PE), Individual
15: if current process 6= successor process then                                                                              PE (timed blur leaf behaviors, all behaviors mapped on a sin-
16:    resume(successor process)                                                                                              gle PE), BlurX (timed blur leaf behaviors, blurX4 mapped
17:    suspend(current process)
18: end if
                                                                                                                              to P E2, synchronization between P E1 and P E2 via double
                                                                                                                              handshake channel (behavior blurX4 on P E2 can only start
                                                                                                                              if behavior blurX_par on P E1 has been entered)), Blur2X
                    Processing Element PE1                                      Processing Element PE1     Sync
                                                                                                                              (timed blur leaf behaviors, parallel composition of blurX3
               Apply_Hysteresis        canny                               Apply_Hysteresis       canny
                                                                                                           Processing
                                                                                                          Element PE2
                                                                                                                              and blurX4 mapped to P E2, synchronization of parallel com-
                 prep    gaussian_smooth                                     prep    gaussian_smooth
                                                                                                             Sync             position like in BlurX) and Blur2X2Y (timed blur leaf
                               blurX_par                                                     blurX_par
      blurX1    blurX2   blurX3 blurX4                            blurX1   blurX2   blurX3
                                                                                                sync
                                                                                                            blurX4
                                                                                                            20ms
                                                                                                                              behaviors, parallel composition of blurX3 and blurX4 and
      20ms      20ms     20ms     20ms                            20ms     20ms     20ms
                                                                                                             Sync
                                                                                                                              parallel composition of blurY3 and blurY4 mapped to P E2,
      blurY1    blurY2   blurY3
                                  blurY_par
                                     blurY4                       blurY1   blurY2   blurY3
                                                                                             blurY_par
                                                                                                blurY4
                                                                                                          Scheduler S         synchronization like in Blur2X with additional synchroniza-
      20ms      20ms     20ms        20ms                         20ms     20ms     20ms        20ms
                                                                                                                              tion barrier between sequential composition of parallel blur
                  blur_done                                                   blur_done
                                                                                                                              behaviors).
  Derivative_X_Y                                                  Derivative_X_Y                                                 When neglecting communication (shared array access),
                          Non_Max_Supp                                                Non_Max_Supp
  Magnitude_X_Y                                                   Magnitude_X_Y                                               synchronization and scheduling (including context switch-
                 Scheduler S                                                 Scheduler S                                      ing) times, our model’s total simulated times, as expected
                                                                              b) BlurX
               a) IndividualPE
                    Processing Element PE1     Sync                             Processing Element PE1     Sync
                                                                                                                              by Amdahl’s law, are shown in Tab. 1. We compare the
               Apply_Hysteresis        canny                               Apply_Hysteresis       canny                       complexity of the different models using a simple Lines of
                                                 Processing                                                  Processing
                 prep    gaussian_smooth        Element PE2                  prep    gaussian_smooth        Element PE2       Code (LoC) metric. The major effort was to allocate new
                                                    Sync                                                        Sync
                                  blurY_par                                                  blurY_par                        channels and behaviors for synchronization. For instance,
      blurX1    blurX2                                            blurX1   blurX2
      20ms      20ms
                         sync        sync      blurX4
                                               20ms
                                                         blurX3
                                                         20ms     20ms     20ms
                                                                                     sync       sync       blurX3
                                                                                                           20ms
                                                                                                                     blurX4
                                                                                                                     20ms
                                                                                                                              for the model BlurX a new PE and three sync behaviors
                                  blurY_par           Sync                                   blurY_par            Sync
                                                                                                                              have been instantiated. Furthermore, the double handshake
      blurY1
      20ms
                blurY2
                20ms
                         blurY3
                         20ms
                                     blurY4
                                     20ms       Scheduler S
                                                                  blurY1
                                                                  20ms
                                                                           blurY2
                                                                           20ms
                                                                                     sync       sync                          channel was hierarchically bound from the ports of the PE’s
                                                                                                           blurY3    blurY4
                  blur_done                                                   blur_done
                                                                                                           20ms      20ms     to the ports of the blur leaf behaviors.
                                                                                                            Scheduler S          In the following, we discuss the simulation of the four
  Derivative_X_Y                                                  Derivative_X_Y

  Magnitude_X_Y
                          Non_Max_Supp
                                                                  Magnitude_X_Y
                                                                                      Non_Max_Supp                            virtual processing models using a round robin scheduler with
                                                                                                                              different time slice granularities from 1ns up to 100, 000ns
                 Scheduler S                                                 Scheduler S
                 c) Blur2X                                                  d) Blur2X2Y
                                                                                                                              on each individual PE. We measured the number of context
                                                                                                                              switches and associated them with a constant cost of 1 ms.
          Figure 9: Partitioning of the Canny edge detector                                                                   Fig. 10 shows the simulated time of model Blur2X2Y with
                                                                                                                              context switching costs for a round robin scheduling of P E1
                                                                                                                              and P E2. As expected, we can observe that fine grained time
                                                                                                                              slices < 100 ns have a huge impact on the overall simulated
model and measure the simulated time, count the sched-                                                                        time. On the other hand, the responsiveness (although not
uler calls and measure the execution time of the SystemC                                                                      necessary for the image filter design) rises.
program. The count of scheduler calls is compared to the ex-
pected number of scheduler calls depending on the scheduling
                                                                                                                                                                              Simula'on	
  'me	
  of	
  the	
  model	
  Blur2X2Y	
  with	
  'me	
  
strategy. Based on the results, we can show that the commu-                                                                                                                          annota'ons	
  for	
  context	
  switches	
  	
  
                                                                                                                                                                    240,000	
  
nication via shared variables and double handshake channels,
                                                                                                                                     simula'on	
  'me	
  [ms]	
  

                                                                                                                                                                    220,000	
  
as well as the individual scheduler on the PEs, do not violate                                                                                                      200,000	
  
                                                                                                                                                                    180,000	
  
the causal execution order defined in the specification model.                                                                                                      160,000	
  
                                                                                                                                                                    140,000	
  
   The design we use for evaluation is a Canny edge detector                                                                                                        120,000	
  
[1], a graphical filter to detect edges in a gray-scale image. As                                                                                                   100,000	
  
                                                                                                                                                                     80,000	
  
starting point, we used an existing SpecC PSM model of the                                                                                                           60,000	
  
filter, transformed it into an OSSS-Behaviour [5] model, and                                                                                                                      1	
         10	
              100	
             1,000	
          10,000	
       100,000	
  
                                                                                                                                                                                            Time	
  slice	
  of	
  the	
  individual	
  schedulers	
  [ns]	
  
implemented different virtual processing models, as shown
in Fig. 9. Communication is performed via shared variables
and double handshake channels. An array containing the                                                                           Figure 10: Simulation with costs for context switches
complete image is shared among the blur behaviors. Each
of these behaviors manipulates pixels on non-overlapping                                                                        Fig. 11 visualizes the ratio between the simulated time for
tiles of the image. In the Canny algorithm blurring is the                                                                    context switches and computation for a 20 ms leaf behavior
most computationally intensive block and therefore we map                                                                     timing annotation. When the time slice is very short, almost
different combinations of blur leaf behaviors to a second                                                                     70 % of the simulated time is spent on context switches.
processing element. For our evaluation, we only require                                                                       Fig. 12 shows the measured execution time of the various
timing annotations for parallel and mapped leaf behaviors.                                                                    models on an Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00
We annotated each blur behavior with 20ms.                                                                                    GHz with 4 GB RAM running Fedora 12 Linux using the
                                                        Ra$o	
  between	
  context	
  switch	
  and	
  computa$on	
  $me	
  for	
                                                    with different time slice granularities. So far we did not
                                                                             the	
  model	
  Blur2X2Y	
  

  Ra$o	
  between	
  context	
  switch	
  
                                              100%	
                                                                                                                                 evaluate our simulation results against measurements on a
             and	
  computa$on	
  $me	
  
                                                     80%	
                                                                                                                           real platform. To do a trade-off between simulation speed
                                                                                                                                                              Computa2on	
  
                                                     60%	
                                                                                                    2me	
  
                                                                                                                                                                                     and accuracy, a comparison with measurement results is
                                                     40%	
                                                                                                    Context	
  
                                                                                                                                                                                     necessary and part of future work. Currently, the context
                                                     20%	
                                                                                                    switch	
  2me	
        switch penalty and communication delay is handled in a very
                                                      0%	
  
                                                                                                                                                                                     simple way and after the conduction of measurement trails
                                                                                                                                                                                     these timing models will be refined.
                                                                 1	
  
                                                                         2	
  
                                                                                      5	
  

                                                                                      0	
  
                                                                                      0	
  

                                                                                  1, 	
  
                                                                                  2, 	
  
                                                                                  5, 	
  
                                                                                 10 0	
  
                                                                                 20 0	
  
                                                                                        	
                                                                                     10
                                                                                     20
                                                                                     50


                                                                                      0
                                                                                       0
                                                                                       0


                                                                                     00
                                                                                    10
                                                                                    20
                                                                                    50
                                                                                    00
                                                                                    00
                                                                                    00

                                                                                     0
                                                                                   ,0
                                                                                   ,0
                                                                                 Time	
  slice	
  of	
  individual	
  scheduler	
  [ns]	
  

                                                                                                                                                                                     Acknowledgement
Figure 11: Ratio between context switches and computation
time                                                                                                                                                                                 This work has been partially supported by the ARAMiS
                                                                                                                                                                                     project (01IS11035M) and the EMC2 collaborative ARTEMIS
                                                                                                                                                                                     project (01IS14002R), both funded by the German Federal
time command. From this measurement, we can observe that                                                                                                                             Ministry of Research and Education (BMBF).
the execution time of the individual models is proportional
to the number of context switches, as shown in Fig. 10.                                                                                                                              References
We traced the individual scheduler calls and compared the                                                                                                                             [1] J. Canny. A Computational Approach to Edge Detection. Pat-
                                                                                                                                                                                          tern Analysis and Machine Intelligence, IEEE Transactions
                                                                       Execu&on	
  &me	
  for	
  round	
  robin	
  scheduled	
  processing	
                                              on, PAMI-8(6):679–698, nov. 1986.
                                                                                                   elements	
                                                                         [2] D. D. Gajski, S. Abdi, A. Gerstlauer, and G. Schirner. Embedded
                                          1.100	
                                                                                                                                         System Design: Modeling, Synthesis and Verification. Springer,
                                          1.000	
                                                                                                             Spec.	
  Model	
            1st edition, 2009.
           Execu&on	
  &me	
  [sec]	
  

                                          0.900	
                                                                                                                                     [3] D. D. Gajski, J. Zhu, R. Dömer, A. Gerstlauer, and S. Zhao.
                                          0.800	
                                                                                                             Individual	
  PE	
  
                                          0.700	
  
                                                                                                                                                                                          SpecC: Specification Language and Methodology. Springer, 1
                                                                                                                                                              Blur	
  X	
                 edition, 2000.
                                          0.600	
  
                                          0.500	
                                                                                                             Blur	
  2X	
            [4] A. Gerstlauer, H. Yu, and D. D. Gajski. RTOS Modeling for
                                          0.400	
  
                                                                                                                                                              Blur	
  2X	
  2Y	
          System Level Design. In Proceedings of DATE. IEEE Computer
                                          0.300	
                                                                                                                                         Society, 2003.
                                          0.200	
                                                                                                                                     [5] K. Grüttner and W. Nebel. Modelling Program–State Machines
                                                               1	
               10	
            100	
          1,000	
         10,000	
        100,000	
  
                                                                                                                                                                                          in SystemC. In Forum on Specification and Design Languages
                                                                                    Time	
  slice	
  of	
  the	
  individual	
  schedulers	
  [ns]	
  
                                                                                                                                                                                          2008, 09 2008.
                                                                                                                                                                                      [6] P. Hartmann, H. Kleen, P. Reinkemeier, and W. Nebel. Efficient
Figure 12: Simulation execution times of canny edge detector                                                                                                                              modelling and simulation of embedded software multi-tasking us-
partitionings                                                                                                                                                                             ing SystemC and OSSS. In Specification, Verification and De-
                                                                                                                                                                                          sign Languages, 2008. FDL 2008. Forum on, pages 19–24, Sept
                                                                                                                                                                                          2008.
execution order of leaf behaviors (i.e. the causal chain)                                                                                                                             [7] P. A. Hartmann, K. Grüttner, A. Rettberg, and I. Podolski.
                                                                                                                                                                                          Distributed Resource-Aware Scheduling for Multi-Core Architec-
between the specification and the Virtual Processing Models.                                                                                                                              tures with SystemC. In Distributed, Parallel and Biologically
In the specification model, the blur behavior’s execution                                                                                                                                 Inspired Systems, volume 329, pages 181–192. Springer, 2010.
order was blurX1, ... , blurX4, while in the different VPM                                                                                                                            [8] Z. He, A. Mok, and C. Peng. Timed RTOS modeling for Embed-
                                                                                                                                                                                          ded System Design. In Real Time and Embedded Technology
models the execution order changed to blurX4, ... , blurX1.                                                                                                                               and Applications Symposium (RTAS’05), 2005.
Even though the ordering is different, validity of causality                                                                                                                          [9] S. Huss and S. Klaus. Assessment of Real-Time Operating Sys-
for parallel compositions (partial order) only requires to be                                                                                                                             tems Characteristics in Embedded Systems Design by SystemC
order isomorph, which is the case.                                                                                                                                                        models of RTOS services. In Proceedings of Design & Verifica-
                                                                                                                                                                                          tion Conference and Exibition (DVCon’07), 2007.
                                                                                                                                                                                     [10] T. Kempf, M. Dörper, R. Leupers, G. Ascheid, H. Meyr, T. Ko-
                                                                                                                                                                                          gel, and B. Vanthournout. A Modular Simulation Framework for
6.   CONCLUSION AND OUTLOOK                                                                                                                                                               Spatial and Temporal Task Mapping onto Multi-Processor SoC
   In this paper, we extended the proposed methodology in                                                                                                                                 Platforms. In Proceedings of DATE, 2005.
[4] and introduced a novel virtual processing model for PSM                                                                                                                          [11] R. Le Moigne, O. Pasquier, and J. Calvez. A Generic RTOS
                                                                                                                                                                                          Model for Real-Time Systems Simulation with SystemC. In Pro-
based models. The existing methodology allowed scheduling                                                                                                                                 ceedings of DATE, 2004.
of processes using generic RTOS primitives. The design step                                                                                                                          [12] S. Mahadevan, M. Storgaard, J. Madsen, and K. Virk. Arts: A
from a non-scheduled specification model to a process-based                                                                                                                               System-Level Framework for Modeling MPSoC Components and
                                                                                                                                                                                          Analysis of their Causality. In 13th IEEE International Sympo-
RTOS scheduled architecture model is a major refinement                                                                                                                                   sium on Modeling, Analysis, and Simulation of Computer and
step. For this reason, we proposed to introduce an inter-                                                                                                                                 Telecommunication Systems, 2005.
mediate model, called virtual processing model. This model                                                                                                                           [13] P. Razaghi and A. Gerstlauer. Host-Compiled Multicore RTOS
                                                                                                                                                                                          Simulator for Embedded Real-Time Software Development. In
enables to add a scheduler with user defined scheduling al-                                                                                                                               Proceedings of DATE. IEEE, 2011.
gorithm to a behavior, called virtual processing unit. This                                                                                                                          [14] G. Schirner and R. Dömer. Introducing preemptive scheduling
flexible scheduling annotation enables fast and easy explo-                                                                                                                               in abstract RTOS models using result oriented modeling. In
                                                                                                                                                                                          Proceedings of DATE, New York, NY, USA, 2008. ACM.
ration, regarding scheduling granularities of behaviors and                                                                                                                          [15] M. Streubühr, J. Falk, C. Haubelt, J. Teich, R. Dorsch, and
assignments of scheduling policies. After successful explo-                                                                                                                               T. Schlipf. Task Accurate Performance Modeling in SystemC
ration, the behavior to process transformation and RTOS                                                                                                                                   for Real-Time Multi-Processor Architectures. In Proceedings of
                                                                                                                                                                                          DATE, 2006.
configuration for architecture refinement can be performed.                                                                                                                          [16] H. Zabel and W. Müller. An Efficient Time Annotation Tech-
We have sketched how to use SystemC to implement our                                                                                                                                      nique in Abstract RTOS Simulations for Multiprocessor Task
virtual processing model. Furthermore, we have integrated                                                                                                                                 Migration. In Distributed, Parallel and Biologically Inspired
the virtual processing model into the OSSS-Behaviour li-                                                                                                                                  Systems, volume 271 of IFIP Advances in Information and
                                                                                                                                                                                          Communication Technology. Springer, 2008.
brary, supporting program state machine (PSM) modeling                                                                                                                               [17] H. Zabel, W. Müller, and A. Gerstlauer. Accurate RTOS Mod-
in SystemC. Our implementation concept allows designers to                                                                                                                                eling and Analysis with SystemC. In W. Ecker, W. Müller, and
implement new scheduling strategies. For the evaluation, we                                                                                                                               R. Dömer, editors, Hardware-dependent Software, pages 233–
                                                                                                                                                                                          260. Springer Netherlands, 2009.
used a Canny filter design and created different behavior par-
titions and scheduler configurations. The evaluation showed
that our extension retains the functional causalities of the
original PSM model when using a round robin scheduling