=Paper=
{{Paper
|id=Vol-1291/ewili14_3
|storemode=property
|title=A Program State Machine Based Virtual Processing Model in SystemC
|pdfUrl=https://ceur-ws.org/Vol-1291/ewili14_3.pdf
|volume=Vol-1291
|dblpUrl=https://dblp.org/rec/conf/ewili/SchmidtGDR14
}}
==A Program State Machine Based Virtual Processing Model in SystemC==
A Program State Machine Based Virtual Processing Model
in SystemC
Tim Schmidt1 , Kim Grüttner2 , Rainer Dömer1 , Achim Rettberg3
1
University of California, Irvine, USA
2
OFFIS – Institute for Information Technology, Oldenburg, Germany
3
Carl von Ossietzky University Oldenburg, Germany
ABSTRACT to raise the level of abstraction and support early design deci-
The Program State Machine (PSM) Model of Computation sions. In [3], different abstraction levels have been proposed.
offers a rich set of modeling elements to describe behavioral The specification level enables untimed modeling of function-
and structural hierarchy, concurrency, synchronization, state ality and causality between behaviors in an executable model.
transitions and timing. With the rising software complexity Behaviors can be statically composed in a sequential order,
of today’s embedded systems, the use of Real-Time Operat- as finite-state machine or parallel. Communication between
ing Systems (RTOS) has become state-of-the-art for nearly functions is described using double handshake channels with
all System-on-Chip designs. Regrettably, the PSM model message passing and shared variables. The architecture level
itself has insufficient support for the specification of the pre- introduces processing elements (PE) that execute behaviors
emptive dynamic scheduling behavior of an RTOS. In this in sequential order. Behaviors are annotated with delays to
paper, we propose a model for dynamically dispatching PSM specify the estimated execution times on the PEs. Commu-
models on a virtual processing element. Our model aims nication between PEs is described through message passing
to abstract from the targeted RTOS and the processor core communication with annotated delays. The implementation
through execution time annotations and a flexible preemp- level adds instruction and cycle accurate timing for PEs and
tive scheduler model. Mapping a PSM model to a set of signal level protocols with cycle accurate communication
scheduled virtual processing elements only requires minor times.
model transformation and enables early exploration of dif-
ferent processing element mappings and scheduling policies. APPLICATION APPLICATION APPLICATION APPLICATION
Our virtual processing model for PSMs is realized on top COMM & SYNC COMM & SYNC RTOS COMM & SYNC
CHANNELS CHANNELS CHANNELS
of the SystemC library. We evaluate the proposed virtual COMM & SYNC VIRTUAL PROCESSING INSTRUCTION SET
RTOS MODEL
processing model using a Canny edge detection filter. CHANNELS UNIT SIMULATOR
SLDL SLDL SLDL SLDL
(a) specification model (b) virtual processing model (c) architecture model (d) implementation model
1. INTRODUCTION
The development process of state-of-the-art embedded Figure 1: Extension of a virtual processing model (see [4])
systems is complex and affects many different disciplines.
Among others, the design process requires hardware and
software design decisions. Today’s system complexity of em- In this work we focus on systems that implement their
bedded Multi-Processor System-on-Chip (MPSoC) designs behavior in software, which can be mapped and executed
is continuing at an almost exponential rate [2]. The strongly on different PEs of an MPSoC. Fig. 1 (a), (c) and (d) show
growing complexity includes the integration of the function- the layers for executable MPSoC models proposed in [4].
ality as well as the associated software complexity. The All models are executed on top of a SLDL, e.g. SpecC or
development of hardware is associated with immense costs. SystemC. The Application is user-defined behavior to be ex-
Consequently, whenever possible, designers prefer software ecuted on the MPSoC. The layers between Application and
solutions and realize algorithms on software processors. SLDL introduce communication, scheduling and timing tech-
To cope with the increasing complexity and the time-to- niques to enable a stepwise refinement from the specification
market pressure, new design methodologies are required. One level down to the implementation.
design challenge for embedded system designers is mapping The design step from a non-scheduled specification model
the functionality on the individual processing elements while (a) to a complete RTOS scheduled architecture model (c)
meeting the required extra-functional properties (e.g. timing is a complex refinement step. It requires to transform the
and power consumption) at minimal cost. To support this application or behavioral description into a process- or thread-
challenge, System-Level Design Languages (SLDLs) enable based model and to use the configuration and scheduling
primitives of the selected RTOS. At this time the choice of
the task granularity and the supported scheduling primitives
have usually been taken. At this level, the comparison of
different task granularities and different scheduling policies,
supported by different RTOSs induces major redesign effort.
In this paper, we introduce a virtual processing model (b)
to support a smoother refinement from the unscheduled
specification to an RTOS scheduled architecture model for
PSMs.
EWiLi’14, November 2014, Lisbon, Portugal. Our approach supports PSM modeling at the specification
Copyright retained by the authors. layer and enables early estimation of dynamic scheduling
effects when mapping parallel behaviors on the same PE. Multi-Processing) or one global ready queue (Symmetric
This step can be performed without any behavior to process Multi-Processing) used for dynamic process to core dispatch-
or RTOS specific refinements. This way, designers can profit ing. The proposed extension supports the concept of Trans-
from simple specification model modifications in combina- action Level Modeling (TLM) for intra-core communication.
tion with early estimated execution time annotations, thus Both solutions focus on a process level RTOS abstraction at
enabling early decisions regarding PE allocation, behavior the architecture and implementation level including features
to process refinement granularity, process to PE binding and like process creation and interrupt handling. In contrast,
scheduling policy selection. our proposed approach avoids process-level RTOS opera-
The rest of the paper is organized as follows. In Section 2 tions and operates on an estimated execution time annotated
we analyze related work. Next, we discuss the design of the specification model. Moreover, our solution keeps commu-
virtual processing model in Section 3. Followed by a brief nication abstract and each processing element has its own
description of the implementation in Section 4, we evaluate ready queue. After exploration, based on our virtual platform
the new virtual processing model using a Canny filter for still model, we can transform the scheduled specification model
image edge detection in Section 5. Finally, we conclude our into an RTOS model on the architecture level.
work in Section 6. The timing accuracy and therefore the simulation perfor-
mance of [4, 13] is limited by the fixed minimal resolution of
2. RELATED WORK discrete time advances. An extension deploying techniques
with respect to preemptive scheduling models has been pre-
An early proposal of a generic RTOS model based on Sys- sented in [14]. The Result Oriented Modeling collects and
temC has been published in [11]. The presented abstract consumes consecutive timing annotations while still handling
RTOS model achieves time-accurate task preemption via Sys- preemptions accurately.
temC events and models time passing via a delay() method. A two layer for modeling approach for software task schedul-
The RTOS overhead can be modeled as well. Two different ing considering shared resources has been proposed in [6, 7].
task scheduling schemes are studied: the first one uses a The design starts with an Application Layer (AL) model,
dedicated thread for the scheduler, while the second is based which describes the functionality in terms of software tasks,
on cooperative procedure calls, avoiding this overhead. Al- hardware modules and shared communication objects. These
though in this approach explicit inter-task communication modeling elements are mapped on modeling elements of the
resources are required (message queue, ...), the simulation Virtual Target Architecture Layer (VTAL): software process-
time advances simultaneously as the tasks consume their ing elements with an RTOS model similar to [14], hardware
delays. processing elements with fixed static scheduling, memories
In [8], an RTOS modeling tool is presented. Its main and SystemC TLM for modeling shared buses and point-to-
purpose is to accurately model an existing RTOS on top of point communication channels. Communication is realized
SystemC. A system designer cannot directly use it. In this ap- via Remote Method Invocation (RMI) via shared buses or
proach, the next RTOS “event” (interrupt, scheduling event, dedicated point-to-point channels. The individually mapped
etc.) is predicted during run-time. This improves simula- software tasks can be annotated with Estimated Execution
tion speed, but requires deeper knowledge of the underlying Time (EET) blocks that represent computation time. The
system. design flow is supported with preemptive and cooperative
An RTOS based scheduling approach with focus on pre- scheduling strategies, as well as deadline driven strategies.
cise interrupt scheduling has been proposed in [17]. For This approach covers specification and architecture level
this purpose, a separate scheduler is introduced to handle modeling. The main difference to our approach is that the
incoming interrupt requests. Timing annotations and syn- RTOS model works with explicit tasks (i.e. processes). Our
chronization within user tasks are handled by a replacement model could be refined as well to [6, 7] after PSM scheduling
of the SystemC wait(). In [16] an annotation method for exploration.
time estimation has been presented that supports flexible
simulation and validation of real-time constraints for task
migration between different target processors. The concept 3. VIRTUAL PROCESSING MODEL
allows preemptive scheduling in the context of priority-based
scheduling, supporting nested interrupts. 3.1 Basic Modeling Elements
All mentioned solutions above work on architecture level We use an expressive subset of the program state machine
models and allow to create and handle processes and in- (PSM) model of computation (MoC) to describe the func-
terrupts on an RTOS specific abstraction. Our solution tionality of a system. A hierarchical PSM model with the
addresses scheduling at a higher abstraction level and keeps corresponding thread graph is shown in Fig. 2(a) and (b). A
communication at specification abstraction, thus no need for sequential composition of n behaviors describes a total execu-
interrupts. tion order, denoted as a n dimensional tuple: (beh1 , ..., behn ).
Several approaches based on abstract task graphs [9, 10, The execution starts with beh1 and finishes with behn . A
12, 15] have been proposed as well. In this case, a pure parallel composition of n behaviors describes a partial execu-
functional SystemC model is mapped onto an architecture tion order, denoted as a set of n behaviors {beh1 , ..., behn }.
model including an abstract RTOS. The mapping requires The parent behavior of a parallel composition of child behav-
an abstract task graph of the model, where estimated ex- iors will not finish until all child behaviors have completed
ecution times can be annotated on a per-task basis only, (Fork-Join semantics). The finite-state machine behavior
ignoring control-flow dependent durations. This reduces the composition is a special case of a sequential execution from
achievable accuracy. a start state to an end state. The execution order is defined
The proposed RTOS model in [4] can be implemented on by state transitions.
top of any SLDL (see Fig. 1) that supports the concepts The virtual processing model supports communication be-
of process handling and time modeling. An extension of tween behaviors via double handshake channels (synchro-
this approach [13] presents a high-level, host-compiled multi- nized) and shared variables (unsynchronized). A double
core RTOS simulator. This multi-core processor model can handshake channel operates in rendezvous fashion (see Fig. 3).
run more than one process simultaneously which can be or- When the data is transferred from the sender to the receiver,
ganized by a separate ready queue per core (Asymmetric both behaviors resume their execution at the same time. A
A A → ttop ing algorithms. We decided to provide these two fundamental
B → ttop
C
B
D
[t1,t2] = fork(ttop) strategies because more complex strategies can be easily de-
E
D → t2 rived from them.
C → t1 E → t2
C2
{C1,C2,C3}→t1 F → t2
C1
G
F
H
[t3,t4] = fork(t2) 3.3.1 Fixed Priority
G → t3 H → t4
C3
t2 = join(t3,t4) The fixed priorities are statically defined. The scheduler
F → t2
J
J → t2
always executes the behavior that has the highest priority
ttop = join(t1,t2) and is ready to execute. We are interested in making the
A → ttop
a) PSM model b) Thread graph
process of priority assignment to the behaviors on the virtual
processing model as simple as possible. The designer assigns
Processing Element PE1 Processing Element: Scheduler
A
PE1 Strategy fixed priorities only to leaf behaviors. A hierarchical higher
B Sequence: behavior cannot hide the priority of a leaf behavior.
C D A
E
Parallel: Set of ready behaviors Set of ready behaviors
C2
B
F C1, C2, C3,
C1
G H F C1, C2, C3
FSM: Sequence: G, H
C D
inner behaviors leaf behaviors inner behaviors leaf behaviors
C3
J State: State: State: Parallel:
E J
C1 C2 C3 F a) Before scheduling decision b) After simulation of behavior F
Scheduler S: fixed priority G H
Figure 4: Differentiation between the priority of inner and
c) Mapped behavior on a PE d) Nested behaviors
leaf behaviors (example based on Fig. 2)
Figure 2: PSM model (a), which is mapped on a PE (c),
The scheduler distinguishes between inner and leaf be-
with corresponding thread graph (b) and nesting tree (d)
haviors. As shown in Fig. 4(a), the set of ready behaviors
can be categorized into two subsets. Inner behaviors al-
ways have an infinite high priority and leaf behaviors have a
communicating behavior can be blocked through communica- fixed priority defined by the designer. The scheduler always
tion for some time, until the other communication partner is prefers an inner behavior over a leaf behavior. Let’s assume
ready and the handshake has been completed. To achieve a the behaviors C1, C2, C3, and F are ready to execute. In
high utilization of a PE the scheduler will be able to preempt this situation, the scheduler selects behavior F because F is
blocked behaviors. an inner behavior (see Fig. 2(c)) and has infinite priority.
Behavior F has two child behaviors G and H, which are leaf
Receiver Communication Receiver behaviors. Behavior F is waiting until the child behaviors
interface interface Behavior B1 Channel C1 Behavior B2
direction
on channel on channel have completed. Fig. 4(b) shows both child behaviors G and
send()
s0
H added to the set of behaviors ready to execute.
Port C1 blocked receive()
with s1
sender B1 B2
Port
with
3.3.2 Round Robin
interface time All mapped behaviors on a processing element get time
receiver
Double handshake channel interface slices of the same length. If a behavior has terminated, the
a) Communication diagram b) Sequence diagram scheduler selects immediately the next running behavior. The
scheduler executes the behaviors in a circular order. If a
Figure 3: Double handshake protocol behavior is not ready to execute, the next ready behavior
with respect to the circular iteration is chosen. Fig. 5 depicts
a round robin scheduling example. The parallel behaviors
3.2 Processing Elements G and H are mapped on the same processing element. The
behaviors G and H are scheduled by round robin and the
A PE maintains a single thread of execution (i.e. a single
time slice is 5 time units. In the following, we keep the
core processor). We associate exactly one scheduler with
focus on behavior G that requests once 3 and once 12 time
one PE one scheduling algorithm. We use the terms simu-
units. The start behavior is arbitrary because the model
lated time and simulation time to express the amount of task
in the figure does not define one; we assume the simulation
execution time currently simulated. The terms simulation
starts with behavior G. Behavior G requests 3 time units,
execution time and execution time refer to the amount of time
computes, and requests 5 more time units. The request of
the simulator requires on the host computer1 . Fig. 2(c) shows
3 time units can be consumed completely in one time slice;
the scheduler S which is assigned to PE1 and associated
however, the following request is too complex. 2 more time
with the fixed priority scheduling strategy. In this case all
units can be consumed after behavior H preempts behavior
behaviors mapped on PE1 need a specific fixed priority, such
G and can start executing. At time 10, behavior G is active
that the scheduler can make a scheduling decision. Further-
again and continues consuming the remaining 10 time units.
more, we assume that computation is only in leaf behaviors
(C1, C2, C3, E, G, H and J in Fig. 2), and hierarchical first time request second time
behaviors (A, B, C, D, and F in Fig. 2) describe the causal of 3 time units request of 12 time untis
chain of execution in the model that must be followed by the Processing Element: PE1 Behavior of G:
scheduler. G H 1. Request 3 time units G G H G H G
2. Compute something
3. Request 12 time units 0 2 4 6 8 10 12 14 16 18 20 22 24
3.3 Scheduling Scheduler: Round robin,
time slice: 5 time units
4. Compute something
5. ...
simulated time
1. time 2. time 3. time
We describe now the concept of scheduling for PSM models slice slice slice
for the two requested fixed priority and round robin schedul-
1 Figure 5: Round robin scheduling
Which is of course dependent on the host CPU, clock frequency
etc. and a comparison between simulation execution times of different
model is only possible on the same reference simulation host. Depending on the scheduling algorithm and the commu-
nication status, a behavior can have one of the following status: ready, low priority ... status: ready, high priority
driven by thread t1 driven by thread t2
states (see Fig. 6): ready, running, communicating and wait-
ing. A ready behavior can be selected by the scheduler and waste_time(5ms) B1 B2
waste_time(3ms)
port->receive()
executed on the associated processing element. A behavior waste_time(7ms)
has the state running, if it is currently executing on the ...
C1 ...
mapped processing element. A running behavior can be
preempted in two different ways: (a) end of the time-slice,
as defined by the scheduling algorithm (waiting state), (b) Figure 8: Scheduling with communication
blocked communication request on double handshake channel
(communicating). If a behavior has completed, its status is
terminated. Fig. 6 shows all possible transitions between the We allow preemption of running behaviors. The function
described states. then of preemption is to suspend the current running pro-
cess and to resume a process the scheduling algorithm has
end of communicating start of
communication communication
selected to be executed next. If we suspend a process waiting
executes estimated
activation computation termination on a timed event,we have to interrupt the waiting process,
ready running terminated store how much time the process has already consumed, and
finished estimated
consume the remaining time later. For this reason, the Sys-
preempted
computation
resume
temC wait() statement in Line 6 is sensitive to two different
completed after
communication waiting or-composed sets of events. The first event max_time notifies
the thread after the provided time slice of the scheduler is
over. The second parameter is an or-composed event list.
Figure 6: State automaton of a behavior All behaviors that start communication add the correspond-
ing synchronization event of the channel to that list. This
mechanism allows preempting a current running behavior by
a behavior with higher priority that has completed commu-
4. IMPLEMENTATION nication.
4.1 Processing Element
Algorithm 1 function waste time(requested time)
The class osss_processing_element represents a PE and in-
herits from the osss_behaviour. Fig. 7 shows the extension of 1: req time ← requested time
2: while req time > 0 do
the OSSS-Behaviour class diagram [5]. The designer derives 3: max time ← available time(requested time)
a class from osss_processing_element class and defines in the 4: start time ← current time
constructor the execution order of the mapped behaviors on 5: behavior status ← running
the highest hierarchical level. 6: wait(max time or registered communication events)
7: req time ← req time - (current time - start time)
8: if req time = 0 then
osss_fixed_priorty_strategy osss_round_robin_strategy
osss_module sc_module 9: return
IF:Class
#wait(…): void
0..*
sc_port 10: else
1
osss_scheduler osss_behaviour
1 0..* IF:Class 11: behavior status ← waiting
< < virtual > > schedule(): void <>+main(): void osss_port
< < virtual > > available_time(): sc_time #osss_seq(osss_sequential_behaviour_list): void 1
12: dispatch()
1
dispatch(): void #osss_fsm(osss_state_transition_list): void osss_composite_behaviour 13: end if
1 #osss_par(osss_parallel_behaviour_list): void
1 <>#init(): void
main(): void 14: end while
osss_processing_element <>#final(): void
osss_fsm_behaviour
osss_scheduling_attributes 1 osss_parallel_behaviour
osss_state
sc_core::sc_time: m_simulated_time
process_status: m_process_status
0..1 osss_sequential_behaviour Each PE represents a single core processor. For this reason,
osss_priority*: m_prio osss_initial_state osss_end_state
only one behavior can be ready for the SLDL scheduler.
Otherwise, the scheduler would execute multiple behaviors
Figure 7: OSSS-Behaviour class extensions (bold boxes) in parallel on the same PE. Communicating behaviors are an
exception because they are waiting for their synchronization
In the following, we describe how the basic scheduling event. If the process of a communicating behavior would be
algorithms are designed to support preemption and commu- suspended, the behavior would ignore the synchronization
nication. event. The function dispatch() (see Alg. 2) guarantees this
requirement. The set of behaviors is stored in two lists,
4.2 Simulation of Time namely a list for inner behaviors and a list for leaf behaviors.
Fig. 8 shows an example where the behaviors B1 and B2 The first behavior in the list of ready behaviors defines the
are mapped on the same PE under fixed priority schedul- next running behavior. In this situation behavior B1 needs
ing strategy. Thread t1 is associated with behavior B1 and to be suspended and behavior B2 should be ready.
thread t2 with behavior B2. We assume behavior B2 is
running and B1 ready (see Fig. 6) at the beginning and 4.3 Scheduling Strategy
neither B1 nor B2 have consumed any time. Thread t2 We decided to separate the scheduler and the scheduling
starts executing the main() function of B2 and enters the 3 strategies (see Fig. 7). The designer derives a class from
milliseconds estimated waste_time() function of the sched- osss_scheduler. The function schedule() takes a list of all
uler, see Alg. 1. The while loop runs until the entire re- leaf behaviors as argument. The function moves the next
quested time of a timing annotation has been consumed. executing behavior to the beginning of the list. The func-
At the beginning of the loop, thread t2 calls the function tion available_time defines the size of a consumable time
available_time(requested_time), which asks the scheduling quantum.
strategy how much of the totally requested time can be con-
sumed. In our example, the scheduling strategy is fixed
priority and the current behavior has the highest priority. In 5. EXPERIMENTS AND EVALUATION
this case, the fixed priority scheduler accepts the complete In order to evaluate our proposed virtual processing model,
time (i.e. 3 milliseconds). we focus on scheduling different partitions of a specification
Algorithm 2 function dispatch() Simulated Speedup LoC ∆ [LoC] ∆ [%]
time [ms]
1: Unsorted List: inner behaviors, leaf behaviors
2: Process successor process ← null Specification 40 - 1497 - -
3: Process current process ← get current process() IndividualPE 160 1 1541 44 2.9
4: if inner behaviors = ∅ then BlurX 140 1.14 1760 219 14.2
5: schedule(leaf behaviors) Blur2X 120 1.33 1840 80 4.5
6: successor process ← first element(leaf behaviors) Blur2X2Y 80 2.00 1897 57 3.1
7: else
8: successor process ← first element(inner behavior)
9: end if Table 1: Comparison of selected design metrics
10: for all Behavior b in inner behavior do
11: if process(b) 6= current process and
status(b) 6= communicating then
12: suspend(process(b)) The following models, as shown in Fig. 9, are evaluated:
13: end if
14: end for Specification (untimed, without virtual PE), Individual
15: if current process 6= successor process then PE (timed blur leaf behaviors, all behaviors mapped on a sin-
16: resume(successor process) gle PE), BlurX (timed blur leaf behaviors, blurX4 mapped
17: suspend(current process)
18: end if
to P E2, synchronization between P E1 and P E2 via double
handshake channel (behavior blurX4 on P E2 can only start
if behavior blurX_par on P E1 has been entered)), Blur2X
Processing Element PE1 Processing Element PE1 Sync
(timed blur leaf behaviors, parallel composition of blurX3
Apply_Hysteresis canny Apply_Hysteresis canny
Processing
Element PE2
and blurX4 mapped to P E2, synchronization of parallel com-
prep gaussian_smooth prep gaussian_smooth
Sync position like in BlurX) and Blur2X2Y (timed blur leaf
blurX_par blurX_par
blurX1 blurX2 blurX3 blurX4 blurX1 blurX2 blurX3
sync
blurX4
20ms
behaviors, parallel composition of blurX3 and blurX4 and
20ms 20ms 20ms 20ms 20ms 20ms 20ms
Sync
parallel composition of blurY3 and blurY4 mapped to P E2,
blurY1 blurY2 blurY3
blurY_par
blurY4 blurY1 blurY2 blurY3
blurY_par
blurY4
Scheduler S synchronization like in Blur2X with additional synchroniza-
20ms 20ms 20ms 20ms 20ms 20ms 20ms 20ms
tion barrier between sequential composition of parallel blur
blur_done blur_done
behaviors).
Derivative_X_Y Derivative_X_Y When neglecting communication (shared array access),
Non_Max_Supp Non_Max_Supp
Magnitude_X_Y Magnitude_X_Y synchronization and scheduling (including context switch-
Scheduler S Scheduler S ing) times, our model’s total simulated times, as expected
b) BlurX
a) IndividualPE
Processing Element PE1 Sync Processing Element PE1 Sync
by Amdahl’s law, are shown in Tab. 1. We compare the
Apply_Hysteresis canny Apply_Hysteresis canny complexity of the different models using a simple Lines of
Processing Processing
prep gaussian_smooth Element PE2 prep gaussian_smooth Element PE2 Code (LoC) metric. The major effort was to allocate new
Sync Sync
blurY_par blurY_par channels and behaviors for synchronization. For instance,
blurX1 blurX2 blurX1 blurX2
20ms 20ms
sync sync blurX4
20ms
blurX3
20ms 20ms 20ms
sync sync blurX3
20ms
blurX4
20ms
for the model BlurX a new PE and three sync behaviors
blurY_par Sync blurY_par Sync
have been instantiated. Furthermore, the double handshake
blurY1
20ms
blurY2
20ms
blurY3
20ms
blurY4
20ms Scheduler S
blurY1
20ms
blurY2
20ms
sync sync channel was hierarchically bound from the ports of the PE’s
blurY3 blurY4
blur_done blur_done
20ms 20ms to the ports of the blur leaf behaviors.
Scheduler S In the following, we discuss the simulation of the four
Derivative_X_Y Derivative_X_Y
Magnitude_X_Y
Non_Max_Supp
Magnitude_X_Y
Non_Max_Supp virtual processing models using a round robin scheduler with
different time slice granularities from 1ns up to 100, 000ns
Scheduler S Scheduler S
c) Blur2X d) Blur2X2Y
on each individual PE. We measured the number of context
switches and associated them with a constant cost of 1 ms.
Figure 9: Partitioning of the Canny edge detector Fig. 10 shows the simulated time of model Blur2X2Y with
context switching costs for a round robin scheduling of P E1
and P E2. As expected, we can observe that fine grained time
slices < 100 ns have a huge impact on the overall simulated
model and measure the simulated time, count the sched- time. On the other hand, the responsiveness (although not
uler calls and measure the execution time of the SystemC necessary for the image filter design) rises.
program. The count of scheduler calls is compared to the ex-
pected number of scheduler calls depending on the scheduling
Simula'on
'me
of
the
model
Blur2X2Y
with
'me
strategy. Based on the results, we can show that the commu- annota'ons
for
context
switches
240,000
nication via shared variables and double handshake channels,
simula'on
'me
[ms]
220,000
as well as the individual scheduler on the PEs, do not violate 200,000
180,000
the causal execution order defined in the specification model. 160,000
140,000
The design we use for evaluation is a Canny edge detector 120,000
[1], a graphical filter to detect edges in a gray-scale image. As 100,000
80,000
starting point, we used an existing SpecC PSM model of the 60,000
filter, transformed it into an OSSS-Behaviour [5] model, and 1
10
100
1,000
10,000
100,000
Time
slice
of
the
individual
schedulers
[ns]
implemented different virtual processing models, as shown
in Fig. 9. Communication is performed via shared variables
and double handshake channels. An array containing the Figure 10: Simulation with costs for context switches
complete image is shared among the blur behaviors. Each
of these behaviors manipulates pixels on non-overlapping Fig. 11 visualizes the ratio between the simulated time for
tiles of the image. In the Canny algorithm blurring is the context switches and computation for a 20 ms leaf behavior
most computationally intensive block and therefore we map timing annotation. When the time slice is very short, almost
different combinations of blur leaf behaviors to a second 70 % of the simulated time is spent on context switches.
processing element. For our evaluation, we only require Fig. 12 shows the measured execution time of the various
timing annotations for parallel and mapped leaf behaviors. models on an Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00
We annotated each blur behavior with 20ms. GHz with 4 GB RAM running Fedora 12 Linux using the
Ra$o
between
context
switch
and
computa$on
$me
for
with different time slice granularities. So far we did not
the
model
Blur2X2Y
Ra$o
between
context
switch
100%
evaluate our simulation results against measurements on a
and
computa$on
$me
80%
real platform. To do a trade-off between simulation speed
Computa2on
60%
2me
and accuracy, a comparison with measurement results is
40%
Context
necessary and part of future work. Currently, the context
20%
switch
2me
switch penalty and communication delay is handled in a very
0%
simple way and after the conduction of measurement trails
these timing models will be refined.
1
2
5
0
0
1,
2,
5,
10 0
20 0
10
20
50
0
0
0
00
10
20
50
00
00
00
0
,0
,0
Time
slice
of
individual
scheduler
[ns]
Acknowledgement
Figure 11: Ratio between context switches and computation
time This work has been partially supported by the ARAMiS
project (01IS11035M) and the EMC2 collaborative ARTEMIS
project (01IS14002R), both funded by the German Federal
time command. From this measurement, we can observe that Ministry of Research and Education (BMBF).
the execution time of the individual models is proportional
to the number of context switches, as shown in Fig. 10. References
We traced the individual scheduler calls and compared the [1] J. Canny. A Computational Approach to Edge Detection. Pat-
tern Analysis and Machine Intelligence, IEEE Transactions
Execu&on
&me
for
round
robin
scheduled
processing
on, PAMI-8(6):679–698, nov. 1986.
elements
[2] D. D. Gajski, S. Abdi, A. Gerstlauer, and G. Schirner. Embedded
1.100
System Design: Modeling, Synthesis and Verification. Springer,
1.000
Spec.
Model
1st edition, 2009.
Execu&on
&me
[sec]
0.900
[3] D. D. Gajski, J. Zhu, R. Dömer, A. Gerstlauer, and S. Zhao.
0.800
Individual
PE
0.700
SpecC: Specification Language and Methodology. Springer, 1
Blur
X
edition, 2000.
0.600
0.500
Blur
2X
[4] A. Gerstlauer, H. Yu, and D. D. Gajski. RTOS Modeling for
0.400
Blur
2X
2Y
System Level Design. In Proceedings of DATE. IEEE Computer
0.300
Society, 2003.
0.200
[5] K. Grüttner and W. Nebel. Modelling Program–State Machines
1
10
100
1,000
10,000
100,000
in SystemC. In Forum on Specification and Design Languages
Time
slice
of
the
individual
schedulers
[ns]
2008, 09 2008.
[6] P. Hartmann, H. Kleen, P. Reinkemeier, and W. Nebel. Efficient
Figure 12: Simulation execution times of canny edge detector modelling and simulation of embedded software multi-tasking us-
partitionings ing SystemC and OSSS. In Specification, Verification and De-
sign Languages, 2008. FDL 2008. Forum on, pages 19–24, Sept
2008.
execution order of leaf behaviors (i.e. the causal chain) [7] P. A. Hartmann, K. Grüttner, A. Rettberg, and I. Podolski.
Distributed Resource-Aware Scheduling for Multi-Core Architec-
between the specification and the Virtual Processing Models. tures with SystemC. In Distributed, Parallel and Biologically
In the specification model, the blur behavior’s execution Inspired Systems, volume 329, pages 181–192. Springer, 2010.
order was blurX1, ... , blurX4, while in the different VPM [8] Z. He, A. Mok, and C. Peng. Timed RTOS modeling for Embed-
ded System Design. In Real Time and Embedded Technology
models the execution order changed to blurX4, ... , blurX1. and Applications Symposium (RTAS’05), 2005.
Even though the ordering is different, validity of causality [9] S. Huss and S. Klaus. Assessment of Real-Time Operating Sys-
for parallel compositions (partial order) only requires to be tems Characteristics in Embedded Systems Design by SystemC
order isomorph, which is the case. models of RTOS services. In Proceedings of Design & Verifica-
tion Conference and Exibition (DVCon’07), 2007.
[10] T. Kempf, M. Dörper, R. Leupers, G. Ascheid, H. Meyr, T. Ko-
gel, and B. Vanthournout. A Modular Simulation Framework for
6. CONCLUSION AND OUTLOOK Spatial and Temporal Task Mapping onto Multi-Processor SoC
In this paper, we extended the proposed methodology in Platforms. In Proceedings of DATE, 2005.
[4] and introduced a novel virtual processing model for PSM [11] R. Le Moigne, O. Pasquier, and J. Calvez. A Generic RTOS
Model for Real-Time Systems Simulation with SystemC. In Pro-
based models. The existing methodology allowed scheduling ceedings of DATE, 2004.
of processes using generic RTOS primitives. The design step [12] S. Mahadevan, M. Storgaard, J. Madsen, and K. Virk. Arts: A
from a non-scheduled specification model to a process-based System-Level Framework for Modeling MPSoC Components and
Analysis of their Causality. In 13th IEEE International Sympo-
RTOS scheduled architecture model is a major refinement sium on Modeling, Analysis, and Simulation of Computer and
step. For this reason, we proposed to introduce an inter- Telecommunication Systems, 2005.
mediate model, called virtual processing model. This model [13] P. Razaghi and A. Gerstlauer. Host-Compiled Multicore RTOS
Simulator for Embedded Real-Time Software Development. In
enables to add a scheduler with user defined scheduling al- Proceedings of DATE. IEEE, 2011.
gorithm to a behavior, called virtual processing unit. This [14] G. Schirner and R. Dömer. Introducing preemptive scheduling
flexible scheduling annotation enables fast and easy explo- in abstract RTOS models using result oriented modeling. In
Proceedings of DATE, New York, NY, USA, 2008. ACM.
ration, regarding scheduling granularities of behaviors and [15] M. Streubühr, J. Falk, C. Haubelt, J. Teich, R. Dorsch, and
assignments of scheduling policies. After successful explo- T. Schlipf. Task Accurate Performance Modeling in SystemC
ration, the behavior to process transformation and RTOS for Real-Time Multi-Processor Architectures. In Proceedings of
DATE, 2006.
configuration for architecture refinement can be performed. [16] H. Zabel and W. Müller. An Efficient Time Annotation Tech-
We have sketched how to use SystemC to implement our nique in Abstract RTOS Simulations for Multiprocessor Task
virtual processing model. Furthermore, we have integrated Migration. In Distributed, Parallel and Biologically Inspired
the virtual processing model into the OSSS-Behaviour li- Systems, volume 271 of IFIP Advances in Information and
Communication Technology. Springer, 2008.
brary, supporting program state machine (PSM) modeling [17] H. Zabel, W. Müller, and A. Gerstlauer. Accurate RTOS Mod-
in SystemC. Our implementation concept allows designers to eling and Analysis with SystemC. In W. Ecker, W. Müller, and
implement new scheduling strategies. For the evaluation, we R. Dömer, editors, Hardware-dependent Software, pages 233–
260. Springer Netherlands, 2009.
used a Canny filter design and created different behavior par-
titions and scheduler configurations. The evaluation showed
that our extension retains the functional causalities of the
original PSM model when using a round robin scheduling