<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Synthesis of Heterogeneous CPU-GPU Embedded Applications from a UML Profile</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Innovation, Design and Engineering - IDT Ma ̈lardalen University</institution>
          ,
          <addr-line>Va ̈stera ̊s</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Modern embedded systems present an ever increasing complexity and model-driven engineering has been shown to be helpful in mitigating it. In our previous works we exploited the power of model-driven engineering to develop a round-trip approach for aiding the evaluation and assessment of extra-functional properties preservation from models to code. In addition, we showed how the round-trip approach could be employed to evaluate different deployment strategies, and the focus was on homogeneous CPUbased platforms. Due to the fact that the assortment of target-platforms in the embedded domain is inevitably shifting to heterogeneous solutions, our goal is to broaden the scope of the round-trip approach towards mixed CPU-GPU configurations. In this work we focus on the modelling of heterogeneous deployment and the enhancement of the current automatic code generator to synthesize code targeting such heterogeneous configurations.</p>
      </abstract>
      <kwd-group>
        <kwd>model-driven engineering</kwd>
        <kwd>code synthesis</kwd>
        <kwd>heterogeneous systems</kwd>
        <kwd>embedded systems</kwd>
        <kwd>CHESS-ML</kwd>
        <kwd>UML</kwd>
        <kwd>MARTE</kwd>
        <kwd>ALF</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The complexity of embedded systems is increasing at high pace and therefore
development processes based on code-centric approaches tend to become highly complex and
error-prone thus demanding the adoption of more powerful and automated mechanisms.
Model-Driven Engineering (MDE) has been proven capable of reducing development
complexity through the ability of abstracting from details not needed at design level
and that are typical of code-centric approaches. More specifically, the focus shifts from
hand-written code to models by which the system can be early analysed and validated;
furthermore the application is meant to be automatically generated from them through
the employment of model transformation mechanisms.</p>
      <p>Automating the code generation phase is a task considered inescapable in order to
make of MDE an eligible approach to substitute code-centric approaches, especially in
industry. Powerful code generation mechanisms can improve quality and
maintainability of the final application as well as enforce its consistency to the source models. In this
way, results from analysis performed at model level are more likely to be valid at code
level too (and the other way around). Development-wise, effective code generation can
positively affect time-to-market as well as overall costs and risks. Additionally,
generated code is meant to achieve higher and more consistent quality than hand-written
code with respect to errors, maintainability and readability.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] we proposed an automated round-trip engineering support for MDE of
embedded systems with focus on the preservation of extra-functional properties (EFPs).
The round-trip support is made of four core steps. The first step consists in modelling the
system through a structural design in terms of components, a behavioural description by
means of state-machines and action code, as well as a deployment model describing the
allocation of software components to operating system’s processes. Then, from the
information contained in the design model, we automatically generate full functional code
to be run as a singleprocess1 application on singlecore CPU-based platforms. When the
application is generated, we monitor its execution on the target platform and measure
selected EFPs. Then gathered values are back-propagated to the design model and, after
their evaluation, the models can be manually tuned to generate more resource-efficient
code.
      </p>
      <p>
        The round-trip support has been validated in industrial settings where the
necessity to extend the generation of code to account more complex platforms arose [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Therefore, we proposed and developed preliminary extensions in order to enable the
generation of multiprocess applications on CPUs, that we employed for deployment
assessment in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Nevertheless, the expectation for embedded systems to be able to
process vast amounts of data, even in real-time, is spreading among several different
domains and a possible solution to make embedded systems fulfil this expectation is
the adoption of hardware technologies based on heterogeneous configurations [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A
common scenario is represented by mixed CPU-GPU configurations where, e.g., input
data comes into a CPU, which in turn may exploit one or more GPUs as coprocessors
for parallel processing of large blocks of data.
      </p>
      <p>The crossover from homogeneous to heterogeneous platforms brings along new
research challenges ranging from modelling to coding of the embedded system. In our
case, while generally increasing performances of the resulting applications, the
introduction of heterogeneity adds additional complexity in modelling and generating the
application. In fact, since different processing units (e.g., CPUs and GPUs) usually
employ different formalisms and mechanisms for code execution, the design models
should contain the information needed by the generation process to map model entities
to code artefacts written in different target languages as well as to generate the
communication code needed for the interaction between CPUs and GPUs.</p>
      <p>Pursuing this direction, the contributions of this work are (i) the identification of
the modelling means to specify heterogeneous deployment especially regarding the
allocation of single component functions to GPU cores, and (ii) the extension of the code
synthesis to generate heterogeneous applications to be run on mixed CPU-GPU
heterogeneous platforms. Moreover, we aim at maintaining a deployment-agnostic
specification of the functional characteristics of the systems, while modelling the platform and
deployment specific details as extra-functional annotations that drive the generation of
the heterogeneous application.</p>
      <p>The remainder of the paper is organised as follows. Section 2 describes the scope
of the work and its contextual delimitation. The relation of our contribution to the state
1 We refer to process as an independent execution unit that only interacts with other processes
via interprocess communication mechanisms (managed by the operating system).
of the art is given in Section 3. Section 4 depicts the means we identified for modelling
heterogeneous allocation and deployment as well as for enabling the synthesis of
applications to be run on mixed CPU-GPU platforms. The paper is concluded in Section 5
with a discussion of the means for the proposed solution as well as current limitations
and planned future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Context</title>
      <p>
        In our work we employ the CHESS Modelling Language (CHESS-ML) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], defined
within the CHESS project as a UML profile [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], including subsets of the MARTE [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
and SysML profiles. The CHESS-ML is part of the CHESS framework2 which
leverages the Papyrus [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] Project [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], an open-source environment for editing Eclipse
Modeling Framework (EMF) models and particularly supporting UML and related profiles
such as SysML and MARTE, on the Eclipse platform. CHESS-ML allows the
specification of a system together with relevant EFPs such as predictability, dependability and
security. Moreover, it supports a development methodology expressly based on separation
of concerns; distinct design views address distinct concerns. In addition, CHESS-ML
supports component-based development as prescribed by the UML Superstructure [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        For the functional definition of the system in CHESS, UML component and
composite component diagrams are employed to model the structural aspects while
statemachines are used to express functional behaviour. Action Language for Foundational
UML (ALF) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is used to define the actual behaviour of the component operations
(also addressed in this paper as functions). In this way, we reach the necessary
expressive power to be able to generate the full implementation directly from the functional
models with no need for manual fine-tuning of the code after its generation. In
compliance with the principle of separation of concerns adopted in CHESS-ML, the functional
models are decorated with extra-functional information thereby ensuring that the
definition of the functional entities is not altered.
      </p>
      <p>
        The target languages are C++, for code portions running on CPU, and CUDA
C/C++ [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for code portions to be deployed on GPU. The application is run on OSE,
a commercial and industrial real-time operating system developed by Enea [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], which
provides the concept of direct and asynchronous message passing for communication
and synchronisation between tasks. This allows tasks to run on different processors or
cores, utilising the same message-based communication model as on a single processor.
This programming model provides the advantage of avoiding the use of shared memory
among tasks. In OSE, the runnable real-time entity equivalent to a task is called process,
and the messages that are passed between processes are referred to as signals (thus, the
terms process and task in this paper can be considered synonyms).
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        Overall, a number of different approaches have been proposed for the generation of
multicore systems, starting from different abstraction levels, such as in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
2 Download at http://www.math.unipd.it/˜azovi/CHESS/CHESS_3.2/.
Nevertheless, the input needed for these approaches is at a very low abstraction level
and the output is meant to complement elsewhere generated or already existing code
artefacts. In our solution the whole implementation is meant to be generated from the
design models in one single transformation process.
      </p>
      <p>
        Different approaches aiming at achieving code generation for embedded systems
can be found in the literature but despite the numerous attempts, this still represents an
open research issue especially when it comes to the generation of code to be run on
heterogeneous platforms. The most concrete attempt to heterogeneous code generation for
CPU-GPU configurations has been proposed by Rodrigues et al. in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] where the
authors define a code generation process from UML–MARTE models to OpenCL. While
similar to ours in terms of the underlying idea, the approach proposed in this work
assumes a more detailed modelled platform information and it targets only GPU-related
code. In the approach we propose the aim is to provide an environment that allows
end-users to freely model systems and thereby allocate components and their functions
to either CPUs or GPUs leaving the burden of communication code between CPU and
GPU to the code generation process.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] the authors aim at automating the task of determining the appropriate
memory usage as well as the coding of data transfer between memories. This work could be
exploited in the next steps of our research work towards possible optimizations of the
code to be generated. Additionally, attempts to automatically generate C from CUDA
have also been proposed, as in [
        <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
        ], and they could represent a useful guidance for
definition and implementation of our ALF to CUDA C/C++ transformation chain.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Modelling and Synthesizing the Heterogeneous Application</title>
      <p>In this section we define the means we identified for modelling heterogeneous
deployment and thereby enabling the generation of applications to be run on mixed CPU-GPU
platforms from the design models.
4.1</p>
      <sec id="sec-4-1">
        <title>Modelling Heterogeneous Deployment</title>
        <p>As previously mentioned, the functional definition of the system is modelled in
CHESSML by means of UML components as well as state-machines. Moreover ALF is
leveraged to implement the actual behaviour of the components’ operations. Note that the
functional specification of the system is meant to remain deployment-agnostic, thus
leaving platform and deployment details to be modelled as extra-functional decorations
as described later in this section.</p>
        <p>The first step to enhance the code generation from models to target heterogeneous
platforms is to identify the modelling means for those points of variability that cannot
be embedded in the transformation process. More specifically, while we already allow
the allocation of components to CPU cores via OSE processes, we want to enable the
modeller to define the allocation of single component functions to GPU cores as well.</p>
        <p>In Fig. 1 we show a simplified version of the vision system from an Autonomous
Underwater Vehicle (AUV), focusing on functional definition and allocation to
hardware resources of components and functions. The system is represented by the
composite component Vision System, containing components VisionManager impl
of type VisionManager, FrontCamera and BottomCamera of type
StereoCamera and representing the two camera systems of the AUV, and the Filter
component of type StereoMatcher. The components are allocated to different OSE
processes, i.e., Process A, Process B, Process C, Process D, of type OSE
Process and stereotyped with MARTE’s MemoryPartition . The processes are then
allocated to the two-cores CPU chip defined as MARTE’s hwProcessor .</p>
        <p>The allocation is modelled by means of MARTE’s allocated (on allocated
components and resources) and allocate (dotted arrows between elements with
allocation relationship). Moreover, the core ID on which the process is meant to be allocated
is specified through a MARTE’s nfpConstraint called Core ID. Let us suppose
that we want to allocate Filter’s function f sum() to the core with ID = 1 of the
GPU chip. This is done by:
– Modelling an allocate link between Filter and the hwProcessor GPU
chip;
– Specifying the function (i.e., f sum() ) to be allocated to the GPU core through a
decoration of the allocate link with MARTE’s assign ;
– Decorate the allocate with a nfpConstraint called Core ID for
specifying the core on which to allocate the function, a nfpConstraint called GridD
for the definition of the grid dimension, and a nfpConstraint called BlockN
for the definition of the thread block.</p>
        <p>Note that, while in Fig. 1 all the details (functional, extra-functional and deployment)
are exposed in a single view, in the actual CHESS-ML model they are placed in
separated views (i.e., functional, extra-functional, deployment) to enforce separation of
concerns.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Generating the Application</title>
        <p>
          The generation process is constituted by a set of model transformations. Starting from
the CHESS-ML model of the system under development, we translate the structural
definition from component, composite component and state-machine diagrams through
a model-to-model transformation chain3. Regarding the translation of state-machines,
our approach resembles the state design pattern, as defined in [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], and considers the
component owning the state-machine as the context for the related states.
        </p>
        <p>
          As prescribed in its specification [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], the execution semantics for ALF is specified
by a formal mapping to foundational UML (fUML) [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. There are three prescribed
ways in which ALF execution semantics may be implemented [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], namely interpretive
execution, compilative execution and translational execution. In our code generation,
we provided a solution towards the translational execution of ALF, focusing on the
minimum conformance level (as defined in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]), by means of model-to-model
transformations which are introduced in [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ].
        </p>
        <p>
          In the followings we describe the generation principles to achieve the needed
communication code to call functions allocated to GPUs (i.e., f sum() ) from functions
allocated to CPUs, as well as the code for f sum(), specified in the model in terms of ALF,
to CUDA C/C++ code. Let us suppose that the function caller() in FrontCamera
3 More details on the transformation process can be found in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]; the complete description of
generation process and involved artefacts for multiprocess applications is currently under
submission.
        </p>
        <p>
          Fig. 1: Modelling of Heterogeneous Deployment in Papyrus
calls Filter’s f sum() which is allocated to the GPU core with Core ID = 1; the
code related to the two functions is depicted in the following ALF-like code snippet.
1 / / c a l l e r i n ALF l i k e
2 p u b l i c c a l l e r ( i n p1 [ ] , i n p2 [ ] ) f
3 i n t N = 1 0 0 0 0 ;
4 i n t r e s [N ] ;
5 F i l t e r . f s um ( p1 , p2 , N, r e s ) ;
As depicted in the code snippet, function f sum() computes the sum of the arrays given
as input parameters (in a[], in b[] ) and put the result in the output parameter array (out
result[] ); for simplicity reasons we statically define the arrays’ length as N. On the one
hand, the function caller() has to be generated as standard C++ function, and therefore
can be handled by the code generator in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. On the other hand, f sum(), which is
deployed on a GPU core, needs to be translated into CUDA C/C++ and communication
code has to be generated in order to allow caller() to call it.
        </p>
        <p>The idea is to enhance the generation process to produce (i) communication code
whenever the code generator runs into a call (caller() ) to a function (f sum() ) allocated
to a GPU core, and (ii) the parallel code which corresponds to the sequential ALF code
defined for it (f sum() ). Code 1.2 depicts the generated C++ caller() function as well as
the generated CUDA C/C++ code in terms of the kernel f sum(), representing the
translated body of f sum(), and f sum caller(), which represents the function implementing
the communication code needed to call the kernel.
1 . cpp f i l e
2 / / c a l l e r i n C++ ( . cpp f i l e )
3 v o i d c a l l e r ( i n t p1 , i n t p2 ) f
4 i n t N = 1 0 0 0 0 ;
5 i n t r e s [N ] ;
6 F i l t e r . f s u m c a l l e r ( p1 , p2 , N, r e s ) ;
/ / p o i n t e r s t o d e v i c e p a r a m e t e r s
i n t p1 d , p2 d , r e s d ;
/ / s i z e , i n b y t e s , o f each a r r a y
s i z e t b y t e s = N s i z e o f ( i n t ) ;
/ / a l l o c a t e memory f o r p a r a m e t e r s
c u d a M a l l o c (&amp; p1 d , b y t e s ) ;
c u d a M a l l o c (&amp; p2 d , b y t e s ) ;
c u d a M a l l o c (&amp; r e s d , b y t e s ) ;
/ / copy h o s t memory t o d e v i c e memory
cudaMemcpyToSymbol ( p1 d , p1 , b y t e s ) ;
cudaMemcpyToSymbol ( p2 d , p2 , b y t e s ) ;
/ / i n i t g r i d and b l o c k d i m e n s i o n s
dim3 dimGrid ( 2 ) ;
dim3 dimBlock ( 1 0 2 4 ) ;
/ / s e l e c t GPU d e v i c e
c u d a S e t D e v i c e ( 1 ) ;
/ / c a l l k e r n e l f u n c t i o n
f sum&lt;&lt;&lt;dimGrid , dimBlock &gt;&gt;&gt;(p1 d , p2 d , r e s d , N ) ;
/ / copy d e v i c e memory t o h o s t memory
cudaMemcpyFromSymbol ( r e s , r e s d , b y t e s ) ;
/ / d e a l l o c a t e
c u d a F r e e ( r e s d ) ;
c u d a F r e e ( p 1 d ) ;
c u d a F r e e ( p 2 d ) ;</p>
        <p>Code 1.2: Generated Functions in C++ and CUDA C/C++
The following steps are performed to generate the kernel f sum() and the
communication code for it to be called. Firstly, since the body of f sum() is defined in terms
of sequential computation, we parallelize it by substituting the iterating for loop with
a multithread parallel sum, in which each thread in the block sums the respective i-th
arrays element (lines 10-16). This step is currently meant to be provided only in a
semiautomatic fashion, hence requiring manual fine-tuning in more complex cases.</p>
        <p>The next step is to create a communication function called f sum caller() (lines
18-45) which would be called by caller() and that provides the CUDA-related
operations needed to call the kernel to f sum(). In order to do this, a pointer for each of
the parameters (both in and out ) of f sum() is declared and given memory through the
cudaMalloc() API (lines 22-28). They will be used for exchanging data between
CPU and GPU via host and device memories. The pointers are then made to point to
the values carried by the parameters by copying host memory to device memory through
the cudaMemcpyToSymbol() API (lines 30-31).</p>
        <p>As depicted in Fig. 1, the modeller defines the number of grid dimensions (i.e., 2) by
nfpConstraint GridD as well as the thread block (i.e., 1024) by nfpConstraint
BlockN as decorations of the allocation of f sum() on the GPU core with Core ID =
1. The generation process will locate this information in the model and use it to declare
the actual dimensions of both grid and block (lines 33-34); the GPU core’s ID is used
to assign the device to be used, through the cudaSetDevice(ID) API (line 36).</p>
        <p>At this point we generate the call to the kernel f sum() using the CUDA-specific
syntax (line 38). When the computation is completed, we move the result hold in the
device memory back to the host memory through the cudaMemcpyFromSymbol()
API (line 40). Finally, we can release the allocated resources through the cudaFree()
API (lines 42-44) and end the parallel computation.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion and Conclusion</title>
      <p>
        The actual development of the proposed approach is carried out by (i) identifying the
means to model deployment on mixed CPU-GPU configurations, (ii) improving the
intermediate artefacts employed by the code generator, and described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], to host
CUDA-related information, (iii) defining model-to-model and model-to-text
transformations to carry out the actual code generation. Currently, the actual parallelisation of
ALF code is only provided in a semi-automatic manner, thus able to translate rather
simple cases (e.g., for loops both simple and nested) of parallelizable code. Future
enhancements of the approach will therefore focus on broadening the set of covered cases.
Nevertheless, as it can be noticed in the proposed example, in order to call f sum() from
caller() (less than 20 code lines together), we would have needed to manually code
more than 25 lines of communication code (per call). This gives a hint on the
usefulness of automating the generation of communication code and therefore relieving the
end-user of an error-prone and time consuming burden. Moreover, once we will have
finalized the necessary information to model heterogeneous allocation (e.g., CoreID,
GridD, BlockN), we intend to produce custom stereotypes, concentrating constraints
and allocations in a single place, that would be folded into the CHESS-ML profile.
      </p>
      <p>
        In this work we focused on the allocation of entire ALF functions to GPU cores
for parallel computation. Since ALF allows to specify possible parallelization at
finergrained level, as for the statements block and for, through the parallel
annotation, we will introduce the possibility of allocating only such specific portions to
the GPU core. According to the fUML semantics [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], the presence of a parallel
annotation does not imply the implementation of actual parallelism on the execution
platform, therefore the deployment-independence of the system’s functional
description would not be jeorpardised. The parallel annotation will in fact be taken into
account only if the owning function would be allocated to a GPU core, and in that
case, instead of computing the entire function in parallel, only the annotated statements
would.
      </p>
      <p>
        Possible future directions could target the definition of (semi-)automatic allocation
of components to processes and functions to either CPU or GPU cores, in order to
optimize performance and/or to decrease communication overhead. In order to achieve this,
a first step would be the definition of a more detailed memory model, both in terms of
the actual hardware resource as well as the allocation of components and
functions/statements to it. Moreover, enhancements of the monitoring features as well as the
backpropagation capabilities would be required for exploiting the round-trip approach in [
        <xref ref-type="bibr" rid="ref1 ref3">1,
3</xref>
        ]. Finally, it is important to remark that, even if applied in the context of CHESS-ML
as enhancement of the round-trip support, the solution described in this work does not
depend on any CHESS-specific stereotype, and that makes it more generally applicable
to approaches leveraging on UML, MARTE and ALF.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>F.</given-names>
            <surname>Ciccozzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cicchetti</surname>
          </string-name>
          , and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Sjo¨din. Round-Trip Support for Extra-functional Property Management in Model-Driven Engineering of Embedded Systems</article-title>
          .
          <source>Information and Software Technology</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>F.</given-names>
            <surname>Ciccozzi</surname>
          </string-name>
          and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Sjo¨din. Enhancing the Generation of Correct-by-construction Code from Design Models for Complex Embedded Systems</article-title>
          . In ETFA. IEEE,
          <year>July 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>F.</given-names>
            <surname>Ciccozzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saadatmand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cicchetti</surname>
          </string-name>
          , and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Sjo¨din. An Automated Round-trip Support Towards Deployment Assessment in Component-based Embedded Systems</article-title>
          . In CBSE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>D.</given-names>
            <surname>Hallmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Asberg</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Nolte</surname>
          </string-name>
          .
          <article-title>Towards using the Graphics Processing Unit (GPU) for embedded systems</article-title>
          .
          <source>In ETFA</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>A.</given-names>
            <surname>Cicchetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ciccozzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mazzini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Puri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Panunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zovi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Vardanega</surname>
          </string-name>
          . CHESS:
          <article-title>a model-driven engineering tool environment for aiding the development of complex industrial systems</article-title>
          .
          <source>In ASE</source>
          , pages
          <fpage>362</fpage>
          -
          <lpage>365</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Bran</given-names>
            <surname>Selic</surname>
          </string-name>
          .
          <article-title>Unified Modeling Language (UML)</article-title>
          .
          <source>In Wiley Encyclopedia of Computer Science and Engineering</source>
          .
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>S.</given-names>
            <surname>Taha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radermacher</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Ge´rard, and</article-title>
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Dekeyser</surname>
          </string-name>
          . MARTE:
          <article-title>UML-based Hardware Design from Modelling to Simulation</article-title>
          . In FDL, pages
          <fpage>274</fpage>
          -
          <lpage>279</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. S. Ge´rard, C. Dumoulin,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tessier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Selic</surname>
          </string-name>
          .
          <article-title>Papyrus: A UML2 Tool for Domain-Specific Language Modeling</article-title>
          .
          <source>In Model-Based Engineering of Embedded Real-Time Systems</source>
          , pages
          <fpage>361</fpage>
          -
          <lpage>368</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Eclipse</given-names>
            <surname>Projects</surname>
          </string-name>
          . Papyrus. http://www.eclipse.org/papyrus/, Last Accessed:
          <year>July 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Object Management</surname>
          </string-name>
          <article-title>Group (OMG)</article-title>
          .
          <source>UML Superstructure Specification V2</source>
          .3. http: //www.omg.org/spec/UML/2.3/Superstructure/PDF/, Last Accessed:
          <year>July 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>OMG. Action Language For FoundationalUML - ALF</surname>
          </string-name>
          . http://www.omg.org/spec/ ALF/, Last Accessed:
          <year>July 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Nvidia</surname>
          </string-name>
          .
          <article-title>Get Started with CUDA C/C++</article-title>
          . https://developer.nvidia.
          <article-title>com/ get-started-cuda-</article-title>
          <string-name>
            <surname>cc</surname>
          </string-name>
          ,
          <source>Last Accessed: July</source>
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Enea</surname>
          </string-name>
          .
          <article-title>The architectural advantages of enea ose in telecom applications</article-title>
          . http://www. enea.com/software/products/rtos/ose/, Last Accessed:
          <year>February 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>J. Piat</surname>
            ,
            <given-names>S. S.</given-names>
          </string-name>
          <string-name>
            <surname>Bhattacharyya</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Pelcat</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Raulet</surname>
          </string-name>
          .
          <article-title>Multicore code generation from interface based hierarchy</article-title>
          .
          <source>In DASIP '09.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>M. Cha</surname>
            ,
            <given-names>K. H.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ha</surname>
            , and
            <given-names>B. S.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
          </string-name>
          .
          <article-title>Deriving High-Performance Real-Time Multicore Systems Based on Simulink Applications</article-title>
          . In DASC'
          <volume>11</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. R. L. Collins,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vellore</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Carloni</surname>
          </string-name>
          .
          <article-title>Recursion-driven parallel code generation for multi-core platforms</article-title>
          .
          <source>In DATE'10.</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>A.W.O.</given-names>
            <surname>Rodrigues</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Guyomarc'h, and</article-title>
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Dekeyser</surname>
          </string-name>
          .
          <article-title>An MDE Approach for Automatic Code Generation from UML/MARTE to OpenCL</article-title>
          . Computing in Science Engineering,
          <volume>15</volume>
          :
          <fpage>46</fpage>
          -
          <lpage>55</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. S.-
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ueng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lathara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Baghsorkhi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.-M. W.</given-names>
            <surname>Hwu</surname>
          </string-name>
          .
          <article-title>Languages and Compilers for Parallel Computing. chapter CUDA-Lite: Reducing GPU Programming Complexity</article-title>
          , pages
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          . Springer-Verlag,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>M. Baskaran</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Ramanujam</surname>
            , and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Sadayappan</surname>
          </string-name>
          .
          <article-title>Automatic C-to-CUDA Code Generation for Affine Programs</article-title>
          .
          <source>In Compiler Construction</source>
          , volume
          <volume>6011</volume>
          <source>of LNCS</source>
          , pages
          <fpage>244</fpage>
          -
          <lpage>263</lpage>
          . Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>M. Amini</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Creusillet</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Even</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Keryell</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Goubier</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Guelton</surname>
            ,
            <given-names>J. O.</given-names>
          </string-name>
          <string-name>
            <surname>Mcmahon</surname>
            ,
            <given-names>F.-X.</given-names>
          </string-name>
          <string-name>
            <surname>Pasquier</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <article-title>Pe´an, and</article-title>
          <string-name>
            <given-names>P.</given-names>
            <surname>Villalon</surname>
          </string-name>
          . Par4All: From Convex Array Regions to Heterogeneous Computing.
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21. E. Gamma,
          <string-name>
            <given-names>R.</given-names>
            <surname>Helm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Vlissides</surname>
          </string-name>
          .
          <article-title>Design patterns: elements of reusable object-oriented software</article-title>
          .
          <source>Addison-Wesley Longman Publishing Co., Inc</source>
          ., Boston, MA, USA,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22. OMG.
          <article-title>Foundational Subset For Executable UML Models (FUML)</article-title>
          . http://www.omg. org/spec/FUML/1.1/, Last Accessed:
          <year>July 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <given-names>F.</given-names>
            <surname>Ciccozzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cicchetti</surname>
          </string-name>
          , and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Sjo¨din. Towards Translational Execution of Action Language for Foundational UML</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>SEAA</given-names>
          </string-name>
          ,
          <year>September 2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>