<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Series</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Programming parallel pipelines using non-parallel C# code</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michal Brabec</string-name>
          <email>brabec@ksi.mff.cuni.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Bedn´arek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Software Engineering</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <volume>1003</volume>
      <fpage>82</fpage>
      <lpage>87</lpage>
      <abstract>
        <p>Parallel and high-performance code is usually created as imperative code in FORTRAN, C, C++ with the help of parallel environments like OpenMP or Intel TBB. However, learning these languages is quite difficult compared to C# or Java. Although these modern languages have numerous parallel features, they lack the automatic parallelization or load distribution features known from specialized parallel environments. Due to the referential nature of C# and Java, the principles of parallel environments like OpenMP cannot be directly transferred to these languages. We investigated the idea of using C# as a programming language for a parallel system based on nonlinear pipelines. In this paper, we propose the architecture of such system and describe some key steps that we have already taken towards the future goal of extracting both the pipeline structure and the code of the nodes from the C# source code.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <sec id="sec-1-1">
        <title>Parallel programs are usually designed within the frame</title>
        <p>work of a specific paradigm like thread-based,
taskbased, or pipeline paralelism. Such a framework is
either explicitly used by the programmer in the form of a
library like Intel TBB, or it is hidden inside a compiler
capable of automatic parallelization like C++/OpenMP.</p>
        <p>
          Pipeline parallelism is a paradigm which receives
increasing attention due to its relation to stream
processing; in its generalized, branched pipeline form it
is also sufficient for data-processing applications
including relational or RDF databases [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The explicit
specification of data flow in a pipeline also helps in
NUMA or distributed applications where the cost of
data movement is important [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>
          Unfortunately, pipeline parallelism was not studied
as thoroughly as other forms of parallelism – while
automatic parallelization within a thread-based or
taskbased framework has been implemented in many
systems including FORTRAN, C, and C++ compilers,
extracting pipeline structure from program code is still
in the stage of experiments [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>
          Bobox [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] is a parallel execution enviroment based
on generalized branched pipelines which connect a set
        </p>
        <p>of execution units called boxes. A Bobox application
is composed of two components: the model which
describes how boxes are interconnected and the box code,
i.e. the implementation of all boxes used in the model.</p>
        <p>C++ compiler
compile time</p>
        <p>run time</p>
        <p>Fig. 1. The basic architecture of Bobox</p>
        <p>As shown in Fig. 1, the code of individual boxes
is compiled from their C++ source code and linked
together with Bobox system code at run time. The
run-time representation of the model, called
instantiated model, is created by the model instantiator from
the text-based model description and the binary box
code. After instantiation, the model is assigned to a
set of CPUs and executed.</p>
        <p>
          When created by humans, Bobox models are
usually written in Bobolang [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], a declarative language
whose purpose and principles are similar to netlist
languages like SPICE [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Bobox models may also be
generated from a query language using a language
frontend, e.g. the SPARQL front-end [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>Nowadays, Bobox boxes are programmed in C++
within tight restrictions imposed by the framework
interface. Although simpler than within explicit
threadbased or message-based parallelism, programming in
Bobox is still a tedious and error-prone task.</p>
        <p>In this paper, we propose a Bobox front-end which
transforms the box code from C# to C++. During the
transformation, code is added to control pipelines and
synchronization.</p>
        <p>In the advanced version of the architecture, the
front-end also extracts the model from a C# program,
approaching the goal of automatic parallelization in
the Bobox environment.</p>
        <p>While the advanced version is a future goal, we
have already taken the critical steps towards the
basic version. We have studied and implemented the key C++ compiler
analytical part of the proposed system, the CIL ana- compile time
lyzer ; in particular, we thoroughly studied the aspects run time
of C# which made the problem different from known
compiler algorithms for C++ or FORTRAN code.</p>
        <p>The rest of the paper is organized as follows: In
Section 2, we describe the motivation for our project and Fig. 2. Improving parallelism of Bobox models
the goals that resulted from the motivation. Section
3 describes the architecture of the proposed solution like statelessness or order sensitivity, as well as
esas well as the justification for the use of C#. We will timates of their quantitative behavior (e.g.
input-toalso compare our approach to related work through- output data size ratio). These properties are described
out this section. In Section 4, we will discuss technical in box metadata.
details associated with the choice of C# and the key Currently, there is no mechanism to check whether
components of our system. In the Conclusion, we will the implementation of a box really satisfies the
propdescribe the current status and the future development erties declared in its metadata. For database-like
apof the project. plications, this fact is negligible because the effect of
individual boxes corresponds to physical algebra
op2 Motivation erators whose properties are well understood.
On the other hand, when Bobox is used as a
paralThe principles of Bobox, developed in accordance with lel engine for general computing, the individual boxes
the general pipeline parallelism paradigm, determine correspond to routines, tasks, or similar elements of a
the means that a developer in Bobox possess. As we parallel algorithm whose behavior is not always clearly
will show in the following paragraphs, the stress on defined. An error in box metadata may cause troubles
maximum performance causes that programming in similar to errors known from parallel programming like
Bobox is not as straightforward as the pipeline ap- race conditions. Detecting and correcting these errors
proach promises. may be as demanding as checking race conditions. This
fact undermines the Bobox aspiration to be a simpler
2.1 Parallelism in Bobox programming environment than general parallel
programming systems.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Bobox design principles impose some crucial restric</title>
        <p>tions upon the behavior of individual boxes. In partic- 2.2 Programming in Bobox
ular, a box shall always execute purely serially, thus,
any parallel execution occurs only among boxes at Furthermore, coding the individual boxes is not as
the plan level. This approach corresponds to inter- simple as it may seem with respect to the simple
prinoperator parallelism in databases. ciples of Bobox.</p>
        <p>
          In order to improve the degree of parallelism, most Most algorithms are described naturally using loops
Bobox models require replication of boxes and intro- taking input data elements one by one. However, for
duction of data splitters and mergers as described in performance reasons, the data in Bobox are
transmit[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The replication is done by model parallelizer at ted and received in blocks called envelopes.
Consecompile time, using the knowledge of crucial param- quently, the code of a box must explicitly handle
eneters of the run-time environment like the number of velope receiving and sending and, thus, deviate from
cores. the simple one-by-one arrangement. Explicit envelope
        </p>
        <p>The architecture of the compile-time part of the handling may be quite painful, especially in cases where
improved Bobox system is shown at Fig. 2; the run- the inputs and/or outputs are not synchronous (e.g. in
time part remains the same as in Fig. 1. the ordered merge algorithm).</p>
        <p>For correct and meaningful transformation, the model In addition, the original Bobox principles required
parallelizer must know essential properties of the boxes, that the code of a box should never enter a
blocking call. This required restructuring the code so that On the other hand, the language at the input side is
envelope handling is done outside of the main box a subject of discussion. Given the output language, the
routine. Although this principle corresponds to event- use of C++ would be natural; however, analyzing and
driven programming which has been successfully used transforming C++ code is extremely difficult because
for years, it is unnatural in the context of most nu- of its complex syntax and permissive pointer
semanmerical and many data-processing algorithms. tics. Furthermore, the formerly widespread knowledge</p>
        <p>The problem of blocking calls was solved in later of C++ has nowadays retracted to devoted
programversions of Bobox by the use of fibers, i.e. lightweight ming professionals – in e-science environment, they are
threads allowing to suspend execution of a box code not always available.
anywhere. However, this solution comes at the cost of Since Bobox is targeted at scientific and data-intensive
stack switching and, thus, slightly worse performance computation beyond the borders of numerical
commainly due to larger number of cache misses. putation, languages like FORTRAN or Mathematica
were disqualified due to their poor ability to handle
sophisticated data structures.
2.3 Goals There were many attempts to introduce a
nonAs demonstrated in the previous paragraphs, using imperative programming language for parallel
programthe pipeline paradigm under ultimate performance re- ming like Lustre, F#, or PigLatin. None of the new
quirements lead to several problematic arrangements languages attracted sufficient attention of
programin Bobox. It became obvious that implementing boxes mers, rendering them useless for a general-programming
directly is quite difficult task and that returning back environment.
to natural implementation of algorithms would require Given the observations mentioned above, our choice
a substantial change to the programming environment. narrowed to modern, widely-accepted strongly-typed</p>
        <p>A natural programming environment for Bobox shall general-programming languages – Java and C#.
Alunload the burden of communication and envelope though they are only the least bad choice among our
handling from the programmer. For performance rea- options, there are at least two important advantages
sons, the envelope handling shall not be hidden in run- of these languages:
time libraries – it is necessary to transform the code First, there are many programmers fluent in these
from natural one-by-one loops into event-driven code. languages.</p>
        <p>In addition, the programming environment shall Second, both languages compile via standardized
also maintain the coherence between box code and box bytecodes – thus, our implementation may, hopefully,
metadata, either by checking whether the box code use the bytecode produced by standard compilers,
bysatisfies the box properties given in advance or by gen- passing the tedious implementation of specialized
lanerating the box metadata from the box implementa- guage front-end.
tion. For our system, we finally decided to use C#,
al</p>
        <p>Furthermore, the programming environment may though the preference over Java was somewhat
arbiassist with fine-grained parallelism: If the box as a trary.
whole satisfies the conditions necessary for coarse-grained
parallelism achieved by pipelining or partitioning, then 3.2 Architecture
it likely satisfies similar conditions for applying vector
instructions.</p>
      </sec>
      <sec id="sec-1-3">
        <title>The architecture of the proposed system is shown in</title>
        <p>
          Fig. 3. The boxes are implemented in C# and
compiled by a third-party C# compiler (Microsoft Visual
3 Approach Studio or Mono). The compiler produces an
intermediate representation called CIL and standardized by
The goals defined in the previous section naturally ECMA/ISO/IEC [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The CIL code is then analyzed
lead to the concept of code transformation and/or and box metadata are created. The analyzed
intermetranslation from a user-friendly programming environ- diate code is then passed to the box generator which
ment to the C++ box code. generates C++ source code of the boxes. The rest of
the process is the same as in Fig. 2 – the code is
com3.1 Language piled by a third-party C++ compiler (Microsoft Visual
Studio or GNU C++) while the box metadata is used
The use of C++ at the output stage is dictated by by the model parallelizer.
the implementation language of the Bobox core and In the proposed system, an application consists of
the unmatched performance of the code generated by model description and box code just like in the plain
C++ compilers. system from Fig. 1 with the visible difference that the
C# compiler
C# compiler
compile time
run time
code of the boxes is implemented in C# instead of to compile from C# or Java to C++, one must either
C++. Nevertheless, the new system offers the follow- simulate the referential semantics in C++, or restrain
ing advantages: the input code from using the referential semantics.
        </p>
        <p>
          The envelope handling is added to the code au- When used on local variables or stand-alone classes,
tomatically, allowing the programmer to focus on the the reference nature may be stripped off by object
innature of the algorithm. The box metadata, required lining as shown in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. However, this technique does
for the application of model parallelizer as in Fig. 2, not work on link-based data structures including many
are extracted automatically from the source code, en- standard containers. It means that the standard
consuring their coherence. tainer library must be replaced by a different set of
containers that will discourage the use of references.
3.3 Advanced architecture This fact may certainly confuse programmers used to
standard containers; nevertheless, learning a new set
of containers is certainly easier than switching to
another language (C++) completely.
        </p>
        <p>Figure 4 shows an advanced version of the proposed
architecture. Here, the source code consists of C# code
of the complete application. This is compiled into CIL
as in the previous case. The advanced analyzer breaks
the application code into boxes and extracts the model 4
automatically from the global structure of the code.</p>
        <p>The following phases are the same as before.</p>
        <p>
          The advanced version is far more ambitious than
the basic architecture, it essentially consists of
automatic coarse-grained parallelization of C# code. Such
level of program transformation is long known for
FORTRAN [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], it was succesfully implemented for C [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
and similar goal was achieved with the help of
profiling information in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Among languages with
referential semantics, coarse-grained parallelization was
attempted in Java [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. However, no such attempt was
described for C# yet.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>The analyzer</title>
      <p>The structure of the CIL analyzer closely follows the
series of transformations and analyses used to prepare
the code for parallelization. The optimization steps
and their order are as follows:
– Preliminary transformations
– Preliminary code analysis
– Dependence testing</p>
      <sec id="sec-2-1">
        <title>The following paragraphs briefly discuss the most important steps; details may be found in [4].</title>
        <p>3.4</p>
        <sec id="sec-2-1-1">
          <title>The effect of referential semantics 4.1</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>Preliminary transformations</title>
          <p>Of source, C# and Java differ from the target C++ This step includes procedure integration and code
verlanguage by their reliance on referential semantics – ification. Procedure integration (also called inlining)
replaces calls to procedures with their bodies – of course, references to address the same object. Another
imporsuch a transformation leads to code expansion and is tant fact is that the reference must always address a
impossible in the case of recursion. However, given our valid object; it cannot be assigned some random
admotivation and architecture, it is applicable and it is dress.
easier than inter-procedural analysis that is usually In addition, procedure integration used in this work
necessary before automatic parallelization. can remove parameter aliasing because the formal
pa</p>
          <p>The main purpose of procedure integration in our rameters are removed in the process.
system is to remove unnecessary dependencies caused Regardless of these factors, exact analysis of
aliasby parameter passing in the referential semantics of ing is an algorithmically unsolvable problem so the
anC#. Even though the procedures bound by a call could alyzer always uses a heuristic-based conservative
apbe analyzed independently, the integration allows the proximation.
flow of data be accurately analyzed.</p>
          <p>
            Code verification is a process designed to check if
the code follows the restrictions required for transla- 4.4 Dependence testing
tion to box code. It is performed after procedure
integration and it must make sure that the final code does Dependence testing is the most difficult part of this
not contain any unsafe code, forbidden instructions or project. The CIL code is transformed to a structure
constructs, including prohibited library elements. that can be analyzed by well-known algorithms of
dependence testing [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. Note that the procedure
integration done previously allows to bypass inter-procedural
4.2 Preliminary code analysis version of dependence testing.
          </p>
          <p>There are two important facts that help
depenThis step gathers information about the control-flow dence testing in our case. First, there are no
pointconstructs and then creates a list of all variables used ers allowed and there are no arbitrary addresses,
bein the method. Both types of information are later cause everything must represent valid, allocated
obused during dependence detection, since a dependence jects. Second, local variables are completely private
may be based on data or control-flow. This step does and they cannot be modified anywhere outside the
not contain any transformations or optimizations and method, with the only exception of reference
paramethe code is not modified here. ters and it is possible to check if a local variable have</p>
          <p>This analysis recognizes five different types of con- been passed by reference or not.
structs: loops, if/else branches, switch statements, pro- Parameters and local variables represent
indepentected blocks and return statements. dent memory locations that can be accessed only by</p>
          <p>Variable recognition is not a simple task since the the method itself because passing parameters by
reffields of an object shall be considered separate vari- erence was ruled out by procedure integration.
Thereables whenever possible; however, a fall-back to con- fore, all reads and writes to different local variables
sidering the object as a whole must be available when or arguments are independent operations that do not
necessary. collide with each other. However, there may be
colli</p>
          <p>In addition, there are special temporary variables sions when a field is accessed using two local variables
created on the stack as a result of some operation and referring to a single object.
later consumed by some other instruction. These vari- Stack variables represent values added and removed
ables are recognized by a stack simulator and they from the stack and every variable is written and read
represent the relationship between instructions that just once. Every stack variable simply represents a
sinconstitute separate commands. gle true dependence with a source in the instruction
that created the variable and the sink is in the
instruc4.3 Aliasing tion that consumes it.</p>
          <p>Two field variables can access the same memory,
Aliasing is the name for the fact that multiple sym- only when they access the same field in the same
obbols (may) represent the same memory location. If the ject, otherwise they are independent. To prove
indeanalyzer is not able to determine what pointers or ref- pendence between fields, it is necessary to keep track
erences reference the same memory then it must con- of the object they belong to and all possible
depenservatively assume that they can reference the same dences must be considered when this object cannot be
memory. properly monitored.</p>
          <p>
            Aliasing in .NET is simplified by two important Arrays represent the best opportunity for
paralfacts. There are no pointers and the references are lelization, but their analysis is the most difficult. The
controlled by the type system which forbids certain subscript analysis is a complex problem which can be
handled in several degrees of conservative approxima- In the advanced version of the architecture, the
tion, presented for instance in [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. model generator must transform the dependence graph
          </p>
          <p>Induction variables are defined by loop iterations of the analyzed code into a Bobox model. Although it
and they are essential to understand the behavior of is essentially possible to do it in one-to-one manner,
a loop. Given the syntax of loops in C#, it is more such a model will contain boxes so small that the
exreliable to analyze the behavior of individual variables ecution will suffer from communication overhead and
regardless of their presence in the loop heading. cache misses. To create effective models, careful cache</p>
          <p>Before the core dependence testing, the loops and aware decomposition strategy will be required – this
their induction variables have been identified and ar- is the most intricate item in our future work.
ray subscripts have been reconstructed, along with
multidimensional arrays. The analysis of aliasing should References
provide some help for the testing and all the variables
which have not been separated may be treated as a
single variable for the purposes of this analysis.</p>
          <p>
            With all this information at hand, dependence
testing is a matter of applying appropriate algorithms
presented in [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ].
5
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion and future work</title>
      <sec id="sec-3-1">
        <title>We have successfully implemented key parts of the CIL</title>
        <p>analyzer as described in the previous section. This
implementation answered the main open problems
associated to the proposed architecture, namely it allowed
us to state that:</p>
        <p>The reference nature of C# does not create
significant additional obstacles in the code analysis required
for parallelization. In particular, most aliases and false
dependences generated by references may be removed
by procedure integration. The intermediate language
(CIL) used by C# compilers does contain enough
information to perform the required analysis. In
particular, we developed the stack simulator to accurately
analyse the data flow in a CIL procedure.</p>
        <p>Note however, that these observations are valid
when assuming C# code that serves the motivation
described in Sec. 2.</p>
        <p>It is doubtful whether our observations may
apply for arbitrary C# code – at least, the use of
procedure integration disqualifies recursive code.
Nevertheless, some phases of analysis may be usable also
outside our constraints.</p>
        <p>To complete our goals, the box generator has to be
implemented. We believe that all the evil was hidden
in the details of the analyzer, so there is hopefully
no algorithmically difficult part in the generator. On
the other hand, the quality of the code produced by
the generator strongly affects the performance of the
system; thus, it requires extreme care when designing
the generator.</p>
        <p>Last but not least, although the system may be
essentially usable as is, any real-life use of our system
will require a set of containers to replace the prohibited
reference-based standard containers.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. ISO/IEC 23271:
          <year>2012</year>
          .
          <article-title>information technology. common language infrastructure (CLI)</article-title>
          .
          <source>Technical report, ISO/IEC JTC1/SC22</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Randy</given-names>
            <surname>Allen</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ken</given-names>
            <surname>Kennedy</surname>
          </string-name>
          .
          <article-title>Optimizing compilers for modern architectures</article-title>
          . Morgan Kaufmann San Francisco,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. David Bedna´rek, Jiˇr´ı Dokulil, Jakub Yaghob, and
          <string-name>
            <given-names>Filip</given-names>
            <surname>Zavoral</surname>
          </string-name>
          .
          <article-title>Data-flow awareness in parallel data processing</article-title>
          . In Giancarlo Fortino, Costin Badica, Michele Malgeri, and Rainer Unland, editors,
          <source>Intelligent Distributed Computing VI</source>
          , volume
          <volume>446</volume>
          <source>of Studies in Computational Intelligence</source>
          , pages
          <fpage>149</fpage>
          -
          <lpage>154</lpage>
          . Springer Berlin Heidelberg,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Michal</given-names>
            <surname>Brabec</surname>
          </string-name>
          .
          <article-title>Analysis of automatic program parallelization based on bytecode</article-title>
          .
          <source>Diploma thesis</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Zoran Budimli´c, Mackale Joyner, and
          <string-name>
            <given-names>Ken</given-names>
            <surname>Kennedy</surname>
          </string-name>
          .
          <article-title>Improving compilation of java scientific applications</article-title>
          .
          <source>Int. J. High Perform. Comput. Appl.</source>
          ,
          <volume>21</volume>
          (
          <issue>3</issue>
          ):
          <fpage>251</fpage>
          -
          <lpage>265</lpage>
          ,
          <year>August 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Zbynˇek</given-names>
            <surname>Falt</surname>
          </string-name>
          ,
          <article-title>Miroslav Cˇerma´k, Jiˇr´ı Dokulil, and Filip Zavoral. Parallel SPARQL query processing using bobox</article-title>
          .
          <source>International Journal On Advances in Intelligent Systems</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          and 4):
          <fpage>302</fpage>
          -
          <lpage>314</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Seema</given-names>
            <surname>Hiranandani</surname>
          </string-name>
          , Ken Kennedy, and
          <string-name>
            <surname>Chau-Wen Tseng</surname>
          </string-name>
          .
          <article-title>Compiling Fortran D for MIMD distributedmemory machines</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>35</volume>
          (
          <issue>8</issue>
          ):
          <fpage>66</fpage>
          -
          <lpage>80</lpage>
          ,
          <year>August 1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Steven</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Muchnick</surname>
          </string-name>
          .
          <article-title>Advanced compiler design implementation</article-title>
          . Morgan Kaufmann Publishers,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Laurence</surname>
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Nagel</surname>
          </string-name>
          .
          <article-title>SPICE2: A Computer Program to Simulate Semiconductor Circuits</article-title>
          .
          <source>PhD thesis</source>
          , EECS Department, University of California, Berkeley,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Frank</surname>
            <given-names>Otto</given-names>
          </string-name>
          , Victor Pankratius, and WalterF. Tichy. XJava:
          <article-title>Exploiting parallelism with object-oriented stream programming</article-title>
          .
          <source>In Henk Sips</source>
          , Dick Epema, and
          <string-name>
            <surname>Hai-Xiang</surname>
            <given-names>Lin</given-names>
          </string-name>
          , editors,
          <source>Euro-Par 2009 Parallel Processing</source>
          , volume
          <volume>5704</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>875</fpage>
          -
          <lpage>886</lpage>
          . Springer Berlin Heidelberg,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Sean</surname>
            <given-names>Rul</given-names>
          </string-name>
          , Hans Vandierendonck, and Koen De Bosschere.
          <article-title>A profile-based tool for finding pipeline parallelism in sequential programs</article-title>
          .
          <source>Parallel Computing</source>
          ,
          <volume>36</volume>
          (
          <issue>9</issue>
          ):
          <fpage>531</fpage>
          -
          <lpage>551</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>W.</given-names>
            <surname>Thies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chandrasekhar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Amarasinghe</surname>
          </string-name>
          .
          <article-title>A practical approach to exploiting coarse-grained pipeline parallelism in C programs</article-title>
          . In Microarchitecture,
          <year>2007</year>
          .
          <source>MICRO</source>
          <year>2007</year>
          . 40th
          <string-name>
            <surname>Annual</surname>
            <given-names>IEEE</given-names>
          </string-name>
          /ACM International Symposium on, pages
          <fpage>356</fpage>
          -
          <lpage>369</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>