Introduction

Series

1613-0073

Programming parallel pipelines using non-parallel C# code

Michal Brabec

brabec@ksi.mff.cuni.cz 0

David Bedn´arek

0 0 Department of Software Engineering

2013

1003 82 87

Parallel and high-performance code is usually created as imperative code in FORTRAN, C, C++ with the help of parallel environments like OpenMP or Intel TBB. However, learning these languages is quite difficult compared to C# or Java. Although these modern languages have numerous parallel features, they lack the automatic parallelization or load distribution features known from specialized parallel environments. Due to the referential nature of C# and Java, the principles of parallel environments like OpenMP cannot be directly transferred to these languages. We investigated the idea of using C# as a programming language for a parallel system based on nonlinear pipelines. In this paper, we propose the architecture of such system and describe some key steps that we have already taken towards the future goal of extracting both the pipeline structure and the code of the nodes from the C# source code.

Introduction Parallel programs are usually designed within the frame

work of a specific paradigm like thread-based, taskbased, or pipeline paralelism. Such a framework is either explicitly used by the programmer in the form of a library like Intel TBB, or it is hidden inside a compiler capable of automatic parallelization like C++/OpenMP.

Pipeline parallelism is a paradigm which receives increasing attention due to its relation to stream processing; in its generalized, branched pipeline form it is also sufficient for data-processing applications including relational or RDF databases [ 6 ]. The explicit specification of data flow in a pipeline also helps in NUMA or distributed applications where the cost of data movement is important [ 3 ].

Unfortunately, pipeline parallelism was not studied as thoroughly as other forms of parallelism – while automatic parallelization within a thread-based or taskbased framework has been implemented in many systems including FORTRAN, C, and C++ compilers, extracting pipeline structure from program code is still in the stage of experiments [ 11 ].

Bobox [ 3 ] is a parallel execution enviroment based on generalized branched pipelines which connect a set

of execution units called boxes. A Bobox application is composed of two components: the model which describes how boxes are interconnected and the box code, i.e. the implementation of all boxes used in the model.

C++ compiler compile time

run time

Fig. 1. The basic architecture of Bobox

As shown in Fig. 1, the code of individual boxes is compiled from their C++ source code and linked together with Bobox system code at run time. The run-time representation of the model, called instantiated model, is created by the model instantiator from the text-based model description and the binary box code. After instantiation, the model is assigned to a set of CPUs and executed.

When created by humans, Bobox models are usually written in Bobolang [ 3 ], a declarative language whose purpose and principles are similar to netlist languages like SPICE [ 9 ]. Bobox models may also be generated from a query language using a language frontend, e.g. the SPARQL front-end [ 6 ].

Nowadays, Bobox boxes are programmed in C++ within tight restrictions imposed by the framework interface. Although simpler than within explicit threadbased or message-based parallelism, programming in Bobox is still a tedious and error-prone task.

In this paper, we propose a Bobox front-end which transforms the box code from C# to C++. During the transformation, code is added to control pipelines and synchronization.

In the advanced version of the architecture, the front-end also extracts the model from a C# program, approaching the goal of automatic parallelization in the Bobox environment.

While the advanced version is a future goal, we have already taken the critical steps towards the basic version. We have studied and implemented the key C++ compiler analytical part of the proposed system, the CIL ana- compile time lyzer ; in particular, we thoroughly studied the aspects run time of C# which made the problem different from known compiler algorithms for C++ or FORTRAN code.

The rest of the paper is organized as follows: In Section 2, we describe the motivation for our project and Fig. 2. Improving parallelism of Bobox models the goals that resulted from the motivation. Section 3 describes the architecture of the proposed solution like statelessness or order sensitivity, as well as esas well as the justification for the use of C#. We will timates of their quantitative behavior (e.g. input-toalso compare our approach to related work through- output data size ratio). These properties are described out this section. In Section 4, we will discuss technical in box metadata. details associated with the choice of C# and the key Currently, there is no mechanism to check whether components of our system. In the Conclusion, we will the implementation of a box really satisfies the propdescribe the current status and the future development erties declared in its metadata. For database-like apof the project. plications, this fact is negligible because the effect of individual boxes corresponds to physical algebra op2 Motivation erators whose properties are well understood. On the other hand, when Bobox is used as a paralThe principles of Bobox, developed in accordance with lel engine for general computing, the individual boxes the general pipeline parallelism paradigm, determine correspond to routines, tasks, or similar elements of a the means that a developer in Bobox possess. As we parallel algorithm whose behavior is not always clearly will show in the following paragraphs, the stress on defined. An error in box metadata may cause troubles maximum performance causes that programming in similar to errors known from parallel programming like Bobox is not as straightforward as the pipeline ap- race conditions. Detecting and correcting these errors proach promises. may be as demanding as checking race conditions. This fact undermines the Bobox aspiration to be a simpler 2.1 Parallelism in Bobox programming environment than general parallel programming systems.

Bobox design principles impose some crucial restric

tions upon the behavior of individual boxes. In partic- 2.2 Programming in Bobox ular, a box shall always execute purely serially, thus, any parallel execution occurs only among boxes at Furthermore, coding the individual boxes is not as the plan level. This approach corresponds to inter- simple as it may seem with respect to the simple prinoperator parallelism in databases. ciples of Bobox.

In order to improve the degree of parallelism, most Most algorithms are described naturally using loops Bobox models require replication of boxes and intro- taking input data elements one by one. However, for duction of data splitters and mergers as described in performance reasons, the data in Bobox are transmit[ 6 ]. The replication is done by model parallelizer at ted and received in blocks called envelopes. Consecompile time, using the knowledge of crucial param- quently, the code of a box must explicitly handle eneters of the run-time environment like the number of velope receiving and sending and, thus, deviate from cores. the simple one-by-one arrangement. Explicit envelope

The architecture of the compile-time part of the handling may be quite painful, especially in cases where improved Bobox system is shown at Fig. 2; the run- the inputs and/or outputs are not synchronous (e.g. in time part remains the same as in Fig. 1. the ordered merge algorithm).

For correct and meaningful transformation, the model In addition, the original Bobox principles required parallelizer must know essential properties of the boxes, that the code of a box should never enter a blocking call. This required restructuring the code so that On the other hand, the language at the input side is envelope handling is done outside of the main box a subject of discussion. Given the output language, the routine. Although this principle corresponds to event- use of C++ would be natural; however, analyzing and driven programming which has been successfully used transforming C++ code is extremely difficult because for years, it is unnatural in the context of most nu- of its complex syntax and permissive pointer semanmerical and many data-processing algorithms. tics. Furthermore, the formerly widespread knowledge

The problem of blocking calls was solved in later of C++ has nowadays retracted to devoted programversions of Bobox by the use of fibers, i.e. lightweight ming professionals – in e-science environment, they are threads allowing to suspend execution of a box code not always available. anywhere. However, this solution comes at the cost of Since Bobox is targeted at scientific and data-intensive stack switching and, thus, slightly worse performance computation beyond the borders of numerical commainly due to larger number of cache misses. putation, languages like FORTRAN or Mathematica were disqualified due to their poor ability to handle sophisticated data structures. 2.3 Goals There were many attempts to introduce a nonAs demonstrated in the previous paragraphs, using imperative programming language for parallel programthe pipeline paradigm under ultimate performance re- ming like Lustre, F#, or PigLatin. None of the new quirements lead to several problematic arrangements languages attracted sufficient attention of programin Bobox. It became obvious that implementing boxes mers, rendering them useless for a general-programming directly is quite difficult task and that returning back environment. to natural implementation of algorithms would require Given the observations mentioned above, our choice a substantial change to the programming environment. narrowed to modern, widely-accepted strongly-typed

A natural programming environment for Bobox shall general-programming languages – Java and C#. Alunload the burden of communication and envelope though they are only the least bad choice among our handling from the programmer. For performance rea- options, there are at least two important advantages sons, the envelope handling shall not be hidden in run- of these languages: time libraries – it is necessary to transform the code First, there are many programmers fluent in these from natural one-by-one loops into event-driven code. languages.

In addition, the programming environment shall Second, both languages compile via standardized also maintain the coherence between box code and box bytecodes – thus, our implementation may, hopefully, metadata, either by checking whether the box code use the bytecode produced by standard compilers, bysatisfies the box properties given in advance or by gen- passing the tedious implementation of specialized lanerating the box metadata from the box implementa- guage front-end. tion. For our system, we finally decided to use C#, al

Furthermore, the programming environment may though the preference over Java was somewhat arbiassist with fine-grained parallelism: If the box as a trary. whole satisfies the conditions necessary for coarse-grained parallelism achieved by pipelining or partitioning, then 3.2 Architecture it likely satisfies similar conditions for applying vector instructions.

The architecture of the proposed system is shown in

Fig. 3. The boxes are implemented in C# and compiled by a third-party C# compiler (Microsoft Visual 3 Approach Studio or Mono). The compiler produces an intermediate representation called CIL and standardized by The goals defined in the previous section naturally ECMA/ISO/IEC [ 1 ]. The CIL code is then analyzed lead to the concept of code transformation and/or and box metadata are created. The analyzed intermetranslation from a user-friendly programming environ- diate code is then passed to the box generator which ment to the C++ box code. generates C++ source code of the boxes. The rest of the process is the same as in Fig. 2 – the code is com3.1 Language piled by a third-party C++ compiler (Microsoft Visual Studio or GNU C++) while the box metadata is used The use of C++ at the output stage is dictated by by the model parallelizer. the implementation language of the Bobox core and In the proposed system, an application consists of the unmatched performance of the code generated by model description and box code just like in the plain C++ compilers. system from Fig. 1 with the visible difference that the C# compiler C# compiler compile time run time code of the boxes is implemented in C# instead of to compile from C# or Java to C++, one must either C++. Nevertheless, the new system offers the follow- simulate the referential semantics in C++, or restrain ing advantages: the input code from using the referential semantics.

The envelope handling is added to the code au- When used on local variables or stand-alone classes, tomatically, allowing the programmer to focus on the the reference nature may be stripped off by object innature of the algorithm. The box metadata, required lining as shown in [ 5 ]. However, this technique does for the application of model parallelizer as in Fig. 2, not work on link-based data structures including many are extracted automatically from the source code, en- standard containers. It means that the standard consuring their coherence. tainer library must be replaced by a different set of containers that will discourage the use of references. 3.3 Advanced architecture This fact may certainly confuse programmers used to standard containers; nevertheless, learning a new set of containers is certainly easier than switching to another language (C++) completely.

Figure 4 shows an advanced version of the proposed architecture. Here, the source code consists of C# code of the complete application. This is compiled into CIL as in the previous case. The advanced analyzer breaks the application code into boxes and extracts the model 4 automatically from the global structure of the code.

The following phases are the same as before.

The advanced version is far more ambitious than the basic architecture, it essentially consists of automatic coarse-grained parallelization of C# code. Such level of program transformation is long known for FORTRAN [ 7 ], it was succesfully implemented for C [ 12 ] and similar goal was achieved with the help of profiling information in [ 11 ]. Among languages with referential semantics, coarse-grained parallelization was attempted in Java [ 10 ]. However, no such attempt was described for C# yet.

The analyzer

The structure of the CIL analyzer closely follows the series of transformations and analyses used to prepare the code for parallelization. The optimization steps and their order are as follows: – Preliminary transformations – Preliminary code analysis – Dependence testing

The following paragraphs briefly discuss the most important steps; details may be found in [4].

3.4

The effect of referential semantics 4.1 Preliminary transformations

Of source, C# and Java differ from the target C++ This step includes procedure integration and code verlanguage by their reliance on referential semantics – ification. Procedure integration (also called inlining) replaces calls to procedures with their bodies – of course, references to address the same object. Another imporsuch a transformation leads to code expansion and is tant fact is that the reference must always address a impossible in the case of recursion. However, given our valid object; it cannot be assigned some random admotivation and architecture, it is applicable and it is dress. easier than inter-procedural analysis that is usually In addition, procedure integration used in this work necessary before automatic parallelization. can remove parameter aliasing because the formal pa

The main purpose of procedure integration in our rameters are removed in the process. system is to remove unnecessary dependencies caused Regardless of these factors, exact analysis of aliasby parameter passing in the referential semantics of ing is an algorithmically unsolvable problem so the anC#. Even though the procedures bound by a call could alyzer always uses a heuristic-based conservative apbe analyzed independently, the integration allows the proximation. flow of data be accurately analyzed.

Code verification is a process designed to check if the code follows the restrictions required for transla- 4.4 Dependence testing tion to box code. It is performed after procedure integration and it must make sure that the final code does Dependence testing is the most difficult part of this not contain any unsafe code, forbidden instructions or project. The CIL code is transformed to a structure constructs, including prohibited library elements. that can be analyzed by well-known algorithms of dependence testing [ 2 ]. Note that the procedure integration done previously allows to bypass inter-procedural 4.2 Preliminary code analysis version of dependence testing.

There are two important facts that help depenThis step gathers information about the control-flow dence testing in our case. First, there are no pointconstructs and then creates a list of all variables used ers allowed and there are no arbitrary addresses, bein the method. Both types of information are later cause everything must represent valid, allocated obused during dependence detection, since a dependence jects. Second, local variables are completely private may be based on data or control-flow. This step does and they cannot be modified anywhere outside the not contain any transformations or optimizations and method, with the only exception of reference paramethe code is not modified here. ters and it is possible to check if a local variable have

This analysis recognizes five different types of con- been passed by reference or not. structs: loops, if/else branches, switch statements, pro- Parameters and local variables represent indepentected blocks and return statements. dent memory locations that can be accessed only by

Variable recognition is not a simple task since the the method itself because passing parameters by reffields of an object shall be considered separate vari- erence was ruled out by procedure integration. Thereables whenever possible; however, a fall-back to con- fore, all reads and writes to different local variables sidering the object as a whole must be available when or arguments are independent operations that do not necessary. collide with each other. However, there may be colli

In addition, there are special temporary variables sions when a field is accessed using two local variables created on the stack as a result of some operation and referring to a single object. later consumed by some other instruction. These vari- Stack variables represent values added and removed ables are recognized by a stack simulator and they from the stack and every variable is written and read represent the relationship between instructions that just once. Every stack variable simply represents a sinconstitute separate commands. gle true dependence with a source in the instruction that created the variable and the sink is in the instruc4.3 Aliasing tion that consumes it.

Two field variables can access the same memory, Aliasing is the name for the fact that multiple sym- only when they access the same field in the same obbols (may) represent the same memory location. If the ject, otherwise they are independent. To prove indeanalyzer is not able to determine what pointers or ref- pendence between fields, it is necessary to keep track erences reference the same memory then it must con- of the object they belong to and all possible depenservatively assume that they can reference the same dences must be considered when this object cannot be memory. properly monitored.

Aliasing in .NET is simplified by two important Arrays represent the best opportunity for paralfacts. There are no pointers and the references are lelization, but their analysis is the most difficult. The controlled by the type system which forbids certain subscript analysis is a complex problem which can be handled in several degrees of conservative approxima- In the advanced version of the architecture, the tion, presented for instance in [ 2 ]. model generator must transform the dependence graph

Induction variables are defined by loop iterations of the analyzed code into a Bobox model. Although it and they are essential to understand the behavior of is essentially possible to do it in one-to-one manner, a loop. Given the syntax of loops in C#, it is more such a model will contain boxes so small that the exreliable to analyze the behavior of individual variables ecution will suffer from communication overhead and regardless of their presence in the loop heading. cache misses. To create effective models, careful cache

Before the core dependence testing, the loops and aware decomposition strategy will be required – this their induction variables have been identified and ar- is the most intricate item in our future work. ray subscripts have been reconstructed, along with multidimensional arrays. The analysis of aliasing should References provide some help for the testing and all the variables which have not been separated may be treated as a single variable for the purposes of this analysis.

With all this information at hand, dependence testing is a matter of applying appropriate algorithms presented in [ 8 ]. 5

Conclusion and future work We have successfully implemented key parts of the CIL

analyzer as described in the previous section. This implementation answered the main open problems associated to the proposed architecture, namely it allowed us to state that:

The reference nature of C# does not create significant additional obstacles in the code analysis required for parallelization. In particular, most aliases and false dependences generated by references may be removed by procedure integration. The intermediate language (CIL) used by C# compilers does contain enough information to perform the required analysis. In particular, we developed the stack simulator to accurately analyse the data flow in a CIL procedure.

Note however, that these observations are valid when assuming C# code that serves the motivation described in Sec. 2.

It is doubtful whether our observations may apply for arbitrary C# code – at least, the use of procedure integration disqualifies recursive code. Nevertheless, some phases of analysis may be usable also outside our constraints.

To complete our goals, the box generator has to be implemented. We believe that all the evil was hidden in the details of the analyzer, so there is hopefully no algorithmically difficult part in the generator. On the other hand, the quality of the code produced by the generator strongly affects the performance of the system; thus, it requires extreme care when designing the generator.

Last but not least, although the system may be essentially usable as is, any real-life use of our system will require a set of containers to replace the prohibited reference-based standard containers.

1. ISO/IEC 23271: 2012 . information technology. common language infrastructure (CLI) . Technical report, ISO/IEC JTC1/SC22 , 2006 .

Randy

Allen and

Ken

Kennedy . Optimizing compilers for modern architectures . Morgan Kaufmann San Francisco, 2002 .

3. David Bedna´rek, Jiˇr´ı Dokulil, Jakub Yaghob, and

Filip

Zavoral . Data-flow awareness in parallel data processing . In Giancarlo Fortino, Costin Badica, Michele Malgeri, and Rainer Unland, editors, Intelligent Distributed Computing VI , volume 446 of Studies in Computational Intelligence , pages 149 - 154 . Springer Berlin Heidelberg, 2013 .

Michal

Brabec . Analysis of automatic program parallelization based on bytecode . Diploma thesis , 2013 .

5. Zoran Budimli´c, Mackale Joyner, and

Ken

Kennedy . Improving compilation of java scientific applications . Int. J. High Perform. Comput. Appl. , 21 ( 3 ): 251 - 265 , August 2007 .

Zbynˇek

Falt , Miroslav Cˇerma´k, Jiˇr´ı Dokulil, and Filip Zavoral. Parallel SPARQL query processing using bobox . International Journal On Advances in Intelligent Systems , 5 ( 3 and 4): 302 - 314 , 2012 .

Seema

Hiranandani , Ken Kennedy, and Chau-Wen Tseng . Compiling Fortran D for MIMD distributedmemory machines . Commun. ACM , 35 ( 8 ): 66 - 80 , August 1992 .

8. Steven

Muchnick . Advanced compiler design implementation . Morgan Kaufmann Publishers, 1997 .

9. Laurence

Nagel . SPICE2: A Computer Program to Simulate Semiconductor Circuits . PhD thesis , EECS Department, University of California, Berkeley, 1975 .

10. Frank

Otto

, Victor Pankratius, and WalterF. Tichy. XJava: Exploiting parallelism with object-oriented stream programming . In Henk Sips , Dick Epema, and Hai-Xiang

Lin

, editors, Euro-Par 2009 Parallel Processing , volume 5704 of Lecture Notes in Computer Science, pages 875 - 886 . Springer Berlin Heidelberg, 2009 .

11. Sean

Rul

, Hans Vandierendonck, and Koen De Bosschere. A profile-based tool for finding pipeline parallelism in sequential programs . Parallel Computing , 36 ( 9 ): 531 - 551 , 2010 .

12.

Thies ,

Chandrasekhar , and

Amarasinghe . A practical approach to exploiting coarse-grained pipeline parallelism in C programs . In Microarchitecture, 2007 . MICRO 2007 . 40th Annual

IEEE

/ACM International Symposium on, pages 356 - 369 , 2007 .