-

M odel- D r iven T r a nsfor ma t ions fo r M ap p i ng Pa r a ll el A lgo r it h ms on Pa r a ll el C omp u t i ng Pla t fo r ms

Ethem Arkin

0 2

Bedir Tekinerdogan

1 2 0 Aselsan MGEO , Ankara , Turkey !" 1 Bilkent University, Dept. of Computer Engineering , Ankara, Turkey 0!1% 2 K eywords: Model Driven Software Development, Parallel Computing, High Performance Computing, Domain Specific Language , Tool Support

2013

A bstr act.One of the important problems in parallel computing is the mapping of the parallel algorithm to the parallel computing platform. Hereby, for each parallel node the corresponding code for the parallel nodes must be implemented. For platforms with a limited number of processing nodes this can be done manually. However, in case the parallel computing platform consists of hundreds of thousands of processing nodes then the manual coding of the parallel algorithms becomes intractable and error-prone. Moreover, a change of the parallel computing platform requires considerable effort and time of coding. In this paper we present a model-driven approach for generating the code of selected parallel algorithms to be mapped on parallel computing platforms. We describe the required platform independent metamodel, and the model-to-model and the model-to-text transformation patterns. We illustrate our approach for the parallel matrix multiplication algorithm.

The famous MoRVUH¶ ODZLKF WDHV KHW SURDIFQPH RIHWK SURLFHVQJ SRZG ubles every eighteen months is coming to an end due to the physical limitations of a single processor [ 11 ]. To keep increasing the performance of the processing power the current trend is towards applying parallel computing on multiple nodes. Unlike serial computing in which instructions are executed serially, multiple processing elements are used to execute the program instructions in parallel. An important challenge in parallel computing is the mapping of the parallel algorithm to the parallel computing platform. The mapping of the algorithm requires the analysis of the algorithm, writing the code for the algorithm and deploying it on the nodes of the parallel computing parallel computing platform. This mapping can be done manually in case we are dealing with a limited number of processing nodes. However, the current trend shows the dramatic increase of the number of processing nodes for parallel computing platforms with now about hundreds of thousands of nodes providing petascale to exascale level processing power [ 8 ]. As a consequence mapping the parallel algorithm to computing platforms has become intractable for the human parallel computing engineer.

Once the mapping has been realized in due time the parallel computing platform might need to evolve or change completely. In that case the overall mapping process must be redone from the beginning requiring lots of time and effort.

In this paper we provide a model-driven approach for both the mapping of parallel algorithms to parallel computing platform, and the evolution of the parallel computing platform. In essence our approach is based on the model-driven architecture design paradigm that makes a distinction between platform independent models and platforms specific models or code. We provide a platform independent metamodel for parallel computing platform and define the model-to-model transformation patterns for realizing the platform specific parallel computing platforms. Further we provide the model-to-text transformation patterns for realizing the code from the platform specific models.

The remainder of the paper is organized as follows. In section 2, we describe the problem statement. Section 3 presents the implementation approach for mapping the parallel algorithm to parallel computing platform by the help of model transformations. Section 4 presents the related work and finally we conclude the paper in section 5. 2

P roblem Statement To define a feasible mapping the parallel algorithm needs to be analyzed and a proper configuration of the given parallel computing platform is required to meet the corresponding quality requirements for power consumption, efficiency and memory usage. To illustrate the problem we will use the parallel matrix multiplication algorithm [ 10 ]. The pseudo code of the algorithm is shown inFig.1a. The matrix multiplication algorithm recursively decomposes the matrix into subdivisions and multiplies the smaller matrices to be summed up to find the resulting matrix. The algorithm is actually composed of three different sections. The first serial section is the multiplication of subdivision matrix elements (line 3), which is followed by a recursive multiplication call for each subdivision (line 5-15). The final part of the algorithm defines the summation of the multiplication results for each subdivision (line 13-16).

Given a physical parallel computing platform consisting of a set of nodes, we need to define the mapping of the different sections to the nodes. In this context, the logical configuration is a view of the physical configuration that defines the logical communication structure among the physical nodes. Typically, for the same physical configuration we can have many different logical configurations [ 2 ]. An example of a logical configuration is shown inFig.1b. In this paper we assume that a feasible logical configuration is selected and the mapping of the code need to be realized.

!" #$%&'()$'*+,-$./0+)1-.213456*76*89:* ;" .<*8=!*->'?* @" A*=*5*B*7* C" '?(.<* D" #E*=*+,-$./0+)1-.21345EE6*7EE6*80!9* F" #!*=*+,-$./0+)1-.21345E!6*7!E6*80!9* G" #;*=*+,-$./0+)1-.21345EE6*7E!6*80!9* H" #@*=*+,-$./0+)1-.21345E!6*7!!6*80!9* I" #C*=*+,-$./0+)1-.21345!E6*7!!6*80!9* !E" #D*=*+,-$./0+)1-.21345!!6*7!E6*80!9* !!" #F*=*+,-$./0+)1-.21345!E6*7E!6*80!9* !;" #G*=*+,-$./0+)1-.21345!!6*7!!6*80!9* !@" AEE*=*#E*J*#!* !C" AE!*=*#;*J*#@* !D" A!E*=*#C*J*#D* !F" A!!*=*#F*J*#G*

a) b)

F ig.1.Matrix Multiplication Algorithm (a) to be mapped on (b) logical configuration platform the code for node 0 is defined which sends the sub matrices to the other nodes (1,2,3). Lines 9 to 14 define the code for receiving the matrices in node 1. A similar code is implemented for the nodes 2 and 3 (not shown in the figure). Line 16 defines a so-called barrier to let the process wait until all the sub-matrices have been distributed and received by all the nodes. After the distribution of the sub-matrices to the nodes, each node runs the code as defined in line 1718 and, as such, multiplies, the received sub-matrices. Once the multiplication is finalized the results are submitted to node 0, which is shown in line 19-22 for node 1 (code for node 2 and 3 is not shown). Line 23 to 25 defines again the collection of the results in node 0. Line 27 defines again a barrier to complete this process. Finally in line 28 to 33 the results are summed in node 0 to compute the resulting matrix C. model-driven development approaches. The overall approach is shown inFig.3. In the first step of the approach the parallel computing algorithm is analyzed to define and characterize the sections that need to be allocated to the nodes of the parallel computing platform. In the second step, the plan is defined for allocating the algorithm sections to the corresponding nodes of the logical computing platform. In the third step the code for each serial section is manually implemented. The fourth step includes the implementation or reuse of predefined model transformations to generate the code for parallel sections. The final step includes the deployment of the code on the physical configuration platform. The details of the steps are described in the following sub-sections.

!"#$%&'()*#$'+,-./01 <"#3*:.%*#/0*#7'&%#:,-#/0*#$'',9&/.,%#,:#

/0*#$'+,-./01#=*9/.,%8# E"#?14'*1*%/#/0*#=*-.&'#5,6*#8*9/.,%8

>"#?14'*1*%/@A*;8*#B,6*'# C-&%8:,-1&/.,%8#/,#D*%*-&/*#5,6* 2"#3*4',(#/0*#5,6*#,%#/0*#70(8.9&'#

5,%:.+;-&/.,%#7'&/:,-1

F ig.3.Approach for Generating/Developing and Deployment of Parallel Algorithm Code 3.1

A nalyze A lgorithm The analysis of the parallel algorithm identifies the separate sections of the algorithm and characterizes these as serial or parallel sections. Here, a section is defined as a coherent set of instructions in the algorithm. A serial section defines the part of the algorithm that needs to run serially on nodes without interacting with other nodes. A parallel section defines the part of the algorithm that runs on each node and interacts with other nodes. For example the matrix multiplication algorithm (Fig.1a) has four main sections as shown in Table 1.

T able 1.Analysis of algorithm sections !"# $%&'()*+,#-./*)'0# -./*)'0#123.# 4# !"#$%"&'$()$*()#'&+,-$%".(#) /01) 5# 2)3)0)4)5) 671) 6# 2899(.$),-$%":),'9$";9<)%(#'9$#) /01) =) 2>>)3)/>)?)/@) 671) 2>@)3)/A)?)/B) 2@>)3)/=)?)/C) 2@@)3)/D)?)/E) The first section defines the distribution of the sub-matrices to the different nodes. This section is characterized as a parallel section (PAR). The second section is characterized as serial (SER) and defines the set of instructions for the multiplication of the sub-matrices. The third section is a parallel section and defines the collection of the results of the matrix multiplications. Finally, the fourth section is characterized as serial and defines the summation of the result to derive the final matrix. 3.2

Define the Plan for the A llocation of the A lgorithm Sections The next step of the implementation approach is to define the plan for mapping the algorithm sections to logical configurations. Usually many different logical configurations can be derived for a given parallel algorithm and parallel computing platform. We refer to our earlier paper [ 2 ] in which we define the overall approach for deriving feasible logical configuration alternatives with respect to speed-up and efficiency metrics. In this paper we assume that a feasible logical configuration has been selected and elaborate on the generation of the implementation of the algorithm sections.

T able 2.Plan for allocating sections to nodes

!"# $%&'()*+,#-./*)'0# -./*)'0#123.# 4%50# !" #$%&'$()&*"&+*"%)(,-.&'$/*%" 012" [-1,0] [0,0] 3" =" B" 4"5"1"6"7" 4;>>*/&"-.&'$?"-)>&$@>A" '*%)>&%" 4CC"5"0C"D"0!" 4C!"5"03"D"0=" 4!C"5"0B"D"0E" 4!!"5"0F"D"0G"

The allocation of the sections to the nodes depends on the type of the sections. The plan for the matrix multiplication algorithm is shown in the fourth column of Table 2. Here we assume that each serial section runs on each node (section 2 and 4). The plan for allocating the parallel sections is defined as a pattern of nodes. The rectangles represent the nodes; the arrows represent the interactions (distribution or collection) among the nodes. Further, each node is assigned an id defining the coordinate of the node in the logical configuration. For section 1 the distribution of the data is represented as a pattern of four nodes in which the dominating node is the node with coordinate (0, 0). The arrows in the pattern show the distribution of the submatrices from the dominating node to the other nodes. For section 3 the pattern represents the collection of the results of the multiplications to provide the final matrix.

In the given example we have assumed a logical configuration consisting of four nodes. Of course for larger configurations defining the allocation plan becomes more difficult. Hereby, the required plan is not drawn completely but defined as a set of patterns that can be used to generate the actual logical configuration. For example, scaling the patterns of Table 2can be used to generate the logical configuration ofFig.1b. For more details about the generation of larger logical configurations from predefined patterns we refer to our earlier paper [ 2 ]. 3.3

I mplement the Serial Code Sections

Once the plan for allocating the algorithm sections to the logical configuration is defined we can start the implementation of the algorithm sections. Hereby, the code for the serial sections is implemented manually.

T able 3.Implementation of the serial sections

!"# $%&'()*+,#-./*)'0# 6,3%.,.0*5*)'0# !" #$%&'$()&*"&+*"%)(,-.&'$/*%" H$>>"(*"I*:*'.&*<" 3" 4"5"1"6"7" 4C"5"1JC"6"7JC" 4!"5"1J!"6"7J!" =" 4;>>*/&"-.&'$?"-)>&$@>A"'*%)>&%" H$>>"(*"I*:*'.&*<" B" 4CC"5"0C"D"0!" 4CC"5"0JC"D"0J!" 4C!"5"03"D"0=" 4C!"5"0J3"D"0J=" 4!C"5"0B"D"0E" 4!C"5"0JB"D"0JE" 4!!"5"0F"D"0G" 4!!"5"0JF"D"0JG" The code for the parallel sections are generated using the model-transformation patterns as defined in the next sub-section. The third column of Table 3 shows the implementation of the serial sections of the matrix multiplication algorithm. Note that the implementation is alignment with the complete implementation of the algorithm as shown in Fig.2.

3.4 M odel T ransfor mations

After analyzing the algorithm, implementing the code for serial algorithm sections and defining the plan for mapping these sections to the logical configuration, the code for the parallel sections will be generated. To support platform independence this code generation process is realized in two steps using model-to-model transformation and model-to-text transformation. These transformation steps are described below.

M odel-to-M odel T r ansfor mation.

For different parallel computing platforms, there are several parallel programming languages such as, MPI, OpenMP, MPL, CILK [ 15 ]. According to the characteristic of the parallel computing platforms, different programming languages can be selected. Later on in case of changing requirements a different platform might need to be selected. To cope with the platform independence and the platform evolution problem we apply the concepts as defined in the Model-Driven Architecture (MDA) paradigm [ 13 ]. Accordingly, we make a distinction between platform independent models (PIM), platform specific models (PSM) and the source code. The generic model-to-model transformation process is shown in Fig.4. !"#"$$%$&'$()#*+,-& ."//*0(&.%+"-)1%$ 2)03)#-4&+) !"#"$$%$&'$()#*+,-& ."//*0(&.)1%$

.8. 9#"043)#-"+*)0 !"#"$$%$&5)-/6+*0(&!$"+3)#-& 7/%2*3*2&.%+"-)1%$

2)03)#-4&+) !"#"$$%$&5)-/6+*0(&!$"+3)#-&

7/%2*3*2&.)1%$

F ig.4.Model-to-model transformation.

Here the transformation process takes as input a platform independent model called, parallel algorithm mapping model. This model defines the mapping of the algorithm sections to the logical configuration. The model conforms to the parallel algorithm mapping metamodel which we will explain later in the section. The output of the transformation process is a platform specific model, called parallel computing platform specific model. Similarly this model conforms to its own metamodel, which typically represents the model of the language of the platform (e.g. MPI metamodel). The platform specific model will be later used to generate the code using model-to-text transformation patterns.

!"#$%&'()*+,-.'&'/,+.0)-+1+23+,4,56-7'&$.6819-7'&$.:;+,<,=+ 9-7'&$.*,0>6'%07'+-.'&'/,+.0)-+1+23+,4,56-7'&$.6819-7'&$.:;+,<,=+ 9-%&0"9-7'&$.*+ ++,-.'&'/,+.0)-+1+23+5,-?'-.@6,+6AB-%C/B-+1+D!"#$%&'E;F+,4,7$@-+1+9CG2HI+,<,=+ J0%0""-"9-7'&$.*+ ++,-.'&'/,+.0)-+1+23+5,-?'-.@6,+6AB-%C/B-+1+D!"#$%&'E;F+,4,'&"-6+81+J0''-%.:,<,=+ K$#&70"L$.M&#A%0'&$.*,-.'&'/,+.0)-+1+23+,4,5'&"-681C&"-:;,<,=+ C&"-*,0>6'%07'+-.'&'/,+.0)-+1+23+,4,,<,=+ L$%-*,-.'&'/,+.0)-+1+23+5,-?'-.@6,+6AB-%C/B-+1+D(%)"E;F,4,&+1+2HCN+1+2HC+,<,=+ J0''-%.*,-.'&'/,+.0)-+1+23+5,-?'-.@6,+6AB-%C/B-+1+D(%)"E;F+ ,4,+'&"-6+81+C&"-:@$)&.0'&$.#6+81+C&"-7$))6+81+L$))A.&70'&$.:+ ?6&O-+1+2HC/6&O-+1+2HC,<,=+ L$))A.&70'&$.*,-.'&'/,+.0)-+1+23+,4,+ ++++M%$)+1+L$%-'$+1+L$%-"-#6&O-+1+2HCM%$)30'0+1+30'0'$30'0+1+30'0+,<,=+

F ig.5.Concrete Syntax of the Parallel Algorithm Mapping Metamodel (PAMM)

The grammar for the parallel algorithm mapping metamodel is defined in XText in the Eclipse IDE and shown in Fig.5. Here, Algorithm consists of Sections, which can be either a ParallelSection or SerialSection. Each section can itself have other sections. In the grammar the serial sections are related to code implementations in the code block. The parallel sections include the data about the mapping plan that is determined with the logical configuration. LogicalConfiguration consists of Tile entity which can be either a single Core (processing unit) or a Pattern with tiles and communications between these tiles. The assets related with the logical configuration with cores and patterns compose the plan for mapping algorithm to logical configuration.

Fig.6 shows, for example, the parallel algorithm mapping model for the matrix multiplication algorithm. In the figure two serial sections MultiplyBlock and SumBlock are defined. In the MultiplyBlock section the matrices are divided into sub-matrices and scattered by using the B2S pattern. The B2S pattern is a predefined pattern in the toolset indicating the pattern for section 1 as defined in the fourth column of Table 2. This multiply block also contains a Multiply serial section which contains the serial implementation of the multiply operation. In the SumBlock section, the resulting matrices are gathered by the pattern B2G which is predefined for section 3 as shown in the fourth column of Table 2. The SumBlock serial section contains the serial code for summation of the resulting sub-matrices.

F ig.6.Parallel Algorithm Mapping Model for the Matrix Multiplication Algorithm

Once the platform independent parallel algorithm mapping model is defined we can transform it to the required platform specific model. We assume, for example, that the aim is to generate a MPI model. Fig.7shows the grammar of the MPI metamodel that is again defined using XText. In the metamodel each MPI model consists of a group of entities, which include MPISection, Process, Node, and Communication. Each section consists of processes and communication among these processes. Each Process allocates to a Node. Each communication defines the destination and target process.

!"#!$%&'()&*+#+,)-*./&-0-12-)3)456$7"890!"#:6$7";<)=)>!"#:6$7"()&*+#+,)-*./&-0-12-)3)48&?+#$*890!"#@&?+#$*;<4*$%&890A$%&<)=)>!"#@&?+#$*()&*+#+,)-*./&-0-12-)3)----48&?+#$*890!"#@&?+#$*;<4"6$?&88&890B6$?&88;<----4?$//7*#?.+#$*890C$//7*#?.+#$*;<?$%&-0-@DE1A:)=)>B6$?&88()&*+#+,)-*./&-0-12-)3)6.*F-0-1AD.''$?.+&80A$%&)=)>A$%&()&*+#+,)-*./&-0-12-)3))=)>C$//7*#?.+#$*()&*+#+,)-*./&-0-12-)3)G6$/-0-B6$?&88+$-0-B6$?&88-)=)>

F ig.7.Grammar of the MPI Metamodel

The model-driven transformation rules refer to elements of both the PAMM and the parallel computing platform specific metamodel, in this case the MPI Metamodel. The M2M transformation rules are implemented using the ATL [ 1 ] transformation language. The transformation rules are shown in Fig.8. As shown in the figure we have implemented four different rules which define the transformations of mapping patterns to MPI sections, cores to processes and communications to MPI communications.

The rule Algorithm2MpiModel , is defined as the main rule of the transformation. The rule Pattern2Section transforms the algorithm pattern sections to MpiSection within the MpiGroup. The rule Core2Process transforms the cores as defined in the patterns to the processes in MpiSection. Each process is transformed from the core with the data of rank calculated from the index values of the core. Similarly, Comm2Comm transforms the communications that are defined in the patterns, to the communications in MPISection.

The MPI model which is the result of the model-to-model transformation is shown in Fig.9. The MPI model includes the MpiSection with processes that will run on each node, communications from a destination process to target process and the serial code section implementation. This MPI model is now ready for model-to-text transformation to generate the final MPI source code.

F ig.9.Part of the MPI model generated by model-to-model transformation M odel-to- T ext T ransfor mation The generated PSM includes the mapping of the processes specific to the parallel computing platform. Subsequently, this PSM is used to generate the source code. The model-to-text transformation pattern for this is shown inFig.10.

!"#$!%&'()*%+ !"#$!)*%+ ,)-.)/(0$ &)

!45 5/'-0.)/('&6)

!"#$ 1)2/,%$3)*%

F ig.10.Example model transformation chain of MPI model XPand [ 18 ] transformation language. To map the sections to the parallel computing platform, for each section the communication operations for the data is generated for target and destination process ranks (line 6 to 11). Subsequently, the serial code implementation is imported to the source code in line 13. For each section, a barrier code is implemented to synchronize the section processes (line 14). The resulting code of the transformation is the code as defined in !" #!"#$%&'$%&'( )" «03,LQWDO]RVG\SHI ( *" #($%)*+,(+,-.%/(*-(+,-.%'( 0" #($%)*+,'+,-.%"/123&-4/(*-(/123&-4'( 5" #($%)*+,'/123&-4"2-$$.4&263&-4/(*-'2-$$'( 7" &89,64:(;;(#2-$$"8,-$",64:'<(=( >" ?@ABA/14C9#2-$$"8,-$D636"46$1'E#2-$$"8,-$D636"/&F1'E?@AB#2-$$"8,-$D636"3G%1'E( H" ((((((((((#2-$$"3-",64:'E#2-$$"8,-$",64:'E?@ABIJ??BKJLMDEN,1O.1/3<PQ( R" &89,64:(;;(#2-$$"3-",64:'<(=( !S" ?@ABA,12T9#2-$$"3-D636"46$1'E#2-$$"8,-$D636"/&F1'E?@AB#2-$$"3-D636"3G%1'E(( !!" ((((((((((#2-$$"8,-$",64:'E?@ABUVWBXUYE?@ABIJ??BKJLMDEN,1O.1/3<PQ( !)" #)./($%)*+,'( !*" #/123&-4"2-C1'( !0" ?@ABZ6,,&1,9?@ABIJ??BKJLMD<P( !5" #)./($%)*+,'#)./($%)*+,'( !7" «)LQDOFRGH (

F ig.11. Transformation template from MPI metamodel to MPI source code 3.5

Deploy Code on Physical Configuration The resulting code of the previous steps needs to be deployed on the physical configuration. The deployment can be done manually or using tool support in case of large configurations. In the literature various tools can be found which concern the automatic deployment of the code to the nodes of a parallel computing platform. We refer to, for example, [ 8 ][ 15 ][ 4 ] for further details. 4

R elated W or k

Several papers have been published in the domain of model-transformations for parallel computing. Palyart et. al. [ 14 ] propose an approach for using model-driven engineering in high performance computing. They focus on automated support for the design of a high performance computing application based on the distinction of different domain expertise like physical configuration, numerical computing, application architecture etc.

Bigot and Perez [ 3 ] adopt HLCM a hierarchical and generic component model with connectors originally designed for high performance applications. The authors represent on their experience with metamodeling and model transformation to implement HLCM. WLp*DPHO [ 7 ] introduced the GASPARD design framework systems that use model transformations for massively parallel embedded systems. They refined the MARTE models based on Model Driven Engineering paradigm. They provide tool support to automatically generate code with high-level specifications. Taillard et.al [ 16 ] implemented a graphical framework for integrating new metamodels to GASPARD framework. They used MDE paradigm to generate OpenMP, Fortran or C code.

Similar to our approach the above studies generate source code for high performance computing. The main difference of our approach is focus on the mapping of algorithm sections to parallel computing platforms. 5

C onclusion

In this paper we have described the model transformations needed to implement the mapping of a parallel algorithm to a parallel computing platform. In alignment with the MDA paradigm the approach is based on separating the platform independent parallel computing model from the platform specific parallel computing model and the source code. The model transformations do not only helps the parallel programming engineer to generate code but it also provides support for easier portability in case of platform evolution. We have illustrated the approach for the MPI platform but the approach is generic. In our future work we will elaborate on the application of model-driven approaches to parallel computing platform and focus on optimizing the values for metrics which are important for mapping parallel algorithms to parallel computing platforms. R efe r ences

1. ATL: ATL Transformation Language . http://www.eclipse.org/atl/

2. Arkin , E. , Tekinerdogan , B. , Imre , K. Model-Driven Approach for Supporting the Mapping of Parallel Algorithms to Parallel Computing Platforms . Proc. of the ACM/IE E E 16th International Conference on Model Driven Engineering Languages and Systems . ( 2013 )

3. Bigot , J. , Perez , C. On Model-Driven Engineering to implement a Component Assembly Compiler RUI K +LJ QR3DFUIHP R&SXWLQPJ R-'

, LVOUQJHp SUD HOV 0RGOHVq ' , 0 ( 2011 )

4. Cumberland , D. , Herban , R. , Irvine, R. , Shuey , M. , and Luisier , M. Rapid parallel systems deployment: techniques for overnight clustering . In Proceedings of the 22nd conference on Large installation system administration conference (LISA'08) . USENIX Association , Berkeley, CA, USA, 49 - 57 . ( 2008 )

5. Foster , I. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering . Addison-Wesley Longman Publishing Co. , Inc., Boston, MA, USA. ( 1995 )

6. Frank , M.P. The physical limits of computing . Computing in Science & Engineering , vol. 4 , no. 3 , pp. 16 , 26 , May

-June

2002 . ( 2002 )

7. WLDpP* $ H/ X%H[ 6 3LHO ( Q%H WLDOK$ 5 (WLHQ $ UT0DXWH 3 QDG

\N'H - $ Model-Driven Design Framework for Massively Parallel Embedded Systems . ACM Trans. Embed. Comput. Syst . 10 , 4 , Article 39. ( 2011 )

8. Hoffmann , A. , Neubauer , B. Deployment and configuration of distributed systems . In Proceedings of the 4th international SDL and MSC conference on System Analysis and Modeling (SAM'04) , Daniel Amyot and Alan W. Williams (Eds.). Springer-Verlag, Berlin, Heidelberg, 1 - 16 . ( 2004 )

9. Kogge , P. , Bergman , K. , Borkar , S., Campbell , D. , Carlson , W. , Dally , W. , Denneau , M. , Franzon , P. , Harrod , W. , Hiller , J. , Karp , S. , Keckler , S. , Klein , D. , Lucas , R. , Richards , M. , Scarpelli , A. , Scott , S. , Snavely , A. , Sterling , T. , Williams , R.S. , Yelick , K. , Bergman , K. , Borkar , S., Campbell , D. , Carlson , W. , Dally , W. , Denneau , M. , Franzon , P. , Harrod , W. , Hiller , J. , Keckler , S. , Klein , D. , Williams , R.S. , and Yelick , K. , Exascale Computing Study: Technology Challenges in Achieving Exascale Systems . DARPA. ( 2008 )

10. Li , K. Scalable parallel matrix multiplication on distributed memory parallel computers . Parallel and Distributed Processing Symposium , 2000 . IPDPS 2000. Proceedings. 14th Inte rnational , vol., no., pp. 307 , 314 . ( 2000 )

11. Moore , G.E.

Cramming

More Components Onto Integrated Circuits . Proceedings of the IE E E , vol. 86 , no. 1 , pp. 82 , 85 . ( 1998 )

12. MPI : A Message-Passing Interface Standart, version 1 .1. http://www.mpi-forum.org/docs/mpi-11 - html/mpi-report.html.

13. Object Management Group (OMG). Model Driven Architecture (MDA) , ormsc/2001-07-01.

14. Palyart , M. , Lugato , D. , Ober , I. , and Bruel , J. MDE4HPC: an approach for using model-driven engineering in high-performance computing . In Proceedings of the 15th international conference on Integrating System and Software Modeling (SDL'11) , Iulian Ober and Ileana Ober (Eds.). SpringerVerlag , Berlin, Heidelberg, 247 - 261 . ( 2011 )

15. Stawinska , M. , Kurzyniec , D. , Stawinski , J. , Sunderam , V. , Automated Deployment Support for Parallel Distributed Computing, Parallel, Distributed and Network-Based Processing , 2007 . PDP ' 07 . 15th

UROMICRO International Conference on , vol., no., pp. 139 , 146 . ( 2007 )

16. Taillard , J. , Guyomarc'h , F. , Dekeyser , J. A Graphical Framework for High Performance Computing Using An MDE Approach . In Proc. of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP '08) . IEEE Computer Society, Washington, DC, USA, 165 - 173 . ( 2008 )

17. Talia , D. Models and Trends in Parallel Programming . Parallel Algorithms and Applications 16 , no. 2 : 145 - 180 . ( 2001 )

18. Xpand , Open Architectureware. http://wiki.eclipse.org/Xpand.

19. Zheng , G. , Kakulapati , G. , Kale , L.V. BigSim: a parallel simulator for performance prediction of extremely large parallel machines . Parallel and Distributed Processing Symposium , 2004 . Proc.. 18th International , vol., no., pp. 78 ,, 26 - 30 . ( 2004 )