Proc. of the 16th Workshop “From Object to Agents” (WOA15) June 17-19, Naples, Italy An AO system for OO-GPU programming Andrea Fornaia, Christian Napoli, Giuseppe Pappalardo, and Emiliano Tramontana Department of Mathematics and Informatics University of Catania, Viale A. Doria 6, 95125 Catania, Italy {fornaia, napoli, pappalardo, tramontana}@dmi.unict.it Abstract—Recent technologies, like general purpose comput- substantial improvement in terms of modularity, performance, ing GPU, have a major limitation consisting in the difficulties that reusability of code and separation of concerns [8], [9], [10], developers face when implementing parallel code using device- [11], [12]. oriented languages. This paper aims to assist developers by automatically producing snippets of code handling GPU-oriented The developed toolbox provides enhanced reusability for tasks. Our proposed approach is based on Aspect-Oriented- parallel computation of previously written code (both object Programming and generates modules in CUDA C compliant code, based or device oriented), by using several agents which which are encapsulated and connected by means of JNI. By interprete the behaviour of the OO code and uses a dedicated means of a set of predefined functions we separate the application translation utility [13], [14]. Tools such as LIME [15] or code from device-dependent concerns, including device memory allocation and management. Moreover, bandwidth utilisation and AeminimumGPU [16] derive a customised language and a cores occupancy is automatically handled in order to minimise run-time environment, however require specific compilers and the overhead caused by host to device communications and the force developers to use a non-standard programming language computational imbalance, which often tampers with the effective while giving no options for standard code reusability (it is speedup of a GPU parallelised code. impossible for such paradigms to take advantage of reused sequential OO code and obtain parallel versions). Keywords—Code generation, GPU programming, separation of concerns. While OpenCL provides developers with fine-grain control of host and kernel code, the handling of low-level details I. I NTRODUCTION is a significant overhead for the developer. The proposed toolbox, instead, requires no knowledge of the device-oriented For device-oriented parallel code, such as distributed language, instead the developer writes standard OO code, high performance computing (HPC) and hardware-dependent and takes care of connecting the toolbox by using some paradigms, developers have an additional task when building annotations. an application which is taking into account the physical struc- ture of the host (or network) [1], [2]. Developers have to con- Recent works, such as [17], have partially automated sider parallelisation and communication overhead, the required several processes in the field of code control to avoid conflicts bandwidth etc. [3]. Therefore, developers strive to achieve or misleading behaviours, but even in this case it is ultimately in their solutions both flexibility and high modularity. This the programmer’s responsibility to structure their codes in results in increased development time and costs, sometimes the appropriate manner. An approach has been derived to with a low-performing code. Moreover, current development mechanically determine how a program accesses data [18], tools do not offer a sufficient abstraction level, instead provide and, other analysers have focused on extracting the structure of a low degree of modularity to applications [4]. Furthermore, a software system to determine some structural properties [19], developers should take into account very complex scenarios [20], [21], [22], [23], [24], [25], [26], [27]. Such analyses in order to parameterise their code e.g. for different sizes for are paramount for assessing the possibility to transform a the solution domains, different number of threads and block, program in such a way to have parallelism while avoiding different data sizes, and therefore different solutions to obtain data inconsistency. In [28] a Java software system has been maximum core occupancy and transfer bandwidth. As a result presented, based on an approach that derives an entirely new it is difficult to separate the different concerns, overcharging set of syntactical rules for the use of a proprietary meta- the developer (and the code) with the handling of a multitude compiler. As far as our understanding, no significative further of responsibilities [5]. step has been made towards an high-level and self contained This paper proposes a new paradigmatic solution letting toolbox for easy development of GPU oriented software within developers that use object-oriented (OO) code to develop GPU- an OO paradigm. specific code or low-level device-oriented code by means of a friendly toolbox. This toolbox uses cooperating agents to assist II. GPGPU AND CUDA PROGRAMMING the development of scalable modular code that take advantage of GPU devices, freeing developers from the need to handle a GPU programmers have to consider the underlying hard- device oriented language. It also provides the management of ware in order to write any GPU-enabled code (now on simply memory allocations and overall communications between host GPU kernel). Graphics processors provide a big number of and computing device. Aspect-oriented programming [6], [7] simple multithreaded cores offering the potential for dramatic is used as a glue to enhance an application with environment- speedups for a variety of general purpose applications when specific choices, such as the selection of a specific task-driven compared to the CPU sequential computation [29], [30], [31], code at runtime. Therefore, our proposed approach brings a [32], [33]. 24 Proc. of the 16th Workshop “From Object to Agents” (WOA15) June 17-19, Naples, Italy The launch of the NvdiaTM CUDA technology has opened where BR is the number of bytes read per kernel, BW is the a new era for GPGPU computing allowing the design and number of bytes written per kernel, and t is the time. On the implementation of parallel GPU oriented algorithms without other hand, BEFF cannot be computed before hand, but only any knowledge of OpenGL, DirectX or the graphics pipeline. A after observing the runtime execution. The presented solution CUDA-enabled GPU is composed of several MIMD (multiple enable us to perform these operations and to obtain a real time instruction multiple data) multiprocessors that contain a set estimate of the bandwidth occupancy rateo of SIMD processors (single instruction single data). Each multiprocessor has a shared memory that can be accessed from BEFF BOR = (3) each of its processors, and also shares a bigger global memory BTH common to all the multiprocessors. Basically, a CUDA kernel makes use of threading between the SIMD processors, where At runtime it could be useful to compute BEFF as a single computation is performed [3]. Moreover, the GPU D ! card allows an advanced geometrical enumeration for threads Y described by a 3–dimensional structure for the 3 spatial axis BEFF = nO li sizeof(TYPE) (4) (even if the z axis is actually only a logical extension) [5]. i=1 Furthermore, it is possible to collect a set of threads in where nO is the number of operations (e.g. 2 for read and logical 3–dimensional blocks that are executed on the same write), D the maximum number of dimensions of the data multiprocessor. structure in transfer, li the length along the i-esime dimension, In CUDA programming model, an application consists of and sizeof(TYPE) the dimension in bytes of one unit of a host program that executes on the CPU and other parallel data for the specified type. Therefore it follows that kernel programs executing on the GPU [34], [35]. A kernel D ! nO Y program is executed by a set of parallel threads. The host BOR = li sizeof(TYPE) (5) program can dynamically allocate device global memory to nM CM RM i=1 the GPU and copy data to (and from) such a memory from (and to) the memory on the CPU. Moreover, the host program Since we are interested in host to device transfers (and can dynamically set the number of threads that run on a kernel vice versa) then nO = 2, moreover CM and RM depend on program. Threads are organised in blocks, and each block has the hardware, and they are un-mutable during the execution. its own shared memory, which can be accessed only by each Therefore, given a fixed constant thread on the same block. nO It is paramount that interactions between CPU and GPU are KHW = (6) nM C M RM minimised, this avoids communication bottlenecks and delays due to data transfers. Necessary data transfers should try to it follows that maximise the bandwidth usage, i.e. CPU and GPU perform as D ! Y least as possible interactions and transfer a large amount of BOR = KHW li sizeof(TYPE) (7) data each time. i=1 A. Bandwidth measurements This latter gives the exact bandwidth usage. The Bandwidth is indeed one of the most important factors for performance. The best practice in CUDA C programming B. Memory optimisations recommends that almost all GPU adaptation changes to code should be made in the context of how they affect bandwidth. While the bandwidth occupancy rateo give us an estimate of the performance of the code, in order to improve such Bandwidth can be dramatically affected by the choice of performances a major part of the effort should be directed memory in which data is stored, how the data is laid out and toward memory optimisations. A performant code should in- the order in which it is accessed, as well as other factors deed maximise the bandwidth occupancy rateo, but such a due to the computation itself. In order to obtain an accurate bandwidth is best served by using as much fast memory and estimate of the possible performances it is required to calculate as little slow-access memory as possible (this practice applies the effective bandwidth which, generally, strongly differs from both to the device memory and to the host memory). the theoretical bandwidth (the latter is much greater than the former). The theoretical maximum bandwidth BTH is In order to gain performance, it is important to reduce the number of data transfers between host and device, sometimes BTH = nM CM RM (1) also by running directly on the GPU portions of serial code where CM is the maximum memory clock, nM is the number (or portions of code the CPU could easily outperform). of bits of the memory interface, and RM is the memory data For the same reason data structures could be created both in rate (1 if single rate, 2 if double rate, etc.). Moreover, to the device and in the host in order to serve as an intermediate obtain an accurate estimate of the effective bandwidth BEFF , buffer. Such a buffer could also be useful to avoid small such computations should be performed at execution time by transfers, organising larger transfers which should perform means of the following equation better even in case of non-contiguous regions of memory (BR + BW ) (these would be packed in an unique compact buffer and then BEFF = (2) unpacked at their destination). t 25 Proc. of the 16th Workshop “From Object to Agents” (WOA15) June 17-19, Naples, Italy A major improvement in memory usage is finally granted by using page-locked memory, also known as pinned mem- Proxy Agent Platform Agent ory. By using the pinned memory the bandwidth usage should be maximised (hence, limiting to the transfers be- Code GUI USER EXPERT Repository tween host and device). In order to use the pinned memory Broker the CUDA libraries provide the cudaHostAlloc() and cudaHostRegister() functions: the first allocates region of memory in pinned modality, while the latter is used to pin Interpreter Translator Injector the memory on the fly without allocating a separate buffer. While the use of pinned memory could improve the perfor- mance, this practice is likely to be difficult for the developers, OO Dedicated LINKER who risk to take on too much responsibilities. Moreover, the Compiler Compiler usage of pinned memory does not give a general solution for every code since pinned memory is a rare resource and Fig. 1. An overall schema of the developed agent oriented system an excessive use could end up in an overall reduction of the system performances. Finally, memory paging is often a heavyweight operation when compared to normal memory III. T HE PROPOSED TOOLBOX management. This results in a trade-off situation which should be carefully analysed before taking any action. The proposed To strictly separate all the algorithmic development related solution is intended to spare the developers from such concerns to the application domain from different concerns (i.e. the by taking care of this issue with automatic evaluations and GPU handling), which should not be taken into account by countermeasures. the programmer. All the GPU device-related managements are performed automatically by the proposed toolbox (e.g. the C. Asynchronous transfers choice of the best number of threads and blocks, the needed modifications to the code in order to enable it for asynchronous In a standard situation the developer may decide to stream execution, etc.). transfer data between host and device by means of the cudaMemcpy() function, which is a blocking transfer. In This toolbox aims at providing a simplified and modular other words, such an operation constitutes a barrier and returns support for GPU computing that developers could use without the control to the thread only after the entire data transfer is having to learn how to program in CUDA. The purpose of this completed. work is to develop such a toolbox for OO Programming to run specific tasks on the MIMD environment provided by a GPU CUDA architecture offers a different solution for memory accelerator without any need to divert from an OO paradigm transfer by means of the cudaMemcpyAsync() function and the related OO language (e.g. Java). which is a non-blocking variant of the previous one. This func- tion returns control immediately with the related consequences. Figure 1 shows the proposed solution, which consists of Moreover, this function requires the use of the pinned memory, three main agents: and, for security reasons, to use the so called streams. A stream is a sequence of operations that are performed on the device • Proxy agents following a certain order. Streams must be properly used while • Broker using asynchronous transfers in order to correctly access data only after they have been transferred. On the other hand • Platform agents different streams can be overlapped. Asynchronous transfers enable us to overlap data transfers with computations, therefore The proxy agent provides a graphical user interface (GUI) their proper use could tremendously increase performances, which could be typically intended as a web portal to upload however they could be very tricky for the developer, and again OO code complying with several syntactic constraints (such assistance could be required. Also this kind of assistance is as the proper use of annotations). The uploaded code is then provided by our developed solution. interpreted by an interpreter software module which creates an XML file to instruct the translator on the behaviour that D. Cores occupancy some portions of code have to obtain. Finally, the proxy agent contains the OO oriented compiler which has the responsibility Another key point in order to maximise the GPU per- to link all the produced software modules with the unchanged formances is the core occupancy. While a task should run portion of the original OO code and then compiles it creating unconstrained, its workload should be correctly designed so an executable binary. The binary is finally returned to the user as to take advantage of a number of threads that exactly by means of the same GUI when ready. matches the number of available GPU cores. The best practice recommends to keep the multiprocessors on the device as busy The said translator module is part of the Broker, it receives as possible. It follows that a poorly balanced workload will an XML description of the behaviours from the interpreter. result in suboptimal performances. Hence, it is important to While the latter has the responsibility to detect the code implement a correct application design with an optimal task behaviours and accordingly prepare an input for the translator, distribution on threads and blocks. The proposed toolbox aims the interpreter has nothing to with the translation of code itself. to spare the developers from such an effort. The translator maintains a reference to the different platform 26 Proc. of the 16th Workshop “From Object to Agents” (WOA15) June 17-19, Naples, Italy agents which are designed to match different hardware infras- threads and blocks. Such functions are implemented as part tructures. The translator then is able to understand the inter- of a set of choices made by the toolbox in order to obtain preted behaviours, and then choose the proper architecture- a CUDA compliant code which satisfies certain requirements oriented broker agent. Once the broker has been chosen, the that we call behaviours. translator instruct it to inject some portions of translated code, A behaviour is a set of fixed parameters concerning the on the other hand the translator itself prepares the unchanged management of the GPU card and all the related optimisation code to be linked with the compiled device-dedicated soft- that does not involve the application logic. While the developer ware modules. The linker module has the latter responsibility, is responsible for the application logic, the implementation of linking the remaining OO code with the generated device- the said behaviour (therefore all the choices and related imple- dedicated executables. The latter responsibility is all but trivial, mentation in terms of specific kernels, functions, parameters since it is up to the linker any choice regarding the best and strategies) are up to the presented support. Finally, the linking approach (e.g. whether to use external functions, native developer can also freely compose predefined behaviours or interfaces, or other approaches in order to permit to let the create new ones. generated binary to be called from the OO portion of the code). The effective injection of code and generation of executable B. Code repository software modules is performed within the dedicated Platform agent. The core module of the Platform agent is the injector. A predefined set of behaviours and related implement- As the name suggest the responsibility of this module is to ing functions are included in a code repository, so that the inject device oriented code in order to create a separately programmer will have no need to directly implement CUDA compiled software-module to be then linked by the broker, kernels and calls, nor even to use C or C++ languages. With as said before. The injector has knowledge of a selection of our approach, a developer could write his code in an OO codes within a related repository. Basing on the indications programming language such as Java, an then use a few of the given by the translator, the injector requires the corresponding classes in our repository to provide some parameters, in order code to the repository. The latter is maintained by experts of to configure the whole ensemble of application and CUDA the target architecture to which the repository is related. The code (examples of parameters are the number of threads, proposed approach is similar to a wiki project where experts blocks, the maximum bandwidth, etc..) or let our toolbox take add new code to be used. Of course the uploaded code must all the decisions. be compliant with all the standards of the device for which As far as CUDA compliant code is concerned, this is imple- it is intended, and, moreover, should be accompanied by an mented by the toolbox, some function pointers are predefined adeguate descriptor within the constraints of the presented and enlisted so that the application developer will have the agent oriented system itself. Finally the injected code si possibility to choose between a given set of functions, or compiled with a dedicated compiler, a binary file is produced manually add to the list a custom function written in CUDA and passed to the linker which performs its duties closing the C compliant code. In this way the developer could even write cycle. non-OO code (specifically CUDA code) and then use it within In the following section we will give some details regarding a more comprehensive OO application. the injection procedure and the involved modules. Any function defined in the library, hence also any custom function, is a __device__ function: the inputs of such GPU IV. T HE DESIGNED MODULES functions consist of a unique pointer to the ensemble of data, operands and outputs; in addition to this pointer some control A. Compile time data are given. Note that such a code will be generated by the In our approach, fragments of device-related code are toolbox, which takes care of all such details. These functions automatically linked by using a predefined library of common are then executed on the GPU device and called by a properly functions of general purpose. Moreover, the designed system generated __global__ kernel. makes it possible to define custom CUDA compliant tasks to be The provided functions work for an arbitrary number of executed on the GPU. At runtime, a component, which makes parameters, (i.e. operands) and functions. use of aspect-oriented programming, provides with the optimal management of the memory transfers and allocation by moni- C. Code injection choice toring the effective allocations, initialisations, and values of the stored variables both on the host and device. By including the This agent oriented system was created to make use of predefined classes or by using the precompiled executable as a several classes that properly realise a usable set of data linked binary, it is possible to use a predefined set of functions assisting the computation to be performed on a GPU. These and also to implement custom CUDA compliant functions and classes are adapted and interpreted as C-like defined structures then invoke them within an OO paradigm by means of Java of standard types, which are then transferred to the GPU Native Interface (JNI). device. The approach provides the developer with a set of functions The JNI layer provides the needed “glue” to manage calls that, when invoked, take care of all needed management, and data transfer towards the C++ side, which will use such including: the data transfers between CPU and device, the classes as primitive structures. While the memory address of memory allocations or the use of pinned memory, the possible an object is not available under a Java framework, once objects asynchronous execution of different threads, the optimal sizing are passed by means of JNI calls to the C++ layer, it becomes and dimensioning with respect to best performing number of possible (within the C++ portion of code) to manipulate and 27 Proc. of the 16th Workshop “From Object to Agents” (WOA15) June 17-19, Naples, Italy pass data by means of their memory addresses. This makes p u b l i c @ i n t e r f a c e GPUstream { it possible to ignore the number of dimensions for arrays, int value ( ) ; } matrices, tensors, handled by such data types. p u b l i c @ i n t e r f a c e GPUparal { String fixed ( ) ; An important strategy has been used to reduce the size } of data transferred and consists of an a priori selection and public c l a s s Behaviour { rearrangement of the operands and functions encoded, as said p r i v a t e s t a t i c B e h a v i o u r b = new B e h a v i o u r ( ) ; before, in unique arrays. private String s t a t u s ; private String tmpStatus ; Since the application developer has to provide the starting and ending points of the operands, before any allocation or private Behaviour ( ) { } communication in and to the device, the toolbox rearranges public s t a t i c Behaviour g e t I n s t a n c e ( ) { only the necessary part of the data in a communication buffer. return b ; } Under the stated conditions the memory allocated and the data transferred to the device are minimised, on the other hand the public String eval ( String s ) { } total size of such a buffer maximises the bandwidth usage. public S t r i n g wise ( ) { } public S t r i n g add ( S t r i n g s ) { } With this selection of data another advantage is to minimise the public void s e t ( S t r i n g s ) { } operations on the device due to the indexing of the operands. public void i n i t ( S t r i n g a ) { Another important feature of the proposed toolbox is the add ( a ) ; add ( e v a l ( t m p S t a t u s ) ) ; simplified interface to memory transfer between host and add ( e v a l ( w i s e ( ) ) ) ; device. It is known that, as far as execution time is concerned, } generally the more costly part is memory allocation and data public String get ( ) { } transfer from the host to the GPU device and vice versa, } which, for the best part, is at the origin of the total overhead. Memory transfer is not only expensive at runtime, indeed Fig. 2. Examples of predefined annotations and classes it is considered the ticklish and misleading part in GPU programming being also expensive in terms of coding time. This toolbox takes care of memory allocations on the in order to identify the correct behaviour (or composition of device and offers an advanced management of communications behaviours), the adopted solution could be classified as an between host and GPU device. For this reason, e.g. when a Annotated Aspect oriented solution (AA). In such an AA variable is used twice on the GPU device during the same solution several aspects are responsible for the interception execution of the program, and if it is not reassigned or of the relevant OO methods, whose execution is ultimately redeclared in the meantime, the toolbox will avoid to repeat substituted with native code, interacting with the remaining a memory transfer, preserving a copy on the device for future portion of the software by means of JNI. In this case, we use. This feature gives an easy way for the programmer to want to run JNI instances driven by some parameters. Some develop GPU-ready code without any need to take care of these of them are embedded into the OO code, others should be tricking side considerations, focusing only on the algorithm she evaluated at runtime, e.g. the core occupancy, whether or not wants to implement. to use pinned memory, the effective bandwidth utilisation as in equation (2). While the presented toolbox makes all the needed D. Translator module computation by means of a meta-layer which reflects on the OO code, in order to correctly interpret the desired behaviour As said above, our toolbox uses several independent agents it is up to the developer to annotate his application code. In in order to generate device oriented software modules that can order to minimise this concern we have developed an easy way be connected with any other Java application code by using to set up all the needed parameters and to select the portions the Java Native Interface. The software modules are created of code to parallelise. by means of a common nvcc compiler without any other precompiler. In fact, it comes with the needed computational Some annotations are given by the toolbox itself and are libraries which can also be precompiled and linked to an part of a library. Among such default annotations some of existing software system. them are used to let the aspects inject the appropriate code in In order to enable application developers to produce mod- the right points within the code. Figure 2 shows annotation ular code, the proposed translation system makes use of be- @GPUparal that allows identifying a class that has to be haviours to obtain certain features of the code. Such behaviours substituted with CUDA compliant code, then some code is exe- could be intended for CUDA-compliant code in the same cuted on the GPU card and connected with the OO software by way as Design Patterns are intended for OO code. In such a means of JNI. Such a parallel execution could be organised into context, aspect-oriented programming (AOP) has been proven several streams by means of a @GPUstream annotation. The effective to implement OO design patterns while preserving the latter takes as input an ID in order to univocally identify an ex- independence of classes and the separation of concerns [36]. ecution stream (mandatory in case of asynchronous execution In the same way, AOP is useful in order to connect an OO and pinned memory utilisation). Moreover, @GPUparal an- application with CUDA native GPU code. notation allows the developer to define mandatory behaviours for certain classes or methods. Such mandatory behaviours Since the proposed toolbox takes advantage of some speci- could be defined along with the implementation of application fied parameters and several annotations given by the developer methods and classes and become proper directives, when the 28 Proc. of the 16th Workshop “From Object to Agents” (WOA15) June 17-19, Naples, Italy p u b l i c c l a s s Main { @GPU param ( t h r e a d s =64 , b l o c k s =8 , a s y n c =1 , public aspect GPUinjector { p i n n e d =1 , b u f f =1024 , s t r e a m s =1 , mixbehav =0 , p o i n t c u t GPUpar ( GPUparal ann , O b j e c t o b j ) : f i x e d =” t h r e a d s , a s y n c , p i n n e d ” ) t h i s ( o b j ) && e x e c u t i o n ( @GPUparal v o i d ∗ . ∗ ( . . ) ) p u b l i c s t a t i c v o i d main ( S t r i n g [ ] a r g s ) { && @ a n n o t a t i o n ( ann ) ; // ... } v o i d a r o u n d ( GPUparal ann , O b j e c t o b j ) : } GPUpar ( ann , o b j ) { try { @GPUparal ( f i x e d =” d e f a u l t ” ) // ... p u b l i c c l a s s MyClass { B e h a v i o u r . g e t I n s t a n c e ( ) . i n i t ( ann . f i x e d ( ) ) ; // ... @GPUstream ( 1 ) } catch ( Exception e ) { } p u b l i c void bigMethod1 ( ) { } // ... } B e h a v i o u r . g e t I n s t a n c e ( ) . s e t ( ” none ” ) ; smallMethod S1 ( ) ; e x t e r n ”C” JNIEXPORT v o i d JNICALL // ... J a v a I n j e c t e d C U D A c o d e ( JNIEnv∗ env , G s t r u c t g p u d a t a ){ } env−> . . . } p u b l i c v o i d bigMethod S0 ( ) { } Fig. 4. Aspect for CUDA code injection @GPUstream ( 2 ) p u b l i c void bigMethod3 ( ) { // ... o t h e r C l a s s . smallMethod OC ( ) ; well as all the other initialisation directives are given in the // ... application code by means of the default annotations. smallMethod S2 ( ) ; // ... } When an instance of an annotated application class will be } intercepted by GPUinjector, the class-related annotations are taken and then stored on an appropriate hash table handled Fig. 3. Examples of application customisation by using Java annotations by the aspect. The same elaboration is performed for method- related annotations, which are stored in another dedicated hash table. Data stored in such hash tables are shared by the several implemented methods (or classes) are called (or used). instances of the same class, once they are intercepted, since the same parallelisation is desired for all instances of the same Some specific behaviours can be implemented to be en- class. abled at a given moment in the code, e.g. before a call to a method, then method set() on an instance of class Data gathered by method-related and class-related annota- Behaviour has to be called (see also Figure 3). The said tions are then merged with the mandatory behaviours and other method set() would then override any other behaviour specifications given by the application developer in order to except for the mandatory behaviour. exclude conflicting behaviours. Then, the resulting behaviours are given by the attribute tmpStatus, as extracted by the Figure 3 shows how an application takes advantage of the aspect from the annotations. Such behaviours are evaluated by proposed toolbox by means of Java annotations. These set method eval() in class Behaviour in order to check the important parameters, or communicate some global behaviour, compatibility with the mandatory behaviours before modifying which is then implemented for all the parallelised code. the general behaviour encoded as a string Status. Another behaviour could be introduced by the aspects when appropriate. Finally, aspect GPUinjector enables us to integrate on- the-fly new behaviours for some advantageous circumstances. E. Injection module This latter integration is made by means of method wise() in class Behaviour. After the evaluation the behaviours are The developed aspect GPUinjector (see Figure 4) takes added (or modified) with method add() on the same class. into account the behaviour resulting from: (i) class-related annotation @GPU_parallel, (ii) method-related annotation When the whole image of the behaviour is composed, @GPUstream), and (iii) the directives given by means of then the aspect calls the code generator that joins preexistent predefined annotation @GPU_param. When the aspect inter- portions of code related to each behaviour composition (or cepts a called method for a given instance, it observes all custom made CUDA compliant code linked by the developer behavioural directives (given on the code by means of the to a certain composition on default or custom behaviours). method set() for class Behaviour). Moreover, it takes into Figure 5 shows the Java code and annotations and calls that account the overall ensemble of parameters and circumstances connect with our toolbox. that intervene at runtime. This latter reasoning could lead the aspect to find a more profitable or advantageous setup in order V. C ONCLUDING REMARKS to enrich or modify the given set of behaviours (except in the case of mandatory behaviours). The possibility to have an easy to use and modular toolbox for GPGPU programming opens an entire new range of possi- At the beginning of an application execution, as soon bilities in the field of fast and performance oriented computing. as class Behaviour is loaded, it is populated by proper By means of our toolbox the developer need not use any data values, then using the mandatory instructions, several external or proprietary compiler. Consequently, this toolbox behaviours are configured. Such mandatory instructions, as offers virtually unlimited reusability with the possibility to link 29 Proc. of the 16th Workshop “From Object to Agents” (WOA15) June 17-19, Naples, Italy import j a v a . l a n g . a n n o t a t i o n . ∗ ; [8] R. Giunta, G. Pappalardo, and E. Tramontana, “An aspect-generated import G c l a s s . ∗ ; approach for the integration of applications into grid.,” in Proceedings of the symposium on Applied computing, ACM, 2007. @GPUparal ( ” none ” ) [9] C. Napoli, F. Bonanno, and G. Capizzi, “An hybrid neuro-wavelet p u b l i c c l a s s MyGPUclass { approach for long-term prediction of solar wind,” in IAU Symposium, p u b l i c v o i d myGPUmethod ( P t y p e d a t a ) { no. 274, pp. 247–249, 2010. } [10] R. Giunta, G. Pappalardo, and E. Tramontana, “Superimposing roles } for design patterns into application classes by means of aspects,” in public class Test { Proceedings of Symposium on Applied Computing (SAC), ACM, 2012. @GPU param ( t h r e a d s =48 , a s y n c =0 , mixbehav =1 , [11] M. Woźniak, Z. Marszałek, M. Gabryel, and R. K. Nowicki, “Modified f i x e d =” t h r e a d s , s p l i t 1 D ” ) merge sort algorithm for large scale data sets,” Lecture Notes in p u b l i c s t a t i c v o i d main ( S t r i n g [ ] a r g s ) { Artificial Intelligence - ICAISC’2013, vol. 7895, pp. 612–622, 2013. P t y p e d a t a = new P t y p e ( ) ; DOI: 10.1007/978-3-642-38610-7 56. Behaviour . g e t I n s t a n c e ( ) . s e t ( ” default , split1D ” ) ; [12] E. Tramontana, “Automatically characterising components with con- MyGPUclass . myGPUmethod ( d a t a ) ; cerns and reducing tangling,” in Proceedings of Computer Software and Applications Conference (COMPSAC) workshop QUORS, IEEE, 2013. } } [13] C. Krueger, “Software reuse,” ACM Computing Surveys (CSUR), vol. 24, no. 2, pp. 131–183, 1992. [14] C. Napoli, G. Pappalardo, and E. Tramontana, “An agent-driven seman- Fig. 5. Example of Java code before injection tical identifier using radial basis neural networks and reinforcement learning,” in XV Workshop ”Dagli Oggetti agli Agenti”, vol. 1260, CEUR-WS, 2014. with CUDA-compliant code. The implementation of advanced [15] J. Auerbach, D. F. Bacon, P. Cheng, and R. Rabbah, “Lime: a java- features is made by using aspect-oriented code. In this way, it compatible and synthesizable language for heterogeneous architec- is possible to have an high customisation level for the definition tures.,” in Proceedings of the ACM International Conference on Object- Oriented Programming Systems, Languages, and Applications, pp. 89– of new behaviours and performances improvements due to an 108, ACM, 2010. advanced management of the allocation and freeing of memory [16] A. Fonseca and B. Cabral, “Aeminiumgpu: An intelligent framework for on the device. This will provide means to control the lifecycle gpu programming,” in Proceedings of Facing the Multicore-Challenge of variables stored on the device. All this without compromis- III, At Stuttgart, 2012. ing the modularity and simplicity of the implementation which [17] M. Boyer, K. Skadron, and W. Weimer, “Automated dynamic analysis are the main driving forces of this work. of cuda programs,” in Third Workshop on Software Tools for MultiCore Systems, 2008. Moreover, the presented toolbox works as an integrated [18] M. Mongiovi, G. Giannone, A. Fornaia, G. Pappalardo, and E. Tra- translational utility for the automatic conversion between se- montana, “Combining static and dynamic data flow analysis: a hybrid quential OO code as well as for the integration of CUDA- approach for detecting data leaks in Java applications,” in Proceedings compliant code, providing an advanced interpretation method. of Symposium on Applied Computing (SAC), ACM, 2015. In this way programmers that intend to make use of the [19] A. Calvagna and E. Tramontana, “Delivering dependable reusable advantages offered by this toolbox will be able to reuse written components by expressing and enforcing design decisions,” in Proceed- ings of Computer Software and Applications Conference (COMPSAC) code by translating it into a GPU-enabled implementation, Workshop QUORS, pp. 493–498, IEEE, July 2013. with a robust compatibility between this toolbox and any [20] F. Bonanno, G. Capizzi, S. Coco, C. Napoli, A. Laudani, and G. Lo Sci- independent OO code. uto, “Optimal thicknesses determination in a multilayer structure to improve the spp efficiency for photovoltaic devices by an hybrid fem— cascade neural network based approach,” in Power Electronics, Elec- R EFERENCES trical Drives, Automation and Motion (SPEEDAM), 2014 International [1] K. R. Jackson, L. Ramakrishnan, K. Muriki, S. Canon, S. Cholia, Symposium on, pp. 355–362, IEEE, 2014. J. Shalf, H. J. Wasserman, and N. J. Wright, “Performance analysis of [21] R. Giunta, G. Pappalardo, and E. Tramontana, “A redundancy-based high performance computing applications on the amazon web services attack detection technique for java card bytecode,” in Proceedings of cloud,” in Proceedings of International Conference on Cloud Computing International WETICE Conference, pp. 384–389, IEEE, 2014. Technology and Science (CloudCom), pp. 159–168, IEEE, 2010. [22] C. Napoli, G. Pappalardo, and E. Tramontana, “Using modularity met- [2] C. Napoli, G. Pappalardo, and E. Tramontana, “Improving files avail- rics to assist move method refactoring of large systems,” in Proceedings ability for bittorrent using a diffusion model,” in 23rd IEEE Interna- of International Conference on Complex, Intelligent, and Software tional WETICE Conference, pp. 191–196, IEEE, 2014. Intensive Systems (CISIS), pp. 529–534, IEEE, 2013. [3] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, and K. Skadron, [23] G. Pappalardo and E. Tramontana, “Suggesting extract class refactoring “A performance study of general-purpose applications on graphics opportunities by measuring strength of method interactions,” in Pro- processors using cuda,” Journal of Parallel and Distribuited Computing, ceedings of Asia Pacific Software Engineering Conference (APSEC), vol. 68, pp. 1370–1380, 2008. IEEE, 2013. [4] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel [24] F. Bonanno, G. Capizzi, and C. Napoli, “Some remarks on the appli- programming with cuda,” Queue, vol. 6, no. 2, pp. 40–53, 2008. cation of rnn and prnn for the charge-discharge simulation of advanced [5] A. Rueda and L. Ortega, “Geometric algorithms on cuda,” Journal of lithium-ions battery energy storage,” in Power Electronics, Electrical Virtual Reality and Broadcasting, no. 200, 2008. Drives, Automation and Motion (SPEEDAM), 2012 International Sym- [6] G. Kiczales, J. Lamping, A. Mendhekar, C. Maeda, C. Lopes, J. Lo- posium on, pp. 941–945, IEEE, 2012. ingtier, and J. Irwin, “Aspect-oriented programming,” ECOOP 97 Ob- [25] E. Tramontana, “Detecting extra relationships for design patterns roles,” ject Oriented Programming, pp. 220–242, 1997. in Proceedings of AsianPlop, March 2014. [7] G. Borowik, M. Woźniak, A. Fornaia, R. Giunta, C. Napoli, G. Pap- [26] F. Bonanno, G. Capizzi, G. Lo Sciuto, C. Napoli, G. Pappalardo, and palardo, and E. Tramontana, “A software architecture assisting workflow E. Tramontana, “A cascade neural network architecture investigating executions on cloud resources,” International Journal of Electronics surface plasmon polaritons propagation for thin metals in openmp,” and Telecommunications, vol. 61, no. 1, pp. 17–23, 2015. DOI: in Artificial Intelligence and Soft Computing, pp. 22–33, Springer 10.1515/eletel-2015-0002. International Publishing, 2014. 30 Proc. of the 16th Workshop “From Object to Agents” (WOA15) June 17-19, Naples, Italy [27] G. Capizzi, F. Bonanno, and C. Napoli, “A new approach for lead-acid batteries modeling by local cosine,” in Power Electronics Electrical Drives Automation and Motion (SPEEDAM), 2010 International Sym- posium on, pp. 1074–1079, IEEE, 2010. [28] M. Ioki, S. Hozumi, and S. Chiba, “Writing a modular gpgpu program in java,” in Proceedings of workshop on Modularity in Systems Software (MISS), (New York, NY, USA), pp. 27–32, ACM, 2012. [29] H. Nguyen, GPU Gems 3. Addison-Wesley, 2008. [30] C. Napoli, G. Pappalardo, E. Tramontana, Z. Marszałek, D. Połap, and M. Woźniak, “Simplified firefly algorithm for 2d image key-points search,” in 2014 IEEE Symposium on Computational Intelligence for Human-like Intelligence, pp. 118–125, IEEE, 2014. [31] G. Capizzi, F. Bonanno, and C. Napoli, “Recurrent neural network- based control strategy for battery energy storage in generation systems with intermittent renewable energy sources,” in IEEE international conference on clean electrical power (ICCEP), pp. 336–340, IEEE, 2011. [32] M. Woźniak and D. Połap, “Basic concept of cuckoo search algorithm for 2D images processing with some research results : An idea to apply cuckoo search algorithm in 2d images key-points search,” in SIGMAP 2014 - Proceedings of the 11th International Conference on Signal Processing and Multimedia Applications, Part of ICETE 2014 - 11th International Joint Conference on e-Business and Telecommunications, (28-30 August, Vienna, Austria), pp. 157–164, SciTePress, 2014. DOI: 10.5220/0005015801570164. [33] F. Bonanno, G. Capizzi, G. Lo Sciuto, C. Napoli, G. Pappalardo, and E. Tramontana, “A novel cloud-distributed toolbox for optimal energy dispatch management from renewables in igss by using wrnn predictors and gpu parallel solutions,” in Power Electronics, Electrical Drives, Automation and Motion (SPEEDAM), 2014 International Symposium on, pp. 1077–1084, IEEE, 2014. [34] Nvidia corporation, NVIDIA CUDA Compute Unified Device Architec- ture programming guide. 2007. [35] C. Napoli, G. Pappalardo, E. Tramontana, and G. Zappalà, “A cloud- distributed gpu architecture for pattern identification in segmented detectors big-data surveys,” The Computer Journal, p. bxu147, 2014. [36] R. Giunta, G. Pappalardo, and E. Tramontana, “Aspects and annotations for controlling the roles application classes play for design patterns,” in Proceedings of the Asia Pacific Software Engineering Conference (APSEC), pp. 306–314, IEEE, 2011. 31