Proc. of the 16th Workshop “From Object to Agents” (WOA15)                                                June 17-19, Naples, Italy


             An AO system for OO-GPU programming

                    Andrea Fornaia, Christian Napoli, Giuseppe Pappalardo, and Emiliano Tramontana
                                            Department of Mathematics and Informatics
                                    University of Catania, Viale A. Doria 6, 95125 Catania, Italy
                                      {fornaia, napoli, pappalardo, tramontana}@dmi.unict.it

    Abstract—Recent technologies, like general purpose comput-         substantial improvement in terms of modularity, performance,
ing GPU, have a major limitation consisting in the difficulties that   reusability of code and separation of concerns [8], [9], [10],
developers face when implementing parallel code using device-          [11], [12].
oriented languages. This paper aims to assist developers by
automatically producing snippets of code handling GPU-oriented             The developed toolbox provides enhanced reusability for
tasks. Our proposed approach is based on Aspect-Oriented-              parallel computation of previously written code (both object
Programming and generates modules in CUDA C compliant code,            based or device oriented), by using several agents which
which are encapsulated and connected by means of JNI. By               interprete the behaviour of the OO code and uses a dedicated
means of a set of predefined functions we separate the application     translation utility [13], [14]. Tools such as LIME [15] or
code from device-dependent concerns, including device memory
allocation and management. Moreover, bandwidth utilisation and
                                                                       AeminimumGPU [16] derive a customised language and a
cores occupancy is automatically handled in order to minimise          run-time environment, however require specific compilers and
the overhead caused by host to device communications and the           force developers to use a non-standard programming language
computational imbalance, which often tampers with the effective        while giving no options for standard code reusability (it is
speedup of a GPU parallelised code.                                    impossible for such paradigms to take advantage of reused
                                                                       sequential OO code and obtain parallel versions).
   Keywords—Code generation, GPU programming, separation of
concerns.                                                                  While OpenCL provides developers with fine-grain control
                                                                       of host and kernel code, the handling of low-level details
                      I.   I NTRODUCTION                               is a significant overhead for the developer. The proposed
                                                                       toolbox, instead, requires no knowledge of the device-oriented
     For device-oriented parallel code, such as distributed
                                                                       language, instead the developer writes standard OO code,
high performance computing (HPC) and hardware-dependent
                                                                       and takes care of connecting the toolbox by using some
paradigms, developers have an additional task when building
                                                                       annotations.
an application which is taking into account the physical struc-
ture of the host (or network) [1], [2]. Developers have to con-            Recent works, such as [17], have partially automated
sider parallelisation and communication overhead, the required         several processes in the field of code control to avoid conflicts
bandwidth etc. [3]. Therefore, developers strive to achieve            or misleading behaviours, but even in this case it is ultimately
in their solutions both flexibility and high modularity. This          the programmer’s responsibility to structure their codes in
results in increased development time and costs, sometimes             the appropriate manner. An approach has been derived to
with a low-performing code. Moreover, current development              mechanically determine how a program accesses data [18],
tools do not offer a sufficient abstraction level, instead provide     and, other analysers have focused on extracting the structure of
a low degree of modularity to applications [4]. Furthermore,           a software system to determine some structural properties [19],
developers should take into account very complex scenarios             [20], [21], [22], [23], [24], [25], [26], [27]. Such analyses
in order to parameterise their code e.g. for different sizes for       are paramount for assessing the possibility to transform a
the solution domains, different number of threads and block,           program in such a way to have parallelism while avoiding
different data sizes, and therefore different solutions to obtain      data inconsistency. In [28] a Java software system has been
maximum core occupancy and transfer bandwidth. As a result             presented, based on an approach that derives an entirely new
it is difficult to separate the different concerns, overcharging       set of syntactical rules for the use of a proprietary meta-
the developer (and the code) with the handling of a multitude          compiler. As far as our understanding, no significative further
of responsibilities [5].                                               step has been made towards an high-level and self contained
    This paper proposes a new paradigmatic solution letting            toolbox for easy development of GPU oriented software within
developers that use object-oriented (OO) code to develop GPU-          an OO paradigm.
specific code or low-level device-oriented code by means of a
friendly toolbox. This toolbox uses cooperating agents to assist                II.   GPGPU AND CUDA PROGRAMMING
the development of scalable modular code that take advantage
of GPU devices, freeing developers from the need to handle a              GPU programmers have to consider the underlying hard-
device oriented language. It also provides the management of           ware in order to write any GPU-enabled code (now on simply
memory allocations and overall communications between host             GPU kernel). Graphics processors provide a big number of
and computing device. Aspect-oriented programming [6], [7]             simple multithreaded cores offering the potential for dramatic
is used as a glue to enhance an application with environment-          speedups for a variety of general purpose applications when
specific choices, such as the selection of a specific task-driven      compared to the CPU sequential computation [29], [30], [31],
code at runtime. Therefore, our proposed approach brings a             [32], [33].


                                                                 24
    Proc. of the 16th Workshop “From Object to Agents” (WOA15)                                                           June 17-19, Naples, Italy


    The launch of the NvdiaTM CUDA technology has opened                 where BR is the number of bytes read per kernel, BW is the
a new era for GPGPU computing allowing the design and                    number of bytes written per kernel, and t is the time. On the
implementation of parallel GPU oriented algorithms without               other hand, BEFF cannot be computed before hand, but only
any knowledge of OpenGL, DirectX or the graphics pipeline. A             after observing the runtime execution. The presented solution
CUDA-enabled GPU is composed of several MIMD (multiple                   enable us to perform these operations and to obtain a real time
instruction multiple data) multiprocessors that contain a set            estimate of the bandwidth occupancy rateo
of SIMD processors (single instruction single data). Each
multiprocessor has a shared memory that can be accessed from                                                      BEFF
                                                                                                   BOR =                                         (3)
each of its processors, and also shares a bigger global memory                                                    BTH
common to all the multiprocessors. Basically, a CUDA kernel
makes use of threading between the SIMD processors, where                   At runtime it could be useful to compute BEFF as
a single computation is performed [3]. Moreover, the GPU                                         D
                                                                                                      !
card allows an advanced geometrical enumeration for threads                                      Y
described by a 3–dimensional structure for the 3 spatial axis                       BEFF = nO       li sizeof(TYPE)                              (4)
(even if the z axis is actually only a logical extension) [5].                                    i=1

Furthermore, it is possible to collect a set of threads in               where nO is the number of operations (e.g. 2 for read and
logical 3–dimensional blocks that are executed on the same               write), D the maximum number of dimensions of the data
multiprocessor.                                                          structure in transfer, li the length along the i-esime dimension,
    In CUDA programming model, an application consists of                and sizeof(TYPE) the dimension in bytes of one unit of
a host program that executes on the CPU and other parallel               data for the specified type. Therefore it follows that
kernel programs executing on the GPU [34], [35]. A kernel                                                D
                                                                                                              !
                                                                                              nO        Y
program is executed by a set of parallel threads. The host                         BOR =                    li sizeof(TYPE)            (5)
program can dynamically allocate device global memory to                                   nM CM RM i=1
the GPU and copy data to (and from) such a memory from
(and to) the memory on the CPU. Moreover, the host program                   Since we are interested in host to device transfers (and
can dynamically set the number of threads that run on a kernel           vice versa) then nO = 2, moreover CM and RM depend on
program. Threads are organised in blocks, and each block has             the hardware, and they are un-mutable during the execution.
its own shared memory, which can be accessed only by each                Therefore, given a fixed constant
thread on the same block.
                                                                                                           nO
    It is paramount that interactions between CPU and GPU are                                   KHW =                              (6)
                                                                                                       nM C M RM
minimised, this avoids communication bottlenecks and delays
due to data transfers. Necessary data transfers should try to            it follows that
maximise the bandwidth usage, i.e. CPU and GPU perform as                                          D
                                                                                                              !
                                                                                                   Y
least as possible interactions and transfer a large amount of                        BOR = KHW           li       sizeof(TYPE)                   (7)
data each time.                                                                                    i=1

A. Bandwidth measurements                                                   This latter gives the exact bandwidth usage.
    The Bandwidth is indeed one of the most important factors
for performance. The best practice in CUDA C programming                 B. Memory optimisations
recommends that almost all GPU adaptation changes to code
should be made in the context of how they affect bandwidth.                  While the bandwidth occupancy rateo give us an estimate
                                                                         of the performance of the code, in order to improve such
    Bandwidth can be dramatically affected by the choice of              performances a major part of the effort should be directed
memory in which data is stored, how the data is laid out and             toward memory optimisations. A performant code should in-
the order in which it is accessed, as well as other factors              deed maximise the bandwidth occupancy rateo, but such a
due to the computation itself. In order to obtain an accurate            bandwidth is best served by using as much fast memory and
estimate of the possible performances it is required to calculate        as little slow-access memory as possible (this practice applies
the effective bandwidth which, generally, strongly differs from          both to the device memory and to the host memory).
the theoretical bandwidth (the latter is much greater than the
former). The theoretical maximum bandwidth BTH is                            In order to gain performance, it is important to reduce the
                                                                         number of data transfers between host and device, sometimes
                        BTH = nM CM RM                        (1)        also by running directly on the GPU portions of serial code
where CM is the maximum memory clock, nM is the number                   (or portions of code the CPU could easily outperform).
of bits of the memory interface, and RM is the memory data                   For the same reason data structures could be created both in
rate (1 if single rate, 2 if double rate, etc.). Moreover, to            the device and in the host in order to serve as an intermediate
obtain an accurate estimate of the effective bandwidth BEFF ,            buffer. Such a buffer could also be useful to avoid small
such computations should be performed at execution time by               transfers, organising larger transfers which should perform
means of the following equation                                          better even in case of non-contiguous regions of memory
                               (BR + BW )                                (these would be packed in an unique compact buffer and then
                      BEFF =                                  (2)        unpacked at their destination).
                                   t


                                                                    25
Proc. of the 16th Workshop “From Object to Agents” (WOA15)                                                       June 17-19, Naples, Italy


    A major improvement in memory usage is finally granted
by using page-locked memory, also known as pinned mem-                   Proxy Agent                                        Platform Agent
ory. By using the pinned memory the bandwidth usage
should be maximised (hence, limiting to the transfers be-                                                                       Code
                                                                               GUI
                                                                                                 USER            EXPERT       Repository
tween host and device). In order to use the pinned memory
                                                                                                        Broker
the CUDA libraries provide the cudaHostAlloc() and
cudaHostRegister() functions: the first allocates region
of memory in pinned modality, while the latter is used to pin
                                                                           Interpreter              Translator                  Injector
the memory on the fly without allocating a separate buffer.
    While the use of pinned memory could improve the perfor-
mance, this practice is likely to be difficult for the developers,             OO                                              Dedicated
                                                                                                        LINKER
who risk to take on too much responsibilities. Moreover, the                 Compiler                                          Compiler
usage of pinned memory does not give a general solution
for every code since pinned memory is a rare resource and            Fig. 1.   An overall schema of the developed agent oriented system
an excessive use could end up in an overall reduction of
the system performances. Finally, memory paging is often
a heavyweight operation when compared to normal memory                                   III.   T HE PROPOSED TOOLBOX
management. This results in a trade-off situation which should
be carefully analysed before taking any action. The proposed             To strictly separate all the algorithmic development related
solution is intended to spare the developers from such concerns      to the application domain from different concerns (i.e. the
by taking care of this issue with automatic evaluations and          GPU handling), which should not be taken into account by
countermeasures.                                                     the programmer. All the GPU device-related managements are
                                                                     performed automatically by the proposed toolbox (e.g. the
C. Asynchronous transfers                                            choice of the best number of threads and blocks, the needed
                                                                     modifications to the code in order to enable it for asynchronous
    In a standard situation the developer may decide to              stream execution, etc.).
transfer data between host and device by means of the
cudaMemcpy() function, which is a blocking transfer. In                 This toolbox aims at providing a simplified and modular
other words, such an operation constitutes a barrier and returns     support for GPU computing that developers could use without
the control to the thread only after the entire data transfer is     having to learn how to program in CUDA. The purpose of this
completed.                                                           work is to develop such a toolbox for OO Programming to run
                                                                     specific tasks on the MIMD environment provided by a GPU
    CUDA architecture offers a different solution for memory         accelerator without any need to divert from an OO paradigm
transfer by means of the cudaMemcpyAsync() function                  and the related OO language (e.g. Java).
which is a non-blocking variant of the previous one. This func-
tion returns control immediately with the related consequences.          Figure 1 shows the proposed solution, which consists of
Moreover, this function requires the use of the pinned memory,       three main agents:
and, for security reasons, to use the so called streams. A stream
is a sequence of operations that are performed on the device             •      Proxy agents
following a certain order. Streams must be properly used while
                                                                         •      Broker
using asynchronous transfers in order to correctly access data
only after they have been transferred. On the other hand                 •      Platform agents
different streams can be overlapped. Asynchronous transfers
enable us to overlap data transfers with computations, therefore         The proxy agent provides a graphical user interface (GUI)
their proper use could tremendously increase performances,           which could be typically intended as a web portal to upload
however they could be very tricky for the developer, and again       OO code complying with several syntactic constraints (such
assistance could be required. Also this kind of assistance is        as the proper use of annotations). The uploaded code is then
provided by our developed solution.                                  interpreted by an interpreter software module which creates
                                                                     an XML file to instruct the translator on the behaviour that
D. Cores occupancy                                                   some portions of code have to obtain. Finally, the proxy agent
                                                                     contains the OO oriented compiler which has the responsibility
    Another key point in order to maximise the GPU per-              to link all the produced software modules with the unchanged
formances is the core occupancy. While a task should run             portion of the original OO code and then compiles it creating
unconstrained, its workload should be correctly designed so          an executable binary. The binary is finally returned to the user
as to take advantage of a number of threads that exactly             by means of the same GUI when ready.
matches the number of available GPU cores. The best practice
recommends to keep the multiprocessors on the device as busy             The said translator module is part of the Broker, it receives
as possible. It follows that a poorly balanced workload will         an XML description of the behaviours from the interpreter.
result in suboptimal performances. Hence, it is important to         While the latter has the responsibility to detect the code
implement a correct application design with an optimal task          behaviours and accordingly prepare an input for the translator,
distribution on threads and blocks. The proposed toolbox aims        the interpreter has nothing to with the translation of code itself.
to spare the developers from such an effort.                         The translator maintains a reference to the different platform


                                                               26
    Proc. of the 16th Workshop “From Object to Agents” (WOA15)                                                    June 17-19, Naples, Italy


agents which are designed to match different hardware infras-               threads and blocks. Such functions are implemented as part
tructures. The translator then is able to understand the inter-             of a set of choices made by the toolbox in order to obtain
preted behaviours, and then choose the proper architecture-                 a CUDA compliant code which satisfies certain requirements
oriented broker agent. Once the broker has been chosen, the                 that we call behaviours.
translator instruct it to inject some portions of translated code,
                                                                                A behaviour is a set of fixed parameters concerning the
on the other hand the translator itself prepares the unchanged
                                                                            management of the GPU card and all the related optimisation
code to be linked with the compiled device-dedicated soft-
                                                                            that does not involve the application logic. While the developer
ware modules. The linker module has the latter responsibility,
                                                                            is responsible for the application logic, the implementation of
linking the remaining OO code with the generated device-
                                                                            the said behaviour (therefore all the choices and related imple-
dedicated executables. The latter responsibility is all but trivial,
                                                                            mentation in terms of specific kernels, functions, parameters
since it is up to the linker any choice regarding the best
                                                                            and strategies) are up to the presented support. Finally, the
linking approach (e.g. whether to use external functions, native
                                                                            developer can also freely compose predefined behaviours or
interfaces, or other approaches in order to permit to let the
                                                                            create new ones.
generated binary to be called from the OO portion of the code).
    The effective injection of code and generation of executable            B. Code repository
software modules is performed within the dedicated Platform
agent. The core module of the Platform agent is the injector.                   A predefined set of behaviours and related implement-
As the name suggest the responsibility of this module is to                 ing functions are included in a code repository, so that the
inject device oriented code in order to create a separately                 programmer will have no need to directly implement CUDA
compiled software-module to be then linked by the broker,                   kernels and calls, nor even to use C or C++ languages. With
as said before. The injector has knowledge of a selection of                our approach, a developer could write his code in an OO
codes within a related repository. Basing on the indications                programming language such as Java, an then use a few of the
given by the translator, the injector requires the corresponding            classes in our repository to provide some parameters, in order
code to the repository. The latter is maintained by experts of              to configure the whole ensemble of application and CUDA
the target architecture to which the repository is related. The             code (examples of parameters are the number of threads,
proposed approach is similar to a wiki project where experts                blocks, the maximum bandwidth, etc..) or let our toolbox take
add new code to be used. Of course the uploaded code must                   all the decisions.
be compliant with all the standards of the device for which                    As far as CUDA compliant code is concerned, this is imple-
it is intended, and, moreover, should be accompanied by an                  mented by the toolbox, some function pointers are predefined
adeguate descriptor within the constraints of the presented                 and enlisted so that the application developer will have the
agent oriented system itself. Finally the injected code si                  possibility to choose between a given set of functions, or
compiled with a dedicated compiler, a binary file is produced               manually add to the list a custom function written in CUDA
and passed to the linker which performs its duties closing the              C compliant code. In this way the developer could even write
cycle.                                                                      non-OO code (specifically CUDA code) and then use it within
    In the following section we will give some details regarding            a more comprehensive OO application.
the injection procedure and the involved modules.                               Any function defined in the library, hence also any custom
                                                                            function, is a __device__ function: the inputs of such GPU
                IV.   T HE DESIGNED MODULES                                 functions consist of a unique pointer to the ensemble of data,
                                                                            operands and outputs; in addition to this pointer some control
A. Compile time                                                             data are given. Note that such a code will be generated by the
    In our approach, fragments of device-related code are                   toolbox, which takes care of all such details. These functions
automatically linked by using a predefined library of common                are then executed on the GPU device and called by a properly
functions of general purpose. Moreover, the designed system                 generated __global__ kernel.
makes it possible to define custom CUDA compliant tasks to be                  The provided functions work for an arbitrary number of
executed on the GPU. At runtime, a component, which makes                   parameters, (i.e. operands) and functions.
use of aspect-oriented programming, provides with the optimal
management of the memory transfers and allocation by moni-
                                                                            C. Code injection choice
toring the effective allocations, initialisations, and values of the
stored variables both on the host and device. By including the                  This agent oriented system was created to make use of
predefined classes or by using the precompiled executable as a              several classes that properly realise a usable set of data
linked binary, it is possible to use a predefined set of functions          assisting the computation to be performed on a GPU. These
and also to implement custom CUDA compliant functions and                   classes are adapted and interpreted as C-like defined structures
then invoke them within an OO paradigm by means of Java                     of standard types, which are then transferred to the GPU
Native Interface (JNI).                                                     device.
    The approach provides the developer with a set of functions                 The JNI layer provides the needed “glue” to manage calls
that, when invoked, take care of all needed management,                     and data transfer towards the C++ side, which will use such
including: the data transfers between CPU and device, the                   classes as primitive structures. While the memory address of
memory allocations or the use of pinned memory, the possible                an object is not available under a Java framework, once objects
asynchronous execution of different threads, the optimal sizing             are passed by means of JNI calls to the C++ layer, it becomes
and dimensioning with respect to best performing number of                  possible (within the C++ portion of code) to manipulate and


                                                                       27
Proc. of the 16th Workshop “From Object to Agents” (WOA15)                                                              June 17-19, Naples, Italy


pass data by means of their memory addresses. This makes            p u b l i c @ i n t e r f a c e GPUstream {
it possible to ignore the number of dimensions for arrays,                  int value ( ) ;
                                                                    }
matrices, tensors, handled by such data types.                      p u b l i c @ i n t e r f a c e GPUparal {
                                                                            String fixed ( ) ;
    An important strategy has been used to reduce the size          }
of data transferred and consists of an a priori selection and
                                                                    public c l a s s Behaviour {
rearrangement of the operands and functions encoded, as said            p r i v a t e s t a t i c B e h a v i o u r b = new B e h a v i o u r ( ) ;
before, in unique arrays.                                               private String s t a t u s ;
                                                                        private String tmpStatus ;
    Since the application developer has to provide the starting
and ending points of the operands, before any allocation or               private Behaviour ( ) {              }

communication in and to the device, the toolbox rearranges                public s t a t i c Behaviour g e t I n s t a n c e ( ) {
only the necessary part of the data in a communication buffer.                return b ;
                                                                          }
Under the stated conditions the memory allocated and the data
transferred to the device are minimised, on the other hand the            public     String eval ( String s ) { }
total size of such a buffer maximises the bandwidth usage.                public     S t r i n g wise ( ) { }
                                                                          public     S t r i n g add ( S t r i n g s ) { }
With this selection of data another advantage is to minimise the          public     void s e t ( S t r i n g s ) { }
operations on the device due to the indexing of the operands.
                                                                          public void i n i t ( S t r i n g a ) {
    Another important feature of the proposed toolbox is the                  add ( a ) ;
                                                                              add ( e v a l ( t m p S t a t u s ) ) ;
simplified interface to memory transfer between host and                      add ( e v a l ( w i s e ( ) ) ) ;
device. It is known that, as far as execution time is concerned,          }
generally the more costly part is memory allocation and data
                                                                          public String get ( ) {              }
transfer from the host to the GPU device and vice versa,            }
which, for the best part, is at the origin of the total overhead.
Memory transfer is not only expensive at runtime, indeed            Fig. 2.    Examples of predefined annotations and classes
it is considered the ticklish and misleading part in GPU
programming being also expensive in terms of coding time.
    This toolbox takes care of memory allocations on the            in order to identify the correct behaviour (or composition of
device and offers an advanced management of communications          behaviours), the adopted solution could be classified as an
between host and GPU device. For this reason, e.g. when a           Annotated Aspect oriented solution (AA). In such an AA
variable is used twice on the GPU device during the same            solution several aspects are responsible for the interception
execution of the program, and if it is not reassigned or            of the relevant OO methods, whose execution is ultimately
redeclared in the meantime, the toolbox will avoid to repeat        substituted with native code, interacting with the remaining
a memory transfer, preserving a copy on the device for future       portion of the software by means of JNI. In this case, we
use. This feature gives an easy way for the programmer to           want to run JNI instances driven by some parameters. Some
develop GPU-ready code without any need to take care of these       of them are embedded into the OO code, others should be
tricking side considerations, focusing only on the algorithm she    evaluated at runtime, e.g. the core occupancy, whether or not
wants to implement.                                                 to use pinned memory, the effective bandwidth utilisation as in
                                                                    equation (2). While the presented toolbox makes all the needed
D. Translator module                                                computation by means of a meta-layer which reflects on the
                                                                    OO code, in order to correctly interpret the desired behaviour
    As said above, our toolbox uses several independent agents      it is up to the developer to annotate his application code. In
in order to generate device oriented software modules that can      order to minimise this concern we have developed an easy way
be connected with any other Java application code by using          to set up all the needed parameters and to select the portions
the Java Native Interface. The software modules are created         of code to parallelise.
by means of a common nvcc compiler without any other
precompiler. In fact, it comes with the needed computational            Some annotations are given by the toolbox itself and are
libraries which can also be precompiled and linked to an            part of a library. Among such default annotations some of
existing software system.                                           them are used to let the aspects inject the appropriate code in
    In order to enable application developers to produce mod-       the right points within the code. Figure 2 shows annotation
ular code, the proposed translation system makes use of be-         @GPUparal that allows identifying a class that has to be
haviours to obtain certain features of the code. Such behaviours    substituted with CUDA compliant code, then some code is exe-
could be intended for CUDA-compliant code in the same               cuted on the GPU card and connected with the OO software by
way as Design Patterns are intended for OO code. In such a          means of JNI. Such a parallel execution could be organised into
context, aspect-oriented programming (AOP) has been proven          several streams by means of a @GPUstream annotation. The
effective to implement OO design patterns while preserving the      latter takes as input an ID in order to univocally identify an ex-
independence of classes and the separation of concerns [36].        ecution stream (mandatory in case of asynchronous execution
In the same way, AOP is useful in order to connect an OO            and pinned memory utilisation). Moreover, @GPUparal an-
application with CUDA native GPU code.                              notation allows the developer to define mandatory behaviours
                                                                    for certain classes or methods. Such mandatory behaviours
    Since the proposed toolbox takes advantage of some speci-       could be defined along with the implementation of application
fied parameters and several annotations given by the developer      methods and classes and become proper directives, when the


                                                              28
      Proc. of the 16th Workshop “From Object to Agents” (WOA15)                                                                                             June 17-19, Naples, Italy


p u b l i c c l a s s Main {
       @GPU param ( t h r e a d s =64 , b l o c k s =8 , a s y n c =1 ,                  public aspect GPUinjector {
              p i n n e d =1 , b u f f =1024 , s t r e a m s =1 , mixbehav =0 ,              p o i n t c u t GPUpar ( GPUparal ann , O b j e c t o b j ) :
               f i x e d =” t h r e a d s , a s y n c , p i n n e d ” )                              t h i s ( o b j ) && e x e c u t i o n ( @GPUparal v o i d ∗ . ∗ ( . . ) )
       p u b l i c s t a t i c v o i d main ( S t r i n g [ ] a r g s ) {                           && @ a n n o t a t i o n ( ann ) ;
               // ...
       }                                                                                       v o i d a r o u n d ( GPUparal ann , O b j e c t o b j ) :
}                                                                                                                                   GPUpar ( ann , o b j ) {
                                                                                                      try {
@GPUparal ( f i x e d =” d e f a u l t ” )                                                                    // ...
p u b l i c c l a s s MyClass {                                                                              B e h a v i o u r . g e t I n s t a n c e ( ) . i n i t ( ann . f i x e d ( ) ) ;
                                                                                                              // ...
      @GPUstream ( 1 )                                                                                } catch ( Exception e ) { }
      p u b l i c void bigMethod1 ( ) {                                                        }
              // ...                                                                     }
             B e h a v i o u r . g e t I n s t a n c e ( ) . s e t ( ” none ” ) ;
             smallMethod S1 ( ) ;                                                        e x t e r n ”C” JNIEXPORT v o i d JNICALL
              // ...                                                                     J a v a I n j e c t e d C U D A c o d e ( JNIEnv∗ env , G s t r u c t g p u d a t a ){
      }                                                                                          env−> . . .
                                                                                         }
      p u b l i c v o i d bigMethod S0 ( ) {
      }
                                                                                         Fig. 4.    Aspect for CUDA code injection
      @GPUstream ( 2 )
      p u b l i c void bigMethod3 ( ) {
              // ...
              o t h e r C l a s s . smallMethod OC ( ) ;                                 well as all the other initialisation directives are given in the
              // ...                                                                     application code by means of the default annotations.
             smallMethod S2 ( ) ;
              // ...
      }
                                                                                             When an instance of an annotated application class will be
}                                                                                        intercepted by GPUinjector, the class-related annotations
                                                                                         are taken and then stored on an appropriate hash table handled
Fig. 3.    Examples of application customisation by using Java annotations               by the aspect. The same elaboration is performed for method-
                                                                                         related annotations, which are stored in another dedicated hash
                                                                                         table. Data stored in such hash tables are shared by the several
implemented methods (or classes) are called (or used).                                   instances of the same class, once they are intercepted, since
                                                                                         the same parallelisation is desired for all instances of the same
    Some specific behaviours can be implemented to be en-                                class.
abled at a given moment in the code, e.g. before a call
to a method, then method set() on an instance of class                                       Data gathered by method-related and class-related annota-
Behaviour has to be called (see also Figure 3). The said                                 tions are then merged with the mandatory behaviours and other
method set() would then override any other behaviour                                     specifications given by the application developer in order to
except for the mandatory behaviour.                                                      exclude conflicting behaviours. Then, the resulting behaviours
                                                                                         are given by the attribute tmpStatus, as extracted by the
   Figure 3 shows how an application takes advantage of the                              aspect from the annotations. Such behaviours are evaluated by
proposed toolbox by means of Java annotations. These set                                 method eval() in class Behaviour in order to check the
important parameters, or communicate some global behaviour,                              compatibility with the mandatory behaviours before modifying
which is then implemented for all the parallelised code.                                 the general behaviour encoded as a string Status.
Another behaviour could be introduced by the aspects when
appropriate.                                                                                 Finally, aspect GPUinjector enables us to integrate on-
                                                                                         the-fly new behaviours for some advantageous circumstances.
E. Injection module                                                                      This latter integration is made by means of method wise()
                                                                                         in class Behaviour. After the evaluation the behaviours are
    The developed aspect GPUinjector (see Figure 4) takes                                added (or modified) with method add() on the same class.
into account the behaviour resulting from: (i) class-related
annotation @GPU_parallel, (ii) method-related annotation                                    When the whole image of the behaviour is composed,
@GPUstream), and (iii) the directives given by means of                                  then the aspect calls the code generator that joins preexistent
predefined annotation @GPU_param. When the aspect inter-                                 portions of code related to each behaviour composition (or
cepts a called method for a given instance, it observes all                              custom made CUDA compliant code linked by the developer
behavioural directives (given on the code by means of the                                to a certain composition on default or custom behaviours).
method set() for class Behaviour). Moreover, it takes into                               Figure 5 shows the Java code and annotations and calls that
account the overall ensemble of parameters and circumstances                             connect with our toolbox.
that intervene at runtime. This latter reasoning could lead the
aspect to find a more profitable or advantageous setup in order                                                      V.      C ONCLUDING REMARKS
to enrich or modify the given set of behaviours (except in the
case of mandatory behaviours).                                                               The possibility to have an easy to use and modular toolbox
                                                                                         for GPGPU programming opens an entire new range of possi-
    At the beginning of an application execution, as soon                                bilities in the field of fast and performance oriented computing.
as class Behaviour is loaded, it is populated by proper                                  By means of our toolbox the developer need not use any
data values, then using the mandatory instructions, several                              external or proprietary compiler. Consequently, this toolbox
behaviours are configured. Such mandatory instructions, as                               offers virtually unlimited reusability with the possibility to link


                                                                                    29
Proc. of the 16th Workshop “From Object to Agents” (WOA15)                                                                                June 17-19, Naples, Italy


import j a v a . l a n g . a n n o t a t i o n . ∗ ;                                            [8]   R. Giunta, G. Pappalardo, and E. Tramontana, “An aspect-generated
import G c l a s s . ∗ ;                                                                              approach for the integration of applications into grid.,” in Proceedings
                                                                                                      of the symposium on Applied computing, ACM, 2007.
@GPUparal ( ” none ” )                                                                          [9]   C. Napoli, F. Bonanno, and G. Capizzi, “An hybrid neuro-wavelet
p u b l i c c l a s s MyGPUclass {                                                                    approach for long-term prediction of solar wind,” in IAU Symposium,
       p u b l i c v o i d myGPUmethod ( P t y p e d a t a ) {                                        no. 274, pp. 247–249, 2010.
       }                                                                                       [10]   R. Giunta, G. Pappalardo, and E. Tramontana, “Superimposing roles
}
                                                                                                      for design patterns into application classes by means of aspects,” in
public class Test {                                                                                   Proceedings of Symposium on Applied Computing (SAC), ACM, 2012.
    @GPU param ( t h r e a d s =48 , a s y n c =0 , mixbehav =1 ,                              [11]   M. Woźniak, Z. Marszałek, M. Gabryel, and R. K. Nowicki, “Modified
            f i x e d =” t h r e a d s , s p l i t 1 D ” )                                            merge sort algorithm for large scale data sets,” Lecture Notes in
    p u b l i c s t a t i c v o i d main ( S t r i n g [ ] a r g s ) {                                Artificial Intelligence - ICAISC’2013, vol. 7895, pp. 612–622, 2013.
           P t y p e d a t a = new P t y p e ( ) ;
                                                                                                      DOI: 10.1007/978-3-642-38610-7 56.
              Behaviour . g e t I n s t a n c e ( ) . s e t ( ” default , split1D ” ) ;        [12]   E. Tramontana, “Automatically characterising components with con-
              MyGPUclass . myGPUmethod ( d a t a ) ;                                                  cerns and reducing tangling,” in Proceedings of Computer Software and
                                                                                                      Applications Conference (COMPSAC) workshop QUORS, IEEE, 2013.
       }
}                                                                                              [13]   C. Krueger, “Software reuse,” ACM Computing Surveys (CSUR),
                                                                                                      vol. 24, no. 2, pp. 131–183, 1992.
                                                                                               [14]   C. Napoli, G. Pappalardo, and E. Tramontana, “An agent-driven seman-
Fig. 5.      Example of Java code before injection                                                    tical identifier using radial basis neural networks and reinforcement
                                                                                                      learning,” in XV Workshop ”Dagli Oggetti agli Agenti”, vol. 1260,
                                                                                                      CEUR-WS, 2014.
with CUDA-compliant code. The implementation of advanced                                       [15]   J. Auerbach, D. F. Bacon, P. Cheng, and R. Rabbah, “Lime: a java-
features is made by using aspect-oriented code. In this way, it                                       compatible and synthesizable language for heterogeneous architec-
is possible to have an high customisation level for the definition                                    tures.,” in Proceedings of the ACM International Conference on Object-
                                                                                                      Oriented Programming Systems, Languages, and Applications, pp. 89–
of new behaviours and performances improvements due to an                                             108, ACM, 2010.
advanced management of the allocation and freeing of memory                                    [16]   A. Fonseca and B. Cabral, “Aeminiumgpu: An intelligent framework for
on the device. This will provide means to control the lifecycle                                       gpu programming,” in Proceedings of Facing the Multicore-Challenge
of variables stored on the device. All this without compromis-                                        III, At Stuttgart, 2012.
ing the modularity and simplicity of the implementation which                                  [17]   M. Boyer, K. Skadron, and W. Weimer, “Automated dynamic analysis
are the main driving forces of this work.                                                             of cuda programs,” in Third Workshop on Software Tools for MultiCore
                                                                                                      Systems, 2008.
    Moreover, the presented toolbox works as an integrated                                     [18]   M. Mongiovi, G. Giannone, A. Fornaia, G. Pappalardo, and E. Tra-
translational utility for the automatic conversion between se-                                        montana, “Combining static and dynamic data flow analysis: a hybrid
quential OO code as well as for the integration of CUDA-                                              approach for detecting data leaks in Java applications,” in Proceedings
compliant code, providing an advanced interpretation method.                                          of Symposium on Applied Computing (SAC), ACM, 2015.
In this way programmers that intend to make use of the                                         [19]   A. Calvagna and E. Tramontana, “Delivering dependable reusable
advantages offered by this toolbox will be able to reuse written                                      components by expressing and enforcing design decisions,” in Proceed-
                                                                                                      ings of Computer Software and Applications Conference (COMPSAC)
code by translating it into a GPU-enabled implementation,                                             Workshop QUORS, pp. 493–498, IEEE, July 2013.
with a robust compatibility between this toolbox and any                                       [20]   F. Bonanno, G. Capizzi, S. Coco, C. Napoli, A. Laudani, and G. Lo Sci-
independent OO code.                                                                                  uto, “Optimal thicknesses determination in a multilayer structure to
                                                                                                      improve the spp efficiency for photovoltaic devices by an hybrid fem—
                                                                                                      cascade neural network based approach,” in Power Electronics, Elec-
                                        R EFERENCES                                                   trical Drives, Automation and Motion (SPEEDAM), 2014 International
 [1]       K. R. Jackson, L. Ramakrishnan, K. Muriki, S. Canon, S. Cholia,                            Symposium on, pp. 355–362, IEEE, 2014.
           J. Shalf, H. J. Wasserman, and N. J. Wright, “Performance analysis of               [21]   R. Giunta, G. Pappalardo, and E. Tramontana, “A redundancy-based
           high performance computing applications on the amazon web services                         attack detection technique for java card bytecode,” in Proceedings of
           cloud,” in Proceedings of International Conference on Cloud Computing                      International WETICE Conference, pp. 384–389, IEEE, 2014.
           Technology and Science (CloudCom), pp. 159–168, IEEE, 2010.
                                                                                               [22]   C. Napoli, G. Pappalardo, and E. Tramontana, “Using modularity met-
 [2]       C. Napoli, G. Pappalardo, and E. Tramontana, “Improving files avail-                       rics to assist move method refactoring of large systems,” in Proceedings
           ability for bittorrent using a diffusion model,” in 23rd IEEE Interna-                     of International Conference on Complex, Intelligent, and Software
           tional WETICE Conference, pp. 191–196, IEEE, 2014.                                         Intensive Systems (CISIS), pp. 529–534, IEEE, 2013.
 [3]       S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, and K. Skadron,                  [23]   G. Pappalardo and E. Tramontana, “Suggesting extract class refactoring
           “A performance study of general-purpose applications on graphics                           opportunities by measuring strength of method interactions,” in Pro-
           processors using cuda,” Journal of Parallel and Distribuited Computing,                    ceedings of Asia Pacific Software Engineering Conference (APSEC),
           vol. 68, pp. 1370–1380, 2008.                                                              IEEE, 2013.
 [4]       J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel                [24]   F. Bonanno, G. Capizzi, and C. Napoli, “Some remarks on the appli-
           programming with cuda,” Queue, vol. 6, no. 2, pp. 40–53, 2008.                             cation of rnn and prnn for the charge-discharge simulation of advanced
 [5]       A. Rueda and L. Ortega, “Geometric algorithms on cuda,” Journal of                         lithium-ions battery energy storage,” in Power Electronics, Electrical
           Virtual Reality and Broadcasting, no. 200, 2008.                                           Drives, Automation and Motion (SPEEDAM), 2012 International Sym-
 [6]       G. Kiczales, J. Lamping, A. Mendhekar, C. Maeda, C. Lopes, J. Lo-                          posium on, pp. 941–945, IEEE, 2012.
           ingtier, and J. Irwin, “Aspect-oriented programming,” ECOOP 97 Ob-                  [25]   E. Tramontana, “Detecting extra relationships for design patterns roles,”
           ject Oriented Programming, pp. 220–242, 1997.                                              in Proceedings of AsianPlop, March 2014.
 [7]       G. Borowik, M. Woźniak, A. Fornaia, R. Giunta, C. Napoli, G. Pap-                  [26]   F. Bonanno, G. Capizzi, G. Lo Sciuto, C. Napoli, G. Pappalardo, and
           palardo, and E. Tramontana, “A software architecture assisting workflow                    E. Tramontana, “A cascade neural network architecture investigating
           executions on cloud resources,” International Journal of Electronics                       surface plasmon polaritons propagation for thin metals in openmp,”
           and Telecommunications, vol. 61, no. 1, pp. 17–23, 2015. DOI:                              in Artificial Intelligence and Soft Computing, pp. 22–33, Springer
           10.1515/eletel-2015-0002.                                                                  International Publishing, 2014.


                                                                                          30
     Proc. of the 16th Workshop “From Object to Agents” (WOA15)                      June 17-19, Naples, Italy


[27] G. Capizzi, F. Bonanno, and C. Napoli, “A new approach for lead-acid
     batteries modeling by local cosine,” in Power Electronics Electrical
     Drives Automation and Motion (SPEEDAM), 2010 International Sym-
     posium on, pp. 1074–1079, IEEE, 2010.
[28] M. Ioki, S. Hozumi, and S. Chiba, “Writing a modular gpgpu program
     in java,” in Proceedings of workshop on Modularity in Systems Software
     (MISS), (New York, NY, USA), pp. 27–32, ACM, 2012.
[29] H. Nguyen, GPU Gems 3. Addison-Wesley, 2008.
[30] C. Napoli, G. Pappalardo, E. Tramontana, Z. Marszałek, D. Połap,
     and M. Woźniak, “Simplified firefly algorithm for 2d image key-points
     search,” in 2014 IEEE Symposium on Computational Intelligence for
     Human-like Intelligence, pp. 118–125, IEEE, 2014.
[31] G. Capizzi, F. Bonanno, and C. Napoli, “Recurrent neural network-
     based control strategy for battery energy storage in generation systems
     with intermittent renewable energy sources,” in IEEE international
     conference on clean electrical power (ICCEP), pp. 336–340, IEEE,
     2011.
[32] M. Woźniak and D. Połap, “Basic concept of cuckoo search algorithm
     for 2D images processing with some research results : An idea to apply
     cuckoo search algorithm in 2d images key-points search,” in SIGMAP
     2014 - Proceedings of the 11th International Conference on Signal
     Processing and Multimedia Applications, Part of ICETE 2014 - 11th
     International Joint Conference on e-Business and Telecommunications,
     (28-30 August, Vienna, Austria), pp. 157–164, SciTePress, 2014. DOI:
     10.5220/0005015801570164.
[33] F. Bonanno, G. Capizzi, G. Lo Sciuto, C. Napoli, G. Pappalardo, and
     E. Tramontana, “A novel cloud-distributed toolbox for optimal energy
     dispatch management from renewables in igss by using wrnn predictors
     and gpu parallel solutions,” in Power Electronics, Electrical Drives,
     Automation and Motion (SPEEDAM), 2014 International Symposium
     on, pp. 1077–1084, IEEE, 2014.
[34] Nvidia corporation, NVIDIA CUDA Compute Unified Device Architec-
     ture programming guide. 2007.
[35] C. Napoli, G. Pappalardo, E. Tramontana, and G. Zappalà, “A cloud-
     distributed gpu architecture for pattern identification in segmented
     detectors big-data surveys,” The Computer Journal, p. bxu147, 2014.
[36] R. Giunta, G. Pappalardo, and E. Tramontana, “Aspects and annotations
     for controlling the roles application classes play for design patterns,”
     in Proceedings of the Asia Pacific Software Engineering Conference
     (APSEC), pp. 306–314, IEEE, 2011.


                                                                                31