<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>An AO system for OO-GPU programming</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Andrea Fornaia, Christian Napoli, Giuseppe Pappalardo, and Emiliano Tramontana Department of Mathematics and Informatics University of Catania</institution>
          ,
          <addr-line>Viale A. Doria 6, 95125 Catania</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <volume>1</volume>
      <fpage>7</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>-Recent technologies, like general purpose computing GPU, have a major limitation consisting in the difficulties that developers face when implementing parallel code using deviceoriented languages. This paper aims to assist developers by automatically producing snippets of code handling GPU-oriented tasks. Our proposed approach is based on Aspect-OrientedProgramming and generates modules in CUDA C compliant code, which are encapsulated and connected by means of JNI. By means of a set of predefined functions we separate the application code from device-dependent concerns, including device memory allocation and management. Moreover, bandwidth utilisation and cores occupancy is automatically handled in order to minimise the overhead caused by host to device communications and the computational imbalance, which often tampers with the effective speedup of a GPU parallelised code.</p>
      </abstract>
      <kwd-group>
        <kwd>Code generation</kwd>
        <kwd>GPU programming</kwd>
        <kwd>separation of concerns</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>
        For device-oriented parallel code, such as distributed
high performance computing (HPC) and hardware-dependent
paradigms, developers have an additional task when building
an application which is taking into account the physical
structure of the host (or network) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Developers have to
consider parallelisation and communication overhead, the required
bandwidth etc. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Therefore, developers strive to achieve
in their solutions both flexibility and high modularity. This
results in increased development time and costs, sometimes
with a low-performing code. Moreover, current development
tools do not offer a sufficient abstraction level, instead provide
a low degree of modularity to applications [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Furthermore,
developers should take into account very complex scenarios
in order to parameterise their code e.g. for different sizes for
the solution domains, different number of threads and block,
different data sizes, and therefore different solutions to obtain
maximum core occupancy and transfer bandwidth. As a result
it is difficult to separate the different concerns, overcharging
the developer (and the code) with the handling of a multitude
of responsibilities [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        This paper proposes a new paradigmatic solution letting
developers that use object-oriented (OO) code to develop
GPUspecific code or low-level device-oriented code by means of a
friendly toolbox. This toolbox uses cooperating agents to assist
the development of scalable modular code that take advantage
of GPU devices, freeing developers from the need to handle a
device oriented language. It also provides the management of
memory allocations and overall communications between host
and computing device. Aspect-oriented programming [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
is used as a glue to enhance an application with
environmentspecific choices, such as the selection of a specific task-driven
code at runtime. Therefore, our proposed approach brings a
substantial improvement in terms of modularity, performance,
reusability of code and separation of concerns [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        The developed toolbox provides enhanced reusability for
parallel computation of previously written code (both object
based or device oriented), by using several agents which
interprete the behaviour of the OO code and uses a dedicated
translation utility [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Tools such as LIME [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] or
AeminimumGPU [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] derive a customised language and a
run-time environment, however require specific compilers and
force developers to use a non-standard programming language
while giving no options for standard code reusability (it is
impossible for such paradigms to take advantage of reused
sequential OO code and obtain parallel versions).
      </p>
      <p>While OpenCL provides developers with fine-grain control
of host and kernel code, the handling of low-level details
is a significant overhead for the developer. The proposed
toolbox, instead, requires no knowledge of the device-oriented
language, instead the developer writes standard OO code,
and takes care of connecting the toolbox by using some
annotations.</p>
      <p>
        Recent works, such as [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], have partially automated
several processes in the field of code control to avoid conflicts
or misleading behaviours, but even in this case it is ultimately
the programmer’s responsibility to structure their codes in
the appropriate manner. An approach has been derived to
mechanically determine how a program accesses data [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ],
and, other analysers have focused on extracting the structure of
a software system to determine some structural properties [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ],
[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], [27]. Such analyses
are paramount for assessing the possibility to transform a
program in such a way to have parallelism while avoiding
data inconsistency. In [28] a Java software system has been
presented, based on an approach that derives an entirely new
set of syntactical rules for the use of a proprietary
metacompiler. As far as our understanding, no significative further
step has been made towards an high-level and self contained
toolbox for easy development of GPU oriented software within
an OO paradigm.
      </p>
      <p>II.</p>
    </sec>
    <sec id="sec-2">
      <title>GPGPU AND CUDA PROGRAMMING</title>
      <p>GPU programmers have to consider the underlying
hardware in order to write any GPU-enabled code (now on simply
GPU kernel). Graphics processors provide a big number of
simple multithreaded cores offering the potential for dramatic
speedups for a variety of general purpose applications when
compared to the CPU sequential computation [29], [30], [31],
[32], [33].
where BR is the number of bytes read per kernel, BW is the
number of bytes written per kernel, and t is the time. On the
other hand, BEFF cannot be computed before hand, but only
after observing the runtime execution. The presented solution
enable us to perform these operations and to obtain a real time
estimate of the bandwidth occupancy rateo</p>
      <p>BOR =
D
Y li
i=1
!</p>
    </sec>
    <sec id="sec-3">
      <title>BEFF BTH</title>
      <p>At runtime it could be useful to compute BEFF as</p>
      <p>BEFF = nO
sizeof(TYPE)
where nO is the number of operations (e.g. 2 for read and
write), D the maximum number of dimensions of the data
structure in transfer, li the length along the i-esime dimension,
and sizeof(TYPE) the dimension in bytes of one unit of
data for the specified type. Therefore it follows that
BOR =</p>
      <p>nO
nMCMRM</p>
      <p>D
Y li
i=1
!
sizeof(TYPE)
(5)
(3)
(4)
(6)
(7)</p>
      <p>
        The launch of the NvdiaTM CUDA technology has opened
a new era for GPGPU computing allowing the design and
implementation of parallel GPU oriented algorithms without
any knowledge of OpenGL, DirectX or the graphics pipeline. A
CUDA-enabled GPU is composed of several MIMD (multiple
instruction multiple data) multiprocessors that contain a set
of SIMD processors (single instruction single data). Each
multiprocessor has a shared memory that can be accessed from
each of its processors, and also shares a bigger global memory
common to all the multiprocessors. Basically, a CUDA kernel
makes use of threading between the SIMD processors, where
a single computation is performed [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Moreover, the GPU
card allows an advanced geometrical enumeration for threads
described by a 3–dimensional structure for the 3 spatial axis
(even if the z axis is actually only a logical extension) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Furthermore, it is possible to collect a set of threads in
logical 3–dimensional blocks that are executed on the same
multiprocessor.
      </p>
      <p>In CUDA programming model, an application consists of
a host program that executes on the CPU and other parallel
kernel programs executing on the GPU [34], [35]. A kernel
program is executed by a set of parallel threads. The host
program can dynamically allocate device global memory to
the GPU and copy data to (and from) such a memory from
(and to) the memory on the CPU. Moreover, the host program
can dynamically set the number of threads that run on a kernel
program. Threads are organised in blocks, and each block has
its own shared memory, which can be accessed only by each
thread on the same block.</p>
      <p>It is paramount that interactions between CPU and GPU are
minimised, this avoids communication bottlenecks and delays
due to data transfers. Necessary data transfers should try to
maximise the bandwidth usage, i.e. CPU and GPU perform as
least as possible interactions and transfer a large amount of
data each time.</p>
      <sec id="sec-3-1">
        <title>A. Bandwidth measurements</title>
        <p>The Bandwidth is indeed one of the most important factors
for performance. The best practice in CUDA C programming
recommends that almost all GPU adaptation changes to code
should be made in the context of how they affect bandwidth.</p>
        <p>Bandwidth can be dramatically affected by the choice of
memory in which data is stored, how the data is laid out and
the order in which it is accessed, as well as other factors
due to the computation itself. In order to obtain an accurate
estimate of the possible performances it is required to calculate
the effective bandwidth which, generally, strongly differs from
the theoretical bandwidth (the latter is much greater than the
former). The theoretical maximum bandwidth BTH is</p>
        <p>BTH = nMCMRM
where CM is the maximum memory clock, nM is the number
of bits of the memory interface, and RM is the memory data
rate (1 if single rate, 2 if double rate, etc.). Moreover, to
obtain an accurate estimate of the effective bandwidth BEFF,
such computations should be performed at execution time by
means of the following equation</p>
        <p>BEFF =
(BR + BW)
t
(1)
(2)</p>
        <p>Since we are interested in host to device transfers (and
vice versa) then nO = 2, moreover CM and RM depend on
the hardware, and they are un-mutable during the execution.
Therefore, given a fixed constant
it follows that</p>
        <p>KHW =</p>
        <p>nO
nMCMRM
D
Y li
i=1</p>
        <p>!
BOR = KHW</p>
        <p>sizeof(TYPE)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>This latter gives the exact bandwidth usage.</title>
      <sec id="sec-4-1">
        <title>B. Memory optimisations</title>
        <p>While the bandwidth occupancy rateo give us an estimate
of the performance of the code, in order to improve such
performances a major part of the effort should be directed
toward memory optimisations. A performant code should
indeed maximise the bandwidth occupancy rateo, but such a
bandwidth is best served by using as much fast memory and
as little slow-access memory as possible (this practice applies
both to the device memory and to the host memory).</p>
        <p>In order to gain performance, it is important to reduce the
number of data transfers between host and device, sometimes
also by running directly on the GPU portions of serial code
(or portions of code the CPU could easily outperform).</p>
        <p>For the same reason data structures could be created both in
the device and in the host in order to serve as an intermediate
buffer. Such a buffer could also be useful to avoid small
transfers, organising larger transfers which should perform
better even in case of non-contiguous regions of memory
(these would be packed in an unique compact buffer and then
unpacked at their destination).</p>
        <p>A major improvement in memory usage is finally granted
by using page-locked memory, also known as pinned
memory. By using the pinned memory the bandwidth usage
should be maximised (hence, limiting to the transfers
between host and device). In order to use the pinned memory
the CUDA libraries provide the cudaHostAlloc() and
cudaHostRegister() functions: the first allocates region
of memory in pinned modality, while the latter is used to pin
the memory on the fly without allocating a separate buffer.</p>
        <p>While the use of pinned memory could improve the
performance, this practice is likely to be difficult for the developers,
who risk to take on too much responsibilities. Moreover, the
usage of pinned memory does not give a general solution
for every code since pinned memory is a rare resource and
an excessive use could end up in an overall reduction of
the system performances. Finally, memory paging is often
a heavyweight operation when compared to normal memory
management. This results in a trade-off situation which should
be carefully analysed before taking any action. The proposed
solution is intended to spare the developers from such concerns
by taking care of this issue with automatic evaluations and
countermeasures.</p>
      </sec>
      <sec id="sec-4-2">
        <title>C. Asynchronous transfers</title>
        <p>In a standard situation the developer may decide to
transfer data between host and device by means of the
cudaMemcpy() function, which is a blocking transfer. In
other words, such an operation constitutes a barrier and returns
the control to the thread only after the entire data transfer is
completed.</p>
        <p>CUDA architecture offers a different solution for memory
transfer by means of the cudaMemcpyAsync() function
which is a non-blocking variant of the previous one. This
function returns control immediately with the related consequences.
Moreover, this function requires the use of the pinned memory,
and, for security reasons, to use the so called streams. A stream
is a sequence of operations that are performed on the device
following a certain order. Streams must be properly used while
using asynchronous transfers in order to correctly access data
only after they have been transferred. On the other hand
different streams can be overlapped. Asynchronous transfers
enable us to overlap data transfers with computations, therefore
their proper use could tremendously increase performances,
however they could be very tricky for the developer, and again
assistance could be required. Also this kind of assistance is
provided by our developed solution.</p>
      </sec>
      <sec id="sec-4-3">
        <title>D. Cores occupancy</title>
        <p>Another key point in order to maximise the GPU
performances is the core occupancy. While a task should run
unconstrained, its workload should be correctly designed so
as to take advantage of a number of threads that exactly
matches the number of available GPU cores. The best practice
recommends to keep the multiprocessors on the device as busy
as possible. It follows that a poorly balanced workload will
result in suboptimal performances. Hence, it is important to
implement a correct application design with an optimal task
distribution on threads and blocks. The proposed toolbox aims
to spare the developers from such an effort.</p>
        <p>Proxy Agent</p>
        <p>GUI
Interpreter</p>
        <p>OO
Compiler</p>
        <p>USER</p>
        <p>EXPERT</p>
        <p>Broker
Translator
LINKER</p>
        <p>Platform Agent</p>
        <p>Code
Repository</p>
        <p>Injector
Dedicated
Compiler</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>III. THE PROPOSED TOOLBOX</title>
      <p>To strictly separate all the algorithmic development related
to the application domain from different concerns (i.e. the
GPU handling), which should not be taken into account by
the programmer. All the GPU device-related managements are
performed automatically by the proposed toolbox (e.g. the
choice of the best number of threads and blocks, the needed
modifications to the code in order to enable it for asynchronous
stream execution, etc.).</p>
      <p>This toolbox aims at providing a simplified and modular
support for GPU computing that developers could use without
having to learn how to program in CUDA. The purpose of this
work is to develop such a toolbox for OO Programming to run
specific tasks on the MIMD environment provided by a GPU
accelerator without any need to divert from an OO paradigm
and the related OO language (e.g. Java).</p>
      <p>Figure 1 shows the proposed solution, which consists of
three main agents:
•
•
•</p>
    </sec>
    <sec id="sec-6">
      <title>Proxy agents</title>
    </sec>
    <sec id="sec-7">
      <title>Broker</title>
    </sec>
    <sec id="sec-8">
      <title>Platform agents</title>
      <p>The proxy agent provides a graphical user interface (GUI)
which could be typically intended as a web portal to upload
OO code complying with several syntactic constraints (such
as the proper use of annotations). The uploaded code is then
interpreted by an interpreter software module which creates
an XML file to instruct the translator on the behaviour that
some portions of code have to obtain. Finally, the proxy agent
contains the OO oriented compiler which has the responsibility
to link all the produced software modules with the unchanged
portion of the original OO code and then compiles it creating
an executable binary. The binary is finally returned to the user
by means of the same GUI when ready.</p>
      <p>The said translator module is part of the Broker, it receives
an XML description of the behaviours from the interpreter.
While the latter has the responsibility to detect the code
behaviours and accordingly prepare an input for the translator,
the interpreter has nothing to with the translation of code itself.
The translator maintains a reference to the different platform
agents which are designed to match different hardware
infrastructures. The translator then is able to understand the
interpreted behaviours, and then choose the proper
architectureoriented broker agent. Once the broker has been chosen, the
translator instruct it to inject some portions of translated code,
on the other hand the translator itself prepares the unchanged
code to be linked with the compiled device-dedicated
software modules. The linker module has the latter responsibility,
linking the remaining OO code with the generated
devicededicated executables. The latter responsibility is all but trivial,
since it is up to the linker any choice regarding the best
linking approach (e.g. whether to use external functions, native
interfaces, or other approaches in order to permit to let the
generated binary to be called from the OO portion of the code).</p>
      <p>The effective injection of code and generation of executable
software modules is performed within the dedicated Platform
agent. The core module of the Platform agent is the injector.
As the name suggest the responsibility of this module is to
inject device oriented code in order to create a separately
compiled software-module to be then linked by the broker,
as said before. The injector has knowledge of a selection of
codes within a related repository. Basing on the indications
given by the translator, the injector requires the corresponding
code to the repository. The latter is maintained by experts of
the target architecture to which the repository is related. The
proposed approach is similar to a wiki project where experts
add new code to be used. Of course the uploaded code must
be compliant with all the standards of the device for which
it is intended, and, moreover, should be accompanied by an
adeguate descriptor within the constraints of the presented
agent oriented system itself. Finally the injected code si
compiled with a dedicated compiler, a binary file is produced
and passed to the linker which performs its duties closing the
cycle.</p>
      <p>In the following section we will give some details regarding
the injection procedure and the involved modules.</p>
    </sec>
    <sec id="sec-9">
      <title>IV. THE DESIGNED MODULES</title>
      <sec id="sec-9-1">
        <title>A. Compile time</title>
        <p>In our approach, fragments of device-related code are
automatically linked by using a predefined library of common
functions of general purpose. Moreover, the designed system
makes it possible to define custom CUDA compliant tasks to be
executed on the GPU. At runtime, a component, which makes
use of aspect-oriented programming, provides with the optimal
management of the memory transfers and allocation by
monitoring the effective allocations, initialisations, and values of the
stored variables both on the host and device. By including the
predefined classes or by using the precompiled executable as a
linked binary, it is possible to use a predefined set of functions
and also to implement custom CUDA compliant functions and
then invoke them within an OO paradigm by means of Java
Native Interface (JNI).</p>
        <p>The approach provides the developer with a set of functions
that, when invoked, take care of all needed management,
including: the data transfers between CPU and device, the
memory allocations or the use of pinned memory, the possible
asynchronous execution of different threads, the optimal sizing
and dimensioning with respect to best performing number of
threads and blocks. Such functions are implemented as part
of a set of choices made by the toolbox in order to obtain
a CUDA compliant code which satisfies certain requirements
that we call behaviours.</p>
        <p>A behaviour is a set of fixed parameters concerning the
management of the GPU card and all the related optimisation
that does not involve the application logic. While the developer
is responsible for the application logic, the implementation of
the said behaviour (therefore all the choices and related
implementation in terms of specific kernels, functions, parameters
and strategies) are up to the presented support. Finally, the
developer can also freely compose predefined behaviours or
create new ones.</p>
      </sec>
      <sec id="sec-9-2">
        <title>B. Code repository</title>
        <p>A predefined set of behaviours and related
implementing functions are included in a code repository, so that the
programmer will have no need to directly implement CUDA
kernels and calls, nor even to use C or C++ languages. With
our approach, a developer could write his code in an OO
programming language such as Java, an then use a few of the
classes in our repository to provide some parameters, in order
to configure the whole ensemble of application and CUDA
code (examples of parameters are the number of threads,
blocks, the maximum bandwidth, etc..) or let our toolbox take
all the decisions.</p>
        <p>As far as CUDA compliant code is concerned, this is
implemented by the toolbox, some function pointers are predefined
and enlisted so that the application developer will have the
possibility to choose between a given set of functions, or
manually add to the list a custom function written in CUDA
C compliant code. In this way the developer could even write
non-OO code (specifically CUDA code) and then use it within
a more comprehensive OO application.</p>
        <p>Any function defined in the library, hence also any custom
function, is a __device__ function: the inputs of such GPU
functions consist of a unique pointer to the ensemble of data,
operands and outputs; in addition to this pointer some control
data are given. Note that such a code will be generated by the
toolbox, which takes care of all such details. These functions
are then executed on the GPU device and called by a properly
generated __global__ kernel.</p>
        <p>The provided functions work for an arbitrary number of
parameters, (i.e. operands) and functions.</p>
      </sec>
      <sec id="sec-9-3">
        <title>C. Code injection choice</title>
        <p>This agent oriented system was created to make use of
several classes that properly realise a usable set of data
assisting the computation to be performed on a GPU. These
classes are adapted and interpreted as C-like defined structures
of standard types, which are then transferred to the GPU
device.</p>
        <p>The JNI layer provides the needed “glue” to manage calls
and data transfer towards the C++ side, which will use such
classes as primitive structures. While the memory address of
an object is not available under a Java framework, once objects
are passed by means of JNI calls to the C++ layer, it becomes
possible (within the C++ portion of code) to manipulate and
pass data by means of their memory addresses. This makes
it possible to ignore the number of dimensions for arrays,
matrices, tensors, handled by such data types.</p>
        <p>An important strategy has been used to reduce the size
of data transferred and consists of an a priori selection and
rearrangement of the operands and functions encoded, as said
before, in unique arrays.</p>
        <p>Since the application developer has to provide the starting
and ending points of the operands, before any allocation or
communication in and to the device, the toolbox rearranges
only the necessary part of the data in a communication buffer.
Under the stated conditions the memory allocated and the data
transferred to the device are minimised, on the other hand the
total size of such a buffer maximises the bandwidth usage.
With this selection of data another advantage is to minimise the
operations on the device due to the indexing of the operands.</p>
        <p>Another important feature of the proposed toolbox is the
simplified interface to memory transfer between host and
device. It is known that, as far as execution time is concerned,
generally the more costly part is memory allocation and data
transfer from the host to the GPU device and vice versa,
which, for the best part, is at the origin of the total overhead.
Memory transfer is not only expensive at runtime, indeed
it is considered the ticklish and misleading part in GPU
programming being also expensive in terms of coding time.</p>
        <p>This toolbox takes care of memory allocations on the
device and offers an advanced management of communications
between host and GPU device. For this reason, e.g. when a
variable is used twice on the GPU device during the same
execution of the program, and if it is not reassigned or
redeclared in the meantime, the toolbox will avoid to repeat
a memory transfer, preserving a copy on the device for future
use. This feature gives an easy way for the programmer to
develop GPU-ready code without any need to take care of these
tricking side considerations, focusing only on the algorithm she
wants to implement.</p>
      </sec>
      <sec id="sec-9-4">
        <title>D. Translator module</title>
        <p>As said above, our toolbox uses several independent agents
in order to generate device oriented software modules that can
be connected with any other Java application code by using
the Java Native Interface. The software modules are created
by means of a common nvcc compiler without any other
precompiler. In fact, it comes with the needed computational
libraries which can also be precompiled and linked to an
existing software system.</p>
        <p>In order to enable application developers to produce
modular code, the proposed translation system makes use of
behaviours to obtain certain features of the code. Such behaviours
could be intended for CUDA-compliant code in the same
way as Design Patterns are intended for OO code. In such a
context, aspect-oriented programming (AOP) has been proven
effective to implement OO design patterns while preserving the
independence of classes and the separation of concerns [36].
In the same way, AOP is useful in order to connect an OO
application with CUDA native GPU code.</p>
        <p>Since the proposed toolbox takes advantage of some
specified parameters and several annotations given by the developer
p u b l i c @ i n t e r f a c e GPUstream {</p>
        <p>i n t v a l u e ( ) ;
}
p u b l i c @ i n t e r f a c e GPUparal {</p>
        <p>S t r i n g f i x e d ( ) ;
}
p u b l i c c l a s s B e h a v i o u r {
p r i v a t e s t a t i c B e h a v i o u r b = new B e h a v i o u r ( ) ;
p r i v a t e S t r i n g s t a t u s ;
p r i v a t e S t r i n g t m p S t a t u s ;
p r i v a t e B e h a v i o u r ( ) { }
p u b l i c s t a t i c B e h a v i o u r g e t I n s t a n c e ( ) {</p>
        <p>return b ;
}
}
p u b l i c S t r i n g e v a l ( S t r i n g s ) { }
p u b l i c S t r i n g wise ( ) { }
p u b l i c S t r i n g add ( S t r i n g s ) { }
p u b l i c void s e t ( S t r i n g s ) { }
p u b l i c void i n i t ( S t r i n g a ) {
add ( a ) ;
add ( e v a l ( t m p S t a t u s ) ) ;
add ( e v a l ( wise ( ) ) ) ;
}</p>
        <p>p u b l i c S t r i n g g e t ( ) { }
in order to identify the correct behaviour (or composition of
behaviours), the adopted solution could be classified as an
Annotated Aspect oriented solution (AA). In such an AA
solution several aspects are responsible for the interception
of the relevant OO methods, whose execution is ultimately
substituted with native code, interacting with the remaining
portion of the software by means of JNI. In this case, we
want to run JNI instances driven by some parameters. Some
of them are embedded into the OO code, others should be
evaluated at runtime, e.g. the core occupancy, whether or not
to use pinned memory, the effective bandwidth utilisation as in
equation (2). While the presented toolbox makes all the needed
computation by means of a meta-layer which reflects on the
OO code, in order to correctly interpret the desired behaviour
it is up to the developer to annotate his application code. In
order to minimise this concern we have developed an easy way
to set up all the needed parameters and to select the portions
of code to parallelise.</p>
        <p>Some annotations are given by the toolbox itself and are
part of a library. Among such default annotations some of
them are used to let the aspects inject the appropriate code in
the right points within the code. Figure 2 shows annotation
@GPUparal that allows identifying a class that has to be
substituted with CUDA compliant code, then some code is
executed on the GPU card and connected with the OO software by
means of JNI. Such a parallel execution could be organised into
several streams by means of a @GPUstream annotation. The
latter takes as input an ID in order to univocally identify an
execution stream (mandatory in case of asynchronous execution
and pinned memory utilisation). Moreover, @GPUparal
annotation allows the developer to define mandatory behaviours
for certain classes or methods. Such mandatory behaviours
could be defined along with the implementation of application
methods and classes and become proper directives, when the
p u b l i c c l a s s Main {
@GPU param ( t h r e a d s =64 , b l o c k s =8 , async =1 ,
p i n n e d =1 , b u f f =1024 , s t r e a m s =1 , mixbehav =0 ,
f i x e d =” t h r e a d s , async , p i n n e d ” )
p u b l i c s t a t i c void main ( S t r i n g [ ] a r g s ) {</p>
        <p>/ / . . .
}
implemented methods (or classes) are called (or used).</p>
        <p>Some specific behaviours can be implemented to be
enabled at a given moment in the code, e.g. before a call
to a method, then method set() on an instance of class
Behaviour has to be called (see also Figure 3). The said
method set() would then override any other behaviour
except for the mandatory behaviour.</p>
        <p>Figure 3 shows how an application takes advantage of the
proposed toolbox by means of Java annotations. These set
important parameters, or communicate some global behaviour,
which is then implemented for all the parallelised code.
Another behaviour could be introduced by the aspects when
appropriate.</p>
      </sec>
      <sec id="sec-9-5">
        <title>E. Injection module</title>
        <p>The developed aspect GPUinjector (see Figure 4) takes
into account the behaviour resulting from: (i) class-related
annotation @GPU_parallel, (ii) method-related annotation
@GPUstream), and (iii) the directives given by means of
predefined annotation @GPU_param. When the aspect
intercepts a called method for a given instance, it observes all
behavioural directives (given on the code by means of the
method set() for class Behaviour). Moreover, it takes into
account the overall ensemble of parameters and circumstances
that intervene at runtime. This latter reasoning could lead the
aspect to find a more profitable or advantageous setup in order
to enrich or modify the given set of behaviours (except in the
case of mandatory behaviours).</p>
        <p>At the beginning of an application execution, as soon
as class Behaviour is loaded, it is populated by proper
data values, then using the mandatory instructions, several
behaviours are configured. Such mandatory instructions, as
p u b l i c a s p e c t G P U i n j e c t o r {
p o i n t c u t GPUpar ( GPUparal ann , O b j e c t o b j ) :
t h i s ( o b j ) &amp;&amp; e x e c u t i o n ( @GPUparal void ∗ . ∗ ( . . ) )
&amp;&amp; @ ann ot ati on ( ann ) ;
void around ( GPUparal ann , O b j e c t o b j ) :</p>
        <p>GPUpar ( ann , o b j ) {
t r y {
/ / . . .</p>
        <p>B e h a v i o u r . g e t I n s t a n c e ( ) . i n i t ( ann . f i x e d ( ) ) ;
/ / . . .</p>
        <p>} catch ( E x c e p t i o n e ) { }
e x t e r n ”C” JNIEXPORT void JNICALL
Java Injected CUDAcode ( JNIEnv∗ env , G s t r u c t g p u d a t a ){
env−&gt; . . .
well as all the other initialisation directives are given in the
application code by means of the default annotations.</p>
        <p>When an instance of an annotated application class will be
intercepted by GPUinjector, the class-related annotations
are taken and then stored on an appropriate hash table handled
by the aspect. The same elaboration is performed for
methodrelated annotations, which are stored in another dedicated hash
table. Data stored in such hash tables are shared by the several
instances of the same class, once they are intercepted, since
the same parallelisation is desired for all instances of the same
class.</p>
        <p>Data gathered by method-related and class-related
annotations are then merged with the mandatory behaviours and other
specifications given by the application developer in order to
exclude conflicting behaviours. Then, the resulting behaviours
are given by the attribute tmpStatus, as extracted by the
aspect from the annotations. Such behaviours are evaluated by
method eval() in class Behaviour in order to check the
compatibility with the mandatory behaviours before modifying
the general behaviour encoded as a string Status.</p>
        <p>Finally, aspect GPUinjector enables us to integrate
onthe-fly new behaviours for some advantageous circumstances.
This latter integration is made by means of method wise()
in class Behaviour. After the evaluation the behaviours are
added (or modified) with method add() on the same class.</p>
        <p>When the whole image of the behaviour is composed,
then the aspect calls the code generator that joins preexistent
portions of code related to each behaviour composition (or
custom made CUDA compliant code linked by the developer
to a certain composition on default or custom behaviours).
Figure 5 shows the Java code and annotations and calls that
connect with our toolbox.</p>
        <p>The possibility to have an easy to use and modular toolbox
for GPGPU programming opens an entire new range of
possibilities in the field of fast and performance oriented computing.
By means of our toolbox the developer need not use any
external or proprietary compiler. Consequently, this toolbox
offers virtually unlimited reusability with the possibility to link
B e h a v i o u r . g e t I n s t a n c e ( ) . s e t ( ” d e f a u l t , s p l i t 1 D ” ) ;</p>
        <p>MyGPUclass . myGPUmethod ( d a t a ) ;
with CUDA-compliant code. The implementation of advanced
features is made by using aspect-oriented code. In this way, it
is possible to have an high customisation level for the definition
of new behaviours and performances improvements due to an
advanced management of the allocation and freeing of memory
on the device. This will provide means to control the lifecycle
of variables stored on the device. All this without
compromising the modularity and simplicity of the implementation which
are the main driving forces of this work.</p>
        <p>Moreover, the presented toolbox works as an integrated
translational utility for the automatic conversion between
sequential OO code as well as for the integration of
CUDAcompliant code, providing an advanced interpretation method.
In this way programmers that intend to make use of the
advantages offered by this toolbox will be able to reuse written
code by translating it into a GPU-enabled implementation,
with a robust compatibility between this toolbox and any
independent OO code.</p>
        <p>G. Capizzi, F. Bonanno, and C. Napoli, “A new approach for lead-acid
batteries modeling by local cosine,” in Power Electronics Electrical
Drives Automation and Motion (SPEEDAM), 2010 International
Symposium on, pp. 1074–1079, IEEE, 2010.</p>
        <p>M. Ioki, S. Hozumi, and S. Chiba, “Writing a modular gpgpu program
in java,” in Proceedings of workshop on Modularity in Systems Software
(MISS), (New York, NY, USA), pp. 27–32, ACM, 2012.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Jackson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Muriki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Canon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cholia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shalf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Wasserman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Wright</surname>
          </string-name>
          , “
          <article-title>Performance analysis of high performance computing applications on the amazon web services cloud</article-title>
          ,”
          <source>in Proceedings of International Conference on Cloud Computing Technology and Science (CloudCom)</source>
          , pp.
          <fpage>159</fpage>
          -
          <lpage>168</lpage>
          , IEEE,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , G. Pappalardo, and E. Tramontana, “
          <article-title>Improving files availability for bittorrent using a diffusion model</article-title>
          ,
          <source>” in 23rd IEEE International WETICE Conference</source>
          , pp.
          <fpage>191</fpage>
          -
          <lpage>196</lpage>
          , IEEE,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Che</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Boyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tarjan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sheaffer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Skadron</surname>
          </string-name>
          , “
          <article-title>A performance study of general-purpose applications on graphics processors using cuda</article-title>
          ,
          <source>” Journal of Parallel and Distribuited Computing</source>
          , vol.
          <volume>68</volume>
          , pp.
          <fpage>1370</fpage>
          -
          <lpage>1380</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Nickolls</surname>
          </string-name>
          , I. Buck,
          <string-name>
            <given-names>M.</given-names>
            <surname>Garland</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Skadron</surname>
          </string-name>
          , “
          <article-title>Scalable parallel programming with cuda</article-title>
          ,
          <source>” Queue</source>
          , vol.
          <volume>6</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>40</fpage>
          -
          <lpage>53</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rueda</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Ortega</surname>
          </string-name>
          , “Geometric algorithms on cuda,
          <source>” Journal of Virtual Reality and Broadcasting</source>
          , no.
          <issue>200</issue>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Kiczales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lamping</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mendhekar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Maeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lopes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Loingtier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Irwin</surname>
          </string-name>
          , “
          <article-title>Aspect-oriented programming</article-title>
          ,
          <source>” ECOOP 97 Object Oriented Programming</source>
          , pp.
          <fpage>220</fpage>
          -
          <lpage>242</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Borowik</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Woz´niak, A</article-title>
          . Fornaia,
          <string-name>
            <given-names>R.</given-names>
            <surname>Giunta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , G. Pappalardo, and E. Tramontana, “
          <article-title>A software architecture assisting workflow executions on cloud resources</article-title>
          ,”
          <source>International Journal of Electronics and Telecommunications</source>
          , vol.
          <volume>61</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>17</fpage>
          -
          <lpage>23</lpage>
          ,
          <year>2015</year>
          . DOI:
          <volume>10</volume>
          .1515/eletel-2015-0002.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Giunta</surname>
          </string-name>
          , G. Pappalardo, and E. Tramontana, “
          <article-title>An aspect-generated approach for the integration of applications into grid</article-title>
          .,
          <source>” in Proceedings of the symposium on Applied computing, ACM</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bonanno</surname>
          </string-name>
          , and G. Capizzi, “
          <article-title>An hybrid neuro-wavelet approach for long-term prediction of solar wind,” in IAU Symposium</article-title>
          , no.
          <issue>274</issue>
          , pp.
          <fpage>247</fpage>
          -
          <lpage>249</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Giunta</surname>
          </string-name>
          , G. Pappalardo, and E. Tramontana, “
          <article-title>Superimposing roles for design patterns into application classes by means of aspects,”</article-title>
          <source>in Proceedings of Symposium on Applied Computing (SAC)</source>
          , ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Woz</surname>
          </string-name>
          ´niak,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Marszałek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gabryel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Nowicki</surname>
          </string-name>
          , “
          <article-title>Modified merge sort algorithm for large scale data sets</article-title>
          ,
          <source>” Lecture Notes in Artificial Intelligence - ICAISC</source>
          '
          <year>2013</year>
          , vol.
          <volume>7895</volume>
          , pp.
          <fpage>612</fpage>
          -
          <lpage>622</lpage>
          ,
          <year>2013</year>
          . DOI:
          <volume>10</volume>
          .1007/978-3-
          <fpage>642</fpage>
          -38610-7
          <fpage>56</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Tramontana</surname>
          </string-name>
          , “
          <article-title>Automatically characterising components with concerns and reducing tangling</article-title>
          ,”
          <source>in Proceedings of Computer Software and Applications Conference</source>
          (COMPSAC)
          <string-name>
            <surname>workshop</surname>
            <given-names>QUORS</given-names>
          </string-name>
          , IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Krueger</surname>
          </string-name>
          , “Software reuse,
          <source>” ACM Computing Surveys (CSUR)</source>
          , vol.
          <volume>24</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>131</fpage>
          -
          <lpage>183</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , G. Pappalardo, and E. Tramontana, “
          <article-title>An agent-driven semantical identifier using radial basis neural networks and reinforcement learning</article-title>
          ,” in XV Workshop ”Dagli Oggetti agli Agenti” , vol.
          <volume>1260</volume>
          ,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Auerbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. F.</given-names>
            <surname>Bacon</surname>
          </string-name>
          , P. Cheng, and
          <string-name>
            <given-names>R.</given-names>
            <surname>Rabbah</surname>
          </string-name>
          , “
          <article-title>Lime: a javacompatible and synthesizable language for heterogeneous architectures</article-title>
          .,
          <source>” in Proceedings of the ACM International Conference on ObjectOriented Programming Systems, Languages, and Applications</source>
          , pp.
          <fpage>89</fpage>
          -
          <lpage>108</lpage>
          , ACM,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fonseca</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Cabral</surname>
          </string-name>
          , “
          <article-title>Aeminiumgpu: An intelligent framework for gpu programming,” in Proceedings of Facing the Multicore-</article-title>
          <string-name>
            <surname>Challenge</surname>
            <given-names>III</given-names>
          </string-name>
          ,
          <string-name>
            <surname>At</surname>
            <given-names>Stuttgart</given-names>
          </string-name>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Boyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Skadron</surname>
          </string-name>
          , and W. Weimer, “
          <source>Automated dynamic analysis of cuda programs,” in Third Workshop on Software Tools for MultiCore Systems</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mongiovi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fornaia</surname>
          </string-name>
          , G. Pappalardo, and E. Tramontana, “
          <article-title>Combining static and dynamic data flow analysis: a hybrid approach for detecting data leaks in Java applications</article-title>
          ,”
          <source>in Proceedings of Symposium on Applied Computing (SAC)</source>
          , ACM,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Calvagna</surname>
          </string-name>
          and E. Tramontana, “
          <article-title>Delivering dependable reusable components by expressing and enforcing design decisions,” in Proceedings of Computer Software</article-title>
          and Applications
          <string-name>
            <surname>Conference (COMPSAC) Workshop</surname>
            <given-names>QUORS</given-names>
          </string-name>
          , pp.
          <fpage>493</fpage>
          -
          <lpage>498</lpage>
          , IEEE,
          <year>July 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bonanno</surname>
          </string-name>
          , G. Capizzi,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Laudani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Lo</surname>
          </string-name>
          <string-name>
            <surname>Sciuto</surname>
          </string-name>
          , “
          <article-title>Optimal thicknesses determination in a multilayer structure to improve the spp efficiency for photovoltaic devices by an hybrid femcascade neural network based approach</article-title>
          ,” in Power Electronics, Electrical Drives,
          <source>Automation and Motion (SPEEDAM)</source>
          ,
          <source>2014 International Symposium on</source>
          , pp.
          <fpage>355</fpage>
          -
          <lpage>362</lpage>
          , IEEE,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>R.</given-names>
            <surname>Giunta</surname>
          </string-name>
          , G. Pappalardo, and E. Tramontana, “
          <article-title>A redundancy-based attack detection technique for java card bytecode</article-title>
          ,”
          <source>in Proceedings of International WETICE Conference</source>
          , pp.
          <fpage>384</fpage>
          -
          <lpage>389</lpage>
          , IEEE,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , G. Pappalardo, and E. Tramontana, “
          <article-title>Using modularity metrics to assist move method refactoring of large systems</article-title>
          ,”
          <source>in Proceedings of International Conference on Complex, Intelligent, and Software Intensive Systems (CISIS)</source>
          , pp.
          <fpage>529</fpage>
          -
          <lpage>534</lpage>
          , IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>G.</given-names>
            <surname>Pappalardo</surname>
          </string-name>
          and E. Tramontana, “
          <article-title>Suggesting extract class refactoring opportunities by measuring strength of method interactions</article-title>
          ,”
          <source>in Proceedings of Asia Pacific Software Engineering Conference (APSEC)</source>
          , IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bonanno</surname>
          </string-name>
          , G. Capizzi, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , “
          <article-title>Some remarks on the application of rnn and prnn for the charge-discharge simulation of advanced lithium-ions battery energy storage</article-title>
          ,” in Power Electronics, Electrical Drives,
          <source>Automation and Motion (SPEEDAM)</source>
          ,
          <source>2012 International Symposium on</source>
          , pp.
          <fpage>941</fpage>
          -
          <lpage>945</lpage>
          , IEEE,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>E.</given-names>
            <surname>Tramontana</surname>
          </string-name>
          , “
          <article-title>Detecting extra relationships for design patterns roles</article-title>
          ,” in Proceedings of AsianPlop,
          <year>March 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bonanno</surname>
          </string-name>
          , G. Capizzi,
          <string-name>
            <given-names>G. Lo</given-names>
            <surname>Sciuto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , G. Pappalardo, and E. Tramontana,
          <article-title>“A cascade neural network architecture investigating surface plasmon polaritons propagation for thin metals in openmp</article-title>
          ,
          <source>” in Artificial Intelligence and Soft Computing</source>
          , pp.
          <fpage>22</fpage>
          -
          <lpage>33</lpage>
          , Springer International Publishing,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <article-title>GPU Gems 3</article-title>
          .
          <string-name>
            <surname>Addison-Wesley</surname>
          </string-name>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pappalardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tramontana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Marszałek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Połap</surname>
          </string-name>
          , and M. Woz´niak, “
          <article-title>Simplified firefly algorithm for 2d image key-points search</article-title>
          ,” in
          <source>2014 IEEE Symposium on Computational Intelligence for Human-like Intelligence</source>
          , pp.
          <fpage>118</fpage>
          -
          <lpage>125</lpage>
          , IEEE,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Capizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bonanno</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , “
          <article-title>Recurrent neural networkbased control strategy for battery energy storage in generation systems with intermittent renewable energy sources</article-title>
          ,”
          <source>in IEEE international conference on clean electrical power (ICCEP)</source>
          , pp.
          <fpage>336</fpage>
          -
          <lpage>340</lpage>
          , IEEE,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>M. Woz</surname>
          </string-name>
          <article-title>´niak and D. Połap, “Basic concept of cuckoo search algorithm for 2D images processing with some research results : An idea to apply cuckoo search algorithm in 2d images key-points search,”</article-title>
          <source>in SIGMAP 2014 - Proceedings of the 11th International Conference on Signal Processing and Multimedia Applications</source>
          ,
          <source>Part of ICETE 2014 - 11th International Joint Conference on e-Business and Telecommunications</source>
          , (
          <volume>28</volume>
          -
          <fpage>30</fpage>
          August, Vienna, Austria), pp.
          <fpage>157</fpage>
          -
          <lpage>164</lpage>
          , SciTePress,
          <year>2014</year>
          . DOI:
          <volume>10</volume>
          .5220/0005015801570164.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <given-names>F.</given-names>
            <surname>Bonanno</surname>
          </string-name>
          , G. Capizzi,
          <string-name>
            <given-names>G. Lo</given-names>
            <surname>Sciuto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , G. Pappalardo, and E. Tramontana, “
          <article-title>A novel cloud-distributed toolbox for optimal energy dispatch management from renewables in igss by using wrnn predictors and gpu parallel solutions,” in Power Electronics, Electrical Drives, Automation and Motion (SPEEDAM</article-title>
          ),
          <source>2014 International Symposium on</source>
          , pp.
          <fpage>1077</fpage>
          -
          <lpage>1084</lpage>
          , IEEE,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <article-title>Nvidia corporation, NVIDIA CUDA Compute Unified Device Architecture programming guide</article-title>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pappalardo</surname>
          </string-name>
          , E. Tramontana, and
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zappala`, “A clouddistributed gpu architecture for pattern identification in segmented detectors big-data surveys,” The Computer Journal</article-title>
          , p.
          <fpage>bxu147</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Giunta</surname>
          </string-name>
          , G. Pappalardo, and E. Tramontana, “
          <article-title>Aspects and annotations for controlling the roles application classes play for design patterns</article-title>
          ,”
          <source>in Proceedings of the Asia Pacific Software Engineering Conference (APSEC)</source>
          , pp.
          <fpage>306</fpage>
          -
          <lpage>314</lpage>
          , IEEE,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>