Parallel Computing Solutions for Linear
                         Combination of Filters
                              Andrea Cavarra                                             Dario Caramagno
                          University of Catania                                        University of Catania
                  Viale A. Doria 6, 95125 Catania, Italy                       Viale A. Doria 6, 95125 Catania, Italy


    Abstract—The GPU (Graphics Processing Unit) are the fu-             GPU architectures in which the silicon area for control and
ture of high performance computing and provides a parallel              management is very reduced compared to that employed in
programming model for general purpose applications thanks to            the CPU and a big increase in the number of cores [3].
CUDA programming interface. The programming model of GPU
architecture is significantly different from the traditional CPU            GPUs allow us to compute optimally, in parallel mode,
one. This paper presents the advantages of GPU architecture             a code with a certain complexity and for the application of
by proposing an algorithm for the linear combination of digital         digital filters in two-dimension arrays. This paper proposes
filters in image processing, implemented as a parallel GPU              the use of GPU architecture for the implementation of digital
version (a sequential CPU version has been also implemented for         filters with a parallel code that uses a linear combination of
comparison). The use of parallel processing CUDA architecture
has enabled us to take advantage of the GPU allowing an increase        digital filters processing images. The aim is to compare the
of the performance compared to CPU.                                     parallel execution speed to the sequential code. The remaining
    Index Terms—Matrices Convolution, Image processing,                 parts of the paper will be dedicated to explain the parallel and
GPGPU, Parallel Computing, CUDA.                                        serial approaches.

                        I. I NTRODUCTION                                                     II. D IGITAL F ILTERS
                                                                           A digital filter performs mathematical operations on
   The evolution of computers has allowed to develop more
                                                                        discrete-time sampled signals in order to enhance or reduce
complex calculations and has improved the performance of
                                                                        certain characteristics of the signal. Such filters are imple-
the simulations in scientific fields. Technology has increased
                                                                        mented by software components, often through mathematical
the CPU (Central Processing Unit) frequency with the upper
                                                                        functions or matrices, and loaded inside the processors with
limit of 3 GHz, processing multiple cores: dual core, quad core
                                                                        programmable hardware. Cost and speed are closely dependent
and octa core in a few decades, unfortunately with the limits
                                                                        on the used processor.
due to power dissipation and the increasing temperature.
                                                                           The application of digital filters has enormous advantages
   One of the solutions proposed to push such limits is the
                                                                        over analog filters as the possible transfer functions are much
use of the GPU (Graphics Processing Unit). Typically, GPUs
                                                                        more flexible for digital filters. The main advantages are:
handle the huge amount of data for graphical applications
in three-dimensions with high performances, so the GPGPU                   1) high accuracy due to the absence of physical compo-
(General-Purpose Computing on GPU) has been introduced.                       nents;
GPGPU programming has been implemented using two types                     2) a digital filter is easier to design and implement auto-
of APIs (Application Programming Interface): the complex                      matically modifing its frequency response and changing
OpenGL [1] and DirectX [2].                                                   the input;
   In the recent years, the computer house NVIDIA has                      3) flexibility to change the digital filter parameters, without
developed CUDA (Compute Unified Device Architecture).                         changing the system hardware;
This platform is much simpler than the previous APIs and                   4) easy simulation and design of the filters being imple-
and it allows to program using a high level programming                       mented by software with a reduction of system com-
language based on C, using a model of parallel computing.                     plexity.
NVIDIA CUDA technology has opened a new era for GPGPU                   these properties are essential to the implementation of a high
computing allowing the design and implementation of parallel            quality filter. As mentioned previously these advantages are
GPU-oriented algorithms without needing any knowledge on                accompanied by the limitations of speed and cost, related
OpenGL. The computational power of these architectures,                 to the processor used, and frequency. The frequency limit is
nowadays, has received a considerable growth compared to                described by Nyquist theorem, using the following formula:
CPUs and due to the GPU ability to perform a huge number                                            fs > 2B                           (1)
of simple operations in parallel. GPUs have cores much
simpler compared to CPUs. This allows the realization of                which imposes the filter with a maximum limit of the fre-
                                                                        quencies in order to reduce the effects of additional aliasing
  Copyright c 2016 held by the authors.                                 distortion of the signals. B term is the signal band and fs


                                                                   23
is sampling frequency. The digital filters are widely used in
image filtering and image processing. This paper will examine,
implement and test an algorithm to filter digital images with
a GPU implementation.
A. Digital filters for image processing
    Digital filters are an essential tool for image processing.
The digital image can be defined as a two dimensional array
or matrix and each element represents a pixel of the image.
Images are processed by an algorithm. The process of filtering
is also known as convolution of a mask with a standard size
3x3, 5x5 or 9x9, which is a matrix. This mask applies different
mathematical operators by means of a convolution to the image
to achieve digital image processing.                                                        Fig. 1: 2D Convolution
    The use of filters to digital image processing allows us to
make operations like:
    1) the extraction of information from the image, such as               This operation is done from left to right, for each of the
        the detection of boundaries in images;                          original matrix element obtaining the matrix of the resulting
    2) the exaltation of details, such as the increase of light         convolution. Therefore, the convolution algorithm implements
        intensity or color contrast;                                    the digital filtering operation. Then, it suffices to modify the
    3) the elimination and reduction of disturbances in images.         kernel in order to change the type of filter. We have studied
    This processing brings a high computational cost especially         a digital filtering operation of am image, in ASCII format,
for large images. For this reason a sequential approach, based          through two filters of ninth order. The next section describes
on the CPU, is not very efficient, so parallel approaches have          the convolution operation implemented as a parallel algorithm
been introduced to increase performances. This paper presents           that can be executed on a CUDA GPU.
an algorithm for the application of a linear combination                           III. PARALLEL COMPUTING ON GPU
of filters to local processing based on a model of parallel
computation on GPU architectures. For our local processing                 This paper presents the convolution algorithm implemen-
filters, the image processing method consists of applying a             tation for a bidimensional matrix of input data with a linear
function to each original pixel values and to an appropriate            combination of filters using a parallel model. In this work
range of pixels, within the radius of the filter matrix. This           two versions of the same algorithm are presented in order
method is based on the convolution operator for which a brief           to compare the efficiency of the parallel computation and
explanation is needed.                                                  the sequential one, and then demonstrate how the parallel
                                                                        computing support is far more suitable for high performance
B. Convolution                                                          computing.
   As previously said, the algorithm executes an operation              A. GPU Architecture
on an input matrix and pre-established filters, in order to
provide as a resalt a linear combination of the input. The 2D              In recent decades, high performance computing sector has
convolution between two continuous functions, f (x, y) and              been having an exponential growth with the widespread use
g(x, y), is defined by the formula:                                     of GPGPU. The use of GPU in this ever-expanding purview
                      Z ∞                                               can exploit the power of parallel computing for increasingly
 f (x, y) ∗ g(x, y) =      f (α, β)g(x − α, y − β)dαdβ     (2)          complex simulations due to the inherent parallel nature of
                       −∞                                               GPUs.
This expression will be brought to the discrete domain, by                 CUDA compatible GPUs are, in fact, based on an architec-
assuming f (x, y) and g(x, y) be two discrete arrays of a               ture made up of a number of MIMD (Multiple Instruction Mul-
limited size, as a double sum expressed by:                             tiple Data) multiprocessors called Streaming MultiProcessors,
                                                                        whose number depends on the specification and class of GPU
                     M −1 N −1
                     X    X                                             performance. This Streaming MultiProcessors are the basic
f (x, y)∗g(x, y) =               f (m, n)g(x−α, y−β)dαdβ     (3)        units of the GPU architecture and are implemented as SIMD
                     α=0 β=0
                                                                        (single instruction multiple data), and called by NVIDIA
for x = 0, 1, . . . , M − 1 and y = 0, 1, . . . , N − 1.                as SIMT (Single Instruction Multiple Threads), which has
   This convolution operation is defined on discrete-time as            8 processors said Streaming Processor or CUDA Cores. In
a simple operation of “local media” in a range of amplitude             this architecture, each Streaming MultiProcessor is able to
defined by the kernel size used obtaining a value for the output        create, manage, schedule and execute groups of 32 threads
matrix for the same position of the source data as shown in             called warp. A warp executes one instruction at a time, so
figure 1.                                                               as to maximize the efficiency when all 32 threads of a warp


                                                                   24
agree on their execution path. When one or more blocks of                established kernels generating an output matrix Ã given by
threads are assigned to a multiprocessor to run, they are then           the linear combination of results obtained using this formula
partitioned into warps, scheduled by a warp scheduler and
                                                                                    A =⇒ Ã = α(k1 ∗ A) + β(k2 ∗ A) + γA               (4)
executed one at a time. Each of these processors can perform
simple mathematical operations (such as addition, subtraction,              While this operation is executed, we want to generate the
multiplication, ect.) for integers or floating point numbers.            best parallelized code for the purpose of improving its per-
Inside each multiprocessor there is also a shared memory,                formances compared to the serial code version. The algorithm
accessible only by the processors in the same multiprocessor,            shown can be summed up in three basic steps:
caches for instructions and for data, and, finally, a unit for              1) acquisition of input data matrix;
decoding the instructions.                                                  2) convolution of the input matrix with pre-established
   Each multiprocessor has access to a global memory shared                    kernels;
among all GPU multiprocessors and called Device Mem-                        3) linear combination of the results.
ory [3]. The programming model in CUDA C organizes the                      Obviously, each of these steps is performed by both the
program in a sequential part executed on the CPU, host,                  sequential version of the code and then the parallel one.
mainly performing memory allocation and kernel calls, and in             The following will detail the basic steps previously exposed
a parallel part called kernel and executed on the GPU, device.           focusing our attention on the parallel algorithm.
This structure requires the presence, inside the program, of
instructions that are sequentially executed on the host and              A. Parallel algorithm solution
interspersed with calls to the kernel that allow us to carry                The parallel solution of the algorithm starts in a host by
out entire parts of the program in parallel, on the device, as           means of code that acquires the input and takes data from
is shown in figure 2.                                                    a text file in ASCII code and store it in an array on shared
                                                                         memory. Instead of a 2D array for data, complex to manage,
                                                                         we have used a single dimensional array where the rows of
                                                                         the input matrix are reported one after the other, then data are
                                                                         reconstructed by an index. In this way the data will be transfer
                                                                         from host to GPU with the standard function cudaMemcpy().
                                                                         After a given number of threads and having organized the
                                                                         blocks, the call to kernel is performed, in order to do parallel
                                                                         computations.
                                                                            Inside the kernel it is assigned a thread to any given input so
                                                                         that we can entrust the execution of the code to a set of parallel
            Fig. 2: Model of programming in CUDA                         threads arranged in blocks and indexed using the following
                                                                         formula:
   Kernel is the model of parallel execution of the code in                 id = threadIdx.x + (blockIdx.x*blockDim.x);
a device and it is defined as a grid divided in to a certain
number of blocks to which it is assigned a multi-processor                  To each thread will be assigned a position in the input
for each. Inside each block there is a number of fundamental             array, and the thread will compute the convolution, in parallel
computational units defined as thread.                                   to other threads in each block. The input data acquired in
                                                                         the form of array was necessary to implement a mechanism
   Kernel, or grid, are sequentially executed between them
                                                                         of jump allowing us to move between the various locations
while blocks and threads are executed in parallel by adopting
                                                                         in the array in order to perform the products between the
a SIMT data-parallel model. Each of these thread belongs to
                                                                         elements of the kernel (in our case a simple 3x3 filter), and
a single block and is uniquely identified within the kernel by
                                                                         the components located in the neighborhood of the source
assigning it an index. In this manner the memory addressing
                                                                         location. This problem is solved using the radius of the kernel
will be simplyfied especially in the case of processing multi-
                                                                         defined as:
dimensional data.
   In each of the blocks the kernel also has a shared memory               Kernel_radius = (Kernel_order -1)/2;

accessible only to the threads of the same block. The logical
                                                                            We can identify all the elements lying in the range of the
subdivision of a kernel in grid and blocks is a crucial aspect in
                                                                         position of interest. Then kernel radius is subtracted to the
a code in CUDA for obtaining the parallelization of code. The
                                                                         input array index and added a variable that increases cyclically
organization and management of the internal threads allows the
                                                                         through a simple for loop. This let us multiply in parallel the
implementation of a more efficient code.
                                                                         elements of the input with the kernel element lying in the
                                                                         right position around the central element of the first row of
          IV. P RESENTATION OF THE ALGORITHM
                                                                         the kernel, as shown below:
   The algorithm developed in this work executes the convo-                 for(i=0;i<N;i++)
lution operation between an input matrix A and some pre-                        s[id]+=A[id-z+(i%3)+(i-(i%3)/3)]*k[i];


                                                                    25
                                                                                                               TABLE I: Averange Execution Time
  Then the convolution algorithm performs the operation just
described for each row of the kernel:                                                                                       Averange Execution Time
                                                                                                                                      (s)
   for each row in kernel;
      for each element in kernel row;
         Multiply the element to the                                                                       Matrix Size   Sequential code   Parallel code
         location of the Kernel with
         the corresponding element                                                                           256x256          0.185            0.099
         in the input data matrix.
                                                                                                             512x512          0.281            0.195
   Sum results, save them on central position.
                                                                                                             800x600          0.508            0.375

    This algorithm is executed as many times as the kernel                                                   1024x1024        0.983            0.658
number we want to apply to our array of the input data. In our                                               1920x1200        1.891            1.164
specific case we chose to apply two kernels of ninth order with                                              2024x2024        3.292            1.890
filter function for the input matrix obtained by ASCII encoding                                              3000x3000        6.942            3.570
of a two-dimensional image. For a linear combination of the
                                                                                                             4096x4096       13.023            7.695
results we have used, finally, a simple program, add, shown
                                                                                                             8192x8192       63.652          41.085
below:
                                                                                                             9000x9000       83.693          55.421
    if(id<((rig-1)*(col-1)))
      sum[id]=(a*s[id]+b*s2[id]+g*A[id]);
    else return;


   The add program, called from the host and executed in the              Note that the execution time of the algorithm on the
device, sums the results of the two performed convolutions             GPU includes the transfer times of data (HostToDevice and
loaded in arrays s and s2 with the values of the input data A.         DeviceToHost) and the kernel running time. The experimental
Each of these addends is regulated by a default weight which           results are represented in Table I and Figure 3.
calibrates the effect of a convolution compareted to the other
in the final matrix indicated with sum array. An input image
                                                                         averange execution time (s)


having 256 colors has been used as input array, so to ensure                                            80
the output image had the same range of color variation, two if
statements impose this condition to the elements of sum array.
   In the following the results and performances of the mea-                                            60
sures on the execution time in CUDA C and the C versions
are compared, hence assessing parallel and sequential versions                                                            P arallel
of the algorithm.                                                                                       40               Sequential

                V. E XPERIMENTAL R ESULTS
                                                                                                        20
   A comparison of the execution time using the function clock
belonging library time.h has been performed to the algorithm
in the sequential and parallel versions. The execution time                                              0
measures the time interval between the start and the end                                                          105         106       107                 108
time. Matrices of various sizes were used as input data in                                                                 dimension matrix
order to compute the efficiency of the two implementations                                             Fig. 3: Comparison between sequential and parallel
when varying the amount of input data. Such matrices are
matrices representing images, coming from a text file in ASCII            Figure 3 shows the execution times, sequential and parallel,
code.                                                                  where the parallel code is faster. The result of this comparison
   The sequential algorithm for processing data runs on a AMD          shows that the parallel code is more efficient. The comparison
i686 Dual-Core processor with up to 1GHz clock speed, while            between parallel and sequential version has been calculated
for the GPU algorithm we have used a NVIDIA GeForce GTX                as the difference between the Sequential Execution T ime
480 with 480 CUDA cores and 1536 MB GDDR5 video RAM                    e il P arallel Execution T ime, and has been reported using
with CUDA 4.2. We developed our GPU code using NVIDIA                  a histogram in which the parallel code presents a significant
CUDA API while the CPU code is compiled under nvcc. The                performance, while increasing the size of the input image,
GPU results are compared with the CPU results under Linux.             hence more benefits in the use of a parallel CUDA architecture
For our simulations we used a subdivision of kernel in blocks          than the typical CPU.
of 440 threads by applying filters of order 9.                            Figure 4 shows the trend of the difference between the
   Table I shows the sizes of the matrices used in data input          Sequential Execution T ime and the P arallel Execution
and averages, calculated on 20 samples, the execution times            T ime. The trend of this curve is exponential, observing
of the sequential and parallel versions.                               the parallel version it has an execution time smaller than


                                                                  26
the sequential version, with a greater efficiency and better                         data array, have led to the results shown in Table II and plotted
performances, while increasing the amount of data input. The                         in Figure 6.
more calculations the more benefits are gained from CUDA
parallelism.                                                                                                         TABLE II: Averange Execution Time on Server

                                                                                                                                       Averange Execution Time
                                                                                                                                                 (s)
9000 ∗ 9000
8192 ∗ 8192                                                                                                           Matrix Size   Sequential code   Parallel code
4096 ∗ 4096                                                                                                            256x256           0.022            1.832
3000 ∗ 3000
                                                                                                                       512x512           0.090            1.885
2024 ∗ 2024
                                                                                                                       800x600           0.191            1.955
1920 ∗ 1200
                                                                                                                      1024x1024          0.037            2.037
1024 ∗ 1024
  800 ∗ 600                                                                                                           1920x1200          0.784            2.093
  512 ∗ 512                                                                                                           2048x2048          1.422            2.415
  256 ∗ 256                                                                                                           3000x3000          3.047            3.649
                                                                                                                      4096x4096          5.651            4.829
                                 0             10             20           30
                                                                                                                      8192x8192         28.327          19.304
                                                    Speedup
                                                                                                                      9000x9000         34.557          23.358
                                           Fig. 4: Speedup

   To test the validity of the results, we were carried out
other tests based on the execution of iterated code in order
to simulate the processing of a set of images of equal size,
                                                                                       averange execution time (s)


then measuring the Execution T ime of the whole process.
                                                                                                                     30
The results of these tests are shown in Figure 5.


                                                                                                                     20
                                                                                                                                      P arallel
Averange Execution T ime


                           300
                                                                                                                                     Sequential

                                                                                                                     10
                           200

                                                                                                                      0
                                                                                                                              105          106       107              108
                           100
                                                                                                                                        dimension matrix
                                                                                     Fig. 6: Comparison between sequential and parallel code on server


                                     20       40       60           80   100            The numbers give the result of an average of 10 samples of
                                                                                     the running times. The results obtained reveal experimentally
                                          N umber of Iterations                      that by increasing CPU hardware performance and for images
                                     Fig. 5: Number of iterations                    having smaller sizes, the sequential version is faster than
                                                                                     its parallel version. This is possibly due to CUDA cores of
   The advantages of the use of CUDA architecture are directly                       common GeForce having a low clock speed, determing slower
proportional to the increase in computation. The same tests                          performances for a few calculations. On the other hand, the
were performed using a computing architecture more complex                           performance of the parallel code is better than the serial
and powerful in order to see if an increase in hardware                              version for the larger processed input array. The tests based
performance could lead to different results. Such tests were                         on the execution of iterated code produce the results shown in
performed by loading the source files on a server equipped                           Figure 7.
with Intel Xenon E5- 2630 with 6 cores 2.6 GHz clock and                                Tests run on the server estimate the trend when increasing
RAM 48 GB, NVIDIA Tesla K10 with 2x1536 Cuda cores                                   the amount of input data. A better hardware has obtained better
and 8 GB GDDR5 video RAM. The two versions of the                                    performance for both versions of the algorithm, producing
algorithm, sequential and parallel, varying the size of the input                    however, a difference depending on the size of the input


                                                                                27
                                                                                       been achieved, whereas in [19], it has been described the
                                                                                       realization of a system of distributed and parallel processing
Averange Execution T ime


                           3,000                                                       for the identification of events.
                                                                                                                VII. C ONCLUSION
                                                                                          This paper has presented an algorithm for the application
                           2,000                                                       of a linear combination of digital filters implemented with
                                                                                       parallel programming for image processing on CUDA com-
                                                                                       patible GPUs, and has compared with its sequential version
                                                                                       on the CPU. The solution presented has the aim to exploit
                           1,000                                                       the programming model CUDA parallel, more efficient than
                                                                                       a sequential processing code with an appropriate management
                                                                                       and organization of blocks and thread.
                              0                                                           The tests have shown that the implementation of the parallel
                                       20        40       60        80      100        code on the GPU increases in speed compared to serial
                                              N umber of Event                         implementation of CPU. In general the speed is not less than
                                                                                       40x with peaks higher than 90x. This increase in speed is
                                   Fig. 7: Number of iterations on Server              also evident with the increase of computational power required
                                                                                       in processing. The execution time of the parallel code also
                                                                                       counts the time of data transfer between the host device and
data array. In fact, the calculation contained in the algorithm
                                                                                       the penalization of some performance of parallel code which
running on the terminal using the GeForce GTX 480 has a
                                                                                       remains higher than the serial version. It is possible to improve
better speedup over the sequential version on CPU. As soon
                                                                                       the result by a more detailed analysis of the strategies of
as the need grows, it is always better the computational per-
                                                                                       parallelization in order to manage the computing resources
formance with a parallel version. Assuming that the matrices
                                                                                       provided by the GPU architectures.
are 9000x9000, the Tesla K10 has a speedup almost 140x
                                                                                          For the realization of the C code we have used CUDA
with respect to the GeForce GTX 480. So the processing
                                                                                       programming interface which has made it easier the GPGPU
on GPU cards is most often preferable than the computaion
                                                                                       approach, and allowing us to exploit very effectively the poten-
on the CPU, the choice of the GPU itself must be weighted
                                                                                       tial hidden in GPU computing. The algorithm presented in this
on the basis of the knowledge of the computational power
                                                                                       article, though designed for the application of filters on digital
needed. The strategy of parallelization brings more efficient
                                                                                       images, is well suited for many applications including mobile
algorithms.
                                                                                       ones that are outside the scope of image processing such
                                         VI. R ELATED W ORKS                           as signal analysis or application of spectral masks sampled
   In recent years the field of GPGPU has aroused great                                signals.
interest inspiring many research works in order to include                                Another possible application for our algorithm is that in
the best strategies that make it possible to exploit the full                          the industrial field for the detections of the boundary in an
potential of the structure of parallel computing and the CUDA                          image useful for the detection of objects. In this case you could
parallel programming model. In this regard it is possible to                           apply the Sobel operator by simply modifying the matrix of
see the work [4] aimed to investigate the GPU architecture                             the filter. With this work we have shown only a small part
and advantages/disadvantages of the CUDA model. Although                               the great potential of the GPU, which could gain popularity in
the core API for programming graphics cards remain, for                                the scientific community thanks to the computing power and
the moment, OpenGL [1] and DirectX [2], the latter is less                             to its easy availability, which makes GPGPU one of the best
prone to general purpose programming. To overcome this                                 choices in high performance computing.
lack CUDA has emerged [5], which though limited only to                                                             R EFERENCES
NVIDIA graphics cards makes programming easier for general
                                                                                        [1] M. Woo, J. Neider, T. Davis, and D. Shreiner, OpenGL programming
purpose [5], [6].                                                                           guide: the official guide to learning OpenGL, version 1.2. Addison-
   A lot of work has been performed for applying the CUDA                                   Wesley Longman Publishing Co., Inc., 1999.
model for the image processing field. A paper useful to                                 [2] K. Gray, Microsoft DirectX 9 programmable graphics pipeline. Mi-
                                                                                            crosoft Press, 2003.
understand how to perform image convolution, essential for                              [3] C. Nvidia, “Nvidia cuda c programming guide,” NVIDIA Corporation,
image processing, has been produced by Podlozhnyuk [7], and                                 vol. 120, p. 18, 2011.
other useful for understanding more complex issues are those                            [4] J. Ghorpade, J. Parande, M. Kulkarni, and A. Bawaskar, “Gpgpu
                                                                                            processing in cuda architecture,” arXiv preprint arXiv:1202.4347, 2012.
produced by Park [8] and Castaño [9].                                                  [5] J. Sanders and E. Kandrot, CUDA by example: an introduction to
   Finally, the work relating to the performance and application                            general-purpose GPU programming.          Addison-Wesley Professional,
of parallel programming model is inherent in a huge range of                                2010.
                                                                                        [6] H. Nguyen, Gpu gems 3. Addison-Wesley Professional, 2007.
topics, such as e.g. the work in [10]–[17]. In [18], a parallel                         [7] V. Podlozhnyuk, “Image convolution with cuda,” NVIDIA Corporation
solution for GPU Integrated generation systems (IGSs) has                                   white paper, June, vol. 2097, no. 3, 2007.


                                                                                  28
 [8] I. K. Park, N. Singhal, M. H. Lee, S. Cho, and C. W. Kim, “Design
     and performance evaluation of image processing algorithms on gpus,”
     Parallel and Distributed Systems, IEEE Transactions on, vol. 22, no. 1,
     pp. 91–104, 2011.
 [9] D. Castaño-Dı́ez, D. Moser, A. Schoenegger, S. Pruggnaller, and A. S.
     Frangakis, “Performance evaluation of image processing algorithms on
     the gpu,” Journal of structural biology, vol. 164, no. 1, pp. 153–160,
     2008.
[10] G. Pappalardo and E. Tramontana, “Automatically discovering design
     patterns and assessing concern separations for applications,” in Proceed-
     ings of ACM Symposium on Applied Computing (SAC), Dijon, France,
     April 2006, pp. 1591–1596.
[11] R. Giunta, G. Pappalardo, and E. Tramontana, “Aspects and annotations
     for controlling the roles application classes play for design patterns,”
     in Proceedings of IEEE Asia Pacific Software Engineering Conference
     (APSEC), Ho Chi Minh, Vietnam, December 2011, pp. 306–314.
[12] G. Pappalardo and E. Tramontana, “Suggesting extract class refactoring
     opportunities by measuring strength of method interactions,” in Pro-
     ceedings of Asia Pacific Software Engineering Conference (APSEC).
     Bangkok, Thailand: IEEE, December 2013, pp. 105–110.
[13] F. Bonanno, G. Capizzi, G. Lo Sciuto, C. Napoli, G. Pappalardo, and
     E. Tramontana, “A cascade neural network architecture investigating
     surface plasmon polaritons propagation for thin metals in openmp,” in
     Proceedings of International Conference on Artificial Intelligence and
     Soft Computing (ICAISC), ser. Springer LNCS, vol. 8467, Zakopane,
     Poland, June 2014, pp. 22–33.
[14] C. Napoli, G. Pappalardo, E. Tramontana, R. Nowicki, J. Starczewski,
     and M. Woźniak, “Toward work groups classification based on prob-
     abilistic neural network approach,” in Proceedings of International
     Conference on Artificial Intelligence and Soft Computing (ICAISC), ser.
     Springer LNCS, Zakopane, Poland, June 2015, vol. 9119, pp. 79–89.
[15] C. Napoli, G. Pappalardo, and E. Tramontana, “An agent-driven se-
     mantical identifier using radial basis neural networks and reinforcement
     learning,” in XV Workshop ”From Objects to Agents” (WOA), vol. 1260.
     Catania, Italy: CEUR-WS, September 2014.
[16] M. Woźniak, D. Połap, M. Gabryel, R. Nowicki, C. Napoli, and
     E. Tramontana, “Can we process 2d images using artificial bee colony?”
     in Proceedings of International Conference on Artificial Intelligence and
     Soft Computing (ICAISC), ser. Springer LNCS, Zakopane, Poland, June
     2015, vol. 9119, pp. 660–671.
[17] C. Napoli, G. Pappalardo, and E. Tramontana, “Improving files avail-
     ability for bittorrent using a diffusion model,” in Proceedings of IEEE
     International WETICE Conference, Parma, Italy, June 2014, pp. 191–
     196.
[18] F. Bonanno, G. Capizzi, G. L. Sciuto, C. Napoli, G. Pappalardo, and
     E. Tramontana, “A novel cloud-distributed toolbox for optimal energy
     dispatch management from renewables in igss by using wrnn predictors
     and gpu parallel solutions,” in Power Electronics, Electrical Drives,
     Automation and Motion (SPEEDAM), 2014 International Symposium
     on. IEEE, 2014, pp. 1077–1084.
[19] C. Napoli, G. Pappalardo, E. Tramontana, and G. Zappalà, “A cloud-
     distributed gpu architecture for pattern identification in segmented
     detectors big-data surveys,” The Computer Journal, p. bxu147, 2014.


                                                                                 29