<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Performance Analysis of Roberts Edge Detection Using CUDA and OpenGL</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Cal`ı</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valeria Di Mauro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Catania</institution>
          ,
          <addr-line>Catania</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <fpage>55</fpage>
      <lpage>62</lpage>
      <abstract>
        <p>-The evolution of high-performance and programmable graphics processing units (GPUs) has generated considerable advancements in graphics and parallel computing. In this paper we present a Roberts filter based on edge detection algorithm using CUDA and OpenGL architectures. The basic idea is to use the Pixel Buffer Object (PBO) to create images with CUDA on a pixel-by-pixel basis and display them using OpenGL. The images can then be processed applying a Roberts filter for edge detection. Finally, it describes the results of an extensive measurement campaign as well as several comparisons among the code performance on CPUs and GPUs. The results are very promising since the GPU parallel version offers much higher performances than the CPU sequential version. The execution time of the GPU parallel version is much lower than the sequential equivalent execution time. Index Terms-CUDA, GPU, Image Processing, OpenGL, PBO, Roberts Edge Detection.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>
        The technological development has increased computer
performances and applications can perform more complex
tasks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Recently the field of information retrieval has
encountered a tremendous growth due to the newly available
computational power and developed architectural supports. A
large variety of scientific and industrial applications requires
to analyze 2D images. In order to extract information from
images it is often possible to apply specific filters to perform
segmentation, edge identification, sharpening, etc. The most
commonly used edge detection system is based on Roberts
method.
      </p>
      <p>In this paper we present a GPU parallel algorithm
elaborating a great number of images in a short time. The solution
can then be readily used for the realization of real-time devices
and systems for industrial applications. A real-time execution
requires to optimize the processing time for image elaboration
and analysis. In this work we use the Common Unified
Device Architecture (CUDA): an NVIDIA architecture which
provides support to interact with the Graphic Processing Unit
(GPU) for general-purpose computing. The solution presented
is used to work with pgm grayscale images by OpenGL
libraries implementing functions. An example of the features
enabled by such functions are:
loading an image in RAM;
mapping an image to a Pixel Buffer Object (PBO);
fast transferring the PBO data to GPU memory.</p>
      <p>OpenGL libraries provide a direct control on image as
well as making it possible to track the changes of the image
Copyright c 2016 held by the authors.
without the developer. In this paper we have also measured
the execution times and compared them to heterogeneous code
execution on CPU, then we have observed the speedup
advantages. The experiments were performed taking into account
different image sizes and numbers.</p>
      <p>In Section II the edge recognition operators basis will be
introduced with particular attention to the Roberts operator
and the respective Kernel. In Section III an introduction to
GPU programming is presented as well as the multihreading
logic used in CUDA, moreover the use of OpenGL libraries for
processing image are explained relatively to the Pixel Buffer
Object. Portions of the developed algorithm and the related
comments are devised in Section IV. Finally, in Section V
a performance analysis is presented as well as a comparison
between the sequential and the parallel algorithm.</p>
    </sec>
    <sec id="sec-2">
      <title>II. IMAGE PROCESSING</title>
      <p>
        The digital filters used in image pre-processing can be
represented as mathematical operators. Such digital filters may be
used to reduce noise, improve contrast, separate objects from
the background, etc. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It is possible to obtain
substantial image improvement using parallel processing also
for real time solutions [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Among the different types of image
enhancement algorithms, spatial convolution kernel filtering
produces the worst scenario.
      </p>
      <p>A convolution kernel replaces every pixel with a new one
so that its value is based on the relationship between the old
pixel value and the values of pixels that surround it. In such a
convolution procedure, two functions are overlaid: pixel values
of the original image are stored in memory, and it is applied
the mask of the convolution kernel. The kernels can vary in
dimension, determining a different number of neighbouring
pixels involved in the convolution. The kernel operates on the
image by replacing the original pixels one by one, therefore
the operation must be performed by applying the convolution
operator to each pixel in the image. A typical convolution
kernel mask operation is represented in Fig. 1.</p>
      <p>The field of image processing and computer vision often
make use of procedures like edge detection. Such a procedure
is particularly useful for features extraction. The edge
detection eliminates the informations in order to focus only on a
specific variations set on the image based on the geometric
and structural characteristics of the examined objects in order
to recognize the borders. An example of such a variation is
constituted by the region of pixels where the light intensity
undergoes abrupt changes. Those rapidly variating regions can
be identified as the edges of a certain object on the image. The
edge detection method is based on digital Roberts filter. The
Roberts filter operator approximates the intensity gradient of
the brightness using two different kernels:
(1)
(2)
Gx =</p>
      <p>Gx, also called horizontal kernel, is able to enhance the
horizontal component of the intensity gradient, the Gy, also
called vertical kernel, at each point, enhances the vertical
component. When combined Gx and Gy give the Roberts
kernel</p>
      <p>q
jGj =</p>
      <p>G2x + G2y</p>
      <p>When G is applied to an image, a threshold is used to
determine which values indicate a boundary and which do
not. By increasing the pixels gradient value the lines tend to
white.</p>
    </sec>
    <sec id="sec-3">
      <title>III. GPU COMPUTING</title>
      <p>
        The GPU architecture consists of a scalable number
of streaming multiprocessors (SMs), each containing eight
streaming processor (SP) cores, two special function units
(SFUs), a multithreaded instruction fetch and issue unit, a
read-only constant cache, and a 16KB read/write shared
memory [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The SM executes a batch of 32 threads together called
a warp [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], unlike SIMD instructions the concept of warp
is not exposed to programmers, rather programmers write a
program for one thread and then specify the number of parallel
threads in a block, and the number of blocks in a kernel grid.
      </p>
      <p>In an SM all threads block should be executed all together.
On the other hand, a SM can manage multiple concurrently
running blocks. The number of blocks running on a SM is
determined by the resource requirements of each block as well</p>
      <p>SIMD and MIMD are types of parallel architectures identified in Flynn’s
taxonomy, which basically says that computers have Single or Multiple
streams of Instructions processing single or Multiple Data
as the registers and shared memory usage. We will use the
adjective active for the executed blocks on one SM at a certain
moment. One block typically is executed on several warps and
the number of warps consists of the total number of GPU cores
divided by the number of cores contained in one warp. The
latter on Nvidia GPU cards has generally been 32, but could
change in the future for new card models.</p>
      <p>
        CUDA architecture makes it possible to have direct access
to the GPU instruction set enabling us to use such GPU
card for general purpose parallel computing. The management
and programming is supported by the CUDA APIs [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Each CUDA thread is mapped to a GPU core. The GPU
card can execute one or more kernel grids, as well as the
streaming multiprocessors (SM) which execute one or more
blocks. CUDA architecture provides APIs and directives in
order to be compatible with different standard programming
languages such as C, C++, Fortran, etc. The main advantages
is then the easy implementation possibilities offered to the
developers which have experience in the said standard
programming languages, and the possibility to achieve highly
modular software systems [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Moreover, CUDA
architecture has been proven a more efficient method to
port developed software systems with respect to other GPU
oriented technologies such as Direct3D and OpenGL.
      </p>
      <p>
        Nevertheless, the OpenGL (Open Graphics Library) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] offers a great number of software tools for the
development of GPU oriented code. OpenGL offers
crosslanguage and multi-platform application programming
interfaces (API) generally used in rendering applications. These
APIs are used to interact with a GPU to achieve
hardwareaccelerated rendering, therefore OpenGL allows applications
to use advanced graphics on relatively small systems. It is
possible to mix OpenGL and CUDA technology in order to
enhance the performances of an application. An example is
given by the use of PBO (Pixel Buffer Object) which makes
it possible to import multidimensional structures on a
pixelby-pixel basis and display them using the related OpenGL
APIs. The advantage in such an approach is to directly obtain
an efficient mapping of pixels directly into threads [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. In
facts, PBO is a specific portion of the video memory in which
it is possible to render images that can be transformed into
textures. Another important advantage of the approach is the
speed of pixel data transfer through DMA (Direct Memory
Access) channel [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The OpenGL PBO mechanism also
permits asynchronous data transfers between the host and the
device [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. It is therefore important to correctly schedule
the workload between different memory transfers in order to
maximise the performance obtainable with the asynchronous
approach.
      </p>
      <p>The texture data are loaded from an image source (image
file or video stream) that can be directly loaded into a PBO,
which is controlled by OpenGL. Fig. 2 gives a simplified
schema of the texture transfer using the Pixel Buffer Object. Of
course, while the transfer is asynchronous, it is anyway needed
a certain amount of CPU workload in order to transfer data
from the host memory to the PBO. Then, after such transfers,
the GPU controllers (driven by the OpenGL drivers) manage
to copy data from a PBO to a texture object. This means that
OpenGL performs a DMA transfer operation without wasting
CPU cycles, so the CPU benefits from a lower workload and
can perform other operations without waiting such a transfer
to be completed.</p>
      <p>
        Since in this work we make use of both OpenGL and CUDA
directives, in order to correctly map the pixels using the PBO,
it is firstly needed to create a GL context (which is OS specific)
and a CUDA buffer registration. After that it is necessary to set
up the GL view port and coordinate system [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], generate one
or more GL buffers to be shared with the CUDA application
and, subsequently, register these buffers within the application
itself (see Fig. 3).
      </p>
      <p>
        The algorithm used to upload data using PBOs [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] is as
follows.
      </p>
      <p>Fig. 4. Upload Application Layout.
1) Creating a PBO on the GPU using glGenBuffers
2) Binding PBO to unpack the buffer target
3) Allocating buffer space on GPU with glBufferData
4) Mapping PBO to CPU memory denying GPU access
for now, glMapBuffer returns a pointer to a place in
GPU memory where the PBO resides
5) Copying data from CPU to GPU using pointer from
glMapBuffer
6) Unmapping PBO (glUnmapBuffer) to allow GPU
full access of the PBO again
7) Transfering data from buffer to a texture target
8) Unbinding the PBO to allow for normal operation again
Steps one to three are only necessary during initialization,
while steps four to eight have to be performed every time the
texture needs to be updated.</p>
      <p>Fig. 4 shows a generic upload application layout.
B. Download</p>
      <p>PBOs can also be used to download data back to the CPU.
There is one problem which must be addressed though; as
the download is an asynchronous operation (using DMA) one
must make sure that the GPU does not clear the render buffer
before the transfer is complete.</p>
      <p>The following is a description of how to download data
using PBOs.</p>
      <p>1) Generating a PBO on the GPU using glGenBuffers
2) Binding PBO to unpack buffer target
3) Allocating buffer space on GPU according to data size
using glBufferData
4) Deciding what framebuffer to read using glReadBuffer.</p>
      <p>One can also read directly from a texture using
glGetTexImage, then skipping the next step
5) Using glReadPixels to read pixel data from the targeted
framebuffer to the bound PBO. This call does not block
as it is asynchronous when reading to a PBO as opposed
to CPU controlled memory. Map PBO to CPU memory
denying GPU access for now. glMapBuffer returns a
pointer to a place in GPU memory where the PBO
resides
6) Copying data from GPU to CPU using pointer from
glMapBuffer
// Boundary control Median range control
if Median lower inf then</p>
      <p>return inf
else
if Median higher sup then</p>
      <p>return sup
end
end
return (unsigned char) Median
// because the PBO works using data char
7) Unmaping PBO (glUnmapBuffer) to allow GPU full
access of the PBO again
8) Unbinding the PBO to allow for normal operation again
Steps one to three are only necessary during initialization,
while steps four to nine need to be performed every time
new data have to be downloaded. However, downloads and
uploads still involve GPU context switch and cannot be done in
parallel with the GPU processing or drawing. Multiple PBOs
can potentially speed up the transfers.</p>
    </sec>
    <sec id="sec-4">
      <title>IV. ROBERTS ALGORITHM</title>
      <p>Roberts edge detection filter can be broken down in different
steps. First we need to set up the image data objects which
will store the intermediate and final results, obtaining the size
of the input image with the OpenGL libraries. Next, we have
to do the computation using Roberts operators and the sum of
the Roberts operators to create and edge image on the CUDA.
Roberts Edge Detection procedure is displayed in Fig. 5.</p>
      <p>
        The algorithm using the features provided by OpenGL and
CUDA will be described below [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>We now describe how to create a convolution kernel of
Roberts and apply it for an image processing in CUDA. The
following code is for Roberts Edge Detection.
// Roberts Edge Detection
Matrix2x2_original_image(upper_left, upper_right,</p>
      <p>lower_left, lower_right)
// Horizontal kernel
principal_diagonal = lower_right - upper_left;
// Vertical kernel
secondary_diagonal = lower_left - upper_right;
// Module of gradient
Median = square(abs(principal_diagonal)+abs(secondary_diagonal))
The following code is for copying the original image. The
main operations are: copy of image, insert original data.
// Definition space memory to be used in GPU
*pointer new array = pointer original array</p>
      <p>+ blockIdx.x*Dimension z
cycle (start to threadIdx.x; stop to widht;</p>
      <p>increase of blockDim.x)
Then, perform a texture lookup in a given 2D sampler with
the function tex2D and associate it to pointer of new array
in the position (step cycle).</p>
      <p>The elaboration of image is performed by means of the
following steps: initializing an array in CUDA for the elaborate
image, inserting Original data.
// Definition space memory to be used in GPU
*pointer new array = pointer original array</p>
      <p>+ blockIdx.x*Dimension z
cycle (start to threadIdx.x; stop to Dimension z;</p>
      <p>increase of blockDim.x)
Then, performing a texture lookup in a given 2D sampler with
the function tex2D and insert it in the matrix of Roberts.
upper_left = tex2D(lookup, coordinate_to_perform_lookup -1,
dimension_x-1)
upper_right = tex2D(lookup, coordinate_to_perform_lookup,
dimension_x-1)
lower_left = tex2D(lookup, coordinate_to_perform_lookup -1,
dimension_x)
lower_right = tex2D(lookup, coordinate_to_perform_lookup,
dimension_x)
Finally, the matrix of Roberts is associated to the pointer of a
new array.</p>
      <p>Then, it is necessary to define the setup of the texture in an
external C function, according to the following steps.</p>
      <p>Describe the format of the value that is
returned when fetching the texture through the
"cudaChannelFormatDesc" function
else
end
if the image is ”pgm” then
assigned format unsigned char
assigned format uchar4
Allocate a CUDA array using cudaMallocArray()
function that is included in checkCudaErrors, used
to correct the string in case of errors.</p>
      <p>Copy from the memory area pointed to by src to
the CUDA array dst using cudaMemcpyToArray()
function, and specifying the direction of the copy with
cudaMemcpyHostToDevice.</p>
      <p>For deleting the texture with an external C function, the
following steps are needed.</p>
      <p>Release the CUDA array using cudaFreeArray()
function, which must have been returned by a previous
call.</p>
      <p>Wrap for the global call that sets up the texture and
threads in an external void C function.</p>
      <p>Bind the CUDA array to the texture reference tex through
cudaBindTextureToArray() function.
switch for different cases do
case original image</p>
      <p>sets up the texture and threads break
case elaborate image with ROBERTS</p>
      <p>sets up the texture and threads break
end
endsw</p>
    </sec>
    <sec id="sec-5">
      <title>Unbind the texture bound to tex.</title>
      <p>For elaborating an image with OpenGL the code below
is used. Such a graphic library is designed to aid in
rendering computer graphics. This typically involves providing
optimized versions of functions that handle common rendering
tasks. It can be realized purely by code running on the CPU,
common in embedded systems, or code running on hardware
accelerated by a GPU, more common in PCs. By employing
these functions, a program can prepare an image to be output
to a monitor. These libraries load the image data into memory,
map between screen and world-coordinates, generate of texture
mipmaps and also ensure interoperability with other third party
libraries and SDK.</p>
      <p>More specifically, for function display() the following
steps have to be performed.</p>
      <p>Map the graphics resources (PBO) in resources for access
by CUDA using cudaGraphicsMapResources()
function.</p>
      <p>Return a pointer through which the mapped
graphics resource may be accessed using
cudaGraphicsResourceGetMappedPointer()
function.</p>
      <p>Unmap the graphics resources in PBO (and once
unmapped, the resources can not be accessed
by CUDA until they are mapped again) using
cudaGraphicsUnmapResources().</p>
      <p>Use function glClear() for the bitwise of masks
that indicate the buffers to be cleared; using the mask
GL_COLOR_BUFFER_BIT, indicating the buffers currently
are enabled for color writing.</p>
      <p>Bind a named texture to a texturing target using
glBindTexture() function; then bind a named buffer
object using glBindBuffer() function, and taking it
from PBO buffer.</p>
      <p>Specify a two-dimensional texture subimage, with target,
texture, level, xoffset, yoffset, width, height, format, type,
pixels, using glTexSubImage2D() function.</p>
      <p>Disable server-side GL_DEPTH_TEST by
glDisable(), or enable server-side GL_TEXTURE_2D
by glEnable().</p>
      <p>Set texture parameters by means of
glTexParameterf(), with target GL_TEXTURE_2D,
which specifies the target texture of the active
texture unit, and pname GL_TEXTURE_MIN_FILTER,
GL_TEXTURE_MAG_FILTER, GL_TEXTURE_WRAP_S,
GL_TEXTURE_WRAP_T, which specifies the symbolic
name of a single-valued texture parameter, and finally
param GL_LINEAR, GL_REPEAT, which specifies the
value of pname.</p>
      <p>Specify the primitive or primitives that will be
created from vertices presented between glBegin and
the subsequent glEnd (the primitives are specified by
glVertex2f and glTexCoord2f)
Perform a buffer swap on the layer in use for the current
window through glutSwapBuffers
Function reshape() consists of the following main steps.</p>
      <p>Specify the affine transformation of x and y from
normalized device coordinates to window coordinates using
glViewport.</p>
      <p>Specify GL_PROJECTION is the target for subsequent
matrix operations by means of glMatrixMode.
Replace the current matrix with the identity matrix
through glLoadIdentity, and describe a
transformation that produces a parallel projection using glOrtho.
Specify GL_MODELVIEW is the target for subsequent
matrix operations by means of glMatrixMode.</p>
      <p>Replace the current matrix with the identity matrix using
glLoadIdentity.</p>
      <p>For function cleanup() the following steps are
performed.</p>
      <p>Unregister a graphics resource for access by CUDA
using cudaGraphicsUnregisterResource, with
resource cuda pbo.</p>
      <p>Bind a named buffer object using glBindBuffer
taking it from GL_PIXEL_UNPACK_BUFFER.</p>
      <p>Delete a buffer object named by the elements of
pbo buffer by means of glDeleteBuffers, and
delete a texture named by the elements of texid using
glDeleteTextures.</p>
      <p>In addition to the above functions, a main() function has
been implemented in order to initialize data, load a default
image, run the processing functions, and save results.</p>
    </sec>
    <sec id="sec-6">
      <title>V. PERFORMANCE ANALYSIS</title>
      <p>The performance analysis has been done by comparing
execution time of parallel computing with its sequential
counterpart. Intel i7 processor and NVIDIA GEFORCE 845m
devices have been used. Detailed technical specifications are</p>
      <p>TABLE I</p>
      <p>GPU TECHNICAL SPECIFICATION
given in TABLE I and TABLE II. The code has been compiled
using CUDA runtime v.7 and CUDA compute capability 5.</p>
      <p>The tests were carried out using the same grayscale image
in four different sizes, in order to evaluate temporal variations
for different sizes.</p>
      <p>The image size has been varied from 128x128 pixels to
1024x1024 pixels. For each of them the algorithm works so
that, after loading the image, it is possible to select the original
image or the one processed (through the use of Roberts edge
detection) by using a selection menu, as shown in Fig. 6</p>
      <p>Roberts operator provides the edge detection of the image
at small scales. If an object with a jagged boundary is present,
as it is shown in Fig. 6 (left), Roberts operator will find the
edges at each spike and twist of the perimeter as in Fig. 6
(right). The operator is sensitive to high frequency noise in
the image and will generate only local edge data instead of
recovering the global structure of a boundary.</p>
      <p>For each image, multiple runs were carried out and the
greatest amounts of processing time were recorded, by using
CudaEventElapsedTime() function for the GPU version
and SDK Timer function for the CPU version. Then it was
calculated the average for different image sizes, as shown in
TABLE III.</p>
      <p>The execution times for the sequential and parallel versions
on CPU and GPU are plotted in Fig. 7. Looking at the data
in Fig. 7, we can state that the average processing CPU time
is greater than the average processing GPU time, particularly
when the image size increses then the sequential version of
the program has a considerable loger execution time. For each
image size, it has been computed the average processing time
ratio between CPU and GPU, which is in the range between
4.1 and 4.4, and it is shown in Fig. 8.</p>
      <p>Fig. 9 shows the comparison between the average processing
time on the GPU and CPU for a number of images equal to
50. It illustrates that the time of the CPU is higher than the</p>
      <p>Image Size
(Pixel)
time of the GPU. Moreover, if we process a greater number
of images, the time difference is considerable.</p>
    </sec>
    <sec id="sec-7">
      <title>VI. RELATED WORKS</title>
      <p>The edge detection can be carried out by different types
of operators such as Roberts, Canny, Deriche, Differential,
Prewitt, Sobel. An example regarding how to use Sobel Edge
Detection can be seen in the following articles.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], authors have used the Sobel kernel for edge
detection. The processing times were compared using code
developed for CPU and GPU, in order to observe the
improvements reported by the GPU computing. Furthermore, a
study has been performed about the timing given by the use of
field-programmable gate array (FPGA), noting that the running
times are comparable to GPU. The authors have included in
the code the steps for loading the file, storing it in memory, the
data transport on CPU to GPU and back, and the visualization
on a window screen, without direct control of images by
OpenGL libraries.
      </p>
      <p>
        Shah [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] has proposed the implementation of the Sobel
filter on a GeForce GT 130. Image processing is performed
in CUDA environment with the GPU computing; OpenGL
libraries provide a control of frame and the insertion into the
memory. In the end, the execution times were compared with
each other taking into account a heterogeneous code on CPU
and observing the improvements of speedup on GPU.
      </p>
    </sec>
    <sec id="sec-8">
      <title>VII. CONCLUSIONS</title>
      <p>As shown by the performance between execution time of
parallel computing with its sequential counterpart, the
performance GPU results indicate that significant speedup can be
achieved. In fact, Roberts edge detection can get a substantial
speedup, compared to the CPU based implementations,
obtaining an average processing time CPU over GPU ratio in
the range 4.1 to 4.4.</p>
      <p>The time reduction is significant in a great number of
processing tasks, allowing the detection of images in high
quality and in real time. This is due to GPUs, which
provide a novel and efficient acceleration technique for image
processing, cheaper than hardware implementation.</p>
      <p>In the future work, we plan to implement a code which
allows independent and parallel execution cores, so as to further
reduce the execution times. The unstructured programming
logic provides freedom of architecture and an implementation
of algorithms able to execute more articulated functions. The
capacity to achieve high speeds depends on the core number
and the program architecture using the parallelism in the best
way, to reduce the execution times.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , G. Pappalardo, and E. Tramontana, “
          <article-title>An agent-driven semantical identifier using radial basis neural networks and reinforcement learning</article-title>
          ,” in XV Workshop ”From Objects to Agents
          <source>” (WOA)</source>
          , vol.
          <volume>1260</volume>
          .
          <string-name>
            <surname>Catania</surname>
          </string-name>
          , Italy: CEUR-WS,
          <year>September 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pappalardo</surname>
          </string-name>
          , E. Tramontana,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nowicki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Starczewski</surname>
          </string-name>
          , and M. Woz´niak, “
          <article-title>Toward work groups classification based on probabilistic neural network approach,”</article-title>
          <source>in Proceedings of International Conference on Artificial Intelligence and Soft Computing (ICAISC)</source>
          , ser. Springer LNCS, Zakopane, Poland,
          <year>June 2015</year>
          , vol.
          <volume>9119</volume>
          , pp.
          <fpage>79</fpage>
          -
          <lpage>89</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pappalardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tramontana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Marszałek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Połap</surname>
          </string-name>
          , and M. Woz´niak, “
          <article-title>Simplified firefly algorithm for 2d image key-points search,” in Symposium on Computational Intelligence for Humanlike Intelligence (CHILI), ser</article-title>
          .
          <source>Symposium Series on Computational Intelligence (SSCI)</source>
          . IEEE,
          <year>2014</year>
          , pp.
          <fpage>118</fpage>
          -
          <lpage>125</lpage>
          . [Online]. Available: http://dx.doi.org/10.1109/CIHLI.
          <year>2014</year>
          .7013395
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Woz</surname>
          </string-name>
          ´niak,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , E. Tramontana, G. Capizzi,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lo</surname>
          </string-name>
          <string-name>
            <surname>Sciuto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nowicki</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Starczewski</surname>
          </string-name>
          , “
          <article-title>A multiscale image compressor with rbfnn and discrete wavelet decomposition</article-title>
          ,”
          <source>in Proceedings of IEEE International Joint Conference on Neural Networks (IJCNN)</source>
          , Killarney, Ireland,
          <year>July 2015</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          , DOI: 10.1109/IJCNN.
          <year>2015</year>
          .
          <volume>7280461</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Woz</surname>
          </string-name>
          ´niak,
          <string-name>
            <given-names>D.</given-names>
            <surname>Połap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gabryel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nowicki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , and E. Tramontana, “
          <article-title>Can we process 2d images using artificial bee colony?</article-title>
          ”
          <source>in Proceedings of International Conference on Artificial Intelligence and Soft Computing (ICAISC)</source>
          , ser. Springer LNCS, Zakopane, Poland,
          <year>June 2015</year>
          , vol.
          <volume>9119</volume>
          , pp.
          <fpage>660</fpage>
          -
          <lpage>671</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Capizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. Lo</given-names>
            <surname>Sciuto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , E. Tramontana, and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Woz´niak, “Automatic classification of fruit defects based on co-occurrence matrix and neural networks</article-title>
          ,
          <source>” in Proceedings of IEEE Federated Conference on Computer Science and Information Systems (FedCSIS)</source>
          , Lodz, Poland,
          <year>September 2015</year>
          , pp.
          <fpage>873</fpage>
          -
          <lpage>879</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wade</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Millar</surname>
          </string-name>
          , “
          <article-title>Fpga-powered display controllers enhance isr video in real time</article-title>
          ,” Z Microsystems, May
          <year>2012</year>
          . [Online]. Available: http://mil-embedded.
          <article-title>com/articles/fpga-powered-isr-video-real-time/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Lindholm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nickolls</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Oberman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Montrym</surname>
          </string-name>
          , “
          <article-title>Nvidia tesla: A unified graphics and computing architecture</article-title>
          ,
          <source>” IEEE Micro</source>
          , vol.
          <volume>28</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>55</lpage>
          ,
          <year>March 2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9] (
          <year>2009</year>
          )
          <article-title>Nvidias next generation cudaTM compute architecture: FermiTM</article-title>
          . NVIDIA Corporation. [Online]. Available: http://www.nvidia.it/page/ home.html
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lal</surname>
          </string-name>
          , Shimpi, Derek, and Wilson, “
          <article-title>Nvidia's geforce 8800 (g80): Gpus re-architected for directx 10</article-title>
          ,” AnandTech,
          <year>November 2006</year>
          . [Online]. Available: http://www.anandtech.com/show/2116
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Giunta</surname>
          </string-name>
          , G. Pappalardo, and E. Tramontana, “AODP:
          <article-title>refactoring code to provide advanced aspect-oriented modularization of design patterns,”</article-title>
          <source>in Proceedings of ACM Symposium on Applied Computing (SAC)</source>
          ,
          <source>Riva del Garda</source>
          , Italy,
          <year>March 2012</year>
          , pp.
          <fpage>1243</fpage>
          -
          <lpage>1250</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Pappalardo</surname>
          </string-name>
          and E. Tramontana, “
          <article-title>Suggesting extract class refactoring opportunities by measuring strength of method interactions</article-title>
          ,”
          <source>in Proceedings of Asia Pacific Software Engineering Conference (APSEC)</source>
          . Bangkok, Thailand: IEEE,
          <year>December 2013</year>
          , pp.
          <fpage>105</fpage>
          -
          <lpage>110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Tramontana</surname>
          </string-name>
          , “
          <article-title>Automatically characterising components with concerns and reducing tangling,” in Proceedings of IEEE Computer Software</article-title>
          and Applications
          <string-name>
            <surname>Conference (COMPSAC) Workshop</surname>
            <given-names>QUORS</given-names>
          </string-name>
          , Kyoto, Japan,
          <year>July 2013</year>
          , pp.
          <fpage>499</fpage>
          -
          <lpage>504</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>[14] “Opengl wiki,” OpenGL.org. [Online]. Available: https://www.opengl. org/wiki/Main Page</mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Segal</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Akeley</surname>
          </string-name>
          ,
          <source>The OpenGL Graphics System: A Specification (Version 4.0)</source>
          . OpenGL.org,
          <year>March 2010</year>
          . [Online]. Available: https://www.opengl.org/registry/doc/glspec40.core.20100311.pdf
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Shreiner</surname>
          </string-name>
          , G. Sellers,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kessenich</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Licea-Kane</surname>
          </string-name>
          ,
          <article-title>OpenGL Programming Guide Eighth Edition</article-title>
          . Addison-Wesley,
          <year>March 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>MahMoudi and P. Manneback</surname>
          </string-name>
          , “
          <article-title>Parallel image processing on gpu with cuda and opengl</article-title>
          .”
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Ahn</surname>
          </string-name>
          .
          <article-title>Opengl pixel buffer object (pbo)</article-title>
          . [Online]. Available: http://www.songho.ca/opengl/gl pbo.html
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>K.</given-names>
            <surname>Group</surname>
          </string-name>
          .
          <article-title>Arb pixel buffer object, architecture review board (arb)</article-title>
          . [Online]. Available: http://www.opengl.org/registry/specs/ARB/pixel buffer object.txt
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>K.</given-names>
            <surname>Shah</surname>
          </string-name>
          , “
          <article-title>Performance analysis of sobel edge detection filter on gpu using cuda &amp; opengl</article-title>
          ,”
          <source>International Journal for Research in Applied Science and Engineering Technology (IJRASET)</source>
          ,
          <source>vol. 1</source>
          Issue III ISSN:
          <fpage>2321</fpage>
          -
          <lpage>9653</lpage>
          ,
          <year>October 2013</year>
          . [Online]. Available: http://www.ijraset.com/
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sandgren</surname>
          </string-name>
          , “
          <article-title>Transfer time reduction of data transfers between cpu and gpu,” Teknisk-naturvetenskaplig fakultet UTH-enheten</article-title>
          ,
          <year>July 2013</year>
          . [Online]. Available: http://www.teknat.uu.se/student
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <article-title>Nvidia cuda library documentation 4.1</article-title>
          .
          <string-name>
            <given-names>NVIDIA</given-names>
            <surname>Corporation</surname>
          </string-name>
          . [Online]. Available: http://developer.download.nvidia.com/compute/cuda/4 1/rel/ toolkit/docs/online/modules.html
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Chouchene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. E.</given-names>
            <surname>Sayadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Said</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Atri</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Tourki</surname>
          </string-name>
          , “
          <article-title>Efficient implementation of sobel edge detection algorithm on cpu, gpu</article-title>
          and fpga,”
          <source>International Journal of Advanced Media and Communication</source>
          , vol.
          <volume>5</volume>
          , pp.
          <fpage>105</fpage>
          -
          <lpage>117</lpage>
          ,
          <year>April 2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>