=Paper= {{Paper |id=Vol-1543/p4 |storemode=property |title=Parallel Computing Solutions for Linear Combination of Filters |pdfUrl=https://ceur-ws.org/Vol-1543/p4.pdf |volume=Vol-1543 |authors=Andrea Cavarra,Dario Caramagno |dblpUrl=https://dblp.org/rec/conf/system/CavarraC15 }} ==Parallel Computing Solutions for Linear Combination of Filters== https://ceur-ws.org/Vol-1543/p4.pdf
                Parallel Computing Solutions for Linear
                         Combination of Filters
                              Andrea Cavarra                                             Dario Caramagno
                          University of Catania                                        University of Catania
                  Viale A. Doria 6, 95125 Catania, Italy                       Viale A. Doria 6, 95125 Catania, Italy



    Abstract—The GPU (Graphics Processing Unit) are the fu-             GPU architectures in which the silicon area for control and
ture of high performance computing and provides a parallel              management is very reduced compared to that employed in
programming model for general purpose applications thanks to            the CPU and a big increase in the number of cores [3].
CUDA programming interface. The programming model of GPU
architecture is significantly different from the traditional CPU            GPUs allow us to compute optimally, in parallel mode,
one. This paper presents the advantages of GPU architecture             a code with a certain complexity and for the application of
by proposing an algorithm for the linear combination of digital         digital filters in two-dimension arrays. This paper proposes
filters in image processing, implemented as a parallel GPU              the use of GPU architecture for the implementation of digital
version (a sequential CPU version has been also implemented for         filters with a parallel code that uses a linear combination of
comparison). The use of parallel processing CUDA architecture
has enabled us to take advantage of the GPU allowing an increase        digital filters processing images. The aim is to compare the
of the performance compared to CPU.                                     parallel execution speed to the sequential code. The remaining
    Index Terms—Matrices Convolution, Image processing,                 parts of the paper will be dedicated to explain the parallel and
GPGPU, Parallel Computing, CUDA.                                        serial approaches.

                        I. I NTRODUCTION                                                     II. D IGITAL F ILTERS
                                                                           A digital filter performs mathematical operations on
   The evolution of computers has allowed to develop more
                                                                        discrete-time sampled signals in order to enhance or reduce
complex calculations and has improved the performance of
                                                                        certain characteristics of the signal. Such filters are imple-
the simulations in scientific fields. Technology has increased
                                                                        mented by software components, often through mathematical
the CPU (Central Processing Unit) frequency with the upper
                                                                        functions or matrices, and loaded inside the processors with
limit of 3 GHz, processing multiple cores: dual core, quad core
                                                                        programmable hardware. Cost and speed are closely dependent
and octa core in a few decades, unfortunately with the limits
                                                                        on the used processor.
due to power dissipation and the increasing temperature.
                                                                           The application of digital filters has enormous advantages
   One of the solutions proposed to push such limits is the
                                                                        over analog filters as the possible transfer functions are much
use of the GPU (Graphics Processing Unit). Typically, GPUs
                                                                        more flexible for digital filters. The main advantages are:
handle the huge amount of data for graphical applications
in three-dimensions with high performances, so the GPGPU                   1) high accuracy due to the absence of physical compo-
(General-Purpose Computing on GPU) has been introduced.                       nents;
GPGPU programming has been implemented using two types                     2) a digital filter is easier to design and implement auto-
of APIs (Application Programming Interface): the complex                      matically modifing its frequency response and changing
OpenGL [1] and DirectX [2].                                                   the input;
   In the recent years, the computer house NVIDIA has                      3) flexibility to change the digital filter parameters, without
developed CUDA (Compute Unified Device Architecture).                         changing the system hardware;
This platform is much simpler than the previous APIs and                   4) easy simulation and design of the filters being imple-
and it allows to program using a high level programming                       mented by software with a reduction of system com-
language based on C, using a model of parallel computing.                     plexity.
NVIDIA CUDA technology has opened a new era for GPGPU                   these properties are essential to the implementation of a high
computing allowing the design and implementation of parallel            quality filter. As mentioned previously these advantages are
GPU-oriented algorithms without needing any knowledge on                accompanied by the limitations of speed and cost, related
OpenGL. The computational power of these architectures,                 to the processor used, and frequency. The frequency limit is
nowadays, has received a considerable growth compared to                described by Nyquist theorem, using the following formula:
CPUs and due to the GPU ability to perform a huge number                                            fs > 2B                           (1)
of simple operations in parallel. GPUs have cores much
simpler compared to CPUs. This allows the realization of                which imposes the filter with a maximum limit of the fre-
                                                                        quencies in order to reduce the effects of additional aliasing
  Copyright c 2016 held by the authors.                                 distortion of the signals. B term is the signal band and fs




                                                                   23
is sampling frequency. The digital filters are widely used in
image filtering and image processing. This paper will examine,
implement and test an algorithm to filter digital images with
a GPU implementation.
A. Digital filters for image processing
    Digital filters are an essential tool for image processing.
The digital image can be defined as a two dimensional array
or matrix and each element represents a pixel of the image.
Images are processed by an algorithm. The process of filtering
is also known as convolution of a mask with a standard size
3x3, 5x5 or 9x9, which is a matrix. This mask applies different
mathematical operators by means of a convolution to the image
to achieve digital image processing.                                                        Fig. 1: 2D Convolution
    The use of filters to digital image processing allows us to
make operations like:
    1) the extraction of information from the image, such as               This operation is done from left to right, for each of the
        the detection of boundaries in images;                          original matrix element obtaining the matrix of the resulting
    2) the exaltation of details, such as the increase of light         convolution. Therefore, the convolution algorithm implements
        intensity or color contrast;                                    the digital filtering operation. Then, it suffices to modify the
    3) the elimination and reduction of disturbances in images.         kernel in order to change the type of filter. We have studied
    This processing brings a high computational cost especially         a digital filtering operation of am image, in ASCII format,
for large images. For this reason a sequential approach, based          through two filters of ninth order. The next section describes
on the CPU, is not very efficient, so parallel approaches have          the convolution operation implemented as a parallel algorithm
been introduced to increase performances. This paper presents           that can be executed on a CUDA GPU.
an algorithm for the application of a linear combination                           III. PARALLEL COMPUTING ON GPU
of filters to local processing based on a model of parallel
computation on GPU architectures. For our local processing                 This paper presents the convolution algorithm implemen-
filters, the image processing method consists of applying a             tation for a bidimensional matrix of input data with a linear
function to each original pixel values and to an appropriate            combination of filters using a parallel model. In this work
range of pixels, within the radius of the filter matrix. This           two versions of the same algorithm are presented in order
method is based on the convolution operator for which a brief           to compare the efficiency of the parallel computation and
explanation is needed.                                                  the sequential one, and then demonstrate how the parallel
                                                                        computing support is far more suitable for high performance
B. Convolution                                                          computing.
   As previously said, the algorithm executes an operation              A. GPU Architecture
on an input matrix and pre-established filters, in order to
provide as a resalt a linear combination of the input. The 2D              In recent decades, high performance computing sector has
convolution between two continuous functions, f (x, y) and              been having an exponential growth with the widespread use
g(x, y), is defined by the formula:                                     of GPGPU. The use of GPU in this ever-expanding purview
                      Z ∞                                               can exploit the power of parallel computing for increasingly
 f (x, y) ∗ g(x, y) =      f (α, β)g(x − α, y − β)dαdβ     (2)          complex simulations due to the inherent parallel nature of
                       −∞                                               GPUs.
This expression will be brought to the discrete domain, by                 CUDA compatible GPUs are, in fact, based on an architec-
assuming f (x, y) and g(x, y) be two discrete arrays of a               ture made up of a number of MIMD (Multiple Instruction Mul-
limited size, as a double sum expressed by:                             tiple Data) multiprocessors called Streaming MultiProcessors,
                                                                        whose number depends on the specification and class of GPU
                     M −1 N −1
                     X    X                                             performance. This Streaming MultiProcessors are the basic
f (x, y)∗g(x, y) =               f (m, n)g(x−α, y−β)dαdβ     (3)        units of the GPU architecture and are implemented as SIMD
                     α=0 β=0
                                                                        (single instruction multiple data), and called by NVIDIA
for x = 0, 1, . . . , M − 1 and y = 0, 1, . . . , N − 1.                as SIMT (Single Instruction Multiple Threads), which has
   This convolution operation is defined on discrete-time as            8 processors said Streaming Processor or CUDA Cores. In
a simple operation of “local media” in a range of amplitude             this architecture, each Streaming MultiProcessor is able to
defined by the kernel size used obtaining a value for the output        create, manage, schedule and execute groups of 32 threads
matrix for the same position of the source data as shown in             called warp. A warp executes one instruction at a time, so
figure 1.                                                               as to maximize the efficiency when all 32 threads of a warp




                                                                   24
agree on their execution path. When one or more blocks of                established kernels generating an output matrix à given by
threads are assigned to a multiprocessor to run, they are then           the linear combination of results obtained using this formula
partitioned into warps, scheduled by a warp scheduler and
                                                                                    A =⇒ Ã = α(k1 ∗ A) + β(k2 ∗ A) + γA               (4)
executed one at a time. Each of these processors can perform
simple mathematical operations (such as addition, subtraction,              While this operation is executed, we want to generate the
multiplication, ect.) for integers or floating point numbers.            best parallelized code for the purpose of improving its per-
Inside each multiprocessor there is also a shared memory,                formances compared to the serial code version. The algorithm
accessible only by the processors in the same multiprocessor,            shown can be summed up in three basic steps:
caches for instructions and for data, and, finally, a unit for              1) acquisition of input data matrix;
decoding the instructions.                                                  2) convolution of the input matrix with pre-established
   Each multiprocessor has access to a global memory shared                    kernels;
among all GPU multiprocessors and called Device Mem-                        3) linear combination of the results.
ory [3]. The programming model in CUDA C organizes the                      Obviously, each of these steps is performed by both the
program in a sequential part executed on the CPU, host,                  sequential version of the code and then the parallel one.
mainly performing memory allocation and kernel calls, and in             The following will detail the basic steps previously exposed
a parallel part called kernel and executed on the GPU, device.           focusing our attention on the parallel algorithm.
This structure requires the presence, inside the program, of
instructions that are sequentially executed on the host and              A. Parallel algorithm solution
interspersed with calls to the kernel that allow us to carry                The parallel solution of the algorithm starts in a host by
out entire parts of the program in parallel, on the device, as           means of code that acquires the input and takes data from
is shown in figure 2.                                                    a text file in ASCII code and store it in an array on shared
                                                                         memory. Instead of a 2D array for data, complex to manage,
                                                                         we have used a single dimensional array where the rows of
                                                                         the input matrix are reported one after the other, then data are
                                                                         reconstructed by an index. In this way the data will be transfer
                                                                         from host to GPU with the standard function cudaMemcpy().
                                                                         After a given number of threads and having organized the
                                                                         blocks, the call to kernel is performed, in order to do parallel
                                                                         computations.
                                                                            Inside the kernel it is assigned a thread to any given input so
                                                                         that we can entrust the execution of the code to a set of parallel
            Fig. 2: Model of programming in CUDA                         threads arranged in blocks and indexed using the following
                                                                         formula:
   Kernel is the model of parallel execution of the code in                 id = threadIdx.x + (blockIdx.x*blockDim.x);
a device and it is defined as a grid divided in to a certain
number of blocks to which it is assigned a multi-processor                  To each thread will be assigned a position in the input
for each. Inside each block there is a number of fundamental             array, and the thread will compute the convolution, in parallel
computational units defined as thread.                                   to other threads in each block. The input data acquired in
                                                                         the form of array was necessary to implement a mechanism
   Kernel, or grid, are sequentially executed between them
                                                                         of jump allowing us to move between the various locations
while blocks and threads are executed in parallel by adopting
                                                                         in the array in order to perform the products between the
a SIMT data-parallel model. Each of these thread belongs to
                                                                         elements of the kernel (in our case a simple 3x3 filter), and
a single block and is uniquely identified within the kernel by
                                                                         the components located in the neighborhood of the source
assigning it an index. In this manner the memory addressing
                                                                         location. This problem is solved using the radius of the kernel
will be simplyfied especially in the case of processing multi-
                                                                         defined as:
dimensional data.
   In each of the blocks the kernel also has a shared memory               Kernel_radius = (Kernel_order -1)/2;

accessible only to the threads of the same block. The logical
                                                                            We can identify all the elements lying in the range of the
subdivision of a kernel in grid and blocks is a crucial aspect in
                                                                         position of interest. Then kernel radius is subtracted to the
a code in CUDA for obtaining the parallelization of code. The
                                                                         input array index and added a variable that increases cyclically
organization and management of the internal threads allows the
                                                                         through a simple for loop. This let us multiply in parallel the
implementation of a more efficient code.
                                                                         elements of the input with the kernel element lying in the
                                                                         right position around the central element of the first row of
          IV. P RESENTATION OF THE ALGORITHM
                                                                         the kernel, as shown below:
   The algorithm developed in this work executes the convo-                 for(i=0;i