=Paper= {{Paper |id=Vol-1543/p4 |storemode=property |title=Parallel Computing Solutions for Linear Combination of Filters |pdfUrl=https://ceur-ws.org/Vol-1543/p4.pdf |volume=Vol-1543 |authors=Andrea Cavarra,Dario Caramagno |dblpUrl=https://dblp.org/rec/conf/system/CavarraC15 }} ==Parallel Computing Solutions for Linear Combination of Filters== https://ceur-ws.org/Vol-1543/p4.pdf

Parallel Computing Solutions for Linear
Combination of Filters
Andrea Cavarra Dario Caramagno
University of Catania University of Catania
Viale A. Doria 6, 95125 Catania, Italy Viale A. Doria 6, 95125 Catania, Italy

Abstract—The GPU (Graphics Processing Unit) are the fu- GPU architectures in which the silicon area for control and
ture of high performance computing and provides a parallel management is very reduced compared to that employed in
programming model for general purpose applications thanks to the CPU and a big increase in the number of cores [3].
CUDA programming interface. The programming model of GPU
architecture is significantly different from the traditional CPU GPUs allow us to compute optimally, in parallel mode,
one. This paper presents the advantages of GPU architecture a code with a certain complexity and for the application of
by proposing an algorithm for the linear combination of digital digital filters in two-dimension arrays. This paper proposes
filters in image processing, implemented as a parallel GPU the use of GPU architecture for the implementation of digital
version (a sequential CPU version has been also implemented for filters with a parallel code that uses a linear combination of
comparison). The use of parallel processing CUDA architecture
has enabled us to take advantage of the GPU allowing an increase digital filters processing images. The aim is to compare the
of the performance compared to CPU. parallel execution speed to the sequential code. The remaining
Index Terms—Matrices Convolution, Image processing, parts of the paper will be dedicated to explain the parallel and
GPGPU, Parallel Computing, CUDA. serial approaches.

I. I NTRODUCTION II. D IGITAL F ILTERS
A digital filter performs mathematical operations on
The evolution of computers has allowed to develop more
discrete-time sampled signals in order to enhance or reduce
complex calculations and has improved the performance of
certain characteristics of the signal. Such filters are imple-
the simulations in scientific fields. Technology has increased
mented by software components, often through mathematical
the CPU (Central Processing Unit) frequency with the upper
functions or matrices, and loaded inside the processors with
limit of 3 GHz, processing multiple cores: dual core, quad core
programmable hardware. Cost and speed are closely dependent
and octa core in a few decades, unfortunately with the limits
on the used processor.
due to power dissipation and the increasing temperature.
The application of digital filters has enormous advantages
One of the solutions proposed to push such limits is the
over analog filters as the possible transfer functions are much
use of the GPU (Graphics Processing Unit). Typically, GPUs
more flexible for digital filters. The main advantages are:
handle the huge amount of data for graphical applications
in three-dimensions with high performances, so the GPGPU 1) high accuracy due to the absence of physical compo-
(General-Purpose Computing on GPU) has been introduced. nents;
GPGPU programming has been implemented using two types 2) a digital filter is easier to design and implement auto-
of APIs (Application Programming Interface): the complex matically modifing its frequency response and changing
OpenGL [1] and DirectX [2]. the input;
In the recent years, the computer house NVIDIA has 3) flexibility to change the digital filter parameters, without
developed CUDA (Compute Unified Device Architecture). changing the system hardware;
This platform is much simpler than the previous APIs and 4) easy simulation and design of the filters being imple-
and it allows to program using a high level programming mented by software with a reduction of system com-
language based on C, using a model of parallel computing. plexity.
NVIDIA CUDA technology has opened a new era for GPGPU these properties are essential to the implementation of a high
computing allowing the design and implementation of parallel quality filter. As mentioned previously these advantages are
GPU-oriented algorithms without needing any knowledge on accompanied by the limitations of speed and cost, related
OpenGL. The computational power of these architectures, to the processor used, and frequency. The frequency limit is
nowadays, has received a considerable growth compared to described by Nyquist theorem, using the following formula:
CPUs and due to the GPU ability to perform a huge number fs > 2B (1)
of simple operations in parallel. GPUs have cores much
simpler compared to CPUs. This allows the realization of which imposes the filter with a maximum limit of the fre-
quencies in order to reduce the effects of additional aliasing
Copyright c 2016 held by the authors. distortion of the signals. B term is the signal band and fs

23
is sampling frequency. The digital filters are widely used in
image filtering and image processing. This paper will examine,
implement and test an algorithm to filter digital images with
a GPU implementation.
A. Digital filters for image processing
Digital filters are an essential tool for image processing.
The digital image can be defined as a two dimensional array
or matrix and each element represents a pixel of the image.
Images are processed by an algorithm. The process of filtering
is also known as convolution of a mask with a standard size
3x3, 5x5 or 9x9, which is a matrix. This mask applies different
mathematical operators by means of a convolution to the image
to achieve digital image processing. Fig. 1: 2D Convolution
The use of filters to digital image processing allows us to
make operations like:
1) the extraction of information from the image, such as This operation is done from left to right, for each of the
the detection of boundaries in images; original matrix element obtaining the matrix of the resulting
2) the exaltation of details, such as the increase of light convolution. Therefore, the convolution algorithm implements
intensity or color contrast; the digital filtering operation. Then, it suffices to modify the
3) the elimination and reduction of disturbances in images. kernel in order to change the type of filter. We have studied
This processing brings a high computational cost especially a digital filtering operation of am image, in ASCII format,
for large images. For this reason a sequential approach, based through two filters of ninth order. The next section describes
on the CPU, is not very efficient, so parallel approaches have the convolution operation implemented as a parallel algorithm
been introduced to increase performances. This paper presents that can be executed on a CUDA GPU.
an algorithm for the application of a linear combination III. PARALLEL COMPUTING ON GPU
of filters to local processing based on a model of parallel
computation on GPU architectures. For our local processing This paper presents the convolution algorithm implemen-
filters, the image processing method consists of applying a tation for a bidimensional matrix of input data with a linear
function to each original pixel values and to an appropriate combination of filters using a parallel model. In this work
range of pixels, within the radius of the filter matrix. This two versions of the same algorithm are presented in order
method is based on the convolution operator for which a brief to compare the efficiency of the parallel computation and
explanation is needed. the sequential one, and then demonstrate how the parallel
computing support is far more suitable for high performance
B. Convolution computing.
As previously said, the algorithm executes an operation A. GPU Architecture
on an input matrix and pre-established filters, in order to
provide as a resalt a linear combination of the input. The 2D In recent decades, high performance computing sector has
convolution between two continuous functions, f (x, y) and been having an exponential growth with the widespread use
g(x, y), is defined by the formula: of GPGPU. The use of GPU in this ever-expanding purview
Z ∞ can exploit the power of parallel computing for increasingly
f (x, y) ∗ g(x, y) = f (α, β)g(x − α, y − β)dαdβ (2) complex simulations due to the inherent parallel nature of
−∞ GPUs.
This expression will be brought to the discrete domain, by CUDA compatible GPUs are, in fact, based on an architec-
assuming f (x, y) and g(x, y) be two discrete arrays of a ture made up of a number of MIMD (Multiple Instruction Mul-
limited size, as a double sum expressed by: tiple Data) multiprocessors called Streaming MultiProcessors,
whose number depends on the specification and class of GPU
M −1 N −1
X X performance. This Streaming MultiProcessors are the basic
f (x, y)∗g(x, y) = f (m, n)g(x−α, y−β)dαdβ (3) units of the GPU architecture and are implemented as SIMD
α=0 β=0
(single instruction multiple data), and called by NVIDIA
for x = 0, 1, . . . , M − 1 and y = 0, 1, . . . , N − 1. as SIMT (Single Instruction Multiple Threads), which has
This convolution operation is defined on discrete-time as 8 processors said Streaming Processor or CUDA Cores. In
a simple operation of “local media” in a range of amplitude this architecture, each Streaming MultiProcessor is able to
defined by the kernel size used obtaining a value for the output create, manage, schedule and execute groups of 32 threads
matrix for the same position of the source data as shown in called warp. A warp executes one instruction at a time, so
figure 1. as to maximize the efficiency when all 32 threads of a warp

24
agree on their execution path. When one or more blocks of established kernels generating an output matrix Ã given by
threads are assigned to a multiprocessor to run, they are then the linear combination of results obtained using this formula
partitioned into warps, scheduled by a warp scheduler and
A =⇒ Ã = α(k1 ∗ A) + β(k2 ∗ A) + γA (4)
executed one at a time. Each of these processors can perform
simple mathematical operations (such as addition, subtraction, While this operation is executed, we want to generate the
multiplication, ect.) for integers or floating point numbers. best parallelized code for the purpose of improving its per-
Inside each multiprocessor there is also a shared memory, formances compared to the serial code version. The algorithm
accessible only by the processors in the same multiprocessor, shown can be summed up in three basic steps:
caches for instructions and for data, and, finally, a unit for 1) acquisition of input data matrix;
decoding the instructions. 2) convolution of the input matrix with pre-established
Each multiprocessor has access to a global memory shared kernels;
among all GPU multiprocessors and called Device Mem- 3) linear combination of the results.
ory [3]. The programming model in CUDA C organizes the Obviously, each of these steps is performed by both the
program in a sequential part executed on the CPU, host, sequential version of the code and then the parallel one.
mainly performing memory allocation and kernel calls, and in The following will detail the basic steps previously exposed
a parallel part called kernel and executed on the GPU, device. focusing our attention on the parallel algorithm.
This structure requires the presence, inside the program, of
instructions that are sequentially executed on the host and A. Parallel algorithm solution
interspersed with calls to the kernel that allow us to carry The parallel solution of the algorithm starts in a host by
out entire parts of the program in parallel, on the device, as means of code that acquires the input and takes data from
is shown in figure 2. a text file in ASCII code and store it in an array on shared
memory. Instead of a 2D array for data, complex to manage,
we have used a single dimensional array where the rows of
the input matrix are reported one after the other, then data are
reconstructed by an index. In this way the data will be transfer
from host to GPU with the standard function cudaMemcpy().
After a given number of threads and having organized the
blocks, the call to kernel is performed, in order to do parallel
computations.
Inside the kernel it is assigned a thread to any given input so
that we can entrust the execution of the code to a set of parallel
Fig. 2: Model of programming in CUDA threads arranged in blocks and indexed using the following
formula:
Kernel is the model of parallel execution of the code in id = threadIdx.x + (blockIdx.x*blockDim.x);
a device and it is defined as a grid divided in to a certain
number of blocks to which it is assigned a multi-processor To each thread will be assigned a position in the input
for each. Inside each block there is a number of fundamental array, and the thread will compute the convolution, in parallel
computational units defined as thread. to other threads in each block. The input data acquired in
the form of array was necessary to implement a mechanism
Kernel, or grid, are sequentially executed between them
of jump allowing us to move between the various locations
while blocks and threads are executed in parallel by adopting
in the array in order to perform the products between the
a SIMT data-parallel model. Each of these thread belongs to
elements of the kernel (in our case a simple 3x3 filter), and
a single block and is uniquely identified within the kernel by
the components located in the neighborhood of the source
assigning it an index. In this manner the memory addressing
location. This problem is solved using the radius of the kernel
will be simplyfied especially in the case of processing multi-
defined as:
dimensional data.
In each of the blocks the kernel also has a shared memory Kernel_radius = (Kernel_order -1)/2;

accessible only to the threads of the same block. The logical
We can identify all the elements lying in the range of the
subdivision of a kernel in grid and blocks is a crucial aspect in
position of interest. Then kernel radius is subtracted to the
a code in CUDA for obtaining the parallelization of code. The
input array index and added a variable that increases cyclically
organization and management of the internal threads allows the
through a simple for loop. This let us multiply in parallel the
implementation of a more efficient code.
elements of the input with the kernel element lying in the
right position around the central element of the first row of
IV. P RESENTATION OF THE ALGORITHM
the kernel, as shown below:
The algorithm developed in this work executes the convo- for(i=0;i