Parallel Computing Solutions for Linear Combination of Filters Andrea Cavarra Dario Caramagno University of Catania University of Catania Viale A. Doria 6, 95125 Catania, Italy Viale A. Doria 6, 95125 Catania, Italy Abstract—The GPU (Graphics Processing Unit) are the fu- GPU architectures in which the silicon area for control and ture of high performance computing and provides a parallel management is very reduced compared to that employed in programming model for general purpose applications thanks to the CPU and a big increase in the number of cores [3]. CUDA programming interface. The programming model of GPU architecture is significantly different from the traditional CPU GPUs allow us to compute optimally, in parallel mode, one. This paper presents the advantages of GPU architecture a code with a certain complexity and for the application of by proposing an algorithm for the linear combination of digital digital filters in two-dimension arrays. This paper proposes filters in image processing, implemented as a parallel GPU the use of GPU architecture for the implementation of digital version (a sequential CPU version has been also implemented for filters with a parallel code that uses a linear combination of comparison). The use of parallel processing CUDA architecture has enabled us to take advantage of the GPU allowing an increase digital filters processing images. The aim is to compare the of the performance compared to CPU. parallel execution speed to the sequential code. The remaining Index Terms—Matrices Convolution, Image processing, parts of the paper will be dedicated to explain the parallel and GPGPU, Parallel Computing, CUDA. serial approaches. I. I NTRODUCTION II. D IGITAL F ILTERS A digital filter performs mathematical operations on The evolution of computers has allowed to develop more discrete-time sampled signals in order to enhance or reduce complex calculations and has improved the performance of certain characteristics of the signal. Such filters are imple- the simulations in scientific fields. Technology has increased mented by software components, often through mathematical the CPU (Central Processing Unit) frequency with the upper functions or matrices, and loaded inside the processors with limit of 3 GHz, processing multiple cores: dual core, quad core programmable hardware. Cost and speed are closely dependent and octa core in a few decades, unfortunately with the limits on the used processor. due to power dissipation and the increasing temperature. The application of digital filters has enormous advantages One of the solutions proposed to push such limits is the over analog filters as the possible transfer functions are much use of the GPU (Graphics Processing Unit). Typically, GPUs more flexible for digital filters. The main advantages are: handle the huge amount of data for graphical applications in three-dimensions with high performances, so the GPGPU 1) high accuracy due to the absence of physical compo- (General-Purpose Computing on GPU) has been introduced. nents; GPGPU programming has been implemented using two types 2) a digital filter is easier to design and implement auto- of APIs (Application Programming Interface): the complex matically modifing its frequency response and changing OpenGL [1] and DirectX [2]. the input; In the recent years, the computer house NVIDIA has 3) flexibility to change the digital filter parameters, without developed CUDA (Compute Unified Device Architecture). changing the system hardware; This platform is much simpler than the previous APIs and 4) easy simulation and design of the filters being imple- and it allows to program using a high level programming mented by software with a reduction of system com- language based on C, using a model of parallel computing. plexity. NVIDIA CUDA technology has opened a new era for GPGPU these properties are essential to the implementation of a high computing allowing the design and implementation of parallel quality filter. As mentioned previously these advantages are GPU-oriented algorithms without needing any knowledge on accompanied by the limitations of speed and cost, related OpenGL. The computational power of these architectures, to the processor used, and frequency. The frequency limit is nowadays, has received a considerable growth compared to described by Nyquist theorem, using the following formula: CPUs and due to the GPU ability to perform a huge number fs > 2B (1) of simple operations in parallel. GPUs have cores much simpler compared to CPUs. This allows the realization of which imposes the filter with a maximum limit of the fre- quencies in order to reduce the effects of additional aliasing Copyright c 2016 held by the authors. distortion of the signals. B term is the signal band and fs 23 is sampling frequency. The digital filters are widely used in image filtering and image processing. This paper will examine, implement and test an algorithm to filter digital images with a GPU implementation. A. Digital filters for image processing Digital filters are an essential tool for image processing. The digital image can be defined as a two dimensional array or matrix and each element represents a pixel of the image. Images are processed by an algorithm. The process of filtering is also known as convolution of a mask with a standard size 3x3, 5x5 or 9x9, which is a matrix. This mask applies different mathematical operators by means of a convolution to the image to achieve digital image processing. Fig. 1: 2D Convolution The use of filters to digital image processing allows us to make operations like: 1) the extraction of information from the image, such as This operation is done from left to right, for each of the the detection of boundaries in images; original matrix element obtaining the matrix of the resulting 2) the exaltation of details, such as the increase of light convolution. Therefore, the convolution algorithm implements intensity or color contrast; the digital filtering operation. Then, it suffices to modify the 3) the elimination and reduction of disturbances in images. kernel in order to change the type of filter. We have studied This processing brings a high computational cost especially a digital filtering operation of am image, in ASCII format, for large images. For this reason a sequential approach, based through two filters of ninth order. The next section describes on the CPU, is not very efficient, so parallel approaches have the convolution operation implemented as a parallel algorithm been introduced to increase performances. This paper presents that can be executed on a CUDA GPU. an algorithm for the application of a linear combination III. PARALLEL COMPUTING ON GPU of filters to local processing based on a model of parallel computation on GPU architectures. For our local processing This paper presents the convolution algorithm implemen- filters, the image processing method consists of applying a tation for a bidimensional matrix of input data with a linear function to each original pixel values and to an appropriate combination of filters using a parallel model. In this work range of pixels, within the radius of the filter matrix. This two versions of the same algorithm are presented in order method is based on the convolution operator for which a brief to compare the efficiency of the parallel computation and explanation is needed. the sequential one, and then demonstrate how the parallel computing support is far more suitable for high performance B. Convolution computing. As previously said, the algorithm executes an operation A. GPU Architecture on an input matrix and pre-established filters, in order to provide as a resalt a linear combination of the input. The 2D In recent decades, high performance computing sector has convolution between two continuous functions, f (x, y) and been having an exponential growth with the widespread use g(x, y), is defined by the formula: of GPGPU. The use of GPU in this ever-expanding purview Z ∞ can exploit the power of parallel computing for increasingly f (x, y) ∗ g(x, y) = f (α, β)g(x − α, y − β)dαdβ (2) complex simulations due to the inherent parallel nature of −∞ GPUs. This expression will be brought to the discrete domain, by CUDA compatible GPUs are, in fact, based on an architec- assuming f (x, y) and g(x, y) be two discrete arrays of a ture made up of a number of MIMD (Multiple Instruction Mul- limited size, as a double sum expressed by: tiple Data) multiprocessors called Streaming MultiProcessors, whose number depends on the specification and class of GPU M −1 N −1 X X performance. This Streaming MultiProcessors are the basic f (x, y)∗g(x, y) = f (m, n)g(x−α, y−β)dαdβ (3) units of the GPU architecture and are implemented as SIMD α=0 β=0 (single instruction multiple data), and called by NVIDIA for x = 0, 1, . . . , M − 1 and y = 0, 1, . . . , N − 1. as SIMT (Single Instruction Multiple Threads), which has This convolution operation is defined on discrete-time as 8 processors said Streaming Processor or CUDA Cores. In a simple operation of “local media” in a range of amplitude this architecture, each Streaming MultiProcessor is able to defined by the kernel size used obtaining a value for the output create, manage, schedule and execute groups of 32 threads matrix for the same position of the source data as shown in called warp. A warp executes one instruction at a time, so figure 1. as to maximize the efficiency when all 32 threads of a warp 24 agree on their execution path. When one or more blocks of established kernels generating an output matrix à given by threads are assigned to a multiprocessor to run, they are then the linear combination of results obtained using this formula partitioned into warps, scheduled by a warp scheduler and A =⇒ à = α(k1 ∗ A) + β(k2 ∗ A) + γA (4) executed one at a time. Each of these processors can perform simple mathematical operations (such as addition, subtraction, While this operation is executed, we want to generate the multiplication, ect.) for integers or floating point numbers. best parallelized code for the purpose of improving its per- Inside each multiprocessor there is also a shared memory, formances compared to the serial code version. The algorithm accessible only by the processors in the same multiprocessor, shown can be summed up in three basic steps: caches for instructions and for data, and, finally, a unit for 1) acquisition of input data matrix; decoding the instructions. 2) convolution of the input matrix with pre-established Each multiprocessor has access to a global memory shared kernels; among all GPU multiprocessors and called Device Mem- 3) linear combination of the results. ory [3]. The programming model in CUDA C organizes the Obviously, each of these steps is performed by both the program in a sequential part executed on the CPU, host, sequential version of the code and then the parallel one. mainly performing memory allocation and kernel calls, and in The following will detail the basic steps previously exposed a parallel part called kernel and executed on the GPU, device. focusing our attention on the parallel algorithm. This structure requires the presence, inside the program, of instructions that are sequentially executed on the host and A. Parallel algorithm solution interspersed with calls to the kernel that allow us to carry The parallel solution of the algorithm starts in a host by out entire parts of the program in parallel, on the device, as means of code that acquires the input and takes data from is shown in figure 2. a text file in ASCII code and store it in an array on shared memory. Instead of a 2D array for data, complex to manage, we have used a single dimensional array where the rows of the input matrix are reported one after the other, then data are reconstructed by an index. In this way the data will be transfer from host to GPU with the standard function cudaMemcpy(). After a given number of threads and having organized the blocks, the call to kernel is performed, in order to do parallel computations. Inside the kernel it is assigned a thread to any given input so that we can entrust the execution of the code to a set of parallel Fig. 2: Model of programming in CUDA threads arranged in blocks and indexed using the following formula: Kernel is the model of parallel execution of the code in id = threadIdx.x + (blockIdx.x*blockDim.x); a device and it is defined as a grid divided in to a certain number of blocks to which it is assigned a multi-processor To each thread will be assigned a position in the input for each. Inside each block there is a number of fundamental array, and the thread will compute the convolution, in parallel computational units defined as thread. to other threads in each block. The input data acquired in the form of array was necessary to implement a mechanism Kernel, or grid, are sequentially executed between them of jump allowing us to move between the various locations while blocks and threads are executed in parallel by adopting in the array in order to perform the products between the a SIMT data-parallel model. Each of these thread belongs to elements of the kernel (in our case a simple 3x3 filter), and a single block and is uniquely identified within the kernel by the components located in the neighborhood of the source assigning it an index. In this manner the memory addressing location. This problem is solved using the radius of the kernel will be simplyfied especially in the case of processing multi- defined as: dimensional data. In each of the blocks the kernel also has a shared memory Kernel_radius = (Kernel_order -1)/2; accessible only to the threads of the same block. The logical We can identify all the elements lying in the range of the subdivision of a kernel in grid and blocks is a crucial aspect in position of interest. Then kernel radius is subtracted to the a code in CUDA for obtaining the parallelization of code. The input array index and added a variable that increases cyclically organization and management of the internal threads allows the through a simple for loop. This let us multiply in parallel the implementation of a more efficient code. elements of the input with the kernel element lying in the right position around the central element of the first row of IV. P RESENTATION OF THE ALGORITHM the kernel, as shown below: The algorithm developed in this work executes the convo- for(i=0;i