Performances of a Parallel CUDA Program for a Biorthogonal Wavelet Filter Antony Scivoletto Nella Romano University of Catania University of Catania Viale A. Doria 6, 95125 Catania, Italy Viale A. Doria 6, 95125 Catania, Italy antonyscivoletto@gmail.com nromano919@gmail.com Abstract—Parallel high-performance computing technologies are managed through a lock or token. In shared memory have encountered tremendous growth, especially in the last machines, a single address space and global memory are decade, and they have made a strong impact in a variety of shared between multiple processors; each processor owns a areas concerning mathematical and engineering fields. In this work we analyze the behavior of a parallel algorithm local cache, and its values have to be made coherent with developed to compute a random signal processing through a the global memory by the operating system. Data can be biorthogonal discrete wavelet filter using CUDA and we compare exchanged among processors simply by placing the values the parallel implementation against similar sequential CPU in a predefined location and synchronizing appropriately [8]. code. The simulations are conducted using two computers: the Other issues involved, which complicate development and first one is a conventional host; the second one is a server. This comparison allows us to underline the differences between have no counter part in the sequential world, can include: hardware architectures with different specifications in terms of finding and expressing concurrency, managing data distribu- time performance. The experimental results show that if the tions, managing communication among processors, balancing signal is decomposed into a rather substantial number of samples, computational load, and simply implementing the parallel al- parallel code timings on server and a usual host GPU win on the gorithm correctly. Therefore, implementing parallel programs corresponding sequential code on the same machines. Index Terms—GPU programming, performances, wavelet is more difficult than implementing sequential ones [8], [9]. In fact, in sequential programming, the programmer has to I. I NTRODUCTION implement a code correctly, and efficient to execute. Parallel The last decades have been featured by large motions in programming involves the same issues, and adds several the perceived value of parallel computing. In such years high- additional challenges, some of which are described above. performance computing technologies have seen substantial In this paper we implement a parallel code by CUDA archi- growth and this is the case for general purpose computing on tecture, subduing a random signal to a biorthogonal discrete GPU, or rather GPGPU. The intensive use of GPU paralleliza- low pass wavelet filter. In particular, this work is structured as tion techniques has given the possibility to implement complex follows. In section 2 we present some essential concepts for simulation and hard computational tasks [1], [2], [3], [4], [5], introducing wavelet theory and we refer to specific techniques, [6]. For much time, one of the most important methods used such as subband coding, in order to process a given signal with to improve the computational capability of devices has been the biorthogonal wavelet filter in discrete time case. In section increasing the processor clock speed. Unfortunately, because 3 we describe the main features of CUDA architecture, used of various fundamental limitations in the manufacturing of in this work for the parallelization procedure. In section 4 integrated circuits, it is not possible to extract much additional some basic concepts on the definition of a digital filter and its computational power from existing architectures by increasing implementation are given through mathematical convolution the processor clock speed. Then, supercomputer manufactories operation. Section 5 examines serial and parallel algorithms, were able to introduce many changes in performance by with which wavelet filter implementation and signal processing strongly increasing the number of processors: they could have are achieved. In section 6 we show the results obtained by tens or hundreds of thousands of processor cores working experimental simulations and compare time performance of concurrently. CPU manufacturers have announced plans for 12 two codes executed in two machines with different technical or 16 cores CPUs. In this way, a big number of simple multi- specifications. Finally, in the last two sections we deduce thread cores in GPU cards could offer the potential for strong consequent results, conclusions and we also report all those speedups regarding several general purpose applications, if works related to subject matter described in this paper. compared with the CPU similar sequential computation, con- firming parallel computing has arrived for good [7]. II. WAVELET THEORY Nevertheless, the architecture of GPU cards gives rise to A. Preliminaries some programming issues, such as sharing resources, which Only in recent years, wavelet theory has been developed as Copyright c 2016 held by the authors. a unifying framework for a large number of techniques thought 30 for wave signal processing applications, such as multiresolu- andmultiresolution signal analysis. Two notions which are tion analysis, subband coding and wavelet series expansions. important for wavelet theory are scale and resolution. Scale is The idea of treating a signal at various scales and analyzing related to the size of the signal, while resolution refers to the it with different resolutions has emerged independently in amount of details present in the signal. In order to understand many mathematics, physics and engineering fields [10]. The better these concepts, we can note that the scale parameter in interest of the signal processing community became strong discrete wavelet analysis behaves as follows [10]: when Doubechies and Mallat, giving a further contribution to • for large scales delated wavelet take a global view of a the wavelet theory, established connections to discret signal subsampled signal; processing results. Since then, a number of theoretical as well • for small scales, reduced wavelets analyze small details as pratical contributions have been made on various aspects of the signal. of this topic and nowadays it is growing rapidly [10], [11]. 1) Multiresolution analysis: With this method we can de- B. Wavelets in continuos domain rive a lower resolution signal by lowpass filtering with a half- band low-pass filtering having impulse responce g(n). This Wavelets are continuos basic functions constructed mainly results in a signal y(n) where in order to satisfy certain mathematical properties. A wavelet is defined as a function with zero mean value which is both k=+∞ X above and below the x-axis, looking like a “wave” [12]. It y(n) = h(k) ∗ x(2n − k) (3) is possible to give a more rigorous definition of a wavelet. In k=−∞ fact, if we consider a multiresolution decomposition of L2 (R), Now based on subsampled version of x(n) we want to that is find an approximation, a(n), to the original: this is done by ∅ ⊂ V0 ⊂ · · · ⊂ Vj ⊂ Vj+1 ⊂ · · · ⊂ L2 (R) inserting a zero between every sample, because we need a signal at the original scale for comparison. In general, a(n) and we call Wj the orthogonal complement of Vj , then we is not going to be equal to x(n); therefore we compute the can define a wavelet as a function ψ(x) such that the set of difference between a(n) and x(n) with the following formula: {ψ(x − l), l ∈ Z} is a Riesz basic of W0 and also must be subject to the following constrains [13]: d(n) = x(n) − a(n) Z +∞ 1) ψ(x) dx = 0 however there is some redundancy, since a signal sampling −∞ Z +∞ ratefs is mapped into to signals d(n) e y(n) with sampling ratesfs and fs/2 , respectively [10]. 2) k ψ(x) k =2 ψ(x)ψ ∗ (x) dx = 1 −∞ 2) Subband coding schemes: The method of multiresolu- The idea behind wavelets is that by stretching and trans- tion analysis described in [10] is characterized by a redundant lating one such main wavelet function ψ(t) (called mother set of samples. We now look at a different scheme where wavelet), we can represent finer parts of a function or a signal no such redundancy appears. This method is called subbang by simple linear combinations: coding scheme, first widely used in voice compression. The X idea behind this technique is based on the following methodol- f (t) = bj,k ψj,k (t) (1) ogy: a pair of filters are used where low pass and a high pass j,k filter partecipate; we decompose a sequence X(n) into two where bj,k are called wavelet coefficients of the function f in subsequences at half rate, or half resolution, and this by means the wavelet basis given by the inner product of ψj,k . Moreover, of orthogonal filter. This process can be iterated on either a whole family of wavelet functions can be obtained by just or both subsequences. In particular to obtain finer frequency shifting and scaling the mother one by the law [12]: resolution at lower frequencies, we iterate the scheme on the √ lower band only. Each iteration halves the width of the low ψj,k = 2j ψ(2j t − k), i, j ∈ N (2) band (in fact it increases its frequency resolution by two), In the procedure of wavelet dilatation and translation, we but because of the subsampling by two, its time resolution is can see that for a large j, ψj,k (t) is short and of high halved as well. Schematically, this is shown in figure 2. frequency; smaller values for j, instead, give long wavelet An important feature of this discret algorithm is its relatively functions of low frequency. This offers us the possibility to low complexity. Actually, the following somewhat surprising analyze a function in different scales [10]. Typical examples result emerges: regardless the depth of the tree in 2, the of wavelet functions are shown in figure 1. complexity is linear in the number of input samples, with a constant factor which depends on the length of the filter [10]. C. Discrete Time Case Now we focus our attention on discrete time signals, since it D. Biorthogonality is the case of our interest for the implementation of a wavelet The wavelet filter implemented in this work is the 3.7 filter as a parallel program which exploits the computational biorthogonal one. A biorthogonal wavelet is a wavelet where capabilities of GPUs. In the discrete time case two meth- the associated wavelet transform is invertible but not necessar- ods were developed independently, namely subband coding ily orthogonal. In the biorthogonal case there are two scaling 31 Fig. 1: Example of wavelet functions [14] functions φ and φ̃, which may generate different multiresolu- of processor cores [18]. The launch of the Nvidia CUDA tion analysis, according to different wavelet functions ψ and technology has opened a new era for GPGPU computing ψ̃, having ψ̃ used in the analysis and ψ used in the syntesis. In allowing the design and implementation of parallel GPU ori- addition, the scaling functions φ and φ̃ and wavelet function ented algorithms without any knowledge of OpenGL, DirectX ψ and ψ̃ are related by duality in the following sense [15]: or graphics pipeline [19]. CUDA (Compute Unified Device Z Architecture) is a novel technology of general purpose com- ψj,k (x) ψ̃j 0 ,k0 (x) dx = 0 puting on the GPU, which makes users develop general GPU programs easily. CUDA GPU has the following advantages as soon as j 6= j 0 or k 6= k 0 and even [18], [19]: Z • general programming environment: CUDA uses C pro- φ0,k (x) φ̃0,k0 (x) dx = 0 gramming tools and C compiler, which make programs as soon as k 6= k 0 [16]. About biorthogonal wavelet filter, have better compatibility and portability; • more powerful parallel computing capability: CUDA by using two wavelets, one for decomposition and the other for reconstruction, interesting properties are derived. Unlike graphic card employs more transistors for computing; • higher bandwidth: it can be strongly affected by the orthogonal wavelets, biorthogonal ones require two different filters: one for the analysis and other one for syntesis of an choice of memory in which data is stored, how the data input. The first number indicates the order of the syntesis is laid out and the order in which it is accessed; • instruction operation: CUDA GPU supports integer and filter while the second number refers to the order of the analysis filter [17]. Unfortunately, frequency responses from bit operation. biorthogonal filters may now not show any symmetry and According to these considerations, programmable GPUs the energy of the decomposed signal may also not equal the have evolved into a highly parallel, manycore processor with energy of the original signal [12]. Nevertheless, there are some considerable computational power. reasons behind using biorthogonal wavelets related to some Comparing the performance of GeForce GTX 680, GeForce requirements to satisfy. One such condition is that they should GTX 580 and GeForce GTX 480 among Sandy Bridge and have compact support: if this constrain were not satisfied, then Bloomfield Intel CUPs around about 2009, we can notice we would need to use an IIR filters to approximate the function since 2001 the theoretical floating point capability on GPUs but IIR filters are more difficult to treat and they also require has increased more rapidly than in CPUs; furthermore, the extra computing power. memory bandwidth follows a similar trend, mainly looking at years since 2003 [20]. The main difference in floating point III. GPU PARALLEL COMPUTING WITH CUDA capability between the CPU and GPU is due to the specializa- The advent of multicore CPUs and manycore GPUs has tion of the second one for compute-intensive highly parallel allowed the development of applications that transparently computation. A CUDA-enabled GPU is composed of several scale its parallelism to take advantage of the increasing number MIMD (multiple instruction multiple data) multiprocessors that contain a set of SIMD processors (single instruction single data). Each multiprocessor has a shared memory that can be accessed from each of its processors, and also shares a bigger global memory common the global multiprocessors [21]. Shared memory buffers reside physically on the GPU as opposed to residing in off-chip DRAM, so because of this, the latency to access shared memory tends to be far lower than typical buffers [7]. CUDA hardware architecture has the following novel fea- tures [19]: Fig. 2: Filter bank tree of Discret Wavelet Transform imple- • general write/read global memory: GPU can acquire data mented with two discret time filters (low-pass and high-pass from any location or the global memory and also put data filter, respectively) and subsampling by two [12] to any location, almost as supple as CPU; 32 • shared memory placed on chip: it can make threads in the B. 3.7 biorthogonal wavelet filter same multiprocessor getting data immediately available Applications of wavelet decomposition techniques in nu- and avoiding accessing global memory frequently; merical analysis of a signal seems very promising because it • thread synchronization: threads in a thread group can be highlights the “zooming” property, which allows a very good synchronized to each other, so they can communicate and representation of critical points, such as irregularity, discon- collaborate to resolve complex problems. tinuos, and so on [20]. Wavelet transform can be realized In CUDA programming model, an application consists of simply and effectively through a filter bank. In particular, a host program that executes on the CPU and other parallel a decomposition phase, adopted in this work, provides a kernel programs executing on the GPU. A kernel program low pass and high pass filtering, followed by decimation by is executed by a set of parallel threads, where a thread is a two; the reconstruction phase, instead, processes separately subdivision of a process in two or more sub-processes, which inputs, through expansion and filtering, and then they are are executed concurrently by a single processor computer summed. These operations have to be iterated according to system (multithreading). The host program can dynamically the number of decomposition levels with which we want to allocate device global memory to the GPU and copy data achieve wavelet transform [22]. to/from such a memory from/to the memory on the CPU. The following table shows the sixteen 3.7 biorthogonal Moreover, the host program can dynamically fix the number of decomposition wavelet filter coefficients taken into account threads used for running of a kernel program. A set of threads in this work [23]. creates a block, and each block has its own shared memory, Filter coefficient Value which can be accessed only by each thread on the same block h0 0.0030210861 [21]. Note that interactions between CPU and GPU should be h1 -0.0090632583 minimized in order to avoid communication bottlenecks and h2 -0.0168317654 delays due to data transfers. h3 0.0746639851 IV. S IGNAL PROCESSING BY WAVELET FILTER h4 0.0313329787 As already widely stated, the main target of this paper is h5 -0.3011591259 the implementation and performance analysis of an algorithm h6 -0.0264992409 built for emulating a randomic signal processing through a 3.7 h7 0.9516421219 biorthogonal wavelet filter. However, before entering in the h8 0.9516421219 description and performance analysis of different algorithms h9 -0.0264992409 used for both sequential and GPU parallel computing versions, h10 -0.3011591259 we want to give the theoretical basis about mathematical h11 0.0313329787 meaning of a digital FIR filter. h12 0.0746639851 h13 -0.0168317654 A. Digital filter and Convolution h14 -0.0090632583 It is extremely useful to observe that any type of FIR digital h15 0.0030210861 filter can be interpreted mathematically as the convolution product between a function f , which represents the signal V. A LGORITHMS to scan, and a function h, which is the impulse response Once understood how we can build a digital filter by a of LTI system through which the signal passes, in our case mathematical point of view, we see now how to have it in represented by the filter. In the case, such as that of our C language. We propose two different versions of code whose interest, in which two discrete time functions S and h are target is the translation of convolution operation, previously given, then it is possible to define the convolution operation described, between the signal and the wavelet filter. The first as follows: version refers to a programming strategy that uses a parallel CUDA architecture, where the heart of the program, called N X kernel, is defined in host memory as a global function, and it Y (j) = S(j − i) ∗ h(i) (4) is executed in GPU memory, or rather in device. Later we will i=0 analyze the temporal performance of the two programs and In other words, this operation can also be seen as the compare them, eventually to highlight differences and observe application of a function h, which represents the filter, to if it might be advantageous using an implementation strategy a function S that instead symbolizes the signal to filter. In or another one. particular, the terms h(i) are called filter coefficients, while N is the order of this filter. A. Parallel code in CUDA The convolution of a discrete time signal may be described The first implementation we want to present is related to the by a sequence of steps achieved in the following way: construction of a wavelet filter with a parallel programming 1) function S is reflected: S(i) → S(−i); model. The parallel code is implemented in C language; 2) S is scrolled over h with a shift of k: S(k − i); however we introduce changes to a conventional C code and 3) for each shift j all products between S and h are added. take advantage of all typical CUDA structures by executing 33 certain functions in GPU device. Once loaded C libraries and in a tree-structure, iterating the kernel a number of times equal library for device functions, we fix the maximum to num iter in order to isolate step by step residuals and details number of threads that draws each block in device memory. of signal and get different levels of resolution. Therefore we Obviously, this amount of threads is linked to hardware achieve also the scaling procedure organizing step by step features of GPU cards. As regards the simulations presented samples in groups of 2m elements such that they assume the in this paper, we use two graphics cards: a 480-core GeForce same value in each group. This operation is implemented as and a 1536-core Tesla Kepler. The number of threads discends int div=(int)powf(2,m); by these specifications, calculating it as the ratio between the sign[id]= sign[id-(id%div)]; total amount of data to process and the number of threads per Listing 2: Scaling istructions block. We are now ready to describe the global main function Inside the int main (), actual variables corresponding to which, as such, is called in host, and that is after performed formal variables are defined and, for each of them, memory in device: this function is exactly the kernel. allocation is carried out. Also we introduce other three vari- // kernel definition ables dev sign, dev fil, dev y that play a similar task in GPU __global__ void wavelet(float *sign, float *fil, float*y, device int m) { int id = blockIdx.x * blockDim.x + threadIdx.x; // allocation in host memory if (id>=sample) return; sign_host = (float *) malloc(sample*sizeof(float)); int P = m*sample; fil_host = (float *) malloc(filtro*sizeof(float)); for (int i=0; i 0) y[P+id]+= sign[id-i]* fil[i]; else y[P+id]+= sign[i-id]* fil[i]; // allocation in device memory int div = (int) powf(2,m); cudaMalloc((void**)&dev_sign, sample*sizeof(float)); sign[id] = sign[id-(id%div)]; cudaMalloc((void**)&dev_fil, filtro*sizeof(float)); return; cudaMalloc((void**)&dev_y, sample*sizeof(float)); } Listing 3: Malloc and CudaMalloc Listing 1: kernel definition Later, after loading the samples of the signal and filter In the kernel, an instruction which performs the convolution coefficients, respectively on the arrays sign host and fil host, operation between the signal S and h filter is implemented. In we proceed with data transfer on device using cudaMemcpy fact, if filter represents the number of wavelet filter coefficients commands. In fact, they constitute a bottleneck in the perfor- (in our case f ilter = 16) and sample is the amount of mance of a code and therefore should be used as sparingly randomic signal samples taken into account, then each id- as possible. For this reason we decide to accumulate the data th component of output signal y is obtained as the sum of of num iter iterations in a single output array y whose size the products between the i-th component of the filter and is equal to num iter ∗ sample. After we carry out another the (id − i)-th component of signal; id is the thread index cudaMemcpy to return output data from host to device with which we exploit computational capability of various //Data transfer from host to device thread in GPU memory and whose dimension should cover cudaMemcpy(dev_sign, sign_host, sample*sizeof(float), cudaMemcpyHostToDevice); the whole length of signal. In other words, a wavelet filter can cudaMemcpy(dev_fil, fil_host, filtro*sizeof(float), be graphically thought as a floating window, initially aligned cudaMemcpyHostToDevice); with the first sample of signal (id = 0), and comes slid up to Listing 4: Cudamemcpy istructions cover all samples. Obviously, when probing the entire signal, there will be Finally, we save output data for each of single iterations situations in which, according to the values assumed by index and we can free host and device allocate memory through id and based on i-th coeffiient position of the filter, the free(...) and Cudafree(...) commands. Function clock(...), difference id − i will be negative; this would require the called from C library , is useful to measure the time of translation of the signal for “negative” istants, i.e. points where this and the other program described in the following section. the signal does not exist: this is known as boundary problem. Therefore, it is necessary to extend the signal at instants in B. Sequential program which it is undefined; this can be done in several ways. The Another version of code that performs the same task refers strategy adopted here is the symmetrical extension, particularly to a sequential programming model. By this we mean the suitable for biorthogonal filters [22]. transposition of signal filtering in a conventional program For this reason, in wavelet function there is a conditional written in C language. In this case the first thing to do is instruction which chooses the (id − i)-th sample of S or defining the function which achieves convolution product and (i−id)-th one, to avoid “boundary problem”. As we can see by corresponds to the parallel code kernel, as shown below. As we code 1, wavelet function takes as input not only the point-to- can observe, such a function is similar to the kernel described float variables sign, fil, y, but also another parameter, m, that above. Here, thread index disappears and is replaced by an is the number of current iteration. In fact, as we shall soon see, index with the same name but referring to a for loop whose the purpose of this paper is adopting subband coding technique element y(id) is computed in a sequential way. 34 2) Server: In server we find a Intel Xeon CPU having a void wavelet(float *sign, float *fil, float*y, int m) { for (int id=0; id 0) y[P+id]+= sign[id-i]* fil[i]; on-board memory, of which 4 GByte for GPU, and supports else y[P+id]+= sign[i-id]* fil[i]; PCI Express Gen 3 bus interface. int div = (int) pow(2,m); sign[id] = sign[id-(id%div)]; } B. Simulations return; } Since technical features of PC GPU are very different than Server one, it is clear that for the parallel code we have to Listing 5: sequential function adjust the number of threads per block, according to the type of used GPU: if we work with GeForce, this parameter will Clearly all typical CUDA instructions have disappeared; have to be equal to 480; if the code is executed on server with nevertheless, in order to compare fairly this code with parallel Tesla Kepler K10 card, then we should put it equal to 1536. one, we need to replace them with similar instructions that The simulation of the parallel code on PC offers the result emulate the same behavior. This explains why instead of represented graphically in Figure 3a. cudaM alloc we find three additional variablesd sign host, Figure 3a shows an almost linear trend throughout the range d fil host, and d y host allocated in host memory which from 1000 to 10 milion of samples, with a computation time substitute arrays dev sign, dev fil and dev y. Then, after varying from 150 to 300 ms for the first 10000 samples. Start execution finishes, cudaMemcpy are replaced by assignments from 5 milion of samples, the computing time increases of listed below, unnecessary for the results, but useful to take about 20 seconds; in the range of equal amplitude to previous into account the evaluation of time (and performance) one, from 5 to 10 milion of samples, the computing time d_sign_host = sign_host; increment is about 15 seconds compared to the theoretical d_fil_host = fil_host; d_y_host = y_host; expected 20s that we should obtain if the curve had not undergone the level-shift down. A similar trend is given in Finally, in this model we use clock( ) function in order Figure 3b for the simulations of parallel code on the server. In to measure execution time of the program. The goal now is this case, for 10 milion of samples computing time is initially making benchmark of two codes and comparing results. larger of about one order of magnitude, growing from there linearly up to 9 milion of samples, where the execution time VI. E XPERIMENTAL RESULTS of the program is 17.4s. Subsequently, from 9 to 10 milion of samples, the computing time increases of only 200 ms. Now we analyse the performance for sequential and parallel A comparison between PC and server runs for parallel and codes. In order to pursue this target, we change the number of sequential version is shown in 4a and 4b: samples that makes up our examined signal and we evaluate We can see that, for PC for a number of samples less than the time trend which results increasing gradually the number 15000, sequential code timing is below the curve relative to the of samples. The simulations of both codes will be conducted parallel code; in the range between 20000 and 70000 samples on two different technical specification computers. In fact the the two curves intersect several times, showing an almost first machine is used as a normal PC for student laboratory ap- swinging behavior; instead from 70000 samples the CPU line plications; the other one is dedicated to specific applications as is above GPU one and increasing the number of samples their a server, which has got high reliability, increased performance distance is larger and larger, and at the 10 million samples it and additional features. For this reason, in order to distinguish reaches 25 seconds. Server simulations show, instead, parallel them, we denote the first computer as PC, the second one as code starts having performance better than sequential one only Server. when they have reached 3 milion of samples, a value much higher than the 70000 seen for the server. We should note in A. Technical specifications this case the gap between two curves is very small: in fact in We describe the main features of two machines and their the range up to 3 milion samples the sequential is faster than respective CPU and graphics card. parallel of about 2s; once crossed the point where the two 1) PC: It has got an AMD Athlon 64X2 with frequency codes have similar performance, the GPU starts to work better, CPU clock up to 2.9 GHz and NVIDIA GeForce GTX 480 gaining a time advantage of about 4.5s in the neighborhood GPU. This GPU shows the following engine and memory of 10 million samples. specifications [24]: it has a number of CUDA cores egual to Figure 5 gives a global overview of the CPUs and GPUs 480, with a graphic clock of about 700MHz and a fill rate behavior in PC and Server. As clearly shown in this graph, texture of 42 billion per second. The interfaces are designed for reaching an amount of samples high enough (almost 3 mil- PCI Express 2.0x16 accelerator, with a peak bandwith of up to lion), the best performance in terms of saved time for program 2 GByte per second and with a 1536 MByte GDDR5 standard execution is when we implement and execute the parallel code memory, where memory locations are connected through 384 using server GPU: this result was quite predictable because bit memory interface. with this card we use 1536 cores compared to 480 of GeForce 35 Performance of parallel code for PC Performance of Parallel code for SERVER 40 18 NVIDIA GeForce GTX−480 GPU NVIDIA Kepler K10 1536 cores GPU 35 16 14 30 Computational Time [s] Computational time [s] 12 25 10 20 8 15 6 10 4 5 2 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Number of Samples x 10 6 Number of Samples 6 x 10 (a) Computing time of parallel code in PC for varying amount of (b) Computing time of parallel code in Server for varying amount of samples samples Fig. 3: Computing time of parallel code in PC and Server Comparison between serial and parallel code for PC Comparison between serial and parallel code for SERVER 60 7 NVIDIA GeForce GTX−480 GPU NVIDIA KEPLER K10 GPU AMD Athlon CPU 6.95 INTEL XEON CPU 50 6.9 Computational Time [s] Computational time [s] 40 6.85 X: 3.121e+006 Y: 6.818 30 6.8 6.75 20 6.7 10 6.65 0 0 1 2 3 4 5 6 7 8 9 10 3.06 3.08 3.1 3.12 3.14 3.16 3.18 Number of Samples 6 x 10 Number of Samples x 10 6 (a) Parallel and sequential program in PC (b) Parallel and sequential program in Server Fig. 4: Comparison between GPU and CPU performance in PC and Server GPU. Nevertheless, by a careful observation, though GPU it with GeForce GTX 480 GPU, which belongs to a high- server performance are better than GeForce one, being able to end market and with its 1536 Mbytes of installed RAM, its earn between 10 and 25s ahead of the second one, it is equally GDDR5 video memory and high operating frequencies, it is true that there aren’t considerable differences from what it certainly one of the most important products made by NVIDIA happens by exploiting the server CPU potential than Kepler corporation [24]. GPU: in this case the separation between the Intel Xeon CPU According to what argued so far, some significant infor- curve and Tesla Kepler relating one is only a few seconds. mation can be derived by plotting all the different ratio This result is due to the server, because of its features and combinations between GPU and CPU timing for PC and applications, has got hardware equipment more advanced than server. From Figure 6a we observe clearly that for the first a normal PC; in fact Xeon CPU, with its 6 cores per socket and 20000 samples the sequential code on the CPU is faster than a clock frequency up to 3 GHz, can provide very satisfactory GPU program; the situation is reversed past this value of results which are not very different than ones offered by Tesla samples, from which parallel code becomes approximately 1.5 Kepler GPU. On the other hand the difference between the use times faster than sequential one. of a parallel code and a sequential one is more pronounced As shown in figure 6b for the server, we have a simi- when we look at a PC with an average CPU and we compare lar situation. However, here the GPU shows its advantages 36 CPU and GPU ratio for PC 1 CPU and GPU ratio for SERVER 1.8 10 AMD Athlon CPU/ NVIDIA GeForce GPU INTEL XEON CPU/ NVIDIA KEPLER K10 GPU 1.6 1.4 0 10 1.2 CPU/GPU CPU/GPU 1 −1 10 0.8 0.6 −2 10 0.4 0.2 −3 0 10 3 4 5 6 7 3 4 5 6 7 10 10 10 10 10 10 10 10 10 10 Number of Samples Number of Samples (a) Ratio between AMD Athlon CPU and NVIDIA GeForce GPU times (b) Ratio between Intel Xeon CPU and NVIDIA Tesla K10 GPU times in PC in Server Fig. 6: Ratio CPU/GPU for PC and Server Comparison among serial and parallel codes for PC and SERVER A monotonically similar trend is shown in the graph 7b. How- 60 ever, for reason seen previously, the ratio between GPU time of NVIDIA GeForce GTX−480 GPU NVIDIA Kepler K10 1536−Core GPU PC and CPU time of server is always maintained higher than AMD Athlon CPU 50 INTEL XEON CPU one, demonstrating GeForce performance are always inferior if we compare them with the Intel Xeon CPU. Finally, if we compare to each other the execution times of two GPUs, we Computational time [s] 40 PC observe that initially the curve remains below the unit in the range from 1000 to about 800000 samples, indicating PC GPU 30 works better than Server one, with values 10 times lower. Instead, in the remaining part of observation range, GeForce 20 execution time is about twice than Tesla Kepler K10 (see figure 8). 10 1 PC and SERVER GPU ratio SERVER 10 0 0 1 2 3 4 5 6 7 8 9 10 Number of Samples 6 x 10 Fig. 5: Comparison among parallel and sequential programs on PC and Server 10 0 GPUpc /GPUserver after passing 3 milion of samples. Moreover, in this graph the performance of parallel code are better than about 1.5 −1 10 times, compared with sequential one in correspondence of the maximum limit of our observation range. The situation is a bit different when we make mixed ratios among GPU performance of a machine and CPU performance of the other one. 10 −2 3 4 5 6 7 10 10 10 10 10 If we take into account figure 7a, we see a downward trend, Number of Samples where in correspondance of an amount of samples equal to Fig. 8: Ratio between PC and Server GPU 1000, the times of server GPU are greater than PC CPU, even over two orders of magnitude. The performance of examined GPU and CPU are equal when we reach 500000 samples. VII. R ELATED W ORKS Going forward, this ratio becomes thiner and thiner, until it Now we remark some works closer to this one, that have settles around 0.3 in the range from 9 to 10 million samples. been carried out in parallel computing context, in particular 37 SERVER GPU and PC CPU ratio 3 10 3 PC GPU and SERVER CPU ratio 10 NVIDIA Kepler K10 GPU/AMD Athlon CPU 2 10 NVIDIA GeForce 480 GPU/INTEL XEON CPU 2 10 GPU/CPU GPU/CPU 1 10 1 10 0 10 −1 10 0 3 4 5 6 7 10 10 10 10 10 10 3 4 5 6 7 10 10 10 10 10 Number of Samples Number of Samples (a) Mixed Ratio between NVIDIA Tesla K10 GPU and AMD Athlon (b) Mixed Ratio between NVIDIA GeForce GPU and Intel Xeon CPU CPU times times Fig. 7: Mixed Ratios among GPUs and CPUs reflecting about GPU card potentials applied to wavelet filters. have found that significant execution time improvements are In [25] the authors deepen the concept based on an algorithm achieved on both multicore platforms and GPUs. Likewise in centered on GPU cards to reconstruct 3D wavelet using the paper [30], the authors explore the potential of OpenCL programs fragments to minimize the transfer of data and in accelerating the DWT computation and analyze the pro- processing overhead fragment; for this pourpose they have grammability, portability and performance aspects of this proposed a novel scheme that uses tile boards as a primary language; their experimental analysis is done using NVIDIA layout to organize 3D wavelet coefficients. and AMD drivers that support OpenCL. In article [26], the authors present an algorithm that per- The authors in paper [31] present an overview of wavelet forms SIMD complete discrete wavelet transform (DWT) con- based multiresolution analysis, discussing the continuos volution on a GPU, which brings significant improvement per- wavelet transform in its simplest form and looking at orthogo- formance on a normal PC, without additional costs. Although nal, biorthogonal and semiorthogonal wavelets. Many authors the forward and inverse wavelet transforms are mathematically have applied the strategy of parallel computing using wavelets different, the algorithm they proposed unifies them to an to image encoding field. In the paper [32] the computation almost identical process that can be efficiently implemented steps in JPEG2000 are examined, particularly in the Tier-1, on GPU. DWT has a wide range of applications, by signal and novel, GPGPU compatible, parallel processing methods processing in video and image compression. For this reason, in through wavelets for the sample-level coding of the images are [27] the authors have shown that this transformation, by means developed. In work [33] closer to the previous one, the authors of the lifting system, can be performed in memory mode focus their attention on accelerating JPEG2000 encoder by and efficient computation on GPU, through CUDA computing using GPGPU. paradigm of Nvidia. In particular their design exploits three Finally, [34] presents a GPGPU framework with the cor- major hardware architectures for 2D DWT (row-column, line- responding parallel computing solution for wavelet based based, block-based) and describes a hybrid method between image denoising by using off-the-shelf consumer-grade pro- the row and column-based methods on blocks. grammable GPUs. Their experiment results show that frame- work gain applicability in data parallelism and satisfaction Another work, in [28], presents an improved version of performances in accelerating computations for wavelet-based an algorithm suitable to compute the 2D DWT on GPU. denoising. In their work the authors have adapted this method for an existing algorithm computing the vertical transform pass. By VIII. C ONCLUSIONS taking advantage of the computational power of a GPU when In this work we have analyzed the computational capac- implementing a wavelet transform, they demonstrate the time ity of a parallel code written in CUDA environment which of the computation can be substantially reduced. implements the filter process of a randomic signal using a Work [29], proposes a method which analyzes the behavior biorthogonal wavelet filter, and we have compared it with a of several parallel algorithms developed to compute the two sequential code that performs the same task; we have exe- dimensional DWT, in particular using both openMP over cuted both codes on two computers having different technical a multicore platform and CUDA over a GPU. They also specifications. 38 Simulations make it clear that, except the case in which [14] [Online]. Available: http://www.bssaonline.org the number of samples is very low, the adoption of a parallel [15] M. A. Salem, N. Ghamry, and B. Meffert, “Daubechies versus biorthog- onal wavelets for moving object detection in traffic monitoring systems,” programming technique is certainly advantageous in terms of 2009. execution time. In particular, over a number of 3 million of [16] S. Mallat, A Wavelet Tour of Signal Processing, October 2008. samples, the GPU Server performance predominates on CPU [17] M. P. Chaudhary and G. Lalit, “Lifting Schemem Using HAAR and Biorthogonal Wavelets For Image Compression,” Internetional Journal of the same computer and on both GPU and CPU related to of Engineering Reseach and Applications, 2003. PC, confirming the strong potential of parallel computing in [18] “Cuda c programming guide version 4.2,” pp. 1–5. terms of efficiency, computational capacity and execution time. [19] Z. Yang, Y. Zhu, and Y. Pu, “Parallel Image processing Based on CUDA,” in Proceedings of International Conference on Computer The potential for future GPU perfomance increases presents Science and Software Engeneering, 2008. great opportunities for demanding applications, including com- [20] NVIDIA CUDA C Programming Guide, April 2012. putational graphics, computer vision, and a wide range of high- [21] A. Fornaia, C. Napoli, G. Pappalardo, and E. Tramontana, “An AO sys- tem for OO-GPU programming,” in Proceedings of the 16th Workshop performance computing applications [35]. ”From Object to Agents” (WOA15), Naples, Italy, June 2015. [22] L. Verdoliva, “La trasformata wavelet,” 2009-2010. ACKNOWLEDGMENT [23] [Online]. Available: http://wavelets.pybytes.com/wavelet/bior3.7 [24] [Online]. Available: http://www.nvidia.it/object/geforce-family-it.html The authors would like to thank Giacomo Capizzi and [25] A. Garcia and H.-W. Shen, “Gpu-Based 3D Wavelet Recostruction With Christian Napoli to lead them in the Parallel GPU computing Tileboarding,” The Visual Computer, vol. 21, no. 8-10, pp. 755–763, with CUDA field and for the support in drafting and reviewing 2005. [26] T.-T. Wong, C.-S. Leung, P.-A. Heng, and J. Wang, “Discrete Wavelet this paper. Transform On Consumer-Level Graphics Hardware,” IEEE Transactions on Multimedia, vol. 9, no. 3, pp. 668–673, 2007. R EFERENCES [27] W. Van Derlaan, A. Jalba, and J. Roerdink, “Accelereting Wavelet Lifting [1] G. Pappalardo and E. Tramontana, “Automatically discovering design On Graphics Hardware Using Cuda,” IEEE Transactions on Parallel and patterns and assessing concern separations for applications,” in Proceed- Distribuited Systems, vol. 22, August 2010. ings of ACM Symposium on Applied Computing (SAC), Dijon, France, [28] H. Moen, “Wavelet transforms and efficient implementation on the April 2006, pp. 1591–1596. GPU,” May 2007. [2] C. Napoli, G. Pappalardo, E. Tramontana, and G. Zappala, “A Cloud- [29] V. Galiano, O. Lopez, M. Malumbres, and H. Migallun, “Parallel Distributed GPU Architecture for Pattern Identification in Segmented strategies for 2D Discrete Wavelet Transform in shared memory systems Detectors Big-Data Surveys,” The Computer Journal, 2014. and GPUs,” Springer Science, 2012. [3] F. Bonanno, G. Capizzi, G. L. Sciuto, C. Napoli, G. Pappalardo, and [30] B. Sharma and N. Vydyanathan, “Parallel discrete wavelet transform E. Tramontana, “A novel cloud-distributed toolbox for optimal energy using the Open Computing Language: a performance and portability dispatch management from renewables in igss by using wrnn predictors study,” in Proceedings of IEEE International Symposium on Parallel and gpu parallel solutions,” in Power Electronics, Electrical Drives, and Distribuited Processing, April 2010. Automation and Motion (SPEEDAM), 2014 International Symposium [31] B. Jawerth and W. Sweldens, “An Overview Of Wavelet Based Mul- on. IEEE, 2014, pp. 1077–1084. tiresolution Analyses,” SIAM review, vol. 36, no. 3, pp. 377–412, 1994. [4] F. Bonanno, G. Capizzi, G. Lo Sciuto, C. Napoli, G. Pappalardo, and [32] R. Le, I. Bahar, and J. Mundy, “A Novel Parallel Tier-1 Coder For E. Tramontana, “A cascade neural network architecture investigating Jpeg2000 Using Gpus,” Application Specific Processors , IEEE, june surface plasmon polaritons propagation for thin metals in openmp,” in 2011. Proceedings of International Conference on Artificial Intelligence and [33] M. Ahmadvand and A. Ezhdehakosh, “Gpu-based implementation of Soft Computing (ICAISC), ser. Springer LNCS, vol. 8467, Zakopane, jpeg2000 encoder,” in Proceedings of the International Conference Poland, June 2014, pp. 22–33. on Parallel and Distributed Processing Techniques and Applications [5] C. Napoli, G. Pappalardo, E. Tramontana, R. Nowicki, J. Starczewski, (PDPTA), 2012. and M. Woźniak, “Toward work groups classification based on prob- [34] Y. Su and Z. Xu, “Parallel implementation of wavelet-based image abilistic neural network approach,” in Proceedings of International denoising on programmable pc-grade graphics hardware,” Signal Pro- Conference on Artificial Intelligence and Soft Computing (ICAISC), ser. cessing, vol. 90, no. 8, pp. 2396–2411, 2010. Springer LNCS, Zakopane, Poland, June 2015, vol. 9119, pp. 79–89. [35] S. W. Keckler, W. J.Dally, B. Khailany, M. Garland, and D. Glasco, [6] M. Woźniak, D. Połap, M. Gabryel, R. Nowicki, C. Napoli, and “GPUs and the Future of Parallel Computing,” IEEE Computer Society, E. Tramontana, “Can we process 2d images using artificial bee colony?” September/October 2011. in Proceedings of International Conference on Artificial Intelligence and Soft Computing (ICAISC), ser. Springer LNCS, Zakopane, Poland, June 2015, vol. 9119, pp. 660–671. [7] J. Sanders and E. Kandrot, CUDA BY EXAMPLE, An Introduction to General-Purpose GPU Programming, NVIDIA. [8] Introduction to parallel programming. [Online]. Available: courses.cs.washington.edu [9] E. Tramontana, “Managing Evolution Using Cooperative Designs and a Reflective Architecture,” in Reflection and Software Engineering, ser. Springer LNCS, 2000, vol. 1826. [10] O. Rioul and M. Vetterli, “Wavelets and Signal Processing,” IEEE SP magazine, October 1991. [11] C. Napoli, G. Pappalardo, and E. Tramontana, “An agent-driven se- mantical identifier using radial basis neural networks and reinforcement learning,” in XV Workshop ”From Objects to Agents” (WOA), vol. 1260. Catania, Italy: CEUR-WS, September 2014. [12] J. Akhtar, Optimization of biorthogonal wavelet filters for signal and image compression, February 2001. [13] C. Napoli, G. Pappalardo, and E. Tramontana, “A Hybrid Neuro-Wavelet Predictor for QoS Control and Stability,” in Proceedings of Advances in Artificial Intelligence (AI*IA), ser. Springer LNCS, vol. 9119, Torino, Italy, December 2013, pp. 527–538. 39