Performances of a Parallel CUDA Program for a
             Biorthogonal Wavelet Filter
                             Antony Scivoletto                                               Nella Romano
                           University of Catania                                         University of Catania
                  Viale A. Doria 6, 95125 Catania, Italy                         Viale A. Doria 6, 95125 Catania, Italy
                       antonyscivoletto@gmail.com                                      nromano919@gmail.com


   Abstract—Parallel high-performance computing technologies              are managed through a lock or token. In shared memory
have encountered tremendous growth, especially in the last                machines, a single address space and global memory are
decade, and they have made a strong impact in a variety of                shared between multiple processors; each processor owns a
areas concerning mathematical and engineering fields.
   In this work we analyze the behavior of a parallel algorithm           local cache, and its values have to be made coherent with
developed to compute a random signal processing through a                 the global memory by the operating system. Data can be
biorthogonal discrete wavelet filter using CUDA and we compare            exchanged among processors simply by placing the values
the parallel implementation against similar sequential CPU                in a predefined location and synchronizing appropriately [8].
code. The simulations are conducted using two computers: the              Other issues involved, which complicate development and
first one is a conventional host; the second one is a server.
This comparison allows us to underline the differences between            have no counter part in the sequential world, can include:
hardware architectures with different specifications in terms of          finding and expressing concurrency, managing data distribu-
time performance. The experimental results show that if the               tions, managing communication among processors, balancing
signal is decomposed into a rather substantial number of samples,         computational load, and simply implementing the parallel al-
parallel code timings on server and a usual host GPU win on the           gorithm correctly. Therefore, implementing parallel programs
corresponding sequential code on the same machines.
   Index Terms—GPU programming, performances, wavelet                     is more difficult than implementing sequential ones [8], [9].
                                                                          In fact, in sequential programming, the programmer has to
                        I. I NTRODUCTION                                  implement a code correctly, and efficient to execute. Parallel
   The last decades have been featured by large motions in                programming involves the same issues, and adds several
the perceived value of parallel computing. In such years high-            additional challenges, some of which are described above.
performance computing technologies have seen substantial                     In this paper we implement a parallel code by CUDA archi-
growth and this is the case for general purpose computing on              tecture, subduing a random signal to a biorthogonal discrete
GPU, or rather GPGPU. The intensive use of GPU paralleliza-               low pass wavelet filter. In particular, this work is structured as
tion techniques has given the possibility to implement complex            follows. In section 2 we present some essential concepts for
simulation and hard computational tasks [1], [2], [3], [4], [5],          introducing wavelet theory and we refer to specific techniques,
[6]. For much time, one of the most important methods used                such as subband coding, in order to process a given signal with
to improve the computational capability of devices has been               the biorthogonal wavelet filter in discrete time case. In section
increasing the processor clock speed. Unfortunately, because              3 we describe the main features of CUDA architecture, used
of various fundamental limitations in the manufacturing of                in this work for the parallelization procedure. In section 4
integrated circuits, it is not possible to extract much additional        some basic concepts on the definition of a digital filter and its
computational power from existing architectures by increasing             implementation are given through mathematical convolution
the processor clock speed. Then, supercomputer manufactories              operation. Section 5 examines serial and parallel algorithms,
were able to introduce many changes in performance by                     with which wavelet filter implementation and signal processing
strongly increasing the number of processors: they could have             are achieved. In section 6 we show the results obtained by
tens or hundreds of thousands of processor cores working                  experimental simulations and compare time performance of
concurrently. CPU manufacturers have announced plans for 12               two codes executed in two machines with different technical
or 16 cores CPUs. In this way, a big number of simple multi-              specifications. Finally, in the last two sections we deduce
thread cores in GPU cards could offer the potential for strong            consequent results, conclusions and we also report all those
speedups regarding several general purpose applications, if               works related to subject matter described in this paper.
compared with the CPU similar sequential computation, con-
firming parallel computing has arrived for good [7].                                          II. WAVELET THEORY
   Nevertheless, the architecture of GPU cards gives rise to              A. Preliminaries
some programming issues, such as sharing resources, which
                                                                             Only in recent years, wavelet theory has been developed as
  Copyright c 2016 held by the authors.                                   a unifying framework for a large number of techniques thought


                                                                     30
for wave signal processing applications, such as multiresolu-            andmultiresolution signal analysis. Two notions which are
tion analysis, subband coding and wavelet series expansions.             important for wavelet theory are scale and resolution. Scale is
The idea of treating a signal at various scales and analyzing            related to the size of the signal, while resolution refers to the
it with different resolutions has emerged independently in               amount of details present in the signal. In order to understand
many mathematics, physics and engineering fields [10]. The               better these concepts, we can note that the scale parameter in
interest of the signal processing community became strong                discrete wavelet analysis behaves as follows [10]:
when Doubechies and Mallat, giving a further contribution to                • for large scales delated wavelet take a global view of a
the wavelet theory, established connections to discret signal                  subsampled signal;
processing results. Since then, a number of theoretical as well             • for small scales, reduced wavelets analyze small details
as pratical contributions have been made on various aspects                    of the signal.
of this topic and nowadays it is growing rapidly [10], [11].                1) Multiresolution analysis: With this method we can de-
B. Wavelets in continuos domain                                          rive a lower resolution signal by lowpass filtering with a half-
                                                                         band low-pass filtering having impulse responce g(n). This
   Wavelets are continuos basic functions constructed mainly
                                                                         results in a signal y(n) where
in order to satisfy certain mathematical properties. A wavelet
is defined as a function with zero mean value which is both                                       k=+∞
                                                                                                   X
above and below the x-axis, looking like a “wave” [12]. It                               y(n) =          h(k) ∗ x(2n − k)                (3)
is possible to give a more rigorous definition of a wavelet. In                                   k=−∞
fact, if we consider a multiresolution decomposition of L2 (R),             Now based on subsampled version of x(n) we want to
that is                                                                  find an approximation, a(n), to the original: this is done by
         ∅ ⊂ V0 ⊂ · · · ⊂ Vj ⊂ Vj+1 ⊂ · · · ⊂ L2 (R)                     inserting a zero between every sample, because we need a
                                                                         signal at the original scale for comparison. In general, a(n)
and we call Wj the orthogonal complement of Vj , then we                 is not going to be equal to x(n); therefore we compute the
can define a wavelet as a function ψ(x) such that the set of             difference between a(n) and x(n) with the following formula:
{ψ(x − l), l ∈ Z} is a Riesz basic of W0 and also must be
subject to the following constrains [13]:                                                      d(n) = x(n) − a(n)
      Z +∞
  1)         ψ(x) dx = 0                                                 however there is some redundancy, since a signal sampling
       −∞           Z +∞                                                 ratefs is mapped into to signals d(n) e y(n) with sampling
                                                                         ratesfs and fs/2 , respectively [10].
  2) k ψ(x) k =2
                         ψ(x)ψ ∗ (x) dx = 1
                    −∞                                                       2) Subband coding schemes: The method of multiresolu-
   The idea behind wavelets is that by stretching and trans-             tion analysis described in [10] is characterized by a redundant
lating one such main wavelet function ψ(t) (called mother                set of samples. We now look at a different scheme where
wavelet), we can represent finer parts of a function or a signal         no such redundancy appears. This method is called subbang
by simple linear combinations:                                           coding scheme, first widely used in voice compression. The
                           X                                             idea behind this technique is based on the following methodol-
                   f (t) =     bj,k ψj,k (t)                 (1)         ogy: a pair of filters are used where low pass and a high pass
                            j,k
                                                                         filter partecipate; we decompose a sequence X(n) into two
  where bj,k are called wavelet coefficients of the function f in        subsequences at half rate, or half resolution, and this by means
the wavelet basis given by the inner product of ψj,k . Moreover,         of orthogonal filter. This process can be iterated on either
a whole family of wavelet functions can be obtained by just              or both subsequences. In particular to obtain finer frequency
shifting and scaling the mother one by the law [12]:                     resolution at lower frequencies, we iterate the scheme on the
                      √                                                  lower band only. Each iteration halves the width of the low
              ψj,k = 2j ψ(2j t − k), i, j ∈ N                 (2)
                                                                         band (in fact it increases its frequency resolution by two),
   In the procedure of wavelet dilatation and translation, we            but because of the subsampling by two, its time resolution is
can see that for a large j, ψj,k (t) is short and of high                halved as well. Schematically, this is shown in figure 2.
frequency; smaller values for j, instead, give long wavelet                  An important feature of this discret algorithm is its relatively
functions of low frequency. This offers us the possibility to            low complexity. Actually, the following somewhat surprising
analyze a function in different scales [10]. Typical examples            result emerges: regardless the depth of the tree in 2, the
of wavelet functions are shown in figure 1.                              complexity is linear in the number of input samples, with a
                                                                         constant factor which depends on the length of the filter [10].
C. Discrete Time Case
    Now we focus our attention on discrete time signals, since it        D. Biorthogonality
is the case of our interest for the implementation of a wavelet             The wavelet filter implemented in this work is the 3.7
filter as a parallel program which exploits the computational            biorthogonal one. A biorthogonal wavelet is a wavelet where
capabilities of GPUs. In the discrete time case two meth-                the associated wavelet transform is invertible but not necessar-
ods were developed independently, namely subband coding                  ily orthogonal. In the biorthogonal case there are two scaling


                                                                    31
                                             Fig. 1: Example of wavelet functions [14]


functions φ and φ̃, which may generate different multiresolu-            of processor cores [18]. The launch of the Nvidia CUDA
tion analysis, according to different wavelet functions ψ and            technology has opened a new era for GPGPU computing
ψ̃, having ψ̃ used in the analysis and ψ used in the syntesis. In        allowing the design and implementation of parallel GPU ori-
addition, the scaling functions φ and φ̃ and wavelet function            ented algorithms without any knowledge of OpenGL, DirectX
ψ and ψ̃ are related by duality in the following sense [15]:             or graphics pipeline [19]. CUDA (Compute Unified Device
                  Z                                                      Architecture) is a novel technology of general purpose com-
                     ψj,k (x) ψ̃j 0 ,k0 (x) dx = 0                       puting on the GPU, which makes users develop general GPU
                                                                         programs easily. CUDA GPU has the following advantages
as soon as j 6= j 0 or k 6= k 0 and even                                 [18], [19]:
                   Z
                                                                            • general programming environment: CUDA uses C pro-
                      φ0,k (x) φ̃0,k0 (x) dx = 0
                                                                               gramming tools and C compiler, which make programs
as soon as k 6= k 0 [16]. About biorthogonal wavelet filter,                   have better compatibility and portability;
                                                                            • more powerful parallel computing capability: CUDA
by using two wavelets, one for decomposition and the other
for reconstruction, interesting properties are derived. Unlike                 graphic card employs more transistors for computing;
                                                                            • higher bandwidth: it can be strongly affected by the
orthogonal wavelets, biorthogonal ones require two different
filters: one for the analysis and other one for syntesis of an                 choice of memory in which data is stored, how the data
input. The first number indicates the order of the syntesis                    is laid out and the order in which it is accessed;
                                                                            • instruction operation: CUDA GPU supports integer and
filter while the second number refers to the order of the
analysis filter [17]. Unfortunately, frequency responses from                  bit operation.
biorthogonal filters may now not show any symmetry and                      According to these considerations, programmable GPUs
the energy of the decomposed signal may also not equal the               have evolved into a highly parallel, manycore processor with
energy of the original signal [12]. Nevertheless, there are some         considerable computational power.
reasons behind using biorthogonal wavelets related to some                  Comparing the performance of GeForce GTX 680, GeForce
requirements to satisfy. One such condition is that they should          GTX 580 and GeForce GTX 480 among Sandy Bridge and
have compact support: if this constrain were not satisfied, then         Bloomfield Intel CUPs around about 2009, we can notice
we would need to use an IIR filters to approximate the function          since 2001 the theoretical floating point capability on GPUs
but IIR filters are more difficult to treat and they also require        has increased more rapidly than in CPUs; furthermore, the
extra computing power.                                                   memory bandwidth follows a similar trend, mainly looking at
                                                                         years since 2003 [20]. The main difference in floating point
      III. GPU PARALLEL COMPUTING WITH CUDA                              capability between the CPU and GPU is due to the specializa-
   The advent of multicore CPUs and manycore GPUs has                    tion of the second one for compute-intensive highly parallel
allowed the development of applications that transparently               computation. A CUDA-enabled GPU is composed of several
scale its parallelism to take advantage of the increasing number         MIMD (multiple instruction multiple data) multiprocessors
                                                                         that contain a set of SIMD processors (single instruction single
                                                                         data). Each multiprocessor has a shared memory that can
                                                                         be accessed from each of its processors, and also shares a
                                                                         bigger global memory common the global multiprocessors
                                                                         [21]. Shared memory buffers reside physically on the GPU
                                                                         as opposed to residing in off-chip DRAM, so because of this,
                                                                         the latency to access shared memory tends to be far lower than
                                                                         typical buffers [7].
                                                                            CUDA hardware architecture has the following novel fea-
                                                                         tures [19]:
Fig. 2: Filter bank tree of Discret Wavelet Transform imple-                • general write/read global memory: GPU can acquire data
mented with two discret time filters (low-pass and high-pass                   from any location or the global memory and also put data
filter, respectively) and subsampling by two [12]                              to any location, almost as supple as CPU;


                                                                    32
  •   shared memory placed on chip: it can make threads in the            B. 3.7 biorthogonal wavelet filter
      same multiprocessor getting data immediately available                 Applications of wavelet decomposition techniques in nu-
      and avoiding accessing global memory frequently;                    merical analysis of a signal seems very promising because it
   • thread synchronization: threads in a thread group can be             highlights the “zooming” property, which allows a very good
      synchronized to each other, so they can communicate and             representation of critical points, such as irregularity, discon-
      collaborate to resolve complex problems.                            tinuos, and so on [20]. Wavelet transform can be realized
   In CUDA programming model, an application consists of                  simply and effectively through a filter bank. In particular,
a host program that executes on the CPU and other parallel                a decomposition phase, adopted in this work, provides a
kernel programs executing on the GPU. A kernel program                    low pass and high pass filtering, followed by decimation by
is executed by a set of parallel threads, where a thread is a             two; the reconstruction phase, instead, processes separately
subdivision of a process in two or more sub-processes, which              inputs, through expansion and filtering, and then they are
are executed concurrently by a single processor computer                  summed. These operations have to be iterated according to
system (multithreading). The host program can dynamically                 the number of decomposition levels with which we want to
allocate device global memory to the GPU and copy data                    achieve wavelet transform [22].
to/from such a memory from/to the memory on the CPU.                         The following table shows the sixteen 3.7 biorthogonal
Moreover, the host program can dynamically fix the number of              decomposition wavelet filter coefficients taken into account
threads used for running of a kernel program. A set of threads            in this work [23].
creates a block, and each block has its own shared memory,
                                                                                        Filter coefficient        Value
which can be accessed only by each thread on the same block
                                                                                                h0            0.0030210861
[21]. Note that interactions between CPU and GPU should be
                                                                                                h1           -0.0090632583
minimized in order to avoid communication bottlenecks and
                                                                                                h2           -0.0168317654
delays due to data transfers.
                                                                                                h3            0.0746639851
        IV. S IGNAL PROCESSING BY WAVELET FILTER                                                h4            0.0313329787
  As already widely stated, the main target of this paper is                                    h5           -0.3011591259
the implementation and performance analysis of an algorithm                                     h6           -0.0264992409
built for emulating a randomic signal processing through a 3.7                                  h7            0.9516421219
biorthogonal wavelet filter. However, before entering in the                                    h8            0.9516421219
description and performance analysis of different algorithms                                    h9           -0.0264992409
used for both sequential and GPU parallel computing versions,                                  h10           -0.3011591259
we want to give the theoretical basis about mathematical                                       h11            0.0313329787
meaning of a digital FIR filter.                                                               h12            0.0746639851
                                                                                               h13           -0.0168317654
A. Digital filter and Convolution
                                                                                               h14           -0.0090632583
    It is extremely useful to observe that any type of FIR digital                             h15            0.0030210861
filter can be interpreted mathematically as the convolution
product between a function f , which represents the signal                                       V. A LGORITHMS
to scan, and a function h, which is the impulse response                     Once understood how we can build a digital filter by a
of LTI system through which the signal passes, in our case                mathematical point of view, we see now how to have it in
represented by the filter. In the case, such as that of our               C language. We propose two different versions of code whose
interest, in which two discrete time functions S and h are                target is the translation of convolution operation, previously
given, then it is possible to define the convolution operation            described, between the signal and the wavelet filter. The first
as follows:                                                               version refers to a programming strategy that uses a parallel
                                                                          CUDA architecture, where the heart of the program, called
                            N
                            X                                             kernel, is defined in host memory as a global function, and it
                  Y (j) =         S(j − i) ∗ h(i)             (4)
                                                                          is executed in GPU memory, or rather in device. Later we will
                            i=0
                                                                          analyze the temporal performance of the two programs and
   In other words, this operation can also be seen as the
                                                                          compare them, eventually to highlight differences and observe
application of a function h, which represents the filter, to
                                                                          if it might be advantageous using an implementation strategy
a function S that instead symbolizes the signal to filter. In
                                                                          or another one.
particular, the terms h(i) are called filter coefficients, while N
is the order of this filter.                                              A. Parallel code in CUDA
   The convolution of a discrete time signal may be described                The first implementation we want to present is related to the
by a sequence of steps achieved in the following way:                     construction of a wavelet filter with a parallel programming
   1) function S is reflected: S(i) → S(−i);                              model. The parallel code is implemented in C language;
   2) S is scrolled over h with a shift of k: S(k − i);                   however we introduce changes to a conventional C code and
   3) for each shift j all products between S and h are added.            take advantage of all typical CUDA structures by executing


                                                                     33
certain functions in GPU device. Once loaded C libraries and                in a tree-structure, iterating the kernel a number of times equal
<cuda.h> library for device functions, we fix the maximum                   to num iter in order to isolate step by step residuals and details
number of threads that draws each block in device memory.                   of signal and get different levels of resolution. Therefore we
Obviously, this amount of threads is linked to hardware                     achieve also the scaling procedure organizing step by step
features of GPU cards. As regards the simulations presented                 samples in groups of 2m elements such that they assume the
in this paper, we use two graphics cards: a 480-core GeForce                same value in each group. This operation is implemented as
and a 1536-core Tesla Kepler. The number of threads discends                 int div=(int)powf(2,m);
by these specifications, calculating it as the ratio between the             sign[id]= sign[id-(id%div)];
total amount of data to process and the number of threads per                                Listing 2: Scaling istructions
block.
   We are now ready to describe the global main function                       Inside the int main (), actual variables corresponding to
which, as such, is called in host, and that is after performed              formal variables are defined and, for each of them, memory
in device: this function is exactly the kernel.                             allocation is carried out. Also we introduce other three vari-
// kernel definition                                                        ables dev sign, dev fil, dev y that play a similar task in GPU
__global__ void wavelet(float *sign, float *fil, float*y,                   device
     int m) {
  int id = blockIdx.x * blockDim.x + threadIdx.x;                           // allocation in host memory
  if (id>=sample) return;                                                     sign_host = (float *) malloc(sample*sizeof(float));
  int P = m*sample;                                                           fil_host = (float *) malloc(filtro*sizeof(float));
  for (int i=0; i<filter; i++)                                                y_host = (float *) malloc(sample*sizeof(float));
    if ((id-i) > 0) y[P+id]+= sign[id-i]* fil[i];
    else y[P+id]+= sign[i-id]* fil[i];                                      // allocation in device memory
  int div = (int) powf(2,m);                                                  cudaMalloc((void**)&dev_sign, sample*sizeof(float));
  sign[id] = sign[id-(id%div)];                                               cudaMalloc((void**)&dev_fil, filtro*sizeof(float));
  return;                                                                     cudaMalloc((void**)&dev_y, sample*sizeof(float));
}
                                                                                          Listing 3: Malloc and CudaMalloc
                   Listing 1: kernel definition
                                                                               Later, after loading the samples of the signal and filter
   In the kernel, an instruction which performs the convolution             coefficients, respectively on the arrays sign host and fil host,
operation between the signal S and h filter is implemented. In              we proceed with data transfer on device using cudaMemcpy
fact, if filter represents the number of wavelet filter coefficients        commands. In fact, they constitute a bottleneck in the perfor-
(in our case f ilter = 16) and sample is the amount of                      mance of a code and therefore should be used as sparingly
randomic signal samples taken into account, then each id-                   as possible. For this reason we decide to accumulate the data
th component of output signal y is obtained as the sum of                   of num iter iterations in a single output array y whose size
the products between the i-th component of the filter and                   is equal to num iter ∗ sample. After we carry out another
the (id − i)-th component of signal; id is the thread index                 cudaMemcpy to return output data from host to device
with which we exploit computational capability of various                   //Data transfer from host to device
thread in GPU memory and whose dimension should cover                       cudaMemcpy(dev_sign, sign_host, sample*sizeof(float),
                                                                                 cudaMemcpyHostToDevice);
the whole length of signal. In other words, a wavelet filter can            cudaMemcpy(dev_fil, fil_host, filtro*sizeof(float),
be graphically thought as a floating window, initially aligned                   cudaMemcpyHostToDevice);
with the first sample of signal (id = 0), and comes slid up to
                                                                                          Listing 4: Cudamemcpy istructions
cover all samples.
   Obviously, when probing the entire signal, there will be                    Finally, we save output data for each of single iterations
situations in which, according to the values assumed by index               and we can free host and device allocate memory through
id and based on i-th coeffiient position of the filter, the                 free(...) and Cudafree(...) commands. Function clock(...),
difference id − i will be negative; this would require the                  called from C library <time.h>, is useful to measure the time of
translation of the signal for “negative” istants, i.e. points where         this and the other program described in the following section.
the signal does not exist: this is known as boundary problem.
Therefore, it is necessary to extend the signal at instants in              B. Sequential program
which it is undefined; this can be done in several ways. The                   Another version of code that performs the same task refers
strategy adopted here is the symmetrical extension, particularly            to a sequential programming model. By this we mean the
suitable for biorthogonal filters [22].                                     transposition of signal filtering in a conventional program
   For this reason, in wavelet function there is a conditional              written in C language. In this case the first thing to do is
instruction which chooses the (id − i)-th sample of S or                    defining the function which achieves convolution product and
(i−id)-th one, to avoid “boundary problem”. As we can see by                corresponds to the parallel code kernel, as shown below. As we
code 1, wavelet function takes as input not only the point-to-              can observe, such a function is similar to the kernel described
float variables sign, fil, y, but also another parameter, m, that           above. Here, thread index disappears and is replaced by an
is the number of current iteration. In fact, as we shall soon see,          index with the same name but referring to a for loop whose
the purpose of this paper is adopting subband coding technique              element y(id) is computed in a sequential way.


                                                                       34
                                                                             2) Server: In server we find a Intel Xeon CPU having a
void wavelet(float *sign, float *fil, float*y, int m) {
  for (int id=0; id<sample; id++) {                                       clock frequency up to 3.4 GHz. About GPU, here is a NVIDIA
    int P = m*sample;                                                     Tesla Kepler K10. This card offers a total of 8 GByte GDDR5
    for (int i=0; i<filtro; i++)
      if ((id-i) > 0) y[P+id]+= sign[id-i]* fil[i];                       on-board memory, of which 4 GByte for GPU, and supports
      else y[P+id]+= sign[i-id]* fil[i];                                  PCI Express Gen 3 bus interface.
    int div = (int) pow(2,m);
    sign[id] = sign[id-(id%div)];
  }                                                                       B. Simulations
  return;
}                                                                            Since technical features of PC GPU are very different than
                                                                          Server one, it is clear that for the parallel code we have to
                 Listing 5: sequential function                           adjust the number of threads per block, according to the type
                                                                          of used GPU: if we work with GeForce, this parameter will
   Clearly all typical CUDA instructions have disappeared;                have to be equal to 480; if the code is executed on server with
nevertheless, in order to compare fairly this code with parallel          Tesla Kepler K10 card, then we should put it equal to 1536.
one, we need to replace them with similar instructions that               The simulation of the parallel code on PC offers the result
emulate the same behavior. This explains why instead of                   represented graphically in Figure 3a.
cudaM alloc we find three additional variablesd sign host,                   Figure 3a shows an almost linear trend throughout the range
d fil host, and d y host allocated in host memory which                   from 1000 to 10 milion of samples, with a computation time
substitute arrays dev sign, dev fil and dev y. Then, after                varying from 150 to 300 ms for the first 10000 samples. Start
execution finishes, cudaMemcpy are replaced by assignments                from 5 milion of samples, the computing time increases of
listed below, unnecessary for the results, but useful to take             about 20 seconds; in the range of equal amplitude to previous
into account the evaluation of time (and performance)                     one, from 5 to 10 milion of samples, the computing time
d_sign_host = sign_host;                                                  increment is about 15 seconds compared to the theoretical
d_fil_host = fil_host;
d_y_host = y_host;                                                        expected 20s that we should obtain if the curve had not
                                                                          undergone the level-shift down. A similar trend is given in
   Finally, in this model we use clock( ) function in order               Figure 3b for the simulations of parallel code on the server. In
to measure execution time of the program. The goal now is                 this case, for 10 milion of samples computing time is initially
making benchmark of two codes and comparing results.                      larger of about one order of magnitude, growing from there
                                                                          linearly up to 9 milion of samples, where the execution time
                VI. E XPERIMENTAL RESULTS                                 of the program is 17.4s. Subsequently, from 9 to 10 milion
                                                                          of samples, the computing time increases of only 200 ms.
   Now we analyse the performance for sequential and parallel             A comparison between PC and server runs for parallel and
codes. In order to pursue this target, we change the number of            sequential version is shown in 4a and 4b:
samples that makes up our examined signal and we evaluate                    We can see that, for PC for a number of samples less than
the time trend which results increasing gradually the number              15000, sequential code timing is below the curve relative to the
of samples. The simulations of both codes will be conducted               parallel code; in the range between 20000 and 70000 samples
on two different technical specification computers. In fact the           the two curves intersect several times, showing an almost
first machine is used as a normal PC for student laboratory ap-           swinging behavior; instead from 70000 samples the CPU line
plications; the other one is dedicated to specific applications as        is above GPU one and increasing the number of samples their
a server, which has got high reliability, increased performance           distance is larger and larger, and at the 10 million samples it
and additional features. For this reason, in order to distinguish         reaches 25 seconds. Server simulations show, instead, parallel
them, we denote the first computer as PC, the second one as               code starts having performance better than sequential one only
Server.                                                                   when they have reached 3 milion of samples, a value much
                                                                          higher than the 70000 seen for the server. We should note in
A. Technical specifications
                                                                          this case the gap between two curves is very small: in fact in
   We describe the main features of two machines and their                the range up to 3 milion samples the sequential is faster than
respective CPU and graphics card.                                         parallel of about 2s; once crossed the point where the two
   1) PC: It has got an AMD Athlon 64X2 with frequency                    codes have similar performance, the GPU starts to work better,
CPU clock up to 2.9 GHz and NVIDIA GeForce GTX 480                        gaining a time advantage of about 4.5s in the neighborhood
GPU. This GPU shows the following engine and memory                       of 10 million samples.
specifications [24]: it has a number of CUDA cores egual to                  Figure 5 gives a global overview of the CPUs and GPUs
480, with a graphic clock of about 700MHz and a fill rate                 behavior in PC and Server. As clearly shown in this graph,
texture of 42 billion per second. The interfaces are designed for         reaching an amount of samples high enough (almost 3 mil-
PCI Express 2.0x16 accelerator, with a peak bandwith of up to             lion), the best performance in terms of saved time for program
2 GByte per second and with a 1536 MByte GDDR5 standard                   execution is when we implement and execute the parallel code
memory, where memory locations are connected through 384                  using server GPU: this result was quite predictable because
bit memory interface.                                                     with this card we use 1536 cores compared to 480 of GeForce


                                                                     35
                                                         Performance of parallel code for PC                                                                                        Performance of Parallel code for SERVER
                                    40                                                                                                                      18

                                                                                 NVIDIA GeForce GTX−480 GPU                                                                                     NVIDIA Kepler K10 1536 cores GPU
                                    35                                                                                                                      16


                                                                                                                                                            14
                                    30


                                                                                                                                   Computational Time [s]
           Computational time [s]


                                                                                                                                                            12
                                    25

                                                                                                                                                            10
                                    20
                                                                                                                                                             8

                                    15
                                                                                                                                                             6

                                    10
                                                                                                                                                             4


                                     5                                                                                                                       2


                                     0                                                                                                                       0
                                         0   1       2      3      4          5       6       7     8     9          10                                          0   1          2           3      4          5         6          7      8    9        10
                                                                 Number of Samples                            x 10
                                                                                                                     6
                                                                                                                                                                                                 Number of Samples                                      6
                                                                                                                                                                                                                                                     x 10

      (a) Computing time of parallel code in PC for varying amount of                                                          (b) Computing time of parallel code in Server for varying amount of
      samples                                                                                                                  samples

                                                                           Fig. 3: Computing time of parallel code in PC and Server

                                                 Comparison between serial and parallel code for PC                                                                      Comparison between serial and parallel code for SERVER
                                    60                                                                                                                        7

                                                                       NVIDIA GeForce GTX−480 GPU                                                                                                                    NVIDIA KEPLER K10 GPU
                                                                       AMD Athlon CPU                                                                       6.95                                                     INTEL XEON CPU
                                    50


                                                                                                                                                             6.9
                                                                                                                               Computational Time [s]
      Computational time [s]


                                    40

                                                                                                                                                            6.85
                                                                                                                                                                                                   X: 3.121e+006
                                                                                                                                                                                                   Y: 6.818

                                    30
                                                                                                                                                             6.8


                                                                                                                                                            6.75
                                    20


                                                                                                                                                             6.7
                                    10

                                                                                                                                                            6.65


                                    0
                                         0   1      2      3       4         5        6      7      8     9      10                                                      3.06        3.08         3.1         3.12          3.14       3.16   3.18
                                                                Number of Samples                                6
                                                                                                              x 10                                                                                Number of Samples                                   x 10
                                                                                                                                                                                                                                                            6


                                                 (a) Parallel and sequential program in PC                                                                               (b) Parallel and sequential program in Server

                                                            Fig. 4: Comparison between GPU and CPU performance in PC and Server


GPU. Nevertheless, by a careful observation, though GPU                                                                        it with GeForce GTX 480 GPU, which belongs to a high-
server performance are better than GeForce one, being able to                                                                  end market and with its 1536 Mbytes of installed RAM, its
earn between 10 and 25s ahead of the second one, it is equally                                                                 GDDR5 video memory and high operating frequencies, it is
true that there aren’t considerable differences from what it                                                                   certainly one of the most important products made by NVIDIA
happens by exploiting the server CPU potential than Kepler                                                                     corporation [24].
GPU: in this case the separation between the Intel Xeon CPU                                                                       According to what argued so far, some significant infor-
curve and Tesla Kepler relating one is only a few seconds.                                                                     mation can be derived by plotting all the different ratio
This result is due to the server, because of its features and                                                                  combinations between GPU and CPU timing for PC and
applications, has got hardware equipment more advanced than                                                                    server. From Figure 6a we observe clearly that for the first
a normal PC; in fact Xeon CPU, with its 6 cores per socket and                                                                 20000 samples the sequential code on the CPU is faster than
a clock frequency up to 3 GHz, can provide very satisfactory                                                                   GPU program; the situation is reversed past this value of
results which are not very different than ones offered by Tesla                                                                samples, from which parallel code becomes approximately 1.5
Kepler GPU. On the other hand the difference between the use                                                                   times faster than sequential one.
of a parallel code and a sequential one is more pronounced                                                                        As shown in figure 6b for the server, we have a simi-
when we look at a PC with an average CPU and we compare                                                                        lar situation. However, here the GPU shows its advantages


                                                                                                                          36
                                                             CPU and GPU ratio for PC                                                      1
                                                                                                                                                                                    CPU and GPU ratio for SERVER
                              1.8                                                                                                       10

                                                              AMD Athlon CPU/ NVIDIA GeForce GPU                                                                                            INTEL XEON CPU/ NVIDIA KEPLER K10 GPU
                              1.6


                              1.4                                                                                                          0
                                                                                                                                        10

                              1.2
        CPU/GPU


                                                                                                                              CPU/GPU
                               1
                                                                                                                                           −1
                                                                                                                                        10
                              0.8


                              0.6

                                                                                                                                           −2
                                                                                                                                        10
                              0.4


                              0.2


                                                                                                                                           −3
                               0                                                                                                        10
                                 3                   4                       5                   6                   7                                      3                   4                  5                6                7
                               10                   10                   10                     10                  10                         10                              10                10                10               10
                                                                Number of Samples                                                                                                         Number of Samples
     (a) Ratio between AMD Athlon CPU and NVIDIA GeForce GPU times                                                        (b) Ratio between Intel Xeon CPU and NVIDIA Tesla K10 GPU times
     in PC                                                                                                                in Server

                                                                                      Fig. 6: Ratio CPU/GPU for PC and Server


                                       Comparison among serial and parallel codes for PC and SERVER
                                                                                                                              A monotonically similar trend is shown in the graph 7b. How-
                              60
                                                                                                                              ever, for reason seen previously, the ratio between GPU time of
                                                              NVIDIA GeForce GTX−480 GPU
                                                              NVIDIA Kepler K10 1536−Core GPU                                 PC and CPU time of server is always maintained higher than
                                                              AMD Athlon CPU
                              50                              INTEL XEON CPU                                                  one, demonstrating GeForce performance are always inferior
                                                                                                                              if we compare them with the Intel Xeon CPU. Finally, if we
                                                                                                                              compare to each other the execution times of two GPUs, we
     Computational time [s]


                              40
                                                                PC
                                                                                                                              observe that initially the curve remains below the unit in the
                                                                                                                              range from 1000 to about 800000 samples, indicating PC GPU
                              30
                                                                                                                              works better than Server one, with values 10 times lower.
                                                                                                                              Instead, in the remaining part of observation range, GeForce
                              20                                                                                              execution time is about twice than Tesla Kepler K10 (see figure
                                                                                                                              8).
                              10

                                                                                                                                                                1
                                                                                                                                                                                          PC and SERVER GPU ratio
                                                                                           SERVER                                                           10

                              0
                                   0      1     2        3       4       5        6       7          8   9      10
                                                               Number of Samples                                6
                                                                                                             x 10


Fig. 5: Comparison among parallel and sequential programs
on PC and Server                                                                                                                                            10
                                                                                                                                                                0
                                                                                                                                         GPUpc /GPUserver


after passing 3 milion of samples. Moreover, in this graph
the performance of parallel code are better than about 1.5                                                                                                      −1
                                                                                                                                                            10
times, compared with sequential one in correspondence of
the maximum limit of our observation range. The situation
is a bit different when we make mixed ratios among GPU
performance of a machine and CPU performance of the other
one.                                                                                                                                                        10
                                                                                                                                                                −2
                                                                                                                                                                     3                4                 5                6                7
                                                                                                                                                                 10                  10                10               10               10
   If we take into account figure 7a, we see a downward trend,                                                                                                                               Number of Samples
where in correspondance of an amount of samples equal to
                                                                                                                                                                         Fig. 8: Ratio between PC and Server GPU
1000, the times of server GPU are greater than PC CPU, even
over two orders of magnitude. The performance of examined
GPU and CPU are equal when we reach 500000 samples.                                                                                             VII. R ELATED W ORKS
Going forward, this ratio becomes thiner and thiner, until it                                                                   Now we remark some works closer to this one, that have
settles around 0.3 in the range from 9 to 10 million samples.                                                                 been carried out in parallel computing context, in particular


                                                                                                                         37
                               SERVER GPU and PC CPU ratio
                 3
                10                                                                                     3
                                                                                                                  PC GPU and SERVER CPU ratio
                                                                                                     10


                                           NVIDIA Kepler K10 GPU/AMD Athlon CPU


                 2
                10
                                                                                                                      NVIDIA GeForce 480 GPU/INTEL XEON CPU


                                                                                                       2
                                                                                                     10
      GPU/CPU


                                                                                           GPU/CPU
                 1
                10


                                                                                                       1
                                                                                                     10
                 0
                10


                 −1
                10                                                                                     0
                      3    4                 5                6                   7                  10
                     10   10               10                10               10                         3    4                  5                6            7
                                                                                                       10    10                10               10            10
                                    Number of Samples                                                                  Number of Samples
     (a) Mixed Ratio between NVIDIA Tesla K10 GPU and AMD Athlon                           (b) Mixed Ratio between NVIDIA GeForce GPU and Intel Xeon CPU
     CPU times                                                                             times

                                                  Fig. 7: Mixed Ratios among GPUs and CPUs


reflecting about GPU card potentials applied to wavelet filters.                           have found that significant execution time improvements are
In [25] the authors deepen the concept based on an algorithm                               achieved on both multicore platforms and GPUs. Likewise in
centered on GPU cards to reconstruct 3D wavelet using                                      the paper [30], the authors explore the potential of OpenCL
programs fragments to minimize the transfer of data and                                    in accelerating the DWT computation and analyze the pro-
processing overhead fragment; for this pourpose they have                                  grammability, portability and performance aspects of this
proposed a novel scheme that uses tile boards as a primary                                 language; their experimental analysis is done using NVIDIA
layout to organize 3D wavelet coefficients.                                                and AMD drivers that support OpenCL.
   In article [26], the authors present an algorithm that per-                                The authors in paper [31] present an overview of wavelet
forms SIMD complete discrete wavelet transform (DWT) con-                                  based multiresolution analysis, discussing the continuos
volution on a GPU, which brings significant improvement per-                               wavelet transform in its simplest form and looking at orthogo-
formance on a normal PC, without additional costs. Although                                nal, biorthogonal and semiorthogonal wavelets. Many authors
the forward and inverse wavelet transforms are mathematically                              have applied the strategy of parallel computing using wavelets
different, the algorithm they proposed unifies them to an                                  to image encoding field. In the paper [32] the computation
almost identical process that can be efficiently implemented                               steps in JPEG2000 are examined, particularly in the Tier-1,
on GPU. DWT has a wide range of applications, by signal                                    and novel, GPGPU compatible, parallel processing methods
processing in video and image compression. For this reason, in                             through wavelets for the sample-level coding of the images are
[27] the authors have shown that this transformation, by means                             developed. In work [33] closer to the previous one, the authors
of the lifting system, can be performed in memory mode                                     focus their attention on accelerating JPEG2000 encoder by
and efficient computation on GPU, through CUDA computing                                   using GPGPU.
paradigm of Nvidia. In particular their design exploits three                                 Finally, [34] presents a GPGPU framework with the cor-
major hardware architectures for 2D DWT (row-column, line-                                 responding parallel computing solution for wavelet based
based, block-based) and describes a hybrid method between                                  image denoising by using off-the-shelf consumer-grade pro-
the row and column-based methods on blocks.                                                grammable GPUs. Their experiment results show that frame-
                                                                                           work gain applicability in data parallelism and satisfaction
   Another work, in [28], presents an improved version of
                                                                                           performances in accelerating computations for wavelet-based
an algorithm suitable to compute the 2D DWT on GPU.
                                                                                           denoising.
In their work the authors have adapted this method for an
existing algorithm computing the vertical transform pass. By                                                   VIII. C ONCLUSIONS
taking advantage of the computational power of a GPU when                                     In this work we have analyzed the computational capac-
implementing a wavelet transform, they demonstrate the time                                ity of a parallel code written in CUDA environment which
of the computation can be substantially reduced.                                           implements the filter process of a randomic signal using a
   Work [29], proposes a method which analyzes the behavior                                biorthogonal wavelet filter, and we have compared it with a
of several parallel algorithms developed to compute the two                                sequential code that performs the same task; we have exe-
dimensional DWT, in particular using both openMP over                                      cuted both codes on two computers having different technical
a multicore platform and CUDA over a GPU. They also                                        specifications.


                                                                                      38
   Simulations make it clear that, except the case in which                           [14] [Online]. Available: http://www.bssaonline.org
the number of samples is very low, the adoption of a parallel                         [15] M. A. Salem, N. Ghamry, and B. Meffert, “Daubechies versus biorthog-
                                                                                           onal wavelets for moving object detection in traffic monitoring systems,”
programming technique is certainly advantageous in terms of                                2009.
execution time. In particular, over a number of 3 million of                          [16] S. Mallat, A Wavelet Tour of Signal Processing, October 2008.
samples, the GPU Server performance predominates on CPU                               [17] M. P. Chaudhary and G. Lalit, “Lifting Schemem Using HAAR and
                                                                                           Biorthogonal Wavelets For Image Compression,” Internetional Journal
of the same computer and on both GPU and CPU related to                                    of Engineering Reseach and Applications, 2003.
PC, confirming the strong potential of parallel computing in                          [18] “Cuda c programming guide version 4.2,” pp. 1–5.
terms of efficiency, computational capacity and execution time.                       [19] Z. Yang, Y. Zhu, and Y. Pu, “Parallel Image processing Based on
                                                                                           CUDA,” in Proceedings of International Conference on Computer
   The potential for future GPU perfomance increases presents                              Science and Software Engeneering, 2008.
great opportunities for demanding applications, including com-                        [20] NVIDIA CUDA C Programming Guide, April 2012.
putational graphics, computer vision, and a wide range of high-                       [21] A. Fornaia, C. Napoli, G. Pappalardo, and E. Tramontana, “An AO sys-
                                                                                           tem for OO-GPU programming,” in Proceedings of the 16th Workshop
performance computing applications [35].                                                   ”From Object to Agents” (WOA15), Naples, Italy, June 2015.
                                                                                      [22] L. Verdoliva, “La trasformata wavelet,” 2009-2010.
                         ACKNOWLEDGMENT                                               [23] [Online]. Available: http://wavelets.pybytes.com/wavelet/bior3.7
                                                                                      [24] [Online]. Available: http://www.nvidia.it/object/geforce-family-it.html
   The authors would like to thank Giacomo Capizzi and                                [25] A. Garcia and H.-W. Shen, “Gpu-Based 3D Wavelet Recostruction With
Christian Napoli to lead them in the Parallel GPU computing                                Tileboarding,” The Visual Computer, vol. 21, no. 8-10, pp. 755–763,
with CUDA field and for the support in drafting and reviewing                              2005.
                                                                                      [26] T.-T. Wong, C.-S. Leung, P.-A. Heng, and J. Wang, “Discrete Wavelet
this paper.                                                                                Transform On Consumer-Level Graphics Hardware,” IEEE Transactions
                                                                                           on Multimedia, vol. 9, no. 3, pp. 668–673, 2007.
                              R EFERENCES                                             [27] W. Van Derlaan, A. Jalba, and J. Roerdink, “Accelereting Wavelet Lifting
 [1] G. Pappalardo and E. Tramontana, “Automatically discovering design                    On Graphics Hardware Using Cuda,” IEEE Transactions on Parallel and
     patterns and assessing concern separations for applications,” in Proceed-             Distribuited Systems, vol. 22, August 2010.
     ings of ACM Symposium on Applied Computing (SAC), Dijon, France,                 [28] H. Moen, “Wavelet transforms and efficient implementation on the
     April 2006, pp. 1591–1596.                                                            GPU,” May 2007.
 [2] C. Napoli, G. Pappalardo, E. Tramontana, and G. Zappala, “A Cloud-               [29] V. Galiano, O. Lopez, M. Malumbres, and H. Migallun, “Parallel
     Distributed GPU Architecture for Pattern Identification in Segmented                  strategies for 2D Discrete Wavelet Transform in shared memory systems
     Detectors Big-Data Surveys,” The Computer Journal, 2014.                              and GPUs,” Springer Science, 2012.
 [3] F. Bonanno, G. Capizzi, G. L. Sciuto, C. Napoli, G. Pappalardo, and              [30] B. Sharma and N. Vydyanathan, “Parallel discrete wavelet transform
     E. Tramontana, “A novel cloud-distributed toolbox for optimal energy                  using the Open Computing Language: a performance and portability
     dispatch management from renewables in igss by using wrnn predictors                  study,” in Proceedings of IEEE International Symposium on Parallel
     and gpu parallel solutions,” in Power Electronics, Electrical Drives,                 and Distribuited Processing, April 2010.
     Automation and Motion (SPEEDAM), 2014 International Symposium                    [31] B. Jawerth and W. Sweldens, “An Overview Of Wavelet Based Mul-
     on. IEEE, 2014, pp. 1077–1084.                                                        tiresolution Analyses,” SIAM review, vol. 36, no. 3, pp. 377–412, 1994.
 [4] F. Bonanno, G. Capizzi, G. Lo Sciuto, C. Napoli, G. Pappalardo, and              [32] R. Le, I. Bahar, and J. Mundy, “A Novel Parallel Tier-1 Coder For
     E. Tramontana, “A cascade neural network architecture investigating                   Jpeg2000 Using Gpus,” Application Specific Processors , IEEE, june
     surface plasmon polaritons propagation for thin metals in openmp,” in                 2011.
     Proceedings of International Conference on Artificial Intelligence and           [33] M. Ahmadvand and A. Ezhdehakosh, “Gpu-based implementation of
     Soft Computing (ICAISC), ser. Springer LNCS, vol. 8467, Zakopane,                     jpeg2000 encoder,” in Proceedings of the International Conference
     Poland, June 2014, pp. 22–33.                                                         on Parallel and Distributed Processing Techniques and Applications
 [5] C. Napoli, G. Pappalardo, E. Tramontana, R. Nowicki, J. Starczewski,                  (PDPTA), 2012.
     and M. Woźniak, “Toward work groups classification based on prob-               [34] Y. Su and Z. Xu, “Parallel implementation of wavelet-based image
     abilistic neural network approach,” in Proceedings of International                   denoising on programmable pc-grade graphics hardware,” Signal Pro-
     Conference on Artificial Intelligence and Soft Computing (ICAISC), ser.               cessing, vol. 90, no. 8, pp. 2396–2411, 2010.
     Springer LNCS, Zakopane, Poland, June 2015, vol. 9119, pp. 79–89.                [35] S. W. Keckler, W. J.Dally, B. Khailany, M. Garland, and D. Glasco,
 [6] M. Woźniak, D. Połap, M. Gabryel, R. Nowicki, C. Napoli, and                         “GPUs and the Future of Parallel Computing,” IEEE Computer Society,
     E. Tramontana, “Can we process 2d images using artificial bee colony?”                September/October 2011.
     in Proceedings of International Conference on Artificial Intelligence and
     Soft Computing (ICAISC), ser. Springer LNCS, Zakopane, Poland, June
     2015, vol. 9119, pp. 660–671.
 [7] J. Sanders and E. Kandrot, CUDA BY EXAMPLE, An Introduction to
     General-Purpose GPU Programming, NVIDIA.
 [8] Introduction to parallel programming. [Online]. Available:
     courses.cs.washington.edu
 [9] E. Tramontana, “Managing Evolution Using Cooperative Designs and
     a Reflective Architecture,” in Reflection and Software Engineering, ser.
     Springer LNCS, 2000, vol. 1826.
[10] O. Rioul and M. Vetterli, “Wavelets and Signal Processing,” IEEE SP
     magazine, October 1991.
[11] C. Napoli, G. Pappalardo, and E. Tramontana, “An agent-driven se-
     mantical identifier using radial basis neural networks and reinforcement
     learning,” in XV Workshop ”From Objects to Agents” (WOA), vol. 1260.
     Catania, Italy: CEUR-WS, September 2014.
[12] J. Akhtar, Optimization of biorthogonal wavelet filters for signal and
     image compression, February 2001.
[13] C. Napoli, G. Pappalardo, and E. Tramontana, “A Hybrid Neuro-Wavelet
     Predictor for QoS Control and Stability,” in Proceedings of Advances in
     Artificial Intelligence (AI*IA), ser. Springer LNCS, vol. 9119, Torino,
     Italy, December 2013, pp. 527–538.


                                                                                 39