=Paper= {{Paper |id=Vol-1543/p3 |storemode=property |title=Simulating Peer to Peer Networks Using GPU High Perfomance Support |pdfUrl=https://ceur-ws.org/Vol-1543/p3.pdf |volume=Vol-1543 |authors=Giovanni Susinni,Giuseppe Greco |dblpUrl=https://dblp.org/rec/conf/system/SusinniG15 }} ==Simulating Peer to Peer Networks Using GPU High Perfomance Support== https://ceur-ws.org/Vol-1543/p3.pdf
Simulating Peer to Peer Networks Using GPU High
               Perfomance Support
                                                Giovanni Susinni, Giuseppe Greco
                                                         University of Catania
                                                Viale A. Doria 6, 95125 Catania, Italy
                                          giovannisusinni@gmail.com, grecoba1984@libero.it


   Abstract—Peer-to-Peer networks are used by many applica-               in C language, that simulates some aspects of a P2P network;
tions to share resources between nodes. We have proposed a                (ii) second, we have implemented a correspondent GPU al-
parallel version of a simulator for some aspects of a peer-               gorithm using the capability of a GPU architecture through
to-peer network performing file sharing. Being this analysis
computationally expensive for contemporary CPUs, the comput-              the Computer Unified Device Architecture (CUDA) C/C++
ing power of Graphic Processing Units allows a great gain in              language; (iii) finally, we have compared the results between
performance during simulation. Specifically, we have used the             the two approaches.
NVIDIA Computer Unified Device Architecture programming                      Our paper is organized as follows: Section 2 provides an
model to simulate the behaviour of a peer-to-peer network                 overview of the P2P model, while Section 3 provides an
performing data transfers, and compute communication delays
as well as data updates.                                                  overview of the CUDA architecture. Section 4 describes the
   Index Terms—P2P networks, GPU computing, CUDA, Perfor-                 simulation approaches for the sequential and GPU parallel
mance, Parallel                                                           versions. Section 5 shows the experimental results and the
                                                                          comparison of execution measured times. In Section 6 we
                        I. I NTRODUCTION                                  analyze the related work, and finally Section 7 provides
   In the last few decades, the Peer-To-Peer (P2P) technology             conclusions and possible future work.
and its many variants, such as BitTorrent, have become popular                       II. BACKGROUND ON P2P N ETWORKS
Internet applications due to their scalability and robustness [1],
                                                                             An effective network architecture should exhibit the follow-
[2]. Today, the number of devices connected to the Internet
                                                                          ing features [3]:
is increasing fast. Therefore, it is important to analyze the
                                                                             • Provably good perfomance;
interactions between peers (or nodes hereafter) and some file
                                                                             • Low overhead;
sharing issues. The potential huge number of devices in a P2P
                                                                             • Quality of Service (QoS) of peers;
network may pose a perfomance problem during the simu-
                                                                             • Autonomy;
lation of data transfers, and on the other hand simulating the
                                                                             • Robustness.
behaviour of novel models for selecting source and destination
peers is a current research field.                                           Over time, the more traditional client-server model based on
   A sequential simulation of the behaviour of a P2P network              a centralized approach has become a decentralized architecture
is viable for a small amount of data and nodes that transfer              for P2P networks [4]. Indeed the continuos growth of users
data: the results can be achieved in a relatively short time.             and bandwidth has been accompanied by increasing requests
However, for an extensive analysis, the sequential approach is            of a diversified wealth of applications, with a remarkable effort
not ideal for computations, having to cope with a lot of data             in terms of resources to be used to achieve these challenges.
and transfer nodes. One of the possible solutions is to have              A. Definition and Properties of P2P Networks
a parallel approach using a Graphic Processing Unit (GPU)
architecture. The multi-core architecture of a GPU, organized               A distributed network architecture is a P2P network if its
in threads, blocks and grids, shows its great potential for the           participants share a part of their own hardware resources, as
said simulation thanks to the lower computation time than the             processing power, network link capability, storage capacity [5].
sequential approach, particularly when increasing the number              P2P networks are organized as overlay topologies on top of
of nodes and data transfer (i.e. these are often file fragments).         the underlying physical network and are formed by peers
In this paper, we propose to use GPUs to perform a simulation             connecting to each other in either a structured or unstructured
of a P2P network that is closer to the real case. To this end,            manner [6].
our work focuses on the implementation, and optimization of                 The P2P network must have the following properties [5]:
exponentiation operations on GPUs.                                          • Resource sharing: each peer contributes with system

   The contribution of this work is threefold: (i) first, we have              resources to the operation of the P2P system. Ideally, this
developed and analyzed a sequential algorithm, implemented                     resource sharing is proportional to the peers use of the
                                                                               P2P system, but many systems suffer from the free rider
  Copyright c 2016 held by the authors.                                        problem.

                                                                     16
                                                                       are multiply-connected. Broadly, there are three classes of P2P
                                                                       systems:
                                                                          • Pure P2P Systems, in which 2 nodes/devices interact
                                                                             with each other without requiring the intervention of any
                                                                             central server or service.
                                                                          • Hybrid P2P Systems, in which peers rely partially on a
                                                                             central server to provide certain services, although the
                                                                             interaction between peers still takes place independently.
                                                                          • Federated P2P Systems, in which peer interactions take
                                                                             place inside pre-defined domains, such as within an
                                                                             organization.
                                                                          In this classification we have an increasing degree of cen-
                                                                       tralization.
                                                                          Moreover, P2P systems can either be: (i) structured [9],
                     Fig. 1. A P2P network
                                                                       where the overlay graph is well structured and a mathematical
                                                                       scheme (e.g. Distributed Hash Tables) is applied to make sure
                                                                       that g nodes are added in a manner which maintains the
  • Networked : all nodes are interconnected with other nodes          structure; or (ii) unstructured, where the overlay is a random
    in the P2P system, and the full set of nodes are members           graph type and new nodes are added to the network in an
    of a connected graph. When the graph is no longer                  unpredictable manner. Structured P2P Systems are formed on
    connected, the overlay network becomes partitioned.                the basis of node-identifiers, guaranteeing information retrieval
  • Decentralization: the behavior of the P2P system is deter-         in bounded time for simple queries and are self-organizing
    mined by the collective actions of peer nodes, and there           in the face of failures, whereas unstructured P2P systems
    is no central control point. Some systems however secure           can support large complex queries, but do not guarantee
    the P2P system using a central login server. The ability to        information retrieval in bounded time with not so efficient
    manage the overlay [7] and monetize its operation may              self-organization capabilities.
    require centralized elements.                                         In this paper we consider a Structured and Pure P2P
  • Symmetry: nodes assume equal roles in the operation of             network.
    the P2P system. In many designs this property is relaxed                                     III. CUDA
    by the use of special peer roles such as super peers or
                                                                          CUDA is a parallel computing platform and programming
    relay peers.
                                                                       model invented by NVIDIA. It enables an increase in com-
  • Autonomy: participation of the peer in the P2P system is
                                                                       puting performance by harnessing the power of GPUs. With
    determined locally, and there is no single administrative
                                                                       millions of CUDA-enabled GPUs sold to date, software devel-
    context for the P2P system.
                                                                       opers, scientists and researchers are finding broad-ranging uses
  • Scalable: this is a pre-requisite of operating P2P systems
                                                                       for GPU computing with CUDA. Here are a few examples:
    with millions of simultaneous nodes, and means that the
                                                                       (i) identify hidden plaque in arteries (heart attacks are the
    resources used at each peer exhibit a growth rate as a
                                                                       leading cause of death worldwide); (ii) analyze air traffic
    function of overlay size that is less than linear. It also
                                                                       flow, the National Airspace System manages the nationwide
    means that the response time does not grow more than
                                                                       coordination of air traffic flow. Computer models help identify
    linearly as a function of overlay size.
                                                                       new ways to alleviate congestion and keep airplane traffic
  • Self-organization: the organization of the P2P system
                                                                       moving efficiently. Using the computational power of GPUs,
    increases over time using local knowledge and local op-
                                                                       a team at NASA obtained a large performance gain, reducing
    erations at each peer, and no peer dominates the system.
                                                                       analysis time from ten minutes to three seconds. The speed-up
    Biskupski, Dowling, and Sacha [8] argue that existing
                                                                       is the result of the parallel GPU architecture, which however
    P2P systems do not exhibit most of the properties of self-
                                                                       require developers to port compute-intensive portions of the
    organization.
                                                                       application to the GPU using the CUDA Toolkit.
  • Stability: within a maximum variability rate, the P2P
                                                                          CUDA works, conceptually, according to the architectural
    system should be stable, i.e., it should maintain its
                                                                       model shown in Figure 2. The graphic chip, in the CUDA
    connected graph and be able to route deterministically
                                                                       model, is constituted by a series of multiprocessors, called
    within a practical hop-count bounds.
                                                                       Streaming MultiProcessor. The number of multiprocessors
                                                                       depends on the characteristics specific to the class and per-
B. System model                                                        formance of each GPU. Each processor can perform a math-
  Figure 1 provides a conceptual representation of the P2P             ematical operation (sum, multiplication, subtraction, etc.) on
overlay topology. Since P2P networks are fault-tolerant, not           integers or floating point single-precision numbers (32-bit). In
susceptible to single-point-of-failure, P2P overlay topologies         each processor there are also two multi-unit. Special functions

                                                                  17
                     Fig. 2. GPU architecture

                                                                                     Fig. 3. An example of state update for nodes
are given (that perform transcendent as sine, cosine, reverse
etc.) and, only for chips based on GT200 architecture, a single
unit in floating-point double-precision (64-bit) is available. In           The logical division in grid and blocks is a crucial aspect
a multiprocessor there is a shared memory accessible by all              in the design and code, in CUDA. In addition to the limits
streaming processors, cache for instructions. The other types            imposed by the hardware, there are other precautions that
of memory, as shown in the picture, are accessible by each               should be taken into account to make code more efficient.
processor and represent the main repository for large amounts                                IV. P2P S IMULATION
of data to be saved in the GPU. In the following we analyze
                                                                            A P2P-based application is a distributed application that
the parts of code in the GPU using the libraries provided by
                                                                         provides tasks or work loads to peers, which have the same
NVIDIA.
                                                                         privileges. Peers give a portion of their resources available to
A. Kernel and hierarchy of threads                                       other network participants, they need not a central coordination
                                                                         server. A P2P system can be used in many application do-
   CUDA programming indicates as a kernel function a portion             mains, e.g. in social contexts, like sharing files and resources.
of code that runs in parallel, N times.                                  Today, given the large diffusion of P2P networks we need to
   The individual run of the kernel is performed by a simple             simulate and assess the performance of novel transfer proto-
unit called thread. CUDA threads are simpler than CPU                    cols, in terms of the computation time and how many nodes
threads, so the code is considered to be faster. To determine            are involved into the communication. Generally, according to
the number of threads that run a single kernel there is a logical        the increase of nodes and files, the computation time increases.
organization and a two-level hierarchy. In the top level they
depend on the size of a grid. The grid is a two-dimensional              A. Issues of P2P Network Simulation
level comprising blocks that in turn have a three-dimensional               In our simulation, we suppose that each peer communicates
structure that is specified by the number of threads.                    with others. Each peer has some file fragments and shares
   Once a kernel is launched it receives two structures speci-           them, while at the same time receives the missing parts of
fying the additional parameters for the size of grid and blocks.         some files. The peers and file fragments are represented in an
This listing shows the syntax to launch a kernel function                array with N peers and M fragments, while the latency times of
named kernelfunction().                                                  communication between nodes (namely ping) are in an array
                                                                         N*M, named p.
dim3 blocksize(x, y, z);
dim3 gridsize (x, y);                                                       On top of figure 3 it is shown the initial state of the peers,
kernelfunction <<>>(parameters);                    in terms of availability of file fragments. We can see that a
                     Listing 1. Kernel syntax                            single node has some file fragments: 0 represent the absent
                                                                         state, while the 1 represent the available state. On the bottom

                                                                    18
it is shown the final state of the peers, after the fragments have            and its memory, while the device refers to the GPU and its
been transferred between peers.                                               memory. Code that runs on the host can manage memory on
                                                                              both the host and device, and also launches kernels which
B. Sequential simulation
                                                                              are functions executed on the device. Kernels are executed
   The first approach is the sequential one. Listing 2 shows the              by many GPU threads in parallel [10], [16]. The use of
algorithm in C language, with the selection of peers that will                tokens BlockIdx.x, ThreadIdx.x, BlockDim.x allows us to take
receive the file fragments and update their state (from absent                advantage of indexes of threads and blocks [14].
to available). Constant N is the number of peers, and M is the                   Indeed the kernel is executed in parallel by threads, which
number of fragments. Array rics contains the destination                      are organized in blocks, with a 3D structure, and grids with
peer indexes. There are MAXINS destination peers for a                        a 2D structure. For simulating a P2P network we use a 2D
source peer, which are sequentially counted starting from the                 array, with ThreadIdx.x indicating the nodes and ThreadIdx.y
index of the source peer.                                                     indicating the fragments. As in the sequential algorithm, we
   The update of file fragments consists of comparing for each                check on each peer whether fragments are available. We use
source peer, having index tidx, its fragments availability, as                the device function checkFr() for this, whose body is the same
determined by array f, with the fragments availability of a                   as the corresponding function in listing 2. If the fragment is
peer having index rics[i]. If the difference between source                   avalaible, the state of the peer can be updated in due time. This
and destination is equal to zero, there will not be an update,                global function runs in parallel in each GPU’s core thanks to
else the source will send missing fragments to the destination                indexes tidx and tidy.
peer. To approximate the real case, in the simulation we use                  __global__ void selection(float *p, int *f){
the latency times for each fragment transfer, both in download                  int rics[MAXINS];
                                                                                unsigned long tidx = blockIdx.x*blockDim.x + threadIdx.x;
and in upload. In this analysis we assume that the transfer is                  unsigned long tidy = blockIdx.y*blockDim.y + threadIdx.y;
ideal, i.e. no loss of packets occurs. The simulation has to
                                                                                  chooseNode(tidx, rics);
consider each source peer, file fragment and destination peer,
therefore we have three nested for cicles.                                        for (int i = 0; i < MAXINS; i++)
                                                                                    if (checkFr(f, tidx, rics, tidy, i))
   This approach is viable for a few peers and fragments, i.e.                        if (p[tidx*N + tidy] <= tstep*(tidx*N + tidy))
the results can be obtained in a small amount of time. However,                         f[rics[i]*N + tidy] = 1;
                                                                              }
the sequential approach is not appropriate for many peers and
fragments, the solution is then a parallel approach using a GPU                              Listing 3. Parallel algorithm in CUDA C
architecture.
                                                                                 The keyword global indicates a kernel function that runs on
void chooseNode(int tidx, int *rics){                                         the device and is called from host code, and runs in multiple
  for (int i=0; i < MAXINS; i++)
    rics[i] = ((tidx + i + 1) % N);                                           instances on several blocks and threads. Listing 3 shows the
}                                                                             kernel function for our simulation.
int checkFr(int *f, int tidx, int *rics, int tidy, int i){                    cudaMalloc((void**)&dev_p,N*M*sizeof(float));
  if (f[tidx*N + tidy] - f[rics[i]*N + tidy]) return 1;                       cudaMalloc((void**)&dev_f,N*M*sizeof(int));
  return 0;                                                                   cudaMalloc((void**)&dev_dt,N*M*sizeof(float));
}
                                                                                             Listing 4. Library function cudaMalloc()
void selection(float *p, int *f) {
  int rics[MAXINS];                                                           cudaMemcpy(dev_p,ping,N*M*sizeof(float),
  unsigned long tidx;                                                              cudaMemcpyHostToDevice);
  unsigned long tidy;
                                                                              cudaMemcpy(output,dev_f,N*M*sizeof(float),
    for (tidx=0; tidx <= N; tidx++) {                                              cudaMemcpyDeviceToHost);
      chooseNode(tidx, rics);
      for (tidy=0; tidy <= M; tidy++)
        for (int i=0; i < MAXINS; i++)
                                                                                             Listing 5. Library function cudaMemcpy()
          if (checkFr(f, tidx, rics, tidy, i))
            if (p[tidx*N + tidy] <= tstep*(tidx*N + tidy))
                                                                                 Library function cudaMalloc() allocates a given size of
              f[rics[i]*N + tidy] = 1;                                        bytes of linear memory on the device and returns *devPtr, a
    }
}
                                                                              pointer to the allocated memory (see Listing 4). The allocated
                                                                              memory is suitably aligned for any kind of variable. Function
       Listing 2. Sequential algorithm updating fragments availability        cudaMalloc() returns a cudaErrorMemoryAllocation in case of
                                                                              failure [17]. Therefore, we can allocate the necessary space to
C. GPU Simulation                                                             transfer the data from CPU to GPU. In CUDA C/C++ library
   The multi-core architecture of GPU, organized in threads,                  function cudaMemcpy() allows us to transfer data from host
blocks and grids [10]–[15], shows its great potential by                      to device and vice versa [17] (see Listing 6).
achieving a lower execution time than the sequential approach,                   In function main() we invoke our kernel function selection()
particularly with the increase of nodes and file fragments.                   (see Listing 6), which allows to transfer the data from host to
CUDA is essentially C/C++ language with a few extensions                      device. This performs the GPU computation, and then there is
that allow one to execute functions on the GPU using many                     a cudaMemcpy() function call, copying from device to host
threads in parallel [10]. In CUDA, the host refers to the CPU                 the data indicating updated nodes.

                                                                         19
                                              = GP U                                                                          = GP U
                                              = CP U                                                                          = CP U
                 100                                                                                       100
Execution time




                                                                                          Execution time
                 50                                                                                         50




                   0                                                                                         0
                              500           1,000          1,500             2,000                               500        1,000         1,500         2,000
                                     N umber of nodes                                                                  N umber of nodes

Fig. 4. GPU and CPU execution time with MAXINS=1/25 of total nodes                        Fig. 5. GPU and CPU execution time with MAXINS=1/10 of total nodes



selection <<>>(dev_p, dev_f);                                                                                = GP U
                                                                                                                              = CP U
                 Listing 6. Calling a kernel function from the CPU program
                                                                                                           100
                                                                                          Execution time


   The input arrays of ping times, node and their fragment
availability are initially loaded from files. Every time-step the
kernel is called and the output array is updated according to
the latency time.
                                                                                                           50
    V. R ESULTS FOR THE SEQUENTIAL AND GPU VERSIONS
   From the analysis of sequential and the GPU execution
times, we can see that the result is very different with a
high number of nodes and file fragments. With a small                                                       0
number of nodes, the sequential computation is faster than                                                       500        1,000         1,500         2,000
the parallel one, because the allocation in memory, that the                                                           N umber of nodes
CUDA compiler does through the GPU’s cores, is an expensive
process, but this impact is smaller with the increase of nodes                               Fig. 6. GPU and CPU execution time with MAXINS=1/5 of total nodes
and data. The execution time has been obtained from the
average of ten measures.
   The sequential algorithm for processing data is executed on                               Figure 8 shows the ratio between execution times on CPU
a AMD(R) Athlon 64 X2 processor with up to 2.9 GHz clock                                  and GPU when varying the numer of MAXINS. The highest
speed and 4GB of DDR3L-1333 RAM, while for the GPU                                        ratio is when the number of destination peers is highest. This
algorithm we used a Nvidia(R) GeForce(TM) GTX 480 with                                    indicates that the parallel version performs much better than
480 CUDA cores and 1536 MB GDRR5 video RAM [18].                                          the sequential version. Figure 9 provides the execution times
   Figure 4, 5 and 6 show the execution times for the se-                                 on CPU and GPU with 50 destination nodes, when varying
quential (CPU) and parallel (GPU) versions, when considering                              the number of peers. GPU performances are alway better than
a number of destination nodes equal to 1/25, 1/10 and 1/5,                                CPU performances, and when the number of peers is the
respectively, of the total nodes N . Indeed with a low N the                              highest also is the gain by resorting to GPU. Table II shows
perfomance of the CPU version are better because for the                                  the actual measured times for the latter scenarios considered.
execution of the GPU version the memory allocation from
                                                                                                                 VI. R ELATED W ORKS
host to device is a costly operation.
   Figure 7 shows the comparison of three simulations, to                                   Nowadays, computational resources are available on de-
appreciate the increasing difference in terms of execution time                           mand and tools are needed to coordinate their use [19]–[21].
between the two approaches, with a strong dependence on                                   Several works aim at proposing ways to make development
the number of destination hosts MAXINS. Table I shows the                                 easier in such complex systems [22]–[27]. Several studies have
actual measured times for all considered scenarios.                                       analysed the behaviour of P2P networks from the point of view

                                                                                     20
                              = GP U M axins = N/25
                   60         = CP U M axins = N/25
                                                                                                                      25                   = M axins = N/25




                                                                                       CP U / GP U execution time
                              = GP U M axins = N/10                                                                                        = M axins = N/10
                              = CP U M axins = N/10
                                                                                                                      20                   = M axins = N/5
Execution time




                              = GP U M axins = N/5
                   40         = CP U M axins = N/5
                                                                                                                      15


                                                                                                                      10
                   20
                                                                                                                        5


                    0                                                                                                   0
                                500            1,000        1,500        2,000                                                     500         1,000        1,500         2,000
                                      N umber of N odes                                                                                  N umber of nodes

     Fig. 7. GPU and CPU execution time: comparison between three cases.              Fig. 8. Ratio CPU and GPU execution times when varying the number of
                                                                                      peers for download
                                          TABLE I
                             E XECUTION TIMES IN GPU AND CPU
                                                                                                                      15
    Nodes x Fragments             MAXINS       GPU (s)    CPU (s)    CPU/GPU                                                                     = GP U
                                    4           0.11       0.05         0.48
                 100x100            10          0.13       0.06         0.46                                                                     = CP U
                                    20          0.16       0.07         0.43
                                                                                      Execution time


                                    10          0.19       0.23         1.22
                 250x250            25          0.20       0.26         1.32
                                                                                                                      10
                                    50          0.20       0.35         1.74
                                    20          0.54       0.80         1.48
                 500x500            50          0.57       1.33         2.34
                                   100          0.59       2.14         3.62
                                    30          1.24       2.23         1.79                                           5
                 750x750            75          1.24       3.91         3.13
                                   150          1.35       7.03          5.2
                                    40          2.01       f4.64         2.3
                 1000x1000         100          2.04       8.44         4.14
                                   200          2.28       19.49        8.56
                                    50          3.20       8.29         2.59                                           0
                 1250x1250         125          3.41       19.27        5.65                                                      500          1,000        1,500        2,000
                                   250          3.53       39.22       11.09
                                    60          4.52       13.56        2.99                                                             N umber of nodes
                 1500x1500         150          4.79       35.64        7.43
                                   300          5.41       66.86       12.36                                        Fig. 9. GPU and CPU execution times with MAXINS=50 nodes
                                    70          5.88       21.30        3.61
                 1750x1750         175          6.67       56.41        8.45
                                   350          7.52      107.74       14.32
                                    80          8.36       32.41        3.87          of shared files, i.e. how to have users contribute with contents
                 2000x2000         200          8.58       70.94        8.26          that can be uploaded by other users, respecting the latency
                                   400          9.69      135.77       14.00
                                                                                      time in download and upload.
                                       TABLE II
                                                                                         Some works have studied the problem of computation in
                  E XECUTION TIMES IN GPU AND CPU WITH MAXINS=50                      P2P networks using the GPU approach vs the CPU approach.
                                                                                      In [6] authors analyze that heterogeneous nodes can have
                           Nodes x Fragments    GPU (s)    CPU (s)
                               100x100           0.10       0.10
                                                                                      multiple types of computing elements, and the performance
                               250x250           0.21       0.35                      and characteristics of each computing element can be very
                               500x500            0.7       1.29                      different. For the simulation of the shared file fragments,
                               750x750           1.15       2.96                      we assume an ideal case without giving priority to any file
                              1000x1000          2.14       5.25
                              1250x1250          3.12       8.36                      fragments. Instead in [2], [28], authors propose to apply a
                              1500x1500          4.44       11.82                     mathematical model for the diffusion of fragments on a P2P
                              1750x1750          6.29       16.51                     in order to take into account both the effects of peer distances
                              2000x2000          7.88       21.51
                                                                                      and the changing availability of peers over time.
                                                                                         In this paper we developed two approaches: the first one
                                                                                      is the sequential one, which we have implemented as C

                                                                                 21
language. The second is the parallel approach, which uses                           [13] X. Chu, K. Zhao, and M. Wang, “Massively Parallel Network Coding on
CUDA C/C++ language. In [29] some authors present FLAP, a                                GPUs,” in Proceedings of IEEE International Performance, Computing
                                                                                         and Communications Conference (IPCCC), 2008, pp. 144–151.
tool to generate CUDA parallel code from sequential C code.                         [14] J. Sanders and E. Kandrot, CUDA by Example: An Introduction to
This tool uses patterns to generate parallel CUDA code.                                  General-Purpose GPU Programming. Addison-Wesley Professional,
   In our paper an important focus is the performance in terms                           2010.
                                                                                    [15] S. Cook, CUDA Programming: A Developer’s Guide to Parallel Com-
of time, with a comparison between GPU and CPU computa-                                  puting With GPUs. Morgan Kaufmann, 2012.
tion. In [13] the authors proposed three parallel algorithms that                   [16] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and
maximize the parallelism of processes, i.e., the power of GPUs                           W.-m. W. Hwu, “Optimization principles and application performance
                                                                                         evaluation of a multithreaded gpu using cuda,” in Proceedings of
is fully utilized. With their implementation of the algorithms,                          ACM SIGPLAN Symposium on Principles and practice of parallel
they are able to achieve up to 12 times the speedup over the                             programming. ACM, 2008, pp. 73–82.
highly optimized CPU counterpart, using the NVIDIA GPU                              [17] NVIDIA       CUDA      Library     Documentation.     [Online].    Avail-
                                                                                         able:     http://www.cs.cmu.edu/afs/cs/academic/class/15668-s11/www/
and the CUDA programming model.                                                          cuda-doc/html/index.html
                                                                                    [18] NVIDIA GeForce GTX 480. [Online]. Available: http://www.geforce.
                         VII. C ONCLUSION                                                com/hardware/desktop-gpus/geforce-gtx-480/specifications
                                                                                    [19] G. Borowik, M. Woźniak, A. Fornaia, R. Giunta, C. Napoli, G. Pap-
   We have presented a sequential and a parallel solution for                            palardo, and E. Tramontana, “A software architecture assisting workflow
the simulation of a P2P network and tested them in an ideal                              executions on cloud resources,” International Journal of Electronics and
                                                                                         Telecommunications, vol. 61, no. 1, pp. 17–23, 2015.
scenario, i.e. without collisions or missing data, and achieved                     [20] C. Napoli, G. Pappalardo, and E. Tramontana, “An agent-driven se-
considerable perfomance gains by using parallelism. Indeed in                            mantical identifier using radial basis neural networks and reinforcement
each simulated setting we appreciated that the GPU solution                              learning,” in XV Workshop ”From Objects to Agents” (WOA), vol. 1260.
                                                                                         Catania, Italy: CEUR-WS, September 2014.
has a lower execution time than the sequential one for updating                     [21] C. Napoli, G. Pappalardo, E. Tramontana, and G. Zappalà, “A cloud-
a peer state. This becomes more evident with the increase                                distributed gpu architecture for pattern identification in segmented
of peers and fragments. Particularly, the perfomance of the                              detectors big-data surveys,” Computer Journal, 2014. [Online].
                                                                                         Available: http://dx.doi.org/10.1093/comjnl/bxu147
sequential approach is worst when increasing the number                             [22] R. Giunta, G. Pappalardo, and E. Tramontana, “Using Aspects and
of destination peers, having an exponential increase of the                              Annotations to Separate Application Code from Design Patterns,” in
execution time, versus the linear trend of a GPU solution.                               Proceedings of ACM Symposium on Applied Computing (SAC), Sierre,
                                                                                         Switzerland, March 2010.
   In the future we can improve the simulation by including                         [23] ——, “Aspects and annotations for controlling the roles application
non-ideal effects as collisions, higher latency times in uploads                         classes play for design patterns,” in Proceedings of IEEE Asia Pacific
and downloads, and the transfer of larger data other than our                            Software Engineering Conference (APSEC), Ho Chi Minh, Vietnam,
                                                                                         December 2011, pp. 306–314.
binary data.                                                                        [24] A. Calvagna and E. Tramontana, “Delivering dependable reusable com-
                                                                                         ponents by expressing and enforcing design decisions,” in Proceedings
                             R EFERENCES                                                 of IEEE Computer Software and Applications Conference (COMPSAC)
                                                                                         Workshop QUORS, Kyoto, Japan, July 2013, pp. 493–498.
 [1] K. Zhao, X. Chu, M. Wang, and Y. Jiang, “Speeding Up Homomor-                  [25] R. Giunta, G. Pappalardo, and E. Tramontana, “Superimposing roles
     pic Hashing Using GPUs Communications,” in Proceedings of IEEE                      for design patterns into application classes by means of aspects,” in
     International Conference on Communications (ICC), 2009, pp. 1–5.                    Proceedings of ACM Symposium on Applied Computing (SAC), Riva
 [2] C. Napoli, G. Pappalardo, and E. Tramontana, “Improving files avail-                del Garda, Italy, March 2012, pp. 1866–1868.
     ability for bittorrent using a diffusion model,” in Proceedings of IEEE        [26] E. Tramontana, “Automatically characterising components with concerns
     International WETICE Conference, Parma, Italy, June 2014, pp. 191–                  and reducing tangling,” in Proceedings of IEEE Computer Software and
     196.                                                                                Applications Conference (COMPSAC) Workshop QUORS, Kyoto, Japan,
 [3] Y.-K. R. Kwok, Peer-to-Peer Computing: Applications, Architecture,                  July 2013, pp. 499–504.
     Protocols, and Challenges. CRC Press, 2011.                                    [27] A. Fornaia, C. Napoli, G. Pappalardo, and E. Tramontana, “An AO
 [4] R. Steinmetz and K. Wehrle, Peer-to-Peer Systems and Applications, ser.             system for OO-GPU programming,” in XVI Workshop ”From Object to
     Information Systems and Applications. Springer, 2005, vol. 2485.                    Agents” (WOA), vol. 1382. Napoli, Italy: CEUR-WS, June 2015, pp.
 [5] X. Shen, H. Yu, J. Buford, and M. Akan, Handbook of Peer-to-Peer                    24–31.
     Networking. Springer, 2010.                                                    [28] C. Napoli, G. Pappalardo, and E. Tramontana, “A mathematical model
 [6] A. Gupta and L. K. Awasthi, “Peer-To-Peer Networks and Computation:                 for file fragment diffusion and a neural predictor to manage priority
     Current Trends and Future Perspectives,” Computing and Informatics,                 queues over bittorrent,” International Journal of Applied Mathematics
     vol. 30, no. 3, pp. 559 – 594, 2012.                                                and Computer Science, vol. 26, no. 1, 2016.
 [7] J. Buford, “Management of peer-to-peer overlays,” International Journal        [29] E. Hernandez Rubio, A. Meneses Viveros, P. M. C. Perez, S. D. H.
     of Internet Protocol Technology, vol. 3, no. 1, pp. 2–12, 2008.                     Zavala, and H. M. M. Rios, “Flap: Tool to generate cuda code from
 [8] B. Biskupski, J. Dowling, and J. Sacha, “Properties and mechanisms                  sequential c code,” in Proceedings of International Conference on Elec-
     of self-organizing MANET and P2P systems,” ACM Transactions on                      tronics, Communications and Computers (CONIELECOMP). IEEE,
     Autonomous and Adaptive Systems (TAAS), vol. 2, no. 1, p. 1, 2007.                  2014, pp. 35–40.
 [9] M. Bawa, B. F. Cooper, A. Crespo, N. Daswani, P. Ganesan, H. Garcia-
     Molina, S. Kamvar, S. Marti, M. Schlosser, Q. Sun et al., “Peer-to-peer
     research at stanford,” ACM SIGMOD Record, vol. 32, no. 3, pp. 23–28,
     2003.
[10] NVIDIA CUDA. [Online]. Available: http://developer.nvidia.com/object/
     cuda.html
[11] Compute Unified Device Architecture: Programming Guide, 2nd ed., jun
     2008.
[12] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C.
     Phillips, “Gpu computing,” in Proceedings of the IEEE, vol. 96, no. 5,
     2008, pp. 879–899.


                                                                               22