=Paper= {{Paper |id=Vol-1543/p3 |storemode=property |title=Simulating Peer to Peer Networks Using GPU High Perfomance Support |pdfUrl=https://ceur-ws.org/Vol-1543/p3.pdf |volume=Vol-1543 |authors=Giovanni Susinni,Giuseppe Greco |dblpUrl=https://dblp.org/rec/conf/system/SusinniG15 }} ==Simulating Peer to Peer Networks Using GPU High Perfomance Support== https://ceur-ws.org/Vol-1543/p3.pdf

Simulating Peer to Peer Networks Using GPU High
Perfomance Support
Giovanni Susinni, Giuseppe Greco
University of Catania
Viale A. Doria 6, 95125 Catania, Italy
giovannisusinni@gmail.com, grecoba1984@libero.it

Abstract—Peer-to-Peer networks are used by many applica- in C language, that simulates some aspects of a P2P network;
tions to share resources between nodes. We have proposed a (ii) second, we have implemented a correspondent GPU al-
parallel version of a simulator for some aspects of a peer- gorithm using the capability of a GPU architecture through
to-peer network performing file sharing. Being this analysis
computationally expensive for contemporary CPUs, the comput- the Computer Unified Device Architecture (CUDA) C/C++
ing power of Graphic Processing Units allows a great gain in language; (iii) finally, we have compared the results between
performance during simulation. Specifically, we have used the the two approaches.
NVIDIA Computer Unified Device Architecture programming Our paper is organized as follows: Section 2 provides an
model to simulate the behaviour of a peer-to-peer network overview of the P2P model, while Section 3 provides an
performing data transfers, and compute communication delays
as well as data updates. overview of the CUDA architecture. Section 4 describes the
Index Terms—P2P networks, GPU computing, CUDA, Perfor- simulation approaches for the sequential and GPU parallel
mance, Parallel versions. Section 5 shows the experimental results and the
comparison of execution measured times. In Section 6 we
I. I NTRODUCTION analyze the related work, and finally Section 7 provides
In the last few decades, the Peer-To-Peer (P2P) technology conclusions and possible future work.
and its many variants, such as BitTorrent, have become popular II. BACKGROUND ON P2P N ETWORKS
Internet applications due to their scalability and robustness [1],
An effective network architecture should exhibit the follow-
[2]. Today, the number of devices connected to the Internet
ing features [3]:
is increasing fast. Therefore, it is important to analyze the
• Provably good perfomance;
interactions between peers (or nodes hereafter) and some file
• Low overhead;
sharing issues. The potential huge number of devices in a P2P
• Quality of Service (QoS) of peers;
network may pose a perfomance problem during the simu-
• Autonomy;
lation of data transfers, and on the other hand simulating the
• Robustness.
behaviour of novel models for selecting source and destination
peers is a current research field. Over time, the more traditional client-server model based on
A sequential simulation of the behaviour of a P2P network a centralized approach has become a decentralized architecture
is viable for a small amount of data and nodes that transfer for P2P networks [4]. Indeed the continuos growth of users
data: the results can be achieved in a relatively short time. and bandwidth has been accompanied by increasing requests
However, for an extensive analysis, the sequential approach is of a diversified wealth of applications, with a remarkable effort
not ideal for computations, having to cope with a lot of data in terms of resources to be used to achieve these challenges.
and transfer nodes. One of the possible solutions is to have A. Definition and Properties of P2P Networks
a parallel approach using a Graphic Processing Unit (GPU)
architecture. The multi-core architecture of a GPU, organized A distributed network architecture is a P2P network if its
in threads, blocks and grids, shows its great potential for the participants share a part of their own hardware resources, as
said simulation thanks to the lower computation time than the processing power, network link capability, storage capacity [5].
sequential approach, particularly when increasing the number P2P networks are organized as overlay topologies on top of
of nodes and data transfer (i.e. these are often file fragments). the underlying physical network and are formed by peers
In this paper, we propose to use GPUs to perform a simulation connecting to each other in either a structured or unstructured
of a P2P network that is closer to the real case. To this end, manner [6].
our work focuses on the implementation, and optimization of The P2P network must have the following properties [5]:
exponentiation operations on GPUs. • Resource sharing: each peer contributes with system

The contribution of this work is threefold: (i) first, we have resources to the operation of the P2P system. Ideally, this
developed and analyzed a sequential algorithm, implemented resource sharing is proportional to the peers use of the
P2P system, but many systems suffer from the free rider
Copyright c 2016 held by the authors. problem.

16
are multiply-connected. Broadly, there are three classes of P2P
systems:
• Pure P2P Systems, in which 2 nodes/devices interact
with each other without requiring the intervention of any
central server or service.
• Hybrid P2P Systems, in which peers rely partially on a
central server to provide certain services, although the
interaction between peers still takes place independently.
• Federated P2P Systems, in which peer interactions take
place inside pre-defined domains, such as within an
organization.
In this classification we have an increasing degree of cen-
tralization.
Moreover, P2P systems can either be: (i) structured [9],
Fig. 1. A P2P network
where the overlay graph is well structured and a mathematical
scheme (e.g. Distributed Hash Tables) is applied to make sure
that g nodes are added in a manner which maintains the
• Networked : all nodes are interconnected with other nodes structure; or (ii) unstructured, where the overlay is a random
in the P2P system, and the full set of nodes are members graph type and new nodes are added to the network in an
of a connected graph. When the graph is no longer unpredictable manner. Structured P2P Systems are formed on
connected, the overlay network becomes partitioned. the basis of node-identifiers, guaranteeing information retrieval
• Decentralization: the behavior of the P2P system is deter- in bounded time for simple queries and are self-organizing
mined by the collective actions of peer nodes, and there in the face of failures, whereas unstructured P2P systems
is no central control point. Some systems however secure can support large complex queries, but do not guarantee
the P2P system using a central login server. The ability to information retrieval in bounded time with not so efficient
manage the overlay [7] and monetize its operation may self-organization capabilities.
require centralized elements. In this paper we consider a Structured and Pure P2P
• Symmetry: nodes assume equal roles in the operation of network.
the P2P system. In many designs this property is relaxed III. CUDA
by the use of special peer roles such as super peers or
CUDA is a parallel computing platform and programming
relay peers.
model invented by NVIDIA. It enables an increase in com-
• Autonomy: participation of the peer in the P2P system is
puting performance by harnessing the power of GPUs. With
determined locally, and there is no single administrative
millions of CUDA-enabled GPUs sold to date, software devel-
context for the P2P system.
opers, scientists and researchers are finding broad-ranging uses
• Scalable: this is a pre-requisite of operating P2P systems
for GPU computing with CUDA. Here are a few examples:
with millions of simultaneous nodes, and means that the
(i) identify hidden plaque in arteries (heart attacks are the
resources used at each peer exhibit a growth rate as a
leading cause of death worldwide); (ii) analyze air traffic
function of overlay size that is less than linear. It also
flow, the National Airspace System manages the nationwide
means that the response time does not grow more than
coordination of air traffic flow. Computer models help identify
linearly as a function of overlay size.
new ways to alleviate congestion and keep airplane traffic
• Self-organization: the organization of the P2P system
moving efficiently. Using the computational power of GPUs,
increases over time using local knowledge and local op-
a team at NASA obtained a large performance gain, reducing
erations at each peer, and no peer dominates the system.
analysis time from ten minutes to three seconds. The speed-up
Biskupski, Dowling, and Sacha [8] argue that existing
is the result of the parallel GPU architecture, which however
P2P systems do not exhibit most of the properties of self-
require developers to port compute-intensive portions of the
organization.
application to the GPU using the CUDA Toolkit.
• Stability: within a maximum variability rate, the P2P
CUDA works, conceptually, according to the architectural
system should be stable, i.e., it should maintain its
model shown in Figure 2. The graphic chip, in the CUDA
connected graph and be able to route deterministically
model, is constituted by a series of multiprocessors, called
within a practical hop-count bounds.
Streaming MultiProcessor. The number of multiprocessors
depends on the characteristics specific to the class and per-
B. System model formance of each GPU. Each processor can perform a math-
Figure 1 provides a conceptual representation of the P2P ematical operation (sum, multiplication, subtraction, etc.) on
overlay topology. Since P2P networks are fault-tolerant, not integers or floating point single-precision numbers (32-bit). In
susceptible to single-point-of-failure, P2P overlay topologies each processor there are also two multi-unit. Special functions

17
Fig. 2. GPU architecture

Fig. 3. An example of state update for nodes
are given (that perform transcendent as sine, cosine, reverse
etc.) and, only for chips based on GT200 architecture, a single
unit in floating-point double-precision (64-bit) is available. In The logical division in grid and blocks is a crucial aspect
a multiprocessor there is a shared memory accessible by all in the design and code, in CUDA. In addition to the limits
streaming processors, cache for instructions. The other types imposed by the hardware, there are other precautions that
of memory, as shown in the picture, are accessible by each should be taken into account to make code more efficient.
processor and represent the main repository for large amounts IV. P2P S IMULATION
of data to be saved in the GPU. In the following we analyze
A P2P-based application is a distributed application that
the parts of code in the GPU using the libraries provided by
provides tasks or work loads to peers, which have the same
NVIDIA.
privileges. Peers give a portion of their resources available to
A. Kernel and hierarchy of threads other network participants, they need not a central coordination
server. A P2P system can be used in many application do-
CUDA programming indicates as a kernel function a portion mains, e.g. in social contexts, like sharing files and resources.
of code that runs in parallel, N times. Today, given the large diffusion of P2P networks we need to
The individual run of the kernel is performed by a simple simulate and assess the performance of novel transfer proto-
unit called thread. CUDA threads are simpler than CPU cols, in terms of the computation time and how many nodes
threads, so the code is considered to be faster. To determine are involved into the communication. Generally, according to
the number of threads that run a single kernel there is a logical the increase of nodes and files, the computation time increases.
organization and a two-level hierarchy. In the top level they
depend on the size of a grid. The grid is a two-dimensional A. Issues of P2P Network Simulation
level comprising blocks that in turn have a three-dimensional In our simulation, we suppose that each peer communicates
structure that is specified by the number of threads. with others. Each peer has some file fragments and shares
Once a kernel is launched it receives two structures speci- them, while at the same time receives the missing parts of
fying the additional parameters for the size of grid and blocks. some files. The peers and file fragments are represented in an
This listing shows the syntax to launch a kernel function array with N peers and M fragments, while the latency times of
named kernelfunction(). communication between nodes (namely ping) are in an array
N*M, named p.
dim3 blocksize(x, y, z);
dim3 gridsize (x, y); On top of figure 3 it is shown the initial state of the peers,
kernelfunction <<>>(parameters); in terms of availability of file fragments. We can see that a
Listing 1. Kernel syntax single node has some file fragments: 0 represent the absent
state, while the 1 represent the available state. On the bottom

18
it is shown the final state of the peers, after the fragments have and its memory, while the device refers to the GPU and its
been transferred between peers. memory. Code that runs on the host can manage memory on
both the host and device, and also launches kernels which
B. Sequential simulation
are functions executed on the device. Kernels are executed
The first approach is the sequential one. Listing 2 shows the by many GPU threads in parallel [10], [16]. The use of
algorithm in C language, with the selection of peers that will tokens BlockIdx.x, ThreadIdx.x, BlockDim.x allows us to take
receive the file fragments and update their state (from absent advantage of indexes of threads and blocks [14].
to available). Constant N is the number of peers, and M is the Indeed the kernel is executed in parallel by threads, which
number of fragments. Array rics contains the destination are organized in blocks, with a 3D structure, and grids with
peer indexes. There are MAXINS destination peers for a a 2D structure. For simulating a P2P network we use a 2D
source peer, which are sequentially counted starting from the array, with ThreadIdx.x indicating the nodes and ThreadIdx.y
index of the source peer. indicating the fragments. As in the sequential algorithm, we
The update of file fragments consists of comparing for each check on each peer whether fragments are available. We use
source peer, having index tidx, its fragments availability, as the device function checkFr() for this, whose body is the same
determined by array f, with the fragments availability of a as the corresponding function in listing 2. If the fragment is
peer having index rics[i]. If the difference between source avalaible, the state of the peer can be updated in due time. This
and destination is equal to zero, there will not be an update, global function runs in parallel in each GPU’s core thanks to
else the source will send missing fragments to the destination indexes tidx and tidy.
peer. To approximate the real case, in the simulation we use __global__ void selection(float *p, int *f){
the latency times for each fragment transfer, both in download int rics[MAXINS];
unsigned long tidx = blockIdx.x*blockDim.x + threadIdx.x;
and in upload. In this analysis we assume that the transfer is unsigned long tidy = blockIdx.y*blockDim.y + threadIdx.y;
ideal, i.e. no loss of packets occurs. The simulation has to
chooseNode(tidx, rics);
consider each source peer, file fragment and destination peer,
therefore we have three nested for cicles. for (int i = 0; i < MAXINS; i++)
if (checkFr(f, tidx, rics, tidy, i))
This approach is viable for a few peers and fragments, i.e. if (p[tidx*N + tidy] <= tstep*(tidx*N + tidy))
the results can be obtained in a small amount of time. However, f[rics[i]*N + tidy] = 1;
}
the sequential approach is not appropriate for many peers and
fragments, the solution is then a parallel approach using a GPU Listing 3. Parallel algorithm in CUDA C
architecture.
The keyword global indicates a kernel function that runs on
void chooseNode(int tidx, int *rics){ the device and is called from host code, and runs in multiple
for (int i=0; i < MAXINS; i++)
rics[i] = ((tidx + i + 1) % N); instances on several blocks and threads. Listing 3 shows the
} kernel function for our simulation.
int checkFr(int *f, int tidx, int *rics, int tidy, int i){ cudaMalloc((void**)&dev_p,N*M*sizeof(float));
if (f[tidx*N + tidy] - f[rics[i]*N + tidy]) return 1; cudaMalloc((void**)&dev_f,N*M*sizeof(int));
return 0; cudaMalloc((void**)&dev_dt,N*M*sizeof(float));
}
Listing 4. Library function cudaMalloc()
void selection(float *p, int *f) {
int rics[MAXINS]; cudaMemcpy(dev_p,ping,N*M*sizeof(float),
unsigned long tidx; cudaMemcpyHostToDevice);
unsigned long tidy;
cudaMemcpy(output,dev_f,N*M*sizeof(float),
for (tidx=0; tidx <= N; tidx++) { cudaMemcpyDeviceToHost);
chooseNode(tidx, rics);
for (tidy=0; tidy <= M; tidy++)
for (int i=0; i < MAXINS; i++)
Listing 5. Library function cudaMemcpy()
if (checkFr(f, tidx, rics, tidy, i))
if (p[tidx*N + tidy] <= tstep*(tidx*N + tidy))
Library function cudaMalloc() allocates a given size of
f[rics[i]*N + tidy] = 1; bytes of linear memory on the device and returns *devPtr, a
}
}
pointer to the allocated memory (see Listing 4). The allocated
memory is suitably aligned for any kind of variable. Function
Listing 2. Sequential algorithm updating fragments availability cudaMalloc() returns a cudaErrorMemoryAllocation in case of
failure [17]. Therefore, we can allocate the necessary space to
C. GPU Simulation transfer the data from CPU to GPU. In CUDA C/C++ library
The multi-core architecture of GPU, organized in threads, function cudaMemcpy() allows us to transfer data from host
blocks and grids [10]–[15], shows its great potential by to device and vice versa [17] (see Listing 6).
achieving a lower execution time than the sequential approach, In function main() we invoke our kernel function selection()
particularly with the increase of nodes and file fragments. (see Listing 6), which allows to transfer the data from host to
CUDA is essentially C/C++ language with a few extensions device. This performs the GPU computation, and then there is
that allow one to execute functions on the GPU using many a cudaMemcpy() function call, copying from device to host
threads in parallel [10]. In CUDA, the host refers to the CPU the data indicating updated nodes.

19
= GP U = GP U
= CP U = CP U
100 100
Execution time

Execution time
50 50

0 0
500 1,000 1,500 2,000 500 1,000 1,500 2,000
N umber of nodes N umber of nodes

Fig. 4. GPU and CPU execution time with MAXINS=1/25 of total nodes Fig. 5. GPU and CPU execution time with MAXINS=1/10 of total nodes

selection <<>>(dev_p, dev_f); = GP U
= CP U
Listing 6. Calling a kernel function from the CPU program
100
Execution time

The input arrays of ping times, node and their fragment
availability are initially loaded from files. Every time-step the
kernel is called and the output array is updated according to
the latency time.
50
V. R ESULTS FOR THE SEQUENTIAL AND GPU VERSIONS
From the analysis of sequential and the GPU execution
times, we can see that the result is very different with a
high number of nodes and file fragments. With a small 0
number of nodes, the sequential computation is faster than 500 1,000 1,500 2,000
the parallel one, because the allocation in memory, that the N umber of nodes
CUDA compiler does through the GPU’s cores, is an expensive
process, but this impact is smaller with the increase of nodes Fig. 6. GPU and CPU execution time with MAXINS=1/5 of total nodes
and data. The execution time has been obtained from the
average of ten measures.
The sequential algorithm for processing data is executed on Figure 8 shows the ratio between execution times on CPU
a AMD(R) Athlon 64 X2 processor with up to 2.9 GHz clock and GPU when varying the numer of MAXINS. The highest
speed and 4GB of DDR3L-1333 RAM, while for the GPU ratio is when the number of destination peers is highest. This
algorithm we used a Nvidia(R) GeForce(TM) GTX 480 with indicates that the parallel version performs much better than
480 CUDA cores and 1536 MB GDRR5 video RAM [18]. the sequential version. Figure 9 provides the execution times
Figure 4, 5 and 6 show the execution times for the se- on CPU and GPU with 50 destination nodes, when varying
quential (CPU) and parallel (GPU) versions, when considering the number of peers. GPU performances are alway better than
a number of destination nodes equal to 1/25, 1/10 and 1/5, CPU performances, and when the number of peers is the
respectively, of the total nodes N . Indeed with a low N the highest also is the gain by resorting to GPU. Table II shows
perfomance of the CPU version are better because for the the actual measured times for the latter scenarios considered.
execution of the GPU version the memory allocation from
VI. R ELATED W ORKS
host to device is a costly operation.
Figure 7 shows the comparison of three simulations, to Nowadays, computational resources are available on de-
appreciate the increasing difference in terms of execution time mand and tools are needed to coordinate their use [19]–[21].
between the two approaches, with a strong dependence on Several works aim at proposing ways to make development
the number of destination hosts MAXINS. Table I shows the easier in such complex systems [22]–[27]. Several studies have
actual measured times for all considered scenarios. analysed the behaviour of P2P networks from the point of view

20
= GP U M axins = N/25
60 = CP U M axins = N/25
25 = M axins = N/25

CP U / GP U execution time
= GP U M axins = N/10 = M axins = N/10
= CP U M axins = N/10
20 = M axins = N/5
Execution time

= GP U M axins = N/5
40 = CP U M axins = N/5
15

10
20
5

0 0
500 1,000 1,500 2,000 500 1,000 1,500 2,000
N umber of N odes N umber of nodes

Fig. 7. GPU and CPU execution time: comparison between three cases. Fig. 8. Ratio CPU and GPU execution times when varying the number of
peers for download
TABLE I
E XECUTION TIMES IN GPU AND CPU
15
Nodes x Fragments MAXINS GPU (s) CPU (s) CPU/GPU = GP U
4 0.11 0.05 0.48
100x100 10 0.13 0.06 0.46 = CP U
20 0.16 0.07 0.43
Execution time

10 0.19 0.23 1.22
250x250 25 0.20 0.26 1.32
10
50 0.20 0.35 1.74
20 0.54 0.80 1.48
500x500 50 0.57 1.33 2.34
100 0.59 2.14 3.62
30 1.24 2.23 1.79 5
750x750 75 1.24 3.91 3.13
150 1.35 7.03 5.2
40 2.01 f4.64 2.3
1000x1000 100 2.04 8.44 4.14
200 2.28 19.49 8.56
50 3.20 8.29 2.59 0
1250x1250 125 3.41 19.27 5.65 500 1,000 1,500 2,000
250 3.53 39.22 11.09
60 4.52 13.56 2.99 N umber of nodes
1500x1500 150 4.79 35.64 7.43
300 5.41 66.86 12.36 Fig. 9. GPU and CPU execution times with MAXINS=50 nodes
70 5.88 21.30 3.61
1750x1750 175 6.67 56.41 8.45
350 7.52 107.74 14.32
80 8.36 32.41 3.87 of shared files, i.e. how to have users contribute with contents
2000x2000 200 8.58 70.94 8.26 that can be uploaded by other users, respecting the latency
400 9.69 135.77 14.00
time in download and upload.
TABLE II
Some works have studied the problem of computation in
E XECUTION TIMES IN GPU AND CPU WITH MAXINS=50 P2P networks using the GPU approach vs the CPU approach.
In [6] authors analyze that heterogeneous nodes can have
Nodes x Fragments GPU (s) CPU (s)
100x100 0.10 0.10
multiple types of computing elements, and the performance
250x250 0.21 0.35 and characteristics of each computing element can be very
500x500 0.7 1.29 different. For the simulation of the shared file fragments,
750x750 1.15 2.96 we assume an ideal case without giving priority to any file
1000x1000 2.14 5.25
1250x1250 3.12 8.36 fragments. Instead in [2], [28], authors propose to apply a
1500x1500 4.44 11.82 mathematical model for the diffusion of fragments on a P2P
1750x1750 6.29 16.51 in order to take into account both the effects of peer distances
2000x2000 7.88 21.51
and the changing availability of peers over time.
In this paper we developed two approaches: the first one
is the sequential one, which we have implemented as C

21
language. The second is the parallel approach, which uses [13] X. Chu, K. Zhao, and M. Wang, “Massively Parallel Network Coding on
CUDA C/C++ language. In [29] some authors present FLAP, a GPUs,” in Proceedings of IEEE International Performance, Computing
and Communications Conference (IPCCC), 2008, pp. 144–151.
tool to generate CUDA parallel code from sequential C code. [14] J. Sanders and E. Kandrot, CUDA by Example: An Introduction to
This tool uses patterns to generate parallel CUDA code. General-Purpose GPU Programming. Addison-Wesley Professional,
In our paper an important focus is the performance in terms 2010.
[15] S. Cook, CUDA Programming: A Developer’s Guide to Parallel Com-
of time, with a comparison between GPU and CPU computa- puting With GPUs. Morgan Kaufmann, 2012.
tion. In [13] the authors proposed three parallel algorithms that [16] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and
maximize the parallelism of processes, i.e., the power of GPUs W.-m. W. Hwu, “Optimization principles and application performance
evaluation of a multithreaded gpu using cuda,” in Proceedings of
is fully utilized. With their implementation of the algorithms, ACM SIGPLAN Symposium on Principles and practice of parallel
they are able to achieve up to 12 times the speedup over the programming. ACM, 2008, pp. 73–82.
highly optimized CPU counterpart, using the NVIDIA GPU [17] NVIDIA CUDA Library Documentation. [Online]. Avail-
able: http://www.cs.cmu.edu/afs/cs/academic/class/15668-s11/www/
and the CUDA programming model. cuda-doc/html/index.html
[18] NVIDIA GeForce GTX 480. [Online]. Available: http://www.geforce.
VII. C ONCLUSION com/hardware/desktop-gpus/geforce-gtx-480/specifications
[19] G. Borowik, M. Woźniak, A. Fornaia, R. Giunta, C. Napoli, G. Pap-
We have presented a sequential and a parallel solution for palardo, and E. Tramontana, “A software architecture assisting workflow
the simulation of a P2P network and tested them in an ideal executions on cloud resources,” International Journal of Electronics and
Telecommunications, vol. 61, no. 1, pp. 17–23, 2015.
scenario, i.e. without collisions or missing data, and achieved [20] C. Napoli, G. Pappalardo, and E. Tramontana, “An agent-driven se-
considerable perfomance gains by using parallelism. Indeed in mantical identifier using radial basis neural networks and reinforcement
each simulated setting we appreciated that the GPU solution learning,” in XV Workshop ”From Objects to Agents” (WOA), vol. 1260.
Catania, Italy: CEUR-WS, September 2014.
has a lower execution time than the sequential one for updating [21] C. Napoli, G. Pappalardo, E. Tramontana, and G. Zappalà, “A cloud-
a peer state. This becomes more evident with the increase distributed gpu architecture for pattern identification in segmented
of peers and fragments. Particularly, the perfomance of the detectors big-data surveys,” Computer Journal, 2014. [Online].
Available: http://dx.doi.org/10.1093/comjnl/bxu147
sequential approach is worst when increasing the number [22] R. Giunta, G. Pappalardo, and E. Tramontana, “Using Aspects and
of destination peers, having an exponential increase of the Annotations to Separate Application Code from Design Patterns,” in
execution time, versus the linear trend of a GPU solution. Proceedings of ACM Symposium on Applied Computing (SAC), Sierre,
Switzerland, March 2010.
In the future we can improve the simulation by including [23] ——, “Aspects and annotations for controlling the roles application
non-ideal effects as collisions, higher latency times in uploads classes play for design patterns,” in Proceedings of IEEE Asia Pacific
and downloads, and the transfer of larger data other than our Software Engineering Conference (APSEC), Ho Chi Minh, Vietnam,
December 2011, pp. 306–314.
binary data. [24] A. Calvagna and E. Tramontana, “Delivering dependable reusable com-
ponents by expressing and enforcing design decisions,” in Proceedings
R EFERENCES of IEEE Computer Software and Applications Conference (COMPSAC)
Workshop QUORS, Kyoto, Japan, July 2013, pp. 493–498.
[1] K. Zhao, X. Chu, M. Wang, and Y. Jiang, “Speeding Up Homomor- [25] R. Giunta, G. Pappalardo, and E. Tramontana, “Superimposing roles
pic Hashing Using GPUs Communications,” in Proceedings of IEEE for design patterns into application classes by means of aspects,” in
International Conference on Communications (ICC), 2009, pp. 1–5. Proceedings of ACM Symposium on Applied Computing (SAC), Riva
[2] C. Napoli, G. Pappalardo, and E. Tramontana, “Improving files avail- del Garda, Italy, March 2012, pp. 1866–1868.
ability for bittorrent using a diffusion model,” in Proceedings of IEEE [26] E. Tramontana, “Automatically characterising components with concerns
International WETICE Conference, Parma, Italy, June 2014, pp. 191– and reducing tangling,” in Proceedings of IEEE Computer Software and
196. Applications Conference (COMPSAC) Workshop QUORS, Kyoto, Japan,
[3] Y.-K. R. Kwok, Peer-to-Peer Computing: Applications, Architecture, July 2013, pp. 499–504.
Protocols, and Challenges. CRC Press, 2011. [27] A. Fornaia, C. Napoli, G. Pappalardo, and E. Tramontana, “An AO
[4] R. Steinmetz and K. Wehrle, Peer-to-Peer Systems and Applications, ser. system for OO-GPU programming,” in XVI Workshop ”From Object to
Information Systems and Applications. Springer, 2005, vol. 2485. Agents” (WOA), vol. 1382. Napoli, Italy: CEUR-WS, June 2015, pp.
[5] X. Shen, H. Yu, J. Buford, and M. Akan, Handbook of Peer-to-Peer 24–31.
Networking. Springer, 2010. [28] C. Napoli, G. Pappalardo, and E. Tramontana, “A mathematical model
[6] A. Gupta and L. K. Awasthi, “Peer-To-Peer Networks and Computation: for file fragment diffusion and a neural predictor to manage priority
Current Trends and Future Perspectives,” Computing and Informatics, queues over bittorrent,” International Journal of Applied Mathematics
vol. 30, no. 3, pp. 559 – 594, 2012. and Computer Science, vol. 26, no. 1, 2016.
[7] J. Buford, “Management of peer-to-peer overlays,” International Journal [29] E. Hernandez Rubio, A. Meneses Viveros, P. M. C. Perez, S. D. H.
of Internet Protocol Technology, vol. 3, no. 1, pp. 2–12, 2008. Zavala, and H. M. M. Rios, “Flap: Tool to generate cuda code from
[8] B. Biskupski, J. Dowling, and J. Sacha, “Properties and mechanisms sequential c code,” in Proceedings of International Conference on Elec-
of self-organizing MANET and P2P systems,” ACM Transactions on tronics, Communications and Computers (CONIELECOMP). IEEE,
Autonomous and Adaptive Systems (TAAS), vol. 2, no. 1, p. 1, 2007. 2014, pp. 35–40.
[9] M. Bawa, B. F. Cooper, A. Crespo, N. Daswani, P. Ganesan, H. Garcia-
Molina, S. Kamvar, S. Marti, M. Schlosser, Q. Sun et al., “Peer-to-peer
research at stanford,” ACM SIGMOD Record, vol. 32, no. 3, pp. 23–28,
2003.
[10] NVIDIA CUDA. [Online]. Available: http://developer.nvidia.com/object/
cuda.html
[11] Compute Unified Device Architecture: Programming Guide, 2nd ed., jun
2008.
[12] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C.
Phillips, “Gpu computing,” in Proceedings of the IEEE, vol. 96, no. 5,
2008, pp. 879–899.