=Paper=
{{Paper
|id=Vol-1543/p3
|storemode=property
|title=Simulating Peer to Peer Networks Using GPU High Perfomance Support
|pdfUrl=https://ceur-ws.org/Vol-1543/p3.pdf
|volume=Vol-1543
|authors=Giovanni Susinni,Giuseppe Greco
|dblpUrl=https://dblp.org/rec/conf/system/SusinniG15
}}
==Simulating Peer to Peer Networks Using GPU High Perfomance Support==
Simulating Peer to Peer Networks Using GPU High Perfomance Support Giovanni Susinni, Giuseppe Greco University of Catania Viale A. Doria 6, 95125 Catania, Italy giovannisusinni@gmail.com, grecoba1984@libero.it Abstract—Peer-to-Peer networks are used by many applica- in C language, that simulates some aspects of a P2P network; tions to share resources between nodes. We have proposed a (ii) second, we have implemented a correspondent GPU al- parallel version of a simulator for some aspects of a peer- gorithm using the capability of a GPU architecture through to-peer network performing file sharing. Being this analysis computationally expensive for contemporary CPUs, the comput- the Computer Unified Device Architecture (CUDA) C/C++ ing power of Graphic Processing Units allows a great gain in language; (iii) finally, we have compared the results between performance during simulation. Specifically, we have used the the two approaches. NVIDIA Computer Unified Device Architecture programming Our paper is organized as follows: Section 2 provides an model to simulate the behaviour of a peer-to-peer network overview of the P2P model, while Section 3 provides an performing data transfers, and compute communication delays as well as data updates. overview of the CUDA architecture. Section 4 describes the Index Terms—P2P networks, GPU computing, CUDA, Perfor- simulation approaches for the sequential and GPU parallel mance, Parallel versions. Section 5 shows the experimental results and the comparison of execution measured times. In Section 6 we I. I NTRODUCTION analyze the related work, and finally Section 7 provides In the last few decades, the Peer-To-Peer (P2P) technology conclusions and possible future work. and its many variants, such as BitTorrent, have become popular II. BACKGROUND ON P2P N ETWORKS Internet applications due to their scalability and robustness [1], An effective network architecture should exhibit the follow- [2]. Today, the number of devices connected to the Internet ing features [3]: is increasing fast. Therefore, it is important to analyze the • Provably good perfomance; interactions between peers (or nodes hereafter) and some file • Low overhead; sharing issues. The potential huge number of devices in a P2P • Quality of Service (QoS) of peers; network may pose a perfomance problem during the simu- • Autonomy; lation of data transfers, and on the other hand simulating the • Robustness. behaviour of novel models for selecting source and destination peers is a current research field. Over time, the more traditional client-server model based on A sequential simulation of the behaviour of a P2P network a centralized approach has become a decentralized architecture is viable for a small amount of data and nodes that transfer for P2P networks [4]. Indeed the continuos growth of users data: the results can be achieved in a relatively short time. and bandwidth has been accompanied by increasing requests However, for an extensive analysis, the sequential approach is of a diversified wealth of applications, with a remarkable effort not ideal for computations, having to cope with a lot of data in terms of resources to be used to achieve these challenges. and transfer nodes. One of the possible solutions is to have A. Definition and Properties of P2P Networks a parallel approach using a Graphic Processing Unit (GPU) architecture. The multi-core architecture of a GPU, organized A distributed network architecture is a P2P network if its in threads, blocks and grids, shows its great potential for the participants share a part of their own hardware resources, as said simulation thanks to the lower computation time than the processing power, network link capability, storage capacity [5]. sequential approach, particularly when increasing the number P2P networks are organized as overlay topologies on top of of nodes and data transfer (i.e. these are often file fragments). the underlying physical network and are formed by peers In this paper, we propose to use GPUs to perform a simulation connecting to each other in either a structured or unstructured of a P2P network that is closer to the real case. To this end, manner [6]. our work focuses on the implementation, and optimization of The P2P network must have the following properties [5]: exponentiation operations on GPUs. • Resource sharing: each peer contributes with system The contribution of this work is threefold: (i) first, we have resources to the operation of the P2P system. Ideally, this developed and analyzed a sequential algorithm, implemented resource sharing is proportional to the peers use of the P2P system, but many systems suffer from the free rider Copyright c 2016 held by the authors. problem. 16 are multiply-connected. Broadly, there are three classes of P2P systems: • Pure P2P Systems, in which 2 nodes/devices interact with each other without requiring the intervention of any central server or service. • Hybrid P2P Systems, in which peers rely partially on a central server to provide certain services, although the interaction between peers still takes place independently. • Federated P2P Systems, in which peer interactions take place inside pre-defined domains, such as within an organization. In this classification we have an increasing degree of cen- tralization. Moreover, P2P systems can either be: (i) structured [9], Fig. 1. A P2P network where the overlay graph is well structured and a mathematical scheme (e.g. Distributed Hash Tables) is applied to make sure that g nodes are added in a manner which maintains the • Networked : all nodes are interconnected with other nodes structure; or (ii) unstructured, where the overlay is a random in the P2P system, and the full set of nodes are members graph type and new nodes are added to the network in an of a connected graph. When the graph is no longer unpredictable manner. Structured P2P Systems are formed on connected, the overlay network becomes partitioned. the basis of node-identifiers, guaranteeing information retrieval • Decentralization: the behavior of the P2P system is deter- in bounded time for simple queries and are self-organizing mined by the collective actions of peer nodes, and there in the face of failures, whereas unstructured P2P systems is no central control point. Some systems however secure can support large complex queries, but do not guarantee the P2P system using a central login server. The ability to information retrieval in bounded time with not so efficient manage the overlay [7] and monetize its operation may self-organization capabilities. require centralized elements. In this paper we consider a Structured and Pure P2P • Symmetry: nodes assume equal roles in the operation of network. the P2P system. In many designs this property is relaxed III. CUDA by the use of special peer roles such as super peers or CUDA is a parallel computing platform and programming relay peers. model invented by NVIDIA. It enables an increase in com- • Autonomy: participation of the peer in the P2P system is puting performance by harnessing the power of GPUs. With determined locally, and there is no single administrative millions of CUDA-enabled GPUs sold to date, software devel- context for the P2P system. opers, scientists and researchers are finding broad-ranging uses • Scalable: this is a pre-requisite of operating P2P systems for GPU computing with CUDA. Here are a few examples: with millions of simultaneous nodes, and means that the (i) identify hidden plaque in arteries (heart attacks are the resources used at each peer exhibit a growth rate as a leading cause of death worldwide); (ii) analyze air traffic function of overlay size that is less than linear. It also flow, the National Airspace System manages the nationwide means that the response time does not grow more than coordination of air traffic flow. Computer models help identify linearly as a function of overlay size. new ways to alleviate congestion and keep airplane traffic • Self-organization: the organization of the P2P system moving efficiently. Using the computational power of GPUs, increases over time using local knowledge and local op- a team at NASA obtained a large performance gain, reducing erations at each peer, and no peer dominates the system. analysis time from ten minutes to three seconds. The speed-up Biskupski, Dowling, and Sacha [8] argue that existing is the result of the parallel GPU architecture, which however P2P systems do not exhibit most of the properties of self- require developers to port compute-intensive portions of the organization. application to the GPU using the CUDA Toolkit. • Stability: within a maximum variability rate, the P2P CUDA works, conceptually, according to the architectural system should be stable, i.e., it should maintain its model shown in Figure 2. The graphic chip, in the CUDA connected graph and be able to route deterministically model, is constituted by a series of multiprocessors, called within a practical hop-count bounds. Streaming MultiProcessor. The number of multiprocessors depends on the characteristics specific to the class and per- B. System model formance of each GPU. Each processor can perform a math- Figure 1 provides a conceptual representation of the P2P ematical operation (sum, multiplication, subtraction, etc.) on overlay topology. Since P2P networks are fault-tolerant, not integers or floating point single-precision numbers (32-bit). In susceptible to single-point-of-failure, P2P overlay topologies each processor there are also two multi-unit. Special functions 17 Fig. 2. GPU architecture Fig. 3. An example of state update for nodes are given (that perform transcendent as sine, cosine, reverse etc.) and, only for chips based on GT200 architecture, a single unit in floating-point double-precision (64-bit) is available. In The logical division in grid and blocks is a crucial aspect a multiprocessor there is a shared memory accessible by all in the design and code, in CUDA. In addition to the limits streaming processors, cache for instructions. The other types imposed by the hardware, there are other precautions that of memory, as shown in the picture, are accessible by each should be taken into account to make code more efficient. processor and represent the main repository for large amounts IV. P2P S IMULATION of data to be saved in the GPU. In the following we analyze A P2P-based application is a distributed application that the parts of code in the GPU using the libraries provided by provides tasks or work loads to peers, which have the same NVIDIA. privileges. Peers give a portion of their resources available to A. Kernel and hierarchy of threads other network participants, they need not a central coordination server. A P2P system can be used in many application do- CUDA programming indicates as a kernel function a portion mains, e.g. in social contexts, like sharing files and resources. of code that runs in parallel, N times. Today, given the large diffusion of P2P networks we need to The individual run of the kernel is performed by a simple simulate and assess the performance of novel transfer proto- unit called thread. CUDA threads are simpler than CPU cols, in terms of the computation time and how many nodes threads, so the code is considered to be faster. To determine are involved into the communication. Generally, according to the number of threads that run a single kernel there is a logical the increase of nodes and files, the computation time increases. organization and a two-level hierarchy. In the top level they depend on the size of a grid. The grid is a two-dimensional A. Issues of P2P Network Simulation level comprising blocks that in turn have a three-dimensional In our simulation, we suppose that each peer communicates structure that is specified by the number of threads. with others. Each peer has some file fragments and shares Once a kernel is launched it receives two structures speci- them, while at the same time receives the missing parts of fying the additional parameters for the size of grid and blocks. some files. The peers and file fragments are represented in an This listing shows the syntax to launch a kernel function array with N peers and M fragments, while the latency times of named kernelfunction(). communication between nodes (namely ping) are in an array N*M, named p. dim3 blocksize(x, y, z); dim3 gridsize (x, y); On top of figure 3 it is shown the initial state of the peers, kernelfunction <<>>(parameters); in terms of availability of file fragments. We can see that a Listing 1. Kernel syntax single node has some file fragments: 0 represent the absent state, while the 1 represent the available state. On the bottom 18 it is shown the final state of the peers, after the fragments have and its memory, while the device refers to the GPU and its been transferred between peers. memory. Code that runs on the host can manage memory on both the host and device, and also launches kernels which B. Sequential simulation are functions executed on the device. Kernels are executed The first approach is the sequential one. Listing 2 shows the by many GPU threads in parallel [10], [16]. The use of algorithm in C language, with the selection of peers that will tokens BlockIdx.x, ThreadIdx.x, BlockDim.x allows us to take receive the file fragments and update their state (from absent advantage of indexes of threads and blocks [14]. to available). Constant N is the number of peers, and M is the Indeed the kernel is executed in parallel by threads, which number of fragments. Array rics contains the destination are organized in blocks, with a 3D structure, and grids with peer indexes. There are MAXINS destination peers for a a 2D structure. For simulating a P2P network we use a 2D source peer, which are sequentially counted starting from the array, with ThreadIdx.x indicating the nodes and ThreadIdx.y index of the source peer. indicating the fragments. As in the sequential algorithm, we The update of file fragments consists of comparing for each check on each peer whether fragments are available. We use source peer, having index tidx, its fragments availability, as the device function checkFr() for this, whose body is the same determined by array f, with the fragments availability of a as the corresponding function in listing 2. If the fragment is peer having index rics[i]. If the difference between source avalaible, the state of the peer can be updated in due time. This and destination is equal to zero, there will not be an update, global function runs in parallel in each GPU’s core thanks to else the source will send missing fragments to the destination indexes tidx and tidy. peer. To approximate the real case, in the simulation we use __global__ void selection(float *p, int *f){ the latency times for each fragment transfer, both in download int rics[MAXINS]; unsigned long tidx = blockIdx.x*blockDim.x + threadIdx.x; and in upload. In this analysis we assume that the transfer is unsigned long tidy = blockIdx.y*blockDim.y + threadIdx.y; ideal, i.e. no loss of packets occurs. The simulation has to chooseNode(tidx, rics); consider each source peer, file fragment and destination peer, therefore we have three nested for cicles. for (int i = 0; i < MAXINS; i++) if (checkFr(f, tidx, rics, tidy, i)) This approach is viable for a few peers and fragments, i.e. if (p[tidx*N + tidy] <= tstep*(tidx*N + tidy)) the results can be obtained in a small amount of time. However, f[rics[i]*N + tidy] = 1; } the sequential approach is not appropriate for many peers and fragments, the solution is then a parallel approach using a GPU Listing 3. Parallel algorithm in CUDA C architecture. The keyword global indicates a kernel function that runs on void chooseNode(int tidx, int *rics){ the device and is called from host code, and runs in multiple for (int i=0; i < MAXINS; i++) rics[i] = ((tidx + i + 1) % N); instances on several blocks and threads. Listing 3 shows the } kernel function for our simulation. int checkFr(int *f, int tidx, int *rics, int tidy, int i){ cudaMalloc((void**)&dev_p,N*M*sizeof(float)); if (f[tidx*N + tidy] - f[rics[i]*N + tidy]) return 1; cudaMalloc((void**)&dev_f,N*M*sizeof(int)); return 0; cudaMalloc((void**)&dev_dt,N*M*sizeof(float)); } Listing 4. Library function cudaMalloc() void selection(float *p, int *f) { int rics[MAXINS]; cudaMemcpy(dev_p,ping,N*M*sizeof(float), unsigned long tidx; cudaMemcpyHostToDevice); unsigned long tidy; cudaMemcpy(output,dev_f,N*M*sizeof(float), for (tidx=0; tidx <= N; tidx++) { cudaMemcpyDeviceToHost); chooseNode(tidx, rics); for (tidy=0; tidy <= M; tidy++) for (int i=0; i < MAXINS; i++) Listing 5. Library function cudaMemcpy() if (checkFr(f, tidx, rics, tidy, i)) if (p[tidx*N + tidy] <= tstep*(tidx*N + tidy)) Library function cudaMalloc() allocates a given size of f[rics[i]*N + tidy] = 1; bytes of linear memory on the device and returns *devPtr, a } } pointer to the allocated memory (see Listing 4). The allocated memory is suitably aligned for any kind of variable. Function Listing 2. Sequential algorithm updating fragments availability cudaMalloc() returns a cudaErrorMemoryAllocation in case of failure [17]. Therefore, we can allocate the necessary space to C. GPU Simulation transfer the data from CPU to GPU. In CUDA C/C++ library The multi-core architecture of GPU, organized in threads, function cudaMemcpy() allows us to transfer data from host blocks and grids [10]–[15], shows its great potential by to device and vice versa [17] (see Listing 6). achieving a lower execution time than the sequential approach, In function main() we invoke our kernel function selection() particularly with the increase of nodes and file fragments. (see Listing 6), which allows to transfer the data from host to CUDA is essentially C/C++ language with a few extensions device. This performs the GPU computation, and then there is that allow one to execute functions on the GPU using many a cudaMemcpy() function call, copying from device to host threads in parallel [10]. In CUDA, the host refers to the CPU the data indicating updated nodes. 19 = GP U = GP U = CP U = CP U 100 100 Execution time Execution time 50 50 0 0 500 1,000 1,500 2,000 500 1,000 1,500 2,000 N umber of nodes N umber of nodes Fig. 4. GPU and CPU execution time with MAXINS=1/25 of total nodes Fig. 5. GPU and CPU execution time with MAXINS=1/10 of total nodes selection << >>(dev_p, dev_f); = GP U = CP U Listing 6. Calling a kernel function from the CPU program 100 Execution time The input arrays of ping times, node and their fragment availability are initially loaded from files. Every time-step the kernel is called and the output array is updated according to the latency time. 50 V. R ESULTS FOR THE SEQUENTIAL AND GPU VERSIONS From the analysis of sequential and the GPU execution times, we can see that the result is very different with a high number of nodes and file fragments. With a small 0 number of nodes, the sequential computation is faster than 500 1,000 1,500 2,000 the parallel one, because the allocation in memory, that the N umber of nodes CUDA compiler does through the GPU’s cores, is an expensive process, but this impact is smaller with the increase of nodes Fig. 6. GPU and CPU execution time with MAXINS=1/5 of total nodes and data. The execution time has been obtained from the average of ten measures. The sequential algorithm for processing data is executed on Figure 8 shows the ratio between execution times on CPU a AMD(R) Athlon 64 X2 processor with up to 2.9 GHz clock and GPU when varying the numer of MAXINS. The highest speed and 4GB of DDR3L-1333 RAM, while for the GPU ratio is when the number of destination peers is highest. This algorithm we used a Nvidia(R) GeForce(TM) GTX 480 with indicates that the parallel version performs much better than 480 CUDA cores and 1536 MB GDRR5 video RAM [18]. the sequential version. Figure 9 provides the execution times Figure 4, 5 and 6 show the execution times for the se- on CPU and GPU with 50 destination nodes, when varying quential (CPU) and parallel (GPU) versions, when considering the number of peers. GPU performances are alway better than a number of destination nodes equal to 1/25, 1/10 and 1/5, CPU performances, and when the number of peers is the respectively, of the total nodes N . Indeed with a low N the highest also is the gain by resorting to GPU. Table II shows perfomance of the CPU version are better because for the the actual measured times for the latter scenarios considered. execution of the GPU version the memory allocation from VI. R ELATED W ORKS host to device is a costly operation. Figure 7 shows the comparison of three simulations, to Nowadays, computational resources are available on de- appreciate the increasing difference in terms of execution time mand and tools are needed to coordinate their use [19]–[21]. between the two approaches, with a strong dependence on Several works aim at proposing ways to make development the number of destination hosts MAXINS. Table I shows the easier in such complex systems [22]–[27]. Several studies have actual measured times for all considered scenarios. analysed the behaviour of P2P networks from the point of view 20 = GP U M axins = N/25 60 = CP U M axins = N/25 25 = M axins = N/25 CP U / GP U execution time = GP U M axins = N/10 = M axins = N/10 = CP U M axins = N/10 20 = M axins = N/5 Execution time = GP U M axins = N/5 40 = CP U M axins = N/5 15 10 20 5 0 0 500 1,000 1,500 2,000 500 1,000 1,500 2,000 N umber of N odes N umber of nodes Fig. 7. GPU and CPU execution time: comparison between three cases. Fig. 8. Ratio CPU and GPU execution times when varying the number of peers for download TABLE I E XECUTION TIMES IN GPU AND CPU 15 Nodes x Fragments MAXINS GPU (s) CPU (s) CPU/GPU = GP U 4 0.11 0.05 0.48 100x100 10 0.13 0.06 0.46 = CP U 20 0.16 0.07 0.43 Execution time 10 0.19 0.23 1.22 250x250 25 0.20 0.26 1.32 10 50 0.20 0.35 1.74 20 0.54 0.80 1.48 500x500 50 0.57 1.33 2.34 100 0.59 2.14 3.62 30 1.24 2.23 1.79 5 750x750 75 1.24 3.91 3.13 150 1.35 7.03 5.2 40 2.01 f4.64 2.3 1000x1000 100 2.04 8.44 4.14 200 2.28 19.49 8.56 50 3.20 8.29 2.59 0 1250x1250 125 3.41 19.27 5.65 500 1,000 1,500 2,000 250 3.53 39.22 11.09 60 4.52 13.56 2.99 N umber of nodes 1500x1500 150 4.79 35.64 7.43 300 5.41 66.86 12.36 Fig. 9. GPU and CPU execution times with MAXINS=50 nodes 70 5.88 21.30 3.61 1750x1750 175 6.67 56.41 8.45 350 7.52 107.74 14.32 80 8.36 32.41 3.87 of shared files, i.e. how to have users contribute with contents 2000x2000 200 8.58 70.94 8.26 that can be uploaded by other users, respecting the latency 400 9.69 135.77 14.00 time in download and upload. TABLE II Some works have studied the problem of computation in E XECUTION TIMES IN GPU AND CPU WITH MAXINS=50 P2P networks using the GPU approach vs the CPU approach. In [6] authors analyze that heterogeneous nodes can have Nodes x Fragments GPU (s) CPU (s) 100x100 0.10 0.10 multiple types of computing elements, and the performance 250x250 0.21 0.35 and characteristics of each computing element can be very 500x500 0.7 1.29 different. For the simulation of the shared file fragments, 750x750 1.15 2.96 we assume an ideal case without giving priority to any file 1000x1000 2.14 5.25 1250x1250 3.12 8.36 fragments. Instead in [2], [28], authors propose to apply a 1500x1500 4.44 11.82 mathematical model for the diffusion of fragments on a P2P 1750x1750 6.29 16.51 in order to take into account both the effects of peer distances 2000x2000 7.88 21.51 and the changing availability of peers over time. In this paper we developed two approaches: the first one is the sequential one, which we have implemented as C 21 language. The second is the parallel approach, which uses [13] X. Chu, K. Zhao, and M. Wang, “Massively Parallel Network Coding on CUDA C/C++ language. In [29] some authors present FLAP, a GPUs,” in Proceedings of IEEE International Performance, Computing and Communications Conference (IPCCC), 2008, pp. 144–151. tool to generate CUDA parallel code from sequential C code. [14] J. Sanders and E. Kandrot, CUDA by Example: An Introduction to This tool uses patterns to generate parallel CUDA code. General-Purpose GPU Programming. Addison-Wesley Professional, In our paper an important focus is the performance in terms 2010. [15] S. Cook, CUDA Programming: A Developer’s Guide to Parallel Com- of time, with a comparison between GPU and CPU computa- puting With GPUs. Morgan Kaufmann, 2012. tion. In [13] the authors proposed three parallel algorithms that [16] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and maximize the parallelism of processes, i.e., the power of GPUs W.-m. W. Hwu, “Optimization principles and application performance evaluation of a multithreaded gpu using cuda,” in Proceedings of is fully utilized. With their implementation of the algorithms, ACM SIGPLAN Symposium on Principles and practice of parallel they are able to achieve up to 12 times the speedup over the programming. ACM, 2008, pp. 73–82. highly optimized CPU counterpart, using the NVIDIA GPU [17] NVIDIA CUDA Library Documentation. [Online]. Avail- able: http://www.cs.cmu.edu/afs/cs/academic/class/15668-s11/www/ and the CUDA programming model. cuda-doc/html/index.html [18] NVIDIA GeForce GTX 480. [Online]. Available: http://www.geforce. VII. C ONCLUSION com/hardware/desktop-gpus/geforce-gtx-480/specifications [19] G. Borowik, M. Woźniak, A. Fornaia, R. Giunta, C. Napoli, G. Pap- We have presented a sequential and a parallel solution for palardo, and E. Tramontana, “A software architecture assisting workflow the simulation of a P2P network and tested them in an ideal executions on cloud resources,” International Journal of Electronics and Telecommunications, vol. 61, no. 1, pp. 17–23, 2015. scenario, i.e. without collisions or missing data, and achieved [20] C. Napoli, G. Pappalardo, and E. Tramontana, “An agent-driven se- considerable perfomance gains by using parallelism. Indeed in mantical identifier using radial basis neural networks and reinforcement each simulated setting we appreciated that the GPU solution learning,” in XV Workshop ”From Objects to Agents” (WOA), vol. 1260. Catania, Italy: CEUR-WS, September 2014. has a lower execution time than the sequential one for updating [21] C. Napoli, G. Pappalardo, E. Tramontana, and G. Zappalà, “A cloud- a peer state. This becomes more evident with the increase distributed gpu architecture for pattern identification in segmented of peers and fragments. Particularly, the perfomance of the detectors big-data surveys,” Computer Journal, 2014. [Online]. Available: http://dx.doi.org/10.1093/comjnl/bxu147 sequential approach is worst when increasing the number [22] R. Giunta, G. Pappalardo, and E. Tramontana, “Using Aspects and of destination peers, having an exponential increase of the Annotations to Separate Application Code from Design Patterns,” in execution time, versus the linear trend of a GPU solution. Proceedings of ACM Symposium on Applied Computing (SAC), Sierre, Switzerland, March 2010. In the future we can improve the simulation by including [23] ——, “Aspects and annotations for controlling the roles application non-ideal effects as collisions, higher latency times in uploads classes play for design patterns,” in Proceedings of IEEE Asia Pacific and downloads, and the transfer of larger data other than our Software Engineering Conference (APSEC), Ho Chi Minh, Vietnam, December 2011, pp. 306–314. binary data. [24] A. Calvagna and E. Tramontana, “Delivering dependable reusable com- ponents by expressing and enforcing design decisions,” in Proceedings R EFERENCES of IEEE Computer Software and Applications Conference (COMPSAC) Workshop QUORS, Kyoto, Japan, July 2013, pp. 493–498. [1] K. Zhao, X. Chu, M. Wang, and Y. Jiang, “Speeding Up Homomor- [25] R. Giunta, G. Pappalardo, and E. Tramontana, “Superimposing roles pic Hashing Using GPUs Communications,” in Proceedings of IEEE for design patterns into application classes by means of aspects,” in International Conference on Communications (ICC), 2009, pp. 1–5. Proceedings of ACM Symposium on Applied Computing (SAC), Riva [2] C. Napoli, G. Pappalardo, and E. Tramontana, “Improving files avail- del Garda, Italy, March 2012, pp. 1866–1868. ability for bittorrent using a diffusion model,” in Proceedings of IEEE [26] E. Tramontana, “Automatically characterising components with concerns International WETICE Conference, Parma, Italy, June 2014, pp. 191– and reducing tangling,” in Proceedings of IEEE Computer Software and 196. Applications Conference (COMPSAC) Workshop QUORS, Kyoto, Japan, [3] Y.-K. R. Kwok, Peer-to-Peer Computing: Applications, Architecture, July 2013, pp. 499–504. Protocols, and Challenges. CRC Press, 2011. [27] A. Fornaia, C. Napoli, G. Pappalardo, and E. Tramontana, “An AO [4] R. Steinmetz and K. Wehrle, Peer-to-Peer Systems and Applications, ser. system for OO-GPU programming,” in XVI Workshop ”From Object to Information Systems and Applications. Springer, 2005, vol. 2485. Agents” (WOA), vol. 1382. Napoli, Italy: CEUR-WS, June 2015, pp. [5] X. Shen, H. Yu, J. Buford, and M. Akan, Handbook of Peer-to-Peer 24–31. Networking. Springer, 2010. [28] C. Napoli, G. Pappalardo, and E. Tramontana, “A mathematical model [6] A. Gupta and L. K. Awasthi, “Peer-To-Peer Networks and Computation: for file fragment diffusion and a neural predictor to manage priority Current Trends and Future Perspectives,” Computing and Informatics, queues over bittorrent,” International Journal of Applied Mathematics vol. 30, no. 3, pp. 559 – 594, 2012. and Computer Science, vol. 26, no. 1, 2016. [7] J. Buford, “Management of peer-to-peer overlays,” International Journal [29] E. Hernandez Rubio, A. Meneses Viveros, P. M. C. Perez, S. D. H. of Internet Protocol Technology, vol. 3, no. 1, pp. 2–12, 2008. Zavala, and H. M. M. Rios, “Flap: Tool to generate cuda code from [8] B. Biskupski, J. Dowling, and J. Sacha, “Properties and mechanisms sequential c code,” in Proceedings of International Conference on Elec- of self-organizing MANET and P2P systems,” ACM Transactions on tronics, Communications and Computers (CONIELECOMP). IEEE, Autonomous and Adaptive Systems (TAAS), vol. 2, no. 1, p. 1, 2007. 2014, pp. 35–40. [9] M. Bawa, B. F. Cooper, A. Crespo, N. Daswani, P. Ganesan, H. Garcia- Molina, S. Kamvar, S. Marti, M. Schlosser, Q. Sun et al., “Peer-to-peer research at stanford,” ACM SIGMOD Record, vol. 32, no. 3, pp. 23–28, 2003. [10] NVIDIA CUDA. [Online]. Available: http://developer.nvidia.com/object/ cuda.html [11] Compute Unified Device Architecture: Programming Guide, 2nd ed., jun 2008. [12] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, “Gpu computing,” in Proceedings of the IEEE, vol. 96, no. 5, 2008, pp. 879–899. 22