Data Protection in the Organization of Numerical Experiments in Distributed Computing 1 Ilya Kurochkin, and Daniil Popov Institute for Information Transmission Problems of Russian Academy of Sciences, Moscow, Russia kurochkin@iitp.ru Abstract. An approach to improve the efficiency of a distributed computing system is proposed. This approach involves reducing the number of initial rep- lication copies by validating the results of one copy calculations. A method for verifying data integrity using cryptographic hash functions is described. And its adaptation for desktop grid based on BOINC platform. Keywords: Distributed computing, Desktop grid, Replication copies, BOINC, Data protection, Cryptographic hash function. 1 Introduction Distributed computing is one of the ways to solve complex computational problems. This method had a good performance at organizing large computational experiments in the interests of scientific groups and laboratories. The use of distributed computing systems to solve scientific problems may become the alternative to the use of super- computers. As the need for large computation far exceed the capacities of available supercomputers and clusters. Grid systems of personal computers (desktop grid) are a common tool for organiz- ing large scientific experiments. There are a number of platforms (BOINC, HTCon- dor, Legion, Globus) to organize the calculations in desktop grids. Of these, the most popular is BOINC. BOINC is an open software platform that allows us to use distributed computing to solve large computational experiments [1]. Such experiments are referred to as distri- buted computing projects. BOINC has client-server architecture and allows to connect not only personal computers but also servers, mobile devices, clusters to desktop gr- ids. At the moment, there are dozens of different scientific projects on the BOINC plat- form. For example, modelling of the charged-particle beam behavior at various para- meters of the LHC@home accelerator control magnets impact on them[2]; Eins- tein@home project [3] deals with the search for spinning neutron stars (also called 1 This work was funded by Russian Science Foundation (№16-11-10352) 90 pulsars) using data from LIGO and GEO gravitational wave detectors, as well as the Arecibo radio observatory. 2 Desktop grid features Desktop grids have a number of features that should be taken into consideration when organizing the calculations:  Heterogeneity of the distributed computing system nodes;  Unreliability of connections and possible shutdown of computational nodes;  Autonomy of the calculations on different nodes and impossibility of calculation coordination between nodes;  Irregular time of continuous node operation;  Errors or delays in the calculations. If we analyze these features, we can formulate few stages to organize numerical experiments on a distributed system:  The whole numerical experiment should be divided into small independent tasks;  Computational complexity of one of the tasks should be small, within several hours calculation on an average personal computer;  Each task should be calculated on several independent computing nodes (replica- tion copies);  Increase in the number of copies increases the reliability of correct calculation of an individual task and increases the speed of obtaining a correct result (on average for the entire set of tasks) but reduces the computational ability of the entire sys- tem;  It is necessary to develop a calculations checking system. 2.1 Splitting into independent tasks It is assumed that the original problem can be split into a large number of parts, which are many times greater than the number of computational nodes in a distributed system. There is a certain class of scientific problems, which can be split into many inde- pendent tasks. The calculation algorithm remains unchanged, and only the input data change. This approach for the split of the original computational problem is called a data split or "bag of tasks" [4]. Examples of this class of problems can be the problems of simulation modelling, SAT problem [8], complete enumeration problems [9], problems of combinatorics [10] and etc. 91 2.2 Complexity of the tasks It is necessary to choose the right computational complexity of the task. On the one hand, make it possible to solve the task completely on a regularly shut down computa- tional node (for example, the node switch only in working hours). On the other hand, to make it difficult enough to reduce the share of overhead costs (transfer time of input data and results, unpacking and recording data, etc.). Typically, for distributed systems the task execution time on the compute node should be in the range from several minutes to 6 hours. There are exceptions when one task runtime can be more than 10 hours or several days. But in this case, it is necessary to ensure regular saving of intermediate results. 2.3 Replication copies To increase the likelihood of successful execution of the task in terms of a distri- buted system compute nodes possible shutdown, it is necessary to send out multiple copies of the same tasks to different nodes. If for each task, n copies will be sent, then the computational capacity of the system will be reduced by n times. Increase in the number of copies without the use of other sending sub-tasks system parameters can significantly reduce computational ability of distributed, but will not achieve the re- quired reliability or speed of obtaining the correct results. Also, the principle of issuing the copies of a single task is important. For example, you cannot issue copies of a single task to the computing nodes of one user. In order to avoid simultaneous shutdown of nodes with all task copies. 2.4 Results verification Results verification in accordance with the specifics of tasks is an important tool to reduce the number of copies. The correctness of the task calculation can be verified by comparing the results of a calculation from several computational nodes. But this approach requires at least 2 results. In the event that a meaningful result (e.g., simula- tion modelling results), you can verify the correctness of the result by partial recount. However, note that you will need dedicated computational power to verify the results. In the case of a search problem or combinatorics, there is no such an opportunity. Solved problems in a distributed computing system can be divided according to the nature of the results obtained into the following classes:  Simulation with different initial conditions (large meaningful results, variable length);  Full bust. Search for one / several solutions (extremely small results, possibly 1 bit);  Multiple computation of the values of a complicated function (the results are small, the length is the same);  The method of branches and boundaries. Enumeration with clipping (tasks of vary- ing complexity, small and variable length results); 92  Search for parts of a large range (the results are extremely small, you need a guar- antee of error-free computing). One solution to this problem is the validation (data integrity check) of not only results, but results in conjunction with input and intermediate data. Then the verifica- tion can be carried out at one result. 2.5 BOINC results validation By default, the server part of BOINC uses the bitwise comparison of multiple re- sults from different computing nodes (host) for validation of the results. This method is that the BOINC project server makes multiple copies of the problem by default and sends it to multiple hosts, and it compares the results bitwise. If N results are identic- al, then the server thinks that this result is correct. If it turns out that there were no N identical results [5], the server initiates the generation of new copies of the problem and sends it to several hosts. This method reduces the computational power of the grid system at least N times. Instead, you can use the mechanism to verify the integrity of the results, using only one copy of the task. It is possible to implement using cryptographic hash functions [6]. Hashing is a convolution of the original data in some combination of some fixed- length by some hash function. The hash function must be cryptographic, i.e., the property of this function unilaterality should be carried out: nobody should be able to select appropriate data by the value of the convolution combination [7]. However, only one copy of the problem still should not be sent, as there is a chance that the result generally will not be counted by the user for the specified pe- riod. But to solve this problem, we will still need almost always a lot less copy than if we do not use validation with the hash functions. In addition to verification of the results integrity, there are still challenges of input and intermediate data integrity. In the case of integrated use of hashing, it`s possible to achieve full data integrity control on the BOINC client side. Let`s look at the dif- ferent types of attacks on the BOINC client side. 3 Scenarios of attacks on an unprotected scheme 3.1 List of attacks Using only one copy of problem by all data, including input, output and interme- diate ones, it is necessary to guarantee the integrity and secrecy. This is necessary because there are many scenarios of attacks on unprotected data: 1. When a user gets files from BOINC, the user accidentally or intentionally modifies the input file so that the app will still start modelling and will give an incorrect re- sult. 93 2. In the process of modelling the application after creating one or more checkpoint files, crashes. Then, the user accidentally or intentionally changes the checkpoint file so that the app will still start modelling and will give an incorrect result. Or it just will substitute it with another checkpoint file from another problem. 3. The user may also change a ready a file with the result. 4. The user may replace the file of the finished result by any file at all. 5. The user may accidentally or intentionally modify the configuration file so that the modelling will begin and be completed, but the result will be incorrect. 3.2 Hash functions Hash functions were used to protect against attacks on the data integrity. A hash sum of the input and configuration files are calculated on the server-side and sent to the host. On the host side during modelling, the hash sum of the checkpoint file and the output file are additionally calculated and together with the rest of the hash sums and the finished result are sent to the server. To hide the data, two types of encryption are used: streaming and asymmetric. Streaming encryption is used for the input configuration file and the checkpoint file and also for the intermediate output file and the files with hash sums. While asymme- tric one is used for the finished output file. 3.3 Standard methods to protect the BOINC platform integrity The BOINC platform has a built-in function of signing file with an electronic sig- nature on the server before sending them to the user. Private and public keys are used for the signature. The private key remains on the server after the signature and the public one is sent to the user along with the files to verify their signatures. This me- thod can protect only between 1 and 5 attacks. 4 Changed scheme of BOINC project The scheme of BOINC project functioning was implemented and modified. For the clarity, the scheme was divided into two parts, the scenario on the server side and the scenario on the host side. 4.1 Scenario on the server side Sending When a problem is sent to a user, in addition to the input and configuration files, the files storing hash sums are also sent. The input and configuration files are sent in an archive. The archive is encrypted by the stream encryption algorithm before send- ing. 94 Receiving The server receives the ready output file and the archive with the hash sums of the input, configuration and output files from the host. The server decrypts the received data and then calculates the hash sum of the input and configuration files, whose cop- ies are still stored on the server since the problem is submitted to the user and from the ready output file. Then, the hash sums are compared bitwise with the hash sums from the hash sums archive. If at least one of the sums does not match, then the result is considered incorrect. 4.2 Scenario on the host side After receiving from the server the input and configuration files and an archive with their hash sums, the application is launched. The application decrypts and extracts the input data and an archive with hash sums, calculates hash sums of the received input and configuration files and compares bit- wise them with the hash sums from the hash sums of the hash sums archive. If they match, then the modelling starts. During the modelling, a checkpoint file is periodically created and an intermediate output file is saved. Before saving these files, their hash sums are calculated and stored in the hash sum archive. Also before saving the checkpoint file and the inter- mediate output file are archived and encrypted by the stream encryption algorithm. In addition, there are cases when the files are saved incorrectly, for this, imme- diately after they are saved, the stored data are read and compared bitwise with those that the app still keeps in its memory. If an application crashed, the next time it is launched, it will decode and extract checkpoint file, calculate its hash sum and a compare it bitwise with the one stored in the archive with the hash sums. If the sums match, the application will continue mod- elling since the creation of the checkpoint file, otherwise the modelling will start from the beginning. After modelling is complete, the ready output file is also archived and encrypted with a symmetric encryption algorithm. In the end, after modelling completing, the host sends the output file and the arc- hive with the hash sums of the input, output and configuration files to the server. Fig. 1. General scheme of interaction between the server and the host 95 Fig. 2. Schema for reading the input and checkpoint file Fig. 3. Scheme for recording the checkpoint file and interim results Fig. 4. Scheme of recording the completed output file 4.3 Archiving The files are archived before encryption every time because if this is not done, then there is some probability that even an encrypted file, after the accidental/intentional 96 changes will remain suitable for use by an application. This can happen if, for exam- ple, when decoding only one digit will be changed to another. The second reason is that archiving before encryption complicates the encryption cracking. The third reason is that the ready output file can weigh quite a lot for sending it over the Internet, and archiving allows you to reduce the weight about 10 times. For archiving an open source library SharpZipLib is used (compression level 2). 4.4 Encryption keys For streaming encryption the sewn source code is used. This is because in the ap- plication operation there is a periodic encryption/decryption of the intermediate out- put file and checkpoint file. 4.5 Application of the proposed methodology To test this methodology, private BOINC project was developed and as a computa- tional application was used implemented a simulation model of functioning telecom- munication network [11]. Different model networks were taken as input data. 5 Conclusion Using cryptographic hash functions and archiving results allows you to maintain the integrity of the data. Verification of the result is possible by one copy, and the number of initial copies of tasks can be reduced without loss of reliability. As a result, the computing power of the distributed system can be increased. However, the use of encryption of data and results in voluntary distributed projects can adversely affect their popularity (computational ability). Since for public projects of distributed computing, the policy of openness is mandatory. References 1. Anderson, D. (2004). BOINC: a system for public-resource computing and storage. IEEE. 2. LHC@Home. URL: http://lhcathome.web.cern.ch/ 3. Einstein@home. URL: https://einsteinathome.org/ru/home 4. Bertin, R., Hunold, S., Legrand, A., & Touati, C. (2014). Fair scheduling of bag-of-tasks applications using distributed Lagrangian optimization. Journal of Parallel and Distributed Computing, 74(1), 1914-1929. 5. Afanasiev, A. P., Bychkov, I. V., Manzyuk, M. O., Posypkin, M. A., Semenov, A. A., & Zaikin, O. S. (2015). Technology for integrating idle computing cluster resources into vo- lunteer computing projects. In Proc. of The 5th International Workshop on Computer Science and Engineering, Moscow, Russia (pp. 109-114). 6. Rogaway, P., & Shrimpton, T. (2004, February). Cryptographic hash-function basics: De- finitions, implications, and separations for preimage resistance, second-preimage resis- 97 tance, and collision resistance. In International Workshop on Fast Software Encryption (pp. 371-388). Springer, Berlin, Heidelberg. 7. Guido Bertoni, Joan Daemen, Michael Peeters, Gilles Van Assche «The Keccak SHA-3 submission», 2011 8. Posypkin, Mikhail and Semenov, Alexander and Zaikin, Oleg (2012) Using BOINC desk- top grid to solve large scale SAT problems. Computer Science, 13 (1). pp. 25-34. 9. Vatutin, E. I., Valyaev, S. Y., & Titov, V. S. (2015). Comparison of Sequential Methods for Getting Separations of Parallel Logic Control Algorithms Using Volunteer Computing. BOINC: FAST. 10. Vatutin, E. I., Zaikin, O. S., Zhuravlev, A. D., Manzyuk, M. O., Kochemazov, S. E., & Ti- tov, V. S. (2016). Using grid systems for enumerating combinatorial objects on example of diagonal Latin squares. In Distributed computing and grid-technologies in science and education (GRID’16): book of abstracts of the 7 th international conference. Dubna: JINR (pp. 114-115). 11. Kurochkin I., Grinberg Ya., Different Criteria of Dynamic Routing, Procedia Computer Science, Volume 66, 2015, Pages 166-173.