=Paper=
{{Paper
|id=Vol-2667/paper35
|storemode=property
|title=Implementation of frequency analysis of Twitter microblogging in a hybrid cloud based on the Binder, Everest platform and the Samara University virtual desktop service
|pdfUrl=https://ceur-ws.org/Vol-2667/paper35.pdf
|volume=Vol-2667
|authors=Sergey Vostokin,Irina Bobyleva
}}
==Implementation of frequency analysis of Twitter microblogging in a hybrid cloud based on the Binder, Everest platform and the Samara University virtual desktop service ==
Implementation of frequency analysis of Twitter microblogging in a hybrid cloud based on the Binder, Everest platform and the Samara University virtual desktop service Sergey Vostokin Irina Bobyleva Samara National Research University Samara National Research University; Samara, Russia Joint Stock Company Space Rocket Center Progress easts@mail.ru Samara, Russia ikazakova90@gmail.com Abstract—The paper proposes the architecture of a deployed to other cloud systems. The computing part includes distributed data processing application in a hybrid cloud the required number of virtual machines. This solves the environment. The application was studied using a distributed problem of obtaining necessary processor resources and algorithm for determining the frequency of words in messages memory. of English-language microblogs published on Twitter. The possibility of aggregating computing resources using many-task In the research (1) we propose the distributed application computing technology for data processing in hybrid cloud architecture and deployment method; (2) we studied the environments is shown. This architecture has proven to be possible speedup of calculations for a given application while technically simple and efficient in terms of performance. solving an actual problem from the field of data processing. Keywords—parallel computing model, algorithmic skeleton, II. METHOD FOR FREQUENCY ANALYSIS IN A HYBRID CLOUD parallel algorithm, actor model, workflow, many-task computing ENVIRONMENT I. INTRODUCTION As a model problem for assessing the effectiveness of the proposed distributed application architecture, the problem of Advances in artificial intelligence and data processing calculating the frequency of words in a text array was chosen. technologies have contributed to the intensive development of The source text array for analysis was taken from a weekly tools for building scientific applications. One of the popular dump of all messages transmitted through the Twitter tools in this area is the JupyterLab integrated development microblogging service. We chose the frequency analysis environment [1]. An important functional feature of the problem, since it has many practical applications, for example, JupyterLab is the ability to perform calculations on a specially for market analysis [5] or clustering [6]. On the other hand, configured computer remotely. The user can interact with the this problem is important in our study to assess the data JupyterLab environment on any other computer over the volume that can be processed in a reasonable time. It is also Internet in a web browser. Cloud services, in particular the important to estimate typical processing and sending times of Binder service [2] used in this study, magnify this feature. The text data over the network, which affects the efficiency of Binder service allows you to automatically prepare the calculations. necessary configuration of the application with the JupyterLab environment as a docker image [3] and run the application on The specifics of computing in a hybrid cloud is that a free virtual machine, for example, in the Google Cloud. application components are limited in the possibilities of Thus, all control over the application is carried out completely interaction: (1) they communicate not directly, but only through a web browser, therefore there is no need for a through some intermediate service; (2) they are connected by specially configured physical computer. communication channels with medium bandwidth and/or latency. These features require the organization of data This method of working with the application is convenient processing in a special way. for demonstration purposes, teaching students, solving computationally simple tasks, and collaborative research. The adequate models for organizing computations in a However, with ease of implementation, the application is hybrid cloud is a task model of computations [7]. According limited in computing resources. Obviously, when deploying to the task model of computations, calculations are understood the application in free computing environments, the user as the periodic launch of non-interacting tasks. The reason for should expect a minimum quota for memory and processor replenishing the set of tasks launched at a certain moment of time. Also the user is limited in the possibility of networking observation may be the completion of another task. and other rights on the free virtual machine. These limitations Our implementation of the frequency analysis algorithm is make it difficult or even impossible to organize high- based on launching two types of tasks from the control part of performance data processing in the way standard for the application. The task of the first type receives a Twitter JupyterLab. dump in the form of a JSON file, extracts English text The article discusses the approach to overcome the messages from it, splits them into words, and forms an described limitations. The essence of the approach is that the alphabetically ordered list of words with their local frequency calculations are not performed on the machine where for the given dump. This list is stored in a file. A task of the JupyterLab is installed. Instead, only control part of the many- second type uses two files built by tasks of the first type. It task application is installed together with JupiterLab. The combines the files into one ordered list of words and control part of the application communicate with the auxiliary frequencies, then splits this list in half. Further, the content of cloud platform Everest [4] and uses its computing part the first file is replaced by the first half, and the contents of the Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Data Science second file is replaced by the second half of the ordered list. for all i from 1 to N-1 execute Note that the common method of merging files in one file also for all j from 0 to j < i perform solves the problem of frequency analysis, but the size of the the 2d type task for file j and file i resulting files is constantly increasing during the processing. end of i loop Partitioning of files is necessary, since tasks should operate on not large files in order to facilitate file transfer over the orders the file array. The work of the control part of the network. application consists in the parallel execution of this procedure. In the experiments we applied a small optimization to reduce It is easy to notice that the periodic application of the the diameter of the task dependencies graph and speedup the second type task to individually ordered files (built by tasks of calculations. the first type) will order the entire file array. For example, if the total number of files in array is N, then procedure Fig. 1. Application deployment and execution diagram. Parallelization of the task invocation procedure can be Let's take a look at the steps to deploy and run the application performed as explained in [8]. The idea of the procedure is to shown in Figure 1. The steps explain the implementation of simulate the behavior of a computing part of the application the main architectural features. on the JupyterLab virtual machine using the master-worker actor system. The master actor decomposes the problem into Step 1. Registration of computing resources of the tasks of the first and second type and distributes the tasks to application on the Everest platform. Obtaining access tokens workers. The worker actors accept tasks, pass completion for agent programs through the web interface of the Everest message back to the master and request more work. But platform. instead of performing tasks on JupyterLab machines, the Step 2. Installing application components for the primary worker actors send tasks for the initial processing of Twitter processing of Twitter dump files (a task of the first type) and dump files (task of the first type) and pairwise consolidation pairwise merging of the resulting files (a task of the second of received files (task of the second type) to the Everest server. type). This installation is performed through the web interface The actor system was implemented in C ++ programming of the Everest platform. language using the Templet parallel computing system [9]. Step 3. Running Windows 7 virtual machines in the Optimization is based on the observation that the task corporate cloud of Samara University. Installing agent dependencies graph for the parallel version of the procedure is programs on them using the access tokens obtained in Step 1. asymmetric. Although we cannot change the total number of Verifying the activity of agent programs through the web vertices in the task graph (it is equal to N(N-1)/2), we can interface on the Everest platform. make it symmetrical with a smaller diameter. To do this, we use the recursive procedure for listing the pairwise merging Step 4. Uploading the data set as Twitter text files in JSON tasks instead of the original iterative procedure. The detailed format to a file server in the corporate cloud of Samara descriptions of the parallel algorithm for data processing and University. This upload can be performed through one of the its optimization can be found in [10]. virtual machines that you started in Step 3. Step 5. Launching the control part of the application (so III. DEPLOYMENT AND OPERATION OF THE FREQUENCY called the application orchestrator) from the GitHub code ANALYSIS APPLICATION repository via the web interface. The main feature of the proposed distributed application architecture is the optimization for computing in hybrid cloud Step 6. Automatically access the Binder service (after environment. Such a computing environment is built on the completing Step 5) to build a docker container with an basis of free computing resources available in public and application orchestrator running in the JupyterLab academic cloud systems. In addition, the application is fully environment. deployed and launched through a web browser. Step 7. Deploy the docker container (from Step 6) in the Google Cloud. Returns the link to the web interface of the VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 163 Data Science application orchestrator to the web terminal of the applicationuser. Step 8. Launch the application orchestrator by the user via the web interface obtained in Step 7. Start automatic data processing. . Step 9. The application orchestrator sends commands to launch next tasks to the Everest platform server and polls the TABLE 1. THE SPEEDUP OF DATA PROCESSING ON A HYBRID status of previously launched tasks. CLOUD COMPUTING ENVIRONMENT Step 10. The Everest platform server distributes tasks for Sequential Parallel Speedup estimated Number of execution to free virtual machines through resource agent processing time, processing time, by the three files, N s s measurements programs (installed in Step 3). The calculation ends when N tasks of the first type and N(N-1)/2 tasks of the second type 79.226 53.7776 are started and completed, where N is the number of files in 2 68.4562 59.9446 1.29 the information array. 79.0123 63.2141 Note, if a series of experiments to determine word frequencies is performed, then the computing part of the 145.436 85.1115 application (Steps 1-4) is set up once. For the second and 3 114.838 94.4716 1.54 subsequent manual launches, only Step 8 is performed. 140.661 86.5152 The initial launch of JupyterLab is fully automatic. The sequence of steps, starting from step 5 and ending with step 7, 213.356 106.638 is initiated by the user of the application when he / she presses 4 183.547 110.084 1.90 a special button in the graphical interface of GitHub. All basic 212.875 104.538 actions, including resolving application dependencies, building an image in the form of a Docker container, searching 304.123 140.8 for a free virtual machine in public clouds (currently Google 5 245.916 140.235 2.14 Cloud, OVH, GESIS laptops and Turing Institute clouds) are fully automated by MyBinder platform. The developer only 347.438 140.065 needs to follow the special format for the git repository of the 468.526 171.275 application orchestrator. 6 330.048 175.948 2.38 The restriction on the method of deploying the computing part of the application in the test implementation (Steps 1-4) 458.023 180.496 is access to the shared file system from all the virtual machines 562.377 198.964 deployed in Step 3. However, it is technically possible to transfer files directly from the application orchestrator through 7 409.327 192.46 2.73 the Everest server to worker virtual machines. Also one can 599.395 185.208 use the IPFS distributed file system client program for file transfer [11]. These methods were not considered in this study. 670.223 204.349 8 484.539 220.411 2.69 IV. PERFORMANCE MEASUREMENT OF THE FREQUENCY ANALYSIS APPLICATION 586.413 223.25 The purpose of the experiments was to verify the 736.98 247.328 functionality of the application and to confirm the 9 763.388 239.363 3.20 effectiveness of calculations using the proposed distributed application architecture. This architecture is appropriate to 845.347 246.543 consider effective, if the calculations are completed in a 889.71 273.798 reasonable time and the use of several virtual machines in the computing part of the application leads to faster calculations. 10 936.003 276.348 3.40 The results of the experiment are shown in Table 1. 983.331 274.694 For the experiments, we used a fragment of a 5.88 GB data array collected in [5]. The array consisted of 10 files. The size The entries were sorted through the entire set of 10 files of the input JSON files ranged from 524 MB to 849 MB. The (words on ‘a’ were stored in the first file, and words in ‘z’ result of processing was stored in an array of 10 text files with were stored in the last file). a total size of 1.83 MB. 148885 words were found (including We carried out a series of experiments. The number of word forms and neologisms). The size of the resulting files processed files and the number of virtual machines on which ranged from 158 KB to 223 KB. The resulting files consisted the computing part of the application was run changed from 2 of entries in the format: to 10. The observed time for completing tasks of the first type in all experiments varied from about 24 to 36 seconds, and the time for completing tasks of the second type ranged from about 6 to 26 seconds. The maximum 3.4x speedup (relative to the version with one virtual machine in the computing part of the application) was achieved when processing 10 files on VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 164 Data Science 10 virtual machines. The sequential processing of the data [4] O. Sukhoroslov, S. Volkov and A. Afanasiev, “A Web-Based Platform array on one virtual machine took 983 seconds in the worst for Publication and Distributed Execution of Computing Applications,” 14th International Symposium on Parallel and Distributed Computing, case (~16 minutes). The parallel processing using 10 virtual Limassol, pp. 175-184, 2015. machines took about 270 seconds (~4.5 minutes). The [5] D.A. Vorobiev and V.G. Litvinov, “Automated system for forecasting absolute reduction in processing time was approximately 11.5 the behavior of the foreign exchange market using analysis of the minutes. Thus, the studied distributed application architecture emotional coloring of messages in social networks,” Advanced can be considered effective. Information Technologies (PIT), pp. 416-419, 2018. [6] I.A. Rycarev, D.V. Kirsh and A.V. Kupriyanov, “Clustering of media V. CONCLUSION content from social networks using bigdata technology,” Computer Optics, vol. 42, no. 5, pp. 921-927, 2018. DOI: 10.18287/2412-6179- The paper proposes the architecture of a distributed data 2018-42-5- 921-927. processing application for computing in a hybrid cloud [7] P. Thoman, K. Dichev, T. Heller, R. Iakymchuk, X. Aguilar, K. environment built as a combination of free and academic Hasanov, P. Gschwandtner, P. Lemarinier, S. Markidis, H. Jordan, T. cloud services. In a computational experiment to determine Fahringer, K. Katrinis, E. Laure and D. S. Nikolopoulos, “A taxonomy the frequency of words in Twitter messages, the possibility of of task-based parallel programming technologies for high-performance computing,” The Journal of Supercomputing, vol. 74, no. 4, pp. 1422- speeding up the calculations was demonstrated, which proves 1434, 2018. the effectiveness of proposed architecture. [8] S.V. Vostokin, O.V. Sukhoroslov, I.V. Bobyleva and S.N. Popov, The practical advantage of the architecture is the “Implementing computations with dynamic task dependencies in the desktop grid environment using Everest and Templet Web,” CEUR possibility of high-performance data processing using only a Workshop Proceedings, vol. 2267, pp. 271-275, 2018. web browser. This feature is great for demos, training, and [9] S.V. Vostokin, “The Templet parallel computing system: Specification, collaborative research. implementation, applications,” Procedia Engineering, vol. 201, pp. 684-689, 2017. REFERENCES [10] S.V. Vostokin and I.V. Bobyleva, “Asynchronous round-robin [1] B. Granger, C. Colbert and I. Rose, “JupyterLab: The next generation tournament algorithms for many-task data processing applications,” jupyter frontend,” JupyterCon, 2017. International Journal of Open Information Technologies, vol. 8, no. 4, [2] P. Jupyter, M. Bussonnier, J. Forde, J. Freeman, B. Granger, T. Head, pp. 45-53, 2020. C. Holdgraf, K. Kelley, G. Nalvarte, A. Osheroff, M. Pacer, Y. Panda, [11] J. Benet, “Ipfs-content addressed, versioned, p2p file system,” arXiv F. Perez, B. Ragan-Kelley and C. Willing, “Binder 2.0 - Reproducible, preprint: 1407.3561, 2014. interactive, sharable environments for science at scale,” Proceedings of the 17th python in science conference, vol. 113, pp. 120, 2018. [3] D. Merkel, “Docker: lightweight linux containers for consistent development and deployment,” Linux Journal, no. 239, 2014. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 165