=Paper= {{Paper |id=Vol-2667/paper35 |storemode=property |title=Implementation of frequency analysis of Twitter microblogging in a hybrid cloud based on the Binder, Everest platform and the Samara University virtual desktop service |pdfUrl=https://ceur-ws.org/Vol-2667/paper35.pdf |volume=Vol-2667 |authors=Sergey Vostokin,Irina Bobyleva }} ==Implementation of frequency analysis of Twitter microblogging in a hybrid cloud based on the Binder, Everest platform and the Samara University virtual desktop service == https://ceur-ws.org/Vol-2667/paper35.pdf

Implementation of frequency analysis of Twitter
microblogging in a hybrid cloud based on the
Binder, Everest platform and the Samara University
virtual desktop service
Sergey Vostokin Irina Bobyleva
Samara National Research University Samara National Research University;
Samara, Russia Joint Stock Company Space Rocket Center Progress
easts@mail.ru Samara, Russia
ikazakova90@gmail.com
Abstract—The paper proposes the architecture of a deployed to other cloud systems. The computing part includes
distributed data processing application in a hybrid cloud the required number of virtual machines. This solves the
environment. The application was studied using a distributed problem of obtaining necessary processor resources and
algorithm for determining the frequency of words in messages memory.
of English-language microblogs published on Twitter. The
possibility of aggregating computing resources using many-task In the research (1) we propose the distributed application
computing technology for data processing in hybrid cloud architecture and deployment method; (2) we studied the
environments is shown. This architecture has proven to be possible speedup of calculations for a given application while
technically simple and efficient in terms of performance. solving an actual problem from the field of data processing.

Keywords—parallel computing model, algorithmic skeleton, II. METHOD FOR FREQUENCY ANALYSIS IN A HYBRID CLOUD
parallel algorithm, actor model, workflow, many-task computing ENVIRONMENT

I. INTRODUCTION As a model problem for assessing the effectiveness of the
proposed distributed application architecture, the problem of
Advances in artificial intelligence and data processing calculating the frequency of words in a text array was chosen.
technologies have contributed to the intensive development of The source text array for analysis was taken from a weekly
tools for building scientific applications. One of the popular dump of all messages transmitted through the Twitter
tools in this area is the JupyterLab integrated development microblogging service. We chose the frequency analysis
environment [1]. An important functional feature of the problem, since it has many practical applications, for example,
JupyterLab is the ability to perform calculations on a specially for market analysis [5] or clustering [6]. On the other hand,
configured computer remotely. The user can interact with the this problem is important in our study to assess the data
JupyterLab environment on any other computer over the volume that can be processed in a reasonable time. It is also
Internet in a web browser. Cloud services, in particular the important to estimate typical processing and sending times of
Binder service [2] used in this study, magnify this feature. The text data over the network, which affects the efficiency of
Binder service allows you to automatically prepare the calculations.
necessary configuration of the application with the JupyterLab
environment as a docker image [3] and run the application on The specifics of computing in a hybrid cloud is that
a free virtual machine, for example, in the Google Cloud. application components are limited in the possibilities of
Thus, all control over the application is carried out completely interaction: (1) they communicate not directly, but only
through a web browser, therefore there is no need for a through some intermediate service; (2) they are connected by
specially configured physical computer. communication channels with medium bandwidth and/or
latency. These features require the organization of data
This method of working with the application is convenient processing in a special way.
for demonstration purposes, teaching students, solving
computationally simple tasks, and collaborative research. The adequate models for organizing computations in a
However, with ease of implementation, the application is hybrid cloud is a task model of computations [7]. According
limited in computing resources. Obviously, when deploying to the task model of computations, calculations are understood
the application in free computing environments, the user as the periodic launch of non-interacting tasks. The reason for
should expect a minimum quota for memory and processor replenishing the set of tasks launched at a certain moment of
time. Also the user is limited in the possibility of networking observation may be the completion of another task.
and other rights on the free virtual machine. These limitations Our implementation of the frequency analysis algorithm is
make it difficult or even impossible to organize high- based on launching two types of tasks from the control part of
performance data processing in the way standard for the application. The task of the first type receives a Twitter
JupyterLab. dump in the form of a JSON file, extracts English text
The article discusses the approach to overcome the messages from it, splits them into words, and forms an
described limitations. The essence of the approach is that the alphabetically ordered list of words with their local frequency
calculations are not performed on the machine where for the given dump. This list is stored in a file. A task of the
JupyterLab is installed. Instead, only control part of the many- second type uses two files built by tasks of the first type. It
task application is installed together with JupiterLab. The combines the files into one ordered list of words and
control part of the application communicate with the auxiliary frequencies, then splits this list in half. Further, the content of
cloud platform Everest [4] and uses its computing part the first file is replaced by the first half, and the contents of the

Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Data Science

second file is replaced by the second half of the ordered list. for all i from 1 to N-1 execute
Note that the common method of merging files in one file also for all j from 0 to j < i perform
solves the problem of frequency analysis, but the size of the the 2d type task for file j and file i
resulting files is constantly increasing during the processing. end of i loop
Partitioning of files is necessary, since tasks should operate on
not large files in order to facilitate file transfer over the orders the file array. The work of the control part of the
network. application consists in the parallel execution of this procedure.
In the experiments we applied a small optimization to reduce
It is easy to notice that the periodic application of the the diameter of the task dependencies graph and speedup the
second type task to individually ordered files (built by tasks of calculations.
the first type) will order the entire file array. For example, if
the total number of files in array is N, then procedure

Fig. 1. Application deployment and execution diagram.

Parallelization of the task invocation procedure can be Let's take a look at the steps to deploy and run the application
performed as explained in [8]. The idea of the procedure is to shown in Figure 1. The steps explain the implementation of
simulate the behavior of a computing part of the application the main architectural features.
on the JupyterLab virtual machine using the master-worker
actor system. The master actor decomposes the problem into Step 1. Registration of computing resources of the
tasks of the first and second type and distributes the tasks to application on the Everest platform. Obtaining access tokens
workers. The worker actors accept tasks, pass completion for agent programs through the web interface of the Everest
message back to the master and request more work. But platform.
instead of performing tasks on JupyterLab machines, the Step 2. Installing application components for the primary
worker actors send tasks for the initial processing of Twitter processing of Twitter dump files (a task of the first type) and
dump files (task of the first type) and pairwise consolidation pairwise merging of the resulting files (a task of the second
of received files (task of the second type) to the Everest server. type). This installation is performed through the web interface
The actor system was implemented in C ++ programming of the Everest platform.
language using the Templet parallel computing system [9].
Step 3. Running Windows 7 virtual machines in the
Optimization is based on the observation that the task corporate cloud of Samara University. Installing agent
dependencies graph for the parallel version of the procedure is programs on them using the access tokens obtained in Step 1.
asymmetric. Although we cannot change the total number of Verifying the activity of agent programs through the web
vertices in the task graph (it is equal to N(N-1)/2), we can interface on the Everest platform.
make it symmetrical with a smaller diameter. To do this, we
use the recursive procedure for listing the pairwise merging Step 4. Uploading the data set as Twitter text files in JSON
tasks instead of the original iterative procedure. The detailed format to a file server in the corporate cloud of Samara
descriptions of the parallel algorithm for data processing and University. This upload can be performed through one of the
its optimization can be found in [10]. virtual machines that you started in Step 3.
Step 5. Launching the control part of the application (so
III. DEPLOYMENT AND OPERATION OF THE FREQUENCY called the application orchestrator) from the GitHub code
ANALYSIS APPLICATION
repository via the web interface.
The main feature of the proposed distributed application
architecture is the optimization for computing in hybrid cloud Step 6. Automatically access the Binder service (after
environment. Such a computing environment is built on the completing Step 5) to build a docker container with an
basis of free computing resources available in public and application orchestrator running in the JupyterLab
academic cloud systems. In addition, the application is fully environment.
deployed and launched through a web browser. Step 7. Deploy the docker container (from Step 6) in the
Google Cloud. Returns the link to the web interface of the

VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 163
Data Science

application orchestrator to the web terminal of the application
user.

Step 8. Launch the application orchestrator by the user via
the web interface obtained in Step 7. Start automatic data
processing.
.
Step 9. The application orchestrator sends commands to
launch next tasks to the Everest platform server and polls the TABLE 1. THE SPEEDUP OF DATA PROCESSING ON A HYBRID
status of previously launched tasks. CLOUD COMPUTING ENVIRONMENT

Step 10. The Everest platform server distributes tasks for Sequential Parallel Speedup estimated
Number of
execution to free virtual machines through resource agent processing time, processing time, by the three
files, N
s s measurements
programs (installed in Step 3). The calculation ends when N
tasks of the first type and N(N-1)/2 tasks of the second type 79.226 53.7776
are started and completed, where N is the number of files in 2 68.4562 59.9446 1.29
the information array.
79.0123 63.2141
Note, if a series of experiments to determine word
frequencies is performed, then the computing part of the 145.436 85.1115
application (Steps 1-4) is set up once. For the second and 3 114.838 94.4716 1.54
subsequent manual launches, only Step 8 is performed.
140.661 86.5152
The initial launch of JupyterLab is fully automatic. The
sequence of steps, starting from step 5 and ending with step 7, 213.356 106.638
is initiated by the user of the application when he / she presses 4 183.547 110.084 1.90
a special button in the graphical interface of GitHub. All basic
212.875 104.538
actions, including resolving application dependencies,
building an image in the form of a Docker container, searching 304.123 140.8
for a free virtual machine in public clouds (currently Google
5 245.916 140.235 2.14
Cloud, OVH, GESIS laptops and Turing Institute clouds) are
fully automated by MyBinder platform. The developer only 347.438 140.065
needs to follow the special format for the git repository of the
468.526 171.275
application orchestrator.
6 330.048 175.948 2.38
The restriction on the method of deploying the computing
part of the application in the test implementation (Steps 1-4) 458.023 180.496
is access to the shared file system from all the virtual machines 562.377 198.964
deployed in Step 3. However, it is technically possible to
transfer files directly from the application orchestrator through 7 409.327 192.46 2.73
the Everest server to worker virtual machines. Also one can 599.395 185.208
use the IPFS distributed file system client program for file
transfer [11]. These methods were not considered in this study. 670.223 204.349
8 484.539 220.411 2.69
IV. PERFORMANCE MEASUREMENT OF THE FREQUENCY
ANALYSIS APPLICATION 586.413 223.25
The purpose of the experiments was to verify the 736.98 247.328
functionality of the application and to confirm the
9 763.388 239.363 3.20
effectiveness of calculations using the proposed distributed
application architecture. This architecture is appropriate to 845.347 246.543
consider effective, if the calculations are completed in a
889.71 273.798
reasonable time and the use of several virtual machines in the
computing part of the application leads to faster calculations. 10 936.003 276.348 3.40
The results of the experiment are shown in Table 1. 983.331 274.694
For the experiments, we used a fragment of a 5.88 GB data
array collected in [5]. The array consisted of 10 files. The size The entries were sorted through the entire set of 10 files
of the input JSON files ranged from 524 MB to 849 MB. The (words on ‘a’ were stored in the first file, and words in ‘z’
result of processing was stored in an array of 10 text files with were stored in the last file).
a total size of 1.83 MB. 148885 words were found (including We carried out a series of experiments. The number of
word forms and neologisms). The size of the resulting files processed files and the number of virtual machines on which
ranged from 158 KB to 223 KB. The resulting files consisted the computing part of the application was run changed from 2
of entries in the format: to 10. The observed time for completing tasks of the first type
in all experiments varied from about 24 to 36 seconds, and the
time for completing tasks of the second type ranged from
about 6 to 26 seconds. The maximum 3.4x speedup (relative
to the version with one virtual machine in the computing part
of the application) was achieved when processing 10 files on

VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 164
Data Science

10 virtual machines. The sequential processing of the data [4] O. Sukhoroslov, S. Volkov and A. Afanasiev, “A Web-Based Platform
array on one virtual machine took 983 seconds in the worst for Publication and Distributed Execution of Computing Applications,”
14th International Symposium on Parallel and Distributed Computing,
case (~16 minutes). The parallel processing using 10 virtual Limassol, pp. 175-184, 2015.
machines took about 270 seconds (~4.5 minutes). The [5] D.A. Vorobiev and V.G. Litvinov, “Automated system for forecasting
absolute reduction in processing time was approximately 11.5 the behavior of the foreign exchange market using analysis of the
minutes. Thus, the studied distributed application architecture emotional coloring of messages in social networks,” Advanced
can be considered effective. Information Technologies (PIT), pp. 416-419, 2018.
[6] I.A. Rycarev, D.V. Kirsh and A.V. Kupriyanov, “Clustering of media
V. CONCLUSION content from social networks using bigdata technology,” Computer
Optics, vol. 42, no. 5, pp. 921-927, 2018. DOI: 10.18287/2412-6179-
The paper proposes the architecture of a distributed data 2018-42-5- 921-927.
processing application for computing in a hybrid cloud [7] P. Thoman, K. Dichev, T. Heller, R. Iakymchuk, X. Aguilar, K.
environment built as a combination of free and academic Hasanov, P. Gschwandtner, P. Lemarinier, S. Markidis, H. Jordan, T.
cloud services. In a computational experiment to determine Fahringer, K. Katrinis, E. Laure and D. S. Nikolopoulos, “A taxonomy
the frequency of words in Twitter messages, the possibility of of task-based parallel programming technologies for high-performance
computing,” The Journal of Supercomputing, vol. 74, no. 4, pp. 1422-
speeding up the calculations was demonstrated, which proves 1434, 2018.
the effectiveness of proposed architecture.
[8] S.V. Vostokin, O.V. Sukhoroslov, I.V. Bobyleva and S.N. Popov,
The practical advantage of the architecture is the “Implementing computations with dynamic task dependencies in the
desktop grid environment using Everest and Templet Web,” CEUR
possibility of high-performance data processing using only a Workshop Proceedings, vol. 2267, pp. 271-275, 2018.
web browser. This feature is great for demos, training, and [9] S.V. Vostokin, “The Templet parallel computing system: Specification,
collaborative research. implementation, applications,” Procedia Engineering, vol. 201, pp.
684-689, 2017.
REFERENCES [10] S.V. Vostokin and I.V. Bobyleva, “Asynchronous round-robin
[1] B. Granger, C. Colbert and I. Rose, “JupyterLab: The next generation tournament algorithms for many-task data processing applications,”
jupyter frontend,” JupyterCon, 2017. International Journal of Open Information Technologies, vol. 8, no. 4,
[2] P. Jupyter, M. Bussonnier, J. Forde, J. Freeman, B. Granger, T. Head, pp. 45-53, 2020.
C. Holdgraf, K. Kelley, G. Nalvarte, A. Osheroff, M. Pacer, Y. Panda, [11] J. Benet, “Ipfs-content addressed, versioned, p2p file system,” arXiv
F. Perez, B. Ragan-Kelley and C. Willing, “Binder 2.0 - Reproducible, preprint: 1407.3561, 2014.
interactive, sharable environments for science at scale,” Proceedings of
the 17th python in science conference, vol. 113, pp. 120, 2018.
[3] D. Merkel, “Docker: lightweight linux containers for consistent
development and deployment,” Linux Journal, no. 239, 2014.

VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 165