=Paper= {{Paper |id=Vol-2667/paper35 |storemode=property |title=Implementation of frequency analysis of Twitter microblogging in a hybrid cloud based on the Binder, Everest platform and the Samara University virtual desktop service |pdfUrl=https://ceur-ws.org/Vol-2667/paper35.pdf |volume=Vol-2667 |authors=Sergey Vostokin,Irina Bobyleva }} ==Implementation of frequency analysis of Twitter microblogging in a hybrid cloud based on the Binder, Everest platform and the Samara University virtual desktop service == https://ceur-ws.org/Vol-2667/paper35.pdf
  Implementation of frequency analysis of Twitter
    microblogging in a hybrid cloud based on the
 Binder, Everest platform and the Samara University
               virtual desktop service
                      Sergey Vostokin                                                                      Irina Bobyleva
              Samara National Research University                                               Samara National Research University;
                       Samara, Russia                                                    Joint Stock Company Space Rocket Center Progress
                        easts@mail.ru                                                                      Samara, Russia
                                                                                                      ikazakova90@gmail.com
    Abstract—The paper proposes the architecture of a                        deployed to other cloud systems. The computing part includes
distributed data processing application in a hybrid cloud                    the required number of virtual machines. This solves the
environment. The application was studied using a distributed                 problem of obtaining necessary processor resources and
algorithm for determining the frequency of words in messages                 memory.
of English-language microblogs published on Twitter. The
possibility of aggregating computing resources using many-task                   In the research (1) we propose the distributed application
computing technology for data processing in hybrid cloud                     architecture and deployment method; (2) we studied the
environments is shown. This architecture has proven to be                    possible speedup of calculations for a given application while
technically simple and efficient in terms of performance.                    solving an actual problem from the field of data processing.

   Keywords—parallel computing model, algorithmic skeleton,                 II. METHOD FOR FREQUENCY ANALYSIS IN A HYBRID CLOUD
parallel algorithm, actor model, workflow, many-task computing                                          ENVIRONMENT

                        I. INTRODUCTION                                          As a model problem for assessing the effectiveness of the
                                                                             proposed distributed application architecture, the problem of
    Advances in artificial intelligence and data processing                  calculating the frequency of words in a text array was chosen.
technologies have contributed to the intensive development of                The source text array for analysis was taken from a weekly
tools for building scientific applications. One of the popular               dump of all messages transmitted through the Twitter
tools in this area is the JupyterLab integrated development                  microblogging service. We chose the frequency analysis
environment [1]. An important functional feature of the                      problem, since it has many practical applications, for example,
JupyterLab is the ability to perform calculations on a specially             for market analysis [5] or clustering [6]. On the other hand,
configured computer remotely. The user can interact with the                 this problem is important in our study to assess the data
JupyterLab environment on any other computer over the                        volume that can be processed in a reasonable time. It is also
Internet in a web browser. Cloud services, in particular the                 important to estimate typical processing and sending times of
Binder service [2] used in this study, magnify this feature. The             text data over the network, which affects the efficiency of
Binder service allows you to automatically prepare the                       calculations.
necessary configuration of the application with the JupyterLab
environment as a docker image [3] and run the application on                     The specifics of computing in a hybrid cloud is that
a free virtual machine, for example, in the Google Cloud.                    application components are limited in the possibilities of
Thus, all control over the application is carried out completely             interaction: (1) they communicate not directly, but only
through a web browser, therefore there is no need for a                      through some intermediate service; (2) they are connected by
specially configured physical computer.                                      communication channels with medium bandwidth and/or
                                                                             latency. These features require the organization of data
    This method of working with the application is convenient                processing in a special way.
for demonstration purposes, teaching students, solving
computationally simple tasks, and collaborative research.                        The adequate models for organizing computations in a
However, with ease of implementation, the application is                     hybrid cloud is a task model of computations [7]. According
limited in computing resources. Obviously, when deploying                    to the task model of computations, calculations are understood
the application in free computing environments, the user                     as the periodic launch of non-interacting tasks. The reason for
should expect a minimum quota for memory and processor                       replenishing the set of tasks launched at a certain moment of
time. Also the user is limited in the possibility of networking              observation may be the completion of another task.
and other rights on the free virtual machine. These limitations                  Our implementation of the frequency analysis algorithm is
make it difficult or even impossible to organize high-                       based on launching two types of tasks from the control part of
performance data processing in the way standard for                          the application. The task of the first type receives a Twitter
JupyterLab.                                                                  dump in the form of a JSON file, extracts English text
    The article discusses the approach to overcome the                       messages from it, splits them into words, and forms an
described limitations. The essence of the approach is that the               alphabetically ordered list of words with their local frequency
calculations are not performed on the machine where                          for the given dump. This list is stored in a file. A task of the
JupyterLab is installed. Instead, only control part of the many-             second type uses two files built by tasks of the first type. It
task application is installed together with JupiterLab. The                  combines the files into one ordered list of words and
control part of the application communicate with the auxiliary               frequencies, then splits this list in half. Further, the content of
cloud platform Everest [4] and uses its computing part                       the first file is replaced by the first half, and the contents of the


Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Data Science

second file is replaced by the second half of the ordered list.                for all i from 1 to N-1 execute
Note that the common method of merging files in one file also                        for all j from 0 to j < i perform
solves the problem of frequency analysis, but the size of the                                   the 2d type task for file j and file i
resulting files is constantly increasing during the processing.                end of i loop
Partitioning of files is necessary, since tasks should operate on
not large files in order to facilitate file transfer over the             orders the file array. The work of the control part of the
network.                                                                  application consists in the parallel execution of this procedure.
                                                                          In the experiments we applied a small optimization to reduce
    It is easy to notice that the periodic application of the             the diameter of the task dependencies graph and speedup the
second type task to individually ordered files (built by tasks of         calculations.
the first type) will order the entire file array. For example, if
the total number of files in array is N, then procedure




Fig. 1. Application deployment and execution diagram.

    Parallelization of the task invocation procedure can be               Let's take a look at the steps to deploy and run the application
performed as explained in [8]. The idea of the procedure is to            shown in Figure 1. The steps explain the implementation of
simulate the behavior of a computing part of the application              the main architectural features.
on the JupyterLab virtual machine using the master-worker
actor system. The master actor decomposes the problem into                    Step 1. Registration of computing resources of the
tasks of the first and second type and distributes the tasks to           application on the Everest platform. Obtaining access tokens
workers. The worker actors accept tasks, pass completion                  for agent programs through the web interface of the Everest
message back to the master and request more work. But                     platform.
instead of performing tasks on JupyterLab machines, the                       Step 2. Installing application components for the primary
worker actors send tasks for the initial processing of Twitter            processing of Twitter dump files (a task of the first type) and
dump files (task of the first type) and pairwise consolidation            pairwise merging of the resulting files (a task of the second
of received files (task of the second type) to the Everest server.        type). This installation is performed through the web interface
The actor system was implemented in C ++ programming                      of the Everest platform.
language using the Templet parallel computing system [9].
                                                                              Step 3. Running Windows 7 virtual machines in the
     Optimization is based on the observation that the task               corporate cloud of Samara University. Installing agent
dependencies graph for the parallel version of the procedure is           programs on them using the access tokens obtained in Step 1.
asymmetric. Although we cannot change the total number of                 Verifying the activity of agent programs through the web
vertices in the task graph (it is equal to N(N-1)/2), we can              interface on the Everest platform.
make it symmetrical with a smaller diameter. To do this, we
use the recursive procedure for listing the pairwise merging                  Step 4. Uploading the data set as Twitter text files in JSON
tasks instead of the original iterative procedure. The detailed           format to a file server in the corporate cloud of Samara
descriptions of the parallel algorithm for data processing and            University. This upload can be performed through one of the
its optimization can be found in [10].                                    virtual machines that you started in Step 3.
                                                                              Step 5. Launching the control part of the application (so
  III. DEPLOYMENT AND OPERATION OF THE FREQUENCY                          called the application orchestrator) from the GitHub code
                      ANALYSIS APPLICATION
                                                                          repository via the web interface.
   The main feature of the proposed distributed application
architecture is the optimization for computing in hybrid cloud               Step 6. Automatically access the Binder service (after
environment. Such a computing environment is built on the                 completing Step 5) to build a docker container with an
basis of free computing resources available in public and                 application orchestrator running in the JupyterLab
academic cloud systems. In addition, the application is fully             environment.
deployed and launched through a web browser.                                Step 7. Deploy the docker container (from Step 6) in the
                                                                          Google Cloud. Returns the link to the web interface of the


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                   163
Data Science

application orchestrator to the web terminal of the application                
user.                                                                          
                                                                               
    Step 8. Launch the application orchestrator by the user via
the web interface obtained in Step 7. Start automatic data                    
processing.                                                                
                                                                           .
    Step 9. The application orchestrator sends commands to
launch next tasks to the Everest platform server and polls the                 TABLE 1. THE SPEEDUP OF DATA PROCESSING ON A HYBRID
status of previously launched tasks.                                                     CLOUD COMPUTING ENVIRONMENT

    Step 10. The Everest platform server distributes tasks for                              Sequential          Parallel       Speedup estimated
                                                                             Number of
execution to free virtual machines through resource agent                                processing time,   processing time,      by the three
                                                                              files, N
                                                                                                s                  s             measurements
programs (installed in Step 3). The calculation ends when N
tasks of the first type and N(N-1)/2 tasks of the second type                                79.226             53.7776
are started and completed, where N is the number of files in                     2           68.4562            59.9446              1.29
the information array.
                                                                                             79.0123            63.2141
    Note, if a series of experiments to determine word
frequencies is performed, then the computing part of the                                     145.436            85.1115
application (Steps 1-4) is set up once. For the second and                       3           114.838            94.4716              1.54
subsequent manual launches, only Step 8 is performed.
                                                                                             140.661            86.5152
     The initial launch of JupyterLab is fully automatic. The
sequence of steps, starting from step 5 and ending with step 7,                              213.356            106.638
is initiated by the user of the application when he / she presses                4           183.547            110.084              1.90
a special button in the graphical interface of GitHub. All basic
                                                                                             212.875            104.538
actions, including resolving application dependencies,
building an image in the form of a Docker container, searching                               304.123             140.8
for a free virtual machine in public clouds (currently Google
                                                                                 5           245.916            140.235              2.14
Cloud, OVH, GESIS laptops and Turing Institute clouds) are
fully automated by MyBinder platform. The developer only                                     347.438            140.065
needs to follow the special format for the git repository of the
                                                                                             468.526            171.275
application orchestrator.
                                                                                 6           330.048            175.948              2.38
    The restriction on the method of deploying the computing
part of the application in the test implementation (Steps 1-4)                               458.023            180.496
is access to the shared file system from all the virtual machines                            562.377            198.964
deployed in Step 3. However, it is technically possible to
transfer files directly from the application orchestrator through                7           409.327            192.46               2.73
the Everest server to worker virtual machines. Also one can                                  599.395            185.208
use the IPFS distributed file system client program for file
transfer [11]. These methods were not considered in this study.                              670.223            204.349
                                                                                 8           484.539            220.411              2.69
  IV. PERFORMANCE MEASUREMENT OF THE FREQUENCY
                     ANALYSIS APPLICATION                                                    586.413            223.25
    The purpose of the experiments was to verify the                                         736.98             247.328
functionality of the application and to confirm the
                                                                                 9           763.388            239.363              3.20
effectiveness of calculations using the proposed distributed
application architecture. This architecture is appropriate to                                845.347            246.543
consider effective, if the calculations are completed in a
                                                                                             889.71             273.798
reasonable time and the use of several virtual machines in the
computing part of the application leads to faster calculations.                 10           936.003            276.348              3.40
The results of the experiment are shown in Table 1.                                          983.331            274.694
    For the experiments, we used a fragment of a 5.88 GB data
array collected in [5]. The array consisted of 10 files. The size            The entries were sorted through the entire set of 10 files
of the input JSON files ranged from 524 MB to 849 MB. The                 (words on ‘a’ were stored in the first file, and words in ‘z’
result of processing was stored in an array of 10 text files with         were stored in the last file).
a total size of 1.83 MB. 148885 words were found (including                   We carried out a series of experiments. The number of
word forms and neologisms). The size of the resulting files               processed files and the number of virtual machines on which
ranged from 158 KB to 223 KB. The resulting files consisted               the computing part of the application was run changed from 2
of entries in the format:                                                 to 10. The observed time for completing tasks of the first type
                                                                  in all experiments varied from about 24 to 36 seconds, and the
                     time for completing tasks of the second type ranged from
                                                                      about 6 to 26 seconds. The maximum 3.4x speedup (relative
                                                                          to the version with one virtual machine in the computing part
                                                                          of the application) was achieved when processing 10 files on



VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                      164
Data Science

10 virtual machines. The sequential processing of the data                       [4]  O. Sukhoroslov, S. Volkov and A. Afanasiev, “A Web-Based Platform
array on one virtual machine took 983 seconds in the worst                            for Publication and Distributed Execution of Computing Applications,”
                                                                                      14th International Symposium on Parallel and Distributed Computing,
case (~16 minutes). The parallel processing using 10 virtual                          Limassol, pp. 175-184, 2015.
machines took about 270 seconds (~4.5 minutes). The                              [5] D.A. Vorobiev and V.G. Litvinov, “Automated system for forecasting
absolute reduction in processing time was approximately 11.5                          the behavior of the foreign exchange market using analysis of the
minutes. Thus, the studied distributed application architecture                       emotional coloring of messages in social networks,” Advanced
can be considered effective.                                                          Information Technologies (PIT), pp. 416-419, 2018.
                                                                                 [6] I.A. Rycarev, D.V. Kirsh and A.V. Kupriyanov, “Clustering of media
                           V. CONCLUSION                                              content from social networks using bigdata technology,” Computer
                                                                                      Optics, vol. 42, no. 5, pp. 921-927, 2018. DOI: 10.18287/2412-6179-
    The paper proposes the architecture of a distributed data                         2018-42-5- 921-927.
processing application for computing in a hybrid cloud                           [7] P. Thoman, K. Dichev, T. Heller, R. Iakymchuk, X. Aguilar, K.
environment built as a combination of free and academic                               Hasanov, P. Gschwandtner, P. Lemarinier, S. Markidis, H. Jordan, T.
cloud services. In a computational experiment to determine                            Fahringer, K. Katrinis, E. Laure and D. S. Nikolopoulos, “A taxonomy
the frequency of words in Twitter messages, the possibility of                        of task-based parallel programming technologies for high-performance
                                                                                      computing,” The Journal of Supercomputing, vol. 74, no. 4, pp. 1422-
speeding up the calculations was demonstrated, which proves                           1434, 2018.
the effectiveness of proposed architecture.
                                                                                 [8] S.V. Vostokin, O.V. Sukhoroslov, I.V. Bobyleva and S.N. Popov,
    The practical advantage of the architecture is the                                “Implementing computations with dynamic task dependencies in the
                                                                                      desktop grid environment using Everest and Templet Web,” CEUR
possibility of high-performance data processing using only a                          Workshop Proceedings, vol. 2267, pp. 271-275, 2018.
web browser. This feature is great for demos, training, and                      [9] S.V. Vostokin, “The Templet parallel computing system: Specification,
collaborative research.                                                               implementation, applications,” Procedia Engineering, vol. 201, pp.
                                                                                      684-689, 2017.
                               REFERENCES                                        [10] S.V. Vostokin and I.V. Bobyleva, “Asynchronous round-robin
[1]   B. Granger, C. Colbert and I. Rose, “JupyterLab: The next generation            tournament algorithms for many-task data processing applications,”
      jupyter frontend,” JupyterCon, 2017.                                            International Journal of Open Information Technologies, vol. 8, no. 4,
[2]   P. Jupyter, M. Bussonnier, J. Forde, J. Freeman, B. Granger, T. Head,           pp. 45-53, 2020.
      C. Holdgraf, K. Kelley, G. Nalvarte, A. Osheroff, M. Pacer, Y. Panda,      [11] J. Benet, “Ipfs-content addressed, versioned, p2p file system,” arXiv
      F. Perez, B. Ragan-Kelley and C. Willing, “Binder 2.0 - Reproducible,           preprint: 1407.3561, 2014.
      interactive, sharable environments for science at scale,” Proceedings of
      the 17th python in science conference, vol. 113, pp. 120, 2018.
[3]   D. Merkel, “Docker: lightweight linux containers for consistent
      development and deployment,” Linux Journal, no. 239, 2014.




VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                                 165