<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Implementation of frequency analysis of Twitter microblogging in a hybrid cloud based on the Binder, Everest platform and the Samara University virtual desktop service</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sergey Vostokin</string-name>
          <email>easts@mail.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Irina Bobyleva</string-name>
          <email>ikazakova90@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Samara National Research University</institution>
          ,
          <addr-line>Samara</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Samara National Research University;, Joint Stock Company Space Rocket Center Progress</institution>
          ,
          <addr-line>Samara</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>162</fpage>
      <lpage>165</lpage>
      <abstract>
        <p>-The paper proposes the architecture of a distributed data processing application in a hybrid cloud environment. The application was studied using a distributed algorithm for determining the frequency of words in messages of English-language microblogs published on Twitter. The possibility of aggregating computing resources using many-task computing technology for data processing in hybrid cloud environments is shown. This architecture has proven to be technically simple and efficient in terms of performance.</p>
      </abstract>
      <kwd-group>
        <kwd>parallel computing model</kwd>
        <kwd>algorithmic skeleton</kwd>
        <kwd>parallel algorithm</kwd>
        <kwd>actor model</kwd>
        <kwd>workflow</kwd>
        <kwd>many-task computing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>
        Advances in artificial intelligence and data processing
technologies have contributed to the intensive development of
tools for building scientific applications. One of the popular
tools in this area is the JupyterLab integrated development
environment [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. An important functional feature of the
JupyterLab is the ability to perform calculations on a specially
configured computer remotely. The user can interact with the
JupyterLab environment on any other computer over the
Internet in a web browser. Cloud services, in particular the
Binder service [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] used in this study, magnify this feature. The
Binder service allows you to automatically prepare the
necessary configuration of the application with the JupyterLab
environment as a docker image [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and run the application on
a free virtual machine, for example, in the Google Cloud.
Thus, all control over the application is carried out completely
through a web browser, therefore there is no need for a
specially configured physical computer.
      </p>
      <p>This method of working with the application is convenient
for demonstration purposes, teaching students, solving
computationally simple tasks, and collaborative research.
However, with ease of implementation, the application is
limited in computing resources. Obviously, when deploying
the application in free computing environments, the user
should expect a minimum quota for memory and processor
time. Also the user is limited in the possibility of networking
and other rights on the free virtual machine. These limitations
make it difficult or even impossible to organize
highperformance data processing in the way standard for
JupyterLab.</p>
      <p>
        The article discusses the approach to overcome the
described limitations. The essence of the approach is that the
calculations are not performed on the machine where
JupyterLab is installed. Instead, only control part of the
manytask application is installed together with JupiterLab. The
control part of the application communicate with the auxiliary
cloud platform Everest [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and uses its computing part
deployed to other cloud systems. The computing part includes
the required number of virtual machines. This solves the
problem of obtaining necessary processor resources and
memory.
      </p>
      <p>In the research (1) we propose the distributed application
architecture and deployment method; (2) we studied the
possible speedup of calculations for a given application while
solving an actual problem from the field of data processing.
II. METHOD FOR FREQUENCY ANALYSIS IN A HYBRID CLOUD</p>
      <p>ENVIRONMENT</p>
      <p>
        As a model problem for assessing the effectiveness of the
proposed distributed application architecture, the problem of
calculating the frequency of words in a text array was chosen.
The source text array for analysis was taken from a weekly
dump of all messages transmitted through the Twitter
microblogging service. We chose the frequency analysis
problem, since it has many practical applications, for example,
for market analysis [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or clustering [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. On the other hand,
this problem is important in our study to assess the data
volume that can be processed in a reasonable time. It is also
important to estimate typical processing and sending times of
text data over the network, which affects the efficiency of
calculations.
      </p>
      <p>The specifics of computing in a hybrid cloud is that
application components are limited in the possibilities of
interaction: (1) they communicate not directly, but only
through some intermediate service; (2) they are connected by
communication channels with medium bandwidth and/or
latency. These features require the organization of data
processing in a special way.</p>
      <p>
        The adequate models for organizing computations in a
hybrid cloud is a task model of computations [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. According
to the task model of computations, calculations are understood
as the periodic launch of non-interacting tasks. The reason for
replenishing the set of tasks launched at a certain moment of
observation may be the completion of another task.
      </p>
      <p>Our implementation of the frequency analysis algorithm is
based on launching two types of tasks from the control part of
the application. The task of the first type receives a Twitter
dump in the form of a JSON file, extracts English text
messages from it, splits them into words, and forms an
alphabetically ordered list of words with their local frequency
for the given dump. This list is stored in a file. A task of the
second type uses two files built by tasks of the first type. It
combines the files into one ordered list of words and
frequencies, then splits this list in half. Further, the content of
the first file is replaced by the first half, and the contents of the
second file is replaced by the second half of the ordered list.
Note that the common method of merging files in one file also
solves the problem of frequency analysis, but the size of the
resulting files is constantly increasing during the processing.
Partitioning of files is necessary, since tasks should operate on
not large files in order to facilitate file transfer over the
network.</p>
      <p>It is easy to notice that the periodic application of the
second type task to individually ordered files (built by tasks of
the first type) will order the entire file array. For example, if
the total number of files in array is N, then procedure
for all i from 1 to N-1 execute
for all j from 0 to j &lt; i perform</p>
      <p>the 2d type task for file j and file i
end of i loop
orders the file array. The work of the control part of the
application consists in the parallel execution of this procedure.
In the experiments we applied a small optimization to reduce
the diameter of the task dependencies graph and speedup the
calculations.</p>
      <p>
        Parallelization of the task invocation procedure can be
performed as explained in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The idea of the procedure is to
simulate the behavior of a computing part of the application
on the JupyterLab virtual machine using the master-worker
actor system. The master actor decomposes the problem into
tasks of the first and second type and distributes the tasks to
workers. The worker actors accept tasks, pass completion
message back to the master and request more work. But
instead of performing tasks on JupyterLab machines, the
worker actors send tasks for the initial processing of Twitter
dump files (task of the first type) and pairwise consolidation
of received files (task of the second type) to the Everest server.
The actor system was implemented in C ++ programming
language using the Templet parallel computing system [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Optimization is based on the observation that the task
dependencies graph for the parallel version of the procedure is
asymmetric. Although we cannot change the total number of
vertices in the task graph (it is equal to N(N-1)/2), we can
make it symmetrical with a smaller diameter. To do this, we
use the recursive procedure for listing the pairwise merging
tasks instead of the original iterative procedure. The detailed
descriptions of the parallel algorithm for data processing and
its optimization can be found in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>III. DEPLOYMENT AND OPERATION OF THE FREQUENCY</p>
      <p>ANALYSIS APPLICATION</p>
      <p>The main feature of the proposed distributed application
architecture is the optimization for computing in hybrid cloud
environment. Such a computing environment is built on the
basis of free computing resources available in public and
academic cloud systems. In addition, the application is fully
deployed and launched through a web browser.</p>
      <p>Let's take a look at the steps to deploy and run the application
shown in Figure 1. The steps explain the implementation of
the main architectural features.</p>
      <p>Step 1. Registration of computing resources of the
application on the Everest platform. Obtaining access tokens
for agent programs through the web interface of the Everest
platform.</p>
      <p>Step 2. Installing application components for the primary
processing of Twitter dump files (a task of the first type) and
pairwise merging of the resulting files (a task of the second
type). This installation is performed through the web interface
of the Everest platform.</p>
      <p>Step 3. Running Windows 7 virtual machines in the
corporate cloud of Samara University. Installing agent
programs on them using the access tokens obtained in Step 1.
Verifying the activity of agent programs through the web
interface on the Everest platform.</p>
      <p>Step 4. Uploading the data set as Twitter text files in JSON
format to a file server in the corporate cloud of Samara
University. This upload can be performed through one of the
virtual machines that you started in Step 3.</p>
      <p>Step 5. Launching the control part of the application (so
called the application orchestrator) from the GitHub code
repository via the web interface.</p>
      <p>Step 6. Automatically access the Binder service (after
completing Step 5) to build a docker container with an
application orchestrator running in the JupyterLab
environment.</p>
      <p>Step 7. Deploy the docker container (from Step 6) in the
Google Cloud. Returns the link to the web interface of the
application orchestrator to the web terminal of the application
user.</p>
      <p>Step 8. Launch the application orchestrator by the user via
the web interface obtained in Step 7. Start automatic data
processing.</p>
      <p>Step 9. The application orchestrator sends commands to
launch next tasks to the Everest platform server and polls the
status of previously launched tasks.</p>
      <p>Step 10. The Everest platform server distributes tasks for
execution to free virtual machines through resource agent
programs (installed in Step 3). The calculation ends when N
tasks of the first type and N(N-1)/2 tasks of the second type
are started and completed, where N is the number of files in
the information array.</p>
      <p>Note, if a series of experiments to determine word
frequencies is performed, then the computing part of the
application (Steps 1-4) is set up once. For the second and
subsequent manual launches, only Step 8 is performed.</p>
      <p>The initial launch of JupyterLab is fully automatic. The
sequence of steps, starting from step 5 and ending with step 7,
is initiated by the user of the application when he / she presses
a special button in the graphical interface of GitHub. All basic
actions, including resolving application dependencies,
building an image in the form of a Docker container, searching
for a free virtual machine in public clouds (currently Google
Cloud, OVH, GESIS laptops and Turing Institute clouds) are
fully automated by MyBinder platform. The developer only
needs to follow the special format for the git repository of the
application orchestrator.</p>
      <p>
        The restriction on the method of deploying the computing
part of the application in the test implementation (Steps 1-4)
is access to the shared file system from all the virtual machines
deployed in Step 3. However, it is technically possible to
transfer files directly from the application orchestrator through
the Everest server to worker virtual machines. Also one can
use the IPFS distributed file system client program for file
transfer [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. These methods were not considered in this study.
      </p>
      <p>IV. PERFORMANCE MEASUREMENT OF THE FREQUENCY</p>
      <p>ANALYSIS APPLICATION</p>
      <p>The purpose of the experiments was to verify the
functionality of the application and to confirm the
effectiveness of calculations using the proposed distributed
application architecture. This architecture is appropriate to
consider effective, if the calculations are completed in a
reasonable time and the use of several virtual machines in the
computing part of the application leads to faster calculations.
The results of the experiment are shown in Table 1.</p>
      <p>
        For the experiments, we used a fragment of a 5.88 GB data
array collected in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The array consisted of 10 files. The size
of the input JSON files ranged from 524 MB to 849 MB. The
result of processing was stored in an array of 10 text files with
a total size of 1.83 MB. 148885 words were found (including
word forms and neologisms). The size of the resulting files
ranged from 158 KB to 223 KB. The resulting files consisted
of entries in the format:
&lt;word 1&gt;
&lt;total number of repetitions of a word 1 in 10 files&gt;
&lt;CR&gt;
&lt;word 2&gt;
&lt;total number of repetitions of a word 2 in 10 files&gt;
&lt;CR&gt;
&lt;word M&gt;
&lt;total number of repetitions of a word M in 10 files&gt;
&lt;CR&gt;.
      </p>
      <p>The entries were sorted through the entire set of 10 files
(words on ‘a’ were stored in the first file, and words in ‘z’
were stored in the last file).</p>
      <p>We carried out a series of experiments. The number of
processed files and the number of virtual machines on which
the computing part of the application was run changed from 2
to 10. The observed time for completing tasks of the first type
in all experiments varied from about 24 to 36 seconds, and the
time for completing tasks of the second type ranged from
about 6 to 26 seconds. The maximum 3.4x speedup (relative
to the version with one virtual machine in the computing part
of the application) was achieved when processing 10 files on
10 virtual machines. The sequential processing of the data
array on one virtual machine took 983 seconds in the worst
case (~16 minutes). The parallel processing using 10 virtual
machines took about 270 seconds (~4.5 minutes). The
absolute reduction in processing time was approximately 11.5
minutes. Thus, the studied distributed application architecture
can be considered effective.</p>
    </sec>
    <sec id="sec-2">
      <title>V. CONCLUSION</title>
      <p>The paper proposes the architecture of a distributed data
processing application for computing in a hybrid cloud
environment built as a combination of free and academic
cloud services. In a computational experiment to determine
the frequency of words in Twitter messages, the possibility of
speeding up the calculations was demonstrated, which proves
the effectiveness of proposed architecture.</p>
      <p>The practical advantage of the architecture is the
possibility of high-performance data processing using only a
web browser. This feature is great for demos, training, and
collaborative research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Granger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Colbert</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Rose</surname>
          </string-name>
          , “
          <article-title>JupyterLab: The next generation jupyter frontend</article-title>
          ,” JupyterCon,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Jupyter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bussonnier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Forde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Freeman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Granger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Head</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Holdgraf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kelley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Nalvarte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Osheroff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pacer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Panda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ragan-Kelley</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Willing</surname>
          </string-name>
          , “Binder 2.
          <fpage>0</fpage>
          - Reproducible, interactive, sharable environments for science at scale,
          <source>” Proceedings of the 17th python in science conference</source>
          , vol.
          <volume>113</volume>
          , pp.
          <fpage>120</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Merkel</surname>
          </string-name>
          , “
          <article-title>Docker: lightweight linux containers for consistent development</article-title>
          and deployment,”
          <source>Linux Journal, no. 239</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O.</given-names>
            <surname>Sukhoroslov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Volkov</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Afanasiev</surname>
          </string-name>
          , “
          <article-title>A Web-Based Platform for Publication and Distributed Execution of Computing Applications</article-title>
          ,” 14th
          <source>International Symposium on Parallel and Distributed Computing</source>
          , Limassol, pp.
          <fpage>175</fpage>
          -
          <lpage>184</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.A.</given-names>
            <surname>Vorobiev</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.G.</given-names>
            <surname>Litvinov</surname>
          </string-name>
          , “
          <article-title>Automated system for forecasting the behavior of the foreign exchange market using analysis of the emotional coloring of messages in social networks</article-title>
          ,
          <source>” Advanced Information Technologies (PIT)</source>
          , pp.
          <fpage>416</fpage>
          -
          <lpage>419</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.A.</given-names>
            <surname>Rycarev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.V.</given-names>
            <surname>Kirsh</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.V.</given-names>
            <surname>Kupriyanov</surname>
          </string-name>
          , “
          <article-title>Clustering of media content from social networks using bigdata technology</article-title>
          ,
          <source>” Computer Optics</source>
          , vol.
          <volume>42</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>921</fpage>
          -
          <lpage>927</lpage>
          ,
          <year>2018</year>
          . DOI:
          <volume>10</volume>
          .18287/
          <fpage>2412</fpage>
          -6179- 2018-42-5-
          <fpage>921</fpage>
          -927.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Thoman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dichev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Iakymchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Aguilar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hasanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gschwandtner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lemarinier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Markidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Katrinis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Laure</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Nikolopoulos</surname>
          </string-name>
          , “
          <article-title>A taxonomy of task-based parallel programming technologies for high-performance computing,”</article-title>
          <source>The Journal of Supercomputing</source>
          , vol.
          <volume>74</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>1422</fpage>
          -
          <lpage>1434</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.V.</given-names>
            <surname>Vostokin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.V.</given-names>
            <surname>Sukhoroslov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.V.</given-names>
            <surname>Bobyleva</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.N.</given-names>
            <surname>Popov</surname>
          </string-name>
          , “
          <article-title>Implementing computations with dynamic task dependencies in the desktop grid environment using Everest and Templet Web,”</article-title>
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>2267</volume>
          , pp.
          <fpage>271</fpage>
          -
          <lpage>275</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.V.</given-names>
            <surname>Vostokin</surname>
          </string-name>
          , “
          <article-title>The Templet parallel computing system: Specification, implementation</article-title>
          , applications,” Procedia Engineering, vol.
          <volume>201</volume>
          , pp.
          <fpage>684</fpage>
          -
          <lpage>689</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.V.</given-names>
            <surname>Vostokin</surname>
          </string-name>
          and
          <string-name>
            <given-names>I.V.</given-names>
            <surname>Bobyleva</surname>
          </string-name>
          , “
          <article-title>Asynchronous round-robin tournament algorithms for many-task data processing applications</article-title>
          ,”
          <source>International Journal of Open Information Technologies</source>
          , vol.
          <volume>8</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>45</fpage>
          -
          <lpage>53</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Benet</surname>
          </string-name>
          , “
          <article-title>Ipfs-content addressed, versioned, p2p file system</article-title>
          ,
          <source>” arXiv preprint: 1407.3561</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>