Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


              USING DISTRIBUTED CLOUDS FOR SCIENTIFIC
                            COMPUTING
                   N.A. Kutovskiy1, I.S. Pelevanyuk1,3,a, D.N. Zaborov2
  1
      Joint Institute for Nuclear Research, 6 Joliot-Curie St., Dubna, Moscow Region, Russia, 141980
  2
      Institute for Nuclear Research, Russian Academy of Sciences, 7a 60-letiya Oktyabrya Prospekt,
                                            Moscow, 117312
          3
              Plekhanov Russian University of Economics, 36 Stremyanny Lane, Moscow, 117997

                                        E-mail: a pelevanyuk@jinr.ru


Nowadays, cloud resources are the most flexible tool to provide access to infrastructures for
establishing services and applications. However, it is also a valuable resource for scientific computing.
At the Joint Institute for Nuclear Research, the computing cloud was integrated with the DIRAC
system. It allowed for the submission of scientific computing jobs directly to the cloud. Thanks to the
experience, the cloud resources of several organizations from the JINR Member States were integrated
in the same way. It increased the total amount of cloud resources accessible in a uniform way through
DIRAC, in the scope of the so-called Distributed Information and Computing Environment (DICE).
Folding@Home tasks related to the SARS-CoV-2 virus were submitted to all available cloud
resources. In addition to useful scientific results, such experience was also helpful in obtaining
information about the performance, limitations, strengths, and weaknesses of the combined system.
Based on the gained experience, the DICE infrastructure was tuned to successfully perform real user
jobs related to Monte-Carlo simulation for the Baikal-GVD experiment.


Keywords: data processing, cloud computing, distributed computing, GRID applications


                                                    Nikolay Kutovskiy, Igor Pelevanyuk, Dmitry Zaborov

                                                               Copyright © 2021 for this paper by its authors.
                      Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                     196
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


1. Introduction
        The Joint Institute for Nuclear Research (JINR) is an international intergovernmental
organization. It is developing as a large multidisciplinary international scientific center incorporating
basic research in the field of modern nuclear physics, the development and application of high
technologies, as well as university education in the relevant fields of knowledge. Currently, JINR has
18 Member States and 6 countries participating in JINR’s activities on the basis of bilateral
agreements signed at the governmental level.
        The cloud infrastructure [1] deployed at JINR was created in 2013 to manage the IT services
and servers of the Meshcheryakov Laboratory of Information Technologies (MLIT) more efficiently
using modern technologies, to combine resources for solving common tasks, to increase the efficiency
of hardware utilization and service reliability, to simplify access to application software and optimize
the use of proprietary software, as well as to provide a modern computing facility for JINR users.
         Cloud infrastructures are a flexible tool for many different tasks. One of these tasks is massive
scientific computing. Clouds can provide a resource capable of handling a High-Throughput
Computing workload. JINR takes part in several experiments where extensive Monte-Carlo generation
is required. The use of cloud infrastructures of the JINR Member States can also be used for such
workloads.
          The integration of the cloud resources of JINR and its Member States was performed with the
help of the DIRAC Interware [2]. DIRAC is a platform that provides basic tools and methods for
combining heterogeneous infrastructures and their operation for large sets of computing tasks. A
special module was developed at JINR in order to integrate clouds to the system [3]. At the moment, 9
clouds from JINR and its Member States are included in the DIRAC system. All of these distributed
heterogeneous computing resources represent a valuable and powerful resource. The main features of
these clouds are shown in Table 1. The major questions related to this unique resource are how to use
it efficiently in scientific computing, what are the limitations of the combined system, how to support
it in the operational state, and what to do with possible failures.
                                               Table 1. Cloud resources accessible through DIRAC at JINR

                       Organization                           Location      CPU cores          RAM GB
Joint Institute for Nuclear Research                        Russia         80             320

Plekhanov Russian University of Economics                   Russia         132            608

Astana branch of the Institute of Nuclear Physics           Kazakhstan     84             840
Institute of Physics of the National Academy of Sciences    Azerbaijan     16             96
of Azerbaijan
North Ossetian State University                             Russia         84             672
Academy of Scientific Research & Technology - Egyptian      Egypt          98             704
National STI Network
Institute for Nuclear Research and Nuclear Energy           Bulgaria       20             64
St. Sophia University “St. Kliment Ohridski”                Bulgaria       48             250
Scientific Research Institute for Nuclear Problems of       Belarus        132            290
Belarusian State University
Total                                                                      614            3524


                                                     197
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


2. Features of the distributed cloud infrastructure
         The DIRAC Interware platform is used to integrate different clouds into a united
infrastructure. For this purpose, a dedicated user account should be created on each cloud. All
resources provided to this user within their quota will be available for creating virtual machines.
         The DIRAC provides a job queue where users can send their jobs. The job is described in a
special way, either using the DIRAC Python API or with a special Job Description Language. The API
is the best way for massive job submissions. The parameters of the job include the following
information: name of the executable on the virtual machine, arguments passed to the executable, list of
files to be uploaded with the job (“Input Sandbox”), list of files to be downloaded back (“Output
Sandbox”), and optionally, list of appropriate clouds that can complete this particular job. It is not
possible to include large files (more than 5 MB) in input and output sandboxes. The general
recommendation is to put only logs and configuration in sandboxes.
        The standard approach is to create a Shell script that describes the whole workflow of a
particular job. Generally, each job described in the Shell script consists of several major steps: initial
configuration, input data download, processing, output data upload, and finalization. Any step can be
omitted if not required. The Shell script and the necessary configuration files are included in the input
sandbox. The standard output of the job is automatically redirected to a file that is included in the
output sandbox.
        All data files should be placed on storage elements integrated with DIRAC. Currently, only
systems with the support of the root protocol for access and VOMS (Virtual Organization Membership
Service) for authentication can be used as storage elements in DIRAC. The use of VOMS ensures that
a user with the correct membership in a VOMS group will be able to access storage elements from
anywhere.
        User jobs are not directly submitted to the cloud. At first, when jobs suitable for the clouds
appear, DIRAC creates virtual machines on one of the appropriate clouds with special instructions to
execute after the boot process. These instructions contain information about the installation of DIRAC
Pilot. DIRAC Pilot is a special process that performs basic checks of the resource it is running on and
submits the results to the DIRAC Matcher service. The DIRAC Matcher service chooses a job that can
be completed on a resource with the received parameters. This job and the corresponding input
sandbox are downloaded by DIRAC Pilot. After that, the user job starts as a child process of DIRAC
Pilot. This scheme eliminates resources that cannot complete the job, for example, due to the lack of
RAM or an inappropriate OS. After completing the first job, DIRAC Pilot can request another.


3. Limitations on job execution in the cloud
        The main concern during job submission is related to the fact that the network connection
between the cloud and the storage element is a limited resource. It means that even if a single job is
successfully completed during the testing stage, it does not guarantee that hundreds of similar jobs will
work in the same way. All these jobs will need to retrieve user software, download some input files
and upload results to a storage element. This can be a bottleneck for the whole pack of jobs and can
increase the execution time. What in turn decreases the efficiency of CPU utilization, since jobs will
be waiting for data transfers for a substantial amount of time.
        The CernVM File System (CVMFS) is used for the software distribution across the
computational resources of all participating organizations. It caches the queried files in a dedicated
directory on the local CVMFS caching node, and the rest of the queries to the same files are served
from this local cache. The client-side CVMFS cache also helps to lower a load on the network when
multiple similar jobs reusing the same CVMFS files are executed subsequently on the same virtual
machine; in the clouds, where virtual machines are spawned and deleted, often the benefits of such file
caching cannot be pronounced. DIRAC provides the necessary options for configuring the IP
addresses of caching CVMFS servers for different clouds.


                                                   198
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


         The cloud infrastructure does not provide a standardized solution for caching input data files.
Hence, custom solutions may need to be considered. For example, some input data can be downloaded
once and cached in a temporary directory on the local file system of the virtual machine. Then jobs
requested by DIRAC Pilot after the first job can use the cached data. It can also be done by placing the
input data on the CVMFS file system, but due to the limited frequency of CVMFS file system
synchronization, this will only work for very static data. Mainly due to the network bandwidth
limitations, clouds are not currently proposed as resources for massive data processing. On the other
hand, clouds can be highly efficient at executing a wide range of CPU-intensive jobs characterized by
modest storage requirements. This is typical of jobs performing Monte Carlo simulation of particle
physics experiments, which normally require very little input data, but are highly CPU-intensive.
During Monte-Carlo data generation, the only data that will definitely require a real network transfer is
the output. The output data size is usually predictable, and it is possible to estimate the number of jobs
that are reasonable for simultaneous execution on a single cloud, given the network bandwidth
restrictions. The described problem does not relate to every cloud. Some of them were designed as
tools for massive data processing and possess a high-bandwidth external network connection, while
others were designed to host services that are not so demanding in terms of network and have a very
limited external network bandwidth.
       Sometimes it is possible to negotiate with cloud owners to upgrade the network connection.
Demonstration that their cloud can participate in computing for large scientific collaborations can be a
good reason for the network upgrade. Thus, we prepared two successful use cases of scientific
workload execution in the clouds.


4. Folding@Home jobs in the clouds
         Folding@Home is a community of volunteers, researchers and organizations that help with
their intelligent and computing resources to understand the dynamics of proteins, their functions and
dysfunctions in order to find new proteins and drugs. Every person in the world can install the
Folding@Home client on their home computer and calculate scientific jobs. When SARS-CoV-2
appeared, Folding@Home created a queue devoted to the study of this virus. Many people and
organizations worldwide donated their resources to this cause.
         We decided that Folding@Home jobs were perfect for the demonstration of distributed cloud
performance, not with artificial tests, but with real jobs. The main feature of Folding@Home jobs is
that the input and output are small, but the amount of computing work is large. This is a perfect mode
for cloud resources. The Folding@Home software was installed in advance in all images used by
DIRAC on all clouds. A special DIRAC job was designed for this task. It starts the Folding@Home
client, but only for one work unit from Folding@Home. After its execution, the job is considered
completed. After that, a new DIRAC job can take its place.
         The total number of successfully completed Folding@Home jobs exceeds 13 thousand. The
distribution of the jobs across the clouds is shown in Figure 1. The completion of each job took on
average 15 hours. Only single-core jobs were executed. The total amount of normalized consumed
computer power is around 135 kHS06 days.


                                                   199
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


            Figure 1. Total number of Folding@Home jobs completed by different clouds


5. Baikal-GVD jobs in the clouds
        Baikal-GVD is a cubic-kilometer scale underwater neutrino detector currently under
construction on Lake Baikal (Russia) [4]. At JINR, Baikal-GVD is the first large particle physics
experiment that massively used cloud resources combined by DIRAC for Monte-Carlo simulation
purposes. The majority of jobs were part of large-scale Monte-Carlo production that involved
simulating the propagation of high-energy muons in water and the detector response. The standard
workflow of a user job with input data download and output data upload was used. Each job required
to download an input file of about 2 GB. The result of Monte-Carlo generation had an average size of
370 MB. Only two cloud resources were ready for this type of workload. The PRUE cloud was busy at
the time of production, so most of the jobs were completed by the JINR cloud. Only single-core jobs
were executed. The total number of successfully executed jobs is 67.5 thousand. The utilization of the
JINR cloud was around 80% during this simulation campaign, which is demonstrated in Figure 2. The
completion of each job took on average 6 hours. The total amount of normalized consumed computer
power is around 280 kHS06 days.


          Figure 2. Number of running Baikal-GVD jobs at each moment during production


                                                   200
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


6. Conclusion
       Clouds are a valuable computing resource for scientific computing. Their flexibility provides a
unique opportunity for tuning working environments for the requirements of different jobs: OS,
amount of local storage, amount of RAM per core; all of which can be configured on the cloud. The
low network bandwidth of any participating organization is the main critical parameter in defining the
number of jobs that can be simultaneously executed.
         We showed an example of cloud integration via the DIRAC Interware. User jobs should be
configured in order to be executed by DIRAC. It requires some significant effort from users. However,
once it is done, it is possible to greatly increase the amount of accessible computing resources. This is
good not only for users, but also for clouds. DIRAC jobs are a good way to improve the utilization of
cloud resources, and a possibility to participate in computing for scientific collaborations.
       Folding@Home jobs were successfully executed in the clouds. That gave us experience in
running Monte-Carlo jobs of Baikal-GVD.


References
[1] Balashov N.A. et al., Present Status and Main Directions of the JINR Cloud Development //
Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019),
CEUR Workshop Proceedings, ISSN:1613-0073, vol. 2507 (2019), pp. 185-189.
[2] Korenkov V., Pelevanyuk I., Tsaregorodtsev A. Integration of the JINR Hybrid Computing
Resources with the DIRAC Interware for Data Intensive Applications // Data Analytics and
Management in Data Intensive Domains. 2020. P. 31-46. DOI: 10.1007/978-3-030-51913-1_3.
[3] N. Balashov, R. Kuchumov, N. Kutovskiy, I. Pelevanyuk, V. Petrunin, and A. Tsaregorodtsev,
CEUR Workshop Proceedings 2507, 256 (2019), URL http://ceur-ws.org/Vol-2507/256-260-paper-
45.pdf.
[4] Baikal-GVD Collaboration: V.A. Allakhverdyan et al., Measuring muon tracks in Baikal-GVD
using a fast reconstruction algorithm, submitted to EPJ C, arXiv:2106.06288.


                                                   201