<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Container-Based Job Management System for Utilization of Idle Supercomputer Resources?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stanislav Polyakov [</string-name>
          <email>s.p.polyakov@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julia Du</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Skobeltsyn Institute of Nuclear Physics, M.V.Lomonosov Moscow State University (SINP MSU)</institution>
          ,
          <addr-line>1(2), Leninskie gory, GSP-1, Moscow 119991</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose a system for utilization of idle computational resources of supercomputers. The system executes additional low-priority jobs inside containers on idle nodes and uses container migration tools to interrupt the execution and resume it later, possibly on di erent nodes. We implemented a prototype of the proposed system for Docker containers and demonstrated that it performs the necessary operations successfully. Our experiments based on simulation show that the proposed system can increase the e ective utilization of supercomputer resources, in some cases by as much as 10%.</p>
      </abstract>
      <kwd-group>
        <kwd>Data processing • Supercomputer scheduling • Average load • Container • Container migration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Supercomputers are very expensive to build and maintain, so it is
important to prevent any unnecessary losses in their performance. In many cases
average load of supercomputers is signi cantly lower than 100%. For example,
Lomonosov and Lomonosov-2 supercomputers in 2017 had the average load of
92.3% and 88.7%, respectively [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Titan supercomputer in 2015 had at least
10% of its capacity unused [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        One possible way to address this issue is adding low-priority jobs to be
executed by the computational nodes that would otherwise remain idle. There are
numerous problems in physics and other scienti c elds that can produce a large
number of jobs, e.g. series of Monte Carlo simulations with varying parameters.
This approach was used, for example, to run ATLAS jobs on Titan
supercomputer [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
? The work was supported by the Russian Foundation for Basic Research grant
1837-00502 \Development and research of methods for increasing the performance of
supercomputers based on job migration using container virtualization".
Copyright ' 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
      </p>
      <p>Conveniently, a lot of jobs that can be used to ll idle supercomputer nodes
do not require parallel execution and thus can be started on an arbitrary number
of idle nodes. However, lling all idle nodes will increase the average load at the
cost of performance with respect to regular jobs: the idle nodes which can be
used to execute additional low-priority jobs ( ller jobs) would not otherwise
necessarily remain idle for the duration of the execution, particularly if time
requirements of the ller jobs are high. Usually it is not even possible to know
in advance how long a node will remain idle, because newly submitted jobs or
jobs completed ahead of time can change the schedule. So in order to reduce
the negative impact of ller jobs on supercomputer performance with respect to
regular jobs, ller jobs with relatively short execution time should be used.</p>
      <p>
        We propose to run additional low-priority jobs in containers and use container
migration tools to interrupt the job execution and resume it later, possibly on a
di erent node. We describe a two-component system for managing and executing
these jobs that can work alongside a supercomputer scheduler. We present the
results of simulation of supercomputer work ow with and without the proposed
system to demonstrate its potential advantages. We previously discussed this
idea in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A similar idea for utilizing idle resources of supercomputers was used
in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] where authors proposed a quasi scheduler to manage jobs of a special type
(distinct from the type we use here).
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Saving and Restoring Containers</title>
      <p>Container virtualization is a method of isolating groups of processes and their
environment, based on the Unix chroot mechanism. Containers are in many
ways similar to virtual machines but they use the same OS kernel as the host
machine. Several implementations of container virtualization o er tools for
container migration. Some of the di erences between the implementations of
container migration are signi cant for our proposed use of containers, and there is
some terminological di erence as well. For simplicity, we will focus on Docker
implementation of container migration.</p>
      <p>
        In Docker [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], container migration is implemented as an experimental feature.
It uses CRIU package (CRIU stands for Checkpoint/Restore in Userspace)
allowing to freeze running applications and save them as collections of les. These
collections of les, or checkpoints, can be later used to restore the application.
The respective Docker feature can be used to create checkpoints for running
containers, saving the processes running in these containers. Restoring the
container from a checkpoint requires an identical container without any running
processes. Our tests con rm that this experimental Docker feature works
reliably and meets the requirements of our project: the checkpoints can be put to
storage, restored on a di erent host machine, and a single container can be saved
and restored multiple times consecutively.
      </p>
      <p>
        We also conducted a series of tests to measure the time needed to start a
container, create a checkpoint, and restart a container from a checkpoint. It
is typical for container operations like creating a new container to take little
time. In our tests, starting a container took 1.25 s on average. However, creating
a checkpoint involves copying to a permanent storage device all contents of a
container, including RAM used by its processes, so we expected the creation
time to grow approximately linearly with the size of the checkpoint. This was
con rmed by our tests: creating a checkpoint for a container using 1 MB memory
on average takes approximately 1 second, and at 200 MB memory and higher it
takes approximately 5 seconds per 100 MB. It is more than 4 times slower than
the sequential write speed of the HDD used in our tests (87 MB/s). Restoring
containers from checkpoints also proved to be slow: over 1 second for a container
using 1 MB memory and approximately 4 seconds per 100 MB for containers
using over 200 MB memory. (By contrast, LXC [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] implementation of saving and
restoring containers is several times faster, possibly because of their use of ZFS
le system. However, there is a persistent error that prevents an LXC container
from being saved and restored the second time after it was done once.)
      </p>
      <p>Of course, these gures are not enough for an accurate estimate of the time
required by checkpoint operations. Supercomputers can use better hardware, and
write speed can be increased by the use of RAIDs. It is also possible that Docker
developers will nd a way to make checkpoint operations less time-consuming.
On the other hand, the network of a supercomputer is a possible bottleneck, and
if the computational nodes cannot access local storage of the other nodes then
the shared storage is another one. Furthermore, utilizing multiple CPU cores of a
single computational node requires sharing access to checkpoint storage between
multiple single-core jobs. Therefore we can reasonably expect to lose, on average,
at least several minutes of CPU time due to container operations and inactivity
each time a computational node is used to run ller jobs.</p>
      <p>For the purposes of the proposed system, several minutes is a signi cant delay.
Because of this delay, we cannot let ller jobs run on a computational node until
it is required for regular jobs: creating checkpoints for all ller jobs running on a
node requires too much time. But there are several other possible approaches: for
example, it is possible to periodically create checkpoints for running ller jobs
without stopping the containers and stop the jobs on demand without creating
new checkpoints and without losing too much progress. In this paper, we propose
a di erent approach: to allot a time interval the node can run ller jobs and save
the un nished jobs at the end of the interval.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Managing Filler Jobs to Utilize Idle Supercomputer</title>
    </sec>
    <sec id="sec-4">
      <title>Nodes</title>
      <p>We propose a two-component system to manage ller jobs and execute them
inside containers on a supercomputer. The rst component is an agent program.
An instance of the program is launched on a computational node, creates or
restores containers with ller jobs, and saves the containers before the allotted
time is over. The second component is a control program that maintains the
queue of ller jobs, stores information about job status, distributes ller jobs
between instances of the agent program, and interacts with the supercomputer
scheduler.</p>
      <p>
        Knowing the number of CPU cores per node, available memory, and the time
it takes to start, checkpoint, and restore containers depending on the memory
they use, it is possible to choose a minimum execution time for the agent
program. However, an agent program instance can run longer than the minimum
time, increasing the e ective use of the node. This makes containerized ller jobs
di erent from regular jobs: the scheduler does not have to abide by the requested
execution time and can set its own execution time instead. In this, our ller jobs
are similar to the jobs managed by the quasi scheduler proposed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] where the
jobs do not have a xed amount of nodes they need, instead allowing the quasi
scheduler to determine the number of nodes they can use.
      </p>
      <p>
        Thus we can choose an approach to setting the execution time for ller jobs.
Based on tradeo s between node utilization and performance with respect to
regular jobs, we can nd in advance an optimal execution time and use it for all
ller jobs. However, this can result in ller jobs retaining nodes even when there
are regular jobs that can use them. For example, suppose there is a regular job J
waiting in the queue that requires 10 nodes. Another job terminates, releasing 2
nodes. Since 2 nodes is not enough to start J, they are given to ller jobs instead.
Then 8 more nodes are released by other ller jobs, but it is not enough to start J
so they are given to ller jobs once again. As a result, regular jobs can lose access
to a signi cant number of nodes. Another idea is to set the termination time for
ller jobs to match some other running job or jobs. However, user estimates of
execution time are very inaccurate (see, e.g., [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]) so usually the termination time
of a regular job cannot be predicted accurately. On the other hand, we control the
termination time of ller jobs supervised by the agent program. Synchronizing
the termination of agents can be as simple as having them all stop at the end of
each hour (or any other period).
      </p>
      <p>Now we can describe the components of the system to utilize idle
supercomputer resources in more detail. Please note that both components of the system
deal with ller jobs only. An instance of the agent program must:
check the remaining time and stay idle if there is not enough time to start
(or resume) a single job and then create a checkpoint for it, with some time left
to actually run the job,
request jobs from a control program,
start or restart the jobs one by one inside containers; a job should only be
started or restarted if this would leave enough time to create checkpoints for all
running jobs,</p>
      <p>periodically check the running jobs and report the complete ones back to the
control program,</p>
      <p>create checkpoints for the remaining jobs when there is just enough time left
for it and report their status back to the control program,</p>
      <p>wait until the termination time and exit.
(The program needs the upper estimates of the time required to start or restart
a job and create a checkpoint as its parameters, as well as the minimal execution
time for ller jobs.)</p>
      <p>The control program must:
submit jobs consisting of agent program instances to the supercomputer
scheduler with the lowest priority and maintain their number according to the
settings,</p>
      <p>maintain the queue of non-parallel low-priority jobs with records of their
status,</p>
      <p>upon receiving a request from an agent, provide it with the jobs from the
queue,</p>
      <p>track the jobs given to agents and update their status according to the
information from the agents.</p>
      <p>The supercomputer scheduler does not need any changes: ller jobs are
submitted as regular (lowest-priority) jobs with the requested time equal to the
interval between synchronized termination of the agents.</p>
      <p>We implemented a prototype of the proposed system, working with Docker
containers. Our tests show that it successfully launches ller jobs inside
containers on idle computers and creates checkpoints for them before the allotted time
is over. Filler jobs can be restored and saved multiple times on di erent nodes.
We have not yet fully implemented a version of the agent program capable of
running multiple ller jobs on a single node. Implementing and testing this part
of the functionality will be a part of future work.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Simulation Results</title>
      <p>
        Using a simulated job queue with a parameter distribution based on data from
Lomonosov supercomputers, we performed a series of experiments to estimate
the bene ts of the proposed system. A detailed description of the experiments
was given in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], here we give a brief summary with some additional results.
      </p>
      <p>The experiments consisted of two series. In the rst series, the main scheduler
queue was assumed to be full (always kept at 100 regular jobs), and in the
second series, the regular jobs were added according to the Poisson distribution
with the parameters chosen to approximate historical average load of Lomonosov
supercomputers. The average load in the rst series was high even with the basic
scheduler only: 96.9% with 1024 nodes, 99.2% for 4000 nodes. We assumed that
container and checkpoint operations cost 10 minutes every time a node runs ller
jobs (if Docker checkpoint operations are used, this approximately corresponds
to 8 jobs per node each using 1.5 GB memory on a hardware similar to that used
in our tests). With this assumption, our system with one hour interval between
synchronized agent termination increased the e ective utilization to 99.0% with
1024 nodes and 99.5% with 4000 nodes. With a di erent type of job queue, the
results were even better: the increase was from 95.5% to 98.9% with 1024 nodes
and from 99.1% to 99.6% with 4000 nodes. For the rst type of job queue, adding
ller jobs decreased the average number of nodes running regular jobs (by 0.56%
with 1024 nodes and 0.19% with 4000 nodes). For the second type, the decrease
was negligible.</p>
      <p>In the experiments with the full queue, changing the interval between agent
termination can sometimes improve the results but one hour is a good default
choice. In the second series of experiments, one hour interval is too short: the
e ective resource utilization can be increased signi cantly by choosing 4 or even
6 hours. With 4 hours interval, the e ective utilization is increased from 92.3% to
98.9% ( rst type of queue, 4000 nodes), and from 88.8% to 99.3% (second type
of queue, 1500 nodes). There is no signi cant decrease in the average number of
nodes running regular jobs.</p>
      <p>In our additional experiments, we estimated the average waiting time for a
regular job. Regular jobs were added according to the Poisson distribution. The
waiting time for the rst type of queue with 4000 nodes was 32 minutes with
only regular jobs and 44 to 63 minutes with ller jobs depending on the interval
between agent termination. For the second type of queue with 1500 nodes, the
average waiting time was 89 minutes without ller jobs and 98 to 114 minutes
with ller jobs. The termination interval for agent instances was between 0.5 and
4 hours. As these experiments show, adding ller jobs increases the waiting time
for regular jobs even without reducing the number of nodes executing them. The
additional delay is signi cant but relatively short given the increase in average
load. By comparison, in the experiments where our system is not used and the
average load is increased by adding regular jobs at a higher rate, the waiting
time between 44 and 63 minutes approximately corresponds to the average load
93{95% for the rst type of queue, and the waiting time between 98 and 114
minutes approximately corresponds to the average load 89{90% for the second
type of the queue.
5</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>Our tests of container tools and the prototype demonstrate that the proposed
system can work as intended. Docker containers are suitable for the purposes of
the system, although in the current implementation the checkpoint operations
are very slow. Our simulation experiments show that the system can increase
the e ective utilization of supercomputer resources, possibly by as much as 10%.
The average e ective utilization was increased in all our experiments. Possible
downsides include the increased waiting time for regular jobs, and in some cases
the decrease in the number of nodes running regular jobs. However, in many
cases the decrease is not signi cant.</p>
      <p>Acknowledgements. The work was supported by the Russian Foundation for
Basic Research grant 18-37-00502 \Development and research of methods for
increasing the performance of supercomputers based on job migration using
container virtualization". We would also like to thank Sergey Zhumatiy for the
provided information and useful discussions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Leonenkov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhumatiy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Supercomputer E ciency:
          <article-title>Complex Approach Inspired by Lomonosov-2 History Evaluation</article-title>
          .
          <source>In: Russian Supercomputing Days: Proceedings of the International Conference (September</source>
          <volume>24</volume>
          {
          <fpage>25</fpage>
          ,
          <year>2018</year>
          , Moscow, Russia), pp.
          <volume>518</volume>
          {
          <fpage>528</fpage>
          . Moscow State University (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>De</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          et al. [ATLAS Collaboration]
          <article-title>: Integration of PanDA workload management system with Titan supercomputer at OLCF</article-title>
          .
          <source>Journal of Physics: Conference Series</source>
          ,
          <volume>664</volume>
          (
          <issue>9</issue>
          ) p.
          <fpage>092020</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Barreiro</given-names>
            <surname>Megino</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          et al. [ATLAS Collaboration]
          <article-title>: Integration of Titan supercomputer at OLCF with ATLAS Production System</article-title>
          .
          <source>Journal of Physics: Conference Series</source>
          ,
          <volume>898</volume>
          (
          <issue>9</issue>
          ) p.
          <fpage>092002</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dubenskaya</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polyakov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Improving the e ective utilization of supercomputer resources by adding low-priority containerized jobs</article-title>
          . In: Kryukov,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Haungs</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (eds.) 3rd
          <source>International Workshop on Data Life Cycle in Physics 2019, CEUR Workshop Proceedings</source>
          , vol.
          <volume>2406</volume>
          , pp.
          <volume>43</volume>
          {
          <fpage>53</fpage>
          . M. Jeusfeld c/o Redaktion Sun SITE,
          <string-name>
            <surname>Informatik</surname>
            <given-names>V</given-names>
          </string-name>
          ,
          <string-name>
            <surname>RWTH Aachen</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Baranov</surname>
            ,
            <given-names>A.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiselev</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lyakhovets</surname>
            ,
            <given-names>D.S.:</given-names>
          </string-name>
          <article-title>The quasi scheduler for utilization of multiprocessing computing system's idle resources under control of the Management System of the Parallel Jobs</article-title>
          . Bulletin of the South Ural State University.
          <source>Series \Computational Mathematics and Software Engineering" 3</source>
          (
          <issue>4</issue>
          ),
          <volume>75</volume>
          {
          <fpage>84</fpage>
          (
          <year>2014</year>
          )
          <article-title>(in Russian)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>6. https://docker.com/</mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>7. https://linuxcontainers.org/</mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Tsafrir</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etsion</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feitelson</surname>
            ,
            <given-names>D.G.</given-names>
          </string-name>
          :
          <article-title>Modeling user runtime estimates</article-title>
          . In: Feitelson,
          <string-name>
            <surname>D.G.</surname>
          </string-name>
          et al.
          <source>(eds.) 11th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP)</source>
          ,
          <source>LNCS</source>
          , vol.
          <volume>3834</volume>
          , pp.
          <volume>1</volume>
          {
          <fpage>35</fpage>
          . Springer-Verlag (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>