<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>POPULATION ANNEALING METHOD AND HYBRID SUPERCOMPUTER ARCHITECTURE</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>L.N. Shchur</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lev Shchur</string-name>
          <email>lev@landau.ac.ru</email>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>HSE University</institution>
          ,
          <addr-line>101000, Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Landau Institute for Theoretical Physics</institution>
          ,
          <addr-line>142432, Chernogolovka</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>5</fpage>
      <lpage>9</lpage>
      <abstract>
        <p>A population annealing method is a universal algorithm applicable to statistical mechanics systems and optimization problems. It is potentially scalable on any parallel architecture. We review recent developments in the area, emphasizing the implementation of the algorithm on a hybrid parallel program architecture combining CUDA and MPI. The problem is to keep all general-purpose graphics processing unit devices as busy as possible by efficiently redistributing replicas. We provide testing details on hardware-based Intel Skylake/Nvidia V100, running more than two million replicas of the Ising model samples in parallel. As the complexity of the simulated system increases, the acceleration grows toward perfect scalability. This work was done under Grant No. 19-11-00286 from the Russian Science Foundation and was supported in part through computational resources of HPC facilities at HSE University.[1].</p>
      </abstract>
      <kwd-group>
        <kwd>population annealing</kwd>
        <kwd>GPU</kwd>
        <kwd>CUDA</kwd>
        <kwd>MPI</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Previous work</title>
      <p>Population annealing algorithm proposed by Hukushima and Iba [2] to simulate statistical
mechanics systems with the complex energy landscape. The central idea is to simulate the enormous
number of system replicas, splitting simulation into two steps, resampling replicas at each cooling
step, and equilibrating replicas independently at the current temperature. The exciting feature of the
algorithm is that it estimates the free energy at each cooling step using the average over the number of
replicas. It was shown by Machta [3] that it is possible to estimate the behavior of statistical errors and
systematic errors using weighted averages as a function of the number of replicas $R$. The statistical
errors decay as $1/R1/2$, and the systematic errors decay as $1/R$ for a large enough number of
replicas $R$.</p>
      <p>There are successful applications of the method to the spin glasses [4,5], molecular dynamics
[6], first-order phase transitions [7], and optimization problems [8]. A detailed description of the
method and the analysis of the accuracy dependence on the essential parameters of the simulation can
be found in the recent publication [9].</p>
      <p>The algorithm was successfully implemented using CUDA [10], and it was found that the
optimal number of replicas per one GPU V100 node should be about ten times larger than the number
of threads. It gives the possibility to run 20 thousand replicas on one GPU node.</p>
      <p>
        Recently, we extended the range of population annealing (PA) simulations up to more than
two million replicas running in parallel on the HSE University supercomputer cHARISMa with 104
GPU Nvidia V100 [
        <xref ref-type="bibr" rid="ref2">11</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. CUDA/MPI realization of PA algorithm</title>
      <p>The main question in the multi-node realization of the PA algorithm is how to keep the
distribution number of replicas at each node as flat as possible? It can happen that at some nodes, the
number of replicas will grow while cooling, and at some nodes, the number of replicas can become
very small. In such a case, the computing time at nodes becomes very different, and simulation could
be inefficient.</p>
      <p>
        The possible solving of the problem consists of the grouping of replicas in the blocks with the
moderate size 1024 and using twenty blocks per GPU [
        <xref ref-type="bibr" rid="ref2">11</xref>
        ]. Blocks redistribute replicas. Before the
redistribution step, the algorithm calculates the excess value of blocks at each node as the difference
between the optimal number of blocks and the allowed excess number or the shortage number of
blocks, depending on which one has a positive value. The decision is performed at the master node,
and the master starts the redistribution of replicas depending on the information on the excess and
shortage values.
      </p>
      <p>
        The number of replicas during the simulations is relatively flat with the block algorithm and
fluctuates within the allowed window, which is one or two excess/shortage blocks [
        <xref ref-type="bibr" rid="ref2">11</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Testing at HSE supercomputing facilities</title>
      <p>
        The block PA algorithm was tested in [
        <xref ref-type="bibr" rid="ref2">11</xref>
        ] on the example of Ising model with square lattice
with 642 spins running GPU code published in [7] on the HSE cluster with 26 nodes, with 104 GPUs
available. The program environment was OPENMP 4.0.1, CUDA version 10.2, and NVIDIA driver
version 440.33.01. The scalability is shown in Figure 1, each GPU simulates an average of 20 blocks
with 1024 replicas, and the volume of computations grows with the number of GPUs from 1 to 104.
Therefore, the HSE computer simulated up to 1024x20x104 = 2 129 920 replicas of the Ising model in
parallel. More than 50 percent of possible simulation power was used. We have to note that the value
of the relaxation parameter was relatively moderate in simulations, and with the large value of the
relaxation be deduced for the simulation of systems that takes more time for equilibration or need
more time for the calculation at each step.
16 24 32 40 48 56 64 72 80 88 96 104
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Future plans</title>
      <p>Nowadays, the HSE supercomputer cHARISMa has been upgraded with the 184 GPU
available. We plan to use the block PA algorithm to simulate the statistical mechanic's complex
behavior at low temperatures.</p>
      <p>In addition, there is an extension of the PA algorithm in which the variable parameter
temperature is replaced with the variable parameter energy [12]. This algorithm is based on the
program code for GPU from paper [10] and demonstrates the possibility of catching the models'
nonequilibrium properties. We are working on the combined GPU/MPI implementation with the spirit of
the block PA algorithm.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Kostenetskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Chulkevich</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V. I.</given-names>
            <surname>Kozyrev</surname>
          </string-name>
          ,
          <source>J. Phys. Conf. Ser</source>
          .
          <volume>1740</volume>
          ,
          <issue>012050</issue>
          (
          <year>2021</year>
          ) [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hukushima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Iba</surname>
          </string-name>
          ,
          <source>AIP Conf. Proc. 690</source>
          ,
          <issue>200</issue>
          (
          <year>2003</year>
          ) [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Machta</surname>
          </string-name>
          ,
          <source>Phys. Rev. E</source>
          <volume>82</volume>
          ,
          <issue>026704</issue>
          (
          <year>2010</year>
          ) [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Machta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.G.</given-names>
            <surname>Katzgraber</surname>
          </string-name>
          ,
          <source>Phys. Rev. E</source>
          <volume>92</volume>
          ,
          <issue>063307</issue>
          (
          <year>2015</year>
          ) [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barzegar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pattison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.G.</given-names>
            <surname>Katzgraber</surname>
          </string-name>
          ,
          <source>Phys. Rev. E</source>
          <volume>98</volume>
          ,
          <issue>053308</issue>
          (
          <year>2018</year>
          ) [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Christiansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Weigel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Janke</surname>
          </string-name>
          ,
          <source>Phys. Rev. Lett</source>
          .
          <volume>122</volume>
          ,
          <issue>060602</issue>
          (
          <year>2019</year>
          ) [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yu</surname>
          </string-name>
          . Barash,
          <string-name>
            <given-names>M.</given-names>
            <surname>Weigel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.N.</given-names>
            <surname>Shchur</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Janke</surname>
          </string-name>
          ,
          <source>Eur. Phys. J. Spec. Top</source>
          .
          <volume>226</volume>
          ,
          <issue>595</issue>
          (
          <year>2017</year>
          ) [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Askarzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Coelho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.E.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.C.</given-names>
            <surname>Mariani</surname>
          </string-name>
          ,
          <source>2016 IEEE International Conference on Systems, Man, and Cybernetics</source>
          (SMC),
          <source>DOI: 10.1109/SMC</source>
          .
          <year>2016</year>
          .
          <volume>7844961</volume>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Weigel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Barash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shchur</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Janke</surname>
          </string-name>
          ,
          <source>Phys. Rev. E</source>
          <volume>103</volume>
          ,
          <issue>053301</issue>
          (
          <year>2021</year>
          ) [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yu</surname>
          </string-name>
          . Barash,
          <string-name>
            <given-names>M.</given-names>
            <surname>Weigel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M</given-names>
            <surname>Borovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Janke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Shchur</surname>
          </string-name>
          ,
          <source>Comp. Phys. Comm</source>
          .
          <volume>220</volume>
          ,
          <issue>341</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Russkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chulkevich</surname>
          </string-name>
          , L. Shchur,
          <source>Comp. Phys. Comm</source>
          .
          <volume>261</volume>
          ,
          <issue>107786</issue>
          (
          <year>2021</year>
          ) [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Rose</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Machta</surname>
          </string-name>
          ,
          <source>Phys. Rev. E</source>
          <volume>100</volume>
          ,
          <issue>063304</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>