Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


MULTI-GPU TRAINING AND PARALLEL CPU COMPUTING
 FOR THE MACHINE LEARNING EXPERIMENTS USING
               ARIADNE LIBRARY
     P. Goncharov1, A. Nikolskaia2, G. Ososkov1, E. Rezvaya1, D. Rusov1 and
                                E. Shchavelev2,a
1
    Joint Institute for Nuclear Research, 6 Joliot-Curie street, 141980, Dubna, Moscow region,
                                               Russia
    2
        Saint Petersburg State University, 7-9 Universitetskaya emb., Saint Petersburg, 199034, Russia

                                   E-mail: a egor.schavelev@gmail.com
Modern machine learning (ML) tasks and neural network (NN) architectures require huge amounts of
GPU computational facilities and demand high CPU parallelization for data preprocessing. At the
same time, the Ariadne library, which aims to solve complex high-energy physics tracking tasks with
the help of deep neural networks, lacks multi-GPU training and efficient parallel data preprocessing on
the CPU.
In our work, we present our approach for the Multi-GPU training in the Ariadne library. We will
present efficient data-caching, parallel CPU data preprocessing, generic ML experiment setup for
prototyping, training, and inference deep neural network models. Results in terms of speed-up and
performance for the existing neural network approaches are presented with the help of GOVORUN
computing resources.


Keywords: Machine Learning, Tracking, Python library, CPU optimizations, GPU
optimizations


                                              Pavel Goncharov, Anastasiia Nikolskaia, Gennady Ososkov,
                                                      Ekaterina Rezvaya, Daniil Rusov, Egor Shchavelev


                                                               Copyright © 2021 for this paper by its authors.
                      Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                     218
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


1. Motivation
         Modern high-energy physics (HEP) experiments produce large amounts of data and require
specific computer software to operate. Particle tracking is an important part of software of HEP
experiments and there are many algorithms for performing such tasks and one of the most well-proven
tracking approach is based on Kalman filter. Unfortunately, it does not scale sufficiently to perform
efficient computations on modern hardware such as graphics processing units (GPU). At the same
time, studies [1,2] indicate that machine learning (ML) and deep neural networks (NN) can be an
efficient replacement for the well-known tracking algorithms. Their authors achieve competitive
results in terms of track reconstruction accuracy, and they are orders of magnitude faster in terms of
processing speed. Modern ML approaches are mostly developed in the Python programming language
and use specific tensor-based libraries to implement NN models and deploy them to the GPU.
Considering the novelty of the ML tracking there are no generally known Python library which goal is
to study deep learning in HEP tracking tasks. Considering all the above mentioned we decided to start
the development of the Ariadne [3] library – the first Python open-source library for particle tracking
based on deep learning methods. The goal of Ariadne is to help researchers investigate their ML-based
tracking methods with a simple but standardized setup. Ariadne is still in development but has already
provided great benefits for our tasks. The initial Ariadne description and motivation one can find
in.[3].

2. Current state of Ariadne
       Current Ariadne application programming interface (API) from the researcher point of view is
shown in Figure 1.


                                              Figure 1. Ariadne API

        For an experimental run, researcher implements such following components, as preprocessor,
model, and dataset, while then he can override already implemented components such as parse,
transforms, criterion and optimizer.
       After the implementation, the user should run the ‘prepare’ phase which computes needed
preprocessing steps. Initial data processing steps are shown in Figure 2a. Later, one can train his NN
model with the preprocessed data.


                                                   219
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


3. Caching and Multi-CPU prepare
        After the previous work [3] there were already 5 different NN approaches developed with the
help of Ariadne. Every approach shares the common library API but implements its own
preprocessing and training components. During the implementation and investigation of a potential
approach researchers often run parsing, preprocessing, and training phases sequentially one-by-one in
a single Python process. So, for example, after any change in preprocessing algorithms all training
data (which can occupy hundreds of gigabytes of disk space) must be recomputed from scratch.
Running such scripts as a single-process Python is a huge time-consumer and cannot scale well with a
hardware computing facility. In this work we reimplemented ‘prepare’ core scripts with the help of
multi-processing. The comparison of old and new implementations is shown in Figure 2.
Implementation consists of 3 main parts:
    ● Caching module – realtime memorization of any processing unit (such as parsing, coordinate
        transformations, and any other data mutation procedure)
    ● Multiprocessing of target preprocessing routine (a preprocessor is being run in worker pool in
        parallel with the help of Python multiprocessing framework)
    ● Data serialization – with the help of HDF5 format [4] data could be efficiently read & write to
        the disk.


                        (a)                                                    (b)
    Figure 2. Comparison of the old (a) and new (b) ‘prepare’ core implementation. For the new
implementation, worker pool creates as many parallel processing processes as a count of cores on the
                                          target hardware.

4. Batch bucketing and Multi-GPU training
        With the help of a new caching module, we implemented the batch bucketing routine. Batch
bucketing routine is a common algorithm[5] for effective dataset data processing which allows placing
the NN input with the equal dimensions to the same training batch. Such routine can reasonably speed-
up training time on a single GPU device and allow to use the batch sizes which would not fit in GPU
memory without such approach. We also enabled the Multi-GPU training with the help of PyTorch
Lightning [6] library. Now researchers can run their NN training on up to 8 GPUs in parallel which
also greatly reduces model training times.


                                                   220
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


5. Measured performance impact
       After applying new functionalities described above, we measured the typical researcher
workflows on 2 target hardware:
    ●    Laptop (MacBook Pro 13) // Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz (8 cores)
    ●    Hybrilit (JINR HOVORUN) // Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz (80 cores)
        In Table 1, one can observe more than x25 event processing speed-up compared to the initial
implementation and up to 6 times faster ‘prepare’ phase. In Figure 3 one can observe a great NN
training speed improvement for the f1_score metric for the revised implementation compared to the
original. (for the same 1-hour training on the same data, the same GraphNet model converges much
more rapidly with the multi-GPU or batch bucketing training).
        Table 1. Processing speed and time preparation for the initial and revised implementation.

  Machine       Processing speed,                    Full dataset      (250k events)      preparation,
                events per second                    minutes

                Old    New                           Old              New
  MacBook       ~17    ~120 (x7 speed-up)            n/a              n/a
  Hybrilit      ~26    ~630 (x25 speed-up)           396 minutes      62 minutes (x6 speed-up)


      Figure 3. F1 score metrics for the original (graphnet original) implementation and revised
 implementation (graphnet bucketing – training with the batch bucketing routine, graphnet multi-gpu
 (2) – training on the 2 GPU units in parallel, graphnet multi-gpu (4) – training on the 4 GPU units in
                                                 parallel).


6. Conclusion
        In our work, we successfully implemented a new ‘prepare’ module for Ariadne. The module
now can run in parallel utilizing all CPU cores on the target hardware which led up to 25x faster event
processing for the GraphNet NN model. For the ‘training’ module we enabled the multi-GPU training
and batch bucketing algorithm which greatly reduces training time for the existing NN model


                                                   221
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


implementation. Such results show great potential for the future implementations of the other NN
approaches within the Ariadne library – users can now utilize more hardware resources, therefore,
increasing processing capacity for more complex neural network models and preprocessing routines.
Source code is available at [7].


7. Acknowledgment
        The reported study was funded by RFBR according to the research project № 18-02-40101.
        The calculations were carried out on the basis of the HybriLIT heterogeneous computing
platform (LIT, JINR) [8].


References
[1] Goncharov P., Shchavelev E., Ososkov G. and Baranov D. BM@N Tracking with Novel Deep
    Learning       Methods     //    EPJ    Web  Conf.,    226  (2020)    03009    /DOI:
    https://doi.org/10.1051/epjconf/202022603009
[2] Farrell S. et al., “Novel deep learning methods for track reconstruction,” in 4th International
    Workshop Connecting The Dots 2018 (CTD2018) Seattle, Washington, USA, March 20-22, 2018
    (2018), arXiv:1810.06111 [hep-ex].
[3] Goncharov P. et al. Ariadne: PyTorch library for particle track reconstruction using deep learning
    / P. Goncharov, E. Schavelev, A. Nikolskaya, and G. Ososkov //AIP Conference Proceedings. –
    AIP Publishing LLC, 2021. – Vol. 2377. – No. 1. – pp. 040004.
[4] The      HDF    Group.    Hierarchical          Data     Format,      version     5,     1997-NNNN.
    https://www.hdfgroup.org/HDF5/.
[5] Khomenko V., Shyshkov O., Radyvonenko O., and Bokhan K. (2016). Accelerating recurrent
    neural network training using sequence bucketing and multi-GPU data parallelization.
    10.1109/DSMP.2016.7583516.
[6] Falcon, W., & The PyTorch Lightning team. (2019). PyTorch Lightning (Version 1.4) [Computer
    software]. https://doi.org/10.5281/zenodo.3828935
[7] Ariadne, Github: https://github.com/t3hseus/ariadne
[8] G. Adam et al., CEUR Workshop Proc., Vol. 2267, 638-644 (2018)


                                                   222