=Paper= {{Paper |id=Vol-3041/332-337-paper-61 |storemode=property |title=TrackNetV3 with Optimized Inference for BM@N Tracking |pdfUrl=https://ceur-ws.org/Vol-3041/332-337-paper-61.pdf |volume=Vol-3041 |authors=Anastasiia Nikolskaia,Pavel Goncharov,Gennady Ososkov,Ekaterina Rezvaya,Daniil Rusov,Egor Shchavelev,Dmitry Baranov }} ==TrackNetV3 with Optimized Inference for BM@N Tracking== https://ceur-ws.org/Vol-3041/332-337-paper-61.pdf

Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
Education" (GRID'2021), Dubna, Russia, July 5-9, 2021

TRACKNETV3 WITH OPTIMIZED INFERENCE FOR BM@N
TRACKING
A. Nikolskaia2,a, P. Goncharov1, G. Ososkov1, E. Rezvaya1, D. Rusov1,
E. Shchavelev2, D. Baranov1
1
Joint Institute for Nuclear Research, 6 Joliot-Curie street, 141980, Dubna, Moscow region,
Russia
2
Saint Petersburg State University, 7-9 Universitetskaya emb., Saint Petersburg, 199034, Russia

E-mail: a nastia.nikolskaya@gmail.com
Tracking is an important task in the field of High Energy physics. Modern experiments produce
enormous amounts of data, and classical tracking algorithms cannot reach required computing
efficiency. This lead to the need to develop new methods, some of them use neural network models.
In our work we present modifications of previously developed model, TrackNetV2. This model and its
descendants showed great results for Monte-Carlo simulations of experiments with microstrip-based
GEM detectors: BESIII and BM@N RUN6. In this work we adapt it to more complex scenario for
BM@N RUN7. The work showed limitations in architecture and training procedure, which are
reworked later.

Keywords: Deep Learning, Particles tracking, Neural networks, TrackNetv3

Anastasiia Nikolskaia, Pavel Goncharov, Gennady Ososkov, Ekaterina Rezvaya,
Daniil Rusov, Egor Shchavelev, Dmitry Baranov

Copyright © 2021 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

332
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
Education" (GRID'2021), Dubna, Russia, July 5-9, 2021

1. Introduction
Particle tracking is a key part of event reconstruction in experiments in High Energy Physics.
The task aims to group a set of registered signal points (hits) into subsets, where each represents one
track.
Scientists face many difficulties when trying to solve this task. For example, in many
experiments the position of the primary vertex - the point of interaction of beams is unknown. Some
experiments contain tracks of different lengths, so we need to decide which hit is last in the given
track. Other challenges are derived from construction of experiments, for example, microstrip GEM
construction produces a high amount of fake hits, which don’t belong to any particle.
Fortunately, there are many classical algorithms with good enough tracking efficiency to
overcome aforementioned difficulties. But most of them are based on the Kalman filter, which leads to
new challenges. In the modern world more and more high luminosity experiments are made and much
more are under development. Each produces enormous amounts of data, so classical algorithms cannot
provide high enough processing speed due to KF lack of parallelization capacity and high
computational complexity.
This leads to a new generation of tracking algorithms which involve deep learning methods.
Neural networks proved their ability to model complex dependencies in different domains. This
approach has advantages in computation complexity, because all operations boil down to sequential
matrix multiplications, hence are easily parallelized.
This approach is successfully used in several works [1-3] dedicated to particle tracking in
modern experiments. For example, graph and recurrent networks show high tracking efficiency for
Monte-Carlo simulation of the BESIII experiment.
One of recurrent networks, TrackNETv2, combines recurrent and convolutional layers to
extract spatial dependencies. In this work the recurrent approach was adapted to the more complex
scenario of Monte-Carlo simulation of the BM@N experiment. The study showed that different
lengths of tracks lead to pure results and very slow training. To overcome this, in [3] a bucketing
procedure was proposed before the model was trained.
This work showed that this approach cannot be used, because some real tracks are cut and the
model learns nonrealistic features in data. Another problem is that convolutions in the “old” model
used future hits to predict current, so more natural training procedure and different changes in the
model need to be proposed.

2. Training procedure
The detailed description of the TrackNETv2 model was firstly proposed in [4], but for the sake
of clarity we need to dive into some details of this architecture in order to highlight its drawbacks. So
the input of the network is the coordinates of track-candidate points. The goal of TrackNETv2 is to
predict the center of ellipse on the next coordinate plane, where to search for track-candidate
continuation and predict the semiaxes of that ellipse. “Next” means the station after the one which
contains the last hit of the track-candidate that is passed as input to the model. If there is a hit inside
the predicted ellipse, the model is inferenced again to predict an ellipse on the next coordinate plane,
and so on until the track-candidate flew out of the detector's sensitive area or the stations come to the
end. When all track-candidates are built a special classifier filters fake tracks while preserving the real
ones. The TrackNETv2 model can be treated as a Kalman filter analogue powered by neural networks.
The first version of TrackNETv2 was trained on the regression task. The inputs are sets of hits
of the track-candidates and the goal of the model is to predict the next hit depending on the prior
information from the input. But the tracks contain a different number of hits, so a bucketing procedure
was used for efficient training. Tracks were sorted by lengths. Each group of tracks with a given
number of hits in a track was reformed to balance the size of all groups. Randomly selected long
tracks were cut and pushed to a bucket with a smaller track length. Thus, the buckets with different
track lengths - from 3 to the number of stations were created. Then the last hits of the tracks in every

333
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
Education" (GRID'2021), Dubna, Russia, July 5-9, 2021

group were selected as labels for the model prediction and loss calculation, while all remaining hits
were chosen as inputs.
Such a bucketing procedure is unnatural for training recurrent neural networks (RNNs) , not
even speaking about the need for balancing buckets. According to the structure of the RNNs we
decided to recap the training procedure. So the new algorithm for training consists of the several
following steps. At first, the whole true track from the Monte-Carlo data except the last hit is passed to
the input of the model. At the same time, all hits except the first one are used as labels for the loss
function calculation. The model makes predictions for each hit, e.g. for the first hit it will try to predict
the ellipse where to search for the second hit, for the first two hits the model tries to guess where the
third hit, etc. Thereby the network predicts ellipses for every timestamp and we calculate the loss for
every model prediction. Then we average the loss value across all timestamps and do backpropagation.
Such a strategy of training is called many-to-many.
There can be samples of different sizes in a single batch of inputs and we need to pad the
shorter track-candidates to the length of the longest ones across the batch. Usually the padding value
equals zero, so if the batch includes two samples - with length 4 and length 6, then the candidate with
four hits will be extended by the two hits with all three coordinates being zeros. During the loss
estimation, a special mask for padding is calculated and these timestamps are excluded from the
process of optimization.
We named the improved model TrackNETv3 to highlight the difference from the previous
version.

3. Causality as a consequence
The TrackNETv2 model uses a standard convolution block to expand the number of input
features for the RNN layers. The receptive field of the TrackNETv2 convolution kernel is 3. For the
receptive size 3 and the "same" padding (1 in this case) at each timestamp, the model is forced to look
into the future one step ahead. Thus, if we try to train the network using the many-to-many strategy,
the standard convolution will promote the model to cheat during training, and so the evaluation will
fail.
From all this we can conclude that the model should be causal, namely the network can make
a prediction based on only the current and previous timestamps. There are several options of how to
make the TrackNETv2 model to be causal. The first and the simplest way to obtain causality in our
model is to drop the convolution layer and make the model be fully recurrent. Looking ahead, this
approach allows us to achieve the best results. The second option is to change standard convolution
operation to the causal convolution [5]. Causal convolutions restricts the model from violating the
ordering in which we model the data. In our case we use a masking technique to skip the future
timestamps.

4. Inference optimization
During the prediction for the TrackNETv2 model we had to check all hits on the
corresponding station for each candidate ellipse - it is O(n^2) complexity. Also, to create the inputs in
the inference phase we are required to use all possible combinations of hits from the first two stations
which is clearly excessive - it is ~O(n^2) complexity too. Considering these drawbacks we tried to
optimize our inference procedure.
Firstly, we can refuse to use all combinations of hits from the first two stations as input seeds.
The new training procedure allows us to make a prediction already from the first point. The data
encapsulates the required statistics to roughly fit the ellipse beginning from the first hit. The ellipse
size will be great, but in spite of all much smaller than the station size.
Secondly, we want to make a fast search of the hits that were caught by the predicted ellipse.
There is a library for efficient similarity search and clustering of dense vectors - Faiss [6]. The model
makes ellipse predictions, simultaneously we put all event hits in the search index from Faiss. The
next step is to use ellipse centers to find K-nearest hits for each of the centers. After finding the K-

334
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
Education" (GRID'2021), Dubna, Russia, July 5-9, 2021

nearest hits for every ellipse center, we check ellipse attendance for them. Eventually we prolong the
accepted candidates and pass them to the model again. As a result, O(NK) complexity for search
continuation of the track-candidates is reached, where K << N.
The number of nearest neighbors (NNs) is a parameter and it plays a significant role, because
the small number of NNs gives lower efficiency, but works faster, and the greater the number of NNs
the bigger the value of efficiency and slower the inference. We discovered that the 5 NNs is an
optimal value for the efficiency-speed tradeoff.

5. Experiments
In the first part of this section we present validation results of the TrackNETv3. We compare
the different datasets, loss coefficients, optimizers and models in order to choose the best combination.
The last part of the section is dedicated to the results of testing the model on new data unseen during
training.
Two metrics are used for the TrackNETv3 validation - hit efficiency and ellipse area. Hit
efficiency means the fraction of hits that are located in the ellipses predicted by the model. Hit
efficiency should tend to one and, on the opposite, ellipse area must be as small as possible. So we
should find a tradeoff between ellipse area and hit efficiency keeping in mind that efficiency must be
near 0.99.
The first part of comparison is concerned with datasets. For training, validation and evaluation
phases we used hits from the Monte-Carlo simulation of the RUN 7 of the BM@N experiment [7]. We
generated about 750K events for training and validation of the model. We used events with beam
energy equal to 3.2 GeV. We consider interactions with the argon beam and plumbum target (ArPb).
The magnet amperage was set to 1250 A. The target was generated with the following parameters: X
(mean: 0.7 cm, std: 0.33 cm), Y (mean: -3.7 cm, std: 0.33 cm), Z (center: -1.1 cm, thickness: 0.25 cm).
RUN 7 has a configuration of the detector with 6 GEM stations and 3 silicon stations before the GEM
ones, so 9 detector planes in total. Multiplicity of events varies up to 100 tracks per event, and 37
tracks per event on the average. Number of hits can be more than 500 per station and about 68% of all
hits are fakes. In our experiments we preprocessed and created three versions of the dataset -
“balanced”, “unbalanced”, and “normalized”. “Balanced” dataset is the version of the original data in
which we balance the size of groups with tracks of different length. “Unbalanced” dataset is a
preprocessed version of the original data and may be considered as the original dataset. “Normalized”
dataset contains the coordinates of hits that were normalized depending on the boundaries of the
detector area. It should be noted that the “normalized” dataset is also balanced.
Also we changed the value of the alpha parameter of the loss function which weights the cost
of ellipse fitting to the loss function in opposition to the ellipse area. The optimal value for the
“normalized” dataset should be much smaller, otherwise the predicted ellipses will be too big. In
figures 1a-1b we can see that the “unbalanced” version of the dataset along with alpha equals to 0.95
is slightly better than the others, but these plots don’t show the ellipse area, which is crucial too. And
the ellipse area for the “normalized” dataset relative to the size of the stations are smaller, so we chose
the “normalized” dataset.

335
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
Education" (GRID'2021), Dubna, Russia, July 5-9, 2021

Figure 1. Validation results of the hit efficiency on the different steps of the training phase. a)
Comparison of the different datasets. b) Comparison of the different alpha values for the “balanced”
dataset. c) Comparison of the different optimizers. d) Comparison of the different types of models.

Figure 1c shows the comparison between different optimizers - Adam, AdamW and AdamP
[8]. We trained the model on Adam longer than on other optimizers, nevertheless the results of Adam
are the same or better than AdamP or AdamW versions, in addition Adam performs faster in terms of
wall clock time.
Also we used two types of models - OnlyRNN and CNNRNN. The plot 1d illustrates the hit
efficiency of the models. OnlyRNN model is slightly better in terms of efficiency. When comparing
the ellipse sizes we found that the OnlyRNN model with ellipse area 3.947 is much better than
CNNRNN with 4.678.
So the best combination of the dataset-alpha-optimizer-model in our opinion is the
“normalized” dataset with 0.01 alpha, Adam optimizer, and OnlyRNN model.
To evaluate the resulting model we measured two metrics - efficiency and purity. Efficiency is
the fraction of tracks for which 70% of hits or more were reconstructed correctly. Purity is the fraction
of true tracks among all tracks that the model was considered to be true. Purity is the inverse value to
the ghost rate, because ghost rate = 1 - purity. We measured the metrics on 2000 events. The results of
the best model are efficiency - 0.9830, purity - 0.0209.
All experiments were done with the tools provided by the Ariadne library [9].

336
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
Education" (GRID'2021), Dubna, Russia, July 5-9, 2021

6. Conclusion
We revised our approach for training and optimized the inference procedure in order to reduce
its algorithmic complexity. The new TrackNETv3 model achieves efficiency - 0.9830, and purity -
0.0209 on the BM@N RUN7 Monte-Carlo data. At the moment we haven't developed a track-
candidate classifier that will work like a filter for fake tracks and increase the purity, but we will
develop it in the future.

7. Acknowledgment
The calculations were carried out on the basis of the HybriLIT heterogeneous computing
platform (LIT, JINR) [10].
The reported study was funded by RFBR according to the research project № 18-02-40101.

References
[1] Rezvaya E. et al. The LOOT model for primary vertex finding in the BES-III inner tracking
detector /E. Rezvaya, P. Goncharov, E. Schavelev, I. Denisenko, G. Ososkov, and A. Zhemchugov
//AIP Conference Proceedings. – AIP Publishing LLC, 2021. – Vol. 2377. – No. 1. – pp. 060005.
[2] Nikolskaia A. et al. Global strategy of tracking on the basis of graph neural network for BES-III
CGEM inner detector / A. Nikolskaya, O. Bakina, I. Denisenko, P. Goncharov, Y. Nefedov, G.
Ososkov, E. Shchavelev and A. Zhemchugov //AIP Conference Proceedings. – AIP Publishing LLC,
2021. – Vol. 2377. – No. 1. – pp. 060001.
[3] Nikolskaia A. et al. Local strategy of particle tracking with TrackNETv2 on the BES-III CGEM
inner detector / A. Nikolskaia, E. Schavelev, P. Goncharov, G. Ososkov, Y. Nefedov, A. Zhemchugov
and I. Denisenko //AIP Conference Proceedings. – AIP Publishing LLC, 2021. – Vol. 2377. – No. 1. –
pp. 060004.
[4] P. Goncharov Particle track reconstruction with the TrackNETv2 / Goncharov P., Ososkov G.,
Baranov D. //AIP Conference Proceedings. – AIP Publishing LLC, 2019. – Vol. 2163. – No. 1. – P.
040003.
[5] Oord A. et al. Wavenet: A generative model for raw audio //arXiv preprint arXiv:1609.03499. –
2016.
[6] Johnson J., Douze M., Jégou H. Billion-scale similarity search with gpus //IEEE Transactions on
Big Data. – 2019.
[7] Kapishin M. et al. Studies of baryonic matter at the BM@ N experiment (JINR) //Nuclear Physics
A. – 2019. – Т. 982. – С. 967-970.
[8] Heo B. et al. AdamP: Slowing down the slowdown for momentum optimizers on scale-invariant
weights //arXiv preprint arXiv:2006.08217. – 2020.
[9] Goncharov P. et al. Ariadne: PyTorch library for particle track reconstruction using deep learning
/ P. Goncharov, E. Schavelev, A. Nikolskaya, and G. Ososkov //AIP Conference Proceedings. – AIP
Publishing LLC, 2021. – Vol. 2377. – No. 1. – pp. 040004.
[10] G. Adam et al., CEUR Workshop Proc., Vol. 2267, 638-644 (2018)

337