Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


     ANALYSIS OF THE EFFECTIVENESS OF VARIOUS
    METHODS FOR PARALLELIZING DATA PROCESSING
        IMPLEMENTED IN THE ROOT PACKAGE
                                          T.M. Solovjeva
   Joint Institute for Nuclear Research, 6 Joliot-Curie St., Dubna, Moscow Region,141980,
                                            Russia

                                        E-mail: tanysha@jinr.ru


The ROOT software package is currently being upgraded in several ways to improve data processing
performance. This paper will consider several tools implemented in this framework for calculations on
modern heterogeneous computing architectures. PROOF (Parallel ROOT Package Extension) divides
common work into small chunks, i.e. packets. The size of the first packet used for calibration, the
minimum and maximum set size of the packet, the degree of data structuring affect the speed of their
processing. When processing large amounts of data, the read and write speed can be crucial. The new
asynchronous file merge feature in the TBufferMerger class allows writing data in parallel from
multiple streams to a single output file. Our calculations show a good scalability of the macro
execution time depending on the number of processor cores used.

Keywords: ROOT, PROOF, parallelization


                                                                                        Tatiana Solovjeva


                                                             Copyright © 2021 for this paper by its authors.
                    Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                   393
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


1. Introduction
        The ROOT [1] software package plays a central role in the analysis of high-energy physics
data. At the dawn of its development, ROOT was formed as a single-threaded application. However
with an increase in the amount of data processed and the development of computing technology, it
became obvious that the framework needed modernization be able to take advantage of modern
computing architectures. At present, multiprocessor computing systems, which are a set of fairly
powerful computers with distributed memory and distributed control, are widespread. For their
effective use, it is necessary to take into account the peculiarities of data structuring and the sequence
of stages of their processing for each specific analysis. This article explores various parallelization
techniques implemented in the ROOT package to improve data processing performance. Performance
tests were carried out on the HybriLIT heterogeneous cluster [2] of the Meshcherykov Laboratory of
Information Technologies of the Joint Institute for Nuclear Research.


2. Data structuring for processing
        PROOF [3], Parallel ROOT Facility, is a special ROOT tool that uses natural parallelism of
data structures located in files of a special format and providing direct access to any individual value.
In our previous work [4], we compared the efficiency of data processing using PROOF and a special
class for threading, as well as the OpenMP technology. It showed the high performance of PROOF, if
the analysis contained a large number of mathematical operations and was performed on a large
amount of data. The present article will study the question of whether the PROOF performance
depends on the way the data file is organized, and also consider the possibilities of speeding up data
writing, provided this process is parallelized.
        The calculations were performed using the ROOT 6.13 and PROOF-Lite versions. For their
operation on the HybriLIT cluster, six CPU cores were reserved. PROOF was running at different
values of the number of workers. As long as the number of workers started does not exceed the
number of cores, one worker process completely occupied the processor core. If, when starting
PROOF, more workers than the number of reserved cores were called, then naturally; several workers
were sharing one processor core. The acceleration factor (the ratio of the execution time of a macro in
a non-parallel version and the time of a macro in a parallel version) was actively growing with an
increase in workers from two to six, then its value stabilized. When starting PROOF with more than
12 workers, the acceleration rate decreased smoothly. The data shown in the graphs was obtained with
workers equal to six.
        To represent data in ROOT, a special class TTree was developed, it was optimized to reduce
disk space and increase the data access rate. All variables are presented in the form of leaves, which
are combined into branches, and the collection of branches forms a tree. The tree storage approach is
good for parallel architectures since each branch can be read independently of the other. The tree can
have different structures. We were interested in the question of how the organization of data in a tree
affected the effectiveness of using PROOF. We considered a simple tree with a list of variables, a tree
with structures and a tree with class objects.
        All of our trees contained twelve variables structured in different ways. The simple tree had
twelve branches, where a floating-point variable acted as a leaf. In the second case, the tree had three
branches, the first two of which contained three leaves each, combined into structures, and the third
branch was a structure of six float variables. In the third case, the tree had one branch that contained
an object consisting of all twelve variables. The data was contained in ten files, each of which was
approximately 1.6 GB in size. We performed two analyzes. In the first of them, data was read from all
branches, they were processed and the result was written to a new file. In the second analysis, the data
of only two variables was read, and in the case of the tree with a structure, we considered two options
for reading, namely, in the first, two variables belonging to the same structure were read and
processed, and in the second case, the variables belonged to different structures. The processing


                                                   394
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


included calculations of various physical quantities commonly used in different types of physical
analysis.
        Figure 1 shows the acceleration factor of the macro execution depending on the amount of
processed data on the left for the first analysis, on the right for the second analysis.


            Figure 1. Dependence of the macro acceleration on the type of data structuring

         As seen from the graphs, when reading and processing all data using PROOF, the acceleration
factor has a maximum value if the data is organized into structures. If only individual columns of root
files are read and processed, then the maximum speedup factor is achieved when organizing the data
in the form of a tree containing objects.

3. Parallel reading and writing
        Different types of analysis have different ratios of time over which different processing steps
are performed. When processing large volumes, sometimes the time spent on reading or writing data
comes to the fore. New possibilities for parallelizing these processes are actively used in the
optimization of HEP software [5-7].
        A distinctive feature of ROOT, which made it easy to use, is its columnar data format. The
user has the ability to read the data of only those columns that are of interest to him. In the process of
reading, the data is unpacked and deserialized. Starting from version 6.08, functions that perform these
actions in parallel have been implemented in ROOT. The user only has to specify the number of
threads. All implementation details are hidden from the user, which is why this approach is called
Implicit Multithreading.
        For parallel writing, the TBufferMerger class, implemented in ROOT since version 6.10, is
used. This class implements the following scheme of actions. The user specifies how many streams
that write data to a single file to create. The generated worker threads divide the data into separate
buffers, in each of which the data is serialized and compressed. Therefore, these processes occur
independently in each buffer, so they can be performed in parallel. The buffers are merged before
closing the output file. In the original version, the merge had a separate output stream. Then a scheme,
in which the merge is done by the worker threads themselves on demand, was implemented. This
made it possible to reduce the required computing resources.
        Our performance tests consisted of generating data, organizing it into a tree with twelve
branches, and then writing it to a file. Figure 2 on the left shows the dependence of the running time of
a macro on the number of worker threads and the number of generated events. The dependence of the
acceleration coefficient on the same factors is illustrated on the right.


                                                   395
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


 Figure 2. Dependence of the running time of the macro and the acceleration factor on the number of
                                          worker threads

        From the results presented in the graphs, it can be seen that the processing time decreases with
an increase in the number of created threads. When the number of worker threads exceeds the number
of cores in use, the reduction in macro runtime stops. The acceleration factor is proportional to the
number of available processor cores and does not depend on the number of generated events.
4. Conclusion
        Currently, parallel computing is the main reserve for increasing the speed of calculations,
therefore, the issues of parallelization of the ROOT package are of great importance. Modern
architectures of computing systems make it possible to use various methods of parallelizing the
processing of experimental data. The choice of the optimal method in each specific case significantly
reduces the task execution time. Our research has shown that when processing data using PROOF, it is
desirable to use the highest possible structuring of primary data. When working with the PROOF-Lite
version, it is reasonable to set the number of workers equal to the number of reserved cores for that
version. Writing to a file in parallel is easy to implement, and the results depend on the number of
cores used.
5. Acknowledgement
       The author is grateful to his colleagues for a fruitful discussion of the issues, as well as to the
HybriLIT group of MLIT JINR for the resources and technical support provided.
References
[1] Brun R., Rademakers F. ROOT – An object oriented data analysis framework //Nuclear
Instruments and Methods in Physics Research A. 1997. V. 389. P. 81-86.
[2] Adam G., Korenkov V., Podgainy D., Streltsova O., Strizh T., Zrelov P. HybriLIT – The main
component of the MICC for heterogeneous computations at JINR// CEUR Workshop Proceedings
(CEUR-WS.org). 2017. V.2023. P.351-356.
[3] Ballintijn M., Roland G., Brun R., Rademakers F. The PROOF distributed parallel analysis
framework based on ROOT// Computing in High Energy and Nuclear Physics, 24-28 March 2003, La
Jolla, California.
[4] Solovjeva T., Soloviev A. Comparative study of the effectiveness of PROOF with other
parallelization methods implemented in the ROOT software package// Computer Physics
Communications. 2018. V.233. P.41-43.
[5] Amadio G., Bockelman B., Canal P., Piparo D., Tejedor E., Zhang Z. Increasing Parallelism in
the ROOT I/O Subsystem // JoP: Conf. Series. 2018. V.1085. P.032014.
[6] Riley D., Jones C. Multi-threaded Output in CMS using ROOT// EPJ Web of Conferences. 2019.
V. 214, P. 02016.
[7] Amadio G., Canal P., Guiraud E., Piparo D. Writing ROOT Data in Parallel with
TBufferMerger// EPJ Web of Conferences. 2019. V. 214, P. 05037.


                                                   396