Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


PERFORMANCE TESTING FRAMEWORK FOR THE ATLAS
                EVENTINDEX
                               E. Cherepanova1,a, F. Prokoshin1
                          On behalf of the Software and Computing Activity
           1
               Joint Institute for Nuclear Research, Joliot-Curie 6, RU-141980 Dubna, Russia

                                E-mail: a elizaveta.cherepanova@cern.ch

The ATLAS EventIndex is going to be upgraded in advance of LHC Run 3. A framework for testing
the performance of both the existing system and the new system has been developed. It generates
various queries (event lookup, trigger searches, etc.) on sets of the EventIndex data and measures the
response times. Studies of the response time dependence on the amount of requested data, and data
sample type and size, can be performed. Performance tests run regularly on the existing EventIndex
and will run on the new system when ready. The results of the regular tests are displayed on the
monitoring dashboards, and they can raise alarms in case (part of) the system misbehaves or becomes
unresponsive.

Keywords: Scientific computing, BigData, Hadoop, EventIndex


                                                                 Elizaveta Cherepanova, Fedor Prokoshin


                                                              Copyright © 2021 for this paper by its authors.
                     Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                    207
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


1. Introduction
        The ATLAS EventIndex [1] is the tool that collects, checks and stores information about the
main properties of all real or simulated events that were collected, processed or generated by the
ATLAS experiment [2], and points to the files that contain them. The current storage implementation
is based on having full data on the Hadoop [3] system and reduced information in Oracle [4] tables.
The Hadoop system runs a variety of tasks, such us importing and cataloguing data, running
consistency checks, establishing links between related datasets and responding to users' queries.
Interactions with users are high-priority tasks. The system should be available and accessible under
various conditions and give a response in the appropriate time, depending on the request. The daily
access statistics are represented in Figure 1. To track the current system state and evaluate its
performance, a variety of tests were developed.


              Figure 1. Daily access statistics of the Hadoop system by different EventIndex
                                  services between May and June 2021

2. Data for tests
        The tests are carried out using two types of input data:
       a key, defined as a runNumber-eventNumber pair, e.g., “278880-251772208”.
       a full dataset name, e.g.,
        “data16_13TeV.00299584.physics_Main.deriv.DAOD_HIGG1D1.r9264_p3083_p4096”
        The list of about 50 000 keys prepared for a physics analysis was used for the tests. This
includes real data recorded in 2015-2018. For the tests several samples were made with keys from all
years and for several selected datasets with a total size of 1 million events.

3. Types of queries
      The EventIndex allows searching data using different types of input information and different
commands for the search. Two of them are used:
       EventLookup (el) for fast search of the physical datasets corresponding to an event (specified
        as a pair of run number and event number or a key).
       EventIndex (ei) to search all datasets using either direct searches or complex Map/Reduce
        jobs. This can take both a key and a dataset name as input data.
        All performed queries can be split into two groups:
       Event picking (performed using “el” and “ei” commands)
        - Fast search of events using key pairs.
        - Return GUIDs (Global Unique Identifier) of events, full dataset names and data types
            (RAW, AOD, DAOD).
       Search for events in a specified dataset (performed using “ei” command)
        - Return full information about an event (GUID, data type, production step, luminosity
            block, time, trigger chains, etc.)
        - Different filters can be applied.


                                                   208
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


4. Performance tests
4.1 EventLookup
         The event lookup is performed through the “el” client command. The search is running every
hour in a cron scheduler. Eight lists of keys are retrieved randomly each time to avoid using cached
results. The summary information about data used and the performance tests results is listed in Table1.

                        Table 1. Data used for EventLookup tests and tests result. 2021/06/18 – 2021/06/25
    Data               Source of keys              Number of keys           Average execution time, s
data 2015       1 million events dataset        1 000                      73.8
data 2015       several datasets                1 000                      37.7
data 2017       several datasets                1 000                      21.3
data 2018       1 million events dataset        10                         4.2
data 2018       1 million events dataset        100                        7.5
data 2018       1 million events dataset        1 000                      30.5
data 2018       1 million events dataset        10 000                     399.0
data 2018       1 million events dataset        50 000                     964.2

        The results of the tests are displayed in the Grafana Monitor [5]. Figure 2 shows the response
times of the Hadoop server to event lookup queries selecting 1000 events out of a dataset with one
million records or a mixture of several datasets as a function of time. The occasional glitches are due
to other activities on the servers at the time of the queries. The response times are dominated by the
query time for low numbers of events, and by the transmission time of the output for large numbers of
events.


 Figure 2. Response times of the EventIndex Hadoop server to event lookup queries selecting 1000 events
  out of a dataset with 1 million records or a mixture of datasets as a function of time, recorded between
                                        2021/06/18 and 2021/06/25
4.2 EventIndex queries. No selection
         The event picking and full dataset information retrieval are running once a day in a cron.
These queries are performed through the “ei” client command. The key search is performed using one
list with 10 keys for each year. The full dataset information retrieval is using datasets with sizes 10k
(50k), 100k, 1M, 10M events for 2015-2018 data.
        Figure 3 shows results of the tests for 2015 and 2017 datasets. The response times are
dominated by the setup time of the Map/Reduce job for low numbers of events, and by the
transmission time of the output record for large numbers of events.


                                                   209
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


Figure 3. Response times of the EventIndex Hadoop server to queries using Map/Reduce jobs to retrieve
information on all events from datasets containing 10k, 100k, 1M and 10M events as a function of time,
recorded between 2021/06/11 and 2021/06/25. Left: 2015 data, right: 2017 data
4.3 EventIndex queries. Trigger selection
        The search of full dataset information for events that passed trigger selection is running in a
cron. These queries are performed through the “ei” client command. The summary information about
data used is listed in Table 2.
                                              Table 2. Data used for EventIndex tests with trigger selection
                                                            % of events passed
       Data                Total number of events                                         Running
                                                                  trigger
data 2015              500k, 1M, 10M, 50M                  <1                         twice a day
data 2016              500k, 1M, 10M, 50M                  <1                         twice a day
data 2017              500k, 1M, 10M, 50M                  <1                         twice a day
data 2018              500k, 1M, 10M, 50M                  <1                         twice a day
data 2018 fast         500k                                <1                         hourly
data 2018 long         50M                                 <40                        once a day

        “Data 2018 fast” is the data used to check the system availability. A dataset of 2018 with 500k
events with <1% fraction of events passing a specified trigger was chosen because of the short
execution time – about 2 minutes (see Figure 4).
        “Data 2018 long” is the data used to check how the system can handle heavy task. A dataset of
2018 with 50M of events with <40% fraction of events passed a specified trigger was chosen because
of long execution time – about 4 hours (see Figure 4).
         The response times of the Hadoop server to queries searching events that passed specified
triggers from datasets containing 500k and 50M events are shown in Figure 4 as a function of time.


Figure 4. Response times of the EventIndex Hadoop server to queries using Map/Reduce jobs searching
events that passed specified triggers from datasets containing 500k and 50M events as a function of time,
recorded between 2021/06/18 and 2021/06/25. The right axis applies to the long search, the left axis applies
to the other queries


                                                    210
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


4.4 Performance time dependence on the way of running queries
        There are two ways of running queries: locally on the Hadoop cluster or remotely through the
Internet via a Tomcat server [6]. The test described above are running locally, however a regular user
usually runs queries remotely.
       For comparison some of the tests were launched remotely. The response times for event
lookup queries with search for data 2015 and for EventIndex queries with trigger selection for data
2018 are shown in Figure 5.


Figure 5. The dependence of response time on the way of running queries. Left: Event lookup queries for
2015 data: Blended – keys retrieved from several datasets, 1M Dataset – keys retrieved from one dataset
with a million events. Right: EventIndex queries with trigger search for 2018 data
        Remote queries run faster for almost all event lookup queries. The larger number of keys for
search, the closer are performance times of queries launched locally and remotely. The EventIndex
queries with trigger selection show similar performance time.
5. Conclusions
         Performance tests for the existing ATLAS EventIndex system has been developed. Three
types of jobs are running in a cron:
        Event lookup key search (once per hour)
        EventIndex key search and EventIndex full dataset information retrieval (once per day)
        EventIndex dataset search with trigger selection
         The first and third are displayed using Grafana dashboards, where it is easy to track the current
state of the system. Most of the locally running tests are also run remotely. Performance times show
that for event picking tasks the local launch can be slower. The results of the tests are stored in
Hadoop and can be used for comparison with performance of the newly-developed system
implementation.
References
[1] Barberis D et al 2015 The ATLAS EventIndex: architecture, design choices, deployment and first
operation experience, J. Phys.: Conf. Ser. 664 042003, doi:10.1088/1742-6596/664/4/042003
[2] ATLAS Collaboration 2008 The ATLAS Experiment at the CERN Large Hadron Collider, JINST 3
S08003 doi:10.1088/1748-0221/3/08/S08003
[3] Hadoop and associated tools: http://hadoop.apache.org
[4] Oracle: https://www.oracle.com
[5] Grafana: https://grafana.com
[6] Tomcat: https://tomcat.apache.org


                                                   211