=Paper=
{{Paper
|id=Vol-3041/223-228-paper-41
|storemode=property
|title=Development of the ATLAS Event Picking Server
|pdfUrl=https://ceur-ws.org/Vol-3041/223-228-paper-41.pdf
|volume=Vol-3041
|authors=Evgeny Alexandrov,Igor Alexandrov,Dario Barberis,Fedor Prokoshin,Aleksandr Yakovlev
}}
==Development of the ATLAS Event Picking Server==
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 DEVELOPMENT OF THE ATLAS EVENT PICKING SERVER E.I. Alexandrov1,a, I.N. Alexandrov1, D. Barberis2, F.V. Prokoshin1, A.V. Yakovlev1 on behalf of the ATLAS Software and Computing Activity 1 Joint Institute for Nuclear Research, Joliot-Curie 6, RU-141980 Dubna, Russia 2 Università di Genova and INFN, Via Dodecaneso 33, I-16146 Genova, Italy E-mail: a aleksand@jinr.ru During LHC Run 2, the ATLAS experiment collected almost 20 billion real data events and produced about three times more simulated events. During physics analysis it is often necessary to retrieve one or more events to inspect their properties in detail and check their reconstruction parameters. Occasionally it is also necessary to select larger samples of events in RAW format to reconstruct them with enhanced code. The new Event Picking Server automates the procedure of finding the location of the events using the EventIndex and submitting the Grid jobs to retrieve the individual events from the files in which they are stored. Keywords: Scientific computing, BigData, EventIndex Evgeny Alexandrov, Igor Alexandrov, Dario Barberis, Fedor Prokoshin, Aleksandr Yakovlev Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 223 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 1. Introduction The ATLAS experiment is one of four LHC accelerator experiments at CERN [1]. The ATLAS EventIndex [2], the catalogue of all ATLAS recorded and simulated events, has been working since 2015 and improvements are produced permanently in correspondence with the increasing data flow. The main use case for the ATLAS EventIndex is event picking, i.e. finding one or more events in the many billion events generated by the ATLAS experiment, stored in several million files, and extracting them. One of the latest addition to the EventIndex project infrastructure is the development of a new Event Picking Service, in order to automate the process of event picking for cases when the number of requested data is large. Several physics analysis workflows can use massive event picking to select a set of interesting events from the wealth of ATLAS data and reprocess them with enhanced algorithms or save additional variables that can help downstream analysis steps. One example in 2019 was the “γγ→WW” analysis. This analysis required the extraction of 50k events in RAW format out of the 18 billion events in Run 2 (about 10 million files). All the steps to look up the events in the EventIndex, submit the PanDA [3] event picking jobs, monitor them and retry them when timing out (because of long tape staging delays) were executed manually. A second round of that analysis is now requiring 136k events. The main goal of the Event Picking Service is to perform all these actions automatically. A user would only have to supply to this service all relevant information to find the requested events such as a file containing the run and event numbers, data format of requested data, project name, trigger stream in case raw data are requested and the version of the requested events if other than raw data are in the request. 2. Architecture of the Event Picking Service After analyzing the current request and taking into account the possibility of other similar requests the following requirements to the Event Picking Service have been formulated: ● The service consists of two parts: the Web Server and the Daemon, which are independent and communicate using a database ● Only the Web Server communicates with Client ● Only the Daemon communicates with protected systems (like Panda and Rucio [4]) ● The Daemon has a flexible workflow. The workflow can be modified only using the database ● The results of all steps of the workflow are stored in the database ● The input data to any step is the output from the previous one, input to the first step is the request by the client Figure 1. The architecture of the Event Picking Service 224 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 The architecture of the Event Picking service is present in Figure 1. A client generates a request to the server using a form which is built in an HTML page. The client receives an identifier for their request which can be used to get the details of that request. The Web Server sends the new request to the database. The Daemon monitors the database for a new request. When it detects a new request, it starts performing tasks for that request according to the workflow taken from the database. Input and output data, logs and results of all tasks are placed in the database. All this data can be viewed by a user or an administrator using a web server. After all tasks corresponding to the workflow are completed or critical errors are received during their execution, the Daemon sends an HTTP request to the web server. In response to this request, the web server receives all the necessary information upon request from the database and generates a message to the user (or administrator) about the results of the request. 3. Workflow The Event Picking Service is not intended for solving one specific task, but for a whole type of event picking. In this case, the steps needed for different requests may be different. As practice shows, even during the execution of one request, it becomes necessary to make changes to the workflow. As a result, it is important to develop the dynamic ability to change the workflow. After analyzing typical requests, two types of tasks were identified. The first type are tasks that use full input data. Tasks of this type must be performed sequentially. The second type are ones that only use a portion of the input. These tasks can be performed in parallel. Usually, whole chains of sequential tasks are parallel, rather than individual tasks. Within this chain, the task uses all the data from the previous task, but the first task in this chain uses only a portion of the data from the parent task. In the Event Picking Service, the first type of task is called Job. Chains of sequential tasks that can be run in parallel are called Chain. The Event Picking Service has two types of workflow: Job workflow and Chain workflow. Both workflows are stored in database in separate tables. The basic idea behind these workflows is that they have an input state, an output state, and the name of the method that will execute the task. The service searches the workflow using the input state for the selected task. Using the workflow, it dynamically starts the required code and upon completion changes the status to the output state of the workflow. Figure 2. The workflow of Event Picking for “γγ→WW” analysis 225 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 Figure 2 shows the event collection workflow for the “γγ → WW” analysis. The first job sorts the entire input pair of run-event numbers. The second task splits it by different values of runs and starts all the split data in parallel mode. For each run, the following tasks are performed: get the GUID from the runs and event number, get the dataset template using the GUID, start copying the event using Panda, check the output files to make sure all events have been copied. When using the service at stage 2, for some file datasets, it was necessary to manually intervene in the service to resolve errors. Most of these errors can be resolved automatically after upgrade of the service and we intend to do this at the next steps of updating the service. 4. Database As was mentioned earlier, the Event Picking Service has two main components: Web Server and Daemon. The interaction between these components is carried out using the PostgreSQL database [5]. In addition, the database stores input and output data, the states for all request, jobs, chains or tasks, logs and all workflow. Figure 3 shows the structure of the database for the Event Picking Service. All the main elements of the workflow have their tables in the database for storing the current state, a link for input / output data and so on. Input and output data are stored in separate tables. Logs are stored only for real tasks. Figure 3. The structure of database for the Event Picking service 5. Daemon The Daemon is responsible for performing all actions not directly related to the user. This requires access to ATLAS protected systems like Panda, Rucio and EventIndex. The Daemon of the Event Picking Service is written as a Java application and must run continuously. All new data or commands are retrieved from the database. The structure of the Daemon is shown in Figure 4. The main class is the Root controller which monitors new requests and creates first job of any new request. The job controller keeps track of new jobs and finds a workflow for any new job and starts executing it. The chain controller finds the chain workflow and starts the execution of the task corresponding to that workflow. 226 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 Figure 4. The structure of the Event Picking Daemon 6. Web Server The web server is implemented on Apache Tomcat and is based on the Web Application Lego Toolkit [6]. It does not need to be installed on the same machine as the Daemon. Its only requirement is database access. The Web Server has two forms, one for creating a new request and one for monitoring. To create a new request, the user must specify several parameters and upload the file with run-event number pairs. In the monitoring page the user can see the list of all requests with the states of the results. The user can see the details of selected request as well (Figure 5). The details include the result dataset name (in case the request was completed correctly) and information for each job. If the user selects any job he will see the information about the chains and tasks of these chains. This page has also error logs for the tasks. Figure 5. Monitoring of requests for the Event Picking Service 227 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 7. Conclusions A prototype of the event picking service has been developed. The service was used for the second stage of the “γγ→WW” analysis (136k events). The experience of using the service in this stage showed that some errors required manual intervention. The Event Picking service will be modernized to fix such problems. References [1] ATLAS Collaboration 2008 The ATLAS Experiment at the CERN Large Hadron Collider, JINST 3 S08003 doi:10.1088/1748-0221/3/08/S08003 [2] Barberis D et al 2015 The ATLAS EventIndex: architecture, design choices, deployment and first operation experience, J. Phys.: Conf. Ser. 664 042003, doi:10.1088/1742-6596/664/4/042003 [3] Barreiro Megino F et al 2017 ATLAS WORLD-cloud and networking in PanDA, J. Phys.: Conf. Ser. 898 052011, doi: 10.1051/epjconf/201921403025 [4] Barisits, M. et al. 2019 Rucio: Scientific Data Management, Comput Softw Big Sci, doi: 10.1007/s41781-019-0026-3 [5] PostgreSQL: https://www.postgresql.org/ [6] Sokolov I et al 2021 WALT Platform for Web Application Development, these proceedings 228