Data Aggregation Platform for Experiments of Astroparticle Physics? Victoria Tokareva[0000−0001−6699−830X] , Andreas Haungs[0000−0002−9638−7574] , Donghwa Kang[0000−0002−5149−9767] , Frank Polgart[0000−0002−9324−7146] , Doris Wochele[0000−0001−6121−0632] , and Jürgen Wochele[0000−0003−3854−4890]?? Karlsruhe Institute of Technology, Institute for Nuclear Physics, 76021 Karlsruhe, Germany victoria.tokareva@kit.edu Abstract. The big data revolution has overturned well-established ap- proaches to data analysis and intensified the demand for access to het- erogeneous data. Modern developed methods allow for the extraction of new knowledge from the data, which enables researchers to approach many previously unsolved mysteries. This trend, observed in many areas of human activity, is also tangible in the field of astroparticle physics. Combined analysis of various experimental data allows researchers to derive deeper insights into the processes occurring in the universe and extend the borders of our knowledge about nature. Providing the infrastructure for such investigations is a topical issue of the astroparticle physics community. In this report we examine a service for the aggregated retrieval of hetero- geneous data from distributed storages of numerous astroparticle physics experiments. We describe its architecture, available data, principles of functioning and interaction with users and data centers. · · Data curation · · Keywords: Astroparticle physics Data engineering Asynchronous data processing Big data. 1 Introduction Since the moment of their discovery in 1912, cosmic rays (CR) have been studied quite thoroughly. Nevertheless, many mysteries related with them re- main the subject of active community research, including CRs spectrum and ? This work was supported by the Helmholtz Society Grant HRSF-0027 ?? The authors acknowledge the help of the colleagues from projects Astropar- ticle Physics Data Storage (APPDS) (esp. A. Kruykov[0000−0002−1624−6131] , A. Mikhailov[0000−0003−4057−4511] , M.D. Nguyen[0000−0002−5003−3623] , [0000−0001−5572−5349] A. Shigarov ), Tunka-Rex Virtual Observatory (TrVO) (esp. © Dmitriy Kostunin[0000−0002−0487−0076] ) and the SCC GridKa infrastructure at KIT. Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). mass composition, behaviour of matter at ultra-high energies not achievable in terrestrial accelerators, mechanisms of CRs acceleration and propagation as well as their origin. Numerous experiments of astroparticle physics around the world record the particle cascades generated in the interactions of relativistic cosmic rays with the atmosphere of planet Earth. A valuable trend in the search for insights into the above-mentioned matters, which includes development of methods for combined analysis of observations from multiple components (messengers) of cosmic radiation, is called Multi- Messenger Astroparticle Physics. The other modalities for combined analysis are encouraging as well, as they allow an increase of the statistical data of observables and positively influence the accuracy of the analysis. Thus, development of a united infrastructure for aggregated access to het- erogeneous experimental data becomes of much importance. Such an infrastruc- ture would integrate data from heterogeneous geographically distributed stor- ages, be resilient, horizontally scalable, and support the entire data life cycle from creation to archiving or destruction, i.e., would satisfy the FAIR (findable- accessible-interoperable-reusable) [1] model of data curation. In the framework of the international German-Russian Data Life Cycle ini- tiative [2], a data aggregation system has been developed that provides access to data from diverse data centers located in Germany and Russia, which store data from astroparticle physics experiments. In this report the infrastructure, which aggregates data from the KASCADE Comic Ray Data Center (KCDC) [3], Tunka-133 [4] experiment and Tunka-Rex Virtual Observatory (TrVo) [5] is taken under consideration. 2 Data aggregation framework 2.1 Framework structure and components The developed data aggregation system includes such elements as the setup (facility), data center of the experiment (storage), data extractor, adaptor, ap- plication programming interface by GRADLC (GRADLC API) and metadata database (MDDB). In a simplified form, the interaction of these elements is shown in Fig. 1. Let us examine the elements from Fig. 1 and their interaction in more de- tail. A storage is a data center or a database of an experiment in astroparticle physics, where data of any reconstruction level registered by experimental setups (facilities) are recorded. Infrastructures for data storage can be loaded with internal search queries of collaborators, processing new data, and with other computationally intensive tasks. Under these conditions, performing complex search queries on the stor- age side is undesirable, since it can lead to a decrease in the performance of the storage. To reduce the search payload, an approach was proposed to use a simplified database on the side of the aggregator, called metadata database (MDDB), which stores metadata for the data from experimental storages. Extracts metadata Facility <...> Extractor Storage MDDB Facility Returns uuids Requests uuid Data retrival by uuid by metadata Returns a download link request Adaptor GRADLC API data file User Requests data with particualr uuids in particular format Fig. 1. The architecture design of the GRADLC platform This approach is based on a two-level metadata system described in [2], that includes physical level metadata, such as datetime, file type etc. and event level metadata, such as event id, setup, energy, and many more. A program called extractor is used to retrieve information from the reposito- ries to MDDB. The schedule of running the extractor depends on the frequency of data updates in the repository. When a request is received to retrieve data from the system, a search for the relevant parameters is performed in MDDB in order to find universally unique identifiers (UUID) that correspond to the criteria of the events. Next, a list of unique identifiers for each experiment is passed to the aggregator, which connects to the repository, extracts the necessary data and transfers it to the aggregator. The interaction of the user with the system and with individual components within the system is organized through the API, which will be discussed in detail in section 2.4. It is important to note that in order to maintain simplicity, Fig. 1 shows the interaction in the case of a single data storage. However, the approach we developed allows interaction with multiple repositories, and it is easy to add new ones. In this case, a custom extractor and adapter correspond to each new data storage. Currently, the users of the service have access to the data of Tunka-133 experiment, KCDC, and TrVo. Users of the APPDS [6] interrelated project also have certain access to the data from the TAIGA HiScore and TAIGA IACT setups [7]. The available data will be discussed in more detail in the next section. 2.2 Data available Currently, the system provides access to the data from the following experi- ments: KASCADE, KASCADE-GRANDE [8], LOPES [9], Tunka-133 [4], Tunka- Rex [10], special compilations of these data (the dataset COMBINED that in- cludes data from KASCADE and KASCADE-GRANDE experiments, based on the analysis by Sven Schoo [12] and published in the KCDC PENTARUS 1.0 re- lease [11]), as well as simulations for some of these setups. The observables used for making basic cuts are commited into the MDDB. Thanks to a two-level data access model, users can access extremely heterogeneous data like Tunka-Rex binary traces, ROOT files with simulation results or high-level reconstruction data, e.g. for the Tunka-133 or KASCADE-GRANDE experiments. Some data, like raw data traces or energy deposits for separate detectors, are not included into available cuts. Table 1. Possible data cuts for available experimental data Setup Data cuts available in GRADLC aggregator KASCADE Datetime, Energy, Electron number, Muon number, Zenith, Azimuth, Core distance, Shower age GRANDE Datetime, Zenith, Azimuth, Number of charged particles, Muon num- ber, Core distance, Shower age COMBINED Datetime, Energy, Zenith, Azimuth, Core distance, Electron number, Muon number, Shower age LOPES LopesCompID, Datetime, Azimuth EW, Azimuth NS, CCheight EW, CCheight NS, ConeAngle EW, ConeAngle NS, EfieldMaxAbs, Eps EW, Eps NS, Elevation EW, Elevation NS, Eta EW, Eta NS, Ge- omagnetic Angle, Geomagnetic AngleG, NCCbeamAntennas EW, NCCbeamAntennas NS, Reconstruction, RmsCCbeam EW, RmsC- Cbeam NS, Xheight EW, Xheight NS Tunka-133 Datetime, Energy, Zenith, Azimuth, Core distance, X max Tunka-Rex Datetime, station id A system of joint data retrieval allows the user to upload data from several setups in a overlapping range of values. The list of available cuts for all the data available through the site is presented in table 1. The available range of values can be seen in the GRADLC API official documentation [13]. A list of possible selection parameters for simulations is presented in table 2. Table 2. Possible parameters for simulations retrieval Parameter Posible values Datasets KASCADE, KASCADE-GRANDE, COMBINED Particles ’gamma’, ’proton’, ’helium’, ’carbon’, ’silicon’, ’iron’ Models QGSjet-II-02, QGSjet-II-04, EPOS 1.99, EPOS LHC, SIBYLL 2.1, SIBYLL 2.3, SIBYLL 2.3c Energy 1.0 × 1014 to 1.0 × 1018 eV, 5.62 × 1017 to 3.16 × 1018 eV To analyse the presented data, one can use the analysis framework [14], integrated into KCDC also within the GRADLC project. A more detailed de- scription of the data can be found on the official websites of the experiments and related data centers or virtual observatories, as well as in the official articles of the mentioned collaborations. 2.3 Asynchronous data processing and interaction with storages Let us examine in more detail the behaviour of the system when it receives a data retrieval request. In Fig. 2 the user sends a request to the aggregator using the GRADLC API. On the aggregator side, a unique request identifier is generated, that is later used by the system to perform all actions associated with its processing. After that, the request is also added to the aggregator database with the request status of ”Scheduled”, and the request identifier is returned to the user to allow further actions associated with it (see section 2.4). On the backend, the scheduler daemon notices that the new query has been added to the aggregator database, and starts its processing. For this purpose, if there are free computing resources on the server, a process is created to retrieve the requested data from the storages, described in more detail in Fig. 3. The status of the request is changed to ”Running”. It should be noted that the length of the queue and the number of simultaneously launched tasks are limited both on the side of the aggregator and on the side of the storages for technical reasons, and this may constitute a bottleneck of the system. User GRADLC Scheduler aggregator Data retrieval UUID request generation Web API DB Process pool monitoring monitoring Add request [finished successfully?] Create into web API [new request is found] download Get download DB with status link Change process no no link "Scheduled" status to "Finished" no or "Failed" yes yes yes Save datafile Create new retrieval process [process is finished] Remove incomplete Processes pool datafile Thread 1 ... Thread N Thread N+1 ./DataAggr ./DataAggr ./DataAggr Fig. 2. Action diagram of the asynchronous parallel data processing. The scheduler monitors running processes and upon completion of the process changes its request status in the system to ”Failed” in case of the failure or to ”Finished” in case of successful completion. For successfully completed processes, the query results are written to the server and archived for further download by the user. For the unsuccessful processes, intermediate files created during the processing are removed. Let’s take a closer look at the process of data extraction by user request, shown in Fig. 3. The scheduler passes the query parameters to the instance of the Data ag- gregator process that executes the query to MDDB and receives the UUID list corresponding to the specified parameters as its response. Next, the aggrega- tor request the adapters of the necessary data storages for direct upload the necessary experimental data. It should be noted that new repositories can be included quite easily: this requires a relatively easy-to-write repository adapter, a metadata extractor 2.1 and a repository data indexation with UUID. ConfigFile request with parameters run Scheduler DataAggregator MDDB return data return uuid request by uuids in neсessary DBs KCDC Tunka-133 TRex Adapter Adapter Adapter KCDC Tunka-133 Tunka-Rex Mongo DB DB DB Fig. 3. Aggregation of distributed heterogeneous data from distributed data storages 2.4 Interfaces Application programming interface In order to provide user interaction with the aggregator an application programming interface GRADLC API was implemented, employing JSON-RPC 2.0 [15] remote procedure call protocol with data transmission over http. There are five methods one can request through the aggregator API: data extraction, request status, list of requests, cancelling request, and data download. One can find examples of the request and detailed explanations in the official documentation [13]. Two days after the request was made related data is deleted from the server automatically and the query gets status ”Expired”. There are six possible sta- tuses a query can have in the system: ”Scheduled”, ”Running”, ”Finished”, ”Failed”, ”Expired”, and ”Deleted”. Web user interface For achieving a better user experience an extended web graphical user interface (Web GUI) was developed. Some examples of user in- teraction with it are shown in Fig. 4-5. GRADLC Home Documentation New request Requests list Other Welcome, Administrator! New task Get JSON Create data-upload task GRANDE KASCADE combined KASCADE Simulations LOPES Tunka-133 Tunka-Rex Tunka-133 25.10.2010 10.12.2012 Datetime from to 14 19 Energy [eV (log10)] from to 0 20 Zenith [°] from to Fig. 4. A new data upload query building in the GRADLC web user interface. Fig. 4 shows an example of creating a data upload request. To do so one proceeds to the ”New task” window of the web GUI. Then, one opens the tab with the name of the experiment of interest, ticks its name inside the tab and specifies the desired values of the request parameters. Once all the desired values are chosen in all the necessary tabs, a request can be send by pressing the ”Create data upload task” button. To automate data upload, it may be useful as well to be able to generate correct JSON-RPC request from GUI. That can be done by pressing the ”Get JSON” button after the parameters selection. To manage one’s data upload requests, the ”Requests list” section of the web GUI is used, which is shown in Fig. 5. Here one can obtain the list of all data upload queries that one has been made. In the table one can see all requests, their status, start and completion times. A request can be cancelled by clicking the ”Delete” link in the interface. Clicking on the task id shows the details about the requested data. It is worth noting as well that the web GUI provides the user some additional possibilities, such as account management in the system (i.e. registration, login, password changes, sign out) and documentation. The web interface can be espe- cially valuable for improving the user experience for a general audience without strong programming background, who feel more familiar interacting with graph- ical interfaces. Thus, the developed aggregation system can be used for outreach purposes. GRADLC Home Documentation New request Requests list Other Welcome, Administrator! List of data-upload tasks ID User Requested at Finished at Status Download 2020-06-07 2020-06-07 89c70a67-28cb-4952-a3ed- 34 Administrator Finished Delete 04:41:24 06:41:25 250116a72363.txt.bz2 2020-06-06 2020-06-06 1546370b-23b8-404a-885a- 33 Administrator Finished Delete 14:10:22 16:10:24 1443433eebbc.txt.bz2 2020-03-09 2020-03-09 10 Administrator Expired Delete 23:42:46 23:44:00 Fig. 5. Request management in the GRADLC web user interface. 3 Outlook The developed software supports aggregation of heterogeneous data from a va- riety of geographically spread data storages mentioned above. The valuable features of the service, such as faster data retrieval employing a metadata data search concept, multiple filters for data search, asynchronous multithread data processing as well as different possibilities of interaction be- tween the system and the user make it a valuable product, which could be used as a service for automatic data retrieval in large scale research projects as well as a stand-alone application for individual outreach projects. Our future plans include employing advanced data management tools for message broking and system status monitoring, as well as possible integration of other physical experiments into the GRADLC data aggregation platform. More information about the project can be found at the project page [16]. References 1. Haungs, A.: Towards a global analysis and data centre in Astroparticle Physics. In: this proceedings, pp. ?–?. Publisher, Location (2020) 2. Bychkov, I. et al.: Russian-German Astroparticle Data Life Cycle Initiative, Data J. 3(4), p. 56 (2018) 3. KASCADE Cosmic Ray Data Centre, https://kcdc.ikp.kit.edu. Last accessed 27 June 2020 4. Antokhonov, B.V., Berezhnev, S.F., Besson, D. et al. TUNKA-133: A new array for the study of ultra-high energy cosmic rays. Bull. Russ. Acad. Sci. Phys. 75, 367 (2011). 5. P.A. Bezyazeekov et al. - Tunka-Rex Collaboration, Proceedings of the 3rd Interna- tional Workshop on Data Life Cycle in Physics, Irkutsk, Russia, 2019, CEUR-WS 2406 3 (2019). 6. Kryukov, A.: AstroDS - A Distributed Storage for Astrophysics of Cosmic Rays. Current Status. In: this proceedings, pp. ?–?. Publisher, Location (2020) 7. Budnev, N. et al. - TAIGA Collaboration: TAIGA the Tunka Advanced Instrument for cosmic ray physics and Gamma Astronomy - present status and perspectives. JINST 9 (2014), C09021 8. Homepage of KASCADE-Grande, https://web.ikp.kit.edu/KASCADE/. Last ac- cessed 27 June 2020 9. LOPES — A LOFAR Prototype Station Homepage, https://www.astro.ru.nl/lopes/. Last accessed 27 June 2020 10. Schröder, F. G. et al.: Tunka-Rex: Status, Plans, and Recent Results. In: EPJ Web of Conferences 135, 01003 (2017) 11. KCDC Announcements - ChangeLogs, https://kcdc.ikp.kit.edu/announcements/changeLogs/. Last accessed 27 June 2020 12. Schoo, S.: Energy Spectrum and Mass Composition of Cosmic Rays and How to Publish Air-Shower Data. PhD thesis. KIT, Karlsruhe (2016) 13. GRADLC API Documentation, http://141.52.67.147:55000/web/doc/. Last ac- cessed 27 June 2020 14. Polgart, F.: Analysis framework for KCDC. In: this proceedings, pp. ?–?. Publisher, Location (2020) 15. JSON-RPC 2.0 Specification, https://www.jsonrpc.org/specification. Last ac- cessed 27 June 2020 16. GRADLC project web page, http://141.52.67.147:55000. Last accessed 27 June 2020