=Paper=
{{Paper
|id=Vol-2679/short10
|storemode=property
|title=The Current Design and Implementation of the AstroDS Data Aggregation Service
|pdfUrl=https://ceur-ws.org/Vol-2679/short10.pdf
|volume=Vol-2679
|authors=Minh-Duc Nguyen,Alexander Kryukov,Andrey Mikhailov
}}
==The Current Design and Implementation of the AstroDS Data Aggregation Service==
The Current Design and Implementation of the AstroDS Data Aggregation Service? Minh-Duc Nguyen1[0000−0002−5003−3623] , Alexander Kryukov1[0000−0002−1624−6131] , and Andrey Mikhailov2[0000−0003−4057−4511] 1 Skobeltsyn Institute of Nuclear Physics, Lomonosov Moscow State University, Moscow, Russia nguyendmitri@gmail.com 2 Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences Irkutsk, Russia Abstract. AstroDS is a distributed storage for Cosmic Ray Astrophysics. The primary goal of Astro DS is to gather data measured by the in- struments of various physical experiments such as TAIGA, TUNKA, KASCADE into global storage and provide the users with a standard- ized user-friendly interface to search for the datasets that match certain conditions. AstroDS consists of a set of distributed microservices compo- nents that communicate with each other through the Internet via REST API. The core component of AstroDS is the Data Aggregation Service that orchestrates other components to provide access to data. The de- velopment process of AstroDS started in 2019. This paper describes the current design and implementation of the Data Aggregation Service and also the benefits it brings to the astrophysical community in the early state. Keywords: Distributed storage · Data warehouse · Data acquisition · Astroparticle physics. 1 Introduction Russian–German Astroparticle Data Life Cycle Initiative [1] is an inter- national project whose aim is to develop an open science system called AS- TROPARTICLE.ONLINE that enables users to publish, store, search, select, and analyze astroparticle data that are coming from various experiments lo- cated worldwide. One of the main parts of the system is the Astroparticle Physics Distributed Data Storage [2] (AstroDS) that gathers data measured by the in- struments of physical experiments such as TAIGA [4], KASCADE-Grande [5] into global storage and allows users to search for a specific dataset and retrieve it via a standardized storage-independent API. © ? Supported by the Russian Science Foundation, grant #18-41-06003. Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). AstroDS consists of a set of distributed microservices that interacts with each other to provide a smooth experience for users in data acquisition. One of the core services of AstroDS is the Data Aggregation Service (the Service) that coordinates others to fulfill data queries from users [3]. This paper describes the current design and implementation of the Service. 2 Design uery Cache Input Q Web UI Metadata Manager Catalog HTTP Core Server Controller Storage Instances Repository Manager Data Aggregation Service Fig. 1. The architecture design of the Data Aggregation Service The overall view of the Service’s components is presented in Fig. 1. On the frontend side, through the Web User Interface, the Service receives data lookup queries from users. The content of a query contains a set of search filters defined by a user. For example, the query can contain IACT01, which is the code name of a facility of the TAIGA experiment, two timestamps identifying the start and the end of a certain period, the weather condition when the data were collected at the experiment’s site. On the backend side, the Web User Interface is served by an HTTP Server. The HTTP Server serves as a proxy layer that passes the queries to different instances of the Service. Within each instance of the Service, the Core Controller is responsible for processing queries. Upon receiving a query, the Core Controller always checks with the Cache Manager if there is a response already generated for the query. To uniquely identify a query, the Core Controller uses the MD5 algorithm to calculate the hash of the query using its content. A response to a query is a list of files containing data that matches the filters defined in the query. If the response to a query is not cached, the Core Controller will forward the query to the Metadata Catalog that stores the metadata of all available datasets. The Metadata Catalog, in turn, makes a search query against its database and returns a response to the Core Controller. After receiving the response from the Metadata Catalog, the Core Controller passes it to the Cache Manager for caching and requests the Repository Manager to generate the files in the response using data from the Storage Instances each of which is mounted to the machine where the Service is running. File generation is carried out asynchronously. 3 Implementation The Web UI was implemented as a single-page application based on React [6] and Material UI [7] packed into a Docker container. In the Web UI, the user can create a data query using the search menu. The GraphQL query language [8] was used to implement the query. The structure of the query reflects the structure of entities that define the data of a physical experiment such as experiment site, facility name, instrument name, detector, data channel, etc. To compose a query, the user can define a list of filters and corresponding values. Each facility of a physical experiment has its own set of filters. When the user chooses multiple facilities, only the common filters available for all facilities are shown, facility- specific filters are hidden. Since the history of all queries made by the user is stored in the local storage of the browser, it is possible to review the response to a query. While reviewing the responses, the user can make a list of files from different queries just by selecting them. The selected files later can be downloaded as a single archive. An illustration of the Web UI is shown in Fig. 2 The Core Controller was implemented as a microservice based on the Django framework [9] and packed into a Docker container. The whole communication between the Web UI and the Core Controller is done through a set of REST API [10] endpoints, each of which is responsible for a type of queries. Data lookup queries are formed using the GraphQL syntax and sent to the Core Controller as POST-requests; others are standard REST API requests. One exception is the file generation queries. Since files are generated asynchronously, when the files for a query are generated, the Core Controller sends a short text message to the Web UI via the WebSocket channel [11]. The Web UI can also check the current status of a file generation query using a GET-request on the initialization stage. All queries are cached by the Cache Manager to reduce the response time. The cache was implemented as a big hash table where the key is the MD5 hash of a query, and the value contains the response to the query and other metadata such as creation time, last used timestamp. Queries are discarded from the cache by the Least Frequently Used (LFU) policy: the Cache Manager counts how often a query is used; those that are used least often are discarded first. Currently, the expiration time of a query is seven days. The Repository Manager was implemented as a separate microservice with its REST API. After receiving a file generation request from the Core Controller, Fig. 2. The architecture design of the Data Aggregation Service the Repository Manager creates an empty CernVM-FS repository, copies the original files from the storage instances to the newly created repository, and publishes it. When a repository is ready, the Repository Manager sends a short text message to the Core Controller via the WebSocket channel. To implement the communication via WebSocket, the RabbitMQ Message Broker [12] was used. Text messages are formatted using the STOMP proto- col [13]. 4 Current Status Facilities of the physical experiments mentioned above (TAIGA, KASCADE- Grande) were successfully integrated into the AstroDS system. Using the Web UI of the Data Aggregation Service, users can already make a complex query that combines data from different facilities of the experiments. The KASCADE- Grande experiment has its own service similar to the Repository Manager, so components of the Data Aggregation Service work with that service to generate files. There are ongoing works to make the Data Aggregation Service stable in production use. In the future, there are plans to integrate more physical experiments into the AstroDS platforms and implement more filters that give users a flexible way to search for data. There are also plans to conduct outreach activities to increase the number of users. References 1. Haungs A., Bychkov I., Dubenskaya J., Fedorov O., Heiss A., Kang D., Kazarina Y., Korosteleva E.E., Kostunin D., Kryukov A., Mikhailov A., Nguyen M.D., Polgart F., Polyakov S., Postnikov E., Shigarov A., Shipilov D., Streit A., Tokareva V., Wochele D., Wochele J., Zhurov D.: Russian–German Astroparticle Data Life Cycle Initiative. Data 2018, 3, 56. https://doi.org/10.3390/data3040056 2. Kryukov A., Nguyen M.-D.: A Distributed Storage for Astropar- ticle Physics. EPJ Web of Conferences 2019, vol. 207, 08003. https://doi.org/10.1051/epjconf/201920708003 3. Nguyen M.-D., Kryukov A., Dubenskaya J., Korosteleva E., Bychkov I., Mikhailov A., Shigarov A.: Data Aggregation in the Astroparticle Physics Distributed Data Storage . CEUR Workshop Proceedings 2019, vol. 2406, pp 84-89. 4. Bundev N. et al.: The TAIGA experiment: From cosmic-ray to gamma-ray astron- omy in the Tunka valley. In: Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment Febru- ary 2017, vol. 845, pp 330-333. https://doi.org/10.1016/j.nima.2016.06.041 5. Apel W.D. et al.: The KASCADE-Grande Experiment. In: Nuclear Instru- ments and Methods in Physics Research Section A 620 April 2010: pp 202-216. https://doi.org/10.1016/j.nima.2010.03.147 6. Facebook Open Source: React, https://reactjs.org. Last accessed 28 Jun 2020 7. Material-UI, https://material-ui.com. Last accessed 28 Jun 2020 8. The GraphQL Foundation: GraphQL, https://graphql.org. Last accessed 28 Jun 2020 9. Django Software Foundation: Django, https://www.djangoproject.com. Last ac- cessed 28 Jun 2020 10. Fielding R. Th.: Architectural Styles and the Design of Network- based Software Architectures. https://www.ics.uci.edu/ field- ing/pubs/dissertation/fielding dissertation.pdf 11. Internet Engineering Task Force: The WebSocket Protocol, https://tools.ietf.org/html/rfc6455. Last accessed 28 Jun 2020 12. Pivotal Software, Inc: RabbitMQ Server Documentation, https://www.rabbitmq.com/admin-guide.html. Last accessed 28 Jun 2020 13. STOMP Protocol Specification, https://stomp.github.io/stomp-specification- 1.2.html. Last accessed 28 Jun 2020