Towards a Global Analysis and Data Centre in Astroparticle Physics ? Andreas Haungs[0000−0002−9638−7574] ?? Karlsruhe Institute of Technology, KIT-IKP, 76021 Karlsruhe, Germany andreas.haungs@kit.edu http://www.kceta.kit.edu Abstract. Astroparticle Physics is a young and evolving research area, where the amount of data continuously increase due to the modern in- struments. Due to the geographical and experimental diversity a dedica- ted strategy for the digitalisation of the research field is fundamental to astroparticle physics. For an effective implementation with benefits for science and society four strategic points were defined: (i) Establishment of one or more global data centres. (ii) Development of methods for conservation of the measurement data. (iii) Development of applications of modern methods in data analysis. (iv) Expansion of courses for the training of young scientists in modern analytical methods. Aim of the initiative of an analysis and data centre in astroparticle physics is to de- velop and implement an interdisciplinary concept, which meets the needs of the digitization of the research field and which is also attractive to so- ciety. The goal is to enable a more efficient analysis of the data that are recorded in different locations around the world (multi-messenger ana- lyses), as well as modern training for Big Data Scientists in the synergy between basic research and the information society. Keywords: Analysis and Data Centre · Data Infrastructure · Astropar- ticle Physics. 1 Introduction: The High-Energy Universe Understanding the high-energy Universe in the context of Astroparticle Physics means first and in foremost to answer the urgent question of the origin of high- energy cosmic rays. This question could not be answered by roughly 100 years of measurements of cosmic rays since their discovery in 1912 by a series of bal- loon flights of Victor Hess [1]. Whereas the cosmic rays at lower energies can be directly measured by balloon or satellite based experiments and are interpreted ? Supported by KRAD, the Karlsruhe-Russian Astroparticle Data Life Cycle Initiative (Helmholtz HRSF-0027). ?? The author acknowledges the help of and cooperation with the colleagues of the projects KCDC and KRAD (esp. D. Wochele, J. Wochele, F. Polgart, V. Tokareva, D. Kang, D. Kostunin), the APPDS initiative (esp. A. Kruykov, M. Nguyen, Y. Ka- zarina) and the SCC GridKa infrastructure at KIT (esp. A. Heiss, A. Streit). Fig. 1. Motivation for a global data and analysis centre in astroparticle physics. Cosmic rays, neutrinos and gamma rays of galactic (Milky Way) or extra-galactic (e.g. super- massive black holes, named AGN) origin reach our Earth and are observed by different kind of instruments. For a multi-messenger analyses these data have to be combined in a dedicated infrastructure. as of Galactic origin, i.e. generated and accelerated within the Milky Way, above 100 TeV primary energy the low flux does not allow to measure them directly. Instead, they are studied by the detection of extensive air showers. These are cascades of particles produced when a high-energy cosmic ray enters the Atmos- phere and interacts. Experiments measuring these air-showers could prove that the highest-energy cosmic rays (above ca. 8 EeV) are of extra-galactic origin [2]. It is unknown which sources in our deep Universe are responsible for these parti- cles and at which energy exactly the transition from cosmic rays of galactic and extra-galactic origin happens. Only charged particles can be accelerated in the source regions of the Uni- verse. However, in these acceleration processes also secondary neutral products are generated, like gamma-rays and neutrinos. Once produced, these particles should travel straight on from the source to Earth. Neutrinos are very weakly interacting with material (detectors) and therefore difficult to measure. Gamma- ray measurements are among others motivated by studying them as tracers from charged cosmic-ray sources, but suffer from absorption by the infrared and mi- crowave background in our Universe. Due to all these difficulties it became clear that only a combination of various measurements of the different tracers - not to forget the rare, but meanwhile detected, catastrophic events in our Universe generating Gravitational Waves - will bring us closer to an understanding of the astrophysics of the High-Energy Universe. This approach is called Multi- Messenger Astroparticle Physics (fig. 1). 2 Multi-Messenger Astroparticle Physics From the theory point of view multi-messenger astroparticle physics is to con- nect the physics of the sources with the observations in gamma-rays, neutrinos, and cosmic rays, and to use these observations to learn about the generation, acceleration and propagation properties [3]. From the experimental point of view the multi-messenger idea is mainly pursued by sending real time alerts from one observatory of one tracer to many others. This is partly organized by the in- dividual experiments (e.g., via the Astrophysical Multimessenger Observatory Network (AMON) [4]). Two recent examples have shown first real successes of experimental multi-messenger physics: (i) the detection of a neutron star merger in parallel by a gravitational wave detector and by many telescopes observing the event in the entire electromagnetic spectrum [5]. This event and the corre- sponding publication is somehow seen as the birth of multi-messenger astronomy (as extension to multi-wavelength astronomy) though no particles (cosmic rays or neutrinos) have been observed from this event. (ii) An alert from the IceCube neutrino observatory for a detection of a high-energy neutrino lead to the corre- sponding observation of a high-energy gamma ray flare of a distant Blazar [6,7]. This shows that the multi-messenger approach is indeed the most promising idea for gaining new insights in the High-Energy Universe. More events of parallel observations of the same source region will certainly be provided in the next years, in particular by the new or enhanced large- scale observatories CTA [8], IceCube(-Gen2) [9,10], Pierre Auger Observatory (Prime) [11,12], and LIGO/VIRGO [13,14]. All these observatories are operated independently by large international collaborations with hundreds and partly thousands of physicists. However, the maximum information of the measure- ments can only be elaborated if a common analysis based on measurements of high accuracy can be achieved. The here presented initiative aims to provide and validate new tools and methods for a sophisticated multi- messenger ana- lysis strategy. An important part of this is to enhance current first steps to a comprehensive analysis and data centre for astroparticle physics [15,16]. 3 An Analysis and Data Centre for Astroparticle Physics With the help of modern Information Technology, Big Data Analytics and Rese- arch Data Management, the basic goal is to enter a new era of multi-messenger astroparticle physics. This can only be reached if several aspects of a coherent development are considered, where a dedicated analysis and data centre is a very important ingredient. It will not only provide the tools and environment to take the step into this new physics era, but also allows to take the digitized society along the path a must do in modern Big Data Science. Nowadays, basic Fig. 2. The main pillars of a possible global Analysis and Data Centre in Astroparticle Physics. research in the field of particle physics, astroparticle physics, nuclear physics, as- trophysics or astronomy is performed in large international collaborations with partly huge infrastructures producing large amounts of valuable scientific data. To efficiently use all the information to solve the still mysterious question about the origin of matter and the Universe, a broad, simple and sustainable access to the scientific data from these infrastructures has to be provided. In a general way, such a global data centre has to provide a vast of functionalities, at least covering the following pillars (fig. 2): – Data availability: All participating researchers of the individual experiments or facilities need a fast and simple access to the relevant data. – Analysis: A fast access to the Big Data from measurements and simulations is needed. – Simulations & Methods development: To prepare the analyses of the data the researchers need a mighty environment on computing power for the pro- duction of relevant simulations and the development of new methods, e.g. by deep machine learning. – Education in Data Science: The handling of the centre as well as the proces- sing of the data needs specialized education in Big Data Science. – Open access: It becomes more and more important to provide the scientific data not only to the internal research community, but also to the interested public: Public Data for Public Money! – Data archive: The valuable scientific data need to be preserved for a later (re-)use. All the present and future large-scale observatories mentioned above will pro- vide their scientific data via sophisticated infrastructures and data centres for internal and also external use. However, information from various experiments and various messengers like charged particles, gamma-rays or neutrinos, measu- red by different globally distributed large-scale facilities, have to be combined. For that a diverse set of astrophysical data is required to be made available and public as well as a framework for developing tools and methods to handle the (open and collaborative) data. Fig. 3. Data life cycle in Astroparticle Physics. Data are generated in the specific observatories or experiments and the final goal of each analysis is a (journal) publication of the analysis resutls. A global analysis and data centre has to provide the tools and infrastructure to perform each individual part of the cycle. We aim to extend the current activities in the individual observatories on an experiment-overarching, global and international level. This includes the valida- tion of both, providing public access to (an initial part of) the scientific data and using the computing environment by the involved researchers to perform multi-messenger analyses. A further goal is to standardize the data, to make the publication FAIR [17], and by that to make it more attractive for a broader user community. The FAIR principles require that all parts of an data life cycle (fig. 3) are considered equally important in the realisation of an open data centre. The move to most modern computing, storage and data access concepts will also open the possibility of developing specific analysis methods (e.g. deep learning) and corresponding simulations in one environment opening a new technological opportunity for the entire research field. 4 Steps towards the Analysis and Data Centre Whereas in Astronomy and Particle Physics data centres are already established, which fulfill a part of the above mentioned requirements (although, not the same parts), in Astroparticle Physics only first attempts are presently under develop- ment. For example, at the KASCADE experiment [18] have initiated KCDC for a first public release of scientific data. In addition, some public IceCube or Auger data can be found already now in the Astronomical Virtual Observatories, like in GAVO [19]. It is obvious that astroparticle physics has become a data intensive science with many terabytes of data and often with tens of measured parameters asso- ciated to each observation. Moreover new highly complex and massively large datasets are expected by novel and more complex scientific instruments as well as simulated data needed for interpretation that will become available in the next decades, probably largely used by the community far beyond the next de- cade. Handling and exploring these new data volumes, and actually making unexpected scientific discoveries, poses a considerable technical challenge that requires the adoption of new approaches in using heterogeneous computing and storage resources and in organizing scientific collaborations, scientific education and science communication, where sophisticated public data centres will play the key role. Based on the experience gained with KCDC [20] and at the GriDKA Tier-1 environment at KIT [21], on a long term, a global Astroparticle Physics Data and Analysis Centre (presently under construction and based on KCDC and GridKa) as a large-scale infrastructure is being established. 5 KCDC - The KASCADE Cosmic-ray Data Centre Here, a brief summary of the realisation of KCDC is given: When publishing data it is not enough to put some ASCII files on a plain webpage. To ensure that potential users can actually use the data, an extensive documentation on how the data has been obtained is needed. Depending on the kind of data, this meta-data is at least a description of the detector and the reconstruction proce- dures employed, but can also consist of parameters determined by the raw data as well as event identifiers like UUIDs. Since this information, in addition to a license agreement, is not expected to change often, creating some static pages is a viable option. Another important aspect is the user and access management. For KCDC it was the concept that only registered users get access to the data and to the detailed data shop. Therefore, it is sufficient to check if an account is active and the user is authenticated. While there is already a basic implementa- tion of a permission based access limitation, a useful categorization of the users into - possibly hierarchical - groups is needed (no administrator should manually manage privileges of single users) to effectively use it. The heart of the data cen- tre is the data shop. The design goals have to ensure an easy access to an user defined subset of the whole dataset, a natural way to configure additional detec- tor components or observables without the need to change the code-base and a clear overview of previous requests with the possibility to use these selections as a template for future requests. For KCDC, the data are stored in a MongoDB. Its scheme-less design allows us to collect all available information of an event, although the available detector components may vary for each event. In principle KCDC can use any kind of input format, as these are implemented as plugins. Currently three output formats are supported, HDF5, ROOT and ASCII. These can be extended by adding additional plugins, too. The requests are processed using a Celery based task queue. The simultaneous processing of requests can be achieved by adding more worker processes, which can be distributed among se- veral machines. Together with the realisation to run the MongoDB on a sharded cluster, a scaling of the needed processing power with the demand is achieved. 6 The German-Russian Astroparticle Data Life Cycle Initiative KCDC is a web portal where the KASCADE scientific data is made available for the interested public. In Russia, there is the operating TAIGA and Tunka-Rex fa- cilities [22,23,24], where by many reasons combined Tunka-Rex and KASCADE data analyses with sophisticated Big Data Science analysis methods (e.g. deep learning) are of advantage for solving physics questions. These high-statistics ex- periments can be used as testbed for future multi-messenger astroparticle physics analyses based on data of the big Observatories coming into operation in next years. The project aims, for the first time, for a common data portal of two inde- pendent observatories and at the same time for a consolidation and maturation of an astroparticle data centre. The German part of the GRADCLI project [25] focuses on four items all of them are initial ingredients of the envisaged global data and analysis centre: Extension of KCDC: The existing data centre KCDC will be extended by scientific data from TAIGA allowing on-the-fly multi-messenger-analysis. The experimental setups in Russia and Germany generate or have generated large amounts of primary data. This have to be specified and a common lan- guage in data description have to be defined. This needs a careful preparation of the data and extension of the KCDC software and web interface. Within this project, we will adapt the Tunka-Rex data to the KCDC concept and provide an extended public data centre [26]. This central and distinct work will apply the concept and software of KCDC to follow the request of the funding agencies to make (at least the high level) scientific data public in a FAIR way. In addition, specific data will be included for the experts at both sides to combine the data in common analysis methods. The extended data centre and the experience gai- ned within this project will serve for data releases of present and forthcoming large-scale experiments in astroparticle physics. Big Data Science Software: Advancement of Big Data Science: The data centre shall allow not only access to the data, but also provide the possibility of developing specific analysis methods and perform corresponding simulations. The concept to reach this goal is the installation of a dedicated Data Life Cycle Lab. To reach this goal first, the basic software of the KCDC data centre needs to be improved. Then the data centre has to be moved to the Big Data environment provided by KIT-SCC. The software is being built as a modular, flexible frame- work with a good scalability (e.g. to large computing centres). The configuration is hold to be simple and doable also via a web interface. The entire software will be based solely on Open Source Software (Python, Django, mongoDB, HDF5, ROOT, etc.). Having done this step, the result enables to install and publish a dedicated Data Life Cycle Lab for the astroparticle physics community. Dedi- cated access, storage, and interface software has to be developed. In addition, the appropriate hardware has to be installed and commissioned. The described concept and working plan is valid also for the next steps, i.e. to generalize and consolidate the software package of a global astroparticle data centre for public and scientific use. Multi-Messenger Data Analysis: Specific analyses of the data provided by the new data centre will be performed to test the entire concept giving impor- tant contributions and confidence to the centre as a valuable scientific tool. For a detailed reliability test we aim for common data analyses using the com- plementary information of independent experiments, in particular to apply deep learning (machine learning) methods to multi-messenger analyses [27]. This will give important contributions and confidence to the project as a valuable scien- tific tool. The data centre will be also very useful for theoreticians to interpret experimental results. Go for the public: A coherent outreach of the project, including example ap- plications for all level of users - from pupils to the directly involved scientists - with detailed tutorials and documentation is an important ingredient of any activity in publishing scientific data. In particular the documentation, i.e. the metadata, of any released dataset has to be prepared with reasonable care. The goal of having detailed tutorials, i.e. an education portal, is to provide the data also to a general public in the sense of a visible outreach of astroparticle physics. The target groups for the tutorials are teachers and pupils in high schools. A tutorial has to provide at least: i) a basic knowledge on the experiment, astrophysics and related topics; ii) the required software and KCDC data (preferably as a pre-selection); iii) a step by step ex- planation of a simple data analysis; iv) a modern programming language code example; v) the interpretation and discussion of the outcome. With increasing number of users also the user management system (e.g. Q&A sections, discussion blogs, etc.) have to be extended and maintained. This part includes also crea- ting and designing outreach materials, conducting routine website maintenance, preparing text or images. In addition, web pages, press releases, distribution of the progress and news in social networks like Facebook, Twitter, etc. needs to be managed. 7 Conclusion Several initiatives have been started towards a dedicated and global Analysis and Data Centre in Astroparticle Physics. Aim of these initiatives is to develop and implement an interdisciplinary framework, which meets the needs of the digitization of the research field and which is also attractive to society. One of the immediate goals is to enable a more efficient analysis of the data that is recorded in different locations around the world for coherent multi-messenger studies in order to better understand the high-energy processes in our Universe. References 1. Victor F. Hess; Über Beobachtungen der durchdringenden Strahlung bei sieben Freiballonfahrten; Phys.Z. 13 (1912) 1084-1091 2. R. Aloisio, V. Berezinsky, A. Gazizov; Transition from galactic to extragalactic cosmic rays; Astropart.Phys. 39-40 (2012) 129-143. 3. A. Palladino, W. Winter; A Multi-Component Model for the Observed Astrophy- sical Neutrinos; Astron.Astrophys. 615 (2018) A168 4. M.W.E. Smith et al.; The Astrophysical Multimessenger Observatory Network (AMON); Astropart.Phys. 45 (2013) 56-70. 5. B.P. Abbott et al.; Multi-messenger Observations of a Binary Neutron Star Merger; Astrophys.J. 848 (2017) no.2, L12. 6. M.G. Aartsen et al.; Multimessenger observations of a flaring blazar coincident with high-energy neutrino IceCube-170922A; Science 361 (2018) 6398, 1378. 7. M.G. Aartsen et al.; Neutrino emission from the direction of the blazar TXS 0506+056 prior to the IceCube-170922A alert; Science 361 (2018) 6398, 147. 8. Edited by J. Hinton, S. Sarkar, D. Torres, J. Knapp; Seeing the High-Energy Universe with the Cherenkov Telescope Array - The Science Explored with the CTA; Astropart.Phys. 43 (2013) 1-356. 9. M.G. Aartsen et al.; Evidence for High-Energy Extraterrestrial Neutrinos at the IceCube Detector; Science 342 (2013) 1242856. 10. J. van Santen et al.; IceCube-Gen2: the next-generation neutrino observatory for the South Pole; PoS ICRC2017 (2018) 991. 11. A. Aab et al.; The Pierre Auger Cosmic Ray Observatory; Nucl.Instrum.Meth. A 798 (2015) 172-213. 12. R. Engel et al.; Upgrade of the Pierre Auger Observatory (AugerPrime) By Pierre Auger Collaboration (). PoS ICRC2015 (2016) 686. 13. B. P. Abbott et al.; Exploring the Sensitivity of Next Generation Gravitational Wave Detectors; Class.Quant.Grav. 34 (2017) no.4, 044001. 14. F. Acernese et al.; Advanced Virgo: a second-generation interferometric gravitati- onal wave detector; Class.Quant.Grav. 32 (2015) no.2, 024001. 15. M. Spiro; Open data policy and data sharing in Astroparticle Physics: the case for high-energy multi-messenger astronomy; J.Phys.Conf.Ser. 718 (2016) no.2, 022016. 16. A.Haungs et al.; The KASCADE Cosmic-ray Data Centre KCDC: Granting Open Access to Astroparticle Physics Research Data; Eur.Phys.J.C. 78 (2018) 741 17. M. D. Wilkinson et al.; The FAIR Guiding Principles for scientific data manage- ment and stewardship Sci Data. 3 (2016) 160018. 18. T. Antoni et al; The Cosmic-Ray Experiment KASCADE; Nucl.Instr. and Meth A513 (2003) 490-510 19. GAVO: see https://www.g-vo.org/ 20. KCDC: see https://kcdc.ikp.kit.edu/ 21. GridKa: see https://www.gridka.de/ 22. N. Budnev et al; The TAIGA experiment: From cosmic-ray to gamma-ray astro- nomy in the Tunka valley; Nucl.Instrum.Meth. A845 (2017) 330-333 23. F.G. Schröder et al.; Tunka-Rex: Status, Plans, and Recent Results; EPJ Web Conf. 135 (2017) 01003 24. P.A. Bezyazeekov et al.; Measurement of cosmic-ray air showers with the Tunka Radio Extension (Tunka-Rex); Nucl.Instrum.Meth. A802 (2015) 89-96 25. I. Bychkov et al.; RussianGerman Astroparticle Data Life Cycle Initiative; Data 3(4) (2018) 56 26. D. Wochele et al.; Data Structure Adaption from Large-Scale Experiment for Pu- blic Re-Use; (2019) these proceedings. 27. V. Tokareva; Data infrastructure development for a global data analysis centre; (2019) these proceedings.