=Paper= {{Paper |id=Vol-2679/short11 |storemode=property |title=An Analysis Framework for KCDC |pdfUrl=https://ceur-ws.org/Vol-2679/short11.pdf |volume=Vol-2679 |authors=Frank Polgart,Andreas Haungs,Donghwa Kang,Doris Wochele,Jürgen Wochele,Victoria Tokareva }} ==An Analysis Framework for KCDC== https://ceur-ws.org/Vol-2679/short11.pdf
              An Analysis Framework for KCDC

    Frank Polgart1[0000−0002−9324−7146] , Andreas Haungs1[0000−0002−9638−7574] ,
    Donghwa Kang1[0000−0002−5149−9767] , Doris Wochele1[0000−0001−6121−0632] ,
                    Jürgen Wochele1[0000−0003−3854−4890] , and
                     Victoria Tokareva1[0000−0001−6699−830X]

    Karlsruhe Institute of Technology, Institute for Nuclear Physics, 76021 Karlsruhe,
                                        Germany
                                frank.polgart@kit.edu



         Abstract. We will introduce a data analysis extension for the KAS-
         CADE Cosmic-ray Data Center (KCDC), based on the Jupyterhub/notebook
         ecosystem. A user-friendly interface, easy access to data from KCDC, and
         modern analysis software are of special interest. This contribution will
         discuss the service architecture, followed by a brief usage example.

                                           ·            ·
         Keywords: Astroparticle Physics Open Data Data Analysis Jupyter·
1      Motivation

   The KASCADE Cosmic-ray Data Center (KCDC) [2] is a public data shop
developed at the Institute for Nuclear Physics at the Karlsruhe Institute for
Technology. KCDC provides access to reconstructed cosmic-ray data taken by
the KASCADE [1] experiment.
   We identified improvements and extension that will benefit our users:
 – The data-sets can be very large, depending on the cut and selection criteria,
   and personal disk-space and bandwidth to download the desired data-set
   quickly becomes a bottleneck.
 – Analyses usually rely on specialised frameworks and toolkits, some of which
   are very cumbersome to find, install, use and/or maintain.
   One obvious solution to this is, instead of bringing data to the researcher,
we bring the researcher to the data.


2      Design
2.1     Requirements
In order to find the best possible technology, we formulated the following re-
quirements.
    Copyright  ©2020 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
 – Accessibility: we want people to actually use it, not yet another prestige
   project.
 – Usability: the environment should provide all the of the tools necessary to
   conduct an analysis of KASCADE data.
 – Administration: the entire software stack should be easy to install, maintain
   and update.
 – Don’t reinvent the wheel, most of the work has already been done.

2.2   Solution
Accessibility We decided to use the Jupyterhub / Jupyter Notebook [3] ecosys-
tem for our analysis framework, as it already is very common analysis tool
throughout all industries and research branches. Almost anyone how has done
some data analysis in the last couple of year has come across or even used jupyter
notebooks.

Useability Python is the current go-to language for data analysis and there’s
a plethora of third-party analysis packages. We also have to provide ROOT [5],
as many example analyses still rely on it and it supplies the standard file format
in particle physics.

Administration Notebooks allow arbitrary code execution by design, so it is
sensible to isolate the notebook server of each user from one another and from
the operating system. It shouldn’t come as a surprise, that docker [4] was the
best choice for containerization, scalability, resource management, ease-of-use,
etc.
    Though there is some custom code needed to attach Jupyter to KCDC, the
majority of components are off-the-shelf industry standards. It is unlikely that
we have to heavily invest into in-house development to keep these running or
even extend and improve over time.

2.3   Architecture
Jupyter notebooks are a special kind of file format for cell-based, interactive
programming. The standard user interface is implemented as a web service
(notebook server); the important features in context of this document are the
file-system-like contents manager and the notebook text-editor/execution envi-
ronment (”kernel” in ipython jargon).
     The notebook servers are docker containers created on demand with ex-
clusive access by a single user each. They are proxied by another container
running the jupyterhub software; the jupyterhub handles authentication and
creation/deletion of notebook servers.
     Choosing docker as virtualisation technology has many benefits, first and
foremost process isolation, which is a good idea when your service requires ar-
bitray code execution. Realizing the service as a docker stack, we not only gain
easy scalability for many users across multiple physical host machines, but also
a text-based infrastructure description, which makes deployment and version
control trivial.
    For simplicity, users can use their KCDC credentials for authentication.
    The contents manager has been extended for transparent access to the KCDC
download area, and it is comparatively simple to add even more data shops,
depending of course on the interfaces exposed by a data shop.
    The current default notebook is configured with kernels for both Python and
C++ (via the ROOT interpreter).
    A graphical representation of the proposed analysis framework with KCDC
integration can be found at Fig. 1.




Fig. 1. A graphical representation of the proposed analysis framework with KCDC
integration. Each user has their own containerized (via docker) analysis environment.
The KCDC download area, where the users requested data-sets reside, can be accessed
directly from the notebook interface to view or download to the notebooks dedicated
work directory.



3   Example
This section is generated from the ipython notebook presentation at DLC 2020 [6],
where it was used to showcase the capabilities of the analysis framework, and it is
used now to demonstrate how easy it is to convert your analysis or presentation
to a paper.
    We also showed the importing of data from KCDC to the analysis frame-
work in the original presentation, and although it is a key design aspect of the
described framework, the ease-of-use of this feature doesn’t lend itself to an en-
lightning description on paper, short of an uninspired screenshot or the mention
of ”There’s a button for that”.
    Successfull acquisition of data is therefore presupposed in the following anal-
ysis example.
       We are going to apply the KCDC example analysis to one of the preselected
    data-sets, for reasons of recreatability. After importing both the data file and
    analysis script, create a new notebook in the same directory.
[1]: import os
     from zipfile import ZipFile
     ZipFile('KASCADE_SmallDataSample_wA_runs_0877-7417_ROOT.zip').
      ,→extractall()

     os.listdir()

[1]: ['.ipynb checkpoints',
      'KCDC analyze example.C',
      'slides.ipynb',
      'KASCADE SmallDataSample wA runs 0877-7417 ROOT.zip',
      'info.txt',
      'events.root',
      'EULA.pdf']

        Along with preselected data-sets, KCDC also provides a basic example anal-
    ysis in C++ + ROOT. In order to run the KCDC example analysis, switch the
    notebook kernel to C++ and enter:
[1]: .L KCDC_analyze_example.C

[2]: run()

[2]: Input file:events.root
     KCDC-Entries read from files: 1080295
     KCDCM-Entries:     1080295
     Array Entries:     986577
     Calor Entries:     250981
     Grande Entries:    88259
     General Entries:   1080295
     KCDCN-Entries to be evaluated: 1080295
      processing event No: 0 of 1080295
      processing event No: 100000 of 1080295
      processing event No: 200000 of 1080295
      processing event No: 300000 of 1080295
      processing event No: 400000 of 1080295
      processing event No: 500000 of 1080295
      processing event No: 600000 of 1080295
      processing event No: 700000 of 1080295
      processing event No: 800000 of 1080295
      processing event No: 900000 of 1080295
      processing event No: 1000000 of 1080295
     Entries survived:: 1080295 out of 1080295
     general id >0   :: 1080295
     array id >0     :: 986577
     calorimter id >0:: 250981
     grande id >0    :: 88259
     (int) 0




       Switch back to Python, and use PyROOT to open and display the result.




[1]: import ROOT
     f = ROOT.TFile('KCDC_Test.root')
     keys = [_.GetName() for _ in f.GetListOfKeys()]
     c = ROOT.TCanvas("foo", "bar", 1920, 1080*len(keys)//4)
     c.Divide(2,len(keys)//2)
     c.SetLogy()
     pad = 0
     logspectra = ['h6202', 'h6302', 'h7202']
     for key in keys:
         pad+=1
         c.cd(pad)
         if key in logspectra:
              ROOT.gPad.SetLogy()
         f.Get(key).Draw()
     c.Draw()




[1]: Welcome to JupyROOT 6.20/04
   The result of this example analysis shows, some statistics of the data set (ex.
events over run number), some spectra (ex. energy deposit µ-Detectors), among
other interesting things.
4    Outlook

We consider the integration of our data shop and analysis framework as a suc-
cessfull technology demonstration. Going public and making it available for all
KCDC users is the current goal, and we also plan to extend the functionality; it
is comparatively simple to add more off-site data shops, there’s the possibility
to have multi-core and GPU support, and we need to evaluate more third-party
analysis packages and notebook extensions. We also anticipate that user feed-
back is going to be another driving force behind long-term improvement of this
service.


References
1. T. Antoni et al; The Cosmic-Ray Experiment KASCADE; Nucl.Instr. and Meth
   A513 (2003) 490-510
2. A. Haungs et al; The KASCADE Cosmic-ray Data Centre KCDC: Granting Open
   Access to Astroparticle Physics Research Data; Eur. Phys. J. C (2018) 78:741;
   https://doi.org/10.1140/epjc/s10052-018-6221-2
3. Thomas Kluyver and Benjamin Ragan-Kelley and Fernando Pérez and Brian E.
   Granger and Matthias Bussonnier and Jonathan Frederic and Kyle Kelley and Jes-
   sica B. Hamrick and Jason Grout and Sylvain Corlay and Paul Ivanov and Damián
   Avila and Safia Abdalla and Carol Willing and et al.; Jupyter Notebooks - a publish-
   ing format for reproducible computational workflows; Kluyver2016JupyterN 2016;
4. Dirk Merkel. 2014. Docker: lightweight Linux containers for consistent development
   and deployment. Linux J. 2014, 239; Article 2 (March 2014)
5. Rene Brun and Fons Rademakers; ROOT - An Object Oriented Data Analysis
   Framework; Proceedings AIHENP’96 Workshop, Lausanne, Sep. 1996, Nucl. Inst.
   & Meth. in Phys. Res. A 389 (1997) 81-86. See also http://root.cern.ch/
6. IV International Workshop ”Data life cycle in physics”, DLC-2020;
   https://indico.scc.kit.edu/event/806/