<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tao Lin a (on behalf of the JUNO collaboration)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tao Lin</string-name>
          <email>lintao@ihep.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of High Energy Physics, Chinese Academy of Sciences</institution>
          ,
          <addr-line>Beijing, 100049</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>5</fpage>
      <lpage>9</lpage>
      <abstract>
        <p>The Jiangmen Underground Neutrino Observatory (JUNO) experiment is mainly designed to determine the neutrino mass hierarchy and precisely measure oscillation parameters by detecting reactor anti-neutrinos. The total event rate from DAQ is about 1 kHz and the estimated volume of raw data is about 2 PB/year. But the event rate of reactor anti-neutrino is only about 60/day. So one of the challenges for data analysis is to select sparse physics signal events in a very large amount of data, whose volume can not be reduced by using the traditional data streaming method. In order to improve the speed of data analysis, a new correlated data analysis method has been implemented based on event's index data. The index data contain the address of events in the original data files as well as all the information needed by event selection, which are produced in event pre-processing using the JUNO's Sniper-based offline software. The index data are subsequently selected by using refined selection criteria with Spark so that the volume of index data is further reduced. At the final stage of data analysis, only the events within the time window are loaded according to the event address in the index data. A performance study shows that this method achieves a 14-fold speedup compared to correlation analysis by reading all the events. This contribution will introduce detailed software design for event index-based correlation analysis and present performance measured with a prototype system.</p>
      </abstract>
      <kwd-group>
        <kwd>JUNO</kwd>
        <kwd>time correlation</kwd>
        <kwd>analysis tool</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The Jiangmen Underground Neutrino Observatory (JUNO) experiment under construction in
southern China, will have a rich physics program, besides neutrino mass ordering and precise
measurement of oscillation parameters [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ]. The JUNO detector is located 700 m deep underground.
As shown in Figure 1, it contains a central detector, water Cherenkov detector and top tracker. The
innermost of the central detector contains 20 kton of liquid scintillator, surrounded by 17,612 20inch
PMTs and 25,600 3inch PMTs.
      </p>
      <p>
        As one of the important systems, the JUNO offline software [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is used to process the
2PB/year data coming from the detector. As shown in Figure 2, the offline software includes an
underlying framework, external libraries and several applications. The applications consist of event
generators, simulation, calibration, reconstruction and analysis tools. In order to support all these
applications, SNiPER (Software for Non-collider Physics ExpeRiment) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is adopted as the
underlying data processing framework.
      </p>
      <p>
        The challenges in the analysis are the rare signals and the time correlation, which are quite
different from collider experiments. The total event rate is about 1 kHz, while the event rate of reactor
antineutrinos is about 60 per day. Therefore, most of the events are backgrounds for the analysis. If
there is no time correlation between the events, then all the background could be discarded. However,
a neutrino is detected via the inverse beta decay (IBD) process, producing a prompt signal positron
and a delayed signal neutron with an average neutron capture time of 200 us. Hence all the events in
the same time window are needed. Due to the time correlation, it is difficult to use the big data
technologies. In order to improve the speed of data analysis and the ability to analyze data
interactively [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], an event index-based method has been proposed. The key idea of this method is to
reduce the I/O by loading the events within the time window on demand according to the event
address in the selected index. In this paper, the design and implementation of this method will be
shown.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Design and Implementation</title>
      <sec id="sec-2-1">
        <title>2.1 The event index-based method and the analysis event index</title>
        <p>An event index contains an address of the event in the original data as well as the necessary
information needed by the event selection. There are three stages in the event index-based method, as
shown in Figure 3. The first stage is the generation of event index. The event index data are produced
in event pre-processing by the SNiPER framework. The second stage is the reduction of the event
index using big data technologies, such as Spark. The event index is selected by using refined criteria
and the volume of the event index is reduced. The third stage is the time correlation analysis in the
correlation analysis framework. The event addresses are loaded from the event index and the events
are loaded automatically according to these addresses.</p>
        <p>As already mentioned, the event index contains two parts. The first part is the address of the
event in the data. The address contains a reference to the event data file and a reference to the entry in
the event data file. This part will be used by the analysis framework internally. The second part is the
user defined event level information. This part will be used for the single event selection. In this study,
the reconstructed energy, the reconstructed vertex and the event time are stored in the second part.</p>
        <p>
          The file formats of the event index could be in the plain text format, the ROOT [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] format and
the HDF5 [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] format, which are supported by both ROOT and the big data technologies. A data
framelike structure of the event index can be easily analyzed and processed by these technologies. In the
current implementation, the event index in plain text format are written by ROOT and then processed
by Spark and ROOT.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 The event index-based correlation analysis framework</title>
        <p>The event index-based correlation analysis framework has been developed based on SNiPER.
One of the essential features of SNiPER is the ability to manage multiple events in the same time
window using an event buffer service. With the modular design of the framework, there is no impact
on the user developed event selection algorithms whether using event index or not. The only changes
are the event loop and the event buffer service with event index support, as shown in Figure 4.</p>
        <p>For the analysis without event index, the event loop is driven by the ROOT-based event data.
The events will be read by the ROOT I/O and put into the event buffer at the beginning of each event.
The event selection algorithms will access the event data from the event buffer. When the processing
of the current event is done, the framework will read and analyze the next events until all the events
are processed.</p>
        <p>For the analysis with event index, the event loop is driven by the event index, instead of the
event data. At the beginning of each event processing, the index-based event buffer service will load
an index via the index I/O. Then according to the reference to the file, the service will check whether
the file is loaded or not. If the file is not loaded, then the ROOT I/O will be used to open the file.
When the file is ready, the event will be then loaded according to the entry number from the event
index. The next step is loading the other events in the same time window via the ROOT I/O. When all
the events in the time window is ready, the event selection algorithms can access them. In the next
event processing, the framework will load the next index, instead of the next event. By using this
method, a fraction of background events can be skipped.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Performances</title>
      <p>In order to evaluate the performances of the correlation analysis framework, two cases are
studied: one case is only considering the I/O without time correlation analysis and another case is
considering the time correlation analysis. The performance of event loading with different ratios is
shown in Figure 5. The complete event index will be read by the index I/O and then the event selection
will be randomized according to the ratios. If an event index is selected, then the corresponding event
data is loaded. In this case, the other events in the same time window are not loaded. In order to reduce
the uncertainty, all the measurements are repeated 30 times. As shown in the figure, even though there
are overheads, the event index can speed up the event loading by reducing the ROOT I/O.</p>
      <p>As the neutrino events are rare, the radioactivity background samples in liquid scintillator are
generated to mimic the IBDs. All the isotopes in the decay chains are considered. The intervals
between two events are sampled according to the event rates. The fiducial volume cut and energy cut
are applied in the selection of single events. Then the energy cut, time interval cut and distance cut are
applied in the selection of correlated events. In the test, there are about 2.5% of events selected in the
event index according to the selection criterial and about 5% of events loaded from the event data. The
performance of the time correlation analysis is shown in Figure 6. Compared to the analysis without
index data, there is about 14-fold speedup. The method can provide further speedup if less events are
selected in the future.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In this study an event index-based correlation analysis method has been developed and applied
to the JUNO analysis. By reducing the I/O of event data, this method could improve the speed of the
data analysis. The speedup is about 14 when 5% of events are really loaded. In order to further speed
up the analysis, the parallelized version is still under development.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Acknowledgement</title>
      <p>This work is supported by National Natural Science Foundation of China (NSFC 11805223)
and Xie Jialin Fund.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>An</surname>
            <given-names>F.P.</given-names>
          </string-name>
          et al. [JUNO Collaboration].
          <source>Neutrino Physics with JUNO // J.Phys.G</source>
          <volume>43</volume>
          (
          <year>2016</year>
          )
          <volume>3</volume>
          ,
          <fpage>030401</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Abusleme</surname>
            <given-names>A.</given-names>
          </string-name>
          et al. [
          <article-title>JUNO Collaboration]</article-title>
          .
          <source>JUNO Physics</source>
          and Detector // accepted by Progr.
          <source>Part. Nucl. Phys. arXiv 2104.02565</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Huang</surname>
            <given-names>X.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zou</surname>
            <given-names>J.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            <given-names>W.D.</given-names>
          </string-name>
          , Deng Z.Y.,
          <string-name>
            <surname>Cao</surname>
            <given-names>G.F.</given-names>
          </string-name>
          <string-name>
            <surname>Offline</surname>
          </string-name>
          <article-title>Data Processing Software for the JUNO Experiment // PoS ICHEP2016 (</article-title>
          <year>2017</year>
          ),
          <fpage>1051</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Zou</surname>
            <given-names>J.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            <given-names>X.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            <given-names>W.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            <given-names>K.</given-names>
          </string-name>
          , Deng Z.Y.,
          <string-name>
            <surname>Cao</surname>
            <given-names>G.F.</given-names>
          </string-name>
          <article-title>SNiPER: an offline software framework for non-</article-title>
          <source>collider physics experiments // J.Phys.Conf.Ser</source>
          .
          <volume>664</volume>
          (
          <year>2015</year>
          )
          <volume>7</volume>
          ,
          <fpage>072053</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Lin</surname>
            <given-names>T.</given-names>
          </string-name>
          [JUNO Collaboration]
          <article-title>Jupyter-based service for JUNO analysis // EPJ Web Conf</article-title>
          .
          <volume>245</volume>
          (
          <year>2020</year>
          )
          <fpage>07011</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Brun</surname>
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Rademakers F. ROOT - An Object Oriented Data Analysis Framework</surname>
          </string-name>
          // Nucl. Inst. &amp; Meth.
          <source>in Phys. Res. A</source>
          <volume>389</volume>
          (
          <year>1997</year>
          )
          <fpage>81</fpage>
          -
          <lpage>86</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>