<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Metadata Extraction from Raw Astroparticle Data of TAIGA Experiment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Igor Bychkov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julia Dubenskaya</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Korosteleva</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexandr Kryukov</string-name>
          <email>kryukov@theory.sinp.msu.ru</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrey Mikhailov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minh-Duc Nguyen</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexey Shigarov</string-name>
          <email>shigarov@icc.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Mathematics</institution>
          ,
          <addr-line>Economics and Informatics</addr-line>
          ,
          <institution>Irkutsk State University</institution>
          ,
          <addr-line>Gagarin Blvd. 20, Irkutsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Matrosov Institute for System Dynamics and Control Theory, Siberian Branch of Russian Academy of Sciences</institution>
          ,
          <addr-line>Lermontov St. 134, Irkutsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Skobeltsyn Institute of Nuclear Physics, Lomonosov Moscow State University</institution>
          ,
          <addr-line>Leninskiye Gory 1(2), Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Today, the operating TAIGA (Tunka Advanced Instrument for cosmic rays and Gamma Astronomy) experiment continuously produces and accumulates a large volume of raw astroparticle data. To be available for the scienti c community these data should be well-described and formally characterized. The use of metadata makes it possible to search for and to aggregate digital objects (e.g. events and runs) by time and equipment through a uni ed interface to access them. The important part of the metadata is hidden and scattered in folder/ les names and package headers. Such metadata should be extracted from binary les, transformed to a uni ed form of digital objects, and loaded into the catalog. To address this challenge we developed a concept of the metadata extractor that can be extended by facility-speci c extraction modules. It is designed to automatically collect descriptive metadata from raw data les of all TAIGA formats.</p>
      </abstract>
      <kwd-group>
        <kwd>Data life cycle management troparticle data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        AsNowadays, large-scale setups used in the experimental astroparticle physics
generate a large volume of data. This trend gives rise to a number of emerging issues
of big data management [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Some activities should be carried out continuously
across all stages of the astroparticle data life cycle [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and open science [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ],
the model of free access to data, for the astroparticle physics.
      </p>
      <p>
        The Russian-German astroparticle data life cycle initiative 4 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] aims at
developing an open science system to support collection, storage, analysis, sharing
      </p>
    </sec>
    <sec id="sec-2">
      <title>4 https://astroparticle.online</title>
      <p>
        and reuse of data produced by TAIGA5 experimental facilities [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This
system is designed to be a common data portal of two independent observatories
and at the same time for consolidation of an analysis for astroparticle physics
experiments.
      </p>
      <p>One of the important issues is how to e ciently manage raw astroparticle
data to support their availability and reuse in future. The long-term preservation
of raw data as originally generated is essential for re-running analysis and
reproducing research results. To be accessible for the scienti c community raw data
should be well-described by descriptive, structural, and administrative
metadata. The use of such metadata makes it possible to search for and to aggregate
raw astroparticle data through a uni ed interface to access them.</p>
      <p>
        Metadata is useful on all stages of the data life cycle we considered in our
initiative [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]: data availability covers user requests to data through metadata;
data analysis can enrich metadata; simulations can generate metadata; open
access is implemented by metadata describing ownership and rights; education
in data science uses metadata of educational resources; data archiving provides
a long-term data preservation alongside metadata.
      </p>
      <p>
        Currently, the TAIGA experimental facilities use ve unique binary le
formats for representing raw data produced by the gamma ray setups:
TAIGAHiSCORE [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and TAIGA-IACT [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and the cosmic ray setups: Tunka-133 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
Tunka-Grande [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and Tunka-Rex [9]. They are not accompanied by well-organized
metadata. Some scattered metadata are hidden in names and package headers of
raw data les. There is neither conventional terminology used in the experiments
nor the uni ed interface for access to the hidden metadata. The main challenge
being considered in this paper is how to extract metadata from raw data.
      </p>
      <p>We propose a concept of the metadata extractor designed to be used in the
astroparticle data storage [10]. The extractor is aimed to automatically collect
descriptive metadata from raw data les of all TAIGA formats and put them
into a catalog of the storage. Its architecture is extensible by facility-speci c
extraction modules (add-ons) that can be implemented with a framework for
binary data format description such as \Kaitai Struct"6 or \FlexT"7. Extracted
metadata should provide searching for and aggregating meaningful chunks of
raw data by both time and equipment.
2</p>
      <sec id="sec-2-1">
        <title>Background</title>
        <p>We de ne the metadata extraction as a characterization of digital objects that
are concealed by raw data. The digital objects signi cant for purposes of our
initiative are events being registered by detectors. An event as a digital object is
composed of a structured sequence of bits/bytes in binary les. The sequence of
bits/bytes de ning such an event can be accessed using a set of unique identi ers
represented in package headers.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5 https://taiga-experiment.info 6 http://kaitai.io 7 http://hmelnov.icc.ru/flext</title>
      <p>Our digital object is also characterized by time and equipment of a run
where it is registered. These properties are scattered over the folder structure,
le names, and package headers. There are some software tools that can be used
for the metadata extraction from binary les. We consider here two kinds of
them: (i) tools for harvesting metadata from binary les and (ii) frameworks for
the binary data format description.</p>
      <p>There are several contemporary tools for harvesting metadata from binary
les, including the followings: \NLNZ Metadata Extraction Tool"8, \JHOVE2"9,
\FITS"10, and \GNU Libextractor"11. Typically, such tools support some
widespread le formats (e.g. JPEG, MP3, ZIP) as a source. They store extracted
metadata in XML, JSON, or delimited text les. Their functionality can be
extended by plug-ins or modules for processing speci c binary formats.</p>
      <p>A work ow for the characterization of digital objects can include the following
steps: identi cation, i.e. determining a presumptive format used for
representation of a digital object; validation, i.e. determining the conformance of a digital
object to the identi ed format; extraction, i.e. deriving metadata of a digital
object signi cant for purposes of classi cation, analysis, and use; and
assessment, i.e. determining the acceptability of a digital object for a speci c use. The
architecture of our metadata extractor is designed on the basis of this work ow.</p>
      <p>The state-of-the-art frameworks, such as \Kaitai Struct" or \FlexT", provide
formal languages for describing binary data formats [11]. They are a
satisfactory solution for the issues of raw data documenting, parsing and verifying. Our
previous work [11] demonstrates applicability of binary le format description
languages to specify, parse and verify raw data of TAIGA experiment. The
formal speci cations implemented for the ve formats of the experiments make it
possible to automatically generate the source code of libraries for accessing data
in one of the target programming languages (e.g. C++, Java, or Python). They
demonstrate good performance and allow us to locate les with corrupted data.</p>
      <p>Such frameworks can facilitate the extraction of metadata from binary les.
We use format speci cations in the formal languages to implement
facilityspeci c modules that extend the capabilities of our metadata extractor. Each
module relies on a format-oriented data reading library generated automatically
from the corresponding speci cation by the framework. This allows us to identify
the le format, to validate raw data, and to extract the descriptive attributes of
digital objects.</p>
      <p>Among the existing solutions for describing the binary data format, \Kaitai
Struct" and \FlexT" are the most suitable ones in our case [11]. Both provide
the declarative languages for representing le format speci cations. Similarly,
they consider a speci cation as a set of data type de nitions. They support
bit-oriented data (bit elds) and variant blocks. Both allow one to generate the
source code of the reading libraries for the raw data formats from speci
ca8 http://meta-extractor.sourceforge.net
9 https://bitbucket.org/jhove2/main/wiki/Home
10 https://projects.iq.harvard.edu/fits
11 https://www.gnu.org/software/libextractor
tions. \FlexT" language [12] is more expressive, but \Kaitai Struct" is based
on well-known format, namely YAML. Moreover, \Kaitai Struct" supports more
programming languages for the source code generation.
3</p>
      <sec id="sec-3-1">
        <title>Metadata of TAIGA raw data</title>
        <p>An important part of the available metadata characterizing events is hidden
in the folder/ le names and package headers of raw data les. There are two
important dimensions, time and equipment, describing each event registered in
a run of a facility (see Fig. 3). These dimensions de ne a hierarchical structure
of folders with raw data for each facility as follows: a season of measuring (the
folder \YYYY-YY" | a start and end year), a moonless period of measuring
in a month (the folder \mmmYY" | a month abbreviation and year), a night
(the folder \ddmmyy.NN" | a date and a number of run in the night), a cluster
or station (the folder \ fNNN" | a facility and a number), a raw data le
(the folder \ddmmy.NNN" a date and a portion of the raw data writing). Both
dimensions are also presented in package headers inside raw data les. A package
header can include a timestamp with an accuracy from millisecond to nanosecond
depending on the facility, as well as a detector and channel identi er of the
equipment.</p>
        <p>Moreover, the runs and events are characterized by some environmental data.
Each run folder can be accompanied by various supplementary les with a
facility-speci c description of equipment con guration, triggering,
synchronization, errors, calibration (e.g. pedestal, current, count rate), and meteorological
measuring. Some facility-speci c attributes (e.g. stop-trigger position, detector
number, optical line length, error package status) are also contained in raw data
les. Fig. 3 shows which of the general and facility-speci c attributes can be
extracted from raw data to describe events and runs.</p>
        <p>Another part of metadata can be derived by processing TAIGA raw data
to describe the following properties: validity (determining corrected and
corrupted chunks of data), reliability (calculating check-sums), availability
(checking whether downloading data from the storage is possible), accessibility
(specifying user rights), popularity (registration of unique user requests and downloads).
Moreover, some arti cial neural network models [13] also can enrich metadata
with knowledge on types of detected particles and energy.</p>
        <p>In the general case, one event can be represented by several sequences of bytes
in di erent les produced by one run of a facility. Such parts should be aggregated
into events to be appropriate for the purposes of classi cation, analysis and use.
The derived metadata can describe how an event is composed of parts. These
metadata can be separated into three levels as follows: L1 | les, L2 | parts
of events linked with les, and L3 | events linked with parts.</p>
        <p>The extracted and derived metadata can populate the catalog of the
astroparticle data storage that we are developing. Thanks to these metadata, user
queries to TAIGA raw data in two dimensions of the following form become
possible:
GET data WHERE time ==
range = time between start and end (less than a night)
run = a specified run | a calibration run
night = a specified date
moonless month = a period of time (not calendar month)
summer = a summer period of time
GET data WHERE equipment ==
facility = a specified facility
cluster = a specified cluster (station) of a facility
4</p>
      </sec>
      <sec id="sec-3-2">
        <title>Metadata extractor</title>
        <p>We propose a concept of the metadata extractor for harvesting attributes of
events and runs from binary les of some facility-speci c formats. It implements
an extensible architecture with pluggable facility-speci c modules (add-ons) as
shown in Fig. 4. Such modules can be developed based on a framework for
binary data format description, e.g. \Kaitai Struct" or \FlexT". In this case,
the considered development includes the following steps: exploring raw data,
writing le format speci cations, generating the source code of the software
libraries for the binary le parsing, incorporating the generated source code in
the corresponding facility-speci c module.</p>
        <p>Fig. 4 shows the work ow for the metadata extractor. The work ow starts
with selecting a module that is appropriate to process the input raw data in a
facility-speci c format. The selected module crawls the input structure of folders
and les to collect attributes being available in the folder/ le names. It
identies the format of each input le, parses and validates binary data by using an
appropriate format-speci c library to extract metadata from package headers.
The module also collects attributes from the input supplementary les (e.g.
facility con guration le). All extracted metadata are used to build instances of
an event object model. Finally, the extractor generates JSON data from these
instances to upload them into the metadata catalog.</p>
        <p>Preliminary, we used \Kaitai Struct" for formally describing the le formats
of TAIGA experiment [11]. The implemented format speci cations allowed us to
generate source code in C/C++, Python, Java for parsing and validating binary
data. The libraries were tested on real data: 89K les of Tunka-133,
TunkaGrande, Tunka-Rex and 120K les of TAIGA-HiSCORE, and TAIGA-IACT
formats. They can be adopted to be used in the work ow for the metadata
extractor.</p>
        <p>The metadata extractor can be incorporated in the micro-service architecture
of the distributed storage of astroparticle data [14, 15] we are developing. The
architecture involves placing instances of the metadata extractor on le storage
nodes locally. This allows an instance to request raw data through CernVM12
le system without their transferring among nodes of the distributed storage.
Each operating instance populates the centralized catalog with extracted
metadata. The interaction with the catalog is provided by GraphQL13 API
(application programming interface). The architecture implements this interface via
12 https://cernvm.cern.ch/portal/filesystem
13 https://graphql.org</p>
        <p>Graphene-Python14 library. It also uses the object-relational mapping based on
SQLAlchemy15 on the catalog side. Since all digital objects (events and runs)
we consider are characterized by time, the design of the architecture suggests
to use TimeScale16, a time series database management system, for organizing
metadata stored in the catalog.
5</p>
      </sec>
      <sec id="sec-3-3">
        <title>Conclusion and further work</title>
        <p>The best practices of scienti c data maintenance recommend keeping raw data.
This ensures facilitating reproducibility of published results and future reuse
with an advanced data analysis and processing. The TAIGA experiment
produces and accumulates a large volume of raw astroparticle data. To be available
for the scienti c community they should be accompanied by metadata with a
uni ed interface of access.</p>
        <p>In our case, the important part of the metadata hidden and scattered in
raw data. Such metadata should be extracted from binary les, transformed
to a uni ed form of digital objects, and loaded into the catalog. To address
14 https://graphene-python.org
15 https://www.sqlalchemy.org
16 https://www.timescale.com
this challenge we have developed a concept of the metadata extractor that can
be extended by facility-speci c extraction modules. The extractor is aimed to
automatically collect descriptive metadata from raw data les of all TAIGA
formats.</p>
        <p>Further work for the incorporation of metadata in the astroparticle data life
cycle requires the following steps: unifying the terminology (conforming a
thesaurus); determining a set of user requests to the metadata catalog; determining
a set of hidden and derived attributes describing the digital objects;
implementing the metadata extractor; developing the metadata catalog implementing a
uni ed interface of access.</p>
        <p>We believe that metadata will be useful on all stages of the astroparticle
data life cycle we consider in our initiative. Metadata can also simplify the
software development for astroparticle data exchanging and aggregation from
various sources in the case of multi-messenger analysis. We plan to share our
experience of extracting metadata from raw data with other scienti c
collaborations.
6</p>
      </sec>
      <sec id="sec-3-4">
        <title>Acknowledgments</title>
        <p>This work was nancially supported by the Russian Scienti c Foundation (Grant
No 18-41-06003).
9. P. A. Bezyazeekov and et al, \Measurement of cosmic-ray air showers with the
Tunka Radio Extension (Tunka-Rex)," Nucl. Instrum. Meth., vol. A802, pp. 89{
96, 2015.
10. A. P. Kryukov and A. P. Demichev, \Architecture of distributed data storage
for astroparticle physics," Lobachevskii Journal of Mathematics, vol. 39, no. 9,
pp. 1199{1206, 2018.
11. I. Bychkov and et al., \Using binary le format description languages for
documenting, parsing, and verifying raw data in TAIGA experiment," CoRR,
vol. abs/1812.01324, 2018.
12. M. A. Khmel'nov A., Bychkov I., \A declarative language FlexT for analysis and
documenting of binary data formats," Proceedings of ISP RAS, vol. 28, no. 5,
pp. 239{268, 2016.
13. E. B. Postnikov, A. P. Kryukov, S. P. Polyakov, D. A. Shipilov, and D. P.</p>
        <p>Zhurov, \Gamma/hadron separation in imaging air cherenkov telescopes using
deep learning libraries tensor ow and pytorch," Journal of Physics: Conference
Series, vol. 1181, p. 012048, 2019.
14. A. P. Kryukov and A. P. Demichev, \Architecture of distributed data storage
for astroparticle physics," Lobachevskii Journal of Mathematics, vol. 39, no. 9,
pp. 1199{1206, 2018.
15. A. Kryukov and M.-D. Nguyen, \A distributed storage for astroparticle physics,"
EPJ Web of Conferences, vol. 207, p. 08003, 2019.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Demchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Grosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wibisono</surname>
          </string-name>
          , and C. de Laat, \
          <article-title>Addressing big data challenges for scienti c data infrastructure,"</article-title>
          <source>in 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings</source>
          , pp.
          <volume>614</volume>
          {
          <issue>617</issue>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. e. a. Bychkov I., \
          <article-title>Russian-german astroparticle data life cycle initiative,"</article-title>
          <source>Data</source>
          , vol.
          <volume>3</volume>
          , no.
          <volume>4</volume>
          :
          <issue>56</issue>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>P. A.</given-names>
            <surname>David</surname>
          </string-name>
          , \
          <article-title>Understanding the emergence of `open science' institutions: functionalist economics in historical context,"</article-title>
          <source>Indus. &amp; Corp. Change</source>
          , vol.
          <volume>13</volume>
          , no.
          <issue>4</issue>
          , pp.
          <volume>571</volume>
          {
          <issue>589</issue>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>B. A.</surname>
          </string-name>
          <year>e</year>
          . a. Nosek, \
          <article-title>Promoting an open research culture,"</article-title>
          <source>Science</source>
          , vol.
          <volume>348</volume>
          , no.
          <issue>6242</issue>
          , pp.
          <volume>1422</volume>
          {
          <issue>1425</issue>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Budnev</surname>
          </string-name>
          , N.;
          <string-name>
            <surname>Astapov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Bezyazeekov,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Bogdanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Boreyko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ; Buker, M.; Bruckner, M.;
            <surname>Chiavassa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Chvalaev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            ;
            <surname>Gress</surname>
          </string-name>
          ,
          <string-name>
            <surname>O.</surname>
          </string-name>
          et al, \
          <article-title>The TAIGA experiment: from cosmic ray to gamma-ray astronomy in the Tunka valley,"</article-title>
          <source>J. Phys. Conf. Ser.</source>
          , vol.
          <volume>718</volume>
          , no.
          <issue>5</issue>
          , p.
          <fpage>052006</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Prosin</surname>
          </string-name>
          and et al, \
          <article-title>Results from Tunka-133 (5 years observation) and from the Tunka-HiSCORE prototype," EPJ Web Conf</article-title>
          ., vol.
          <volume>121</volume>
          , p.
          <fpage>03004</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Kuzmichev</surname>
          </string-name>
          and et al, \
          <article-title>TAIGA Gamma Observatory: Status and Prospects,"</article-title>
          <source>Phys. Atom</source>
          . Nucl., vol.
          <volume>81</volume>
          , pp.
          <volume>497</volume>
          {
          <issue>507</issue>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Monkhoev</surname>
          </string-name>
          and et al, \
          <article-title>The Tunka-Grande experiment: Status and prospects,"</article-title>
          <source>Bull. Russ. Acad. Sci.</source>
          , vol.
          <volume>81</volume>
          , no.
          <issue>4</issue>
          , pp.
          <volume>468</volume>
          {
          <issue>470</issue>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>