Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019)
                         Budva, Becici, Montenegro, September 30 – October 4, 2019


   DATA KNOWLEDGE BASE: METADATA INTEGRATION
          SYSTEM FOR HENP EXPERIMENTS
       M. Golosova1, M. Grigorieva2, V. Aulov1, A. Kaida3, M. Borodin4
        1
            NRC “Kurchatov Institute”, 1 Akademika Kurchatova sq., Moscow, 123182, Russia
            2
                Lomonosov Moscow State University, 1 Leninskie Gory, Moscow, 119991, Russia
                3
                    NR Tomsk Polytechnic University, 30 Lenina prospekt, Tomsk, 634050, Russia
                                          4
                                              University of Iowa, Iowa, USA

                                          E-mail: golosova_mv@nrcki.ru


HENP experiments, especially the long-living ones like the ATLAS experiment at the LHC, have
a diverse and evolving ecosystem of information systems that help scientists to organize research
processes – such as data handling (including data taking, simulation, processing, storage, and access),
preparing and discussion of publications, etc. With time all the components of the ecosystem grow,
develop into complex structures, accumulate metadata and become more independent and less flexible.
Automated information integration becomes a pressing need for effective operation within
the ecosystem. This contribution is dedicated to the meta-system, known as Data Knowledge Base
(DKB), designed to integrate information from multiple independent sources and provide fast
and flexible access to the integrated knowledge. Over the last two years, the system is being
successfully integrated with the production system of the ATLAS experiment, including the extension
of the production system web-interface with functionality built upon the unified metadata provided
by DKB.


Keywords: information integration, online analytics, metadata, mega-science


                      Marina Golosova, Maria Grigorieva, Anastasiia Kaida, Vasilii Aulov, Mikhail Borodin


                                                                   Copyright © 2019 for this paper by its authors.
                          Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                                                                             200
      Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019)
                         Budva, Becici, Montenegro, September 30 – October 4, 2019


1. Introduction
         Present-day HENP scientific experiments at the forefront of research often involve
complicated facilities (like LHC [1] or NICA [2]), massive apparatus (like ATLAS [3] detector
at LHC or XFEL [4]) and are going on for years and even decades. Every project of such scale
(mega-science project) must be accompanied by some information infrastructure to manage all
operations of a research process. Being unique for every particular project, each infrastructure
addresses similar set of tasks: experimental data storage, search and access, modelled data production,
software development, execution of computing tasks on shared computing resources, resources
allocation for different scientific groups, etc. For every type of activity, executed operations often
produce and/or require some auxiliary information about data or processes – metadata. These metadata
are also used for monitoring purposes and to get current state of specific processes. Combined
together, different types of metadata can additionally be used to get summary information about the
whole infrastructure – in order to detect possible or actual problems, find ways to solve them and
determine directions for the further development.
         The more complicated the project is and the more diversified tasks the infrastructure must
serve, the more complex is the metadata management. In very simple cases metadata may be managed
by a single information system, taking care of consistent storage, operative updates and providing
users and analytics with convenient representation of the information for every use-case;
yet for mega-science project the volumes and diversity of metadata lead to development
of significantly independent systems, operating with different scopes of metadata. These metadata
scopes, even being related to various parts of the project, still remain semantically connected –
and for analytical purposes often appear to be useful to bring them together and treat information from
multiple scopes in terms of some unified metadata model. In case of a diverse infrastructure, where
wide variety and vast volumes of information are handled by multiple different systems, this task
requires special attention, as straightforward information integration and aggregation “on the fly” may
take a lot of time, making interaction with the analytical tools strictly offline. Offline analytics
is widely used for regular pre-defined reports generation, but it is inefficient for analytical research
on the infrastructure operation as a whole.
         This paper provides the authors’ view on the problem of the interactive analytics for complex
information infrastructures of HENP experiments and describes an approach applied to the
development of the meta-information system prototype, aimed to serve multi-scope analytical tasks, in
the case of the ATLAS experiment at LHC.


2. Metadata concept hierarchy
        In the case of the ATLAS experiment at LHC, the online analytics is mostly interested
in the metadata related to the physics data production (modelling and preparation for analysis), storage
and analytical processing. The high-level management is performed in global terms, such as “Monte
Carlo simulation campaign” or “data sample”, but metadata are managed on a level of smaller
objects – such as “dataset” (data storage unit, defined in Rucio meta-catalog [5]), “production request”
(generated by user or group request for massive data processing) or “task” (processing unit, defined
in Production System 2 [6]). The high-level objects are mostly defined in human-readable form
and used by people to simplify conversation; they carry valuable semantic charge but very little
metrics information useful for analytical processing. On the contrary, the low-level objects are less
semantically charged, yet provide plenty of metrics for events and objects, directly managed
by corresponding systems.
        Figure 1 represents the concept hierarchies for both data storage and computing in terms
of the ATLAS information infrastructure; it also provides examples of metrics, related to specific
concepts and links to the systems that manage this information: JIRA, DEFT, JEDI, AMI and Rucio.


                                                                                                     201
      Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019)
                         Budva, Becici, Montenegro, September 30 – October 4, 2019


   Figure 1. Metadata concept hierarchies and data sources (in the case of the ATLAS experiment)

        The most common analytical request is to calculate some low-level metric for a high-level
object. In terms of the ATLAS information infrastructure, the requests can be, for example,
to calculate the storage size of a Data Sample (as a sum of storage sizes of all included Datasets),
or CPU resources spent on its production (as a sum of the CPU usage metric of all Jobs, executed
to produce included Datasets, at every processing Step of Task Chains starting from the first Step
(modelled data generation or event reconstruction for the real data taken from the detector) –
or from any other specific one).


3. Data Knowledge Base
         The Data Knowledge Base meta-information system is a system designed to bring together
different scopes of metadata, reconstruct missed or indistinct links between objects and provide fast
and flexible access to information, in particular – about the high-level objects. The prototype,
developed for the Production System [7] of the ATLAS collaboration, uses Elasticsearch full text
search engine [8] to store integrated metadata at the level of Task and Dataset objects. The data model
used to index information about these objects presumes that Task object properties include the
properties of the Task itself, references to the higher level objects (like Task Chain or Campaign), and
some properties aggregated by the low-level objects (Jobs) – such as actual CPU usage, for example –
while the Dataset properties contain only the Dataset object properties (both storage information and
data characteristics) and no additional information. Each object within the Elasticsearch is represented
as a document (with object properties as the document fields), and Task/Dataset documents
are connected as parent/child entities, where Task is a parent, and its output Datasets are child
documents.
         This indexing model described above made it possible to implement metadata integration
as a two-branched ETL (Extract, Transform, Load) process, where one branch is responsible
for integration and indexing metadata of a Task object, and the other one processes Dataset objects
related to the Task. However, this division of information into two different types (Task and Dataset
metadata), while making the prototype development process simpler, has also imposed some
restrictions on the metadata usage scenarios. And the implementation of real-life use-cases
for the Production System users (production managers) revealed that these restrictions do affect
the response time of the prototype for user requests, for in many scenarios Dataset properties
are treated as those of the Task object. These scenarios require additional efforts for information
retrieving, making the request execution less performant. Depending on the number of the documents
in the selection, some of the implemented requests might take tens of seconds – while for the seamless
interactive communication with the system, the response time should not exceed 10 seconds [9].


                                                                                                     202
      Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019)
                         Budva, Becici, Montenegro, September 30 – October 4, 2019


4. Indexing model improvement
        To improve the performance of user operations, a new metadata indexing model
was suggested. It takes into account the specifics of the already addressed use-cases, and the most
noticeable change is that the output datasets properties are now stored together with the Task object,
in the form of nested documents (instead of parent/child documents). Operations with nested
documents show better performance due all related documents being stored not only in the same shard,
but in the same Lucene block. It also means that re-indexing of one of the documents
leads to re‑indexing of all related ones – but when, as in this case, indexed documents contain only
object metadata (and not the object content, which may be sizeable), their sizes are quite small
and it will not be very expensive operation.
        To test the new model against the one originally used in the prototype, a specially allocated
single-node instance of the Elasticsearch was used (heap space: 4GB; index volume: 4M records
(~2M tasks, ~2M datasets) (5GB)). During the testing all caching mechanisms were disabled, and both
ES and disk caches that could not be disabled were cleaned after every request: reading data from
memory – and even more, getting request results from the cache – would almost eliminate the effect
of scheme change on given data volumes, reducing the request execution time to almost immediate
response.


    Figure 2. Test results: response time in relation to number of tasks or documents (both tasks
  and output datasets) matching the requests 1-4: (1 - keywords search, 2 - calculation of derivation
process efficiency, 3 - calculation of general statistics for derivation or reprocessing request by output
          data format, 4 - calculation of general statistics for campaign by processing steps)


                                                                                                      203
        Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019)
                           Budva, Becici, Montenegro, September 30 – October 4, 2019


      The comparison of two indexing models was made in 4 different cases:
    ● keywords (Google-like) search of tasks and related datasets;
    ● calculation of derivation process efficiency statistics in accordance with the output data format
      (aggregation by all input and output datasets of tasks selected by specific properties);
    ● general statistics (resource usage, number of processed events, …) for:
          ○ MC production campaign (by steps);
          ○ reprocessing or derivation request (by output data format).
        The test results (Fig. 2) show that new indexing model reduces the response time in all test
cases and also is more scalable. It means that introducing this model into the prototype will allow
users to perform interactive analysis of greater volumes of metadata with the same configuration of the
backend Elasticsearch cluster, reducing the resource requirements.


5. Conclusion
         The Data Knowledge Base prototype, successfully integrated with the Production System
of the ATLAS experiment, allows execution of complex analytical requests, requiring information
from different information systems and from different levels of abstraction. Some metadata in use
may belong to a computing Job, executed on a single computing node, while another describe a Data
Sample, which may include petabytes of data produced under the same conditions (software version,
configuration parameters, etc).
         Although the prototype allows implementation of multiple different scenarios, providing
flexible access to the integrated metadata, in some cases the response time still exceeds the upper limit
usually considered for the interactive operations. Possible solution is to improve the metadata indexing
and storage scheme, bringing pieces of information that are often used together into a single document
(with nested sub-documents).
         Performance testing of both schemes (currently used one and the one developed according
to a suggested solution) has proved that the suggested changes in storage scheme do improve
the performance of user operations and also make the system more scalable. Currently the DKB
development team works on the update of the operating DKB prototype installed at CERN to apply
these changes: in addition to re-indexing of all the stored data with the new scheme, all the metadata
integration scenarios (ETL processes), responsible for the filling and regular update
of the Elasticsearch storage, are being updated accordingly; the update also requires improvement
of user interface to make this (and any other possible in the future) change in the storage scheme
transparent for the end users.


6. Acknowledgement
         This work is supported by Russian Science Foundation under contract №18-37-20003.


References
[1] LHC, Large Hadron Collider // CERN Publication, European Laboratory for Particle Physics,
June 1990
[2] Agapov N. et al. Design and Construction of Nuclotron-based Ion Collider fAcility (NICA).
Conceptual Design Report, edited by I.Meshkov and A.Sidorin // JINR, Dubna, 2008. Available at:
http://nica.jinr.ru/files/NICA_CDR.pdf (accessed 06.11.2019)
[3] ATLAS Collaboration. The ATLAS Experiment at the CERN Large Hadron Collider // JINST 3
S08003, 2008. Available at: http://nordberg.web.cern.ch/nordberg/PAPERS/JINST08.pdf (accessed
06.11.2019)


                                                                                                       204
      Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019)
                         Budva, Becici, Montenegro, September 30 – October 4, 2019


[4] Altarelli M. et al. XFEL: The European X-Ray Free-Electron Laser – Technical Design Report //
DESY 2006-097, 2006. Available at:
https://xfel.desy.de/localfsExplorer_read?currentPath=/afs/desy.de/group/xfel/wof/EPT/TDR/XFEL-
TDR-final.pdf (accessed 06.11.2019)
[5] Rucio – Scientific Data Management // https://rucio.cern.ch (accessed 06.11.2019)
[6] Barreiro F.H., Borodin M., De K., Golubkov D., Klimentov A., Maeno T., Mashinistov R.,
Padolski S., Wenaus T. on behalf of the ATLAS Collaboration. The ATLAS Production System
Evolution: New Data Processing and Analysis Paradigm for the LHC Run2 and High-Luminosity //
IOP Conf. Series: Journal of Physics: Conf. Series 898 (2017) 052016, doi :10.1088/1742-
6596/898/5/052016. Available at: http://inspirehep.net/record/1638474/files/pdf.pdf (accessed
10.10.2019)
[7] Golosova M.V., Aulov V.A., Grigoryeva M.A., Kaida A.Y. Data Knowledge Base for the
ATLAS collaboration // Proceedings of the VIII International Conference "Distributed Computing and
Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia,
September 10 - 14, 2018
[8] Elasticsearch // https://www.elastic.co/products/elasticsearch (accessed on: 17.10.2018)
[9] Nielsen J. Usability Engineering // New York: Academic Press, 1993


                                                                                                     205