=Paper=
{{Paper
|id=Vol-3041/256-260-paper-47
|storemode=property
|title=Data Knowledge Base Current Status and Operation
|pdfUrl=https://ceur-ws.org/Vol-3041/256-260-paper-47.pdf
|volume=Vol-3041
|authors=Viktor Kotliar
}}
==Data Knowledge Base Current Status and Operation==
<pdf width="1500px">https://ceur-ws.org/Vol-3041/256-260-paper-47.pdf</pdf>
<pre>
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


       DATA KNOWLEDGE BASE CURRENT STATUS AND
                    OPERATION
                                             V. Kotliara
 Institute for High Energy Physics named by A.A. Logunov of National Research Center “Kurchatov
                 Institute”, Nauki Square 1, Protvino, Moscow region, Russia, 142281

                                    E-mail: a viktor.kotliar@ihep.ru


The Data Knowledge Base (DKB) project aims at knowledge acquisition and metadata integration. It
provides fast response for a variety of complicated queries, such as summary reports and monitoring
tasks (aggregation queries) and multi-system join queries. Such queries are not easy to implement in a
timely manner and, obviously, are less efficient than a query to a single system with integrated and
pre-processed information would be. This work describes the status of the project as well as its
integration with the ATLAS Workflow Management and future perspectives.

Keywords: information integration, metadata integration, metadata, workflow pipelines


                                                                                             Viktor Kotliar


                                                             Copyright © 2021 for this paper by its authors.
                    Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                   256
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


1. Introduction
         The Data Knowledge Base (DKB) project aims at knowledge acquisition and metadata
integration [1]. It started at 2016 with main purposes: integrate and link pieces of information from
independent sources (pdf, indico, wiki page, etc.); reconstruct connections between research results
and data samples; provide fast and flexible access to everything people might want to know about
some process or object. From 2018, the main goal of the project changed to create a universal tool for
multi-source queries. A python library pyDKB [2] was created to address necessaries for workflow
pipelines adopted to the High Energy Physics (HEP) projects. The ATLAS [3] dataflow system is
installed based on the developed software that consists of:
               ETL (Extract, Transform, Load) pipeline flow [4] based on scripts and library;
               System to run and check the flow;
               NoSQL database to store results;
               REST API to access system;
               Frontend UI for users.
       This system is used in the production system at ATLAS experiment to operate with GRID
computing metadata and to prepare LCH Run 3.


2. DKB environment overview
        DKB project has a distributed environment over several virtual machines hosted by CERN
openstack infrastructure [5]. These machines are managed by computing center virtual machine
software management system which includes Puppet and Foreman profiles. The whole environment is
split over production, quality assurance and development servers. CentOS7 x86_64 operating system
is used as base OS for all services. Production system is shown on figure 1.

                     es.atlas-dkb.cern.ch                api.atlas-dkb.cern.ch

                              Nginx                               Nginx
                              proxy                               server


                                          ES                         ES
                                                 Replication
                                                   2 copy
                                 aiatlas171                 aiatlas172
                                 master
                                Figure 1. DKB production environment

        It consists of two servers aiatlas171(master) and aiatlas172 with load balanced names assigned
to them es.atlas-dkb.cern.ch and api.atlas-dkb.cern.ch accordingly. Elasticsearch [6] engine is used for
data preservation and it is configured for two-copy replication mode. Such mode allows to achieve a
good speed for read access and safety for data. There are two nginx servers used for system access to
DKB from outside. First one works as proxy to ensure direct access levels to Elasticsearch engine for


                                                   257
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


users with read-only or read-write permissions. Second one works as http server for DKB API
software based on python FastCGI program. The main DKB workflow pipeline is configured to run
only on the master node leaving slave node only for serving API requests.
       The project sources are available on github [1] and development to production workflow goes
through github pull requests [fig. 2]

                                          https://github.com/PanDAWMS/dkb

                                                Pull request with review

                     API            api
                                           master                       New branch


                    aiatlas172
                                 Tag release


                                    data4es-prod

                                           aiatlas171

                                 Figure 2. DKB development workflow

        After a new functionality or bug fix are added and tested in a new git branch, a new pull
request is created for merging changes to the master branch. All pull requests go through careful
review from another person in the project and only after that merges into master. Master branch
automatically applies to the API server (based on puppet profile) and manually gets tags and applies as
data4es-prod branch on the production workflow server. For the moment DKB provides API with
version 0.3.3 and DKB production workflow runs version 0.2-0.
       Current environment stores near 15GB data for ATLAS production tasks and 50GB of data for
ATLAS analysis tasks. Every hour it loads and stores metadata information about around 1500 tasks
and 5000 datasets from ATLAS experiment.


3. Metadata integration
        At present DKB serves for ATLAS collaboration Production System [7] as metadata
integration service for the metadata at the level of Task and Dataset objects [fig. 3].


                              Figure 3. DKB ATLAS metadata integration

       Information updates are based on “task timestamp” from ProdSys database. Main information
comes from DEFT (Database Engine for Tasks) and is extended with additional metadata from other
systems like:
               AMI – ATLAS Metadata Interface;


                                                    258
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


               JEDI - Job Execution and Definition Interface;
               Rucio - scientific data management system.
        At the end as soon as the new integrated metadata stored in the single Elasticsearh it simplifies
search queries for the whole systems and such queries integrated into the ProdSys user interface
through web access [6].
         From implementation point of view, this workflow pipeline presents ETL process which is
shown on figure 4. It is implemented through one Linux bash script calling different stages (workflow
parts). These stages could use any software inside but to simplify communications between them and
simplify building of such stages DKB python library is used.


                             Figure 4. DKB ATLAS ETL workflow pipeline


4. Resent changes and plans
        Several changes have been made to the DKB project recently, mainly aimed at improving
system performance and upgrading it to use a new version of the Elasticsearch engine. To improve the
performance of user operations, a new metadata indexing model is implemented for ATLAS integrated
metadata. It takes into account the specifics of the already addressed use-cases, and the most
noticeable change is that the output datasets properties are now stored together with the Task object, in
the form of nested documents (instead of parent/child documents). It is made to simplify queries to the
Elasticsearch index, used in the most problematic requests from the addressed use-cases. Some
investigations are made on internal communication protocol for DKB stages to use batch processing
instead of serial one which is in place in production. The nearest plan for DKB is to fully migrate to
the CERN production Elasticsearch infrastructure and split data storage from the project to special
dedicated outside service [fig. 5].
                                                                                   api.atlas-dkb.cern.ch


                    es.atlas-dkb.cern.ch     api.atlas-dkb.cern.ch                              Nginx
                                                                                                server
                         Nginx                        Nginx
                         proxy                        server


                                                                      aiatlas171                 aiatlas172
                                                                      dataflow
                                  ES                    ES
                                        Replication
                                          2 copy                                    CERN
                           aiatlas171              aiatlas172
                                                                                     ESK
                           master                                                  service
                                                                                     7.10

                                                                      https://atlas-dkb.cern.ch/es
                            Figure 5. DKB migration to CERN Elasticserach

                                                                259
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


5. Conclusion
        The Data Knowledge Base project is successfully integrated with the Production System of the
ATLAS experiment and it allows execution of complex analytical requests, requiring information
from different information systems and from different levels of abstraction in a timely manner. The
developed library and resent changes allows implementation of multiple different scenarios for
metadata integrations, providing flexible tool for building metadata workflow pipelines. Stable run in
production and good availability and accessibility allowed to use DKB metadata integration service in
processing ATLAS metadata for tasks and datasets to prepare LHC Run 3.


6. Acknowledgement
       The DKB ATLAS metadata integration services is supported by CERN IT and by CERN
ATLAS IT support team. The UI for DKB added by ATLAS ProdSys development team by M.
Borodin.
        DKB project is supported by NRC "Kurchatov Institute".
        Special thanks to M. Golosova and V. Aulov for the project development.


References
[1] Grigoryeva M., Golosova M., Klimentov A., Wenaus T. Data Knowledge Base for HENP
Scientific Collaborations.// Journal of Physics: Conference Series, vol. 1085, issue 3, 2018
[2] The Data Knowledge Base for HENP experiments                               [DKB].     Available     at:
https://github.com/PanDAWMS/dkb (accessed 22.09.2021)
[3] ATLAS Collaboration. The ATLAS Experiment at the CERN Large Hadron Collider [ATLAS].
Available at: https://nordberg.web.cern.ch/nordberg/PAPERS/JINST08.pdf (accessed 22.09.2021)
[4] Extract, transform, load procedure in computing [ETL].                               Available      at:
https://en.wikipedia.org/wiki/Extract,_transform,_load (accessed 22.09.2021)
[5] CERN       OpenStack Private Cloud Guide                  [CERN       OpenStack].     Available     at:
https://clouddocs.web.cern.ch/ (accessed 22.09.2021)
[6] The Elastic Stack [ESK]. Available at: https://www.elastic.co/elastic-stack/ (accessed
22.09.2021)
[7] The       ATLAS       collaboration     Production       System       [ProdSys].     Available      at:
https://prodtask.cern.ch/dkb/ (accessed 22.09.2021)


                                                   260

</pre>