<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DATA KNOWLEDGE BASE CURRENT STATUS AND OPERATION</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>V. Kotliar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viktor Kotliar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for High Energy Physics named by A.A. Logunov of National Research Center “Kurchatov Institute”</institution>
          ,
          <addr-line>Nauki Square 1, Protvino, Moscow region, Russia, 142281</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>5</fpage>
      <lpage>9</lpage>
      <abstract>
        <p>The Data Knowledge Base (DKB) project aims at knowledge acquisition and metadata integration. It provides fast response for a variety of complicated queries, such as summary reports and monitoring tasks (aggregation queries) and multi-system join queries. Such queries are not easy to implement in a timely manner and, obviously, are less efficient than a query to a single system with integrated and pre-processed information would be. This work describes the status of the project as well as its integration with the ATLAS Workflow Management and future perspectives.</p>
      </abstract>
      <kwd-group>
        <kwd>information integration</kwd>
        <kwd>metadata integration</kwd>
        <kwd>metadata</kwd>
        <kwd>workflow pipelines</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The Data Knowledge Base (DKB) project aims at knowledge acquisition and metadata
integration [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It started at 2016 with main purposes: integrate and link pieces of information from
independent sources (pdf, indico, wiki page, etc.); reconstruct connections between research results
and data samples; provide fast and flexible access to everything people might want to know about
some process or object. From 2018, the main goal of the project changed to create a universal tool for
multi-source queries. A python library pyDKB [2] was created to address necessaries for workflow
pipelines adopted to the High Energy Physics (HEP) projects. The ATLAS [3] dataflow system is
installed based on the developed software that consists of:
      </p>
      <p>ETL (Extract, Transform, Load) pipeline flow [4] based on scripts and library;




</p>
      <p>System to run and check the flow;
NoSQL database to store results;
REST API to access system;</p>
      <p>Frontend UI for users.</p>
      <p>This system is used in the production system at ATLAS experiment to operate with GRID
computing metadata and to prepare LCH Run 3.</p>
    </sec>
    <sec id="sec-2">
      <title>2. DKB environment overview</title>
      <p>DKB project has a distributed environment over several virtual machines hosted by CERN
openstack infrastructure [5]. These machines are managed by computing center virtual machine
software management system which includes Puppet and Foreman profiles. The whole environment is
split over production, quality assurance and development servers. CentOS7 x86_64 operating system
is used as base OS for all services. Production system is shown on figure 1.</p>
      <p>es.atlas-dkb.cern.ch</p>
      <p>api.atlas-dkb.cern.ch</p>
      <sec id="sec-2-1">
        <title>Nginx proxy</title>
        <p>ES
aiatlas171
master</p>
      </sec>
      <sec id="sec-2-2">
        <title>Nginx server</title>
        <p>ES
aiatlas172</p>
      </sec>
      <sec id="sec-2-3">
        <title>Replication</title>
      </sec>
      <sec id="sec-2-4">
        <title>2 copy</title>
        <p>It consists of two servers aiatlas171(master) and aiatlas172 with load balanced names assigned
to them es.atlas-dkb.cern.ch and api.atlas-dkb.cern.ch accordingly. Elasticsearch [6] engine is used for
data preservation and it is configured for two-copy replication mode. Such mode allows to achieve a
good speed for read access and safety for data. There are two nginx servers used for system access to
DKB from outside. First one works as proxy to ensure direct access levels to Elasticsearch engine for
users with read-only or read-write permissions. Second one works as http server for DKB API
software based on python FastCGI program. The main DKB workflow pipeline is configured to run
only on the master node leaving slave node only for serving API requests.</p>
        <p>
          The project sources are available on github [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and development to production workflow goes
through github pull requests [fig. 2]
        </p>
        <p>API</p>
        <p>api
aiatlas172</p>
        <p>Tag release
data4es-prod</p>
        <p>aiatlas171
https://github.com/PanDAWMS/dkb</p>
        <p>Pull request with review
master</p>
        <p>New branch</p>
        <p>After a new functionality or bug fix are added and tested in a new git branch, a new pull
request is created for merging changes to the master branch. All pull requests go through careful
review from another person in the project and only after that merges into master. Master branch
automatically applies to the API server (based on puppet profile) and manually gets tags and applies as
data4es-prod branch on the production workflow server. For the moment DKB provides API with
version 0.3.3 and DKB production workflow runs version 0.2-0.</p>
        <p>Current environment stores near 15GB data for ATLAS production tasks and 50GB of data for
ATLAS analysis tasks. Every hour it loads and stores metadata information about around 1500 tasks
and 5000 datasets from ATLAS experiment.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Metadata integration</title>
      <p>
        At present DKB serves for ATLAS collaboration Production System [
        <xref ref-type="bibr" rid="ref2">7</xref>
        ] as metadata
integration service for the metadata at the level of Task and Dataset objects [fig. 3].
      </p>
      <p>Information updates are based on “task timestamp” from ProdSys database. Main information
comes from DEFT (Database Engine for Tasks) and is extended with additional metadata from other
systems like:

</p>
      <p>JEDI - Job Execution and Definition Interface;</p>
      <p>Rucio - scientific data management system.</p>
      <p>At the end as soon as the new integrated metadata stored in the single Elasticsearh it simplifies
search queries for the whole systems and such queries integrated into the ProdSys user interface
through web access [6].</p>
      <p>From implementation point of view, this workflow pipeline presents ETL process which is
shown on figure 4. It is implemented through one Linux bash script calling different stages (workflow
parts). These stages could use any software inside but to simplify communications between them and
simplify building of such stages DKB python library is used.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Resent changes and plans</title>
      <p>Several changes have been made to the DKB project recently, mainly aimed at improving
system performance and upgrading it to use a new version of the Elasticsearch engine. To improve the
performance of user operations, a new metadata indexing model is implemented for ATLAS integrated
metadata. It takes into account the specifics of the already addressed use-cases, and the most
noticeable change is that the output datasets properties are now stored together with the Task object, in
the form of nested documents (instead of parent/child documents). It is made to simplify queries to the
Elasticsearch index, used in the most problematic requests from the addressed use-cases. Some
investigations are made on internal communication protocol for DKB stages to use batch processing
instead of serial one which is in place in production. The nearest plan for DKB is to fully migrate to
the CERN production Elasticsearch infrastructure and split data storage from the project to special
dedicated outside service [fig. 5].</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>The Data Knowledge Base project is successfully integrated with the Production System of the
ATLAS experiment and it allows execution of complex analytical requests, requiring information
from different information systems and from different levels of abstraction in a timely manner. The
developed library and resent changes allows implementation of multiple different scenarios for
metadata integrations, providing flexible tool for building metadata workflow pipelines. Stable run in
production and good availability and accessibility allowed to use DKB metadata integration service in
processing ATLAS metadata for tasks and datasets to prepare LHC Run 3.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Acknowledgement</title>
      <p>The DKB ATLAS metadata integration services is supported by CERN IT and by CERN
ATLAS IT support team. The UI for DKB added by ATLAS ProdSys development team by M.
Borodin.</p>
      <p>DKB project is supported by NRC "Kurchatov Institute".</p>
      <p>Special thanks to M. Golosova and V. Aulov for the project development.
experiments
[DKB].</p>
      <sec id="sec-6-1">
        <title>Available</title>
      </sec>
      <sec id="sec-6-2">
        <title>Available</title>
        <p>[CERN</p>
      </sec>
      <sec id="sec-6-3">
        <title>OpenStack].</title>
      </sec>
      <sec id="sec-6-4">
        <title>Available</title>
        <p>[ProdSys].</p>
      </sec>
      <sec id="sec-6-5">
        <title>Available at:</title>
        <p>at:</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Grigoryeva</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Golosova</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klimentov</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wenaus</surname>
            <given-names>T</given-names>
          </string-name>
          .
          <article-title>Data Knowledge Base for HENP Scientific Collaborations</article-title>
          .// Journal of Physics: Conference Series, vol.
          <volume>1085</volume>
          , issue 3,
          <year>2018</year>
          [
          <article-title>2] The Data Knowledge Base for HENP https://github</article-title>
          .com/PanDAWMS/dkb (accessed
          <volume>22</volume>
          .09.
          <year>2021</year>
          ) [3]
          <string-name>
            <given-names>ATLAS</given-names>
            <surname>Collaboration</surname>
          </string-name>
          .
          <article-title>The ATLAS Experiment at the CERN Large Hadron Collider [ATLAS]</article-title>
          . Available at: https://nordberg.web.cern.ch/nordberg/PAPERS/JINST08.pdf (accessed
          <volume>22</volume>
          .09.
          <year>2021</year>
          )
          <article-title>[4] Extract, transform, load procedure in computing [ETL]</article-title>
          . https://en.wikipedia.org/wiki/Extract,_transform,_
          <source>load (accessed 22.09</source>
          .
          <year>2021</year>
          ) [5]
          <string-name>
            <given-names>CERN</given-names>
            <surname>OpenStack Private Cloud Guide</surname>
          </string-name>
          https://clouddocs.web.
          <source>cern.ch/ (accessed 22.09</source>
          .
          <year>2021</year>
          ) [6]
          <string-name>
            <given-names>The</given-names>
            <surname>Elastic</surname>
          </string-name>
          <string-name>
            <surname>Stack</surname>
          </string-name>
          [ESK]. Available at: https://www.elastic.co/elastic-stack
          <source>/ (accessed 22.09</source>
          .
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [7]
          <string-name>
            <surname>The</surname>
            <given-names>ATLAS</given-names>
          </string-name>
          collaboration Production https://prodtask.cern.ch/dkb/ (accessed 22.09.
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>