<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>THE ATLAS BIGPANDA MONITORING SYSTEM ARCHITECTURE</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>T. Korchuganova</string-name>
          <email>tatiana.korchuganova@cern.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>S. Padolski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>T. Wenaus</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A. Klimentov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A. Alekseev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>on behalf of ATLAS Collaboration</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>2018 Tatiana Korchuganova</institution>
          ,
          <addr-line>Siarhei Padolski, Torre Wenaus, Alexei Klimentov, Aleksandr Alekseev</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>81</fpage>
      <lpage>85</lpage>
      <abstract>
        <p>Currently-running large-scale scientific projects involve unprecedented amounts of data and computing power. For example, the ATLAS experiment at the Large Hadron Collider (LHC) has collected 140 PB of data over the course of Run 1 and this value kept increasing at the rate of ~800MB/s during Run 2. Processing and analysis of such amounts of data requires development of complex operational workflow and payload management systems along with building top edge computing facilities. In the ATLAS experiment a key element of the payload management is the Production and Distributed Analysis system (PanDA). It consists of several components and one of them is the BigPanDA monitoring component. It is responsible for providing a comprehensive and coherent view of the tasks and jobs executed by the system, from high level summaries to detailed drill-down job diagnostics. The BigPanDA monitoring has been in production since the mid-2014 and it continuously evolves to satisfy increasing demands in functionality and growing payload scales. Today it effectively keeps track of more than 2 million jobs per day distributed over 170 computing centers worldwide in the largest instance of the BigPanDA monitoring: the ATLAS experiment. In this paper we describe the monitoring architecture and its principal features.</p>
      </abstract>
      <kwd-group>
        <kwd>monitoring</kwd>
        <kwd>PanDA</kwd>
        <kwd>data aggregation</kwd>
        <kwd>Django application</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Tomsk Polytechnic University</title>
    </sec>
    <sec id="sec-2">
      <title>2 Brookhaven National Laboratory</title>
      <sec id="sec-2-1">
        <title>1. Introduction</title>
        <p>
          The BigPanDA monitoring system is one of the PanDA workload management system [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
components. It provides many services, including the system state overview characterized by its object
parameters, tracking operational progress and serving as a source of detailed data for troubleshooting.
Input information for this representation analysis are scattered among real-time data and historical
archives. There are various levels of detalization and groupings required to satisfy the needs of four
types of groups. They are physicists who do their own analysis; managers who operate simulation and
data processing campaigns on behalf of a physics group or the whole experiment; shifters who monitor
the health of the overall distributed computing resources, chasing failures in a timely manner; and
developers using the monitor as a window into the PanDA system. Due to the continuously growing
computational and functional needs of the system, it should be developed as a scalable system with
extensible functionalities. This paper describes requirements to the system as well as its architecture
and structure.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2. Input</title>
        <p>
          Originally the monitoring system was developed for the ATLAS experiment [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] at LHC. The
system described here is its latest generation, developed to address the continuously increasing
demands [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. For this reason the system is capable of aggregating massive volumes of data in a close
to the real time mode. One of the most important and the most demanding task of the monitoring
system is to aggregate and expose information on every job being handled by PanDA. Properties of a
job entity are assembled from a number of PanDA database tables including a jobs table which stores
primary properties and tables with related objects, such as events, files, datasets, tasks, and computing
sites where jobs are being processed.
        </p>
        <p>An event refers to a distinct particle collision event recorded by the detector or simulated by a
Monte Carlo software. A job is a payload which is supposed to process number of input events or
produce them using initial random generator seeds conditions. A task is a collection of jobs united by
the same data sample split along them. A request can contain a set or a chain of tasks as well as a
single task. A campaign is a set of requests which are united by a physics objective. This hierarchy of
objects has the following statistics on average: a job contains hundreds of events, a task contains
hundreds of thousands of events, a request contains more than a billion of events and finally a
campaign has a trillion of events. Similarly, the average time to process an event varies from seconds
to several minutes, a job lasts from few hours to a couple of days, a task can take up to several weeks,
a request is processed for more than a month and a campaign generally lasts for a year. These
estimates show that the architecture of the monitoring system should be adequate to analyse and
represent an extremely wide scale of data.</p>
      </sec>
      <sec id="sec-2-3">
        <title>3. Architecture</title>
        <p>
          The BigPanDA monitor is built as a web application. The architecture of the system is shown
in Figure 1. The application backend is based on the Django model-view-controller framework [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
which is a powerful open source package written in Python. The monitor is hosted by Apache [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] web
servers through the Web Server Gateway Interface (WSGI) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The system supports various relational
database (DB) backends using abstraction layers provided by Django. However, relational DBs are not
the only data source for the monitoring system. It also acquires data from non-relational sources like
ElasticSearch (ES) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and Redis cache instances [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. In addition, ATLAS’s Rucio data management
system [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] provides logs, and the Dashboard service provides historical histograms [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>Data-flow for the system is shown in Figure 2. The raw data from PanDA system is stored in
the DB. Taking into account the fact that user requests may involve millions of DB rows to process,
data aggregation algorithms are split between DB and Web server backends, which allows to reap
benefits from both engines, reduce data transfers and increase the performance. In addition, the
monitoring system has an advanced caching system. A data prepared for displaying is divided into a
common and a user specific parts. In the common part data is proactively cached, whereas user
specific data is processed for each user request individually. Pre-aggregated data from DB or cache
storage is loaded through the standard Django interface as well as indexed data from ES. The user
specific data can contain either settings for a page or a list of references to the most relevant pages
determined by analysis of BigPanDA browsing history. The user specific data is protected by several
policies, in particular, SSO authentication and HTTPS protocol.</p>
        <p>
          Data is displayed using Foundation CSS [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] containers or dynamic sortable DataTables [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
Data for DataTables is delivered asynchronously using jQuery [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The containers can include static
tables, responsive menus, visualisations generated either on the client side using D3.js or on the server
side using the matplotlib library [
          <xref ref-type="bibr" rid="ref14 ref15 ref16">14-16</xref>
          ]. Besides self-generated visualizations, Kibana dashboards and
histograms provided by Dashboard service can be embedded into pages directly. The monitoring
system also provides aggregated data in JSON format to allow the system to serve as a programmatic
source of information. We are also planning to integrate monitoring and Data Knowledge Base of the
experiment in the future [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]
        </p>
        <p>
          The principal views of the monitoring system displays key objects of data processing and
analysis, such as jobs, tasks, files, and its aggregates. These views are unified into a single module
defining the system’s core (see Figure 3). More specific and accounting views are implemented as
specific plugins. This approach enables system customization by plugging in and out existing
components or implementing new ones. For example, the basic BigPanDA monitor core is installed on
Amazon Elastic Compute Cloud (EC2) [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] for serving different experiments which use PanDA for
workload management. For the COMPASS experiment at SPS [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] the extra module was developed
and deployed in addition to the core ones. The largest instance for the ATLAS experiment includes
core and 9 more specific modules, such as the ATLAS Release Tester (ART) monitor showing tests
results of nightly software builds, and Reports providing a wide overview of a campaign computation.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>4. Summary</title>
        <p>The BigPanDA monitoring system is in production since the middle of 2014 and thanks to the
flexible architecture the functionality of the system is continuously evolving without interrupting the
service. In September 2018 the BigPanDA monitoring system in ATLAS handled more than 35
thousand requests in a day, where 77% of them are key views, in particular, jobs, tasks, sites, and files.
The successful ATLAS experience made this product also of interest to other experiments, in
particular at the moment of writing 3 instances of BigPanDA monitor serve payload monitoring for
ATLAS, COMPASS and other experiments beyond High Energy Physics.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Acknowledgements</title>
        <p>TPU team work is supported by the Russian Science Foundation grant under contract
№16-11-10280.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Maeno</surname>
            <given-names>T.</given-names>
          </string-name>
          et al.,
          <year>2017</year>
          ,
          <article-title>PanDA for ATLAS distributed computing in the next decade</article-title>
          ,
          <source>J. Phys. Conf. Ser. 898 052002</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>ATLAS</given-names>
            <surname>Collaboration</surname>
          </string-name>
          ,
          <year>2008</year>
          ,
          <article-title>The ATLAS Experiment at the CERN Large Hadron Collider, JINST 3</article-title>
          , S08003
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Schovancova</surname>
            <given-names>J.</given-names>
          </string-name>
          et al,
          <year>2014</year>
          ,
          <article-title>The new Generation of the ATLAS PanDA Monitoring System</article-title>
          ,
          <volume>035</volume>
          .
          <fpage>10</fpage>
          .
          <issue>22323</issue>
          /1.210.0035.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Django</given-names>
            <surname>Documentation</surname>
          </string-name>
          . Available at: https://docs.djangoproject.
          <source>com/en/1.11/ (accessed on 5.07</source>
          .
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Apache</surname>
            <given-names>HTTP</given-names>
          </string-name>
          <source>Server Version 2.4 Documentation</source>
          . Available at: https://httpd.apache.
          <source>org/docs/2.4/ (accessed on 10.08</source>
          .
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6] WSGI. Available at: https://wsgi.readthedocs.io/en/latest/ (
          <source>accessed on 9.08</source>
          .
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Elastic</given-names>
            <surname>Stack</surname>
          </string-name>
          and
          <string-name>
            <given-names>Product</given-names>
            <surname>Documentation</surname>
          </string-name>
          . Available at: https://www.elastic.co/guide/index.
          <source>html (accessed on 13.07</source>
          .
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Nelson</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <year>2016</year>
          ,
          <string-name>
            <given-names>Mastering</given-names>
            <surname>Redis</surname>
          </string-name>
          , Birmingham:Packt Publishing,
          <volume>340</volume>
          p.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Lassnig</surname>
            <given-names>M.</given-names>
          </string-name>
          et al.,
          <year>2015</year>
          ,
          <article-title>Monitoring and Controlling ATLAS data management: The Rucio web user interface</article-title>
          ,
          <source>J. Phys.Conf. Ser. 664</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Andreeva</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Campana</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karavakis</surname>
            <given-names>E.</given-names>
          </string-name>
          et al.,
          <year>2012</year>
          ,
          <article-title>ATLAS job monitoring in the Dashboard Framework</article-title>
          ,
          <source>J. Phys.Conf. Ser. 396</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>ZURB</given-names>
            <surname>Foundation</surname>
          </string-name>
          . Available at: http://foundation.zurb.com/sites/docs/ (
          <source>accessed on 3.07</source>
          .
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>DataTables</surname>
          </string-name>
          . Available at: http://datatables.net
          <source>/ (accessed on 25.07</source>
          .2018
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>jQuery - Asynchronous JavaScript Library</surname>
          </string-name>
          . Available at: http://jquery.com
          <source>/ (accessed on 24.05</source>
          .
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <article-title>Data-driven documents Available at</article-title>
          : https://d3js.org/ (
          <source>accessed on 17.06</source>
          .
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15] Matplotlib Overview Available at: https://matplotlib.org/contents.
          <source>html (accessed on 17.06</source>
          .
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Padolski</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Korchuganova</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wenaus</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grigorieva</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alexeev</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Titov</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klimentov</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <year>2018</year>
          ,
          <article-title>Data visualization and representation in ATLAS BigPanDA monitoring</article-title>
          , Scientific Visualization №
          <volume>10</volume>
          , p.
          <fpage>69</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Grigorieva</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aulov</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gubin</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klimentov</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <year>2016</year>
          ,
          <article-title>Data knowledge base for scientific experiment</article-title>
          ,
          <source>Open Systems, DBMS</source>
          , Vol.
          <volume>4</volume>
          , p.
          <fpage>42</fpage>
          -
          <lpage>44</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Amazon</given-names>
            <surname>EC2</surname>
          </string-name>
          . Available at: https://aws.amazon.
          <source>com/ec2/ (accessed on 17.09</source>
          .
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Abbon</surname>
            <given-names>P.</given-names>
          </string-name>
          et al.,
          <year>2007</year>
          , The COMPASS experiment
          <string-name>
            <surname>at</surname>
            <given-names>CERN</given-names>
          </string-name>
          , Nucl. Instrum. Meth., Vol.
          <volume>A577</volume>
          , p.
          <fpage>455</fpage>
          -
          <lpage>518</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>