<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>THE ATLAS PRODUCTION SYSTEM PREDICTIVE ANALYTICS SERVICE: AN APPROACH FOR INTELLIGENT TASK ANALYSIS</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>M.A. Titov</string-name>
          <email>mikhail.titov@cern.ch</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M.S. Borodin</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>D.V. Golubkov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A.A. Klimentov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>on behalf of the ATLAS Collaboration</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Brookhaven National Laboratory</institution>
          ,
          <addr-line>P.O. Box 5000, Upton, NY, 11973</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute for High Energy Physics of NRC «Kurchatov Institute»</institution>
          ,
          <addr-line>1 pl. Nauki, Protvino, Moscow region, 142281</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>National Research Centre «Kurchatov Institute»</institution>
          ,
          <addr-line>1 pl. Akademika Kurchatova, Moscow, 123182</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Iowa</institution>
          ,
          <addr-line>108 Calvin Hall, Iowa City, IA, 52242</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>124</fpage>
      <lpage>128</lpage>
      <abstract>
        <p>The second generation of the Production System (ProdSys2) of the ATLAS experiment (LHC, CERN), in conjunction with the workload management system PanDA (Production and Distributed Analysis), represents a complex set of computing components that are responsible for defining, organizing, scheduling, starting and executing payloads in a distributed computing infrastructure. ProdSys2/PanDA are responsible for all stages of (re)processing, analysis and modeling of raw and derived data, as well as simulation of physical processes and functioning of the detector using Monte Carlo methods. The prototype of the ProdSys2 Predictive Analytics (P2PA) service is an essential part of the growing analytical service for the ProdSys2 and it will play a key role in the ATLAS distributed computing. P2PA uses such tools as Time-To-Complete (TTC) estimation towards units of the processing (i.e., tasks, chains and groups of tasks) to control the processing state and rate, and to be able to highlight abnormal operations and executions (e.g., to discover stalled processes). It uses methods and techniques of machine learning to obtain corresponding predictive models and metrics that are aimed to characterize the current system's state and its changes over a short period of time.</p>
      </abstract>
      <kwd-group>
        <kwd>predictive analytics</kwd>
        <kwd>production system</kwd>
        <kwd>Apache Spark</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Evolution of the Production System (ProdSys2) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] of the ATLAS experiment [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] leads to
extension of its possibilities by using not just technical and engineering solutions but techniques and
methods of intelligent analysis based on data mining and machine learning. Such analysis is applied
towards management and execution of computing tasks, as well as towards operational management
processes. New components and services are designed to enhance the task processing workflow and to
increase the automation in decision making processes [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
      </p>
      <p>The current key components of ProdSys2, such as the Database Engine for Tasks (DEfT) and
the Job Execution and Definition Interface (JEDI), are used as main sources for information about
computing tasks (set of parameters per task or chain of tasks) and their processing states. A computing
task, in terms of ATLAS, represents a logical grouping of computing jobs that are responsible for the
execution of algorithm/transformation on input files and generation of output files (dynamic jobs
definition and execution are performed by JEDI). Profound understanding of a task lifecycle will
improve its processing workflow and optimize usage of computing resources.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Problem statement</title>
      <p>The new and advanced analytical service, that is aimed to collect and process information
about tasks for their deep analysis, and to provide operational metrics for ProdSys2, is based on
predictive modeling and analysis, and is called the Predictive Analytics service. The ultimate goal of
this service is to solve the following problems/questions: i) discover and handle key task features that
impact the workflow; ii) regulate the task processing/execution at a given stage; iii) predict task
metrics and its next state (e.g., normal execution, stalled, etc.).</p>
      <p>The next step in automation of the task processing management raises questions that are
expected to be solved by the decision-making system, which will be a part of the Predictive Analytics
service and will use service core tools for deep analysis of computing tasks. All that includes
estimation of the correlation between task parameters and descriptive parameters of computing
resources (e.g., selection and reservation of available computing capacities, determination of resources
of the particular type for clustered group of tasks), mining of sequences of task reassignments (e.g.,
keep full track of task lifecycle stages and states, and task progress).</p>
    </sec>
    <sec id="sec-3">
      <title>3. ProdSys2 Predictive Analytics service</title>
      <p>
        The current implementation of the service includes two packages which represent key
components (fig. 1) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>[ProdSys2] DEfT/JEDI
HHaCYDdlAuoFsRoStpeNrDreirSkapsstoriiltrrssLccggaaaaeeeedpnoubructeedmMFaSqpooianlepaittcxSgaaaeehnhRBDDwMgRSeyersdtuecmeigP iitrcSgnp [(PFwirltietehdremitroageFSSDHdco]tidDoenalstsa)
analytics cluster (analytix.cern.ch)
prodsys-pa-model
Colector bySqoop,Pig
PredictorbySpark(MLlib)
Distributor
byDEfT/P2PAAPIs
(a)
[PcroerdnS.cyhs/D2PBAOndDaetmabaandse] ---- PSPOtrepaeretdfiorciacrmtatiionoandnnacmdleyponmrdoaeecmltesriisccsspersedictions</p>
      <p>prodsys-pa-web
- Manage processing service jobs
- Track performance metrics
Core Control Unit - Adjust service thresholds</p>
      <p>- Alert / notification module
monitor and management tools (UI)
manager node (prodsys-pa-ui.cern.ch, VM)
(b)</p>
      <p>
        Predictive model handling package (prodsys-pa-model) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is designed as an independent set
of tools for task analysis: task information collector (extracts requested task parameters from
DEfT/JEDI); analysis of the task operational parameters (creates predictive model and uses it to
generate predictions of time-to-complete/TTC per new task); delivery of the obtained results (uses
DEfT and P2PA APIs for predictions distribution). This package runs on an analytics cluster that
provides HDFS and the parallel processing framework Apache Spark1 (e.g., cluster analytix at the
CERN Computing Center), and it is adapted to be a part of the service. Web application package
(prodsys-pa-web) consolidates in itself monitor and management tools, and provides an interface to
interact with task analysis process. It is built using the Django2 web framework and related service
python libraries (django REST framework3, celery4).
      </p>
      <p>P2PA service also collects certain task timing parameters to evaluate applied prediction
methods and chosen set of parameters (i.e., the quality of feature selection process). The current
implementation of the prediction generation processes uses Random Forests regression method by
Spark.MLlib, but it is in long-term plan to add other libraries with new methods. Evaluation of
prediction models uses the following metrics as basic: mean squared error (MSE) and root mean
squared error (RMSE). The quality/accuracy of generated predictions is characterized by the
corresponding metric, i.e., a confidence coefficient. Its evaluation uses the tracking of the task
execution progress ("state control" process in figure 1b).</p>
      <p>The full set of generated data along with the evaluation metrics (including operational metrics
to estimate the performance of applied methods) are presented to the user as the monitoring part of the
service. Figures 2,3 show screenshots of web application that provides information about operation
processes (e.g., predictive model creation, predictions generation) and task profiles with extracted
parameters and estimated metrics, such as TTC predicted, description of the block of generated
predictions with corresponding confidence coefficient and MSE.
1 Apache Spark, https://spark.apache.org [accessed on 2018-10-25]
2 Django project (version 1.11), https://www.djangoproject.com/ [accessed on 2018-10-25]
3 Django REST framework, https://www.django-rest-framework.org/ [accessed on 2018-10-25]
4 Celery: Distributed Task Queue, http://www.celeryproject.org/ [accessed on 2018-10-25]</p>
    </sec>
    <sec id="sec-4">
      <title>4. Analysis of a computing task</title>
      <p>Generated predictions and obtained metrics are planned to be used in decision-making
processes to regulate ProdSys2 behaviour and resources consumption. Thus, it is important to identify
essential system influenced features that reflect the system behaviour. The key metric per computing
task in ProdSys2 is TTC, which is used as an indicator for task condition (e.g., faster than average,
longer than average, etc.) and its further exploration will let revealing reasons for its processing
deviation that is important in forecasting the state of ProdSys2 in general. It is not yet planned to use it
as a pre-task-definition check for optimal parameters finding (e.g., computing center), but such
possibility will be considered with the service improvements.</p>
      <p>There are several steps in estimation of task TTC, where each of them enhances the quality of
obtained results from the previous ones.</p>
      <p>

</p>
      <p>Steps for task TTC estimation:
definition of the value range - 95th percentile of task duration is used per group of tasks that
are distinguished by a set of features. Current implementation uses the following features:
projectName, productionStep, workingGroup;
prediction of task duration based on descriptive / initial parameters of the task;
periodically repeated step that uses dynamic parameters (of the task and computing
environment including computing site for processing) for the adjustment of earlier predicted
TTC and task duration eventually.</p>
      <p>
        The current choice of controlled parameters/metrics is due to their correlation with
corresponding ProdSys2 possible failure states [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], but with a further increase of their quality/accuracy
and the need to introduce new ones, their comparative analysis will be conducted.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Acknowledgement</title>
      <p>This work has been carried out using computing resources of the federal collective usage
center Complex for Simulation and Data Processing for Mega-science Facilities at NRC “Kurchatov
Institute”, http://ckp.nrcki.ru/. NRC KI researchers have been funded by the Russian Ministry of
Science and High Level Education under the contract No. 14.Z50.31.0024.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>ProdSys2 Predictive Analytics service is designed to enhance workflow control at the ATLAS
Production System and to be able to detect and highlight abnormal operations and executions. Its
prototype demonstrates the usefulness of the provided metrics and state control mechanism. It still
lacks most evaluation metrics to fine-tune the prediction process that would increase the quality of
generated predictions and operational metrics.</p>
      <p>Furthermore, the future decision-making system should rely on generated quality metrics,
since it is responsible for regulation of the resource consumption. The quality of obtained metrics
(estimated values of controlled parameters) is constantly improving and new evaluation parameters
and metrics will be introduced for task analysis and mining processes.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Borodin</surname>
            <given-names>M.</given-names>
          </string-name>
          et al.
          <article-title>The ATLAS Production System Evolution: New Data Processing and Analysis Paradigm for the LHC Run2 and High-</article-title>
          <source>Luminosity // Journal of Physics: Conference Series</source>
          <volume>898</volume>
          (
          <year>2017</year>
          )
          <fpage>052016</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>ATLAS</given-names>
            <surname>Collaboration</surname>
          </string-name>
          ,
          <source>2008 JINST 3 S08003</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Titov</surname>
            <given-names>M.</given-names>
          </string-name>
          et al.
          <article-title>Predictive analytics as an essential mechanism for situational awareness at the ATLAS Production System /</article-title>
          / CEUR Workshop Proceedings 2023 (
          <year>2017</year>
          ) pp.
          <fpage>61</fpage>
          -
          <lpage>67</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Titov</surname>
            <given-names>M.</given-names>
          </string-name>
          et al.
          <article-title>Advanced Analytics service to enhance workflow control at the ATLAS Production System //</article-title>
          <source>Proceedings of the 23rd International Conference on Computing in High Energy and Nuclear Physics (CHEP)</source>
          , Sofia, Bulgaria,
          <fpage>9</fpage>
          -
          <issue>13</issue>
          <year>July</year>
          2018
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Duellmann</surname>
            <given-names>D.</given-names>
          </string-name>
          et al.
          <article-title>Hadoop and friends - first experience at CERN with a new platform for high throughput analysis steps //</article-title>
          <source>Journal of Physics: Conference Series</source>
          <volume>898</volume>
          (
          <year>2017</year>
          )
          <fpage>072034</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>[6] Predictive model handling package</article-title>
          , https://github.com/XDatum/prodsys-pa-
          <source>model [accessed on</source>
          <year>2018</year>
          -
          <volume>10</volume>
          -25]
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>