=Paper=
{{Paper
|id=Vol-2267/124-128-paper-22
|storemode=property
|title=The atlas production system predictive analytics service: an approach for intelligent task analysis
|pdfUrl=https://ceur-ws.org/Vol-2267/124-128-paper-22.pdf
|volume=Vol-2267
|authors=Mikhail A. Titov,Mikhail S. Borodin,Dmitry V. Golubkov,Alexei A. Klimentov
}}
==The atlas production system predictive analytics service: an approach for intelligent task analysis==
Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 THE ATLAS PRODUCTION SYSTEM PREDICTIVE ANALYTICS SERVICE: AN APPROACH FOR INTELLIGENT TASK ANALYSIS M.A. Titov 1,a, M.S. Borodin 2, D.V. Golubkov 1,3, A.A. Klimentov 1,4 on behalf of the ATLAS Collaboration 1 National Research Centre «Kurchatov Institute», 1 pl. Akademika Kurchatova, Moscow, 123182, Russia 2 University of Iowa, 108 Calvin Hall, Iowa City, IA, 52242, USA 3 Institute for High Energy Physics of NRC «Kurchatov Institute», 1 pl. Nauki, Protvino, Moscow region, 142281, Russia 4 Brookhaven National Laboratory, P.O. Box 5000, Upton, NY, 11973, USA E-mail: a mikhail.titov@cern.ch The second generation of the Production System (ProdSys2) of the ATLAS experiment (LHC, CERN), in conjunction with the workload management system PanDA (Production and Distributed Analysis), represents a complex set of computing components that are responsible for defining, organizing, scheduling, starting and executing payloads in a distributed computing infrastructure. ProdSys2/PanDA are responsible for all stages of (re)processing, analysis and modeling of raw and derived data, as well as simulation of physical processes and functioning of the detector using Monte Carlo methods. The prototype of the ProdSys2 Predictive Analytics (P2PA) service is an essential part of the growing analytical service for the ProdSys2 and it will play a key role in the ATLAS distributed computing. P2PA uses such tools as Time-To-Complete (TTC) estimation towards units of the processing (i.e., tasks, chains and groups of tasks) to control the processing state and rate, and to be able to highlight abnormal operations and executions (e.g., to discover stalled processes). It uses methods and techniques of machine learning to obtain corresponding predictive models and metrics that are aimed to characterize the current system's state and its changes over a short period of time. Keywords: predictive analytics, production system, Apache Spark © 2018 Mikhail A. Titov, Mikhail S. Borodin, Dmitry V. Golubkov, Alexei A. Klimentov 124 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 1. Introduction Evolution of the Production System (ProdSys2) [1] of the ATLAS experiment [2] leads to extension of its possibilities by using not just technical and engineering solutions but techniques and methods of intelligent analysis based on data mining and machine learning. Such analysis is applied towards management and execution of computing tasks, as well as towards operational management processes. New components and services are designed to enhance the task processing workflow and to increase the automation in decision making processes [3, 4]. The current key components of ProdSys2, such as the Database Engine for Tasks (DEfT) and the Job Execution and Definition Interface (JEDI), are used as main sources for information about computing tasks (set of parameters per task or chain of tasks) and their processing states. A computing task, in terms of ATLAS, represents a logical grouping of computing jobs that are responsible for the execution of algorithm/transformation on input files and generation of output files (dynamic jobs definition and execution are performed by JEDI). Profound understanding of a task lifecycle will improve its processing workflow and optimize usage of computing resources. 2. Problem statement The new and advanced analytical service, that is aimed to collect and process information about tasks for their deep analysis, and to provide operational metrics for ProdSys2, is based on predictive modeling and analysis, and is called the Predictive Analytics service. The ultimate goal of this service is to solve the following problems/questions: i) discover and handle key task features that impact the workflow; ii) regulate the task processing/execution at a given stage; iii) predict task metrics and its next state (e.g., normal execution, stalled, etc.). The next step in automation of the task processing management raises questions that are expected to be solved by the decision-making system, which will be a part of the Predictive Analytics service and will use service core tools for deep analysis of computing tasks. All that includes estimation of the correlation between task parameters and descriptive parameters of computing resources (e.g., selection and reservation of available computing capacities, determination of resources of the particular type for clustered group of tasks), mining of sequences of task reassignments (e.g., keep full track of task lifecycle stages and states, and task progress). 3. ProdSys2 Predictive Analytics service The current implementation of the service includes two packages which represent key components (fig. 1) [4]. - Static and dynamic predictions ProdSys2PA database - Prediction models [ProdSys2] DEfT/JEDI [ cern.ch/DBOnDemand ] - Performance metrics - Operational processes prodsys-pa-web Data exchange with RDBMS prodsys-pa-model HDFS Storage Large scale data proceesing Collector - Manage processing service jobs Scripting Sqoop by Sqoop, Pig - Track performance metrics Pig Core Control Unit - Adjust service thresholds Spark M apReduce Predictor by Spark (MLlib) - Alert / notification module YARN [Filtered] Data Cluster resource manager HDFS Predictions Distributor by DEfT/P2PA APIs Hadoop Distributed File System (with models) monitor and management tools (UI) analytics cluster (analytix.cern.ch) manager node (prodsys-pa-ui.cern.ch, VM) (a) (b) Figure 1. The architecture (a) and the communication (b) schemes of the P2PA service (analytics cluster “analytix” with highlighted services as provided by CERN-IT [5]) 125 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 Predictive model handling package (prodsys-pa-model) [6] is designed as an independent set of tools for task analysis: task information collector (extracts requested task parameters from DEfT/JEDI); analysis of the task operational parameters (creates predictive model and uses it to generate predictions of time-to-complete/TTC per new task); delivery of the obtained results (uses DEfT and P2PA APIs for predictions distribution). This package runs on an analytics cluster that provides HDFS and the parallel processing framework Apache Spark1 (e.g., cluster analytix at the CERN Computing Center), and it is adapted to be a part of the service. Web application package (prodsys-pa-web) consolidates in itself monitor and management tools, and provides an interface to interact with task analysis process. It is built using the Django 2 web framework and related service python libraries (django REST framework 3, celery4). P2PA service also collects certain task timing parameters to evaluate applied prediction methods and chosen set of parameters (i.e., the quality of feature selection process). The current implementation of the prediction generation processes uses Random Forests regression method by Spark.MLlib, but it is in long-term plan to add other libraries with new methods. Evaluation of prediction models uses the following metrics as basic: mean squared error (MSE) and root mean squared error (RMSE). The quality/accuracy of generated predictions is characterized by the corresponding metric, i.e., a confidence coefficient. Its evaluation uses the tracking of the task execution progress ("state control" process in figure 1b). The full set of generated data along with the evaluation metrics (including operational metrics to estimate the performance of applied methods) are presented to the user as the monitoring part of the service. Figures 2,3 show screenshots of web application that provides information about operation processes (e.g., predictive model creation, predictions generation) and task profiles with extracted parameters and estimated metrics, such as TTC predicted, description of the block of generated predictions with corresponding confidence coefficient and MSE. Figure 2. P2PA web application (UI) screenshots for operation process(es) 1 Apache Spark, https://spark.apache.org [accessed on 2018-10-25] 2 Django project (version 1.11), https://www.djangoproject.com/ [accessed on 2018-10-25] 3 Django REST framework, https://www.django-rest-framework.org/ [accessed on 2018-10-25] 4 Celery: Distributed Task Queue, http://www.celeryproject.org/ [accessed on 2018-10-25] 126 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 Figure 3. P2PA web application (UI) screenshots for task profile(s) with generated TTC estimation(s) 4. Analysis of a computing task Generated predictions and obtained metrics are planned to be used in decision-making processes to regulate ProdSys2 behaviour and resources consumption. Thus, it is important to identify essential system influenced features that reflect the system behaviour. The key metric per computing task in ProdSys2 is TTC, which is used as an indicator for task condition (e.g., faster than average, longer than average, etc.) and its further exploration will let revealing reasons for its processing deviation that is important in forecasting the state of ProdSys2 in general. It is not yet planned to use it as a pre-task-definition check for optimal parameters finding (e.g., computing center), but such possibility will be considered with the service improvements. There are several steps in estimation of task TTC, where each of them enhances the quality of obtained results from the previous ones. Steps for task TTC estimation: definition of the value range - 95th percentile of task duration is used per group of tasks that are distinguished by a set of features. Current implementation uses the following features: projectName, productionStep, workingGroup; prediction of task duration based on descriptive / initial parameters of the task; periodically repeated step that uses dynamic parameters (of the task and computing environment including computing site for processing) for the adjustment of earlier predicted TTC and task duration eventually. The current choice of controlled parameters/metrics is due to their correlation with corresponding ProdSys2 possible failure states [3], but with a further increase of their quality/accuracy and the need to introduce new ones, their comparative analysis will be conducted. 5. Acknowledgement This work has been carried out using computing resources of the federal collective usage center Complex for Simulation and Data Processing for Mega-science Facilities at NRC “Kurchatov Institute”, http://ckp.nrcki.ru/. NRC KI researchers have been funded by the Russian Ministry of Science and High Level Education under the contract No. 14.Z50.31.0024. 127 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 6. Conclusion ProdSys2 Predictive Analytics service is designed to enhance workflow control at the ATLAS Production System and to be able to detect and highlight abnormal operations and executions. Its prototype demonstrates the usefulness of the provided metrics and state control mechanism. It still lacks most evaluation metrics to fine-tune the prediction process that would increase the quality of generated predictions and operational metrics. Furthermore, the future decision-making system should rely on generated quality metrics, since it is responsible for regulation of the resource consumption. The quality of obtained metrics (estimated values of controlled parameters) is constantly improving and new evaluation parameters and metrics will be introduced for task analysis and mining processes. References [1] Borodin M. et al. The ATLAS Production System Evolution: New Data Processing and Analysis Paradigm for the LHC Run2 and High-Luminosity // Journal of Physics: Conference Series 898 (2017) 052016 [2] ATLAS Collaboration, 2008 JINST 3 S08003 [3] Titov M. et al. Predictive analytics as an essential mechanism for situational awareness at the ATLAS Production System // CEUR Workshop Proceedings 2023 (2017) pp.61-67 [4] Titov M. et al. Advanced Analytics service to enhance workflow control at the ATLAS Production System // Proceedings of the 23rd International Conference on Computing in High Energy and Nuclear Physics (CHEP), Sofia, Bulgaria, 9-13 July 2018 [5] Duellmann D. et al. Hadoop and friends - first experience at CERN with a new platform for high throughput analysis steps // Journal of Physics: Conference Series 898 (2017) 072034 [6] Predictive model handling package, https://github.com/XDatum/prodsys-pa-model [accessed on 2018-10-25] 128