=Paper= {{Paper |id=Vol-2430/paper10 |storemode=property |title=Meta-monitoring system for ensuring a fault tolerance of the intelligent high-performance computing environment |pdfUrl=https://ceur-ws.org/Vol-2430/paper10.pdf |volume=Vol-2430 |authors=Ivan Sidorov,Tuyana Sidorova,Yana Kurzybova |dblpUrl=https://dblp.org/rec/conf/iccs-de/SidorovSK19 }} ==Meta-monitoring system for ensuring a fault tolerance of the intelligent high-performance computing environment== https://ceur-ws.org/Vol-2430/paper10.pdf
Meta-monitoring system for ensuring a fault tolerance of the
intelligent high-performance computing environment

                I A Sidorov1, T V Sidorova2 and Ya V Kurzibova3
                1
                  Matrosov Institute for System Dynamics and Control Theory of SB RAS, Lermontov
                St. 134, Irkutsk, Russia, 664033
                2
                  Limnological Institute of SB RAS, Ulan-Batorskaya St. 3, Irkutsk, Russia, 664033
                3
                  Irkutsk State University, Karl Marks St. 1, Irkutsk, Russia, 664003

                ivan.sidorov@icc.ru

                Abstract. The high-performance computing systems include a large number of hardware and
                software components that can cause failures. Nowadays, the well-known approaches to
                monitoring and ensuring the fault tolerance of the high-performance computing systems do not
                allow to fully implement its integrated solution. The aim of this paper is to develop methods
                and tools for identifying abnormal situations during large-scale computational experiments in
                high-performance computing environments, localizing these malfunctions, automatically
                troubleshooting if this is possible, and automatically reconfiguring the computing environment
                otherwise. The proposed approach is based on the idea of integrating monitoring systems, used
                in different nodes of the environment, into a unified meta-monitoring system. The use of the
                proposed approach minimizes the time to perform diagnostics and troubleshooting through the
                use of parallel operations. It also improves the resiliency of the computing environment
                processes by preventive measures to diagnose and troubleshoot of failures. These advantages
                lead to increasing the reliability and efficiency of the environment functioning. The novelty of
                the proposed approach is underlined by the following elements: mechanisms of the
                decentralized collection, storage, and processing of monitoring data; a new technique of
                decision-making in reconfiguring the environment; the supporting the provision of fault
                tolerance and reliability not only for software and hardware, but also for environment
                management systems.


1. Introduction
The development of a comprehensive monitoring system that would ensure the collection of data from
a large number of heterogeneous components included in modern intelligent high-performance
computational environment (IHPCE) is a difficult task because of the lack of appropriate standardized
formats and protocols for obtaining the necessary information. There is a large number of software
solutions that allow us to separately monitor the necessary components of IHPCE. In this regard, the
most expedient and promising direction of research in creating integrated monitoring systems for
IHPCE is the integration of existing local monitoring systems within a unified meta-monitoring
system [1]. At the same time, the local monitoring system acts as a supplier of data. The data
collecting and unification, expert analyzing the obtained information, and defining the necessary
control actions are assigned to the meta-monitoring system.
   The monitoring of the IHPCE components can be conventionally divided into the following
categories:
   Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0).
      Monitoring and analysis of the software execution efficiency in IHPCE (control of the current
       state of computational processes and their individual copies, evaluation of the efficiency of the
       allocated resource use, etc.),
     Monitoring, testing, and diagnostics of hardware components of nodes (disks, processors,
       RAM, network interfaces, etc.),
     Monitoring of the IHPCE engineering infrastructure (uninterruptible power supply systems,
       climatic equipment, fire-fighting systems, etc.),
     Monitoring of the IHPCE computing infrastructure (monitoring of the current load of
       computing nodes, control of communication, control of transport and service networks, data
       storage systems, etc.),
     Monitoring of the IHPCE firmware (monitoring of the functioning of system services, task
       queues, agents, various subsystems, etc.).
   The paper suggests an approach to complex monitoring of the IHPCE with multiagent control of
computations [2, 3]. It is based on collecting and analyzing data received from a set of local
monitoring systems that control the operation of hardware and software components of the
environment. In addition, control effects on the IHPCE functioning are developed within the proposed
approach.

2. Related work
Tools for monitoring and analyzing the effectiveness of a program implementation in distributed
computing environments. A large number of systems have been accumulated in this category. They
include program profilers, tools for monitoring the utilization of computational resources by means of
copies of programs executed in distributed computing environments nodes, and tools for monitoring
the utilization of network components. The description of these systems is represented in Table 1.
Their comparative analysis is given in details in [4, 5].
            Table 1. Systems for the monitoring and analysis of the program performance
 System          Description                                                Reference
 NWPerf          A system for analyzing the performance of a parallel       https://github.com/EMSL-
                 program with the ability to provide data on its            MSC/NWPerf
                 individual blocks. It implemented in the Python
                 programming language.
 Arm MAP         A parallel, multi-threaded, and sequential profiler that   https://www.arm.com/pro
                 provides comprehensive analysis on a specific set of       ducts/development-
                 metrics. It allows to analyze C, C ++, and Fortran         tools/server-and-
                 programs.                                                  hpc/forge/map
 LAPTA           Tools for multidimensional analysis of dynamic             http://hpc.msu.ru/node/84
                 characteristics of programs focused on
                 supercomputers. It provides various types of graphical
                 reports.
 mpiP            Lightweight profiler of MPI-programs. It enables to        http://mpip.sourceforge.ne
                 analyze programs in C, C ++, and Fortran.                  t/
 Integrated      Advanced profiler of parallel programs with the ability    http://ipm-
 Performance     to analyze data transfer processes, memory access,         hpc.sourceforge.net/
 Monitoring      communication network, and disks. It supports various
 (IPM)           implementations of the MPI library.
 Intel®          Commercial software for analyzing the program              https://software.intel.com/
 VTune™          performance. Its supports the analysis of the              en-us/intel-vtune-
 Amplifier       performance and scalability of programs,                   amplifier-xe
                 communication network bandwidth, and data caching.
 Tuning and      Tools for analyzing and visualizing the execution of         http://tau.uoregon.edu
 Analysis        parallel programs. It allows to analyze programs in C,
 Utilities       C ++, Fortran, UPC, Java, and Python.
 HPCToolkit      Tools for the automatic detection of inefficient blocks      http://hpctoolkit.org
                 of a parallel program with the reference to its source
                 code. It focuses on the use in computing environments,
                 including tens and hundreds of thousands of nodes.
 Paraver         A program performance analyzer based on event                http://www.bsc.es/paraver
                 tracing and allowing detailed analysis of changes and
                 distribution of a specific set of metrics. It supports the
                 prediction of program behavior in different scenarios.
 Scalasca        Tools for optimizing parallel programs by measuring          http://www.scalasca.org
                 and analyzing their behavior during the execution. The
                 main emphasis in identifying inefficient blocks is
                 given for the synchronization of parallel programs.

   From the author's point of view, NWPerf and Paraver open-source packages are the most functional
and perspective solutions for analyzing the efficiency of the parallel program execution in distributed
computing environments.
   Monitoring, testing, and diagnostics of hardware components of computational nodes.
Unfortunately, only a small number of systems intended for detecting defects in the hardware
components of distributed computing environments nodes are known. The description of these
systems is represented in Table 2.
       Table 2. Monitoring, testing, and diagnostic systems for hardware components of nodes
 System         Description                                                   Reference
 Disparity      A software package that launches an MPI program on            [6]
                target nodes in order to detect possible malfunctions. It
                supports multiple modes of testing nodes (fast,
                advanced, etc.).
 Coordinated The system implements consistent processes for                   https://wiki.mcs.anl.gov/c
 Infrastructure exchanging information about faults between nodes in          ifts/index.php/CIFTS
 for Fault      order to develop a holistic picture of their state as a
 Tolerant       whole.
 Systems
    The most interesting of them is the Disparty software tool, which allows to detect malfunctions of
the components of the computing node during the downtime between the runs of instances of
computational processes.
    Systems for monitoring the engineering infrastructure of distributed computing environments. The
systems represented in Table 3 are used to monitor the engineering infrastructure of supercomputer
and data processing centers. However, almost all of them are proprietary and tied to the specialized
equipment. Thus, they usually do not have the sufficient flexibility for monitoring the IHPCE
infrastructure.
Table 3. Systems for monitoring the engineering infrastructure of distributed computing environments
 System          Description                                                  Reference
 ClustrX         A resource management system that supports                   http://www.t-
                 automatic shutdowns of equipment in the event of a           platforms.com/
                 failure of hardware and software components. The             products/software/clustrx
                 description of the monitored components is performed         productfamily/clustrxwatc
                 in the Erlang scripting language.                         h.html
 EMC ViRP        Software for monitoring a corporate storage of            http://russia.emc.com/data
 SRM             information resources and automating the generation       -center-management/vipr-
                 of reports about their status. It designed to monitor     srm.htm
                 specialized equipment only.
 Bright          A toolkit of automating the creation and control of       http://www.brightcomputi
 Cluster         compute clusters in data centers or cloud platforms. It   ng.com/products
 Manager         provides a variety of reports.
 Moab Cloud      Resource management system for supercomputers. It         http://www.adaptivecomp
 HPC Suite       supports automation of planning, control, monitoring,     uting.com/moab-hpc-
                 and reporting.                                            basic-edition/
 IBM cluster     Software management complex of large-scale                https://www-
 system          computing clusters. It used predominantly on              01.ibm.com/common/ssi/c
 management      computing clusters manufactured by IBM.                   gi-bin/ssialias

   At present, non-commercial software products which could provide universal description of
heterogeneous engineering equipment of a supercomputer center, creation of new objects, and setting
the rules of their monitoring are not known to the author. Monitoring systems Nagios [7] and Zabbix
[8] provide a set of tools for monitoring the engineering infrastructure of distributed computing
environments, which in each case should be significantly improved.
   Monitoring the computation infrastructure of distributed computing environments. Today, there are
a large number of complex solutions in this category. The most popular complex systems are
represented in Table 4.
                   Table 4. Systems for monitoring the computation infrastructure
 System          Description                                               Reference
 Ganglia         Scalable distributed monitoring system of computing       http://ganglia.sourceforge.
                 cluster resources and cloud platforms with a              net
                 hierarchical structure. It is the most common system
                 used in computer centers.
 Nagios          A monitoring system for computing systems and             https://www.nagios.org
                 networks that supports a wide range of functional
                 capabilities for notifying an operator of possible
                 malfunctions. It is often used to monitor
                 telecommunication networks.
 Zabbix          A system for monitoring and tracking the state of the     https://www.zabbix.org
                 software and hardware of telecommunication
                 networks, including network servers and services. It
                 supports various databases for the data storage.
 ZenOSS          Monitoring software package that supports the             https://www.zenoss.com/
                 automatic detection and configuration of monitoring
                 parameters of various systems. It focused on cloud
                 applications.
 Ovis2           A comprehensive monitoring system that provides           http://ovis.ca.sandia.gov/
                 high scalability and integration with other monitoring
                 tools.
   The most popular system in this category is Ganglia. However, its standard set of functions does
not meet the growing needs for monitoring the computation infrastructure of distributed computing
environments. Often, the limited set of functions leads to the need for additional monitoring systems,
such as Zabbix or Nagios. The most promising system in this category, from the author’s point of
view, is Ovis2, which provides high scalability and wide possibilities for connecting various data
sources.
   Monitoring middleware of distributed computing environments. This category includes Nagios and
Zabbix monitoring systems described above, as well as more specialized tools represented in Table 5.
          Table 5. Systems for monitoring middleware of distributed computing environments
 System            Description                                                 Reference
 Xymon             Software complex for monitoring the process of              http://xymon.sourceforge.
                   functioning of the system services of computing             net/
                   systems. The basic principle is to check the availability
                   of network ports.
 Failure           Toolkit for testing applications in the cloud               [9]
 Testing           environment. It allows testing in the framework of
 Service           continuous integration.
 CloudRift         System for testing microservice applications for cloud      [10]
                   platforms. It enables to identify failures in individual
                   segments of cloud programs.

    In addition to the aforementioned systems, environment administrators usually develop specialized
utilities to track the correct functioning of individual subsystems included into middleware. Such
utilities are often implemented in the form of scripts running on schedule using the CRON service.
    The results of a comparative analysis of the functionality of the developed meta-monitoring system
with the capabilities of the key local monitoring systems described above is represented in Table 6.
These results show the obvious advantages of the meta-monitoring system.
    Table 6. Comparison of the developed meta-monitoring system with local monitoring systems
Feature            Lapta Disparity ClustrX Zabbix        Ganglia Nagios Ovis2 Xymon Meta-
                                                                                    monitoring
                                                                                    system
Analysis of the
effectiveness of
                     +         –          –        –          –        –       –        –         +
program
implementation
Monitoring and
diagnostics of        –        +          –        +         +         +       +        –         +
computing nodes
Engineering
infrastructure        –        –         +         +          –        +       –        –         +
monitoring
Computing
infrastructure       +         –          –        +         +         +       +        –         +
monitoring
Testing of
services and
                      –        –          –        –          –        –       +       +          +
control
subsystems

3. Scheme of the environment component control
The general scheme of the IHPCE component control using the meta-monitoring system is shown in
Fig. 1. In this scheme, the IHPCE component acts as a control object. The administrator configures the
operation of the job management system, which handles the flow w of user tasks, using the vector c
configuration parameters. He also creates affects u1 on the control parameters of the IHPCE
component. The task management system determines the computational load l of the component in
accordance with the flow w . The external disturbances d of the environment arise because of the
actions of local users of the environment or events that occur during the operation of the engineering
infrastructure.
    The monitoring system collects the information i about the IHPCE component and computation
management system with the help of measuring tools and local monitoring systems. This information
is formed on the basis of the characteristics h1 of component status and the information h2 about
functioning the computation management system. Based on the collected information, the meta-
monitoring system assesses the current computational situation, predicts its development, and forms
the control effects u 2 and u3 on the IHPCE component and the computation management system in
order to prevent or partially eliminate failures of hardware and software. In the event of a critical
situation when such actions cannot be performed automatically, the meta-monitoring system sends the
corresponding notification s to the environment administrator.




                         Figure. 1. Scheme of the IHPCE component control.

4. Meta-monitoring system architecture
The meta-monitoring system architecture is based on the principles of organization of multagenic
systems [11] and includes the following main components:
      Interface for the user access to components of the meta-monitoring system, which allows to
         work with them in batch or interactive modes,
      Access level subsystem, which performs the differentiation of the access to the requested data,
      Agents that operate on the IHPCE nodes and carry out the data collection and processing. In
         addition, they interact with other agents.
    A software agent installed in the IHPCE nodes is a program executed in the background mode. The
agent collects data from the local monitoring systems, unifies the received data, and saves it in the
local DBMS. It includes a subsystem for the failure diagnostics and environment reconfiguration. In
addition, the agent has control subsystem that performs the execution functions of control actions and
interaction with the agents of upper levels.
    The possibility of data analysis and making necessary decisions on the side of the computing node
is a key difference between the presented approach and existing solutions. In the well-known
monitoring systems, the client installed in the nodes performs the functions of data collection and their
periodic transmission to the control node. The centralized processing and analysis of the collected data
are performed on the control node. This creates an additional extra load on the network protocol stack,
which also requires CPU time, and has problems with scalability.
   Agents of the developed meta-monitoring system consume about 37% less processor time in
comparison with the Ganglia agents at the same frequency of interrogation of sensors. They transfer
data to the central node of IHPCE or neighbouring agents only if necessary or on request. The
processor time spent for the data analysis on the node is less the time spent on formation of network
packets and control of their integrity. Thus, this reduces the load on the network stack and the central
node of the monitoring system. In addition, it is possible to reduce the negative impact of the
monitoring agent on the computational tasks performed in the nodes.
   The measurement of node state metrics (processor, memory, etc.) is implemented by the functions
of the SIGAR library [12]. This library is cross-platform. It allows unified access to the necessary
information.
   Integration of the meta-monitoring system with local monitoring systems is carried out in the
specialized language that is a subset of the ECMA Script language [13]. This specialized language
supports the call of external commands, network interaction, processing of output stream, regular
expressions, and a number of other mechanisms for rapid implementation of non-standard sensors.
   The subsystem of the data collection and processing is based on the principles of Round-robin
Database. A volume of such databases does not change with time. Their fixed size is achieved due to
the predefined number of records used cyclically to store data.
   Nowadays, there are many implementations of cyclic databases (MRTG, RRDtools, etc.). At the
same time, the performed tests have revealed a number of drawbacks in such systems related primarily
to unacceptable performance in reading/writing data. We tried to create a cyclic database prototype of
on the basis of the lightweight embedded relational database SQLite. However, the conducted
experiments have shown its lower performance in comparison with RRDtools.
   In this regard, we have made a decision to create own implementation of the cyclic database, which
uses the specialized XML based format for storing structured information. We developed the
mechanisms of data reading and writing, aggregation of data for a certain time interval, displacement
of outdated data, data sampling in accordance to determined criteria, and means of data caching in
memory. The developed database has demonstrated its efficiency in comparison with MRTG and
RRDtools.

5. Practical application
The developed methods and tools for meta-monitoring IHPCE have been successfully tested in the
Irkutsk Supercomputer Center of SB RAS [14]. IHPCE included three pools of nodes:
     20 computational nodes with Intel Xeon E5-2695 v4 "Broadwell" processors with a total
        number of 720 cores,
     10 computing nodes with AMD Opteron 6276 "Bulldozer"/"Interlagos" processors with the
        total number of cores 320,
     20 computing nodes with Intel Xeon 5345 EM64T 2.33 GHz "Clovertown" processors with
        the total number of cores 160.
   During the study of IHPCE by the meta-monitoring system, a list of hardware and software
resources whose components were in a state close to critical or functioning with errors was revealed.
The list of nodes, diagnostic messages of the meta-monitoring system, and node state description
corresponding to the detected faults are given in Table 7.
                                   Table 7. Meta-monitoring results
Pool   Node   Diagnostic message                        Node status description
number number
              «warning node-4.matrosov.icc.ru           The average node load for the last 5 minutes
1      4
              loadavg5 43»                              exceeded 43 points.
              «critical node-13. matrosov.icc.ru        At the node, the loading of processor cores by
1      13
              cpu-sys-p 77»                             the tasks of the operating system prevails.
                   «critical sm112.matrosov.icc.ru
2        112                                            Writing to the /home directory is too slow.
                   filesystem /home wtime 583816»
                   «error node-14.matrosov.icc.ru
1        14                                             Node not available.
                   down»
                   «critical sm102.tesla.icc.ru memory-
2        102                                            On the node RAM is used by 97%.
                   used-ten 97»
                   «error node node-7.blackford.icc.ru The node has run out of free disk space in the
3        7
                   filesystem /store du-free 0»         /store directory.
   The analysis of data on the state of the IHPCE hardware and software resources collected by the
meta-monitoring system revealed the inefficient operation of user applications, optimized the load of
computing resources, and improved the reliability of the IHPCE operation. For example, when solving
an important practical task of annotating the Synedra acus genome with the help of the MAKER
software package [15], the prevalence of read-write operations in the network directory over
computational operations performed on processor cores was revealed. In accordance with the detected
inefficient use of resources, the package parameters indicating the location of directories for writing
the results of calculations were automatically corrected. Local directories of nodes (for example, /tmp)
were assigned as such directories, which allowed to significantly increase the efficiency of using
processor cores in this package by more than 30%.
   Another illustrative example of the successful applying of the developed meta-monitoring system is
a significant improvement in power saving for one of the IHPCE pools, the nodes of which are
outdated, but continue to be operated by users. These users solve their problems with the help of
applications specialized in software and hardware features of the nodes in this pool.
   The PBS Torque [16] system is used to control the completion of tasks in this pool. In order to
automate the power consumption control in pool nodes the following meta-monitoring operations and
rules have been developed:
     Operations to collect data from the sensors of the PBS Torque system about the used resources
         and tasks set in the queue, to enable and disable pool nodes, to change the pool configuration
         parameters,
     Set of output rules for the expert subsystem that define the conditions for applying these
         operations.
   When a task is added to the PBS Torque queue on the pool's management node, the number of
nodes in the pool required to solve it is automatically enabled using the Intelligent Platform
Management Interface (IPMI) protocol. Then they are quickly tested and computational processes in
these nodes are launched. After the task solution is completed, new tasks are waited for a specified
period of time (usually 1-2 hours). In the case of their absence, the nodes are automatically switched
off using the same IPMI protocol. As a result of automation in this pool with the help of meta-
monitoring system, their daily power consumption was reduced by 34%.
   The meta-monitoring system is great importance for evaluating the efficiency of the processes of
functioning of the multi-agent system of distributed computing management [1, 2]. Permanent
monitoring of the work of this multi-agent system has shown its higher fault tolerance to failures of
software and hardware resources of IHPCE in comparison with other similar systems [17].

6. Conclusions
The paper addresses the relevant problem of monitoring the high-performance computing systems and
ensuring their fault tolerance. We proposed a new approach to monitoring IHPCE (the environment
with multi-agent management of distributed computing) and developed the specialized meta-
monitoring system. The developed meta-monitoring system provides control, diagnostics, localization,
and troubleshooting of the IPCE components. In addition, automatic reconfiguration of IHPCE in a
finite number of steps enables minimizing the time of diagnosis and troubleshooting through the
parallel execution of their operations. The fault tolerance increase of nodes by means the preventive
diagnosis and troubleshooting improves the reliability and efficiency of IHPCE.
   The novelty of the presented approach includes the following elements:
     Special mechanism of decentralized collection, storage, and processing of monitoring data,
     Decentralized decision-making for the environment reconfiguration,
     Ensuring the fault tolerance and reliability for both the hardware and software of the
         environment, and the environment management system itself.

Acknowledgment. The study is supported by the Russian Foundation of Basic Research, project
no. 19-07-00097 (reg. no. АААА-А19-119062590002-7). This work was also supported in part by
Basic Research Program of SB RAS, project no. IV.38.1.1 (reg. no. АААА-А17-117032210078-4).

References

[1]    Bychkov I V, Oparin G A, Novopashin A P 2015 Agent-Based Approach to Monitoring and
          Control of Distributed Computing Environment Lecture Notes in Computer Science 253-257
[2]    Bychkov I, Feoktistov A, Sidorov I, Kostromin R 2017 Job Flow Management for Virtualized
          Resources of Heterogeneous Distributed Computing Environment Procedia Engineering 201
          534-542
[3]    Feoktistov A, Sidorov I, Tchernykh A, Edelev A, Zorkalzev V, Gorsky S, Kostromin R,
          Bychkov I, Avetisyan A 2018 Multi-Agent Approach for Dynamic Elasticity of Virtual
          Machines Provisioning in Heterogeneous Distributed Computing Environment Proc. of the
          Int. Conf. on High Performance Computing and Simulation (IEEE) pp 909-916
[4]    Benedict S 2013 Performance issues and performance analysis tools for HPC cloud
          applications: a survey Computing 89-108
[5]    Mohr B 2014 Scalable parallel performance measurement and analysis tools – state-of-the-art
          and future challenges Supercomputing frontiers and innovations 1(2) 108-123
[6]    Desai N, Bradshaw R, Lusk E 2008 Disparity: Scalable Anomaly Detection for Clusters Proc.
          of the 37th International Conference on Parallel Processing pp 116-120
[7]    Josephsen D 2007 Building a Monitoring Infrastructure with Nagios p 255
[8]    Zabbix. Available at: https://www.zabbix.org (accessed: 19.06.19)
[9]    Haryadi S G 2011 FATE and DESTINI: a framework for cloud recovery testing Proc. of the 8th
          USENIX conference on Networked systems design and implementation pp 238-252
[10]   Savchenko D, Radchenko G, Taipale O 2015 Microservices validation: Mjolnirr platform case
          study Proceedings of the 38th International Convention MIPRO (IEEE) pp 248-253
[11]   Wooldridge M, Jennings N 1995 Intelligent Agents: Theory and Practice. The Knowledge
          Engineering Review 10(2) 115-152
[12]   System        Information       Gatherer       and    Reporter     API.     Available     at:
          https://github.com/AlexYaruki/sigar (accessed: 19.06.2019).
[13]   Standard      ECMA-262:         ECMAScript       Language    Specification.  Available    at:
          http://es5.javascript.ru/ (accessed: 19.06.2019)
[14]   Irkutsk Supercomputer Center of SB RAS. Available at: http://hpc.icc.ru (accessed:
          19.06.2019).
[15]   MAKER – genome annotation pipeline. Available at: http://gmod.org/wiki/MAKER (accessed:
          19.06.2019).
[16]   PBS Torque. Available at: https://github.com/adaptivecomputing/torque (accessed: 19.06.2019).
[17]   Bychkov I, Feoktistov A, Kostromin R, Sidorov I, Edelev A, Gorsky S 2018 Machine Learning
          in a Multi-Agent System for Distributed Computing Management Data Science. Information
          Technology and Nanotechnology 2018 (CEUR-WS Proceedings) 2212 89-97