=Paper= {{Paper |id=Vol-2341/paper-01 |storemode=property |title=Approach to the Algorithms for the Analysis and Processing of Data from IT-Services Monitoring System |pdfUrl=https://ceur-ws.org/Vol-2341/paper-01.pdf |volume=Vol-2341 |authors=Maksim A. Bolshakov,Sergei V. Pugachev,Igor A. Molodkin,Nikolay N. Teslya }} ==Approach to the Algorithms for the Analysis and Processing of Data from IT-Services Monitoring System== https://ceur-ws.org/Vol-2341/paper-01.pdf
 Approach to the Analysis and Processing of Data from IT-Services
                       Monitoring System

        Maksim A. Bolshakov                                                  Sergei V. Pugachev
   Saint Petersburg Information and                                 Emperor Alexander I St. Petersburg State
Computing Centre JSC Russian Railways                                        Transport University
       Saint Petersburg, Russia                                            Saint Petersburg, Russia
       bolshakovm@yandex.ru                                               nki-pugachev@yandex.ru

          Igor A. Molodkin                                                      Nikolay N. Teslya
Emperor Alexander I St. Petersburg State                             Laboratory of computer aided integrated
         Transport University                                           systems, St.Petersburg Institute for
       Saint Petersburg, Russia                                      Informatics and Automation of the RAS
        imolodkin@gmail.com                                                  Saint Petersburg, Russia
                                                                                teslya@iias.spb.su

                                                             operating costs. Methods for solving these issues are
                                                             generally similar both for railway infrastructure and IT
                      Abstract                               infrastructure. Thus, one of the approaches to railway
                                                             infrastructure condition monitoring implementation is
   Various     instrumental      IT    infrastructure        using IRV concept (Instrumented Revenue Vehicles),
   monitoring systems are considered and                     which consists of the instrumentation of active cars with
   compared: Zabbix, Nagios, ManageEngine                    means for infrastructure condition monitoring. In
   OpManager, Hewlett Packard Operations                     locomotive facilities, information about the operation of
   Manager, Naumen Network Manager and IBM                   traction equipment aggregates and parts is read directly
   Tivoli. The functions to be performed by the IT           by sensors located on the locomotive [Gol17]. In IT
   infrastructure monitoring and management                  infrastructure, a similar approach is applied: to take
   system in its target state are specified. The             readings about the operation of all assembly units in a
   current state of the IT infrastructure monitoring         number of different ways where the elements of
   system in JSC Russian Railways is described.              computing resources employed in IT service creation are
   A mathematical formulation of the problem of              taken for assembly units.
   determining control values of metrics and an                  When selecting a specific vendor of a monitoring
   example of developing a neural network to                 system, you should decide according to the conformity
   determine control values of metrics and                   between the functionality of the solution in question,
   recommendations for its improvement are                   problems and particularly the scale of problems to be
   given.                                                    solved on the IT landscape existing in the organization.

Introduction                                                 1 IT Infrastructure Monitoring Systems
The scope of the coverage of technological processes by      Currently, the following instrumental systems are the
automation systems in JSC Russian Railways is                most frequently used to solve the problem of full
constantly growing. As of now, automation systems have       coverage by an IT infrastructure monitoring system.
accumulated and use in their work enormous quantities          • Zabbix is an open-source system featuring
of data. New possibilities of data processing tools enable   sufficiently high efficiency and readiness for scalability
more comprehensive use of data being accumulated to          up to corporate-level data. Because of its open-source
solve a large variety of current problems [Rrw17]. Thus,
Big Data technology is receiving ever-growing                technology and wide applicability, there is a sufficiently
acceptance and demand for usage. It should be noted that     active developers community with whose help even a
a special feature of Big Data as applied to JSC Russian      novice administrator of monitoring system can quickly
Railways consists of information coming from both            become familiar with its installation and maintenance.
external and internal sources. Provided data volumes
fully conform to the currently accepted “7 Vs” of Big        In: B. V. Sokolov, A. D. Khomonenko, A. A. Bliudov
Data: Volume, Velocity, Variety, Veracity, Variability,      (eds.): Selected Papers of the Workshop Computer
Visualization, Value [Cuk14].                                Science and Engineering in the framework of the 5 th
   1
    One of the most important ways of increasing the         International     Scientific-Methodical  Conference
company operation efficiency is reducing failure rate and    "Problems of Mathematical and Natural-Scientific
                                                             Training in Engineering Education", St.-Petersburg,
Copyright © by the papers’ authors. Copying permitted        Russia, 8–9 November, 2018, published at http://ceur-
for private and academic purposes.                           ws.org


                                                                                                                     1
However, the question of implementation itself (setting        for relations between these metrics characterizing the
up and use of data collection metrics) can present serious     work of IT components.
difficulties for an administrator at the first stages of use     • Function of performance management – data
as there are no ready-made agents of monitoring in this        collection in order to forecast utilization of resources and
tool. Additionally, disadvantages include poor design of       form proposals on load redistribution/balancing.
visualization tools.                                             • Function of correlation and event management –
  • Nagios is an open-source tool with similar                 providing message reception from all sources of data
characteristics (both positive and negative) as Zabbix,        monitoring and its subsequent analysis in order to enrich
excluding operability of using new settings of data-           data about an event, identify data similarity and
collection. While with Zabbix, data collection algorithms      determine root cause.
can be reorganized in online mode, in Nagios, the system         • Function of automated response to events in order
should be rebooted after changes are made [Sha18].             to restore operability of a component or entire service.
  • ManageEngine OpManager is a tool which allows
you to work, among others, in terms of automated               2 Current State of IT Infrastructure
response to abnormal situations. It has a convenient and       Monitoring System in JSC Russian Railways
understandable interface but is limited by scalability in
                                                               Presently, in JSC Russian Railways, an umbrella solution
terms of data and fully justifies its use at small and
                                                               is implemented as a centralized monitoring system,
medium-sized enterprises. On a higher level, it is             where data are transmitted by means of various “probes”
significantly behind its competitors in terms of               at the lower level of information collection (monitoring
performance.                                                   agents from different vendors) and then processed
  • Hewlett Packard Operations Manager is an example           centrally by IBM Tivoli. Because of the wide variety of
of a complete centralized monitoring system with high          ready-for-service monitoring agents from IBM, the
characteristics both in terms of user interface and data       portion of Tivoli “probes” at the lower level of
handling quality for various volumes.                          monitoring makes up about 90% of the total quantity of
  • Naumen Network Manager is a Russian product                information collection tools throughout JSC Russian
with all the necessary parameters for a complete solution      Railways the next largest tools in terms of coverage are
for the orientation of IT infrastructure monitoring            Zabbix tools. It is the umbrella-type structure of the
processes. Under conditions where many state companies         monitoring system that allows coverage of absolutely all
aim for import substitution of IT tools, such products         elements of IT infrastructure and processing of this data
should positively meet all client needs. At the present        according to a single logic avoiding various local and not
time, this tool has good characteristics both in terms of      interrelated monitoring systems, for instance for each
                                                               type of equipment or geographical location.
convenience of deployment, implementation and support,
                                                                   Thus, the main factors affecting the decision on
and in terms of collection and providing summary data
                                                               vendor selection are the readiness to work with existing
after necessary processing.                                    volumes of client’s IT infrastructure and the cost of this
  • IBM Tivoli is a tool for centralized monitoring            decision. It is the transition from simple IT infrastructure
system creation and features simple installation, but          monitoring to IT infrastructure management and IT
initial configuration process requires the presence of         services monitoring that requires the highest
highly qualified specialists. Generally, this is connected     expenditures, because for virtually all vendors the
with its application on the corporate data level, when         elements of the functions of the product line
configuration and setup requires a large amount of work.       implementing data processing and forecasting are the
In operation, the system is intuitive, with a vast set of      most expensive. That is why (among other causes)
functions ready-made and supported by the developer, it        companies often stop developing their monitoring system
is possible to form monitoring agents at your own              at the level of equipment information collection and
discretion using Agent Builder. The line of monitoring         generation of events relating to failures of this
products from this vendor contains tools for all               equipment, without considering the relations between
monitoring functions – from data collection by means of        these elements.
various agents and collected data processing, to                   Where as in the target state, the resource and service
automated response to failures and presenting necessary        model is determined as a result of taking inventory of all
information via data displays for all user levels.             resources employed in IT service provisioning and
    However, selection of one vendor of monitoring tools       subsequent determination of the influence of elements on
                                                               each other and on key parameters of the performance of
does not result in use of drastically different approaches
                                                               the whole service. This concept is the key concept for the
in its implementation. A working IT infrastructure             evaluation of possible consequences of failures/faults in
monitoring and management system should perform the            the operation of individual elements of infrastructure
following functions in its target state:                       affecting final IT service provisioning feasibility. In this
  • Function of detection of IT components and                 case, the condition of each element is characterized by
determination of dependencies between them –including          metrics taken automatically [Oht06].
automatic monitoring agent installation and subsequent             Obtained values of metrics are compared with
information collection, among others, in order to search       reference ones, and when they exceed acceptable limits




                                                                                                                         2
the monitoring system informs appropriate technical            time.
support personnel about the abnormal state of the service.
                                                                        ì0;
This creates the problem of correct determination of               Fj = í
these reference or marginal values of normal condition                  î1,
for each metric and their mutual interrelations.               where 0 is normal condition, and 1 is a fault.
    Initially, these values are determined with the help of        The problem under consideration can be solved using
experts, however, the risk of subjectivity in this case can    classic methods for the system of solving simultaneous
not be totally excluded. That is why the problem of            linear equations:
maintaining marginal values and their mutual
interrelations in their actual condition for large number of          ì K1 ´ M 11 + ××× + Ki ´ M 1i + ××× + K m ´ M 1m = F1 ;
                                                                      ï
metrics is rather complex, and additionally, when a group
                                                                      ï                         ...
of experts is permanently engaged this task is cost                   ï
intensive [Spr16, Aua07, Mar09].                                      í K1 ´ M j1 + ××× + Ki ´ M ji + ××× + K m ´ M jm = Fj ;
    Within the framework of the current level of IT                   ï
                                                                      ï                         ...
monitoring system development, the data volume in
MCC JSC Russian Railways is 11 terabytes, and this                    ï
                                                                      î K1 ´ M n1 + ××× + Ki ´ M ni + ××× + K m ´ M nm = Fn .
value will only increase as solving the problem of                 At the same time, this problem can be solved using
automated prioritizing of metrics taken from IT                apparatus of artificial neural networks [Aya07]. These
infrastructure and creating resource and service models        networks have various structures and each one has
bigger than current data storage horizon is required. At       advantages and disadvantages for solving different kinds
the present time, when aggregated data are stored, for         of problems. Assuming that at the j-th point of time the
most metrics the storage horizon resides in the interval of    value of metrics is given for all previous measurements
1-3 months. Raw data is actually stored for a                  and there is a problem of the resulting condition of the IT
substantially shorter period.                                  service     (failure/degradation)     change    probability
    In this case, the specialists responsible for IT           estimation then it is possible to speak about a forecasting
infrastructure monitoring and support have at their            problem, i.e. about a special case of a regression
disposal 1318 unique metrics, the combinations of which        problem. For such problems, the use of forward
are imposed on selected IT services. Special attention         propagation neural networks (perceptrons) is the most
should be paid to data heterogeneity for specified             justified [Naz03].
metrics. For example, for each IT service the following            In this case, the number of layers for forward
metrics should be analyzed: numerical values of                propagation networks and the number of neurons in each
processor utilization expressed in percent, remaining free     layer are the values upon which, on the one hand, the
space on the virtual server in megabytes, response time        speed and, on the other hand, the quality of proposed
of network equipment via ICMP in milliseconds, and text        neural network learning depends. The degree of network
values of responses for Blade chassis operation status.        architecture complication and the increase in the number
    Specified heterogeneity together with a large volume       of neurons, in turn, depend on existing computational
of data is the cause of impossibility of problem solving       capability limitations.
based just on the knowledge, skills and competence of              Under the conditions of MCC JSC Russian Railways
experts involved in the support of IT infrastructure on        operations, the speed of these calculations is not less
which MCC JSC Russian Railways IT services are                 important than K i , coefficient calculations quality
deployed [Ort91].                                              because of this problem it is absolutely necessary to
                                                               organize periodic model recalculation in order to
3 Mathematical Formulation of Problem                          maintain its actual condition.
                                                                   To achieve this goal, and to mitigate possible
The mathematical formulation of reference problem
                                                               negative influence on IT services provided by MCC JSC
metrics values can be represented in the following form:
                              [     ]
for each metric M i , ( i Î 1, m ) it is necessary to
                                                               Russian Railways to consumers, the process of periodic
                                                               learning of the neural network should be executed at the
determine reference values K i , which makes it possible
                                                               time of minimal utilization of the computing system, for
to unambiguously split the whole array metric values M i
                                                               instance, in daily relearning mode at night (according to
into normal and abnormal subsets (the boundary
                                                               the Moscow time zone).
condition can be derived).
                                                                    The learning process itself should be built in the
 !!" To do this it necessary to find vector                    format of learning by instruction – learning by means of
 K = ( K1 ,..., Ki ,..., Km ) under condition:
                         !!"    !"                                                                !"
                                                               the presentation of multiple available   examples of input
                         K ´M ® F ,                            data M and reference solutions F . As stated above, this
             ( )
where M = M ij – is the matrix of i-th metric values (         problem, in substance, is a forecasting problem which is
                                                               most often solved using a back propagation learning
i Î [1, m]) in j-th period of time ( j Î [1, n]);              algorithm. Its main disadvantage consists in the learning
                                                               process being too long which makes it essentially
    !"
    F = ( F1 ,..., Fj ,..., Fn ) – is resulting condition of   unusable for the given problem. At the present time,
                                                               there are enough faster algorithms such as: conjugate
service, Fj is condition of service for each j-th point of     gradient method, RProp method of Levenberg–




                                                                                                                           3
Marquardt, etc.                                               considered technologies becomes clear – initial resource
    The optimal choice for solving the problem is the         and service model building in online mode without
RProp (Resilient Propagation) method known as the             expert engagement.
method of resilient error propagation. It outperforms the
standard back propagation method in terms of learning
time length, particularly with regard to the heterogeneity
of available data [Nov16].
    At the beginning of learning, all weight factors К arei


set in a random manner (as small values close to zero),       Conclusion
and further when examples are input the network error is
minimized.                                                    The key effect of the use of the considered technologies
    In the learning process using the RProp algorithm,        consists of initial resource and service model building in
partial derivative signs are used to trim weight factors.     online mode without engagement from experts.
For each К weight factor in the chain for k-th neuron, the
            i                                                    In our opinion, it is practical to continue further
separate modifier value entered D ik , is used to calculate   studies in the direction of detailed configuring of the
the size of correction for each relevant weight factor.       neural network architecture to solve the problem in
    To determine the correction value the following           question and the use of available computational
convention is used in each step:                              capabilities for periodic neural network relearning based
                                                              on constantly updated teaching selections.
             ì + ( j -1)             ¶E j -1¶E j                 In this case, the question of effective use of
             ï h  D        ik , если               > 0;
             ï                       ¶K ik ¶ K ik
                                                              computing system, namely its dynamically distributed
    D j ik = í                          j -1
                                                              resources, should be built on the principles of parallel
             ïh - D ( j -1) , если ¶E ¶E < 0,
                                                 j
                                                              processing calculation tasks while using algorithms
             ïî            ik
                                     ¶Kik ¶K ik
                                                              employed in the distribution of works for multiprocessor
                                                              computing systems [Mld19].
Where 0 < h - < 1 < h + .                                        This will make it possible to provide development of
                                                              adequate models not requiring substantial debugging by a
    Specific values of modifiers can be different but most
                                                              group of experts in terms operation of IT services, and
often the values proposed in [Rdm93] and tested on
                                                              keeping them in their current state which ultimately
multiple examples are used:
                                                              provides obvious improvement in the real-time
                    h - = 0.5; h + = 1.2;                     evaluation quality of MCC JSC Russian Railways IT
                                                              services.
    ¶E j                                                         Further development of monitoring system should be
          – partial derivative of activation function by
    ¶K ik                                                     performed in relation to the results obtained.
weight factor at j-th point of time.
   If in the current step the partial derivative with         References
respect to corresponding weight K ik has changed its          [Aya07]     N. Ayachitula. IT Service Management
sign, then it follows that the last change was too large                Automation - a Hybrid Methodology to
and the algorithm has exceeded the local minimum.
                                                                        Integrate and Orchestrate Collaborative Human
Consequently, the amount of change should be decreased
                                                                        Centric and Automation Centric Workflows. /
and the previous weight factor value should be returned,
in other words, the “rollback” should be performed.                     N.Ayachitula, M. J. Buco, Y. Diao, M.
When the derivative retains its sign, the modifier value                Surendra, R. Pavuluri, L. Shwartz, C. Ward – In
should be additionally increased to accelerate                          IEEE SCC, 2007. 574–581 p.
convergence.
   After the values of modifiers have been updated, the       [Cuk14] K. Cukier. A Revolution That Will Transform
change of factors themselves is made according to the                 How We Live. / K. Cukier, V. Mayer-
convention:                                                           Schonberger NY: Mariner Books, 2014. 240 p.
                         ì j            ¶E j
                         ï-D ik , если ¶K > 0;                [Gol17] A.S. Golubev. Digital Railway is Reality / A.S.
                         ï                  ik                        Golubev, A.V. Skryabin – Russia, Eurasia
                         ïï j         ¶E  j
                                                                      News. 2017, №12.
                 DK ik = íD ik , если        < 0;
                          ï           ¶K ik
                          ï0, иначе.                          [Mar09] P. Marcu. Managing Faults in the Service
                          ï                                           Delivery Process of Service Provider Coalitions.
                          ïî                                          / P. Marcu, L. Shwartz, G. Grabarnik, D.
                     K j +1ik = K j ik + DK j ik .                    Loewenstern – In IEEE SCC, 2009. Pp. 65–72.
   The discussed example is a suitable variant of             [Mol19] I.A. Molodkin, S.G. Svistunov. Comparative
network generation and its subsequent learning for set
                                                                      Analysis of Scheduling Algorithms in
problem, and in its terms the key effect of the use of the
                                                                      Multiprocessor    Systems,      Intellectual




                                                                                                                      4
        Technologies on Transport. 2018, № 2. Pp. 41–
        46.                                              [Rie93] M.A. Riedmiller. Direct Adaptive Method for
                                                                 Faster Backpropagation Learning: The RPROP
[Naz03] A.V. Nazarov. Neural Network Algorithms of               Algorithm. / M. Riedmiller, H.Braun – In IEE,
       Forecasting and Optimization of Systems / A.V.            Conf. on Neural Networks. San Francisco,
       Nazarov, A.I. Loskutov, Saint-Petersburg:                 1993. Pp. 586-591.
       Science and technique. 2003. 384 p.
                                                         [Rrw17] The Concept of Implementation of the Complex
[Nov16] P.A. Novikov. Software for Mobile Indoor                 Scientific and Technical Project "Digital
       Navigation Using Neural Networks / P.A.                   Railway” – Russia, Moscow, 2017. 92 p.
       Novikov, A.D. Khomonenko, E.L. Yakovlev.
       Information Management System. 2016, №1.          [Sha18] K. S. Shardakov. Comparative Analysis of the
       P.32-39.                                                  Popular Monitoring Systems for Network
                                                                 Equipment Distributed Under the GPL License /
[Ort91] J. Ortega. Introduction to Parallel and Vector           K.S. Shardakov, V.P. Bubnov, Intellectual
        Solutions of Linear Systems / J. Ortega –                Technologies on Transport. 2018, №1. Pp. 44–
        Russia, Moscow: World. 1991. 367 p.                      48.
                                                          [Sup16] M. Supriya. Monitoring and Evalution in
[Oht06] M.Yu. Ohtilev. Intelligent Technologies for              adaptation. / M.Supriya, S.Truck, P.Davies –
        Monitoring and Control of Structural Dynamics            Darwin, 2016. 56 p.
        of Complex Technical Objects / M.YU. Ohtilev,
        B.V. Sokolov, R.M. Yusupov, Moscow:
        Science, 2006. 410 p.




                                                                                                            5