=Paper= {{Paper |id=Vol-2341/paper-01 |storemode=property |title=Approach to the Algorithms for the Analysis and Processing of Data from IT-Services Monitoring System |pdfUrl=https://ceur-ws.org/Vol-2341/paper-01.pdf |volume=Vol-2341 |authors=Maksim A. Bolshakov,Sergei V. Pugachev,Igor A. Molodkin,Nikolay N. Teslya }} ==Approach to the Algorithms for the Analysis and Processing of Data from IT-Services Monitoring System== https://ceur-ws.org/Vol-2341/paper-01.pdf

Approach to the Analysis and Processing of Data from IT-Services
Monitoring System

Maksim A. Bolshakov Sergei V. Pugachev
Saint Petersburg Information and Emperor Alexander I St. Petersburg State
Computing Centre JSC Russian Railways Transport University
Saint Petersburg, Russia Saint Petersburg, Russia
bolshakovm@yandex.ru nki-pugachev@yandex.ru

Igor A. Molodkin Nikolay N. Teslya
Emperor Alexander I St. Petersburg State Laboratory of computer aided integrated
Transport University systems, St.Petersburg Institute for
Saint Petersburg, Russia Informatics and Automation of the RAS
imolodkin@gmail.com Saint Petersburg, Russia
teslya@iias.spb.su

operating costs. Methods for solving these issues are
generally similar both for railway infrastructure and IT
Abstract infrastructure. Thus, one of the approaches to railway
infrastructure condition monitoring implementation is
Various instrumental IT infrastructure using IRV concept (Instrumented Revenue Vehicles),
monitoring systems are considered and which consists of the instrumentation of active cars with
compared: Zabbix, Nagios, ManageEngine means for infrastructure condition monitoring. In
OpManager, Hewlett Packard Operations locomotive facilities, information about the operation of
Manager, Naumen Network Manager and IBM traction equipment aggregates and parts is read directly
Tivoli. The functions to be performed by the IT by sensors located on the locomotive [Gol17]. In IT
infrastructure monitoring and management infrastructure, a similar approach is applied: to take
system in its target state are specified. The readings about the operation of all assembly units in a
current state of the IT infrastructure monitoring number of different ways where the elements of
system in JSC Russian Railways is described. computing resources employed in IT service creation are
A mathematical formulation of the problem of taken for assembly units.
determining control values of metrics and an When selecting a specific vendor of a monitoring
example of developing a neural network to system, you should decide according to the conformity
determine control values of metrics and between the functionality of the solution in question,
recommendations for its improvement are problems and particularly the scale of problems to be
given. solved on the IT landscape existing in the organization.

Introduction 1 IT Infrastructure Monitoring Systems
The scope of the coverage of technological processes by Currently, the following instrumental systems are the
automation systems in JSC Russian Railways is most frequently used to solve the problem of full
constantly growing. As of now, automation systems have coverage by an IT infrastructure monitoring system.
accumulated and use in their work enormous quantities • Zabbix is an open-source system featuring
of data. New possibilities of data processing tools enable sufficiently high efficiency and readiness for scalability
more comprehensive use of data being accumulated to up to corporate-level data. Because of its open-source
solve a large variety of current problems [Rrw17]. Thus,
Big Data technology is receiving ever-growing technology and wide applicability, there is a sufficiently
acceptance and demand for usage. It should be noted that active developers community with whose help even a
a special feature of Big Data as applied to JSC Russian novice administrator of monitoring system can quickly
Railways consists of information coming from both become familiar with its installation and maintenance.
external and internal sources. Provided data volumes
fully conform to the currently accepted “7 Vs” of Big In: B. V. Sokolov, A. D. Khomonenko, A. A. Bliudov
Data: Volume, Velocity, Variety, Veracity, Variability, (eds.): Selected Papers of the Workshop Computer
Visualization, Value [Cuk14]. Science and Engineering in the framework of the 5 th
1
One of the most important ways of increasing the International Scientific-Methodical Conference
company operation efficiency is reducing failure rate and "Problems of Mathematical and Natural-Scientific
Training in Engineering Education", St.-Petersburg,
Copyright © by the papers’ authors. Copying permitted Russia, 8–9 November, 2018, published at http://ceur-
for private and academic purposes. ws.org

1
However, the question of implementation itself (setting for relations between these metrics characterizing the
up and use of data collection metrics) can present serious work of IT components.
difficulties for an administrator at the first stages of use • Function of performance management – data
as there are no ready-made agents of monitoring in this collection in order to forecast utilization of resources and
tool. Additionally, disadvantages include poor design of form proposals on load redistribution/balancing.
visualization tools. • Function of correlation and event management –
• Nagios is an open-source tool with similar providing message reception from all sources of data
characteristics (both positive and negative) as Zabbix, monitoring and its subsequent analysis in order to enrich
excluding operability of using new settings of data- data about an event, identify data similarity and
collection. While with Zabbix, data collection algorithms determine root cause.
can be reorganized in online mode, in Nagios, the system • Function of automated response to events in order
should be rebooted after changes are made [Sha18]. to restore operability of a component or entire service.
• ManageEngine OpManager is a tool which allows
you to work, among others, in terms of automated 2 Current State of IT Infrastructure
response to abnormal situations. It has a convenient and Monitoring System in JSC Russian Railways
understandable interface but is limited by scalability in
Presently, in JSC Russian Railways, an umbrella solution
terms of data and fully justifies its use at small and
is implemented as a centralized monitoring system,
medium-sized enterprises. On a higher level, it is where data are transmitted by means of various “probes”
significantly behind its competitors in terms of at the lower level of information collection (monitoring
performance. agents from different vendors) and then processed
• Hewlett Packard Operations Manager is an example centrally by IBM Tivoli. Because of the wide variety of
of a complete centralized monitoring system with high ready-for-service monitoring agents from IBM, the
characteristics both in terms of user interface and data portion of Tivoli “probes” at the lower level of
handling quality for various volumes. monitoring makes up about 90% of the total quantity of
• Naumen Network Manager is a Russian product information collection tools throughout JSC Russian
with all the necessary parameters for a complete solution Railways the next largest tools in terms of coverage are
for the orientation of IT infrastructure monitoring Zabbix tools. It is the umbrella-type structure of the
processes. Under conditions where many state companies monitoring system that allows coverage of absolutely all
aim for import substitution of IT tools, such products elements of IT infrastructure and processing of this data
should positively meet all client needs. At the present according to a single logic avoiding various local and not
time, this tool has good characteristics both in terms of interrelated monitoring systems, for instance for each
type of equipment or geographical location.
convenience of deployment, implementation and support,
Thus, the main factors affecting the decision on
and in terms of collection and providing summary data
vendor selection are the readiness to work with existing
after necessary processing. volumes of client’s IT infrastructure and the cost of this
• IBM Tivoli is a tool for centralized monitoring decision. It is the transition from simple IT infrastructure
system creation and features simple installation, but monitoring to IT infrastructure management and IT
initial configuration process requires the presence of services monitoring that requires the highest
highly qualified specialists. Generally, this is connected expenditures, because for virtually all vendors the
with its application on the corporate data level, when elements of the functions of the product line
configuration and setup requires a large amount of work. implementing data processing and forecasting are the
In operation, the system is intuitive, with a vast set of most expensive. That is why (among other causes)
functions ready-made and supported by the developer, it companies often stop developing their monitoring system
is possible to form monitoring agents at your own at the level of equipment information collection and
discretion using Agent Builder. The line of monitoring generation of events relating to failures of this
products from this vendor contains tools for all equipment, without considering the relations between
monitoring functions – from data collection by means of these elements.
various agents and collected data processing, to Where as in the target state, the resource and service
automated response to failures and presenting necessary model is determined as a result of taking inventory of all
information via data displays for all user levels. resources employed in IT service provisioning and
However, selection of one vendor of monitoring tools subsequent determination of the influence of elements on
each other and on key parameters of the performance of
does not result in use of drastically different approaches
the whole service. This concept is the key concept for the
in its implementation. A working IT infrastructure evaluation of possible consequences of failures/faults in
monitoring and management system should perform the the operation of individual elements of infrastructure
following functions in its target state: affecting final IT service provisioning feasibility. In this
• Function of detection of IT components and case, the condition of each element is characterized by
determination of dependencies between them –including metrics taken automatically [Oht06].
automatic monitoring agent installation and subsequent Obtained values of metrics are compared with
information collection, among others, in order to search reference ones, and when they exceed acceptable limits

2
the monitoring system informs appropriate technical time.
support personnel about the abnormal state of the service.
ì0;
This creates the problem of correct determination of Fj = í
these reference or marginal values of normal condition î1,
for each metric and their mutual interrelations. where 0 is normal condition, and 1 is a fault.
Initially, these values are determined with the help of The problem under consideration can be solved using
experts, however, the risk of subjectivity in this case can classic methods for the system of solving simultaneous
not be totally excluded. That is why the problem of linear equations:
maintaining marginal values and their mutual
interrelations in their actual condition for large number of ì K1 ´ M 11 + ××× + Ki ´ M 1i + ××× + K m ´ M 1m = F1 ;
ï
metrics is rather complex, and additionally, when a group
ï ...
of experts is permanently engaged this task is cost ï
intensive [Spr16, Aua07, Mar09]. í K1 ´ M j1 + ××× + Ki ´ M ji + ××× + K m ´ M jm = Fj ;
Within the framework of the current level of IT ï
ï ...
monitoring system development, the data volume in
MCC JSC Russian Railways is 11 terabytes, and this ï
î K1 ´ M n1 + ××× + Ki ´ M ni + ××× + K m ´ M nm = Fn .
value will only increase as solving the problem of At the same time, this problem can be solved using
automated prioritizing of metrics taken from IT apparatus of artificial neural networks [Aya07]. These
infrastructure and creating resource and service models networks have various structures and each one has
bigger than current data storage horizon is required. At advantages and disadvantages for solving different kinds
the present time, when aggregated data are stored, for of problems. Assuming that at the j-th point of time the
most metrics the storage horizon resides in the interval of value of metrics is given for all previous measurements
1-3 months. Raw data is actually stored for a and there is a problem of the resulting condition of the IT
substantially shorter period. service (failure/degradation) change probability
In this case, the specialists responsible for IT estimation then it is possible to speak about a forecasting
infrastructure monitoring and support have at their problem, i.e. about a special case of a regression
disposal 1318 unique metrics, the combinations of which problem. For such problems, the use of forward
are imposed on selected IT services. Special attention propagation neural networks (perceptrons) is the most
should be paid to data heterogeneity for specified justified [Naz03].
metrics. For example, for each IT service the following In this case, the number of layers for forward
metrics should be analyzed: numerical values of propagation networks and the number of neurons in each
processor utilization expressed in percent, remaining free layer are the values upon which, on the one hand, the
space on the virtual server in megabytes, response time speed and, on the other hand, the quality of proposed
of network equipment via ICMP in milliseconds, and text neural network learning depends. The degree of network
values of responses for Blade chassis operation status. architecture complication and the increase in the number
Specified heterogeneity together with a large volume of neurons, in turn, depend on existing computational
of data is the cause of impossibility of problem solving capability limitations.
based just on the knowledge, skills and competence of Under the conditions of MCC JSC Russian Railways
experts involved in the support of IT infrastructure on operations, the speed of these calculations is not less
which MCC JSC Russian Railways IT services are important than K i , coefficient calculations quality
deployed [Ort91]. because of this problem it is absolutely necessary to
organize periodic model recalculation in order to
3 Mathematical Formulation of Problem maintain its actual condition.
To achieve this goal, and to mitigate possible
The mathematical formulation of reference problem
negative influence on IT services provided by MCC JSC
metrics values can be represented in the following form:
[ ]
for each metric M i , ( i Î 1, m ) it is necessary to
Russian Railways to consumers, the process of periodic
learning of the neural network should be executed at the
determine reference values K i , which makes it possible
time of minimal utilization of the computing system, for
to unambiguously split the whole array metric values M i
instance, in daily relearning mode at night (according to
into normal and abnormal subsets (the boundary
the Moscow time zone).
condition can be derived).
The learning process itself should be built in the
!!" To do this it necessary to find vector format of learning by instruction – learning by means of
K = ( K1 ,..., Ki ,..., Km ) under condition:
!!" !" !"
the presentation of multiple available examples of input
K ´M ® F , data M and reference solutions F . As stated above, this
( )
where M = M ij – is the matrix of i-th metric values ( problem, in substance, is a forecasting problem which is
most often solved using a back propagation learning
i Î [1, m]) in j-th period of time ( j Î [1, n]); algorithm. Its main disadvantage consists in the learning
process being too long which makes it essentially
!"
F = ( F1 ,..., Fj ,..., Fn ) – is resulting condition of unusable for the given problem. At the present time,
there are enough faster algorithms such as: conjugate
service, Fj is condition of service for each j-th point of gradient method, RProp method of Levenberg–

3
Marquardt, etc. considered technologies becomes clear – initial resource
The optimal choice for solving the problem is the and service model building in online mode without
RProp (Resilient Propagation) method known as the expert engagement.
method of resilient error propagation. It outperforms the
standard back propagation method in terms of learning
time length, particularly with regard to the heterogeneity
of available data [Nov16].
At the beginning of learning, all weight factors К arei

set in a random manner (as small values close to zero), Conclusion
and further when examples are input the network error is
minimized. The key effect of the use of the considered technologies
In the learning process using the RProp algorithm, consists of initial resource and service model building in
partial derivative signs are used to trim weight factors. online mode without engagement from experts.
For each К weight factor in the chain for k-th neuron, the
i In our opinion, it is practical to continue further
separate modifier value entered D ik , is used to calculate studies in the direction of detailed configuring of the
the size of correction for each relevant weight factor. neural network architecture to solve the problem in
To determine the correction value the following question and the use of available computational
convention is used in each step: capabilities for periodic neural network relearning based
on constantly updated teaching selections.
ì + ( j -1) ¶E j -1¶E j In this case, the question of effective use of
ï h D ik , если > 0;
ï ¶K ik ¶ K ik
computing system, namely its dynamically distributed
D j ik = í j -1
resources, should be built on the principles of parallel
ïh - D ( j -1) , если ¶E ¶E < 0,
j
processing calculation tasks while using algorithms
ïî ik
¶Kik ¶K ik
employed in the distribution of works for multiprocessor
computing systems [Mld19].
Where 0 < h - < 1 < h + . This will make it possible to provide development of
adequate models not requiring substantial debugging by a
Specific values of modifiers can be different but most
group of experts in terms operation of IT services, and
often the values proposed in [Rdm93] and tested on
keeping them in their current state which ultimately
multiple examples are used:
provides obvious improvement in the real-time
h - = 0.5; h + = 1.2; evaluation quality of MCC JSC Russian Railways IT
services.
¶E j Further development of monitoring system should be
– partial derivative of activation function by
¶K ik performed in relation to the results obtained.
weight factor at j-th point of time.
If in the current step the partial derivative with References
respect to corresponding weight K ik has changed its [Aya07] N. Ayachitula. IT Service Management
sign, then it follows that the last change was too large Automation - a Hybrid Methodology to
and the algorithm has exceeded the local minimum.
Integrate and Orchestrate Collaborative Human
Consequently, the amount of change should be decreased
Centric and Automation Centric Workflows. /
and the previous weight factor value should be returned,
in other words, the “rollback” should be performed. N.Ayachitula, M. J. Buco, Y. Diao, M.
When the derivative retains its sign, the modifier value Surendra, R. Pavuluri, L. Shwartz, C. Ward – In
should be additionally increased to accelerate IEEE SCC, 2007. 574–581 p.
convergence.
After the values of modifiers have been updated, the [Cuk14] K. Cukier. A Revolution That Will Transform
change of factors themselves is made according to the How We Live. / K. Cukier, V. Mayer-
convention: Schonberger NY: Mariner Books, 2014. 240 p.
ì j ¶E j
ï-D ik , если ¶K > 0; [Gol17] A.S. Golubev. Digital Railway is Reality / A.S.
ï ik Golubev, A.V. Skryabin – Russia, Eurasia
ïï j ¶E j
News. 2017, №12.
DK ik = íD ik , если < 0;
ï ¶K ik
ï0, иначе. [Mar09] P. Marcu. Managing Faults in the Service
ï Delivery Process of Service Provider Coalitions.
ïî / P. Marcu, L. Shwartz, G. Grabarnik, D.
K j +1ik = K j ik + DK j ik . Loewenstern – In IEEE SCC, 2009. Pp. 65–72.
The discussed example is a suitable variant of [Mol19] I.A. Molodkin, S.G. Svistunov. Comparative
network generation and its subsequent learning for set
Analysis of Scheduling Algorithms in
problem, and in its terms the key effect of the use of the
Multiprocessor Systems, Intellectual

4
Technologies on Transport. 2018, № 2. Pp. 41–
46. [Rie93] M.A. Riedmiller. Direct Adaptive Method for
Faster Backpropagation Learning: The RPROP
[Naz03] A.V. Nazarov. Neural Network Algorithms of Algorithm. / M. Riedmiller, H.Braun – In IEE,
Forecasting and Optimization of Systems / A.V. Conf. on Neural Networks. San Francisco,
Nazarov, A.I. Loskutov, Saint-Petersburg: 1993. Pp. 586-591.
Science and technique. 2003. 384 p.
[Rrw17] The Concept of Implementation of the Complex
[Nov16] P.A. Novikov. Software for Mobile Indoor Scientific and Technical Project "Digital
Navigation Using Neural Networks / P.A. Railway” – Russia, Moscow, 2017. 92 p.
Novikov, A.D. Khomonenko, E.L. Yakovlev.
Information Management System. 2016, №1. [Sha18] K. S. Shardakov. Comparative Analysis of the
P.32-39. Popular Monitoring Systems for Network
Equipment Distributed Under the GPL License /
[Ort91] J. Ortega. Introduction to Parallel and Vector K.S. Shardakov, V.P. Bubnov, Intellectual
Solutions of Linear Systems / J. Ortega – Technologies on Transport. 2018, №1. Pp. 44–
Russia, Moscow: World. 1991. 367 p. 48.
[Sup16] M. Supriya. Monitoring and Evalution in
[Oht06] M.Yu. Ohtilev. Intelligent Technologies for adaptation. / M.Supriya, S.Truck, P.Davies –
Monitoring and Control of Structural Dynamics Darwin, 2016. 56 p.
of Complex Technical Objects / M.YU. Ohtilev,
B.V. Sokolov, R.M. Yusupov, Moscow:
Science, 2006. 410 p.