=Paper=
{{Paper
|id=Vol-2341/paper-01
|storemode=property
|title=Approach to the Algorithms for the Analysis and Processing of Data from IT-Services Monitoring System
|pdfUrl=https://ceur-ws.org/Vol-2341/paper-01.pdf
|volume=Vol-2341
|authors=Maksim A. Bolshakov,Sergei V. Pugachev,Igor A. Molodkin,Nikolay N. Teslya
}}
==Approach to the Algorithms for the Analysis and Processing of Data from IT-Services Monitoring System==
Approach to the Analysis and Processing of Data from IT-Services Monitoring System Maksim A. Bolshakov Sergei V. Pugachev Saint Petersburg Information and Emperor Alexander I St. Petersburg State Computing Centre JSC Russian Railways Transport University Saint Petersburg, Russia Saint Petersburg, Russia bolshakovm@yandex.ru nki-pugachev@yandex.ru Igor A. Molodkin Nikolay N. Teslya Emperor Alexander I St. Petersburg State Laboratory of computer aided integrated Transport University systems, St.Petersburg Institute for Saint Petersburg, Russia Informatics and Automation of the RAS imolodkin@gmail.com Saint Petersburg, Russia teslya@iias.spb.su operating costs. Methods for solving these issues are generally similar both for railway infrastructure and IT Abstract infrastructure. Thus, one of the approaches to railway infrastructure condition monitoring implementation is Various instrumental IT infrastructure using IRV concept (Instrumented Revenue Vehicles), monitoring systems are considered and which consists of the instrumentation of active cars with compared: Zabbix, Nagios, ManageEngine means for infrastructure condition monitoring. In OpManager, Hewlett Packard Operations locomotive facilities, information about the operation of Manager, Naumen Network Manager and IBM traction equipment aggregates and parts is read directly Tivoli. The functions to be performed by the IT by sensors located on the locomotive [Gol17]. In IT infrastructure monitoring and management infrastructure, a similar approach is applied: to take system in its target state are specified. The readings about the operation of all assembly units in a current state of the IT infrastructure monitoring number of different ways where the elements of system in JSC Russian Railways is described. computing resources employed in IT service creation are A mathematical formulation of the problem of taken for assembly units. determining control values of metrics and an When selecting a specific vendor of a monitoring example of developing a neural network to system, you should decide according to the conformity determine control values of metrics and between the functionality of the solution in question, recommendations for its improvement are problems and particularly the scale of problems to be given. solved on the IT landscape existing in the organization. Introduction 1 IT Infrastructure Monitoring Systems The scope of the coverage of technological processes by Currently, the following instrumental systems are the automation systems in JSC Russian Railways is most frequently used to solve the problem of full constantly growing. As of now, automation systems have coverage by an IT infrastructure monitoring system. accumulated and use in their work enormous quantities • Zabbix is an open-source system featuring of data. New possibilities of data processing tools enable sufficiently high efficiency and readiness for scalability more comprehensive use of data being accumulated to up to corporate-level data. Because of its open-source solve a large variety of current problems [Rrw17]. Thus, Big Data technology is receiving ever-growing technology and wide applicability, there is a sufficiently acceptance and demand for usage. It should be noted that active developers community with whose help even a a special feature of Big Data as applied to JSC Russian novice administrator of monitoring system can quickly Railways consists of information coming from both become familiar with its installation and maintenance. external and internal sources. Provided data volumes fully conform to the currently accepted “7 Vs” of Big In: B. V. Sokolov, A. D. Khomonenko, A. A. Bliudov Data: Volume, Velocity, Variety, Veracity, Variability, (eds.): Selected Papers of the Workshop Computer Visualization, Value [Cuk14]. Science and Engineering in the framework of the 5 th 1 One of the most important ways of increasing the International Scientific-Methodical Conference company operation efficiency is reducing failure rate and "Problems of Mathematical and Natural-Scientific Training in Engineering Education", St.-Petersburg, Copyright © by the papers’ authors. Copying permitted Russia, 8–9 November, 2018, published at http://ceur- for private and academic purposes. ws.org 1 However, the question of implementation itself (setting for relations between these metrics characterizing the up and use of data collection metrics) can present serious work of IT components. difficulties for an administrator at the first stages of use • Function of performance management – data as there are no ready-made agents of monitoring in this collection in order to forecast utilization of resources and tool. Additionally, disadvantages include poor design of form proposals on load redistribution/balancing. visualization tools. • Function of correlation and event management – • Nagios is an open-source tool with similar providing message reception from all sources of data characteristics (both positive and negative) as Zabbix, monitoring and its subsequent analysis in order to enrich excluding operability of using new settings of data- data about an event, identify data similarity and collection. While with Zabbix, data collection algorithms determine root cause. can be reorganized in online mode, in Nagios, the system • Function of automated response to events in order should be rebooted after changes are made [Sha18]. to restore operability of a component or entire service. • ManageEngine OpManager is a tool which allows you to work, among others, in terms of automated 2 Current State of IT Infrastructure response to abnormal situations. It has a convenient and Monitoring System in JSC Russian Railways understandable interface but is limited by scalability in Presently, in JSC Russian Railways, an umbrella solution terms of data and fully justifies its use at small and is implemented as a centralized monitoring system, medium-sized enterprises. On a higher level, it is where data are transmitted by means of various “probes” significantly behind its competitors in terms of at the lower level of information collection (monitoring performance. agents from different vendors) and then processed • Hewlett Packard Operations Manager is an example centrally by IBM Tivoli. Because of the wide variety of of a complete centralized monitoring system with high ready-for-service monitoring agents from IBM, the characteristics both in terms of user interface and data portion of Tivoli “probes” at the lower level of handling quality for various volumes. monitoring makes up about 90% of the total quantity of • Naumen Network Manager is a Russian product information collection tools throughout JSC Russian with all the necessary parameters for a complete solution Railways the next largest tools in terms of coverage are for the orientation of IT infrastructure monitoring Zabbix tools. It is the umbrella-type structure of the processes. Under conditions where many state companies monitoring system that allows coverage of absolutely all aim for import substitution of IT tools, such products elements of IT infrastructure and processing of this data should positively meet all client needs. At the present according to a single logic avoiding various local and not time, this tool has good characteristics both in terms of interrelated monitoring systems, for instance for each type of equipment or geographical location. convenience of deployment, implementation and support, Thus, the main factors affecting the decision on and in terms of collection and providing summary data vendor selection are the readiness to work with existing after necessary processing. volumes of client’s IT infrastructure and the cost of this • IBM Tivoli is a tool for centralized monitoring decision. It is the transition from simple IT infrastructure system creation and features simple installation, but monitoring to IT infrastructure management and IT initial configuration process requires the presence of services monitoring that requires the highest highly qualified specialists. Generally, this is connected expenditures, because for virtually all vendors the with its application on the corporate data level, when elements of the functions of the product line configuration and setup requires a large amount of work. implementing data processing and forecasting are the In operation, the system is intuitive, with a vast set of most expensive. That is why (among other causes) functions ready-made and supported by the developer, it companies often stop developing their monitoring system is possible to form monitoring agents at your own at the level of equipment information collection and discretion using Agent Builder. The line of monitoring generation of events relating to failures of this products from this vendor contains tools for all equipment, without considering the relations between monitoring functions – from data collection by means of these elements. various agents and collected data processing, to Where as in the target state, the resource and service automated response to failures and presenting necessary model is determined as a result of taking inventory of all information via data displays for all user levels. resources employed in IT service provisioning and However, selection of one vendor of monitoring tools subsequent determination of the influence of elements on each other and on key parameters of the performance of does not result in use of drastically different approaches the whole service. This concept is the key concept for the in its implementation. A working IT infrastructure evaluation of possible consequences of failures/faults in monitoring and management system should perform the the operation of individual elements of infrastructure following functions in its target state: affecting final IT service provisioning feasibility. In this • Function of detection of IT components and case, the condition of each element is characterized by determination of dependencies between them –including metrics taken automatically [Oht06]. automatic monitoring agent installation and subsequent Obtained values of metrics are compared with information collection, among others, in order to search reference ones, and when they exceed acceptable limits 2 the monitoring system informs appropriate technical time. support personnel about the abnormal state of the service. ì0; This creates the problem of correct determination of Fj = í these reference or marginal values of normal condition î1, for each metric and their mutual interrelations. where 0 is normal condition, and 1 is a fault. Initially, these values are determined with the help of The problem under consideration can be solved using experts, however, the risk of subjectivity in this case can classic methods for the system of solving simultaneous not be totally excluded. That is why the problem of linear equations: maintaining marginal values and their mutual interrelations in their actual condition for large number of ì K1 ´ M 11 + ××× + Ki ´ M 1i + ××× + K m ´ M 1m = F1 ; ï metrics is rather complex, and additionally, when a group ï ... of experts is permanently engaged this task is cost ï intensive [Spr16, Aua07, Mar09]. í K1 ´ M j1 + ××× + Ki ´ M ji + ××× + K m ´ M jm = Fj ; Within the framework of the current level of IT ï ï ... monitoring system development, the data volume in MCC JSC Russian Railways is 11 terabytes, and this ï î K1 ´ M n1 + ××× + Ki ´ M ni + ××× + K m ´ M nm = Fn . value will only increase as solving the problem of At the same time, this problem can be solved using automated prioritizing of metrics taken from IT apparatus of artificial neural networks [Aya07]. These infrastructure and creating resource and service models networks have various structures and each one has bigger than current data storage horizon is required. At advantages and disadvantages for solving different kinds the present time, when aggregated data are stored, for of problems. Assuming that at the j-th point of time the most metrics the storage horizon resides in the interval of value of metrics is given for all previous measurements 1-3 months. Raw data is actually stored for a and there is a problem of the resulting condition of the IT substantially shorter period. service (failure/degradation) change probability In this case, the specialists responsible for IT estimation then it is possible to speak about a forecasting infrastructure monitoring and support have at their problem, i.e. about a special case of a regression disposal 1318 unique metrics, the combinations of which problem. For such problems, the use of forward are imposed on selected IT services. Special attention propagation neural networks (perceptrons) is the most should be paid to data heterogeneity for specified justified [Naz03]. metrics. For example, for each IT service the following In this case, the number of layers for forward metrics should be analyzed: numerical values of propagation networks and the number of neurons in each processor utilization expressed in percent, remaining free layer are the values upon which, on the one hand, the space on the virtual server in megabytes, response time speed and, on the other hand, the quality of proposed of network equipment via ICMP in milliseconds, and text neural network learning depends. The degree of network values of responses for Blade chassis operation status. architecture complication and the increase in the number Specified heterogeneity together with a large volume of neurons, in turn, depend on existing computational of data is the cause of impossibility of problem solving capability limitations. based just on the knowledge, skills and competence of Under the conditions of MCC JSC Russian Railways experts involved in the support of IT infrastructure on operations, the speed of these calculations is not less which MCC JSC Russian Railways IT services are important than K i , coefficient calculations quality deployed [Ort91]. because of this problem it is absolutely necessary to organize periodic model recalculation in order to 3 Mathematical Formulation of Problem maintain its actual condition. To achieve this goal, and to mitigate possible The mathematical formulation of reference problem negative influence on IT services provided by MCC JSC metrics values can be represented in the following form: [ ] for each metric M i , ( i Î 1, m ) it is necessary to Russian Railways to consumers, the process of periodic learning of the neural network should be executed at the determine reference values K i , which makes it possible time of minimal utilization of the computing system, for to unambiguously split the whole array metric values M i instance, in daily relearning mode at night (according to into normal and abnormal subsets (the boundary the Moscow time zone). condition can be derived). The learning process itself should be built in the !!" To do this it necessary to find vector format of learning by instruction – learning by means of K = ( K1 ,..., Ki ,..., Km ) under condition: !!" !" !" the presentation of multiple available examples of input K ´M ® F , data M and reference solutions F . As stated above, this ( ) where M = M ij – is the matrix of i-th metric values ( problem, in substance, is a forecasting problem which is most often solved using a back propagation learning i Î [1, m]) in j-th period of time ( j Î [1, n]); algorithm. Its main disadvantage consists in the learning process being too long which makes it essentially !" F = ( F1 ,..., Fj ,..., Fn ) – is resulting condition of unusable for the given problem. At the present time, there are enough faster algorithms such as: conjugate service, Fj is condition of service for each j-th point of gradient method, RProp method of Levenberg– 3 Marquardt, etc. considered technologies becomes clear – initial resource The optimal choice for solving the problem is the and service model building in online mode without RProp (Resilient Propagation) method known as the expert engagement. method of resilient error propagation. It outperforms the standard back propagation method in terms of learning time length, particularly with regard to the heterogeneity of available data [Nov16]. At the beginning of learning, all weight factors К arei set in a random manner (as small values close to zero), Conclusion and further when examples are input the network error is minimized. The key effect of the use of the considered technologies In the learning process using the RProp algorithm, consists of initial resource and service model building in partial derivative signs are used to trim weight factors. online mode without engagement from experts. For each К weight factor in the chain for k-th neuron, the i In our opinion, it is practical to continue further separate modifier value entered D ik , is used to calculate studies in the direction of detailed configuring of the the size of correction for each relevant weight factor. neural network architecture to solve the problem in To determine the correction value the following question and the use of available computational convention is used in each step: capabilities for periodic neural network relearning based on constantly updated teaching selections. ì + ( j -1) ¶E j -1¶E j In this case, the question of effective use of ï h D ik , если > 0; ï ¶K ik ¶ K ik computing system, namely its dynamically distributed D j ik = í j -1 resources, should be built on the principles of parallel ïh - D ( j -1) , если ¶E ¶E < 0, j processing calculation tasks while using algorithms ïî ik ¶Kik ¶K ik employed in the distribution of works for multiprocessor computing systems [Mld19]. Where 0 < h - < 1 < h + . This will make it possible to provide development of adequate models not requiring substantial debugging by a Specific values of modifiers can be different but most group of experts in terms operation of IT services, and often the values proposed in [Rdm93] and tested on keeping them in their current state which ultimately multiple examples are used: provides obvious improvement in the real-time h - = 0.5; h + = 1.2; evaluation quality of MCC JSC Russian Railways IT services. ¶E j Further development of monitoring system should be – partial derivative of activation function by ¶K ik performed in relation to the results obtained. weight factor at j-th point of time. If in the current step the partial derivative with References respect to corresponding weight K ik has changed its [Aya07] N. Ayachitula. IT Service Management sign, then it follows that the last change was too large Automation - a Hybrid Methodology to and the algorithm has exceeded the local minimum. Integrate and Orchestrate Collaborative Human Consequently, the amount of change should be decreased Centric and Automation Centric Workflows. / and the previous weight factor value should be returned, in other words, the “rollback” should be performed. N.Ayachitula, M. J. Buco, Y. Diao, M. When the derivative retains its sign, the modifier value Surendra, R. Pavuluri, L. Shwartz, C. Ward – In should be additionally increased to accelerate IEEE SCC, 2007. 574–581 p. convergence. After the values of modifiers have been updated, the [Cuk14] K. Cukier. A Revolution That Will Transform change of factors themselves is made according to the How We Live. / K. Cukier, V. Mayer- convention: Schonberger NY: Mariner Books, 2014. 240 p. ì j ¶E j ï-D ik , если ¶K > 0; [Gol17] A.S. Golubev. Digital Railway is Reality / A.S. ï ik Golubev, A.V. Skryabin – Russia, Eurasia ïï j ¶E j News. 2017, №12. DK ik = íD ik , если < 0; ï ¶K ik ï0, иначе. [Mar09] P. Marcu. Managing Faults in the Service ï Delivery Process of Service Provider Coalitions. ïî / P. Marcu, L. Shwartz, G. Grabarnik, D. K j +1ik = K j ik + DK j ik . Loewenstern – In IEEE SCC, 2009. Pp. 65–72. The discussed example is a suitable variant of [Mol19] I.A. Molodkin, S.G. Svistunov. Comparative network generation and its subsequent learning for set Analysis of Scheduling Algorithms in problem, and in its terms the key effect of the use of the Multiprocessor Systems, Intellectual 4 Technologies on Transport. 2018, № 2. Pp. 41– 46. [Rie93] M.A. Riedmiller. Direct Adaptive Method for Faster Backpropagation Learning: The RPROP [Naz03] A.V. Nazarov. Neural Network Algorithms of Algorithm. / M. Riedmiller, H.Braun – In IEE, Forecasting and Optimization of Systems / A.V. Conf. on Neural Networks. San Francisco, Nazarov, A.I. Loskutov, Saint-Petersburg: 1993. Pp. 586-591. Science and technique. 2003. 384 p. [Rrw17] The Concept of Implementation of the Complex [Nov16] P.A. Novikov. Software for Mobile Indoor Scientific and Technical Project "Digital Navigation Using Neural Networks / P.A. Railway” – Russia, Moscow, 2017. 92 p. Novikov, A.D. Khomonenko, E.L. Yakovlev. Information Management System. 2016, №1. [Sha18] K. S. Shardakov. Comparative Analysis of the P.32-39. Popular Monitoring Systems for Network Equipment Distributed Under the GPL License / [Ort91] J. Ortega. Introduction to Parallel and Vector K.S. Shardakov, V.P. Bubnov, Intellectual Solutions of Linear Systems / J. Ortega – Technologies on Transport. 2018, №1. Pp. 44– Russia, Moscow: World. 1991. 367 p. 48. [Sup16] M. Supriya. Monitoring and Evalution in [Oht06] M.Yu. Ohtilev. Intelligent Technologies for adaptation. / M.Supriya, S.Truck, P.Davies – Monitoring and Control of Structural Dynamics Darwin, 2016. 56 p. of Complex Technical Objects / M.YU. Ohtilev, B.V. Sokolov, R.M. Yusupov, Moscow: Science, 2006. 410 p. 5