Comparative analysis of machine learning methods to assess the quality of IT services Maksim A. Bolshakova, Igor A. Molodkina and Sergei V. Pugacheva a Saint Petersburg Railway Transport University of Emperor Alexander I, 9 Moskovsky Ave., Saint-Petersburg, 190031, Russia Abstract The article considers the issue of choosing a machine learning method for solving the applied problem of assessing the current state of the quality of IT services. As a method of choice, a comparative analysis of the generally accepted methods of machine learning was carried out using a set of criteria that made it possible to evaluate their effectiveness and efficiency. F-measure is considered as the main criteria, as a generalizing concept of the completeness and accuracy of classification, for each class of states separately and the duration of the training and prediction procedure. All operations were carried out on the same dataset, namely, on the data of the centralized monitoring and management system for the IT infrastructure of Russian Railways in terms of the ETRAN IT service. Due to their heterogeneity and taking into account the practice of applying the Barlow hypothesis, the initial data went through preliminary processing, the algorithm of which is also described in detail in the article Keywords machine learning, Barlow hypothesis, F-measure, dataset, data heterogeneity. service applications implemented on it. 1. Introduction It is possible to assess the state of the IT infrastructure at a certain point in time by predicting its final performance based on the current data of the The requirement for software developers and device monitoring system. To do this, it is necessary to solve manufacturers in terms of mandatory monitoring of the problem of classifying the final state of the IT their performance is an established and mandatory infrastructure, that is, to determine the class label (the norm. As a result, most of the IT services provided by type of the final state of the incident / regular operating organizations can be assessed not only by operation) based on the current values of the the final failure state, but also by the current local monitoring system metrics. At the same time, for the characteristics of their work. choice of the implementation method, the number of Consider the currently operating centralized system for available classes is not so important - the binary monitoring and managing the IT infrastructure of classification is a special case of the multiclass Russian Railways, which is operated by the Main classification, and is not a decisive characteristic when Computer Center. The huge array of accumulated and choosing a teaching method. Typically, the nature of constantly updated data of the specified monitoring the input data, namely the types and formats, have a system allows us to assume the success of using significant impact on the choice of machine learning machine learning methods to solve the problem of methods. At the same time, data preprocessing assessing the quality of IT services by determining the algorithms do not depend on the chosen training current state of the specified infrastructure and the method itself and must ensure the correct use of the ________________________________________________________ initial data as training and test samples for all further Models and Methods for Researching Information Systems methods of solving the problem. [1] in Transport, Dec. 11-12, St. Petersburg, Russia EMAIL: bolshakovm@yandex.ru (Maksim A. Bolshakov); molodkin@pgups.ru (Igor A. Molodkin); nki.pugachev@yandex.ru 2. Primary data processing (Sergei V. Pugachev) ©️ 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR- Primary data is understood as the whole set of the WS.org) characteristics of the IT infrastructure involved in the 142 operability of the specified system, taken by the column_names_with_static_value = [] current monitoring and management system, and the forcol_name in column_names: set of service characteristics that make it possible to if (merged_df[col_name].nunique() == 1): define this automated system as an IT service. The column_names_with_static_value.append(col_name) quality of the obtained primary data, namely their if 'isIncident' in column_names_with_static_value: heterogeneity, imply additional processing regardless column_names_with_static_value.remove('isIncident') of the choice of a specific machine learning method. merged_df.drop(column_names_with_static_value, For the correct formation of the training sample, the axis = 1, inplace = True) most suitable programming language is Python version After reducing the dimension in this way, it is 3.7.4 and the following imported libraries: Pandas and necessary to recode the text values, that is, replace the Numpy. The specified libraries must be installed on existing text values of the metrics with numeric ones the script execution server as Site-package libraries. by entering new arguments of the objective function. Initially, the data received from the monitoring system For each current text value, a separate column is is presented as a csv file with a separator | and a set of created in the current dataset and a value of 1 or 0 is columns in the following order:mon_obj determined, thus obtaining the correct data for analysis metric_name and comparability. metric_value def find_strings_column(): value_time df_cols = merged_df.columns isIncident ret_col_names = []; These columns contain the following information: for col in df_cols: mon_obj — monitoring object name if (merged_df[col].dtypes == 'object'): metric_name — metric’s name ret_col_names.append(col) metric_value — metric value at the time of data returnret_col_names collection The variable string_columns stored list of names of the value_time — date / time when data was collected columns that contain the string values. Next step is to isIncident — critical state indicator (at the first stage, a form a dictionary of string values. Values in columns binary classification is highlighted: 1 - critical state, 0 should then be replaced by indexes of elements from - normal operation). the dictionary. After that via variable small_columns_list = def make_dict_with_strings_column(): ['mon_obj','metric_name','metric_value', ret = {} 'value_time','isIncident'] and using pandas library new forsc in strings_column: DataFrame is formed, which is suitable for usage. ret[sc] = list(merged_df[sc].unique()) d = pd.read_csv('data/data.csv', sep='|', encoding='utf- returnret 8', skipinitialspace=True, skiprows=1, By using the Save function of the NumPy library, this names=small_columns_list, low_memory=False) dictionary is saved to a file. Next, you need to define a new np.save('data/dict_of_string_values.npy', composite_metric_name column in DataFrame d, dict_of_string_values) which will have a composite name from the values of And further, using the function the columns with the data about the object of def modify_value_string_column(): monitoring and the name of the metric for each for c in dict_of_string_values.keys(): current: merged_df[c] = merged_df[c].map(lambda x: row.d['composite_metric_name']=['mon_obj']+'_'+d['m dict_of_string_values[c].index(x)), all text values are etric_name] changed to numeric values. Then a new variable column_names is declared, At the end of the primary data processing, the data consisting of the unique values of the newly generated frame values are standardized so that the variance is column unitary, and the average value of the series is 0 - this 'composite_metric_name'column_names=d['composite can be done using the built-in Standardscaler tool or _metric_name'].unique() through a self-written function: These steps are necessary to unambiguously compare def normalize_df(): the metric and its values at any given time. cols = list(merged_df.columns) Further, after transposing the DataFrame into a more for cl in cols: convenient form and removing unnecessary columns, x_min = min(list(merged_df[cl])) it will be possible to remove static data as a method to x_max = max(list(merged_df[cl])) reduce the dimensionality and optimize the use of merged_df[cl]= merged_df[cl].map(lambda x: ((x - computing resources without losing data quality: x_min)/(x_max - x_min))) defdelete_cols_w_static_value(): 143 For the convenience, the final result should be saved to relative to all situations of this class in the test sample. a file. More clearly, these criteria can be presented through merged_df.to_csv(path_or_buf='data/normalised_df.cs the contingency (contingency) table, also built v', sep='|',index=False) separately for each final class [3] As a result of the actions performed, a dataset was obtained that is suitable for applying various machine Training Sample Class (1/0) Positive Negative learning models with the following characteristics: Positive 𝑇𝑃 𝐹𝑃 • The number of records in the training set is 10815 Result Negative 𝐹𝑁 𝑇𝑁 (9323 class 0, 1492 class 1); • The number of records in the test set is 3605 (3297 of These results are used directly in the calculation of the class 0, 308 of class 1). criteria for the classifier as follows: 𝑇𝑃 𝑃𝑟 𝑒 𝑐𝑖𝑠𝑖𝑜𝑛 = ; 3. Criteria for comparing results 𝑇𝑃+𝐹𝑃 𝑇𝑃 When solving any supervised learning problem, there 𝑅𝑒𝑐𝑎𝑙𝑙 = is no most suitable machine learning algorithm - there 𝑇𝑃 + 𝐹𝑁 is such a thing as the "No Free Lunch" theorem, the As can be seen from the calculation algorithm, these essence of which is the impossibility to criteria provide a more complete understanding of the unambiguously approve the best machine learning quality of the classifier's work. It would be logical to method for a specific task in advance. The say that the higher the accuracy and completeness, the applicability of this theorem extends, among other better, but in reality, maximum accuracy and things, to the well-studied area of problems of binary completeness are not achievable at the same time and classification, therefore, it is imperative to consider a it is necessary to find a balance between these set of methods and, based on the results of practical characteristics. To do this, you need to use the F tests, evaluate the effectiveness and applicability of measure, which is the following calculated value: each of them. [2] The effectiveness and efficiency will be primarily 𝑃𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛⋅𝑅𝑒 𝑐𝑎𝑙𝑙 assessed by the most common and user-understandable 𝐹 = (𝛽 2 + 1) ⋅ 𝛽2 ⋅𝑃𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒 𝑐𝑎𝑙𝑙, criterion Accuracy. Accuracy is a measure that indicates the proportion of where 0<β<1, if the greater weight for the selection of correct decisions of the classifier: the classifier has accuracy and β> 1 with the preference for completeness. In the case β = 1, a 𝑃 balanced F-measure is obtained, denoted as: 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = , 𝑁 𝑃𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛⋅𝑅𝑒 𝑐𝑎𝑙𝑙 where P is the number of states correctly identified by 𝐹1 = 2 ⋅ 𝑃𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒 𝑐𝑎𝑙𝑙. the system, and N is the total number of states in the test sample. In addition, for all classifiers, we will apply an However, for the problem being solved, this criterion estimate of the duration of the execution of Wall time is not enough, since in its calculation it assigns the operations, by which we will count the time from the same weight to all final states (classes), which may be moment the model is created and the beginning of incorrect in the considered case of non-uniform training until the moment the values of the distribution of time moments over the final states. performance assessment metrics are issued based on Thus, for a more correct comparison and taking into the comparison of the predicted results with the test account the fact that in the example under sample. consideration the share of normal states is much All work is carried out on a computer complex with greater than the share of critical states, the assessment the following characteristics: of the classifiers should be based, among other things, Intel Core i9-9980XE 3.00 Ghz, 128 Mb, 4xNVIDIA according to the following criteria: Precision (accuracy RTX 2080Ti 11 Gb, 1 TB PCIe SSD. - in a calculation other than Accuracy), and Recall (completeness). 4. K-nearest neighbors (knn) algorithm These criteria are calculated separately for each summary class, where Precison is the proportion of The specified algorithm works as follows - let a situations that really belong to this class relative to all training sample of pairs "object (state characteristic) - situations that the system has assigned to this class. response (state class)" be given: System completeness (Recall) is the proportion of situations found by the classifier that belong to a class 144 𝑋 𝑚 = {(𝑥1 , 𝑦1 ), . . . (𝑥𝑚 , 𝑦𝑚 )}, probability of accepting a particular value of the dependent variable. [5] then the distance function P (x, x ') is given on the set Let the objects be specified by numerical features: of objects. This function should be a reasonably adequate model of object similarity. The larger the 𝑓𝑗 : 𝑋 → 𝑅, 𝑗 = 1. . . 𝑛 value of this function, the less similar two objects x, x 'are [4]. For an arbitrary object μ, we arrange the and the space of feature descriptions in this case X=Rn. objects of the training sample in the order of Let Y in this case be a set of class labels and a training increasing distances to μ: set of object-response pairs 𝑋𝑚 = {(𝑥1 , 𝑦1 ), . . . , (𝑥𝑚 , 𝑦𝑚 )}. 𝑃(𝜇1 ; 𝑝(𝜇𝑖 , 𝑥1;𝜇 ) < 𝑝(𝜇𝑖 , 𝑥2;𝜇 ). . . < 𝑝(𝜇𝑖 , 𝑥𝑚;𝜇 ), Consider the case of two classes: Y={-1,+1}. In logistic regression, a linear classification algorithm where some x (1; μ) denotes the object of the training aX→Y of the form: sample that is the i-th neighbor of the object μ. We 𝑛 introduce a similar notation for the answer to the i-th 𝑎(𝑥, 𝑤) = 𝑠𝑖𝑔𝑛 (∑ 𝑤𝑗 𝑓𝑖 (𝑥) − 𝑤0 ) = 𝑠𝑖𝑔𝑛〈 𝑥, 𝑤〉, neighbor y (i; μ). Thus, an arbitrary object μ generates 𝑗=1 a new renumbering of the sample. In its most general form, the nearest neighbors algorithm is: where wj – scale of j-th feature, w0 − decision threshold w=(w0,...,wn) − scales vector, ⟨x,w⟩ − dot 𝑎(𝜇) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦∈𝑌 ∑𝑚 𝑖=1[𝑦(𝑥𝑖;𝜇 ) = 𝑦] ⋅ 𝑤(𝑖, 𝜇), product of the feature description of an object by a vector of weights. It is assumed that the zero feature is where w (i, μ) is a given weight function that estimates artificially introduced: f0(x) = −1. the degree of importance of the i-th neighbor for The task of training a linear classifier is to adjust the classifying the object μ, while this function cannot be weight vector w based on the sample Xm. In logistic negative and does not increase with respect to i. regression, for this, the problem of minimizing For the k nearest neighbors method: empirical risk is solved with a loss function of the following form: (𝑖; 𝜂) = [𝑖 ≤ 𝑘] 𝑚 The qualitative characteristics of using the 𝑄(𝑤) = ∑ 𝑙𝑛( 1 + 𝑒𝑥𝑝( − 𝑦𝑖 < 𝑥𝑖 , 𝑤 >)) → 𝑚𝑖𝑛 𝑤 KNeighborsClassifier algorithm with the n_neighboors 𝑖=1 = 10 parameter are as follows: After the solution w is found, it becomes possible not only to perform classification for an arbitrary object x, Accuracy 0.949 but also to estimate the posterior probabilities of its Precision Recall F1 score 1 0.63 0.97 0.77 belonging to the existing classes: 0 1.0 0.95 0.97 𝛲{𝑦|𝑥} = 𝜎(𝑦 < 𝑥, 𝑤 >), 𝑦 ∈ 𝑌, Wall time: 1 min 15 s 1 It is clearly seen that the accuracy assessment in the 𝜎(𝑧) = − 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 1 + 𝑒𝑥𝑝 −𝑧 form of Accuracy is not enough for a comparative analysis - with a value of this indicator of 0.949 - Qualitative characteristics of the application of the almost 95% accuracy - we see that the accuracy of specified model with the following parameters determining the final state 1 (inoperability) through the LogisticRegression (multi_class='ovr', solver='lbfgs'): Precision characteristic is only 63%. As a result of the work, one should record the F-Measure value for each Accuracy 0.971 class and the total duration of the classifier's work. Precision Recall F1 score 1 0.76 0.98 0.85 5. Logistic regression 0 1.0 0.97 0.98 Wall time: 1 min 10 s To predict the probability of occurrence of a certain event by the values of the set of signs, a dependent The results shown by this classifier are better relative variable Y is introduced, which takes values 0 or 1 and to the previous method with a comparable duration of a set of independent variables x1, ... xn, based on the operation, however, the work on identifying faulty values of which it is required to calculate the situations (class 1) does not yet guarantee satisfactory operation in industrial mode. 145 6. Naive Bayesian classifier the main of which F-measure, are inferior to past classifiers. The Bayesian classification is based on the maximum probability hypothesis, that is, an object d is 7. Decision tree methodology considered to belong to the class cj (cj ∈ C) if the highest posterior probability is achieved: With this algorithm, the tree is built from top to bottom - from the root node to the leaves. At the first 𝑚𝑎𝑥 𝑃 (𝑐𝑗 |𝑑).[6] step of training, an "empty" tree is formed, which consists only of the root node, which in turn contains Bayesian formula: the entire training set. Next, you need to split the root node into subsets, from which the descendant nodes 𝑃(𝑐𝑗 )⋅𝑃(𝑑|𝑐𝑗 ) will be formed. For this, one of the attributes is 𝑃(𝑐𝑗 |𝑑) = ≈ 𝑃(𝑐𝑗 )𝑃(𝑑|𝑐𝑗 ), 𝑃(𝑑) selected and rules are formed that divide the training set into subsets, the number of which is equal to the where P(d|cj)- the probability of encountering an number of unique values of the selected attribute. [8] object d among objects of class cj, and P(cj),P(d) – As a result of splitting, p (according to the number of prior probabilities of the class cj and d. attribute values) subsets are obtained and, therefore, p Under the "naive" assumption that all features descendants of the root node are formed, each of describing the classified objects are completely equal which is assigned its own subset. Then this procedure and not related to each other, then P (d | cj) can be is recursively applied to all subsets until the stop calculated as the product of the probabilities of condition is reached. encountering a feature xi (xi∈X) among objects of For example, a partitioning rule should be applied to class cj: the training set, in which the attribute A, which can take p values: a1, a2, ..., ap, creates p subsets S1, S2, ..., |𝑥| 𝑃(𝑑|𝑐𝑗 ) = ∏𝑖=1 𝑃(𝑥𝑖 |𝑐𝑗 ), Sp, where examples will be distributed, in which the attribute A takes the corresponding value. where 𝑃(𝑥𝑖 |𝑐𝑗 ) - probabilistic assessment of the Moreover, N (Cj, S) is the number of examples of the contribution of a feature xi to the fact that d ∈ cj.[7] class Cj in the set S, then the probability of the class Cj In practice, when multiplying very small conditional in this set is determined by the expression: probabilities, there can be a loss of significant digits, and therefore, instead of the estimates of the 𝑁(𝐶𝑗 𝑆) 𝑃= , probabilities P (xi | cj), the logarithms of these 𝑁(𝑆) probabilities should be used. The logarithm is a monotonically increasing function, and, therefore, the where N (S) is the total number of examples in the set class cj with the largest value of the logarithm will S. remain the most probable. In this case, the decision The entropy of the sets S will be expressed as: rule of the naive Bayesian classifier takes the 𝑚 following form: 𝑁(𝑆𝑖 ) 𝑁(𝐶𝑗 𝑆) 𝐼 𝑛 𝑓𝑜(𝑆) = − ∑ ⋅ 𝑙𝑜𝑔( ) 𝑁(𝑆) 𝑁(𝑆) 𝑥 𝑖=1 ∗ 𝐶 = 𝑎𝑟𝑔𝑐𝑗 ∈𝐶 𝑚𝑎𝑥 [𝑙𝑜𝑔 𝑃 (𝑐𝑗 ) + ∑ 𝑃(𝑥𝑖 |𝑐𝑗 )] 𝑖=1 It will demonstrate the average amount of information . required to determine the class of an example from the set S. The resulting values of the MultinomialNB classifier The same estimate, obtained after partitioning the set S from the Sklearn library turned out to be the following: by attribute A, can be written as: Accuracy 0.874 𝑘 Precision Recall F1 score 𝑁(𝐶𝑗 𝑆) 𝐼𝑛𝑓𝑜𝐴 (𝑆) = ∑ ⋅ 𝐼𝑛𝑓𝑜(𝑆𝑖 ), 1 0.40 0.96 0.57 𝑁(𝑆) 𝑖=1 0 1.0 0.87 0.93 Wall time: 289 ms where Si - i-th node, obtained by splitting by attribute The considered classifier, based on its description, A. After that, to choose the best branching attribute, works fundamentally differently - the speed of its you should use the criterion of the form: work is much higher - however, the qualitative criteria, 146 𝐺𝑎𝑖𝑛(𝐴) = 𝐼𝑛𝑓𝑜(𝑆) − 𝐼𝑛𝑓𝑜𝐴 (𝑆) composition: 𝐹𝑀 (𝑥) = ∑𝑀 𝑚=1 𝑏𝑚 ℎ(𝑥, 𝑎𝑚 ), 𝑏𝑚 ∈ 𝑅, 𝑎𝑚 ∈ 𝐴. However, the selection of the optimal set of This criterion is called the criterion of information parameters {𝑎𝑚 , 𝑏𝑚 }𝑀 𝑚=1 is an extremely time- gain. This value is calculated for all potential split consuming task, therefore the construction of this attributes and the one that maximizes the specified composition should be carried out by means of criterion is selected for the division operation. "greedy" growth, each time adding the summand, The described procedure is applied to subsets Si and which is the most optimal algorithm, to the sum. further, until the values of the criterion cease to At the step when the optimal classifier F(m-1) of length increase significantly with new partitions or a different m - 1 has already been assembled, the task is reduced stopping condition is met. In this case, when in the to finding a pair of the most optimal parameters process of building a tree, an "empty" node is {am,bm} for the classifier of length m: obtained, where not a single example will fall, then it must be converted into a leaf that is associated with 𝐹𝑀 (𝑥) = 𝐹𝑚−1 (𝑥) + 𝑏𝑚 ℎ(𝑥, 𝑎𝑚 ), 𝑏𝑚 ∈ 𝑅, 𝑎𝑚 ∈ 𝐴 the class most often found in the immediate ancestor of this node. Optimality is understood here in accordance with the The DecisionTreeRegressor classifier with parameters principles of explicit maximization of margins - this Random_state = 15 and Min_samples_leaf = 25 means that a certain loss function L(yi,Fm (xi)) → min showed the following characteristics: is introduced, showing how much the predicted answer Fm (xi) differs from the correct answer yi. Next, you Accuracy 0.964 need to minimize the functionality of this error: Precision Recall F1 score 1 0.71 0.97 0.85 𝑁∑ 0 1.0 0.97 0.98 𝑄 = ∑ 𝐿(𝑦𝑖 , 𝐹𝑚 (𝑥𝑖 )) → 𝑚𝑖𝑛 Wall time: 975ms 𝑖=1 When working with the decision tree method, the results are similar to the logistic regression method, It should be noted that the error functional Q is a real however, the duration of training and forecasting in the function depending on the points {𝐹𝑚 (𝑥𝑖 )}𝑁 𝑖=1 in the N- decision tree method is much longer, which, with dimensional space, and this function is minimized by equal qualitative characteristics, puts the results of this the gradient descent method. As the point for which method higher than others. the optimal increment should be found, we define 𝐹𝑚−1 and the error gradient is expressed as follows: 8. Gradient boosting method 𝑁 𝜕𝑄 Gradient boosting is a machine learning method that 𝛻𝑄 = [ (𝑥 )] creates a decisive forecasting model in the form of an 𝜕𝐹𝑚−1 𝑖 𝑖=1 𝑁 ensemble of weak forecasting models, usually decision 𝜕(∑𝑁 𝑖=1 𝐿(𝑦𝑖 , 𝐹𝑚−1 )) trees, essentially developing the decision tree method. =[ (𝑥𝑖 )] = 𝜕𝐹𝑚−1 𝑖=1 During boosting, the model is built in stages - an 𝑁 arbitrary differentiable loss function is also optimized 𝜕𝐿(𝑦𝑖 , 𝐹𝑚−1 ) =[ (𝑥𝑖 )] in stages. [9] 𝜕𝐹𝑚−1 𝑖=1 For the problem of object recognition from a multidimensional space X with a label space Y, a By virtue of the gradient descent method, it is most training sample {𝑥𝑖 }𝑁 𝑖=1 is given, where 𝑥𝑖 ∈ 𝑋. In beneficial to add a new term as follows: addition, the true values of the class labels for each 𝐹𝑚 = 𝐹𝑚−1 − 𝑏𝑚 𝛻𝑄, 𝑏𝑚 ∈ 𝑅, object {𝑦𝑖 }𝑁 𝑖=1 are known, where yi∈Y. The solution to the prediction problem is reduced in this case to the where bm is selected by linear search over real search for a recognizing operator who can predict the numbers R: labels as accurately as possible for each new object 𝑁 from the set X. Let a family of basic algorithms H be given, each 𝑏𝑚 = 𝑎𝑟𝑔𝑚𝑖𝑛 ∑ 𝐿(𝐹𝑚−1 (𝑥𝑖 ) − 𝑏𝛻𝑄𝑖 ) element h(x,a)∈H:X→R of which defined by some 𝑖=1 vector of parameters a∈A. In this case, it is necessary to find the final However, ∇Q is only a vector of optimal values for classification algorithm in the form of the following each object xi, and not a basic algorithm from the 147 family H, defined by ∀x∈X, so it is necessary to find which is a "smoothed step", as a rule, a hyperbolic h(x,am)∈H that is most similar to ∇Q. To do this, it is tangent: necessary to re-minimize the error functionality using an algorithm based on the principle of explicit 𝑒 𝑡 − 𝑒 −𝑡 𝑡𝑛𝑔(𝑡) = minimization of indents: 𝑒 𝑡 + 𝑒 −𝑡 𝑁 𝑎𝑚 = 𝑎𝑟𝑔𝑚𝑖𝑛 ∑ 𝐿(𝛻𝑄𝑖 , ℎ(𝑥𝑖 , 𝑎)) ≡ or logistic function: 𝑡 𝑡𝑛𝑔( )+1 1 𝑖=1 2 𝜎(𝑡) = = . ≡ 𝑡𝑟𝑎𝑖𝑛({𝑥𝑖 }𝑁 𝑁 𝑖=1 , {𝛻𝑄𝑖 }𝑖=1 ), 2 1+𝑒 −𝑡 The activation function of output neurons can also be which in turn corresponds to the basic learning the same "smoothed step", or it can be identical algorithm. gv(t)=t, that is, each neuron v calculates the function: Next, you need to find the coefficient bm using linear search: 𝑢(𝑣) = 𝑔𝑣 ((∑ 𝑤𝑒 𝑢(𝑒)) − 𝑤𝑣 ). 𝑁 The parameters of the edges we are called their 𝑏𝑚 = 𝑎𝑟𝑔𝑚𝑖𝑛 ∑ 𝐿(𝐹𝑚−1 (𝑥𝑖 ) − 𝑏 ⋅ ℎ(𝑥𝑖 , 𝑎𝑚 ) weights, and the parameters of the vertices wv are 𝑖=1 called displacements. In this case, which activation function is chosen - hiberbolic tangent or logistic - is The GradientBoostingClassifier implemented indifferent: for any multilayer perceptron with an according to this algorithm with the parameters activation function tng calculating the function learning_rate = 0.3; n_estimators = 10; verbose = 1, Ftng(w,x), the same perceptron, in which the activation subsample = 0.5 showed the following results: function in intermediate layers is replaced by a logical function σ, calculates the same the function itself for Accuracy 0.993 some other value of the parameter w ': Precision Recall F1 score 1 0.94 0.99 0.97 𝐹𝑡𝑛𝑔 (𝑤, 𝑥) = 𝐹𝜎 (𝑤 ′ , 𝑥). 0 1.0 0.99 1.0 Walltime: 3.95s In accordance with the ideology of minimizing empirical risk with regularization of training of the The gradient boosting method, like the decision tree perceptron calculating the function F(w,x), this is the method, is a method of enumerating classification search for a vector of weights and biases that parameters, which, in turn, determines their relative minimizes the regularized total error: comparability in terms of training duration. However, the time spent in boosting to determine the ensemble 𝑁 of decision trees has a colossal effect - the accuracy in 𝐸𝑇 (𝑤) = 𝜑(𝑤) + ∑ 𝐸(𝐹(𝑤, 𝑥𝑖 ), 𝑦𝑖 ) terms of the final state classes is the highest for this 𝑖=1 method. on some training set T=((x1,y1),...(xn,yn)). [11] Training is most often carried out by the classical 9. Neural network method of gradient descent; for its applicability, the activation functions of all neurons and the error and Taking into account the available analysis of data regularization functions must be differentiable. sources of the considered IT infrastructure monitoring Practice shows that the speed of this algorithm is often system and the nature of this data, the MLP (Multi- inferior to others because of the huge dimension of the Layer Perceptron) type was defined as a neural parameter w and the absence of explicit formulas for network - due to the absence of video surveillance the derivatives of the function F with respect to w. [11] systems as sources, and, consequently, the problem of The results of applying the MLPClassifier (max_iter = video recognition of the classical verbose neural 100, random_state = 10) are as follows: network of direct distribution will be sufficient to Accuracy 0.992 determine the effectiveness of its application. [10] Precision Recall F1 score An MLP network is any multilayer neural network 1 0.93 0.99 0.95 with a linear conductance function and a monotone 0 1.0 0.99 1.0 limited activation function gv common to all hidden Wall time: 1 min 5 s neurons, depending only on the variety t=s(v)-wv, The neural network, as the user often expects, has shown high results of a qualitative assessment - they are essentially equal to the results of gradient boosting. 148 However, the duration of training and forecasting for a [2] D.H. Wolpert, W.G. Macready No Free Lunch neural network is much longer than for methods based Theorems for Optimization // IEEE on building ensembles of decision trees - 1 minute Transactions on Evolutionary Computation. versus several seconds. 1997. №1. С. 1. [3] G. Upton. Analysis of contingency tables. 10. Conclusion Translated from English and preface by Yu. P. Adler. Moscow. Finance and statistics. 1982. 143 p. The results of the application of various machine [4] M. Parsian Data Algorithms. Newton: O'Reilly learning methods clearly prove the postulate of the Media, Inc., 2015. 778 с. "NoFreeLunch" theorem - it is the experimental tests [5] D.W. Hosmer, S. Lemeshow Applied Logistic that allow you to choose the most appropriate Regression. 2nd Ed. New York: John Wiley & algorithm for solving a specific problem, taking into sons, INC, 2000. 397 с. account specific initial data. In this case, it should be [6] S. Ranganathan, K. Nakai, C. Schonbach noted once again that the Accuracy characteristic is Bayes' Theorem and Naive Bayes Classifier. practically useless in comparing the results - it is more Encyclopedia of Bioinformatics and correct to evaluate the results by F-measure, and this Computational Biology, Volume 1, Elsevier, should be done separately for each class. с. 403-412. 2017 Based on the applied sense of the task - to provide [7] Barber D. Bayesian Reasoning and Machine better monitoring of the IT infrastructure operation - Learning. Cambridge University Press, с. the characteristics of training methods for data class 2012. "1" are more important, that is, for cases of real [8] T. Hastie The Elements of Statistical Learning failures and infrastructure failures. At the same time, // Trevor Hastie, Robert Tibshirani, Jerome errors for class "0", in fact, will be additional incidents Friedman, 2 изд. Springer, 2008. 764 с. and, therefore, require additional labor from technical [9] J.H. Friedman, Stochastic Gradient Boosting. support specialists, which is certainly critical, but less Technical report. Dept. of Statistics, Stanford important in comparison with the omission of real University, 1999. failures and failures. It is also worth noting the time [10] T. Windeatt Ensemble MLP Classifier Design. parameters of the methods - the spread is truly In: Jain L.C., Sato-Ilic M., Virvou M., colossal, from 289 milliseconds to 1 minute 15 Tsihrintzis G.A., Balas V.E., Abeynayake C. seconds. (eds) Computational Intelligence Paradigms. When comparing the comparison criteria, it is clearly Studies in Computational Intelligence, vol seen that the gradient boosting method showed the 137. Springer, Berlin, Heidelberg optimal results of work - with a higher speed of this [11] V. Vapnik Principles of Risk Minimization for algorithm, it was able to learn better than other Learning Theory // Advances in Neural algorithms. When replicating an application on already Information Processing Systems 4 (NIPS large data sets (all IT services, a larger analysis 1991). 1992. №4. С. 831-838. horizon), training time is extremely important. [12] Gasnikov A.V. Modern numerical Understanding this and the nature of the initial data, optimization methods. Universal Gradient namely the absence of video and photo images in the Descent Method: A Tutorial. Moscow: MFTI, initial data, allow us to conclude that the gradient 2018.286 p. 2nd Ed boosting method is more than sufficient for solving the problem and using a neural network (showed similar results with a longer training duration) at this stage development of the considered IT infrastructure monitoring system in terms of the method of collecting information is not required. References [1] Bolshakov M.A. Preparation of a monitoring system for IT infrastructure for models of critical states based on neural networks // Science-intensive technologies in space research of the Earth. 2019. No. 4. p. 65-71. 149