Comparative analysis of machine learning methods to assess the
quality of IT services


Maksim A. Bolshakova, Igor A. Molodkina and Sergei V. Pugacheva
a
 Saint Petersburg Railway Transport University of Emperor Alexander I, 9 Moskovsky Ave., Saint-Petersburg, 190031,
Russia


                                 Abstract
                  The article considers the issue of choosing a machine learning method for solving the applied
                  problem of assessing the current state of the quality of IT services. As a method of choice, a
                  comparative analysis of the generally accepted methods of machine learning was carried out using a
                  set of criteria that made it possible to evaluate their effectiveness and efficiency. F-measure is
                  considered as the main criteria, as a generalizing concept of the completeness and accuracy of
                  classification, for each class of states separately and the duration of the training and prediction
                  procedure. All operations were carried out on the same dataset, namely, on the data of the
                  centralized monitoring and management system for the IT infrastructure of Russian Railways in
                  terms of the ETRAN IT service. Due to their heterogeneity and taking into account the practice of
                  applying the Barlow hypothesis, the initial data went through preliminary processing, the algorithm
                  of which is also described in detail in the article

                               Keywords
                  machine learning, Barlow hypothesis, F-measure, dataset, data heterogeneity.

                                                                                               service applications implemented on it.
      1. Introduction                                                                          It is possible to assess the state of the IT infrastructure
                                                                                               at a certain point in time by predicting its final
                                                                                               performance based on the current data of the
The requirement for software developers and device
                                                                                               monitoring system. To do this, it is necessary to solve
manufacturers in terms of mandatory monitoring of
                                                                                               the problem of classifying the final state of the IT
their performance is an established and mandatory
                                                                                               infrastructure, that is, to determine the class label (the
norm. As a result, most of the IT services provided by
                                                                                               type of the final state of the incident / regular
operating organizations can be assessed not only by
                                                                                               operation) based on the current values of the
the final failure state, but also by the current local
                                                                                               monitoring system metrics. At the same time, for the
characteristics of their work.
                                                                                               choice of the implementation method, the number of
Consider the currently operating centralized system for
                                                                                               available classes is not so important - the binary
monitoring and managing the IT infrastructure of
                                                                                               classification is a special case of the multiclass
Russian Railways, which is operated by the Main
                                                                                               classification, and is not a decisive characteristic when
Computer Center. The huge array of accumulated and
                                                                                               choosing a teaching method. Typically, the nature of
constantly updated data of the specified monitoring
                                                                                               the input data, namely the types and formats, have a
system allows us to assume the success of using
                                                                                               significant impact on the choice of machine learning
machine learning methods to solve the problem of
                                                                                               methods. At the same time, data preprocessing
assessing the quality of IT services by determining the
                                                                                               algorithms do not depend on the chosen training
current state of the specified infrastructure and the
                                                                                               method itself and must ensure the correct use of the
    ________________________________________________________                                   initial data as training and test samples for all further
    Models and Methods for Researching Information Systems                                     methods of solving the problem. [1]
    in Transport, Dec. 11-12, St. Petersburg, Russia
    EMAIL: bolshakovm@yandex.ru (Maksim A. Bolshakov);
    molodkin@pgups.ru (Igor A. Molodkin); nki.pugachev@yandex.ru                                   2. Primary data processing
    (Sergei V. Pugachev)
               ©️ 2020 Copyright for this paper by its authors. Use permitted under Creative
               Commons License Attribution 4.0 International (CC BY 4.0).

                CEUR Workshop Proceedings (CEUR-                                               Primary data is understood as the whole set of the
                           WS.org)                                                             characteristics of the IT infrastructure involved in the

                                                                                                                                 142
operability of the specified system, taken by the              column_names_with_static_value = []
current monitoring and management system, and the              forcol_name in column_names:
set of service characteristics that make it possible to        if (merged_df[col_name].nunique() == 1):
define this automated system as an IT service. The             column_names_with_static_value.append(col_name)
quality of the obtained primary data, namely their             if 'isIncident' in column_names_with_static_value:
heterogeneity, imply additional processing regardless          column_names_with_static_value.remove('isIncident')
of the choice of a specific machine learning method.           merged_df.drop(column_names_with_static_value,
For the correct formation of the training sample, the          axis = 1, inplace = True)
most suitable programming language is Python version           After reducing the dimension in this way, it is
3.7.4 and the following imported libraries: Pandas and         necessary to recode the text values, that is, replace the
Numpy. The specified libraries must be installed on            existing text values of the metrics with numeric ones
the script execution server as Site-package libraries.         by entering new arguments of the objective function.
Initially, the data received from the monitoring system        For each current text value, a separate column is
is presented as a csv file with a separator | and a set of     created in the current dataset and a value of 1 or 0 is
columns in the following order:mon_obj                         determined, thus obtaining the correct data for analysis
metric_name                                                    and comparability.
metric_value                                                   def find_strings_column():
value_time                                                     df_cols = merged_df.columns
isIncident                                                     ret_col_names = [];
These columns contain the following information:               for col in df_cols:
mon_obj — monitoring object name                               if (merged_df[col].dtypes == 'object'):
metric_name — metric’s name                                    ret_col_names.append(col)
metric_value — metric value at the time of data                returnret_col_names
collection                                                     The variable string_columns stored list of names of the
value_time — date / time when data was collected               columns that contain the string values. Next step is to
isIncident — critical state indicator (at the first stage, a   form a dictionary of string values. Values in columns
binary classification is highlighted: 1 - critical state, 0    should then be replaced by indexes of elements from
- normal operation).                                           the dictionary.
After that via variable small_columns_list =                   def make_dict_with_strings_column():
['mon_obj','metric_name','metric_value',                       ret = {}
'value_time','isIncident'] and using pandas library new        forsc in strings_column:
DataFrame is formed, which is suitable for usage.              ret[sc] = list(merged_df[sc].unique())
d = pd.read_csv('data/data.csv', sep='|', encoding='utf-       returnret
8',          skipinitialspace=True,           skiprows=1,      By using the Save function of the NumPy library, this
names=small_columns_list, low_memory=False)                    dictionary is saved to a file.
Next,       you      need     to    define       a     new     np.save('data/dict_of_string_values.npy',
composite_metric_name column in DataFrame d,                   dict_of_string_values)
which will have a composite name from the values of            And further, using the function
the columns with the data about the object of                  def modify_value_string_column():
monitoring and the name of the metric for each                 for c in dict_of_string_values.keys():
current:                                                       merged_df[c]        =   merged_df[c].map(lambda        x:
row.d['composite_metric_name']=['mon_obj']+'_'+d['m            dict_of_string_values[c].index(x)), all text values are
etric_name]                                                    changed to numeric values.
Then a new variable column_names is declared,                  At the end of the primary data processing, the data
consisting of the unique values of the newly generated         frame values are standardized so that the variance is
column                                                         unitary, and the average value of the series is 0 - this
'composite_metric_name'column_names=d['composite               can be done using the built-in Standardscaler tool or
_metric_name'].unique()                                        through a self-written function:
These steps are necessary to unambiguously compare             def normalize_df():
the metric and its values at any given time.                   cols = list(merged_df.columns)
Further, after transposing the DataFrame into a more           for cl in cols:
convenient form and removing unnecessary columns,              x_min = min(list(merged_df[cl]))
it will be possible to remove static data as a method to       x_max = max(list(merged_df[cl]))
reduce the dimensionality and optimize the use of              merged_df[cl]= merged_df[cl].map(lambda x: ((x -
computing resources without losing data quality:               x_min)/(x_max - x_min)))
defdelete_cols_w_static_value():

                                                                                                143
For the convenience, the final result should be saved to      relative to all situations of this class in the test sample.
a file.                                                       More clearly, these criteria can be presented through
merged_df.to_csv(path_or_buf='data/normalised_df.cs           the contingency (contingency) table, also built
v', sep='|',index=False)                                      separately for each final class [3]
As a result of the actions performed, a dataset was
obtained that is suitable for applying various machine                                                    Training Sample
                                                                         Class (1/0)
                                                                                                    Positive           Negative
learning models with the following characteristics:
                                                                                 Positive              𝑇𝑃                𝐹𝑃
• The number of records in the training set is 10815            Result
                                                                                 Negative              𝐹𝑁                𝑇𝑁
(9323         class     0,     1492        class     1);
• The number of records in the test set is 3605 (3297 of
                                                              These results are used directly in the calculation of the
class 0, 308 of class 1).
                                                              criteria for the classifier as follows:
                                                                                                        𝑇𝑃
                                                                                       𝑃𝑟 𝑒 𝑐𝑖𝑠𝑖𝑜𝑛 =           ;
    3. Criteria for comparing results                                                                  𝑇𝑃+𝐹𝑃

                                                                                                    𝑇𝑃
When solving any supervised learning problem, there                                    𝑅𝑒𝑐𝑎𝑙𝑙 =
is no most suitable machine learning algorithm - there                                            𝑇𝑃 + 𝐹𝑁
is such a thing as the "No Free Lunch" theorem, the           As can be seen from the calculation algorithm, these
essence of which is the impossibility to                      criteria provide a more complete understanding of the
unambiguously approve the best machine learning               quality of the classifier's work. It would be logical to
method for a specific task in advance. The                    say that the higher the accuracy and completeness, the
applicability of this theorem extends, among other            better, but in reality, maximum accuracy and
things, to the well-studied area of problems of binary        completeness are not achievable at the same time and
classification, therefore, it is imperative to consider a     it is necessary to find a balance between these
set of methods and, based on the results of practical         characteristics. To do this, you need to use the F
tests, evaluate the effectiveness and applicability of        measure, which is the following calculated value:
each               of               them.             [2]
The effectiveness and efficiency will be primarily                                                𝑃𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛⋅𝑅𝑒 𝑐𝑎𝑙𝑙
assessed by the most common and user-understandable                      𝐹 = (𝛽 2 + 1) ⋅ 𝛽2 ⋅𝑃𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒 𝑐𝑎𝑙𝑙,
criterion                                      Accuracy.
Accuracy is a measure that indicates the proportion of        where 0<β<1, if the greater weight for the selection of
correct decisions of the classifier:                          the classifier has accuracy and β> 1 with the
                                                              preference for completeness. In the case β = 1, a
                                  𝑃                           balanced F-measure is obtained, denoted as:
                     𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =     ,
                                  𝑁
                                                                                             𝑃𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛⋅𝑅𝑒 𝑐𝑎𝑙𝑙
where P is the number of states correctly identified by                          𝐹1 = 2 ⋅ 𝑃𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒 𝑐𝑎𝑙𝑙.
the system, and N is the total number of states in the
test                                               sample.    In addition, for all classifiers, we will apply an
However, for the problem being solved, this criterion         estimate of the duration of the execution of Wall time
is not enough, since in its calculation it assigns the        operations, by which we will count the time from the
same weight to all final states (classes), which may be       moment the model is created and the beginning of
incorrect in the considered case of non-uniform               training until the moment the values of the
distribution of time moments over the final states.           performance assessment metrics are issued based on
Thus, for a more correct comparison and taking into           the comparison of the predicted results with the test
account the fact that in the example under                    sample.
consideration the share of normal states is much              All work is carried out on a computer complex with
greater than the share of critical states, the assessment     the following characteristics:
of the classifiers should be based, among other things,       Intel Core i9-9980XE 3.00 Ghz, 128 Mb, 4xNVIDIA
according to the following criteria: Precision (accuracy      RTX 2080Ti 11 Gb, 1 TB PCIe SSD.
- in a calculation other than Accuracy), and Recall
(completeness).                                               4. K-nearest neighbors (knn) algorithm
These criteria are calculated separately for each
summary class, where Precison is the proportion of            The specified algorithm works as follows - let a
situations that really belong to this class relative to all   training sample of pairs "object (state characteristic) -
situations that the system has assigned to this class.        response (state class)" be given:
System completeness (Recall) is the proportion of
situations found by the classifier that belong to a class

                                                                                                        144
              𝑋 𝑚 = {(𝑥1 , 𝑦1 ), . . . (𝑥𝑚 , 𝑦𝑚 )},              probability of accepting a particular value of the
                                                                 dependent variable. [5]
then the distance function P (x, x ') is given on the set        Let the objects be specified by numerical features:
of objects. This function should be a reasonably
adequate model of object similarity. The larger the                                    𝑓𝑗 : 𝑋 → 𝑅, 𝑗 = 1. . . 𝑛
value of this function, the less similar two objects x, x
'are [4]. For an arbitrary object μ, we arrange the              and the space of feature descriptions in this case X=Rn.
objects of the training sample in the order of                   Let Y in this case be a set of class labels and a training
increasing distances to μ:                                       set of object-response pairs
                                                                                 𝑋𝑚 = {(𝑥1 , 𝑦1 ), . . . , (𝑥𝑚 , 𝑦𝑚 )}.
    𝑃(𝜇1 ; 𝑝(𝜇𝑖 , 𝑥1;𝜇 ) < 𝑝(𝜇𝑖 , 𝑥2;𝜇 ). . . < 𝑝(𝜇𝑖 , 𝑥𝑚;𝜇 ),
                                                                 Consider the case of two classes: Y={-1,+1}. In
                                                                 logistic regression, a linear classification algorithm
where some x (1; μ) denotes the object of the training
                                                                 aX→Y of the form:
sample that is the i-th neighbor of the object μ. We                                       𝑛
introduce a similar notation for the answer to the i-th
                                                                     𝑎(𝑥, 𝑤) = 𝑠𝑖𝑔𝑛 (∑ 𝑤𝑗 𝑓𝑖 (𝑥) − 𝑤0 ) = 𝑠𝑖𝑔𝑛〈 𝑥, 𝑤〉,
neighbor y (i; μ). Thus, an arbitrary object μ generates
                                                                                         𝑗=1
a new renumbering of the sample. In its most general
form, the nearest neighbors algorithm is:
                                                                 where wj – scale of j-th feature, w0 − decision
                                                                 threshold w=(w0,...,wn) − scales vector, ⟨x,w⟩ − dot
𝑎(𝜇) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦∈𝑌 ∑𝑚
                  𝑖=1[𝑦(𝑥𝑖;𝜇 ) = 𝑦] ⋅ 𝑤(𝑖, 𝜇),                   product of the feature description of an object by a
                                                                 vector of weights. It is assumed that the zero feature is
where w (i, μ) is a given weight function that estimates         artificially introduced: f0(x) = −1.
the degree of importance of the i-th neighbor for                The task of training a linear classifier is to adjust the
classifying the object μ, while this function cannot be          weight vector w based on the sample Xm. In logistic
negative and does not increase with respect to i.                regression, for this, the problem of minimizing
For the k nearest neighbors method:                              empirical risk is solved with a loss function of the
                                                                 following form:
                       (𝑖; 𝜂) = [𝑖 ≤ 𝑘]
                                                                             𝑚

The qualitative characteristics of using the                         𝑄(𝑤) = ∑ 𝑙𝑛( 1 + 𝑒𝑥𝑝( − 𝑦𝑖 < 𝑥𝑖 , 𝑤 >)) → 𝑚𝑖𝑛 𝑤
KNeighborsClassifier algorithm with the n_neighboors                         𝑖=1

= 10 parameter are as follows:
                                                                 After the solution w is found, it becomes possible not
                                                                 only to perform classification for an arbitrary object x,
Accuracy 0.949
                                                                 but also to estimate the posterior probabilities of its
         Precision   Recall   F1 score
1        0.63        0.97     0.77
                                                                 belonging to the existing classes:
0        1.0         0.95     0.97
                                                                              𝛲{𝑦|𝑥} = 𝜎(𝑦 < 𝑥, 𝑤 >), 𝑦 ∈ 𝑌,
Wall time: 1 min 15 s
                                                                                        1
It is clearly seen that the accuracy assessment in the                   𝜎(𝑧) =                    − 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
                                                                                   1 + 𝑒𝑥𝑝 −𝑧
form of Accuracy is not enough for a comparative
analysis - with a value of this indicator of 0.949 -             Qualitative characteristics of the application of the
almost 95% accuracy - we see that the accuracy of                specified model with the following parameters
determining the final state 1 (inoperability) through the        LogisticRegression (multi_class='ovr', solver='lbfgs'):
Precision characteristic is only 63%. As a result of the
work, one should record the F-Measure value for each             Accuracy 0.971
class and the total duration of the classifier's work.                 Precision     Recall      F1 score
                                                                 1     0.76          0.98        0.85
5. Logistic regression                                           0     1.0           0.97        0.98
                                                                 Wall time: 1 min 10 s
To predict the probability of occurrence of a certain
event by the values of the set of signs, a dependent             The results shown by this classifier are better relative
variable Y is introduced, which takes values 0 or 1 and          to the previous method with a comparable duration of
a set of independent variables x1, ... xn, based on the          operation, however, the work on identifying faulty
values of which it is required to calculate the                  situations (class 1) does not yet guarantee satisfactory
                                                                 operation in industrial mode.

                                                                                                          145
6. Naive Bayesian classifier                                the main of which F-measure, are inferior to past
                                                            classifiers.
The Bayesian classification is based on the maximum
probability hypothesis, that is, an object d is             7. Decision tree methodology
considered to belong to the class cj (cj ∈ C) if the
highest posterior probability is achieved:                  With this algorithm, the tree is built from top to
                                                            bottom - from the root node to the leaves. At the first
                    𝑚𝑎𝑥 𝑃 (𝑐𝑗 |𝑑).[6]                       step of training, an "empty" tree is formed, which
                                                            consists only of the root node, which in turn contains
Bayesian formula:                                           the entire training set. Next, you need to split the root
                                                            node into subsets, from which the descendant nodes
                   𝑃(𝑐𝑗 )⋅𝑃(𝑑|𝑐𝑗 )                          will be formed. For this, one of the attributes is
      𝑃(𝑐𝑗 |𝑑) =                        ≈ 𝑃(𝑐𝑗 )𝑃(𝑑|𝑐𝑗 ),
                       𝑃(𝑑)                                 selected and rules are formed that divide the training
                                                            set into subsets, the number of which is equal to the
where P(d|cj)- the probability of encountering an           number of unique values of the selected attribute. [8]
object d among objects of class cj, and P(cj),P(d) –        As a result of splitting, p (according to the number of
prior probabilities of the class cj and d.                  attribute values) subsets are obtained and, therefore, p
Under the "naive" assumption that all features              descendants of the root node are formed, each of
describing the classified objects are completely equal      which is assigned its own subset. Then this procedure
and not related to each other, then P (d | cj) can be       is recursively applied to all subsets until the stop
calculated as the product of the probabilities of           condition is reached.
encountering a feature xi (xi∈X) among objects of           For example, a partitioning rule should be applied to
class cj:                                                   the training set, in which the attribute A, which can
                                                            take p values: a1, a2, ..., ap, creates p subsets S1, S2, ...,
                                  |𝑥|
              𝑃(𝑑|𝑐𝑗 ) = ∏𝑖=1 𝑃(𝑥𝑖 |𝑐𝑗 ),                   Sp, where examples will be distributed, in which the
                                                            attribute A takes the corresponding value.
where 𝑃(𝑥𝑖 |𝑐𝑗 ) - probabilistic assessment of the          Moreover, N (Cj, S) is the number of examples of the
contribution of a feature xi to the fact that d ∈ cj.[7]    class Cj in the set S, then the probability of the class Cj
In practice, when multiplying very small conditional        in this set is determined by the expression:
probabilities, there can be a loss of significant digits,
and therefore, instead of the estimates of the                                             𝑁(𝐶𝑗 𝑆)
                                                                                  𝑃=                 ,
probabilities P (xi | cj), the logarithms of these                                          𝑁(𝑆)
probabilities should be used. The logarithm is a
monotonically increasing function, and, therefore, the      where N (S) is the total number of examples in the set
class cj with the largest value of the logarithm will       S.
remain the most probable. In this case, the decision        The entropy of the sets S will be expressed as:
rule of the naive Bayesian classifier takes the                                       𝑚
following form:                                                                            𝑁(𝑆𝑖 )        𝑁(𝐶𝑗 𝑆)
                                                                      𝐼 𝑛 𝑓𝑜(𝑆) = − ∑             ⋅ 𝑙𝑜𝑔(         )
                                                                                           𝑁(𝑆)           𝑁(𝑆)
                                              𝑥                                      𝑖=1
        ∗
      𝐶 = 𝑎𝑟𝑔𝑐𝑗 ∈𝐶 𝑚𝑎𝑥 [𝑙𝑜𝑔 𝑃 (𝑐𝑗 ) + ∑ 𝑃(𝑥𝑖 |𝑐𝑗 )]
                                             𝑖=1            It will demonstrate the average amount of information
                              .                             required to determine the class of an example from the
                                                            set S.
The resulting values of the MultinomialNB classifier        The same estimate, obtained after partitioning the set S
from the Sklearn library turned out to be the following:    by attribute A, can be written as:
Accuracy 0.874                                                                        𝑘
      Precision   Recall    F1 score                                                       𝑁(𝐶𝑗 𝑆)
                                                                      𝐼𝑛𝑓𝑜𝐴 (𝑆) = ∑                ⋅ 𝐼𝑛𝑓𝑜(𝑆𝑖 ),
1     0.40        0.96      0.57                                                            𝑁(𝑆)
                                                                                     𝑖=1
0     1.0         0.87      0.93
Wall time: 289 ms
                                                            where Si - i-th node, obtained by splitting by attribute
The considered classifier, based on its description,        A. After that, to choose the best branching attribute,
works fundamentally differently - the speed of its          you should use the criterion of the form:
work is much higher - however, the qualitative criteria,

                                                                                                   146
           𝐺𝑎𝑖𝑛(𝐴) = 𝐼𝑛𝑓𝑜(𝑆) − 𝐼𝑛𝑓𝑜𝐴 (𝑆)                      composition:          𝐹𝑀 (𝑥) = ∑𝑀 𝑚=1 𝑏𝑚 ℎ(𝑥, 𝑎𝑚 ), 𝑏𝑚 ∈
                                                              𝑅, 𝑎𝑚 ∈ 𝐴.
                                                              However, the selection of the optimal set of
This criterion is called the criterion of information         parameters {𝑎𝑚 , 𝑏𝑚 }𝑀 𝑚=1 is an extremely time-
gain. This value is calculated for all potential split        consuming task, therefore the construction of this
attributes and the one that maximizes the specified           composition should be carried out by means of
criterion is selected for the division operation.             "greedy" growth, each time adding the summand,
The described procedure is applied to subsets Si and          which is the most optimal algorithm, to the sum.
further, until the values of the criterion cease to           At the step when the optimal classifier F(m-1) of length
increase significantly with new partitions or a different     m - 1 has already been assembled, the task is reduced
stopping condition is met. In this case, when in the          to finding a pair of the most optimal parameters
process of building a tree, an "empty" node is                {am,bm} for the classifier of length m:
obtained, where not a single example will fall, then it
must be converted into a leaf that is associated with           𝐹𝑀 (𝑥) = 𝐹𝑚−1 (𝑥) + 𝑏𝑚 ℎ(𝑥, 𝑎𝑚 ), 𝑏𝑚 ∈ 𝑅, 𝑎𝑚 ∈ 𝐴
the class most often found in the immediate ancestor
of this node.                                                 Optimality is understood here in accordance with the
The DecisionTreeRegressor classifier with parameters          principles of explicit maximization of margins - this
Random_state = 15 and Min_samples_leaf = 25                   means that a certain loss function L(yi,Fm (xi)) → min
showed the following characteristics:                         is introduced, showing how much the predicted answer
                                                              Fm (xi) differs from the correct answer yi. Next, you
Accuracy 0.964                                                need to minimize the functionality of this error:
      Precision   Recall   F1 score
1     0.71        0.97     0.85                                                 𝑁∑
0     1.0         0.97     0.98
                                                                          𝑄 = ∑ 𝐿(𝑦𝑖 , 𝐹𝑚 (𝑥𝑖 )) → 𝑚𝑖𝑛
Wall time: 975ms
                                                                               𝑖=1

When working with the decision tree method, the
results are similar to the logistic regression method,        It should be noted that the error functional Q is a real
however, the duration of training and forecasting in the      function depending on the points {𝐹𝑚 (𝑥𝑖 )}𝑁
                                                                                                         𝑖=1 in the N-
decision tree method is much longer, which, with
                                                              dimensional space, and this function is minimized by
equal qualitative characteristics, puts the results of this
                                                              the gradient descent method. As the point for which
method higher than others.                                    the optimal increment should be found, we define
                                                              𝐹𝑚−1 and the error gradient is expressed as follows:
8. Gradient boosting method
                                                                                                𝑁
                                                                                      𝜕𝑄
Gradient boosting is a machine learning method that                           𝛻𝑄 = [      (𝑥 )]
creates a decisive forecasting model in the form of an                               𝜕𝐹𝑚−1 𝑖 𝑖=1
                                                                                                              𝑁
ensemble of weak forecasting models, usually decision                           𝜕(∑𝑁
                                                                                   𝑖=1 𝐿(𝑦𝑖 , 𝐹𝑚−1 ))
trees, essentially developing the decision tree method.                       =[                      (𝑥𝑖 )]    =
                                                                                      𝜕𝐹𝑚−1                 𝑖=1
During boosting, the model is built in stages - an                                                      𝑁
arbitrary differentiable loss function is also optimized                          𝜕𝐿(𝑦𝑖 , 𝐹𝑚−1 )
                                                                             =[                  (𝑥𝑖 )]
in stages. [9]                                                                      𝜕𝐹𝑚−1               𝑖=1
For the problem of object recognition from a
multidimensional space X with a label space Y, a              By virtue of the gradient descent method, it is most
training sample {𝑥𝑖 }𝑁 𝑖=1 is given, where 𝑥𝑖 ∈ 𝑋. In         beneficial to add a new term as follows:
addition, the true values of the class labels for each                     𝐹𝑚 = 𝐹𝑚−1 − 𝑏𝑚 𝛻𝑄, 𝑏𝑚 ∈ 𝑅,
object {𝑦𝑖 }𝑁
            𝑖=1 are known, where yi∈Y. The solution to
the prediction problem is reduced in this case to the         where bm is selected by linear search over real
search for a recognizing operator who can predict the         numbers R:
labels as accurately as possible for each new object
                                                                                        𝑁
from the set X.
Let a family of basic algorithms H be given, each                    𝑏𝑚 = 𝑎𝑟𝑔𝑚𝑖𝑛 ∑ 𝐿(𝐹𝑚−1 (𝑥𝑖 ) − 𝑏𝛻𝑄𝑖 )
element h(x,a)∈H:X→R of which defined by some                                          𝑖=1
vector of parameters a∈A.
In this case, it is necessary to find the final               However, ∇Q is only a vector of optimal values for
classification algorithm in the form of the following         each object xi, and not a basic algorithm from the

                                                                                                  147
family H, defined by ∀x∈X, so it is necessary to find       which is a "smoothed step", as a rule, a hyperbolic
h(x,am)∈H that is most similar to ∇Q. To do this, it is     tangent:
necessary to re-minimize the error functionality using
an algorithm based on the principle of explicit                                                𝑒 𝑡 − 𝑒 −𝑡
                                                                                 𝑡𝑛𝑔(𝑡) =
minimization of indents:                                                                       𝑒 𝑡 + 𝑒 −𝑡
                           𝑁

        𝑎𝑚 = 𝑎𝑟𝑔𝑚𝑖𝑛 ∑ 𝐿(𝛻𝑄𝑖 , ℎ(𝑥𝑖 , 𝑎)) ≡                  or logistic function:
                                                                                           𝑡
                                                                                       𝑡𝑛𝑔( )+1            1
                      𝑖=1                                                                  2
                                                                              𝜎(𝑡) =                 =          .
             ≡ 𝑡𝑟𝑎𝑖𝑛({𝑥𝑖 }𝑁           𝑁
                          𝑖=1 , {𝛻𝑄𝑖 }𝑖=1 ),
                                                                                           2             1+𝑒 −𝑡

                                                            The activation function of output neurons can also be
which in turn corresponds to the basic learning
                                                            the same "smoothed step", or it can be identical
algorithm.
                                                            gv(t)=t, that is, each neuron v calculates the function:
Next, you need to find the coefficient bm using linear
search:                                                                   𝑢(𝑣) = 𝑔𝑣 ((∑ 𝑤𝑒 𝑢(𝑒)) − 𝑤𝑣 ).
                     𝑁
                                                            The parameters of the edges we are called their
    𝑏𝑚 = 𝑎𝑟𝑔𝑚𝑖𝑛 ∑ 𝐿(𝐹𝑚−1 (𝑥𝑖 ) − 𝑏 ⋅ ℎ(𝑥𝑖 , 𝑎𝑚 )            weights, and the parameters of the vertices wv are
                    𝑖=1                                     called displacements. In this case, which activation
                                                            function is chosen - hiberbolic tangent or logistic - is
The      GradientBoostingClassifier     implemented         indifferent: for any multilayer perceptron with an
according to this algorithm with the parameters             activation function tng calculating the function
learning_rate = 0.3; n_estimators = 10; verbose = 1,        Ftng(w,x), the same perceptron, in which the activation
subsample = 0.5 showed the following results:               function in intermediate layers is replaced by a logical
                                                            function σ, calculates the same the function itself for
Accuracy 0.993                                              some other value of the parameter w ':
      Precision   Recall   F1 score
1     0.94        0.99     0.97                                                𝐹𝑡𝑛𝑔 (𝑤, 𝑥) = 𝐹𝜎 (𝑤 ′ , 𝑥).
0     1.0         0.99     1.0
Walltime: 3.95s                                             In accordance with the ideology of minimizing
                                                            empirical risk with regularization of training of the
The gradient boosting method, like the decision tree        perceptron calculating the function F(w,x), this is the
method, is a method of enumerating classification           search for a vector of weights and biases that
parameters, which, in turn, determines their relative       minimizes the regularized total error:
comparability in terms of training duration. However,
the time spent in boosting to determine the ensemble                                           𝑁
of decision trees has a colossal effect - the accuracy in              𝐸𝑇 (𝑤) = 𝜑(𝑤) + ∑ 𝐸(𝐹(𝑤, 𝑥𝑖 ), 𝑦𝑖 )
terms of the final state classes is the highest for this                                       𝑖=1
method.
                                                            on some training set T=((x1,y1),...(xn,yn)). [11]
                                                            Training is most often carried out by the classical
9. Neural network                                           method of gradient descent; for its applicability, the
                                                            activation functions of all neurons and the error and
Taking into account the available analysis of data          regularization functions must be differentiable.
sources of the considered IT infrastructure monitoring      Practice shows that the speed of this algorithm is often
system and the nature of this data, the MLP (Multi-         inferior to others because of the huge dimension of the
Layer Perceptron) type was defined as a neural              parameter w and the absence of explicit formulas for
network - due to the absence of video surveillance          the derivatives of the function F with respect to w. [11]
systems as sources, and, consequently, the problem of       The results of applying the MLPClassifier (max_iter =
video recognition of the classical verbose neural           100, random_state = 10) are as follows:
network of direct distribution will be sufficient to        Accuracy 0.992
determine the effectiveness of its application. [10]              Precision    Recall     F1 score
An MLP network is any multilayer neural network             1     0.93         0.99       0.95
with a linear conductance function and a monotone           0     1.0          0.99       1.0
limited activation function gv common to all hidden         Wall time: 1 min 5 s
neurons, depending only on the variety t=s(v)-wv,           The neural network, as the user often expects, has
                                                            shown high results of a qualitative assessment - they
                                                            are essentially equal to the results of gradient boosting.

                                                                                                     148
However, the duration of training and forecasting for a        [2] D.H. Wolpert, W.G. Macready No Free Lunch
neural network is much longer than for methods based               Theorems for Optimization // IEEE
on building ensembles of decision trees - 1 minute                 Transactions on Evolutionary Computation.
versus several seconds.                                            1997. №1. С. 1.
                                                               [3] G. Upton. Analysis of contingency tables.
10. Conclusion                                                     Translated from English and preface by Yu. P.
                                                                   Adler. Moscow. Finance and statistics. 1982.
                                                                   143 p.
The results of the application of various machine
                                                               [4] M. Parsian Data Algorithms. Newton: O'Reilly
learning methods clearly prove the postulate of the
                                                                   Media, Inc., 2015. 778 с.
"NoFreeLunch" theorem - it is the experimental tests
                                                               [5] D.W. Hosmer, S. Lemeshow Applied Logistic
that allow you to choose the most appropriate
                                                                   Regression. 2nd Ed. New York: John Wiley &
algorithm for solving a specific problem, taking into
                                                                   sons, INC, 2000. 397 с.
account specific initial data. In this case, it should be
                                                               [6] S. Ranganathan, K. Nakai, C. Schonbach
noted once again that the Accuracy characteristic is
                                                                   Bayes' Theorem and Naive Bayes Classifier.
practically useless in comparing the results - it is more
                                                                   Encyclopedia      of    Bioinformatics      and
correct to evaluate the results by F-measure, and this
                                                                   Computational Biology, Volume 1, Elsevier,
should be done separately for each class.
                                                                   с. 403-412. 2017
Based on the applied sense of the task - to provide
                                                               [7] Barber D. Bayesian Reasoning and Machine
better monitoring of the IT infrastructure operation -
                                                                   Learning. Cambridge University Press, с.
the characteristics of training methods for data class
                                                                   2012.
"1" are more important, that is, for cases of real
                                                               [8] T. Hastie The Elements of Statistical Learning
failures and infrastructure failures. At the same time,
                                                                   // Trevor Hastie, Robert Tibshirani, Jerome
errors for class "0", in fact, will be additional incidents
                                                                   Friedman, 2 изд. Springer, 2008. 764 с.
and, therefore, require additional labor from technical
                                                               [9] J.H. Friedman, Stochastic Gradient Boosting.
support specialists, which is certainly critical, but less
                                                                   Technical report. Dept. of Statistics, Stanford
important in comparison with the omission of real
                                                                   University, 1999.
failures and failures. It is also worth noting the time
                                                              [10] T. Windeatt Ensemble MLP Classifier Design.
parameters of the methods - the spread is truly
                                                                   In: Jain L.C., Sato-Ilic M., Virvou M.,
colossal, from 289 milliseconds to 1 minute 15
                                                                   Tsihrintzis G.A., Balas V.E., Abeynayake C.
seconds.
                                                                   (eds) Computational Intelligence Paradigms.
When comparing the comparison criteria, it is clearly
                                                                   Studies in Computational Intelligence, vol
seen that the gradient boosting method showed the
                                                                   137. Springer, Berlin, Heidelberg
optimal results of work - with a higher speed of this
                                                              [11] V. Vapnik Principles of Risk Minimization for
algorithm, it was able to learn better than other
                                                                   Learning Theory // Advances in Neural
algorithms. When replicating an application on already
                                                                   Information Processing Systems 4 (NIPS
large data sets (all IT services, a larger analysis
                                                                   1991). 1992. №4. С. 831-838.
horizon), training time is extremely important.
                                                              [12] Gasnikov      A.V.      Modern       numerical
Understanding this and the nature of the initial data,
                                                                   optimization methods. Universal Gradient
namely the absence of video and photo images in the
                                                                   Descent Method: A Tutorial. Moscow: MFTI,
initial data, allow us to conclude that the gradient
                                                                   2018.286           p.          2nd           Ed
boosting method is more than sufficient for solving the
problem and using a neural network (showed similar
results with a longer training duration) at this stage
development of the considered IT infrastructure
monitoring system in terms of the method of collecting
information is not required.

References

    [1] Bolshakov M.A. Preparation of a monitoring
        system for IT infrastructure for models of
        critical states based on neural networks //
        Science-intensive technologies in space
        research of the Earth. 2019. No. 4. p. 65-71.


                                                                                           149