<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Comparative analysis of machine learning methods to assess the quality of IT services</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maksim A. Bolshakov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Igor A. Molodkin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergei V. Pugachev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Saint Petersburg Railway Transport University of Emperor Alexander I</institution>
          ,
          <addr-line>9 Moskovsky Ave., Saint-Petersburg, 190031</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>142</fpage>
      <lpage>149</lpage>
      <abstract>
        <p>The article considers the issue of choosing a machine learning method for solving the applied problem of assessing the current state of the quality of IT services. As a method of choice, a comparative analysis of the generally accepted methods of machine learning was carried out using a set of criteria that made it possible to evaluate their effectiveness and efficiency. F-measure is considered as the main criteria, as a generalizing concept of the completeness and accuracy of classification, for each class of states separately and the duration of the training and prediction procedure. All operations were carried out on the same dataset, namely, on the data of the centralized monitoring and management system for the IT infrastructure of Russian Railways in terms of the ETRAN IT service. Due to their heterogeneity and taking into account the practice of applying the Barlow hypothesis, the initial data went through preliminary processing, the algorithm of which is also described in detail in the article</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;machine learning</kwd>
        <kwd>Barlow hypothesis</kwd>
        <kwd>F-measure</kwd>
        <kwd>dataset</kwd>
        <kwd>data heterogeneity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The requirement for software developers and device
manufacturers in terms of mandatory monitoring of
their performance is an established and mandatory
norm. As a result, most of the IT services provided by
operating organizations can be assessed not only by
the final failure state, but also by the current local
characteristics of their work.</p>
      <p>Consider the currently operating centralized system for
monitoring and managing the IT infrastructure of
Russian Railways, which is operated by the Main
Computer Center. The huge array of accumulated and
constantly updated data of the specified monitoring
system allows us to assume the success of using
machine learning methods to solve the problem of
assessing the quality of IT services by determining the
current state of the specified infrastructure and the
service applications implemented on it.</p>
      <p>
        It is possible to assess the state of the IT infrastructure
at a certain point in time by predicting its final
performance based on the current data of the
monitoring system. To do this, it is necessary to solve
the problem of classifying the final state of the IT
infrastructure, that is, to determine the class label (the
type of the final state of the incident / regular
operation) based on the current values of the
monitoring system metrics. At the same time, for the
choice of the implementation method, the number of
available classes is not so important - the binary
classification is a special case of the multiclass
classification, and is not a decisive characteristic when
choosing a teaching method. Typically, the nature of
the input data, namely the types and formats, have a
significant impact on the choice of machine learning
methods. At the same time, data preprocessing
algorithms do not depend on the chosen training
method itself and must ensure the correct use of the
initial data as training and test samples for all further
methods of solving the problem. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Primary data processing</title>
      <p>Primary data is understood as the whole set of the
characteristics of the IT infrastructure involved in the
operability of the specified system, taken by the
current monitoring and management system, and the
set of service characteristics that make it possible to
define this automated system as an IT service. The
quality of the obtained primary data, namely their
heterogeneity, imply additional processing regardless
of the choice of a specific machine learning method.
For the correct formation of the training sample, the
most suitable programming language is Python version
3.7.4 and the following imported libraries: Pandas and
Numpy. The specified libraries must be installed on
the script execution server as Site-package libraries.
Initially, the data received from the monitoring system
is presented as a csv file with a separator | and a set of
columns in the following order:mon_obj
metric_name
metric_value
value_time
isIncident
These columns contain the following information:
mon_obj — monitoring object name
metric_name — metric’s name
metric_value — metric value at the time of data
collection
value_time — date / time when data was collected
isIncident — critical state indicator (at the first stage, a
binary classification is highlighted: 1 - critical state, 0
- normal operation).</p>
      <p>After that via variable small_columns_list =
['mon_obj','metric_name','metric_value',
'value_time','isIncident'] and using pandas library new
DataFrame is formed, which is suitable for usage.
d = pd.read_csv('data/data.csv', sep='|',
encoding='utf8', skipinitialspace=True, skiprows=1,
names=small_columns_list, low_memory=False)
Next, you need to define a new
composite_metric_name column in DataFrame d,
which will have a composite name from the values of
the columns with the data about the object of
monitoring and the name of the metric for each
current:
row.d['composite_metric_name']=['mon_obj']+'_'+d['m
etric_name]
Then a new variable column_names is declared,
consisting of the unique values of the newly generated
column
'composite_metric_name'column_names=d['composite
_metric_name'].unique()
These steps are necessary to unambiguously compare
the metric and its values at any given time.
Further, after transposing the DataFrame into a more
convenient form and removing unnecessary columns,
it will be possible to remove static data as a method to
reduce the dimensionality and optimize the use of
computing resources without losing data quality:
defdelete_cols_w_static_value():
column_names_with_static_value = []
forcol_name in column_names:
if (merged_df[col_name].nunique() == 1):
column_names_with_static_value.append(col_name)
if 'isIncident' in column_names_with_static_value:
column_names_with_static_value.remove('isIncident')
merged_df.drop(column_names_with_static_value,
axis = 1, inplace = True)
After reducing the dimension in this way, it is
necessary to recode the text values, that is, replace the
existing text values of the metrics with numeric ones
by entering new arguments of the objective function.
For each current text value, a separate column is
created in the current dataset and a value of 1 or 0 is
determined, thus obtaining the correct data for analysis
and comparability.
def find_strings_column():
df_cols = merged_df.columns
ret_col_names = [];
for col in df_cols:
if (merged_df[col].dtypes == 'object'):
ret_col_names.append(col)
returnret_col_names
The variable string_columns stored list of names of the
columns that contain the string values. Next step is to
form a dictionary of string values. Values in columns
should then be replaced by indexes of elements from
the dictionary.
def make_dict_with_strings_column():
ret = {}
forsc in strings_column:
ret[sc] = list(merged_df[sc].unique())
returnret
By using the Save function of the NumPy library, this
dictionary is saved to a file.
np.save('data/dict_of_string_values.npy',
dict_of_string_values)
And further, using the function
def modify_value_string_column():
for c in dict_of_string_values.keys():
merged_df[c] = merged_df[c].map(lambda x:
dict_of_string_values[c].index(x)), all text values are
changed to numeric values.</p>
      <p>At the end of the primary data processing, the data
frame values are standardized so that the variance is
unitary, and the average value of the series is 0 - this
can be done using the built-in Standardscaler tool or
through a self-written function:
def normalize_df():
cols = list(merged_df.columns)
for cl in cols:
x_min = min(list(merged_df[cl]))
x_max = max(list(merged_df[cl]))
merged_df[cl]= merged_df[cl].map(lambda x: ((x
x_min)/(x_max - x_min)))
For the convenience, the final result should be saved to
a file.
merged_df.to_csv(path_or_buf='data/normalised_df.cs
v', sep='|',index=False)
As a result of the actions performed, a dataset was
obtained that is suitable for applying various machine
learning</p>
      <p>models with the following characteristics:
• The number of records in the training set is 10815</p>
    </sec>
    <sec id="sec-3">
      <title>3. Criteria for comparing results</title>
      <p>When solving any supervised learning problem, there
is no most suitable machine learning algorithm - there
is such a thing as the "No Free Lunch" theorem, the
which
is
the
impossibility
to
essence
method
of
for
unambiguously approve the best machine learning
a
specific
task
in
advance.</p>
      <p>The
applicability of this theorem extends, among other
things, to the well-studied area of problems of binary
classification, therefore, it is imperative to consider a
set of methods and, based on the results of practical
tests, evaluate the effectiveness and applicability of
them.</p>
      <p>
        [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
each
criterion
The effectiveness and efficiency
will be primarily
assessed by the most common and user-understandable
Accuracy is a measure that indicates the proportion of
correct decisions of the classifier:
      </p>
      <p>Accuracy.
of

=</p>
      <p>,


where P is the number of states correctly identified by
the system, and N is the total number of states in the
sample.
test
However, for the problem being solved, this criterion
is not enough, since in its calculation it assigns the
same weight to all final states (classes), which may be
incorrect in the
considered
case
of non-uniform
distribution of time moments over the final states.
Thus, for a more correct comparison and taking into
account
consideration the share of normal states is much
greater than the share of critical states, the assessment
of the classifiers should be based, among other things,
according to the following criteria: Precision (accuracy
- in a calculation other than Accuracy), and Recall
(completeness).</p>
      <p>
        These criteria are calculated separately for each
summary class, where Precison is the proportion of
situations that really belong to this class relative to all
situations that the system has assigned to this class.
System completeness (Recall) is the proportion of
situations found by the classifier that belong to a class
relative to all situations of this class in the test sample.
More clearly, these criteria can be presented through
the
contingency
(contingency) table,
also
built
separately for each final class [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
      </p>
      <p>Class (1/0)
Result</p>
      <p>Positive</p>
      <p>Negative
These results are used directly in the calculation of the
criteria for the classifier as follows:</p>
      <p>Training Sample
Positive</p>
      <p>Negative
As can be seen from the calculation algorithm, these
criteria provide a more complete understanding of the
quality of the classifier's work. It would be logical to
say that the higher the accuracy and completeness, the
better,
but in
reality,
maximum
accuracy
and
completeness are not achievable at the same time and
it is necessary to find a balance between these
characteristics. To do this, you need to use the F
measure, which is the following calculated value:

= ( 2 + 1)⋅</p>
      <p>2⋅  
⋅  
+  
where 0&lt;β&lt;1, if the greater weight for the selection of
the
classifier
has
accuracy
and
β&gt; 1</p>
    </sec>
    <sec id="sec-4">
      <title>4. K-nearest neighbors (knn) algorithm</title>
      <p>The specified algorithm</p>
      <p>works as follows - let a
training sample of pairs "object (state characteristic)
response (state class)" be given:</p>
      <p>
        = {( 1,  1), . . . (  ,   )},
then the distance function P (x, x ') is given on the set
of objects. This function should be a reasonably
adequate model of object similarity. The larger the
value of this function, the less similar two objects x, x
'are [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For an arbitrary object μ, we arrange the
objects of the training sample in the order of
increasing distances to μ:
      </p>
      <p>( 1;  (  ,  1; )&lt;  (  ,  2; ). . . &lt;  (  ,   ; ),
where some x (1; μ) denotes the object of the training
sample that is the i-th neighbor of the object μ. We
introduce a similar notation for the answer to the i-th
neighbor y (i; μ). Thus, an arbitrary object μ generates
a new renumbering of the sample. In its most general
form, the nearest neighbors algorithm is:
 ( )= 
 ∈</p>
      <p>∑</p>
      <p>=1[ (  ; )=  ] ⋅  ( ,  ),
where w (i, μ) is a given weight function that estimates
the degree of importance of the i-th neighbor for
classifying the object μ, while this function cannot be
negative and does not increase with respect to i.
For the k nearest neighbors method:</p>
      <p>( ;  )= [ ≤  ]
determining the final state 1 (inoperability) through the
Precision characteristic is only 63%. As a result of the
work, one should record the F-Measure value for each
class and the total duration of the classifier's work.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Logistic regression</title>
      <p>
        To predict the probability of occurrence of a certain
event by the values of the set of signs, a dependent
variable Y is introduced, which takes values 0 or 1 and
a set of independent variables x1, ... xn, based on the
values of which it is required to calculate the
probability of accepting a particular value of the
dependent variable. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
Let the objects be specified by numerical features:
  :  →  ,  = 1. . . 
and the space of feature descriptions in this case X=Rn.
Let Y in this case be a set of class labels and a training
set of object-response pairs
      </p>
      <p>= {( 1,  1), . . . , (  ,   )}.</p>
      <p>Consider the case of two classes: Y={-1,+1}. In
logistic regression, a linear classification algorithm
aX→Y of the form:
 ( ,  )= 
(∑   
 ( )−  0) =</p>
      <p>〈  ,  〉,

threshold w=(w0,...,wn) − scales vector, ⟨x,w⟩ − dot
product of the feature description of an object by a
vector of weights. It is assumed that the zero feature is
artificially introduced: f0(x) = −1.</p>
      <p>The task of training a linear classifier is to adjust the
weight vector w based on the sample Xm. In logistic
regression, for this, the problem
of
minimizing
empirical risk is solved with a loss function of the
following form:
 ( )= ∑  (1 + 
(−   &lt;   ,  &gt;))→ 


 =1
After the solution w is found, it becomes possible not
only to perform classification for an arbitrary object x,
but also to estimate the posterior probabilities of its
belonging to the existing classes:</p>
      <p>{ | } =  ( &lt;  ,  &gt;),  ∈  ,
 ( )=
1 + 
1
−
− 

Qualitative characteristics of the application of the
specified model with the following parameters
LogisticRegression (multi_class='ovr', solver='lbfgs'):</p>
      <sec id="sec-5-1">
        <title>Accuracy 0.971</title>
        <p>1
0</p>
      </sec>
      <sec id="sec-5-2">
        <title>Wall time: 1 min 10 s</title>
        <sec id="sec-5-2-1">
          <title>Precision 0.76 1.0</title>
        </sec>
        <sec id="sec-5-2-2">
          <title>Recall</title>
          <p>0.98
0.97
operation, however, the work on identifying faulty
situations (class 1) does not yet guarantee satisfactory
operation in industrial mode.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Naive Bayesian classifier</title>
      <p>The Bayesian classification is based on the maximum
probability
hypothesis, that is, an
object
d is
considered to belong to the class cj (cj ∈ C) if the
highest posterior probability is achieved:</p>
      <p>
        (  | ).[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
Bayesian formula:
 (  | )=
 (  )⋅ ( |  )
 ( )
≈  (
 ) ( |  ),
where P(d|cj)- the probability of encountering an
object d among objects of class cj, and P(cj),P(d) –
prior probabilities of the class cj and d.
      </p>
      <p>
        Under the
"naive"
assumption
that all features
describing the classified objects are completely equal
and not related to each other, then P (d | cj) can be
calculated as the product of the probabilities of
encountering a feature xi (xi∈X) among objects of
class cj:
 ( |  )= ∏
| |
 =1 (  |  ),
rule
where  (  |  ) - probabilistic assessment of the
contribution of a feature xi to the fact that d ∈ cj.[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
In practice, when multiplying very small conditional
probabilities, there can be a loss of significant digits,
and
therefore, instead
monotonically increasing function, and, therefore, the
class cj with the largest value of the logarithm
will
remain the most probable. In this case, the decision
naive
      </p>
      <p>Bayesian</p>
      <p>classifier takes the
 ∗ = 
  ∈ 
.</p>
      <p>=1
The resulting values of the MultinomialNB classifier
from the Sklearn library turned out to be the following:</p>
      <sec id="sec-6-1">
        <title>Accuracy 0.874</title>
        <p>1
0</p>
      </sec>
      <sec id="sec-6-2">
        <title>Wall time: 289 ms</title>
        <p>works fundamentally differently - the speed of its
work is much higher - however, the qualitative criteria,
the main of which F-measure, are inferior to past
classifiers.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Decision tree methodology</title>
      <p>
        With this algorithm, the tree is built from top to
bottom - from the root node to the leaves. At the first
step of training, an "empty" tree is formed, which
consists only of the root node, which in turn contains
the entire training set. Next, you need to split the root
node into subsets, from which the descendant nodes
will be formed. For this, one of the attributes is
selected and rules are formed that divide the training
set into subsets, the number of which is equal to the
number of unique values of the selected attribute. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
As a result of splitting, p (according to the number of
attribute values) subsets are obtained and, therefore, p
descendants of the root node are formed, each of
which is assigned its own subset. Then this procedure
is recursively applied to all subsets until the stop
condition is reached.
      </p>
      <p>For example, a partitioning rule should be applied to
the training set, in which the attribute A, which can
take p values: a1, a2, ..., ap, creates p subsets S1, S2, ...,
Sp, where examples will be distributed, in which the
attribute A takes the corresponding value.</p>
      <p>Moreover, N (Cj, S) is the number of examples of the
class Cj in the set S, then the probability of the class Cj
in this set is determined by the expression:

=
 (   )
 ( )
where N (S) is the total number of examples in the set</p>
      <p>The entropy of the sets S will be expressed as:
   ( )= − ∑
 (  )
 ( )
⋅ 
(
 (   )
 ( )
)
It will demonstrate the average amount of information
required to determine the class of an example from the
set S.</p>
      <p>The same estimate, obtained after partitioning the set S
by attribute A, can be written as:
 =1
 (   )
 ( )
⋅ 
where Si - i-th node, obtained by splitting by attribute
A. After that, to choose the best branching attribute,
you should use the criterion of the form:
 =1


This criterion is called the criterion of information
gain. This value is calculated for all potential split
attributes and the one that maximizes the specified
criterion is selected for the division operation.
The described procedure is applied to subsets Si and
further, until the values of the criterion cease to
increase significantly with new partitions or a different
stopping condition is met. In this case, when in the
process of building a tree, an "empty"
node is
obtained, where not a single example will fall, then it
must be converted into a leaf that is associated with
the class most often found in the immediate ancestor
of this node.</p>
      <p>The DecisionTreeRegressor classifier with parameters
Random_state =
15 and</p>
      <p>Min_samples_leaf = 25
showed the following characteristics:</p>
      <sec id="sec-7-1">
        <title>Accuracy 0.964</title>
        <p>1
0</p>
      </sec>
      <sec id="sec-7-2">
        <title>Wall time: 975ms</title>
        <p>When working with the decision tree method, the
results are similar to the logistic regression method,
however, the duration of training and forecasting in the
decision tree</p>
        <p>method is much longer, which, with
equal qualitative characteristics, puts the results of this
method higher than others.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Gradient boosting method</title>
      <p>
        Gradient boosting is a machine learning method that
creates a decisive forecasting model in the form of an
ensemble of weak forecasting models, usually decision
trees, essentially developing the decision tree method.
During boosting, the model is built in stages - an
arbitrary differentiable loss function is also optimized
in stages. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
For the
problem
of object recognition from
a
multidimensional space X with a label space Y, a

training sample {  } =1 is given, where   ∈  . In
addition, the true values of the class labels for each

object {  } =1 are known, where yi∈Y. The solution to
the prediction problem is reduced in this case to the
search for a recognizing operator who can predict the
labels as accurately as possible for each new object
from the set X.
      </p>
      <p>Let a family of basic algorithms H be given, each
element h(x,a)∈H:X→R of which defined by some
vector of parameters a∈A.</p>
      <p>In
this</p>
      <p>case, it is necessary to find the final
classification algorithm in the form of the following

  ( )= ∑ =1   ℎ( ,   ),  
∈
However, the
selection</p>
      <p>{  ,   }
consuming task, therefore the construction of this
composition should
be carried
out by
means of
"greedy" growth, each time adding the summand,
which is the most optimal algorithm, to the sum.
At the step when the optimal classifier F(m-1) of length
m - 1 has already been assembled, the task is reduced
to finding a pair of the
most optimal parameters
{am,bm} for the classifier of length m:
  ( )=   −1( )+   ℎ( ,   ),  
∈  ,  
∈ 
Optimality is understood here in accordance with the
principles of explicit maximization of margins - this
means that a certain loss function L(yi,Fm (xi)) → min
is introduced, showing how much the predicted answer
Fm (xi)</p>
      <p>differs from the correct answer yi. Next, you
need to minimize the functionality of this error:
 ∑
 =1
 = ∑  (  ,   (  ))→ 
It should be noted that the error functional Q is a real
dimensional space, and this function is minimized by
the gradient descent method. As the point for which
the optimal increment should be found, we define
  −1 and the error gradient is expressed as follows:
  = [
= [</p>
      <p>−1</p>
      <p>(  )]
 (∑ =1  (  ,   −1))</p>
      <p>−1
= [
  (  ,   −1)
   −1
(  )]

 =1

 =1
(  )]</p>
      <p>=

 =1
By virtue of the gradient descent method, it is most
beneficial to add a new term as follows:


=   −1 −     ,  
∈  ,
where bm is selected by linear search over real
numbers R:

However, ∇Q is only a vector of optimal values for
each object xi, and not a basic algorithm from the
family H, defined by ∀x∈X, so it is necessary to find
h(x,am)∈H that is most similar to ∇Q. To do this, it is
necessary to re-minimize the error functionality using
an
algorithm
based</p>
      <p>on the principle of explicit
minimization of indents:
 
algorithm.
search:
which in turn corresponds to the basic learning
Next, you need to find the coefficient bm using linear

 =1
 
The gradient boosting method, like the decision tree
method, is a method of enumerating classification
parameters, which, in turn, determines their relative
comparability in terms of training duration. However,
the time spent in boosting to determine the ensemble
of decision trees has a colossal effect - the accuracy in
terms of the final state classes is the highest for this
method.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Neural network</title>
      <p>Taking into account the available analysis of data
sources of the considered IT infrastructure monitoring
system and the nature of this data, the MLP
(MultiLayer Perceptron) type</p>
      <p>
        was defined as a neural
network - due to the absence of video surveillance
systems as sources, and, consequently, the problem of
video recognition of the classical verbose neural
network of direct distribution
will be sufficient to
determine the effectiveness of its application. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
An MLP network is any multilayer neural network
with a linear conductance function and a monotone
limited activation function gv common to all hidden
neurons, depending only on the variety t=s(v)-wv,
which is a "smoothed step", as a rule, a hyperbolic
tangent:
or logistic function:
( )=   +  −
      </p>
      <p>The activation function of output neurons can also be
the same "smoothed step", or it can be identical
gv(t)=t, that is, each neuron v calculates the function:
 ( )=   ((∑ 
 ( ))−   ).</p>
      <p>The parameters of the edges we are called their
weights, and the parameters of the vertices wv are
called displacements. In this case, which activation
function is chosen - hiberbolic tangent or logistic - is
indifferent: for any multilayer perceptron
activation
function
tng
calculating
the
with an
function
Ftng(w,x), the same perceptron, in which the activation
function in intermediate layers is replaced by a logical
function σ, calculates the same the function itself for
some other value of the parameter w ':</p>
      <p>( ,  )=   ( ′,  ).</p>
      <p>
        In accordance
with the ideology of
minimizing
empirical risk with regularization of training of the
perceptron calculating the function F(w,x), this is the
search for a vector of
weights and
biases that
minimizes the regularized total error:
  ( )=  ( )+ ∑  ( ( ,   ),   )

 =1
on some training set T=((x1,y1),...(xn,yn)). [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
Training is most often carried out by the classical
method of gradient descent; for its applicability, the
activation functions of all neurons and the error and
regularization
functions
must
be
differentiable.
      </p>
      <p>
        Practice shows that the speed of this algorithm is often
inferior to others because of the huge dimension of the
parameter w and the absence of explicit formulas for
the derivatives of the function F with respect to w. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
The results of applying the MLPClassifier (max_iter =
100, random_state = 10) are as follows:
      </p>
      <sec id="sec-9-1">
        <title>Accuracy 0.992</title>
        <p>1
0</p>
        <sec id="sec-9-1-1">
          <title>Precision</title>
          <p>However, the duration of training and forecasting for a
neural network is much longer than for methods based
on building ensembles of decision trees - 1 minute
versus several seconds.
The results of the application of various machine
learning methods clearly prove the postulate of the
"NoFreeLunch" theorem - it is the experimental tests
that allow you to choose the most appropriate
algorithm for solving a specific problem, taking into
account specific initial data. In this case, it should be
noted once again that the Accuracy characteristic is
practically useless in comparing the results - it is more
correct to evaluate the results by F-measure, and this
should be done separately for each class.</p>
          <p>Based on the applied sense of the task - to provide
better monitoring of the IT infrastructure operation
the characteristics of training methods for data class
"1" are more important, that is, for cases of real
failures and infrastructure failures. At the same time,
errors for class "0", in fact, will be additional incidents
and, therefore, require additional labor from technical
support specialists, which is certainly critical, but less
important in comparison with the omission of real
failures and failures. It is also worth noting the time
parameters of the methods - the spread is truly
colossal, from 289 milliseconds to 1 minute 15
seconds.</p>
          <p>When comparing the comparison criteria, it is clearly
seen that the gradient boosting method showed the
optimal results of work - with a higher speed of this
algorithm, it was able to learn better than other
algorithms. When replicating an application on already
large data sets (all IT services, a larger analysis
horizon), training time is extremely important.
Understanding this and the nature of the initial data,
namely the absence of video and photo images in the
initial data, allow us to conclude that the gradient
boosting method is more than sufficient for solving the
problem and using a neural network (showed similar
results with a longer training duration) at this stage
development of the considered IT infrastructure
monitoring system in terms of the method of collecting
information is not required.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Bolshakov</surname>
            <given-names>M.A.</given-names>
          </string-name>
          <article-title>Preparation of a monitoring system for IT infrastructure for models of critical states based on neural networks // Science-intensive technologies in space research of the Earth</article-title>
          .
          <year>2019</year>
          . No.
          <volume>4</volume>
          . p.
          <fpage>65</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.H.</given-names>
            <surname>Wolpert</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.G.</surname>
          </string-name>
          <article-title>Macready No Free Lunch Theorems for Optimization /</article-title>
          / IEEE Transactions on Evolutionary Computation.
          <year>1997</year>
          . №1. С.
          <volume>1</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>G. Upton.</surname>
          </string-name>
          <article-title>Analysis of contingency tables. Translated from English and preface by Yu</article-title>
          . P. Adler. Moscow. Finance and statistics.
          <year>1982</year>
          . 143 p.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Parsian Data Algorithms. Newton: O'Reilly Media</surname>
          </string-name>
          , Inc.,
          <year>2015</year>
          .
          <volume>778</volume>
          с.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.W.</given-names>
            <surname>Hosmer</surname>
          </string-name>
          , S. Lemeshow Applied Logistic Regression. 2nd Ed. New York: John Wiley &amp; sons, INC,
          <year>2000</year>
          .
          <volume>397</volume>
          с.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ranganathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nakai</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Schonbach Bayes' Theorem and Naive Bayes Classifier</article-title>
          .
          <source>Encyclopedia of Bioinformatics and Computational Biology</source>
          , Volume
          <volume>1</volume>
          , Elsevier, с.
          <fpage>403</fpage>
          -
          <lpage>412</lpage>
          .
          <year>2017</year>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Barber</surname>
            <given-names>D</given-names>
          </string-name>
          .
          <source>Bayesian Reasoning and Machine Learning</source>
          . Cambridge University Press, с.
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>T. Hastie</surname>
          </string-name>
          <article-title>The Elements of Statistical Learning /</article-title>
          / Trevor Hastie, Robert Tibshirani, Jerome Friedman,
          <volume>2</volume>
          <fpage>изд</fpage>
          . Springer,
          <year>2008</year>
          .
          <volume>764</volume>
          с.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.H.</given-names>
            <surname>Friedman</surname>
          </string-name>
          , Stochastic Gradient Boosting.
          <source>Technical report</source>
          . Dept. of Statistics, Stanford University,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Windeatt Ensemble MLP Classifier</surname>
          </string-name>
          <article-title>Design</article-title>
          . In: Jain L.C.,
          <string-name>
            <surname>Sato-Ilic</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Virvou</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsihrintzis</surname>
            <given-names>G.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balas</surname>
            <given-names>V.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abeynayake</surname>
            <given-names>C</given-names>
          </string-name>
          .
          <article-title>(eds) Computational Intelligence Paradigms</article-title>
          .
          <source>Studies in Computational Intelligence</source>
          , vol
          <volume>137</volume>
          . Springer, Berlin, Heidelberg
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Vapnik</surname>
          </string-name>
          <article-title>Principles of Risk Minimization for Learning</article-title>
          <source>Theory // Advances in Neural Information Processing Systems 4 (NIPS</source>
          <year>1991</year>
          ).
          <year>1992</year>
          . №4. С.
          <volume>831</volume>
          -
          <fpage>838</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Gasnikov</surname>
            <given-names>A.V.</given-names>
          </string-name>
          <article-title>Modern numerical optimization methods</article-title>
          .
          <source>Universal Gradient Descent Method: A Tutorial</source>
          . Moscow: MFTI,
          <year>2018</year>
          .286 p. 2nd Ed
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>