=Paper=
{{Paper
|id=Vol-2803/paper20
|storemode=property
|title=Comparative analysis of machine learning methods to assess the
quality of IT services (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2803/paper20.pdf
|volume=Vol-2803
|authors=Maksim A. Bolshakov,Igor A. Molodkin,Sergei V. Pugachev
}}
==Comparative analysis of machine learning methods to assess the
quality of IT services (short paper)==
Comparative analysis of machine learning methods to assess the
quality of IT services
Maksim A. Bolshakova, Igor A. Molodkina and Sergei V. Pugacheva
a
Saint Petersburg Railway Transport University of Emperor Alexander I, 9 Moskovsky Ave., Saint-Petersburg, 190031,
Russia
Abstract
The article considers the issue of choosing a machine learning method for solving the applied
problem of assessing the current state of the quality of IT services. As a method of choice, a
comparative analysis of the generally accepted methods of machine learning was carried out using a
set of criteria that made it possible to evaluate their effectiveness and efficiency. F-measure is
considered as the main criteria, as a generalizing concept of the completeness and accuracy of
classification, for each class of states separately and the duration of the training and prediction
procedure. All operations were carried out on the same dataset, namely, on the data of the
centralized monitoring and management system for the IT infrastructure of Russian Railways in
terms of the ETRAN IT service. Due to their heterogeneity and taking into account the practice of
applying the Barlow hypothesis, the initial data went through preliminary processing, the algorithm
of which is also described in detail in the article
Keywords
machine learning, Barlow hypothesis, F-measure, dataset, data heterogeneity.
service applications implemented on it.
1. Introduction It is possible to assess the state of the IT infrastructure
at a certain point in time by predicting its final
performance based on the current data of the
The requirement for software developers and device
monitoring system. To do this, it is necessary to solve
manufacturers in terms of mandatory monitoring of
the problem of classifying the final state of the IT
their performance is an established and mandatory
infrastructure, that is, to determine the class label (the
norm. As a result, most of the IT services provided by
type of the final state of the incident / regular
operating organizations can be assessed not only by
operation) based on the current values of the
the final failure state, but also by the current local
monitoring system metrics. At the same time, for the
characteristics of their work.
choice of the implementation method, the number of
Consider the currently operating centralized system for
available classes is not so important - the binary
monitoring and managing the IT infrastructure of
classification is a special case of the multiclass
Russian Railways, which is operated by the Main
classification, and is not a decisive characteristic when
Computer Center. The huge array of accumulated and
choosing a teaching method. Typically, the nature of
constantly updated data of the specified monitoring
the input data, namely the types and formats, have a
system allows us to assume the success of using
significant impact on the choice of machine learning
machine learning methods to solve the problem of
methods. At the same time, data preprocessing
assessing the quality of IT services by determining the
algorithms do not depend on the chosen training
current state of the specified infrastructure and the
method itself and must ensure the correct use of the
________________________________________________________ initial data as training and test samples for all further
Models and Methods for Researching Information Systems methods of solving the problem. [1]
in Transport, Dec. 11-12, St. Petersburg, Russia
EMAIL: bolshakovm@yandex.ru (Maksim A. Bolshakov);
molodkin@pgups.ru (Igor A. Molodkin); nki.pugachev@yandex.ru 2. Primary data processing
(Sergei V. Pugachev)
©️ 2020 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR- Primary data is understood as the whole set of the
WS.org) characteristics of the IT infrastructure involved in the
142
operability of the specified system, taken by the column_names_with_static_value = []
current monitoring and management system, and the forcol_name in column_names:
set of service characteristics that make it possible to if (merged_df[col_name].nunique() == 1):
define this automated system as an IT service. The column_names_with_static_value.append(col_name)
quality of the obtained primary data, namely their if 'isIncident' in column_names_with_static_value:
heterogeneity, imply additional processing regardless column_names_with_static_value.remove('isIncident')
of the choice of a specific machine learning method. merged_df.drop(column_names_with_static_value,
For the correct formation of the training sample, the axis = 1, inplace = True)
most suitable programming language is Python version After reducing the dimension in this way, it is
3.7.4 and the following imported libraries: Pandas and necessary to recode the text values, that is, replace the
Numpy. The specified libraries must be installed on existing text values of the metrics with numeric ones
the script execution server as Site-package libraries. by entering new arguments of the objective function.
Initially, the data received from the monitoring system For each current text value, a separate column is
is presented as a csv file with a separator | and a set of created in the current dataset and a value of 1 or 0 is
columns in the following order:mon_obj determined, thus obtaining the correct data for analysis
metric_name and comparability.
metric_value def find_strings_column():
value_time df_cols = merged_df.columns
isIncident ret_col_names = [];
These columns contain the following information: for col in df_cols:
mon_obj — monitoring object name if (merged_df[col].dtypes == 'object'):
metric_name — metric’s name ret_col_names.append(col)
metric_value — metric value at the time of data returnret_col_names
collection The variable string_columns stored list of names of the
value_time — date / time when data was collected columns that contain the string values. Next step is to
isIncident — critical state indicator (at the first stage, a form a dictionary of string values. Values in columns
binary classification is highlighted: 1 - critical state, 0 should then be replaced by indexes of elements from
- normal operation). the dictionary.
After that via variable small_columns_list = def make_dict_with_strings_column():
['mon_obj','metric_name','metric_value', ret = {}
'value_time','isIncident'] and using pandas library new forsc in strings_column:
DataFrame is formed, which is suitable for usage. ret[sc] = list(merged_df[sc].unique())
d = pd.read_csv('data/data.csv', sep='|', encoding='utf- returnret
8', skipinitialspace=True, skiprows=1, By using the Save function of the NumPy library, this
names=small_columns_list, low_memory=False) dictionary is saved to a file.
Next, you need to define a new np.save('data/dict_of_string_values.npy',
composite_metric_name column in DataFrame d, dict_of_string_values)
which will have a composite name from the values of And further, using the function
the columns with the data about the object of def modify_value_string_column():
monitoring and the name of the metric for each for c in dict_of_string_values.keys():
current: merged_df[c] = merged_df[c].map(lambda x:
row.d['composite_metric_name']=['mon_obj']+'_'+d['m dict_of_string_values[c].index(x)), all text values are
etric_name] changed to numeric values.
Then a new variable column_names is declared, At the end of the primary data processing, the data
consisting of the unique values of the newly generated frame values are standardized so that the variance is
column unitary, and the average value of the series is 0 - this
'composite_metric_name'column_names=d['composite can be done using the built-in Standardscaler tool or
_metric_name'].unique() through a self-written function:
These steps are necessary to unambiguously compare def normalize_df():
the metric and its values at any given time. cols = list(merged_df.columns)
Further, after transposing the DataFrame into a more for cl in cols:
convenient form and removing unnecessary columns, x_min = min(list(merged_df[cl]))
it will be possible to remove static data as a method to x_max = max(list(merged_df[cl]))
reduce the dimensionality and optimize the use of merged_df[cl]= merged_df[cl].map(lambda x: ((x -
computing resources without losing data quality: x_min)/(x_max - x_min)))
defdelete_cols_w_static_value():
143
For the convenience, the final result should be saved to relative to all situations of this class in the test sample.
a file. More clearly, these criteria can be presented through
merged_df.to_csv(path_or_buf='data/normalised_df.cs the contingency (contingency) table, also built
v', sep='|',index=False) separately for each final class [3]
As a result of the actions performed, a dataset was
obtained that is suitable for applying various machine Training Sample
Class (1/0)
Positive Negative
learning models with the following characteristics:
Positive 𝑇𝑃 𝐹𝑃
• The number of records in the training set is 10815 Result
Negative 𝐹𝑁 𝑇𝑁
(9323 class 0, 1492 class 1);
• The number of records in the test set is 3605 (3297 of
These results are used directly in the calculation of the
class 0, 308 of class 1).
criteria for the classifier as follows:
𝑇𝑃
𝑃𝑟 𝑒 𝑐𝑖𝑠𝑖𝑜𝑛 = ;
3. Criteria for comparing results 𝑇𝑃+𝐹𝑃
𝑇𝑃
When solving any supervised learning problem, there 𝑅𝑒𝑐𝑎𝑙𝑙 =
is no most suitable machine learning algorithm - there 𝑇𝑃 + 𝐹𝑁
is such a thing as the "No Free Lunch" theorem, the As can be seen from the calculation algorithm, these
essence of which is the impossibility to criteria provide a more complete understanding of the
unambiguously approve the best machine learning quality of the classifier's work. It would be logical to
method for a specific task in advance. The say that the higher the accuracy and completeness, the
applicability of this theorem extends, among other better, but in reality, maximum accuracy and
things, to the well-studied area of problems of binary completeness are not achievable at the same time and
classification, therefore, it is imperative to consider a it is necessary to find a balance between these
set of methods and, based on the results of practical characteristics. To do this, you need to use the F
tests, evaluate the effectiveness and applicability of measure, which is the following calculated value:
each of them. [2]
The effectiveness and efficiency will be primarily 𝑃𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛⋅𝑅𝑒 𝑐𝑎𝑙𝑙
assessed by the most common and user-understandable 𝐹 = (𝛽 2 + 1) ⋅ 𝛽2 ⋅𝑃𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒 𝑐𝑎𝑙𝑙,
criterion Accuracy.
Accuracy is a measure that indicates the proportion of where 0<β<1, if the greater weight for the selection of
correct decisions of the classifier: the classifier has accuracy and β> 1 with the
preference for completeness. In the case β = 1, a
𝑃 balanced F-measure is obtained, denoted as:
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = ,
𝑁
𝑃𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛⋅𝑅𝑒 𝑐𝑎𝑙𝑙
where P is the number of states correctly identified by 𝐹1 = 2 ⋅ 𝑃𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒 𝑐𝑎𝑙𝑙.
the system, and N is the total number of states in the
test sample. In addition, for all classifiers, we will apply an
However, for the problem being solved, this criterion estimate of the duration of the execution of Wall time
is not enough, since in its calculation it assigns the operations, by which we will count the time from the
same weight to all final states (classes), which may be moment the model is created and the beginning of
incorrect in the considered case of non-uniform training until the moment the values of the
distribution of time moments over the final states. performance assessment metrics are issued based on
Thus, for a more correct comparison and taking into the comparison of the predicted results with the test
account the fact that in the example under sample.
consideration the share of normal states is much All work is carried out on a computer complex with
greater than the share of critical states, the assessment the following characteristics:
of the classifiers should be based, among other things, Intel Core i9-9980XE 3.00 Ghz, 128 Mb, 4xNVIDIA
according to the following criteria: Precision (accuracy RTX 2080Ti 11 Gb, 1 TB PCIe SSD.
- in a calculation other than Accuracy), and Recall
(completeness). 4. K-nearest neighbors (knn) algorithm
These criteria are calculated separately for each
summary class, where Precison is the proportion of The specified algorithm works as follows - let a
situations that really belong to this class relative to all training sample of pairs "object (state characteristic) -
situations that the system has assigned to this class. response (state class)" be given:
System completeness (Recall) is the proportion of
situations found by the classifier that belong to a class
144
𝑋 𝑚 = {(𝑥1 , 𝑦1 ), . . . (𝑥𝑚 , 𝑦𝑚 )}, probability of accepting a particular value of the
dependent variable. [5]
then the distance function P (x, x ') is given on the set Let the objects be specified by numerical features:
of objects. This function should be a reasonably
adequate model of object similarity. The larger the 𝑓𝑗 : 𝑋 → 𝑅, 𝑗 = 1. . . 𝑛
value of this function, the less similar two objects x, x
'are [4]. For an arbitrary object μ, we arrange the and the space of feature descriptions in this case X=Rn.
objects of the training sample in the order of Let Y in this case be a set of class labels and a training
increasing distances to μ: set of object-response pairs
𝑋𝑚 = {(𝑥1 , 𝑦1 ), . . . , (𝑥𝑚 , 𝑦𝑚 )}.
𝑃(𝜇1 ; 𝑝(𝜇𝑖 , 𝑥1;𝜇 ) < 𝑝(𝜇𝑖 , 𝑥2;𝜇 ). . . < 𝑝(𝜇𝑖 , 𝑥𝑚;𝜇 ),
Consider the case of two classes: Y={-1,+1}. In
logistic regression, a linear classification algorithm
where some x (1; μ) denotes the object of the training
aX→Y of the form:
sample that is the i-th neighbor of the object μ. We 𝑛
introduce a similar notation for the answer to the i-th
𝑎(𝑥, 𝑤) = 𝑠𝑖𝑔𝑛 (∑ 𝑤𝑗 𝑓𝑖 (𝑥) − 𝑤0 ) = 𝑠𝑖𝑔𝑛〈 𝑥, 𝑤〉,
neighbor y (i; μ). Thus, an arbitrary object μ generates
𝑗=1
a new renumbering of the sample. In its most general
form, the nearest neighbors algorithm is:
where wj – scale of j-th feature, w0 − decision
threshold w=(w0,...,wn) − scales vector, ⟨x,w⟩ − dot
𝑎(𝜇) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦∈𝑌 ∑𝑚
𝑖=1[𝑦(𝑥𝑖;𝜇 ) = 𝑦] ⋅ 𝑤(𝑖, 𝜇), product of the feature description of an object by a
vector of weights. It is assumed that the zero feature is
where w (i, μ) is a given weight function that estimates artificially introduced: f0(x) = −1.
the degree of importance of the i-th neighbor for The task of training a linear classifier is to adjust the
classifying the object μ, while this function cannot be weight vector w based on the sample Xm. In logistic
negative and does not increase with respect to i. regression, for this, the problem of minimizing
For the k nearest neighbors method: empirical risk is solved with a loss function of the
following form:
(𝑖; 𝜂) = [𝑖 ≤ 𝑘]
𝑚
The qualitative characteristics of using the 𝑄(𝑤) = ∑ 𝑙𝑛( 1 + 𝑒𝑥𝑝( − 𝑦𝑖 < 𝑥𝑖 , 𝑤 >)) → 𝑚𝑖𝑛 𝑤
KNeighborsClassifier algorithm with the n_neighboors 𝑖=1
= 10 parameter are as follows:
After the solution w is found, it becomes possible not
only to perform classification for an arbitrary object x,
Accuracy 0.949
but also to estimate the posterior probabilities of its
Precision Recall F1 score
1 0.63 0.97 0.77
belonging to the existing classes:
0 1.0 0.95 0.97
𝛲{𝑦|𝑥} = 𝜎(𝑦 < 𝑥, 𝑤 >), 𝑦 ∈ 𝑌,
Wall time: 1 min 15 s
1
It is clearly seen that the accuracy assessment in the 𝜎(𝑧) = − 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
1 + 𝑒𝑥𝑝 −𝑧
form of Accuracy is not enough for a comparative
analysis - with a value of this indicator of 0.949 - Qualitative characteristics of the application of the
almost 95% accuracy - we see that the accuracy of specified model with the following parameters
determining the final state 1 (inoperability) through the LogisticRegression (multi_class='ovr', solver='lbfgs'):
Precision characteristic is only 63%. As a result of the
work, one should record the F-Measure value for each Accuracy 0.971
class and the total duration of the classifier's work. Precision Recall F1 score
1 0.76 0.98 0.85
5. Logistic regression 0 1.0 0.97 0.98
Wall time: 1 min 10 s
To predict the probability of occurrence of a certain
event by the values of the set of signs, a dependent The results shown by this classifier are better relative
variable Y is introduced, which takes values 0 or 1 and to the previous method with a comparable duration of
a set of independent variables x1, ... xn, based on the operation, however, the work on identifying faulty
values of which it is required to calculate the situations (class 1) does not yet guarantee satisfactory
operation in industrial mode.
145
6. Naive Bayesian classifier the main of which F-measure, are inferior to past
classifiers.
The Bayesian classification is based on the maximum
probability hypothesis, that is, an object d is 7. Decision tree methodology
considered to belong to the class cj (cj ∈ C) if the
highest posterior probability is achieved: With this algorithm, the tree is built from top to
bottom - from the root node to the leaves. At the first
𝑚𝑎𝑥 𝑃 (𝑐𝑗 |𝑑).[6] step of training, an "empty" tree is formed, which
consists only of the root node, which in turn contains
Bayesian formula: the entire training set. Next, you need to split the root
node into subsets, from which the descendant nodes
𝑃(𝑐𝑗 )⋅𝑃(𝑑|𝑐𝑗 ) will be formed. For this, one of the attributes is
𝑃(𝑐𝑗 |𝑑) = ≈ 𝑃(𝑐𝑗 )𝑃(𝑑|𝑐𝑗 ),
𝑃(𝑑) selected and rules are formed that divide the training
set into subsets, the number of which is equal to the
where P(d|cj)- the probability of encountering an number of unique values of the selected attribute. [8]
object d among objects of class cj, and P(cj),P(d) – As a result of splitting, p (according to the number of
prior probabilities of the class cj and d. attribute values) subsets are obtained and, therefore, p
Under the "naive" assumption that all features descendants of the root node are formed, each of
describing the classified objects are completely equal which is assigned its own subset. Then this procedure
and not related to each other, then P (d | cj) can be is recursively applied to all subsets until the stop
calculated as the product of the probabilities of condition is reached.
encountering a feature xi (xi∈X) among objects of For example, a partitioning rule should be applied to
class cj: the training set, in which the attribute A, which can
take p values: a1, a2, ..., ap, creates p subsets S1, S2, ...,
|𝑥|
𝑃(𝑑|𝑐𝑗 ) = ∏𝑖=1 𝑃(𝑥𝑖 |𝑐𝑗 ), Sp, where examples will be distributed, in which the
attribute A takes the corresponding value.
where 𝑃(𝑥𝑖 |𝑐𝑗 ) - probabilistic assessment of the Moreover, N (Cj, S) is the number of examples of the
contribution of a feature xi to the fact that d ∈ cj.[7] class Cj in the set S, then the probability of the class Cj
In practice, when multiplying very small conditional in this set is determined by the expression:
probabilities, there can be a loss of significant digits,
and therefore, instead of the estimates of the 𝑁(𝐶𝑗 𝑆)
𝑃= ,
probabilities P (xi | cj), the logarithms of these 𝑁(𝑆)
probabilities should be used. The logarithm is a
monotonically increasing function, and, therefore, the where N (S) is the total number of examples in the set
class cj with the largest value of the logarithm will S.
remain the most probable. In this case, the decision The entropy of the sets S will be expressed as:
rule of the naive Bayesian classifier takes the 𝑚
following form: 𝑁(𝑆𝑖 ) 𝑁(𝐶𝑗 𝑆)
𝐼 𝑛 𝑓𝑜(𝑆) = − ∑ ⋅ 𝑙𝑜𝑔( )
𝑁(𝑆) 𝑁(𝑆)
𝑥 𝑖=1
∗
𝐶 = 𝑎𝑟𝑔𝑐𝑗 ∈𝐶 𝑚𝑎𝑥 [𝑙𝑜𝑔 𝑃 (𝑐𝑗 ) + ∑ 𝑃(𝑥𝑖 |𝑐𝑗 )]
𝑖=1 It will demonstrate the average amount of information
. required to determine the class of an example from the
set S.
The resulting values of the MultinomialNB classifier The same estimate, obtained after partitioning the set S
from the Sklearn library turned out to be the following: by attribute A, can be written as:
Accuracy 0.874 𝑘
Precision Recall F1 score 𝑁(𝐶𝑗 𝑆)
𝐼𝑛𝑓𝑜𝐴 (𝑆) = ∑ ⋅ 𝐼𝑛𝑓𝑜(𝑆𝑖 ),
1 0.40 0.96 0.57 𝑁(𝑆)
𝑖=1
0 1.0 0.87 0.93
Wall time: 289 ms
where Si - i-th node, obtained by splitting by attribute
The considered classifier, based on its description, A. After that, to choose the best branching attribute,
works fundamentally differently - the speed of its you should use the criterion of the form:
work is much higher - however, the qualitative criteria,
146
𝐺𝑎𝑖𝑛(𝐴) = 𝐼𝑛𝑓𝑜(𝑆) − 𝐼𝑛𝑓𝑜𝐴 (𝑆) composition: 𝐹𝑀 (𝑥) = ∑𝑀 𝑚=1 𝑏𝑚 ℎ(𝑥, 𝑎𝑚 ), 𝑏𝑚 ∈
𝑅, 𝑎𝑚 ∈ 𝐴.
However, the selection of the optimal set of
This criterion is called the criterion of information parameters {𝑎𝑚 , 𝑏𝑚 }𝑀 𝑚=1 is an extremely time-
gain. This value is calculated for all potential split consuming task, therefore the construction of this
attributes and the one that maximizes the specified composition should be carried out by means of
criterion is selected for the division operation. "greedy" growth, each time adding the summand,
The described procedure is applied to subsets Si and which is the most optimal algorithm, to the sum.
further, until the values of the criterion cease to At the step when the optimal classifier F(m-1) of length
increase significantly with new partitions or a different m - 1 has already been assembled, the task is reduced
stopping condition is met. In this case, when in the to finding a pair of the most optimal parameters
process of building a tree, an "empty" node is {am,bm} for the classifier of length m:
obtained, where not a single example will fall, then it
must be converted into a leaf that is associated with 𝐹𝑀 (𝑥) = 𝐹𝑚−1 (𝑥) + 𝑏𝑚 ℎ(𝑥, 𝑎𝑚 ), 𝑏𝑚 ∈ 𝑅, 𝑎𝑚 ∈ 𝐴
the class most often found in the immediate ancestor
of this node. Optimality is understood here in accordance with the
The DecisionTreeRegressor classifier with parameters principles of explicit maximization of margins - this
Random_state = 15 and Min_samples_leaf = 25 means that a certain loss function L(yi,Fm (xi)) → min
showed the following characteristics: is introduced, showing how much the predicted answer
Fm (xi) differs from the correct answer yi. Next, you
Accuracy 0.964 need to minimize the functionality of this error:
Precision Recall F1 score
1 0.71 0.97 0.85 𝑁∑
0 1.0 0.97 0.98
𝑄 = ∑ 𝐿(𝑦𝑖 , 𝐹𝑚 (𝑥𝑖 )) → 𝑚𝑖𝑛
Wall time: 975ms
𝑖=1
When working with the decision tree method, the
results are similar to the logistic regression method, It should be noted that the error functional Q is a real
however, the duration of training and forecasting in the function depending on the points {𝐹𝑚 (𝑥𝑖 )}𝑁
𝑖=1 in the N-
decision tree method is much longer, which, with
dimensional space, and this function is minimized by
equal qualitative characteristics, puts the results of this
the gradient descent method. As the point for which
method higher than others. the optimal increment should be found, we define
𝐹𝑚−1 and the error gradient is expressed as follows:
8. Gradient boosting method
𝑁
𝜕𝑄
Gradient boosting is a machine learning method that 𝛻𝑄 = [ (𝑥 )]
creates a decisive forecasting model in the form of an 𝜕𝐹𝑚−1 𝑖 𝑖=1
𝑁
ensemble of weak forecasting models, usually decision 𝜕(∑𝑁
𝑖=1 𝐿(𝑦𝑖 , 𝐹𝑚−1 ))
trees, essentially developing the decision tree method. =[ (𝑥𝑖 )] =
𝜕𝐹𝑚−1 𝑖=1
During boosting, the model is built in stages - an 𝑁
arbitrary differentiable loss function is also optimized 𝜕𝐿(𝑦𝑖 , 𝐹𝑚−1 )
=[ (𝑥𝑖 )]
in stages. [9] 𝜕𝐹𝑚−1 𝑖=1
For the problem of object recognition from a
multidimensional space X with a label space Y, a By virtue of the gradient descent method, it is most
training sample {𝑥𝑖 }𝑁 𝑖=1 is given, where 𝑥𝑖 ∈ 𝑋. In beneficial to add a new term as follows:
addition, the true values of the class labels for each 𝐹𝑚 = 𝐹𝑚−1 − 𝑏𝑚 𝛻𝑄, 𝑏𝑚 ∈ 𝑅,
object {𝑦𝑖 }𝑁
𝑖=1 are known, where yi∈Y. The solution to
the prediction problem is reduced in this case to the where bm is selected by linear search over real
search for a recognizing operator who can predict the numbers R:
labels as accurately as possible for each new object
𝑁
from the set X.
Let a family of basic algorithms H be given, each 𝑏𝑚 = 𝑎𝑟𝑔𝑚𝑖𝑛 ∑ 𝐿(𝐹𝑚−1 (𝑥𝑖 ) − 𝑏𝛻𝑄𝑖 )
element h(x,a)∈H:X→R of which defined by some 𝑖=1
vector of parameters a∈A.
In this case, it is necessary to find the final However, ∇Q is only a vector of optimal values for
classification algorithm in the form of the following each object xi, and not a basic algorithm from the
147
family H, defined by ∀x∈X, so it is necessary to find which is a "smoothed step", as a rule, a hyperbolic
h(x,am)∈H that is most similar to ∇Q. To do this, it is tangent:
necessary to re-minimize the error functionality using
an algorithm based on the principle of explicit 𝑒 𝑡 − 𝑒 −𝑡
𝑡𝑛𝑔(𝑡) =
minimization of indents: 𝑒 𝑡 + 𝑒 −𝑡
𝑁
𝑎𝑚 = 𝑎𝑟𝑔𝑚𝑖𝑛 ∑ 𝐿(𝛻𝑄𝑖 , ℎ(𝑥𝑖 , 𝑎)) ≡ or logistic function:
𝑡
𝑡𝑛𝑔( )+1 1
𝑖=1 2
𝜎(𝑡) = = .
≡ 𝑡𝑟𝑎𝑖𝑛({𝑥𝑖 }𝑁 𝑁
𝑖=1 , {𝛻𝑄𝑖 }𝑖=1 ),
2 1+𝑒 −𝑡
The activation function of output neurons can also be
which in turn corresponds to the basic learning
the same "smoothed step", or it can be identical
algorithm.
gv(t)=t, that is, each neuron v calculates the function:
Next, you need to find the coefficient bm using linear
search: 𝑢(𝑣) = 𝑔𝑣 ((∑ 𝑤𝑒 𝑢(𝑒)) − 𝑤𝑣 ).
𝑁
The parameters of the edges we are called their
𝑏𝑚 = 𝑎𝑟𝑔𝑚𝑖𝑛 ∑ 𝐿(𝐹𝑚−1 (𝑥𝑖 ) − 𝑏 ⋅ ℎ(𝑥𝑖 , 𝑎𝑚 ) weights, and the parameters of the vertices wv are
𝑖=1 called displacements. In this case, which activation
function is chosen - hiberbolic tangent or logistic - is
The GradientBoostingClassifier implemented indifferent: for any multilayer perceptron with an
according to this algorithm with the parameters activation function tng calculating the function
learning_rate = 0.3; n_estimators = 10; verbose = 1, Ftng(w,x), the same perceptron, in which the activation
subsample = 0.5 showed the following results: function in intermediate layers is replaced by a logical
function σ, calculates the same the function itself for
Accuracy 0.993 some other value of the parameter w ':
Precision Recall F1 score
1 0.94 0.99 0.97 𝐹𝑡𝑛𝑔 (𝑤, 𝑥) = 𝐹𝜎 (𝑤 ′ , 𝑥).
0 1.0 0.99 1.0
Walltime: 3.95s In accordance with the ideology of minimizing
empirical risk with regularization of training of the
The gradient boosting method, like the decision tree perceptron calculating the function F(w,x), this is the
method, is a method of enumerating classification search for a vector of weights and biases that
parameters, which, in turn, determines their relative minimizes the regularized total error:
comparability in terms of training duration. However,
the time spent in boosting to determine the ensemble 𝑁
of decision trees has a colossal effect - the accuracy in 𝐸𝑇 (𝑤) = 𝜑(𝑤) + ∑ 𝐸(𝐹(𝑤, 𝑥𝑖 ), 𝑦𝑖 )
terms of the final state classes is the highest for this 𝑖=1
method.
on some training set T=((x1,y1),...(xn,yn)). [11]
Training is most often carried out by the classical
9. Neural network method of gradient descent; for its applicability, the
activation functions of all neurons and the error and
Taking into account the available analysis of data regularization functions must be differentiable.
sources of the considered IT infrastructure monitoring Practice shows that the speed of this algorithm is often
system and the nature of this data, the MLP (Multi- inferior to others because of the huge dimension of the
Layer Perceptron) type was defined as a neural parameter w and the absence of explicit formulas for
network - due to the absence of video surveillance the derivatives of the function F with respect to w. [11]
systems as sources, and, consequently, the problem of The results of applying the MLPClassifier (max_iter =
video recognition of the classical verbose neural 100, random_state = 10) are as follows:
network of direct distribution will be sufficient to Accuracy 0.992
determine the effectiveness of its application. [10] Precision Recall F1 score
An MLP network is any multilayer neural network 1 0.93 0.99 0.95
with a linear conductance function and a monotone 0 1.0 0.99 1.0
limited activation function gv common to all hidden Wall time: 1 min 5 s
neurons, depending only on the variety t=s(v)-wv, The neural network, as the user often expects, has
shown high results of a qualitative assessment - they
are essentially equal to the results of gradient boosting.
148
However, the duration of training and forecasting for a [2] D.H. Wolpert, W.G. Macready No Free Lunch
neural network is much longer than for methods based Theorems for Optimization // IEEE
on building ensembles of decision trees - 1 minute Transactions on Evolutionary Computation.
versus several seconds. 1997. №1. С. 1.
[3] G. Upton. Analysis of contingency tables.
10. Conclusion Translated from English and preface by Yu. P.
Adler. Moscow. Finance and statistics. 1982.
143 p.
The results of the application of various machine
[4] M. Parsian Data Algorithms. Newton: O'Reilly
learning methods clearly prove the postulate of the
Media, Inc., 2015. 778 с.
"NoFreeLunch" theorem - it is the experimental tests
[5] D.W. Hosmer, S. Lemeshow Applied Logistic
that allow you to choose the most appropriate
Regression. 2nd Ed. New York: John Wiley &
algorithm for solving a specific problem, taking into
sons, INC, 2000. 397 с.
account specific initial data. In this case, it should be
[6] S. Ranganathan, K. Nakai, C. Schonbach
noted once again that the Accuracy characteristic is
Bayes' Theorem and Naive Bayes Classifier.
practically useless in comparing the results - it is more
Encyclopedia of Bioinformatics and
correct to evaluate the results by F-measure, and this
Computational Biology, Volume 1, Elsevier,
should be done separately for each class.
с. 403-412. 2017
Based on the applied sense of the task - to provide
[7] Barber D. Bayesian Reasoning and Machine
better monitoring of the IT infrastructure operation -
Learning. Cambridge University Press, с.
the characteristics of training methods for data class
2012.
"1" are more important, that is, for cases of real
[8] T. Hastie The Elements of Statistical Learning
failures and infrastructure failures. At the same time,
// Trevor Hastie, Robert Tibshirani, Jerome
errors for class "0", in fact, will be additional incidents
Friedman, 2 изд. Springer, 2008. 764 с.
and, therefore, require additional labor from technical
[9] J.H. Friedman, Stochastic Gradient Boosting.
support specialists, which is certainly critical, but less
Technical report. Dept. of Statistics, Stanford
important in comparison with the omission of real
University, 1999.
failures and failures. It is also worth noting the time
[10] T. Windeatt Ensemble MLP Classifier Design.
parameters of the methods - the spread is truly
In: Jain L.C., Sato-Ilic M., Virvou M.,
colossal, from 289 milliseconds to 1 minute 15
Tsihrintzis G.A., Balas V.E., Abeynayake C.
seconds.
(eds) Computational Intelligence Paradigms.
When comparing the comparison criteria, it is clearly
Studies in Computational Intelligence, vol
seen that the gradient boosting method showed the
137. Springer, Berlin, Heidelberg
optimal results of work - with a higher speed of this
[11] V. Vapnik Principles of Risk Minimization for
algorithm, it was able to learn better than other
Learning Theory // Advances in Neural
algorithms. When replicating an application on already
Information Processing Systems 4 (NIPS
large data sets (all IT services, a larger analysis
1991). 1992. №4. С. 831-838.
horizon), training time is extremely important.
[12] Gasnikov A.V. Modern numerical
Understanding this and the nature of the initial data,
optimization methods. Universal Gradient
namely the absence of video and photo images in the
Descent Method: A Tutorial. Moscow: MFTI,
initial data, allow us to conclude that the gradient
2018.286 p. 2nd Ed
boosting method is more than sufficient for solving the
problem and using a neural network (showed similar
results with a longer training duration) at this stage
development of the considered IT infrastructure
monitoring system in terms of the method of collecting
information is not required.
References
[1] Bolshakov M.A. Preparation of a monitoring
system for IT infrastructure for models of
critical states based on neural networks //
Science-intensive technologies in space
research of the Earth. 2019. No. 4. p. 65-71.
149