=Paper=
{{Paper
|id=Vol-2842/paper_8
|storemode=property
|title=The Regression Analysis of the Data to Determine the Buffer Size when Serving a Self-similar Packets Flow
|pdfUrl=https://ceur-ws.org/Vol-2842/paper_7.pdf
|volume=Vol-2842
|authors=Gennadiy Linets,Roman Voronkin,Svetlana Govorova,Ilya Palkanov,Carlos Grilo
}}
==The Regression Analysis of the Data to Determine the Buffer Size when Serving a Self-similar Packets Flow==
The Regression Analysis of the Data to Determine the Buffer Size
When Serving a Self-Similar Packets Flow
Gennadiy Linets a, Roman Voronkin a, Svetlana Govorova a, Ilya Palkanov a, Carlos Grilo b
a
North Caucasus Federal University, 2 Kulakova str, Stavropol, 355029, Russia
b
Instituto Polit´écnico de Leiria, Rua General Norton de Matos, Apartado 4133, 2411-901 Leiria, Portugal
Abstract
Using the methods of regression analysis on the basis of simulation data, a model for predicting the
queue size of the input self-similar packet flow, distributed according to the Pareto law when it is
transformed into a flow having an exponential distribution, is constructed. Since the amount of
losses in the general case does not give any information about the efficiency of using the buffer
memory space in the process of transforming a self-similar packet flow, a quality metric (penalty)
was introduced to get the quality of the models after training, which is a complex score. This
criterion considers both packet loss during functional transformations and ineffective use of the
buffer space in switching nodes. The choice of the best model for predicting the queue size when
servicing a self-similar packet flow was carried out using the following characteristics: the
coefficient of determination; root-mean-square regression error; mean absolute error; the penalty
score. The best in terms of the investigated characteristics are the models using the isotonic
regression and the support vector regression.
Keywords 1
Telecommunication network, self-similar traffic, Hurst exponent, Pareto distribution, packet loss,
regression analysis, quality metrics, penalty score, machine learning.
1. Introduction
The main reason leading to a buffer overflow is the presence of a long-term dependence in network traffic
due to its self-similarity, as a result of which the total cumulative effect in a wide range of delays can
significantly differ from that observed in a short-term dependent process [1]. To eliminate self-similarity of
network traffic, various models and traffic transformation devices are used, one of which is the asynchronous
simulation model described in [2-4], for which there is a software implementation [5].
An important indicator of the operation of this model is the queue size used in the traffic transformation
process. Since, due to limited computer resources, the queue cannot have an infinite size, the problem arises
of predicting the queue size depending on the measure of self-similarity of the input traffic, which is the
Hurst exponent.
The solution to the problem of finding the optimal buffer size for a given value of the Hurst exponent H
can be found using the methods of regression analysis, based on simulation data obtained using the
developed software [5].
2. Statement of the problem
Using machine learning methods, it is necessary to develop a model to predict the queue size depending
on the Hurst exponent value based on the data obtained when performing the transformation of an input self-
similar flow distributed according to the Pareto law into a flow having an exponential distribution.
YRID-2020: International Workshop on Data Mining and Knowledge Engineering, October 15-16, 2020, Stavropol, Russia
EMAIL: kbytw@mail.ru (Gennadiy Linets); roman.voronkin@gmail.com (Roman Voronkin); mitnik2@yandex.ru (Svetlana Govorova);
ilya0693@yandex.ru (Ilya Palkanov); carlos.grilo@ipleiria.pt (Carlos Grilo)
ORCID: 0000-0002-2279-3887 (Gennadiy Linets); 0000-0002-7345-579X (Roman Voronkin); 0000-0002-3225-1088 (Svetlana Govorova);
0000-0003-0751-3928 (Ilya Palkanov); 0000-0001-9727-905X (Carlos Grilo)
©️ 2020 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
Since machine learning includes many methods, at the initial stage, for further comparison with more
complex models built, in particular, using deep learning methods, it is advisable to consider only methods of
pairwise regression analysis, isotonic regression and support vector machines.
Let us involve a quality metric (penalty), which is a complex score and considers both packet loss during
traffic transformation and inefficient use of buffer space.
Next, we choose the best model for predicting the queue size, depending on the Hurst exponent of the
input flow, using the following quality metrics:
coefficient of determination;
root mean square error of regression;
mean absolute error;
penalty score value.
When setting the problem, special attention should be paid to testing the resulting models. In this case, the
classical approach, which consists in dividing the entire data set into training and test samples, is not
acceptable. Since on the test sample we will get the estimated number of lost packets and, therefore, the
estimated penalty based on the difference between the predicted and actual buffer sizes, the obtained models
must be tested by simulating traffic transformation with a queue size limitation, based on the results of
applying the tested model for determining the size of the queue to the sequence being converted. This task is
not trivial and will not be discussed in this article.
3. The solution of the problem
The simulation model presented in [2] provides transformation of the input flow of packets, which is
obviously self-similar, into a given distribution law, in particular, into an exponential one. The object of
transformation is a one-dimensional distribution density of time intervals between packets of the input flow.
Using the developed model, 11,000 tests were carried out and data were obtained for statistical analysis.
Since the amount of losses in the general case does not give any information about the efficiency of using
the queue in the process of the transformation traffic, to assess the quality of the resulting model, we
introduce a quality metric - the penalty score, which takes into account not only the amount of losses, but
also not rational use of buffer memory.
Let define yi as the true value of the queue size in the sample, yˆi is the predicted value of the queue size
in the sample corresponding to the true value yi . If yi yˆi , we will penalize the learning system by
yi yˆi . If yi yˆi , the amount of the penalty will depend on the value of the difference i yi yˆi ,
with i the amount of the penalty will be i and 0 – otherwise. Let us illustrate this with an
example (Figure 1).
p
yi yˆi
yˆi yi
y3 y2 ŷ y1 y
Figure 1: Graph of the dependence of the amount of the penalty on the buffer volume
Consider three cases, each of which corresponds to the true values of the queue sizes y1 , y2 and y3 .
Suppose the predicted values of the queue sizes coincide in each of the three cases, in other words
yˆ1 yˆ2 yˆ3 yˆ . Then, in the first case, y1 yˆ and the amount of the penalty is determined as y1 yˆ .
In the second case ŷ y2 and the penalty is 0, it is assumed that ŷ is the preferred queue size for
y2 . In the third case y3 yˆ , the amount of the penalty is determined from the expression
ŷ y3 .
Thus, the amount of the penalty score will be determined from the equation:
yi yˆ , if yi yˆi ,
pi yˆi yi , if yˆi yi ,
0, otherwise.
The total penalty for all trials is determined as the arithmetic mean between the penalties for each trial:
1 n
p pi
n i 1
where n is the number of tests. In the process of training the model, it is necessary to ensure the minimum
value of the penalty for all tests, in other words p min .
The presented system of penalties provides for the introduction of three hyperparameters: , and ,
where 0 and 0 . Let's set the hyperparameter values as follows: .
3.1. The initial data analysis
Figure 2 shows a scatter plot of queue size in dependence of Hurst exponent. The figure clearly shows
that there is a certain correlation between the Hurst exponent and the buffer size [4].
Let us first group the tests by the value of the Hurst exponent and then select 30 groups to estimate the
spread of the queue size.
Figure 2: The scatter plot of queue size in Figure 3: The box-plot of the 30 analyzed groups
dependence of the Hurst exponent
Next, we can build a box-plot for each group. It follows from Figure 3 that the largest amount of outliers
from above is observed for the first 10 groups, which corresponds to the Hurst exponent value close to 0.5.
Consequently, at these values of the Hurst exponent, losses may occur due to the fact that the required buffer
size will be greater than the predicted one.
For groups from 28 to 30, there are significant outliers from the bottom, which leads to inefficient use of
buffer memory.
3.2. The regression analysis
Machine learning is a subset of artificial intelligence that studies and explores algorithms that can learn
without direct programming. Linear regression is a typical representative of machine learning algorithms [7].
There are the following tasks solved by machine learning: supervised learning, unsupervised learning,
reinforcement learning, active learning, knowledge transfer, etc. Regression (as well as classification)
belongs to the class of supervised learning problems, when a certain target variable must be predicted for a
given set of features of the observed object. As a rule, in supervised learning problems, experience E is
represented as a set of pairs of features and target variables: D xi , yi 1... n . In the case of linear
i
regression, the feature description of an object is a real vector x R , where R is the set of real numbers and
m
the target variable is a scalar y R . The simplest measure of the quality L for the regression problem is
L y, yˆ y yˆ ,
2
where ŷ is an estimate of the real value of the target variable [7, 8].
Let us restore the dependence shown in Figure 2 using the methods of regression analysis.
The basis of regression analysis is the method of least squares (OLS), according to which the function
y f x is taken as the regression equation such that the sum of the squares of the differences would
satisfy
n
S yi yˆi min.
2
i 1
Using the methods of pairwise regression analysis, we will carry out a statistical analysis of the data
obtained by transforming an input self-similar flow distributed according to the Pareto law into a flow having
an exponential distribution. Let us examine the methods widely used in practice, which allow finding the
buffer size for the input flow with a given Hurst exponent.
3.3. The linear regression analysis
In this case, the relationship between the Hurst exponent H and queue size ŷ is determined according to
the linear equation:
yˆ b0 b1H .
We obtain the regression equation using the least squares method:
(1)
The result of the fitting for the current model is shown on Figure 4.
Figure 4: The linear model report is built using the statsmodels package of the Python programming
language
Thus we obtained the statistically significant result. In Table 1 the quality metrics values are shown for
the obtained linear regression model.
Table 1
Linear regression model quality metrics
Quality Metric Value
Coefficient of determination R2 0.584
Root mean square error of regression RMSE 130.908
Mean absolute error MAE 96.808
Penalty score p 55.710
The obtained value of the coefficient of determination suggests that only about 58% of cases of changes
in the Hurst exponent lead to a change in the size of the queue within the framework of the linear model. The
obtained result is unsatisfactory for practice, therefore, in the simplest case, it makes sense to consider other
methods using the methods of linearization of nonlinear dependencies. As a result, the nonlinear dependence
can be reduced to linear, and then, the least squares method can be used.
3.4. The hyperbolic regression
For the hyperbolic regression, the relationship between H and ŷ can be described as follows:
b
yˆ b0 1 .
H
1
The linearization of the hyperbolic equation is achieved by replacing with a new variable, which we
H
denote by z [6]. Then the hyperbolic regression equation takes the form yˆ b0 b1z. We obtain the
regression equation using the least squares method:
489.379
yˆ 875.438 (2)
H
The result of the fitting for the current model is shown on Figure 5.
Figure 5: The first hyperbolic model report is built using the statsmodels package of the Python
programming language
Thereby, we obtained the statistically significant result. In Table 2 the quality metrics values are shown
for the obtained first hyperbolic regression model.
Table 2
Quality metrics of a hyperbolic regression model
Quality Metric Value
Coefficient of determination R2 0.453
Root mean square error of regression RMSE 150.218
Mean absolute error MAE 110.511
Penalty score p 63.841
The obtained value of the coefficient of determination suggests that about 45% of cases of changes in the
Hurst exponent lead to a change in the size of the queue. This is much worse than the value of the coefficient
of determination of the linear model. For this reason, it makes sense to consider a different hyperbolic
regression model:
1
yˆ .
b0 b1H
Using the least squares method, we obtain the regression equation for this model:
1
yˆ (3)
0.0399 0.0397 H
The result of the fitting for the current model is shown on Figure 6.
Figure 6: The second hyperbolic model report is built using statsmodels package of the Python
programming language
Accordingly we obtained the statistically significant result. In Table 3 the quality metrics values are
shown for the obtained second hyperbolic regression model.
Table 3
Quality metrics of the modified hyperbolic model
Quality Metric Value
Coefficient of determination R2 0.591
Root mean square error of regression RMSE 223.798
Mean absolute error MAE 77.543
Penalty score p 39.537
The obtained value of the coefficient of determination is about 59%, which is slightly better than the
linear model.
3.5. The power regression
In the case of the power regression, the relationship between H and ŷ is:
yˆ b0 H b1 .
This equation is nonlinear in the coefficient b1 and belongs to the class of regression models that can be
reduced to linear form using transformations [6]
ln y ln b0 b1 ln H .
The exponential function is internally linear, therefore, estimates of the unknown parameters of its
linearized form can be calculated using the classical least squares method. The regression equation is:
(4)
The result of the fitting for the current model is shown on Figure 7.
Figure 7: The power model report is built using statsmodels package of the Python programming language
Thus we obtained the statistically significant result. In Table 4 the quality metrics values are shown for
the obtained power regression model.
Table 4
Extent regression model quality metrics
Quality Metric Value
Coefficient of determination R2 0.699
Root mean square error of regression RMSE 128.675
Mean absolute error MAE 72.823
Penalty score p 53.042
The obtained value of the coefficient of determination is 70%, which is much better than the coefficient
of determination of the linear model.
3.6. The exponential regression
For the exponential regression, the relationship between H and ŷ is:
yˆ b0eb1H .
This equation is non-linear with respect to the coefficient b1 and belongs to the class of regression
models, which are reduced to a linear form using transformations [6]:
ln yˆ ln b0 H ln b1.
The exponential function is internally linear; therefore, estimates of the unknown parameters of its
linearized form can be calculated using the classical least squares method. The regression equation is:
yˆ 2.926 e 5.089H (5)
The result of the fitting for the current model is shown on the Figure 8.
, In this way we obtained the statistically significant result. In Table 4 the quality metrics values are
shown for the obtained exponential regression model.
Figure 8: The exponential model report is built using statsmodels package of the Python programming
language
Table 5
Exponential regression model quality metrics
Quality Metric Value
Coefficient of determination R2 0.745
Root mean square error of regression RMSE 112.443
Mean absolute error MAE 65.199
Penalty score p 46.768
The obtained value of the coefficient of determination indicates that about 74% of cases of changes in the
Hurst exponent lead to a change in the size of the queue in the framework of the exponential model, which is
the best result when using the methods of paired regression analysis. An analysis of the amount of the
penalty gives the same result.
Let us carry out a comparative analysis of the results obtained and then build graphs of the regression
equations (1-5) (Figure 9). It is obvious that exponential regression most closely fits the relationship between
the Hurst exponent and the buffer size.
Figure 9: The comparative analysis of the results of paired regression analysis
The trivial paired regression models described above do not adequately describe the dependence of the
queue size on the Hurst exponent, so we complicate the model. One possible way is the isotonic regression
usage.
3.7. The isotonic regression
In statistics, isotonic regression or monotonic regression is a method of fitting a free-form line to a
sequence of observations under the following constraints: the fitted free-form line should be non-decreasing
(or not increasing) over the domain, and should lie as close as possible to the observations [13]. In the
process of constructing an isotonic curve, the following problem is solved [13]:
i wi yi yˆi min,
2
where the value of the weighting factor is wi 0 . This gives a vector that consists of the non-decreasing
elements that are closest in terms of the root mean square error. In practice, this list of elements forms a
piecewise linear function.
Let us train the isotonic regression model using the scikit-learn package of the Python 3 programming
language [9] and build a graph corresponding to the model built using isotonic regression (Figure 10)
In Table 6 the quality metrics values are shown for the obtained isotonic regression model.
Table 6
Isotonic regression quality metrics
Quality Metric Value
Coefficient of determination R2 0.928
Root mean square error of regression RMSE 54.437
Mean absolute error MAE 39.501
Penalty score p 21.269
The obtained value of the coefficient of determination suggests that about 92% of cases of changes in the
Hurst exponent lead to a change in the size of the queue within the framework of this model, which is much
better than the models built on the basis of pair regression methods. Moreover, the value of the penalty for
isotonic regression is two times less than the corresponding value for paired regression.
Figure 10: Plotting an isotonic curve to a dataset
3.8. The support vector regression
Support Vector Machines (SVM) is a linear algorithm used in classification and regression problems (for
regression problems it is called SVR - Support Vector Regression). The main idea of the method is to
construct a hyperplane that separates the sampled objects in an optimal way [10-12].
Support vector machines maximize the padding of objects, which is closely related to minimizing the
likelihood of overfitting. Moreover, it makes it very easy to go to the construction of a nonlinear dividing
surface due to the nuclear transition [10, 119].
Let us train the model based on SVR. The nonlinear nature of the relationship between the Hurst
exponent value and the queue size indicates the need to choose a radial basis kernel for the SVR model. This
model was trained using the scikit-learn package of the Python 3 programming language [12]. In Figure 11
the graph of the relationship between queue size and Hurst exponent is shown.
Figure 11: Plot corresponding to trained support vector machine
In Table 7 the quality metrics values are shown for the obtained support vector regression model.
Table 7
Support vector model quality metrics
Quality Metric Value
Coefficient of determination R2 0.901
Root mean square error of regression RMSE 63.868
Mean absolute error MAE 52.506
Penalty score p 18.374
The obtained value of the coefficient of determination is about 90%, which is slightly worse than that of
the method using isotonic regression. However, the penalty for this method is less than for isotonic
regression.
3.9. Comparative analysis of models
The research results are presented in Table 8 for estimating and choosing the best method for predicting
queue size from the Hurst exponent.
Based on the data of the pivot table, it can be concluded that the best predictive ability based on the
introduced quality metric is a model built using the support vector machine. Within the framework of this
study, it can be concluded that the complication of SVR by transition to the rectifying space does not lead to
an improvement in the quality of learning.
Table 8
The comparison between considered regression methods for 0.5