Analysis of the Application of Stochastic Gradient Descent to Detect Network Violations

Analysis of the Application of Stochastic Gradient Descent to Detect Network Violations NataliyaBoyko nataliya.i.boyko@lpnu.ua IrynaKhomyshyn khomyshyni@gmail.com NataliaOrtynska ViktoriaTerletska viktoriia.o.terletska@lpnu.ua OksanaBilyk OleksandraHasko oleksandra.i.hasko@lpnu.ua Lviv Polytechnic National University

Profesorska Street 1 79013 Lviv Ukraine

International Conference on Computational Linguistics and Intelligent Systems

May 12-13 2022 Gliwice Poland

Analysis of the Application of Stochastic Gradient Descent to Detect Network Violations B65006E2367B6DC6FFF773ED88A0D14B GROBID - A machine learning software for extracting information from scholarly documents Machine Learning, Artificial Intelligence, stochastic gradient descent, stochastic algorithms, negative log likelihood, convolution neural network 0000-0002-6962-9363 (N. Boyko) 0000-0002-6180-3478 (I. Khomyshyn) 0000-0002-5061-5340(N. Ortynska) 0000-0002-9334-2557 (V. Terletska) 0000-0001-6042-1147 (O. Bilyk) 0000-0003-4519-610X (О. Hasko)

Over the past decade, data sizes have grown faster than processor speeds. In this context, the capabilities of statistical machine learning methods are limited by computational time, not sample size. A more accurate analysis reveals qualitatively different trade-offs in the case of small and large learning problems. This work considers the occurrence of a large-scale case involving the basic algorithm's computational complexity and the implementation of optimization in non-trivial ways. Optimization algorithms such as stochastic gradient descent, which demonstrate exceptional performance for large-scale problems, are considered. The work compares the algorithms of gradient descent, stochastic gradient, and standard deterministic gradient descent. Here are given ways to apply them in machine learning and considered a formal description of the operation of gradient descent, stochastic gradient, and standard deterministic gradient descent algorithms. Here is analyzed the difference in the process of algorithms. The advantages and disadvantages of each of them are given. A comparison of both approaches is considered: deterministic and Stochastic algorithms. This work has analyzed the effectiveness of these two approaches. Convolutional (convolutional) neural networks are considered, which, unlike conventional deep neural networks, use convolutions to find patterns in images, thereby improving prediction in image recognition problems. A model of a convolutional neural network is presented. Here are analyzed the parameters of training the input sample and its testing. As a result of using a simulation of the usual gradient descent, when the weight is updated only after viewing all the elements, the value of the loss function is considered. The analysis of the selected problem and model also justifies the results of its use.

Introduction

As a sub-branch of artificial intelligence, machine learning is actively developing in various areas of technology and human life. It is used in several computational problems in which the development and programming of explicit algorithms with good performance are difficult or impossible. Examples of its applications include email filtering, detecting network criminals or malicious insiders, Optical Character Recognition, ranking training, computer vision, etc.

For so-called" training", machine learning often uses statistical techniques and algorithms, one of which is the gradient descent method.

However, when training the model, a massive sample of data is most often used, as a result of which the standard deterministic gradient descent does not show the best time results. A modification of deterministic gradient descent is stochastic gradient descent, which provides accelerated learning without significant loss of efficiency.

This work aims to analyze and reveal the operation of stochastic gradient descent, explore the application of this algorithm in machine learning training, and compare it with a deterministic algorithm.

Review of literature sources

Almost all machine learning problems are optimization problems, which is to find the extremum of the objective function [1]. Finding a model and the correct objective function is the first step in solving the machine learning problem. Then you need to define analytical and numerical methods that will help solve the optimization problem [2].

The paper discusses first-order methods, namely methods that use the gradient of a function. In the author's work [3], the general optimization method is the usual gradient descent. The author of [5] describes a formal description and description of the algorithm for solving the gradient descent method. In his work [5], the parameters are updated iteratively in the direction of the gradients of the objective function. The mechanism of the gradient descent method is described in the paper [2]. The gradient descent method is easy to implement. The solution is a global minimum or maximum when the objective function is convex [6]. It often converges more slowly if the variable is closer to the optimal solution, which requires more thorough iterations [1,4,6]. Due to the high computational complexity in each iteration, stochastic gradient descent (SGD) [3,19] is proposed for a large amount of data. Therefore, the authors [8,10] present an analysis of the application of this algorithm in the process of machine learning training and its comparison with a deterministic algorithm.

The manually adjusted learning rate strongly affects the SGD effect. This is a complex problem of setting the appropriate learning rate value. It is described in the authors' works [1][2][3][4][5][6][7][8]. The study uses popular machine learning libraries for the Python language (such as PyTorch, TensorFlow, and SkLearn). The standard algorithm is not even implemented, but only its stochastic version exists [12][13][14][15].

Using a simulation of the usual gradient descent described in [1][2][3][4]16], the model's accuracy becomes very low when the weight is updated only after viewing all the elements. It is described that with this approach, gradients are updated more accurately. In this paper, the authors conduct a study with many epochs, where gradient descent shows the best accuracy of training.

Materials and methods

Gradient descent is one of the most popular optimization algorithms and the most common way to optimize neural networks. At the same time, each modern deep learning library contains implementations of various algorithms for optimizing gradient descent. However, these algorithms are often used as black-box optimizers because it is difficult to find practical explanations for their strengths and weaknesses.

The study aims to provide an analysis of the behaviour of various algorithms for optimizing Gradient Descent, which will help to apply them. First, you should consider different options for gradient descent. The next step is to summarize the problems during training. Therefore, this work presents the most common optimization algorithms that demonstrate their motivation to solve these problems and lead to the conclusion of update rules.

Gradient descent is a way to minimize an objective function

) ( J that is parameterized through model parameters d R 

 by updating parameters in the opposite direction of the objective function

) (  J 

gradient to parameters. The learning rate  determines the size of the steps that need to be taken to reach the (local) minimum. When analyzing, you need to monitor the direction of inclination of the surface down, which is created by the objective function.

Gradient Descent

Gradient descent is an iterative first-order optimization algorithm in which steps are taken to find the local minimum of a function proportional to the opposite value of the process's gradient (or approximate gradient) at the current point.

In short, since the gradient shows us the direction and speed of movement of the function relative to all arguments, to find the local minimum, we must move opposite to the action of the process (if it increases, we move it to decrease, if it declines, we move even more to the minimum).

For an intuitive understanding of gradient descent, we can imagine the movement of the ball (arguments) along with a parabolic bucket (function) in the direction of the bottom (minimum). Still, our algorithm will perform its role (Figure 1). When the ball is on the left side (i.e., the function decreases), the algorithm must "push" the ball forward to reach the bottom. When it is, on the right side (the function increases), as shown in Figure 1, then the algorithm, on the contrary, "pulls" it down. This movement continues step by step until the ball reaches the bottom.

Standard deterministic gradient descent

Deterministic refers to an algorithm that returns the same result each time, regardless of the number of its runs.

Normal gradient descent can return the same values because it is performed on the total amount of data that comes to its input (that is, it works with the same input data and also serves the same steps on it without any accidents, which gives the same result at the end).

Stochastic gradient descent

Stochastic refers to an algorithm that returns different values each time, i.e. the algorithm contains a certain stochastic (random) relationship.

Different output values occur because stochastic gradient descent is not performed on the entire volume of input data but stochastically selects only a certain predefined number of records for each epoch. Such a set of individual records is called a batch.

Application in machine learning

Gradient descent is widely used in machine learning because it allows the model to learn from its mistakes and inaccuracies.

When training the model, a loss or error function is introduced, which provides information about the difference between the model's expected results. Since we cannot change the input data from the dataset, we must optimize the coefficients of our model.

The gradient of this function shows us how each coefficient affects the final error function. Applying a gradient descent to the loss function after this allows us to achieve the minimum of this function by adjusting the coefficients (i.e., make the error minimal). In this way, the configured model will enable us to use it for our tasks.

Thus, it becomes clear that gradient descent itself (stochastic and standard) is not used in machine learning. Its implementation is only an auxiliary tool for minimizing loss or error functions in machine learning algorithms.

Formal description of gradient descent algorithms 4.1. Mathematics in gradient descent

The gradient descent algorithm looks like this: 1. The calculation of new values of variables is presented in equation 1:

 = − = − = n i i w f grad n w w f grad w w 1 )), ((

Mathematics in Stochastic Gradient Descent

The stochastic gradient descent algorithm looks like this: 1. Randomly shuffle the dataset.

Difference of algorithms

As can be seen from points 4.1 and 4.2, there is no difference in the formula for calculating new values for both algorithms. Differences appear in the approach to updating variables. While a standard gradient descent updates variables only after all n training samples have been passed, the stochastic algorithm updates these variables after passing only randomly selected models, which increases the number of updates per epoch.

Comparison of both approaches

On the one hand, a deterministic algorithm behaves more precisely since it processes a complete dataset to update data, but such training takes a very long time. On the other hand, the stochastic algorithm is not as accurate after each update, but its speed allows you to achieve the required accuracy in much less time.

Since machine learning works with massive data sets, the stochastic algorithm is much more efficient for its tasks. The standard algorithm is not even implemented in most popular machine learning libraries for the Python language (such as PyTorch, TensorFlow, and SkLearn), but only its stochastic version exists.

Recognition by convolutional neural networks trained using stochastic gradient descent 5.1. Analysis and justification of the selected task and model

Digit recognition is still a relevant topic today, which can be applied in various areas of our lives, starting from recognising bank card numbers in bank applications and ending with recognising mathematical expressions.

As you know, convolutional neural networks are best suited for image recognition, which, unlike conventional deep neural networks, use convolutions to find patterns in images that allow for improved prediction. Convolutions also compress images, which significantly increases the speed of analysis.

Description and analysis of the selected dataset

To recognize numbers, a well-known dataset is called MNIST. MNIST (short for Mixed National Institute of Standards and Technology)a dataset of samples of handwritten writing of numbers. It is a standard proposed by the US National Institute of standards and technology to calibrate and compare image recognition methods using machine learning, primarily based on artificial neural networks. The dataset contains 60,000 images for training and 10,000 images for testing (Figure 2).

Convolutional neural network model

To implement a convolutional neural network, I chose the following architecture, a diagram of which can be seen in Figure 3:

• input 28x28x1 • convolutional layer 24x24x10 • Max-

Training and testing parameters

The data is randomly divided into 60,000 training images and 10,000 test images to prevent the possibility of retraining, when the model shows good results in training, but then cannot distinguish samples that were not seen during training.

As an optimizer for training, a stochastic gradient descent was chosen with the experimentally selected best learning rate learningrate = 0.01.

The training will last for 3 epochs with a different number of images in each betcha for the step of the stochastic gradient algorithm.

The negative logarithmic likelihood function

) log( ) ( y y L − =

is chosen as the loss function, which is simple, but at the same time powerful in classification problems.

Accuracy testing will reflect the ratio of the number of correctly predicted images to the total number of images in the dataset

      = % 100 * N N Accuracy true .

Work results

The graph shown in Figure 4 shows the change in the value of the loss function relative to the number of elements seen during a pure stochastic gradient descent (1 element per batch). At the same time, the accuracy of the neural network for 3 epochs was only 8%, which means that only about 800 out of 10,000 images were correctly identified by the network. The final error was about 4.2315. The graph shown in Figure 5 shows the change in the value of the loss function relative to the number of elements seen with a batch size of 64 elements. At the same time, the accuracy of the neural network for 3 epochs was 97%. The final error was about 0.327. The graph shown in Figure 6 shows the change in the value of the loss function relative to the number of elements seen with a batch size of 128 elements. At the same time, the accuracy of the neural network for 3 epochs was 96%. The final error was about 0.421. The graph shown in Figure 7 shows the change in the value of the loss function relative to the number of elements seen with a batch size of 256 elements. At the same time, the accuracy of the neural network for 3 epochs was 93%. The final error was about 0.686. Since there is no ready-made implementation of normal gradient descent among known libraries, we can only stimulate it with stochastic gradient descent by setting the number of elements in the batch to 60,000 (i.e., to update the algorithm weights, we will need to view 60,000 images. The graph shown in Figure 8 shows the change in the value of the loss function relative to the number of elements seen during the simulated normal gradient descent. At the same time, the accuracy of the neural network for 3 epochs was only 9%. The final margin of error was about 2.284.

Discussion of results

When we consider only one random element to change the weights in pure stochastic gradient descent, the loss function has the highest values. One random component is not enough to describe the trend of movement of the loss function and the corresponding gradients for all other weights. It can be seen that the value of the loss function is constantly moving up and down. Obviously, with this movement of the process, the accuracy of neural network predictions was very low (only 8%).

When we use mini-batch stochastic gradient descent for a different number of elements in a batch (64/128/256), the value of the loss function increases and decreases alternately. However, in all cases, it still moves to a minimum. The only difference with a different number of batch elements is the frequency of these movements since the weights are updated with a different number of viewed features. At the same time, mini-batching showed the highest accuracy of predictions, which was 97% for a 64-element batch.

As a result of using the standard gradient descent simulation, when the weight update takes place only after viewing all the elements, we see that the value of the loss function also falls. Still, since the weight update is so rare, this movement is prolonged, so the model's accuracy also becomes very low. However, the gradients with this approach are updated more accurately. Presumably, with more epochs, gradient descent would show better training accuracy.

Conclusions

Looking at the results of stochastic gradient descent and comparing them with the simulated average gradient, we can see that SGD performs training much faster because only three epochs are enough to achieve 97% accuracy with the correct number of elements in the batch. At the same three epochs, gradient descent reduced the value of the loss function by several tenths.

At the same time, pure GHS also shows poor accuracy results since it cannot describe the change in weights using a single random element.

Considering the previous thesis, we can conclude that using a mini-party stochastic gradient is more effective and more in demand since it performs the fastest and best training of the model.

Figure 1 :1Figure 1: Intuitive image of the gradient descent algorithm

small value, the pace of learning 2. Repeat the calculation of Step 1 until

2 . 3 .23Select the k-required samples (in pure stochastic descent 1 The calculation of new values of variables is presented in equation 2

Figure 2 :2Figure 2: Example of the appearance of images from the MNIST dataset

puling 12x12x10 • ReLU activation function • convolution 8x8x20 • Max puling 4x4x20 • complete connection of 320 neurons • complete connection of 50 neurons • output for 10 classes (10 digits) Note: there is a ReLU activation function between all full connection layers

Figure 3 :3Figure 3: CNN diagram

Figure 4 :4Figure 4: Graph of the ratio of the amount of losses relative to the number of elements passed by the network with a batch size of 1 element CNN diagram

Figure 5 :5Figure 5: Graph of the ratio of the amount of losses relative to the number of elements passed by the network with a batch size of 64 elements

Figure 6 :6Figure 6: Graph of the ratio of the amount of losses relative to the number of elements passed by the network with a batch size of 128 elements

Figure 7 :7Figure 7: Graph of the ratio of the amount of losses relative to the number of elements passed by the network with a batch size of 256 elements

Figure 8 :8Figure 8: Graph of the ratio of the amount of losses relative to the number of elements passed by the network with a batch size of 60,000 elements (simulation of a normal gradient descent)

Another approach to polychotomous classification JFriedman Machine Learning 2004 The Elements of Statistical Learning THastie RTibshirani JFriedman 2001 SpringerVerlag New York A look trough methods of intellectual data analysis and their applying in informational systems NBoyko XIth International Scientific and Technical Conference Computer Sciences and Information Technologies (CSIT) 2016 Prediction Games and Arcing Algorithms LBreiman 10.1162/089976699300016106 Neural Computation 11 7 1999 An ensemble learning method for classification of multiple-label data PZhang GZhang QPan 10.12733/jcis14783 Journal of Computational Information Systems 11 12 2015 Characteristics of the quality of the regression model AOPererodov 2020 SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent ABordes LBottou PGallinari Journal of Machine Learning Research 10 2009 The Tradeoffs of Large Scale Learning LBottou OBousquet Advances in Neural Information Processing Systems 20 2008 On-line Learning for Very Large Datasets LBottou YLecun Applied Stochastic Models in Business and Industry 21 2 2004 Application of Ensemble Methods of Strengthening in Search of Legal Information NBoyko KKmetyk-Podubinska IAndrusiak 10.1007/978-3-030-82014-5_13 Lecture Notes on Data Engineering and Communications Technologies 77 2021 Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms OBousquet 2002 Palaiseau, France Ecole Polytechnique Thèse de doctorat Training Linear SVMs in Linear Time TJoachims Proceedings of the 12th ACM SIGKDD the 12th ACM SIGKDD 2006 Evaluation of the effectiveness of different image skeletonization methods in biometric security systems MNazarkevych SDmytruk VHrytsyk OVozna AKuza OShevchuk VSheketa International Journal of Sensors Wireless Communications and Control 11 5 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data JDLafferty AMccallum FPereira Proceedings of ICML ICML 2001 RCV1: A New Benchmark Collection for Text Categorization Research DDLewis YYang TGRose FLi Journal of Machine Learning Research 5 2004 Trust region Newton methods for large-scale logistic regression CJLin RCWeng SSKeerthi Proceedings of ICML ICML 2007 Some applications of concentration inequalities to Statistics PMassart Annales de la Faculté des Sciences de Toulouse series 6 9 2000 SVM optimization: inverse dependence on training set size SShalev-Shwartz NSrebro Proceedings of the ICML the ICML 2008 Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent WXu Journal of Machine Learning Research 2010