1. Introduction

Online

1880-5558

10.4108/eai.16-4

Multi-Criteria Method for Comparing the Effectiveness of Gradient Descent Modifications on Benchmark Functions

Viktor Morozov

viktor.morozov@knu.ua 0

Vladyslav Deineha

Danylo Kovalchuk

0 0 Taras Shevchenko National University of Kyiv , 24, Bohdan Gavrilishin Str., Kyiv, 04116 , Ukraine

2013

15 4 123 137

In this article, an analysis of optimization methods, in particular, first-order methods, which are actively used to minimize loss functions in various problems, is presented. Optimization is a key component in both business and research, particularly in machine learning, where it plays a critical role in training models and improving their performance. The main focus was on gradient descent and its modifications, such as Momentum, Heavy Ball, and Nesterov. First-order methods, such as Standard Gradient Descent, are effective due to their simplicity of implementation, but can suffer from oscillation and slow convergence. Modifications, such as Momentum, Heavy Ball, and Nesterov, can in some cases significantly improve the optimization process. A review of the recent publications has confirmed the importance of using these methods to solve applied problems requiring high accuracy and efficiency. Comparison of gradient descent modifications showed that each method has its own characteristics, and the choice of the optimal approach depends on the specifics of the case. In particular, the use of methods such as Nesterov Accelerated. Gradient can significantly reduce the training time in real-world conditions. In the practical part of this work, several optimization methods were tested on benchmark functions, including Himmelblau, Rosenbrock, Rastrigin, Ackley, and Beale. The methods implemented included Standard Gradient Descent, Momentum, Heavy Ball, and Nesterov. Applying the Analytic Hierarchy Process enabled a thorough evaluation of each method based on key criteria: the number of iterations, average time per iteration, total execution time, and the function value at the final iteration. This structured approach allowed for a clearer, more precise comparison, aiding in the selection of the most effective method for various optimization challenges. According to the experimental results, the Heavy Ball method demonstrated the best results on most functions, while the Standard Gradient Descent and other methods showed mixed results, depending on the properties of the functions.

eol>Optimization Machine learning Standard Gradient Descent Momentum method Heavy Ball method Nesterov method Analytic Hierarchy Process

1. Introduction

Optimization is an important aspect in many fields, ranging from economics and engineering to bioinformatics and artificial intelligence [ 1 ]. It allows to find the best solutions for complex problems by minimizing or maximizing a certain function depending on the task at hand. In the real world, the optimization process is used to find the most efficient ways to use resources, improve technological processes, and solve optimal management problems [2]. Optimization plays a particularly important role in machine learning, where it helps train models based on large amounts of data. In the context of machine learning, optimization plays a key role in tuning the parameters of models such as neural networks, regression models, support vector machines (SVMs), etc. These models are used for forecasting, classification, decision making, etc. [ 3 ]. For example, neural networks can be used to solve problems related to IT projects [ 4 ].

One of the key optimization tools in machine learning is gradient descent methods, which are used to minimize loss functions. They allow quick and efficient training of machine learning models by gradually reducing the error based on the calculation of gradients. Due to their simplicity and efficiency, gradient descent methods have become the basis of many modern optimization algorithms. However, for more complex or non-uniform loss functions, standard gradient descent may not be fast enough or stable enough.

In this paper, several modifications of gradient descent, including Momentum, Heavy Ball, and Nesterov, are reviewed and compared. These methods offer different approaches to speeding up the optimization process and increasing its resistance to local minima. The main focus will be on comparing their effectiveness in terms of the number of iterations, time to reach the minimum, average time per iteration, and accuracy of the result.

2. Using optimization in machine learning

Machine learning is a vast field of study that includes data analysis and model building techniques used to solve various problems, including time series forecasting [ 5 ]. Optimization in machine learning is a key element that determines the efficiency and accuracy of models. The process of training machine learning models is to find parameters that minimize the loss function, a mathematical representation of the errors between predicted and actual values. The goal of optimization is to find the model parameters that provide the best results on new, unknown data [ 6 ].

In most machine learning tasks, the goal is to minimize the loss function that represents the model error on the training data. To achieve this, various optimization methods are used to gradually adjust the model parameters to reduce the error. One of the most common optimization methods in machine learning is gradient descent and its modifications. Gradient descent uses the derivative of the loss function to determine the direction in which the parameters should be adjusted to reduce the value of the function. Classical gradient descent, as well as its advanced versions, such as Momentum, are among the tools used to optimize neural networks and many other models. During optimization, the choice of hyperparameters, such as learning rate, number of iterations, and other parameters that affect the speed and stability of the learning process, plays an important role. Automated methods, such as grid search, help to automate this process.

Optimization challenges in machine learning: • High dimensionality of the parameter space. In complex models, such as deep neural networks, the number of parameters can reach millions or even billions. Optimization in such a large space is computationally challenging, and therefore first-order methods are the most appropriate due to their efficiency. • The presence of local minima. In nonlinear models, local minima are often present, which can prevent the global minimum of the loss function from being reached. Gradient descent modifications can help solve this problem. • Convergence speed. Optimization can be time-consuming for large models and data.

Therefore, optimizers that use acceleration, such as Nesterov Accelerated Gradient, can significantly reduce model training time.

Thus, optimization is a fundamental component of machine learning that determines the success of a model in solving real-world problems. Effective use of optimization methods allows to create models capable of finding complex patterns in data and making accurate predictions.

3. Analysis of recent research and publications

This work [7] provides an overview of various gradient descent-based optimization algorithms and explains their strengths and weaknesses. The author sought to provide practical intuitions for understanding the behavior of these algorithms so that the user could apply them more effectively. The article discusses three main variants of gradient descent, of which the most popular is the mini-batch gradient descent method. The author also analyzes in detail the most common algorithms for optimizing stochastic gradient descent, including Momentum, accelerated Nesterov gradient, and other methods, including adaptive ones. In addition, various algorithms for optimizing asynchronous SGD are investigated, as well as additional improvement strategies such as data shuffling, curriculum learning, batch normalization, and early stopping. The main conclusion of the article is that various variants and modifications of gradient descent can be adapted for certain machine learning tasks depending on the specifics of the data and model architecture. This review is useful for our article, as our paper also compares different modifications of gradient descent, including Momentum and Nesterov. The importance of choosing the right optimization strategy for a particular task is emphasized, which resonates with our analysis of the effectiveness of these methods in different settings.

This [8] article is devoted to improving the method of gradient descent with momentum, which is widely used to minimize loss functions in machine learning. The authors consider the method with the so-called Nesterov acceleration, where the gradient is calculated not at the current position in the parameter space, but at the expected position after one step. A new modification of controlled by a new hyperparameter. The results show that the super-acceleration of the moment method is useful not only for the idealized problem, but also for the MNIST classification task using neural networks. An important conclusion is that this modification of the gradient descent with moment improves the convergence speed and efficiency of minimizing loss functions, which is especially relevant for large models in machine learning. In the context of our paper, this approach is relevant because our analysis also includes modifications of gradient descent, such as the momentum and Nesterov acceleration methods. The proposal to use the gradient from positions several steps ahead may provide additional advantages over standard methods, making this approach relevant to our study of optimization methods.

Work [9] is devoted to the use of the heavy ball moment to accelerate gradient descent in optimization problems. The authors first explain the concept of pathological curvature arising in different regions of a function and give an overview of standard gradient descent. They demonstrate the problems associated with applying gradient descent to the function given as an example. The main idea is that without a moment, the gradient descent may converge too slowly due to the characteristics of the function. To solve this problem, the moment is used to adjust the current step in the direction of the previous one, speeding up the convergence process. Using the same example, the author shows that using the moment improves the learning process and converges to the minimum much faster. This article is important for our topic because it demonstrates the heavy ball method, which is also analyzed in our study.

Paper [10] presents a new optimization method based on control theory called Controlled Gradient Descent (CGD). This approach is aimed at overcoming the shortcomings of optimization algorithms, in particular, the problems associated with the choice of an appropriate geometric structure. The effectiveness of CGD is demonstrated using various test functions, such as the Rosenbrock benchmark function, as well as a non-planar objective function and a semi-convex objective function, which are often encountered in machine learning problems. This approach is suitable for solving large-scale problems and shows promise for further development of optimization methods. The Rosenbrock function, which will also be used in our practical part of the paper, is an important tool for demonstrating the effectiveness of the method.

Paper [11] is devoted to the use of stochastic gradient descent with momentum (SGDM) for training deep neural networks (DNNs) and recurrent neural networks (RNNs), which was previously considered a difficult task due to problems with optimizing such models. The authors show that with proper initialization and careful use of parameters, both DNNs and RNNs can be trained successfully, achieving results that were previously only possible with complex secondorder methods such as Hessian-Free (HF). An important aspect of the study is that improperly initialized networks cannot be trained effectively using momentum, and that the absence or poor tuning of momentum significantly reduces performance. The researchers also proved that a welltuned momentum can successfully solve problems in deep and recurrent network training tasks that previously required the use of second-order methods. This is directly related to our topic, as our work also considers various modifications of gradient descent, including moment and Nesterov methods. The article emphasizes that even first-order methods, such as SGD with moment, can achieve optimization performance similar to second-order methods, which is especially important for training complex models. This study confirms the importance of careful tuning of the moment parameters, emphasizing the benefits of the moment to speed up convergence and improve optimization quality.

Article [12] is devoted to the use of artificial intelligence (AI) and transfer learning techniques to automate e-waste sorting in smart cities. The authors emphasize the importance of digitalization in the context of the circular economy and consider automated e-waste processing as one of the key steps towards sustainable development. The study uses the AlexNet model with the transfer learning technique. Particular attention is paid to tuning the gradient descent optimizer and selecting the learning rate, which is directly related to our topic, since various modifications of gradient descent are also analyzed. The results show that using SGDM with a properly tuned learning rate yields an accuracy of almost 98%, which emphasizes the effectiveness of this approach. The paper also addresses overfitting issues and applies data augmentation techniques to improve model generalization, which is also useful for our study. This study demonstrates that the use of optimization algorithms such as gradient descent can improve the efficiency and accuracy of processing systems, contributing to the development of circular smart cities.

In this article [13], a new method of accelerated gradient descent is proposed that combines Taylor expansion and conjugate direction with the Nesterov accelerated gradient method. The goal was to increase the speed of convergence of optimization processes on the example of optimizing the thickness of an oil film to minimize the friction coefficient on a textured surface. Nesterov method is known for its faster convergence than standard first-order methods, but the authors improved it by including additional terms through the Taylor expansion, which allows for a more accurate approximation of the solution. The use of conjugate directions makes the method more efficient for large-scale problems, where it has advantages over the gradient descent method and is less memory intensive than Newton's method. The results of numerical experiments conducted using the finite element method in FreeFEM++ show that the proposed method has faster convergence than the Nesterov method and is capable of finding deeper solutions. Moreover, the method is easy to implement and suitable for large-scale continuous optimization problems. The useful conclusions relate to the improvement of gradient descent methods, in particular the Nesterov method. The proposed method demonstrates that a combination of techniques, such as Taylor decomposition and conjugate directions, can significantly improve the convergence rate and efficiency of optimization algorithms. This is directly related to our topic, as our paper also discusses accelerated gradient descent methods and their modifications, analyzing their effectiveness for complex optimization problems. The methods considered in this paper, such as the Nesterov method, are relevant for large and complex optimization problems, as they provide faster convergence. This approach can be particularly useful in our study to compare gradient descent modifications used in complex machine learning systems.

4. Standard Gradient Descent

Standard gradient descent is one of the simplest and most common optimization methods. Its basic idea is to gradually update the model parameters based on the gradient of the loss function. The gradient indicates the direction in which the value of the loss function decreases the fastest. The goal is to find the minimum of the function by adjusting the model parameters in this direction.

The main stages of Standard Gradient Descent: 1. Gradient calculation. At each step of the method, the gradient of the loss function is calculated for all model parameters. The gradient is a vector of partial derivatives of the loss function for each parameter, which indicates the direction of the largest increase in the function. 2. Parameter update. After calculating the gradient, the model parameters are updated according to the formula [14]: = − ∗ ( ), (1) where θ are the parameters of the model, α is the learning rate, and ∇J(θ) is the gradient of the loss function J(θ).

3. Learning rate. This is a key parameter of the method. If the step is too small, the optimization process will be too slow, and if it is too 4. Iterations. The process of updating the parameters is repeated many times until a stop is reached (by convergence criteria or after a specified number of iterations).

Advantages of Standard Gradient Descent:

• Easy to implement. Standard Gradient Descent is easy to implement because for each iteration only need to calculate the gradient and update the model parameters. • Efficiency for smooth functions. If the loss function is smooth and convex, the method can efficiently find the global minimum.

Disadvantages of Standard Gradient Descent: • Problems with the choice of learning rate. An incorrect choice of learning rate can lead to very slow convergence or, conversely, to divergence. • Delay due to computation. Standard Gradient Descent requires calculating the gradient on all data at each step, which can be slow when working with large datasets. • Oscillations in areas of saddle points. In areas where the gradient is very small or varies without reaching it effectively.

In the following parts of the article, look at the modifications of the gradient descent, such as Momentum, Heavy Ball and Nesterov, which were developed to overcome some of the shortcomings of the Standard Gradient Descent.

5. Standard Gradient Descent

The Momentum gradient descent method is an improved version of the Standard Gradient Descent method that helps speed up convergence and avoid problems associated with oscillations in the parameter update - the accumulated effect of previous gradients. This allows to maintain the direction of movement even if the gradients change slightly or oscillate.

In classical gradient descent, each update of the model parameters depends only on the current gradient. In the Momentum method, inertia is added, which is accumulated based on previous gradients and smooth out oscillations when parameters fluctuate around the minimum.

Momentum algorithm. At each step, the model parameters are updated using the following formulas [15]: 1.

= ∗ − 1 + (1 − ) ∗ ( ), where vt is the velocity at iteration t, β of the loss function at the current iteration.

is the coefficient of inertia, ∇J(θ) is the gradient 2. After that, the model parameters are updated to reflect the velocity:

+ 1 = − ∗ , where θt are the current parameters of the model, α velocity used to update the parameters. is the learning rate, vt (2) (3) is the Advantages of Momentum: • Accelerated convergence. The Momentum method allows to quickly approach the minimum in convex problems, especially in gentle sections of the function. Momentum accumulation allows not to slow down the movement in the direction where the gradient remains unchanged. • Reducing oscillations. One of the key problems with Standard Gradient Descent is the oscillation of parameters in directions where gradients often change sign. Momentum allows to smooth out these oscillations by accumulating inertia and avoid stopping at saddle points or surfaces with small gradients. • Better performance on curved surfaces. On difficult surfaces where the minimum is surrounded by deep valleys or hills, Momentum allows to continue in the selected direction even when the current gradient is too small or changes too quickly.

Disadvantages of Momentum: • Setting up hyperparameters. For the method to work efficiently, it is necessary to properly

• value that is too small may not provide a sufficient acceleration effect.

become unstable, leading to divergence or oscillations around the minimum.

Gradient descent with momentum is widely used in neural networks and large machine learning models. It helps to cope more efficiently with large parameter spaces and complex loss functions, making it one of the most popular optimization methods. Momentum is also the basis for many modern modifications, such as Nesterov Accelerated Gradient, which further improve optimization performance.

The next step is to consider the Heavy Ball and Nesterov methods, which build on the ideas of Momentum, adding their own improvements for even greater optimization efficiency.

6. Heavy Ball method

The Heavy Ball method is one of the modifications of the gradient descent, which is based on similar principles as Momentum. The main idea is to add inertia to the process of updating parameters, which helps to speed up convergence and reduce oscillations. The name of the method comes from the physical analogy of moving a heavy ball on an inclined plane, where inertia helps to move in the direction of the minimum, overcoming obstacles such as local minima and plateaus.

In this method, each new step takes into account not only the current gradient, but also the the minimum more efficiently. By analogy with physics, this is similar to how a heavy object continues to move under the influence of inertia even after the force (gradient) stops acting on it.

The parameters in the Heavy Ball method are updated using the following formula [16]: + 1 = − ∗ ( ) + ∗ ( − − 1), (4) where wk + 1 - is the new value (updated parameter), wk - is the current value (current parameter), wk − 1 - is the previous value (previous parameter), αk- is the step (learning rate), βk - is the momentum parameter, ∇f(wk) - is the gradient of the function f(w) at point wk. Advantages of Heavy Ball: • Speed up convergence. As in the Momentum method, inertia helps to move faster to the minimum, especially on flat parts of the function where Standard Gradient Descent can be slow. The accumulation of speed helps to keep moving even when the gradient becomes small.

Oscillation smoothing. The Heavy Ball method smoothes out oscillations that can occur in conventional gradient descent, especially in cases with high curvature or highly elongated minima.

points where the gradient is very small, Heavy Ball keeps moving forward due to inertia. Disadvantages of Heavy Ball: • The need for careful tuning. The method requires the correct choice of both the learning too slow convergence.

minimum and start oscillating around it instead of achieving stable convergence. • Delays are possible in complex landscapes. Although the method works well on smooth functions, in very complex landscapes with numerous local minima, inertia can prevent the fastest possible finding of the global minimum.

The Heavy Ball method is used in problems that require faster convergence than Standard Gradient Descent. It is suitable for problems with a large number of parameters, such as neural network optimization, especially in situations where the loss function has a complex shape with wide minima or plateaus. Heavy Ball is a good option for problems where the speed of convergence is important, but the stability of the optimization process cannot be sacrificed.

In the next part, consider the Nesterov Accelerated Gradient method, which is another advanced version of the gradient descent, based on the ideas of momentum and inertia, but adds its own features for even greater efficiency.

7. Heavy Ball method

The Nesterov Accelerated Gradient (NAG) method is an advanced modification of the gradient descent based on the idea of Momentum, but with additional acceleration. The main innovation of the method is that it updates the parameters not only based on the current gradient, but also taking into account the predicted future state. This allows the model to take into account where it will move in advance and adjust the steps more accurately.

Unlike the classical Momentum method, where the gradient is calculated based on current parameters, the Nesterov method calculates the gradient based on the future position. This allows

In a physical analogy, this is similar to how a heavy ball (which moves due to inertia) would not adjust its movement.

Model parameters in Nesterov method can be updated in the following steps [17]: = ∗ − 1 + (1 − ) ∗ ( ),

+ 1 = − ∗ , where θt are the parameters at iteration t, vt is the velocity at iteration t, α is the learning rate, β is the momentum term, ∇f(θfp) is the gradient at the future position. Advantages of the Nesterov method: • Faster convergence. Since the method uses the predicted position to calculate the gradient, it makes better use of the information about the direction of movement, which contributes to faster convergence compared to the classic Momentum. (5) (6)

Better adaptation to the function landscape. Nesterov method is more sensitive to changes to match the predicted position. This allows it to better cope with complex landscapes with numerous local minima.

Oscillation reduction. Similar to Momentum, Nesterov method helps reduce oscillations, especially in problems with large curvature or saddle points. However, due to the effectively.

Better stability in complex problems. Nesterov method is less prone to situations where stable in complex optimization problems.

Disadvantages of the Nesterov method: • Complicated computation. Although the method provides better convergence, it requires additional computations to estimate the predicted position of the parameters. This can increase computational complexity, especially when working with large models. • Adjusting hyperparameters. As with Momentum, Nesterov method requires careful tuning discrepancies or slow convergence.

The Nesterov Accelerated Gradient method is widely used in neural networks and complex machine learning models where loss functions have a rough or complex landscape. It can significantly speed up training, especially in problems where classical gradient descent methods face difficulties in stability and convergence speed.

Nesterov method is one of the most popular optimization algorithms due to its ability to accelerate learning and efficient use of gradient direction information. It is often used in combination with other optimization methods to provide even more efficient and faster model training.

8. Experimental research

In the practical part of the paper, an experimental comparison of the mentioned optimization methods, such as Standard Gradient Descent, Momentum method, Heavy Ball method, and Nesterov method, was conducted on various mathematical functions. The main goal of the study was to investigate the effectiveness of these methods on complex functions using the Analytic Hierarchy Process. The evaluation of effectiveness was conducted based on several key criteria: the number of iterations, average time per iteration, total execution time, and the function value at the final iteration. The following functions were used for this purpose: • Himmelblau's function is a multimodal nonlinear function known for its four global minima [18]. • The Rosenbrock function is a standard test function for optimization, with a hard-to-find global minimum [19]. • The Rastrigin function is a strongly oscillating function with many local minima, which is used to test the stability of optimization methods [20]. • The Ackley function is a nonlinear function with a large number of local minima, which poses difficulties for gradient methods [21]. • The Beale function is a three-dimensional function that has one global minimum and several local ones, representing a problem with high nonlinearity [22].

Each of the methods was applied to solving optimization problems. For each combination of method and function, the main criteria values were measured, and then the values were normalized using the minimax method. If the minimum and maximum values of the criteria coincided (which could happen, for example, if the methods did not reach the minimum), the normalized values were set to 1.

• • • •

Tavg Ttotal f(xfinal)

The criteria are as follows: the number of iterations, the average time per iteration, the total execution time,

the value of the function at the last iteration.

Perform normalization of values:

̂ (̂ ̂ ), ), ), ).

(7) (8) (9) (10)

In the practical part of the study, the Analytic Hierarchy Process was applied to evaluate and compare the effectiveness of different optimization methods: Standard

Gradient Descent,

method, Heavy Ball method and Nesterov method. This approach has allowed to systematically consider optimization methods in terms of several criteria, which contributed to a more informed choice of the best method for specific problems. Stages of application of the analytic hierarchy process are presented below.

Building a tree of alternatives: In the context of this stage, a hierarchical structure was formed, where at the top level was the overall objective of the study - to evaluate the effectiveness of optimization methods. Below it were the key criteria: number of iterations, average time per iteration, total execution time and function value at the last iteration. At the lowest level of the tree were the alternatives, which are optimization methods.

Constructing a matrix of pairwise comparisons of criteria: A matrix of pairwise comparisons of criteria was created for further analysis. Each criterion was ranked relative to the others in terms of its importance for achieving the overall objective. This allowed the priorities of the criteria to be fixed and their weight to be taken into account in subsequent calculations.

Construction of matrices of pairwise comparisons of alternatives: Next, a matrix of pairwise comparisons of alternatives was constructed for each criterion. In this matrix, the optimization methods were evaluated using normalized values of the criteria. This approach allowed each method to be compared in the context of all criteria.

Matrix Analysis: In this step, the matrices obtained were analyzed, resulting in a vector of criteria weights and vectors of alternative weights for each criterion.

Determination of weights of alternatives: Based on the weights obtained in the previous step, the final weights of the alternatives in terms of achieving the objective were determined. For this purpose, a calculation was made using the following formula [23]:

= ∑ ( ∗ ) , where Wk total weight of method k, wi weight of the i-th criterion, pik method k by criterion i, n - total number of criteria.

Thus, the final weight Wk was calculated for each method, where a higher value indicates a better result. After completing the calculations, visualization of the trajectories along which the methods moved to the minimum of each function was performed, which allowed to get a visual representation of the behavior of different optimization methods in each case. This information is presented below in the form of tables and figures.

Standard

Gradient

Descent

Based on these tables, several conclusions can be made about the results of the optimization methods applied to the Himmelblau, Rosenbrock, Rastrigin, Ackley, and Beale functions. Analyze each method on different functions.

• Himmelblau function: The Heavy Ball method has the highest weight of 0.528656, indicating its superiority in terms of performance on this function. The Standard Gradient Descent has the lowest weight of 0.065604, showing weaker results compared to the other methods. • Rosenbrock function: Heavy Ball again shows the best result with the highest weight of 0.699239, demonstrating its stability and efficiency in achieving optimal values. Other methods, such as Momentum 0.085858 and Nesterov 0.073100, show lower weights, indicating their less efficient performance on this function. • Rastrigin function: Standard Gradient Descent has the highest weight of 0.498951, indicating its performance on this function, while Momentum has the lowest weight of 0.078360, showing weaker performance compared to other methods. • Ackley function: On this function, Standard Gradient Descent has the highest weight of 0.357109, indicating its effectiveness, Momentum and Nesterov also show good results, while the Heavy Ball method has the lowest weight of 0.112995, showing relatively weak results on this function. • Beale function: On this function, the Heavy Ball method showed the highest weight of 0.707523, demonstrating high performance. The other methods performed less efficiently compared to Heavy Ball.

The analysis showed that the considered optimization methods demonstrate different efficiency on various test functions.

The Heavy Ball method consistently performs well on most functions. Its high accuracy and low execution time results in high weight values, which makes it one of the most efficient methods for most of the tasks considered.

Standard gradient descent shows a significant variation in results. On some functions (Rastrigin and Ackley functions), this method works very effectively, while on others (e.g., Himmelblau), its results are significantly inferior to other methods.

The Momentum and Nesterov methods show both bad and average results depending on the function. In some cases (Rosenbrock, Rastrigin and Beale functions), these methods show low weight values, but in other cases (Himmelblau and Ackley functions) they can be more effective. efficiency, taking into account several important aspects at the same time: the number of iterations, the average time of the iteration, the total time of execution of the method, and the accuracy of finding the minimum. This made it possible to clearly assess the advantages and disadvantages of each method in different conditions and compare them.

In general, the results of the study demonstrate that the choice of optimization method depends on the specific task and function. The Heavy Ball method showed the best overall results, while other methods, such as Standard Gradient Descent, can be effective in certain conditions.

9. Conclusions

In this article, the topic of optimization was analyzed, in particular, first-order methods, which are widely used to minimize loss functions in various problems. Optimization is an important aspect in both business and research, especially in machine learning, where it plays a key role in training models and improving their accuracy. One of the main approaches - gradient descent was considered, as well as its main modifications, including Momentum, Heavy Ball, and Nesterov methods.

First-order methods, such as Standard Gradient Descent, have a number of advantages due to their simplicity and efficiency, but often face problems related to, for example, oscillations. Advanced modifications, such as Momentum, Heavy Ball, and Nesterov, may be better suited for specific tasks.

In addition, a review of the recent publications on the application of these methods in machine learning and other fields was conducted. The review confirmed the importance of using first-order optimization methods to solve applied problems requiring high accuracy and speed of model training.

In general, the comparison of different modifications of gradient descent showed that each method has its own strengths and weaknesses, and the choice of the optimal approach depends on the specific conditions of the problem. The use of more sophisticated methods, such as Nesterov Accelerated Gradient, can significantly improve results, reduce learning time, and increase the stability of optimization in real-world projects.

In the practical part of the work, several optimization methods were compared on classical test functions, such as Himmelblau, Rosenbrock, Rastrigin, Ackley, and Beale. The methods of Standard Gradient Descent, Momentum, Heavy Ball, and Nesterov were implemented. Using the Analytic Hierarchy Process allowed a complex evaluation of each method by a number of significant criteria: number of iterations, average time per iteration, total execution time, and function value at the last iteration. This approach provided a structured comparison, which facilitated a more accurate selection of the best method for different optimization problems. According to the evaluation results, the Heavy Ball method showed the best performance on most functions, while the Standard Gradient Descent and other methods had mixed results depending on the specifics of the functions.

Declaration on Generative AI

The authors did not use Generative AI tools in preparing the content, analyzing the data or creating the figures presented in the paper. All ideas, conclusions and figures are based on standard research and analysis methods. -Based Optimizer for

[1] Daoud , M.S. , Shehab , M. , Al-Mimi , H.M. et al. Gradient-Based Optimizer (GBO): A Review, Theory, Variants, and Applications . Arch Computat Methods Eng 30 , 2431 2449 ( 2023 ). https://doi.org/10.1007/s11831-022-09872 -y Scheduling Deadline-Constrained Workfl

[3] Mustapha , Aatila & Lachgar, Mohamed & Ali, Kartit. ( 2020 ). An Overview of Gradient Descent Algorithm Optimization in Machine Learning: Application in the Ophthalmology Field . 10 .1007/978-3- 030 -45183-7_ 27 .

[4] Morozov , V. , Kalnichenko , O. ( 2020 ). The method of interaction modeling on basis of deep -projects . International Journal of Computing 19 ( 1 ), 88 96 .

[5] Morozov , V. , Deineha , V. , Khlevnyi , A. , ( 2023 ), Research on the Use of Machine Learning Methods for Forecasting Time Series when Making Management Decisions in IT Projects Under Martial Law , CEUR Workshop Proceedings , 3624 , pp. 192 204 .

[6]

How

Machines Learn: The Power of Gradient Descent URL : https://towardsai.net/p/artificial-intelligence/ how-machines-learn-the-power-of-gradientdescent