=Paper= {{Paper |id=Vol-3742/paper17 |storemode=property |title=Understanding the Adam Optimization Algorithm in Machine Learning |pdfUrl=https://ceur-ws.org/Vol-3742/paper17.pdf |volume=Vol-3742 |authors=Oles Hospodarskyy,Vasyl Martsenyuk,Nataliia Kukharska,Andriy Hospodarskyy,Sofiia Sverstiuk |dblpUrl=https://dblp.org/rec/conf/citi2/HospodarskyyMKH24 }} ==Understanding the Adam Optimization Algorithm in Machine Learning== https://ceur-ws.org/Vol-3742/paper17.pdf
                                Understanding the Adam Optimization Algorithm in
                                Machine Learning
                                Oles Hospodarskyy1,*,†, Vasyl Martsenyuk2,†, Nataliia Kukharska3,†, Andriy
                                Hospodarskyy4,†, Sofiia Sverstiuk5,†

                                1 Lviv Polytechnic National Universiy, Bandery St. 12, Lviv, 79001, Ukraine

                                2 University of Bielsko-Biala, Willowa St. 2, Bielsko-Biala, 43-300, Poland

                                3 Ternopil National Ivan Puluj Technical University, Rus'ka St. 56, Ternopil, 46001, Ukraine

                                4 I. Horbachevsky Ternopil National Medical University, Maidan Voli, 1, Ternopil, 46002, Ukraine

                                5 Ternopil National Pedagogical University, 2 Maxyma Kryvonosa St., Ternopil, 46027, Ukraine




                                                Abstract
                                                Machine learning and artificial intelligence are significant areas of interest in both contemporary
                                                science and society. There are various optimization algorithms used. The algorithm's speed
                                                depends on the size of the dataset, the number of model parameters, and the number of iterations.
                                                Standard gradient descent requires computing the gradient of the cost function over the entire
                                                dataset, which can be resource-intensive, especially with large datasets. In Adam, a separate
                                                learning rate is maintained for each parameter weight, which is adapted and updated
                                                individually. The algorithm selects a smaller learning rate for frequently updated parameters and
                                                a larger one for parameters corresponding to rare features. To measure the effectiveness and
                                                universality of the Adam, we compared it with other optimization algorithms. Analysis of the
                                                experiment results conducted on various datasets, indicates a significant advantage of the Adam
                                                optimization algorithm. To make sure our model works well for our specific needs, we made a
                                                small dataset ourselves. The famous MNIST dataset, created by American researchers, might not
                                                match our handwritten numbers perfectly. The results appear promising, with the model
                                                achieving an accuracy of 97%, meaning it correctly predicted 97 out of 100 images. This level of
                                                accuracy suggests that the model is performing well on our custom dataset, demonstrating its
                                                effectiveness in recognizing and classifying our handwritten numbers. Experiments on various
                                                datasets showed that the Adam algorithm is capable of achieving good results across a wide range
                                                of machine learning tasks.

                                                Keywords
                                                Adam algorithm, machine learning, artificial intelligence, loss function, gradient descent



                                1. Introduction
                                Machine learning and artificial intelligence are significant areas of interest in both
                                contemporary science and society [1]. They represent some of the most advanced


                                CITI'2024: 2nd International Workshop on Computer Information Technologies in Industry 4.0, June 12-14, 2024,
                                Ternopil, Ukraine. ∗ Corresponding author. † These authors contributed equally.
                                   oles.hospodarskyi.kb.2021@lpnu.ua (O. Hospodarskyy); vmartsenyuk@ath.bielsko.pl (V. Martsenyuk),
                                nataliia.p.kukharska@lpnu.ua (N. Kukharska); hospodarskyy@tdmu.edu.ua (A. Hospodarskyy);
                                khrystynasofia@gmail.com (S. Sverstiuk)
                                   0009-0005-9088-3015 (O. Hospodarskyy); 0000-0001-5622-1038 (V. Martsenyuk); 0000-0002-0896-8361
                                (N. Kukharska); 0000-0002-9394-2675 (A. Hospodarskyy); 0000-0001-5595-4918 (S. Sverstiuk)
                                           © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
technologies applicable across various industries, including healthcare, finance,
transportation and entertainment. The theoretical foundations of machine learning have
been explored and expanded upon by some of the greatest figures in the field. Geoffrey
Hinton, often revered as the "Godfather of Deep Learning," has laid the groundwork for
many modern machine learning techniques with his groundbreaking research on neural
networks
   Yann LeCun's work on convolutional neural networks (CNNs) has revolutionized
computer vision and pattern recognition, while Yoshua Bengio's contributions to neural
network models have greatly advanced natural language processing and unsupervised
learning. Ian Goodfellow's work on generative adversarial networks (GANs) has opened up
new avenues in unsupervised learning and generative modeling, while Juergen
Schmidhuber's contributions to recurrent neural networks (RNNs) and long short-term
memory (LSTM) networks have propelled advancements in sequential learning and AI [2].
   Each year the number of scientific publications dedicated to algorithms and machine
learning methods continues to increase. However, despite significant progress made by
researchers in this field, a range of unresolved and insufficiently studied issues persists.
These include challenges related to optimization, which entail the search for optimal model
parameters to achieve maximum prediction accuracy.
   The main objective of this article is to investigate the Adam optimization algorithm,
compare its effectiveness with other well-known algorithms on standard datasets such as
MNIST and FashionMNIST, and assess its accuracy in recognizing and classifying
handwritten characters [3].

2. General principles of optimization to find the best algorithm in
   machine learning
When an input signal is received by the machine model, it undergoes processing through a
function, and after a series of computations, it transforms into an output value. Then the
model compares the generated output with the actual output value and computes the loss
function. The loss function is a measure of how well the model performs on a given task. For
example, a popular loss function MSE computes squared difference between values:
                                         𝑛
                                   1            2
                              𝑀𝑆𝐸 = ∑(𝑌𝑖 − 𝑌̂𝑖 ) ,                                    (1)
                                   𝑛
                                        𝑖=1

   Now we need some algorithm that will adjust the parameters of a model (w1, w2,...wn)
to minimize (or maximize) the loss function. That’s the basic concept of optimization in
machine learning.
   There are various optimization algorithms used for this task. These algorithms are
iterative, meaning they update the model parameters during each epoch during the training
process.
3. The description of the gradient descent algorithm as a fundamental
   optimization technique
The idea of gradient descent is to update the parameters of the model (weights) by moving
in the direction of the steepest descent of the loss function [5].
    At first algorithm initializes random weights (or zeros). Then it calculates the loss
function of the whole dataset, for example, MSE.
    Then gradient descent calculates the derivative of the loss function concerning each
model parameter (weight) to determine the direction of the update.
    After computing the gradient of the loss function concerning each weight, the algorithm
updates the weights by subtracting a fraction of the gradient from the current value of each
weight. This fraction is known as the learning rate, denoted by 𝛼, and it controls the size of
the step taken in the direction of the negative gradient.

                                              𝜕𝐿𝑜𝑠𝑠
                                 𝑤 = 𝑤−𝛼[           ],                                   (2)
                                               𝜕𝑤

   This process is repeated iteratively, and with each iteration, the algorithm progressively
approaches a local minimum of the loss function, where the weight values are optimal
(Figure 1).




Figure 1: The relationship between the loss function and the weight value w
Source: photographed by the author

   The algorithm's speed depends on the size of the dataset, the number of model
parameters, and the number of iterations. Typically, larger datasets and more complex
models require more time and resources for training. Additionally, the speed of the
algorithm can be influenced by the choice of learning rate. Selecting a too large learning rate
may cause the algorithm to move too quickly and fail to find the minimum of the weight
function (Figure 2), while choosing a too small learning rate may prolong the training
process.




Figure 2: Demonstrative example where a learning rate that is too large.
Source: photographed by the author

4. Stochastic gradient descent, SGD
Standard gradient descent requires computing the gradient of the cost function over the
entire dataset, which can be resource-intensive, especially with large datasets. Therefore,
in such cases, stochastic gradient descent (SGD) is applied, which is more efficient for
optimizing models with a large amount of data [4].
   Standard gradient descent updates the model weights at each iteration using the entire
dataset to compute the gradient of the loss function. However, stochastic gradient descent
computes the gradient and updates the weights for each data sample in the dataset
separately. That is, on each iteration, SGD uses only one data sample instead of the entire
dataset, allowing for quick weight updates and more efficient processing of large datasets
[5].
   However, SGD can be sensitive to the initial values of the weights, causing it to get stuck
in a local minimum and fail to find the global minimum of the loss function (Figure 3). Data
normalization could help mitigate this issue for linear models, however, in more complex
models like neural networks, normalization may not be sufficient.
Figure 3: Graphical example where the algorithm failed to find the global minimum of the
function
Source: photographed by the author

    That’s where Adam comes in handy. It uses the history of previous gradients to
adaptively adjust the learning rates for each parameter, helping to overcome the limitations
of SGD.

5. Adam
Adam was presented by Diederik Kingma from OpenAI and Jimmy Ba from the University
of Toronto in their 2015 ICLR paper (poster) titled “Adam: A Method for Stochastic
Optimization“. The name Adam is derived from adaptive moment estimation.
   Adam differs from classical stochastic gradient descent. Standard stochastic gradient
descent uses a single learning rate (alpha) for updating weights, and this learning rate
remains constant throughout training [6].
   In Adam, a separate learning rate is maintained for each parameter weight, which is
adapted and updated individually. The algorithm selects a smaller learning rate for
frequently updated parameters and a larger one for parameters corresponding to rare
features.
   The authors describe Adam as combining the advantages of two other extensions of
stochastic gradient descent. Specifically:

   •   The Adaptive Gradient Algorithm (AdaGrad) which is particularly effective with
       sparse gradients, such as in natural language processing tasks (NLP) and computer
       vision. It employs a method that maintains an individual learning rate for each
       parameter, facilitating efficient updates for rarely used parameters. However,
       AdaGrad may encounter the issue of rapidly decreasing learning rates, which can
       prematurely halt the learning process.
  •    The Root Mean Square Propagation (RMSProp) algorithm which, unlike AdaGrad,
       mitigates the problem of decreasing learning rates. It utilizes a method that sustains
       individually adjusted learning rates for each parameter, adapted based on the recent
       average of gradient magnitudes for weights. This algorithm performs effectively on
       online tasks and tasks where parameters may change over time (non-stationary
       tasks).

  The hyperparameters of Adam include:

  1.   Learning rate: Determines the size of the step by which the model weights will
       change during each iteration. A large learning rate can lead to unstable model
       training, while a too small value can slow down the learning process. Typically, an
       initial learning rate is chosen, but Adam automatically adapts it over time.
  2.   Beta1 and Beta2: These parameters control the exponential smoothing of previous
       gradients and their squares, respectively. Beta1 is responsible for smoothing
       gradients, while Beta2 handles the smoothing of gradient squares. Typically, values
       such as Beta1 = 0.9 and Beta2 = 0.999 work well, but they can be manually adjusted
       if needed.
  3.   Epsilon: A small numerical value added to the denominator in the Adam formula to
       avoid division by zero.

   The Adam algorithm computes the exponential moving average of the gradient (first
moment) and the squared gradient (second moment) of the weights, where the parameters
beta1 and beta2 control the smoothing rates of these moving averages [7].
   In the context of exponential moving average (Figure 4), smoothing occurs by assigning
more weight to newer data. Thus, the model responds more to the recent changes in data
than to older values, allowing it to adapt more quickly to any new data trends.




Figure 4: Example of exponential moving average (blue line)
Source: photographed by the author
   The first and the second moments are statistical concepts. The first moment of data is
their mean value, and the second moment is the variance, which indicates how spread out
the data is around the mean value.
   In the context of the Adam optimization algorithm, the first moment of gradients is used
to estimate the mean value of gradients (which can be viewed as the "rate of change" of
model parameters), and the second moment of gradients is used to estimate the variance of
gradients (reflecting how gradients are spread out around the mean value).
   The main idea behind using moments in the Adam algorithm is to provide the algorithm
with additional information about previous weight updates and the gradient direction,
enabling better control over the optimization process.

                             𝑚𝑡 = 𝛽1 𝑚𝑡−1 + (1 − 𝛽1 )𝑔𝑡 ,                                (3)
                             𝑣𝑡 = 𝛽2 𝑣𝑡−1 + (1 − 𝛽2 )𝑔𝑡2

   As mt and vt are initialized as vectors of zeros, they tend to be biased towards zero,
especially during the initial time steps, and especially when the decay rates are small (i.e. β1
and β2 are close to 1). In the Adam algorithm, a bias correction is done by adjusting the
estimates 𝑚t and 𝑣t by dividing them by (1−𝛽t), where t represents the current step. This
reduces the bias towards zero, ensuring that the initial parameter updates are more
accurate.

                                              𝑚𝑡
                                     𝑚
                                     ̂𝑡 =                                                (4)
                                            1 − 𝛽1𝑡
                                              𝑣𝑡
                                      𝑣̂𝑡 =                                              (5)
                                            1 − 𝛽2𝑡

   Taking this correction into account, the parameter update rule takes the following form:

                                                𝑛
                               𝑤𝑡+1 = 𝑤𝑡 −             𝑚𝑡 ,                              (6)
                                             √𝑣𝑡 + 𝑒

6. Simple Adam example
In this section, we will provide an example of how the Adam algorithm works in its simplest
form. We will consider a scenario where we have a simple function that needs to be
optimized, and we will apply the Adam algorithm to find its minimum.
    Firstly, we need to define the loss function. We will use a simple two-dimensional
function that squares the input data and defines the range of input data from -1.0 to 1.0:
    def loss_function(x, y):
      return x ** 2.0 + y ** 2.0
    To visually observe the progress of the function, let's create a 2D plot:
    bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
    xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
    yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
    x, y = meshgrid(xaxis, yaxis)
   results = objective(x, y)
   pyplot.contourf(x, y, results, levels=50, cmap='jet')
   pyplot.show()

   Executing this code snippet generates a two-dimensional contour plot of the objective
loss function (Figure 5). This plot will serve as a visual representation of the points
investigated throughout the search for the local minimum of the function.




Figure 5: Two-dimensional plot of the loss function using Adam
Source: photographed by the author

   Let's move on to the Adam algorithm. First, we initialize the first and second moments
as zeros:
   m = [0.0 for _ in range(bounds.shape[0])]
   v = [0.0 for _ in range(bounds.shape[0])]

   After that, we compute the gradient (derivative) of the data:
   gradient = derivative(w[0], w[1])

   Now we need to apply the Adam parameter update rule. While in practice, a matrix
method is typically utilized for computation, for the sake of clarity in this example, we'll
employ an iterative approach. Given we have two parameters, we'll use a loop to update
both of them:
   for i in range(x.shape[0]):
     m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]
     v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2
   Then we apply the bias correction:
  mhat = m[i] / (1.0 - beta1**(t+1))
  vhat = v[i] / (1.0 - beta2**(t+1))
  In the end we update the parameters of the model and calculate the loss:
  w[i] = w[i] - alpha * mhat / (sqrt(vhat) + eps)
  score = loss_function(w[0], w[1])

   The Figure 6 illustrates the outcome of executing the code. The "Score" indicates the
value of the loss function.




Figure 6: Two-dimensional graph of the loss function using Gradient Descent
Source: photographed by the author

  For comparison, the gradient descent algorithm, with the same function and the same
number of iterations, achieved significantly worse results (Figure 7).
Figure 7: Graph illustrating the performance using Gradient descent algorithm
Source: photographed by the author

7. Comparing Adam with other algorithms
To measure the effectiveness and universality of the Adam, we compared it with other
optimization algorithms on two datasets, MNIST and FashionMNIST. We chose them
because they are often used as a benchmark for testing new machine learning algorithms
and models.
   MNIST (Modified National Institute of Standards and Technology) is a classic dataset
consisting of 60,000 black and white images of handwritten digits in the training set and
10,000 images in the test set.
   FashionMNIST is another popular dataset that contains 60,000 images in the training set
and 10,000 images in the test set. The images represent various types of clothing items such
as T-shirts, dresses, trousers, etc. This dataset is created for use in image classification tasks.
   We chose to test the Adam algorithm on both MNIST and FashionMNIST datasets
because they contain different kinds of data. MNIST has images of handwritten digits, while
FashionMNIST consists of more complicated pictures of clothing items. By evaluating
Adam's performance on these diverse datasets, we can see how well it works across
different types of data, showing its usefulness in various situations.
   Figure 8 illustrates the training process of models on the MNIST dataset using various
optimization algorithms. The results indicate that, despite the task not being very
complicated, Adam successfully demonstrated the best performance among all the
algorithms. This is further evidenced in Figure 9, where the test accuracy of Adam surpasses
that of all other algorithms.
Figure 8: Graph illustrating the performance of different algorithms on MNIST dataset
Source: photographed by the author




Figure 9: Test accuracy of different algorithms on MNIST dataset
Source: photographed by the author

   Figure 10 illustrates the training process of models on the FashionMNIST dataset, which
is slightly more complex. Despite this complexity, Adam managed to outperform other
algorithms, demonstrating its effectiveness even for more challenging tasks. This is further
supported by Figure 11, where it is shown that the test accuracy of Adam exceeds that of all
other algorithms.
Figure 10: Graph illustrating the performance of different algorithms on FashionMNIST
dataset
Source: photographed by the author




Figure 11: Test accuracy of different algorithms on FashionMNIST dataset
Source: photographed by the author

    Analysis of the experiment results conducted on various datasets, including MNIST and
FashionMNIST indicates a significant advantage of the Adam optimization algorithm. Its
effectiveness was demonstrated regardless of the complexity of object structures and the
diversity of classes in the datasets. Interestingly, while some algorithms may have shown
slightly better results on datasets with simpler structures and fewer classes, Adam proved
to be more efficient in all modeled scenarios. Overall, Adam provided faster and higher-
quality solutions to classification tasks compared to most other algorithms, confirming its
advantages in machine learning.
    To make sure our model works well for our specific needs, we made a small dataset
ourselves. The famous MNIST dataset, created by American researchers, might not match
our handwritten numbers perfectly. So, we wanted to see if our model could still understand
and categorize our handwritten characters correctly. This way, we could check if our model
is flexible and reliable for our purposes, not just for standard datasets.
    The dataset consists of 100 handwritten numbers from 0 to 9 (Figure 12).




Figure 12: An example of our handwritten dataset
Source: photographed by the author

   The results appear promising, with the model achieving an accuracy of 97%, meaning it
correctly predicted 97 out of 100 images. This level of accuracy suggests that the model is
performing well on our custom dataset, demonstrating its effectiveness in recognizing and
classifying our handwritten numbers.
   In further research, it is planned to use Adam's algorithm to analyze the data of
cyberphysical systems [8, 9], biosensors [10] and the results of cardiac signal processing
[11].

8. Conclusions
In this study, the Adam algorithm was investigated in the context of optimization in machine
learning. The main conclusions and results of the study are as follows:

   1.   The Adam algorithm is an effective optimization method that combines ideas from
        other algorithms such as RMSProp and AdaGrad.
   2.   Experiments on various datasets, such as MNIST and FashionMNIST showed that the
        Adam algorithm is capable of achieving good results across a wide range of machine
        learning tasks.
   3.   The Adam algorithm is effective for optimizing tasks involving both large and small
        datasets, as demonstrated by experimental results.

9. References
[1] D. P. Kingma and J. L. Ba, Adam: a method for stochastic optimization,
     arXiv:1412.6980v9 [cs.LG], 2015.
[2] R. Zaheer and H. Shaziya, A Study of the Optimization Algorithms in Deep Learning,
     March 2020.
[3] H. Xiao, K. Rasul, and R. Vollgraf, Fashion-mnist: a novel image dataset for
     benchmarking machine learning algorithms, arXiv preprint arXiv:1708.07747, 2017.
[4] S. Ruder, An overview of gradient descent optimization algorithms, arXiv preprint
     arXiv:1609.04747, 2016.
[5] S. Wang, C. Li, X. Ding, Demystifying Parallel and Distributed Deep Learning: An In-
     Depth Concurrency Analysis, arXiv:1802.09941v2 [cs.LG], 15 Sep 2018.
[6] J. Brownlee Gentle Introduction to the Adam Optimization Algorithm for Deep
     Learning, 2017, https://machinelearningmastery.com/adam-optimization-algorithm-
     for-deep-learning/
[7] J. Brownlee        Code Adam Optimization Algorithm From Scratch, 2021,
     https://machinelearningmastery.com/adam-optimization-from-scratch/
[8] V. Martsenyuk, A. Sverstiuk, A. Klos-Witkowska, N.Kozodii, O. Bagriy-Zayats, I.
     Zubenko, Numerical analysis of results simulation of cyber-physical biosensor systems.
     CEUR Workshop Proceedings, 2019, 2516, pp. 149–164.
[9] V. Martsenyuk, A. Sverstiuk, O. Bahrii-Zaiats, A. Kłos-Witkowska, Qualitative and
     Quantitative Comparative Analysis of Results of Numerical Simulation of Cyber-
     Physical Biosensor Systems. (2022) CEUR Workshop Proceedings, 3309, pp. 134 – 149.
[10] V. Martsenyuk, A. Klos-Witkowska, S. Dzyadevych, A. Sverstiuk, Nonlinear Analytics for
     Electrochemical Biosensor Design Using Enzyme Aggregates and Delayed Mass Action.
     Sensors, 2022, 22(3), 980.
[11] V. Trysnyuk, A. Zozulia, S. Lupenko, I. Lytvynenko, A. Sverstiuk, Methods of rhythm-
     cardio signals processing based on a mathematical model in the form of a vector of
     stationary and stationary connected random sequences. CEUR Workshop
     Proceedings, 2021, 3021, pp. 197–205.