<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Comparison of neural network efficiency, learning process in relation to various activation functions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paulina Hałatek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paweł Noras</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Applied Mathematics, Silesian Univeristy of Technology</institution>
          ,
          <addr-line>Kaszubska 23, Gliwice</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Artificial intelligence and machine learning play a great role in today's world. Lots of new systems are based on machine learning such as detecting threats system, detecting malicious network traffic and even creating an image or response with a given input. All those systems have one common feature, all were created using neural networks. A neural network is a subset of the machine learning process that helps in advanced way decide predictions about a given data set. The idea of creating a neural network was inspired by the biological nature of neurons that are in a human brain. The concept and the biological nature are much alike because the main purpose of the neural network is to try to mimic the way the biological neurons signal each other. Our paper will consist of understanding a concept of a neural network and implementing a simple neural network to conclude whether how usage of a different activation function (sigmoid, tanh, ReLU and gaussian) affects on learning process of our network.</p>
      </abstract>
      <kwd-group>
        <kwd>Neural network</kwd>
        <kwd>artificial intelligence</kwd>
        <kwd>binary classification</kwd>
        <kwd>forward propagation</kwd>
        <kwd>backward propagation</kwd>
        <kwd>the</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Artificial intelligence has become a trending topic in recent years and has captured the attention of
many researchers and developers around the world. It has the potential to revolutionize the way we
live and work and it has already begun to transform many industries. One of the components of
Artificial intelligence is a neural network, which is a type of machine learning algorithm.
Neural networks are inspired by how a human brain works, they are made up of layers, which consist
of interconnected nodes or neurons. By adjusting the parameters of each connection between nodes it
can learn to recognize patterns in given data and can be used to make classifications based on its
learning data. Nowadays neural networks are commonly used in various situations, such as image
classification, where most used type of neural networks are convolutional neural networks [1].
Furthermore speech recognition is widely used in some text editors, commonly know as speech to text
[2] likewise uses neural networks models. Moreover they can be found in automotive industry
especially in self-driven autonomous cars, which has a lot of sensors that serve a vast input data to a
neural networks [3].</p>
      <p>Although neural networks have a lot of potential they can be computationally expensive [5] and
require large amounts of train data to perform accurate predictions, especially in more
_________________________</p>
      <p>2023 Copyright for this paper by its authors.
CEUR</p>
      <p>ceur-ws.org
complex tasks. Nevertheless, researchers and developers are continuously working to overcome
current weaknesses of neural networks and develop new algorithms as well as techniques, so the
capabilities of neural networks are only likely to grow in the future.</p>
      <p>Activation functions are a crucial component of neural networks as they determine whether a neuron
should be activated or not, so they directly influence the output of each neuron and tell us if it is
important in the process of prediction. In addition, another factor that can impact the efficiency and
accuracy of a neural network is learning rate [4]. By carefully selecting these two parameters it is
possible to achieve better results.</p>
      <p>In our paper we want to focus on performance of different activation functions with several learning
rates values. At the beginning we will describe which data set will be used and what techniques and
algorithms were used during the creation of our neural network. After that there will be an analysis of
collected results, which goal is to find the best activation function to each of the examined learning
rate, draw conclusion based on the results of the experiments and possibly find some relationships
between each activation function and learning rate value.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data set</title>
      <p>For a neural network we decided to take a simple data set describing people’s gender based on two
features: weight(lbs) and height(inch). The data set available on kaggle.com and contains of 10.000
samples. Since we wanted to create a simple neural network, we decided to create a binary
classification, so before we started to implement a neural network, first we replaced text gender with
0,1 values. 0 for Male and 1 for Female. Next, we ensured that our data set did not contain null values.
Subsequently, we shuffled, normalized and split the data set by dividing it into 70% as a training set
and 30% as a validation set.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Neural network</title>
      <p>Before we go into the details of building a neural network, we need to explain two crucial things first:
weights and bias. Each node in a neural network has its weight that determines the strength of the
signal transmitted through the connection. Bias, on the other hand helps to shift the output of each
neuron in a neural network, allowing it to better fit the data. Both weight and bias are extremely
important in a neural network, since the main purpose of a neural network is to adjust those values to
optimize the network’s performance. In our model we initialized all weights with random values and
all biases with zeros.</p>
      <p>Our neural network consists of one input layer with two nodes: weight and height, one hidden layer
that has four nodes from h1 to h4 and one output layer that carries information about predicted values
for a given input values.</p>
      <p>A Neural network relies on repetition and improvement. That is why the next steps that we will
explain will be repeated a given number of times. The more the model is being repeated the more
adjusted weights and biases are and the model is more accurate.</p>
      <sec id="sec-3-1">
        <title>3.1. Forward propagation</title>
        <p>
          The first step in implementing a neural network is forward propagation. For simplicity, we will
explain it on a single sample and single node, but everything is working the same for all the other
nodes.
In the input layer we have two inputs x1 and x2 with corresponding to them weights w1.1 and w2.1,
each node in the hidden layer is connected to inputs x1 and x2 with different weights,
therefore each node will perform some calculations to get the corresponding node in the hidden layer.
To get the h1 value we simply need to use this equation:
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
ℎ1 = ( 1 ∗  11 +  2 ∗  21) +  1
ℎ1
 1
 2
 1.1
 2.1
 1
weighted value
input value
input value
weight between(x1 and h1) input layer and hidden layer
weight between(x2 and h1) input layer and hidden layer
h1’s bias between input layer and hidden layer
        </p>
        <p>Next, we perform the selected activation function. After that, we do the same thing as in previous
step. Since, in this example, we only have one node in the hidden layer with only one weight. Now we
change the input value, so that now the input value is the activated value from an activation function.
Note that this is only for simplicity purpose, in fact the hidden layer has four nodes which gives us
another three weights between the hidden layer and the output layer.</p>
        <p>1 = (ℎ1 ∗  11) +  1
 1
ℎ1
wH1.1
bH1.1
predicted value
activated value in hidden layer
h1 weight between hidden layer and output layer
h1’s bias between hidden layer and output layer</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Activation functions</title>
        <p>Activation functions decide whether a neuron’s input to the network is important in the operation of
determining a prediction. In our article we will use several functions such as: Sigmoid, TanH, ReLU,
and Gaussian. We will attempt to conclude which of those functions works the best in a neural
network for binary classification. They will be used in forward propagation to define a predicted value
but also in backward propagation as a derivative of a function to adjust weights and biases for the
model to be a better fit.</p>
        <p>It is worth mentioning that we use a binary cross entropy loss function, which is commonly used for
binary classification. In the binary classification we only have two possible outcomes, hence to ensure
that a final output value is between 0 and 1, we need to use an activation function that maps the output
in the given range. Therefore we will use as the last activation function the sigmoid function.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Calculating loss</title>
        <p>A Loss function calculates the difference, which is called ”loss” between the predicted output of the
network and the actual output. It is a measure of how well the network’s predictions match the true
value. Since our model has only two possible outputs we use a binary cross entropy loss function,
which is dedicated in such a case. It measures the difference between the predicted probability of the
positive class and the true probability of the positive class.</p>
        <p>1</p>
        <p>
          ∑

= −
 =1  log  ̂ − (1 −  ) 
(1 −  ̂)
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )

 ̂
N
data set output (0 or 1)
output size
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Backward propagation</title>
        <p>predicted probability of the positive class(the class labeled as 1)
Backward propagation is a learning algorithm that allows a neural network to adjust its weights based
on the difference between the predicted output and the actual output. The algorithm works by
computing the gradient of the loss function with respect to each weight in the neural network. The
gradient tells us how much the output of the neural network will change when we change the given
weight.</p>
        <p>This is an iterative process that is repeated many times throughout the learning process, adjusting
weights of the neural network after each iteration to reduce the error and help achieve better
predictions, as the accuracy of the network improves over time.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.4.1. Finding gradient</title>
        <p>In order to establish which node is responsible for the most loss in every layer, and therefore which
node’s weight value must be changed, we need to find the gradient of the loss function. To compute
the gradient, we use the chain rule of calculus. The chain rule allows us to break down the derivative
of the output into smaller derivatives that are easier to compute. We then use these smaller derivatives
to calculate the gradient of the loss function.</p>
        <p>We will try to explain it on a simple figure:</p>
        <p>
          In order to calculate the delta_0 we need to use a given formula:

_0 = 
∗ 
_1 ∗  ′( )
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
To get the loss of the h1 node we need to multiply three components: w which is the weight between
the hidden layer and the output layer, delta_1 which is already known since it is simply delta between
the predicted value and the real output value, f(z)’ is a derivative of an activation function of the value
obtained through forward propagation. We do the same calculation for all the other nodes with the
aim to discover what loss every node has.
        </p>
        <p>As we have obtained deltas for our neural network we are now able to calculate the partial
derivative of the loss function with respect to weight.</p>
        <p>We simply need to multiply z value which is obtained through forward propagation with the delta_1
which is the loss at the unit on the other end of the weighted link. We do the same calculation on all
the weights, but is worth mentioning that z value for the input node is simply the input value itself.</p>
        <p>The partial derivative of the loss function with respect to bias is straightforward since the input
value of a neuron is simply  ∗  +  and the partial derivative of that with respect to bias is simply 1.
So when we are calculating the partial derivative of the loss function with respect to bias we simply
ignore the output(z) in the bias case and multiply with one.</p>
        <p>=  ∗</p>
        <p>_1
=</p>
        <p>
          _1
∂
∂

(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
        </p>
      </sec>
      <sec id="sec-3-6">
        <title>3.4.2. Gradient descent</title>
        <p>Once we have calculated the gradient of the loss function, we can adjust the weights of the neural
network in the direction of the negative gradient. This process is known as gradient descent, and it
helps to reduce the error and improve the accuracy of the neural network. By repeating this process
many times, the neural network can learn to make better predictions and improve its accuracy over
time. With the undermentioned formulas we are able to adjust weights and biases in our neural
network:
  =   −  ∗
  =   −  ∗


  
  
Wn
Bn
lr

  


weights between n and n – 1 layer
biases between n and n – 1 layer
learning rate
the partial derivative of cost function with respect to  
the partial derivative of cost function with respect to</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>In order to properly distinguish which of all selected activation functions causes the greatest
efficiency in the neural network, we performed numerous tests on the different learning rate values.
The learning rate is an essential component since it determines how much the model’s parameters are
adjusted in response to the error gradient calculated during backward propagation. It is crucial since it
establishes the step size taken while searching for an optimal set of weights that minimize the loss
function. If the learning rate is too large, then the model may overshoot the optimal solution and fail
to converge. On the other hand, if the learning rate is too small, the model may converge very slowly
or get stuck in the local optima.</p>
      <p>Hence, we performed a series of tests on the same data group to establish the best learning rate for
our neural network model. The following chart represents the selected learning rate, chosen activation
functions and how the loss value changes over the iterations of a learning process.</p>
      <p>In view of having several results for different learning rate, we will compare each learning rate
separately in order to determine which activation function did best. We will try to compare two
metrics: the loss function value at the end of 20.000 iterations and accuracy of overall model using a
given activation function.</p>
      <p>In the above figure for learning rate equal to 1.0, we can see that at 20.000 iterations the lowest loss
function value had ReLU activation function, but really close to it were Sigmoid and Tanh. Gaussian
clearly stands behind the other three, when it comes to loss value and accuracy. On the ReLU, Tanh
and Gaussian plots, we can see that at the beginning of the learning process we can see jumps in the
chart, which means that the learning rate for those activation functions was to high and caused the
model to overshoot the optimal solution.</p>
      <p>Another odd thing happen for Gaussian function, we suppose that the learning rate is to high for this
particular function. Therefore, the model was getting worse over time. We can claim that the learning
rate was too high and caused the model to overshoot the optimal solution. At the end Tanh, Sigmoid
and ReLU performed well for given learning rate, but since Sigmoid has the biggest accuracy and was
the only one of all which didn’t struggle at the beginning we’ve decide that it was the best activation
function for learning rate equal to 1.0.</p>
      <p>(a)
(b)
(c)
(d)</p>
      <p>For a learning rate equal to 0.4, we can conclude that the Tanh function has the best results. But as
before Tanh, Sigmoid and ReLU have very similar results at the end. We can see that there are no
longer any jumps in Tanh and ReLU plots, but instead Sigmoid seemed to struggle a little bit more
than last time. Gaussian is still overshooting the optimal solution and cleary is performing the worst of
all activation function.</p>
      <p>For a learning rate equal to 0.3 we can see very similar results to the ones before, at learning rate
equal to 0.4. Again Tanh has the best results of all of them, Sigmoid has seemed to struggle a little bit
at the beginning as well and Gaussian still struggle to find optimal solution, so even 0.3 learning rate
might be too big for it.</p>
      <p>(b)
(c)</p>
      <p>First thing that we noticed after changing the learning rate to 0.1 is that the Gaussian function
finally does not overshoot optimal solutions and the loss function in that activation function ultimately
decreases. Accordingly accuracy for Gaussian increases as well and it even the accuracy of Tanh
function. All of the function performed very similarly to each other, but Sigmoid looks like it
performance is decreasing with lower learning rate. Since Tanh has the lowest loss function value, but
ReLU has the slightly bigger accuracy and both plots of these function looks relatively the same, we
have decided that both of them can be considered as the best for 0.1 learning rate value.
(a)
(b)
(c)</p>
      <p>On the above image we can see that at the beginning all functions are learning a little bit slower
compared to previous examined learning rates. Clearly the worst performing activation function is
Sigmoid, our assumption from previous analysis turned out to be true, that the lower learning rate
indicate weaker performance results. Tanh again has the lowest learning rate of all functions, but as
before it fall short a little bit with accuracy compared to ReLU function, but as before both of these
functions clearly performed better than the other two.</p>
      <p>(a)
(b)
(c)
(d)</p>
      <p>The last comparison is for a learning rate, that is equal to 0.001. From a given table it can be
concluded that the Gaussian function is the best, since it has the lowest loss function and the greatest
accuracy. For other activation functions we can determine that when we decreased significantly the
learning rate value, the model for each activation function was not able to make accurate predictions,
therefore the slope of the loss function was being decreased very slowly, which means that it took a
lot of time for a model to properly learn from a given data. Taking into account all figures, we can
presume that the Gaussian activation function works the best when learning rate decreases.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Through analysis of attained results for all activation functions, we concluded that there is not one
simple answer in which we decide which one of given activation functions works the best way,
therefore we decided that the best option to establish it, is to determine it for each learning rate
separately.</p>
      <p>We are now aware that the learning rate is a crucial thing in creating neural network model, since
when the learning rate is to high it leads to a situation when the model may overshoot the optimal
solution and fail to converge. This is what happened with for example the Gaussian activation
function. When we decreased the value, the model started to work properly. On the other hand, if the
learning rate is too small, the model may converge very slowly, we could see it when the learning rate
was equal 0.01 or 0.001, the accuracy for models was significantly worse compared to different
learning rate values.</p>
      <p>Overall, we can identify some dependencies of examined activation functions to learning rate value.
Firstly, the most obvious one is that performance of Gaussian function increases with lower learning
rate value. Contrary it could be observed that Sigmoid function results tend to be weaker when
learning rate started to be lower. From these observations we can gather some conclusions. When one
deal with high learning rate Sigmoid might be the best activation function we can choose. When
dealing with lower learning rates Gaussian may turned out to be the one best performing. If having to
deal with something in the middle one can’t go wrong with ReLU or Than, because these two seemed
to have most persistent performance through all experiments, but fall a little bit short on very high and
low learning rate values.</p>
      <p>This project was interesting to create, since it was our first experience with neural networks and the
work and effort that was applied to complete this article are practical and applicable. This research
offered an opportunity to learn and expand our knowledge about immeasurable possibilities offered by
neural networks likewise creating this article helped us better understand how neural networks work
and the analysis allowed acquiring practical experience in data analysis.</p>
      <p>In a future we want to broaden our knowledge of neural network, by creating new experiments
using a deep neural network with more complex data sets. It can be truly interesting to learn how the
size of hidden layers affects on learning process or effectiveness of the entire model. Another idea that
would be interesting way to develop our model is to create some kind of visualisations, that would
represent how fast a model is learning. With a large knowledge base about neural network we could
create an application that takes an input given by user and based on it, concludes obtained results.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Neha</given-names>
            <surname>Sharma</surname>
          </string-name>
          , Vibhor Jain,
          <string-name>
            <given-names>Anju</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <article-title>An Analysis Of Convolutional Neural Networks For Image Classification</article-title>
          ,
          <source>Procedia Computer Science</source>
          , Volume
          <volume>132</volume>
          ,
          <year>2018</year>
          , Pages
          <fpage>377</fpage>
          -
          <lpage>384</lpage>
          , ISSN 1877-
          <volume>0509</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Sanket</given-names>
            <surname>Shah</surname>
          </string-name>
          , Hardik Dudhrejia,
          <article-title>Speech Recognition using Neural Networks</article-title>
          ,
          <source>INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH TECHNOLOGY (IJERT)</source>
          Volume
          <volume>07</volume>
          , Issue 10 (October -
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Tian</surname>
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pei</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jana</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ray</surname>
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2017</year>
          ). DeepTest: Automated Testing of Deep-NeuralNetworkdriven Autonomous Cars.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Igiri</given-names>
            <surname>Chinwe</surname>
          </string-name>
          , Uzoma Anyama, Silas Abasiama. (
          <year>2021</year>
          ).
          <source>Effect of Learning Rate on Artificial Neural Network in Machine Learning</source>
          .
          <source>International Journal of Engineering Research</source>
          ,
          <volume>4</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Livni</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shalev-Shwartz</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shamir</surname>
            <given-names>O.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>On the computational efficiency of training neural networks</article-title>
          .
          <source>Advances in neural information processing systems</source>
          ,
          <volume>27</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6] https://www.v7labs.
          <article-title>com/blog/neural-networks-activation-functions</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7] https://medium.com/analytics-vidhya/
          <article-title>what-do-you-mean-by-forward-propagation-in-ann9a89c80dac1b</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8] https://towardsdatascience.com/part-2
          <article-title>-gradient-descent-and-backpropagation-bf90932c066a</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9] https://towardsdatascience.com
          <article-title>/calculating-gradient-descent-manually-6d9bee09aa0b</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>[10] https://en.wikipedia.org/wiki/Activation_function</mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11] https://towardsdatascience.com
          <article-title>/derivative-of-the-sigmoid-function-536880cf918e</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12] https://machinelearningmastery.com
          <article-title>/learning-rate-for-deep-learning-neural-networks/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13] https://www.linkedin.com/advice/3/how-does
          <article-title>-cross-entropy-mean-squared-error-affect</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>[14] https://www.askpython.com/python/examples/backpropagation-in-python</mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15] https://builtin.com/machine
          <article-title>-learning/backpropagation-neural-network</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16] https://towardsdatascience.com/part-2
          <article-title>-gradient-descent-and-backpropagation-bf90932c066a</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>