<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Weighted Shifted S-shaped Activation Functions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sergiy Popov</string-name>
          <email>serhii.popov@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmytro Pikhulya</string-name>
          <email>dmytro.pikhulia@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kharkiv National University of Radio Electronics</institution>
          ,
          <addr-line>Nauky Ave. 14, Kharkiv, 61166</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ProfIT AI 2024: 4</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we propose a family of activation functions (AFs) that can be considered as smooth approximations of bounded ReLU and similar AFs. These AFs are constructed by using a shifted originaligned S-shaped function as a basis, and weighing it with another S-shaped function, similar to how SiLU/GELU AFs weigh the () =  function. The use of both regular and adaptive variants of such AFs is explored. The performance of the proposed family of AFs is evaluated in terms of the image classification accuracy with CNN models by comparing their multiple variants with the popular existing AFs on CIFAR10 and Fashion-MNIST datasets using Adam and stochastic gradient descent (SGD) optimizers with different learning rates. Overall, 28 variants of the proposed AFs are compared with 21 variants of popular existing AFs (including the ReLU-like functions such as ReLU, Leaky ReLU, SiLU, GELU, PReLU, Swish, etc. and some S-shaped AFs), and 6 shifted S-shaped AFs. The experiments have shown that in most cases the adaptive versions of the proposed AFs provide a pronounced image classification accuracy advantage over all existing AFs that were considered when the Adam optimizer is used, and no consistent advantage with the SGD optimizer. Further research regarding the use of these AFs with the SGD optimizer, and the use of their non-adaptive variants is required.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        its variations, such as Leaky ReLU [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], SiLU [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], GELU [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], ELU [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], etc. In this paper, such functions
are called ReLU-like ones for convenience.
      </p>
      <p>
        The ReLU-like functions effectively solve the vanishing gradient problem [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], which is a typical
problem with S-shaped AFs [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In many cases, using ReLU-like functions lead to better model’s
effectiveness for solving the image classification tasks with CNNs than the S-shaped functions like
Sigmoid and Tanh [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        It has been shown that bounding of the ReLU function can be beneficial for training stability and
classification accuracy with functions like BReLU, BLReLU [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. At the same time there are some
improved variants of ReLU (LReLU, GELU, SiLU, PReLU) that can show better results, but these
0000-0002-1274-5830 (S. Popov); 0009-0003-6280-1890 (D. Pikhulya)
© 2024 Copyright for this paper by its authors.
      </p>
      <p>Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>CPWErooUrckResehdoinpgs hIStSpN:/c1e6u1r3-w-0s.o7r3g CEUR Workshop Proceedings (CEUR-WS.org)
functions are not bounded, hence there’s a potential in exploring the bounded versions of such
functions to see if this could produce a cumulative improvement effect that would lead to better
results than any of these functions.</p>
      <p>
        The way that BReLU or similar (ReLU-6 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]) functions are bounded makes the function to have a
fixed value with zero derivative after a certain value of its argument, which could degrade the
network’s training process. Alternative functions, which are formally not bounded, but still limit the
function’s growth after a certain value of its argument as well are functions like BLReLU [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and
PLU [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The BReLU, BLReLU, and PLU AFs are piecewise linear functions though. In this work it
is assumed that smoothing the transitions in such functions by replacing piecewise linear functions
with smoother approximation functions could improve the overall network’s approximation
capabilities. The intuition behind this assumption is that real-world data would presumably typically
be diverse enough to have no or few hard edges in distributions of most of its aspects.
      </p>
      <p>
        Some of the functions that can be useful for creating smooth approximations of bounded
ReLUlike functions are the shifted S-shaped functions. The work [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] shows that the modified version of
the Tanh function, which is shifted horizontally and vertically while still maintaining an intersection
with the origin, called Shifted Tanh, achieves a better performance than the Tanh function, and can
show a performance that is similar or slightly higher than that of the ReLU AF.
      </p>
      <p>
        In this paper, we take a further look at such shifted S-shaped functions by first evaluating the
performance of shifted variants of the Atan and Asinh functions, and then modifying them to
represent better smoothed approximations of bounded ReLU-like functions. In this paper the
respective shifted AFs conventionally have the “So” prefix added to them (meaning “shifted,
originaligned”): SoTanh (same as Shifted Tanh in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]), SoAtan, SoAsinh. Regarding the Asinh function in
particular, it is worth noting that unlike most S-shaped functions, it is an unbounded one, which
could potentially be a useful property for mitigating the vanishing gradient problem as it tends to
have higher first derivative values for a wider range of  than the bounded functions like Tanh and
Atan.
      </p>
      <p>More specifically, shifting an S-shaped function to the right is seen to potentially be beneficial
due to the following reasons:
•
•</p>
      <p>
        This makes the negative function’s part closer to the  axis, similar to the shape of ReLU-like
functions. It is informally assumed in this work that the proximity of ReLU-like functions to
0 plays a certain role in their effectiveness with CNNs. One of the explanations might be the
assumption that such AF’s property encourages learning sparse representations of network’s
inputs [
        <xref ref-type="bibr" rid="ref11 ref5">5, 11</xref>
        ].
      </p>
      <p>This makes the function in its positive part to have a longer range where the functions value
is closer to the () =  function, compared to a regular unshifted version of the same
Sfunction, which also makes their shape closer to that of Bounded ReLU or BLReU AFs, while
also having smooth transitions. The proximity of the positive function’s part to the () = 
function is hypothesized to contribute to preventing both vanishing gradients and exploding
gradients in deep networks.</p>
      <p>After testing the performance of SoTanh, SoAtan, and SoAsinh functions, we introduce their
modified versions, which make them closer to bounded ReLU-like functions. One of the notable
differences of SoTanh, SoAtan, SoAsinh functions from functions like ReLU, SiLU, GELU is that the
negative part of SoTanh, SoAtan, and SoAsinh AFs is notably farther away from the  axis than
ReLU/SiLU/GELU. Hence, it is hypothesized that these shifted S-shaped functions might fail to
introduce the activations sparsity that can be seen with ReLU/SiLU/GELU. Thus, there could be a
potential for improving their performance by “pushing” their negative part closer to the  axis. Since
we strive to create smooth approximations of bounded ReLU-like functions, we choose to explore
the same method of “pushing” negative part to the  axis as the one used by the SiLU and GELU
functions in this work. As a result, we create shifted S-shaped functions, which are weighted by
another S-shaped function that has a range of (0; 1) and a value of 0.5 at  = 0. In this paper we
call such a family of functions as weighted shifted origin-aligned S-shaped functions (WSoS
functions). By using two variants of weight functions and three variants of base S-shaped functions
in this work we introduce and investigate the following specific WSoS AFs: SiSoTanh, SiSoAtan,
SiSoAsinh, GeSoTanh, GeSoAtan, GeSoAsinh (see section 2.2).</p>
      <p>Considering the fact that bounded functions are prone to causing the vanishing gradient problem,
in this work we try to mitigate the likelihood of this problem by prolonging the range of the function
in its positive range where the function has values close to the () =  function. We do this by
introducing the scaling parameters. Besides, there’s one more adjustable parameter that identifies
the amount by which the base S-function is shifted along the  axis.</p>
      <p>Finding a good combination among permutations of all AF’s parameters can potentially be hard,
so we first perform experiments with certain fixed parameter values, and then make these
parameters to be trainable by creating adaptive versions of these AFs. In case of adaptive versions of
the AFs, we share the parameters across the entire model rather than introducing different trainable
parameter sets per each network’s neuron.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <sec id="sec-2-1">
        <title>2.1. Shifted origin-aligned S-shaped AFs</title>
        <p>
          The notion of shifted origin-aligned S-shaped functions is not new (the Shifted Tanh function
was explored in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]), but we include them into the comparison to see how the AFs proposed in this
work stack up against them along with other existing AFs. Besides, in addition to the Shifted Tanh
function (which is called SoTanh in this work for brevity and consistency with the proposed AFs),
this work also explores the shifted versions of Atan and Asinh functions, which follow the same
pattern, and evaluates the performance of their adaptive variants.
        </p>
        <p>The general form of such shifted origin-aligned S-shaped functions () used in this work can
be described with the following formula:
() = ( − ) + (),
(1)
where</p>
        <p>S is an arbitrary S-shaped function, which is also called a base function in this paper, and
α is a value by which the function is shifted horizontally.</p>
        <p>This results in the following three AFs, which represent the shifted versions of Tanh, Atan, and
Asinh functions (see Table 1). Examples of such AFs can be seen in Figure 1.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Weighted shifted origin-aligned S-shaped AFs</title>
        <p>The family of AFs proposed in this work (by convention called as WSoS AFs family in this paper)
contains the modifications of shifted origin-aligned S-shaped functions (see section 2.1), whose
negative part is softly pushed closer to the  axis by weighing them with another S-shaped function,
as can be described in a general form by this formula:
() = ()( ),
(2)
where
 is a function’s vertical scaling parameter.
() is any S-shaped function in range (0; 1) symmetric with respect to the point (0, 0.5).
 is a horizontal and vertical scaling parameter for the  function.</p>
        <p>() is a shifted origin-aligned S-shaped function (1).</p>
        <p>The way of weighing a shifted origin-aligned S-shaped function, which was chosen in this work,
is similar to the way of weighing the () =  function in the SiLU and GELU AFs, where the logistic
sigmoid and Gauss error functions are used respectively as the weight function (). The respective
variants used in this work, informally called  and , are shown in Table 2.</p>
        <p>With the three variants of shifted S-shaped functions listed in Table 1, this results in 6 specific
AFs that belong to the class of WSoS AFs, which are explored in this work (see Table 3).</p>
        <p>In an attempt to identify some concrete efficient AF variants, each of these functions is tested
with several different sets of , , and  parameters. Some examples of these functions with different
values of α, β,  parameters can be seen in Figure 2 and Figure 3.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Adaptive AF variants</title>
        <p>In addition to the functions mentioned in Table 1 and Table 3, this work considers respective
adaptive variants of these AFs, which use the same AF formulas, but treat α, β,  as trainable
parameters, which are shared across the whole model. The resulting adaptive variants of shifted
origin-aligned S-shaped AFs: ASoTanh, ASoAtan, ASoAsinh. The adaptive variants of the WSoS AFs
(later called AWSoS functions for conciseness) are as follows: ASiSoTanh, ASiSoAtan, ASiSoAsinh,
AGeSoTanh, AGeSoAtan, AGeSoAsinh.</p>
        <p>The ASoTanh, ASoAtan, ASoAsinh functions are tested with one variant of the initial α
parameter’s value for each AF, and the ASiSoTanh, ASiSoAtan, ASiSoAsinh, AGeSoTanh,
AGeSoAtan, AGeSoAsinh AFs are tested with several sets of initial parameter values.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Experimental setup</title>
        <p>All activation functions are tested and compared on the image classification task with various
CNN models, datasets, and hyperparameters. Below are the details on the respective experiments
that are performed.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.4.1. Activation functions being compared</title>
        <p>In this paper, we compare the proposed activation functions belonging to the WSoS/AWSoS
family with a set of existing activation functions by evaluating the average best test accuracy for
each AF over several runs. The comparison includes the following functions:</p>
        <p>All the proposed AWSoS AFs share the trainable α, β,  parameters across the entire model in
this work. This means that using an adaptive variant of the AFs add just a single set of three trainable
variables to the entire model, which means that the memory footprint from using these AFs remains
practically unaffected.</p>
      </sec>
      <sec id="sec-2-6">
        <title>2.4.2. Testing metrics and configurations</title>
        <p>Each of the functions is tested several times in each of the test configurations listed in Table 5.
Within the same testing configuration, for every AF, an average image classification accuracy (as
well as the standard deviation) across several test runs is evaluated for each training epoch. Only the
accuracies obtained from the test dataset (not the training dataset) are used. The maximum average
accuracy value that was achieved by a certain AF on any of the training epochs in certain test
configuration is considered as the accuracy of this AF in this test configuration. After obtaining
accuracies for each AF in each configuration, a common comparison chart for each of the
configurations is made, where accuracies of all AFs can be compared with each other within the
respective testing configuration.</p>
        <p>Table 5
AF testing configurations</p>
        <sec id="sec-2-6-1">
          <title>Configuration Dataset Model Optimizer Learning Batch No of No of</title>
        </sec>
        <sec id="sec-2-6-2">
          <title>Name Rate Size epochs runs</title>
          <p>CAL CIFAR-10 CIFAR10-cls Adam 0.001 32 30 10
CAH CIFAR-10 CIFAR10-cls Adam 0.002 32 30 3
CSL CIFAR-10 CIFAR10-cls SGD 0.03 32 30 10
CSH CIFAR-10 CIFAR10-cls SGD 0.06 32 30 3</p>
        </sec>
        <sec id="sec-2-6-3">
          <title>FAL Fashion-MNIST FMNIST-cls Adam 0.001 32 30 10</title>
        </sec>
        <sec id="sec-2-6-4">
          <title>FAH Fashion-MNIST FMNIST-cls Adam 0.01 32 30 3</title>
        </sec>
        <sec id="sec-2-6-5">
          <title>FSL Fashion-MNIST FMNIST-cls SGD 0.03 32 30 10</title>
        </sec>
        <sec id="sec-2-6-6">
          <title>FSH Fashion-MNIST FMNIST-cls SGD 0.3 32 30 3</title>
          <p>In all cases, CNN kernel weights are initialized with the Glorot uniform weight initialization
method, and biases are initialized with zeros.</p>
          <p>Besides evaluating AF classification accuracies for each configuration, this work explores whether
some AFs tend to have better/worse accuracy across configurations. In order to make this possible
while dealing with different datasets, models, and hyperparameters, which can result in different
accuracy ranges, a notion of AF accuracy rank is introduced. For any given AF, its accuracy rank
#$,&amp; within a specific configuration  is defined as a 1-based index in a list of AFs sorted by their
classification accuracy in an ascending order within this configuration . Provided that all
configurations are performed over the same set of AFs, for any set  of multiple test configurations
' … (, a combined rank for each AF #$,) can be calculated by averaging the respective
perconfiguration ranks for this AF:</p>
          <p>(
1
#$,) =  L*+' #$,&amp;!,</p>
          <p>In this work, combined AF ranks are calculated for three configuration combinations:
Configurations from Table 5 that use the Adam optimizer.</p>
          <p>Configurations from Table 5 that use the stochastic gradient descent (SGD) optimizer.
All configurations from Table 5.
(3)</p>
        </sec>
      </sec>
      <sec id="sec-2-7">
        <title>2.4.3. CNN models used</title>
        <p>As can be seen in Table 5, each dataset is used with its respective model. The CIFAR-10 dataset is
used with the CIFAR10-cls model (see Figure 4), and Fashion-MNIST dataset is used with the
FMNIST-cls model (see Figure 5).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <sec id="sec-3-1">
        <title>The image classification accuracies that were identified in experiments for each AF in Table 4 in each of the test configurations listed in Table 5 can be seen in</title>
        <p>AF</p>
      </sec>
      <sec id="sec-3-2">
        <title>CIFAR-10 Fashion-MNIST Adam SGD Dataset</title>
        <p>AF</p>
      </sec>
      <sec id="sec-3-3">
        <title>SoTanh(1.0)</title>
      </sec>
      <sec id="sec-3-4">
        <title>SoAtan(1.0)</title>
      </sec>
      <sec id="sec-3-5">
        <title>SoAsinh(1.0)</title>
      </sec>
      <sec id="sec-3-6">
        <title>ASoTanh(1.0)</title>
      </sec>
      <sec id="sec-3-7">
        <title>ASoAtan(1.0)</title>
      </sec>
      <sec id="sec-3-8">
        <title>ASoAsinh(1.0)</title>
      </sec>
      <sec id="sec-3-9">
        <title>Sigmoid</title>
      </sec>
      <sec id="sec-3-10">
        <title>Tanh</title>
      </sec>
      <sec id="sec-3-11">
        <title>Atan</title>
      </sec>
      <sec id="sec-3-12">
        <title>Asinh</title>
      </sec>
      <sec id="sec-3-13">
        <title>Softsign</title>
      </sec>
      <sec id="sec-3-14">
        <title>ReLU</title>
      </sec>
      <sec id="sec-3-15">
        <title>ReLU6</title>
      </sec>
      <sec id="sec-3-16">
        <title>LeakyReLU(0.01)</title>
      </sec>
      <sec id="sec-3-17">
        <title>LeakyReLU(0.1)</title>
      </sec>
      <sec id="sec-3-18">
        <title>LeakyReLU(0.3)</title>
      </sec>
      <sec id="sec-3-19">
        <title>LeakyReLU(0.5)</title>
      </sec>
      <sec id="sec-3-20">
        <title>PReLU</title>
      </sec>
      <sec id="sec-3-21">
        <title>PreLU_shared(0.01)</title>
      </sec>
      <sec id="sec-3-22">
        <title>PreLU_shared(0.2)</title>
      </sec>
      <sec id="sec-3-23">
        <title>PreLU_shared(0.4)</title>
      </sec>
      <sec id="sec-3-24">
        <title>SiLU</title>
        <p>Swish(0.33)
Swish(1.0)
Swish(3.0)</p>
      </sec>
      <sec id="sec-3-25">
        <title>GELU</title>
        <p>FAL</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <sec id="sec-4-1">
        <title>4.1. Analysis of the results</title>
      </sec>
      <sec id="sec-4-2">
        <title>4.1.1. The advantage of adaptive AWSoS AFs with the Adam optimizer</title>
        <p>Overall, reviewing the results from the experiments made in this work shows that the adaptive
AWSoS AFs perform notably better than all other AFs when the model is trained with the Adam
optimizer. Here are the respective notes that can be made regarding such observations:
•
•
•
•
•</p>
        <p>The adaptive AWSoS AF variants have a pronounced advantage in image classification
accuracy over existing popular ReLU-like AFs in all testing configurations that use the Adam
optimizer. With a few of exceptions all AWSoS AFs have resulted in higher image
classification accuracies than all other AFs considered in this work in all testing
configurations that use the Adam optimizer. This can in particular be seen by the respective
combined AF ranks in Figure 8.</p>
        <p>The classification accuracy advantage of the AWSoS AFs can be seen to be even more
pronounced with higher learning rates when using the Adam optimizer. In the CAH testing
configuration, the highest-accuracy AF AGeSoTanh(1, 1, 1) shows an accuracy of 78.19%,
which is ~3% higher than the highest-accuracy standard AF LeakyReLU(0.1) of 75.16% in this
configuration. In comparison, a similar configuration with a lower learning rate CAL shows
a lower advantage of ~0.8% which the highest-accuracy AWSoS AF ASiSoAsinh(1, 1, 1)
(77.63%) has over the highest-accuracy existing AF LeakyReLU(0.1) (76.81%). A similar
tendency can be seen on models trained for Fashion-MNIST image classification with low
and high learning rates.</p>
        <p>The choice of initial parameter values for the AWSoS AFs is seen to have no or little decisive
effect with the Adam optimizer, and they consistently show higher accuracy than the
existing AFs in most cases. Nevertheless, the choice of their parameter values is still
important to fine tune the level of accuracy that can be achieved.</p>
        <p>In 6 out of 8 testing configurations the ASiSoAsinh(1, 1, 1) AF has provided a classification
accuracy higher than that of all considered existing AFs. Besides, in 4 out of 8 configurations
this particular AF had shown an accuracy higher than all other compared AFs.</p>
        <p>In the testing configurations that use the SGD optimizer, the AWSoS AFs don’t have a
consistent advantage over the existing AFs. Adaptive versions of these AFs are in many cases
not stable in the configurations using SGD, where they often fail to converge during training.
A few exceptions are the ASiSoAsinh(1, 1, 1) and AGeSoAsinh(1, 1, 1) AFs, which in three of
four SGD-related configurations have provided a higher accuracy than most of the standard
AFs.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.1.2. Comparing non-adaptive WSoS functions to existing AFs</title>
        <p>In many cases the proposed WSoS AFs provide image classification accuracy similar to the
existing ReLU-like functions. Their performance is very sensitive to the choice of the α, β, 
parameter values, so they require respective attention for choosing the suitable parameter values.
The results of this work don’t provide sufficient data to make recommendations about the potentially
more suitable parameter values and this topic requires further research.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.1.3. Observations related to shifted S-shaped AFs</title>
        <p>
          The regular (non-weighted) shifted S-shaped AFs can be seen to provide an image classification
accuracy that is comparable to ReLU-like AFs and typically higher than that of the ReLU AF. This is
in line with the observations made in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], which was exploring the Shifted Tanh function (named
SoTanh in this paper). The experiments made in this work show that the other modifications of such
a function, which are based on Atan and Asinh functions, can also significantly improve the
classification accuracy compared to their regular unshifted variants.
        </p>
        <p>The experiments also show that the adaptive versions of these AFs (ASoTanh, ASoAtan,
ASoAsinh), which use the value of horizontal shift as a trainable parameter, in most cases provide
an additional notable improvement in classification accuracy over the non-adaptive forms of these
AFs (e.g., see Figure 6–Figure 10).</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.2. Computational performance considerations</title>
        <p>This work primarily focuses on investigating the performance of the proposed AFs in terms of
the classification accuracy in comparison with the existing ones. The analysis of the computational
performance of the proposed functions, which measures the time required to train the model, and
the time required to perform a forward pass when using the model in a production environment,
was not the target of this work. Nevertheless, preliminary analysis confirms the intuitive assumption
that using a function which requires more computations resources, like the AWSoS functions,
requires more time. Preliminary measurements show that, when training on CPU, the AWSoS
functions can take from ~10% more time than the Swish AF (for ASiSoTanh AF) to ~80% more time
than the Swish AF (for ASiSoAsinh AF), but a more thorough study is required to identify the relative
cost of using the WSoS/AWSoS AFs relative to the existing ones.</p>
        <p>Besides, additional research is needed to evaluate the training speed of the proposed AFs in terms
of the number of epochs required to reach certain accuracy, which, in combination with the
assessment of the relative computational cost per one epoch, could allow a more realistic evaluation
of the actual training speed that the proposed AFs can provide.</p>
        <p>Nevertheless, the advantage that the AWSoS AFs can provide in terms of the classification
accuracy can be important in some applications by itself regardless of the extra computational cost
that might be required to train the model that achieves a higher performance, or use it in a production
environment.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.3. Future work</title>
        <p>As was mentioned above, a notable tendency about the AWSoS function is their pronounced
advantage over the considered existing AFs with the Adam optimizer, but a not as good performance
with the SGD optimizer. This difference requires further research to try to identify ways to improve
their performance with the SGD optimizer. One hypothesis that might explain this issue is that the
weight initialization method used in this research (Glorot uniform) might lead the model with these
AFs to poor convergence while preventing it from finding a global minimum, which is mitigated by
the Adam optimizer, but not SGD.</p>
        <p>Other directions of further research include exploring the possibility of more computationally
efficient variants of WSoS/AWSoS functions, exploring whether some parameter configurations for
WSoS functions can be recommended as potentially more efficient ones, and exploring how the
WSoS/AWSoS functions perform in significantly deeper networks.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This work proposes a class of weighted shifted origin-aligned S-shaped activation functions (WSoS
AFs) and explores their performance in image classification tasks using CNNs in comparison with a
range of existing AFs. An emphasis is made on comparing the proposed AFs with ReLU-like AFs,
which are the most popular choice of AFs with CNNs.</p>
      <p>
        These functions are considered as an evolution of shifted origin-aligned S-shaped functions (e.g.
the ones similar to Shifted Tanh in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]), and are at the same time viewed as softly-bounded versions
of ReLU-like functions GELU and SiLU in this work. The results of experiments show that they can
indeed be used to improve the classification accuracy of shifted S-shaped functions and can compete
with most of ReLU-like functions, but the classification accuracy that they provide significantly
depends on the choice of their three parameters, which can be challenging.
      </p>
      <p>At the same time, a notable result of this work is that the adaptive versions of the WSoS AFs
(AWSoS AFs) in most of the tested configurations show a clear advantage over all tested existing
AFs including the existing adaptive ones, but this advantage holds only when the training is done
with the Adam optimizer, and not the SGD optimizer, where the training is often not stable with
these AFs.</p>
      <p>
        Further research is needed to explore ways of achieving similar advantages of AWSoS AFs with
the SGD optimizer, which, according to preliminary experiments, could be made with changing the
weight initialization method. Besides, more computationally effective forms of WSoS/AWSoS
functions can also be explored in the future research. Another line of future research would consider
AWSoS AFs in combination with other learning algorithms, including their robust modifications
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and other neural network architectures [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Maas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Hannun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <source>Rectifier Nonlinearities Improve Neural Network Acoustic Models, in: Proceedings of the 30th International Conference on Machine Learning</source>
          , volume
          <volume>28</volume>
          ,
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Elfwinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Uchibea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Doyab</surname>
          </string-name>
          ,
          <article-title>Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning, Neural Networks</article-title>
          , volume
          <volume>107</volume>
          ,
          <year>2018</year>
          3-
          <fpage>11</fpage>
          . doi:
          <volume>10</volume>
          .1016/j.neunet.
          <year>2017</year>
          .
          <volume>12</volume>
          .012.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpe</surname>
          </string-name>
          ,
          <article-title>Gaussian Error Linear Units (GELUs)</article-title>
          , arXiv,
          <year>2016</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.1606.08415.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Clevert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          ,
          <article-title>Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2015</year>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Glorot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <source>Deep Sparse Rectifier Neural Networks, in: International Conference on Artificial Intelligence and Statistics</source>
          , 2011
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Szandała</surname>
          </string-name>
          ,
          <article-title>Review and Comparison of Commonly Used Activation Functions for Deep Neural Networks</article-title>
          , in: A.
          <string-name>
            <surname>K. Bhoi</surname>
            ,
            <given-names>P. K.</given-names>
          </string-name>
          <string-name>
            <surname>Mallick</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
          </string-name>
          , V. E. Balas (Eds.),
          <source>Bio-inspired Neurocomputing, Lectures on Embedded Systems</source>
          , Springer Singapore,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .1007/
          <fpage>978</fpage>
          -981-15-5495-7.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. B.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          ,
          <article-title>Activation functions in deep learning: A comprehensive survey and benchmark</article-title>
          , Neurocomputing, volume
          <volume>503</volume>
          ,
          <year>2022</year>
          92-
          <fpage>108</fpage>
          . doi:
          <volume>10</volume>
          .1016/j.neucom.
          <year>2022</year>
          .
          <volume>06</volume>
          .111. doi:j.neucom.
          <year>2022</year>
          .
          <volume>06</volume>
          .111.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Liew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Khalil-Hani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bakhteri</surname>
          </string-name>
          ,
          <article-title>Bounded activation functions for enhanced training stability of deep neural networks on visual pattern recognition problems</article-title>
          , Neurocomputing, volume
          <volume>216</volume>
          ,
          <year>2016</year>
          718-
          <fpage>734</fpage>
          . doi:
          <volume>10</volume>
          .1016/j.neucom.
          <year>2016</year>
          .
          <volume>08</volume>
          .037.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <source>Convolutional Deep Belief Networks on CIFAR-10</source>
          ,
          <year>2012</year>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nicolae</surname>
          </string-name>
          ,
          <source>PLU: The Piecewise Linear Unit Activation Function</source>
          , arXiv,
          <year>2018</year>
          , doi:10.48550/arXiv.
          <year>1809</year>
          .09534
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Kim3</surname>
          </string-name>
          ,
          <article-title>Tanh Works Better With Asymmetry</article-title>
          ,
          <source>in: NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems</source>
          , article no.
          <source>: 549</source>
          ,
          <year>2024</year>
          12536-
          <fpage>12554</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Ye. Bodyanskiy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Popov</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Titov</surname>
          </string-name>
          ,
          <article-title>Robust Learning Algorithm for Networks of Neuro-Fuzzy Units</article-title>
          , in: T. Sobh (ed) Innovations and Advances in
          <source>Computer Sciences and Engineering</source>
          . Springer, Dordrecht,
          <year>2010</year>
          . doi:
          <volume>10</volume>
          .1007/
          <fpage>978</fpage>
          -90-481-3658-2_
          <fpage>59</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Ye. Bodyanskiy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Popov</surname>
          </string-name>
          , T. Rybalchenko,
          <article-title>Feedforward neural network with a specialized architecture for estimation of the temperature influence on the electric load</article-title>
          ,
          <source>in Proc. 2008 4th International IEEE Conference Intelligent Systems, Varna, Bulgaria</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>7</fpage>
          -
          <issue>14</issue>
          -7-18, doi:10.1109/IS.
          <year>2008</year>
          .
          <volume>4670444</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>