1. Introduction

Weighted Shifted S-shaped Activation Functions

Sergiy Popov

serhii.popov@nure.ua 0 1

Dmytro Pikhulya

dmytro.pikhulia@nure.ua 0 1 0 Kharkiv National University of Radio Electronics , Nauky Ave. 14, Kharkiv, 61166 , Ukraine 1 ProfIT AI 2024: 4

In this paper, we propose a family of activation functions (AFs) that can be considered as smooth approximations of bounded ReLU and similar AFs. These AFs are constructed by using a shifted originaligned S-shaped function as a basis, and weighing it with another S-shaped function, similar to how SiLU/GELU AFs weigh the () = function. The use of both regular and adaptive variants of such AFs is explored. The performance of the proposed family of AFs is evaluated in terms of the image classification accuracy with CNN models by comparing their multiple variants with the popular existing AFs on CIFAR10 and Fashion-MNIST datasets using Adam and stochastic gradient descent (SGD) optimizers with different learning rates. Overall, 28 variants of the proposed AFs are compared with 21 variants of popular existing AFs (including the ReLU-like functions such as ReLU, Leaky ReLU, SiLU, GELU, PReLU, Swish, etc. and some S-shaped AFs), and 6 shifted S-shaped AFs. The experiments have shown that in most cases the adaptive versions of the proposed AFs provide a pronounced image classification accuracy advantage over all existing AFs that were considered when the Adam optimizer is used, and no consistent advantage with the SGD optimizer. Further research regarding the use of these AFs with the SGD optimizer, and the use of their non-adaptive variants is required.

1. Introduction

its variations, such as Leaky ReLU [ 1 ], SiLU [ 2 ], GELU [ 3 ], ELU [ 4 ], etc. In this paper, such functions are called ReLU-like ones for convenience.

The ReLU-like functions effectively solve the vanishing gradient problem [ 5, 6 ], which is a typical problem with S-shaped AFs [ 6 ]. In many cases, using ReLU-like functions lead to better model’s effectiveness for solving the image classification tasks with CNNs than the S-shaped functions like Sigmoid and Tanh [ 7 ].

It has been shown that bounding of the ReLU function can be beneficial for training stability and classification accuracy with functions like BReLU, BLReLU [ 8 ]. At the same time there are some improved variants of ReLU (LReLU, GELU, SiLU, PReLU) that can show better results, but these 0000-0002-1274-5830 (S. Popov); 0009-0003-6280-1890 (D. Pikhulya) © 2024 Copyright for this paper by its authors.

Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

CPWErooUrckResehdoinpgs hIStSpN:/c1e6u1r3-w-0s.o7r3g CEUR Workshop Proceedings (CEUR-WS.org) functions are not bounded, hence there’s a potential in exploring the bounded versions of such functions to see if this could produce a cumulative improvement effect that would lead to better results than any of these functions.

The way that BReLU or similar (ReLU-6 [ 9 ]) functions are bounded makes the function to have a fixed value with zero derivative after a certain value of its argument, which could degrade the network’s training process. Alternative functions, which are formally not bounded, but still limit the function’s growth after a certain value of its argument as well are functions like BLReLU [ 8 ] and PLU [ 10 ]. The BReLU, BLReLU, and PLU AFs are piecewise linear functions though. In this work it is assumed that smoothing the transitions in such functions by replacing piecewise linear functions with smoother approximation functions could improve the overall network’s approximation capabilities. The intuition behind this assumption is that real-world data would presumably typically be diverse enough to have no or few hard edges in distributions of most of its aspects.

Some of the functions that can be useful for creating smooth approximations of bounded ReLUlike functions are the shifted S-shaped functions. The work [ 11 ] shows that the modified version of the Tanh function, which is shifted horizontally and vertically while still maintaining an intersection with the origin, called Shifted Tanh, achieves a better performance than the Tanh function, and can show a performance that is similar or slightly higher than that of the ReLU AF.

In this paper, we take a further look at such shifted S-shaped functions by first evaluating the performance of shifted variants of the Atan and Asinh functions, and then modifying them to represent better smoothed approximations of bounded ReLU-like functions. In this paper the respective shifted AFs conventionally have the “So” prefix added to them (meaning “shifted, originaligned”): SoTanh (same as Shifted Tanh in [ 11 ]), SoAtan, SoAsinh. Regarding the Asinh function in particular, it is worth noting that unlike most S-shaped functions, it is an unbounded one, which could potentially be a useful property for mitigating the vanishing gradient problem as it tends to have higher first derivative values for a wider range of than the bounded functions like Tanh and Atan.

More specifically, shifting an S-shaped function to the right is seen to potentially be beneficial due to the following reasons: • •

This makes the negative function’s part closer to the axis, similar to the shape of ReLU-like functions. It is informally assumed in this work that the proximity of ReLU-like functions to 0 plays a certain role in their effectiveness with CNNs. One of the explanations might be the assumption that such AF’s property encourages learning sparse representations of network’s inputs [ 5, 11 ].

This makes the function in its positive part to have a longer range where the functions value is closer to the () = function, compared to a regular unshifted version of the same Sfunction, which also makes their shape closer to that of Bounded ReLU or BLReU AFs, while also having smooth transitions. The proximity of the positive function’s part to the () = function is hypothesized to contribute to preventing both vanishing gradients and exploding gradients in deep networks.

After testing the performance of SoTanh, SoAtan, and SoAsinh functions, we introduce their modified versions, which make them closer to bounded ReLU-like functions. One of the notable differences of SoTanh, SoAtan, SoAsinh functions from functions like ReLU, SiLU, GELU is that the negative part of SoTanh, SoAtan, and SoAsinh AFs is notably farther away from the axis than ReLU/SiLU/GELU. Hence, it is hypothesized that these shifted S-shaped functions might fail to introduce the activations sparsity that can be seen with ReLU/SiLU/GELU. Thus, there could be a potential for improving their performance by “pushing” their negative part closer to the axis. Since we strive to create smooth approximations of bounded ReLU-like functions, we choose to explore the same method of “pushing” negative part to the axis as the one used by the SiLU and GELU functions in this work. As a result, we create shifted S-shaped functions, which are weighted by another S-shaped function that has a range of (0; 1) and a value of 0.5 at = 0. In this paper we call such a family of functions as weighted shifted origin-aligned S-shaped functions (WSoS functions). By using two variants of weight functions and three variants of base S-shaped functions in this work we introduce and investigate the following specific WSoS AFs: SiSoTanh, SiSoAtan, SiSoAsinh, GeSoTanh, GeSoAtan, GeSoAsinh (see section 2.2).

Considering the fact that bounded functions are prone to causing the vanishing gradient problem, in this work we try to mitigate the likelihood of this problem by prolonging the range of the function in its positive range where the function has values close to the () = function. We do this by introducing the scaling parameters. Besides, there’s one more adjustable parameter that identifies the amount by which the base S-function is shifted along the axis.

Finding a good combination among permutations of all AF’s parameters can potentially be hard, so we first perform experiments with certain fixed parameter values, and then make these parameters to be trainable by creating adaptive versions of these AFs. In case of adaptive versions of the AFs, we share the parameters across the entire model rather than introducing different trainable parameter sets per each network’s neuron.

2. Method 2.1. Shifted origin-aligned S-shaped AFs

The notion of shifted origin-aligned S-shaped functions is not new (the Shifted Tanh function was explored in [ 11 ]), but we include them into the comparison to see how the AFs proposed in this work stack up against them along with other existing AFs. Besides, in addition to the Shifted Tanh function (which is called SoTanh in this work for brevity and consistency with the proposed AFs), this work also explores the shifted versions of Atan and Asinh functions, which follow the same pattern, and evaluates the performance of their adaptive variants.

The general form of such shifted origin-aligned S-shaped functions () used in this work can be described with the following formula: () = ( − ) + (), (1) where

S is an arbitrary S-shaped function, which is also called a base function in this paper, and α is a value by which the function is shifted horizontally.

This results in the following three AFs, which represent the shifted versions of Tanh, Atan, and Asinh functions (see Table 1). Examples of such AFs can be seen in Figure 1.

2.2. Weighted shifted origin-aligned S-shaped AFs

The family of AFs proposed in this work (by convention called as WSoS AFs family in this paper) contains the modifications of shifted origin-aligned S-shaped functions (see section 2.1), whose negative part is softly pushed closer to the axis by weighing them with another S-shaped function, as can be described in a general form by this formula: () = ()( ), (2) where is a function’s vertical scaling parameter. () is any S-shaped function in range (0; 1) symmetric with respect to the point (0, 0.5). is a horizontal and vertical scaling parameter for the function.

() is a shifted origin-aligned S-shaped function (1).

The way of weighing a shifted origin-aligned S-shaped function, which was chosen in this work, is similar to the way of weighing the () = function in the SiLU and GELU AFs, where the logistic sigmoid and Gauss error functions are used respectively as the weight function (). The respective variants used in this work, informally called and , are shown in Table 2.

With the three variants of shifted S-shaped functions listed in Table 1, this results in 6 specific AFs that belong to the class of WSoS AFs, which are explored in this work (see Table 3).

In an attempt to identify some concrete efficient AF variants, each of these functions is tested with several different sets of , , and parameters. Some examples of these functions with different values of α, β, parameters can be seen in Figure 2 and Figure 3.

2.3. Adaptive AF variants

In addition to the functions mentioned in Table 1 and Table 3, this work considers respective adaptive variants of these AFs, which use the same AF formulas, but treat α, β, as trainable parameters, which are shared across the whole model. The resulting adaptive variants of shifted origin-aligned S-shaped AFs: ASoTanh, ASoAtan, ASoAsinh. The adaptive variants of the WSoS AFs (later called AWSoS functions for conciseness) are as follows: ASiSoTanh, ASiSoAtan, ASiSoAsinh, AGeSoTanh, AGeSoAtan, AGeSoAsinh.

The ASoTanh, ASoAtan, ASoAsinh functions are tested with one variant of the initial α parameter’s value for each AF, and the ASiSoTanh, ASiSoAtan, ASiSoAsinh, AGeSoTanh, AGeSoAtan, AGeSoAsinh AFs are tested with several sets of initial parameter values.

2.4. Experimental setup

All activation functions are tested and compared on the image classification task with various CNN models, datasets, and hyperparameters. Below are the details on the respective experiments that are performed.

2.4.1. Activation functions being compared

In this paper, we compare the proposed activation functions belonging to the WSoS/AWSoS family with a set of existing activation functions by evaluating the average best test accuracy for each AF over several runs. The comparison includes the following functions:

All the proposed AWSoS AFs share the trainable α, β, parameters across the entire model in this work. This means that using an adaptive variant of the AFs add just a single set of three trainable variables to the entire model, which means that the memory footprint from using these AFs remains practically unaffected.

2.4.2. Testing metrics and configurations

Each of the functions is tested several times in each of the test configurations listed in Table 5. Within the same testing configuration, for every AF, an average image classification accuracy (as well as the standard deviation) across several test runs is evaluated for each training epoch. Only the accuracies obtained from the test dataset (not the training dataset) are used. The maximum average accuracy value that was achieved by a certain AF on any of the training epochs in certain test configuration is considered as the accuracy of this AF in this test configuration. After obtaining accuracies for each AF in each configuration, a common comparison chart for each of the configurations is made, where accuracies of all AFs can be compared with each other within the respective testing configuration.

Table 5 AF testing configurations

Configuration Dataset Model Optimizer Learning Batch No of No of Name Rate Size epochs runs

CAL CIFAR-10 CIFAR10-cls Adam 0.001 32 30 10 CAH CIFAR-10 CIFAR10-cls Adam 0.002 32 30 3 CSL CIFAR-10 CIFAR10-cls SGD 0.03 32 30 10 CSH CIFAR-10 CIFAR10-cls SGD 0.06 32 30 3

FAL Fashion-MNIST FMNIST-cls Adam 0.001 32 30 10 FAH Fashion-MNIST FMNIST-cls Adam 0.01 32 30 3 FSL Fashion-MNIST FMNIST-cls SGD 0.03 32 30 10 FSH Fashion-MNIST FMNIST-cls SGD 0.3 32 30 3

In all cases, CNN kernel weights are initialized with the Glorot uniform weight initialization method, and biases are initialized with zeros.

Besides evaluating AF classification accuracies for each configuration, this work explores whether some AFs tend to have better/worse accuracy across configurations. In order to make this possible while dealing with different datasets, models, and hyperparameters, which can result in different accuracy ranges, a notion of AF accuracy rank is introduced. For any given AF, its accuracy rank #$,& within a specific configuration is defined as a 1-based index in a list of AFs sorted by their classification accuracy in an ascending order within this configuration . Provided that all configurations are performed over the same set of AFs, for any set of multiple test configurations ' … (, a combined rank for each AF #$,) can be calculated by averaging the respective perconfiguration ranks for this AF:

( 1 #$,) = L*+' #$,&!,

In this work, combined AF ranks are calculated for three configuration combinations: Configurations from Table 5 that use the Adam optimizer.

Configurations from Table 5 that use the stochastic gradient descent (SGD) optimizer. All configurations from Table 5. (3)

2.4.3. CNN models used

As can be seen in Table 5, each dataset is used with its respective model. The CIFAR-10 dataset is used with the CIFAR10-cls model (see Figure 4), and Fashion-MNIST dataset is used with the FMNIST-cls model (see Figure 5).

3. Results The image classification accuracies that were identified in experiments for each AF in Table 4 in each of the test configurations listed in Table 5 can be seen in

CIFAR-10 Fashion-MNIST Adam SGD Dataset

SoTanh(1.0) SoAtan(1.0) SoAsinh(1.0) ASoTanh(1.0) ASoAtan(1.0) ASoAsinh(1.0) Sigmoid Tanh Atan Asinh Softsign ReLU ReLU6 LeakyReLU(0.01) LeakyReLU(0.1) LeakyReLU(0.3) LeakyReLU(0.5) PReLU PreLU_shared(0.01) PreLU_shared(0.2) PreLU_shared(0.4) SiLU

Swish(0.33) Swish(1.0) Swish(3.0)

GELU

FAL

4. Discussion 4.1. Analysis of the results 4.1.1. The advantage of adaptive AWSoS AFs with the Adam optimizer

Overall, reviewing the results from the experiments made in this work shows that the adaptive AWSoS AFs perform notably better than all other AFs when the model is trained with the Adam optimizer. Here are the respective notes that can be made regarding such observations: • • • • •

The adaptive AWSoS AF variants have a pronounced advantage in image classification accuracy over existing popular ReLU-like AFs in all testing configurations that use the Adam optimizer. With a few of exceptions all AWSoS AFs have resulted in higher image classification accuracies than all other AFs considered in this work in all testing configurations that use the Adam optimizer. This can in particular be seen by the respective combined AF ranks in Figure 8.

The classification accuracy advantage of the AWSoS AFs can be seen to be even more pronounced with higher learning rates when using the Adam optimizer. In the CAH testing configuration, the highest-accuracy AF AGeSoTanh(1, 1, 1) shows an accuracy of 78.19%, which is ~3% higher than the highest-accuracy standard AF LeakyReLU(0.1) of 75.16% in this configuration. In comparison, a similar configuration with a lower learning rate CAL shows a lower advantage of ~0.8% which the highest-accuracy AWSoS AF ASiSoAsinh(1, 1, 1) (77.63%) has over the highest-accuracy existing AF LeakyReLU(0.1) (76.81%). A similar tendency can be seen on models trained for Fashion-MNIST image classification with low and high learning rates.

The choice of initial parameter values for the AWSoS AFs is seen to have no or little decisive effect with the Adam optimizer, and they consistently show higher accuracy than the existing AFs in most cases. Nevertheless, the choice of their parameter values is still important to fine tune the level of accuracy that can be achieved.

In 6 out of 8 testing configurations the ASiSoAsinh(1, 1, 1) AF has provided a classification accuracy higher than that of all considered existing AFs. Besides, in 4 out of 8 configurations this particular AF had shown an accuracy higher than all other compared AFs.

In the testing configurations that use the SGD optimizer, the AWSoS AFs don’t have a consistent advantage over the existing AFs. Adaptive versions of these AFs are in many cases not stable in the configurations using SGD, where they often fail to converge during training. A few exceptions are the ASiSoAsinh(1, 1, 1) and AGeSoAsinh(1, 1, 1) AFs, which in three of four SGD-related configurations have provided a higher accuracy than most of the standard AFs.

4.1.2. Comparing non-adaptive WSoS functions to existing AFs

In many cases the proposed WSoS AFs provide image classification accuracy similar to the existing ReLU-like functions. Their performance is very sensitive to the choice of the α, β, parameter values, so they require respective attention for choosing the suitable parameter values. The results of this work don’t provide sufficient data to make recommendations about the potentially more suitable parameter values and this topic requires further research.

4.1.3. Observations related to shifted S-shaped AFs

The regular (non-weighted) shifted S-shaped AFs can be seen to provide an image classification accuracy that is comparable to ReLU-like AFs and typically higher than that of the ReLU AF. This is in line with the observations made in [ 11 ], which was exploring the Shifted Tanh function (named SoTanh in this paper). The experiments made in this work show that the other modifications of such a function, which are based on Atan and Asinh functions, can also significantly improve the classification accuracy compared to their regular unshifted variants.

The experiments also show that the adaptive versions of these AFs (ASoTanh, ASoAtan, ASoAsinh), which use the value of horizontal shift as a trainable parameter, in most cases provide an additional notable improvement in classification accuracy over the non-adaptive forms of these AFs (e.g., see Figure 6–Figure 10).

4.2. Computational performance considerations

This work primarily focuses on investigating the performance of the proposed AFs in terms of the classification accuracy in comparison with the existing ones. The analysis of the computational performance of the proposed functions, which measures the time required to train the model, and the time required to perform a forward pass when using the model in a production environment, was not the target of this work. Nevertheless, preliminary analysis confirms the intuitive assumption that using a function which requires more computations resources, like the AWSoS functions, requires more time. Preliminary measurements show that, when training on CPU, the AWSoS functions can take from ~10% more time than the Swish AF (for ASiSoTanh AF) to ~80% more time than the Swish AF (for ASiSoAsinh AF), but a more thorough study is required to identify the relative cost of using the WSoS/AWSoS AFs relative to the existing ones.

Besides, additional research is needed to evaluate the training speed of the proposed AFs in terms of the number of epochs required to reach certain accuracy, which, in combination with the assessment of the relative computational cost per one epoch, could allow a more realistic evaluation of the actual training speed that the proposed AFs can provide.

Nevertheless, the advantage that the AWSoS AFs can provide in terms of the classification accuracy can be important in some applications by itself regardless of the extra computational cost that might be required to train the model that achieves a higher performance, or use it in a production environment.

4.3. Future work

As was mentioned above, a notable tendency about the AWSoS function is their pronounced advantage over the considered existing AFs with the Adam optimizer, but a not as good performance with the SGD optimizer. This difference requires further research to try to identify ways to improve their performance with the SGD optimizer. One hypothesis that might explain this issue is that the weight initialization method used in this research (Glorot uniform) might lead the model with these AFs to poor convergence while preventing it from finding a global minimum, which is mitigated by the Adam optimizer, but not SGD.

Other directions of further research include exploring the possibility of more computationally efficient variants of WSoS/AWSoS functions, exploring whether some parameter configurations for WSoS functions can be recommended as potentially more efficient ones, and exploring how the WSoS/AWSoS functions perform in significantly deeper networks.

5. Conclusion

This work proposes a class of weighted shifted origin-aligned S-shaped activation functions (WSoS AFs) and explores their performance in image classification tasks using CNNs in comparison with a range of existing AFs. An emphasis is made on comparing the proposed AFs with ReLU-like AFs, which are the most popular choice of AFs with CNNs.

These functions are considered as an evolution of shifted origin-aligned S-shaped functions (e.g. the ones similar to Shifted Tanh in [ 11 ]), and are at the same time viewed as softly-bounded versions of ReLU-like functions GELU and SiLU in this work. The results of experiments show that they can indeed be used to improve the classification accuracy of shifted S-shaped functions and can compete with most of ReLU-like functions, but the classification accuracy that they provide significantly depends on the choice of their three parameters, which can be challenging.

At the same time, a notable result of this work is that the adaptive versions of the WSoS AFs (AWSoS AFs) in most of the tested configurations show a clear advantage over all tested existing AFs including the existing adaptive ones, but this advantage holds only when the training is done with the Adam optimizer, and not the SGD optimizer, where the training is often not stable with these AFs.

Further research is needed to explore ways of achieving similar advantages of AWSoS AFs with the SGD optimizer, which, according to preliminary experiments, could be made with changing the weight initialization method. Besides, more computationally effective forms of WSoS/AWSoS functions can also be explored in the future research. Another line of future research would consider AWSoS AFs in combination with other learning algorithms, including their robust modifications [ 12 ], and other neural network architectures [ 13 ].

[1]

A. L.

Maas ,

A. Y.

Hannun ,

A. Y.

Ng , Rectifier Nonlinearities Improve Neural Network Acoustic Models, in: Proceedings of the 30th International Conference on Machine Learning , volume 28 , 3 .

[2]

Elfwinga ,

Uchibea ,

Doyab , Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning, Neural Networks , volume 107 , 2018 3- 11 . doi: 10 .1016/j.neunet. 2017 . 12 .012.

[3]

Hendrycks ,

Gimpe , Gaussian Error Linear Units (GELUs) , arXiv, 2016 . doi: 10 .48550/arXiv.1606.08415.

[4]

Clevert ,

Unterthiner ,

Hochreiter , Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , in: International Conference on Learning Representations , 2015

[5]

Glorot ,

Bordes ,

Bengio , Deep Sparse Rectifier Neural Networks, in: International Conference on Artificial Intelligence and Statistics , 2011

[6]

Szandała , Review and Comparison of Commonly Used Activation Functions for Deep Neural Networks , in: A. K. Bhoi , P. K.

Mallick , C.

Liu , V. E. Balas (Eds.), Bio-inspired Neurocomputing, Lectures on Embedded Systems , Springer Singapore, 2021 . doi: 10 .1007/ 978 -981-15-5495-7.

[7]

S. R.

Dubey ,

S. K.

Singh ,

B. B.

Chaudhuri , Activation functions in deep learning: A comprehensive survey and benchmark , Neurocomputing, volume 503 , 2022 92- 108 . doi: 10 .1016/j.neucom. 2022 . 06 .111. doi:j.neucom. 2022 . 06 .111.

[8]

S. S.

Liew ,

Khalil-Hani ,

Bakhteri , Bounded activation functions for enhanced training stability of deep neural networks on visual pattern recognition problems , Neurocomputing, volume 216 , 2016 718- 734 . doi: 10 .1016/j.neucom. 2016 . 08 .037.

[9]

Krizhevsky , Convolutional Deep Belief Networks on CIFAR-10 , 2012

[10]

Nicolae , PLU: The Piecewise Linear Unit Activation Function , arXiv, 2018 , doi:10.48550/arXiv. 1809 .09534

[11]

Kim ,

Kim , S. Kim3 , Tanh Works Better With Asymmetry , in: NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems , article no. : 549 , 2024 12536- 12554

[12] Ye. Bodyanskiy , S.

Popov , M.

Titov , Robust Learning Algorithm for Networks of Neuro-Fuzzy Units , in: T. Sobh (ed) Innovations and Advances in Computer Sciences and Engineering . Springer, Dordrecht, 2010 . doi: 10 .1007/ 978 -90-481-3658-2_ 59

[13] Ye. Bodyanskiy , S. Popov , T. Rybalchenko, Feedforward neural network with a specialized architecture for estimation of the temperature influence on the electric load , in Proc. 2008 4th International IEEE Conference Intelligent Systems, Varna, Bulgaria , 2008 , pp. 7 - 14 -7-18, doi:10.1109/IS. 2008 . 4670444 .