Weighted Shifted S-shaped Activation Functions

Weighted Shifted S-shaped Activation Functions SergiyPopov serhii.popov@nure.ua Kharkiv National University of Radio Electronics

Nauky Ave. 14 61166 Kharkiv Ukraine

DmytroPikhulya dmytro.pikhulia@nure.ua Kharkiv National University of Radio Electronics

Nauky Ave. 14 61166 Kharkiv Ukraine

2024 Cambridge MA USA

Weighted Shifted S-shaped Activation Functions 1613-0073 C0FB76269C34AFB79F8C71DE41132015 GROBID - A machine learning software for extracting information from scholarly documents Convolutional neural network Image classification Activation function Bounded activation function Activation function modifications1

In this paper, we propose a family of activation functions (AFs) that can be considered as smooth approximations of bounded ReLU and similar AFs. These AFs are constructed by using a shifted originaligned S-shaped function as a basis, and weighing it with another S-shaped function, similar to how SiLU/GELU AFs weigh the 𝑓(𝑥) = 𝑥 function. The use of both regular and adaptive variants of such AFs is explored. The performance of the proposed family of AFs is evaluated in terms of the image classification accuracy with CNN models by comparing their multiple variants with the popular existing AFs on CIFAR-10 and Fashion-MNIST datasets using Adam and stochastic gradient descent (SGD) optimizers with different learning rates. Overall, 28 variants of the proposed AFs are compared with 21 variants of popular existing AFs (including the ReLU-like functions such as ReLU, Leaky ReLU, SiLU, GELU, PReLU, Swish, etc. and some S-shaped AFs), and 6 shifted S-shaped AFs. The experiments have shown that in most cases the adaptive versions of the proposed AFs provide a pronounced image classification accuracy advantage over all existing AFs that were considered when the Adam optimizer is used, and no consistent advantage with the SGD optimizer. Further research regarding the use of these AFs with the SGD optimizer, and the use of their non-adaptive variants is required.

Introduction

Artificial intelligence (AI) as a field that strives to replicate different kinds of cognitive functions pertaining to humans inevitably has to deal with many kinds of computer vision (CV) tasks. The fact that CV-related tasks represent a broad part of AI tasks is a natural consequence of the fact that humans strongly rely on vision in many of their daily routines, which are in turn eventually being targeted for solution by AI. Convolutional neural networks (CNNs) are a class of artificial neural networks that proved to be very effective at solving many CV-related tasks, such as image classification, object detection, semantic segmentation, etc.

The performance of neural network models can vary depending on many factors such as the task being solved, the network's architecture being used, scale of the network, hyperparameters involved in tuning the model, etc. The choice of activation functions (AFs) is one of such hyperparameters that can significantly influence the network's capability to perform a certain task. In case of CNNs, like in many other cases, ReLU AF is a popular choice, along with other AFs that can be said as being its variations, such as Leaky ReLU [1], SiLU [2], GELU [3], ELU [4], etc. In this paper, such functions are called ReLU-like ones for convenience.

The ReLU-like functions effectively solve the vanishing gradient problem [5,6], which is a typical problem with S-shaped AFs [6]. In many cases, using ReLU-like functions lead to better model's effectiveness for solving the image classification tasks with CNNs than the S-shaped functions like Sigmoid and Tanh [7].

It has been shown that bounding of the ReLU function can be beneficial for training stability and classification accuracy with functions like BReLU, BLReLU [8]. At the same time there are some improved variants of ReLU (LReLU, GELU, SiLU, PReLU) that can show better results, but these functions are not bounded, hence there's a potential in exploring the bounded versions of such functions to see if this could produce a cumulative improvement effect that would lead to better results than any of these functions.

The way that BReLU or similar (ReLU-6 [9]) functions are bounded makes the function to have a fixed value with zero derivative after a certain value of its argument, which could degrade the network's training process. Alternative functions, which are formally not bounded, but still limit the function's growth after a certain value of its argument as well are functions like BLReLU [8] and PLU [10]. The BReLU, BLReLU, and PLU AFs are piecewise linear functions though. In this work it is assumed that smoothing the transitions in such functions by replacing piecewise linear functions with smoother approximation functions could improve the overall network's approximation capabilities. The intuition behind this assumption is that real-world data would presumably typically be diverse enough to have no or few hard edges in distributions of most of its aspects.

Some of the functions that can be useful for creating smooth approximations of bounded ReLUlike functions are the shifted S-shaped functions. The work [11] shows that the modified version of the Tanh function, which is shifted horizontally and vertically while still maintaining an intersection with the origin, called Shifted Tanh, achieves a better performance than the Tanh function, and can show a performance that is similar or slightly higher than that of the ReLU AF.

In this paper, we take a further look at such shifted S-shaped functions by first evaluating the performance of shifted variants of the Atan and Asinh functions, and then modifying them to represent better smoothed approximations of bounded ReLU-like functions. In this paper the respective shifted AFs conventionally have the "So" prefix added to them (meaning "shifted, originaligned"): SoTanh (same as Shifted Tanh in [11]), SoAtan, SoAsinh. Regarding the Asinh function in particular, it is worth noting that unlike most S-shaped functions, it is an unbounded one, which could potentially be a useful property for mitigating the vanishing gradient problem as it tends to have higher first derivative values for a wider range of 𝑥 than the bounded functions like Tanh and Atan.

More specifically, shifting an S-shaped function to the right is seen to potentially be beneficial due to the following reasons:

• This makes the negative function's part closer to the 𝑥 axis, similar to the shape of ReLU-like functions. It is informally assumed in this work that the proximity of ReLU-like functions to 0 plays a certain role in their effectiveness with CNNs. One of the explanations might be the assumption that such AF's property encourages learning sparse representations of network's inputs [5,11]. • This makes the function in its positive part to have a longer range where the functions value is closer to the 𝑓(𝑥) = 𝑥 function, compared to a regular unshifted version of the same Sfunction, which also makes their shape closer to that of Bounded ReLU or BLReU AFs, while also having smooth transitions. The proximity of the positive function's part to the 𝑓(𝑥) = 𝑥 function is hypothesized to contribute to preventing both vanishing gradients and exploding gradients in deep networks.

After testing the performance of SoTanh, SoAtan, and SoAsinh functions, we introduce their modified versions, which make them closer to bounded ReLU-like functions. One of the notable differences of SoTanh, SoAtan, SoAsinh functions from functions like ReLU, SiLU, GELU is that the negative part of SoTanh, SoAtan, and SoAsinh AFs is notably farther away from the 𝑥 axis than ReLU/SiLU/GELU. Hence, it is hypothesized that these shifted S-shaped functions might fail to introduce the activations sparsity that can be seen with ReLU/SiLU/GELU. Thus, there could be a potential for improving their performance by "pushing" their negative part closer to the 𝑥 axis. Since we strive to create smooth approximations of bounded ReLU-like functions, we choose to explore the same method of "pushing" negative part to the 𝑥 axis as the one used by the SiLU and GELU functions in this work. As a result, we create shifted S-shaped functions, which are weighted by another S-shaped function that has a range of (0; 1) and a value of 0.5 at 𝑥 = 0. In this paper we call such a family of functions as weighted shifted origin-aligned S-shaped functions (WSoS functions). By using two variants of weight functions and three variants of base S-shaped functions in this work we introduce and investigate the following specific WSoS AFs: SiSoTanh, SiSoAtan, SiSoAsinh, GeSoTanh, GeSoAtan, GeSoAsinh (see section 2.2).

Considering the fact that bounded functions are prone to causing the vanishing gradient problem, in this work we try to mitigate the likelihood of this problem by prolonging the range of the function in its positive range where the function has values close to the 𝑓(𝑥) = 𝑥 function. We do this by introducing the scaling parameters. Besides, there's one more adjustable parameter that identifies the amount by which the base S-function is shifted along the 𝑥 axis.

Finding a good combination among permutations of all AF's parameters can potentially be hard, so we first perform experiments with certain fixed parameter values, and then make these parameters to be trainable by creating adaptive versions of these AFs. In case of adaptive versions of the AFs, we share the parameters across the entire model rather than introducing different trainable parameter sets per each network's neuron.

Method

Shifted origin-aligned S-shaped AFs

The notion of shifted origin-aligned S-shaped functions is not new (the Shifted Tanh function was explored in [11]), but we include them into the comparison to see how the AFs proposed in this work stack up against them along with other existing AFs. Besides, in addition to the Shifted Tanh function (which is called SoTanh in this work for brevity and consistency with the proposed AFs), this work also explores the shifted versions of Atan and Asinh functions, which follow the same pattern, and evaluates the performance of their adaptive variants.

The general form of such shifted origin-aligned S-shaped functions 𝑆𝑜(𝑥) used in this work can be described with the following formula:

𝑆𝑜(𝑥) = 𝑆(𝑥 − 𝛼) + 𝑆(𝛼),(1)

where S is an arbitrary S-shaped function, which is also called a base function in this paper, and α is a value by which the function is shifted horizontally. This results in the following three AFs, which represent the shifted versions of Tanh, Atan, and Asinh functions (see Table 1). Examples of such AFs can be seen in Figure 1.

Table 1 Shifted origin-aligned AFs evaluated in this work

Base function Shifted, origin-aligned AF formula Tanh 𝑆𝑜𝑇𝑎𝑛ℎ(𝑥) = 𝑇𝑎𝑛ℎ(𝑥 − 𝛼) + 𝑇𝑎𝑛ℎ(𝛼)

Atan 𝑆𝑜𝐴𝑡𝑎𝑛(𝑥) = 𝐴𝑡𝑎𝑛(𝑥 − 𝛼) + 𝐴𝑡𝑎𝑛(𝛼) Asinh 𝑆𝑜𝐴𝑠𝑖𝑛ℎ(𝑥) = 𝐴𝑠𝑖𝑛ℎ(𝑥 − 𝛼) + 𝐴𝑠𝑖𝑛ℎ(𝛼)

Weighted shifted origin-aligned S-shaped AFs

The family of AFs proposed in this work (by convention called as WSoS AFs family in this paper) contains the modifications of shifted origin-aligned S-shaped functions (see section 2.1), whose negative part is softly pushed closer to the 𝑥 axis by weighing them with another S-shaped function, as can be described in a general form by this formula:

𝑊𝑆𝑜(𝑥) = 𝛾𝑊(𝑥)𝛽𝑆𝑜( 1 𝛽 𝑥),(2)

where 𝛾 is a function's vertical scaling parameter. 𝑊(𝑥) is any S-shaped function in range (0; 1) symmetric with respect to the point (0, 0.5).

𝛽 is a horizontal and vertical scaling parameter for the 𝑆𝑜 function. 𝑆𝑜(𝑥) is a shifted origin-aligned S-shaped function (1). 2.

Table 2 Variants of the weight function W(x) used in this work

Weight function Formula

𝑆𝑖(𝑥) = 𝜎(𝑥) = 1 1 + 𝑒 !" Ge 𝐺𝑒(𝑥) = 1 2 (𝐸𝑟𝑓(𝑥) + 1)

With the three variants of shifted S-shaped functions listed in Table 1, this results in 6 specific AFs that belong to the class of WSoS AFs, which are explored in this work (see Table 3).

In an attempt to identify some concrete efficient AF variants, each of these functions is tested with several different sets of 𝛼, 𝛽, and 𝛾 parameters. Some examples of these functions with different values of α, β, 𝛾 parameters can be seen in Figure 2 and Figure 3.

Table 3 The proposed WSoS AFs

Activation function Formula

SiSoTanh 𝑆𝑖𝑆𝑜𝑇𝑎𝑛ℎ(𝑥) = 𝛾𝛽𝜎(𝑥) B𝑇𝑎𝑛ℎ B 1 𝛽 𝑥 − 𝛼C + 𝑇𝑎𝑛ℎ(𝛼)C SiSoAtan 𝑆𝑖𝑆𝑜𝐴𝑡𝑎𝑛(𝑥) = 𝛾𝛽𝜎(𝑥) B𝐴𝑡𝑎𝑛 B 1 𝛽 𝑥 − 𝛼C + 𝐴𝑡𝑎𝑛(𝛼)C SiSoAsinh 𝑆𝑖𝑆𝑜𝐴𝑠𝑖𝑛ℎ(𝑥) = 𝛾𝛽𝜎(𝑥) B𝐴𝑠𝑖𝑛ℎ B 1 𝛽 𝑥 − 𝛼C + 𝐴𝑠𝑖𝑛ℎ(𝛼)C GeSoTanh 𝐺𝑒𝑆𝑜𝑇𝑎𝑛ℎ(𝑥) = 1 2 𝛾𝛽(𝐸𝑟𝑓(𝑥) + 1) B𝑇𝑎𝑛ℎ B 1 𝛽 𝑥 − 𝛼C + 𝑇𝑎𝑛ℎ(𝛼)C GeSoAtan 𝐺𝑒𝑆𝑜𝐴𝑡𝑎𝑛(𝑥) = 1 2 𝛾𝛽(𝐸𝑟𝑓(𝑥) + 1) B𝐴𝑡𝑎𝑛 B 1 𝛽 𝑥 − 𝛼C + 𝐴𝑡𝑎𝑛(𝛼)C GeSoAsinh 𝐺𝑒𝑆𝑜𝐴𝑠𝑖𝑛ℎ(𝑥) = 1 2 𝛾𝛽(𝐸𝑟𝑓(𝑥) + 1) B𝐴𝑠𝑖𝑛ℎ B 1 𝛽 𝑥 − 𝛼C + 𝐴𝑠𝑖𝑛ℎ(𝛼)C

Adaptive AF variants

In addition to the functions mentioned in Table 1 and Table 3, this work considers respective adaptive variants of these AFs, which use the same AF formulas, but treat α, β, 𝛾 as trainable parameters, which are shared across the whole model. The resulting adaptive variants of shifted origin-aligned S-shaped AFs: ASoTanh, ASoAtan, ASoAsinh. The adaptive variants of the WSoS AFs (later called AWSoS functions for conciseness) are as follows: ASiSoTanh, ASiSoAtan, ASiSoAsinh, AGeSoTanh, AGeSoAtan, AGeSoAsinh.

The ASoTanh, ASoAtan, ASoAsinh functions are tested with one variant of the initial α parameter's value for each AF, and the ASiSoTanh, ASiSoAtan, ASiSoAsinh, AGeSoTanh, AGeSoAtan, AGeSoAsinh AFs are tested with several sets of initial parameter values.

Experimental setup

All activation functions are tested and compared on the image classification task with various CNN models, datasets, and hyperparameters. Below are the details on the respective experiments that are performed.

Activation functions being compared

In this paper, we compare the proposed activation functions belonging to the WSoS/AWSoS family with a set of existing activation functions by evaluating the average best test accuracy for each AF over several runs. The comparison includes the following functions: All the proposed AWSoS AFs share the trainable α, β, 𝛾 parameters across the entire model in this work. This means that using an adaptive variant of the AFs add just a single set of three trainable variables to the entire model, which means that the memory footprint from using these AFs remains practically unaffected.

Testing metrics and configurations

Each of the functions is tested several times in each of the test configurations listed in Table 5. Within the same testing configuration, for every AF, an average image classification accuracy (as well as the standard deviation) across several test runs is evaluated for each training epoch. Only the accuracies obtained from the test dataset (not the training dataset) are used. The maximum average accuracy value that was achieved by a certain AF on any of the training epochs in certain test configuration is considered as the accuracy of this AF in this test configuration. After obtaining accuracies for each AF in each configuration, a common comparison chart for each of the configurations is made, where accuracies of all AFs can be compared with each other within the respective testing configuration. Besides evaluating AF classification accuracies for each configuration, this work explores whether some AFs tend to have better/worse accuracy across configurations. In order to make this possible while dealing with different datasets, models, and hyperparameters, which can result in different accuracy ranges, a notion of AF accuracy rank is introduced. For any given AF, its accuracy rank 𝑅 #$,& within a specific configuration 𝐶 is defined as a 1-based index in a list of AFs sorted by their classification accuracy in an ascending order within this configuration 𝐶. Provided that all configurations are performed over the same set of AFs, for any set 𝑀 of multiple test configurations 𝐶 ' … 𝐶 ( , a combined rank for each AF 𝑅 #$,) can be calculated by averaging the respective perconfiguration ranks for this AF:

𝑅 #$,) = 1 𝑛 L 𝑅 #$,& ! ( *+' ,(3)

In this work, combined AF ranks are calculated for three configuration combinations: Configurations from Table 5 that use the Adam optimizer. Configurations from Table 5 that use the stochastic gradient descent (SGD) optimizer. All configurations from Table 5.

CNN models used

As can be seen in Table 5, each dataset is used with its respective model. The CIFAR-10 dataset is used with the CIFAR10-cls model (see Figure 4), and Fashion-MNIST dataset is used with the FMNIST-cls model (see Figure 5).

Results

The image classification accuracies that were identified in experiments for each AF in Table 4 in each of the test configurations listed in Table 5 can be seen in Table 6 (for the proposed WSoS and AWSoS AFs) and Table 7 (for existing AFs).

The AF measurements sorted by classification accuracy for the configurations CAL and CSL in Table 5 are visualized on charts depicted in Figure 6 and Figure 7 respectively. 5 (lower is better), shows some AWSoS and shifted S-shaped AFs that are on average better than most Afs

Discussion

Analysis of the results

The advantage of adaptive AWSoS AFs with the Adam optimizer

Overall, reviewing the results from the experiments made in this work shows that the adaptive AWSoS AFs perform notably better than all other AFs when the model is trained with the Adam optimizer. Here are the respective notes that can be made regarding such observations:

• The adaptive AWSoS AF variants have a pronounced advantage in image classification accuracy over existing popular ReLU-like AFs in all testing configurations that use the Adam optimizer. With a few of exceptions all AWSoS AFs have resulted in higher image classification accuracies than all other AFs considered in this work in all testing configurations that use the Adam optimizer. This can in particular be seen by the respective combined AF ranks in Figure 8. • The classification accuracy advantage of the AWSoS AFs can be seen to be even more pronounced with higher learning rates when using the In the CAH testing configuration, the highest-accuracy AF AGeSoTanh(1, 1, 1) shows an accuracy of 78.19%, which is ~3% higher than the highest-accuracy standard AF LeakyReLU(0.1) of 75.16% in this configuration. In comparison, a similar configuration with a lower learning rate CAL shows a lower advantage of ~0.8% which the highest-accuracy AWSoS AF ASiSoAsinh(1, 1, 1) (77.63%) has over the highest-accuracy existing AF LeakyReLU(0.1) (76.81%). A similar tendency can be seen on models trained for Fashion-MNIST image classification with low and high learning rates. • The choice of initial parameter values for the AWSoS AFs is seen to have no or little decisive effect with the Adam optimizer, and they consistently show higher accuracy than the existing AFs in most cases. Nevertheless, the choice of their parameter values is still important to fine tune the level of accuracy that can be achieved. • In 6 out of 8 testing configurations the ASiSoAsinh(1, 1, 1) AF has provided a classification accuracy higher than that of all considered existing AFs. Besides, in 4 out of 8 configurations this particular AF had shown an accuracy higher than all other compared AFs. • In the testing configurations that use the SGD optimizer, the AWSoS AFs don't have a consistent advantage over the existing AFs. Adaptive versions of these AFs are in many cases not stable in the configurations using SGD, where they often fail to converge during training. A few exceptions are the ASiSoAsinh(1, 1, 1) and AGeSoAsinh(1, 1, 1) AFs, which in three of four SGD-related configurations have provided a higher accuracy than most of the standard AFs.

Comparing non-adaptive WSoS functions to existing AFs

In many cases the proposed WSoS AFs provide image classification accuracy similar to the existing ReLU-like functions. Their performance is very sensitive to the choice of the α, β, 𝛾 parameter values, so they require respective attention for choosing the suitable parameter values. The results of this work don't provide sufficient data to make recommendations about the potentially more suitable parameter values and this topic requires further research.

Observations related to shifted S-shaped AFs

The regular (non-weighted) shifted S-shaped AFs can be seen to provide an image classification accuracy that is comparable to ReLU-like AFs and typically higher than that of the ReLU AF. This is in line with the observations made in [11], which was exploring the Shifted Tanh function (named SoTanh in this paper). The experiments made in this work show that the other modifications of such a function, which are based on Atan and Asinh functions, can also significantly improve the classification accuracy compared to their regular unshifted variants.

The experiments also show that the adaptive versions of these AFs (ASoTanh, ASoAtan, ASoAsinh), which use the value of horizontal shift as a trainable parameter, in most cases provide an additional notable improvement in classification accuracy over the non-adaptive forms of these AFs (e.g., see Figure 6-Figure 10).

Computational performance considerations

This work primarily focuses on investigating the performance of the proposed AFs in terms of the classification accuracy in comparison with the existing ones. The analysis of the computational performance of the proposed functions, which measures the time required to train the model, and the time required to perform a forward pass when using the model in a production environment, was not the target of this work. Nevertheless, preliminary analysis confirms the intuitive assumption that using a function which requires more computations resources, like the AWSoS functions, requires more time. Preliminary measurements show that, when training on CPU, the AWSoS functions can take from ~10% more time than the Swish AF (for ASiSoTanh AF) to ~80% more time than the Swish AF (for ASiSoAsinh AF), but a more thorough study is required to identify the relative cost of using the WSoS/AWSoS AFs relative to the existing ones.

Besides, additional research is needed to evaluate the training speed of the proposed AFs in terms of the number of epochs required to reach certain accuracy, which, in combination with the assessment of the relative computational cost per one epoch, could allow a more realistic evaluation of the actual training speed that the proposed AFs can provide.

Nevertheless, the advantage that the AWSoS AFs can provide in terms of the classification accuracy can be important in some applications by itself regardless of the extra computational cost that might be required to train the model that achieves a higher performance, or use it in a production environment.

Future work

As was mentioned above, a notable tendency about the AWSoS function is their pronounced advantage over the considered existing AFs with the Adam optimizer, but a not as good performance with the SGD optimizer. This difference requires further research to try to identify ways to improve their performance with the SGD optimizer. One hypothesis that might explain this issue is that the weight initialization method used in this research (Glorot uniform) might lead the model with these AFs to poor convergence while preventing it from finding a global minimum, which is mitigated by the Adam optimizer, but not SGD.

Other directions of further research include exploring the possibility of more computationally efficient variants of WSoS/AWSoS functions, exploring whether some parameter configurations for WSoS functions can be recommended as potentially more efficient ones, and exploring how the WSoS/AWSoS functions perform in significantly deeper networks.

Conclusion

This work proposes a class of weighted shifted origin-aligned S-shaped activation functions (WSoS AFs) and explores their performance in image classification tasks using CNNs in comparison with a range of existing AFs. An emphasis is made on comparing the proposed AFs with ReLU-like AFs, which are the most popular choice of AFs with CNNs.

These functions are considered as an evolution of shifted origin-aligned S-shaped functions (e.g. the ones similar to Shifted Tanh in [11]), and are at the same time viewed as softly-bounded versions of ReLU-like functions GELU and SiLU in this work. The results of experiments show that they can indeed be used to improve the classification accuracy of shifted S-shaped functions and can compete with most of ReLU-like functions, but the classification accuracy that they provide significantly depends on the choice of their three parameters, which can be challenging.

At the same time, a notable result of this work is that the adaptive versions of the WSoS AFs (AWSoS AFs) in most of the tested configurations show a clear advantage over all tested existing AFs including the existing adaptive ones, but this advantage holds only when the training is done with the Adam optimizer, and not the SGD optimizer, where the training is often not stable with these AFs.

Further research is needed to explore ways of achieving similar advantages of AWSoS AFs with the SGD optimizer, which, according to preliminary experiments, could be made with changing the weight initialization method. Besides, more computationally effective forms of WSoS/AWSoS functions can also be explored in the future research. Another line of future research would consider AWSoS AFs in combination with other learning algorithms, including their robust modifications [12], and other neural network architectures [13].

Figure 1 :1Figure 1: Shifted origin-aligned S-shaped functions with 𝛼 = 1.0

Figure 2 :Figure 3 :23Figure 2: Weighted shifted origin-aligned S-shaped functions with α, β, 𝛾 all equal to 1.0

Figure 4 :4Figure 4: The CIFAR10-cls model used for CIFAR-10 image classification in this work

Figure 5 :5Figure 5: The FMNIST-cls model used for Fashion-MNIST image classification in this work

Figure 6 :6Figure 6: CIFAR-10 classification accuracies for all AFs with the Adam optimizer and learning rate of 0.001-testing configurfation CAL, demonstrates an advantage of AWSoS AFs

Figure 7 :Figure 8 :78Figure 7: CIFAR-10 classification accuracies for all AFs with the SGD optimizer and learning rate of 0.03-testing configuration CSL, shows that there's no consistent advantage of AWSoS AFs with the SGD optimizer

Figure 9 :Figure 10 :910Figure 9: Combined accuracy ranks for all testing configurations using the SGD optimizerconfigurations CSL, CSH, FSL, FSH (lower is better), demonstrates poor performance of WSoS AFs with the SGD optimizer

Table 4 Activation functions compared in this work4Category of AFsAFsSets of parameter values/initial parameter valuesThe proposed WSoS andSiSoTanh, SiSoAtan, SiSoAsinh,𝛼 = 1, 𝛽 = 1, 𝛾 = 1AWSoS AFsGeSoTanh, GeSoAtan, GeSoAsinh,𝛼 = 1, 𝛽 = 10, 𝛾 = 2.6ASiSoTanh, ASiSoAtan, ASiSoAsinh,AGeSoTanh, AGeSoAtan, AGeSoAsinhSiSoTanh, GeSoTanh𝛼 = 1, 𝛽 = 1.5, 𝛾 = 3.64SiSoAtan, GeSoAtan𝛼 = 1, 𝛽 = 1.2, 𝛾 = 3.2Shifted origin-aligned S-SoTanh, ASoTanh𝛼 = 1shaped AFs, and theirSoAtan, ASoAtan𝛼 = 1adaptive variantsSoAsinh, ASoAsinh𝛼 = 1Popular existing AFsReLU, ReLU-6, SiLU, GELU, ELU, Softsign,N/ASigmoid, Tanh, Arctan, AsinhLeaky ReLU0.010.10.30.6PReLU (with per-neuron trainable param.) 0PReLU0.01(with trainable parameter shared across0.2the whole model)0.4Swish0.33(with trainable parameter shared across1the whole model)3

Table 55In all cases, CNN kernel weights are initialized with the Glorot uniform weight initialization method, and biases are initialized with zeros.AF testing configurationsConfigurationDatasetModelOptimizer LearningBatchNo ofNo ofNameRateSizeepochsrunsCALCIFAR-10CIFAR10-cls Adam0.001323010CAHCIFAR-10CIFAR10-cls Adam0.00232303CSLCIFAR-10CIFAR10-cls SGD0.03323010CSHCIFAR-10CIFAR10-cls SGD0.0632303FALFashion-MNIST FMNIST-clsAdam0.001323010FAHFashion-MNIST FMNIST-clsAdam0.0132303FSLFashion-MNIST FMNIST-clsSGD0.03323010FSHFashion-MNIST FMNIST-clsSGD0.332303

Table 6 Image classification accuracies for the proposed WSoS and AWSoS AFs along with std. deviations6,

Table 7 Image classification accuracies for existing AFs along with std. deviations7, %

Rectifier Nonlinearities Improve Neural Network Acoustic Models ALMaas AYHannun AYNg Proceedings of the 30th International Conference on Machine Learning the 30th International Conference on Machine Learning 28 3 Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning SElfwinga EUchibea KDoyab 10.1016/j.neunet.2017.12.012 Neural Networks 107 DHendrycks KGimpe 10.48550/arXiv.1606.08415 Gaussian Error Linear Units (GELUs) 2016 Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) DClevert TUnterthiner SHochreiter International Conference on Learning Representations 2015 Deep Sparse Rectifier Neural Networks XGlorot ABordes YBengio International Conference on Artificial Intelligence and Statistics 2011 Review and Comparison of Commonly Used Activation Functions for Deep Neural Networks TSzandała 10.1007/978-981-15-5495-7 Bio-inspired Neurocomputing, Lectures on Embedded Systems AKBhoi PKMallick CLiu VEBalas

Singapore

Springer 2021 Activation functions in deep learning: A comprehensive survey and benchmark SRDubey SKSingh BBChaudhuri 10.1016/j.neucom.2022.06.111 doi:j.neucom.2022.06.111 Neurocomputing 503 Bounded activation functions for enhanced training stability of deep neural networks on visual pattern recognition problems SSLiew MKhalil-Hani RBakhteri 10.1016/j.neucom.2016.08.037 Neurocomputing 216 Convolutional Deep Belief Networks on CIFAR AKrizhevsky 2012 10 ANicolae 10.48550/arXiv.1809.09534 PLU: The Piecewise Linear Unit Activation Function 2018 Tanh Works Better With Asymmetry DKim WKim SKim3 NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems 2024 549 Robust Learning Algorithm for Networks of Neuro-Fuzzy Units Ye SBodyanskiy MPopov Titov 10.1007/978-90-481-3658-2_59 Innovations and Advances in Computer Sciences and Engineering TSobh

Dordrecht

Springer 2010 Feedforward neural network with a specialized architecture for estimation of the temperature influence on the electric load Ye SBodyanskiy TPopov Rybalchenko 10.1109/IS.2008.4670444 Proc. 2008 4th International IEEE Conference Intelligent Systems 2008 4th International IEEE Conference Intelligent Systems

Varna, Bulgaria

2008