1. Introduction

Assigning different activation functions in artificial neural networks with the goal of achieving higher prediction accuracy*

Gytis Baravykas

gytis.baravykas@ktu.lt 0

Justas Kardoka

justas.kardoka@ktu.lt 0

Domas Grigaliunas

domas.grigaliunas@ktu.lt 0

Darius Naujokaitis

darius.naujokaitis@ktu.lt 0 2 0 Faculty of Informatics, Kaunas University of Technology , Studentu 50, 51368 Kaunas , Lithuania 1 IVUS2024: Information Society and University Studies 2024 2 Smart Grids and Renewable Energy Laboratory, Lithuanian Energy Institute , 44403 Kaunas , Lithuania

The research paper explores the concept of using multiple activation functions in artificial neural networks and investigates their impact on model performance. The experiments conducted on various models such as AlexNet, ResNet50, TuNet, and SimpleNN reveal insights into the effectiveness of different activation function combinations. The results indicate that using multiple activation functions can lead to modest improvements in model performance, particularly in image segmentation tasks where modifications to the UNet architecture show significant enhancements. However, for time series regression/forecasting tasks, the experiments demonstrate that using multiple activation functions does not significantly improve prediction accuracy. Therefore, the paper concludes that while there are some benefits to using multiple activation functions in certain scenarios, the choice of activation function should be based on the specific task and dataset.

eol>Activation functions artificialneural networks machine learning

1. Introduction 2. Literature review

Activation functions in an ANN are used to introduce non-linear relations to the data, so that the network would better fit the results and improve the accuracy of a given task. It is a very common part of ANNs and often omitted from neural network structure diagrams. Many mathematical functions have been introduced to achieve non-linearity, such as ReLU, Tanh, Sigmoid and others, each tailored to specific tasks. In this paper we entertain the idea of using no one activation function per layer or network, but multiple, assigning a different one for each neuron.

The importance of activation functions is discussed in many recent works. Their importance is based on their wide-spread usage in ANN architectures. Dubey has published a comprehensive overview of the most common activation functions, along with their characteristics and a performance comparison between them [ 1 ]. They have found that different activations functions are more suited for certain machine learning tasks, and that in certain cases, alternative choices must be considered. Although there are some common choices, new activation functions are constantly being developed [ 2,3,4,5,6 ]. Yu has created a modified activation function based on ReLU, with the goal of increasing the accuracy of classification tasks [ 2 ]. Wang developed a activation function as a better alternative to other commonly used activation functions [ 3 ]. The developed activation function, Smish, performed better than other common activation functions in classification tasks on open datasets. Wuraola has developed a family of activation functions that are to be used in embedded systems [ 4 ]. The proposed activation functions were shown to be computationally faster, and their use resulted in higher accuracy results than other common activation functions in recurrent neural networks and logistic regression models. Kaytan has introduced a new non-monotonic activation function capable of achieving higher results than other activation functions like Swish, Mish and others for image classification tasks [ 5 ]. Chai developed a new model based on LSTM capable of achieving higher accuracy for short-term PV generation forecasts [ 6 ]. The model uses a newly proposed activation function that helps solve the gradient disappearance problem and ensures a high accuracy of the prediction results for the task of short-term PV generation. There are also works in which the activation functions of the default implementation of model architectures are switched with other, alternative activation functions. Anami had performed experiments in which they had tried to compare prediction results by switching the default activation function with other different, common activation functions [ 7 ]. Wang has performed experiments in which they tried to use alternative activation functions in VGG16, ResNet50 and LeNet architectures, achieving superior results [ 8 ]. Essai Ali has tried to modify a LSTM by changing its’ Tanh functions to different activation functions [ 9 ]. The author has achieved his aim of increasing the classificationaccuracy from 86% to 88% using the Weather Reports dataset, and from 93% to 97 % using the Japanese Vowels dataset.

3. Methodology 3.1. Activation functions

ℎ = ∑ ⋅ +

=1 1 = (ℎ1) 2 = ℎ(ℎ2) 1 = 1 + 2ℎ (1) (2) (3) (4) where ℎ – hidden layers, – weights, – inputs, – bias, – activation function results and – outputs. In an artificial convolutional neural network activations play a similar role, but because there are no actual neurons in a convolutional layer, different application is required. For the convolution layer 2 approaches were introduced.

In regular CNN architectures there is often only one activation function in a convolution layer. As displayed in the diagram Figure 2.1. different activation function can be applied to each channel after the convolution layer. Second diagram Figure 2.2. refers to another idea to apply multiple activation functions for each matrix column. In this case 3x3 matrix there are 3 columns in each channel. Every slice has a specificactivation applied to it.

Some CNN architectures have a linear neuron layer which typically have only on activation function. The idea displayed on Figure 3 is to leave one activation in convolution layers and only have multiple activation functions in linear neuron layers, specifically an activation function for each neuron. As displayer in the diagram boxes (1-4) can each have a specific function assigned creating a spectrum of variations: (1-tanh, 2-relu, 3-sigmoid, 4-softmax), (1-relu, 2-tanh, 3-sigmoid, 4-relu) and so on.

For linear layers it is also possible to have a complete list of activation functions assigned. This idea is later experimented in this paper. Combinations of this list can be calculated as such. In this case 2 activation functions (ReLU, Tanh) power by 4 neurons equal to 16 variations: = (5) where – variations, – elected activations and – number of neurons.

It must also be noted that various activation functions can be used, and it is not limited to the most used activation functions such as ReLU, Tanh, Sigmoid, etc. The range of activation functions that were tested in this work are detailed in the experiments section.

3.2. Models

There has been a vast selection of CNN models proposed for image classification,a lot of those have complex implementations and long training hours. The models chosen for this paper are a low to mid- range complexity to test out the theory. Starting with SimpleNN, a simple neural network with one hidden layer of N neurons. TuNet – a CNN with 2 convolutions, 2 polling layers and 3 linear layers [ 10 ]. AlexNet is a convolutional neural network (CNN) architecture that consists of five convolutional layers, three fully connected layers, and two pooling layers [ 10 ]. The convolutional layers extract features from the input images, while the pooling layers reduce the dimensionality of the feature maps. The fully connected layers learn a mapping from the extracted features to the output classes. Some of the key innovations introduced by AlexNet include the use of rectified linear unit (ReLU) activation functions, dropout regularization, and data augmentation techniques.

ResNet50 derives its name from its depth, incorporating 50 layers [11]. Notably, ResNet50 addresses the challenge of training deep networks by introducing residual connections that enable the direct flow of information across layers. This innovation mitigates the vanishing gradient problem, allowing for the successful training of extremely deep networks.

The architecture comprises building blocks known as residual blocks, each containing skip connections that bypass one or more layers. These skip connections facilitate the smooth propagation of gradients during backpropagation, enhancing the model's ability to capture intricate features. Additionally, ResNet50 employs batch normalization to accelerate training convergence and improve generalization performance.

UNet was used for image segmentation tasks [12]. It is a popular model with several modificationsover the years [13,14,15]. The model has improved on the results of previous image segmentation models by its’ architecture consisting of a contracting path used for capturing context and a symmetric expanding path used that enables precise localization [12]. The resulting architecture consists of 23 convolutional layers and the architecture utilizes the ReLU activation function. The model also heavily utilizes image augmentation, which enables it to achieve high accuracy without relying on many training images.

3.3. Datasets 3.3.1. Images

Several image datasets are popular for testing performance of CNN models. The CIFAR-100 is a dataset containing 60 000 32x32 color images with 100 classes (600 images per class). It is a subset of the Tiny Images dataset and is commonly used for fine-grained image classification [16]. The dataset contains a wide variety of images of objects, animals, and textures. The images are labeled with both fine-grained and coarse labels. The fine-grained labels correspond to the specific object or scene in the image, while the coarse labels correspond to the superclass of the object or scene.

The German Traffic Sign Benchmark is a multi-class, single-image classification challenge held at the International Joint Conference on Neural Networks (IJCNN) 2011 [17]. The following dataset includes 43 classes of traffic signs and more than 50,000 images.

Cityscapes dataset is a popular image segmentation dataset that consists of 25 000 such images captured from a moving vehicle [13,14,15]. The images were taken in different cities in Germany during different weather conditions. The dataset consists of 50 different classes. Each dataset item consists of a horizontally joined image, in which the left image is the original photograph, meanwhile the right image is the semantically segmented version of the image. 3.3.2. Tabular

Two tabular datasets were incorporated in this paper: breast cancer and iris flower classification. Breast cancer dataset features are computed from a digitized image of a fineneedle aspirate (FNA) of a breast mass [18]. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at http://www.cs.wisc.edu/~street/images/.

Iris flowers dataset is one of the earliest datasets used in literature on classification methods and widely used in statistics and machine learning [19]. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are not linearly separable from each other. When performing experiments, Obaid’s work was used as a benchmark for the comparison of results [20].

3.3.3. Timeseries

Timeseries data for amazon stocks with stock price, closing price and other attributes was used [21]. Additionally, a custom photovoltaic (PV) panel generation dataset was used. The data consists of about a year of meteorological and PV generation data. The PV generation data was retrieved from a PV station in Kaunas, Lithuania, meanwhile the publicly available meteorological data was retrieved from Oikolab and from the Lithuanian Hydrometeorological Service. It was also attempted to include METAR data on cloud conditions at different altitudes, but utilizing this data did not provide any improvement to the results, so it was left out from the dataset. Based on the observed linear relationships between different meteorological features and PV generation, certain meteorological features were chosen to be used in the experiments (see Figure 4).

As can be seen from the relationships between different features, a strong linear relationship between PV generation and air temperature, surface solar radiation has been observed. It was noted that using other meteorological data improved the results, although these features did not seem to have a linear relationship with the PV generation data. In total, the dataset consists of the following 11 features (see Table 1).

3.4. Environment

Google Collab environment with a single NVIDIA Tesla T4 GPU was used for experimentations of AlexNet and ResNet50 on CIFAR100. For GTSRB, UNet and LSTM experiments, the models were trained on two Tesla T4 GPU setup. Amazon stock close predictions were performed on a Kaggle provided CPU.

4. Experiments and results 4.1. Image classification 4.1.1. CIFAR-100 with AlexNet

Inspired by Sharma’s work [22], we choose AlexNet as the primary target. Main reasons for choosing this architecture were that it had linear layers aside convolution blocks. We began experiments with the OriginalAlexNet implementation as a baseline with Tanh. Next, we experimented with changing only linear layers - changing one layer then changing both. The change was that instead of applying a single activation function, we applied 2 or 3 in cyclic order. The best results were with Tanh and Softmax combination of functions – 1.14% improvement in testing accuracy compared to the ReLU baseline, however, Tanh baseline was still more superior.

Later, we expanded experimentation with modifying Convolution Neural Network layers (CNN). Here implementation consisted of changing activation functions per channel. This showed marginally better results than the OriginalAlexNet with ReLU - 0.36% improvement.

For experimentation, hyper parameters were the following: learning rate – 0.0001, batch size – 256 and number of epochs – 40. 4.1.2. CIFAR-100 with ResNet50 ResNet50

ResNet50Cus tomResiduala

ResNet50Cus tomResidualb

ResNet50Cus tomResidualc

ResNet50Cus tomResidualr Tanh ReLU, Tanh

ReLU, SoftMax

Tanh, Softmax, ReLU random list

4.1.3 GTSRB with TuNet

Classifying images are pre-processed in the same manner and on the same training parameters as in the previous experiments, meanwhile the fixed size image is 32 by 32 pixels. The training parameters for TuNet are as follows: optimizer – Adam, learning rate – 0.001, loss function – cross entropy and batch size – 32. As can be seen in Table 4, the results of the TuNet baseline are generally worse than of the modifiedarchitecture:

4.2. Cityscapes with UNet

For the image segmentation task, the popular Cityscapes dataset was chosen alongside the UNet model. The following parameters were the same for all the experiments using UNet: Adam optimizer with a learning rate of 0.001, the mean-squared error as the loss function, a batch size of 4 and 20 as the number of epochs for training.

As it can be seen from the results of the experiments, a significant Dice metric increase of about 10% was achieved by various activation function combinations (see Table 5).

As can be seen from the table, using almost any combinations of activation functions can result in better prediction results in the case of UNet. It is also observed that even changing the activation in the baseline model from ReLU to Tanh has improved the results by a significant amountas well.

4.3 Time series regression/forecasting 4.3.1 Simple NN on Amazon stock prediction

Experiments were performed on Amazon stock timeseries data predict the closing price for the next day. An architecture named SimpleNN was used. It is a neural network with 1 input cells, 14 hidden layer cells and 1 output. The following parameters were used in the experiment: optimizer – Adam, learning rate – 0.001, loss function – mean-squared error, batch size – 16, lag values – 7 and number of training epochs – 5.

The experiment compares the same model and its architecture, the only difference is activations per neuron and one activation for the whole network (see Table 6).

ReLU, Softmax

ReLU, ReLU, ReLU, ReLU, ReLU, Sigmoid, ReLU, ReLU, Sigmoid, ReLU, ReLU, Sigmoid, ReLU, Sigmoid

Additionally, all possible combinations of different activation functions sets have been tested (see model PerNeuronList).

As can be seen from the results, there is an increase in accuracy in certain cases, and it can also be observed that finding the best possible set of activation functions yielded the best results out of the experiments.

4.3.2 Custom PV dataset with LSTM

Experiments were performed using a time-series dataset for forecasting PV generation. An LSTM model was used, as it is often utilized for solving PV generation forecast tasks [23,24,25,26,27]. For performing the forecasts, the output of the previous step is used as the input of the following training step. The following parameters were used for the experiments: Adam optimizer with a learning rate of 0.001, mean-squared error for the error metric, a batch size of 8, 12 lag values for the PV data, and 20 training epochs.

The parameters for the experiments were chosen based on experiments performed using different sets of parameters. The batch size refers to the number of predictions retrieved from the model output and the lag values refers to the number of previous predictions to use as input of the next prediction. Based on tests using different lag values, a value of 12 was noticed to be one of the best values for this parameter, although this parameter did not seem to have much impact on the accuracy of predictions. Regarding transformations of data, the training data has been standardized so that the ranges of values would be the same for all features.

As can be seen from Table 7, there is no significant improvement based on testing RMSLE. Although many experiments yielded similar results to the baseline, there was not a single experiment which yielded better results than the baseline. It can also be observed that an increase in the number of different activation functions used does not improve the forecast results either.

4.4 Tabular

Tabular data is still widely used in machine learning tasks. In this paper we choose two datasets to experiment with the changes on Iris flowers and Breast cancer classifications. Both experiments have the following training parameters: optimizer – SGD, learning rate – 0.01, loss function – cross entropy loss and number of training epochs – 200.

From results displayed in Table 8 comparing one activation versus multiple for this Iris flowers classificationtask, there is no improvement compared to best suited activation function.

Experiments performed on breast cancer dataset can be visible in Table 9. After training testing results, can be viewed in the table below. As we can see there is slight improvement with model having multiple activation functions.

Additionally, a activation function set from a large number of combinations was selected and the accuracy using it is better compared to one activation function (see Table 10).

It should also be noted that better results were achieved than from the SVM described in Obaid’s work. As can be seen from the results, there is a significant accuracy increase for the PerNeuron models, whilst the most significant increase can be seen when finding the best activation function list from all possible combinations.

5. Conclusions and discussion

The research paper explores the concept of using multiple activation functions in artificial neural networks. It discusses the role of activation functions in introducing non-linear relations to improve the accuracy of tasks. The paper investigates different approaches to incorporating multiple activation functions, including assigning a different function to each neuron or channel.

The experiments included using models such as AlexNet, ResNet50, TuNet, and SimpleNN. In the AlexNet experiment, different activation function combinations were tested in both linear layers and convolutional neural network (CNN) layers. The results showed that using OriginalAlexNet with Tanh activation function yielded the best overall performance. The ResNet50 experiments resulted in one combination performing marginally better than any of single function baselines. The TuNet and SimpleNN experiments aimed to evaluate the performance of these specific architectures on their respective datasets. Overall, the experiments provided insights into the impact of activation function combinations on model performance, with modest improvements observed compared to using a single activation function. The datasets used in the experiments included CIFAR-100, GTSRB, Breast Cancer Wisconsin (Diagnostic), Iris flowers, and Amazon stocks. In image segmentation tasks, modifying the UNet architecture with different activation function combinations leads to significant improvements in the Dice metric. Even changing the activation function in the baseline model from ReLU to Tanh shows improved results. For time series regression/forecasting tasks, the experiments show that using multiple activation functions does not significantly improve the accuracy of predictions. This paper also hints into an idea of full list of activation functions, which would learn relation with the specificdata neuron is receiving. An idea which requires further analysis.

Overall, the paper concludes that while using multiple activation functions can have some benefits in certain scenarios, the improvements are not substantial compared to using a single activation function. The choice of activation function should be based on the specifictask, dataset and its features. https://doi.org/10.1145/3065386. [11] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, (2015).

https://doi.org/10.48550/arXiv.1512.03385. [12] O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional Networks for Biomedical Image

Segmentation, (2015). https://doi.org/10.48550/arXiv.1505.04597. [13] H. Bai, L. Liu, Q. Han, Y. Zhao, Y. Zhao, A novel UNet segmentation method based on deep learning for preferential flow in soil, Soil & Tillage Research. 233 (2023) 105792-. https://doi.org/10.1016/j.still.2023.105792. [14] K.K. Wong, A. Zhang, K. Yang, S. Wu, D.N. Ghista, GCW-UNet segmentation of cardiac magnetic resonance images for evaluation of left atrial enlargement, Computer Methods and Programs in Biomedicine. 221 (2022) 106915–106915. https://doi.org/10.1016/j.cmpb.2022.106915. [15] G. Rani, P. Thakkar, A. Verma, V. Mehta, R. Chavan, V.S. Dhaka, R.K. Sharma, E. Vocaturo, E.

Zumpano, KUB-UNet: Segmentation of Organs of Urinary System from a KUB X-ray Image, Computer Methods and Programs in Biomedicine. 224 (2022) 107031–107031. https://doi.org/10.1016/j.cmpb.2022.107031. [16] A. Krizhevsky, Learning Multiple Layers of Features from Tiny Images, in: 2009. https://www.semanticscholar.org/paper/Learning-Multiple-Layers-of-Features-from-TinyKrizhevsky/5d90f06bb70a0a3dced62413346235c02b1aa086 (accessed January 17, 2024). [17] J. Stallkamp, M. Schlipsing, J. Salmen, C. Igel, Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition, Neural Networks. 32 (2012) 323–332. https://doi.org/10.1016/j.neunet.2012.02.016. [18] W.N. Street, W.H. Wolberg, O.L. Mangasarian, Nuclear feature extraction for breast tumor diagnosis, in: R.S. Acharya, D.B. Goldgof (Eds.), San Jose, CA, 1993: pp. 861–870. https://doi.org/10.1117/12.148698. [19] A. Unwin, K. Kleinman, The Iris Data Set: In Search of the Source of Virginica, Significance.18 (2021) 26–29. https://doi.org/10.1111/1740-9713.01589. [20] O. Ibrahim Obaid, M. Mohammed, M.K. Abd Ghani, S. Mostafa, F. Al-Dhief, Evaluating the Performance of Machine Learning Techniques in the Classification of Wisconsin Breast Cancer, International Journal of Engineering and Technology. 7 (2018) 160–166. https://doi.org/10.14419/ijet.v7i4.36.23737. [21] Amazon, Inc., Amazon.com, Inc. (AMZN) Stock Historical Prices & Data - Yahoo Finance,

Amazon. (2024). https://finance.yahoo.com/quote/AMZN/history/(accessed January 17, 2024). [22] N. Sharma, V. Jain, A. Mishra, An Analysis Of Convolutional Neural Networks For Image Classification, Procedia Computer Science. 132 (2018) 377–384. https://doi.org/10.1016/j.procs.2018.05.198. [23] T. Limouni, R. Yaagoubi, K. Bouziane, K. Guissi, E.H. Baali, Accurate one step and multistep forecasting of very short-term PV power using LSTM-TCN model, Renewable Energy. 205 (2023) 1010–1024. https://doi.org/10.1016/j.renene.2023.01.118. [24] L. Wang, M. Mao, J. Xie, Z. Liao, H. Zhang, H. Li, Accurate solar PV power prediction interval method based on frequency-domain decomposition and LSTM model, Energy (Oxford). 262 (2023) 125592-. https://doi.org/10.1016/j.energy.2022.125592. [25] X. Huang, Q. Li, Y. Tai, Z. Chen, J. Liu, J. Shi, W. Liu, Time series forecasting for hourly photovoltaic power using conditional generative adversarial network and Bi-LSTM, Energy (Oxford). 246 (2022) 123403-. https://doi.org/10.1016/j.energy.2022.123403. [26] H. Gao, S. Qiu, J. Fang, N. Ma, J. Wang, K. Cheng, H. Wang, Y. Zhu, D. Hu, H. Liu, J. Wang, Short-Term Prediction of PV Power Based on Combined Modal Decomposition and NARXLSTM-LightGBM, Sustainability (Basel, Switzerland). 15 (2023) 8266-. https://doi.org/10.3390/su15108266. [27] J. Ospina, A. Newaz, M.O. Faruque, Forecasting of PV plant output using hybrid wavelet-based LSTM-DNN structure model, IET Renewable Power Generation. 13 (2019) 1087–1095. https://doi.org/10.1049/iet-rpg.2018.5779.

[1]

S.R.

Dubey ,

S.K.

Singh ,

B.B.

Chaudhuri , Activation functions in deep learning: A comprehensive survey and benchmark , Neurocomputing (Amsterdam) . 503 ( 2022 ) 92 - 108 . https://doi.org/10.1016/j.neucom. 2022 . 06 .111.

[2]

Yu ,

Adu ,

Tashi ,

Anokye ,

Wang ,

M.A.

Ayidzoe , RMAF: Relu-Memristor-Like Activation Function for Deep Learning , IEEE Access. 8 ( 2020 ) 72727 - 72741 . https://doi.org/10.1109/ACCESS. 2020 . 2987829 .

[3]

Wang ,

Ren ,

Wang , Smish:

A Novel

Activation Function for Deep Learning Methods , Electronics (Basel) . 11 ( 2022 ) 540 -. https://doi.org/10.3390/electronics11040540.

[4]

Wuraola ,

Patel ,

S.K.

Nguang , Efficient activation functions for embedded inference engines , Neurocomputing . 442 ( 2021 ) 73 - 88 . https://doi.org/10.1016/j.neucom. 2021 . 02 .030.

[5]

Kaytan , İ.B. Aydilek , C. Yeroğlu , Gish: a novel activation function for image classification , Neural Comput & Applic . 35 ( 2023 ) 24259 - 24281 . https://doi.org/10.1007/s00521-023-09035-5.

[6]

Chai ,

Xia ,

Hao ,

Peng ,

Cui , W. Liu, PV Power Prediction Based on LSTM With Adaptive Hyperparameter Adjustment , IEEE Access. 7 ( 2019 ) 115473 - 115486 . https://doi.org/10.1109/ACCESS. 2019 . 2936597 .

[7]

B.S.

Anami ,

C.V.

Sagarnal , Influence of Different Activation Functions on Deep Learning Models in Indoor Scene Images Classification, Pattern Recognition and Image Analysis . 32 ( 2022 ) 78 - 88 . https://doi.org/10.1134/S1054661821040039.

[8]

Hao ,

Yizhou ,

Yaqin ,

Zhili , The Role of Activation Function in CNN , in: 2020 2nd International Conference on Information Technology and Computer Application (ITCA) , 2020 : pp. 429 - 432 . https://doi.org/10.1109/ITCA52113. 2020 . 00096 .

[9]

M.H.

Essai Ali ,

A.B.

Abdel-Raman ,

E.A.

Badry , Developing Novel Activation Functions Based Deep Learning LSTM for Classification , IEEE Access. 10 ( 2022 ) 97259 - 97275 . https://doi.org/10.1109/ACCESS. 2022 . 3205774 .

[10]

Krizhevsky , I. Sutskever, G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Neural Information Processing Systems . 25 ( 2012 ).