=Paper=
{{Paper
|id=Vol-3302/paper14
|storemode=property
|title=Comparison of Deep Neural Network Learning Algorithms for Biomedical Image Processing
|pdfUrl=https://ceur-ws.org/Vol-3302/paper7.pdf
|volume=Vol-3302
|authors=Oleh Berezsky,Petro Liashchynskyi,Oleh Pitsun,Pavlo Liashchynskyi,Mykola Berezkyy
|dblpUrl=https://dblp.org/rec/conf/iddm/BerezskyLPLB22
}}
==Comparison of Deep Neural Network Learning Algorithms for Biomedical Image Processing==
Comparison of Deep Neural Network Learning Algorithms for Biomedical Image Processing Oleh Berezskya), Petro Liashchynskyia), Oleh Pitsuna), Pavlo Liashchynskyia) and Mykola Berezkyy a) a West Ukrainian National University, 11 Lvivska st., Ternopil, 46009, Ukraine Abstract In recent years, the popularity of deep neural networks used for various problem-solving tasks has increased dramatically. The main tasks include image classification and synthesis using convolutional and generative-adversarial neural networks. These types of networks need large amounts of training data to achieve the required accuracy and performance. In addition, these networks have a long training time. The authors of the paper analyzed and compared the gradient-based neural network learning algorithms. The biomedical image classification with the use of a convolutional neural network of a given architecture was carried out. A comparison of learning algorithms (SGD, Adadelta, RMSProp, Adam, Adamax, Adagrad, and Nadam) was made according to the following parameters: training time, training loss, training accuracy, test loss, and test accuracy. For the experiments, the authors used the Python programming language, the Keras machine learning library, and the Google Colaboratory development environment, which provides free use of the Nvidia Tesla K80 graphics processor. For the experiments tracking and logging the authors used the Weights & Biases service. Keywords Machine learning, CNN, GAN, optimization algorithms, biomedical images. 1. Introduction Neural networks are powerful tools used for solving a wide range of problems. A typical deep neural network consists of an input layer, several hidden layers, and an output layer. Any neural network optimizes a certain objective function depending on the type of problem. In recent years, the problems of image classification and synthesis with the use of convolutional and generative- adversarial neural networks have become relevant. The use of neural networks to solve a specific problem involves solving the following tasks: - selection of a training dataset; - dataset preprocessing and augmentation if needed; - selection of neural network architecture or designing it from scratch; - selection of a learning algorithm; - further architecture optimization and tuning. Training a neural network means optimizing the parameters to achieve the minimum error value. Optimization of neural network parameters can be performed by various algorithms, which are called learning algorithms. The purpose of this work is to compare neural network learning algorithms to classify biomedical images using a convolutional neural network. IDDM-2022: 5th International Conference on Informatics & Data-Driven Medicine, November 18–20, 2022, Lyon, France EMAIL: ob@wunu.edu.ua (A. 1); p.liashchynskyi@st.wunu.edu.ua (A. 2); o.pitsun@wunu.edu.ua (A. 3); pavloksmfcit@gmail.com (A. 4); mykolaberezkyy@gmail.com (A. 5); ORCID: 0000-0001-9931-4154 (A. 1); 0000-0003-0646-7448 (A. 2); 0000-0003-0280-8786 (A. 3); 0000-0001-8371-1534 (A. 4) ; 0000- 0001-6507-9117 (A. 5); ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 2. Literature review Learning algorithms are divided into first-order or second-order algorithms and evolutionary algorithms. First-order algorithms are based on the calculation of the first derivative of the error function. Therefore, these algorithms are also called gradient algorithms. Second-order algorithms use the second derivative to select the direction of error minimization. Evolutionary algorithms are built on the basis of genetic algorithms. In [1], the author analyzed the known gradient learning methods and provided their visualization. The authors of article [2] compared three evolutionary algorithms using a hybrid neural network in forecasting downstream river flow based on areal precipitation. In the article [3], the authors compared several gradient optimization methods for a simple convolutional neural network. The Nadam algorithm showed the best results. In the article [4], the author described the implementation of neural networks in the FPGA environment. This implementation allows for speeding up the learning processes of neural networks due to the use of parallel processing. As a learning algorithm, the author used a simple gradient descent. In the research study [5], the author substantiated the relevance of improving neural network training methods for object classification and segmentation problems. The author developed a method that reduces the training time of neural networks based on nonlinear dynamics. The improved method is based on the gradient descent method with delayed feedback. In these publications, researchers mostly paid attention to the analysis of existing algorithms. Only some of the authors compared learning algorithms. Therefore, the limitations of these publications are that they only partially address the problem of learning algorithms comparison. Most of the publications are just about learning algorithms review when solving a bigger problem. The main goal of any learning algorithm is to minimize the learning error and optimize the network parameters. Modern classifiers [6, 7] require large amounts of training data to achieve high accuracy. In the work [8], the authors described the process of biomedical image classification and synthesis using convolutional and generative-adversarial neural networks. The process of training these networks is time-consuming. The training time can be reduced with an adequate selection of the learning algorithm. Therefore, the actual task is the comparison of learning algorithms for biomedical image classification. 3. Analysis of learning algorithms Modern algorithms for learning neural networks are based on error backpropagation and the gradient descent method. These algorithms are called gradient or first-order algorithms. An important parameter of algorithms is the learning rate. This parameter controls how far to move in the direction opposite to the gradient of the function in one step. If the learning rate is low, the training time of the neural network can increase significantly. If the learning rate is high, the neural network may not reach the minimum error value [9]. Formally, it can be presented as follows: where θ refers to neural network parameters, ɑ is a learning rate, is a gradient of the optimization function (loss). The disadvantage of gradient descent is that the network parameters can be updated only after passing the full training dataset. Among other gradient learning algorithms, stochastic gradient descent and mini-batch gradient descent are distinguished. Stochastic gradient descent (SGD) differs from the usual one in that the network parameters are updated after each training iteration [10]. Therefore, when using this learning algorithm, the parameters of the neural network are updated much more often. Mini-batch gradient descent uses data packets to update parameters [11]. The training dataset is divided into packets of the same size. Then each of the packets is sent to the network input, the gradient is calculated and the parameters are updated. Equation (1) can be represented in the following way: where is a package of training examples. Let us analyze the variations of gradient descent methods. Adagrad. The essence of this algorithm is that the learning rate adapts according to the network parameters [12]. The algorithm sets a lower learning rate for parameters that are associated with frequent features in the dataset. Then the equation with iterations will have the following form: where is a diagonal matrix, where each of the diagonal elements is the sum of the squares of the gradients with respect to the parameters θ at all previous iterations, including t, is a parameter with a small value that prevents division by 0 (usually ), is a gradient of the optimization function, . The advantage of this algorithm is that a researcher does not need to set the learning rate manually. The authors use the default value for the learning rate, which is 1.0 [12]. The disadvantage of the algorithm is an accumulation of gradients from previous iterations. This leads to a decrease in the learning rate and a minor update of the network parameters. Adadelta. This algorithm is an improved version of the previous algorithm. The Adadelta algorithm reduces the size of the matrix of accumulated gradients to a particular fixed value [13]. where RMS is a root mean square value, is a gradient of the optimization function, . The advantage of this algorithm is in no need for setting the initial value of the learning speed. RMSProp. This algorithm is similar to the Adadelta algorithm. It was developed by Geoffrey Hinton [14]. The equations that describe the operation of the algorithm are as follows: RMSprop was developed at around the same time as Adadelta. These algorithms solve the problem of monotonically decreasing learning rates in the Adagrad algorithm. Adam та Adamax. Unlike the two previous algorithms, in addition to the squares of the previous gradients, the Adam algorithm also stores the previous gradients : where are estimates of the mean and variance of the gradients, respectively [15]. The rule for updating parameters in this algorithm is as follows: where , . The authors suggest the following values for the parameters: The Adamax is a variant of the Adam algorithm. Nadam. Nadam is a combination of RMSProp and Momentum algorithms. The first algorithm accumulates the squares of the gradient values, and the second algorithm accumulates the values of the previous gradients. 4. Dataset and augmentation A training set of cytological images with a size of 64x64 pixels was used for the experiments. The initial dataset contains about 100 images. Therefore, the dataset was expanded to approximately 800 images using affine distortions. The Python programming language and the Rudi library [14] were used to expand the training data set. Cytological images form a subset of biomedical images. Cytological images are images of cells of the organism. Examples of cytological images are shown in Figure 1. Figure 1. Cytological images Cytological image processing and analysis are reflected in works [16-18]. 5. CNN architecture design To compare the gradient descent-based training methods, a convolutional neural network model was built. As an input, the network accepts color cytological images with a size of 64x64 pixels and outputs a class label. The sequence of layers is given in Table 1. Table 1 Model summary Layer Output shape Layer param Input 64x64x3 Conv 32x32x64 kernel size = 5 Batch Norm 32x32x64 Leaky Relu 32x32x64 slope = 0.2 Conv 16x16x128 kernel size = 5 Batch Norm 16x16x128 Leaky Relu 16x16x128 slope = 0.2 Max Pool 8x8x128 Dropout 8x8x128 rate = 0.5 Conv 4x4x256 kernel size = 3 Batch Norm 4x4x256 Leaky Relu 4x4x256 slope = 0.2 Conv 2x2x512 kernel size = 3 Batch Norm 2x2x512 Leaky Relu 2x2x512 slope = 0.2 Dropout 2x2x512 rate = 0.5 Flatten 2048 Dense 4 Softmax 4 As can be seen from Table 1, the network consists of several repeating blocks. Each block consists of a sequence of convolution layers, batch normalization, an activation layer, and a dropout layer. Each convolutional layer reduces by half the input volume. The model is compiled using the categorical cross-entropy loss function. The number of learning epochs is 30. The dataset is divided into learning and testing in the ratio of 80% to 20%. The batch size is set to 64. The Tensorflow 2 library and the Python programming language were used to build the model and conduct experiments. The experiments were conducted in the Google Colaboratory environment on an Nvidia Tesla K80 graphics processor. 6. Experiments The results of the optimization of neural network parameters based on gradient descent are shown in Table 2. Table 2 Optimizers Name Learning rate Adam 0.001 SGD 0.01 RMSprop 0.001 Nadam 0.001 Adadelta 0.001 Adagrad 0.001 Adamax 0.001 All optimizer parameters are set to the default values specified in the Tensorflow library. The comparison of learning algorithms was made on the basis of the network training time, the value of the loss function, and the classification accuracy on the test dataset. The results of the experiments are shown in Table 3 and in the figures below. Table 3 Experimental results Name Training time Loss Test loss Accuracy Test accuracy (m) Adam 4.10 0.05 0.77 0.9797 0.7625 SGD 2.50 0.07 1.29 0.9731 0.7462 RMSprop 3.39 0.08 0.45 0.9719 0.895 Nadam 3.55 0.04 2.79 0.9822 0.66 Adadelta 4.35 0.64 0.45 0.7319 0.8425 Adagrad 3.49 0.13 0.12 0.9491 0.9488 Adamax 4.45 0.06 0.30 0.9775 0.9162 Figure 2: Model with Adam optimizer Figure 2 shows the error graphs and accuracy graphs on the training and test datasets for the Adam optimizer. The accuracy curves on the training and test datasets show that the network is being retrained. This happens because there is a significant difference in accuracy between the training and test datasets. Accuracy on the training dataset was ~98%, and accuracy on the test dataset was ~76%. Figure 3: Model with SGD optimizer Figure 3 also shows a significant difference between the accuracy values on the training and test datasets: ~97% and ~75%, respectively. Figure 4: Model with RMSprop optimizer The classification accuracy for the model with the RMSprop optimizer on the training and test datasets was ~97% and ~89%, respectively (fig.4). The model with the Nadam optimizer also demonstrates significant overtraining. Accuracy on the training and test datasets is ~98% and 66%, respectively (fig.5). Figure 5: Model with Nadam optimizer Figure 6: Model with Adadelta optimizer As can be seen from Figure 6, the error and accuracy graphs are quite smooth. However, the classification accuracy on the training dataset was only ~73%. This is explained by another problem, namely underfitting. There are several options for its solution, such as an increase in the number of training epochs or an increase in the complexity of the used model. The accuracy values on the training and test data sets are almost the same for the model with the Adagrad optimizer and equal to ~95% (fig.7). Figure 7: Model with Adagrad optimizer Figure 8: All optimizers visualization Figure 8 shows that the error and accuracy curves on the test dataset for almost all models are not smooth, so the networks are retrained. This is evidenced by the significant difference in accuracy values on the training and test datasets. This problem can be solved by simplifying the neural network model or by increasing the number of images in the training dataset. On the other hand, the model with the Adadelta optimizer has an inverse problem called underfitting. To solve this problem, several techniques can be applied, such as increasing the complexity of the model or increasing the number of training epochs. 7. Conclusions The results of the study are as follows: 1. The authors of the research study conducted a comparative analysis of the gradient descent- based algorithms for optimizing neural network parameters (Adam, SGD, RMSprop, Nadam, Adadelta, Adagrad, and Adamax). The comparison was made according to the criteria of network training time, loss function values, and classification accuracy on the cytological image dataset. 2. Based on the cytological image dataset and the developed convolutional neural network model, the four best optimizers were selected according to the val_accuracy parameter: Adamax, Adadelta, Adagrad, and RMSprop. 3. The graphs of val_loss and val_accuracy on the test data set for optimizers (except Adadelta) are not smooth. Unlike other algorithms, Adadelta is an optimization algorithm with an adaptive learning rate. Therefore, parameters are updated with a smaller step, and during the training, there is no problem with retraining. As a result, the accuracy and error curves on the training and test datasets almost coincide and are smooth. 4. Since Adadelta is an adaptive algorithm, it is necessary to use a higher training rate at the beginning. This will significantly reduce the training time and ensure the convergence of the model. In this work, only one neural network model was used. In future research, it is planned to apply discussed optimizers to more complex and huge models. The further research direction will also cover the application of optimizers for generative-adversarial networks. 8. References [1] Ruder, Sebastian. "An overview of gradient descent optimization algorithms." arXiv preprint arXiv:1609.04747 (2016). URL: https://arxiv.org/abs/1609.04747 [2] X.Y. Chen, K.W. Chau, A.O. Busari. A comparative study of population-based optimization algorithms for downstream river flow forecasting by a hybrid neural network model, Engineering Applications of Artificial Intelligence, Volume 46, Part A, 2015, Pages 258-268, ISSN 0952-1976, https://doi.org/10.1016/j.engappai.2015.09.010. [3] E. M. Dogo, O. J. Afolabi, N. I. Nwulu, B. Twala and C. O. Aigbavboa, "A Comparative Analysis of Gradient Descent-Based Optimization Algorithms on Convolutional Neural Networks," 2018 International Conference on Computational Techniques, Electronics and Mechanical Systems (CTEMS), 2018, pp. 92-99, https://ieeexplore.ieee.org/document/8769211/ [4] Khuzhakhmetova, A. Sh, A. V. Semenyutina, and V. A. Semenyutina. "Deep neural network elements and their implementation in models of protective forest stands with the participation of shrubs." International journal of advanced trends in computer science and engineering 9.4 (2020): 6742-6746. [5] Smorodin A. V. Methods of learning neural networks based on nonlinear dynamics: diss. dr. Philos. Sciences: 122 / Smorodin Andriy Vyacheslavovich – Odesa, 2022. – 169 p. [16] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large- scale image recognition." arXiv preprint arXiv:1409.1556 (2014). [7] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. https://doi.org/10.1109/CVPR.2016.90 [8] Berezsky, O., Pitsun, O., Liashchynskyi, P., Derysh, B., Batryn, N. (2023). Computational Intelligence in Medicine. In: Babichev, S., Lytvynenko, V. (eds) Lecture Notes in Data Engineering, Computational Intelligence, and Decision Making. ISDMCI 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 149. Springer, Cham. https://doi.org/10.1007/978-3-031- 16203-9_28 [9] Amari, Shun-ichi. "Backpropagation and stochastic gradient descent method." Neurocomputing 5.4-5 (1993): 185-196. https://doi.org/10.1016/0925-2312(93)90006-O [10] Ketkar, Nikhil. "Stochastic gradient descent." Deep learning with Python. Apress, Berkeley, CA, 2017. 113-132. [11] Hinton, Geoffrey, Nitish Srivastava, and Kevin Swersky. "Neural networks for machine learning lecture 6a overview of mini-batch gradient descent." Cited on 14.8 (2012): 2. [12] Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159. Retrieved from http://jmlr.org/papers/v12/duchi11a.html [13] Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. URL: http://arxiv.org/abs/1212.5701 [14] Hinton, G., Srivastava, N., Swersky, K. Overview of mini-batch gradient descent. URL: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf [15] Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13. https://arxiv.org/abs/1412.6980 [16] Berezsky, O., Pitsun, O., T. Dolynyuk, T., Dubchak, L., Savka, N., Melnyk, G. & Teslyuk, V. “Cytological Image Classification Using Data Reduction”. II International Workshop Informatics & Data-Driven Medicine (IDDM 2019). Lviv, Ukraine. November 11-13, 2019. http://ceur-ws.org/Vol- 2488/paper2.pdf. [17] Berezsky, O., Pitsun, O., Datsko, T., Derysh, B., Tsmots, I. & Tesluk, V. “Specified diagnosis of breast cancer on the basis of immunogistochemical images analysis”, IDDM’2020: 3rd International Conference on Informatics & Data-Driven Medicine, November 19–21, 2020, Växjö, Sweden. pp. 129-135. http://ceur-ws.org/Vol-2753/short5.pdf, [18] Berezsky, O., Pitsun, O., Dubchak, L., Berezka, K., Dolynyuk, T. & Derish, B. Cytological Images Clustering. In: Shakhovska N., Medykovskyy M.O. (eds) Advances in Intelligent Systems and Computing V. CSIT 2020. Advances in Intelligent Systems and Computing. 2021; vol 1293. Springer, Cham. https://doi.org/10.1007/978-3-030-63270-0_12.