Bagging of Convolutional Neural Networks for Diagnostic of Eye Diseases Mahmoud Smaida1[0000-0002-5552-2768], Serhii Yaroshchak 2[0000-0001-9576-2929] The national university of water and Environmental Engineering, Revine, Ukraine Smaida20012001@gmail.com The national university of water and Environmental Engineering, Revine, Ukraine s.v.yaroshchak@nuwm.edu.ua Abstract- Deep learning is a subset of machine learning where artificial neural networks, algorithms inspired by the structure of the human brain itself, learn from large amounts of data. In this paper, we will introduce the part of the tech- niques of deep learning to perform multi-class classification, in order to classify eye diseases. One of the biggest issues in image recognition is the classification of medical images, and it aims to classify medical images into different catego- ries to help doctors diagnose the disease. But the most important idea will be addressed in our paper is the evaluation performance model using a bagging en- semble. In this study, we will compare three models of the convolutional neural network, CNN, Vgg16 and InceptionV3 in order to evaluate the performance of the models using bagging ensemble. In our work, a deep learning convolutional network based on Keras and Tensor Flow is deployed using python for image classification. A number of different medical images have been used as a data set to diagnose eye diseases, which contain four types of diseases such as, Diabetic retinopathy, Glaucoma, Myopia and Normal. CNN, VGG16 and InceptionV3 neural network structures are compared singly and together using bagging ensemble, in order to diagnose eye diseases. All experiments were applied and the result was obtained. It has been shown that using a bagging ensemble yields better predictive efficiency than can be obtained using learning algorithms alone. Moreover, the use of the con- fusion matrix in our experiments shows us where our classifiers are confused when it makes predictions. Keywords. InceptionV3, Vgg16, eye diseases, ensemble learning, Deep Learn- ing, Diabetic retinopathy, Glaucoma, Myopia, bagging. 1 Introduction. Diabetic retinopathy, glaucoma and myopia are some of the most common eye diseases and one of the most common causes of blindness in the world if they are not detected at an early stage. In recent years, the diagnosis of diseases of the human visual system has advanced greatly to technological innovations and developments in the field of artificial intelli- gence. Taking into account the diversity and complexity of eye functions, a large Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). number of diagnostic equipment, tools, methods and algorithms have been developed. Sometimes a doctor can discover a specific disease after a visual analysis of the im- age. However, in a large number of cases, the diagnosis is not made due to many fac- tors, such as bad experience, fatigue, a variety of shapes, similarities, poor image quality, etc. In these cases, the second opinion is very important and useful, which comes from another expert who uses advanced information technology and algo- rithms to accurately analyze the image to diagnose eye diseases [1] bagging ensemble is a kind of ensemble learning, it is a set of machine learning models combined to- gether to obtain better results. In this study, we focus on bagging ensemble to improve the model's prediction and make it better. We are not talking about creating a new algorithm, but instead assembling together several different algorithms or several different models to create an ensemble learner, called bagging, in order to increase the accuracy of the model. In general, predicting the target variable using any deep learn- ing method leads to a difference between actual and expected values, due to noise, variance and bias. The Bagging ensemble helps reduce variance. In summary, as shown in Figure 1, different and same algorithms are used in ensemble learning to achieve a better prediction efficiency that can be achieved from any of the constituent learning algorithms alone [2]. 2 Formal problem statement Eye diseases have a wide range of shapes, sometimes the textures are difficult to identify and recognize by an ophthalmologist. Therefore, information technology must be used to provide maximum comfort to the patient and ophthalmologist, and improve health care system. In this paper, we will use bagging ensemble to evaluate three different CNN struc- tures to identify eye diseases, Diabetic retinopathy, Glaucoma, Myopia, and Normal. 3 Ensemble learning Ensemble learning models are a technique that combines several base models to create a perfect predictive model, and it is divided into two groups: Simple ensemble Techniques and advanced ensemble techniques [3]. 3.1 Simple ensemble Techniques [3]: a. Max Voting: each model in max voting makes a prediction and votes for each sample. The category with the most votes will be the last predictive cat- egory. b. Averaging: It is the process of creating many models and combining them to get the desired result. The result will be better average performance than sin- gle model. (1) Fig. 1. Types of ensemble learning [5] c. Weighted Averaging: is an extension of a model averaging ensemble where the contribution of each member to the final evaluated is weighted by the performance of the model. A weighted average mean value takes the form of a sum on quantum energy states, rather than continuous integration. (2) 3.2 Advanced ensemble techniques [4], [15]: a. Bagging: type of ensemble learning that relies on creating a number of sub-datasets called bagging. b. Boosting: Is a fairly simple variation on bagging that strives to improve the learners by focusing on areas where the system is not performing well. c. Stacking: in stacking all models are trained based on a complete data set, and the output used as input features to train ensemble function. 4 Bagging: In this paper we will focus on bagging ensemble which is a type of ensemble learning that relies on creating a number of sub-datasets called bagging; each bag is a subset of the original dataset, which contains a number of different instances. Inside each bag a set of instances of random data with replacement. We use each of these collections of data (bag) to train a different model. Finally, collect all of the outputs (predicts) and calculate the average or voting value as shown in fig.2. Fig. 2. Bagging example has three steps Bagging ensemble learning model is category bases on its use; the details are ad- dressed in Table.1 Table 1. bagging ensemble learning basis of it’s use [5]: Partitioning of Methods Function to Goal to the data into where this is combine single achieve subsets used models Mini- Bag- Random (weighted) Random mize vari- ging subspace average ance 5 Related work Many researchers have used bagging ensemble techniques using neural networks in their research, and most of these studies have been done recently, focusing on re- cent research. A few reviews are as follows: Ju, Cheng, Aurélien Bibaut and Mark van der Laan. [6] In this work, authors in this work are used neural network, VGG, GoogleNet and ResNet to applied some of ensemble learning technical, including: Max voting, unweighted average, bayes opti- mal and super learner. the authors trained their models based on same and different Networks. Ensemble of the same and different networks has been trained multiple times. the results obtained and listed based on the best performance on the testing set. all learners used CIFAR 10 as a dataset, and the unweighted average provided the best result when the performance of the base learners is comparable. Huang, Jonathan, et al. [7] four deep neural networks has been applied in this work in order to improve the accuracy. These networks are, Vgg12, ResNet50, AclNet and AclSincNet, all these models were pre-trained with audio dataset. Ensemble learning was achieved in all these models and the result obtained over the validation set. The best accuracies were achieved when all the networks combined together based on ensemble average by score 83.01%. Mo, Weilong, et al. [8] the authors suggest an image recognition algorithm based on the ensemble learning algorithm and the structure of the ELA-CNN to solve a problem that single model can not correctly predict. They used the bagging ensemble to train their models. the networks structure was used are combines of ResNet, DenseNet, DenseNet-BC and Inception-Resnet-v2 architecture. in their experiments they used cifar-10 as images dataset, it consists of 60,000 color images. These images were divided into 50,000 in the training set and 10,000 in the test set. The final result was the average probability of the prediction vector. Kumar, Ashnil, et al. [9] For classification of medical images based on diagnosis, training, and biomedical research, a set of convolutional neuronal networks of fine- tuned were used to classify medical images. They used 6,776 training images and 4166 test images. The authors used two different CNN designs, AlexNet and Goog- leNet, to images classification. The experiments were performed using individual models and ensemble models. By the end result, the ensemble method reached an accuracy corresponding to the best accuracy among other methods of the overall method of 96.59%. Beluch, William H., et al. [10] the authors in this paper explore some of the recent- ly proposed active learning methods that contain big data and CNN classifiers. They compare ensemble-based methods against Monte-Carlo Dropout and geometric ap- proaches.. They have found that the ensemble learning better and leads to a more predictable uncertainty, which is the basis of many active training algorithms of con- volution Neural Networks, such as S-CNN, K-CNN, DenseNet, InceptionV3 and ResNet -50 to classify Diabetic retinopathy. The dataset was used with MNIST, CIFAR, and ImageNet. They found that ensembles which based on several active learning algorithms were better predicted and achieved a set test accuracy of 90% of the approximately 12,200 images presented. Minetto, Rodrigo, Maurício Pamplona Segundo, and Sudeep Sarkar. [11] Hydra: An Ensemble of Convolutional Neural Networks for Geospatial Land Classification in satellite image. Hydra is an initial CNN that is coarsely optimized, which will serve as the Hydra’s body. in this article, authors created ensembles for their experiments us- ing two state-of-the-art CNN architectures, ResNet and DenseNet. they demonstrated their application of Hydra framework in two datasets, FMOW and NWPU- RESISC45. The final result ensemble was achieved accuracy around 94.51%. 6 Data Description Kaggle is a data science website that contains a variety of interesting data sets. In its main menu, you can find all kinds of specialized data sets, from the Ramen classi- fications to basketball and animal licenses data in Seattle [11]. We used our data from competition in kaggle Diabetic Retinopathy Detection [17] and iChallenge-GON Comprehension which is a large collection of 1,200 retinal fun- dus images for both subjects without glaucoma (90%) and glaucoma patients (10%). The data set includes more than 35 types of eye diseases. To simplify, we will re- duce the data set with 4 main breeds. The dataset includes images of glaucoma, myo- pia, diabetic retinopathy, and Normal eye provided as a subset of photos from a large dataset of 2781 Retinal Image as it shown in table.2. All the images were collected in total from Kaggle dataset and iChallenge-GON Comprehension, in high resolution images. Fig. 3. Normal Fundus, Glaucoma, Diabetic retinopathy and Myopia Images will be the entrance to CNN architecture. our images divided into training dataset, validation and test dataset; each type of image has a separate folder and each image has a file name, which is its unique identifier. Python will be used to achieve our goal using Google Colab. Table 2. Number of images according to eye diseases Diabetic retinopathy Glaucoma Myopia Normal eye 975 721 486 599 7 Research Methodology The three proposed methodologies will be used in our experiment are CNN, VGG16 and InceptionV3 in order to evaluate singly and using bagging ensembles to identify eye diseases. First, a set of image data is prepared step-by-step; there are 4 folders in the data set, which contain 2781 images of diabetic retinopathy, glaucoma, myopia and normal, where 1951 images were used for training, 415 images were used for tests, and 415 images were used for validation. In the next steps, fitting our CNN model, then, obtain the accuracy of the data set for different CNN structures and final- ly, compare these accuracies separately and using bagging ensemble to measure per- formance. This article covers three ways to evaluate the performance of our learners:  CNN based on three hidden layers, pooling layers and fully connected layers.  Pre-trained CNN based on VGG 16 algorithms using the last block layer training (Block 5).  Pre-trained CNN based Inception v3 algorithms using the last block layer training ('mixed6). 7.1 Convolution Neural Network The size of the input image is 150 * 150 pixels with 3 channels (RGB). To extract the image features, we used 32 filters 3 * 3 pixels. And 2 * 2 pixel window, used to minimize the size of image(Pooling layer). Next, we applied another convolution layer used 32 filters with a size of 3 * 3 and a max pooling size of 2 * 2. In the last convolution layers, 64 filters of 3 * 3 are used with a max pooling of 2 * 2., then we use the Fully connected layer (64 dense units) and softmax layer (4 units) to predict eye diseases. CNN networks adjust the weight of the filters during the back propaga- tion, which means that after forwarding, the network can look at the loss function and carry out the backward transfer process to update the weight. (3) (4) 7.2 VGG 16. This is a convolutional neural network structure developed by the University of Ox- ford's Visual Engineering group in 2014. This model loads a set of pre-trained weights into ImageNet using a 16-layer network. The size of the images entered on the VGG16 network is 224x224 RGB, the imag- es are passed through 5 blocks of convolutional layers, with each block consisting of an increasing number of 3x3 filters, stride is fixed to 1 while the convolutional layer inputs are padded. The blocks are separated by the max pooling layers. The max pool- ing is made on 2 * 2 windows with stride 2. Five blocks of convolutional layers are followed by three fully connected (FC) layers. The last layer is a soft max layer repre- senting the output layer [12]. 7.3 Inception V3. Is a convolutional neural network consisting of 48 deep layers trained in over a mil- lion images in an ImageNet database. It can categorize images into 1000 categories of objects [13], [14]. Inception-v3 is one of the most popular models that can be used for transfer learn- ing. This allows us to retrain the last layers of existing models, which leads to a sig- nificant reduction in training time. inception-v3 has been trained in over a million images from the ImageNet database, which means that the model had learned during its original training and could be applied to smaller dataset with highly accurate clas- sifications without the need of training all the model. The Inception Layers is a mixture of a set of layers (i.e. 1 × 1 convolutional layer, 3 × 3 convolutional layer, 5 × 5 convolutional layer) with combinations of output filters combined into one output vector, forming the inputs for the next step. 7.4 Selected Measures. In this section, we officially describe the most common measures used to compare our works. The various measures are based on the marginal rate of the confusion ma- trix. In this article, comparisons will be made using confusion matrix to measure model accuracy. Accuracy: This is a measure of how much the classifier predicted the class correctly. Accuracy = TP / TP+FP+FN+TN (5) 8 Experiments and Results. All the models above are applied using Python; the dataset is a set of fundus imag- es representing eye diseases, such as diabetic retinopathy, glaucoma, myopia and normal. In our experiments, we compare the empirical performance for bagging en- semble method which we mentioned before to obtain different models to get better accuracy for eye diseases detection. 8.1 Results on CNN, VGG16 and InceptionV3 individually: CNN, VGG16 with fine-tune the final layers and InceptionV3 with pre-training the final layers uses eye diseases dataset including bagging have been applied. Table.3 shows the result on the test dataset. Table 3. Prediction accuracy in individual models Number of Prediction Accu- Model Epoch racy CNN 50 71.57% VGG16 50 83.86 % InceptionV3 50 87.71 From the above models, there are three classification accuracies obtained as shown in Table.3. These accuracies are graphically represented in the graphs below, where each model structure is shown with epochs and accuracies. Fig. 4. Accuracy on train and test set in CNN Fig. 5. Accuracy on train and test set in VGG architecture Fig. 6. Accuracy on train and test set in InceptionV3 architecture 8.2 Results on CNN, VGG16 and InceptionV3 using bagging ensemble learning:. Bagging ensembles with different and the same structures have been applied. Three models were trained by CNN, VGG16 and InceptionV3 to implement a bag- ging ensemble. Therefore, compare the performance for all the bagging ensemble methods, and the results presented in Table.4 of each net on the test set. Table 4. Prediction accuracy on the testing set for same and different models using ensemble learning Model Type of ensemble Prediction Accuracy Three model of CNN Bagging 77.1% Three model of VGG16 Bagging 79.8% Three model of InceptionV3 Bagging 87.2% CNN, VGG16 and InceptionV3 Bagging 86.5% Table 4, shows the accuracy of bagging ensemble with same and different architec- tures, these accuracies are graphically represented in below graphs, where each model structure is shown with epochs and accuracies. Fig. 7. Accuracy on train and test set in bagging ensemble models using CNN Fig. 8. Accuracy on train and test set in bagging ensemble models using VGG16 Fig. 9. Accuracy on train and test set in bagging ensemble models using InceptionV3 Fig. 10. Accuracy on train and test set in bagging ensemble models using CNN, VGG and InceptionV3 We compare accuracies of graphs above, and find out the following:  Combining Inception v3 and All the models as a bagging ensemble which shown in Fig.9 and Fig.10 gives the best accuracy 87.20 % and 86.50%, which is far bet- ter than accuracy of graph in Fig.7 and graph in Fig.8.  Due to the varied results between the models alone. We used the CNN which has poor accuracy compared to the other models. Therefore, we recommend using deep learning networks such as Alex Net or ResNet with Inception V3 to obtain the best accuracy.  The confusion matrix in Fig.11 shows that all classification models are confused with Glaucoma and Normal eye when it makes prediction. Therefore, this problem must be addressed to optimize the classification. Fig. 11. Confusion matrix shows where our model confuse 9 Conclusion. We studied the relative performance used bagging ensemble methods with deep con- volutional neural networks as base learners on eye diseases data set, for image classification. In this work we have applied three systems for multi-class classification using a bagging ensemble, and we found that assembly of deep neural network mod- els can outperform traditional methods that rely on learning algorithms alone. Three models of multiclass classification CNN, VGG16 and Inception V3 have been compared in order to measure the accuracy and to know the effects of models assembly compared with learning algorithms alone. Due to the small number of the training datasets, we implemented the Fine-tuning and data augmentation to increase the accuracy of experiments in the test set. All the models mentioned above are de- ployed using python for multiclass image classification. We compared these three different structures of CNN on GPU systems using google Colab. With experiments, as shown in table.4 we obtained results for each combination and observed that bag- ging ensemble based on Inception V3 combination gives better classification accuracy (87.20 %) than any other models. We recommend using deep learning networks such as AlexNet or ResNet with In- ception V3 to obtain better accuracy. Confusion matrix has been applied in our exper- iment to know in which class our models were confused. The results show that all classification models in varying proportions are confused with Glaucoma when it makes prediction as it shown in Fig.11. Therefore, this problem must be addressed to optimize the classification. References 1. American Macular Degeneration Foundation www.macular.org, 2019. 2. Ensembling ConvNets using Keras https://towardsdatascience.com/ensembling-convnets- using-keras-237d429157eb, 01/2020. 3. Ensemble averaging (machine learning), https://en.wikipedia.org/wiki/Ensemble_averaging_(machine_learning), 01/2020. 4. A Comprehensive Guide to Ensemble Learning, https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble- models/, 01/2020. 5. Ensemble Learning- The heart of Machine learning, https://medium.com/ml-research- lab/ensemble-learning-the-heart-of-machine-learning-b4f59a5f9777, 01/2020. 6. Ju, Cheng, Aurélien Bibaut, and Mark van der Laan. "The relative performance of ensem- ble methods with deep convolutional neural networks for image classification." Journal of Applied Statistics 45.15 (2018): 2800-2818. 7. Huang, Jonathan, et al. "Acoustic scene classification using deep learning-based ensemble averaging." (2019). 8. Mo, Weilong, et al. "Image recognition using convolutional neural network combined with ensemble learning algorithm." Journal of Physics: Conference Series. Vol. 1237. No. 2. IOP Publishing, 2019. 9. Kumar, Ashnil, et al. "An ensemble of fine-tuned convolutional neural networks for medi- cal image classification." IEEE journal of biomedical and health informatics 21.1 (2016): 31-40. 10. Beluch, William H., et al. "The power of ensembles for active learning in image classifica- tion." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. 11. Minetto, Rodrigo, Maurício Pamplona Segundo, and Sudeep Sarkar. "Hydra: an ensemble of convolutional neural networks for geospatial land classification." IEEE Transactions on Geoscience and Remote Sensing (2019). 12. Tindall, Lucas, Cuong Luong, and Andrew Saad. "Plankton classification using vgg16 network." (2015). 13. Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE con- ference on computer vision and pattern recognition. 2015. 14. ImageNet. http://www.image-net.org/ ,01/2020 15. The interests of truth require a diversity of opinions, https://www.zest.ai/blog/many-heads- are-better-than-one-making-the-case-for-ensemble-learning, 12/2019 16. https://subscription.packtpub.com/book/big_data_and_business_intelligence/97817891366 09/2/ch02lvl1sec16/max-voting, 01/2020 17. Diabetic Retinopathy Detection, https://www.kaggle.com/c/diabetic-retinopathy- detection/data, 01/2020 18. Visa, Sofia, et al. "Confusion Matrix-based Feature Selection." MAICS 710 (2011): 120- 127. 19. Nezami, Omid Mohamad, et al. "Automatic Recognition of Student Engagement using Deep Learning and Facial Expression." arXiv preprint arXiv:1808.02324 (2018). 20. Loussaief, Sehla, and Afef Abdelkrim. "Machine learning framework for image classifica- tion." 2016 7th International Conference on Sciences of Electronics, Technologies of In- formation and Telecommunications (SETIT). IEEE, 2016. 21. Bizios, Dimitrios, et al. "Machine learning classifiers for glaucoma diagnosis based on classification of retinal nerve fibre layer thickness parameters measured by Stratus OCT." Acta ophthalmologica 88.1 (2010): 44-52. 22. LeCun, Y., Bengio, Y., et al. (1995). Convolutional networksforimag- es,speech,andtimeseries. Thehandbook of brain theory and neural networks, 3361(10):1995. 23. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105. 24. Graves, A., Mohamed, A.-r., and Hinton, G. (2013). Speech recognition with deep recur- rent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee inter- national conference on, pages 6645– 6649. IEEE. 25. Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. 26. Open Dataset Finders, https://lionbridge.ai/datasets/the-50-best-free-datasets-for-machine- learning/,2019 27. Tindall, Lucas, Cuong Luong, and Andrew Saad. "Plankton classification using vgg16 network." (2015). 28. Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE con- ference on computer vision and pattern recognition. 2015. 29. Smolyakov, V. "Ensemble learning to improve machine learning results." (2017).