=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_16
|storemode=property
|title=Transfer Learning with CNN Architectures for Classifying Gastrointestinal Diseases and Anatomical Landmarks
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_16.pdf
|volume=Vol-2283
|authors=Danielle Dias,Ulisses Dias
|dblpUrl=https://dblp.org/rec/conf/mediaeval/DiasD18
}}
==Transfer Learning with CNN Architectures for Classifying Gastrointestinal Diseases and Anatomical Landmarks==
Transfer learning with CNN architectures for classifying gastrointestinal diseases and anatomical landmarks Danielle Dias, Ulisses Dias University of Campinas, Brazil danielle.dias@ic.unicamp.br,ulisses@ft.unicamp.br ABSTRACT results. They restricted the analysis to two architectures: Inception- Transfer learning is an approach where a model trained for a given V3, and VGGNet. They also conducted an analysis of how a model task is used as a starting point on a second task. Many advanced performs if it uses features extracted from both architectures as deep learning architectures have been pre-trained on ImageNet input. The results are better for just a small factor, which lead us and are currently available, which makes this technique very popu- believe that it does not worth the extra processing time. We ex- lar. We evaluate 10 pre-trained architectures on the task of finding tended the work of Agrawal et al. [1] by using 10 architectures and gastrointestinal diseases and anatomical landmarks in images col- by considering both solution quality and efficiency in our analysis. lected in hospitals. Our analysis considered both processing time and accuracy. We also study if global image features bring advan- 3 APPROACH tages to the pre-trained models for the problem of gastrointestinal The development and test dataset contain 5,293 and 8,740 images, medical image classification. Our best models achieved accuracy respectively. For each image, visual features were extracted and and F1-score values of 0.988 and 0.908, respectively. Our fastest provided as feature vectors by the task organizers, namely: JCD, model classifies an input instance in 0.037 seconds, and yields ac- Tamura, ColorLayout, EdgeHistogram, AutoColorCorrelogram and curacy and F1-score of 0.983 and 0.866, respectively. PHOG [6]. These feature vectors are sequences of floating point values for each image, and the number of values sum to 1185. These 1 INTRODUCTION values were joined to form a table used as input to our model, The Medico Task proposes the challenge of predicting diseases where rows represent images. We removed 19 columns because based on multimedia data collected in hospitals [7, 8]. The images either they had the same value for all images or because they were are frames collected from videos captured by the insertion of a cam- duplicated. era in the gastrointestinal tract. The main purpose is to identify We used 10 architectures trained on ImageNet: DenseNet121 [5], anomalies that can be detected visually, even before they become DenseNet169 [5], DenseNet201 [5], InceptionResNetV2 [11], Incep- symptomatic. More details can be found in the task overview [9]. tionV3 [12], MobileNet [3], ResNet [4], VGG16 [10], VGG19 [10], To solve the task we created several models based on features Xception [2]. Each architecture requires a particular pre-processing extracted from deep convolutional architectures and global image step and returns vectors of floating point numbers. Vector sizes features. The deep architectures were trained on ImageNet. The are: DenseNet121 (1024), DenseNet169 (1664), DenseNet201 (1920), strategy was to create three kinds of models: (i) those having only InceptionResNetV2 (1536), InceptionV3 (2048), MobileNet (1024), features extracted from deep architectures as input, (ii) those that ResNet (2048), VGG16 (512), VGG19 (512), Xception (2018). considered only global image features, and (iii) those created with The input layer has the same number of nodes as the feature all features available. vector sizes. That said, in a model that uses only global features The approach of extracting features from pre-trained models the input layer has 1166 nodes. In a model that uses global features is usually referred to as transfer learning. It has become popu- and a given architecture as feature extractor, the input layer has the lar because many models are available. We selected 10 architec- feature vector of that architecture plus 1166. tures trained on ImageNet as extractors, and compared their per- The best model for most of the input features uses one hidden formance on classifying images for the Medico Task. layer that has 512 nodes and each node uses Relu as activation The architectures differ in several characteristics, which impacts function. We added a Dropout of 50% in the training stage and l 2 the time to compute the features. Since efficiency is an important regularization to prevent overfit. Models with more layers tend to matter on this task, we computed the data processing speed for overfit very easily with just a few epochs. It would be possible to each test images and made efforts to reach a balance between so- create simpler models for small feature vectors like VGG architec- lution quality and running time. tures, but we decided to report the same network for all input vec- tors for comparison purposes. The output layer has 16 nodes (one 2 RELATED WORK for each class) and uses softmax activation to classify the image in one of the classes. Transfer learning has been used in several problems in a number of domains. In 2017, Agrawal et al. [1] used transfer learning for the Medico Task when the challenge had 8 classes and achieved good 4 RESULTS AND ANALYSIS During the training stage we split the development set with 5,293 Copyright held by the owner/author(s). images into train (3,038 images), validation (1,722 images), and test MediaEval’18, 29-31 October 2018, Sophia Antipolis, France (531 images) datasets. We used train and validation sets to train and MediaEval’18, 29-31 October 2018, Sophia Antipolis, France D. Dias et al. tune the classifier. Test dataset was used only once to generate the somewhat worse results. Therefore, Resnet should be preferred in results in Table 1 after we were satisfied with the validation scores. any situation. DenseNet201 was the top accuracy and F1-score, but The model that uses only global image features yields an accu- it is so close to Resnet that it is difficult to argue that it is enough racy of 0.813, and F1-score of 0.782, which we consider a baseline compensation for the fact of being twice as slower. result. Table 1 summarizes the results using transfer learning in the test dataset with 531 images. We based our decision about which Table 2: F1-score, accuracy and Rk of the selected models in model was best to submit on the F1-score and accuracy metrics. task test dataset (8,740 images). Table 1: F1-score and accuracy of the transfer learning mod- Architecture ACC F1 Rk els in our test dataset (531 images). Table also reports the av- erage time (in seconds) to classify an image after the model DenseNet121 0.987 0.903 0.893 is loaded in memory. We highlighted the best results. DenseNet201 0.988 0.908 0.898 Mobilenet 0.983 0.866 0.853 Resnet 0.988 0.906 0.896 Architecture No Global Features Global Features Time ACC F1 ACC F1 DenseNet121 0.163 0.904 0.856 0.915 0.868 5 DISCUSSION AND OUTLOOK DenseNet169 0.209 0.915 0.899 0.919 0.903 DenseNet201 0.242 0.923 0.905 0.925 0.871 We believe these models can be improved and the confusion matrix InceptResnetV2 0.349 0.908 0.894 0.904 0.858 shown in Figure 1 provides some insights. We analysed how they InceptionV3 0.213 0.883 0.876 0.889 0.845 performed in each of 16 class and found out that the class “out-of- Mobilenet 0.037 0.889 0.841 0.896 0.850 patient” is particularly problematic, since it has only 4 instances in Resnet 0.115 0.909 0.916 0.894 0.883 the development set and 5 instances in the test set. Furthermore, VGG16 0.372 0.879 0.837 0.870 0.828 none of our models was able to classify these 5 instances right, VGG19 0.406 0.881 0.867 0.866 0.858 which does not impact accuracy but degrades F1-score. In the fu- Xception 0.257 0.877 0.835 0.898 0.855 ture, some data augmentation should be performed to improve this class. Our first decision was to disregard the models that had trans- fer learning plus global features because the improvement after adding global features was considered irrelevant. As an example, the architecture DenseNet201 had a small increase of accuracy from 0.923 to 0.925. These were the best models using accuracy metric. If we consider the F1-score, models with global image features be- came even worse in several cases. Taking into account that these models with global features have 1166 more inputs than their coun- terparts, we decided to keep the simpler models. DenseNet201 and Resnet were selected as the two best models considering accuracy and F1-score, respectively. MobileNet was se- lected because it is amazingly fast, and efficiency is an important matter on this task. Indeed, we consider this model the best trade- off since its 0.889 of accuracy is not far away from the 0.923 ac- curacy of DenseNet201 (best accuracy model) and runs 6.5 times faster. DenseNet121 was the last model selected because is some- where in between the best accuracy model (DenseNet201) and the fastest model (Mobilenet). Selected models were submited and evaluated against the dataset Figure 1: Confusion matrix of DenseNet201 on the test with 8,740 images. In this dataset, our results were much better dataset with 8,740 images. than we anticipated. Results are shown in Table 2, where we also report the official competition ranking indicator Rk . ResNet and DenseNet models achieved accuracy higher than 0.987. Mobilenet Another class we need to study is “esophagitis”, because 170 yields accuracy of 0.983, which is very close to the top accuracy of instances were classified as “normal-z-line”, which accounts for 0.988 achieved by DenseNet201 and Resnet. 30.57% of instances in this class. The classes “Stool-plenty” and F1-score shows that Resnet and DenseNet201 are the best mod- “colon-clear” have most of the instances, and our models did a good els and that MobileNet is somewhat worse than the others. How- job on classifying them right, which boosted the scores. ever, we believe MobileNet has best trade-off if we consider that it returns solution in much less than a second. DenseNet121 does not ACKNOWLEDGMENTS appear a good choice because it is slower than Resnet and present We thank CAPES and CNPq (grant 400487/2016-0). Medico Multimedia Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [1] Taruna Agrawal, Rahul Gupta, Saurabh Sahu, and Carol Y Espy- Wilson. 2017. SCL-UMD at the Medico Task-MediaEval 2017: Trans- fer Learning based Classification of Medical Images.. In Proceedings of the MediaEval 2017 Workshop. [2] François Chollet. 2016. Xception: Deep Learning with Depthwise Sep- arable Convolutions. CoRR abs/1610.02357 (2016). arXiv:1610.02357 http://arxiv.org/abs/1610.02357 [3] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mo- bile Vision Applications. (04 2017). [4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). arXiv:1512.03385 http://arxiv.org/abs/1512.03385 [5] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. 2016. Densely Connected Convolutional Networks. CoRR abs/1608.06993 (2016). arXiv:1608.06993 http://arxiv.org/abs/1608.06993 [6] Mathias Lux and Savvas A. Chatzichristofis. 2008. Lire: Lucene Image Retrieval: An Extensible Java CBIR Library. In Proceedings of the 16th ACM International Conference on Multimedia (MM ’08). ACM, New York, NY, USA, 1085–1088. [7] Konstantin Pogorelov, Kristin Ranheim Randel, Thomas de Lange, Sigrun Losada Eskeland, Carsten Griwodz, Dag Johansen, Concetto Spampinato, Mario Taschwer, Mathias Lux, Peter Thelin Schmidt, Michael Riegler, and Pål Halvorsen. 2017. Nerthus: A Bowel Prepa- ration Quality Video Dataset. In Proceedings of the 8th ACM on Mul- timedia Systems Conference (MMSys’17). ACM, New York, NY, USA, 170–174. https://doi.org/10.1145/3083187.3083216 [8] Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Con- cetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter The- lin Schmidt, Michael Riegler, and Pål Halvorsen. 2017. KVASIR: A Multi-Class Image Dataset for Computer Aided Gastrointestinal Dis- ease Detection. In Proceedings of the 8th ACM on Multimedia Sys- tems Conference (MMSys’17). ACM, New York, NY, USA, 164–169. https://doi.org/10.1145/3083187.3083212 [9] Konstantin Pogorelov, Michael Riegler, Pål Halvorsen, Thomas de Lange, Kristin Ranheim Randel, Duc-Tien Dang-Nguyen, Mathias Lux, and Olga Ostroukhova. 2018. Medico Multimedia Task at Media- Eval 2018. In Proceedings of the MediaEval 2018 Workshop. [10] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Con- volutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014). arXiv:1409.1556 http://arxiv.org/abs/1409.1556 [11] Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. 2016. Inception-v4, Inception-ResNet and the Impact of Residual Connec- tions on Learning. CoRR abs/1602.07261 (2016). arXiv:1602.07261 http://arxiv.org/abs/1602.07261 [12] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. CoRR abs/1512.00567 (2015). arXiv:1512.00567 http://arxiv.org/abs/1512.00567