An application of Residual Network and Faster - RCNN for Medico: Multimedia Task at MediaEval 2018 Trung-Hieu Hoang1 , Hai-Dang Nguyen2 , Thanh-An Nguyen1 , Vinh-Tiep Nguyen3 , Minh-Triet Tran1 , 1 Faculty of Information Technology, University of Science, VNU-HCM, Vietnam 2 Eurecom, France 3 University of Information Technology, VNU-HCM, Vietnam {hthieu,ntan}@selab.hcmus.edu.vn,nguyenhd@eurecom.fr,tiepnv@uit.edu.vn,tmtriet@fit.hcmus.edu.vn ABSTRACT 3 APPROACH The Medico: Multimedia Task focuses on developing an efficient 3.1 Dataset Preparation framework for predicting and classifying abnormalities in endo- 3.1.1 Disease region localization. In order for the Faster R-CNN scopic images of gastrointestinal (GI) tract. We present the HCMUS model to be trained, objects in the image have to be tagged with Team’s approach, which employs a combination of Residual Neural bounding boxes and passed to the model as input. We annotate Network and Faster R - CNN model to classify endoscopic images. the signal of disease in all images of the following classes: dyed- We submit multiple runs with different modifications of the param- resection-margins, dyed-lifted-polyps, instruments and polyps. eters in our combined model. Our methods show potential results through experiments. 3.1.2 Re-labeling Medico development dataset. After training with the development set, we find some training samples with inappropriate labels according to the priority list. Therefore, in 1 INTRODUCTION order for our model to learn with the least confusing, we apply the Medico: Multimedia Task at MediaEval 2018 challenge [4] aims to new labels, predicted by the trained model, to these images. bring new achievements in computer vision, image processing and 3.1.3 Instruments dataset augmentation. Instruments - the sec- machine learning to the next level of computer and multimedia ond highest priority class has only 36 images with the limitation of assisted diagnosis. The goal of the challenge is to predict abnormal- background context in the development set. In order to maintain ities and diseases in an efficient way with as less training data as the balancing between all of the classes and also improve the diver- possible [5]. The task organizers also provide a priority list for the sity of the instruments images, we generate more images for the classes in other to accommodate with the single-class classification instruments based on the current given development set by placing challenge. Thus, this leads to some modifications of our model, the instruments on the foreground of other diseases backgrounds. which are meticulously described in section 3. Among the 36 instruments images, we carefully select 24 of them In our approach, we introduce a stacked model consisting of two and crop the instruments along their edges. Then, we randomly deep networks, a Residual Neural Network (Resnet) [2] followed select 20% of the images from dyed-lifted-polyps, dyed-resection- by a Faster Region-based Convolutional Neural Network (Faster margins, ulcerative-colitis classes, and use them as the background R-CNN) [7]. Since Resnet mostly focuses on deep global features of of the cropped instruments. By applying this method, we are able image, it fails to classify images that symptoms of abnormal diseases to generate more than 800 images for the instruments class. or instruments appear as small objects on diversity backgrounds. Therefore, this is the reason of using Faster R-CNN to re-classify 3.2 Method the images of some classes that Resnet usually mis-classify. 3.2.1 Fine-tuning deep neural network for medical images. In our approach, both Residual Network with 101 layers and Faster 2 RELATED WORK R-CNN [1] (both pre-trained on ImageNet) are fine-tuned by using In the field of medical image processing, deep neural networks have our modified development dataset. In term of using convolution been used in order to solve several problems related to endoscopic neural network for medical images, knowledge transferring from images of the gastrointestinal (GI) tract. Particularly, to localize and natural images to medical images is possible, even though there identify polyps within real-time constraint, deep CNNs has recently is a large difference between the source and target databases. It is shown an impressive potential when achieving up to 96.4% accuracy especially useful in the case of small dataset of images provided [6]. - published in 2018 by Urban G et al. [9]. Another interesting article Our experiment results also support this idea. Fine-tuning on the of Satoki Shichijo et al. [8] also applies multiple deep CNNs to ImageNet pre-trained model significantly improves the efficient of diagnose Helicobacter pylori gastritis based on endoscopic images. classification model. Further, gastrointestinal bleeding detection using deep CNNs on endoscopic images has been successfully done and published by 3.2.2 First run. Residual network with 101 layers model are Xiao Jia et al. [3]. fine-tuned on the original development set provided by the task or- ganizers along with our instruments increased dataset. After passed Copyright held by the owner/author(s). through Resnet101, output images classified as special classes be- MediaEval’18, 29-31 October 2018, Sophia Antipolis, France come the input of Faster R-CNN network, which is trained for detecting instruments in images. MediaEval’18, 29-31 October 2018, Sophia Antipolis, France Trung-Hieu Hoang et al. • First case: Images predicted as instruments by Resnet101 are double-checked. In case instruments are not detected by Faster R-CNN in those images, they are re-labeled as the class of their second highest score proposed by Resnet101. • Second case: Images predicted as dyed-lifted-polyps, dyed- resection-margins, ulcerative colitis by Resnet101 are fed forward through Faster R-CNN network to detect instru- ments. They are classified as instruments if detected or keep the original prediction otherwise. 3.2.3 Second run. Feeding forward a large number of images in the three classes through Faster R-CNN causes a bottle-neck of in- ference time, as Faster R-CNN has high time complexity. Therefore, in this second run, we limited the images passed through Faster R-CNN by only performing the first case of the first run. 3.2.4 Third run. The configuration of the third run is as same as the second run. Instead of using the original training set mentioned in the first run, we train our model on the re-labeled development set combined with the augmented instrument set. 3.2.5 Forth run. In this run, we reduce the number of images Figure 1: Confusion matrix of our best run - Run03 used for training by selecting randomly 75% images of each class in the same training set as the third run. Other processing steps are also configured in the same way. 3.2.6 Fifth run. Throughout our experiments, normal-z-line and the first two runs. This implies that training on our re-labeled esophagitis are the top most confusing classes not only for Resnet101 development set provides better models. but also for human to distinguish them. In the priority list, esophagi- On the other hand, using the Residual neural network cannot tis has a higher rank than normal-z-line’s. Thus, after several times classify efficiently the two classes esophagitis and normal-z-line. evaluating our model on the development dataset, we propose a con- The same problem also occurs between the dyed-resection-margins dition for these two classes when they are predicted by Resnet101. and dyed-lifted-polyps classes. It can be observed in the confusion As Resnet101 provides a probability distribution over the 16 classes matrices of the two pairs (Figure 1). Therefore, these are the two for each image, whenever the normal-z-line appears to be the high- main reasons which mainly bring negative impact to our results. est class, we add a small bias 0.3 to the probability of the esophagitis. Additionally, as we mentioned in section 3, the configuration of Hence, the model is more likely to emit the esophagitis class. This in- Run05 intuitively prefers esophagitis to normal-z-line, which may tuitively means that our model prefers esophagitis to normal-z-line leads to an increasing of the false-positive cases in the result. when it is confused between these classes. By comparison to the others, Run04 has the lowest precision since it uses 75% of training data. Decreasing the amount of training 4 RESULTS samples of course affects the performance in deep-learning models. Table 1: Official evaluation result for both sub-tasks (pro- Nevertheless, the result is still acceptable when it decreases only a vided by the organizers) and speed (fps) on Tesla K80 GPU few percentages and its configuration is as same as Run03. This is an evidence that we are even able to reduce up to 50% of data when RunID PREC REC ACC F1 MCC RK FPS the less training time is preferred over the accuracy. Run01 94.245 94.245 99.281 94.245 93.861 93.590 6.589 Run02 93.959 93.959 99.245 93.959 93.556 93.273 23.191 Run03 94.600 94.600 99.325 94.600 94.240 93.987 23.148 5 CONCLUSION AND FUTURE WORKS Run04 93.043 93.043 99.130 93.043 92.579 92.257 22.654 Medico image classification is a challenging problem because of the Run05 94.508 94.508 99.314 94.508 94.142 93.884 21.413 fine-grained images, less training data and require high accuracy. There is a trade-off between speed and accuracy when comparing In our current approach, we focus on training a combination of the result of Run01 and Run02. In Run02, we reduce a large number Residual Neural Network and Faster R-CNN with different modifi- of images passing through Faster R-CNN for the sake of time, so cations of the training set. Additionally, object detection method its performance seems to be relatively worse than Run01’s. is applied to detect small symptoms of diseases, which are useful As we mentioned earlier in section 3, data pre-processing takes evidences for the classification task. Accuracy and inference time an important role in building a deep-neural network model. Through that we reach is acceptable and appropriate for real-time constraint. our experiments, in the case of less training data, the augmented However, for future works, we need a more robust approach to dataset helps us improve the performance of deep-neural network exploit the distinction between easy-confused classes, e.g, esophgitis model. Run03 and Run05 show impressive results comparing to and normal-z-line, or dyed-lifted-polyps and dyed-resection-margins. Medico Multimedia Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [1] Xinlei Chen and Abhinav Gupta. 2017. An Implementation of Faster RCNN with Study for Region Sampling. arXiv preprint arXiv:1702.02138 (2017). [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jun 2016). https: //doi.org/10.1109/cvpr.2016.90 [3] Xiao Jia and Max Q.-H. Meng. 2016. A deep convolutional neural network for bleeding detection in Wireless Capsule Endoscopy images. 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (2016). https://doi.org/10.1109/ embc.2016.7590783 [4] PËŽal Halvorsen Thomas de Lange Kristin Ranheim Randel Duc-Tien Dang-Nguyen Mathias Lux Olga Ostroukhova Konstantin Pogorelov, Michael Riegler. 2018. Medico Multimedia Task at MediaEval 2018. Media Eval’ 2018 (2018). [5] Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Con- cetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter The- lin Schmidt, Michael Riegler, and Pål Halvorsen. 2017. KVASIR: A Multi-Class Image Dataset for Computer Aided Gastrointestinal Dis- ease Detection. In Proceedings of the 8th ACM on Multimedia Sys- tems Conference (MMSys’17). ACM, New York, NY, USA, 164–169. https://doi.org/10.1145/3083187.3083212 [6] Adnan Qayyum, Syed Anwar, Muhammad Majid, Muhammad Awais, and Majdi Alnowami. 2017. Medical Image Analysis using Convolu- tional Neural Networks: A Review. 42 (09 2017). [7] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 91–99. [8] Aoyama Kazuharu Nishikawa Yoshitaka Miura Motoi Shinagawa Takahide Takiyama Hirotoshi Tanimoto Tetsuya Ishihara Soichiro Matsuo Keigo Tada Tomohiro Shichijo Satoki, Nomura Shuhei. 2017. Application of Convolutional Neural Networks in the Diagnosis of He- licobacter pylori Infection Based on Endoscopic Images. EBioMedicine 25 (01 Nov 2017), 106–111. https://doi.org/10.1016/j.ebiom.2017.10.014 [9] Gregor Urban, Priyam Tripathi, Talal Alkayali, Mohit Mittal, Farid Jalali, William Karnes, and Pierre Baldi. 2018. Deep Learning Localizes and Identifies Polyps in Real Time With 96% Accuracy in Screening Colonoscopy. Gastroenterology 155, 4 (2018). https://doi.org/10.1053/j. gastro.2018.06.037