SCL-UMD at the Medico Task-MediaEval 2017: Transfer learning based Classification of Medical Images Taruna Agrawal∗ , Rahul Gupta∗ , Saurabh Sahu+ , Carol Espy Wilson+ + Speech and Communication Lab, University of Maryland, College Park taruna3@gmail.com,rahul.1987iit@gmail.com,ssahu89@umd.edu,espy@isr.umd.edu ABSTRACT resources is a requirement. We motivate our approach by discus- Detecting landmarks in medical images can aid medical diagnosis sions of some related work in the next section, followed by the and is a widely researched problem. The Medico task at MediaEval description of the database and the methodology. 2017 addresses the problem of detecting gastrointestinal landmarks, keeping into consideration the amount of training data as well as the 2 RELATED WORK speed of the detection system. Since medical data is obtained from Transfer learning involves borrowing knowledge from related do- real-world patients, access to large amounts of data for training the mains to aid classification in a domain of interest [9]. Transfer models can be restricted. We therefore focus on a transfer learning learning has been successfully applied in tasks such as human be- approach, where we can borrow image representations yielded by havioral understanding [1, 8], developing deep architectures [2] other image classification/detection systems and then train a su- and autonomous shaping [4]. Several recent works have leveraged pervised learning schemes on the available annotated medical data. advances in image recognition and detection techniques to improve We borrow the state of the art deep learning classification schemes different but related tasks. Shin et al. [15] provide an overview of (VGGNet and Inception-V3 networks) to obtain representations CNN architectures, data characteristics and transfer learning. Li for the medical images and use them in addition to the provided et al. [7] perform domain adaptation for object localization, using set of features. A joint model trained on all these features yields a VGGNet. Zheng et al. [21] provide good practices for CNN fea- Matthew’s Correlation Coefficient (MCC) of 0.826 with an accuracy ture transfer, as we have used in our work. Other applications that and F1-score values of 0.961 and 0.847, respectively. have used VGGNet and Inception-V3 based architectures include Alzheimer’s disease classification [14], disambiguation for large KEYWORDS scene classification [20] and plant classification [6]. The success of Image classification, transfer learning, Convolutional Neural Net- these CNN based transfer learning inspire our experiments. works. 3 DATABASE We use the dataset provided as part of the The 2017 Multimedia for 1 INTRODUCTION Medicine Task (Medico) task during MediaEval benchmarking initia- The Medico task addresses the problem of detecting diseases based tive 2017 [10, 13]. The dataset consists of 8000 images of the GI tract on image signals from the gastrointestinal (GI) tract. The goal of which are annotated and verified by experienced medical doctors the task is to advance the application of machine learning tools into eight different anatomical landmarks. We use the suggested within the medical domain, while specifically focusing on the de- split of 4000 images as training set and the remaining images as tection of GI landmarks from images. Our approach in this task the testing test for the purpose of our experiments. The training involves leveraging the established frameworks for the detection dataset contains a balanced number of instances per class. More of real-world objects from images. Specifically, we borrow the state details regarding the dataset can be found in [13]. of the art deep learning models in object classification to aid the classification of medical images. Models such as VGGNet [16] and 4 METHODOLOGY Inception-V3 [19] contain several convolutional, pooling and fully Deep learning models have achieved state of the art performance connected layers and are typically trained on large amounts of in several image classification related tasks. In particular, Convo- datasets. Training on these datasets yield models that can capture lutional Neural Networks (CNN) have provided the best perfor- various geometrical patterns in the input images and translate them mances on tasks such as object classification [17], detection [12] into features vectors that are then consumed by the final soft-max and tracking [5]. Inspired from these developments, we obtain a set layer for class prediction. We aim to harness the capability of such of features from popular CNN designs, in addition to the provided deep networks by retaining the initial convolution filters and pool- set of features. First, we describe the set of features used in our ing layers in these networks. We then obtain the representations experiments, followed by the classification setup. yielded by these networks towards the final layers of these net- works. This approach can be particularly useful in the cases with 4.1 Features limited amount of training data. Since medical domain data is of- ten obtained from real-world patients, training models on limited We use an assembly of features provided as part of the challenge as well as a few CNN based features. We discuss these features below. Copyright held by the owner/author(s). MediaEval’17, 13-15 September 2017, Dublin, Ireland ∗ Independent authors. MediaEval’17, 13-15 September 2017, Dublin, Ireland T. Agrawal et al. Table 1: Results obtained using various feature sets. Descrip- Table 2: Confusion matrix for the best performing system tion of the metrics can be found in [13]. using all the features. Features used Rk Accuracy F1-score Predictions → 1 2 3 4 5 6 7 8 Baseline + Inception-V3 0.816 0.959 0.838 True class ↓ Baseline + VGGNet 0.785 0.953 0.812 1: Polyps 448 13 1 0 0 0 23 33 Baseline + Inception-V3 + VGGNet 0.826 0.961 0.847 2: Normal-cecum 22 478 0 0 0 0 0 27 3: Normal-z-line 0 0 427 8 202 0 0 0 4: Normal:pylorus 4 0 5 480 5 0 0 0 Baseline features The task provides a set of features extracted 5: Esophagitis 0 0 67 10 293 0 0 2 on the images such as Tamura, ColorLayout, EdgeHistogram and, 6: Dyed-resection 0 0 0 0 0 406 55 1 AutoColorCorrelogram. Each of these features is a global descriptor margins of the image. Note that these features quantify a specific property 7: Dyed-lifted-polyps 2 0 0 0 0 94 421 0 of each image, which may or may not be associated with the final 8: Ulcerative-colitis 24 9 0 2 0 0 1 437 classification task. On the other hand, CNN architectures learn to extract features relevant to the task at hand, although it may 5 RESULTS be hard to interpret those features. More details regarding these We present our results for each set of features in Table 1. The eval- features can be found in the task paper [13]. uation metric used in the challenge is a multi-class generalization of Matthew’s Correlation Coefficient (Rk ). From the results, we VGGNet based features We use the VGGNet [16] pre-trained on observe that the combination including all sets of features performs ImageNet dataset [3] as a feature extractor. Since the ImageNet the best. Since the evaluation metric also takes into account the dataset contains a large number of training samples, we expect amount of training data used, we consistently use only 3200 samples the VGGNet dataset to be able to model a large variety of shape out of the 4000 samples for training. We also provide the class-wise patterns in the images. We hypothesize that this characteristic of confusion matrix in Table 2. We observe that most of the confusion the VGGNet can be useful in the Medico task. We use the 16 layer lies between the classes Normal z-line and Esophagitis. We aim to configuration of VGGNet to predict the outcomes on the ImageNet investigate this class confusion in future to reduce the error rate. dataset (configuration D in Table 1, [16]). After this pre-training, In order to further understand the complementarity of the three we provide the Medico task images as input to the trained VGGNet. feature sets used in our experiments, we performed another cross- Note that each image is of a different size and is rescaled to 244×244 validation experiments. We used only one out of the baseline, in order to be fed to the VGGNet network. We use the outputs Inception-V3 based and VGGNet based feature sets and evaluate from the first fully connected layer as features for the classification their performance. We observed that the Inception-v3 and VGGNet task at hand. The dimensionality of the outputs from the first fully based features outperform the baseline features, indicating that connected layer is 4096. these CNN based features capture better representation in the im- ages. This may be due to the fact that they are trained on a larger Inception-V3 features Similar to the VGGNet based features, we (albeit mismatched) corpus and can model a larger number of geo- extract features from the Inception-V3 network [19]. Inception-V3 metrical shapes in the images, as compared to the baseline features. consists of a stack of convolutional layers and pooling stacked to- gether and the features we obtain are obtained from the penultimate 6 CONCLUSION layer. We again resize the Medico task images to 139×139 pixels. The Medico task at MediaEval-2017 challenge addresses the problem The dimensionality of the features obtained from the penultimate of detecting GI landscapes from images. The task focuses on training layer is 2048. We next describe the classification setup to predict limited amount of dataset with fast evaluation. We address this the anatomical landmarks. problem by adopting a transfer learning method, borrowing pre- trained CNN architectures, successfully applied to other image detection and classification problems. We borrow features extracted 4.2 Classification setup from VGGNet and Inception-V3 models and train a supervised We evaluate three different classification setups, with different algorithm along with the provided baseline features. With only combinations of features. We train a multi-class Support Vector 3200 training samples, we obtain an MCC value of 0.826. Machine (SVM) classifier on the following combinations: In the future, we aim to add more sources of transfer learning • Baseline + Inception-V3 features to this task. Recently further modifications have been proposed to • Baseline + VGGNet features and deep CNN architectures such as GoogLeNet [18], ResNet [12] and • Baseline + Inception-V3 + VGGNet features generative adversarial networks [11]. We also aim to experiment with ensemble methods to fuse the prediction from these network The hyper-parameters for the SVM classifier was tuned using five based features, along with the simple feature fusion in this paper. fold cross-validation framework on the training dataset. We tuned Since each of the CNN networks carry a different methodology for the SVM box-constraint parameter as well as the kernel. The linear convolution and pooling, we also aim to understand the discrimi- kernel performs the best, suggesting that further non-linear trans- native power of each of these feature sets independently. Finally, formation of the features is not required. In the next section, we we also aim to extend the proposed methodology to more image present the obtained results. classification tasks within the medical domain. Medico challenge, MediaEval 2017 MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [17] Richard Socher, Brody Huval, Bharath Bath, Christopher D Manning, [1] Sabyasachee Baruah, Rahul Gupta, and Shrikanth S. Narayanan. 2017. and Andrew Y Ng. 2012. Convolutional-recursive deep learning for A Knowledge Transfer and Boosting Approach to the Prediction of 3d object classification. In Advances in Neural Information Processing Affect in Movies. In Proceedings of IEEE International Conference on Systems. 656–664. Audio, Speech and Signal Processing (ICASSP). [18] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, [2] Yoshua Bengio and others. 2009. Learning deep architectures for AI. Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Foundations and trends® in Machine Learning 2, 1 (2009), 1–127. Rabinovich. 2015. Going deeper with convolutions. In Proceedings of [3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. the IEEE conference on computer vision and pattern recognition. 1–9. 2009. Imagenet: A large-scale hierarchical image database. In Computer [19] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Zbigniew Wojna. 2016. Rethinking the inception architecture for IEEE, 248–255. computer vision. In Proceedings of the IEEE Conference on Computer [4] George Konidaris and Andrew Barto. 2006. Autonomous shaping: Vision and Pattern Recognition. 2818–2826. Knowledge transfer in reinforcement learning. In Proceedings of the [20] Limin Wang, Sheng Guo, Weilin Huang, Yuanjun Xiong, and Yu Qiao. 23rd international conference on Machine learning. ACM, 489–496. 2017. Knowledge guided disambiguation for large-scale scene clas- [5] Matej Kristan, Jiri Matas, Ales Leonardis, Michael Felsberg, Luka Ce- sification with multi-resolution CNNs. IEEE Transactions on Image hovin, Gustavo Fernández, Tomas Vojir, Gustav Hager, Georg Nebehay, Processing 26, 4 (2017), 2055–2068. and Roman Pflugfelder. 2015. The visual object tracking vot2015 chal- [21] Liang Zheng, Yali Zhao, Shengjin Wang, Jingdong Wang, and Qi lenge results. In Proceedings of the IEEE international conference on Tian. 2016. Good practice in CNN feature transfer. arXiv preprint computer vision workshops. 1–23. arXiv:1604.00133 (2016). [6] Sue Han Lee, Yang Loong Chang, Chee Seng Chan, and Paolo Re- magnino. 2016. Plant Identification System based on a Convolutional Neural Network for the LifeClef 2016 Plant Classification Task.. In CLEF (Working Notes). 502–510. [7] Dong Li, Jia-Bin Huang, Yali Li, Shengjin Wang, and Ming-Hsuan Yang. 2016. Weakly supervised object localization with progressive domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3512–3520. [8] Qinyi Luo, Rahul Gupta, and Shrikanth Narayanan. Transfer Learning between Concepts for Human Behavior Modeling: An Application to Sincerity and Deception Prediction.. In Interspeech, 2017. [9] Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 10 (2010), 1345–1359. [10] Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Con- cetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin Schmidt, and others. 2017. Kvasir: a multi-class image dataset for computer aided gastrointestinal disease detection. In Proceedings of the 8th ACM on Multimedia Systems Conference. ACM, 164–169. [11] Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversar- ial networks. arXiv preprint arXiv:1511.06434 (2015). [12] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal net- works. In Advances in neural information processing systems. 91–99. [13] Michael Riegler, Konstantin Pogorelov, PÃěl Halvorsen, Carsten Gri- wodz, Thomas de Lange, Kristin Ranheim Randel, Sigrun Losada Eske- land, Duc-Tien Dang-Nguyen, Mathias Lux, and Concetto Spampinato. Multimedia for Medicine: The Medico Task at MediaEval 2017,. In MediaEval, 13-15 September 2017, Dublin, Ireland. [14] Saman Sarraf, John Anderson, Ghassem Tofighi, and others. 2016. DeepAD: AlzheimerâĂš s Disease Classification via Deep Convolu- tional Neural Networks using MRI and fMRI. bioRxiv (2016), 070441. [15] Hoo-Chang Shin, Holger R Roth, Mingchen Gao, Le Lu, Ziyue Xu, Isabella Nogues, Jianhua Yao, Daniel Mollura, and Ronald M Summers. 2016. Deep convolutional neural networks for computer-aided detec- tion: CNN architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging 35, 5 (2016), 1285–1298. [16] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo- lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).