SCL-UMD at the Medico Task-MediaEval 2017: Transfer learning
           based Classification of Medical Images
                             Taruna Agrawal∗ , Rahul Gupta∗ , Saurabh Sahu+ , Carol Espy Wilson+
                                     + Speech and Communication Lab, University of Maryland, College Park

                            taruna3@gmail.com,rahul.1987iit@gmail.com,ssahu89@umd.edu,espy@isr.umd.edu

ABSTRACT                                                                 resources is a requirement. We motivate our approach by discus-
Detecting landmarks in medical images can aid medical diagnosis          sions of some related work in the next section, followed by the
and is a widely researched problem. The Medico task at MediaEval         description of the database and the methodology.
2017 addresses the problem of detecting gastrointestinal landmarks,
keeping into consideration the amount of training data as well as the    2     RELATED WORK
speed of the detection system. Since medical data is obtained from       Transfer learning involves borrowing knowledge from related do-
real-world patients, access to large amounts of data for training the    mains to aid classification in a domain of interest [9]. Transfer
models can be restricted. We therefore focus on a transfer learning      learning has been successfully applied in tasks such as human be-
approach, where we can borrow image representations yielded by           havioral understanding [1, 8], developing deep architectures [2]
other image classification/detection systems and then train a su-        and autonomous shaping [4]. Several recent works have leveraged
pervised learning schemes on the available annotated medical data.       advances in image recognition and detection techniques to improve
We borrow the state of the art deep learning classification schemes      different but related tasks. Shin et al. [15] provide an overview of
(VGGNet and Inception-V3 networks) to obtain representations             CNN architectures, data characteristics and transfer learning. Li
for the medical images and use them in addition to the provided          et al. [7] perform domain adaptation for object localization, using
set of features. A joint model trained on all these features yields a    VGGNet. Zheng et al. [21] provide good practices for CNN fea-
Matthew’s Correlation Coefficient (MCC) of 0.826 with an accuracy        ture transfer, as we have used in our work. Other applications that
and F1-score values of 0.961 and 0.847, respectively.                    have used VGGNet and Inception-V3 based architectures include
                                                                         Alzheimer’s disease classification [14], disambiguation for large
KEYWORDS                                                                 scene classification [20] and plant classification [6]. The success of
Image classification, transfer learning, Convolutional Neural Net-       these CNN based transfer learning inspire our experiments.
works.
                                                                         3     DATABASE
                                                                         We use the dataset provided as part of the The 2017 Multimedia for
1    INTRODUCTION                                                        Medicine Task (Medico) task during MediaEval benchmarking initia-
The Medico task addresses the problem of detecting diseases based        tive 2017 [10, 13]. The dataset consists of 8000 images of the GI tract
on image signals from the gastrointestinal (GI) tract. The goal of       which are annotated and verified by experienced medical doctors
the task is to advance the application of machine learning tools         into eight different anatomical landmarks. We use the suggested
within the medical domain, while specifically focusing on the de-        split of 4000 images as training set and the remaining images as
tection of GI landmarks from images. Our approach in this task           the testing test for the purpose of our experiments. The training
involves leveraging the established frameworks for the detection         dataset contains a balanced number of instances per class. More
of real-world objects from images. Specifically, we borrow the state     details regarding the dataset can be found in [13].
of the art deep learning models in object classification to aid the
classification of medical images. Models such as VGGNet [16] and         4     METHODOLOGY
Inception-V3 [19] contain several convolutional, pooling and fully
                                                                         Deep learning models have achieved state of the art performance
connected layers and are typically trained on large amounts of
                                                                         in several image classification related tasks. In particular, Convo-
datasets. Training on these datasets yield models that can capture
                                                                         lutional Neural Networks (CNN) have provided the best perfor-
various geometrical patterns in the input images and translate them
                                                                         mances on tasks such as object classification [17], detection [12]
into features vectors that are then consumed by the final soft-max
                                                                         and tracking [5]. Inspired from these developments, we obtain a set
layer for class prediction. We aim to harness the capability of such
                                                                         of features from popular CNN designs, in addition to the provided
deep networks by retaining the initial convolution filters and pool-
                                                                         set of features. First, we describe the set of features used in our
ing layers in these networks. We then obtain the representations
                                                                         experiments, followed by the classification setup.
yielded by these networks towards the final layers of these net-
works. This approach can be particularly useful in the cases with
                                                                         4.1     Features
limited amount of training data. Since medical domain data is of-
ten obtained from real-world patients, training models on limited        We use an assembly of features provided as part of the challenge as
                                                                         well as a few CNN based features. We discuss these features below.
Copyright held by the owner/author(s).
MediaEval’17, 13-15 September 2017, Dublin, Ireland
                                                                         ∗ Independent authors.
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                          T. Agrawal et al.

Table 1: Results obtained using various feature sets. Descrip-            Table 2: Confusion matrix for the best performing system
tion of the metrics can be found in [13].                                 using all the features.
Features used                       Rk    Accuracy F1-score                  Predictions →          1 2 3 4 5 6 7 8
Baseline + Inception-V3            0.816    0.959      0.838                 True class ↓
Baseline + VGGNet                  0.785    0.953      0.812                 1: Polyps             448 13 1 0 0 0 23 33
Baseline + Inception-V3 + VGGNet 0.826      0.961      0.847                 2: Normal-cecum       22 478 0 0 0 0 0 27
                                                                             3: Normal-z-line       0 0 427 8 202 0 0 0
                                                                             4: Normal:pylorus      4 0 5 480 5 0 0 0
Baseline features The task provides a set of features extracted              5: Esophagitis         0 0 67 10 293 0 0 2
on the images such as Tamura, ColorLayout, EdgeHistogram and,                6: Dyed-resection      0 0 0 0 0 406 55 1
AutoColorCorrelogram. Each of these features is a global descriptor          margins
of the image. Note that these features quantify a specific property          7: Dyed-lifted-polyps 2 0 0 0 0 94 421 0
of each image, which may or may not be associated with the final             8: Ulcerative-colitis 24 9 0 2 0 0 1 437
classification task. On the other hand, CNN architectures learn
to extract features relevant to the task at hand, although it may         5   RESULTS
be hard to interpret those features. More details regarding these         We present our results for each set of features in Table 1. The eval-
features can be found in the task paper [13].                             uation metric used in the challenge is a multi-class generalization
                                                                          of Matthew’s Correlation Coefficient (Rk ). From the results, we
VGGNet based features We use the VGGNet [16] pre-trained on               observe that the combination including all sets of features performs
ImageNet dataset [3] as a feature extractor. Since the ImageNet           the best. Since the evaluation metric also takes into account the
dataset contains a large number of training samples, we expect            amount of training data used, we consistently use only 3200 samples
the VGGNet dataset to be able to model a large variety of shape           out of the 4000 samples for training. We also provide the class-wise
patterns in the images. We hypothesize that this characteristic of        confusion matrix in Table 2. We observe that most of the confusion
the VGGNet can be useful in the Medico task. We use the 16 layer          lies between the classes Normal z-line and Esophagitis. We aim to
configuration of VGGNet to predict the outcomes on the ImageNet           investigate this class confusion in future to reduce the error rate.
dataset (configuration D in Table 1, [16]). After this pre-training,         In order to further understand the complementarity of the three
we provide the Medico task images as input to the trained VGGNet.         feature sets used in our experiments, we performed another cross-
Note that each image is of a different size and is rescaled to 244×244    validation experiments. We used only one out of the baseline,
in order to be fed to the VGGNet network. We use the outputs              Inception-V3 based and VGGNet based feature sets and evaluate
from the first fully connected layer as features for the classification   their performance. We observed that the Inception-v3 and VGGNet
task at hand. The dimensionality of the outputs from the first fully      based features outperform the baseline features, indicating that
connected layer is 4096.                                                  these CNN based features capture better representation in the im-
                                                                          ages. This may be due to the fact that they are trained on a larger
Inception-V3 features Similar to the VGGNet based features, we            (albeit mismatched) corpus and can model a larger number of geo-
extract features from the Inception-V3 network [19]. Inception-V3         metrical shapes in the images, as compared to the baseline features.
consists of a stack of convolutional layers and pooling stacked to-
gether and the features we obtain are obtained from the penultimate
                                                                          6   CONCLUSION
layer. We again resize the Medico task images to 139×139 pixels.          The Medico task at MediaEval-2017 challenge addresses the problem
The dimensionality of the features obtained from the penultimate          of detecting GI landscapes from images. The task focuses on training
layer is 2048. We next describe the classification setup to predict       limited amount of dataset with fast evaluation. We address this
the anatomical landmarks.                                                 problem by adopting a transfer learning method, borrowing pre-
                                                                          trained CNN architectures, successfully applied to other image
                                                                          detection and classification problems. We borrow features extracted
4.2    Classification setup                                               from VGGNet and Inception-V3 models and train a supervised
We evaluate three different classification setups, with different         algorithm along with the provided baseline features. With only
combinations of features. We train a multi-class Support Vector           3200 training samples, we obtain an MCC value of 0.826.
Machine (SVM) classifier on the following combinations:                      In the future, we aim to add more sources of transfer learning
      • Baseline + Inception-V3 features                                  to this task. Recently further modifications have been proposed to
      • Baseline + VGGNet features and                                    deep CNN architectures such as GoogLeNet [18], ResNet [12] and
      • Baseline + Inception-V3 + VGGNet features                         generative adversarial networks [11]. We also aim to experiment
                                                                          with ensemble methods to fuse the prediction from these network
The hyper-parameters for the SVM classifier was tuned using five          based features, along with the simple feature fusion in this paper.
fold cross-validation framework on the training dataset. We tuned         Since each of the CNN networks carry a different methodology for
the SVM box-constraint parameter as well as the kernel. The linear        convolution and pooling, we also aim to understand the discrimi-
kernel performs the best, suggesting that further non-linear trans-       native power of each of these feature sets independently. Finally,
formation of the features is not required. In the next section, we        we also aim to extend the proposed methodology to more image
present the obtained results.                                             classification tasks within the medical domain.
Medico challenge, MediaEval 2017                                                             MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES                                                                     [17] Richard Socher, Brody Huval, Bharath Bath, Christopher D Manning,
 [1] Sabyasachee Baruah, Rahul Gupta, and Shrikanth S. Narayanan. 2017.             and Andrew Y Ng. 2012. Convolutional-recursive deep learning for
     A Knowledge Transfer and Boosting Approach to the Prediction of                3d object classification. In Advances in Neural Information Processing
     Affect in Movies. In Proceedings of IEEE International Conference on           Systems. 656–664.
     Audio, Speech and Signal Processing (ICASSP).                             [18] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
 [2] Yoshua Bengio and others. 2009. Learning deep architectures for AI.            Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew
     Foundations and trends® in Machine Learning 2, 1 (2009), 1–127.                Rabinovich. 2015. Going deeper with convolutions. In Proceedings of
 [3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.         the IEEE conference on computer vision and pattern recognition. 1–9.
     2009. Imagenet: A large-scale hierarchical image database. In Computer    [19] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and
     Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.           Zbigniew Wojna. 2016. Rethinking the inception architecture for
     IEEE, 248–255.                                                                 computer vision. In Proceedings of the IEEE Conference on Computer
 [4] George Konidaris and Andrew Barto. 2006. Autonomous shaping:                   Vision and Pattern Recognition. 2818–2826.
     Knowledge transfer in reinforcement learning. In Proceedings of the       [20] Limin Wang, Sheng Guo, Weilin Huang, Yuanjun Xiong, and Yu Qiao.
     23rd international conference on Machine learning. ACM, 489–496.               2017. Knowledge guided disambiguation for large-scale scene clas-
 [5] Matej Kristan, Jiri Matas, Ales Leonardis, Michael Felsberg, Luka Ce-          sification with multi-resolution CNNs. IEEE Transactions on Image
     hovin, Gustavo Fernández, Tomas Vojir, Gustav Hager, Georg Nebehay,            Processing 26, 4 (2017), 2055–2068.
     and Roman Pflugfelder. 2015. The visual object tracking vot2015 chal-     [21] Liang Zheng, Yali Zhao, Shengjin Wang, Jingdong Wang, and Qi
     lenge results. In Proceedings of the IEEE international conference on          Tian. 2016. Good practice in CNN feature transfer. arXiv preprint
     computer vision workshops. 1–23.                                               arXiv:1604.00133 (2016).
 [6] Sue Han Lee, Yang Loong Chang, Chee Seng Chan, and Paolo Re-
     magnino. 2016. Plant Identification System based on a Convolutional
     Neural Network for the LifeClef 2016 Plant Classification Task.. In
     CLEF (Working Notes). 502–510.
 [7] Dong Li, Jia-Bin Huang, Yali Li, Shengjin Wang, and Ming-Hsuan
     Yang. 2016. Weakly supervised object localization with progressive
     domain adaptation. In Proceedings of the IEEE Conference on Computer
     Vision and Pattern Recognition. 3512–3520.
 [8] Qinyi Luo, Rahul Gupta, and Shrikanth Narayanan. Transfer Learning
     between Concepts for Human Behavior Modeling: An Application to
     Sincerity and Deception Prediction.. In Interspeech, 2017.
 [9] Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning.
     IEEE Transactions on knowledge and data engineering 22, 10 (2010),
     1345–1359.
[10] Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz,
     Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Con-
     cetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin
     Schmidt, and others. 2017. Kvasir: a multi-class image dataset for
     computer aided gastrointestinal disease detection. In Proceedings of
     the 8th ACM on Multimedia Systems Conference. ACM, 164–169.
[11] Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised
     representation learning with deep convolutional generative adversar-
     ial networks. arXiv preprint arXiv:1511.06434 (2015).
[12] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster
     R-CNN: Towards real-time object detection with region proposal net-
     works. In Advances in neural information processing systems. 91–99.
[13] Michael Riegler, Konstantin Pogorelov, PÃěl Halvorsen, Carsten Gri-
     wodz, Thomas de Lange, Kristin Ranheim Randel, Sigrun Losada Eske-
     land, Duc-Tien Dang-Nguyen, Mathias Lux, and Concetto Spampinato.
     Multimedia for Medicine: The Medico Task at MediaEval 2017,. In
     MediaEval, 13-15 September 2017, Dublin, Ireland.
[14] Saman Sarraf, John Anderson, Ghassem Tofighi, and others. 2016.
     DeepAD: AlzheimerâĂš s Disease Classification via Deep Convolu-
     tional Neural Networks using MRI and fMRI. bioRxiv (2016), 070441.
[15] Hoo-Chang Shin, Holger R Roth, Mingchen Gao, Le Lu, Ziyue Xu,
     Isabella Nogues, Jianhua Yao, Daniel Mollura, and Ronald M Summers.
     2016. Deep convolutional neural networks for computer-aided detec-
     tion: CNN architectures, dataset characteristics and transfer learning.
     IEEE transactions on medical imaging 35, 5 (2016), 1285–1298.
[16] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo-
     lutional networks for large-scale image recognition. arXiv preprint
     arXiv:1409.1556 (2014).