=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_16
|storemode=property
|title=Transfer Learning with CNN Architectures for Classifying Gastrointestinal Diseases and Anatomical Landmarks
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_16.pdf
|volume=Vol-2283
|authors=Danielle Dias,Ulisses Dias
|dblpUrl=https://dblp.org/rec/conf/mediaeval/DiasD18
}}
==Transfer Learning with CNN Architectures for Classifying Gastrointestinal Diseases and Anatomical Landmarks==
<pdf width="1500px">https://ceur-ws.org/Vol-2283/MediaEval_18_paper_16.pdf</pdf>
<pre>
           Transfer learning with CNN architectures for classifying
             gastrointestinal diseases and anatomical landmarks
                                                              Danielle Dias, Ulisses Dias
                                                                University of Campinas, Brazil
                                                     danielle.dias@ic.unicamp.br,ulisses@ft.unicamp.br

ABSTRACT                                                                        results. They restricted the analysis to two architectures: Inception-
Transfer learning is an approach where a model trained for a given              V3, and VGGNet. They also conducted an analysis of how a model
task is used as a starting point on a second task. Many advanced                performs if it uses features extracted from both architectures as
deep learning architectures have been pre-trained on ImageNet                   input. The results are better for just a small factor, which lead us
and are currently available, which makes this technique very popu-              believe that it does not worth the extra processing time. We ex-
lar. We evaluate 10 pre-trained architectures on the task of finding            tended the work of Agrawal et al. [1] by using 10 architectures and
gastrointestinal diseases and anatomical landmarks in images col-               by considering both solution quality and efficiency in our analysis.
lected in hospitals. Our analysis considered both processing time
and accuracy. We also study if global image features bring advan-               3   APPROACH
tages to the pre-trained models for the problem of gastrointestinal
                                                                                The development and test dataset contain 5,293 and 8,740 images,
medical image classification. Our best models achieved accuracy
                                                                                respectively. For each image, visual features were extracted and
and F1-score values of 0.988 and 0.908, respectively. Our fastest
                                                                                provided as feature vectors by the task organizers, namely: JCD,
model classifies an input instance in 0.037 seconds, and yields ac-
                                                                                Tamura, ColorLayout, EdgeHistogram, AutoColorCorrelogram and
curacy and F1-score of 0.983 and 0.866, respectively.
                                                                                PHOG [6]. These feature vectors are sequences of floating point
                                                                                values for each image, and the number of values sum to 1185. These
1 INTRODUCTION                                                                  values were joined to form a table used as input to our model,
The Medico Task proposes the challenge of predicting diseases                   where rows represent images. We removed 19 columns because
based on multimedia data collected in hospitals [7, 8]. The images              either they had the same value for all images or because they were
are frames collected from videos captured by the insertion of a cam-            duplicated.
era in the gastrointestinal tract. The main purpose is to identify                 We used 10 architectures trained on ImageNet: DenseNet121 [5],
anomalies that can be detected visually, even before they become                DenseNet169 [5], DenseNet201 [5], InceptionResNetV2 [11], Incep-
symptomatic. More details can be found in the task overview [9].                tionV3 [12], MobileNet [3], ResNet [4], VGG16 [10], VGG19 [10],
    To solve the task we created several models based on features               Xception [2]. Each architecture requires a particular pre-processing
extracted from deep convolutional architectures and global image                step and returns vectors of floating point numbers. Vector sizes
features. The deep architectures were trained on ImageNet. The                  are: DenseNet121 (1024), DenseNet169 (1664), DenseNet201 (1920),
strategy was to create three kinds of models: (i) those having only             InceptionResNetV2 (1536), InceptionV3 (2048), MobileNet (1024),
features extracted from deep architectures as input, (ii) those that            ResNet (2048), VGG16 (512), VGG19 (512), Xception (2018).
considered only global image features, and (iii) those created with                The input layer has the same number of nodes as the feature
all features available.                                                         vector sizes. That said, in a model that uses only global features
    The approach of extracting features from pre-trained models                 the input layer has 1166 nodes. In a model that uses global features
is usually referred to as transfer learning. It has become popu-                and a given architecture as feature extractor, the input layer has the
lar because many models are available. We selected 10 architec-                 feature vector of that architecture plus 1166.
tures trained on ImageNet as extractors, and compared their per-                   The best model for most of the input features uses one hidden
formance on classifying images for the Medico Task.                             layer that has 512 nodes and each node uses Relu as activation
    The architectures differ in several characteristics, which impacts          function. We added a Dropout of 50% in the training stage and l 2
the time to compute the features. Since efficiency is an important              regularization to prevent overfit. Models with more layers tend to
matter on this task, we computed the data processing speed for                  overfit very easily with just a few epochs. It would be possible to
each test images and made efforts to reach a balance between so-                create simpler models for small feature vectors like VGG architec-
lution quality and running time.                                                tures, but we decided to report the same network for all input vec-
                                                                                tors for comparison purposes. The output layer has 16 nodes (one
2 RELATED WORK                                                                  for each class) and uses softmax activation to classify the image in
                                                                                one of the classes.
Transfer learning has been used in several problems in a number of
domains. In 2017, Agrawal et al. [1] used transfer learning for the
Medico Task when the challenge had 8 classes and achieved good                  4   RESULTS AND ANALYSIS
                                                                                During the training stage we split the development set with 5,293
Copyright held by the owner/author(s).
                                                                                images into train (3,038 images), validation (1,722 images), and test
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France
                                                                                (531 images) datasets. We used train and validation sets to train and
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                                          D. Dias et al.


tune the classifier. Test dataset was used only once to generate the     somewhat worse results. Therefore, Resnet should be preferred in
results in Table 1 after we were satisfied with the validation scores.   any situation. DenseNet201 was the top accuracy and F1-score, but
   The model that uses only global image features yields an accu-        it is so close to Resnet that it is difficult to argue that it is enough
racy of 0.813, and F1-score of 0.782, which we consider a baseline       compensation for the fact of being twice as slower.
result. Table 1 summarizes the results using transfer learning in the
test dataset with 531 images. We based our decision about which          Table 2: F1-score, accuracy and Rk of the selected models in
model was best to submit on the F1-score and accuracy metrics.           task test dataset (8,740 images).

Table 1: F1-score and accuracy of the transfer learning mod-
                                                                                        Architecture      ACC       F1       Rk
els in our test dataset (531 images). Table also reports the av-
erage time (in seconds) to classify an image after the model                            DenseNet121       0.987    0.903    0.893
is loaded in memory. We highlighted the best results.                                   DenseNet201       0.988    0.908    0.898
                                                                                        Mobilenet         0.983    0.866    0.853
                                                                                        Resnet            0.988    0.906    0.896
  Architecture        No Global Features        Global Features
                     Time      ACC      F1      ACC         F1
  DenseNet121        0.163     0.904   0.856    0.915      0.868         5   DISCUSSION AND OUTLOOK
  DenseNet169        0.209     0.915   0.899    0.919      0.903
  DenseNet201        0.242     0.923   0.905    0.925      0.871         We believe these models can be improved and the confusion matrix
  InceptResnetV2     0.349     0.908   0.894    0.904      0.858         shown in Figure 1 provides some insights. We analysed how they
  InceptionV3        0.213     0.883   0.876    0.889      0.845         performed in each of 16 class and found out that the class “out-of-
  Mobilenet          0.037     0.889   0.841    0.896      0.850         patient” is particularly problematic, since it has only 4 instances in
  Resnet             0.115     0.909   0.916    0.894      0.883         the development set and 5 instances in the test set. Furthermore,
  VGG16              0.372     0.879   0.837    0.870      0.828         none of our models was able to classify these 5 instances right,
  VGG19              0.406     0.881   0.867    0.866      0.858         which does not impact accuracy but degrades F1-score. In the fu-
  Xception           0.257     0.877   0.835    0.898      0.855         ture, some data augmentation should be performed to improve this
                                                                         class.

   Our first decision was to disregard the models that had trans-
fer learning plus global features because the improvement after
adding global features was considered irrelevant. As an example,
the architecture DenseNet201 had a small increase of accuracy from
0.923 to 0.925. These were the best models using accuracy metric.
If we consider the F1-score, models with global image features be-
came even worse in several cases. Taking into account that these
models with global features have 1166 more inputs than their coun-
terparts, we decided to keep the simpler models.
   DenseNet201 and Resnet were selected as the two best models
considering accuracy and F1-score, respectively. MobileNet was se-
lected because it is amazingly fast, and efficiency is an important
matter on this task. Indeed, we consider this model the best trade-
off since its 0.889 of accuracy is not far away from the 0.923 ac-
curacy of DenseNet201 (best accuracy model) and runs 6.5 times
faster. DenseNet121 was the last model selected because is some-
where in between the best accuracy model (DenseNet201) and the
fastest model (Mobilenet).
   Selected models were submited and evaluated against the dataset
                                                                         Figure 1: Confusion matrix of DenseNet201 on the test
with 8,740 images. In this dataset, our results were much better
                                                                         dataset with 8,740 images.
than we anticipated. Results are shown in Table 2, where we also
report the official competition ranking indicator Rk . ResNet and
DenseNet models achieved accuracy higher than 0.987. Mobilenet              Another class we need to study is “esophagitis”, because 170
yields accuracy of 0.983, which is very close to the top accuracy of     instances were classified as “normal-z-line”, which accounts for
0.988 achieved by DenseNet201 and Resnet.                                30.57% of instances in this class. The classes “Stool-plenty” and
   F1-score shows that Resnet and DenseNet201 are the best mod-          “colon-clear” have most of the instances, and our models did a good
els and that MobileNet is somewhat worse than the others. How-           job on classifying them right, which boosted the scores.
ever, we believe MobileNet has best trade-off if we consider that it
returns solution in much less than a second. DenseNet121 does not        ACKNOWLEDGMENTS
appear a good choice because it is slower than Resnet and present        We thank CAPES and CNPq (grant 400487/2016-0).
Medico Multimedia Task                                                        MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


REFERENCES
 [1] Taruna Agrawal, Rahul Gupta, Saurabh Sahu, and Carol Y Espy-
     Wilson. 2017. SCL-UMD at the Medico Task-MediaEval 2017: Trans-
     fer Learning based Classification of Medical Images.. In Proceedings
     of the MediaEval 2017 Workshop.
 [2] François Chollet. 2016. Xception: Deep Learning with Depthwise Sep-
     arable Convolutions. CoRR abs/1610.02357 (2016). arXiv:1610.02357
     http://arxiv.org/abs/1610.02357
 [3] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko,
     Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam.
     2017. MobileNets: Efficient Convolutional Neural Networks for Mo-
     bile Vision Applications. (04 2017).
 [4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015.
     Deep Residual Learning for Image Recognition. CoRR abs/1512.03385
     (2015). arXiv:1512.03385 http://arxiv.org/abs/1512.03385
 [5] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. 2016. Densely
     Connected Convolutional Networks. CoRR abs/1608.06993 (2016).
     arXiv:1608.06993 http://arxiv.org/abs/1608.06993
 [6] Mathias Lux and Savvas A. Chatzichristofis. 2008. Lire: Lucene Image
     Retrieval: An Extensible Java CBIR Library. In Proceedings of the 16th
     ACM International Conference on Multimedia (MM ’08). ACM, New
     York, NY, USA, 1085–1088.
 [7] Konstantin Pogorelov, Kristin Ranheim Randel, Thomas de Lange,
     Sigrun Losada Eskeland, Carsten Griwodz, Dag Johansen, Concetto
     Spampinato, Mario Taschwer, Mathias Lux, Peter Thelin Schmidt,
     Michael Riegler, and Pål Halvorsen. 2017. Nerthus: A Bowel Prepa-
     ration Quality Video Dataset. In Proceedings of the 8th ACM on Mul-
     timedia Systems Conference (MMSys’17). ACM, New York, NY, USA,
     170–174. https://doi.org/10.1145/3083187.3083216
 [8] Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz,
     Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Con-
     cetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter The-
     lin Schmidt, Michael Riegler, and Pål Halvorsen. 2017. KVASIR: A
     Multi-Class Image Dataset for Computer Aided Gastrointestinal Dis-
     ease Detection. In Proceedings of the 8th ACM on Multimedia Sys-
     tems Conference (MMSys’17). ACM, New York, NY, USA, 164–169.
     https://doi.org/10.1145/3083187.3083212
 [9] Konstantin Pogorelov, Michael Riegler, Pål Halvorsen, Thomas de
     Lange, Kristin Ranheim Randel, Duc-Tien Dang-Nguyen, Mathias
     Lux, and Olga Ostroukhova. 2018. Medico Multimedia Task at Media-
     Eval 2018. In Proceedings of the MediaEval 2018 Workshop.
[10] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Con-
     volutional Networks for Large-Scale Image Recognition. CoRR
     abs/1409.1556 (2014). arXiv:1409.1556 http://arxiv.org/abs/1409.1556
[11] Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. 2016.
     Inception-v4, Inception-ResNet and the Impact of Residual Connec-
     tions on Learning. CoRR abs/1602.07261 (2016). arXiv:1602.07261
     http://arxiv.org/abs/1602.07261
[12] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens,
     and Zbigniew Wojna. 2015. Rethinking the Inception Architecture
     for Computer Vision. CoRR abs/1512.00567 (2015). arXiv:1512.00567
     http://arxiv.org/abs/1512.00567

</pre>