=Paper= {{Paper |id=Vol-2283/MediaEval_18_paper_28 |storemode=property |title=Deep Learning Based Disease Detection Using Domain Specific Transfer Learning |pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_28.pdf |volume=Vol-2283 |authors=Steven A. Hicks,Pia H. Smedsrud,Pål Halvorsen,Michael Riegler |dblpUrl=https://dblp.org/rec/conf/mediaeval/HicksSHR18 }} ==Deep Learning Based Disease Detection Using Domain Specific Transfer Learning== https://ceur-ws.org/Vol-2283/MediaEval_18_paper_28.pdf
 Deep Learning Based Disease Detection Using Domain Specific
                     Transfer Learning
                        Steven A. Hicks3 , Pia H. Smedsrud1 , Pål Halvorsen1, 2, 3 , Michael Riegler1, 2, 3
                                                     1 Simula Research Laboratory 2 University of Oslo
                                                   3 Simula Metropolitan Center for Digital Engineering


ABSTRACT                                                                        2.1    Transfer Learning From ImageNet
In this paper, we present our approach for the Medico Multimedia                For the approach of fine-tuning models based on ImageNet [4],
Task as part of the MediaEval 2018 Benchmark [13]. Our method is                we simply used the pre-trained networks available in our deep
based on convolutional neural networks (CNNs), where we compare                 learning libraries of choice, Keras [3] (with a Tensorflow [1] back-
how fine-tuning, in the context of transfer learning, from different            end) and Pytorch [10]. Both libraries include several popular CNN
source domains (general versus medical domain) affect classification            architectures trained on 1,000 categories containing objects from
performance. The preliminary results show that fine-tuning models               every day life. As for our method of fine-tuning, we found that
trained on large and diverse datasets is favorable, even when the               simply replacing the classification block, and tuning across the
model’s source domain has little to no resemblance to the new                   entire network without freezing any layers gave the best results,
target.                                                                         both in terms of classification performance and time to convergence.
                                                                                2.2    Transfer Learning From a Medical Dataset
                                                                                For the medical domain based fine-tuning approach, we trained two
1    INTRODUCTION
In an effort to explore how medical multimedia can be used to                   models from scratch on a custom medical dataset, consisting of a
create performant and efficient classification algorithms, in the 2018          combination of two openly available medical datasets, LapGyn4 [8]
Multimedia for Medicine Task the participants explore the challenge             and Cataract-101 [15]. They contain a total of 57,134 images spread
of automatically detecting diseases found in the gastrointestinal (GI)          across 31 classes, taken from laparoscopy and eye cataract surgeries,
tract using as little data as possible [13]. The challenge presents four        respectively. Between this custom dataset and the supplied Medico
tasks, of which we decided to focus on the task for classification of           development dataset, the only overlapping classes are the classes
diseases and findings and the task for fast and efficient classification.       for detecting instruments. Similar to how we trained the ImageNet
                                                                                models, we fine-tuned across the entire network without freezing
2    APPROACH                                                                   any layers.
As the current state-of-the-art method for solving most computer                2.3    Additional Training Techniques
vision tasks involves various implementations of deep neural net-               In addition to our main hypothesis, we applied various techniques
works, we decided to base our approach on this class of algorithms,             to offset the common issue of overfitting, which can be especially
specifically CNNs. However, due to the limited size of the devel-               problematic when training on smaller datasets. Techniques used
opment dataset [11, 12], training a CNN from scratch would most                 include weighting the loss function based on class size, various data
likely yield subpar results. Therefore, to resolve this issue, we fine-         augmentation techniques [17], regularization of the classification
tune the weights of networks previously trained on larger datasets              block [7, 9], and resampling the dataset by extending the minor-
using the limited data that we have to fit our specific domain (clas-           ity class on some base assumptions [2]. First, as the development
sification of images taken from the GI tract). This technique is com-           dataset is small and highly imbalanced with class sizes ranging
monly referred to as transfer learning (TL), and has been shown to              from 4 to 613 samples, we weighted the network error based on
work well across different domains [5, 6, 16].                                  class size. Second, we applied image pre-processing techniques
    For this challenge, we hypothesized that adapting the weights of            such as rotation, zooming, flipping, shifting and scaling. Third, we
a model trained on data similar to our own (medical images) would               applied some L1 and L2 regularization on the classification block of
yield better results than that of models trained on data with little            each trained network. Fourth, we observed that the minority class
resemblance, both in terms of time to convergence and classifica-               was the only one which did not contain pictures from within the
tion performance. To test this hypothesis, we compared models                   human body. Based on this, we extended the class out of patient by
trained for the purpose of gaining high-scores on the ImageNet                  adding 14 additional images depicting a typical office environment,
challenge [4] to models trained for medical image classification.               including objects such as windows, computers and people to name
    For the classification task, all models were measured by the                a few. Note that these techniques were used for all runs.
requirements given, namely matthews correlation coefficient (MCC)               2.4    Techniques for Efficient Classification
and the number of samples used for training. Runs submitted to                  Our approach for the fast and efficient classification task, we simply
the efficiency task were evaluated based on their classification                reused the models trained for the classification task to see which
throughput, i.e., the time it takes for the model to classify an image.         models were most efficient. We quickly observed that models imple-
                                                                                mented in PyTorch [10] had a much higher frames per second (FPS)
                                                                                than their Tensorflow [1] based counterparts, largely due to the
Copyright held by the owner/author(s).                                          difference of how tensors are laid out between the two frameworks.
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France
                                                                                This led to some models being re-implemented in PyTorch and
                                                                                re-evaluated.
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                        S. Hicks, P. Smedsrud, P. Halvorsen, M. Riegler


3     RESULTS AND ANALYSIS                                                                Internal Classification Evaluation Results
The initial evaluation of our internal experiments was done using                        Method    MCC       F1      REC    PREC     SPEC   ACC        FPS
3-fold cross-validation, where each run was scored by averaging                                   ImageNet Based Transfer Learning
the macro-average classification scores of each model split. A com-           InceptionResnetV2    0.857 0.858 0.866 0.991 0.851            0.983       31
                                                                                       ResNet50    0.866 0.869 0.874 0.995 0.864            0.991      100
plete overview of the internal runs for both tasks are shown in
                                                                                       ResNet18    0.866 0.880 0.994 0.995 0.882            0.989      323
Table 1. Based on these initial findings, we selected four runs for                     AlexNet    0.878 0.885 0.901 0.993 0.880            0.986     1015
the classification of diseases and findings task (Table 2) and three              DenseNet169      0.915 0.922 0.931 0.995 0.918            0.991      45
runs for the fast and efficient classification task (Table 3) as official                VGG11     0.901 0.908 0.923 0.995 0.905            0.990      624
                                                                             (Tiny) DenseNet201    0.864 0.876 0.906 0.993 0.873            0.987       58
runs to be submitted to the event organizers.
                                                                                                  Medical Based Transfer Learning
   Prioritizing runs for submissions was done by looking at which
                                                                                   DenseNet169    0.792 0.798 0.830 0.991 0.795             0.983      52
experiments achieved the highest metric relative to the task at hand          InceptionResnetV2   0.802 0.814 0.843 0.989 0.807             0.979      30
(MCC or FPS). Additionally, we wanted to submit a variety of differ-
ent models, e.g., even though our fine-tuned medical based models           Table 1: The classification performance results of our inter-
did not perform as well as their ImageNet based counterparts, we            nal experiments. Note that the displayed metrics are aver-
still wanted to submit a run for official evaluation. For this same         ages across K-splits generated through cross-validation.
reason, we also submitted a model which was trained on a signif-                          Official Classification Evaluation Results
icantly limited development dataset, i.e., a model trained on only                                 Method     MCC      F1      REC     PREC    SPEC      ACC
657 samples.                                                                    ImageNet TL DenseNet169       0.927   0.931    0.931   0.931   0.995    0.991
                                                                             Medical TL InceptionResNetV2     0.830   0.841    0.841   0.841   0.989    0.980
3.1    Classification Subtask Results                                           3-Averaged DenseNet169        0.935   0.939    0.939   0.939   0.996    0.992
Looking at the results for the classification task (Table 2), we see that        Tiny Dataset DenseNet201     0.890   0.897    0.897   0.897   0.993    0.987
the best performing run is the 3-Averaged DenseNet169. This was
expected as it constitutes the averaged output of the best performing       Table 2: The official classification performance results as
model from our internal experiments. Furthermore, as shown in               provided by the Medico task organizers.
our internal runs, the ImageNet based model beats the medical                               Official Efficiency Evaluation Results
based on by approximately 10% when comparing MCC scores. We                   Method      MCC       F1        REC      PREC        SPEC     ACC         FPS
believe these results may be due to the difference in variety and            ResNet18     0.892    0.899     0.899     0.899       0.993    0.987        323
size between the two datasets used to train the base models. Due               VGG11      0.907    0.913     0.913     0.913       0.994    0.989        624
to limited time and resources, we were only able to train a small            AlexNet      0.882    0.890     0.890     0.890       0.993    0.986       1015
variety of networks on the medical dataset, and we believe there is         Table 3: The official efficiency results as provided by the
more work to be done in this aspect.                                        Medico task organizers.
   Somewhat surprisingly, the submitted model which was trained
on a severely limited training set (657 samples), (Tiny) DenseNet2010,                  Confusion Matrix for 3-Averaged DenseNet169
was still able to retain a relatively high MCC score. We believe this          A B C D E                 F   G H I        J        K L M N O P
is due to the similarity between images within the same class, and           A 512 0   0   0   1         0   0   0   0    0        4   7   0    0   0   9
how each class is quite visually distinct (with a few exceptions).           B 1   452 84 0    0         0   0   0   0    0        0   0   0    0   0   0
                                                                             C 1   103 477 0   0         0   0   0   0    0        0   0   0    0   0   0
This is supported by the confusion matrix shown in Table 4, where            D 1   0   0   499 41        0   0   0   0    0        2   0   0    0   0   41
we see the model fails on just a few categories.                             E 0   0   0   51 522        0   0   0   0    0        1   0   0    0   0   15
                                                                             F 0   0   0   0   0         5   0   0   0    0        0   0   0    0   0   0
3.2    Efficiency Subtask Results                                            G 1   1   2   0   0         0   555 0   0    0        1   0   0    0   0   0
Looking at Table 3, we see the official results for the efficiency sub-      H 0   0   0   0   0         0   0   490 4    0        0   0   0    0   0   0
                                                                             I 1   0   0   0   0         0   0   0   1961 0        0   0   0    0   0   1
task. Note that all models submitted to this task were implemented           J 1   0   0   0   0         0   0   0   0    37       0   0   0    0   0   0
in PyTorch. Of the three models, AlexNet was the most performant             K 17 0    0   5   0         0   6   0   0    0        357 13 0     6   0   68
                                                                             L 5   0   0   1   0         0   0   0   0    0        9   564 0    0   0   3
by quite a large margin. We believe this is due to the networks              M 1   0   0   0   0         0   0   16 0     0        0   0   1065 0   0   0
depth and complexity, i.e., the number of layers and parameters.             N 1   0   0   0   0         0   0   0   0    0        0   0   0    183 1   1
Additionally, the model’s MCC score is relatively high, considering          O 0   0   0   0   0         0   0   0   0    0        0   0   0    3   396 0
                                                                             P 0   0   0   0   0         0   0   0   0    0        0   0   0    0   0   135
that AlexNet is rather simple compared to models we used for the
classification                                                              Table 4: (A) Ulcerative colitis; (B) esophagitis; (C) normal z-
                                                                            line; (D) dyed-lifted polyps; (E) dyed resection margins; (F)
4     CONCLUSION                                                            out of patient; (G) normal pylorus; (H) stool inclusions; (I)
In this paper, we presented the work done as part of the Medico             stool plenty; (J) blurry nothing; (K) polyps; (L) normal ce-
Multimedia Task where we participated in two of the four available          cum; (M) colon clear; (N) retroflex rectum; (O) retroflex stom-
subtasks. Our main hypothesis for this challenge was that fine-             ach; (P) instruments.
tuned models with a medical source domain would perform better
than fine-tuned ImageNet models, when used for medical disease              precedence over how similar the source domain is to the target.
detection. Furthermore, with a goal of submitting to the efficiency         Additionally, we found that networks of lesser depth and complexity
task, we measured the FPS of the models. Based on our internal              were generally more efficient. We admit that these results may
experiments and the official evaluation metrics received from the           be anecdotal, but we believe this requires more research to fully
event organizers, we conclude that a large and varied dataset takes         explore the potential of our approach.
Medico Multimedia Task                                                               MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


REFERENCES                                                                          and Olga Ostroukhova. 2018. Medico Multimedia Task at MediaEval
 [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng              2018.
     Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean,             [14] Konstantin Pogorelov, Michael Riegler, Pål Halvorsen, Carsten Gri-
     Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,                  wodz, Thomas de Lange, Kristin Ranheim Randel, Sigrun Losada Es-
     Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz         keland, Duc-Tien Dang-Nguyen, Olga Ostroukhova, Mathias Lux, and
     Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat                Concetto Spampinato. 2017. A Comparison of Deep Learning with
     Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster,                  Global Features for Gastrointestinal Disease Detection. In MediaEval.
     Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul       [15] Klaus Schoeffmann, Mario Taschwer, Stephanie Sarny, Bernd Münzer,
     Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol             Manfred Jürgen Primus, and Doris Putzgruber. 2018. Cataract-101:
     Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu,                Video Dataset of 101 Cataract Surgeries. In Proceedings of the 9th ACM
     and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learn-              Multimedia Systems Conference (MMSys ’18). ACM, New York, NY,
     ing on Heterogeneous Systems. (2015). https://www.tensorflow.org/              USA, 421–425. https://doi.org/10.1145/3204949.3208137
     Software available from tensorflow.org.                                   [16] Chuen-Kai Shie, Chung-Hisang Chuang, Chun-Nan Chou, Meng-Hsi
 [2] Mateusz Buda, Atsuto Maki, and Maciej A. Mazurowski. 2017. A                   Wu, and Edward Y. Chang. 2015. Transfer representation learning for
     systematic study of the class imbalance problem in convolutional               medical image analysis. 2015 (08 2015), 711–714.
     neural networks. Computing Research Repository abs/1710.05381 (2017).     [17] Luke Taylor and Geoff Nitschke. 2017. Improving Deep Learning
     arXiv:1710.05381 http://arxiv.org/abs/1710.05381                               using Generic Data Augmentation. Computing Research Repository
 [3] François Chollet and others. 2015. Keras. https://keras.io. (2015).            abs/1708.06020 (2017). arXiv:1708.06020 http://arxiv.org/abs/1708.
 [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. Im-        06020
     ageNet: A Large-Scale Hierarchical Image Database. In Proceedings
     of the 2009 IEEE Computer Society Conference on Computer Vision and
     Pattern Recognition.
 [5] H. G. Kim, Y. Choi, and Y. M. Ro. 2017. Modality-bridge Transfer
     Learning for Medical Image Classification. ArXiv e-prints (Aug. 2017).
     arXiv:cs.CV/1708.03111
 [6] Simon Kornblith, Jonathon Shlens, and Quoc V. Le. 2018. Do Bet-
     ter ImageNet Models Transfer Better? Computing Research Reposi-
     tory abs/1805.08974 (2018). arXiv:1805.08974 http://arxiv.org/abs/1805.
     08974
 [7] Anders Krogh and John A. Hertz. 1992. A Simple Weight Decay
     Can Improve Generalization. In Advances in Neural Information
     Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lipp-
     mann (Eds.). Morgan-Kaufmann, 950–957. http://papers.nips.cc/paper/
     563-a-simple-weight-decay-can-improve-generalization.pdf
 [8] Andreas Leibetseder, Stefan Petscharnig, Manfred Jürgen Primus, Sab-
     rina Kletz, Bernd Münzer, Klaus Schoeffmann, and Jörg Keckstein.
     2018. Lapgyn4: A Dataset for 4 Automatic Content Analysis Problems
     in the Domain of Laparoscopic Gynecology. In Proceedings of the 9th
     ACM Multimedia Systems Conference (MMSys ’18). ACM, New York,
     NY, USA, 357–362. https://doi.org/10.1145/3204949.3208127
 [9] Pushparaja Murugan and Shanmugasundaram Durairaj. 2017. Reg-
     ularization and Optimization strategies in Deep Convolutional Neu-
     ral Network. Computing Research Repository abs/1712.04711 (2017).
     arXiv:1712.04711 http://arxiv.org/abs/1712.04711
[10] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward
     Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga,
     and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).
[11] Konstantin Pogorelov, Kristin Ranheim Randel, Thomas de Lange,
     Sigrun Losada Eskeland, Carsten Griwodz, Dag Johansen, Concetto
     Spampinato, Mario Taschwer, Mathias Lux, Peter Thelin Schmidt,
     Michael Riegler, and Pål Halvorsen. 2017. Nerthus: A Bowel Prepara-
     tion Quality Video Dataset. In Proceedings of the 8th ACM on Multime-
     dia Systems Conference. ACM, 170–174.
[12] Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz,
     Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Con-
     cetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin
     Schmidt, Michael Riegler, and Pål Halvorsen. 2017. Kvasir: A Multi-
     Class Image Dataset for Computer Aided Gastrointestinal Disease
     Detection. In Proceedings of the 8th ACM on Multimedia Systems Con-
     ference. ACM, 164–169.
[13] Konstantin Pogorelov, Michael Riegler, Pål Halvorsen, Thomas de
     Lange, Kristin Ranheim Randel, Duc-Tien Dang-Nguyen, Mathias Lux,