=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_28
|storemode=property
|title=Deep Learning Based Disease Detection Using Domain Specific Transfer Learning
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_28.pdf
|volume=Vol-2283
|authors=Steven A. Hicks,Pia H. Smedsrud,Pål Halvorsen,Michael Riegler
|dblpUrl=https://dblp.org/rec/conf/mediaeval/HicksSHR18
}}
==Deep Learning Based Disease Detection Using Domain Specific Transfer Learning==
Deep Learning Based Disease Detection Using Domain Specific Transfer Learning Steven A. Hicks3 , Pia H. Smedsrud1 , Pål Halvorsen1, 2, 3 , Michael Riegler1, 2, 3 1 Simula Research Laboratory 2 University of Oslo 3 Simula Metropolitan Center for Digital Engineering ABSTRACT 2.1 Transfer Learning From ImageNet In this paper, we present our approach for the Medico Multimedia For the approach of fine-tuning models based on ImageNet [4], Task as part of the MediaEval 2018 Benchmark [13]. Our method is we simply used the pre-trained networks available in our deep based on convolutional neural networks (CNNs), where we compare learning libraries of choice, Keras [3] (with a Tensorflow [1] back- how fine-tuning, in the context of transfer learning, from different end) and Pytorch [10]. Both libraries include several popular CNN source domains (general versus medical domain) affect classification architectures trained on 1,000 categories containing objects from performance. The preliminary results show that fine-tuning models every day life. As for our method of fine-tuning, we found that trained on large and diverse datasets is favorable, even when the simply replacing the classification block, and tuning across the model’s source domain has little to no resemblance to the new entire network without freezing any layers gave the best results, target. both in terms of classification performance and time to convergence. 2.2 Transfer Learning From a Medical Dataset For the medical domain based fine-tuning approach, we trained two 1 INTRODUCTION In an effort to explore how medical multimedia can be used to models from scratch on a custom medical dataset, consisting of a create performant and efficient classification algorithms, in the 2018 combination of two openly available medical datasets, LapGyn4 [8] Multimedia for Medicine Task the participants explore the challenge and Cataract-101 [15]. They contain a total of 57,134 images spread of automatically detecting diseases found in the gastrointestinal (GI) across 31 classes, taken from laparoscopy and eye cataract surgeries, tract using as little data as possible [13]. The challenge presents four respectively. Between this custom dataset and the supplied Medico tasks, of which we decided to focus on the task for classification of development dataset, the only overlapping classes are the classes diseases and findings and the task for fast and efficient classification. for detecting instruments. Similar to how we trained the ImageNet models, we fine-tuned across the entire network without freezing 2 APPROACH any layers. As the current state-of-the-art method for solving most computer 2.3 Additional Training Techniques vision tasks involves various implementations of deep neural net- In addition to our main hypothesis, we applied various techniques works, we decided to base our approach on this class of algorithms, to offset the common issue of overfitting, which can be especially specifically CNNs. However, due to the limited size of the devel- problematic when training on smaller datasets. Techniques used opment dataset [11, 12], training a CNN from scratch would most include weighting the loss function based on class size, various data likely yield subpar results. Therefore, to resolve this issue, we fine- augmentation techniques [17], regularization of the classification tune the weights of networks previously trained on larger datasets block [7, 9], and resampling the dataset by extending the minor- using the limited data that we have to fit our specific domain (clas- ity class on some base assumptions [2]. First, as the development sification of images taken from the GI tract). This technique is com- dataset is small and highly imbalanced with class sizes ranging monly referred to as transfer learning (TL), and has been shown to from 4 to 613 samples, we weighted the network error based on work well across different domains [5, 6, 16]. class size. Second, we applied image pre-processing techniques For this challenge, we hypothesized that adapting the weights of such as rotation, zooming, flipping, shifting and scaling. Third, we a model trained on data similar to our own (medical images) would applied some L1 and L2 regularization on the classification block of yield better results than that of models trained on data with little each trained network. Fourth, we observed that the minority class resemblance, both in terms of time to convergence and classifica- was the only one which did not contain pictures from within the tion performance. To test this hypothesis, we compared models human body. Based on this, we extended the class out of patient by trained for the purpose of gaining high-scores on the ImageNet adding 14 additional images depicting a typical office environment, challenge [4] to models trained for medical image classification. including objects such as windows, computers and people to name For the classification task, all models were measured by the a few. Note that these techniques were used for all runs. requirements given, namely matthews correlation coefficient (MCC) 2.4 Techniques for Efficient Classification and the number of samples used for training. Runs submitted to Our approach for the fast and efficient classification task, we simply the efficiency task were evaluated based on their classification reused the models trained for the classification task to see which throughput, i.e., the time it takes for the model to classify an image. models were most efficient. We quickly observed that models imple- mented in PyTorch [10] had a much higher frames per second (FPS) than their Tensorflow [1] based counterparts, largely due to the Copyright held by the owner/author(s). difference of how tensors are laid out between the two frameworks. MediaEval’18, 29-31 October 2018, Sophia Antipolis, France This led to some models being re-implemented in PyTorch and re-evaluated. MediaEval’18, 29-31 October 2018, Sophia Antipolis, France S. Hicks, P. Smedsrud, P. Halvorsen, M. Riegler 3 RESULTS AND ANALYSIS Internal Classification Evaluation Results The initial evaluation of our internal experiments was done using Method MCC F1 REC PREC SPEC ACC FPS 3-fold cross-validation, where each run was scored by averaging ImageNet Based Transfer Learning the macro-average classification scores of each model split. A com- InceptionResnetV2 0.857 0.858 0.866 0.991 0.851 0.983 31 ResNet50 0.866 0.869 0.874 0.995 0.864 0.991 100 plete overview of the internal runs for both tasks are shown in ResNet18 0.866 0.880 0.994 0.995 0.882 0.989 323 Table 1. Based on these initial findings, we selected four runs for AlexNet 0.878 0.885 0.901 0.993 0.880 0.986 1015 the classification of diseases and findings task (Table 2) and three DenseNet169 0.915 0.922 0.931 0.995 0.918 0.991 45 runs for the fast and efficient classification task (Table 3) as official VGG11 0.901 0.908 0.923 0.995 0.905 0.990 624 (Tiny) DenseNet201 0.864 0.876 0.906 0.993 0.873 0.987 58 runs to be submitted to the event organizers. Medical Based Transfer Learning Prioritizing runs for submissions was done by looking at which DenseNet169 0.792 0.798 0.830 0.991 0.795 0.983 52 experiments achieved the highest metric relative to the task at hand InceptionResnetV2 0.802 0.814 0.843 0.989 0.807 0.979 30 (MCC or FPS). Additionally, we wanted to submit a variety of differ- ent models, e.g., even though our fine-tuned medical based models Table 1: The classification performance results of our inter- did not perform as well as their ImageNet based counterparts, we nal experiments. Note that the displayed metrics are aver- still wanted to submit a run for official evaluation. For this same ages across K-splits generated through cross-validation. reason, we also submitted a model which was trained on a signif- Official Classification Evaluation Results icantly limited development dataset, i.e., a model trained on only Method MCC F1 REC PREC SPEC ACC 657 samples. ImageNet TL DenseNet169 0.927 0.931 0.931 0.931 0.995 0.991 Medical TL InceptionResNetV2 0.830 0.841 0.841 0.841 0.989 0.980 3.1 Classification Subtask Results 3-Averaged DenseNet169 0.935 0.939 0.939 0.939 0.996 0.992 Looking at the results for the classification task (Table 2), we see that Tiny Dataset DenseNet201 0.890 0.897 0.897 0.897 0.993 0.987 the best performing run is the 3-Averaged DenseNet169. This was expected as it constitutes the averaged output of the best performing Table 2: The official classification performance results as model from our internal experiments. Furthermore, as shown in provided by the Medico task organizers. our internal runs, the ImageNet based model beats the medical Official Efficiency Evaluation Results based on by approximately 10% when comparing MCC scores. We Method MCC F1 REC PREC SPEC ACC FPS believe these results may be due to the difference in variety and ResNet18 0.892 0.899 0.899 0.899 0.993 0.987 323 size between the two datasets used to train the base models. Due VGG11 0.907 0.913 0.913 0.913 0.994 0.989 624 to limited time and resources, we were only able to train a small AlexNet 0.882 0.890 0.890 0.890 0.993 0.986 1015 variety of networks on the medical dataset, and we believe there is Table 3: The official efficiency results as provided by the more work to be done in this aspect. Medico task organizers. Somewhat surprisingly, the submitted model which was trained on a severely limited training set (657 samples), (Tiny) DenseNet2010, Confusion Matrix for 3-Averaged DenseNet169 was still able to retain a relatively high MCC score. We believe this A B C D E F G H I J K L M N O P is due to the similarity between images within the same class, and A 512 0 0 0 1 0 0 0 0 0 4 7 0 0 0 9 how each class is quite visually distinct (with a few exceptions). B 1 452 84 0 0 0 0 0 0 0 0 0 0 0 0 0 C 1 103 477 0 0 0 0 0 0 0 0 0 0 0 0 0 This is supported by the confusion matrix shown in Table 4, where D 1 0 0 499 41 0 0 0 0 0 2 0 0 0 0 41 we see the model fails on just a few categories. E 0 0 0 51 522 0 0 0 0 0 1 0 0 0 0 15 F 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 3.2 Efficiency Subtask Results G 1 1 2 0 0 0 555 0 0 0 1 0 0 0 0 0 Looking at Table 3, we see the official results for the efficiency sub- H 0 0 0 0 0 0 0 490 4 0 0 0 0 0 0 0 I 1 0 0 0 0 0 0 0 1961 0 0 0 0 0 0 1 task. Note that all models submitted to this task were implemented J 1 0 0 0 0 0 0 0 0 37 0 0 0 0 0 0 in PyTorch. Of the three models, AlexNet was the most performant K 17 0 0 5 0 0 6 0 0 0 357 13 0 6 0 68 L 5 0 0 1 0 0 0 0 0 0 9 564 0 0 0 3 by quite a large margin. We believe this is due to the networks M 1 0 0 0 0 0 0 16 0 0 0 0 1065 0 0 0 depth and complexity, i.e., the number of layers and parameters. N 1 0 0 0 0 0 0 0 0 0 0 0 0 183 1 1 Additionally, the model’s MCC score is relatively high, considering O 0 0 0 0 0 0 0 0 0 0 0 0 0 3 396 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 135 that AlexNet is rather simple compared to models we used for the classification Table 4: (A) Ulcerative colitis; (B) esophagitis; (C) normal z- line; (D) dyed-lifted polyps; (E) dyed resection margins; (F) 4 CONCLUSION out of patient; (G) normal pylorus; (H) stool inclusions; (I) In this paper, we presented the work done as part of the Medico stool plenty; (J) blurry nothing; (K) polyps; (L) normal ce- Multimedia Task where we participated in two of the four available cum; (M) colon clear; (N) retroflex rectum; (O) retroflex stom- subtasks. Our main hypothesis for this challenge was that fine- ach; (P) instruments. tuned models with a medical source domain would perform better than fine-tuned ImageNet models, when used for medical disease precedence over how similar the source domain is to the target. detection. Furthermore, with a goal of submitting to the efficiency Additionally, we found that networks of lesser depth and complexity task, we measured the FPS of the models. Based on our internal were generally more efficient. We admit that these results may experiments and the official evaluation metrics received from the be anecdotal, but we believe this requires more research to fully event organizers, we conclude that a large and varied dataset takes explore the potential of our approach. Medico Multimedia Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES and Olga Ostroukhova. 2018. Medico Multimedia Task at MediaEval [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng 2018. Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, [14] Konstantin Pogorelov, Michael Riegler, Pål Halvorsen, Carsten Gri- Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, wodz, Thomas de Lange, Kristin Ranheim Randel, Sigrun Losada Es- Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz keland, Duc-Tien Dang-Nguyen, Olga Ostroukhova, Mathias Lux, and Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Concetto Spampinato. 2017. A Comparison of Deep Learning with Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Global Features for Gastrointestinal Disease Detection. In MediaEval. Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul [15] Klaus Schoeffmann, Mario Taschwer, Stephanie Sarny, Bernd Münzer, Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Manfred Jürgen Primus, and Doris Putzgruber. 2018. Cataract-101: Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, Video Dataset of 101 Cataract Surgeries. In Proceedings of the 9th ACM and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learn- Multimedia Systems Conference (MMSys ’18). ACM, New York, NY, ing on Heterogeneous Systems. (2015). https://www.tensorflow.org/ USA, 421–425. https://doi.org/10.1145/3204949.3208137 Software available from tensorflow.org. [16] Chuen-Kai Shie, Chung-Hisang Chuang, Chun-Nan Chou, Meng-Hsi [2] Mateusz Buda, Atsuto Maki, and Maciej A. Mazurowski. 2017. A Wu, and Edward Y. Chang. 2015. Transfer representation learning for systematic study of the class imbalance problem in convolutional medical image analysis. 2015 (08 2015), 711–714. neural networks. Computing Research Repository abs/1710.05381 (2017). [17] Luke Taylor and Geoff Nitschke. 2017. Improving Deep Learning arXiv:1710.05381 http://arxiv.org/abs/1710.05381 using Generic Data Augmentation. Computing Research Repository [3] François Chollet and others. 2015. Keras. https://keras.io. (2015). abs/1708.06020 (2017). arXiv:1708.06020 http://arxiv.org/abs/1708. [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. Im- 06020 ageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. [5] H. G. Kim, Y. Choi, and Y. M. Ro. 2017. Modality-bridge Transfer Learning for Medical Image Classification. ArXiv e-prints (Aug. 2017). arXiv:cs.CV/1708.03111 [6] Simon Kornblith, Jonathon Shlens, and Quoc V. Le. 2018. Do Bet- ter ImageNet Models Transfer Better? Computing Research Reposi- tory abs/1805.08974 (2018). arXiv:1805.08974 http://arxiv.org/abs/1805. 08974 [7] Anders Krogh and John A. Hertz. 1992. A Simple Weight Decay Can Improve Generalization. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lipp- mann (Eds.). Morgan-Kaufmann, 950–957. http://papers.nips.cc/paper/ 563-a-simple-weight-decay-can-improve-generalization.pdf [8] Andreas Leibetseder, Stefan Petscharnig, Manfred Jürgen Primus, Sab- rina Kletz, Bernd Münzer, Klaus Schoeffmann, and Jörg Keckstein. 2018. Lapgyn4: A Dataset for 4 Automatic Content Analysis Problems in the Domain of Laparoscopic Gynecology. In Proceedings of the 9th ACM Multimedia Systems Conference (MMSys ’18). ACM, New York, NY, USA, 357–362. https://doi.org/10.1145/3204949.3208127 [9] Pushparaja Murugan and Shanmugasundaram Durairaj. 2017. Reg- ularization and Optimization strategies in Deep Convolutional Neu- ral Network. Computing Research Repository abs/1712.04711 (2017). arXiv:1712.04711 http://arxiv.org/abs/1712.04711 [10] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017). [11] Konstantin Pogorelov, Kristin Ranheim Randel, Thomas de Lange, Sigrun Losada Eskeland, Carsten Griwodz, Dag Johansen, Concetto Spampinato, Mario Taschwer, Mathias Lux, Peter Thelin Schmidt, Michael Riegler, and Pål Halvorsen. 2017. Nerthus: A Bowel Prepara- tion Quality Video Dataset. In Proceedings of the 8th ACM on Multime- dia Systems Conference. ACM, 170–174. [12] Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Con- cetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin Schmidt, Michael Riegler, and Pål Halvorsen. 2017. Kvasir: A Multi- Class Image Dataset for Computer Aided Gastrointestinal Disease Detection. In Proceedings of the 8th ACM on Multimedia Systems Con- ference. ACM, 164–169. [13] Konstantin Pogorelov, Michael Riegler, Pål Halvorsen, Thomas de Lange, Kristin Ranheim Randel, Duc-Tien Dang-Nguyen, Mathias Lux,