Transfer learning with prioritized classification and training dataset equalization for medical objects detection Olga Ostroukhova1 , Konstantin Pogorelov2,3 , Michael Riegler3,4 , Duc-Tien Dang-Nguyen5 , Pål Halvorsen3,4 1 Research Institute of Multiprocessor Computation Systems n.a. A.V. Kalyaev, Russia 2 Simula Research Laboratory, Norway 3 University of Oslo, Norway 4 Simula Metropolitan Center for Digital Engineering, Norway 5 University of Bergen, Norway olka7lands@gmail.com,konstantin@simula.no,michael@simula.no,ductien.dangnguyen@uib.no,paalh@simula.no ABSTRACT previous work [7]. This approach is based on the Inception v3 ar- This paper presents the method proposed by the organizer team chitecture [13]. To achieve the highest possible performance on the (SIMULA) for MediaEval 2018 Multimedia for Medicine: the Medico provided limited development set, we used the model pre-trained Task. We utilized the recent transfer-learning-based image classifi- on the ImageNet dataset [1]. We performed the model retraining cation methodology and focused on how easy it is to implement using the method described in [2]. We kept all the basic convolu- multi-class image classifiers in general and how to improve the tional layers of the network and only retrained the two top fully classification performance without deep neural network model re- connected (FC) layers after random initialization of their weights. design. The goal for this was both to provide a baseline for the The FC layers were retrained using the RMSprop [14] optimizer Medico task and to show the performance of out-of-the-box classi- which allows an adaptive learning rate during the training process. fiers for the medical use-case scenario. We did not used any additional enhancing or pre-processing for the images provided in the datasets. In order to increase the number of training samples, we performed various augmentation opera- 1 INTRODUCTION tions on the images in the training set. Specifically, we performed This paper provides a detailed description of the methods proposed horizontal and vertical flipping and a change of brightness in the by team SIMULA for MediaEval 2018 Multimedia for Medicine interval of ±20%. Medico Task [11]. The main goal of the task is to perform medi- The initial experimental studies showed that the pre-trained In- cal image classification. The use case scenario is gastrointestinal ception v3 model is able to efficiently extract high-level features endoscopies. The 2018-year version of the task is designed as an from the given medical images, and it is converge quickly during sixteen classes classification problem. Compared to the 2017-year the retraining process with sufficient resulting classification per- version which was limited to eight classes [9], the current version formance (see section 3). However, due to a heavily imbalanced of the task comes with several additional challenges such as an training dataset and despite the used training data augmentation, imbalanced number of samples in the classes to make it more real- the detection performance of some classes was not good enough. istic [8, 9]. In the previous year of the task, participants proposed To solve this issue, we implemented an additional training dataset different methods ranging from simple handcrafted features to deep balancing procedure that performs equalization of the training set neural networks [3–6, 10, 12]. For our approach, we propose a con- by the random duplication of the training samples for the under- volutional neural network approach (CNN) in combination with filled classes, like instruments, blurry, etc. This nearly doubled the transfer learning. To compensate for the imbalanced dataset, we number of the training samples allowing for better classification perform prioritized classification and dataset equalization. performance for the classes with a low number of images provided. An additional classifier output post-processing step was imple- 2 PROPOSED APPROACH mented in order to address the different importance of the different As the organizer’s team for the Medico task, our aim is not achiev- classes as it was stated in the task dataset description [11]. Specifi- ing the best possible classification performance. Instead, we decided cally, we performed the prioritized selection of the resulting output to check how low is the entry threshold to the medical images clas- class for each image based of the model’s probability output. This sification and corresponding lesion detection challenge. To achieve was implemented as the selection of the first class with the detec- this, and also to provide a baseline for the competing teams, we tion probability higher than a set threshold from the array of classes involved the recent transfer-learning-based image classification sorted in order of their importance. methodology and checked how well we are able to (i) easily imple- ment multi-class image classifier and (ii) improve the classification performance without deep neural network model redesign. 3 RESULTS AND ANALYSIS Thus, for the basic classification algorithm, we used a CNN ar- For the official task submission creation, two separate models were chitecture and a transfer learning-based classifier, which has been used, trained on the different datasets. The first model was trained previously introduced for the medical images classification in our on the training set created from the development set using the Copyright held by the owner/author(s). described (see section 2) data augmentation procedure. The trained MediaEval’18, 29-31 October 2018, Sophia Antipolis, France model was used to process the task’s test set, and the classification MediaEval’18, 29-31 October 2018, Sophia Antipolis, France O. Ostroukhova et al. Table 1: Official classification performance evaluation for Detec- non-prioritized runs. It means that the trained classifier is perform- tion (D) and Speed (S) runs including ZeroR (ZR), Random (RD) and ing as well as it can, and additional re-classification using the class True (TR) baseline classifiers reporting the following cross-class av- priorities does not make sense for this particular dataset. However, eraged metrics: True Positive or Hit (TP), True Negative or Correct it still can be potentially interesting for bigger datasets or a higher Rejection (TN), False Positive or False Alarm (FP), False negative or Miss (FN), Recall or Sensitivity or Hit rate or True Positive Rate number of classes. The best performing run was the detection run (REC), Specificity or True Negative Rate (SPE), Precision or Positive #1 generated using the equalized training set and non-prioritized Predictive value (PRE), Accuracy (ACC), F1-Score (F1), Matthews classifier with the classification performance of 0.854 for Rk statis- correlation coefficient (MCC), Rk statistic or MCC for k different tic (MCC for k different classes). The confusion matrix for this run classes (RK), Processing Speed or Frames per Second (FA). is depicted in table 2, and the class imbalance and corresponding Run TP TN FP FN REC SPE PRE ACC F1 MCC RK FPS training and classification challenges can be easily observed. The D1 474 8122 72 72 0.824 0.991 0.828 0.984 0.815 0.812 0.854 43.1 most challenging class was Instruments that is mostly caused by D2 474 8122 72 72 0.823 0.991 0.828 0.984 0.814 0.811 0.854 43.0 the different shapes, positions and visibilities of the instruments in D3 470 8117 76 76 0.817 0.991 0.819 0.983 0.807 0.803 0.845 43.1 D4 440 8087 107 107 0.774 0.987 0.771 0.976 0.756 0.752 0.786 43.2 the images. There also was a number of miss-classification cases D5 333 7981 213 213 0.664 0.974 0.646 0.951 0.601 0.605 0.582 43.0 for the Dyed classes as well as for Esophagitis and Normal Z-line S1 469 8117 77 77 0.765 0.991 0.729 0.982 0.743 0.737 0.844 43.1 classes. S2 469 8117 77 77 0.765 0.991 0.728 0.982 0.743 0.737 0.844 43.1 S3 465 8112 82 82 0.758 0.990 0.722 0.981 0.736 0.729 0.835 42.9 With respect to the classification performance in terms of pro- S4 430 8077 117 117 0.709 0.986 0.677 0.973 0.679 0.674 0.766 43.0 cessing speed, the proposed classified can process approximately S5 313 7960 233 233 0.546 0.971 0.607 0.947 0.504 0.510 0.544 43.3 43 frames per second on a GPU-enabled consumer-grade personal ZR 34 7681 512 512 0.063 0.938 0.004 0.883 0.007 0.0 0.0 - computer regardless of the enabled or disabled post-processing RD 35 7682 511 511 0.057 0.938 0.064 0.883 0.055 0.001 0.002 - TR 546 8193 0 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 - classes prioritization. 4 CONCLUSIONS AND FUTURE WORK Table 2: Confusion matrix for the detection run#1 depicted in ta- In this paper, we presented an out-of-the-box solution utilizing a ble 1. The classes are Ulcerative Colitis (A), Esophagitis (B), Normal Z-line (C), Dyed and Lifted Polyps (D), Dyed Resection Margins (E), modern pre-trained CNN for the task of medical image classifica- Out of Patient images (F), Normal Pylorus (G), Stool Inclusions (H), tion. The goal was to provide a baseline for the task and to show Stool Plenty (I), Blurry Nothing of value (J), Polyps (K), Normal Ce- the performance of basic methods without any deep architecture cum (L), Colon Clear (M), Retroflex Rectum (N), Retroflex Stomach modification. The best achieved performance measured as Matthew (O) and Instruments (P). correlation coefficient for k different classes of 0.854 and a speed Detected class of 43 frames per second. This is already a quite good result for an A B C D E F G H I J K L M N O P out-of-the-box method. A 459 2 1 1 5 0 1 0 54 0 13 13 1 7 0 7 B 2 388 77 0 0 0 0 0 0 0 0 0 0 0 0 0 C 0 145 451 0 0 0 4 0 0 0 1 0 0 0 0 0 REFERENCES D 0 0 0 406 81 0 0 0 1 0 4 0 0 0 0 26 E 0 0 0 115 462 0 0 0 0 0 0 1 1 1 0 17 [1] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. F 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 2009. Imagenet: A large-scale hierarchical image database. In Computer Actual class G 3 18 27 0 0 0 548 0 0 0 2 0 2 1 4 1 H 10 1 0 5 2 0 0 498 98 0 3 1 24 0 0 6 Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. I 14 0 0 5 1 0 0 0 1771 0 5 2 1 3 0 7 Ieee, 248–255. J 2 0 0 0 0 3 0 1 7 37 0 0 2 1 0 0 K 22 1 6 17 2 0 7 1 8 0 316 14 1 9 0 64 [2] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, L 19 0 0 2 6 0 1 0 16 0 22 551 8 3 0 4 Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A Deep Convolutional M 3 0 1 1 0 0 0 6 4 0 5 1 1025 1 0 6 N 8 0 0 3 4 0 0 0 3 0 2 1 0 160 4 8 Activation Feature for Generic Visual Recognition.. In Proc. of ICML, O 0 1 0 0 0 0 0 0 2 0 0 0 0 5 387 1 Vol. 32. 647–655. P 0 0 0 1 0 0 0 0 1 0 1 0 0 1 2 126 [3] Yang Liu, Zhonglei Gu, and William K Cheung. 2017. HKBU at Media- Eval 2017 Medico: Medical multimedia task. In Working Notes Proceed- output was post-processed using the prioritized classification selec- ings of the MediaEval 2017 Workshop (MediaEval 2017). tor with four different probability threshold settings from 0.75 to [4] Syed Sadiq Ali Naqvi, Shees Nadeem, Muhammad Zaid, and Muham- 0.1 resulting in the runs #2 - #5. For the run #1, we used the max mad Atif Tahir. 2017. Ensemble of Texture Features for Finding Ab- probability selector without class prioritization. The results using normalities in the Gastro-Intestinal Tract. Working Notes Proceedings the first model were submitted as the speed runs. The second model of the MediaEval 2017 Workshop (MediaEval 2017). was trained using the equalized training set, and the same rules for [5] Stefan Petscharnig and Klaus Schöffmann. 2018. Learning laparoscopic video shot classification for gynecological surgery. An International the five runs generation were submitted as the detection run. Journal of Multimedia Tools and Applications 77, 7 (2018), 8061–8079. The official evaluation results for all the runs are shown in table 1. [6] Stefan Petscharnig, Klaus Schöffmann, and Mathias Lux. 2017. An As one can see, all the runs significantly outperform the ZeroR and Inception-like CNN Architecture for GI Disease and Anatomical Land- Random baselines and show good classification performance. All mark Classification. In Working Notes Proceedings of the MediaEval the runs that utilize the equalized training set have slightly better 2017 Workshop (MediaEval 2017). classification performance. Surprisingly, the introduced prioritized [7] Konstantin Pogorelov, Sigrun Losada Eskeland, Thomas de Lange, classification method did not result in improved detection perfor- Carsten Griwodz, Kristin Ranheim Randel, Håkon Kvale Stens- mance, not for the original nor for the equalized training sets. With land, Duc-Tien Dang-Nguyen, Concetto Spampinato, Dag Johansen, the threshold of 0.75, the classification performance is equal to the Michael Riegler, and others. 2017. A holistic multimedia system for Medico Multimedia Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France gastrointestinal tract disease detection. In Proceedings of the 8th ACM Workshop (MediaEval 2017). on Multimedia Systems Conference. ACM, 112–123. [11] Konstantin Pogorelov, Michael Riegler, Pål Halvorsen, Thomas De [8] Konstantin Pogorelov, Kristin Ranheim Randel, Thomas de Lange, Lange, Kristin Ranheim Randel, Duc-Tien Dang-Nguyen, Mathias Lux, Sigrun Losada Eskeland, Carsten Griwodz, Dag Johansen, Concetto and Olga Ostroukhova. 2018. Medico Multimedia Task at MediaEval Spampinato, Mario Taschwer, Mathias Lux, Peter Thelin Schmidt, 2018. In Working Notes Proceedings of the MediaEval 2018 Workshop. Michael Riegler, and Pål Halvorsen. 2017. Nerthus: A Bowel Prepara- [12] Michael Riegler, Konstantin Pogorelov, Pål Halvorsen, Carsten Gri- tion Quality Video Dataset. In Proceedings of the 8th ACM on Multime- wodz, Thomas Lange, Kristin Ranheim Randel, Sigrun Eskeland, Dang dia Systems Conference (MMSYS). ACM, 170–174. Nguyen, Duc Tien, Mathias Lux, and others. 2017. Multimedia for [9] Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, medicine: the medico Task at mediaEval 2017. In Working Notes Pro- Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Con- ceedings of the MediaEval 2017 Workshop (MediaEval 2017). cetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin [13] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Schmidt, and others. 2017. Kvasir: A multi-class image dataset for and Zbigniew Wojna. 2015. Rethinking the inception architecture for computer aided gastrointestinal disease detection. In Proceedings of the computer vision. arXiv preprint arXiv:1512.00567 (2015). 8th ACM on Multimedia Systems Conference (MMSYS). ACM, 164–169. [14] Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: [10] Konstantin Pogorelov, Michael Riegler, Pål Halvorsen, Carsten Gri- Divide the gradient by a running average of its recent magnitude. wodz, Thomas de Lange, Kristin Ranheim Randel, Sigrun Eskeland, COURSERA: Neural networks for machine learning 4, 2 (2012). Duc-Tien Dang-Nguyen, Olga Ostroukhova, and others. 2017. A com- parison of deep learning with global features for gastrointestinal dis- ease detection. In Working Notes Proceedings of the MediaEval 2017