Transfer learning with prioritized classification and training
          dataset equalization for medical objects detection
                                            Olga Ostroukhova1 , Konstantin Pogorelov2,3 ,
                                    Michael Riegler3,4 , Duc-Tien Dang-Nguyen5 , Pål Halvorsen3,4
                            1 Research Institute of Multiprocessor Computation Systems n.a. A.V. Kalyaev, Russia
                                         2 Simula Research Laboratory, Norway 3 University of Oslo, Norway
                      4 Simula Metropolitan Center for Digital Engineering, Norway 5 University of Bergen, Norway

       olka7lands@gmail.com,konstantin@simula.no,michael@simula.no,ductien.dangnguyen@uib.no,paalh@simula.no

ABSTRACT                                                                    previous work [7]. This approach is based on the Inception v3 ar-
This paper presents the method proposed by the organizer team               chitecture [13]. To achieve the highest possible performance on the
(SIMULA) for MediaEval 2018 Multimedia for Medicine: the Medico             provided limited development set, we used the model pre-trained
Task. We utilized the recent transfer-learning-based image classifi-        on the ImageNet dataset [1]. We performed the model retraining
cation methodology and focused on how easy it is to implement               using the method described in [2]. We kept all the basic convolu-
multi-class image classifiers in general and how to improve the             tional layers of the network and only retrained the two top fully
classification performance without deep neural network model re-            connected (FC) layers after random initialization of their weights.
design. The goal for this was both to provide a baseline for the            The FC layers were retrained using the RMSprop [14] optimizer
Medico task and to show the performance of out-of-the-box classi-           which allows an adaptive learning rate during the training process.
fiers for the medical use-case scenario.                                    We did not used any additional enhancing or pre-processing for the
                                                                            images provided in the datasets. In order to increase the number
                                                                            of training samples, we performed various augmentation opera-
1    INTRODUCTION                                                           tions on the images in the training set. Specifically, we performed
This paper provides a detailed description of the methods proposed          horizontal and vertical flipping and a change of brightness in the
by team SIMULA for MediaEval 2018 Multimedia for Medicine                   interval of ±20%.
Medico Task [11]. The main goal of the task is to perform medi-                 The initial experimental studies showed that the pre-trained In-
cal image classification. The use case scenario is gastrointestinal         ception v3 model is able to efficiently extract high-level features
endoscopies. The 2018-year version of the task is designed as an            from the given medical images, and it is converge quickly during
sixteen classes classification problem. Compared to the 2017-year           the retraining process with sufficient resulting classification per-
version which was limited to eight classes [9], the current version         formance (see section 3). However, due to a heavily imbalanced
of the task comes with several additional challenges such as an             training dataset and despite the used training data augmentation,
imbalanced number of samples in the classes to make it more real-           the detection performance of some classes was not good enough.
istic [8, 9]. In the previous year of the task, participants proposed       To solve this issue, we implemented an additional training dataset
different methods ranging from simple handcrafted features to deep          balancing procedure that performs equalization of the training set
neural networks [3–6, 10, 12]. For our approach, we propose a con-          by the random duplication of the training samples for the under-
volutional neural network approach (CNN) in combination with                filled classes, like instruments, blurry, etc. This nearly doubled the
transfer learning. To compensate for the imbalanced dataset, we             number of the training samples allowing for better classification
perform prioritized classification and dataset equalization.                performance for the classes with a low number of images provided.
                                                                                An additional classifier output post-processing step was imple-
2    PROPOSED APPROACH                                                      mented in order to address the different importance of the different
As the organizer’s team for the Medico task, our aim is not achiev-         classes as it was stated in the task dataset description [11]. Specifi-
ing the best possible classification performance. Instead, we decided       cally, we performed the prioritized selection of the resulting output
to check how low is the entry threshold to the medical images clas-         class for each image based of the model’s probability output. This
sification and corresponding lesion detection challenge. To achieve         was implemented as the selection of the first class with the detec-
this, and also to provide a baseline for the competing teams, we            tion probability higher than a set threshold from the array of classes
involved the recent transfer-learning-based image classification            sorted in order of their importance.
methodology and checked how well we are able to (i) easily imple-
ment multi-class image classifier and (ii) improve the classification
performance without deep neural network model redesign.                     3   RESULTS AND ANALYSIS
    Thus, for the basic classification algorithm, we used a CNN ar-
                                                                            For the official task submission creation, two separate models were
chitecture and a transfer learning-based classifier, which has been
                                                                            used, trained on the different datasets. The first model was trained
previously introduced for the medical images classification in our
                                                                            on the training set created from the development set using the
Copyright held by the owner/author(s).                                      described (see section 2) data augmentation procedure. The trained
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                  model was used to process the task’s test set, and the classification
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                                                                      O. Ostroukhova et al.

Table 1: Official classification performance evaluation for Detec-                                            non-prioritized runs. It means that the trained classifier is perform-
tion (D) and Speed (S) runs including ZeroR (ZR), Random (RD) and                                             ing as well as it can, and additional re-classification using the class
True (TR) baseline classifiers reporting the following cross-class av-
                                                                                                              priorities does not make sense for this particular dataset. However,
eraged metrics: True Positive or Hit (TP), True Negative or Correct
                                                                                                              it still can be potentially interesting for bigger datasets or a higher
Rejection (TN), False Positive or False Alarm (FP), False negative
or Miss (FN), Recall or Sensitivity or Hit rate or True Positive Rate                                         number of classes. The best performing run was the detection run
(REC), Specificity or True Negative Rate (SPE), Precision or Positive                                         #1 generated using the equalized training set and non-prioritized
Predictive value (PRE), Accuracy (ACC), F1-Score (F1), Matthews                                               classifier with the classification performance of 0.854 for Rk statis-
correlation coefficient (MCC), Rk statistic or MCC for k different                                            tic (MCC for k different classes). The confusion matrix for this run
classes (RK), Processing Speed or Frames per Second (FA).                                                     is depicted in table 2, and the class imbalance and corresponding
                Run   TP     TN    FP    FN    REC      SPE    PRE     ACC      F1     MCC      RK     FPS
                                                                                                              training and classification challenges can be easily observed. The
                D1    474   8122   72    72    0.824   0.991   0.828   0.984   0.815   0.812   0.854   43.1   most challenging class was Instruments that is mostly caused by
                D2    474   8122   72    72    0.823   0.991   0.828   0.984   0.814   0.811   0.854   43.0   the different shapes, positions and visibilities of the instruments in
                D3    470   8117   76    76    0.817   0.991   0.819   0.983   0.807   0.803   0.845   43.1
                D4    440   8087   107   107   0.774   0.987   0.771   0.976   0.756   0.752   0.786   43.2   the images. There also was a number of miss-classification cases
                D5    333   7981   213   213   0.664   0.974   0.646   0.951   0.601   0.605   0.582   43.0   for the Dyed classes as well as for Esophagitis and Normal Z-line
                S1    469   8117   77    77    0.765   0.991   0.729   0.982   0.743   0.737   0.844   43.1   classes.
                S2    469   8117   77    77    0.765   0.991   0.728   0.982   0.743   0.737   0.844   43.1
                S3    465   8112   82    82    0.758   0.990   0.722   0.981   0.736   0.729   0.835   42.9
                                                                                                                  With respect to the classification performance in terms of pro-
                S4    430   8077   117   117   0.709   0.986   0.677   0.973   0.679   0.674   0.766   43.0   cessing speed, the proposed classified can process approximately
                S5    313   7960   233   233   0.546   0.971   0.607   0.947   0.504   0.510   0.544   43.3   43 frames per second on a GPU-enabled consumer-grade personal
                ZR    34    7681   512   512   0.063   0.938   0.004   0.883   0.007    0.0     0.0     -     computer regardless of the enabled or disabled post-processing
                RD    35    7682   511   511   0.057   0.938   0.064   0.883   0.055   0.001   0.002    -
                TR    546   8193    0     0     1.0     1.0     1.0     1.0     1.0     1.0     1.0     -     classes prioritization.

                                                                                                              4   CONCLUSIONS AND FUTURE WORK
Table 2: Confusion matrix for the detection run#1 depicted in ta-
                                                                                                              In this paper, we presented an out-of-the-box solution utilizing a
ble 1. The classes are Ulcerative Colitis (A), Esophagitis (B), Normal
Z-line (C), Dyed and Lifted Polyps (D), Dyed Resection Margins (E),
                                                                                                              modern pre-trained CNN for the task of medical image classifica-
Out of Patient images (F), Normal Pylorus (G), Stool Inclusions (H),                                          tion. The goal was to provide a baseline for the task and to show
Stool Plenty (I), Blurry Nothing of value (J), Polyps (K), Normal Ce-                                         the performance of basic methods without any deep architecture
cum (L), Colon Clear (M), Retroflex Rectum (N), Retroflex Stomach                                             modification. The best achieved performance measured as Matthew
(O) and Instruments (P).                                                                                      correlation coefficient for k different classes of 0.854 and a speed
                                          Detected class
                                                                                                              of 43 frames per second. This is already a quite good result for an
                    A   B  C  D   E F G   H       I     J  K  L   M    N  O  P                                out-of-the-box method.
                 A 459 2   1   1  5 0 1    0     54     0 13 13    1   7  0   7
                 B   2 388 77  0  0 0 0    0     0      0  0   0   0   0  0   0
                 C 0 145 451 0    0 0 4    0     0      0  1   0   0   0  0   0                               REFERENCES
                 D 0     0 0 406 81 0 0    0     1      0  4   0   0   0  0  26
                 E   0   0 0 115 462 0 0   0     0      0  0   1   1   1  0  17                                [1] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
                 F   0   0 0   0  1 2 0    0     0      0  0   0   0   0  0   0                                    2009. Imagenet: A large-scale hierarchical image database. In Computer
 Actual class


                 G 3    18 27  0  0 0 548 0      0      0  2   0   2   1  4   1
                 H 10    1 0   5  2 0 0 498 98          0  3   1  24   0  0   6                                    Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.
                 I  14   0 0   5  1 0 0    0 1771 0        5   2   1   3  0   7                                    Ieee, 248–255.
                 J   2   0 0   0  0 3 0    1     7     37 0    0   2   1  0   0
                 K 22    1 6  17  2 0 7    1     8      0 316 14   1   9  0  64                                [2] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang,
                 L 19    0 0   2  6 0 1    0     16     0 22 551   8   3  0   4                                    Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A Deep Convolutional
                 M 3     0 1   1  0 0 0    6     4      0  5   1 1025 1   0   6
                 N 8     0 0   3  4 0 0    0     3      0  2   1   0  160 4   8
                                                                                                                   Activation Feature for Generic Visual Recognition.. In Proc. of ICML,
                 O 0     1 0   0  0 0 0    0     2      0  0   0   0   5 387 1                                     Vol. 32. 647–655.
                 P   0   0 0   1  0 0 0    0     1      0  1   0   0   1  2 126                                [3] Yang Liu, Zhonglei Gu, and William K Cheung. 2017. HKBU at Media-
                                                                                                                   Eval 2017 Medico: Medical multimedia task. In Working Notes Proceed-
output was post-processed using the prioritized classification selec-                                              ings of the MediaEval 2017 Workshop (MediaEval 2017).
tor with four different probability threshold settings from 0.75 to                                            [4] Syed Sadiq Ali Naqvi, Shees Nadeem, Muhammad Zaid, and Muham-
0.1 resulting in the runs #2 - #5. For the run #1, we used the max                                                 mad Atif Tahir. 2017. Ensemble of Texture Features for Finding Ab-
probability selector without class prioritization. The results using                                               normalities in the Gastro-Intestinal Tract. Working Notes Proceedings
the first model were submitted as the speed runs. The second model                                                 of the MediaEval 2017 Workshop (MediaEval 2017).
was trained using the equalized training set, and the same rules for                                           [5] Stefan Petscharnig and Klaus Schöffmann. 2018. Learning laparoscopic
                                                                                                                   video shot classification for gynecological surgery. An International
the five runs generation were submitted as the detection run.
                                                                                                                   Journal of Multimedia Tools and Applications 77, 7 (2018), 8061–8079.
   The official evaluation results for all the runs are shown in table 1.
                                                                                                               [6] Stefan Petscharnig, Klaus Schöffmann, and Mathias Lux. 2017. An
As one can see, all the runs significantly outperform the ZeroR and                                                Inception-like CNN Architecture for GI Disease and Anatomical Land-
Random baselines and show good classification performance. All                                                     mark Classification. In Working Notes Proceedings of the MediaEval
the runs that utilize the equalized training set have slightly better                                              2017 Workshop (MediaEval 2017).
classification performance. Surprisingly, the introduced prioritized                                           [7] Konstantin Pogorelov, Sigrun Losada Eskeland, Thomas de Lange,
classification method did not result in improved detection perfor-                                                 Carsten Griwodz, Kristin Ranheim Randel, Håkon Kvale Stens-
mance, not for the original nor for the equalized training sets. With                                              land, Duc-Tien Dang-Nguyen, Concetto Spampinato, Dag Johansen,
the threshold of 0.75, the classification performance is equal to the                                              Michael Riegler, and others. 2017. A holistic multimedia system for
Medico Multimedia Task                                                                MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


     gastrointestinal tract disease detection. In Proceedings of the 8th ACM         Workshop (MediaEval 2017).
     on Multimedia Systems Conference. ACM, 112–123.                            [11] Konstantin Pogorelov, Michael Riegler, Pål Halvorsen, Thomas De
 [8] Konstantin Pogorelov, Kristin Ranheim Randel, Thomas de Lange,                  Lange, Kristin Ranheim Randel, Duc-Tien Dang-Nguyen, Mathias Lux,
     Sigrun Losada Eskeland, Carsten Griwodz, Dag Johansen, Concetto                 and Olga Ostroukhova. 2018. Medico Multimedia Task at MediaEval
     Spampinato, Mario Taschwer, Mathias Lux, Peter Thelin Schmidt,                  2018. In Working Notes Proceedings of the MediaEval 2018 Workshop.
     Michael Riegler, and Pål Halvorsen. 2017. Nerthus: A Bowel Prepara-        [12] Michael Riegler, Konstantin Pogorelov, Pål Halvorsen, Carsten Gri-
     tion Quality Video Dataset. In Proceedings of the 8th ACM on Multime-           wodz, Thomas Lange, Kristin Ranheim Randel, Sigrun Eskeland, Dang
     dia Systems Conference (MMSYS). ACM, 170–174.                                   Nguyen, Duc Tien, Mathias Lux, and others. 2017. Multimedia for
 [9] Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz,                  medicine: the medico Task at mediaEval 2017. In Working Notes Pro-
     Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Con-                     ceedings of the MediaEval 2017 Workshop (MediaEval 2017).
     cetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin          [13] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens,
     Schmidt, and others. 2017. Kvasir: A multi-class image dataset for              and Zbigniew Wojna. 2015. Rethinking the inception architecture for
     computer aided gastrointestinal disease detection. In Proceedings of the        computer vision. arXiv preprint arXiv:1512.00567 (2015).
     8th ACM on Multimedia Systems Conference (MMSYS). ACM, 164–169.            [14] Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop:
[10] Konstantin Pogorelov, Michael Riegler, Pål Halvorsen, Carsten Gri-              Divide the gradient by a running average of its recent magnitude.
     wodz, Thomas de Lange, Kristin Ranheim Randel, Sigrun Eskeland,                 COURSERA: Neural networks for machine learning 4, 2 (2012).
     Duc-Tien Dang-Nguyen, Olga Ostroukhova, and others. 2017. A com-
     parison of deep learning with global features for gastrointestinal dis-
     ease detection. In Working Notes Proceedings of the MediaEval 2017