An application of Residual Network and Faster - RCNN for
              Medico: Multimedia Task at MediaEval 2018
                                    Trung-Hieu Hoang1 , Hai-Dang Nguyen2 , Thanh-An Nguyen1 ,
                                              Vinh-Tiep Nguyen3 , Minh-Triet Tran1 ,
                 1 Faculty of Information Technology, University of Science, VNU-HCM, Vietnam 2 Eurecom, France
                                  3 University of Information Technology, VNU-HCM, Vietnam

              {hthieu,ntan}@selab.hcmus.edu.vn,nguyenhd@eurecom.fr,tiepnv@uit.edu.vn,tmtriet@fit.hcmus.edu.vn

ABSTRACT                                                                    3 APPROACH
The Medico: Multimedia Task focuses on developing an efficient              3.1 Dataset Preparation
framework for predicting and classifying abnormalities in endo-
                                                                               3.1.1 Disease region localization. In order for the Faster R-CNN
scopic images of gastrointestinal (GI) tract. We present the HCMUS
                                                                            model to be trained, objects in the image have to be tagged with
Team’s approach, which employs a combination of Residual Neural
                                                                            bounding boxes and passed to the model as input. We annotate
Network and Faster R - CNN model to classify endoscopic images.
                                                                            the signal of disease in all images of the following classes: dyed-
We submit multiple runs with different modifications of the param-
                                                                            resection-margins, dyed-lifted-polyps, instruments and polyps.
eters in our combined model. Our methods show potential results
through experiments.                                                           3.1.2 Re-labeling Medico development dataset. After training
                                                                            with the development set, we find some training samples with
                                                                            inappropriate labels according to the priority list. Therefore, in
1    INTRODUCTION                                                           order for our model to learn with the least confusing, we apply the
Medico: Multimedia Task at MediaEval 2018 challenge [4] aims to             new labels, predicted by the trained model, to these images.
bring new achievements in computer vision, image processing and                3.1.3 Instruments dataset augmentation. Instruments - the sec-
machine learning to the next level of computer and multimedia               ond highest priority class has only 36 images with the limitation of
assisted diagnosis. The goal of the challenge is to predict abnormal-       background context in the development set. In order to maintain
ities and diseases in an efficient way with as less training data as        the balancing between all of the classes and also improve the diver-
possible [5]. The task organizers also provide a priority list for the      sity of the instruments images, we generate more images for the
classes in other to accommodate with the single-class classification        instruments based on the current given development set by placing
challenge. Thus, this leads to some modifications of our model,             the instruments on the foreground of other diseases backgrounds.
which are meticulously described in section 3.                                 Among the 36 instruments images, we carefully select 24 of them
    In our approach, we introduce a stacked model consisting of two         and crop the instruments along their edges. Then, we randomly
deep networks, a Residual Neural Network (Resnet) [2] followed              select 20% of the images from dyed-lifted-polyps, dyed-resection-
by a Faster Region-based Convolutional Neural Network (Faster               margins, ulcerative-colitis classes, and use them as the background
R-CNN) [7]. Since Resnet mostly focuses on deep global features of          of the cropped instruments. By applying this method, we are able
image, it fails to classify images that symptoms of abnormal diseases       to generate more than 800 images for the instruments class.
or instruments appear as small objects on diversity backgrounds.
Therefore, this is the reason of using Faster R-CNN to re-classify          3.2    Method
the images of some classes that Resnet usually mis-classify.
                                                                               3.2.1 Fine-tuning deep neural network for medical images. In
                                                                            our approach, both Residual Network with 101 layers and Faster
2    RELATED WORK                                                           R-CNN [1] (both pre-trained on ImageNet) are fine-tuned by using
 In the field of medical image processing, deep neural networks have        our modified development dataset. In term of using convolution
 been used in order to solve several problems related to endoscopic         neural network for medical images, knowledge transferring from
 images of the gastrointestinal (GI) tract. Particularly, to localize and   natural images to medical images is possible, even though there
 identify polyps within real-time constraint, deep CNNs has recently        is a large difference between the source and target databases. It is
 shown an impressive potential when achieving up to 96.4% accuracy          especially useful in the case of small dataset of images provided [6].
- published in 2018 by Urban G et al. [9]. Another interesting article      Our experiment results also support this idea. Fine-tuning on the
 of Satoki Shichijo et al. [8] also applies multiple deep CNNs to           ImageNet pre-trained model significantly improves the efficient of
 diagnose Helicobacter pylori gastritis based on endoscopic images.         classification model.
 Further, gastrointestinal bleeding detection using deep CNNs on
 endoscopic images has been successfully done and published by                 3.2.2 First run. Residual network with 101 layers model are
Xiao Jia et al. [3].                                                        fine-tuned on the original development set provided by the task or-
                                                                            ganizers along with our instruments increased dataset. After passed
Copyright held by the owner/author(s).
                                                                            through Resnet101, output images classified as special classes be-
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                  come the input of Faster R-CNN network, which is trained for
                                                                            detecting instruments in images.
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                             Trung-Hieu Hoang et al.


       • First case: Images predicted as instruments by Resnet101
         are double-checked. In case instruments are not detected
         by Faster R-CNN in those images, they are re-labeled as the
         class of their second highest score proposed by Resnet101.
       • Second case: Images predicted as dyed-lifted-polyps, dyed-
         resection-margins, ulcerative colitis by Resnet101 are fed
         forward through Faster R-CNN network to detect instru-
         ments. They are classified as instruments if detected or
         keep the original prediction otherwise.

   3.2.3 Second run. Feeding forward a large number of images in
the three classes through Faster R-CNN causes a bottle-neck of in-
ference time, as Faster R-CNN has high time complexity. Therefore,
in this second run, we limited the images passed through Faster
R-CNN by only performing the first case of the first run.

   3.2.4 Third run. The configuration of the third run is as same as
the second run. Instead of using the original training set mentioned
in the first run, we train our model on the re-labeled development
set combined with the augmented instrument set.

   3.2.5 Forth run. In this run, we reduce the number of images                 Figure 1: Confusion matrix of our best run - Run03
used for training by selecting randomly 75% images of each class in
the same training set as the third run. Other processing steps are
also configured in the same way.

    3.2.6 Fifth run. Throughout our experiments, normal-z-line and          the first two runs. This implies that training on our re-labeled
esophagitis are the top most confusing classes not only for Resnet101       development set provides better models.
but also for human to distinguish them. In the priority list, esophagi-        On the other hand, using the Residual neural network cannot
tis has a higher rank than normal-z-line’s. Thus, after several times       classify efficiently the two classes esophagitis and normal-z-line.
evaluating our model on the development dataset, we propose a con-          The same problem also occurs between the dyed-resection-margins
dition for these two classes when they are predicted by Resnet101.          and dyed-lifted-polyps classes. It can be observed in the confusion
As Resnet101 provides a probability distribution over the 16 classes        matrices of the two pairs (Figure 1). Therefore, these are the two
for each image, whenever the normal-z-line appears to be the high-          main reasons which mainly bring negative impact to our results.
est class, we add a small bias 0.3 to the probability of the esophagitis.      Additionally, as we mentioned in section 3, the configuration of
Hence, the model is more likely to emit the esophagitis class. This in-     Run05 intuitively prefers esophagitis to normal-z-line, which may
tuitively means that our model prefers esophagitis to normal-z-line         leads to an increasing of the false-positive cases in the result.
when it is confused between these classes.                                     By comparison to the others, Run04 has the lowest precision
                                                                            since it uses 75% of training data. Decreasing the amount of training
4     RESULTS                                                               samples of course affects the performance in deep-learning models.
Table 1: Official evaluation result for both sub-tasks (pro-                Nevertheless, the result is still acceptable when it decreases only a
vided by the organizers) and speed (fps) on Tesla K80 GPU                   few percentages and its configuration is as same as Run03. This is
                                                                            an evidence that we are even able to reduce up to 50% of data when
    RunID   PREC      REC      ACC       F1      MCC       RK       FPS     the less training time is preferred over the accuracy.
    Run01   94.245   94.245   99.281   94.245   93.861   93.590    6.589
    Run02   93.959   93.959   99.245   93.959   93.556   93.273   23.191
    Run03   94.600   94.600   99.325   94.600   94.240   93.987   23.148
                                                                            5   CONCLUSION AND FUTURE WORKS
    Run04   93.043   93.043   99.130   93.043   92.579   92.257   22.654    Medico image classification is a challenging problem because of the
    Run05   94.508   94.508   99.314   94.508   94.142   93.884   21.413    fine-grained images, less training data and require high accuracy.
There is a trade-off between speed and accuracy when comparing              In our current approach, we focus on training a combination of
the result of Run01 and Run02. In Run02, we reduce a large number           Residual Neural Network and Faster R-CNN with different modifi-
of images passing through Faster R-CNN for the sake of time, so             cations of the training set. Additionally, object detection method
its performance seems to be relatively worse than Run01’s.                  is applied to detect small symptoms of diseases, which are useful
    As we mentioned earlier in section 3, data pre-processing takes         evidences for the classification task. Accuracy and inference time
an important role in building a deep-neural network model. Through          that we reach is acceptable and appropriate for real-time constraint.
our experiments, in the case of less training data, the augmented           However, for future works, we need a more robust approach to
dataset helps us improve the performance of deep-neural network             exploit the distinction between easy-confused classes, e.g, esophgitis
model. Run03 and Run05 show impressive results comparing to                 and normal-z-line, or dyed-lifted-polyps and dyed-resection-margins.
Medico Multimedia Task                                                        MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


REFERENCES
[1] Xinlei Chen and Abhinav Gupta. 2017. An Implementation of Faster
    RCNN with Study for Region Sampling. arXiv preprint arXiv:1702.02138
    (2017).
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
    Residual Learning for Image Recognition. 2016 IEEE Conference on
    Computer Vision and Pattern Recognition (CVPR) (Jun 2016). https:
    //doi.org/10.1109/cvpr.2016.90
[3] Xiao Jia and Max Q.-H. Meng. 2016. A deep convolutional neural
    network for bleeding detection in Wireless Capsule Endoscopy images.
    2016 38th Annual International Conference of the IEEE Engineering in
    Medicine and Biology Society (EMBC) (2016). https://doi.org/10.1109/
    embc.2016.7590783
[4] PËŽal Halvorsen Thomas de Lange Kristin Ranheim Randel Duc-Tien
    Dang-Nguyen Mathias Lux Olga Ostroukhova Konstantin Pogorelov,
    Michael Riegler. 2018. Medico Multimedia Task at MediaEval 2018.
    Media Eval’ 2018 (2018).
[5] Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz,
    Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Con-
    cetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter The-
    lin Schmidt, Michael Riegler, and Pål Halvorsen. 2017. KVASIR: A
    Multi-Class Image Dataset for Computer Aided Gastrointestinal Dis-
    ease Detection. In Proceedings of the 8th ACM on Multimedia Sys-
    tems Conference (MMSys’17). ACM, New York, NY, USA, 164–169.
    https://doi.org/10.1145/3083187.3083212
[6] Adnan Qayyum, Syed Anwar, Muhammad Majid, Muhammad Awais,
    and Majdi Alnowami. 2017. Medical Image Analysis using Convolu-
    tional Neural Networks: A Review. 42 (09 2017).
[7] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster
    R-CNN: Towards Real-Time Object Detection with Region Proposal
    Networks. In Advances in Neural Information Processing Systems 28,
    C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett
    (Eds.). Curran Associates, Inc., 91–99.
[8] Aoyama Kazuharu Nishikawa Yoshitaka Miura Motoi Shinagawa
    Takahide Takiyama Hirotoshi Tanimoto Tetsuya Ishihara Soichiro
    Matsuo Keigo Tada Tomohiro Shichijo Satoki, Nomura Shuhei. 2017.
    Application of Convolutional Neural Networks in the Diagnosis of He-
    licobacter pylori Infection Based on Endoscopic Images. EBioMedicine
    25 (01 Nov 2017), 106–111. https://doi.org/10.1016/j.ebiom.2017.10.014
[9] Gregor Urban, Priyam Tripathi, Talal Alkayali, Mohit Mittal, Farid
    Jalali, William Karnes, and Pierre Baldi. 2018. Deep Learning Localizes
    and Identifies Polyps in Real Time With 96% Accuracy in Screening
    Colonoscopy. Gastroenterology 155, 4 (2018). https://doi.org/10.1053/j.
    gastro.2018.06.037