<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PATCH-BASED DEEP LEARNING APPROACHES FOR ARTEFACT DETECTION OF ENDOSCOPIC IMAGES</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xiaohong W. Gao</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yu Qian</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cortexcia Vision System Limited</institution>
          ,
          <addr-line>London SE1 9LQ</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Middlesex University</institution>
          ,
          <addr-line>London, NW4 4BT</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper constitutes the work in EAD2019 competition. In this competition, for segmentation (task 2) of five types of artefact, patch-based fully convolutional neural network (FCN) allied to support vector machine (SVM) classifier is implemented, aiming to contend with smaller data sets (i.e., hundreds) and the characteristics of endoscopic images with limited regions capturing artefact (e.g. bubbles, specularity). In comparison with conventional CNN and other state of the art approaches (e.g. DeepLab) while processed on whole images, this patch-based FCN appears to achieve the best.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Index Terms— Endoscopic images, Deep neural
networks, Decoder-Encoder neural networks</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>This paper details the work by taking part of the Endoscopic
artefact detection challenge (EAD2019) [1, 2] with three
tasks, which are detection (task #1), segmentation (task #2)
and generalization (task #3). All three tasks are performed
using the current state of the art deep learning techniques with
a number of enhancements. For example, for segmentation
(task #2), patch-based approached are applied. In doing so,
each image is divided into 55 non-overlapping patches of
equal sizes. Then based on the contents of their counterparts
of masks, only patches with non-zero masks are selected for
training to limit the inclusion of background information.
Each class is trained individually firstly. Then upon the last
layer of receptive fields, the features from five classes are
trained together using SVM to further differentiate subtle
changes between five classes.</p>
      <p>For detection of bounding boxes (Tasks #1 and #3), while
the above patch-based approach delivers good segmentations,
the bounding boxes of those segments do not seem to agree
well with the ground truth with Null values of IoU. Hence the
state of the art models of faster-RCNN with resNet101
backbone has been applied that gives the ranking position of 12th
on the leaderboard, which is build upon tensorflow model. In
addition, the models of YOLOv3 [3] by using darknet is also
evaluated, which delivers detection ranks between 17 to 21
based the selection of thresholds (0.5 or 0.1).</p>
      <p>Fig. 1: The steps applied in the proposed patch-based
segmentation.</p>
    </sec>
    <sec id="sec-3">
      <title>2. METHOD</title>
    </sec>
    <sec id="sec-4">
      <title>2.1. Segmentation</title>
      <p>Before training, each image undergoes pre-processing stage
to be divided into 25 (55) small patches in equal size. As a
result, the training samples have width and height sizes varying
from 60 to 300 pixels. Those patches with their
corresponding masks with zero content are removed from the training to
level the influence of background.</p>
      <p>For segmentation, the training applies the conventional
fully connected neural network [4, 5, 6]built upon
Matconvnet1 that begun with imageNet-vgg-verydeep-16 model. To
minimise the influence of overlapping segments, instead of
training all the classes collectively, this study trains each
segmentation task individually. The final mask for each image
is then the integration of five individual segmentation masks
after fine tuning using SVM. In other words, the last layer of
features from each model are collected first. Then SVM
classifier is applied to fine tune each segmentation class to further
differentiate each class. Figure 1 illustrates the proposed
approach. Firstly, each of five classes is trained on patches
independently to take into account of overlapping classes.
Then upon connection layer of all five classes, SVM classifier
is trained to highlight the distinctions between each class.
This classifier will perform the final segmentation for each
1https://github.com/vlfeat/matconvnet-fcn
of five categories, i.e. instrument, specularity, artefact,
bubbles, and saturation. In addition, two other popular models
are evaluated, which are deepLab [7] and patch-based pixel
labeling [8] which is to label every pixel based on centered
patch classification result. Table 1 presents the outcome from
EAD2019 leaderboard 2 after uploading each result obtained
from different deep learning models where our patch-based
FCN delivers the best F2 and semantic scores.</p>
      <p>
        Figure 2 demonstrates the steps taken while applying
deepLab V3 using tensorflow model [
        <xref ref-type="bibr" rid="ref1">9</xref>
        ]. Similarly, Figure
3 represents the procedures while utilizing the patch-based
classification model of Caffe. The patch size is selected to be
32x32.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Model</title>
      <p>Patch-based labeling
deepLab
Patch-based FCN</p>
    </sec>
    <sec id="sec-6">
      <title>F2-score</title>
      <p>0.2300
0.1638
0.2354</p>
    </sec>
    <sec id="sec-7">
      <title>Semantic score</title>
      <p>0.2155
0.1872
0.2434</p>
    </sec>
    <sec id="sec-8">
      <title>2.2. Detection of artefact</title>
      <p>While the above patch-based segmentation model appears to
perform well for segmentation, when it comes to detection of
bounding boxes of intended segments, for some unknown
reasons, the detected value of IoUd appears to be NULL. Hence
a number of existing state of the art models are evaluated due
to time constraint, comprising fast-rcnn-nas3 and
fast-rcnnresnet101 [5] using tensorflowand YOLOV3 [3] using
darknet [3]. Table 2 presents the evaluation results of the above
models. The fast-rcnn-resnet101 model with the threshold of
3https://github.com/tensorflow/models/blob/master/research/object
detection/g3doc/detection model zoo.md
0.3 appear to perform the best, which is the one given on the
leaderboard of EAD2019 with a rank of 12.</p>
    </sec>
    <sec id="sec-9">
      <title>3. RESULTS</title>
      <p>Table 2 presents the evaluation results of the above models.
The fast-rcnn-resnet101 model with the threshold of 0.3
appear to perform the best, which is the one given on the
leaderboard of EAD2019 with a rank of 12. Figuratively, Figure 4
demonstrates the comparison results between the above four
models for 2 images. Figure 5 compares the generation results
(Task #3) between models Fast-rcnn-nas (top) and
fast-rcnnresnet101 (bottom).</p>
    </sec>
    <sec id="sec-10">
      <title>4. CONCLUSION AND DISCUSSION</title>
      <p>It has been a very enjoyable experience while taking part in
this EAD2019 competition. Due to the late participation (two
weeks before the initial deadline), implementation of several
ideas could not be fully completed. However, the final
position of 12 is better than expected, which is quite uplifting.
After initial evaluation of existing models (both in-house and
in the public domains), it is found that, no model performs
significantly better than the other. Semi-supervised approach
will be recommended coupled with clinical knowledge.</p>
      <p>Contribution includes patch-based training. While several
existing models incorporate regions of interest for training,
some regions appear to be overwhelmingly larger than the
intended targets (&gt; 95%), hence introducing too much
background information, leading to the sampling distribution
substantially unbalanced. Because of the varying size of training
datasets, from 300 to 1400 pixels along both width and height
directions, fixed patch size may instigate under or over
sampling. Hence in this study for segmentation (task #2), each
image is divided into 25 equal sized patches non-overlapping,
which appears to give good segmentation results. However,
it is foreseen that sampling with overlapping regions
collectively might deliver even better results, which will be
investigated in the future. Figure 6 depicts the learning information
of whole-image-based (top) and patch-based segmentation as
well as whole-image-based detection (bottom).</p>
      <p>Regarding to the detection tasks utilising existing models,
the challenge here is to find the right threshold for the last
fully connected layer of probability. Higher thresholds might
miss some intended regions. However, lower thresholds tend
to not only over segment but also repeat some regions a
number of times. For example, to delineate one single contrast
region using YOLOv3 [3] model from one test image, lower
threshold (0.4) delivers to three bounding boxes, with each
bigger one surrounding smaller one as illustrated in Figure 7.
In summary, for medical images, medical knowledge needs to
be incorporated in order to generate more accurate results.</p>
    </sec>
    <sec id="sec-11">
      <title>5. REFERENCES</title>
      <p>[8] Andrew Janowczyk, Scott Doyle, Hannah Gilmore, and</p>
      <p>Anant Madabhushi, “A resolution adaptive deep
hierarchical (radhical) learning scheme applied to nuclear
segFig. 6: Learning information for segmentation based on whole image (left), patch (middle) and detection based on whole image
(right).</p>
      <p>mentation of digital pathology images,” CMBBE:
Imaging &amp; Visualization, vol. 6, no. 3, pp. 270–276, 2018.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Liang-Chieh</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam, “
          <article-title>Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Computer Vision</article-title>
          - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-
          <issue>14</issue>
          ,
          <year>2018</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>VII</given-names>
          </string-name>
          ,
          <year>2018</year>
          , pp.
          <fpage>833</fpage>
          -
          <lpage>851</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>