<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DEEP LAYER AGGREGATION APPROACHES FOR REGION SEGMENTATION OF ENDOSCOPIC IMAGES Qingtian Ning, Xu Zhao, Jingyi Wang</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Automation, Shanghai Jiao Tong University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper contains our approaches in EAD2019 competition. For multi-class region segmentation (task 2), we utilize deep layer aggregation algorithm to achieve the best results compared to U-net. For the completeness of the competition, we employ the Cascade R-CNN framework to finish multi-class artefact detection (task 1) and multi-class artefact generalization tasks (task 3).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        In this paper, we will introduce our methods and results of
the Endoscopic artefact detection challenge (EAD2019) [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2</xref>
        ]
in detail. The competition consists of three tasks, which are
artefact detection (task 1), region segmentation (task 2) and
generalization (task 3). For task 1, it aims to get localization
of bounding boxes and class labels for 7 artefact classes for
given frames. For task 2, Algorithm should obtain the
precise boundary delineation of detected artefacts. And for task
3, it aims to verify the detection performance independent of
specific data type and source.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. DETAILS ON OUR METHOD</title>
    </sec>
    <sec id="sec-3">
      <title>2.1. Detection and generalisation tasks</title>
      <sec id="sec-3-1">
        <title>2.1.1. Cascade R-CNN</title>
        <p>
          In object detection, we need an intersection over union (IoU)
threshold to define positives and negatives. An object
detector usually generates noisy detections if it is trained with low
IoU threshold, e.g. 0.5. But detection performance degrade
with increasing the IoU thresholds [
          <xref ref-type="bibr" rid="ref4">3</xref>
          ]. So the Cascade
RCNN is proposed to address two problems: 1) over-fitting
during training, due to exponentially vanishing positive samples,
and 2) inference-time mismatch between the IoUs for which
the detector is optimal and those of the input hypotheses [
          <xref ref-type="bibr" rid="ref4">3</xref>
          ].
It consists of a sequence of detectors trained with increasing
IoU thresholds, to be sequentially more selective against close
false positives. The detectors are trained stage by stage,
leveraging the observation that the output of a detector is a good
distribution for training the next higher quality detector [
          <xref ref-type="bibr" rid="ref4">3</xref>
          ].
So we simply apply the Cascade R-CNN framework and use
L1 loss to optimize network.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>2.2. Region segmentation</title>
      <sec id="sec-4-1">
        <title>2.2.1. Deep Layer Aggregation</title>
        <p>
          Visual recognition requires rich representations that span
levels from low to high, scales from small to large, and
resolutions from fine to coarse [
          <xref ref-type="bibr" rid="ref5">4</xref>
          ]. Even with the depth of
features in a convolutional network, a layer in isolation is not
enough: compounding and aggregating these representations
improves inference of what and where [
          <xref ref-type="bibr" rid="ref5">4</xref>
          ]. Deep layer
aggregation (DLA) structures iteratively and hierarchically merge
the feature hierarchy to make networks with better accuracy
and fewer parameters [
          <xref ref-type="bibr" rid="ref5">4</xref>
          ]. For region segmentation, we used
DLA-60 model provided. In addition, we use post processing,
such as conditional random field [
          <xref ref-type="bibr" rid="ref6">5</xref>
          ], to optimize
segmentation results. In particular, this is the case that one pixel
corresponds to multiple categories in ground truth label. In order
to avoid this, we manipulate a simple process to make each
pixel correspond to only one classes. To overcome the class
imbalance problem, we propose to use a weighted multi-class
dice loss as the segmentation loss.
        </p>
        <p>LDice = 1
2 XC wcY^ncYnc
c=1 wc(Y^nc + Ync)
;
(1)
where Y^nc denotes the predicted probability belonging to class
c (i.e. background, instrument, specularity, artifact, bubbles,
saturation), Ync denotes the ground truth probability, and wc
denotes a class dependent weighting factor. Empirically, we
set the weights to be 1 for background, 1.5 for instrument,
2.5 for specularity, 2 for artifact, 2.5 for bubbles and 2 for
saturation.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>3. EXPERIMENTS</title>
    </sec>
    <sec id="sec-6">
      <title>3.1. Detection and generalisation tasks</title>
      <p>For Detection and generalisation tasks, experiments are built
with Caffe framework on a single NVIDIA TITAN X GPU.
We use the Adam optimizer with the learning rate 6:25 10 4
and a weight decay of 0.0001 for 250000 iterations with batch
size 1. Table 1 shows the evaluation results on EAD2019
detection and Table 2 shows the score gap for generalisation
tasks. What surprises us is that our detection algorithm has
good generalization performance.</p>
    </sec>
    <sec id="sec-7">
      <title>3.2. Region segmentation</title>
      <p>
        For region segmentation, experiments are built with Pytorch
framework on two NVIDIA 1080ti GPUs. We use the SGD
optimizer with a weight decay of 0.0001, and adopt the poly
learning rate (1 toetpaolcehpoc1h ) with momentum 0.9 and train
the model for 200 epochs with batch size 64. The starting
learning rate is 0.01 and the crop size is chosen to be 256.
Table 3 shows the evaluation results on EAD2019 region
segmentation. Table 4 shows the comparison results of U-net[
        <xref ref-type="bibr" rid="ref7">6</xref>
        ]
and DLA on our validation set.
Overall, EAD2019 is a very meaningful competition. We
have gained a lot in the process of completing the
competition. Finally, we ranked 20th, 11th and 3th for detection,
segmentation and generalization, respectively. The final result
exceeded our expectations, which is considerably delightful.
Of course, we still have a lot of shortcomings. For
example, for segmentation tasks, we make each pixel correspond
to only one classes, which will lead to some holes in the
results. In addition, we can also employ data augmentation, etc.
All in all, we still have a lot to improve.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>Table 1. Results on EAD2019 detection and generalisation</article-title>
          .
          <source>Method Scored IoUd mAPd Cascade R-CNN 0.2330 0.1222 0</source>
          .
          <fpage>3068</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Sharib</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <surname>Felix Zhou</surname>
          </string-name>
          , Christian Daul, Barbara Braden, Adam Bailey, Stefano Realdon, James East, Georges Wagnires, Victor Loschenov, Enrico Grisan, Walter Blondel, and Jens Rittscher, “
          <article-title>Endoscopy artifact detection (EAD 2019) challenge dataset</article-title>
          ,
          <source>” CoRR</source>
          , vol. abs/
          <year>1905</year>
          .03209,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Sharib</given-names>
            <surname>Ali</surname>
          </string-name>
          , Felix Zhou, Adam Bailey, Barbara Braden, James East, Xin Lu, and Jens Rittscher, “
          <article-title>A deep learning framework for quality assessment and restoration in video endoscopy,” CoRR</article-title>
          , vol. abs/
          <year>1904</year>
          .07073,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Zhaowei</given-names>
            <surname>Cai</surname>
          </string-name>
          and Nuno Vasconcelos, “
          <string-name>
            <surname>Cascade</surname>
          </string-name>
          r-cnn:
          <article-title>Delving into high quality object detection,”</article-title>
          <source>in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>6154</fpage>
          -
          <lpage>6162</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Fisher</given-names>
            <surname>Yu</surname>
          </string-name>
          , Dequan Wang,
          <string-name>
            <surname>Evan Shelhamer</surname>
          </string-name>
          , and Trevor Darrell, “Deep layer aggregation,”
          <source>in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>2403</fpage>
          -
          <lpage>2412</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Kra</surname>
          </string-name>
          <article-title>¨henbu¨hl and Vladlen Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,”</article-title>
          <source>in Advances in neural information processing systems</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>109</fpage>
          -
          <lpage>117</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Olaf</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Fischer</surname>
          </string-name>
          , and Thomas Brox, “
          <article-title>U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention</article-title>
          . Springer,
          <year>2015</year>
          , pp.
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>