GRAPH-SEARCH BASED UNET-D FOR THE ANALYSIS OF ENDOSCOPIC IMAGES

                                              Shufan Yang1 , Sandy Cochran 1
                           1
                               School of Engineering, University of Glasgow, Glasgow, UK


                        ABSTRACT

While object recognition in deep neural networks (DNN)
has shown remarkable success in natural images, endoscopic
images still cannot be fully analysed using DNNs, since
analysing endoscopic images must account for occlusion,
light reflection and image blur. UNet based deep convolu-
tional neural networks (DNNs) offer great potential to extract
high-level spatial features, thanks to its hierarchical nature
with multiple levels of abstraction, which is especially useful                 Fig. 1: The architecture of Network
for working with multimodal endoscopic images with white
light and fluoroscopy in the diagnosis of esophageal disease.                              2. METHOD
However, the currently reported inference time for DNNs is
above 200ms, which is unsuitable to integrate into robotic         As shown in Fig. 1, the architecture of network for this chal-
control loops. This work addresses real-time object detec-         lenge is based on the UNet architecture. The convolution net-
tion and semantic segmentation in endoscopic devices. We           work layers are used as an encoder to abstract low level spatial
show that endoscopic assistive diagnosis can approach satisfy      information. A decoder is then implemented using transposed
detection rates with a fast inference time.                        convolution. Instead of using an ASPP layer, a general auto-
   Index Terms— Endoscopic images, Deep neural net-                encoder class label is kept into a dense layer. This compressed
works, Decoder-Encoder neural networks                             feature vector connects to a series of up-sampling layers using
                                                                   the coast mask.

                   1. INTRODUCTION                                 2.1. Algorithm
                                                                   Regular classification DCNNs generate a coast mask contain-
A common strategy in deep convolutional neural network for
                                                                   ing probabilities for each class in a dense confidence regions
semantic segmentation tasks requires the down-sampling of
                                                                   using the following steps:
an image between convolutional and ReLU layers, and then
up-sampling the output to match the input size [1]. Atrous            1 Generate feature map using fully convolutional neural
convolution is designed to obtain the spatial resolution af-            network
ter several convolution layers [2]. Although, when compared
to normal convolution layers, the atrous convolution inserts          2 Initialize a segmentation with feature detected
holes into its filters, thus enlarging the receptive field to a       3 Transpose convolutional using confidence check to
greater extent, this method often loses low level information,          keep one weak edge on the common boundary
and is therefore unsuitable for a medical environment. To deal
with multi-scale images, a new Atrous Spatial Pyramid Pool-           4 Merge neighbouring regions (Ri and Rj) using an op-
ing(ASPP) layer has been developed to allow the network to              timal objective function with the confidence of whole
work on different image size and thus increase the flexibil-            image from feature map
ity of the input scale [3]. Capturing more information, some
of networks also directly used the output from convolution            5 Generate a new maximum of confidence map through
layers as the low-level features, passing it into the decoder           all adjunct regions
to increase accuracy [4]. However, these structures currently      Here, the considered objective function is:
report an average inference time above 300ms [4]: it is essen-                              PNr
tial to have a fast inference time in order to achieve real-time                                   Cβ
image analysis.                                                                  Cimage = l=1         (1 − Pj )γ ,             (1)
                                                                                                Nγ
           Layer name          Output size      Parameters                  Where, a is the negative slope of the rectifier that used
          Conv-1               H/2, W/2       8 × 8, stride 2          after this layer which is 0 for Relu activation layer.
         Max pooling           H/4, W/4      3 × 3, stride 2              The typical batch size for SGD is generally set to 6, 12, 24
                                               1 × 1 64                [7]. However, in this work, the batch size was set to 5, which
         Conv-block-1          H/4, W/4
                                              × 3 512
                                               3                       is the optimal number to strike the tradeoff of GPU memory
                                               1 × 1 64                and speed of training.
      Dense-confi-block        H/8, W/8
                                               3 × 3 512                    During the training process, a poly learning rate policy is
                                                                       implemented on the learning rates. To begin with, the learning
   Table 1: Network architecture and layers specification.             rate is relatively high and, after several iterations, the weights
                                                                       have improved and the distance between current and the best
                                                                       weights decreased. Learning rates also become smaller corre-
where Cβ is the current region of confidence and Nγ is                 spondingly to find the best weights. The decay learning rate
the number of region of the corresponding specific adjunct             policy is employed with the formula
region.Pj is the probability of the j th class. γ is a free param-                                            power
eter which can be used to scale up confidence level to avoid                                              ep
                                                                                         η =η 1−                        ,             (3)
ignoring small regions.                                                                                maxep
     After calculating the dense confidence feature map, the re-
                                                                       where ep and maxep are the current epoch and the maximum
sulting features are fed to a 1x1 convolution kernel with 256
                                                                       epoch, which is set to 500. Here the power is set to 0.9 based
filters. Finally, the result is bi-linearly up-sampled to the cor-
                                                                       on previous published method [8]. Since the training dataset
rect dimensions. The dense confidence pyramid uses atrous
                                                                       includes some of very similar data, a weight decay method
convolutional layers in a cascade fashion, where the dilation
                                                                       [10] is followed the equation 3 and equation 4.
rate of each layer increases layer by layer; layers with small
dilation rates are located in the lower part, while layers with
                                                                                                   X X
                                                                                                               2
                                                                                         R (w) =       k    l wk,l ,              (4)
large dilation rates are located in upper part.
     Because of the great imbalance of different classes in this       where wk,l is the weights stored in the network. The total loss
test dataset, some classes have large number of pixels in al-          from the loss function will now have two parts:
most every image and others doesnt exist in some images at
                                                                                              N
all. By setting γ > 0, we reduce the relative loss for well-                              1 X
classified examples to avoid miss classifying objects. In other                L (w) =          Li (f (xi , w) , y) + λR (w)         (5)
                                                                                          N i=1
words, the dense confidence layer works to alleviate errors
using a smaller scaling factor.                                            The first term represents the loss calculated by the loss
                                                                       function chosen; the second part is the regulation part, mak-
                                                                       ing the network more simple. If two sets of weights all have
2.2. Data Argumentation
                                                                       a similar loss calculated by the loss functions, the bigger
Imagnet pre-trained Resnet-50 is used for training with 320            weights will have a bigger regular term and therefore has a
images that EAD2019 challenge provides for the semantic                bigger total loss.
segmentation task [5, 6]. Among those images, 20% is kept
for evaluation and the rest is kept for training. The follow-                                3. EVALUATION
ing data argumentation methods are applied: the RGB value
(66.32, 76.13, 120.58) is used for normalization with batch            3.1. Sematic segmentation results
size 4. A random flip and rotation with (-50, 50) are used to
                                                                       Results obtained from the trained models of challenge vali-
rescale the picture to 0.5-0.75 of its original size with the pad
                                                                       dation set are listed in Figure 2. The various resolution
size of (600 pixel, 512 pixel). After the data augmentation,
                                                                       im-ages are shown from the top to bottom row: (1003 x
about 1300 images are obtained which are 4 times larger than
                                                                       1003pixel, 628 x 628 pixel, 576 x 576 pixel).
the original dataset.
                                                                           A detailed example of segmentation results for 5 classes
                                                                       from the endoscopic dataset is shown in Appendix[9].
2.3. Training processes
                                                                       3.2. Training process
The following table 1 shows the hyper-parameters chosen for
feature map abstraction.                                               Figure 3 shows the loss rates at validation epoch. Although
    We uses a normal distribution to pick a tensor from the            the evaluation processes was not as good as the loss at train-
interval of (0, std), where the equation of std is:                    ing epoch, it was still acceptable. The MIoU curve dramati-
                        p                                              cally increase during the initiative 30 epochs, but then slowly
                std =       (2/((1 + a2 )f ani n))               (2)   converged to the final value, achieving 65%.
                                                                 Fig. 4: The comparison among DeeplabV3 , UNet, UNet-
                                                                 D(our proposed approach)

                                                                       Method          Over-lap    F2-Score     Score s
Fig. 2: Results obtained from the validation set are                   UNet            0.36        0.48         0.42
listed using various grey scales for five classes: Instru-             Deeplab v3+     0.54        0.56         0.55
ment(255), Specularity(204), Artefacts(153), Bubbles(102),             UNet-D          0.39        0.44         0.41
Saturation(51). From left to right: (a) input (b) Unet (c)
DeeplabV3+ (d)Unet-D                                             Table 2: Sematic Segmentation score in the EAD2019 Chal-
                                                                 lenge

                                                                  Model            Training time    Prediction time    Size
                                                                  UNet             20h              213.5ms            28.7MB
                                                                  Deeplab-V3+      40h              320.8ms            182.7MB
                                                                  UNet-D           30h              126.3ms            23.2MB

                                                                 Table 3: The comparison of training and inference perfor-
                                         (b) MIoU                mance


                                                                     However, the measurement is an inadequate measurement
        (a) Loss rates
                                                                 for semantic segmentation. Since the DICE calculate is based
                                                                 on binary cases, this means that no cross regions appeared in
Fig. 3: The train process performance at the loss rates and      multiple classes. Furthermore, the scores is in favor with the
MIoU at each evaluation epoch                                    high DICE value.
                                                                     The experiment environment used was Windows 10, 64-
                                                                 bit with an Intel Core i7-7700HQ CPU and GeForce GTX
3.3. Comparison
                                                                 1080 Ti. The number of inferences to calculate the aver-
Our evaluation was implemented using the validation set. We      age result was 20. Although the UNet-D network does not
use the Mean Intersection over Union to evaluate the capacity    have the best performance in terms of its scores value in the
of the model:                                                    EAD2019 challenge [5], it had a smaller computational foot-
                                                                 print, making it an excellent candidate for real-time semantic
                      k
                 1 X              pii                            segmentation tasks.
    M IoU =             Pk        Pk                       (6)
               k + 1 i=0 j=0 pij + j=0 pji − pii
                                                                                     4. CONCLUSION
    The prediction (pii ) was made by finding the maximum
output features map of the segmentation model, and is up-        This work demonstrates that a skipped connection, keeping
sampled by 8 using bilinear interpolation. As shown in Fig.      low level spatial information, and removing the connection
4, our approach (UNet-D) had very similar performance in         with the ReLu layer, using a confidence relay, can reduce the
training compared with state of the art semantic segmenta-       inference time. The UNet-D performance was not, however,
tion methods. In this challenge, the rules for evaluation of     outstanding at this challenge; part of reason was that we use
segmentation was based on the DICE and Jaccard value. Our        small batch size to keep system memory low. Using the PAS-
results achieved the same results as other technologies, shown   CAL VOC2012 dataset, 85% MUOI was reported at the eval-
in Table 2 and Table3.                                           uation processes. With careful data argument methods, the
sematic segmentation based on deep convolution neural net-         [9] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,
work has great potential to be used in the real-time control           Adam Bailey, Stefano Realdon, James East, Georges
loop for the next generation of endoscopic devices.                    Wagnires, Victor Loschenov, Enrico Grisan, Walter Blon-
                                                                       del, and Jens Rittscher, “Endoscopy artifact detection
                     5. REFERENCES                                     (ead 2019) challenge dataset,” 2019.

[1] O. Ronneberger, P.Fischer, and T. Brox, “U-net: Con-
    volutional networks for biomedical image segmentation,”
    in Medical Image Computing and Computer-Assisted In-
    tervention (MICCAI). 2015, vol. 9351 of LNCS, pp. 234–
    241, Springer.

[2] Liang-Chieh Chen, George Papandreou, Iasonas Kokki-
    nos, Kevin Murphy, and Alan L. Yuille, “Deeplab: Se-
    mantic image segmentation with deep convolutional nets,
    atrous convolution, and fully connected crfs,” IEEE
    Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–
    848, 2018.

[3] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao,
    Gang Yu, and Nong Sang, “Learning a discriminative fea-
    ture network for semantic segmentation,” in 2018 IEEE
    Conference on Computer Vision and Pattern Recognition,
    CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018,
    2018, pp. 1857–1866.

[4] Tobias Pohlen, Alexander Hermans, Markus Mathias,
    and Bastian Leibe, “Full-resolution residual networks for
    semantic segmentation in street scenes,” in 2017 IEEE
    Conference on Computer Vision and Pattern Recognition,
    CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017,
    pp. 3309–3318.

[5] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,
    Adam Bailey, Stefano Realdon, James East, Georges
    Wagnires, Victor Loschenov, Enrico Grisan, Walter
    Blondel, and Jens Rittscher, “Endoscopy artifact de-
    tection (EAD 2019) challenge dataset,” CoRR, vol.
    abs/1905.03209, 2019.

[6] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden,
    James East, Xin Lu, and Jens Rittscher, “A deep learning
    framework for quality assessment and restoration in video
    endoscopy,” CoRR, vol. abs/1904.07073, 2019.

[7] Ilya Sutskever, James Martens, George E. Dahl, and Ge-
    offrey E. Hinton, “On the importance of initialization
    and momentum in deep learning,” in Proceedings of
    the 30th International Conference on Machine Learning,
    ICML 2013, Atlanta, GA, USA, 16-21 June 2013, 2013,
    pp. 1139–1147.

[8] Anders Krogh and John A. Hertz, “A simple weight decay
    can improve generalization,” in ADVANCES IN NEURAL
    INFORMATION PROCESSING SYSTEMS 4. 1992, pp.
    950–957, Morgan Kaufmann.
Fig. 5: Sample semantic segmentation results for five classes