1. INTRODUCTION

GRAPH-SEARCH BASED UNET-D FOR THE ANALYSIS OF ENDOSCOPIC IMAGES

Shufan Yang

Sandy Cochran

0 0 School of Engineering, University of Glasgow , Glasgow , UK

2019

While object recognition in deep neural networks (DNN) has shown remarkable success in natural images, endoscopic images still cannot be fully analysed using DNNs, since analysing endoscopic images must account for occlusion, light reflection and image blur. UNet based deep convolutional neural networks (DNNs) offer great potential to extract high-level spatial features, thanks to its hierarchical nature with multiple levels of abstraction, which is especially useful for working with multimodal endoscopic images with white light and fluoroscopy in the diagnosis of esophageal disease. However, the currently reported inference time for DNNs is above 200ms, which is unsuitable to integrate into robotic control loops. This work addresses real-time object detection and semantic segmentation in endoscopic devices. We show that endoscopic assistive diagnosis can approach satisfy detection rates with a fast inference time.

1. INTRODUCTION 2.1. Algorithm ABSTRACT

Index Terms— Endoscopic images, Deep neural networks, Decoder-Encoder neural networks A common strategy in deep convolutional neural network for semantic segmentation tasks requires the down-sampling of an image between convolutional and ReLU layers, and then up-sampling the output to match the input size [1]. Atrous convolution is designed to obtain the spatial resolution after several convolution layers [2]. Although, when compared to normal convolution layers, the atrous convolution inserts holes into its filters, thus enlarging the receptive field to a greater extent, this method often loses low level information, and is therefore unsuitable for a medical environment. To deal with multi-scale images, a new Atrous Spatial Pyramid Pooling(ASPP) layer has been developed to allow the network to work on different image size and thus increase the flexibility of the input scale [3]. Capturing more information, some of networks also directly used the output from convolution layers as the low-level features, passing it into the decoder to increase accuracy [4]. However, these structures currently report an average inference time above 300ms [4]: it is essential to have a fast inference time in order to achieve real-time image analysis.

Fig. 1: The architecture of Network

2. METHOD

As shown in Fig. 1, the architecture of network for this challenge is based on the UNet architecture. The convolution network layers are used as an encoder to abstract low level spatial information. A decoder is then implemented using transposed convolution. Instead of using an ASPP layer, a general autoencoder class label is kept into a dense layer. This compressed feature vector connects to a series of up-sampling layers using the coast mask.

Regular classification DCNNs generate a coast mask containing probabilities for each class in a dense confidence regions using the following steps: 1 Generate feature map using fully convolutional neural network 2 Initialize a segmentation with feature detected 3 Transpose convolutional using confidence check to keep one weak edge on the common boundary 4 Merge neighbouring regions (Ri and Rj) using an optimal objective function with the confidence of whole image from feature map 5 Generate a new maximum of confidence map through all adjunct regions

Here, the considered objective function is:

Layer name

Conv-1 Max pooling

Conv-block-1

Output size H/2, W/2 H/4, W/4 H/4, W/4

Dense-confi-block H/8, W/8

where C is the current region of confidence and N is the number of region of the corresponding specific adjunct region.Pj is the probability of the jth class. is a free parameter which can be used to scale up confidence level to avoid ignoring small regions.

After calculating the dense confidence feature map, the resulting features are fed to a 1x1 convolution kernel with 256 filters. Finally, the result is bi-linearly up-sampled to the correct dimensions. The dense confidence pyramid uses atrous convolutional layers in a cascade fashion, where the dilation rate of each layer increases layer by layer; layers with small dilation rates are located in the lower part, while layers with large dilation rates are located in upper part.

Because of the great imbalance of different classes in this test dataset, some classes have large number of pixels in almost every image and others doesnt exist in some images at all. By setting > 0, we reduce the relative loss for wellclassified examples to avoid miss classifying objects. In other words, the dense confidence layer works to alleviate errors using a smaller scaling factor.

2.2. Data Argumentation

Imagnet pre-trained Resnet-50 is used for training with 320 images that EAD2019 challenge provides for the semantic segmentation task [5, 6]. Among those images, 20% is kept for evaluation and the rest is kept for training. The following data argumentation methods are applied: the RGB value (66.32, 76.13, 120.58) is used for normalization with batch size 4. A random flip and rotation with (-50, 50) are used to rescale the picture to 0.5-0.75 of its original size with the pad size of (600 pixel, 512 pixel). After the data augmentation, about 1300 images are obtained which are 4 times larger than the original dataset.

2.3. Training processes

The following table 1 shows the hyper-parameters chosen for feature map abstraction.

We uses a normal distribution to pick a tensor from the interval of (0, std), where the equation of std is: std = p(2=((1 + a2)f anin)) (2) (3) (4)

Where, a is the negative slope of the rectifier that used after this layer which is 0 for Relu activation layer.

The typical batch size for SGD is generally set to 6, 12, 24 [7]. However, in this work, the batch size was set to 5, which is the optimal number to strike the tradeoff of GPU memory and speed of training.

During the training process, a poly learning rate policy is implemented on the learning rates. To begin with, the learning rate is relatively high and, after several iterations, the weights have improved and the distance between current and the best weights decreased. Learning rates also become smaller correspondingly to find the best weights. The decay learning rate policy is employed with the formula = 1

ep maxep power ; where ep and maxep are the current epoch and the maximum epoch, which is set to 500. Here the power is set to 0.9 based on previous published method [8]. Since the training dataset includes some of very similar data, a weight decay method [10] is followed the equation 3 and equation 4.

R (w) =

X k X l wk2;l; where wk;l is the weights stored in the network. The total loss from the loss function will now have two parts:

L (w) = 1 N

X Li (f (xi ; w) ; y) + N i=1

R (w) (5)

The first term represents the loss calculated by the loss function chosen; the second part is the regulation part, making the network more simple. If two sets of weights all have a similar loss calculated by the loss functions, the bigger weights will have a bigger regular term and therefore has a bigger total loss.

3. EVALUATION 3.1. Sematic segmentation results

Results obtained from the trained models of challenge validation set are listed in Figure 2. The various resolution im-ages are shown from the top to bottom row: (1003 x 1003pixel, 628 x 628 pixel, 576 x 576 pixel).

A detailed example of segmentation results for 5 classes from the endoscopic dataset is shown in Appendix[9].

3.2. Training process

Figure 3 shows the loss rates at validation epoch. Although the evaluation processes was not as good as the loss at training epoch, it was still acceptable. The MIoU curve dramatically increase during the initiative 30 epochs, but then slowly converged to the final value, achieving 65%.

The prediction (pii) was made by finding the maximum output features map of the segmentation model, and is upsampled by 8 using bilinear interpolation. As shown in Fig. 4, our approach (UNet-D) had very similar performance in training compared with state of the art semantic segmentation methods. In this challenge, the rules for evaluation of segmentation was based on the DICE and Jaccard value. Our results achieved the same results as other technologies, shown in Table 2 and Table3.

However, the measurement is an inadequate measurement for semantic segmentation. Since the DICE calculate is based on binary cases, this means that no cross regions appeared in multiple classes. Furthermore, the scores is in favor with the high DICE value.

The experiment environment used was Windows 10, 64bit with an Intel Core i7-7700HQ CPU and GeForce GTX 1080 Ti. The number of inferences to calculate the average result was 20. Although the UNet-D network does not have the best performance in terms of its scores value in the EAD2019 challenge [5], it had a smaller computational footprint, making it an excellent candidate for real-time semantic segmentation tasks.

4. CONCLUSION

This work demonstrates that a skipped connection, keeping low level spatial information, and removing the connection with the ReLu layer, using a confidence relay, can reduce the inference time. The UNet-D performance was not, however, outstanding at this challenge; part of reason was that we use small batch size to keep system memory low. Using the PASCAL VOC2012 dataset, 85% MUOI was reported at the evaluation processes. With careful data argument methods, the sematic segmentation based on deep convolution neural network has great potential to be used in the real-time control loop for the next generation of endoscopic devices.

5. REFERENCES [1] O. Ronneberger, P.Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI). 2015, vol. 9351 of LNCS, pp. 234– 241, Springer. [2] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834– 848, 2018. [3] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang, “Learning a discriminative feature network for semantic segmentation,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018, pp. 1857–1866. [4] Tobias Pohlen, Alexander Hermans, Markus Mathias, and Bastian Leibe, “Full-resolution residual networks for semantic segmentation in street scenes,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 3309–3318. [5] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, Adam Bailey, Stefano Realdon, James East, Georges Wagnires, Victor Loschenov, Enrico Grisan, Walter Blondel, and Jens Rittscher, “Endoscopy artifact detection (EAD 2019) challenge dataset,” CoRR, vol. abs/1905.03209, 2019. [6] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden, James East, Xin Lu, and Jens Rittscher, “A deep learning framework for quality assessment and restoration in video endoscopy,” CoRR, vol. abs/1904.07073, 2019. [7] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton, “On the importance of initialization and momentum in deep learning,” in Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, 2013, pp. 1139–1147. [8] Anders Krogh and John A. Hertz, “A simple weight decay can improve generalization,” in ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 4. 1992, pp. 950–957, Morgan Kaufmann.

Fig. 5: Sample semantic segmentation results for five classes