GRAPH-SEARCH BASED UNET-D FOR THE ANALYSIS OF ENDOSCOPIC IMAGES Shufan Yang1 , Sandy Cochran 1 1 School of Engineering, University of Glasgow, Glasgow, UK ABSTRACT While object recognition in deep neural networks (DNN) has shown remarkable success in natural images, endoscopic images still cannot be fully analysed using DNNs, since analysing endoscopic images must account for occlusion, light reflection and image blur. UNet based deep convolu- tional neural networks (DNNs) offer great potential to extract high-level spatial features, thanks to its hierarchical nature with multiple levels of abstraction, which is especially useful Fig. 1: The architecture of Network for working with multimodal endoscopic images with white light and fluoroscopy in the diagnosis of esophageal disease. 2. METHOD However, the currently reported inference time for DNNs is above 200ms, which is unsuitable to integrate into robotic As shown in Fig. 1, the architecture of network for this chal- control loops. This work addresses real-time object detec- lenge is based on the UNet architecture. The convolution net- tion and semantic segmentation in endoscopic devices. We work layers are used as an encoder to abstract low level spatial show that endoscopic assistive diagnosis can approach satisfy information. A decoder is then implemented using transposed detection rates with a fast inference time. convolution. Instead of using an ASPP layer, a general auto- Index Terms— Endoscopic images, Deep neural net- encoder class label is kept into a dense layer. This compressed works, Decoder-Encoder neural networks feature vector connects to a series of up-sampling layers using the coast mask. 1. INTRODUCTION 2.1. Algorithm Regular classification DCNNs generate a coast mask contain- A common strategy in deep convolutional neural network for ing probabilities for each class in a dense confidence regions semantic segmentation tasks requires the down-sampling of using the following steps: an image between convolutional and ReLU layers, and then up-sampling the output to match the input size [1]. Atrous 1 Generate feature map using fully convolutional neural convolution is designed to obtain the spatial resolution af- network ter several convolution layers [2]. Although, when compared to normal convolution layers, the atrous convolution inserts 2 Initialize a segmentation with feature detected holes into its filters, thus enlarging the receptive field to a 3 Transpose convolutional using confidence check to greater extent, this method often loses low level information, keep one weak edge on the common boundary and is therefore unsuitable for a medical environment. To deal with multi-scale images, a new Atrous Spatial Pyramid Pool- 4 Merge neighbouring regions (Ri and Rj) using an op- ing(ASPP) layer has been developed to allow the network to timal objective function with the confidence of whole work on different image size and thus increase the flexibil- image from feature map ity of the input scale [3]. Capturing more information, some of networks also directly used the output from convolution 5 Generate a new maximum of confidence map through layers as the low-level features, passing it into the decoder all adjunct regions to increase accuracy [4]. However, these structures currently Here, the considered objective function is: report an average inference time above 300ms [4]: it is essen- PNr tial to have a fast inference time in order to achieve real-time Cβ image analysis. Cimage = l=1 (1 − Pj )γ , (1) Nγ Layer name Output size Parameters Where, a is the negative slope of the rectifier that used Conv-1 H/2, W/2 8 × 8, stride 2 after this layer which is 0 for Relu activation layer. Max pooling H/4, W/4 3 × 3, stride 2 The typical batch size for SGD is generally set to 6, 12, 24 1 × 1 64 [7]. However, in this work, the batch size was set to 5, which Conv-block-1 H/4, W/4  × 3 512 3 is the optimal number to strike the tradeoff of GPU memory 1 × 1 64 and speed of training. Dense-confi-block H/8, W/8 3 × 3 512 During the training process, a poly learning rate policy is implemented on the learning rates. To begin with, the learning Table 1: Network architecture and layers specification. rate is relatively high and, after several iterations, the weights have improved and the distance between current and the best weights decreased. Learning rates also become smaller corre- where Cβ is the current region of confidence and Nγ is spondingly to find the best weights. The decay learning rate the number of region of the corresponding specific adjunct policy is employed with the formula region.Pj is the probability of the j th class. γ is a free param-  power eter which can be used to scale up confidence level to avoid ep η =η 1− , (3) ignoring small regions. maxep After calculating the dense confidence feature map, the re- where ep and maxep are the current epoch and the maximum sulting features are fed to a 1x1 convolution kernel with 256 epoch, which is set to 500. Here the power is set to 0.9 based filters. Finally, the result is bi-linearly up-sampled to the cor- on previous published method [8]. Since the training dataset rect dimensions. The dense confidence pyramid uses atrous includes some of very similar data, a weight decay method convolutional layers in a cascade fashion, where the dilation [10] is followed the equation 3 and equation 4. rate of each layer increases layer by layer; layers with small dilation rates are located in the lower part, while layers with X X 2 R (w) = k l wk,l , (4) large dilation rates are located in upper part. Because of the great imbalance of different classes in this where wk,l is the weights stored in the network. The total loss test dataset, some classes have large number of pixels in al- from the loss function will now have two parts: most every image and others doesnt exist in some images at N all. By setting γ > 0, we reduce the relative loss for well- 1 X classified examples to avoid miss classifying objects. In other L (w) = Li (f (xi , w) , y) + λR (w) (5) N i=1 words, the dense confidence layer works to alleviate errors using a smaller scaling factor. The first term represents the loss calculated by the loss function chosen; the second part is the regulation part, mak- ing the network more simple. If two sets of weights all have 2.2. Data Argumentation a similar loss calculated by the loss functions, the bigger Imagnet pre-trained Resnet-50 is used for training with 320 weights will have a bigger regular term and therefore has a images that EAD2019 challenge provides for the semantic bigger total loss. segmentation task [5, 6]. Among those images, 20% is kept for evaluation and the rest is kept for training. The follow- 3. EVALUATION ing data argumentation methods are applied: the RGB value (66.32, 76.13, 120.58) is used for normalization with batch 3.1. Sematic segmentation results size 4. A random flip and rotation with (-50, 50) are used to Results obtained from the trained models of challenge vali- rescale the picture to 0.5-0.75 of its original size with the pad dation set are listed in Figure 2. The various resolution size of (600 pixel, 512 pixel). After the data augmentation, im-ages are shown from the top to bottom row: (1003 x about 1300 images are obtained which are 4 times larger than 1003pixel, 628 x 628 pixel, 576 x 576 pixel). the original dataset. A detailed example of segmentation results for 5 classes from the endoscopic dataset is shown in Appendix[9]. 2.3. Training processes 3.2. Training process The following table 1 shows the hyper-parameters chosen for feature map abstraction. Figure 3 shows the loss rates at validation epoch. Although We uses a normal distribution to pick a tensor from the the evaluation processes was not as good as the loss at train- interval of (0, std), where the equation of std is: ing epoch, it was still acceptable. The MIoU curve dramati- p cally increase during the initiative 30 epochs, but then slowly std = (2/((1 + a2 )f ani n)) (2) converged to the final value, achieving 65%. Fig. 4: The comparison among DeeplabV3 , UNet, UNet- D(our proposed approach) Method Over-lap F2-Score Score s Fig. 2: Results obtained from the validation set are UNet 0.36 0.48 0.42 listed using various grey scales for five classes: Instru- Deeplab v3+ 0.54 0.56 0.55 ment(255), Specularity(204), Artefacts(153), Bubbles(102), UNet-D 0.39 0.44 0.41 Saturation(51). From left to right: (a) input (b) Unet (c) DeeplabV3+ (d)Unet-D Table 2: Sematic Segmentation score in the EAD2019 Chal- lenge Model Training time Prediction time Size UNet 20h 213.5ms 28.7MB Deeplab-V3+ 40h 320.8ms 182.7MB UNet-D 30h 126.3ms 23.2MB Table 3: The comparison of training and inference perfor- (b) MIoU mance However, the measurement is an inadequate measurement (a) Loss rates for semantic segmentation. Since the DICE calculate is based on binary cases, this means that no cross regions appeared in Fig. 3: The train process performance at the loss rates and multiple classes. Furthermore, the scores is in favor with the MIoU at each evaluation epoch high DICE value. The experiment environment used was Windows 10, 64- bit with an Intel Core i7-7700HQ CPU and GeForce GTX 3.3. Comparison 1080 Ti. The number of inferences to calculate the aver- Our evaluation was implemented using the validation set. We age result was 20. Although the UNet-D network does not use the Mean Intersection over Union to evaluate the capacity have the best performance in terms of its scores value in the of the model: EAD2019 challenge [5], it had a smaller computational foot- print, making it an excellent candidate for real-time semantic k 1 X pii segmentation tasks. M IoU = Pk Pk (6) k + 1 i=0 j=0 pij + j=0 pji − pii 4. CONCLUSION The prediction (pii ) was made by finding the maximum output features map of the segmentation model, and is up- This work demonstrates that a skipped connection, keeping sampled by 8 using bilinear interpolation. As shown in Fig. low level spatial information, and removing the connection 4, our approach (UNet-D) had very similar performance in with the ReLu layer, using a confidence relay, can reduce the training compared with state of the art semantic segmenta- inference time. The UNet-D performance was not, however, tion methods. In this challenge, the rules for evaluation of outstanding at this challenge; part of reason was that we use segmentation was based on the DICE and Jaccard value. Our small batch size to keep system memory low. Using the PAS- results achieved the same results as other technologies, shown CAL VOC2012 dataset, 85% MUOI was reported at the eval- in Table 2 and Table3. uation processes. With careful data argument methods, the sematic segmentation based on deep convolution neural net- [9] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, work has great potential to be used in the real-time control Adam Bailey, Stefano Realdon, James East, Georges loop for the next generation of endoscopic devices. Wagnires, Victor Loschenov, Enrico Grisan, Walter Blon- del, and Jens Rittscher, “Endoscopy artifact detection 5. REFERENCES (ead 2019) challenge dataset,” 2019. [1] O. Ronneberger, P.Fischer, and T. Brox, “U-net: Con- volutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted In- tervention (MICCAI). 2015, vol. 9351 of LNCS, pp. 234– 241, Springer. [2] Liang-Chieh Chen, George Papandreou, Iasonas Kokki- nos, Kevin Murphy, and Alan L. Yuille, “Deeplab: Se- mantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834– 848, 2018. [3] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang, “Learning a discriminative fea- ture network for semantic segmentation,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018, pp. 1857–1866. [4] Tobias Pohlen, Alexander Hermans, Markus Mathias, and Bastian Leibe, “Full-resolution residual networks for semantic segmentation in street scenes,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 3309–3318. [5] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, Adam Bailey, Stefano Realdon, James East, Georges Wagnires, Victor Loschenov, Enrico Grisan, Walter Blondel, and Jens Rittscher, “Endoscopy artifact de- tection (EAD 2019) challenge dataset,” CoRR, vol. abs/1905.03209, 2019. [6] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden, James East, Xin Lu, and Jens Rittscher, “A deep learning framework for quality assessment and restoration in video endoscopy,” CoRR, vol. abs/1904.07073, 2019. [7] Ilya Sutskever, James Martens, George E. Dahl, and Ge- offrey E. Hinton, “On the importance of initialization and momentum in deep learning,” in Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, 2013, pp. 1139–1147. [8] Anders Krogh and John A. Hertz, “A simple weight decay can improve generalization,” in ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 4. 1992, pp. 950–957, Morgan Kaufmann. Fig. 5: Sample semantic segmentation results for five classes