ARTEFACT DETECTION IN VIDEO ENDOSCOPY USING RETINANET AND FOCAL LOSS FUNCTION Ilkay Oksuz1 , James R. Clough1 , Andrew P. King1∗ , Julia A. Schnabel1∗ 1 School of Biomedical Engineering & Imaging Sciences , King’s College London, UK ABSTRACT ing the underlying tissue during diagnosis but also affect any post-analysis methods required for follow-up (e.g. video mo- Endoscopic Artefact Detection (EAD) is a fundamental saicking done for archival purposes and video-frame retrieval task for enabling the use of endoscopy images for diagno- needed for reporting). Accurate detection of artefacts is a core sis and treatment of diseases in multiple organs. Precise de- challenge in a wide-range of endoscopic applications address- tection of specific artefacts such as pixel saturations, motion ing multiple different disease areas. The importance of pre- blur, specular reflections, bubbles and instruments is essential cise detection of these artefacts is essential for high-quality for high-quality frame restoration. This work describes our endoscopic frame restoration and is crucial for realising reli- submission to the EAD 2019 challenge to detect bounding able computer assisted endoscopy tools for improved patient boxes for seven classes of artefacts in endoscopy videos. Our care. An example ground truth bounding box annotations is method is based on focal loss and Retina-net architecture with visualized in Figure 1a. Resnet-152 backbone. We have generated a large derivative Existing endoscopy workflows detect only one arte- dataset by augmenting the original images with free-form de- fact class which is insufficient to obtain high-quality frame formations to prevent over-fitting. Our method reaches a mAP restoration as detailed in a comprehensive review about im- of 0.2719 and a IoU of 0.3456 for the detection task over all age quality estimation [1]. In general, the same video frame classes of artefact for 195 images. We report comparable per- can be corrupted with multiple artefacts, e.g. motion blur, formance for the generalization dataset reaching a mAP of specular reflections, and low contrast can be present in the 0.2974 and deviation from the detection dataset of 0.0859. same frame. Furthermore, not all artefact types contaminate Index Terms— Endoscopic artefact detection, focal loss, the frame equally. So, unless multiple artefacts present in retina-net, class imbalance the frame are known with their precise spatial location, clini- cally relevant frame restoration quality cannot be guaranteed. 1. INTRODUCTION Another advantage of such detection is that frame quality assessments can be guided to minimise the number of frames Endoscopy is a procedure in which the inside of the body that get discarded during automated video analysis. is examined using a long, thin, flexible tube that has a light source and camera at one end, which allows visualization of 2. RELATED WORKS the inside of organs on a screen. It is a widely used clinical procedure for the early detection of numerous cancers as well The existing works on endoscopic artefact detection are as for therapeutic procedures and minimally invasive surgery. mainly focused on thresholding-based methods using the A major handicap of endoscopy video frames is that they are HSV [2] and RGB colour channels. Queiroz et al. [3] pro- subject to heavy corruption with multiple artefacts. The En- posed to use a principal component analysis based detection doscope Artefact Detection (EAD) challenge in the ISBI 2019 algorithm of specular artefacts. Akbari et al. [4] proposed to conference provides a multi-institutional dataset consisting use a non-linear SVM specular artefact detection using both of 7 different types of artefact (i.e. saturation, motion blur, HSV and RGB colour space information for segmentation of specular reflections, bubbles, instrument, contrast and arti- specular reflections. The SVM was trained with 12 statistical fact). These artefacts not only cause difficulties in visualiz- features including the mean and standard deviation of each channel of the RGB and HSV colour spaces. This work was supported by an EPSRC programme Grant (EP/P001009/1) and the Wellcome EPSRC Centre for Medical Engi- The nature of this multi-class artefact detection challenge neering at School of Biomedical Engineering and Imaging Sciences, Kings is in close relation to object detection challenges in computer College London (WT 203148/Z/16/Z). We acknowledge financial support vision (e.g. the COCO challenge [5]). The top performing from the Department of Health via the NIHR comprehensive Biomedical Research Centre award to Guys & St Thomas NHS Foundation Trust with algorithms on COCO and similar computer vision object de- KCL and Kings College Hospital NHS Foundation Trust. tection challenges are based on convolutional neural network ∗ Joint last authors. deep learning architectures. Current state-of-the-art object detectors are based on a two-stage, proposal-driven mecha- volutional network. The first subnet performs convolutional nism. As popularized in the R-CNN framework [6], the first object classification on the backbone’s output; the second stage generates a sparse set of candidate object locations and subnet performs convolutional bounding box regression. The the second stage classifies each candidate location as one of two subnetworks feature a simple design that we propose the foreground classes or as background using a convolutional specifically for one-stage, dense detection. While there are neural network. Through a sequence of advances [7, 8], this many possible choices for the details of these components, two-stage framework consistently achieves top accuracy on most design parameters are not particularly sensitive to exact the challenging COCO benchmark [5]. Despite the success values as shown in the experiments. We detail the compo- of two-stage detectors, also one stage detectors are applied nents of RetinaNet in the following sections. over a regular, dense sampling of object locations, scales, and aspect ratios. Recent work on one-stage detectors, such as YOLO [9], demonstrates promising results, yielding faster 3.1. Feature Pyramid Network Backbone detectors with high accuracy. In this direction Lin et al. pro- posed RetinaNet [10], which is a one-stage object detector We adopt the Feature Pyramid Network (FPN) from [11] as that matches the state-of-the-art COCO Avearage Precision the backbone network for RetinaNet. In brief, FPN augments (AP) of more complex two-stage detectors, such as the Fea- a standard convolutional network with a top-down pathway ture Pyramid Network (FPN) [11] or variants of Faster R- and lateral connections so the network efficiently constructs CNN [7]. To achieve this result, class imbalance during train- a rich, multi-scale feature pyramid from a single resolution ing was identified as the main obstacle impeding one-stage input image. Each level of the pyramid can be used for de- detectors from achieving state-of-the-art accuracy and a new tecting objects at a different scale. FPN improves multi-scale loss function that eliminates this barrier was proposed. predictions from fully convolutional networks (FCN), as well Class imbalance is one-key issue in the EAD 2019 multi- at two-stage detectors such as Fast R-CNN or Mask R-CNN. artefact detection challenge [12, 13], where the classes have Following this, we build FPN on top of the ResNet architec- an imbalanced distribution in the training set (e.g. specularity ture [14]. We construct a pyramid with levels P3 through P7, 43%, blur 3.5%, artifact 12%). Class imbalance is addressed where l indicates pyramid level (Pl has resolution 2l lower in R-CNN-like detectors by two-stage cascade and sampling than the input). As in [11] all pyramid levels have C = 256 heuristics. The proposal stage rapidly narrows down the num- channels. Details of the pyramid generally can be found in ber of candidate object locations to a small number, filtering [11]. out most background samples. In this paper, we address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified 3.2. Anchors examples similar to [10]. Focal Loss focuses training on a We use translation-invariant anchor boxes similar to those in sparse set of hard examples and prevents the vast number of the orginal Retina-net [10]. The anchors have areas of 322 to easy negatives from overwhelming the detector during train- 5122 on pyramid levels P3 to P7, respectively. As in [11], at ing. To evaluate the effectiveness of our loss, we design and each pyramid level we use anchors at three aspect ratios 1:2, train a simple dense detector based on RetinaNet. As high- 1:1, 2:1. For denser scale coverage than in, at each level we lighted in [10], when trained with the focal loss, RetinaNet is add anchors of sizes 20, 21/3, 22/3 of the original set of 3 as- able to match the speed of previous one-stage detectors while pect ratio anchors. This improve AP in our setting. In total surpassing the accuracy of all existing state-of-the-art two- there are A = 9 anchors per level and across levels they cover stage detectors. We use a loss function that acts as a more the scale range 32 - 813 pixels with respect to the networks effective alternative to previous approaches for dealing with input image. Each anchor is assigned a length K one-hot vec- class imbalance. The loss function is a dynamically scaled tor of classification targets, where K is the number of object cross entropy loss, where the scaling factor decays to zero as classes, and a 4-vector of box regression targets. We use the confidence in the correct class increases. Intuitively, this scal- assignment rule from RPN but modified for multi-class detec- ing factor can automatically down-weight the contribution of tion and with adjusted thresholds. Specifically, anchors are easy examples during training and rapidly focus the model on assigned to ground-truth object boxes using an intersection- hard examples. over-union (IoU) threshold of 0.7; and to background if their IoU is in [0, 0.6). As each anchor is assigned to at most one 3. METHODS object box, we set the corresponding entry in its length K label vector to 1 and all other entries to 0. If an anchor is RetinaNet is a single, unified network composed of a back- unassigned, which may happen with overlap in [0.6, 0.7), it is bone network and two task-specific subnetworks. The back- ignored during training. Box regression targets are computed bone is responsible for computing a convolutional feature as the offset between each anchor and its assigned object box, map over an entire input image and is an off-the-self con- or omitted if there is no assignment. 3.3. Classification Subnet low loss. The total loss is a combination between the focal loss and a regression loss on bounding boxes. The classification subnet predicts the probability of object presence at each spatial position for each of the A anchors and K object classes. This subnet is a small FCN attached to 4. IMPLEMENTATION DETAILS each FPN level; parameters of this subnet are shared across all pyramid levels. Its design is simple. Taking an input feature RetinaNet forms a single FCN comprised of a ResNet-FPN map with C channels from a given pyramid level, the subnet backbone, a classification subnet, and a box regression sub- applies four 3×3 convolutional layers, each with C filters and net. We use ResNet-152-FPN backbone to run our experi- each followed by ReLU activations, followed by a 3 × 3 con- ments. As such, inference involves simply forwarding an im- volutional layer with KA filters. Finally sigmoid activations age through the network. To improve speed, we only decode are attached to output the KA binary predictions per spatial box predictions from at most 200 top-scoring predictions per location. We use C = 256 and A = 9 in most experiments. FPN level, after thresholding detector confidence at 0.36. The In contrast to RPN , our object classification subnet is deeper, top predictions from all levels are merged and non-maximum uses only 3 × 3 convolutions, and does not share parameters suppression with a threshold of 0.5 is applied to yield the final with the box regression subnet. We found these higher-level detections. We trained our network using the Keras frame- design decisions to be more important than specific values of work with Tensorflow library on nVidia P6000 GPU. hyperparameters. 4.1. Focal Loss 3.4. Box Regression Subnet We use the focal loss introduced in this work as the loss on In parallel with the object classification subnet, we attach an- the output of the classification subnet. We find that γ = 2 other small FCN to each pyramid level for the purpose of re- and α = 0.25 works well in practice and the RetinaNet is rel- gressing the offset from each anchor box to a nearby ground- atively robust. We emphasize that when training RetinaNet, truth object, if one exists. The design of the box regression the focal loss is applied to all 100k anchors in each sampled subnet is identical to the classification subnet except that it image. This stands in contrast to common practice of using terminates in 4A linear outputs per spatial location. For each heuristic sampling (RPN) or hard example mining (OHEM, of the A anchors per spatial location, these 4 outputs predict SSD) to select a small set of anchors for each minibatch. The the relative offset between the anchor and the ground-truth total focal loss of an image is computed as the sum of the box. The object classification subnet and the box regression focal loss over all 100k anchors, normalized by the number subnet, though sharing a common structure, use separate pa- of anchors assigned to a ground-truth box. We perform the rameters. normalization by the number of assigned anchors, not total anchors, since the vast majority of anchors are easy nega- tives and receive negligible loss values under the focal loss. 3.5. Focal loss In general α should be decreased slightly as γ is increased, as Our novel Focal Loss focuses training on a sparse set of hard highlighted in the original Retina-net paper. examples and prevents the vast number of easy negatives from overwhelming the detector during training. Formally, focal 4.2. Initialization loss is a modified version the cross entropy loss, with tunable focusing γ parameter: All new convolutional layers except the final one in the Reti- naNet subnets are initialized with bias b = 0 and a Gaussian Focal loss = −αt (1 − pt )γ log(pt ) weight fill with σ = 0.01. For the final convolutional layer of the classification subnet, we set the bias initialization to b = where pt is class-specific probability of belonging to a class log((1 )/), where specifies that at the start of training every and α is a weighting parameter. anchor should be labeled as foreground with confidence of . There are two important properties of the focal loss, which We use φ = .01 in all experiments, although results are robust makes it appealing for EAD 2019 challenge: (1) When an ex- to the exact value. This initialization prevents the large num- ample is misclassified and pt is small, the modulating factor is ber of background anchors from generating a large, destabi- near 1 and the loss is unaffected. With increasing pt , the fac- lizing loss value in the first iteration of training. tor goes to 0 and the loss for well-classified examples is down- weighted. (2) The focusing parameter γ smoothly adjusts the 4.3. Optimization rate at which easy examples are downweighted. When γ = 0, Focal loss is equivalent to CE, and as γ is increased the effect RetinaNet is trained with stochastic gradient descent (SGD). of the modulating factor is likewise increased. Intuitively, the We use a minibatch of 3 size of 3 images. The model is modulating factor reduces the loss contribution from easy ex- trained for 10000 iterations with an initial learning rate of amples and extends the range in which an example receives 0.001, which is then divided by 10 at 5000 and again at 750 (a) Example ground truth bounding boxes (b) Predicted bounding boxes Fig. 1: Example artefact detection and confidence scores from training detection set (result using 5-fold cross validation). The example was used in the validation set in this setup was not used during training of the network. iterations. Weight decay of 0.0001 and momentum of 0.9 are used. The training loss is the sum the focal loss and the stan- dard smooth L1 loss used for box regression. Training of the network took 26 hours. 4.4. Augmentation Our scheme of image augmentations was designed to pre- vent overfitting to the set of training images, and so make our method more generalisable to the images in the test set. We assessed the effect of these augmentations by training the net- work with just the original training data and applying it to the test set images to produce artefact detections. In accordance to this is the observation that training without augmentations produces a much smaller final loss value as compared to train- ing with augmentations. Having trained on the whole dataset Fig. 2: Example artefact detections and confidence scores for 10000 iterations with a batch size of 600 images, the final from test detection set. loss value without augmentations is 0.0082 but with augmen- tations is 0.0605. This clearly indicates significant over-fitting to the training dataset when augmentations are not used. We reach 0.2719 mean average precision (mAP) score on 195 cases over all 7 artefact classes. The intersection over union 5. EXPERIMENTAL RESULTS (IoU) for our predictions is 0.3456 for detection task over all classes. We report comparable performance for 51 images We used a stratified 5-fold cross validation strategy to op- generalization dataset reaching a mAP of 0.2974 and devia- timize the parameters of the network. Table 1 summarizes tion from detection dataset of 0.0859. the quantitative results achieved over 5-fold for each of the The visual result of 5-fold cross-validation for a case from seven artefact classes. The data imbalance in between dif- validation cohort is visualized in Figure 1. The trained net- ferent classes and the how easily distinguishable each spe- work is capable to generate bounding boxes with high confi- cific classes influences the specific mAP score for each class. dence for an unseen case during training A qualitative result Fold specularity saturation artifact blur contrast bubbles instrument mAP IoU 0 0.5694 0.5908 0.6258 0.6053 0.6832 0.4962 0.6785 0.5245 0.4591 1 0.5619 0.5977 0.6518 0.6098 0.7037 0.5060 0.6858 0.5309 0.4562 2 0.5709 0.6193 0.6674 0.5969 0.7104 0.5112 0.6969 0.5401 0.4015 3 0.5613 0.6291 0.6689 0.6011 0.6974 0.5076 0.6897 0.5354 0.4209 4 0.5659 0.6072 0.6882 0.6185 0.7141 0.5213 0.6871 0.5442 0.4173 Mean 0.5659 0.6604 0.6063 0.70176 0.5085 0.9605 0.6876 0.53502 0.4310 D-Test N/A N/A N/A N/A N/A N/A N/A 0.2719 0.3456 G-Test N/A N/A N/A N/A N/A N/A N/A 0.2974 N/A Table 1: 5-Fold validation average precision (AP) per class and intersection over union (IoU) results for seven classes. The AP results for each class, mean AP (mAP) and IoU results are reported for the validation over 5-fold (eg. for fold 5,0 the first fifth of images from the dataset were validation). The IoU column is the intersection-over-union difference between the bounding boxes inferred from the fold network and the ground truth bounding boxes. D-Test and G-test correspond to the detection and generalization test data for which per-class results are not reported in the challenge. from detection test set is illustrated in Figure 2, with the pre- and inpainting in colonoscopy video frames,” in 2018 diction probabilities. The bubble and artefact classes are cor- 25th IEEE International Conference on Image Process- rectly identified in the example image. The ground truth is ing (ICIP). IEEE, 2018, pp. 3134–3138. not available for this case. [5] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and 6. DISCUSSION AND FUTURE WORK C Lawrence Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. In our experience it was clear when artefacts classes are Springer, 2014, pp. 740–755. poorly detected a significant factor is the size and total num- ber of bounding boxes produced. The main difference in [6] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jiten- between different setups was dependent on the number of dra Malik, “Rich feature hierarchies for accurate object bounding boxes generated for artefacts in a neighbourhood. detection and semantic segmentation,” in Proceedings One critical factor in the final mAP score is the probability of the IEEE conference on computer vision and pattern threshold used to include the detected artefacts. In future recognition, 2014, pp. 580–587. work, we aim to apply our algorithm on different artefact localization task for medical images (e.g. cardiac MR) with [7] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian the availability of the training data. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99. 7. REFERENCES [8] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross [1] Bernd Münzer, Klaus Schoeffmann, and Laszlo Girshick, “Mask r-cnn,” in Proceedings of the IEEE Böszörmenyi, “Content-based processing and analysis international conference on computer vision, 2017, pp. of endoscopic images and videos: A survey,” Multime- 2961–2969. dia Tools and Applications, vol. 77, no. 1, pp. 1323– 1362, 2018. [9] Joseph Redmon and Ali Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE conference [2] Othmane Meslouhi, Mustapha Kardouchi, Hakim Al- on computer vision and pattern recognition, 2017, pp. lali, Taoufiq Gadi, and Yassir Benkaddour, “Automatic 7263–7271. detection and inpainting of specular reflections for col- poscopic images,” Open Computer Science, vol. 1, no. [10] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, 3, pp. 341–354, 2011. and Piotr Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on [3] Fabiane Queiroz and Tsang Ing Ren, “Endoscopy im- computer vision, 2017, pp. 2980–2988. age restoration: A study of the kernel estimation from specular highlights,” Digital Signal Processing, 2019. [11] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie, “Feature pyra- [4] Mojtaba Akbari, Majid Mohrekesh, Kayvan Najari- mid networks for object detection,” in Proceedings of ani, Nader Karimi, Shadrokh Samavi, and SM Reza the IEEE Conference on Computer Vision and Pattern Soroushmehr, “Adaptive specular reflection detection Recognition, 2017, pp. 2117–2125. [12] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, Adam Bailey, Stefano Realdon, James East, Georges Wagnires, Victor Loschenov, Enrico Grisan, Walter Blondel, and Jens Rittscher, “Endoscopy artifact de- tection (EAD 2019) challenge dataset,” CoRR, vol. abs/1905.03209, 2019. [13] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden, James East, Xin Lu, and Jens Rittscher, “A deep learn- ing framework for quality assessment and restoration in video endoscopy,” CoRR, vol. abs/1904.07073, 2019. [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.