DETECT ARTEFACTS OF VARIOUS SIZES ON THE RIGHT SCALE FOR EACH CLASS IN VIDEO ENDOSCOPY Xiaokang Wang1 , Chunqing Wang2 1 Department of Biomedical Engineering, University of California, Davis, Davis, CA 95616, USA 2 Department of Ultrasound Imaging, Tiantan Hospital, Beijing, 100050, China ABSTRACT car, speed is a prerequisite given an acceptable detection per- formance as a self-driving car has to react instantly. In the Detecting artefacts in video filmed in endoscopy is an im- case of diagnosis in biomedical engineering, we can bear with portant problem for downstream computer-assisted diagnosis. slightly more computation time for higher accuracy. Since When tackling this problem, one challenge is that the size of the advent of R-CNN [1], which is a two stage approach, this an artefact varies in a wide range. The other challenge is that region-based detection method has become increasingly ma- labeling endoscopic images is labor- extensive and is hard to ture. Along the line, Fast R-CNN [2] introduced a RoI pool- outsource the labeling task to untrained people without the ing operation that does forward pass on all the object pro- aid of doctors. In this report, we demonstrate how the perfor- posals in an image simultaneously. Faster R-CNN [3] fur- mance of a Faster R-CNN model can be improved by scaling ther speeds up R-CNN by training a region proposal network an image to the right scale before training and testing. The (PRN) using the feature maps generated by the convolution training method overcomes the issue that a convolution neu- operations at the low level, without introducing much cost. ral network trained on one scale barely works when detecting Thus, I chose Faster R-CNN as the base framework in this the same category of objects on a different scale. The method challenge. is totally independent of the model and can be easily adapted However, two challenges have to be solved when devel- with other models. Besides, it saves time and memory by fo- oping the model. One special challenge is that the size of the cusing on the patches that include objects when training the artefacts varies in a wide range and the other one is a limited model. The source code? for this report will be made public number of labeled images (2,192 in total). The scale-related upon the publishing of my solution. challenge is associated with the architecture of a CNN. The low level feature maps of a CNN capture features like edges 1. INTRODUCTION and have a small receptive field, whereas the high-level fea- tures capture more semantic features, and have a larger re- Endoscopy is a widely used clinical procedure for the early ceptive field [4, 5, 6]. Thus, the high-level features of small detection of numerous cancers (e.g., nasopharyngeal, gastric, objects (e.g. less than 32 pixels) get mixed with features for colorectal cancers, bladder cancer etc.), therapeutic proce- background or objects nearby if the features do not disap- dures and minimally invasive surgery. Video taken by the pear due to dimension reduction caused in feature extraction. camera of an endoscope is usually heavily corrupted with e.g. For a feature stride of 32, the highest level features were multiple artefacts (e.g., pixel saturations, motion blur, defo- shrunk 32 times compared to the raw image. For very large cus, specular reflections, bubbles, fluid, debris etc.). Accu- objects, the deeper layers suffer from extracting high-level se- rate detection of artefacts is a core challenge in a wide range mantic features due to failing to integrate low-level features of endoscopic applications addressing multiple different dis- given a limited feature stride. ease areas. The importance of precise detection of these arte- To alleviate the problem caused by the wide range of ob- facts is essential for high-quality endoscopic frame restora- ject size, various solutions have been proposed. One category tion and crucial for realizing reliable computer assisted en- of solution focused on designing new CNN architectures to doscopy tools for improved patient care. exploit the features at different levels. Under this paradigm, In the last few years, convolution neural network (CNN) SSD [7] and MSCNN [8], use feature maps from different has outperformed previous non-CNN based methods in solv- layers to detect objects at different scales. Although the fea- ing the object detection problem. The dominant CNN-based tures for small objects survive in the low-layer features, they methods fall into two categories, one-stage approach and two- lack semantic information which is supposed to be encoded stage approach, with the former method shining in speed and in high level features. FPN [9], DSSD [10], STDN [11] inte- the latter in accuracy. These two methods meet the demand grate features at different layers. Another solution is to train in different fields. For example, in the field of self-driving a neural network on a multi-scale image pyramid, resulting in a scale-invariant predictor [12]. Nevertheless, the previous When cutting a patch of the image (Fig. 2 B, step 1), solutions do not change the fact that high-level feature maps the size of the patch and location of the patch was jittered, for small objects are mixed and the receptive fields for large which allows us to generate not exactly the same patch every objects are limited given an image and a CNN. Recently, a time even though the patch with the same object is selected. new training method that detects all objects at a proper scale Note that the size of the patch is always larger than the object by scaling up small objects and scaling down large objects has and a larger patch was cut if a smaller object exists in that been reported in the state-of-the-art models, SNIPER [13] and patch. Otherwise, if the object were always in the center or TridentNet [14]. same location in the patch, the model would not learn to detect In this study, we demonstrated the successful application objects but learn to localize objects assuming there is always of the idea of detecting objects of various size at the right an object, which is not true. The setting for the size of a patch scale in detecting artefacts in endoscopy. The report is orga- (spatch ) is defined by this equation: spatch = r ×sbbox , where nized in such an order: datasets, methods, results, discussion r = 4.5, 2, 1.5, and 1.2, respectively, for the cases, sbbox < 80, and conclusion. 160, 350, and > 350. If no object exists in a random patch, a fixed size of patch was cut from an image. 2. DATASETS After cutting a patch from an image, the patch was scaled up or down depending on the size of the object in that patch The training dataset consists of 2,192 endoscopic images (Fig. (Fig. 2 B, step 2). The scaling provides a zoomed-in view for 1 A), in which seven categories of artefacts were labeled [15, small object objects and a zoomed-out view for large objects. 16]. The seven categories are pixel saturation, motion blur, Thus, both high-level and low-level feature maps will exist specular reflections, bubbles, strong contrast, instrument and after an image passes a CNN. The scaling ratio (r) is inversely other artefacts. The size of an artefact varies in a range from proportional to the size (s) of the object in the patch: r = 160/s a few pixels to one thousand pixels (Fig. 1 B). The number and the size of all the objects are grouped into 6 bins. So of objects in each category is from 453 to 5835 (Fig. 1 C). the settings used here are (r=4, s < 40), (r=2, s¡80), (r=1, The performance of a model was tested on two datasets, one s < 160), (r=0.5, s < 250), (r=0.25, s < 640), and (r=0.13, collected by the same endoscope and the other collected by s > 640). The reason for choosing such settings is that there a different endoscope to test the generalization ability of a will be 5 pixels in the last layer of the encoder if a raw input model. The former and latter testing datasets comprise 195 of 160 pixels is fed into the model, which has a feature stride and 51 images, respectively. of 32. A patch, which has no objects in it, was scaled with a random ratio on the fly. After scaling up or down the patch cut from each image in 3. METHODS a batch, objects that are too large/small were excluded. The The model we built is a Faster R-CNN with a FPN as the thresholds for too small and too large objects are 32 and 2000, backbone. An FPN consists of mainly two parts, an encoder respectively. Choosing 32 as the threshold is because of the and an decoder, which is very similar to a U-Net [17] architec- feature stride of the model is 32 and choosing 2000 is just ture developed for image segmentation tasks. Considering the because it is large enough. memory capacity of our GPU (GTX 1070 16GB), We chose In each training iteration, multiple patches from multiple ResNet-50 as the workhorse of the encoder [18]. The im- images were cut and normalized as a batch by padding the plementation was based on a modularized implementation of patch with with the channel mean and concatenated, resulting mask R-CNN [19] in Pytorch [20]. The weights of the model in a batch of images, whose width and height are a multiple were initialized with the weights trained on the COCO dataset of 32 (Fig. 2 B, step 3). For a patch that is already a multiple except that the weights for the classification and regression of the stride size of the encoder, no padding was added. Since head were initialized with random weights. the padding is always on the bottom and right side if neces- When training the model, we adapted the method pro- sary, the coordinates of the bounding box does not change. posed in [13] considering the class imbalance in our dataset Collating multiple samples and unifying them in size can be and introduced data augmentation by strategically cutting a easily implemented in Pytorch1 . For other details like how patch from an image for training. In specific, given all the the bounding boxes were adjusted accordingly when cutting bounding boxes (bboxes) in an image, k+1 bboxes were sam- a patch from an image, see our code∗ on Github. pled from all the bboxes in this image (Fig. 2 A). k is the number of categories of objects in this image and 1 represents 4. RESULTS a random bbox. Such operation is to alleviate the class imbal- ance problem (Fig. 1 C) in our dataset. Then one bbox was In inference, we tested one image on all the scales used in sampled from the k+1 bboxes. Finally a patch of the image training (scales: 4, 2, 1, 0.5, 0.25 and 0.13). The coordi- was cut out and scaled up or down depending on the size of the object in that patch. 1 https://pytorch.org A bubbles Instrument specularity artifact B specularity C6000 saturation artifact 800 5000 blur contrast count of objects count of objects bubbles 4000 600 instrument 3000 400 2000 200 1000 0 0 0 200 400 600 800 1000 1200 ent blur ration ntrast ubbles rtifact ularity um mean of width and height (pixel) inst r satu co b a spe c Fig. 1. A. Two sample endoscopic images. B. The distribution of the size of each category of artefacts in the training set. C. The distribution of the counts of seven categories of artefact in the training set. nates of all the detected objects were transformed back to the 5. DISCUSSION & CONCLUSION original scale. To remove redundant bounding boxes, non- maximum suppression were conducted on all the detected The training method boosted the performance of Faster bounding boxes for each category of object. It was observed RCNN in two ways. First, as we discussed in the Intro- that many false positive, which were small objects, were de- duction section, it alleviates the scale variation problem by tected on the scale of 4, so I finally chose not to include the scaling up/down an object to the right scale to detect. Second, prediction on that scale. randomly cutting a patch which includes an object allows us to generate far more different training images compared to feeding the whole image to the model. Thus, cutting a patch serves as a data augmentation technique. Besides, it offers flexibility to deal with the class imbalance as we can choose The performance of our model was evaluated by a hybrid which patch to cut from a training image, considering the metric which was a weighted score of mean average preci- distribution of the counts of all the classes. sion (mAP) and Intersection over Union (IoU): 0.6 × mAP Since the detected bboxes on all the scales were merged, a + 0.4 × IoU. We compared the performance of two ways for bbox was called if it was detected on any of the scales. Such training the same model. One way is training the model on an integration approach tends to report more false positive. the whole image every time and the other is on patches gen- One failure case we observed is that a false positive object erated following the method described above. The threshold does look like a true object because the model decides with- for the probability when determining an object is 0.65. In the out considering the context of that object. We run into the former case, the best overall score on the two testing sets was case when an image is scaled up by 4 times. Thus the con- 0.221. For the latter case, we trained the network for 27,105 text information does matter and a context refinement proba- iterations (batch size is 8 in each iteration) and a significant bly corrects such kind of errors [21]. An alternative solution boost in performance was observed. The score we reached to solve this bias of this method can be feeding the detected was 0.293, which was among the top 10 teams on the leader- bboxes as the input for the RoI pooling layer and merging the board. features generated by the model on an image pyramid. Since A True bounding boxes Balanced bounding boxes x1 y1 x2 y2 label Sample a bbox from x1 y1 x2 y2 label 312 910 997 1720 instrument each class and add a 312 910 997 1720 instrument random bbox 0 234 99 312 artifact 1100 650 1250 790 artifact 1100 650 1250 790 artifact 510 10 620 110 background … 1196 801 1397 912 artifact Image B 0 800 1600 Pad with 0 channel mean Scale up Patch1 Bbox1 Patch2 800 Bbox2 Pad with channel mean Scale down 1600 step 1 cut step 2 scale step 3 pad Fig. 2. The method for training the model to detect objects of various size on the right scale. A. a balanced set of bounding boxes was generated by sampling a bbox from each class and adding a random bbox. B. then a single bbox was sampled and a patch including the bbox was cut from the image. The patch was scaled down or up depending on the size of the object in the patch. Finally a batch of patches with unified size were generated. the scaled down images include more context information, we [3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian expect the problem to be solved in this way. Sun. Faster r-cnn: Towards real-time object detection To further boost the detection performance, the regular with region proposal networks. In Advances in neural convolution operation in the FPN backbone can be replaced information processing systems, pages 91–99, 2015. by deformable convolution operation will enhance the trans- [4] Matthew D Zeiler and Rob Fergus. Visualizing and un- formation modeling capacity of CNNs [22] or a newly pro- derstanding convolutional networks. In European con- posed backbone designed for object detection task [?]. In ference on computer vision, pages 818–833. Springer, conclusion, there is still room for improvement and we have 2014. demonstrated the performance of a Faster R-CNN model can be improved significantly by training and detecting the ob- [5] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard jects on the right scale. Zemel. Understanding the effective receptive field in deep convolutional neural networks. In Advances in neural information processing systems, pages 4898– 6. REFERENCES 4906, 2016. [6] Bharat Singh and Larry S Davis. An analysis of scale in- [1] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jiten- variance in object detection snip. In Proceedings of the dra Malik. Rich feature hierarchies for accurate object IEEE conference on computer vision and pattern recog- detection and semantic segmentation. In Proceedings nition, pages 3578–3587, 2018. of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014. [7] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C [2] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE Berg. Ssd: Single shot multibox detector. In European international conference on computer vision, pages conference on computer vision, pages 21–37. Springer, 1440–1448, 2015. 2016. [8] Zhaowei Cai, Quanfu Fan, Rogerio S Feris, and Nuno [19] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Vasconcelos. A unified multi-scale deep convolutional Girshick. Mask r-cnn. In Proceedings of the IEEE in- neural network for fast object detection. In Euro- ternational conference on computer vision, pages 2961– pean conference on computer vision, pages 354–370. 2969, 2017. Springer, 2016. [20] Francisco Massa and Ross Girshick. maskrcnn- [9] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, benchmark: Fast, modular reference im- Bharath Hariharan, and Serge Belongie. Feature pyra- plementation of Instance Segmentation and mid networks for object detection. In Proceedings of Object Detection algorithms in PyTorch. the IEEE Conference on Computer Vision and Pattern https://github.com/facebookresearch/maskrcnn- Recognition, pages 2117–2125, 2017. benchmark, 2018. [10] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish [21] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam- Tyagi, and Alexander C Berg. Dssd: Deconvolutional Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urta- single shot detector. arXiv preprint arXiv:1701.06659, sun, and Alan Yuille. The role of context for object de- 2017. tection and semantic segmentation in the wild. In Pro- ceedings of the IEEE Conference on Computer Vision [11] Peng Zhou, Bingbing Ni, Cong Geng, Jianguo Hu, and and Pattern Recognition, pages 891–898, 2014. Yi Xu. Scale-transferrable object detection. In Proceed- ings of the IEEE Conference on Computer Vision and [22] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Pattern Recognition, pages 528–537, 2018. Zhang, Han Hu, and Yichen Wei. Deformable convo- lutional networks. In Proceedings of the IEEE interna- [12] Edward H Adelson, Charles H Anderson, James R tional conference on computer vision, pages 764–773, Bergen, Peter J Burt, and Joan M Ogden. Pyramid meth- 2017. ods in image processing. RCA engineer, 29(6):33–41, 1984. [13] Bharat Singh, Mahyar Najibi, and Larry S Davis. Sniper: Efficient multi-scale training. In Advances in Neural Information Processing Systems, pages 9310– 9320, 2018. [14] Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxi- ang Zhang. Scale-aware trident networks for object de- tection. arXiv preprint arXiv:1901.01892, 2019. [15] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, Adam Bailey, Stefano Realdon, James East, Georges Wagnires, Victor Loschenov, Enrico Grisan, Wal- ter Blondel, and Jens Rittscher. Endoscopy arti- fact detection (EAD 2019) challenge dataset. CoRR, abs/1905.03209, 2019. [16] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden, James East, Xin Lu, and Jens Rittscher. A deep learn- ing framework for quality assessment and restoration in video endoscopy. CoRR, abs/1904.07073, 2019. [17] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medi- cal image computing and computer-assisted interven- tion, pages 234–241. Springer, 2015. [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.