DETECT ARTEFACTS OF VARIOUS SIZES ON THE RIGHT SCALE FOR EACH CLASS IN
                           VIDEO ENDOSCOPY

                                           Xiaokang Wang1 , Chunqing Wang2
    1
        Department of Biomedical Engineering, University of California, Davis, Davis, CA 95616, USA
               2
                 Department of Ultrasound Imaging, Tiantan Hospital, Beijing, 100050, China


                        ABSTRACT                                  car, speed is a prerequisite given an acceptable detection per-
                                                                  formance as a self-driving car has to react instantly. In the
Detecting artefacts in video filmed in endoscopy is an im-        case of diagnosis in biomedical engineering, we can bear with
portant problem for downstream computer-assisted diagnosis.       slightly more computation time for higher accuracy. Since
When tackling this problem, one challenge is that the size of     the advent of R-CNN [1], which is a two stage approach, this
an artefact varies in a wide range. The other challenge is that   region-based detection method has become increasingly ma-
labeling endoscopic images is labor- extensive and is hard to     ture. Along the line, Fast R-CNN [2] introduced a RoI pool-
outsource the labeling task to untrained people without the       ing operation that does forward pass on all the object pro-
aid of doctors. In this report, we demonstrate how the perfor-    posals in an image simultaneously. Faster R-CNN [3] fur-
mance of a Faster R-CNN model can be improved by scaling          ther speeds up R-CNN by training a region proposal network
an image to the right scale before training and testing. The      (PRN) using the feature maps generated by the convolution
training method overcomes the issue that a convolution neu-       operations at the low level, without introducing much cost.
ral network trained on one scale barely works when detecting      Thus, I chose Faster R-CNN as the base framework in this
the same category of objects on a different scale. The method     challenge.
is totally independent of the model and can be easily adapted
                                                                      However, two challenges have to be solved when devel-
with other models. Besides, it saves time and memory by fo-
                                                                  oping the model. One special challenge is that the size of the
cusing on the patches that include objects when training the
                                                                  artefacts varies in a wide range and the other one is a limited
model. The source code? for this report will be made public
                                                                  number of labeled images (2,192 in total). The scale-related
upon the publishing of my solution.
                                                                  challenge is associated with the architecture of a CNN. The
                                                                  low level feature maps of a CNN capture features like edges
                   1. INTRODUCTION                                and have a small receptive field, whereas the high-level fea-
                                                                  tures capture more semantic features, and have a larger re-
Endoscopy is a widely used clinical procedure for the early       ceptive field [4, 5, 6]. Thus, the high-level features of small
detection of numerous cancers (e.g., nasopharyngeal, gastric,     objects (e.g. less than 32 pixels) get mixed with features for
colorectal cancers, bladder cancer etc.), therapeutic proce-      background or objects nearby if the features do not disap-
dures and minimally invasive surgery. Video taken by the          pear due to dimension reduction caused in feature extraction.
camera of an endoscope is usually heavily corrupted with          e.g. For a feature stride of 32, the highest level features were
multiple artefacts (e.g., pixel saturations, motion blur, defo-   shrunk 32 times compared to the raw image. For very large
cus, specular reflections, bubbles, fluid, debris etc.). Accu-    objects, the deeper layers suffer from extracting high-level se-
rate detection of artefacts is a core challenge in a wide range   mantic features due to failing to integrate low-level features
of endoscopic applications addressing multiple different dis-     given a limited feature stride.
ease areas. The importance of precise detection of these arte-        To alleviate the problem caused by the wide range of ob-
facts is essential for high-quality endoscopic frame restora-     ject size, various solutions have been proposed. One category
tion and crucial for realizing reliable computer assisted en-     of solution focused on designing new CNN architectures to
doscopy tools for improved patient care.                          exploit the features at different levels. Under this paradigm,
    In the last few years, convolution neural network (CNN)       SSD [7] and MSCNN [8], use feature maps from different
has outperformed previous non-CNN based methods in solv-          layers to detect objects at different scales. Although the fea-
ing the object detection problem. The dominant CNN-based          tures for small objects survive in the low-layer features, they
methods fall into two categories, one-stage approach and two-     lack semantic information which is supposed to be encoded
stage approach, with the former method shining in speed and       in high level features. FPN [9], DSSD [10], STDN [11] inte-
the latter in accuracy. These two methods meet the demand         grate features at different layers. Another solution is to train
in different fields. For example, in the field of self-driving    a neural network on a multi-scale image pyramid, resulting
in a scale-invariant predictor [12]. Nevertheless, the previous       When cutting a patch of the image (Fig. 2 B, step 1),
solutions do not change the fact that high-level feature maps     the size of the patch and location of the patch was jittered,
for small objects are mixed and the receptive fields for large    which allows us to generate not exactly the same patch every
objects are limited given an image and a CNN. Recently, a         time even though the patch with the same object is selected.
new training method that detects all objects at a proper scale    Note that the size of the patch is always larger than the object
by scaling up small objects and scaling down large objects has    and a larger patch was cut if a smaller object exists in that
been reported in the state-of-the-art models, SNIPER [13] and     patch. Otherwise, if the object were always in the center or
TridentNet [14].                                                  same location in the patch, the model would not learn to detect
    In this study, we demonstrated the successful application     objects but learn to localize objects assuming there is always
of the idea of detecting objects of various size at the right     an object, which is not true. The setting for the size of a patch
scale in detecting artefacts in endoscopy. The report is orga-    (spatch ) is defined by this equation: spatch = r ×sbbox , where
nized in such an order: datasets, methods, results, discussion    r = 4.5, 2, 1.5, and 1.2, respectively, for the cases, sbbox < 80,
and conclusion.                                                   160, 350, and > 350. If no object exists in a random patch, a
                                                                  fixed size of patch was cut from an image.
                       2. DATASETS                                    After cutting a patch from an image, the patch was scaled
                                                                  up or down depending on the size of the object in that patch
The training dataset consists of 2,192 endoscopic images (Fig.    (Fig. 2 B, step 2). The scaling provides a zoomed-in view for
1 A), in which seven categories of artefacts were labeled [15,    small object objects and a zoomed-out view for large objects.
16]. The seven categories are pixel saturation, motion blur,      Thus, both high-level and low-level feature maps will exist
specular reflections, bubbles, strong contrast, instrument and    after an image passes a CNN. The scaling ratio (r) is inversely
other artefacts. The size of an artefact varies in a range from   proportional to the size (s) of the object in the patch: r = 160/s
a few pixels to one thousand pixels (Fig. 1 B). The number        and the size of all the objects are grouped into 6 bins. So
of objects in each category is from 453 to 5835 (Fig. 1 C).       the settings used here are (r=4, s < 40), (r=2, s¡80), (r=1,
The performance of a model was tested on two datasets, one        s < 160), (r=0.5, s < 250), (r=0.25, s < 640), and (r=0.13,
collected by the same endoscope and the other collected by        s > 640). The reason for choosing such settings is that there
a different endoscope to test the generalization ability of a     will be 5 pixels in the last layer of the encoder if a raw input
model. The former and latter testing datasets comprise 195        of 160 pixels is fed into the model, which has a feature stride
and 51 images, respectively.                                      of 32. A patch, which has no objects in it, was scaled with a
                                                                  random ratio on the fly.
                                                                      After scaling up or down the patch cut from each image in
                       3. METHODS
                                                                  a batch, objects that are too large/small were excluded. The
The model we built is a Faster R-CNN with a FPN as the            thresholds for too small and too large objects are 32 and 2000,
backbone. An FPN consists of mainly two parts, an encoder         respectively. Choosing 32 as the threshold is because of the
and an decoder, which is very similar to a U-Net [17] architec-   feature stride of the model is 32 and choosing 2000 is just
ture developed for image segmentation tasks. Considering the      because it is large enough.
memory capacity of our GPU (GTX 1070 16GB), We chose                  In each training iteration, multiple patches from multiple
ResNet-50 as the workhorse of the encoder [18]. The im-           images were cut and normalized as a batch by padding the
plementation was based on a modularized implementation of         patch with with the channel mean and concatenated, resulting
mask R-CNN [19] in Pytorch [20]. The weights of the model         in a batch of images, whose width and height are a multiple
were initialized with the weights trained on the COCO dataset     of 32 (Fig. 2 B, step 3). For a patch that is already a multiple
except that the weights for the classification and regression     of the stride size of the encoder, no padding was added. Since
head were initialized with random weights.                        the padding is always on the bottom and right side if neces-
    When training the model, we adapted the method pro-           sary, the coordinates of the bounding box does not change.
posed in [13] considering the class imbalance in our dataset      Collating multiple samples and unifying them in size can be
and introduced data augmentation by strategically cutting a       easily implemented in Pytorch1 . For other details like how
patch from an image for training. In specific, given all the      the bounding boxes were adjusted accordingly when cutting
bounding boxes (bboxes) in an image, k+1 bboxes were sam-         a patch from an image, see our code∗ on Github.
pled from all the bboxes in this image (Fig. 2 A). k is the
number of categories of objects in this image and 1 represents                              4. RESULTS
a random bbox. Such operation is to alleviate the class imbal-
ance problem (Fig. 1 C) in our dataset. Then one bbox was         In inference, we tested one image on all the scales used in
sampled from the k+1 bboxes. Finally a patch of the image         training (scales: 4, 2, 1, 0.5, 0.25 and 0.13). The coordi-
was cut out and scaled up or down depending on the size of
the object in that patch.                                           1 https://pytorch.org
             A
                                           bubbles                                                                          Instrument


                                             specularity
                                artifact


            B                                               specularity
                                                                          C6000
                                                            saturation
                                                            artifact
                   800                                                                       5000
                                                            blur
                                                            contrast


                                                                          count of objects
count of objects


                                                            bubbles
                                                                                             4000
                   600                                      instrument


                                                                                             3000
                   400
                                                                                             2000

                   200
                                                                                             1000


                     0                                                                           0
                         0   200   400     600    800     1000   1200
                                                                                                      ent   blur ration ntrast ubbles rtifact ularity
                                                                                                   um
                              mean of width and height (pixel)
                                                                                             inst
                                                                                                  r          satu     co     b       a
                                                                                                                                         spe
                                                                                                                                             c


Fig. 1. A. Two sample endoscopic images. B. The distribution of the size of each category of artefacts in the training set. C.
The distribution of the counts of seven categories of artefact in the training set.


nates of all the detected objects were transformed back to the                                          5. DISCUSSION & CONCLUSION
original scale. To remove redundant bounding boxes, non-
maximum suppression were conducted on all the detected                                       The training method boosted the performance of Faster
bounding boxes for each category of object. It was observed                                  RCNN in two ways. First, as we discussed in the Intro-
that many false positive, which were small objects, were de-                                 duction section, it alleviates the scale variation problem by
tected on the scale of 4, so I finally chose not to include the                              scaling up/down an object to the right scale to detect. Second,
prediction on that scale.                                                                    randomly cutting a patch which includes an object allows us
                                                                                             to generate far more different training images compared to
                                                                                             feeding the whole image to the model. Thus, cutting a patch
                                                                                             serves as a data augmentation technique. Besides, it offers
                                                                                             flexibility to deal with the class imbalance as we can choose
    The performance of our model was evaluated by a hybrid                                   which patch to cut from a training image, considering the
metric which was a weighted score of mean average preci-                                     distribution of the counts of all the classes.
sion (mAP) and Intersection over Union (IoU): 0.6 × mAP                                          Since the detected bboxes on all the scales were merged, a
+ 0.4 × IoU. We compared the performance of two ways for                                     bbox was called if it was detected on any of the scales. Such
training the same model. One way is training the model on                                    an integration approach tends to report more false positive.
the whole image every time and the other is on patches gen-                                  One failure case we observed is that a false positive object
erated following the method described above. The threshold                                   does look like a true object because the model decides with-
for the probability when determining an object is 0.65. In the                               out considering the context of that object. We run into the
former case, the best overall score on the two testing sets was                              case when an image is scaled up by 4 times. Thus the con-
0.221. For the latter case, we trained the network for 27,105                                text information does matter and a context refinement proba-
iterations (batch size is 8 in each iteration) and a significant                             bly corrects such kind of errors [21]. An alternative solution
boost in performance was observed. The score we reached                                      to solve this bias of this method can be feeding the detected
was 0.293, which was among the top 10 teams on the leader-                                   bboxes as the input for the RoI pooling layer and merging the
board.                                                                                       features generated by the model on an image pyramid. Since
   A       True bounding boxes                                                              Balanced bounding boxes

                x1       y1     x2        y2       label            Sample a bbox from        x1     y1       x2     y2      label
               312      910    997      1720     instrument         each class and add a      312    910     997    1720   instrument
                                                                       random bbox
                0       234    99        312      artifact                                   1100    650     1250   790     artifact
               1100     650    1250      790      artifact                                    510    10      620    110    background
                                                     …
               1196     801    1397      912      artifact


           Image
   B       0                   800                1600                                          Pad with
     0                                                                                        channel mean

                                                           Scale up
                                      Patch1
                                         Bbox1
                      Patch2
    800                    Bbox2
                                                                                             Pad with
                                                                                           channel mean
                                                           Scale down


    1600
                          step 1 cut                         step 2 scale                       step 3 pad


Fig. 2. The method for training the model to detect objects of various size on the right scale. A. a balanced set of bounding
boxes was generated by sampling a bbox from each class and adding a random bbox. B. then a single bbox was sampled and a
patch including the bbox was cut from the image. The patch was scaled down or up depending on the size of the object in the
patch. Finally a batch of patches with unified size were generated.


the scaled down images include more context information, we                   [3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
expect the problem to be solved in this way.                                      Sun. Faster r-cnn: Towards real-time object detection
    To further boost the detection performance, the regular                       with region proposal networks. In Advances in neural
convolution operation in the FPN backbone can be replaced                         information processing systems, pages 91–99, 2015.
by deformable convolution operation will enhance the trans-
                                                                              [4] Matthew D Zeiler and Rob Fergus. Visualizing and un-
formation modeling capacity of CNNs [22] or a newly pro-
                                                                                  derstanding convolutional networks. In European con-
posed backbone designed for object detection task [?]. In
                                                                                  ference on computer vision, pages 818–833. Springer,
conclusion, there is still room for improvement and we have
                                                                                  2014.
demonstrated the performance of a Faster R-CNN model can
be improved significantly by training and detecting the ob-                   [5] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard
jects on the right scale.                                                         Zemel. Understanding the effective receptive field in
                                                                                  deep convolutional neural networks. In Advances in
                                                                                  neural information processing systems, pages 4898–
                           6. REFERENCES                                          4906, 2016.
                                                                              [6] Bharat Singh and Larry S Davis. An analysis of scale in-
 [1] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jiten-
                                                                                  variance in object detection snip. In Proceedings of the
     dra Malik. Rich feature hierarchies for accurate object
                                                                                  IEEE conference on computer vision and pattern recog-
     detection and semantic segmentation. In Proceedings
                                                                                  nition, pages 3578–3587, 2018.
     of the IEEE conference on computer vision and pattern
     recognition, pages 580–587, 2014.                                        [7] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
                                                                                  Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C
 [2] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE                        Berg. Ssd: Single shot multibox detector. In European
     international conference on computer vision, pages                           conference on computer vision, pages 21–37. Springer,
     1440–1448, 2015.                                                             2016.
 [8] Zhaowei Cai, Quanfu Fan, Rogerio S Feris, and Nuno         [19] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross
     Vasconcelos. A unified multi-scale deep convolutional           Girshick. Mask r-cnn. In Proceedings of the IEEE in-
     neural network for fast object detection. In Euro-              ternational conference on computer vision, pages 2961–
     pean conference on computer vision, pages 354–370.              2969, 2017.
     Springer, 2016.
                                                                [20] Francisco Massa and Ross Girshick.         maskrcnn-
 [9] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,         benchmark:         Fast,   modular reference im-
     Bharath Hariharan, and Serge Belongie. Feature pyra-            plementation of Instance Segmentation and
     mid networks for object detection. In Proceedings of            Object      Detection    algorithms   in    PyTorch.
     the IEEE Conference on Computer Vision and Pattern              https://github.com/facebookresearch/maskrcnn-
     Recognition, pages 2117–2125, 2017.                             benchmark, 2018.
[10] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish              [21] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-
     Tyagi, and Alexander C Berg. Dssd: Deconvolutional              Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urta-
     single shot detector. arXiv preprint arXiv:1701.06659,          sun, and Alan Yuille. The role of context for object de-
     2017.                                                           tection and semantic segmentation in the wild. In Pro-
                                                                     ceedings of the IEEE Conference on Computer Vision
[11] Peng Zhou, Bingbing Ni, Cong Geng, Jianguo Hu, and              and Pattern Recognition, pages 891–898, 2014.
     Yi Xu. Scale-transferrable object detection. In Proceed-
     ings of the IEEE Conference on Computer Vision and         [22] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong
     Pattern Recognition, pages 528–537, 2018.                       Zhang, Han Hu, and Yichen Wei. Deformable convo-
                                                                     lutional networks. In Proceedings of the IEEE interna-
[12] Edward H Adelson, Charles H Anderson, James R                   tional conference on computer vision, pages 764–773,
     Bergen, Peter J Burt, and Joan M Ogden. Pyramid meth-           2017.
     ods in image processing. RCA engineer, 29(6):33–41,
     1984.
[13] Bharat Singh, Mahyar Najibi, and Larry S Davis.
     Sniper: Efficient multi-scale training. In Advances in
     Neural Information Processing Systems, pages 9310–
     9320, 2018.
[14] Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhaoxi-
     ang Zhang. Scale-aware trident networks for object de-
     tection. arXiv preprint arXiv:1901.01892, 2019.
[15] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,
     Adam Bailey, Stefano Realdon, James East, Georges
     Wagnires, Victor Loschenov, Enrico Grisan, Wal-
     ter Blondel, and Jens Rittscher.       Endoscopy arti-
     fact detection (EAD 2019) challenge dataset. CoRR,
     abs/1905.03209, 2019.
[16] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden,
     James East, Xin Lu, and Jens Rittscher. A deep learn-
     ing framework for quality assessment and restoration in
     video endoscopy. CoRR, abs/1904.07073, 2019.
[17] Olaf Ronneberger, Philipp Fischer, and Thomas Brox.
     U-net: Convolutional networks for biomedical image
     segmentation. In International Conference on Medi-
     cal image computing and computer-assisted interven-
     tion, pages 234–241. Springer, 2015.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
     Sun. Deep residual learning for image recognition. In
     Proceedings of the IEEE conference on computer vision
     and pattern recognition, pages 770–778, 2016.