ARTEFACT DETECTION IN VIDEO ENDOSCOPY USING RETINANET AND FOCAL LOSS
                               FUNCTION

                       Ilkay Oksuz1 , James R. Clough1 , Andrew P. King1∗ , Julia A. Schnabel1∗
            1
                School of Biomedical Engineering & Imaging Sciences , King’s College London, UK


                           ABSTRACT                                       ing the underlying tissue during diagnosis but also affect any
                                                                          post-analysis methods required for follow-up (e.g. video mo-
    Endoscopic Artefact Detection (EAD) is a fundamental
                                                                          saicking done for archival purposes and video-frame retrieval
task for enabling the use of endoscopy images for diagno-
                                                                          needed for reporting). Accurate detection of artefacts is a core
sis and treatment of diseases in multiple organs. Precise de-
                                                                          challenge in a wide-range of endoscopic applications address-
tection of specific artefacts such as pixel saturations, motion
                                                                          ing multiple different disease areas. The importance of pre-
blur, specular reflections, bubbles and instruments is essential
                                                                          cise detection of these artefacts is essential for high-quality
for high-quality frame restoration. This work describes our
                                                                          endoscopic frame restoration and is crucial for realising reli-
submission to the EAD 2019 challenge to detect bounding
                                                                          able computer assisted endoscopy tools for improved patient
boxes for seven classes of artefacts in endoscopy videos. Our
                                                                          care. An example ground truth bounding box annotations is
method is based on focal loss and Retina-net architecture with
                                                                          visualized in Figure 1a.
Resnet-152 backbone. We have generated a large derivative
                                                                              Existing endoscopy workflows detect only one arte-
dataset by augmenting the original images with free-form de-
                                                                          fact class which is insufficient to obtain high-quality frame
formations to prevent over-fitting. Our method reaches a mAP
                                                                          restoration as detailed in a comprehensive review about im-
of 0.2719 and a IoU of 0.3456 for the detection task over all
                                                                          age quality estimation [1]. In general, the same video frame
classes of artefact for 195 images. We report comparable per-
                                                                          can be corrupted with multiple artefacts, e.g. motion blur,
formance for the generalization dataset reaching a mAP of
                                                                          specular reflections, and low contrast can be present in the
0.2974 and deviation from the detection dataset of 0.0859.
                                                                          same frame. Furthermore, not all artefact types contaminate
    Index Terms— Endoscopic artefact detection, focal loss,               the frame equally. So, unless multiple artefacts present in
retina-net, class imbalance                                               the frame are known with their precise spatial location, clini-
                                                                          cally relevant frame restoration quality cannot be guaranteed.
                     1. INTRODUCTION                                      Another advantage of such detection is that frame quality
                                                                          assessments can be guided to minimise the number of frames
Endoscopy is a procedure in which the inside of the body                  that get discarded during automated video analysis.
is examined using a long, thin, flexible tube that has a light
source and camera at one end, which allows visualization of                                 2. RELATED WORKS
the inside of organs on a screen. It is a widely used clinical
procedure for the early detection of numerous cancers as well             The existing works on endoscopic artefact detection are
as for therapeutic procedures and minimally invasive surgery.             mainly focused on thresholding-based methods using the
A major handicap of endoscopy video frames is that they are               HSV [2] and RGB colour channels. Queiroz et al. [3] pro-
subject to heavy corruption with multiple artefacts. The En-              posed to use a principal component analysis based detection
doscope Artefact Detection (EAD) challenge in the ISBI 2019               algorithm of specular artefacts. Akbari et al. [4] proposed to
conference provides a multi-institutional dataset consisting              use a non-linear SVM specular artefact detection using both
of 7 different types of artefact (i.e. saturation, motion blur,           HSV and RGB colour space information for segmentation of
specular reflections, bubbles, instrument, contrast and arti-             specular reflections. The SVM was trained with 12 statistical
fact). These artefacts not only cause difficulties in visualiz-           features including the mean and standard deviation of each
                                                                          channel of the RGB and HSV colour spaces.
    This work was supported by an EPSRC programme Grant
(EP/P001009/1) and the Wellcome EPSRC Centre for Medical Engi-                 The nature of this multi-class artefact detection challenge
neering at School of Biomedical Engineering and Imaging Sciences, Kings   is in close relation to object detection challenges in computer
College London (WT 203148/Z/16/Z). We acknowledge financial support       vision (e.g. the COCO challenge [5]). The top performing
from the Department of Health via the NIHR comprehensive Biomedical
Research Centre award to Guys & St Thomas NHS Foundation Trust with
                                                                          algorithms on COCO and similar computer vision object de-
KCL and Kings College Hospital NHS Foundation Trust.                      tection challenges are based on convolutional neural network
    ∗ Joint last authors.                                                 deep learning architectures. Current state-of-the-art object
detectors are based on a two-stage, proposal-driven mecha-           volutional network. The first subnet performs convolutional
nism. As popularized in the R-CNN framework [6], the first           object classification on the backbone’s output; the second
stage generates a sparse set of candidate object locations and       subnet performs convolutional bounding box regression. The
the second stage classifies each candidate location as one of        two subnetworks feature a simple design that we propose
the foreground classes or as background using a convolutional        specifically for one-stage, dense detection. While there are
neural network. Through a sequence of advances [7, 8], this          many possible choices for the details of these components,
two-stage framework consistently achieves top accuracy on            most design parameters are not particularly sensitive to exact
the challenging COCO benchmark [5]. Despite the success              values as shown in the experiments. We detail the compo-
of two-stage detectors, also one stage detectors are applied         nents of RetinaNet in the following sections.
over a regular, dense sampling of object locations, scales, and
aspect ratios. Recent work on one-stage detectors, such as
YOLO [9], demonstrates promising results, yielding faster            3.1. Feature Pyramid Network Backbone
detectors with high accuracy. In this direction Lin et al. pro-
posed RetinaNet [10], which is a one-stage object detector           We adopt the Feature Pyramid Network (FPN) from [11] as
that matches the state-of-the-art COCO Avearage Precision            the backbone network for RetinaNet. In brief, FPN augments
(AP) of more complex two-stage detectors, such as the Fea-           a standard convolutional network with a top-down pathway
ture Pyramid Network (FPN) [11] or variants of Faster R-             and lateral connections so the network efficiently constructs
CNN [7]. To achieve this result, class imbalance during train-       a rich, multi-scale feature pyramid from a single resolution
ing was identified as the main obstacle impeding one-stage           input image. Each level of the pyramid can be used for de-
detectors from achieving state-of-the-art accuracy and a new         tecting objects at a different scale. FPN improves multi-scale
loss function that eliminates this barrier was proposed.             predictions from fully convolutional networks (FCN), as well
    Class imbalance is one-key issue in the EAD 2019 multi-          at two-stage detectors such as Fast R-CNN or Mask R-CNN.
artefact detection challenge [12, 13], where the classes have        Following this, we build FPN on top of the ResNet architec-
an imbalanced distribution in the training set (e.g. specularity     ture [14]. We construct a pyramid with levels P3 through P7,
43%, blur 3.5%, artifact 12%). Class imbalance is addressed          where l indicates pyramid level (Pl has resolution 2l lower
in R-CNN-like detectors by two-stage cascade and sampling            than the input). As in [11] all pyramid levels have C = 256
heuristics. The proposal stage rapidly narrows down the num-         channels. Details of the pyramid generally can be found in
ber of candidate object locations to a small number, filtering       [11].
out most background samples. In this paper, we address this
class imbalance by reshaping the standard cross entropy loss
such that it down-weights the loss assigned to well-classified       3.2. Anchors
examples similar to [10]. Focal Loss focuses training on a
                                                                     We use translation-invariant anchor boxes similar to those in
sparse set of hard examples and prevents the vast number of
                                                                     the orginal Retina-net [10]. The anchors have areas of 322 to
easy negatives from overwhelming the detector during train-
                                                                     5122 on pyramid levels P3 to P7, respectively. As in [11], at
ing. To evaluate the effectiveness of our loss, we design and
                                                                     each pyramid level we use anchors at three aspect ratios 1:2,
train a simple dense detector based on RetinaNet. As high-
                                                                     1:1, 2:1. For denser scale coverage than in, at each level we
lighted in [10], when trained with the focal loss, RetinaNet is
                                                                     add anchors of sizes 20, 21/3, 22/3 of the original set of 3 as-
able to match the speed of previous one-stage detectors while
                                                                     pect ratio anchors. This improve AP in our setting. In total
surpassing the accuracy of all existing state-of-the-art two-
                                                                     there are A = 9 anchors per level and across levels they cover
stage detectors. We use a loss function that acts as a more
                                                                     the scale range 32 - 813 pixels with respect to the networks
effective alternative to previous approaches for dealing with
                                                                     input image. Each anchor is assigned a length K one-hot vec-
class imbalance. The loss function is a dynamically scaled
                                                                     tor of classification targets, where K is the number of object
cross entropy loss, where the scaling factor decays to zero as
                                                                     classes, and a 4-vector of box regression targets. We use the
confidence in the correct class increases. Intuitively, this scal-
                                                                     assignment rule from RPN but modified for multi-class detec-
ing factor can automatically down-weight the contribution of
                                                                     tion and with adjusted thresholds. Specifically, anchors are
easy examples during training and rapidly focus the model on
                                                                     assigned to ground-truth object boxes using an intersection-
hard examples.
                                                                     over-union (IoU) threshold of 0.7; and to background if their
                                                                     IoU is in [0, 0.6). As each anchor is assigned to at most one
                        3. METHODS                                   object box, we set the corresponding entry in its length K
                                                                     label vector to 1 and all other entries to 0. If an anchor is
RetinaNet is a single, unified network composed of a back-           unassigned, which may happen with overlap in [0.6, 0.7), it is
bone network and two task-specific subnetworks. The back-            ignored during training. Box regression targets are computed
bone is responsible for computing a convolutional feature            as the offset between each anchor and its assigned object box,
map over an entire input image and is an off-the-self con-           or omitted if there is no assignment.
3.3. Classification Subnet                                         low loss. The total loss is a combination between the focal
                                                                   loss and a regression loss on bounding boxes.
The classification subnet predicts the probability of object
presence at each spatial position for each of the A anchors
and K object classes. This subnet is a small FCN attached to                   4. IMPLEMENTATION DETAILS
each FPN level; parameters of this subnet are shared across all
pyramid levels. Its design is simple. Taking an input feature      RetinaNet forms a single FCN comprised of a ResNet-FPN
map with C channels from a given pyramid level, the subnet         backbone, a classification subnet, and a box regression sub-
applies four 3×3 convolutional layers, each with C filters and     net. We use ResNet-152-FPN backbone to run our experi-
each followed by ReLU activations, followed by a 3 × 3 con-        ments. As such, inference involves simply forwarding an im-
volutional layer with KA filters. Finally sigmoid activations      age through the network. To improve speed, we only decode
are attached to output the KA binary predictions per spatial       box predictions from at most 200 top-scoring predictions per
location. We use C = 256 and A = 9 in most experiments.            FPN level, after thresholding detector confidence at 0.36. The
In contrast to RPN , our object classification subnet is deeper,   top predictions from all levels are merged and non-maximum
uses only 3 × 3 convolutions, and does not share parameters        suppression with a threshold of 0.5 is applied to yield the final
with the box regression subnet. We found these higher-level        detections. We trained our network using the Keras frame-
design decisions to be more important than specific values of      work with Tensorflow library on nVidia P6000 GPU.
hyperparameters.
                                                                   4.1. Focal Loss
3.4. Box Regression Subnet                                         We use the focal loss introduced in this work as the loss on
In parallel with the object classification subnet, we attach an-   the output of the classification subnet. We find that γ = 2
other small FCN to each pyramid level for the purpose of re-       and α = 0.25 works well in practice and the RetinaNet is rel-
gressing the offset from each anchor box to a nearby ground-       atively robust. We emphasize that when training RetinaNet,
truth object, if one exists. The design of the box regression      the focal loss is applied to all 100k anchors in each sampled
subnet is identical to the classification subnet except that it    image. This stands in contrast to common practice of using
terminates in 4A linear outputs per spatial location. For each     heuristic sampling (RPN) or hard example mining (OHEM,
of the A anchors per spatial location, these 4 outputs predict     SSD) to select a small set of anchors for each minibatch. The
the relative offset between the anchor and the ground-truth        total focal loss of an image is computed as the sum of the
box. The object classification subnet and the box regression       focal loss over all 100k anchors, normalized by the number
subnet, though sharing a common structure, use separate pa-        of anchors assigned to a ground-truth box. We perform the
rameters.                                                          normalization by the number of assigned anchors, not total
                                                                   anchors, since the vast majority of anchors are easy nega-
                                                                   tives and receive negligible loss values under the focal loss.
3.5. Focal loss                                                    In general α should be decreased slightly as γ is increased, as
Our novel Focal Loss focuses training on a sparse set of hard      highlighted in the original Retina-net paper.
examples and prevents the vast number of easy negatives from
overwhelming the detector during training. Formally, focal         4.2. Initialization
loss is a modified version the cross entropy loss, with tunable
focusing γ parameter:                                              All new convolutional layers except the final one in the Reti-
                                                                   naNet subnets are initialized with bias b = 0 and a Gaussian
             Focal loss = −αt (1 − pt )γ log(pt )                  weight fill with σ = 0.01. For the final convolutional layer of
                                                                   the classification subnet, we set the bias initialization to b =
where pt is class-specific probability of belonging to a class     log((1 )/), where specifies that at the start of training every
and α is a weighting parameter.                                    anchor should be labeled as foreground with confidence of .
    There are two important properties of the focal loss, which    We use φ = .01 in all experiments, although results are robust
makes it appealing for EAD 2019 challenge: (1) When an ex-         to the exact value. This initialization prevents the large num-
ample is misclassified and pt is small, the modulating factor is   ber of background anchors from generating a large, destabi-
near 1 and the loss is unaffected. With increasing pt , the fac-   lizing loss value in the first iteration of training.
tor goes to 0 and the loss for well-classified examples is down-
weighted. (2) The focusing parameter γ smoothly adjusts the
                                                                   4.3. Optimization
rate at which easy examples are downweighted. When γ = 0,
Focal loss is equivalent to CE, and as γ is increased the effect   RetinaNet is trained with stochastic gradient descent (SGD).
of the modulating factor is likewise increased. Intuitively, the   We use a minibatch of 3 size of 3 images. The model is
modulating factor reduces the loss contribution from easy ex-      trained for 10000 iterations with an initial learning rate of
amples and extends the range in which an example receives          0.001, which is then divided by 10 at 5000 and again at 750
          (a) Example ground truth bounding boxes                                    (b) Predicted bounding boxes

Fig. 1: Example artefact detection and confidence scores from training detection set (result using 5-fold cross validation). The
example was used in the validation set in this setup was not used during training of the network.


iterations. Weight decay of 0.0001 and momentum of 0.9 are
used. The training loss is the sum the focal loss and the stan-
dard smooth L1 loss used for box regression. Training of the
network took 26 hours.

4.4. Augmentation
Our scheme of image augmentations was designed to pre-
vent overfitting to the set of training images, and so make our
method more generalisable to the images in the test set. We
assessed the effect of these augmentations by training the net-
work with just the original training data and applying it to the
test set images to produce artefact detections. In accordance
to this is the observation that training without augmentations
produces a much smaller final loss value as compared to train-
ing with augmentations. Having trained on the whole dataset          Fig. 2: Example artefact detections and confidence scores
for 10000 iterations with a batch size of 600 images, the final      from test detection set.
loss value without augmentations is 0.0082 but with augmen-
tations is 0.0605. This clearly indicates significant over-fitting
to the training dataset when augmentations are not used.             We reach 0.2719 mean average precision (mAP) score on 195
                                                                     cases over all 7 artefact classes. The intersection over union
             5. EXPERIMENTAL RESULTS                                 (IoU) for our predictions is 0.3456 for detection task over all
                                                                     classes. We report comparable performance for 51 images
We used a stratified 5-fold cross validation strategy to op-         generalization dataset reaching a mAP of 0.2974 and devia-
timize the parameters of the network. Table 1 summarizes             tion from detection dataset of 0.0859.
the quantitative results achieved over 5-fold for each of the            The visual result of 5-fold cross-validation for a case from
seven artefact classes. The data imbalance in between dif-           validation cohort is visualized in Figure 1. The trained net-
ferent classes and the how easily distinguishable each spe-          work is capable to generate bounding boxes with high confi-
cific classes influences the specific mAP score for each class.      dence for an unseen case during training A qualitative result
         Fold    specularity    saturation    artifact     blur     contrast   bubbles    instrument       mAP       IoU
            0      0.5694        0.5908       0.6258     0.6053     0.6832     0.4962       0.6785        0.5245    0.4591
            1      0.5619        0.5977       0.6518     0.6098     0.7037     0.5060       0.6858        0.5309    0.4562
            2      0.5709        0.6193       0.6674     0.5969     0.7104     0.5112       0.6969        0.5401    0.4015
            3      0.5613        0.6291       0.6689     0.6011     0.6974     0.5076       0.6897        0.5354    0.4209
            4      0.5659        0.6072       0.6882     0.6185     0.7141     0.5213       0.6871        0.5442    0.4173
        Mean       0.5659        0.6604       0.6063     0.70176    0.5085     0.9605       0.6876       0.53502    0.4310
       D-Test        N/A           N/A          N/A        N/A       N/A         N/A          N/A        0.2719     0.3456
       G-Test        N/A           N/A          N/A        N/A       N/A         N/A          N/A        0.2974      N/A

Table 1: 5-Fold validation average precision (AP) per class and intersection over union (IoU) results for seven classes. The AP
results for each class, mean AP (mAP) and IoU results are reported for the validation over 5-fold (eg. for fold 5,0 the first fifth
of images from the dataset were validation). The IoU column is the intersection-over-union difference between the bounding
boxes inferred from the fold network and the ground truth bounding boxes. D-Test and G-test correspond to the detection and
generalization test data for which per-class results are not reported in the challenge.
from detection test set is illustrated in Figure 2, with the pre-        and inpainting in colonoscopy video frames,” in 2018
diction probabilities. The bubble and artefact classes are cor-          25th IEEE International Conference on Image Process-
rectly identified in the example image. The ground truth is              ing (ICIP). IEEE, 2018, pp. 3134–3138.
not available for this case.
                                                                     [5] Tsung-Yi Lin, Michael Maire, Serge Belongie, James
                                                                         Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
        6. DISCUSSION AND FUTURE WORK                                    C Lawrence Zitnick, “Microsoft coco: Common objects
                                                                         in context,” in European conference on computer vision.
In our experience it was clear when artefacts classes are
                                                                         Springer, 2014, pp. 740–755.
poorly detected a significant factor is the size and total num-
ber of bounding boxes produced. The main difference in               [6] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jiten-
between different setups was dependent on the number of                  dra Malik, “Rich feature hierarchies for accurate object
bounding boxes generated for artefacts in a neighbourhood.               detection and semantic segmentation,” in Proceedings
One critical factor in the final mAP score is the probability            of the IEEE conference on computer vision and pattern
threshold used to include the detected artefacts. In future              recognition, 2014, pp. 580–587.
work, we aim to apply our algorithm on different artefact
localization task for medical images (e.g. cardiac MR) with          [7] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
the availability of the training data.                                   Sun, “Faster r-cnn: Towards real-time object detection
                                                                         with region proposal networks,” in Advances in neural
                                                                         information processing systems, 2015, pp. 91–99.
                     7. REFERENCES
                                                                     [8] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross
 [1] Bernd Münzer, Klaus Schoeffmann, and Laszlo                        Girshick, “Mask r-cnn,” in Proceedings of the IEEE
     Böszörmenyi, “Content-based processing and analysis               international conference on computer vision, 2017, pp.
     of endoscopic images and videos: A survey,” Multime-                2961–2969.
     dia Tools and Applications, vol. 77, no. 1, pp. 1323–
     1362, 2018.                                                     [9] Joseph Redmon and Ali Farhadi, “Yolo9000: better,
                                                                         faster, stronger,” in Proceedings of the IEEE conference
 [2] Othmane Meslouhi, Mustapha Kardouchi, Hakim Al-                     on computer vision and pattern recognition, 2017, pp.
     lali, Taoufiq Gadi, and Yassir Benkaddour, “Automatic               7263–7271.
     detection and inpainting of specular reflections for col-
     poscopic images,” Open Computer Science, vol. 1, no.           [10] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He,
     3, pp. 341–354, 2011.                                               and Piotr Dollár, “Focal loss for dense object detection,”
                                                                         in Proceedings of the IEEE international conference on
 [3] Fabiane Queiroz and Tsang Ing Ren, “Endoscopy im-                   computer vision, 2017, pp. 2980–2988.
     age restoration: A study of the kernel estimation from
     specular highlights,” Digital Signal Processing, 2019.         [11] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
                                                                         Bharath Hariharan, and Serge Belongie, “Feature pyra-
 [4] Mojtaba Akbari, Majid Mohrekesh, Kayvan Najari-                     mid networks for object detection,” in Proceedings of
     ani, Nader Karimi, Shadrokh Samavi, and SM Reza                     the IEEE Conference on Computer Vision and Pattern
     Soroushmehr, “Adaptive specular reflection detection                Recognition, 2017, pp. 2117–2125.
[12] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,
     Adam Bailey, Stefano Realdon, James East, Georges
     Wagnires, Victor Loschenov, Enrico Grisan, Walter
     Blondel, and Jens Rittscher, “Endoscopy artifact de-
     tection (EAD 2019) challenge dataset,” CoRR, vol.
     abs/1905.03209, 2019.
[13] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden,
     James East, Xin Lu, and Jens Rittscher, “A deep learn-
     ing framework for quality assessment and restoration in
     video endoscopy,” CoRR, vol. abs/1904.07073, 2019.

[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
     Sun, “Deep residual learning for image recognition,” in
     Proceedings of the IEEE conference on computer vision
     and pattern recognition, 2016, pp. 770–778.