=Paper= {{Paper |id=Vol-2366/EAD2019_paper_5 |storemode=property |title=Focal Loss For Artefact Detection In Medical Endoscopy |pdfUrl=https://ceur-ws.org/Vol-2366/EAD2019_paper_5.pdf |volume=Vol-2366 |authors=Maxime Kayser,Roger D. Soberanis-Mukul,Shadi Albarqouni,Nassir Navab }} ==Focal Loss For Artefact Detection In Medical Endoscopy== https://ceur-ws.org/Vol-2366/EAD2019_paper_5.pdf
             FOCAL LOSS FOR ARTEFACT DETECTION IN MEDICAL ENDOSCOPY

              Maxime Kayser1 , Roger D. Soberanis-Mukul1 , Shadi Albarqouni1 , Nassir Navab1
    1
        Chair for Computer Aided Medical Procedure (CAMP), Technische Universität Mnchen (TUM),
                                            Munich, Germany


                        ABSTRACT                                  unbalanced and contains objects at vastly different scales.
                                                                  RetinaNets built in Feature Pyramid Network (FPN) [5] and
Endoscopic video frames tend to be corrupted by various arte-
                                                                  focal loss can effectively address these issues.
facts impairing their visibility. Automated detection of these
artefacts will foster advances in computer-assisted diagnosis,
post-examination procedures and frame restoration software.                              2. METHODS
In this work, we propose an ensemble of deep learning object
detectors to automate multi-class artefact detection in video     Our method consists of an ensemble of seven RetinaNet ar-
endoscopy. Our approach achieved a mean average preci-            chitectures that vary in hyperparameters, backbone networks,
sion (mAP) of 0.3087 and an average intersection-over-union       transfer learning, data augmentation, and training subset
(IoU) of 0.3997 on the EAD2019 test set. This resulted in a       used. The models are combined based on an efficient voting
final score of 0.3451 and the 3rd rank in the EAD 2019 object     scheme.
detection sub-challenge leaderboard.
   Index Terms— Endoscopy, RetinaNet, artefact detection,         2.1. RetinaNet architecture
deep learning, object detection                                   The RetinaNet detector consists of a backbone network for
                                                                  extracting a convolutional feature map and two subnetworks
                   1. INTRODUCTION                                that perform object classification and bounding box regres-
                                                                  sion via convolution. The classification loss is given by the fo-
Tissue characteristics of hollow organs, as well as the differ-   cal loss and the regression loss is given by the smooth L1 loss.
ent instruments and illumination modes applied in medical         The sum of both these losses constitutes the overall loss that
endoscopy lead to a high number of artefacts that obstruct vis-   is minimized during training. It is a one-stage method, mean-
ibility. The frames produced during endoscopic interventions      ing that it does not require a region proposal module. Instead,
can be corrupted by bubbles, instruments and image deficien-      anchors at different scales and aspect ratios are densely dis-
cies such as specular reflections, strong contrasts, saturated    tributed across the image and they will all be classified by the
pixels, motion blurs, or other artefacts. These corruptions       network. In order to construct a multi-scale feature pyramid
have a negative effect on both the live diagnosis and post-       from a single resolution input image, the backbone network is
intervention procedures. Successfully detecting the artefacts     augmented by a feature pyramid network, FPN [5]. FPNs are
will thus assist endoscopic experts and will be the cornerstone   a top-down architecture with lateral connections that allows
of successful endoscopic video frame restoration.                 semantically rich layers to be built at all scales with marginal
     Over recent years deep learning techniques have become       computational cost. FPNs have proven especially effective in
the state-of-the-art in medical image analysis and they have      the detection of small objects and are therefore well-suited
in particular proved successful in related medical endoscopy      for our use case. Pyramid levels and anchors were generated
computer vision tasks, such as polyp detection [1]. Deep          according to the specifications in [2]. We experimented with
learning object detection methods can be separated into one-      different IoU thresholds for assigning an anchor to a ground-
and two-stage methods. One-stage methods are generally            truth object and validated the thresholds used in [2]. No other
faster as they dont rely an additional region proposal step.      changes were made to the RetinaNet classification and regres-
RetinaNet is a one-stage detector proposed by Lin et al. [2].     sion subnetworks.
The novelty of this architecture is the proposed focal loss,          In our experiments, we used both VGGNet [6] and
which addresses the imbalance between foreground and back-        ResNet [7] convolutional neural networks (CNN) as the
ground anchors that occurs in one-stage methods. RetinaNet        backbone network in our framework. VGGNet is a CNN
outperforms two-stage methods such as Faster R-CNN on             from 2014 with a simple architecture that consists of convo-
COCO test-dev. We use this network as the base architecture       lution layers, pooling layers, and fully connected layers. We
for our challenge submission. The EAD dataset [3, 4] is very      tested both a 16 and 19-layer VGGNet (Table. 2). ResNets
are much deeper CNNs that maintain their generalization
capability through inception modules. Given that ResNet
are much deeper and can extract more elaborate features, it
generally outperforms VGGNet on most public validation
test sets. We experimented with 50, 101, and 152-layered
versions of ResNet.

2.2. Focal Loss
                                                                   Fig. 1: Illustration of how bounding box predictions predicted
Focal Loss is an extension of the cross-entropy loss that uses     by different models mi are compared to each other. Mod-
a weighting factor to prevent one-stage detection methods be-      els are first ordered in descending order of test performance.
ing overwhelmed by the large amount of ‘easy’ background           Then each model mi is compared to all subsequent models,
examples. Typically, one-stage methods have around 100k            mj for j > i.
anchors per image. Most of these are background anchors that
are easy to classify and swamp the classifier, undermining its
ability to focus and learn on the harder, foreground examples.
This imbalance is countered by the addition of a weighting
factor (1 − pt ), which reduces the weight of easily classified
anchors and thereby shifts the focus onto harder examples. pt
is given in [2] as:
                        (
                          p,       if y = 1
                   pt =                                     (1)
                          p − 1, otherwise
                                                                   Fig. 2: Illustration of the overlap score computation. Dis-
where γ is a tunable hyperparameter that modifies the extent       played are two bounding box predictions from a model m1
to which the loss function prioritizes hard examples. If γ = 0,    and a model m5 that both predict the class bubbles. Their
then our loss function is equal to the cross-entropy loss and no   confidence scores are averaged. The resulting average score
priority is given to hard examples. If for instance γ = 2 and      together with the IoU gives an overlap score of 0.473. In our
pt = 0.9 for a given anchor, then its contribution to the loss     case, where the threshold for determing a positive overlap is
will be 100 times lower than for the standard cross-entropy        set as 0.46, this means that both boxes ‘overlap’ and will be
loss. Our experiments (Table. 3) show that setting γ = 1.5         assigned to each other.
yields the best performance.
    Besides the focal loss weighting factor, a further weight-
ing factor α was applied. Correctly classified anchors are         Each prediction box from mi forms the root of a stack and
weighted by α and misclassified ones are weighted by 1 − α         boxes from mj can then be assigned to that stack. Whenever
with α ∈ [0, 1]. αt is defined analogously to pt . According       a bounding box from mj is assigned to the stack of a box in
to [2], α needs to be selected together with γ. Accordingly,       mi , it will be removed entirely to avoid assigning one box to
we set α to 0.25 for γ = 1.5.                                      multiple stacks. Having the most accurate bounding boxes as
    The final α-balanced version of the focal loss is given by:    roots of the stack proved beneficial. Therefore it is impor-
                                                                   tant to first order the models according to their test score to
              F L(pt ) = −αt (1 − pt )γ log(pt )            (2)    achieve optimal results. Boxes are assigned to a stack based
                                                                   on the weighted sum of their combined average objectness
                                                                   confidence score and IoU with the root of that stack. Our
2.3. Ensemble Method
                                                                   overlap score combines the average confidence score and the
In order to counter the variance in the network output and         IoU. Our experiments have shown that the overlap score com-
increase performance, we implemented an ensemble method.           puted from weighting the average score by 0.7 and IoU by
Let M be the number of models used in the ensemble method.         0.3 respectively yielded the best performance. Each time we
Our final method used M = 7 models.                                compare a model mi to a model mj , we evaluated the overlap
    The single trained models were first ordered according         score between all of their bounding boxes and assigned boxes
to their individual test scores (Table. 5). We then iterated       to each other in descending order of overlap score. For in-
through the M − 1 first models and compared for each model         stance, if bounding box A from mi and bounding box B from
mi (i ≤ M − 1) its bounding box predictions to all subse-          mj have an overlap score of 0.92 and that is the highest score
quent models mj (j > i) in a pairwise manner. For example          between all boxes from these two models, then B is assigned
with three models, m1 would be compared to m2 and m3 . In          to the stack of A and B is no longer considered in future com-
the next iteration step m2 would be compared to m3 (Fig. 1).       parisons. We used a threshold overlap score of 0.46 to assign
                                                                    2.4. Single Models

                                                                    We found that our test scores were optimal if we used seven
                                                                    models in our ensemble method. These seven models were
                                                                    selected and designed with the aim of achieving high dissim-
                                                                    ilarity between the models and high individual performance.
                                                                    Specifically we created the seven single models using differ-
                                                                    ent data augmentation techniques, different CNN depths and
                                                                    different configurations of the loss function.
                                                                         Unless otherwise specified, all of these models used the
                                                                    50-layer version of ResNet or ResNet-50 as the feature ex-
                                                                    tractor. Our ResNet-50 was not trained from scratch but uses
Fig. 3: Illustration of the final ensemble detection by             weights pre-trained on the MS COCO dataset. Initially we
weighted averaging of bounding box points with detection            used a version with pre-trained weights from the ImageNet1k
confidence. Detections are given by models m1 , m4 , and m5 .       dataset, but experiments showed that MS COCO weights
As m1 is the root it is given higher priority in the weighted       yield a better result (Table. 4). The deeper backbone net-
averaging and the green box is the final detection that will be     works, ResNet-101 or ResNet-152 used in two of the seven
yielded from this stack.                                            models in the ensemble, were pre-trained on ImageNet1k.
                                                                    Unless stated otherwise, training batch size of 1 was used.
                                                                    The number of training iterations used differ between models
                                                                    and is generally derived from the validation scores and with
boxes to each other.                                                the goal of increasing the diversity between the models.
    Given all the final stacks we compute their respective               In the following we provide the specifications of the seven
summed confidence scores. If this aggregated score exceeds          different models used in our ensemble method. The models
a voting threshold of 1.68 (corresponding to an average score       are denoted m1 to m7 in descending order according to their
of 0.24 per model for M = 7), a final detection will be             single model performance, Table 5.
yielded from that stack (Fig. 3). The value 1.68 was found               m1 was our best performing model, where all configura-
to optimize the trade-off between high mAP and high IoU.            tions were optimized to the best of our knowledge. Besides
For each model we only considered detections with a confi-          changing the focal loss parameter γ from 2 to 1.5, thereby
dence score greater than 0.2. Considering detections with a         slightly reducing the extent to which the loss function pri-
lower confidence score did not increase the performance but         oritized hard examples, other parameters were mostly set as
slowed down our ensemble method. The four corner points             specified in [2]. Baseline data augmentation consisted of a
of the detection was then calculated according to a weighted        randomized combination of image rotation, translation, shear,
(by the respective scores) average of all the bounding boxes        scaling and flipping. For each training epoch, each image
from that stack. In order to reward the fact that a detection       was rotated, translated, and sheared by a factor of -0.1 to 0.1,
was confirmed by many models, we introduced a frequency             scaled between 0.9 to 1.1 of its original size, and flipped with
factor of 0.03 that is multiplied by the number of boxes in         a chance of 50% both horizontally and vertically. m1 trained
that group and then added to the average score of the final         for around 35k iterations.
detection. This added a slight improvement to our scores.                Next, we tried to experiment with how the classification
    By visual inspection of the detection output we noticed         and regression loss were combined to yield the overall loss.
that many bounding boxes were drawn around smaller bound-           By either increasing the weight of classification or regression
ing boxes of the same class label. As we deemed the smaller         loss we aimed to shift the focus between both losses. Dou-
boxes to be superfluous, we added a post-processing step that       bling the weight of the classification loss provided a good bal-
removed final detections if another detection of the same class     ance between high performance and dissimilarity and hence
label with intersection-over-area (i.e. the ratio of intersection   this weighting was introduced for m2 . This model was trained
with a given box and its own ratio) > 90% was present. Thus,        for 81k iterations.
whenever a bounding was present and its area was more than               For the third model m3 we chosen a different γ in the
90% within another box of the same label, we removed the            focal loss. We wanted this model to focus more on harder
outer box. This improved our scores.                                examples and set γ = 3.5 to achieve this. If one box was
    Another thing we observed is that there are frequently two      classified with a confidence score of 0.9, this means this box
or more detections of a different class that seem to overlap al-    would contribute 100 times less to the loss in m3 than in m1 .
most perfectly. In these scenarios we attempted to improve          The model was trained for 58k iterations.
our score by removing the detection that had a lower confi-              In the dataset we observed that the endoscopic frames
dence score, but this did not improve our score.                    were exposed to different illumination modes leading to dif-
ferent colorings of the images. Hence, we added a data aug-                     IoU Threshold      Validation Scores
mentation step that randomly adds values to the RGB chan-                            0.25               0.2188
nels for model m4 . This was done in addition to the random                          0.35               0.2402
geometric transformation applied in training all our models.                       0.4-0.5              0.2861
For each epoch, there is a 1/9 chance for each image that a                           0.5               0.2669
value between 50 and 200 was added to either of the RGB                               0.6               0.2132
image channels. Training batch size was set to 4 and con-                             0.7               0.2531
ducted for 10k iterations.                                                            0.8               0.1725
    Model m5 was trained with a 101-layer ResNet backbone.
While this model performed worse than ResNet-50 models,             Table 1: Comparison of different IoU threshold for anchor
we added it to the ensemble under the assumption that deeper        assignment. All models trained for 12 epochs.
CNNs will discover more advanced features and therefore add
to the diversity of the ensemble. The model was trained for                        Backbone      Validation Score
45k iterations.                                                                     VGG16             0.1181
    Analogously to m5 , we also added a model m6 with a                             VGG19             0.1305
152-layer ResNet as the feature extractor. m6 was trained for                      ResNet50           0.3165
69k iterations.                                                                    ResNet101          0.2879
    The last model m7 was trained on a subsampled trainset.
For this model we added Gaussian noise at a scale of 127.5          Table 2: Comparison of backbone networks. VGG16 and
(note 8-bit image intensity values of 0-255). Analogous to          VGG19 trained for less than 10 epochs as they stop improving
m4 , this augmentation step was added on top of the random          before that.
geometric transformation and was applied to 1/9 of the im-
ages at each epoch. The model was trained using a batch size
of 4 for 11k iterations.                                            3.3. Evaluation

                                                                    Scores were calculated using a weighted sum of average IoU
           3. EXPERIMENTS AND RESULTS                               and mAP. IoU was weighted by 0.4 and mAP by 0.6. In order
                                                                    to avoid overly rewarding high IoU, the IoU value was addi-
3.1. Dataset                                                        tionally not allowed to exceed a multiple of 1.3 of the mAP.
Our dataset consists of the 2,193 endoscopic frames that were       Otherwise it would have been possible for example to reach
released by the EAD2019 challenge [3, 4]. A significant num-        an overall score of 0.4 by having only one detection in the
ber of the frames in this dataset appear to be from the same        whole test set that overlapped perfectly with a ground-truth
video sequences. Further, these videos differed by tissue type,     annotation.
illumination mode and procedure type. In order to make sure
that our train-validation split led to representative results, we   3.4. Results
had to make sure to split the dataset in a video-wise manner,
meaning that one video was either entirely in the train set or      3.4.1. IoU Threshold
entirely in the validation set based on manual assignment of
the the frames to videos. Initial experiments conducted on a        During training of the RetinaNet framework, anchors were
random train-validation split that did not respect a video-wise     considered a true positive based on the correctness of the pre-
split resulted in validation scores up to 50% greater than the      dicted class and their IoU with the corresponding ground-
actual test scores submitted online. Our final validation set       truth annotation. We experimented with using different IoU
approximately correspond to 20% of the total EAD released           thresholds to consider an anchor as true or false, Table. 1. We
training data.                                                      found that a negative threshold of 0.4 and a positive threshold
                                                                    of 0.5 worked best. This means that anchors with IoU below
                                                                    0.4 were considered as false, above 0.5 were considered true,
3.2. Training                                                       and those in between were ignored.
Models were trained with the Adam optimization algo-
rithm [8]. We used a learning rate of 10−5 that was reduced         3.4.2. Backbone Network
by a factor of 10 whenever performance plateaued. Best
performance were obtained for a training batch size of 1.           As previously stated, we tried out different CNNs to use as the
Training was performed on a single GPU (Tesla K80) using            feature extractor in our RetinaNet framework. Even without
Google Colab. Most runs were trained for 10 to 30 epochs            pre-training on MS COCO, ResNet50 was the most effective
(equal to 18k to 54k iterations) and took less 12 hours.            and outperformed deeper ResNet models.
                   γ      Validation Score                                           Action                    Test Score
                  1.00         0.2832                                           Initial 3 Models                 30.51
                  1.25         0.2915                                               4 Models                     31.93
                  1.50         0.3235                                               5 Models                     32.63
                  1.75         0.2905                                               6 Models                     32.95
                  2.00         0.3028                                               7 Models                     33.03
                  2.50         0.2780                                   +Optimized Combination Strategy          33.88
                                                                               +Post-Processing                  33.96
Table 3: Validation scores for different γ values of our base-            +Parameter Optimization                34.51
line model after 15 epochs. For values of γ below 1 RetinaNet
failed to converge.                                               Table 6: Summary of how different refinement steps led to
                                                                  score improvements towards our final ensemble method.
              Pre-training    Validation Score
              ImageNet1k           0.3108
              MS COCO              0.3435                         improving until we reached 7 models. Thereafter the score
                                                                  decreased again. Our performance was partly increased by
Table 4: Validation scores for different pre-trained weights of   an optimized combination strategy. This was largely thanks
a ResNet-50 backbone network.                                     to the introduction of the overlap score, which handled the
                                                                  way boxes were assigned to each other, and to the frequency
                                                                  factor used to compute a weighted average of bounding box
3.4.3. Focal Loss Parameters                                      positions. The post-processing step of removing boxes that
                                                                  encompass boxes of the same class provided an additional
Tuning the focal loss parameters had the greatest effect on our   performance boost. Finally, through testing and optimizing
single model performance. Indeed, increasing or decreasing        various parameters of our ensemble method we reached our
γ enabled us to decide to what extent we wanted the model to      final, highest score. These parameters include the following:
focus on hard examples. This was especially useful in our use     overlap score threshold, the weighting between IoU and aver-
case as the data was both unbalanced and some of the classes      age score in the overlap score, frequency factor, score thresh-
were much easier to detect than others. Using a γ value of 1.5    old of each individual model, and the overall voting thresh-
yielded the best performance for us. In the original paper [2]    old for each detection stack. Table. 6 summarise the describe
γ = 2 was used.                                                   stepwise improvement in test score. Our proposed ensemble
                                                                  method achieved a final score of 0.3451 on the EAD2019 test
3.4.4. Pre-training                                               set. Our mAP was 0.3087 and IoU was 0.3997. For this sub-
                                                                  mission, mAP on the EAD2019 generalization set was 0.2848
As previously mentioned, using a ResNet-50 model pre-             with a deviation score of 0.0696. In a previous submission,
trained on MS COCO improved our performance substan-              with slightly different ensemble parameters and the introduc-
tially compared to models pre-trained on Image1kNet.              tion of class-specific voting thresholds, we scored 33.45 on
                                                                  the test set and a mAP of 0.3508 on the generalization set
3.4.5. Single Model Summary and Ensemble Method                   with a deviation score of 0.0556.
The single model performance is summarised in Table. 5. We
initially used 3 models in our ensemble method. By con-           3.4.6. Visualization
tinuously adding models to the ensemble, our score kept on
                                                                  Fig. 4-7 depict outputs from our ensemble method and mod-
                                                                  els m1 , m4 and m5 , respectively, on the same example image
                      Model    Test Score                         from the EAD2019 test set. The figures illustrate how dif-
                       m1        0.3056                           ferent RetinaNet models are combined to produce a superior
                       m2        0.3033                           output.
                       m3        0.2901
                       m4        0.2856                                   4. DISCUSSION AND FUTURE WORK
                       m5        0.2789
                       m6        0.2750                           Our approach tackles the novel issue of multi-class artefact
                       m7        0.2601                           detection in endoscopy by proposing the application of the
                                                                  one-stage detection method RetinaNet. RetinaNet matches
Table 5: The single model performances of the seven models        the speed of other one-stage methods and its focal loss ad-
used in our ensemble method.                                      dresses the imbalance between easy and difficult examples.
  Fig. 4: Example output of combined ensemble method.            Fig. 6: Example output of color augmentation model m4 .




                                                                     Fig. 7: Example output of ResNet-101 model m5 .

     Fig. 5: Example output of our baseline model m1 .
                                                                                     5. REFERENCES

                                                                [1] Pu Wang, Xiao Xiao, Jeremy R Glissen Brown, Tyler M
                                                                    Berzin, Mengtian Tu, Fei Xiong, Xiao Hu, Peixi Liu, Yan
By intelligently combining multiple models that were trained        Song, Di Zhang, et al., “Development and validation of a
according to the specific nature of endoscopic video frames,        deep-learning algorithm for the detection of polyps dur-
our score improved substantially and resulted in a EAD2019          ing colonoscopy,” Nature Biomedical Engineering, vol.
object detection score of 0.3451.                                   2, no. 10, pp. 741, 2018.

    Future exploration could include the implementation of      [2] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He,
more advanced backbone networks and/or more advanced                and Piotr Dollár, “Focal loss for dense object detection,”
transfer learning approaches, such as pre-training on medical       in Proceedings of the IEEE international conference on
images.                                                             computer vision, 2017, pp. 2980–2988.
[3] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,
    Adam Bailey, Stefano Realdon, James East, Georges
    Wagnires, Victor Loschenov, Enrico Grisan, Walter
    Blondel, and Jens Rittscher, “Endoscopy artifact de-
    tection (EAD 2019) challenge dataset,” CoRR, vol.
    abs/1905.03209, 2019.
[4] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden,
    James East, Xin Lu, and Jens Rittscher, “A deep learning
    framework for quality assessment and restoration in video
    endoscopy,” CoRR, vol. abs/1904.07073, 2019.

[5] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
    Bharath Hariharan, and Serge Belongie, “Feature pyra-
    mid networks for object detection,” in Proceedings of the
    IEEE Conference on Computer Vision and Pattern Recog-
    nition, 2017, pp. 2117–2125.

[6] Karen Simonyan and Andrew Zisserman, “Very deep
    convolutional networks for large-scale image recogni-
    tion,” arXiv preprint arXiv:1409.1556, 2014.
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
    Sun, “Deep residual learning for image recognition,” in
    Proceedings of the IEEE conference on computer vision
    and pattern recognition, 2016, pp. 770–778.
[8] Diederik P Kingma and Jimmy Ba,        “Adam: A
    method for stochastic optimization,” arXiv preprint
    arXiv:1412.6980, 2014.