=Paper=
{{Paper
|id=Vol-2366/EAD2019_paper_5
|storemode=property
|title=Focal Loss For Artefact Detection In Medical Endoscopy
|pdfUrl=https://ceur-ws.org/Vol-2366/EAD2019_paper_5.pdf
|volume=Vol-2366
|authors=Maxime Kayser,Roger D. Soberanis-Mukul,Shadi Albarqouni,Nassir Navab
}}
==Focal Loss For Artefact Detection In Medical Endoscopy==
FOCAL LOSS FOR ARTEFACT DETECTION IN MEDICAL ENDOSCOPY
Maxime Kayser1 , Roger D. Soberanis-Mukul1 , Shadi Albarqouni1 , Nassir Navab1
1
Chair for Computer Aided Medical Procedure (CAMP), Technische Universität Mnchen (TUM),
Munich, Germany
ABSTRACT unbalanced and contains objects at vastly different scales.
RetinaNets built in Feature Pyramid Network (FPN) [5] and
Endoscopic video frames tend to be corrupted by various arte-
focal loss can effectively address these issues.
facts impairing their visibility. Automated detection of these
artefacts will foster advances in computer-assisted diagnosis,
post-examination procedures and frame restoration software. 2. METHODS
In this work, we propose an ensemble of deep learning object
detectors to automate multi-class artefact detection in video Our method consists of an ensemble of seven RetinaNet ar-
endoscopy. Our approach achieved a mean average preci- chitectures that vary in hyperparameters, backbone networks,
sion (mAP) of 0.3087 and an average intersection-over-union transfer learning, data augmentation, and training subset
(IoU) of 0.3997 on the EAD2019 test set. This resulted in a used. The models are combined based on an efficient voting
final score of 0.3451 and the 3rd rank in the EAD 2019 object scheme.
detection sub-challenge leaderboard.
Index Terms— Endoscopy, RetinaNet, artefact detection, 2.1. RetinaNet architecture
deep learning, object detection The RetinaNet detector consists of a backbone network for
extracting a convolutional feature map and two subnetworks
1. INTRODUCTION that perform object classification and bounding box regres-
sion via convolution. The classification loss is given by the fo-
Tissue characteristics of hollow organs, as well as the differ- cal loss and the regression loss is given by the smooth L1 loss.
ent instruments and illumination modes applied in medical The sum of both these losses constitutes the overall loss that
endoscopy lead to a high number of artefacts that obstruct vis- is minimized during training. It is a one-stage method, mean-
ibility. The frames produced during endoscopic interventions ing that it does not require a region proposal module. Instead,
can be corrupted by bubbles, instruments and image deficien- anchors at different scales and aspect ratios are densely dis-
cies such as specular reflections, strong contrasts, saturated tributed across the image and they will all be classified by the
pixels, motion blurs, or other artefacts. These corruptions network. In order to construct a multi-scale feature pyramid
have a negative effect on both the live diagnosis and post- from a single resolution input image, the backbone network is
intervention procedures. Successfully detecting the artefacts augmented by a feature pyramid network, FPN [5]. FPNs are
will thus assist endoscopic experts and will be the cornerstone a top-down architecture with lateral connections that allows
of successful endoscopic video frame restoration. semantically rich layers to be built at all scales with marginal
Over recent years deep learning techniques have become computational cost. FPNs have proven especially effective in
the state-of-the-art in medical image analysis and they have the detection of small objects and are therefore well-suited
in particular proved successful in related medical endoscopy for our use case. Pyramid levels and anchors were generated
computer vision tasks, such as polyp detection [1]. Deep according to the specifications in [2]. We experimented with
learning object detection methods can be separated into one- different IoU thresholds for assigning an anchor to a ground-
and two-stage methods. One-stage methods are generally truth object and validated the thresholds used in [2]. No other
faster as they dont rely an additional region proposal step. changes were made to the RetinaNet classification and regres-
RetinaNet is a one-stage detector proposed by Lin et al. [2]. sion subnetworks.
The novelty of this architecture is the proposed focal loss, In our experiments, we used both VGGNet [6] and
which addresses the imbalance between foreground and back- ResNet [7] convolutional neural networks (CNN) as the
ground anchors that occurs in one-stage methods. RetinaNet backbone network in our framework. VGGNet is a CNN
outperforms two-stage methods such as Faster R-CNN on from 2014 with a simple architecture that consists of convo-
COCO test-dev. We use this network as the base architecture lution layers, pooling layers, and fully connected layers. We
for our challenge submission. The EAD dataset [3, 4] is very tested both a 16 and 19-layer VGGNet (Table. 2). ResNets
are much deeper CNNs that maintain their generalization
capability through inception modules. Given that ResNet
are much deeper and can extract more elaborate features, it
generally outperforms VGGNet on most public validation
test sets. We experimented with 50, 101, and 152-layered
versions of ResNet.
2.2. Focal Loss
Fig. 1: Illustration of how bounding box predictions predicted
Focal Loss is an extension of the cross-entropy loss that uses by different models mi are compared to each other. Mod-
a weighting factor to prevent one-stage detection methods be- els are first ordered in descending order of test performance.
ing overwhelmed by the large amount of ‘easy’ background Then each model mi is compared to all subsequent models,
examples. Typically, one-stage methods have around 100k mj for j > i.
anchors per image. Most of these are background anchors that
are easy to classify and swamp the classifier, undermining its
ability to focus and learn on the harder, foreground examples.
This imbalance is countered by the addition of a weighting
factor (1 − pt ), which reduces the weight of easily classified
anchors and thereby shifts the focus onto harder examples. pt
is given in [2] as:
(
p, if y = 1
pt = (1)
p − 1, otherwise
Fig. 2: Illustration of the overlap score computation. Dis-
where γ is a tunable hyperparameter that modifies the extent played are two bounding box predictions from a model m1
to which the loss function prioritizes hard examples. If γ = 0, and a model m5 that both predict the class bubbles. Their
then our loss function is equal to the cross-entropy loss and no confidence scores are averaged. The resulting average score
priority is given to hard examples. If for instance γ = 2 and together with the IoU gives an overlap score of 0.473. In our
pt = 0.9 for a given anchor, then its contribution to the loss case, where the threshold for determing a positive overlap is
will be 100 times lower than for the standard cross-entropy set as 0.46, this means that both boxes ‘overlap’ and will be
loss. Our experiments (Table. 3) show that setting γ = 1.5 assigned to each other.
yields the best performance.
Besides the focal loss weighting factor, a further weight-
ing factor α was applied. Correctly classified anchors are Each prediction box from mi forms the root of a stack and
weighted by α and misclassified ones are weighted by 1 − α boxes from mj can then be assigned to that stack. Whenever
with α ∈ [0, 1]. αt is defined analogously to pt . According a bounding box from mj is assigned to the stack of a box in
to [2], α needs to be selected together with γ. Accordingly, mi , it will be removed entirely to avoid assigning one box to
we set α to 0.25 for γ = 1.5. multiple stacks. Having the most accurate bounding boxes as
The final α-balanced version of the focal loss is given by: roots of the stack proved beneficial. Therefore it is impor-
tant to first order the models according to their test score to
F L(pt ) = −αt (1 − pt )γ log(pt ) (2) achieve optimal results. Boxes are assigned to a stack based
on the weighted sum of their combined average objectness
confidence score and IoU with the root of that stack. Our
2.3. Ensemble Method
overlap score combines the average confidence score and the
In order to counter the variance in the network output and IoU. Our experiments have shown that the overlap score com-
increase performance, we implemented an ensemble method. puted from weighting the average score by 0.7 and IoU by
Let M be the number of models used in the ensemble method. 0.3 respectively yielded the best performance. Each time we
Our final method used M = 7 models. compare a model mi to a model mj , we evaluated the overlap
The single trained models were first ordered according score between all of their bounding boxes and assigned boxes
to their individual test scores (Table. 5). We then iterated to each other in descending order of overlap score. For in-
through the M − 1 first models and compared for each model stance, if bounding box A from mi and bounding box B from
mi (i ≤ M − 1) its bounding box predictions to all subse- mj have an overlap score of 0.92 and that is the highest score
quent models mj (j > i) in a pairwise manner. For example between all boxes from these two models, then B is assigned
with three models, m1 would be compared to m2 and m3 . In to the stack of A and B is no longer considered in future com-
the next iteration step m2 would be compared to m3 (Fig. 1). parisons. We used a threshold overlap score of 0.46 to assign
2.4. Single Models
We found that our test scores were optimal if we used seven
models in our ensemble method. These seven models were
selected and designed with the aim of achieving high dissim-
ilarity between the models and high individual performance.
Specifically we created the seven single models using differ-
ent data augmentation techniques, different CNN depths and
different configurations of the loss function.
Unless otherwise specified, all of these models used the
50-layer version of ResNet or ResNet-50 as the feature ex-
tractor. Our ResNet-50 was not trained from scratch but uses
Fig. 3: Illustration of the final ensemble detection by weights pre-trained on the MS COCO dataset. Initially we
weighted averaging of bounding box points with detection used a version with pre-trained weights from the ImageNet1k
confidence. Detections are given by models m1 , m4 , and m5 . dataset, but experiments showed that MS COCO weights
As m1 is the root it is given higher priority in the weighted yield a better result (Table. 4). The deeper backbone net-
averaging and the green box is the final detection that will be works, ResNet-101 or ResNet-152 used in two of the seven
yielded from this stack. models in the ensemble, were pre-trained on ImageNet1k.
Unless stated otherwise, training batch size of 1 was used.
The number of training iterations used differ between models
and is generally derived from the validation scores and with
boxes to each other. the goal of increasing the diversity between the models.
Given all the final stacks we compute their respective In the following we provide the specifications of the seven
summed confidence scores. If this aggregated score exceeds different models used in our ensemble method. The models
a voting threshold of 1.68 (corresponding to an average score are denoted m1 to m7 in descending order according to their
of 0.24 per model for M = 7), a final detection will be single model performance, Table 5.
yielded from that stack (Fig. 3). The value 1.68 was found m1 was our best performing model, where all configura-
to optimize the trade-off between high mAP and high IoU. tions were optimized to the best of our knowledge. Besides
For each model we only considered detections with a confi- changing the focal loss parameter γ from 2 to 1.5, thereby
dence score greater than 0.2. Considering detections with a slightly reducing the extent to which the loss function pri-
lower confidence score did not increase the performance but oritized hard examples, other parameters were mostly set as
slowed down our ensemble method. The four corner points specified in [2]. Baseline data augmentation consisted of a
of the detection was then calculated according to a weighted randomized combination of image rotation, translation, shear,
(by the respective scores) average of all the bounding boxes scaling and flipping. For each training epoch, each image
from that stack. In order to reward the fact that a detection was rotated, translated, and sheared by a factor of -0.1 to 0.1,
was confirmed by many models, we introduced a frequency scaled between 0.9 to 1.1 of its original size, and flipped with
factor of 0.03 that is multiplied by the number of boxes in a chance of 50% both horizontally and vertically. m1 trained
that group and then added to the average score of the final for around 35k iterations.
detection. This added a slight improvement to our scores. Next, we tried to experiment with how the classification
By visual inspection of the detection output we noticed and regression loss were combined to yield the overall loss.
that many bounding boxes were drawn around smaller bound- By either increasing the weight of classification or regression
ing boxes of the same class label. As we deemed the smaller loss we aimed to shift the focus between both losses. Dou-
boxes to be superfluous, we added a post-processing step that bling the weight of the classification loss provided a good bal-
removed final detections if another detection of the same class ance between high performance and dissimilarity and hence
label with intersection-over-area (i.e. the ratio of intersection this weighting was introduced for m2 . This model was trained
with a given box and its own ratio) > 90% was present. Thus, for 81k iterations.
whenever a bounding was present and its area was more than For the third model m3 we chosen a different γ in the
90% within another box of the same label, we removed the focal loss. We wanted this model to focus more on harder
outer box. This improved our scores. examples and set γ = 3.5 to achieve this. If one box was
Another thing we observed is that there are frequently two classified with a confidence score of 0.9, this means this box
or more detections of a different class that seem to overlap al- would contribute 100 times less to the loss in m3 than in m1 .
most perfectly. In these scenarios we attempted to improve The model was trained for 58k iterations.
our score by removing the detection that had a lower confi- In the dataset we observed that the endoscopic frames
dence score, but this did not improve our score. were exposed to different illumination modes leading to dif-
ferent colorings of the images. Hence, we added a data aug- IoU Threshold Validation Scores
mentation step that randomly adds values to the RGB chan- 0.25 0.2188
nels for model m4 . This was done in addition to the random 0.35 0.2402
geometric transformation applied in training all our models. 0.4-0.5 0.2861
For each epoch, there is a 1/9 chance for each image that a 0.5 0.2669
value between 50 and 200 was added to either of the RGB 0.6 0.2132
image channels. Training batch size was set to 4 and con- 0.7 0.2531
ducted for 10k iterations. 0.8 0.1725
Model m5 was trained with a 101-layer ResNet backbone.
While this model performed worse than ResNet-50 models, Table 1: Comparison of different IoU threshold for anchor
we added it to the ensemble under the assumption that deeper assignment. All models trained for 12 epochs.
CNNs will discover more advanced features and therefore add
to the diversity of the ensemble. The model was trained for Backbone Validation Score
45k iterations. VGG16 0.1181
Analogously to m5 , we also added a model m6 with a VGG19 0.1305
152-layer ResNet as the feature extractor. m6 was trained for ResNet50 0.3165
69k iterations. ResNet101 0.2879
The last model m7 was trained on a subsampled trainset.
For this model we added Gaussian noise at a scale of 127.5 Table 2: Comparison of backbone networks. VGG16 and
(note 8-bit image intensity values of 0-255). Analogous to VGG19 trained for less than 10 epochs as they stop improving
m4 , this augmentation step was added on top of the random before that.
geometric transformation and was applied to 1/9 of the im-
ages at each epoch. The model was trained using a batch size
of 4 for 11k iterations. 3.3. Evaluation
Scores were calculated using a weighted sum of average IoU
3. EXPERIMENTS AND RESULTS and mAP. IoU was weighted by 0.4 and mAP by 0.6. In order
to avoid overly rewarding high IoU, the IoU value was addi-
3.1. Dataset tionally not allowed to exceed a multiple of 1.3 of the mAP.
Our dataset consists of the 2,193 endoscopic frames that were Otherwise it would have been possible for example to reach
released by the EAD2019 challenge [3, 4]. A significant num- an overall score of 0.4 by having only one detection in the
ber of the frames in this dataset appear to be from the same whole test set that overlapped perfectly with a ground-truth
video sequences. Further, these videos differed by tissue type, annotation.
illumination mode and procedure type. In order to make sure
that our train-validation split led to representative results, we 3.4. Results
had to make sure to split the dataset in a video-wise manner,
meaning that one video was either entirely in the train set or 3.4.1. IoU Threshold
entirely in the validation set based on manual assignment of
the the frames to videos. Initial experiments conducted on a During training of the RetinaNet framework, anchors were
random train-validation split that did not respect a video-wise considered a true positive based on the correctness of the pre-
split resulted in validation scores up to 50% greater than the dicted class and their IoU with the corresponding ground-
actual test scores submitted online. Our final validation set truth annotation. We experimented with using different IoU
approximately correspond to 20% of the total EAD released thresholds to consider an anchor as true or false, Table. 1. We
training data. found that a negative threshold of 0.4 and a positive threshold
of 0.5 worked best. This means that anchors with IoU below
0.4 were considered as false, above 0.5 were considered true,
3.2. Training and those in between were ignored.
Models were trained with the Adam optimization algo-
rithm [8]. We used a learning rate of 10−5 that was reduced 3.4.2. Backbone Network
by a factor of 10 whenever performance plateaued. Best
performance were obtained for a training batch size of 1. As previously stated, we tried out different CNNs to use as the
Training was performed on a single GPU (Tesla K80) using feature extractor in our RetinaNet framework. Even without
Google Colab. Most runs were trained for 10 to 30 epochs pre-training on MS COCO, ResNet50 was the most effective
(equal to 18k to 54k iterations) and took less 12 hours. and outperformed deeper ResNet models.
γ Validation Score Action Test Score
1.00 0.2832 Initial 3 Models 30.51
1.25 0.2915 4 Models 31.93
1.50 0.3235 5 Models 32.63
1.75 0.2905 6 Models 32.95
2.00 0.3028 7 Models 33.03
2.50 0.2780 +Optimized Combination Strategy 33.88
+Post-Processing 33.96
Table 3: Validation scores for different γ values of our base- +Parameter Optimization 34.51
line model after 15 epochs. For values of γ below 1 RetinaNet
failed to converge. Table 6: Summary of how different refinement steps led to
score improvements towards our final ensemble method.
Pre-training Validation Score
ImageNet1k 0.3108
MS COCO 0.3435 improving until we reached 7 models. Thereafter the score
decreased again. Our performance was partly increased by
Table 4: Validation scores for different pre-trained weights of an optimized combination strategy. This was largely thanks
a ResNet-50 backbone network. to the introduction of the overlap score, which handled the
way boxes were assigned to each other, and to the frequency
factor used to compute a weighted average of bounding box
3.4.3. Focal Loss Parameters positions. The post-processing step of removing boxes that
encompass boxes of the same class provided an additional
Tuning the focal loss parameters had the greatest effect on our performance boost. Finally, through testing and optimizing
single model performance. Indeed, increasing or decreasing various parameters of our ensemble method we reached our
γ enabled us to decide to what extent we wanted the model to final, highest score. These parameters include the following:
focus on hard examples. This was especially useful in our use overlap score threshold, the weighting between IoU and aver-
case as the data was both unbalanced and some of the classes age score in the overlap score, frequency factor, score thresh-
were much easier to detect than others. Using a γ value of 1.5 old of each individual model, and the overall voting thresh-
yielded the best performance for us. In the original paper [2] old for each detection stack. Table. 6 summarise the describe
γ = 2 was used. stepwise improvement in test score. Our proposed ensemble
method achieved a final score of 0.3451 on the EAD2019 test
3.4.4. Pre-training set. Our mAP was 0.3087 and IoU was 0.3997. For this sub-
mission, mAP on the EAD2019 generalization set was 0.2848
As previously mentioned, using a ResNet-50 model pre- with a deviation score of 0.0696. In a previous submission,
trained on MS COCO improved our performance substan- with slightly different ensemble parameters and the introduc-
tially compared to models pre-trained on Image1kNet. tion of class-specific voting thresholds, we scored 33.45 on
the test set and a mAP of 0.3508 on the generalization set
3.4.5. Single Model Summary and Ensemble Method with a deviation score of 0.0556.
The single model performance is summarised in Table. 5. We
initially used 3 models in our ensemble method. By con- 3.4.6. Visualization
tinuously adding models to the ensemble, our score kept on
Fig. 4-7 depict outputs from our ensemble method and mod-
els m1 , m4 and m5 , respectively, on the same example image
Model Test Score from the EAD2019 test set. The figures illustrate how dif-
m1 0.3056 ferent RetinaNet models are combined to produce a superior
m2 0.3033 output.
m3 0.2901
m4 0.2856 4. DISCUSSION AND FUTURE WORK
m5 0.2789
m6 0.2750 Our approach tackles the novel issue of multi-class artefact
m7 0.2601 detection in endoscopy by proposing the application of the
one-stage detection method RetinaNet. RetinaNet matches
Table 5: The single model performances of the seven models the speed of other one-stage methods and its focal loss ad-
used in our ensemble method. dresses the imbalance between easy and difficult examples.
Fig. 4: Example output of combined ensemble method. Fig. 6: Example output of color augmentation model m4 .
Fig. 7: Example output of ResNet-101 model m5 .
Fig. 5: Example output of our baseline model m1 .
5. REFERENCES
[1] Pu Wang, Xiao Xiao, Jeremy R Glissen Brown, Tyler M
Berzin, Mengtian Tu, Fei Xiong, Xiao Hu, Peixi Liu, Yan
By intelligently combining multiple models that were trained Song, Di Zhang, et al., “Development and validation of a
according to the specific nature of endoscopic video frames, deep-learning algorithm for the detection of polyps dur-
our score improved substantially and resulted in a EAD2019 ing colonoscopy,” Nature Biomedical Engineering, vol.
object detection score of 0.3451. 2, no. 10, pp. 741, 2018.
Future exploration could include the implementation of [2] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He,
more advanced backbone networks and/or more advanced and Piotr Dollár, “Focal loss for dense object detection,”
transfer learning approaches, such as pre-training on medical in Proceedings of the IEEE international conference on
images. computer vision, 2017, pp. 2980–2988.
[3] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,
Adam Bailey, Stefano Realdon, James East, Georges
Wagnires, Victor Loschenov, Enrico Grisan, Walter
Blondel, and Jens Rittscher, “Endoscopy artifact de-
tection (EAD 2019) challenge dataset,” CoRR, vol.
abs/1905.03209, 2019.
[4] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden,
James East, Xin Lu, and Jens Rittscher, “A deep learning
framework for quality assessment and restoration in video
endoscopy,” CoRR, vol. abs/1904.07073, 2019.
[5] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
Bharath Hariharan, and Serge Belongie, “Feature pyra-
mid networks for object detection,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recog-
nition, 2017, pp. 2117–2125.
[6] Karen Simonyan and Andrew Zisserman, “Very deep
convolutional networks for large-scale image recogni-
tion,” arXiv preprint arXiv:1409.1556, 2014.
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun, “Deep residual learning for image recognition,” in
Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[8] Diederik P Kingma and Jimmy Ba, “Adam: A
method for stochastic optimization,” arXiv preprint
arXiv:1412.6980, 2014.