=Paper= {{Paper |id=Vol-2595/endoCV2020_Chen_et_al |storemode=property |title=Endoscopy Artefact Detection and Segmentation using Deep Convolutional Neural Network |pdfUrl=https://ceur-ws.org/Vol-2595/endoCV2020_paper_id_26.pdf |volume=Vol-2595 |authors=Haijian Chen,Chenyu Lian,Liansheng Wang |dblpUrl=https://dblp.org/rec/conf/isbi/ChenLW20 }} ==Endoscopy Artefact Detection and Segmentation using Deep Convolutional Neural Network== https://ceur-ws.org/Vol-2595/endoCV2020_paper_id_26.pdf

ENDOSCOPY ARTEFACT DETECTION AND SEGMENTATION USING DEEP
CONVOLUTIONAL NEURAL NETWORK

Haijian Chen, Chenyu Lian, Liansheng Wang

Department of Computer Science, School of Informatics, Xiamen University, China

ABSTRACT Class Count Ratio Class Count Ratio
Endoscopy Artefact Detection and Segmentation (EAD2020) specularity 9791 36.2% contrast 1641 6.1%
includes 3 sub-tasks: Multi-class artefact detection, Seman- saturation 1277 4.7% bubbles 4670 17.3%
tic segmentation and Out-of-sample generalisation. This artifact 8012 29.6% instrument 470 1.7%
manuscript summarizes our solution. The challenge can be blur 684 2.5% blood 491 1.8%
considered as two independent problems: object detection
and semantic segmentation. For the detection problem, we Table 1. Class distribution of detection dataset
use Cascade R-CNN with FPN and Hyper Task Cascade. For
the segmentation problem, we use DeepLab v3+ model with P
bce+dice loss. Class fg bg fg/ fg fg/(fg+bg)
Instrument 15997225 371567134 36.39% 4.13%
Specularity 4700063 382864296 10.69% 1.21%
1. INTRODUCTION Artefact 4100248 383464111 9.33% 1.06%
Bubbles 8967902 378596457 20.40% 2.31%
Endoscopy is a widely used clinical procedure for the early Saturation 10190545 377373814 23.18% 2.63%
detection of numerous cancers. However, a major drawback
of these video frames is that they are heavily corrupted with Table 2. Pixel distribution of segmentation dataset (fg: fore-
multiple artifacts. Thus, accurate detection and even segmen- ground, bg : background)
tation of artifacts are very helpful to improve the endoscopy
tools. This task aims to localise bounding boxes, predict
class labels and pixel-wise segmentation of 8 different arti- Size Count Ratio Size Count Ratio
fact classes for given frames and clinical endoscopy video 512 × 512 138 25.36% Smaller 129 23.71%
clips. 1349 × 1079 118 21.69% Bigger 159 29.23%
Total 544 100%
2. DATASETS
Table 3. Image sizes of segmentation dataset (Smaller :
The details of Endoscopy Artifact Detection and Segmenta- height <800, width <700, Bigger is the contrary)
tion Dataset are described well in the original papers [1, 2, 3].
The following part gives a brief analysis of EAD2020 data.

2.1. Object detection 2.2. Semantic segmentation

We combine the two phases of the dataset together. As shown
in Table 1, the distribution of different classes is very imbal- Many ground-truth pixel values are between 0 and 255 in the
anced. The counts of ‘blur’, ‘instrument’, and ‘blood’ are dataset. After dividing all ground truth pixel values by 255
significantly smaller than others, which could become hard and using a threshold of 0.5 to classify foreground and back-
examples when training models. The counts of ‘specularity’ ground pixels, the results are shown in Table 2. Foreground
and ‘artifact’ are very big and the objects of them are very pixels are significantly fewer than background pixels. The
small in size. Based on the condition, we pay attention to the foreground pixels of different classes in ground-truth images
balance of each class when we divide 20% of data as valida- are imbalanced as well. As shown in Table 3, the most com-
tion set. mon size of images is 512 × 512 and 1349 × 1079. The others
Copyright c 2020 for this paper by its authors. Use permitted under contain different sizes. We shuffle the dataset randomly and
Creative Commons License Attribution 4.0 International (CC BY 4.0). use 20% of the data as the validation set.
3. METHODS bce+dice as the loss of our model at last, which simply means

3.1. Object Detection L = Lbce + Ldice = − ygt log ypred − (1 − ygt ) log ypred
P
(2 ygt · ypred +
3.1.1. Model Overview +1− P P )
ygt + ypred +
We use Cascade R-CNN [4] with ResNetXt-101 [5] back-
bone and FPN [6] as the neck of the model. We also train a ( = 10−7 , ygt and ypred are flattened tensors)
Hyper Task Cascade model [7] with the same backbone and
neck. 3.2.3. Augmentation
We apply random brightness and contrast changes, random
3.1.2. Loss horizontal and vertical flip, random shift scale rotation, Gaus-
sian blurring, resizing and normalization to images of the
We use Cross Entropy Loss for classification. Smooth L1 training set. All random transformations are applied by a
Loss is utilized to improve the precision of detection. probability of 0.5 with the default parameters of Albumen-
tations library [11]. We apply image normalization in the
3.1.3. Augmentation validation set.
The images are resized to 512 × 512 and 1024 × 1024
In training data, we perform random flip, normalization and during the training phase, see 4.2.
resizing. The images are resized to 512 × 512.
3.2.4. Implementation Details
3.1.4. Implementation Details We load the weights pre-trained on the ImageNet for the back-
We extract candidate bounding box with RPN (region pro- bone network. The network is trained using SGD with a mo-
posal network) , and use non-maximum suppression (NMS) mentum of 0.9 and a weight decay of 0.0001. We train the
to filter the useful bounding-box. Observing that some small model using mini-batches of size 4. The learning rate is in-
objects are ignored, NMS threshold is increased from 0.7 creased linearly over the warm-up period of 5 epochs, to the
to 0.8. It slightly improves the recall rate and mAP. Soft- maximum value of 0.01, then adjusted by cosine annealing
NMS [8] is applied to avoid mistakenly discard the bounding- with warm restarts [12] by a period of 40 epochs. The images
box directly. are resized to 512 × 512 to train 200 epochs and then resized
to 1024 × 1024 to train another 100 epochs.
We use SGD optimizer with a momentum of 0.9 and a
weight decay of 0.0001. In order to get better results when
convergence, we add a warm-up period to make the training 4. RESULTS
rate linearly increase to 0.0025 in the first 500 iterations. The
network is trained for 13 epochs totally. 4.1. Object Detection
Table 4 shows mAPs of different classes in the validation set,
3.2. Semantic Segmentation which are evaluated by COCO metrics. And Table 5 shows
more details of evaluation results. Metrics of both models
3.2.1. Model Overview are pretty close to each other. In Figure 1, we find the HTC
model is good at detecting large objects while doing poorly
We use DeepLab V3+ network [9] with ResNet101 [5] back-
in some small objects, though its AP small metric is slightly
bone for semantic segmentation. DeepLab V3+ is an encoder-
better than the other.
decoder network with dilation convolution. ASPP modules
and decoder is implemented as the original paper does. Class Cascade R-CNN HTC Faster R-CNN
The output of the network is activated by sigmoid function instrument 0.64791 0.64965 0.56197
to get the probability map, since there may be overlap among artifact 0.22540 0.22511 0.21733
different channels of the mask. The segmentation problem is blood 0.10594 0.12520 0.10998
considered as multiple binary segmentation tasks. blur 0.26506 0.26097 0.19428
bubbles 0.11302 0.10491 0.10600
3.2.2. Loss contrast 0.40275 0.39182 0.38044
saturation 0.27912 0.24990 0.26373
We evaluated different losses, including Binary Cross En- specularity 0.09281 0.09485 0.08561
tropy, Dice Loss, Lovsz-Hinge Loss [10], and their combina-
tion. Based on the testing results discussed in 4.2, we choose Table 4. mAPs of different classes in validation set
Metric Cascade R-CNN HTC Faster R-CNN
mAP 0.267 0.263 0.240
AP50 0.501 0.505 0.498
AP75 0.246 0.249 0.209
AP small 0.082 0.091 0.086
AP medium 0.162 0.166 0.166
AP large 0.337 0.337 0.299

Table 5. AP metrics of evaluation results in validation set

Model mAPd IoUd mAPg mAPsq
Cascade R-CNN 0.2238 0.1707 0.2405 0.3038
Fig. 1. Predictions of two images (left: Hybrid Task Cascade;
HTC network 0.2393 0.0674 0.2621 0.3214
right: Cascade R-CNN)
Table 6. Detection scores in the first phase of test data

Model Score d dstd gmAP gdev
Cascade R-CNN 0.2193 0.0871 0.2485 0.0552
HTC network 0.2021 0.0901 0.2744 0.0556

Table 7. Detection scores in the final test
Fig. 2. Test image Fig. 3. Pred-512 Fig. 4. Pred-1024
Size Score d dstd gmAP gdev
512 × 512 0.2193 0.0871 0.2485 0.0552
using bce+dice, showing the effectiveness of this loss.
1024 × 1024 0.2156 0.0991 0.2659 0.0764
After 300 epochs, the DeepLabV3+ model using bce+dice
Table 8. Detection scores in the final test with Cascade mod- got 0.7927 in F1, 0.8386 in F2, 0.6857 in IoU, 0.7422 in pre-
els trained with different sizes cision and 0.887 in recall. The U-Net models don’t get much
better scores as they almost converge after 165 epochs.
We also tested Lovsz-Hinge loss. In our test, it is hard to
The results of Table 6,7,8 are provided by the official converge if the model is trained from the ground up. Hence,
leaderboard. Table 6 shows the detection scores in the first we use Lovsz-Hinge loss to fine-tune the Deeplab model
phase of test data. The Hybrid Task Cascade network per- trained with bce+dice for 300 epochs. Table 11 shows the re-
forms better in mAP, while getting a lower score in IoU. sults of the first 20 epochs (Epochs means the training epochs
Table 7 shows the scores in the final test. We get a higher with Lovsz). This model converges after 30 epochs but these
detection score with the Cascade R-CNN network. results are worse than the model before fine-tuning, so we
As shown in Table 8, resizing the image to 1024 × 1024 give up this method.
instead of 512×512 doesn’t give a better score but contributes
to generalization performance.
Metric F1 F2 IoU p r
bce 0.585 0.6754 0.4447 0.5014 0.8136
4.2. Semantic Segmentation dice 0.5846 0.5881 0.4874 0.6755 0.601
4.2.1. Experiments of losses in validation set bce+dice 0.6728 0.7042 0.5523 0.6699 0.7585

To evaluate the results of different losses, we train a DeepLab Table 9. The metrics of the validation set with different
V3+ model with ResNet101 backbone and a modified U- losses in DeepLab V3+ model with ResNet101 backbone
Net [13] model with ResNet-34 backbone for 160 epochs. (160 epochs)
The threshold to predict foreground pixels is 0.5. Other
configurations are the same as 3.2.4. In Table 9 and Table Metric F1 F2 IoU p r
10, ‘bce is Binary Cross Entropy loss, ‘dice is Dice Loss, bce 0.4666 0.4346 0.3743 0.7201 0.4196
‘bce+dice’ is defined in 3.2.2. All ‘p’ and ‘r’ in the tables dice 0.561 0.5421 0.469 0.7188 0.5353
stand for precision and recall. bce+dice 0.6138 0.5837 0.5057 0.7415 0.5709
In Table 9, the experiment shows that bce+dice gets the
best score in Dice, F2, and IoU score. The precision of Table 10. The metrics of the validation set with different
bce+dice is pretty close to dice, while not losing much recall. losses in a modified U-Net model with ResNet-34 backbone
In Table 10, we can see a significant improvement of UNet (160 epochs)
We choose bce+dice to train the final model. 512, models trained with 1024 × 1024 get better scores in the
validation set. Some predictions look smoother, as shown in
Epochs F1 F2 IoU p r Figure 2,3,4.
5 0.5491 0.5038 0.4486 0.8490 0.4828 We find that adding segmentation data of EAD2019 to the
20 0.5115 0.4610 0.4173 0.8927 0.4367 training set also helps a little, although there is potential val-
40 0.5373 0.4846 0.4387 0.8872 0.4594 idation data leakage, making validation metrics unbelievable.
However, it does not help in the detection task.
Table 11. Using Lovsz-Hinge loss to fine-tune a model
We chose Model 3 to predict the final test data and got
trained with BCE + Dice loss
scores as Table 14 shows.

4.2.2. Experiments of backbones in validation set 5. DISCUSSION & CONCLUSION
Table 12 shows another experiment to compare different net-
In task 1, we compare Cascade R-CNN with Hyper Task Cas-
works. We find that Xception-based DeepLabV3+ converges
cade to get a better detection model. FPN and Soft-NMS are
significantly slower than ResNet101-based model, and does
used to improve the detection precision due to class imbal-
not get better scores than the U-Net model.
ance. A proper threshold of NMS is helpful to improve the
recall rate of small objects.
Model F1 F2 IoU p r In task 2, we select DeepLabV3+ to solve the problem.
We select bce+dice as the loss function to balance precision
D-X 0.4189 0.4248 0.3167 0.486 0.4388
and recall. Image sizes of the dataset is a noticeable part at
D-R101 0.5823 0.5967 0.4717 0.6288 0.6313
the training phase. Adjusting the threshold of predicting also
U-R34 0.5535 0.5209 0.4507 0.7512 0.5078
contributes to a more balanced model.
Table 12. The metrics of validation set with different net-
work (85 epochs, D: DeeplabV3+, U:UNet, X:Xception, 6. REFERENCES
R:ResNet)
[1] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,
Adam Bailey, Stefano Realdon, James East, Georges
4.2.3. Submission results Wagnieres, Victor Loschenov, Enrico Grisan, et al. En-
doscopy artifact detection (ead 2019) challenge dataset.
F1 F2 p r sscore sd arXiv preprint arXiv:1905.03209, 2019.
1 0.4872 0.5027 0.5250 0.5467 0.5154 0.2327
2 0.4802 0.5156 0.4836 0.5872 0.5167 0.2403 [2] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden,
3 0.5012 0.5042 0.5817 0.5390 0.5315 0.2644 James East, Xin Lu, and Jens Rittscher. A deep learn-
ing framework for quality assessment and restoration
Table 13. Segmentation scores in the first phase of test data in video endoscopy. arXiv preprint arXiv:1904.07073,
(50% of final data) 2019.

[3] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai-
Model sscore sstd ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao-
3 : DeepLabV3+/ResNet101/1024x 0.5459 0.2682 qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul,
Shadi Albarqouni, Xiaokang Wang, Chunqing Wang,
Table 14. Segmentation scores in the final test Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu-
fan Yang, Mohammad Azam Khan, Xiaohong W. Gao,
The training parameters are listed in 3.2.4. All the results Stefano Realdon, Maxim Loshchenov, Julia A. Schn-
above are provided by the official leaderboard. In Table 13, abel, James E. East, Geroges Wagnieres, Victor B.
Model 1 is trained with 512 × 512 images and a threshold Loschenov, Enrico Grisan, Christian Daul, Walter Blon-
of 0.5. Model 2 is the same as model 1 except changing the del, and Jens Rittscher. An objective comparison of de-
threshold to 0.7. Model 3 is trained with 1024 × 1024 images tection and segmentation algorithms for artefacts in clin-
and a threshold of 0.7. ical endoscopy. Scientific Reports, 10, 2020.
We resized the image to 512 × 512 at first. However, as
discussed in Table 3, there are many bigger images. This can [4] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN:
be found in the first phase of test images as well. Compared delving into high quality object detection. CoRR,
with the models only trained with images resized to 512 × abs/1712.00726, 2017.
[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. Deep residual learning for image recognition,
2015.
[6] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming
He, Bharath Hariharan, and Serge J. Belongie. Fea-
ture pyramid networks for object detection. CoRR,
abs/1612.03144, 2016.
[7] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xi-
aoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jian-
ping Shi, Wanli Ouyang, Chen Change Loy, and Dahua
Lin. Hybrid task cascade for instance segmentation.
CoRR, abs/1901.07518, 2019.
[8] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and
Larry S. Davis. Improving object detection with one
line of code. CoRR, abs/1704.04503, 2017.

[9] Liang-Chieh Chen, Yukun Zhu, George Papandreou,
Florian Schroff, and Hartwig Adam. Encoder-decoder
with atrous separable convolution for semantic image
segmentation, 2018.
[10] Maxim Berman, Amal Rannen Triki, and Matthew B.
Blaschko. The lovsz-softmax loss: A tractable surro-
gate for the optimization of the intersection-over-union
measure in neural networks, 2017.
[11] E. Khvedchenya V. I. Iglovikov A. Buslaev, A. Parinov
and A. A. Kalinin. Albumentations: fast and flexible
image augmentations. ArXiv e-prints, 2018.
[12] Ilya Loshchilov and Frank Hutter. SGDR: stochastic
gradient descent with restarts. CoRR, abs/1608.03983,
2016.

[13] Olaf Ronneberger, Philipp Fischer, and Thomas Brox.
U-net: Convolutional networks for biomedical im-
age segmentation. In Medical Image Computing and
Computer-Assisted Intervention –MICCAI 2015, pages
234–241, 2015.