Temporal Context Framework for Endoscopy Artefact
Segmentation and Detection
Haili Ye1,2 , Hanpei Miao1,2 , Jiang Liu1,2 , Dahan Wang3 and Heng Li1,2
1
  Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology, Shenzhen 518055, China
2
  Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China
3
  Department of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361004, China


                                             Abstract
                                             Endoscopic video processing could facilitate pre-operative planning, intra-operative image guidance and generation of
                                             post-operative analysis of the surgical procedure. However, most of the current methods are still based on a single frame of
                                             image analysis, which makes the results of the previous frame images independent of each other and causes vibration. In
                                             this paper, we propose an temporal context framework for endoscopy artefact segmentation and detection. The framework
                                             extends the general segmentation and detection model to the form based on temporal input, and we add a Temporal Context
                                             Transformer(TCT) after the encoder of the model to improve the model’s ability to construct temporal context features. the
                                             experiments of the EndoCV 2022 challenge dataset that this framework can improve the robustness of the model.

                                             Keywords
                                             Medical Image Analysis, Colonoscopic Image, Semantic Segmentation, Object Detection


1. Introduction                                                                                                                       two hot research fields in computer vision. In medical
                                                                                                                                      semantic segmentation, Olaf et al proposed a classic med-
Colon cancer[1] is a common malignant tumor of the                                                                                    ical image segmentation model U-Net[7], and the rele-
digestive tract that occurs in the colon. Colon cancer                                                                                vant encoder-decoder structure and skip-layer connec-
is closely related to the consumption of red meat (such                                                                               tion method have great inspiration for subsequent re-
as beef). Incidence of gastrointestinal tumors accounted                                                                              search work. On this basis, a series of novel and effective
for the third place. Colon cancer is mainly adenocarci-                                                                               models are developed, such as U-Net++[8], nnUNet[9],
noma, mucinous adenocarcinoma, undifferentiated carci-                                                                                DANet[10], Deeplab[11] and so on. For the analysis of
noma. Endoscopy[2] can clearly find intestinal lesions,                                                                               endoscope images, The PraNet[12] proposed by Fan et al.
but also can treat some intestinal lesions, such as: intesti-                                                                         aggregates features at a high level through the parallel
nal polyps and other benign lesions under the microscope                                                                              partial decoder (PDD) to obtain context information and
directly removed, intestinal bleeding under the micro-                                                                                generate a global map. In medical object detection, Ross
scope to stop bleeding, the removal of foreign bodies in                                                                              et al. proposed the Faster RCNN[13] achieves end-to-end
the colon. Endoscopic video[3] processing could facilitate                                                                            object detection based on a deep learning two-stage struc-
pre-operative planning, intra-operative image guidance                                                                                ture. Cai et al. proposed Cascade R-CNN[14] to continu-
and generation of post-operative analysis of the surgical                                                                             ously optimize the prediction results by cascading several
procedure. Computer assisted interventions[4] have the                                                                                detection networks. The Swin Transfromer[15] proposed
potential to enhance the surgeon’s visualization and navi-                                                                            by Liu et al. is a general vision structure designed based
gation capabilities and postoperative analytics to provide                                                                            on the concept of Transfromer[16], which has achieved
insights for surgical training and risk assessment. A nec-                                                                            breakthroughs in multiple vision tasks. However, most
essary element for these processes is scene understanding                                                                             of the current methods are still based on single-frame
and, in particular, anatomy and instrument detection and                                                                              image analysis, which makes the analysis results not well
localization. Therefore, by segmenting and differentiat-                                                                              combined with temporal context information.
ing among the elements that appear in the Endoscopic                                                                                     Endoscope image sequence can provide more infor-
view, it is possible to assess tissue-instrument interac-                                                                             mation than single frame image [17, 18], and combining
tions and understand endoscopic workflow.                                                                                             the contextual time information of the before and after
   Semantic segmentation[5] and object detection[6] are                                                                               images can effectively improve the analysis performance
                                                                                                                                      of endoscopy artefact. Inspired by this, in this paper, We
4th International Workshop and Challenge on Computer Vision in
                                                                                                                                      propose a temporal Context Framework for endoscopy
Endoscopy (EndoCV2022) in conjunction with the 19th IEEE Inter-
                                                                                                                                      artefact segmentation and detection. Our contributions
national Symposium on Biomedical Imaging ISBI2022, March
28th, 2022, IC Royal Bengal, Kolkata, India
                                                                                                                                      are as follows:
$ yehl@mail.sustech.edu.cn (H. Ye)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License        ∙ We introduce a general framework to extract tem-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                               poral context features from sequential images and
Figure 1: Overall Temporal Context Framework for Endoscopy Artefact Segmentation and Detection


       apply it to semantic segmentation and object de-       different frames. This process can repair the wrong fea-
       tection.                                               tures extracted by the model, which effectively improves
     ∙ In order to improve the feature modeling ability       the robustness of the model. We also refer to UNet’s hop
       of the framework,we designed a Temporal Con-           connection and connect the corresponding encoder and
       text Transformer(TCT) to improve the feature           decoding to supplement the shallow feature. The features
       extraction ability of temporal context.                are integrated by the temporal context transformer and
     ∙ Our framework can be adapted to various types of       then enter the decoder to obtain the segmentation mask
       backbone models and can be extended for similar        of the endoscopy image.
       endoscopic analysis problems.                             The input form of the endoscopy artefact detection
                                                              model is also 𝐿-frame image sequence, and the overall
                                                              structure is similar to the common two-stage target de-
2. METHODOLOGY                                                tection model.The feature of each frame is extracted from
                                                              the encoder, and then the feature is integrated into the 𝑁
In this section, we introduce the proposed temporal con-      group temporal context transformer. The network struc-
text framework for endoscopy artefact segmentation and        ture of feature pyramid can deal with the multi-scale
detection. The overall of this framework as shown in Fig.     change problem in object detection with a small amount
1. The framework includes endoscopy artefact segmenta-        of computation. So the model have a feature pyramid to
tion model and endoscopy artefact detection model. The        improve the model’s localization ability for multi-scale
input of both models is the endoscope image sequence,         surgical examples. The multi-scale features extracted by
and we set a hyperparameter 𝐿 to represent the length         FPN[19] will be input into the corresponding detection
of the image sequence, so 𝐿-frame sequence of input to        head for prediction. The detection head uses a region
the model can be represented as 𝐼 ∈ 𝑅𝐿,3,𝐻,𝑊 .                proposal network(RPN)[13] to filter out suggestion boxes
   In the endoscopy artefact segmentation model, we use       that may have instances of surgical instruments. And ROI
the classical coding-decoding results. In particular, the     Align[13] adopts the corresponding local features in the
encoder of the model is similar to the traditional encoder,   global features according to these proposal boxes. These
which is responsible for extracting the features of single    local features provide the FFN to classify the artefact
frame influence. 𝑁 group temporal context transformer         within the proposal box and regress to specific coordi-
is connected at the end of the encoder to establish the       nates. Finally, the prediction results of different scales
correlation between the image features of each frame.         are merged and filtered using Sotf NMS[20]. Sotf NMS
Compared with general single-frame image-based meth-          will remove the prediction results with large overlap and
ods, this module utilizes feature correlations between        the prediction box, and retain the results with high confi-
dence. The loss function form of object detection model
is the same as that of Faster RCNN[13].
   Temporal Context Transformer. For the image se-
quence, there is a little correlation between the image
data of the next frame and the next frame. Especially in
the case of blur or artifact in the image, introducing the
features of the previous frame can effectively repair the
situation of target loss or category recognition error. In
order to effectively improve the context understanding
and feature integration capabilities of the model for im-
age sequences. We designed the temporal context trans-
former, as show in Fig .2. Temporal context transformer
is divided into transformer encoder and transformer de-
coder. The features extracted in the encoder will be input
to the transformer encoder. For the Transformer encoder
of layer 𝑛, the input is the output 𝐸𝑛−1 ∈ 𝑅𝐿,𝐶 of
the upper layer. The coordination transformer encoder
has a similar structure to the traditional Transformer
encoder, but the difference is that we design the timing      Figure 2: Structure of Temporal Context Transformer(TCT)
code 𝑇 combining the characteristics of image sequence.
The time difference between the two frames can be cal-
culated in the endoscope image sequence and the time
sequence coding between different frames can be mod-          𝑄 ∈ 𝑅𝐿,𝐶 and key 𝐾 ∈ 𝑅𝐿,𝐶 using the output 𝐸𝑛 of
eled by normalization of the time difference. When the        the transformer encoder of the same layer. The cross
image sequence length is 𝐿, the sequence encoding 𝑇𝑠 is       attention weight matrix 𝑇𝑐 is calculated by 𝑄 and 𝐾
a square matrix of 𝐿 × 𝐿:                                     of transformer encoder. As shown in Fig.2, there are
                                                              two parallel attention modules for feature learning in
        ⎡
               0        |𝑡0 − 𝑡1 | · · · |𝑡0 − 𝑡𝐿 |
                                                      ⎤       this part. We hope that these two attention modules can
        ⎢ |𝑡1 − 𝑡0 |        0        · · · |𝑡1 − 𝑡𝐿 | ⎥       learn feature compensation and contraction respectively.
  𝑇 =⎢
        ⎢
               ..           ..        ..       ..
                                                      ⎥       Therefore, the parameters of the two modules do not
                                         .
                                                      ⎥
        ⎣       .            .                  .     ⎦       share, and matrix addition and matrix cross product are
           |𝑡𝐿 − 𝑡0 | |𝑡𝐿 − 𝑡1 | · · ·         0              used respectively.The specific operations are as follows:
                                                        (1)
                                       −1                         ′
                  𝑇𝑠 = 𝑁 𝑜𝑟𝑚𝑎𝑙(𝑇 )                      (2)                        𝐷
                                                                𝐶 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥((𝑄 * 𝑊𝑄1
                                                                                                𝐷 𝑇
                                                                                      ) * (𝐾 * 𝑊𝐾1
                                                                                                   ) /𝜏 ) (3)
In self-attention generates query 𝑄 ∈ 𝑅𝐿,𝐶 , key 𝐾 ∈              ′′
                                                                                   𝐷
                                                                𝐶 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥((𝑄 * 𝑊𝑄             𝐷 𝑇
                                                                                       ) * (𝐾 * 𝑊𝐾  ) /𝜏 ) (4)
𝑅𝐿,𝐶 , and value 𝑉 ∈ 𝑅𝐿,𝐶 based on 𝐸𝑛−1 . Then cal-                                  2            2

culate the initial self-attention weight 𝐴 ∈ 𝑅𝐿,𝐿 =                                                  ′
                                                               𝐷𝑛 = 𝐼𝑛𝑠.𝑁 𝑜𝑟𝑚{𝐼𝑛𝑠.𝑁 𝑜𝑟𝑚{𝐶 * 𝑉 * 𝑊𝑉𝐷1 + 𝑉 }
𝑆𝑜𝑓 𝑡𝑚𝑎𝑥((𝑄*𝑊𝑄     𝐸
                     )*(𝐾 *𝑊𝐾    ) /𝜏 ) between L frames.
                                𝐸 𝑇
                                                                                                ′′
Then, sequence coding is introduced to calculate the fi-                      +𝐼𝑛𝑠.𝑁 𝑜𝑟𝑚{(𝐶 * 𝑉 * 𝑊𝑉𝐷2 ) ⊗ 𝑉 }}
nal self-attention weight 𝐴 ∈ 𝑅𝐿,𝐿 = 𝐴 * 𝑇𝑠 . In this                                                                (5)
way, the temporal relevance in the original self-attention    The above process makes the features of each frame im-
weight can be strengthened. The following steps are the       ages fully fused, and the temporal context transformer
same as for a classical transformer[16].                      effectively extracts the context information of different
   The transformer decoder is responsible for decoding        frame images. The aggregate feature will reshape to its
and reconstruction of the features of the transformer         original dimension before being sent into the decoder.
encoder. The input form of the layer 𝑁 transformer
encoder is 𝐸𝑛−1 ∈ 𝑅𝐿,𝐶 . Like the transformer en-
coder, sequence coding is added to the transformer de-
                                                              3. Experimental Results
coder to improve the temporal modeling ability of the         In this section, we compare the performance of the pro-
model. In the transformer decoder, the first step is          posed ensemble Temporal Context for Endoscopy Arte-
the mask self-attention, which emphasizes the predic-         fact Segmentation and Detection Farmworke and state-
tion of the model in accordance with the sequence of          of-the-art model were compared in the segmentation and
images. Different from the classical transformer, we          detection of endoscopy artefact.
add the cross attention[16] unit at the end of the trans-
former decoder.The transformer decoder calculates query
Figure 3: Example of sequential endoscopy artefact image segmentation and detection results.


Table 1                                                      the data of the training set. In order to demonstrate the
Temporal context transformer layer number comparative ex-    effectiveness of the method, we do not use TTA or multi-
periment.                                                    model fusion and other post-processing means, but only
   Model     𝑁      𝐷𝑖𝑐𝑒         𝐽𝑎𝑐𝑐𝑎𝑟𝑑         𝑃𝐴          use a single model for test set prediction.
             0      0.525         0.402         0.872
             1      0.635         0.491         0.892        Table 2
   UNet
             2      0.653         0.513         0.897        Structural ablation of temporal context transformer
             3      0.607         0.469         0.895
   Model     𝑁     𝑚𝐴𝑃𝑚𝑒𝑎𝑛       𝑚𝐴𝑃50         𝑚𝐴𝑃75             Model         𝑇 𝐶𝑇𝑤/𝑜      𝐷𝑖𝑐𝑒   𝐽𝑎𝑐𝑐𝑎𝑟𝑑           𝑃𝐴
             0      0.232         0.464         0.208                                       0.525   0.402           0.872
                                                                  UNet            √
  Faster     1      0.305         0.554         0.309                                       0.635   0.491           0.892
  R-CNN      2      0.317         0.563         0.321                                       0.651   0.597           0.923
                                                                  DANet           √
             3      0.288         0.523         0.272                                       0.773   0.660           0.944
                                                                                            0.716   0.676           0.936
                                                                 PraNet           √
   Data details and preparation. Our model mainly                                           0.815   0.721           0.961
used the EndoCV2022 challenge dataset [17] for en-               Model         𝑇 𝐶𝑇𝑤/𝑜     𝑚𝐴𝑃𝑚𝑒𝑎𝑛 𝑚𝐴𝑃50           𝑚𝐴𝑃75
doscopic images for Endoscopy Artefact Detection in               Faster                    0.232   0.464           0.208
                                                                                  √
                                                                 R-CNN                      0.317   0.563           0.321
this work. Endoscopic surgical instruments include five
                                                                 Cascade                    0.336   0.579           0.347
categories: nonmucosa, artefact, saturation, specular-                            √
                                                                  RCNN                      0.395   0.611           0.401
ity, bubbles. EndoCV launched this as an extension to              Swin                     0.356   0.598           0.364
the previous artefact detection and segmentation chal-         Transformer
                                                                                  √
                                                                                            0.403   0.613           0.421
lenges [21, 22] with dataset specific to the colonoscopy.       We first compared the influence of the number 𝑁 of
The dataset contains 24 endoscopic videos sequence for       TCT on the model performance through comparative ex-
EAD sub-challenge with total 1,449 endoscopic images.        periments. The results are shown in Table 1. From the
We split the dataset into 80% sequence for training and      experimental results, it can be seen that the model has the
20% sequence for validation. For the segmentation task,      best effect except when 𝑁 is 2, and the model will overfit
we used Dice coefficient, Jaccard coefficient and PA for     when N is too large. To verify the effectiveness of our
evaluation. For the detection task, we used mAP with         method, we perform a comprehensive comparison with
different thresholds for evaluation.                         state-of-the-art segmentation and detection methods, seg-
   Implementation details. The deep models are im-           mentation methods including UNet, DANet, PraNet, de-
plemented based on PyTorch and trained on an NVIDIA          tection methods including Faster RCNN, Cascade RCNN,
Tesla V100 GPU. surgical instrument segmentation model       Swin Transformer, as shown in Table 2. Specifically, The
using SGD optimizer with a learning rate of 10−4 . surgi-    performance of each SOTA model has been steadily im-
cal instrument detection model base on mmdetetcion and       proved after being converted to our method. We visu-
using SGD optimizer with a learning rate of 10−2 . The       alized an example of an inferential endoscope sequence
batch size is set to 2 and use a sliding window of length    image of a set of models, as shown in Fig.3. the Dice, Jac-
L to sample subsequences in the original sequence, while     car, PA of segmentation task model has been improved
input sequence images are resized to 960×540. Since the      by 9%-12%, 5%-9% and 2%-3%. For detection tasks, the
input are image sequences, the batch size was relatively     model’s mAP improved by 5%-8%. And it is effective for
small. In addition, we used conventional inversion, affine   different types of methods, which proves that our method
transformation, contrast and other methods to enhance        is robust and applicable.
References                                                         2019/html/Fu_Dual_Attention_Network_for_
                                                                   Scene_Segmentation_CVPR_2019_paper.html.
 [1] Q. L. Zhe Guo, Ruiyao Zhang, et al., Global cancer            doi:10.1109/CVPR.2019.00326.
     statistics, 2012„ Ca A Cancer Journal for Clinicians     [11] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam,
     65 (2013) 87–108.                                             Encoder-decoder with atrous separable convolution
 [2] P. L. Reiko Nishihara, Kana Wu, et al., Long-                 for semantic image segmentation, in: V. Ferrari,
     term colorectal-cancer incidence and mortality after          M. Hebert, C. Sminchisescu, Y. Weiss (Eds.), Com-
     lower endoscopy (2018).                                       puter Vision - ECCV 2018 - 15th European Con-
 [3] Y. Mori, S. Kudo, Detecting colorectal polyps via             ference, Munich, Germany, September 8-14, 2018,
     machine learning, Nature Biomedical Engineering               Proceedings, Part VII, volume 11211 of Lecture Notes
     2 (2018) 713–714.                                             in Computer Science, Springer, 2018, pp. 833–851.
 [4] K. T. Teppei Kanayama, Yusuke Kurose, et al., Gas-            URL: https://doi.org/10.1007/978-3-030-01234-2_49.
     tric Cancer Detection from Endoscopic Images Us-              doi:10.1007/978-3-030-01234-2\_49.
     ing Synthesis by GAN, MICCAI 2019, Part V, 2019.         [12] D. Fan, G. Ji, T. Zhou, G. Chen, H. Fu, J. Shen,
 [5] S. A. Taghanaki, K. Abhishek, J. P. Cohen, J. Cohen-          L. Shao, Pranet: Parallel reverse attention net-
     Adad, G. Hamarneh, Deep semantic segmen-                      work for polyp segmentation, in: A. L. Martel,
     tation of natural and medical images: a re-                   P. Abolmaesumi, D. Stoyanov, D. Mateus, M. A.
     view, Artif. Intell. Rev. 54 (2021) 137–178. URL:             Zuluaga, S. K. Zhou, D. Racoceanu, L. Joskowicz
     https://doi.org/10.1007/s10462-020-09854-1. doi:10.           (Eds.), Medical Image Computing and Computer
     1007/s10462-020-09854-1.                                      Assisted Intervention - MICCAI 2020 - 23rd Inter-
 [6] G. Zamanakos, L. T. Tsochatzidis, A. Amanatiadis,             national Conference, Lima, Peru, October 4-8, 2020,
     I. Pratikakis, A comprehensive survey of lidar-               Proceedings, Part VI, volume 12266 of Lecture Notes
     based 3d object detection methods with deep learn-            in Computer Science, Springer, 2020, pp. 263–273.
     ing for autonomous driving, Comput. Graph. 99                 URL: https://doi.org/10.1007/978-3-030-59725-2_26.
     (2021) 153–181. URL: https://doi.org/10.1016/j.cag.           doi:10.1007/978-3-030-59725-2\_26.
     2021.07.003. doi:10.1016/j.cag.2021.07.003.              [13] S. Ren, K. He, R. B. Girshick, J. Sun, Faster R-CNN:
 [7] O. Ronneberger, Invited talk: U-net convolu-                  towards real-time object detection with region pro-
     tional networks for biomedical image segmen-                  posal networks, IEEE Trans. Pattern Anal. Mach.
     tation, in: K. H. Maier-Hein, T. M. Deserno,                  Intell. 39 (2017) 1137–1149. URL: https://doi.org/10.
     H. Handels, T. Tolxdorff (Eds.), Bildverarbeitung             1109/TPAMI.2016.2577031. doi:10.1109/TPAMI.
     für die Medizin 2017 - Algorithmen - Systeme -                2016.2577031.
     Anwendungen. Proceedings des Workshops vom               [14] Z. Cai, N. Vasconcelos, Cascade R-CNN: delving
     12. bis 14. März 2017 in Heidelberg, Informatik               into high quality object detection, in: 2018 IEEE
     Aktuell, Springer, 2017, p. 3. URL: https://doi.              Conference on Computer Vision and Pattern
     org/10.1007/978-3-662-54345-0_3. doi:10.1007/                 Recognition, CVPR 2018, Salt Lake City, UT, USA,
     978-3-662-54345-0\_3.                                         June 18-22, 2018, Computer Vision Foundation /
 [8] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, J. Liang,          IEEE Computer Society, 2018, pp. 6154–6162. URL:
     Unet++: Redesigning skip connections to exploit               http://openaccess.thecvf.com/content_cvpr_2018/
     multiscale features in image segmentation, IEEE               html/Cai_Cascade_R-CNN_Delving_CVPR_2018_
     Trans. Medical Imaging 39 (2020) 1856–1867. URL:              paper.html. doi:10.1109/CVPR.2018.00644.
     https://doi.org/10.1109/TMI.2019.2959609. doi:10.        [15] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin,
     1109/TMI.2019.2959609.                                        B. Guo, Swin transformer: Hierarchical vision trans-
 [9] F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen,         former using shifted windows (2021).
     K. H. Maier-Hein, nnu-net: a self-configuring            [16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszko-
     method for deep learning-based biomedical image               reit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
     segmentation, Nature Methods 18 (2020) 203â211.               sukhin, Attention is all you need, in: I. Guyon,
     URL: http://dx.doi.org/10.1038/s41592-020-01008-z.            U. von Luxburg, S. Bengio, H. M. Wallach,
     doi:10.1038/s41592-020-01008-z.                               R. Fergus, S. V. N. Vishwanathan, R. Garnett
[10] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, H. Lu,        (Eds.), NIPS 2017, 2017, pp. 5998–6008. URL:
     Dual attention network for scene segmentation,                https://proceedings.neurips.cc/paper/2017/hash/
     in: IEEE Conference on Computer Vision and                    3f5ee243547dee91fbd053c1c4a845aa-Abstract.
     Pattern Recognition, CVPR 2019, Long Beach,                   html.
     CA, USA, June 16-20, 2019, Computer Vision               [17] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Can-
     Foundation / IEEE, 2019, pp. 3146–3154. URL:                  nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V.
     http://openaccess.thecvf.com/content_CVPR_                    Anonsen, M. A. Riegler, et al., Polypgen: A multi-
     center polyp detection and segmentation dataset
     for generalisability assessment, arXiv preprint
     arXiv:2106.04463 (2021).
[18] S. Ali, N. Ghatwary, D. Jha, E. Isik-Polat, G. Po-
     lat, C. Yang, W. Li, A. Galdran, M.-Á. G. Ballester,
     V. Thambawita, et al., Assessing generalisability of
     deep learning-based polyp detection and segmenta-
     tion methods through a computer vision challenge,
     arXiv preprint arXiv:2202.12031 (2022).
[19] Q. Wang, L. Zhou, Y. Yao, Y. Wang, J. Li, W. Yang,
     An interconnected feature pyramid networks for
     object detection, J. Vis. Commun. Image Repre-
     sent. 79 (2021) 103260. URL: https://doi.org/10.1016/
     j.jvcir.2021.103260. doi:10.1016/j.jvcir.2021.
     103260.
[20] N. Bodla, B. Singh, R. Chellappa, L. S. Davis, Soft-
     nms - improving object detection with one line of
     code, in: IEEE International Conference on Com-
     puter Vision, ICCV 2017, Venice, Italy, October 22-
     29, 2017, IEEE Computer Society, 2017, pp. 5562–
     5570. URL: https://doi.org/10.1109/ICCV.2017.593.
     doi:10.1109/ICCV.2017.593.
[21] S. Ali, F. Zhou, B. Braden, A. Bailey, S. Yang,
     G. Cheng, P. Zhang, X. Li, M. Kayser, R. D.
     Soberanis-Mukul, S. Albarqouni, X. Wang, C. Wang,
     S. Watanabe, I. Oksuz, Q. Ning, S. Yang, M. A. Khan,
     X. W. Gao, S. Realdon, M. Loshchenov, J. A. Schn-
     abel, J. E. East, G. Wagnieres, V. B. Loschenov,
     E. Grisan, C. Daul, W. Blondel, J. Rittscher, An
     objective comparison of detection and segmen-
     tation algorithms for artefacts in clinical en-
     doscopy, Scientific Reports 10 (2020). URL: https:
     //doi.org/10.1038%2Fs41598-020-59413-5. doi:10.
     1038/s41598-020-59413-5.
[22] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Po-
     lat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo,
     B. Matuszewski, M. Gridach, I. Voiculescu, V. Yo-
     ganand, A. Chavan, A. Raj, N. T. Nguyen, D. Q.
     Tran, L. D. Huynh, N. Boutry, S. Rezvy, H. Chen,
     Y. H. Choi, A. Subramanian, V. Balasubrama-
     nian, X. W. Gao, H. Hu, Y. Liao, D. Stoyanov,
     C. Daul, S. Realdon, R. Cannizzaro, D. Lamarque,
     T. Tran-Nguyen, A. Bailey, B. Braden, J. E. East,
     J. Rittscher, Deep learning for detection and seg-
     mentation of artefact and disease instances in gas-
     trointestinal endoscopy, Medical Image Analysis
     70 (2021) 102002. URL: https://doi.org/10.1016%2Fj.
     media.2021.102002. doi:10.1016/j.media.2021.
     102002.