Temporal Context Framework for Endoscopy Artefact Segmentation and Detection Haili Ye1,2 , Hanpei Miao1,2 , Jiang Liu1,2 , Dahan Wang3 and Heng Li1,2 1 Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology, Shenzhen 518055, China 2 Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China 3 Department of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361004, China Abstract Endoscopic video processing could facilitate pre-operative planning, intra-operative image guidance and generation of post-operative analysis of the surgical procedure. However, most of the current methods are still based on a single frame of image analysis, which makes the results of the previous frame images independent of each other and causes vibration. In this paper, we propose an temporal context framework for endoscopy artefact segmentation and detection. The framework extends the general segmentation and detection model to the form based on temporal input, and we add a Temporal Context Transformer(TCT) after the encoder of the model to improve the model’s ability to construct temporal context features. the experiments of the EndoCV 2022 challenge dataset that this framework can improve the robustness of the model. Keywords Medical Image Analysis, Colonoscopic Image, Semantic Segmentation, Object Detection 1. Introduction two hot research fields in computer vision. In medical semantic segmentation, Olaf et al proposed a classic med- Colon cancer[1] is a common malignant tumor of the ical image segmentation model U-Net[7], and the rele- digestive tract that occurs in the colon. Colon cancer vant encoder-decoder structure and skip-layer connec- is closely related to the consumption of red meat (such tion method have great inspiration for subsequent re- as beef). Incidence of gastrointestinal tumors accounted search work. On this basis, a series of novel and effective for the third place. Colon cancer is mainly adenocarci- models are developed, such as U-Net++[8], nnUNet[9], noma, mucinous adenocarcinoma, undifferentiated carci- DANet[10], Deeplab[11] and so on. For the analysis of noma. Endoscopy[2] can clearly find intestinal lesions, endoscope images, The PraNet[12] proposed by Fan et al. but also can treat some intestinal lesions, such as: intesti- aggregates features at a high level through the parallel nal polyps and other benign lesions under the microscope partial decoder (PDD) to obtain context information and directly removed, intestinal bleeding under the micro- generate a global map. In medical object detection, Ross scope to stop bleeding, the removal of foreign bodies in et al. proposed the Faster RCNN[13] achieves end-to-end the colon. Endoscopic video[3] processing could facilitate object detection based on a deep learning two-stage struc- pre-operative planning, intra-operative image guidance ture. Cai et al. proposed Cascade R-CNN[14] to continu- and generation of post-operative analysis of the surgical ously optimize the prediction results by cascading several procedure. Computer assisted interventions[4] have the detection networks. The Swin Transfromer[15] proposed potential to enhance the surgeon’s visualization and navi- by Liu et al. is a general vision structure designed based gation capabilities and postoperative analytics to provide on the concept of Transfromer[16], which has achieved insights for surgical training and risk assessment. A nec- breakthroughs in multiple vision tasks. However, most essary element for these processes is scene understanding of the current methods are still based on single-frame and, in particular, anatomy and instrument detection and image analysis, which makes the analysis results not well localization. Therefore, by segmenting and differentiat- combined with temporal context information. ing among the elements that appear in the Endoscopic Endoscope image sequence can provide more infor- view, it is possible to assess tissue-instrument interac- mation than single frame image [17, 18], and combining tions and understand endoscopic workflow. the contextual time information of the before and after Semantic segmentation[5] and object detection[6] are images can effectively improve the analysis performance of endoscopy artefact. Inspired by this, in this paper, We 4th International Workshop and Challenge on Computer Vision in propose a temporal Context Framework for endoscopy Endoscopy (EndoCV2022) in conjunction with the 19th IEEE Inter- artefact segmentation and detection. Our contributions national Symposium on Biomedical Imaging ISBI2022, March 28th, 2022, IC Royal Bengal, Kolkata, India are as follows: $ yehl@mail.sustech.edu.cn (H. Ye) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License ∙ We introduce a general framework to extract tem- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) poral context features from sequential images and Figure 1: Overall Temporal Context Framework for Endoscopy Artefact Segmentation and Detection apply it to semantic segmentation and object de- different frames. This process can repair the wrong fea- tection. tures extracted by the model, which effectively improves ∙ In order to improve the feature modeling ability the robustness of the model. We also refer to UNet’s hop of the framework,we designed a Temporal Con- connection and connect the corresponding encoder and text Transformer(TCT) to improve the feature decoding to supplement the shallow feature. The features extraction ability of temporal context. are integrated by the temporal context transformer and ∙ Our framework can be adapted to various types of then enter the decoder to obtain the segmentation mask backbone models and can be extended for similar of the endoscopy image. endoscopic analysis problems. The input form of the endoscopy artefact detection model is also 𝐿-frame image sequence, and the overall structure is similar to the common two-stage target de- 2. METHODOLOGY tection model.The feature of each frame is extracted from the encoder, and then the feature is integrated into the 𝑁 In this section, we introduce the proposed temporal con- group temporal context transformer. The network struc- text framework for endoscopy artefact segmentation and ture of feature pyramid can deal with the multi-scale detection. The overall of this framework as shown in Fig. change problem in object detection with a small amount 1. The framework includes endoscopy artefact segmenta- of computation. So the model have a feature pyramid to tion model and endoscopy artefact detection model. The improve the model’s localization ability for multi-scale input of both models is the endoscope image sequence, surgical examples. The multi-scale features extracted by and we set a hyperparameter 𝐿 to represent the length FPN[19] will be input into the corresponding detection of the image sequence, so 𝐿-frame sequence of input to head for prediction. The detection head uses a region the model can be represented as 𝐼 ∈ 𝑅𝐿,3,𝐻,𝑊 . proposal network(RPN)[13] to filter out suggestion boxes In the endoscopy artefact segmentation model, we use that may have instances of surgical instruments. And ROI the classical coding-decoding results. In particular, the Align[13] adopts the corresponding local features in the encoder of the model is similar to the traditional encoder, global features according to these proposal boxes. These which is responsible for extracting the features of single local features provide the FFN to classify the artefact frame influence. 𝑁 group temporal context transformer within the proposal box and regress to specific coordi- is connected at the end of the encoder to establish the nates. Finally, the prediction results of different scales correlation between the image features of each frame. are merged and filtered using Sotf NMS[20]. Sotf NMS Compared with general single-frame image-based meth- will remove the prediction results with large overlap and ods, this module utilizes feature correlations between the prediction box, and retain the results with high confi- dence. The loss function form of object detection model is the same as that of Faster RCNN[13]. Temporal Context Transformer. For the image se- quence, there is a little correlation between the image data of the next frame and the next frame. Especially in the case of blur or artifact in the image, introducing the features of the previous frame can effectively repair the situation of target loss or category recognition error. In order to effectively improve the context understanding and feature integration capabilities of the model for im- age sequences. We designed the temporal context trans- former, as show in Fig .2. Temporal context transformer is divided into transformer encoder and transformer de- coder. The features extracted in the encoder will be input to the transformer encoder. For the Transformer encoder of layer 𝑛, the input is the output 𝐸𝑛−1 ∈ 𝑅𝐿,𝐶 of the upper layer. The coordination transformer encoder has a similar structure to the traditional Transformer encoder, but the difference is that we design the timing Figure 2: Structure of Temporal Context Transformer(TCT) code 𝑇 combining the characteristics of image sequence. The time difference between the two frames can be cal- culated in the endoscope image sequence and the time sequence coding between different frames can be mod- 𝑄 ∈ 𝑅𝐿,𝐶 and key 𝐾 ∈ 𝑅𝐿,𝐶 using the output 𝐸𝑛 of eled by normalization of the time difference. When the the transformer encoder of the same layer. The cross image sequence length is 𝐿, the sequence encoding 𝑇𝑠 is attention weight matrix 𝑇𝑐 is calculated by 𝑄 and 𝐾 a square matrix of 𝐿 × 𝐿: of transformer encoder. As shown in Fig.2, there are two parallel attention modules for feature learning in ⎡ 0 |𝑡0 − 𝑡1 | · · · |𝑡0 − 𝑡𝐿 | ⎤ this part. We hope that these two attention modules can ⎢ |𝑡1 − 𝑡0 | 0 · · · |𝑡1 − 𝑡𝐿 | ⎥ learn feature compensation and contraction respectively. 𝑇 =⎢ ⎢ .. .. .. .. ⎥ Therefore, the parameters of the two modules do not . ⎥ ⎣ . . . ⎦ share, and matrix addition and matrix cross product are |𝑡𝐿 − 𝑡0 | |𝑡𝐿 − 𝑡1 | · · · 0 used respectively.The specific operations are as follows: (1) −1 ′ 𝑇𝑠 = 𝑁 𝑜𝑟𝑚𝑎𝑙(𝑇 ) (2) 𝐷 𝐶 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥((𝑄 * 𝑊𝑄1 𝐷 𝑇 ) * (𝐾 * 𝑊𝐾1 ) /𝜏 ) (3) In self-attention generates query 𝑄 ∈ 𝑅𝐿,𝐶 , key 𝐾 ∈ ′′ 𝐷 𝐶 = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥((𝑄 * 𝑊𝑄 𝐷 𝑇 ) * (𝐾 * 𝑊𝐾 ) /𝜏 ) (4) 𝑅𝐿,𝐶 , and value 𝑉 ∈ 𝑅𝐿,𝐶 based on 𝐸𝑛−1 . Then cal- 2 2 culate the initial self-attention weight 𝐴 ∈ 𝑅𝐿,𝐿 = ′ 𝐷𝑛 = 𝐼𝑛𝑠.𝑁 𝑜𝑟𝑚{𝐼𝑛𝑠.𝑁 𝑜𝑟𝑚{𝐶 * 𝑉 * 𝑊𝑉𝐷1 + 𝑉 } 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥((𝑄*𝑊𝑄 𝐸 )*(𝐾 *𝑊𝐾 ) /𝜏 ) between L frames. 𝐸 𝑇 ′′ Then, sequence coding is introduced to calculate the fi- +𝐼𝑛𝑠.𝑁 𝑜𝑟𝑚{(𝐶 * 𝑉 * 𝑊𝑉𝐷2 ) ⊗ 𝑉 }} nal self-attention weight 𝐴 ∈ 𝑅𝐿,𝐿 = 𝐴 * 𝑇𝑠 . In this (5) way, the temporal relevance in the original self-attention The above process makes the features of each frame im- weight can be strengthened. The following steps are the ages fully fused, and the temporal context transformer same as for a classical transformer[16]. effectively extracts the context information of different The transformer decoder is responsible for decoding frame images. The aggregate feature will reshape to its and reconstruction of the features of the transformer original dimension before being sent into the decoder. encoder. The input form of the layer 𝑁 transformer encoder is 𝐸𝑛−1 ∈ 𝑅𝐿,𝐶 . Like the transformer en- coder, sequence coding is added to the transformer de- 3. Experimental Results coder to improve the temporal modeling ability of the In this section, we compare the performance of the pro- model. In the transformer decoder, the first step is posed ensemble Temporal Context for Endoscopy Arte- the mask self-attention, which emphasizes the predic- fact Segmentation and Detection Farmworke and state- tion of the model in accordance with the sequence of of-the-art model were compared in the segmentation and images. Different from the classical transformer, we detection of endoscopy artefact. add the cross attention[16] unit at the end of the trans- former decoder.The transformer decoder calculates query Figure 3: Example of sequential endoscopy artefact image segmentation and detection results. Table 1 the data of the training set. In order to demonstrate the Temporal context transformer layer number comparative ex- effectiveness of the method, we do not use TTA or multi- periment. model fusion and other post-processing means, but only Model 𝑁 𝐷𝑖𝑐𝑒 𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝑃𝐴 use a single model for test set prediction. 0 0.525 0.402 0.872 1 0.635 0.491 0.892 Table 2 UNet 2 0.653 0.513 0.897 Structural ablation of temporal context transformer 3 0.607 0.469 0.895 Model 𝑁 𝑚𝐴𝑃𝑚𝑒𝑎𝑛 𝑚𝐴𝑃50 𝑚𝐴𝑃75 Model 𝑇 𝐶𝑇𝑤/𝑜 𝐷𝑖𝑐𝑒 𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝑃𝐴 0 0.232 0.464 0.208 0.525 0.402 0.872 UNet √ Faster 1 0.305 0.554 0.309 0.635 0.491 0.892 R-CNN 2 0.317 0.563 0.321 0.651 0.597 0.923 DANet √ 3 0.288 0.523 0.272 0.773 0.660 0.944 0.716 0.676 0.936 PraNet √ Data details and preparation. Our model mainly 0.815 0.721 0.961 used the EndoCV2022 challenge dataset [17] for en- Model 𝑇 𝐶𝑇𝑤/𝑜 𝑚𝐴𝑃𝑚𝑒𝑎𝑛 𝑚𝐴𝑃50 𝑚𝐴𝑃75 doscopic images for Endoscopy Artefact Detection in Faster 0.232 0.464 0.208 √ R-CNN 0.317 0.563 0.321 this work. Endoscopic surgical instruments include five Cascade 0.336 0.579 0.347 categories: nonmucosa, artefact, saturation, specular- √ RCNN 0.395 0.611 0.401 ity, bubbles. EndoCV launched this as an extension to Swin 0.356 0.598 0.364 the previous artefact detection and segmentation chal- Transformer √ 0.403 0.613 0.421 lenges [21, 22] with dataset specific to the colonoscopy. We first compared the influence of the number 𝑁 of The dataset contains 24 endoscopic videos sequence for TCT on the model performance through comparative ex- EAD sub-challenge with total 1,449 endoscopic images. periments. The results are shown in Table 1. From the We split the dataset into 80% sequence for training and experimental results, it can be seen that the model has the 20% sequence for validation. For the segmentation task, best effect except when 𝑁 is 2, and the model will overfit we used Dice coefficient, Jaccard coefficient and PA for when N is too large. To verify the effectiveness of our evaluation. For the detection task, we used mAP with method, we perform a comprehensive comparison with different thresholds for evaluation. state-of-the-art segmentation and detection methods, seg- Implementation details. The deep models are im- mentation methods including UNet, DANet, PraNet, de- plemented based on PyTorch and trained on an NVIDIA tection methods including Faster RCNN, Cascade RCNN, Tesla V100 GPU. surgical instrument segmentation model Swin Transformer, as shown in Table 2. Specifically, The using SGD optimizer with a learning rate of 10−4 . surgi- performance of each SOTA model has been steadily im- cal instrument detection model base on mmdetetcion and proved after being converted to our method. We visu- using SGD optimizer with a learning rate of 10−2 . The alized an example of an inferential endoscope sequence batch size is set to 2 and use a sliding window of length image of a set of models, as shown in Fig.3. the Dice, Jac- L to sample subsequences in the original sequence, while car, PA of segmentation task model has been improved input sequence images are resized to 960×540. Since the by 9%-12%, 5%-9% and 2%-3%. For detection tasks, the input are image sequences, the batch size was relatively model’s mAP improved by 5%-8%. And it is effective for small. In addition, we used conventional inversion, affine different types of methods, which proves that our method transformation, contrast and other methods to enhance is robust and applicable. References 2019/html/Fu_Dual_Attention_Network_for_ Scene_Segmentation_CVPR_2019_paper.html. [1] Q. L. Zhe Guo, Ruiyao Zhang, et al., Global cancer doi:10.1109/CVPR.2019.00326. statistics, 2012„ Ca A Cancer Journal for Clinicians [11] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, 65 (2013) 87–108. Encoder-decoder with atrous separable convolution [2] P. L. Reiko Nishihara, Kana Wu, et al., Long- for semantic image segmentation, in: V. Ferrari, term colorectal-cancer incidence and mortality after M. Hebert, C. Sminchisescu, Y. Weiss (Eds.), Com- lower endoscopy (2018). puter Vision - ECCV 2018 - 15th European Con- [3] Y. Mori, S. Kudo, Detecting colorectal polyps via ference, Munich, Germany, September 8-14, 2018, machine learning, Nature Biomedical Engineering Proceedings, Part VII, volume 11211 of Lecture Notes 2 (2018) 713–714. in Computer Science, Springer, 2018, pp. 833–851. [4] K. T. Teppei Kanayama, Yusuke Kurose, et al., Gas- URL: https://doi.org/10.1007/978-3-030-01234-2_49. tric Cancer Detection from Endoscopic Images Us- doi:10.1007/978-3-030-01234-2\_49. ing Synthesis by GAN, MICCAI 2019, Part V, 2019. [12] D. Fan, G. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, [5] S. A. Taghanaki, K. Abhishek, J. P. Cohen, J. Cohen- L. Shao, Pranet: Parallel reverse attention net- Adad, G. Hamarneh, Deep semantic segmen- work for polyp segmentation, in: A. L. Martel, tation of natural and medical images: a re- P. Abolmaesumi, D. Stoyanov, D. Mateus, M. A. view, Artif. Intell. Rev. 54 (2021) 137–178. URL: Zuluaga, S. K. Zhou, D. Racoceanu, L. Joskowicz https://doi.org/10.1007/s10462-020-09854-1. doi:10. (Eds.), Medical Image Computing and Computer 1007/s10462-020-09854-1. Assisted Intervention - MICCAI 2020 - 23rd Inter- [6] G. Zamanakos, L. T. Tsochatzidis, A. Amanatiadis, national Conference, Lima, Peru, October 4-8, 2020, I. Pratikakis, A comprehensive survey of lidar- Proceedings, Part VI, volume 12266 of Lecture Notes based 3d object detection methods with deep learn- in Computer Science, Springer, 2020, pp. 263–273. ing for autonomous driving, Comput. Graph. 99 URL: https://doi.org/10.1007/978-3-030-59725-2_26. (2021) 153–181. URL: https://doi.org/10.1016/j.cag. doi:10.1007/978-3-030-59725-2\_26. 2021.07.003. doi:10.1016/j.cag.2021.07.003. [13] S. Ren, K. He, R. B. Girshick, J. Sun, Faster R-CNN: [7] O. Ronneberger, Invited talk: U-net convolu- towards real-time object detection with region pro- tional networks for biomedical image segmen- posal networks, IEEE Trans. Pattern Anal. Mach. tation, in: K. H. Maier-Hein, T. M. Deserno, Intell. 39 (2017) 1137–1149. URL: https://doi.org/10. H. Handels, T. Tolxdorff (Eds.), Bildverarbeitung 1109/TPAMI.2016.2577031. doi:10.1109/TPAMI. für die Medizin 2017 - Algorithmen - Systeme - 2016.2577031. Anwendungen. Proceedings des Workshops vom [14] Z. Cai, N. Vasconcelos, Cascade R-CNN: delving 12. bis 14. März 2017 in Heidelberg, Informatik into high quality object detection, in: 2018 IEEE Aktuell, Springer, 2017, p. 3. URL: https://doi. Conference on Computer Vision and Pattern org/10.1007/978-3-662-54345-0_3. doi:10.1007/ Recognition, CVPR 2018, Salt Lake City, UT, USA, 978-3-662-54345-0\_3. June 18-22, 2018, Computer Vision Foundation / [8] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, J. Liang, IEEE Computer Society, 2018, pp. 6154–6162. URL: Unet++: Redesigning skip connections to exploit http://openaccess.thecvf.com/content_cvpr_2018/ multiscale features in image segmentation, IEEE html/Cai_Cascade_R-CNN_Delving_CVPR_2018_ Trans. Medical Imaging 39 (2020) 1856–1867. URL: paper.html. doi:10.1109/CVPR.2018.00644. https://doi.org/10.1109/TMI.2019.2959609. doi:10. [15] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, 1109/TMI.2019.2959609. B. Guo, Swin transformer: Hierarchical vision trans- [9] F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen, former using shifted windows (2021). K. H. Maier-Hein, nnu-net: a self-configuring [16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszko- method for deep learning-based biomedical image reit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo- segmentation, Nature Methods 18 (2020) 203â211. sukhin, Attention is all you need, in: I. Guyon, URL: http://dx.doi.org/10.1038/s41592-020-01008-z. U. von Luxburg, S. Bengio, H. M. Wallach, doi:10.1038/s41592-020-01008-z. R. Fergus, S. V. N. Vishwanathan, R. Garnett [10] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, H. Lu, (Eds.), NIPS 2017, 2017, pp. 5998–6008. URL: Dual attention network for scene segmentation, https://proceedings.neurips.cc/paper/2017/hash/ in: IEEE Conference on Computer Vision and 3f5ee243547dee91fbd053c1c4a845aa-Abstract. Pattern Recognition, CVPR 2019, Long Beach, html. CA, USA, June 16-20, 2019, Computer Vision [17] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Can- Foundation / IEEE, 2019, pp. 3146–3154. URL: nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V. http://openaccess.thecvf.com/content_CVPR_ Anonsen, M. A. Riegler, et al., Polypgen: A multi- center polyp detection and segmentation dataset for generalisability assessment, arXiv preprint arXiv:2106.04463 (2021). [18] S. Ali, N. Ghatwary, D. Jha, E. Isik-Polat, G. Po- lat, C. Yang, W. Li, A. Galdran, M.-Á. G. Ballester, V. Thambawita, et al., Assessing generalisability of deep learning-based polyp detection and segmenta- tion methods through a computer vision challenge, arXiv preprint arXiv:2202.12031 (2022). [19] Q. Wang, L. Zhou, Y. Yao, Y. Wang, J. Li, W. Yang, An interconnected feature pyramid networks for object detection, J. Vis. Commun. Image Repre- sent. 79 (2021) 103260. URL: https://doi.org/10.1016/ j.jvcir.2021.103260. doi:10.1016/j.jvcir.2021. 103260. [20] N. Bodla, B. Singh, R. Chellappa, L. S. Davis, Soft- nms - improving object detection with one line of code, in: IEEE International Conference on Com- puter Vision, ICCV 2017, Venice, Italy, October 22- 29, 2017, IEEE Computer Society, 2017, pp. 5562– 5570. URL: https://doi.org/10.1109/ICCV.2017.593. doi:10.1109/ICCV.2017.593. [21] S. Ali, F. Zhou, B. Braden, A. Bailey, S. Yang, G. Cheng, P. Zhang, X. Li, M. Kayser, R. D. Soberanis-Mukul, S. Albarqouni, X. Wang, C. Wang, S. Watanabe, I. Oksuz, Q. Ning, S. Yang, M. A. Khan, X. W. Gao, S. Realdon, M. Loshchenov, J. A. Schn- abel, J. E. East, G. Wagnieres, V. B. Loschenov, E. Grisan, C. Daul, W. Blondel, J. Rittscher, An objective comparison of detection and segmen- tation algorithms for artefacts in clinical en- doscopy, Scientific Reports 10 (2020). URL: https: //doi.org/10.1038%2Fs41598-020-59413-5. doi:10. 1038/s41598-020-59413-5. [22] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Po- lat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo, B. Matuszewski, M. Gridach, I. Voiculescu, V. Yo- ganand, A. Chavan, A. Raj, N. T. Nguyen, D. Q. Tran, L. D. Huynh, N. Boutry, S. Rezvy, H. Chen, Y. H. Choi, A. Subramanian, V. Balasubrama- nian, X. W. Gao, H. Hu, Y. Liao, D. Stoyanov, C. Daul, S. Realdon, R. Cannizzaro, D. Lamarque, T. Tran-Nguyen, A. Bailey, B. Braden, J. E. East, J. Rittscher, Deep learning for detection and seg- mentation of artefact and disease instances in gas- trointestinal endoscopy, Medical Image Analysis 70 (2021) 102002. URL: https://doi.org/10.1016%2Fj. media.2021.102002. doi:10.1016/j.media.2021. 102002.