Endoscopic computer vision challenges 2.0 Sharib Ali1,2 , Noha Ghatwary3 1 School of Computing, University of Leeds, Leeds, UK 2 Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, OX3 7DQ, Oxford, UK 3 Computer Engineering Department, Arab Academy for Science and Technology, 1029, Alexandria, Egypt Abstract Accurate detection of artefacts is a core challenge in a wide-range of endoscopic applications addressing multiple different disease areas. The importance of precise detection of these artefacts is essential for high-quality endoscopic video acquisition crucial for realising reliable computer assisted endoscopy tools for improved patient care. In particular, colonoscopy requires colon preparation and cleaning to obtain improved adenoma detection rate. Computer aided systems can help to guide both expert and trainee endoscopists to obtain consistent high quality surveillance and detect, localize and segment widely known cancer precursor lesion, “polyps”. While deep learning has been successfully applied in the medical imaging, generalization is still an open problem. Generalizability issue of deep learning models need to be clearly defined and tackled to build more reliable technology for clinical translation. Inspired by the enthusiasm of participants on our previous challenges, this year we put forward a 2.0 version of two sub-challenges (Endoscopy artefact detection) EAD 2.0 and (Polyp generalization) PolypGen 2.0. Both the sub-challenges consists of multi-center and diverse population datasets with tasks for both detection and segmentation but focus on assessing generalizability of algorithms. In this challenge, we aim to add more sequence/video data and multimodality data from different centers. The participants is aimed to be evaluated on both standard (some already present at leaderboard) and generalization metrics presented in our previous challenges. However, unlike previous challenges, in 2.0 we aimed to benchmark methods on larger test-set comprising of mostly video sequences as in the real-world clinical scenario. Keywords Artefact, Polyp, Endoscopy, Deep learning, Generalization 1. Introduction detection, and segmentation methods can help improve colonoscopy procedures. Even though many methods Endoscopy is a widely used clinical procedure for the have been built to tackle automatic detection and seg- early detection of numerous cancers (e.g., nasopharyn- mentation of polyps, benchmarking and development of geal, oesophageal adenocarcinoma, gastric, colorectal computer vision methods still remains an open problem. cancers, bladder cancer etc.), therapeutic procedures and This is mostly due to the lack of datasets or challenges minimally invasive surgery (e.g., laparoscopy). A major that incorporate highly heterogeneous dataset appealing drawback during endoscopic video surveillance is that to participants to test for generalization abilities of the they are heavily corrupted with multiple artefacts (e.g., methods [3]. Polyps are usually protrusions (lumps) oc- pixel saturations, motion blur, defocus, specular reflec- curring as a single object or in groups, however, they also tions, bubbles, fluid, debris etc.). These artefacts not only disguise themselves in different other appearances such present difficulty in visualizing the underlying tissue dur- as sessile or flat polyps or hidden behind other protruded ing diagnosis but also affect any post-analysis methods mucosal structures [1]. In addition, during colonoscopy required for follow-ups. This is a huge problem during multiple artefacts can be present making the procedure colonoscopy which is an endoscopic surveillance pro- more difficult and hard-to detect cancer precursor lesions cedure widely done to identify colorectal cancer (CRC). such as polyp. Thus, this challenge aimed at tackling CRC is the third most common cause of cancer mortality both of these existing problems using computer vision with about 1.3 million new cases worldwide [1]. Ade- methods, in particular deep learning as sub-challenges: nomas or serrated polyps to CRC are the main cause of Endoscopy artefact detection (EAD 2.0) and polyp gener- CRC [2] and can be difficult to detect and remove be- alization (PolypGen 2.0). The aim of the sub-challenge cause of their varying shape, size, appearances, locations EAD 2.0 is to localise bounding boxes, predict class la- and often occlusion with artefacts. Thus computer-aided bels and pixel-wise segmentation of 8 different artefact classes for given clinical endoscopy video clips. The 8 4th International Workshop and Challenge on Computer Vision in classes include specularity, bubbles, saturation, contrast, Endoscopy (EndoCV2022) in conjunction with the 19th IEEE Inter- blood, instrument, blur and imaging artefacts. Similarly, national Symposium on Biomedical Imaging ISBI2022, March PolypGen 2.0 aimed to benchmark methods on the ba- 28th, 2022, IC Royal Bengal, Kolkata, India $ ali.sharib2002@gmail.com (S. Ali) sis of generalization capabilities to unseen colonoscopy © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License video sequence data for both detection and segmenta- Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 tion deep learning methods. We challenged computer vision and computational medical imaging community nearly 300-500 frames from multiple centers is the most to participate and build methods that are generalizable comprehensive test set allowing for a robust generaliz- in the different clinical settings that we believe provided ability test of algorithms. To make the competition relate the adaptability of built and trained methods on different to real-world scenarios we have picked our data centers population dataset without requiring them to train from for both of sub-challenges from different countries that scratch. includes Egypt, France, Italy, Norway, Sweden, and UK. The test splits will include - modality split, population split, endoscopy model or manufacturer split and polyp 2. Dataset and challenge size split. All dataset (including test) will be released after a prospective joint-journal paper. That is, all the Below we detail the datasets and challenge tasks that was data used in the training and testing of the challenge used in each of our sub-challenge: can be used for research and educational purposes. Be- low we present ethics and annotation strategies involved 2.1. Datasets in our data collection and curation: a) Ethical and pri- We have already curated large multicenter dataset for vacy aspects of the data: Patient consenting procedure both sub-challenges consisting of different endoscopy at each individual institution was performed prior to manufacturers, e.g. Olympus (mostly), Fujifilm, and Karl the collection. Additional review of the data collection Storz. Heterogenous collection to reflect real clinical plan by a local medical ethics committee or an institu- practices worldwide. This includes both standard defini- tional review board was also done in some centers [3, 5]. tion, HD and Ultra HD. For EAD training dataset please Challenge organisers performed all anonymisation of the refer to our data published at Mendeley1 and discussed video or image frames (including demographic informa- here [4]. A total of 280 patient videos from multiple or- tion) prior to including them into any dataset. Future, gans and institutions were used for curating this dataset build-up of new test samples presented here will follow that led to over 45,478 annotations on both single frame these ethical procedures. b) Annotation strategy: First, and sequence video data. Training data for the detection a small subset of dataset will be annotated by all clini- task consisted of total 2531 frames with 31,069 bounding cal experts and a joint consensus will be made available. boxes while 643 frames with 7511 binary masks for the Then, the remaining subset of dataset2 was annotated by segmentation task (except for blur, blood and contrast). post-doctoral researchers (working on endoscopy) and Sequences were required to mimic the change from large validated by clinicians at two different centres (10-fold areas of artefacts to small or no artefact frames and vice cross-validation). Finally, through a joint conference call versa similar to that in the natural occurrence in endo- all annotation validation will be achieved. We will use scopic procedures. A detailed overview is also presented labelbox 3 for annotation processes. During the entire in our EndoCV2020 joint paper [4]. A new set of test data procedure we aim to produce an annotation protocol and were curated that include unique video sequences con- document the entire phenomena which will be released sisting of more than 500 frames of which 360 was used publicly too. A statistical test on annotation variances in leaderboard test assessment. While for “PolypGen 2.0” between experts will also be observed and reported. training data we refer to the newly curated dataset de- scribed in [3]. The dataset includes both single frame 2.2. Challenge and sequence data with 3446 annotated polyp labels with Each sub-challenge will consist of two tasks: precise delineation of polyp boundaries (pixel level for segmentation task and bounding boxes for detection task) 1. Detection task: The aim of this task will be to verified by six senior gastroenterologists and consists of test the performance of participants’ methods for both small and large polyps including serrated and ade- detection and localization task on our compre- nomas. Expert endoscopists (with 20+ years experience) hensive and sorted multicenter datasets. The par- were involved in acquiring all the data. These videos ticipants will be tested on both detection-based are obtained from routine clinical procedures. To our metric and localization metric. A weighted final knowledge, this is the most comprehensive detection and metric will be used to evaluate for the best per- segmentation dataset curated by a team of computational forming method. scientists and expert gastroenterologists. In addition to 2. Segmentation task: Similar to task 1, each partici- this dataset, we have curated additional 23 unique pa- pants methods will be evaluated on multicenter tient video clips (> 100 frames per video) making in total curated and sorted datasets. An ideal segmen- of 46 sequences for PoypGen2.0 and 24 sequences for tation method will provide the top performance EAD2.0. The test phase of this challenge that will make 2 https://doi.org/10.17632/c7fjbxcgj9.3 1 3 https://data.mendeley.com/datasets/c7fjbxcgj9/3 https://labelbox.com on all the variabilities in different splits and an • The ranking on leaderboard will be based on the unseen dataset. highest mean value between DSC, PPV and sen- sitivity; and the least HD value Please note that generalizability assessment of each • Generalizability difference (Gerror): Difference method will be conducted for both tasks and the winner between DSC on mixed sample data and DSC on will be based on this metric (for further details see Section unseen data will be key in deciding winner of this III). Results should be submitted like the provided training task ground truth annotations for each task category and as detailed below: • Clinical applicability metrics: runtime (to be used post challenge only) i Category 1 (artefact detection): csv file of bound- Most of the evaluation metrics are already available at ing box coordinates corresponding to each class our GitHub repositories(see EAD4 , polypGen65 ). (e.g. label, confidence, x1, y1, x2, y2). ii Category 2 (semantic segmentation): image label Baseline methods Based on our previous challenges masks, integer valued for each image and current developments in deep learning methods for iii Category 3 (generalization): csv file of bounding detection segmentation we have picked three baseline box coordinates corresponding to each class (e.g. methods that will set the criteria for passing challenge label, confidence, x1, y1, x2, y2). threshold score. Test data was released in two sets. The first set determine which participants go to next round 3. Evaluation metrics and baseline depending on their score threshold. RetinaNet and YOLO- v4 was used as the baseline for detection while UNet, Detection task For detection task we aimed to use PSPNet and DeepLabV3+ was used as baseline methods widely accepted standard metrics and a generalization with ResNet50 backbone. metric as detailed below: Challenge leaderboard The EndoCV2022 challenge • Standard computer vision metric: mean average leaderboard was splitted into two submissions. First sub- precision (mAP, IoU interval [0.25:0.05:0.75]) (see mission (referred as round-I) included the results on 50% PASCAL VOC3 and COCO4 detection challenges) of the test data while the final submission (referred as • Standard intersection over union (IoU, interval round-II) included all 100% of test samples that were used [0.25:0.05:0.75]) to assess challenge participants methods. Please refer • Final detection score (trade-off between mAP and to https://endocv2022.grand-challenge.org/evaluation/ IoU): 0.6*mAP + 0.4*IoU (This metric have been round-i-det-gen/leaderboard/. Further, algorithmic de- used in our previous challenges. The standard tails, assessment details, and insights of the developed metrics using only mAP can lead to very good methods are under compilation and will be published as detection but poor localisation. The penalisation a joint-journal. proposed tackles such problem.) • Generalization gap (Gerror): defined as the differ- ence between detection score and the generaliza- 4. Conclusion tion score (on unseen data) [6] This paper summarises the motivation of challenge, data • Centroid localisation error (Lerror): defined as the collection and preparation, challenge tasks and evalua- distance between centroids positions of detected tion metrics used in EndoCV2022 challenge. However, boxes between the consecutive frames in a video some of the evaluation metrics may have not been in- (new) cluded in the leaderboard but is aimed at being used in • Clinical applicability metrics: runtime (to be used the joint-journal paper for further analysis. post challenge only) Segmentation task For segmentation task we have References taken into account widely used standard metrics and a generalization metric as detailed below: [1] F. Bray, J. Ferlay, I. Soerjomataram, R. Siegel, L. Torre, A. Jemal, Global cancer statistics 2018: GLOBOCAN • Standard segmentation metrics that include Dice estimates of incidence and mortality worldwide for coefficient (DSC or F1), F2-error, positive predic- 36 cancers in 185 countries, CA Cancer J Clin. 68 tive value (PPV), Hausdorff distance (HD) and (2018) 394–424. sensitivity (recall) will be used 4 https://github.com/sharibox/EAD2019 5 https://github.com/sharibox/EndoCV2021-polyp det seg gen [2] F. Loeve, R. Boer, A. G. Zauber, M. Van Ballegooi- jen, G. J. Van Oortmarssen, S. J. Winawer, J. D. F. Habbema, National polyp study data: evidence for regression of adenomas, Int. J. Cancer 111 (2004) 633–639. [3] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Cannizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V. Anonsen, M. A. Riegler, et al., Polypgen: A multi-center polyp detection and segmentation dataset for generalis- ability assessment, arXiv preprint arXiv:2106.04463 (2021). doi:10.48550/arXiv.2106.04463. [4] S. Ali, F. Y. Zhou, C. Daul, B. Braden, A. Bailey, S. Realdon, J. E. East, G. Wagnières, V. Loschenov, E. Grisan, W. Blondel, J. Rittscher, Endoscopy arti- fact detection (ead 2019) challenge dataset, ArXiv abs/1905.03209 (2019). [5] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Po- lat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo, B. Matuszewski, et al., Deep learning for detection and segmentation of artefact and disease instances in gastrointestinal endoscopy, Medical image analy- sis 70 (2021) 102002. doi:10.1016/j.media.2021. 102002. [6] S. Ali, F. Zhou, B. Braden, A. Bailey, S. Yang, G. Cheng, P. Zhang, X. Li, M. Kayser, R. D. Soberanis- Mukul, S. Albarqouni, X. Wang, C. Wang, S. Watan- abe, I. Oksuz, Q. Ning, S. Yang, M. A. Khan, X. W. Gao, S. Realdon, M. Loshchenov, J. A. Schnabel, J. E. East, G. Wagnieres, V. B. Loschenov, E. Grisan, C. Daul, W. Blondel, J. Rittscher, An objective comparison of detection and segmentation algorithms for artefacts in clinical endoscopy, Scientific Reports 10 (2020). URL: https://doi.org/10.1038%2Fs41598-020-59413-5. doi:10.1038/s41598-020-59413-5.