Endoscopic computer vision challenges 2.0
Sharib Ali1,2 , Noha Ghatwary3
1
  School of Computing, University of Leeds, Leeds, UK
2
  Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, OX3 7DQ, Oxford, UK
3
  Computer Engineering Department, Arab Academy for Science and Technology, 1029, Alexandria, Egypt


                                       Abstract
                                       Accurate detection of artefacts is a core challenge in a wide-range of endoscopic applications addressing multiple different
                                       disease areas. The importance of precise detection of these artefacts is essential for high-quality endoscopic video acquisition
                                       crucial for realising reliable computer assisted endoscopy tools for improved patient care. In particular, colonoscopy requires
                                       colon preparation and cleaning to obtain improved adenoma detection rate. Computer aided systems can help to guide both
                                       expert and trainee endoscopists to obtain consistent high quality surveillance and detect, localize and segment widely known
                                       cancer precursor lesion, “polyps”. While deep learning has been successfully applied in the medical imaging, generalization is
                                       still an open problem. Generalizability issue of deep learning models need to be clearly defined and tackled to build more
                                       reliable technology for clinical translation. Inspired by the enthusiasm of participants on our previous challenges, this year we
                                       put forward a 2.0 version of two sub-challenges (Endoscopy artefact detection) EAD 2.0 and (Polyp generalization) PolypGen
                                       2.0. Both the sub-challenges consists of multi-center and diverse population datasets with tasks for both detection and
                                       segmentation but focus on assessing generalizability of algorithms. In this challenge, we aim to add more sequence/video
                                       data and multimodality data from different centers. The participants is aimed to be evaluated on both standard (some already
                                       present at leaderboard) and generalization metrics presented in our previous challenges. However, unlike previous challenges,
                                       in 2.0 we aimed to benchmark methods on larger test-set comprising of mostly video sequences as in the real-world clinical
                                       scenario.

                                       Keywords
                                       Artefact, Polyp, Endoscopy, Deep learning, Generalization


1. Introduction                                                                                        detection, and segmentation methods can help improve
                                                                                                       colonoscopy procedures. Even though many methods
Endoscopy is a widely used clinical procedure for the have been built to tackle automatic detection and seg-
early detection of numerous cancers (e.g., nasopharyn- mentation of polyps, benchmarking and development of
geal, oesophageal adenocarcinoma, gastric, colorectal computer vision methods still remains an open problem.
cancers, bladder cancer etc.), therapeutic procedures and This is mostly due to the lack of datasets or challenges
minimally invasive surgery (e.g., laparoscopy). A major that incorporate highly heterogeneous dataset appealing
drawback during endoscopic video surveillance is that to participants to test for generalization abilities of the
they are heavily corrupted with multiple artefacts (e.g., methods [3]. Polyps are usually protrusions (lumps) oc-
pixel saturations, motion blur, defocus, specular reflec- curring as a single object or in groups, however, they also
tions, bubbles, fluid, debris etc.). These artefacts not only disguise themselves in different other appearances such
present difficulty in visualizing the underlying tissue dur- as sessile or flat polyps or hidden behind other protruded
ing diagnosis but also affect any post-analysis methods mucosal structures [1]. In addition, during colonoscopy
required for follow-ups. This is a huge problem during multiple artefacts can be present making the procedure
colonoscopy which is an endoscopic surveillance pro- more difficult and hard-to detect cancer precursor lesions
cedure widely done to identify colorectal cancer (CRC). such as polyp. Thus, this challenge aimed at tackling
CRC is the third most common cause of cancer mortality both of these existing problems using computer vision
with about 1.3 million new cases worldwide [1]. Ade- methods, in particular deep learning as sub-challenges:
nomas or serrated polyps to CRC are the main cause of Endoscopy artefact detection (EAD 2.0) and polyp gener-
CRC [2] and can be difficult to detect and remove be- alization (PolypGen 2.0). The aim of the sub-challenge
cause of their varying shape, size, appearances, locations EAD 2.0 is to localise bounding boxes, predict class la-
and often occlusion with artefacts. Thus computer-aided bels and pixel-wise segmentation of 8 different artefact
                                                                                                       classes for given clinical endoscopy video clips. The 8
4th International Workshop and Challenge on Computer Vision in classes include specularity, bubbles, saturation, contrast,
Endoscopy (EndoCV2022) in conjunction with the 19th IEEE Inter-
                                                                                                       blood, instrument, blur and imaging artefacts. Similarly,
national Symposium on Biomedical Imaging ISBI2022, March
                                                                                                       PolypGen 2.0 aimed to benchmark methods on the ba-
28th, 2022, IC Royal Bengal, Kolkata, India
$ ali.sharib2002@gmail.com (S. Ali)                                                                    sis of generalization capabilities to unseen colonoscopy
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License video sequence data for both detection and segmenta-
          Attribution 4.0 International (CC BY 4.0).
    CEUR

          CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                       tion deep learning methods. We challenged computer
vision and computational medical imaging community            nearly 300-500 frames from multiple centers is the most
to participate and build methods that are generalizable       comprehensive test set allowing for a robust generaliz-
in the different clinical settings that we believe provided   ability test of algorithms. To make the competition relate
the adaptability of built and trained methods on different    to real-world scenarios we have picked our data centers
population dataset without requiring them to train from       for both of sub-challenges from different countries that
scratch.                                                      includes Egypt, France, Italy, Norway, Sweden, and UK.
                                                              The test splits will include - modality split, population
                                                              split, endoscopy model or manufacturer split and polyp
2. Dataset and challenge                                      size split. All dataset (including test) will be released
                                                              after a prospective joint-journal paper. That is, all the
Below we detail the datasets and challenge tasks that was
                                                              data used in the training and testing of the challenge
used in each of our sub-challenge:
                                                              can be used for research and educational purposes. Be-
                                                              low we present ethics and annotation strategies involved
2.1. Datasets                                                 in our data collection and curation: a) Ethical and pri-
We have already curated large multicenter dataset for         vacy aspects of the data: Patient consenting procedure
both sub-challenges consisting of different endoscopy         at each individual institution was performed prior to
manufacturers, e.g. Olympus (mostly), Fujifilm, and Karl      the collection. Additional review of the data collection
Storz. Heterogenous collection to reflect real clinical       plan by a local medical ethics committee or an institu-
practices worldwide. This includes both standard defini-      tional review board was also done in some centers [3, 5].
tion, HD and Ultra HD. For EAD training dataset please        Challenge organisers performed all anonymisation of the
refer to our data published at Mendeley1 and discussed        video or image frames (including demographic informa-
here [4]. A total of 280 patient videos from multiple or-     tion) prior to including them into any dataset. Future,
gans and institutions were used for curating this dataset     build-up of new test samples presented here will follow
that led to over 45,478 annotations on both single frame      these ethical procedures. b) Annotation strategy: First,
and sequence video data. Training data for the detection      a small subset of dataset will be annotated by all clini-
task consisted of total 2531 frames with 31,069 bounding      cal experts and a joint consensus will be made available.
boxes while 643 frames with 7511 binary masks for the         Then, the remaining subset of dataset2 was annotated by
segmentation task (except for blur, blood and contrast).      post-doctoral researchers (working on endoscopy) and
Sequences were required to mimic the change from large        validated by clinicians at two different centres (10-fold
areas of artefacts to small or no artefact frames and vice    cross-validation). Finally, through a joint conference call
versa similar to that in the natural occurrence in endo-      all annotation validation will be achieved. We will use
scopic procedures. A detailed overview is also presented      labelbox 3 for annotation processes. During the entire
in our EndoCV2020 joint paper [4]. A new set of test data     procedure we aim to produce an annotation protocol and
were curated that include unique video sequences con-         document the entire phenomena which will be released
sisting of more than 500 frames of which 360 was used         publicly too. A statistical test on annotation variances
in leaderboard test assessment. While for “PolypGen 2.0”      between experts will also be observed and reported.
training data we refer to the newly curated dataset de-
scribed in [3]. The dataset includes both single frame 2.2. Challenge
and sequence data with 3446 annotated polyp labels with
                                                           Each sub-challenge will consist of two tasks:
precise delineation of polyp boundaries (pixel level for
segmentation task and bounding boxes for detection task)       1. Detection task: The aim of this task will be to
verified by six senior gastroenterologists and consists of        test the performance of participants’ methods for
both small and large polyps including serrated and ade-           detection and localization task on our compre-
nomas. Expert endoscopists (with 20+ years experience)            hensive and sorted multicenter datasets. The par-
were involved in acquiring all the data. These videos             ticipants will be tested on both detection-based
are obtained from routine clinical procedures. To our             metric and localization metric. A weighted final
knowledge, this is the most comprehensive detection and           metric will be used to evaluate for the best per-
segmentation dataset curated by a team of computational           forming method.
scientists and expert gastroenterologists. In addition to      2. Segmentation task: Similar to task 1, each partici-
this dataset, we have curated additional 23 unique pa-            pants methods will be evaluated on multicenter
tient video clips (> 100 frames per video) making in total        curated and sorted datasets. An ideal segmen-
of 46 sequences for PoypGen2.0 and 24 sequences for               tation method will provide the top performance
EAD2.0. The test phase of this challenge that will make
                                                              2
                                                                  https://doi.org/10.17632/c7fjbxcgj9.3
1                                                             3
    https://data.mendeley.com/datasets/c7fjbxcgj9/3               https://labelbox.com
        on all the variabilities in different splits and an             • The ranking on leaderboard will be based on the
        unseen dataset.                                                   highest mean value between DSC, PPV and sen-
                                                                          sitivity; and the least HD value
   Please note that generalizability assessment of each
                                                                        • Generalizability difference (Gerror): Difference
method will be conducted for both tasks and the winner
                                                                          between DSC on mixed sample data and DSC on
will be based on this metric (for further details see Section
                                                                          unseen data will be key in deciding winner of this
III). Results should be submitted like the provided training
                                                                          task
ground truth annotations for each task category and as
detailed below:                                                         • Clinical applicability metrics: runtime (to be used
                                                                          post challenge only)
      i Category 1 (artefact detection): csv file of bound-
                                                                  Most of the evaluation metrics are already available at
        ing box coordinates corresponding to each class
                                                                our GitHub repositories(see EAD4 , polypGen65 ).
        (e.g. label, confidence, x1, y1, x2, y2).
     ii Category 2 (semantic segmentation): image label
                                                                Baseline methods Based on our previous challenges
        masks, integer valued for each image
                                                                and current developments in deep learning methods for
    iii Category 3 (generalization): csv file of bounding
                                                                detection segmentation we have picked three baseline
        box coordinates corresponding to each class (e.g.
                                                                methods that will set the criteria for passing challenge
        label, confidence, x1, y1, x2, y2).
                                                                threshold score. Test data was released in two sets. The
                                                                first set determine which participants go to next round
3. Evaluation metrics and baseline                              depending on their score threshold. RetinaNet and YOLO-
                                                                v4 was used as the baseline for detection while UNet,
Detection task For detection task we aimed to use               PSPNet and DeepLabV3+ was used as baseline methods
widely accepted standard metrics and a generalization           with ResNet50 backbone.
metric as detailed below:
                                                                Challenge leaderboard The EndoCV2022 challenge
     • Standard computer vision metric: mean average            leaderboard was splitted into two submissions. First sub-
       precision (mAP, IoU interval [0.25:0.05:0.75]) (see      mission (referred as round-I) included the results on 50%
       PASCAL VOC3 and COCO4 detection challenges)              of the test data while the final submission (referred as
     • Standard intersection over union (IoU, interval          round-II) included all 100% of test samples that were used
       [0.25:0.05:0.75])                                        to assess challenge participants methods. Please refer
     • Final detection score (trade-off between mAP and         to https://endocv2022.grand-challenge.org/evaluation/
       IoU): 0.6*mAP + 0.4*IoU (This metric have been           round-i-det-gen/leaderboard/. Further, algorithmic de-
       used in our previous challenges. The standard            tails, assessment details, and insights of the developed
       metrics using only mAP can lead to very good             methods are under compilation and will be published as
       detection but poor localisation. The penalisation        a joint-journal.
       proposed tackles such problem.)
     • Generalization gap (Gerror): defined as the differ-
       ence between detection score and the generaliza-         4. Conclusion
       tion score (on unseen data) [6]
                                                                This paper summarises the motivation of challenge, data
     • Centroid localisation error (Lerror): defined as the
                                                                collection and preparation, challenge tasks and evalua-
       distance between centroids positions of detected
                                                                tion metrics used in EndoCV2022 challenge. However,
       boxes between the consecutive frames in a video
                                                                some of the evaluation metrics may have not been in-
       (new)
                                                                cluded in the leaderboard but is aimed at being used in
     • Clinical applicability metrics: runtime (to be used      the joint-journal paper for further analysis.
       post challenge only)

Segmentation task For segmentation task we have                 References
taken into account widely used standard metrics and a
generalization metric as detailed below:                   [1] F. Bray, J. Ferlay, I. Soerjomataram, R. Siegel, L. Torre,
                                                               A. Jemal, Global cancer statistics 2018: GLOBOCAN
     • Standard segmentation metrics that include Dice         estimates of incidence and mortality worldwide for
       coefficient (DSC or F1), F2-error, positive predic-     36 cancers in 185 countries, CA Cancer J Clin. 68
       tive value (PPV), Hausdorff distance (HD) and           (2018) 394–424.
       sensitivity (recall) will be used                   4
                                                                    https://github.com/sharibox/EAD2019
                                                                5
                                                                    https://github.com/sharibox/EndoCV2021-polyp det seg gen
[2] F. Loeve, R. Boer, A. G. Zauber, M. Van Ballegooi-
    jen, G. J. Van Oortmarssen, S. J. Winawer, J. D. F.
    Habbema, National polyp study data: evidence for
    regression of adenomas, Int. J. Cancer 111 (2004)
    633–639.
[3] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Cannizzaro,
    O. E. Salem, D. Lamarque, C. Daul, K. V. Anonsen,
    M. A. Riegler, et al., Polypgen: A multi-center polyp
    detection and segmentation dataset for generalis-
    ability assessment, arXiv preprint arXiv:2106.04463
    (2021). doi:10.48550/arXiv.2106.04463.
[4] S. Ali, F. Y. Zhou, C. Daul, B. Braden, A. Bailey,
    S. Realdon, J. E. East, G. Wagnières, V. Loschenov,
    E. Grisan, W. Blondel, J. Rittscher, Endoscopy arti-
    fact detection (ead 2019) challenge dataset, ArXiv
    abs/1905.03209 (2019).
[5] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Po-
    lat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo,
    B. Matuszewski, et al., Deep learning for detection
    and segmentation of artefact and disease instances
    in gastrointestinal endoscopy, Medical image analy-
    sis 70 (2021) 102002. doi:10.1016/j.media.2021.
    102002.
[6] S. Ali, F. Zhou, B. Braden, A. Bailey, S. Yang,
    G. Cheng, P. Zhang, X. Li, M. Kayser, R. D. Soberanis-
    Mukul, S. Albarqouni, X. Wang, C. Wang, S. Watan-
    abe, I. Oksuz, Q. Ning, S. Yang, M. A. Khan, X. W. Gao,
    S. Realdon, M. Loshchenov, J. A. Schnabel, J. E. East,
    G. Wagnieres, V. B. Loschenov, E. Grisan, C. Daul,
    W. Blondel, J. Rittscher, An objective comparison of
    detection and segmentation algorithms for artefacts
    in clinical endoscopy, Scientific Reports 10 (2020).
    URL: https://doi.org/10.1038%2Fs41598-020-59413-5.
    doi:10.1038/s41598-020-59413-5.