1. Introduction

Deep Learning using temporal information for automatic polyp detection in videos

Adrian Krenzer

Philipp Sodmann

Nico Hasler

Frank Puppe

0 0 Department of Artificial Intelligence and Knowledge Systems, University of Würzburg , Germany 1 Gastroenterology department of the University Hospital of Würzburg, University of Würzburg , Germany

Previous research in the field of endoscopic computer vision has mainly focused on the detection of polyps using single images, but not videos or streams of images. The Endoscopic computer vision challenges 2.0 (EndoCV 2.0) is designed specifically to use streams of image sequences for the detection of polyps. In this paper, we describe our approach based on Gong et al. [1] by leveraging deep convolutional neural networks (CNNs) combined with temporal information to improve upon existing solutions for polyp detection. We demonstrate a detection system that combines similar ROI features across multiple frames with temporal attention to predict the final polyp detections for an emerging frame. For evaluation, we compare our approach to two classical image detection algorithms on a validation set based on training data provided by the challenge. The first one is a Single Shot Detector (SSD) called "YOLOv3", and the second one is a two-step region proposal-based CNN called "Faster R-CNN". To minimize the generalization error, we apply data augmentation and add additional open-source data for our training.

eol>Machine learning Deep learning Endoscopy Automation Video object detection Attention

1. Introduction

some approaches in the literature addressing temporal dependency in polyp detection: In Itoh et al. [4], tempoThe second leading cause of cancer-related deaths world- ral information is included through a 3D-ResNet. The 3D wide is Colorectal cancer (CRC) [2]. An excellent method ResNet is thereby combining present and future frames to prevent CRC is to detect pre-cancerous lesions (col- for the detection of a new frame. orectal polyps) of the disease as early as possible, using a Furthermore, Qadir et al. [5] work with a traditional colonoscopy. During a colonoscopy, a long flexible tube localization model, such as SSD [6] or Faster R-CNN [7], that is inserted through the rectum into the colon. The and post-process the output with an FP Reduction Unit. end of the tube has a small camera, allowing the physi- This approach considers the area of the generated boundcian to examine the colon thoroughly 1. Computer sci- ing boxes over the 7 preceding and following frames ence researchers are developing new methods to support and identifies and adjusts the outliers. The use of future physicians with this procedure. Polyp detection using frames causes a small delay, however, the actual calculacomputers is called computer-aided detection (CAD). This tion of the FP Reduction Unit is fast. A second promising process of polyp detection has already been subject to method by Qadir et al. uses a two-step process which numerous publications. aims to decrease the proportion of false predictions. Fur

However, these published solutions mostly focus on thermore, the CNN that flags several regions of interest detection on still images [3]. Therefore, most of the pub- (ROIs) for classification. The marked ROIs are then comlished algorithms do not consider temporal dependencies pared with subsequent frames and their corresponding and do compare themselves on benchmarks which do not ROIs and classified into true positives and false positives. consider temporal connections. To predict the final polyp The underlying assumption here is that each frame in a detections for an emerging frame, our approach based video is similar to its adjacent frames [5]. on Gong et al. [1] utilizes temporal dependencies by Xu et al. [8] designed a 2D CNN detector, which takes combining similar ROI features across successive frames the spatiotemporal information into account and uses with temporal attention. Nevertheless, there are already an ISTM network to improve its polyp detection eficiency while maintaining real-time speed. The model 4EntnahdtioIonsnctoeaprlnySa(ytEimonndpaoolCsWViu2om0r2k2so)hnionpBciaoonnmdjueCdnhiccatailollenInmgweaitgohnintChgeoImS1B9ptIuh2t0eI2rE2EV,EiMsiIonantrecirhn- wapapsrtoraacinhewd hoinchcuinstcolumdedsattah.eInteamdpdoitriaolnd,ethpeernediesnacnieosthveiar 28th, 2022, IC Royal Bengal, Kolkata, India post-processing. This approach uses fast image detection $ adrian.krenzer@uni-wuerzburg.de (A. Krenzer) algorithms like YOLO and, afterwards, combines these © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License predictions with an eficient real-time post-processing CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) technic. This post-processing technique includes the 1https://www.mayoclinic.org/tests- predictions of polyps detected in past frames for future procedures/colonoscopy/about/pac-20393569

Detection system Frame t - 1

Frame t

Frame t + 1 Most similar ROI Features

ROI Features

Most similar ROI Features Temporal Attention Temporal ROI Feature

Detection Result detections [9]. Taking these ideas forward, we implemented a polyp-detection model using the "ROI-Align Module" of Gong et al. [1] This allows the neural network to attend to information in previous frames and to combine ROI features from diferent frames for new predictions.

2. Data

To train the model, we used two public available datasets in addition to the challenge dataset: • Kvasir-SEG [10]: 1000 polyp frames are included in the data collection, along with 1071 masks and bounding boxes. The sizes range from 332× 487 pixels to 1920× 1072 pixels. Gastroenterologists at Norway’s Vestre Viken Health Trust confirmed the annotations. The majority of the frames show basic information on the left side, while others have a black box in the lower-left corner that contains data from ScopeGuide’s endoscope position marking probe (Olympus). The data is available in the Kvasir-SEG repository2.

The data is available in the SUN Colonoscopy

Video repository3. • PolypGen2.0 (Polyp Generalization) [12, 13, 14]:

This dataset is one of the two sets from the challenge and an extended version of the datasets from the 2020 and 2021 challenges. Both subchallenges provide multi-center and diverse population datasets with tasks for both detection and segmentation, but the emphasis is on evaluating algorithm generalizability. The goal was to incorporate additional sequence/video data as well as multimodal data from various sites. PolyGen2.0 consists of 46 sequences with a total of 3290 images. All frames have a resolution of 1920× 1080 pixels.

We split the PolyGen2.0 dataset into training and validation. For this purpose, 20 random sequences were assigned to validation (1366 images) and the rest to training (1924 images). The resulting validation set was used for all training steps. • SUN Colonoscopy Video Database [11]: This 3. Methods dataset comprises 49,136 polyp frames from 100 distinct polyps, all of which are thoroughly In this section, we illustrate our approaches for the Endocumented. These frames were taken at Showa doCV2022 challenge, depicted in figure 1. All our models University Northern Yokohama and annotated are trained on a NVIDIA QUADRO RTX 8000. After by Showa University’s specialist endoscopists. exploring the data, we decided to choose an algorithm There are also 109,554 non-polyp frames present. which includes temporal information for the challenge, The frames have a resolution of 1240× 1080 pixels. since the test data provided includes entire videos rather

2https://datasets.simula.no/kvasir-seg/ 3http://sundatabase.org/

Compute Similiarity Map

Temporal Attention Most similar ROI-Feature the ROI characteristics of clear object instances should take precedence over the features of blurry instances in aggregate. To aggregate the ROI characteristics and the most comparable ROI features, multi-temporal attention blocks are used to perform the temporal feature aggregathan just images. The model is based on Gong et al. [1] tion. A major advantage of Temporal ROI Align is that it and will be explained in the following. can extract the object features from support frames even Most state-of-the-art single-frame object detectors use when a polyp is partially occluded in the target frame. the paradigm of region-based detection. When these de- Therefore, the visible parts are dominant and features at tectors are used directly for video object detection (VID), these locations can still get enhanced. object appearances in videos such as motion blur, video For our approach, the nerual network is trained for defocus, and object occlusions can degrade detection ac- 10 epochs on our full dataset and then finetuned for 3 curacy. These are frequent problems in endoscopy videos, epochs on the challenge dataset. We choose the stochastic which make the detection of polyps more dificult. There- gradient descent (SGD) optimizer with a learning rate fore, the main challenge is to design a method that can of 0.01, momentum of 0.9, and a weight decay of 0.0001. utilize the temporal redundancy of the information ef- Additionally, we use a linear training warm-up schedule ifciently for the same object instance in a sequence of for 1 epoch. To enhance the generalization capabilities of images or videos. To extract ROI features, most region- our model, we use the following augmentation-schema: based detectors use ROI Align. However, ROI Align only We applied a probability of 0.3 for upward and downward uses the current frame feature map to extract features lfips and a vertical flipping probability of 0.5. In addition, for current frame proposals, resulting in ROI features we rescaled the image with a probability of 0.64. We that lack the temporal information of the same object also use a translation along the horizontal axis with a instance in the video. Using feature maps of other frames probability of 0.5. to perform ROI Align for the current frame proposals is a straightforward and clear technique for using temporal 4. Results information. However, since the exact placement of the current frame proposals in other frame feature maps is In this section, we describe our results of the EndoCV2022 unknown, the basic solution is inefective. challenge. We highlight the performance of our approach Temporal ROI Align, on the other hand, defines a target and compare it to two classic benchmarking algorithms. frame as a frame in which the final prediction is made One is an SSD algorithm called YOLOv3 [15] and the in real-time. In figure 2 the temporal ROI algin process other is the ROI Proposal algorithm called Faster RCNN is illustrated. Temporal ROI algin also allows the target [16]. We trained both algorithms on the same data. For frame to have multiple support frames, which are used to the validation, we create a validation set. The validation refine the features of the target frame. To achieve this re- set consists of 20 sequences randomly chosen from the ifnement, the proposed operator selects the most compa- provided data (no additional data is included). We test rable ROI features from the feature maps of the available the detection-created validation set. To enable the comsupport frames. The temporally redundant information parison of our results with the other participants of the of the same object instance in a video is contained in the challenge we do also declare our final scores: Score(mAP) extracted most comparable ROI characteristics. The main 13.12 % and score(mAP50) 27.05 % are our final detection target now is to efectively capture diverse ROI features. scores on the second round of the challenge evaluation. Average is ineficient, because a polyp may seem blurry Table 1 shows our results on our created validation in some frames and clear in others. It is self-evident that set for the detection task where YOLOv3 is a benchmark

SSD algorithm, Faster R-CNN is the FASTER R-CNN algo

rithm with ResNet-101 backbone. For the evaluation, we report the F1-score. The F1-score describes the harmonic mean of precision and recall as shown in the following equations:

Precision =

Recall =

+ 1 = 2 * Precision * Recall

Precision + Recall =

2 * 2 * + +

Our approach

We count an annotation as true positive (TP) if the boxes

of our prediction and the boxes from the ground truth overlap at least 50%. Additionally, we display the mean average precision (mAP) and the mAP50 with a minimum IoU of 0.5 [17]. The mAP is calculated by the integral of the area under the precision-recall curve. Thereby, all predicted boxen are first ranked by their confidence value YOLOv3 Faster-RCNN Our approach given by the polyp detection system. Then we computed mAP 13.8 14.2 18.8 precision and recall for diferent thresholds of these con- PmrAecPi5si0on 2372..52 3248..95 3322..48 ifdence values. When reducing the confidence threshold Recall 30.1 32.4 39.6 recall increases and precision decreases. This results in F1 31.1 33.4 35.6 a precision-recall curve. Finally, for this precision-recall Speed 44 15 24 curve, the area under the curve is measured. This results in the mAP.

Table 1 shows that our approach is outperforming clas- detecting every image with an endoscopy processor prosical benchmarks on our validation data; this is mostly cessing at 30 FPS. This can be mitigated by pruning and due to our temporal dependencies included in the algo- quantization-aware retraining. This on the other hand rithm which are not included in the Faster-RCNN ap- reduces the accuracy of the algorithm. Additionally, in proach. Notably, SSD algorithms like YOLOv3 are still 20 the literature, a lot of benchmarking scores on still polyp FPS faster than our approach in detecting single images. images are already exceeding 80 % F1 score [18, 19]. NevNevertheless, our approach yield a huge recall increase ertheless, those are not directly comparable with our of 9.5 % compared to the fast YOLOv3. We do especially evaluation as they are using diferent data sets and do emphasize this as recall is one of the most important not include sequences of images. metrics in real clinical use. As it is more important to The second and most drastic issue is that the system ifnd a missing polyp than to have additional false positiv in its current form only works with video data and not detections. Figure 3 shows a sequence of detections re- a real-time stream of videos due to the dependencies in sults with our algorithm on the test dataset provided by the algorithm, including preceding and future frames in the challenge. Furthermore, figure 4 shows a qualitative the prediction. This issue may be solved by changing the comparison of the three detection algorithms. We can see algorithm to only use the preceding frames. In its current that all algorithms are detecting the polyp. Nevertheless, form, the algorithm can be used to evaluate endoscopies Yolov3 and Faster-RCNN are distracted by light reflec- after they are completed or to detect polyps with wireless tions and therefore also draw wrong detections. Through capsule endoscopy (WCE). temporal ROI align, our approach can incorporate the detections from previous frames and therefore does not get distracted by the light reflections. 6. Conclusion

5. Discussion In this section, we like to discuss two main points: First,

the limitations of our approach, and second how to use our approach in clinical useful settings. The first limitation is the current speed of our system. With an inference performance of 24 FPS, the algorithm is not capable of Overall, we demonstrate our approach to the Endoscopic computer vision challenges 2.0. We show a detection system that combines similar ROI Features across frames with temporal attention to create the final for polyp detections for a new emerging frame. The system thereby uses present, past, and future features on the temporal axis to create new polyp localizations. We show that the system exceeds classical benchmarks algorithms based on individual frames on our validation data from the [7] S. Ren, K. He, R. Girshick, J. Sun, Faster r-CNN: challenge. Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2017) 7. Compliance with ethical 1137–1149. URL: https://doi.org/10.1109/tpami.2016. standards 2577031. doi:10.1109/tpami.2016.2577031. [8] X. Liu, X. Guo, Y. Liu, Y. Yuan, Consolidated domain This research study was conducted retrospectively us- adaptive detection and localization framework for ing human subject data made available in open access cross-device colonoscopic images, Medical image [10, 11, 12, 13, 14]. Ethical approval was not required as analysis 71 (2021) 102052. confirmed by the license attached with the open access [9] A. Krenzer, M. Banck, K. Makowski, A. Hekalo, data. D. Fitting, J. Troya, B. Sudarevic, W. G. Zoller, A. Hann, F. Puppe, A real-time polyp detection system with clinical application in colonoscopy us8. Acknowledgments ing deep convolutional neural networks (2022). This research is supported using public funding from In- [10] D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, terdisziplinäres Zentrum für Klinische Forschung (IZKF) T. de Lange, D. Johansen, H. D. Johansen, Kvasir-seg: from the University of Würzburg. A segmented polyp dataset, in: International Conference on Multimedia Modeling, Springer, 2020, pp. 451–462.

References [11] M. Misawa, S.-e. Kudo, Y. Mori, K. Hotta, K. Ohtsuka, T. Matsuda, S. Saito, T. Kudo, T. Baba, F. Ishida, [1] T. Gong, K. Chen, X. Wang, Q. Chu, F. Zhu, D. Lin, et al., Development of a computer-aided detection N. Yu, H. Feng, Temporal roi align for video object system for colonoscopy and a publicly accessible recognition, in: Proceedings of the AAAI Confer- large colonoscopy video database (with video), Gasence on Artificial Intelligence, volume 35, 2021, pp. trointestinal Endoscopy 93 (2021) 960–967. 1442–1450. [12] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Can[2] F. Bray, J. Ferlay, I. Soerjomataram, R. L. Siegel, nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V.

L. A. Torre, A. Jemal, Global cancer statistics Anonsen, M. A. Riegler, et al., Polypgen: A 2018: GLOBOCAN estimates of incidence and mor- multi-center polyp detection and segmentation tality worldwide for 36 cancers in 185 countries, dataset for generalisability assessment, arXiv CA: A Cancer Journal for Clinicians 68 (2018) preprint arXiv:2106.04463 (2021). doi:10.48550/ 394–424. URL: https://doi.org/10.3322/caac.21492. arXiv.2106.04463.

doi:10.3322/caac.21492. [13] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Po[3] A. Krenzer, A. Hekalo, F. Puppe, Endoscopic detec- lat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo, tion and segmentation of gastroenterological dis- B. M. et al., Deep learning for detection and segeases with deep convolutional neural networks., in: mentation of artefact and disease instances in gasEndoCV@ ISBI, 2020, pp. 58–63. trointestinal endoscopy, Medical Image Analysis 70 [4] H. Itoh, H. Roth, M. Oda, M. Misawa, Y. Mori, S.-E. (2021) 102002. URL: https://www.sciencedirect.com/ Kudo, K. Mori, Stable polyp-scene classification via science/article/pii/S1361841521000487. doi:https: subsampling and residual learning from an imbal- //doi.org/10.1016/j.media.2021.102002. anced large dataset, Healthcare Technology Letters [14] S. Ali, N. Ghatwary, D. Jha, E. Isik-Polat, G. Po6 (2019) 237–242. URL: https://doi.org/10.1049/htl. lat, C. Yang, W. Li, A. Galdran, M.-Á. G. Ballester, 2019.0079. doi:10.1049/htl.2019.0079. V. Thambawita, et al., Assessing generalisabil[5] H. A. Qadir, I. Balasingham, J. Solhusvik, J. Bergs- ity of deep learning-based polyp detection and land, L. Aabakken, Y. Shin, Improving automatic segmentation methods through a computer vision polyp detection using CNN by exploiting temporal challenge, arXiv preprint arXiv:2202.12031 (2022). dependency in colonoscopy video, IEEE Journal of doi:10.48550/arXiv.2202.12031. Biomedical and Health Informatics 24 (2020) 180– [15] J. Redmon, A. Farhadi, Yolov3: An incremental im193. URL: https://doi.org/10.1109/jbhi.2019.2907434. provement, arXiv preprint arXiv:1804.02767 (2018). doi:10.1109/jbhi.2019.2907434. [16] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: [6] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Towards real-time object detection with region proReed, C.-Y. Fu, A. C. Berg, Ssd: Single shot multibox posal networks, Advances in neural information detector, ArXiv abs/1512.02325 (2016). processing systems 28 (2015).

[17] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: European conference on computer vision, Springer, 2014, pp.

740–755. [18] D. Wang, N. Zhang, X. Sun, P. Zhang, C. Zhang,

Y. Cao, B. Liu, Afp-net: Realtime anchor-free polyp detection in colonoscopy, in: 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), IEEE, 2019, pp. 636–643. [19] X. Mo, K. Tao, Q. Wang, G. Wang, An eficient approach for polyps detection in endoscopic videos based on faster r-cnn, in: 2018 24th international conference on pattern recognition (ICPR), IEEE, 2018, pp. 3929–3934.