<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Deep Learning using temporal information for automatic polyp detection in videos</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adrian Krenzer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philipp Sodmann</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nico Hasler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frank Puppe</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Artificial Intelligence and Knowledge Systems, University of Würzburg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Gastroenterology department of the University Hospital of Würzburg, University of Würzburg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Previous research in the field of endoscopic computer vision has mainly focused on the detection of polyps using single images, but not videos or streams of images. The Endoscopic computer vision challenges 2.0 (EndoCV 2.0) is designed specifically to use streams of image sequences for the detection of polyps. In this paper, we describe our approach based on Gong et al. [1] by leveraging deep convolutional neural networks (CNNs) combined with temporal information to improve upon existing solutions for polyp detection. We demonstrate a detection system that combines similar ROI features across multiple frames with temporal attention to predict the final polyp detections for an emerging frame. For evaluation, we compare our approach to two classical image detection algorithms on a validation set based on training data provided by the challenge. The first one is a Single Shot Detector (SSD) called "YOLOv3", and the second one is a two-step region proposal-based CNN called "Faster R-CNN". To minimize the generalization error, we apply data augmentation and add additional open-source data for our training.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Machine learning</kwd>
        <kwd>Deep learning</kwd>
        <kwd>Endoscopy</kwd>
        <kwd>Automation</kwd>
        <kwd>Video object detection</kwd>
        <kwd>Attention</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>some approaches in the literature addressing temporal
dependency in polyp detection: In Itoh et al. [4],
tempoThe second leading cause of cancer-related deaths world- ral information is included through a 3D-ResNet. The 3D
wide is Colorectal cancer (CRC) [2]. An excellent method ResNet is thereby combining present and future frames
to prevent CRC is to detect pre-cancerous lesions (col- for the detection of a new frame.
orectal polyps) of the disease as early as possible, using a Furthermore, Qadir et al. [5] work with a traditional
colonoscopy. During a colonoscopy, a long flexible tube localization model, such as SSD [6] or Faster R-CNN [7],
that is inserted through the rectum into the colon. The and post-process the output with an FP Reduction Unit.
end of the tube has a small camera, allowing the physi- This approach considers the area of the generated
boundcian to examine the colon thoroughly 1. Computer sci- ing boxes over the 7 preceding and following frames
ence researchers are developing new methods to support and identifies and adjusts the outliers. The use of future
physicians with this procedure. Polyp detection using frames causes a small delay, however, the actual
calculacomputers is called computer-aided detection (CAD). This tion of the FP Reduction Unit is fast. A second promising
process of polyp detection has already been subject to method by Qadir et al. uses a two-step process which
numerous publications. aims to decrease the proportion of false predictions.
Fur</p>
      <p>However, these published solutions mostly focus on thermore, the CNN that flags several regions of interest
detection on still images [3]. Therefore, most of the pub- (ROIs) for classification. The marked ROIs are then
comlished algorithms do not consider temporal dependencies pared with subsequent frames and their corresponding
and do compare themselves on benchmarks which do not ROIs and classified into true positives and false positives.
consider temporal connections. To predict the final polyp The underlying assumption here is that each frame in a
detections for an emerging frame, our approach based video is similar to its adjacent frames [5].
on Gong et al. [1] utilizes temporal dependencies by Xu et al. [8] designed a 2D CNN detector, which takes
combining similar ROI features across successive frames the spatiotemporal information into account and uses
with temporal attention. Nevertheless, there are already an ISTM network to improve its polyp detection
eficiency while maintaining real-time speed. The model
4EntnahdtioIonsnctoeaprlnySa(ytEimonndpaoolCsWViu2om0r2k2so)hnionpBciaoonnmdjueCdnhiccatailollenInmgweaitgohnintChgeoImS1B9ptIuh2t0eI2rE2EV,EiMsiIonantrecirhn- wapapsrtoraacinhewd hoinchcuinstcolumdedsattah.eInteamdpdoitriaolnd,ethpeernediesnacnieosthveiar
28th, 2022, IC Royal Bengal, Kolkata, India post-processing. This approach uses fast image detection
$ adrian.krenzer@uni-wuerzburg.de (A. Krenzer) algorithms like YOLO and, afterwards, combines these
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License predictions with an eficient real-time post-processing
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) technic. This post-processing technique includes the
1https://www.mayoclinic.org/tests- predictions of polyps detected in past frames for future
procedures/colonoscopy/about/pac-20393569</p>
      <p>Detection system
Frame t - 1</p>
      <p>Frame t</p>
      <p>Frame t + 1
Most similar ROI Features</p>
      <p>ROI Features</p>
      <p>Most similar ROI Features
Temporal Attention
Temporal ROI Feature</p>
      <p>Detection Result
detections [9]. Taking these ideas forward, we
implemented a polyp-detection model using the "ROI-Align
Module" of Gong et al. [1] This allows the neural
network to attend to information in previous frames and
to combine ROI features from diferent frames for new
predictions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data</title>
      <p>To train the model, we used two public available datasets
in addition to the challenge dataset:
• Kvasir-SEG [10]: 1000 polyp frames are included
in the data collection, along with 1071 masks and
bounding boxes. The sizes range from 332× 487
pixels to 1920× 1072 pixels. Gastroenterologists
at Norway’s Vestre Viken Health Trust confirmed
the annotations. The majority of the frames show
basic information on the left side, while others
have a black box in the lower-left corner that
contains data from ScopeGuide’s endoscope position
marking probe (Olympus). The data is available
in the Kvasir-SEG repository2.</p>
      <sec id="sec-2-1">
        <title>The data is available in the SUN Colonoscopy</title>
        <p>Video repository3.
• PolypGen2.0 (Polyp Generalization) [12, 13, 14]:</p>
        <p>This dataset is one of the two sets from the
challenge and an extended version of the datasets
from the 2020 and 2021 challenges. Both
subchallenges provide multi-center and diverse
population datasets with tasks for both detection and
segmentation, but the emphasis is on evaluating
algorithm generalizability. The goal was to
incorporate additional sequence/video data as well as
multimodal data from various sites. PolyGen2.0
consists of 46 sequences with a total of 3290
images. All frames have a resolution of 1920× 1080
pixels.</p>
        <p>We split the PolyGen2.0 dataset into training and
validation. For this purpose, 20 random sequences were
assigned to validation (1366 images) and the rest to
training (1924 images). The resulting validation set was used
for all training steps.
• SUN Colonoscopy Video Database [11]: This 3. Methods
dataset comprises 49,136 polyp frames from
100 distinct polyps, all of which are thoroughly In this section, we illustrate our approaches for the
Endocumented. These frames were taken at Showa doCV2022 challenge, depicted in figure 1. All our models
University Northern Yokohama and annotated are trained on a NVIDIA QUADRO RTX 8000. After
by Showa University’s specialist endoscopists. exploring the data, we decided to choose an algorithm
There are also 109,554 non-polyp frames present. which includes temporal information for the challenge,
The frames have a resolution of 1240× 1080 pixels. since the test data provided includes entire videos rather</p>
      </sec>
      <sec id="sec-2-2">
        <title>2https://datasets.simula.no/kvasir-seg/</title>
      </sec>
      <sec id="sec-2-3">
        <title>3http://sundatabase.org/</title>
        <p>Compute
Similiarity Map</p>
        <p>Temporal
Attention
Most similar
ROI-Feature
the ROI characteristics of clear object instances should
take precedence over the features of blurry instances in
aggregate. To aggregate the ROI characteristics and the
most comparable ROI features, multi-temporal attention
blocks are used to perform the temporal feature
aggregathan just images. The model is based on Gong et al. [1] tion. A major advantage of Temporal ROI Align is that it
and will be explained in the following. can extract the object features from support frames even
Most state-of-the-art single-frame object detectors use when a polyp is partially occluded in the target frame.
the paradigm of region-based detection. When these de- Therefore, the visible parts are dominant and features at
tectors are used directly for video object detection (VID), these locations can still get enhanced.
object appearances in videos such as motion blur, video For our approach, the nerual network is trained for
defocus, and object occlusions can degrade detection ac- 10 epochs on our full dataset and then finetuned for 3
curacy. These are frequent problems in endoscopy videos, epochs on the challenge dataset. We choose the stochastic
which make the detection of polyps more dificult. There- gradient descent (SGD) optimizer with a learning rate
fore, the main challenge is to design a method that can of 0.01, momentum of 0.9, and a weight decay of 0.0001.
utilize the temporal redundancy of the information ef- Additionally, we use a linear training warm-up schedule
ifciently for the same object instance in a sequence of for 1 epoch. To enhance the generalization capabilities of
images or videos. To extract ROI features, most region- our model, we use the following augmentation-schema:
based detectors use ROI Align. However, ROI Align only We applied a probability of 0.3 for upward and downward
uses the current frame feature map to extract features lfips and a vertical flipping probability of 0.5. In addition,
for current frame proposals, resulting in ROI features we rescaled the image with a probability of 0.64. We
that lack the temporal information of the same object also use a translation along the horizontal axis with a
instance in the video. Using feature maps of other frames probability of 0.5.
to perform ROI Align for the current frame proposals is
a straightforward and clear technique for using temporal 4. Results
information. However, since the exact placement of the
current frame proposals in other frame feature maps is In this section, we describe our results of the EndoCV2022
unknown, the basic solution is inefective. challenge. We highlight the performance of our approach
Temporal ROI Align, on the other hand, defines a target and compare it to two classic benchmarking algorithms.
frame as a frame in which the final prediction is made One is an SSD algorithm called YOLOv3 [15] and the
in real-time. In figure 2 the temporal ROI algin process other is the ROI Proposal algorithm called Faster RCNN
is illustrated. Temporal ROI algin also allows the target [16]. We trained both algorithms on the same data. For
frame to have multiple support frames, which are used to the validation, we create a validation set. The validation
refine the features of the target frame. To achieve this re- set consists of 20 sequences randomly chosen from the
ifnement, the proposed operator selects the most compa- provided data (no additional data is included). We test
rable ROI features from the feature maps of the available the detection-created validation set. To enable the
comsupport frames. The temporally redundant information parison of our results with the other participants of the
of the same object instance in a video is contained in the challenge we do also declare our final scores: Score(mAP)
extracted most comparable ROI characteristics. The main 13.12 % and score(mAP50) 27.05 % are our final detection
target now is to efectively capture diverse ROI features. scores on the second round of the challenge evaluation.
Average is ineficient, because a polyp may seem blurry Table 1 shows our results on our created validation
in some frames and clear in others. It is self-evident that set for the detection task where YOLOv3 is a benchmark</p>
      </sec>
      <sec id="sec-2-4">
        <title>SSD algorithm, Faster R-CNN is the FASTER R-CNN algo</title>
        <p>rithm with ResNet-101 backbone. For the evaluation, we
report the F1-score. The F1-score describes the harmonic
mean of precision and recall as shown in the following
equations:</p>
        <p>Precision =</p>
        <p>+</p>
        <p>Recall =</p>
        <p>+  
1 =
2 * Precision * Recall</p>
        <p>Precision + Recall
=</p>
        <p>2 *  
2 *   +   +</p>
        <p>Our approach</p>
      </sec>
      <sec id="sec-2-5">
        <title>We count an annotation as true positive (TP) if the boxes</title>
        <p>of our prediction and the boxes from the ground truth
overlap at least 50%. Additionally, we display the mean
average precision (mAP) and the mAP50 with a minimum
IoU of 0.5 [17]. The mAP is calculated by the integral of
the area under the precision-recall curve. Thereby, all
predicted boxen are first ranked by their confidence value YOLOv3 Faster-RCNN Our approach
given by the polyp detection system. Then we computed mAP 13.8 14.2 18.8
precision and recall for diferent thresholds of these con- PmrAecPi5si0on 2372..52 3248..95 3322..48
ifdence values. When reducing the confidence threshold Recall 30.1 32.4 39.6
recall increases and precision decreases. This results in F1 31.1 33.4 35.6
a precision-recall curve. Finally, for this precision-recall Speed 44 15 24
curve, the area under the curve is measured. This results
in the mAP.</p>
        <p>Table 1 shows that our approach is outperforming clas- detecting every image with an endoscopy processor
prosical benchmarks on our validation data; this is mostly cessing at 30 FPS. This can be mitigated by pruning and
due to our temporal dependencies included in the algo- quantization-aware retraining. This on the other hand
rithm which are not included in the Faster-RCNN ap- reduces the accuracy of the algorithm. Additionally, in
proach. Notably, SSD algorithms like YOLOv3 are still 20 the literature, a lot of benchmarking scores on still polyp
FPS faster than our approach in detecting single images. images are already exceeding 80 % F1 score [18, 19].
NevNevertheless, our approach yield a huge recall increase ertheless, those are not directly comparable with our
of 9.5 % compared to the fast YOLOv3. We do especially evaluation as they are using diferent data sets and do
emphasize this as recall is one of the most important not include sequences of images.
metrics in real clinical use. As it is more important to The second and most drastic issue is that the system
ifnd a missing polyp than to have additional false positiv in its current form only works with video data and not
detections. Figure 3 shows a sequence of detections re- a real-time stream of videos due to the dependencies in
sults with our algorithm on the test dataset provided by the algorithm, including preceding and future frames in
the challenge. Furthermore, figure 4 shows a qualitative the prediction. This issue may be solved by changing the
comparison of the three detection algorithms. We can see algorithm to only use the preceding frames. In its current
that all algorithms are detecting the polyp. Nevertheless, form, the algorithm can be used to evaluate endoscopies
Yolov3 and Faster-RCNN are distracted by light reflec- after they are completed or to detect polyps with wireless
tions and therefore also draw wrong detections. Through capsule endoscopy (WCE).
temporal ROI align, our approach can incorporate the
detections from previous frames and therefore does not
get distracted by the light reflections. 6. Conclusion</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Discussion</title>
      <sec id="sec-3-1">
        <title>In this section, we like to discuss two main points: First,</title>
        <p>the limitations of our approach, and second how to use
our approach in clinical useful settings. The first
limitation is the current speed of our system. With an inference
performance of 24 FPS, the algorithm is not capable of
Overall, we demonstrate our approach to the Endoscopic
computer vision challenges 2.0. We show a detection
system that combines similar ROI Features across frames
with temporal attention to create the final for polyp
detections for a new emerging frame. The system thereby
uses present, past, and future features on the temporal
axis to create new polyp localizations. We show that the
system exceeds classical benchmarks algorithms based
on individual frames on our validation data from the [7] S. Ren, K. He, R. Girshick, J. Sun, Faster r-CNN:
challenge. Towards real-time object detection with region
proposal networks, IEEE Transactions on
Pattern Analysis and Machine Intelligence 39 (2017)
7. Compliance with ethical 1137–1149. URL: https://doi.org/10.1109/tpami.2016.
standards 2577031. doi:10.1109/tpami.2016.2577031.
[8] X. Liu, X. Guo, Y. Liu, Y. Yuan, Consolidated domain
This research study was conducted retrospectively us- adaptive detection and localization framework for
ing human subject data made available in open access cross-device colonoscopic images, Medical image
[10, 11, 12, 13, 14]. Ethical approval was not required as analysis 71 (2021) 102052.
confirmed by the license attached with the open access [9] A. Krenzer, M. Banck, K. Makowski, A. Hekalo,
data. D. Fitting, J. Troya, B. Sudarevic, W. G. Zoller,
A. Hann, F. Puppe, A real-time polyp detection
system with clinical application in colonoscopy
us8. Acknowledgments ing deep convolutional neural networks (2022).
This research is supported using public funding from In- [10] D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen,
terdisziplinäres Zentrum für Klinische Forschung (IZKF) T. de Lange, D. Johansen, H. D. Johansen, Kvasir-seg:
from the University of Würzburg. A segmented polyp dataset, in: International
Conference on Multimedia Modeling, Springer, 2020,
pp. 451–462.</p>
        <p>References [11] M. Misawa, S.-e. Kudo, Y. Mori, K. Hotta, K.
Ohtsuka, T. Matsuda, S. Saito, T. Kudo, T. Baba, F. Ishida,
[1] T. Gong, K. Chen, X. Wang, Q. Chu, F. Zhu, D. Lin, et al., Development of a computer-aided detection
N. Yu, H. Feng, Temporal roi align for video object system for colonoscopy and a publicly accessible
recognition, in: Proceedings of the AAAI Confer- large colonoscopy video database (with video),
Gasence on Artificial Intelligence, volume 35, 2021, pp. trointestinal Endoscopy 93 (2021) 960–967.
1442–1450. [12] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R.
Can[2] F. Bray, J. Ferlay, I. Soerjomataram, R. L. Siegel, nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V.</p>
        <p>L. A. Torre, A. Jemal, Global cancer statistics Anonsen, M. A. Riegler, et al., Polypgen: A
2018: GLOBOCAN estimates of incidence and mor- multi-center polyp detection and segmentation
tality worldwide for 36 cancers in 185 countries, dataset for generalisability assessment, arXiv
CA: A Cancer Journal for Clinicians 68 (2018) preprint arXiv:2106.04463 (2021). doi:10.48550/
394–424. URL: https://doi.org/10.3322/caac.21492. arXiv.2106.04463.</p>
        <p>doi:10.3322/caac.21492. [13] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G.
Po[3] A. Krenzer, A. Hekalo, F. Puppe, Endoscopic detec- lat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo,
tion and segmentation of gastroenterological dis- B. M. et al., Deep learning for detection and
segeases with deep convolutional neural networks., in: mentation of artefact and disease instances in
gasEndoCV@ ISBI, 2020, pp. 58–63. trointestinal endoscopy, Medical Image Analysis 70
[4] H. Itoh, H. Roth, M. Oda, M. Misawa, Y. Mori, S.-E. (2021) 102002. URL: https://www.sciencedirect.com/
Kudo, K. Mori, Stable polyp-scene classification via science/article/pii/S1361841521000487. doi:https:
subsampling and residual learning from an imbal- //doi.org/10.1016/j.media.2021.102002.
anced large dataset, Healthcare Technology Letters [14] S. Ali, N. Ghatwary, D. Jha, E. Isik-Polat, G.
Po6 (2019) 237–242. URL: https://doi.org/10.1049/htl. lat, C. Yang, W. Li, A. Galdran, M.-Á. G. Ballester,
2019.0079. doi:10.1049/htl.2019.0079. V. Thambawita, et al., Assessing
generalisabil[5] H. A. Qadir, I. Balasingham, J. Solhusvik, J. Bergs- ity of deep learning-based polyp detection and
land, L. Aabakken, Y. Shin, Improving automatic segmentation methods through a computer vision
polyp detection using CNN by exploiting temporal challenge, arXiv preprint arXiv:2202.12031 (2022).
dependency in colonoscopy video, IEEE Journal of doi:10.48550/arXiv.2202.12031.
Biomedical and Health Informatics 24 (2020) 180– [15] J. Redmon, A. Farhadi, Yolov3: An incremental
im193. URL: https://doi.org/10.1109/jbhi.2019.2907434. provement, arXiv preprint arXiv:1804.02767 (2018).
doi:10.1109/jbhi.2019.2907434. [16] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn:
[6] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Towards real-time object detection with region
proReed, C.-Y. Fu, A. C. Berg, Ssd: Single shot multibox posal networks, Advances in neural information
detector, ArXiv abs/1512.02325 (2016). processing systems 28 (2015).</p>
        <p>[17] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,
D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft
coco: Common objects in context, in: European
conference on computer vision, Springer, 2014, pp.</p>
        <p>740–755.
[18] D. Wang, N. Zhang, X. Sun, P. Zhang, C. Zhang,</p>
        <p>Y. Cao, B. Liu, Afp-net: Realtime anchor-free polyp
detection in colonoscopy, in: 2019 IEEE 31st
International Conference on Tools with Artificial
Intelligence (ICTAI), IEEE, 2019, pp. 636–643.
[19] X. Mo, K. Tao, Q. Wang, G. Wang, An eficient
approach for polyps detection in endoscopic videos
based on faster r-cnn, in: 2018 24th international
conference on pattern recognition (ICPR), IEEE,
2018, pp. 3929–3934.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>