<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Temporal Context Framework for Endoscopy Artefact Segmentation and Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Haili Ye</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hanpei Miao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiang Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dahan Wang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Heng Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, Southern University of Science and Technology</institution>
          ,
          <addr-line>Shenzhen 518055</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer and Information Engineering, Xiamen University of Technology</institution>
          ,
          <addr-line>Xiamen 361004</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology</institution>
          ,
          <addr-line>Shenzhen 518055</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Endoscopic video processing could facilitate pre-operative planning, intra-operative image guidance and generation of post-operative analysis of the surgical procedure. However, most of the current methods are still based on a single frame of image analysis, which makes the results of the previous frame images independent of each other and causes vibration. In this paper, we propose an temporal context framework for endoscopy artefact segmentation and detection. The framework extends the general segmentation and detection model to the form based on temporal input, and we add a Temporal Context Transformer(TCT) after the encoder of the model to improve the model's ability to construct temporal context features. the experiments of the EndoCV 2022 challenge dataset that this framework can improve the robustness of the model.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Medical Image Analysis</kwd>
        <kwd>Colonoscopic Image</kwd>
        <kwd>Semantic Segmentation</kwd>
        <kwd>Object Detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Colon cancer[1] is a common malignant tumor of the
digestive tract that occurs in the colon. Colon cancer
is closely related to the consumption of red meat (such
as beef). Incidence of gastrointestinal tumors accounted
for the third place. Colon cancer is mainly
adenocarcinoma, mucinous adenocarcinoma, undiferentiated
carcinoma. Endoscopy[2] can clearly find intestinal lesions,
but also can treat some intestinal lesions, such as:
intestinal polyps and other benign lesions under the microscope
directly removed, intestinal bleeding under the
microscope to stop bleeding, the removal of foreign bodies in
the colon. Endoscopic video[3] processing could facilitate
pre-operative planning, intra-operative image guidance
and generation of post-operative analysis of the surgical
procedure. Computer assisted interventions[4] have the
potential to enhance the surgeon’s visualization and
navigation capabilities and postoperative analytics to provide
insights for surgical training and risk assessment. A
necessary element for these processes is scene understanding
and, in particular, anatomy and instrument detection and
localization. Therefore, by segmenting and
diferentiating among the elements that appear in the Endoscopic
view, it is possible to assess tissue-instrument
interactions and understand endoscopic workflow.</p>
      <p>
        Semantic segmentation[5] and object detection[6] are
two hot research fields in computer vision. In medical
semantic segmentation, Olaf et al proposed a classic
medical image segmentation model U-Net[7], and the
relevant encoder-decoder structure and skip-layer
connection method have great inspiration for subsequent
research work. On this basis, a series of novel and efective
models are developed, such as U-Net++[8], nnUNet[9],
DANet[
        <xref ref-type="bibr" rid="ref11 ref9">10</xref>
        ], Deeplab[11] and so on. For the analysis of
endoscope images, The PraNet[12] proposed by Fan et al.
aggregates features at a high level through the parallel
partial decoder (PDD) to obtain context information and
generate a global map. In medical object detection, Ross
et al. proposed the Faster RCNN[13] achieves end-to-end
object detection based on a deep learning two-stage
structure. Cai et al. proposed Cascade R-CNN[14] to
continuously optimize the prediction results by cascading several
detection networks. The Swin Transfromer[15] proposed
by Liu et al. is a general vision structure designed based
on the concept of Transfromer[16], which has achieved
breakthroughs in multiple vision tasks. However, most
of the current methods are still based on single-frame
image analysis, which makes the analysis results not well
combined with temporal context information.
      </p>
      <p>Endoscope image sequence can provide more
information than single frame image [17, 18], and combining
the contextual time information of the before and after
images can efectively improve the analysis performance
of endoscopy artefact. Inspired by this, in this paper, We
propose a temporal Context Framework for endoscopy
artefact segmentation and detection. Our contributions
are as follows:
∙</p>
      <p>We introduce a general framework to extract
temporal context features from sequential images and</p>
    </sec>
    <sec id="sec-2">
      <title>2. METHODOLOGY</title>
      <p>In this section, we introduce the proposed temporal
context framework for endoscopy artefact segmentation and
detection. The overall of this framework as shown in Fig.
1. The framework includes endoscopy artefact
segmentation model and endoscopy artefact detection model. The
input of both models is the endoscope image sequence,
and we set a hyperparameter  to represent the length
of the image sequence, so -frame sequence of input to
the model can be represented as  ∈ ,3,, .</p>
      <p>In the endoscopy artefact segmentation model, we use
the classical coding-decoding results. In particular, the
encoder of the model is similar to the traditional encoder,
which is responsible for extracting the features of single
frame influence.  group temporal context transformer
is connected at the end of the encoder to establish the
correlation between the image features of each frame.
Compared with general single-frame image-based
methods, this module utilizes feature correlations between
dence. The loss function form of object detection model
is the same as that of Faster RCNN[13].</p>
      <p>Temporal Context Transformer. For the image
sequence, there is a little correlation between the image
data of the next frame and the next frame. Especially in
the case of blur or artifact in the image, introducing the
features of the previous frame can efectively repair the
situation of target loss or category recognition error. In
order to efectively improve the context understanding
and feature integration capabilities of the model for
image sequences. We designed the temporal context
transformer, as show in Fig .2. Temporal context transformer
is divided into transformer encoder and transformer
decoder. The features extracted in the encoder will be input
to the transformer encoder. For the Transformer encoder
of layer , the input is the output − 1 ∈ , of
the upper layer. The coordination transformer encoder
has a similar structure to the traditional Transformer
encoder, but the diference is that we design the timing
code  combining the characteristics of image sequence.
The time diference between the two frames can be
calculated in the endoscope image sequence and the time
sequence coding between diferent frames can be
modeled by normalization of the time diference. When the
image sequence length is , the sequence encoding  is
a square matrix of  × :
⎡</p>
      <p>0
⎢ |1 − 0|
 = ⎢⎢⎣ ...</p>
      <p>|0 − 1| · · · | 0 − | ⎤
0 · · · | 1 − | ⎥
... . . . ... ⎦⎥⎥
0
| − 0| | − 1| · · ·
(1)
(2)
 =  ( − 1)
In self-attention generates query  ∈ , , key  ∈
, , and value  ∈ , based on − 1. Then
calculate the initial self-attention weight  ∈ , =
 ((*  )* ( *  ) / ) between L frames.
Then, sequence coding is introduced to calculate the
final self-attention weight  ∈ , =  * . In this
way, the temporal relevance in the original self-attention
weight can be strengthened. The following steps are the
same as for a classical transformer[16].</p>
      <p>The transformer decoder is responsible for decoding
and reconstruction of the features of the transformer
encoder. The input form of the layer  transformer
encoder is − 1 ∈ , . Like the transformer
encoder, sequence coding is added to the transformer
decoder to improve the temporal modeling ability of the
model. In the transformer decoder, the first step is
the mask self-attention, which emphasizes the
prediction of the model in accordance with the sequence of
images. Diferent from the classical transformer, we
add the cross attention[16] unit at the end of the
transformer decoder.The transformer decoder calculates query
 ∈ , and key  ∈ , using the output  of
the transformer encoder of the same layer. The cross
attention weight matrix  is calculated by  and 
of transformer encoder. As shown in Fig.2, there are
two parallel attention modules for feature learning in
this part. We hope that these two attention modules can
learn feature compensation and contraction respectively.
Therefore, the parameters of the two modules do not
share, and matrix addition and matrix cross product are
used respectively.The specific operations are as follows:
′ =  (( * 1 ) * ( * 1 ) / ) (3)
′′ =  (( * 2 ) * ( * 2 ) / ) (4)
 = . {. {′ *  * 1 +  }
+. {(′′ *  * 2 ) ⊗  }}
(5)
The above process makes the features of each frame
images fully fused, and the temporal context transformer
efectively extracts the context information of diferent
frame images. The aggregate feature will reshape to its
original dimension before being sent into the decoder.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Results</title>
      <p>In this section, we compare the performance of the
proposed ensemble Temporal Context for Endoscopy
Artefact Segmentation and Detection Farmworke and
stateof-the-art model were compared in the segmentation and
detection of endoscopy artefact.</p>
      <sec id="sec-3-1">
        <title>Model</title>
      </sec>
      <sec id="sec-3-2">
        <title>UNet</title>
      </sec>
      <sec id="sec-3-3">
        <title>Model</title>
      </sec>
      <sec id="sec-3-4">
        <title>Faster R-CNN</title>
        <p>Table 1 the data of the training set. In order to demonstrate the
Temporal context transformer layer number comparative ex- efectiveness of the method, we do not use TTA or
multiperiment. model fusion and other post-processing means, but only
use a single model for test set prediction.</p>
        <p>Data details and preparation. Our model mainly
used the EndoCV2022 challenge dataset [17] for
endoscopic images for Endoscopy Artefact Detection in
this work. Endoscopic surgical instruments include five
categories: nonmucosa, artefact, saturation,
specularity, bubbles. EndoCV launched this as an extension to
the previous artefact detection and segmentation
challenges [21, 22] with dataset specific to the colonoscopy.</p>
        <p>The dataset contains 24 endoscopic videos sequence for
EAD sub-challenge with total 1,449 endoscopic images.</p>
        <p>We split the dataset into 80% sequence for training and
20% sequence for validation. For the segmentation task,
we used Dice coeficient, Jaccard coeficient and PA for
evaluation. For the detection task, we used mAP with
diferent thresholds for evaluation.</p>
        <p>Implementation details. The deep models are
implemented based on PyTorch and trained on an NVIDIA
Tesla V100 GPU. surgical instrument segmentation model
using SGD optimizer with a learning rate of 10− 4.
surgical instrument detection model base on mmdetetcion and
using SGD optimizer with a learning rate of 10− 2. The
batch size is set to 2 and use a sliding window of length
L to sample subsequences in the original sequence, while
input sequence images are resized to 960× 540. Since the
input are image sequences, the batch size was relatively
small. In addition, we used conventional inversion, afine
transformation, contrast and other methods to enhance</p>
        <p>Model  /     
UNet √ 00..652355 00..440921 00..887922
DANet √ 00..765713 00..659670 00..992434
PraNet √ 00..871165 00..767261 00..993661
Model  /  50 75
Faster 0.232 0.464 0.208
R-CNN √ 0.317 0.563 0.321
Cascade 0.336 0.579 0.347
RCNN √ 0.395 0.611 0.401</p>
        <p>Swin 0.356 0.598 0.364
Transformer √ 0.403 0.613 0.421</p>
        <p>We first compared the influence of the number  of
TCT on the model performance through comparative
experiments. The results are shown in Table 1. From the
experimental results, it can be seen that the model has the
best efect except when  is 2, and the model will overfit
when N is too large. To verify the efectiveness of our
method, we perform a comprehensive comparison with
state-of-the-art segmentation and detection methods,
segmentation methods including UNet, DANet, PraNet,
detection methods including Faster RCNN, Cascade RCNN,
Swin Transformer, as shown in Table 2. Specifically, The
performance of each SOTA model has been steadily
improved after being converted to our method. We
visualized an example of an inferential endoscope sequence
image of a set of models, as shown in Fig.3. the Dice,
Jaccar, PA of segmentation task model has been improved
by 9%-12%, 5%-9% and 2%-3%. For detection tasks, the
model’s mAP improved by 5%-8%. And it is efective for
diferent types of methods, which proves that our method
is robust and applicable.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>arXiv:2106.04463</source>
          (
          <year>2021</year>
          ). [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ghatwary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Isik-Polat</surname>
          </string-name>
          , G. Po-
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>arXiv preprint arXiv:2202.12031</source>
          (
          <year>2022</year>
          ). [19]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          sent.
          <volume>79</volume>
          (
          <year>2021</year>
          )
          <article-title>103260</article-title>
          . URL: https://doi.org/10.1016/
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          j.jvcir.
          <year>2021</year>
          .
          <volume>103260</volume>
          . doi:
          <volume>10</volume>
          .1016/j.jvcir.
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          103260. [20]
          <string-name>
            <given-names>N.</given-names>
            <surname>Bodla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chellappa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Davis</surname>
          </string-name>
          , Soft-
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>puter Vision</source>
          , ICCV 2017, Venice, Italy, October 22-
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          29,
          <year>2017</year>
          , IEEE Computer Society,
          <year>2017</year>
          , pp.
          <fpage>5562</fpage>
          -
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          5570. URL: https://doi.org/10.1109/ICCV.
          <year>2017</year>
          .
          <volume>593</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>doi:10</source>
          .1109/ICCV.
          <year>2017</year>
          .
          <volume>593</volume>
          . [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Braden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bailey</surname>
          </string-name>
          , S. Yang,
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>doscopy</surname>
          </string-name>
          ,
          <source>Scientific Reports</source>
          <volume>10</volume>
          (
          <year>2020</year>
          ). URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          //doi.org/10.1038%
          <fpage>2Fs41598</fpage>
          -
          <fpage>020</fpage>
          -59413-5. doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <volume>1038</volume>
          /s41598-020-59413-5. [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dmitrieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ghatwary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bano</surname>
          </string-name>
          , G. Po-
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <volume>70</volume>
          (
          <year>2021</year>
          )
          <article-title>102002</article-title>
          . URL: https://doi.org/10.1016%
          <fpage>2Fj</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>media.</surname>
          </string-name>
          <year>2021</year>
          .
          <volume>102002</volume>
          . doi:
          <volume>10</volume>
          .1016/j.media.
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>