1. Introduction

Temporal Context Framework for Endoscopy Artefact Segmentation and Detection

Haili Ye

0 2

Hanpei Miao

0 2

Jiang Liu

0 2

Dahan Wang

Heng Li

0 2 0 Department of Computer Science and Engineering, Southern University of Science and Technology , Shenzhen 518055 , China 1 Department of Computer and Information Engineering, Xiamen University of Technology , Xiamen 361004 , China 2 Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology , Shenzhen 518055 , China

Endoscopic video processing could facilitate pre-operative planning, intra-operative image guidance and generation of post-operative analysis of the surgical procedure. However, most of the current methods are still based on a single frame of image analysis, which makes the results of the previous frame images independent of each other and causes vibration. In this paper, we propose an temporal context framework for endoscopy artefact segmentation and detection. The framework extends the general segmentation and detection model to the form based on temporal input, and we add a Temporal Context Transformer(TCT) after the encoder of the model to improve the model's ability to construct temporal context features. the experiments of the EndoCV 2022 challenge dataset that this framework can improve the robustness of the model.

eol>Medical Image Analysis Colonoscopic Image Semantic Segmentation Object Detection

1. Introduction

Colon cancer[1] is a common malignant tumor of the digestive tract that occurs in the colon. Colon cancer is closely related to the consumption of red meat (such as beef). Incidence of gastrointestinal tumors accounted for the third place. Colon cancer is mainly adenocarcinoma, mucinous adenocarcinoma, undiferentiated carcinoma. Endoscopy[2] can clearly find intestinal lesions, but also can treat some intestinal lesions, such as: intestinal polyps and other benign lesions under the microscope directly removed, intestinal bleeding under the microscope to stop bleeding, the removal of foreign bodies in the colon. Endoscopic video[3] processing could facilitate pre-operative planning, intra-operative image guidance and generation of post-operative analysis of the surgical procedure. Computer assisted interventions[4] have the potential to enhance the surgeon’s visualization and navigation capabilities and postoperative analytics to provide insights for surgical training and risk assessment. A necessary element for these processes is scene understanding and, in particular, anatomy and instrument detection and localization. Therefore, by segmenting and diferentiating among the elements that appear in the Endoscopic view, it is possible to assess tissue-instrument interactions and understand endoscopic workflow.

Semantic segmentation[5] and object detection[6] are two hot research fields in computer vision. In medical semantic segmentation, Olaf et al proposed a classic medical image segmentation model U-Net[7], and the relevant encoder-decoder structure and skip-layer connection method have great inspiration for subsequent research work. On this basis, a series of novel and efective models are developed, such as U-Net++[8], nnUNet[9], DANet[ 10 ], Deeplab[11] and so on. For the analysis of endoscope images, The PraNet[12] proposed by Fan et al. aggregates features at a high level through the parallel partial decoder (PDD) to obtain context information and generate a global map. In medical object detection, Ross et al. proposed the Faster RCNN[13] achieves end-to-end object detection based on a deep learning two-stage structure. Cai et al. proposed Cascade R-CNN[14] to continuously optimize the prediction results by cascading several detection networks. The Swin Transfromer[15] proposed by Liu et al. is a general vision structure designed based on the concept of Transfromer[16], which has achieved breakthroughs in multiple vision tasks. However, most of the current methods are still based on single-frame image analysis, which makes the analysis results not well combined with temporal context information.

Endoscope image sequence can provide more information than single frame image [17, 18], and combining the contextual time information of the before and after images can efectively improve the analysis performance of endoscopy artefact. Inspired by this, in this paper, We propose a temporal Context Framework for endoscopy artefact segmentation and detection. Our contributions are as follows: ∙

We introduce a general framework to extract temporal context features from sequential images and

2. METHODOLOGY

In this section, we introduce the proposed temporal context framework for endoscopy artefact segmentation and detection. The overall of this framework as shown in Fig. 1. The framework includes endoscopy artefact segmentation model and endoscopy artefact detection model. The input of both models is the endoscope image sequence, and we set a hyperparameter to represent the length of the image sequence, so -frame sequence of input to the model can be represented as ∈ ,3,, .

In the endoscopy artefact segmentation model, we use the classical coding-decoding results. In particular, the encoder of the model is similar to the traditional encoder, which is responsible for extracting the features of single frame influence. group temporal context transformer is connected at the end of the encoder to establish the correlation between the image features of each frame. Compared with general single-frame image-based methods, this module utilizes feature correlations between dence. The loss function form of object detection model is the same as that of Faster RCNN[13].

Temporal Context Transformer. For the image sequence, there is a little correlation between the image data of the next frame and the next frame. Especially in the case of blur or artifact in the image, introducing the features of the previous frame can efectively repair the situation of target loss or category recognition error. In order to efectively improve the context understanding and feature integration capabilities of the model for image sequences. We designed the temporal context transformer, as show in Fig .2. Temporal context transformer is divided into transformer encoder and transformer decoder. The features extracted in the encoder will be input to the transformer encoder. For the Transformer encoder of layer , the input is the output − 1 ∈ , of the upper layer. The coordination transformer encoder has a similar structure to the traditional Transformer encoder, but the diference is that we design the timing code combining the characteristics of image sequence. The time diference between the two frames can be calculated in the endoscope image sequence and the time sequence coding between diferent frames can be modeled by normalization of the time diference. When the image sequence length is , the sequence encoding is a square matrix of × : ⎡

0 ⎢ |1 − 0| = ⎢⎢⎣ ...

|0 − 1| · · · | 0 − | ⎤ 0 · · · | 1 − | ⎥ ... . . . ... ⎦⎥⎥ 0 | − 0| | − 1| · · · (1) (2) = ( − 1) In self-attention generates query ∈ , , key ∈ , , and value ∈ , based on − 1. Then calculate the initial self-attention weight ∈ , = ((* )* ( * ) / ) between L frames. Then, sequence coding is introduced to calculate the final self-attention weight ∈ , = * . In this way, the temporal relevance in the original self-attention weight can be strengthened. The following steps are the same as for a classical transformer[16].

The transformer decoder is responsible for decoding and reconstruction of the features of the transformer encoder. The input form of the layer transformer encoder is − 1 ∈ , . Like the transformer encoder, sequence coding is added to the transformer decoder to improve the temporal modeling ability of the model. In the transformer decoder, the first step is the mask self-attention, which emphasizes the prediction of the model in accordance with the sequence of images. Diferent from the classical transformer, we add the cross attention[16] unit at the end of the transformer decoder.The transformer decoder calculates query ∈ , and key ∈ , using the output of the transformer encoder of the same layer. The cross attention weight matrix is calculated by and of transformer encoder. As shown in Fig.2, there are two parallel attention modules for feature learning in this part. We hope that these two attention modules can learn feature compensation and contraction respectively. Therefore, the parameters of the two modules do not share, and matrix addition and matrix cross product are used respectively.The specific operations are as follows: ′ = (( * 1 ) * ( * 1 ) / ) (3) ′′ = (( * 2 ) * ( * 2 ) / ) (4) = . {. {′ * * 1 + } +. {(′′ * * 2 ) ⊗ }} (5) The above process makes the features of each frame images fully fused, and the temporal context transformer efectively extracts the context information of diferent frame images. The aggregate feature will reshape to its original dimension before being sent into the decoder.

3. Experimental Results

In this section, we compare the performance of the proposed ensemble Temporal Context for Endoscopy Artefact Segmentation and Detection Farmworke and stateof-the-art model were compared in the segmentation and detection of endoscopy artefact.

Model UNet Model Faster R-CNN

Table 1 the data of the training set. In order to demonstrate the Temporal context transformer layer number comparative ex- efectiveness of the method, we do not use TTA or multiperiment. model fusion and other post-processing means, but only use a single model for test set prediction.

Data details and preparation. Our model mainly used the EndoCV2022 challenge dataset [17] for endoscopic images for Endoscopy Artefact Detection in this work. Endoscopic surgical instruments include five categories: nonmucosa, artefact, saturation, specularity, bubbles. EndoCV launched this as an extension to the previous artefact detection and segmentation challenges [21, 22] with dataset specific to the colonoscopy.

The dataset contains 24 endoscopic videos sequence for EAD sub-challenge with total 1,449 endoscopic images.

We split the dataset into 80% sequence for training and 20% sequence for validation. For the segmentation task, we used Dice coeficient, Jaccard coeficient and PA for evaluation. For the detection task, we used mAP with diferent thresholds for evaluation.

Implementation details. The deep models are implemented based on PyTorch and trained on an NVIDIA Tesla V100 GPU. surgical instrument segmentation model using SGD optimizer with a learning rate of 10− 4. surgical instrument detection model base on mmdetetcion and using SGD optimizer with a learning rate of 10− 2. The batch size is set to 2 and use a sliding window of length L to sample subsequences in the original sequence, while input sequence images are resized to 960× 540. Since the input are image sequences, the batch size was relatively small. In addition, we used conventional inversion, afine transformation, contrast and other methods to enhance

Model / UNet √ 00..652355 00..440921 00..887922 DANet √ 00..765713 00..659670 00..992434 PraNet √ 00..871165 00..767261 00..993661 Model / 50 75 Faster 0.232 0.464 0.208 R-CNN √ 0.317 0.563 0.321 Cascade 0.336 0.579 0.347 RCNN √ 0.395 0.611 0.401

Swin 0.356 0.598 0.364 Transformer √ 0.403 0.613 0.421

We first compared the influence of the number of TCT on the model performance through comparative experiments. The results are shown in Table 1. From the experimental results, it can be seen that the model has the best efect except when is 2, and the model will overfit when N is too large. To verify the efectiveness of our method, we perform a comprehensive comparison with state-of-the-art segmentation and detection methods, segmentation methods including UNet, DANet, PraNet, detection methods including Faster RCNN, Cascade RCNN, Swin Transformer, as shown in Table 2. Specifically, The performance of each SOTA model has been steadily improved after being converted to our method. We visualized an example of an inferential endoscope sequence image of a set of models, as shown in Fig.3. the Dice, Jaccar, PA of segmentation task model has been improved by 9%-12%, 5%-9% and 2%-3%. For detection tasks, the model’s mAP improved by 5%-8%. And it is efective for diferent types of methods, which proves that our method is robust and applicable.

arXiv:2106.04463 ( 2021 ). [18]

Ali ,

Ghatwary ,

Jha ,

Isik-Polat , G. Po-

arXiv preprint arXiv:2202.12031 ( 2022 ). [19]

Wang ,

Zhou ,

Yao ,

Wang ,

Li ,

Yang ,

sent. 79 ( 2021 ) 103260 . URL: https://doi.org/10.1016/

j.jvcir. 2021 . 103260 . doi: 10 .1016/j.jvcir. 2021 .

103260. [20]

Bodla ,

Singh ,

Chellappa ,

L. S.

Davis , Soft-

puter Vision , ICCV 2017, Venice, Italy, October 22-

29, 2017 , IEEE Computer Society, 2017 , pp. 5562 -

5570. URL: https://doi.org/10.1109/ICCV. 2017 . 593 .

doi:10 .1109/ICCV. 2017 . 593 . [21]

Ali ,

Zhou ,

Braden ,

Bailey , S. Yang,

doscopy , Scientific Reports 10 ( 2020 ). URL: https:

//doi.org/10.1038% 2Fs41598 - 020 -59413-5. doi:10.

1038 /s41598-020-59413-5. [22]

Ali ,

Dmitrieva ,

Ghatwary ,

Bano , G. Po-

70 ( 2021 ) 102002 . URL: https://doi.org/10.1016% 2Fj .

media.

2021 . 102002 . doi: 10 .1016/j.media. 2021 .