Improved-STCN Network with Enhanced Strategy for Sequence Polyp Segmentation Quan He1 , Xiaobo Hu1 , Feng Sun1 , Lulu Zhou1 , Jing Wang1 and Qiming Wan1 1 Hangzhou Hikvision Digital Technology Co.,ltd, Hangzhou, China Abstract The detection of polyps is helpful to the diagnosis of early colorectal cancer. With the rapid development of deep learning, more and more researchers apply detection and segmentation technology to assist polyp detection. This work is our solution to the polyp segmentation subtask in the EndoCV2022 challenge. We come up with the idea from the semi-supervised video object segmentation and build on STCN [1] for this challenge. STCN is built for the task when the correct segmentation mask of the first frame of the video is given as input, then the model just tracks the target, no matter what it is. We modify STCN into a sequence polyp segmentation network named improved-STCN, which can not only segment the polyps but also track the polyps. As EndoCV2022 challenge [2] [3] is a sequence challenge, the images in the same sequence are very similar, which will lead to bad performance. Thus, we adopt semi-supervised learning to get more abundant data for training. We also carry out experiments on how to make the segmentation results more credible, that single frame detection and reverse sequence information will help in this part. Finally, on the round-II test, our system achieves a segmentation score of 0.7654 and ranked the second. Keywords Polyp segmentation, Sequence data, Deep learning, Semi-supervised learning, Improved-STCN 1. Introduction Colorectal cancer (CRC) is a common malignant tumor in the gastrointestinal tract. Its incidence rate and mortality rate are the second most important in digestive system cancer, followed by gastric cancer, esophageal cancer and primary liver cancer. Polyp is considered a sign of precancerous lesions, thus, finding it at any time during precancerous lesions and blocking it not only reduce the mortality of colorectal cancer, but also reduce the inci- dence rate. Colorectal lesions are usually diagnosed by Figure 1: Example of EndoCV2022 challenge sequence data colonoscopy, but unfortunately, it is estimated that about 6-27% of pathological missed diagnosis in colonoscopy [4]. Colonoscopy image analysis and decision support the context in the image. The encoder is just a tradi- system have shown great potential in improving examina- tional convolution and maximum pool layer stack. The tion efficiency and reducing the number of missed lesions second path is the symmetric spread path (also known [5]. Deep learning is more and more widely used in the as the decoder), which is used for precise positioning field of medical images. Since MICCAI 2015 Automatic using transpose convolution. This structure has been Polyp. proved to be able to segment medical images effectively. Detection in Colonoscopy Videos challenge, more and However, for sequence data in real scenes, this kind of more datasets and challenges have been launched, which method can not effectively model timing information. further promote the application of deep learning-based In the field of video object segmentation, the model is endoscopic vision [6]. Among them, the most widely trained to extract the relationship between video frames used deep learning model is Unet [7] and its variants. to improve the performance of segmentation. Masktrack The Unet consists of two paths. The first path is a com- [8] is a typical network of video object segmentation. pression path (also known as an encoder) that captures Taking the mask of the previous frame and the current 4th International Workshop and Challenge on Computer Vision in frame as the model input, the trained model will outputs Endoscopy (EndoCV2022) in conjunction with the 19th IEEE Inter- the mask of the current frame with high segmentation national Symposium on Biomedical Imaging ISBI2022, March accuracy. However, the performance of this method often 28th, 2022, IC Royal Bengal, Kolkata, India depends on the accuracy of the output of the previous $ whut2014hq@163.com (Q. He) frame, which has the risk of cumulative error. This work © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). is our solution to the polyp segmentation subtask in the CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 2: Overview of the improved-STCN EndoCV2022 challenge. The proposed approach is built encoder and a value encoder respectively. The key en- on STCN, a semi-supervised video object segmentation coder encodes the images into the key feature space and network. In particular, we modify STCN into a sequence the value encoder encode both the images and mask into polyp segmentation network ,which can not only seg- the value feature space. The key correspond with value ment the polyps but also track the polyps. In short, our one by one will be stored in the memory bank. Then, main contribution for this work are as follows: when a new frame in the video sequence is collected, the frame will be encoded into the key feature space firstly, • We modify STCN into a sequence polyp seg- and then calculate the similarity with the key features mentation network, which will no need the first of the previous frame stored in the memory bank. The frame’s mask to predict like it used to be. And we most similar features will be combined into the feature also practice the experiment on training strategy space of the current frame for model outputs. Here, the to find a stronger model. negative square Euclidean distance is used as similarity • We learn from semi-supervised learning to gener- functions, which is defined as follows: ate more training data, as the image of the same sequence have great similarity, which is not con- 𝑆 = −||𝐾 𝑃 − 𝐾 𝐶 ||22 (1) ducive to the improvement of network general- where 𝐾 represents the previous frames’ key fea- 𝑃 ization and feature extraction ability. tures, 𝐾 𝐶 represents the current frames’ key feature. • We propose an enhanced scheme to make the Then the aggregated readout feature 𝑉 𝐶 for the current segmentations results more credible. Overall, frame can be computed as a weighted sum of the memory our method is proved to be effective in the En- features with an efficient matrix multiplication: doCV2022 challenge round-I and round-II. 𝑉 𝐶 = 𝐶 𝑃 .𝑆 (2) 2. Method which is then passed to the decoder for mask generation [1]. 2.1. Overview of the framework STCN is used to meet the semi-supervised video object Figure 2 shows the overall process of the improved-STCN. segmentation task where the first frame of the video is The network use ResNet50 and ResNet18 to build a key needed. We have specially improved the STCN’s struc- ture named improved-STCN for EndoCV2022 challenge. Figure 3: Overview of the enhanced scheme In particular, we firstly hidden memory bank and affin- ity compute module, then add a convolution module to get the single frame segmentations network (SFSN), as shown in the red dashed box in the figure 2. In the train- Figure 4: EndoCV2022 challenge Dataset statistical ing phase, we train the SFSN only to make the encoder and decoder strong. Then the parameter of SFSN will be the pre-training parameters for STCN’s training. In the of the network output response in the segmentation tar- inference phase, for the first frame, SFSN will outputs get area. Then the key encoder and value encoder of the result firstly, then STCN will track the mask and com- STCN will encode the segmentation results with higher plete the predictions of all subsequent sequences. In this confidence and store the coding results in the memory way, improved-STCN build the ability of single frames’ bank. The prediction of all subsequent sequences will be segmentation without the help of other frames. Finally, completed next. the improved-STCN can not only segment the polyps but Sequence information is helpful for model segmen- also track the polyps that appear in the previous frame. tation. Usually, we use forward sequence information. As for offline diagnosis, such as capsule endoscopy di- 2.2. Semi-supervised learning agnosis, we can take advantage of backward sequence information. Thus, we reverse the input sequence data Due to the small field of vision of the endoscope and and make the model to predict. Then, fuse the forward the slow movement during endoscopy, the sequence data sequence data results and the backward sequence data collected over a period of time are highly approximate, results as the final output of the network. Here, fuse as figure 1 shows. These approximate data are not con- method is the same as the above, that is comparing the ducive to the improvement of network generalization confidence in the segmentation result and select the one ability and feature extraction ability. We learn from semi- with higher confidence as the final result. supervised learning to generate more training data. In practice, firstly, we use all the EndoCV2022 challenge Dataset and STCN to train the polyp tracking model. 3. EXPERIMENTAL RESULTS Then we manually annotate the first frame of the Hyper- Kvasir videos [9], and the polyp tracking model will gen- The experimental part is mainly composed of two part- erate the pseudo labels. In this way, we get more abun- snamed baseline experiments and experiments used for dant sequence data with labels, which is helpful for our the challenge. In part one, the baseline experiments were model’s learning. used to find the suitable hyper-parameters and data aug- mentation strategy for the training of improved-STCN. Besides, we carried out the semi-supervised learning men- 2.3. Enhanced scheme tioned in the Subsection 2.2. We also explored the effects Although the model mentioned in the Subsection 2.1 of illumination and size on model’s performance. In part has the ability to segment and track the polyps, we find two, we used the same train strategy as the part one to that train two models to segment and track polyps sepa- train model with all the dataset we have, and tested model rately will get better results. As figure 3 shows, SFSN that with the Endocv2022 challenge unseen dataset. The en- change from STCN is used to segment the polyps in the hanced scheme was adopted to get the more credible first few frames of the sequence data. Meanwhile, STCN segmentation results. will also outputs the segmentation results. The results of the two models will use the same calculation method to obtain confidence, which is defined as the average value 3.1. Dataset EndoCV2022 leaderboard also chosen the Dice coefficient as the scores to evaluate the performance of the model. The EndoCV2022’s organizing committee provided a to- tal of 46 sequence data for all participants. According to the statistics, the EndoCV2022 challenge Dataset consists 3.3. Training Details 3348 frames sampled in the real-world clinical scenario. We chose PyTorch to train our model, and both the train As figure 4 shows, most polyps are around 400 in size and inference were run on the NVIDIA TESLA V100 while a few polyps are larger than 800. Due to the differ- GPU. Here, we minimized the cross-entropy loss using ent sizes of polyps and images, we need to pay attention Adam optimizer with default momentum 𝛽1 = 0.9, 𝛽2 to using some strategies to reduce the sensitivity of the = 0.999. The learning rate lr=0.0001 and the batch size network to resolution, such as Multi-scale training. Al- was set to 16. The input image size of the model was though polyps have different shapes and sizes, the image 384 × 384 pixels As it was an sequential learning task, of the same sequence data have great similarity, which is the maximum temporal distance between frames was not conducive to the improvement of network’s general- set to be [5,10,15,20,25,5] at the corresponding iterations ization and feature extraction ability. Thus, in baseline of [0%,10%,20%,30%,40%,90%] of the total 20000 training experiments, we split the EndoCV2022 challenge Dataset iterations We also adopted the strategy to make the model into 80% for training and 20% for validation in sequence. pay more attention to the learning of difficult pixels. After To enhance the generalization and feature extraction abil- 15000 iterations, only the top-20% pixels that had the ity of our model, we also utilized three well-known pub- highest loss would be selected to compute gradients. As licly endoscopy sequence datasets, ETIS-Larib Polyp [10], we describes in the subsection 3.1, we added multi-scale CVC-Clinic [11], and Hyper-Kvasir dataset. ETIS-Larib training strategy to train model. The initial input image Polyp DB were used directly as a training set. CVC-Clinic size of the model was 384 × 384 pixels, the model would were used as validation set as more data can better eval- be trained with multi-scale training parameters 0.75, 1, uate the generalization of the model. As HyperKvasir 1.25. dataset has only video data and no labels, we adopted the method mentioned in the subsection 2.2 to generate labels. Then, these sequence data with pseudo labels 3.4. Experimental Results were also used as a training set. In the experiments for Table 1 shows the Ablation study result of Endocv2022 challenge, we used the same train strategy as the baseline validation and CVC-Clinic datasets. Firstly, we see that experiments, and trained model with all the dataset we when we use semi-supervised learning, the dice coeffi- have cient of the model in the Validation Set (EndoCV2022 validation + CVC-Clinic) has increased by 3%. It proves 3.2. Evaluation Metrics that adding more sequence data for model to learn does help .Secondly, colonoscopy is a product of a combined The EndoCV2022’s organizing committee provided par- light source, thus, the collected images are either very ticipants a toolbox to calculate the scores between the bright or very dark. We set color jitter of (brightness=0.5, predicted mask and the ground truth mask at github contrast=0.03, saturation=0.03) to simulated light change. [12, 13]. There are seven metrics in the toolbox: Jaccard In this way, the dice coefficient improves to 0.7694. Fig- (Jac), Dice, F2-score, Precision (Positive Predictive Value, ure 5 shows that images cases which the base model can PPV), Recall (Rec), Accuracy (Acc), and Hausdorff dis- not segment benefit from this approach. Lastly, we see tance (Hdf). As these metrics are similar, and to make that the scale of images will affect the performance of experiments more efficient, we chose the most commonly the model. The multi-scale training strategy reduces the used metrics for the medical image segmentation, the Jac- sensitivity of the model to image resolution, as the dice card and the Dice coefficient. The Jaccard is defined as coefficient of the model improves to 0.7800. follows: Table 2 provides our model’s segmentation results on 𝑇𝑃 EndoCV2022 challenge segmentation task. Firstly, the 𝐽𝑎𝑐 = (3) improved-STCN model we have trained for polyp seg- 2 * 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 mentation have an excellent performance on the unseen Where TP represents true positive "polyp", while FP dataset while the dice coefficient is up to 0.7423.This re- and FN represents false positive and false negative re- sult already make us ranked the top5 on the leaderboards. spectively. Similarly, the Dice coefficient is calculated as When we adopt the two methods mentioned in the sub- follows: section 2.3, the dice coefficient has increased by 2% and 2 * 𝑇𝑃 by 3% respectively. From the results, we see that our en- 𝐷𝑖𝑐𝑒 = (4) hance scheme mentioned above does help. Unfortunately, 2 * 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 Table 1 4. Conclusion Ablation study result of Endocv2022 validation combined with CVC-Clinic datasets In this work, we have detailed our solution for the polyp Method Dice IOU segmentation subtask in the EndoCV2022 challenge. We have proposed improved-STCN network with a semi- base 0.7338 0.6701 supervised learning method to improve model’s general- semi-supervised learning 0.7613 0.6894 ization and an enhanced scheme to make model output semi-supervised learning 0.7694 0.7058 more credible results. Limited experimental results show + Light Change that our method achieves consistently high Dice scores semi-supervised learning at very low standard deviations, suggesting its suitability 0.7800 0.7237 + Light Change + Multi-scale training for polyp segmentation on endoscopic sequence data. References [1] H. K. Cheng, Y.-W. Tai, C.-K. Tang, Rethinking space-time networks with improved memory cov- erage for efficient video object segmentation, Ad- vances in Neural Information Processing Systems 34 (2021). [2] S. Ali, N. Ghatwary, D. Jha, E. Isik-Polat, G. Po- lat, C. Yang, W. Li, A. Galdran, M.-Á. G. Ballester, Figure 5: Comparison of model segmentation under strong V. Thambawita, et al., Assessing generalisabil- light and low light (a) shows model trained with light change ity of deep learning-based polyp detection and strategy has better performance, as (b) can not distinguish segmentation methods through a computer vision the target. challenge, arXiv preprint arXiv:2202.12031 (2022). doi:10.48550/arXiv.2202.12031. [3] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Can- Table 2 nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V. Results on EndoCV2022 segmentation task round II test set Anonsen, M. A. Riegler, et al., Polypgen: A Method Dice std multi-center polyp detection and segmentation dataset for generalisability assessment, arXiv STCN 0.7423 0.3756 preprint arXiv:2106.04463 (2021). doi:10.48550/ STCN + SFSN 0.7613 0.3571 arXiv.2106.04463. STCN + Reverse Sequence 0.7694 0.3543 [4] S. B. Ahn, D. S. Han, J. H. Bae, T. J. Byun, J. P. Kim, C. S. Eun, The miss rate for colorectal adenoma de- termined by quality-adjusted, back-to-back colono- scopies, Gut and liver 6 (2012) 64. [5] T. K. Lui, C. K. Hui, V. W. Tsui, K. S. Cheung, M. K. Ko, D. C. Foo, L. Y. Mak, C. K. Yeung, T. H. Lui, S. Y. Wong, et al., New insights on missed colonic lesions during colonoscopy through artificial intelligence– assisted real-time detection (with video), Gastroin- Figure 6: Example of model segmentation results on EndoCV testinal Endoscopy 93 (2021) 193–200. 2022 round-II.(a) shows the easy case for model and (b) shows [6] C. Yua, J. Yana, X. Lia, Parallel res2net-based net- the hard case in complex scenarios. work with reverse attention for polyp segmentation (2021). [7] O. Ronneberger, P. Fischer, T. Brox, U-Net: con- as figure 6 shows, our model does not recognize objects volutional networks for biomedical image segmen- in complex scenarios, such as dim and dark scenes. tation, in: International Conference on Medical image computing and computer-assisted interven- tion, Springer, 2015, pp. 234–241. [8] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, A. Sorkine-Hornung, Learning video object seg- mentation from static images, in: Proceedings of the IEEE conference on computer vision and pat- tern recognition, 2017, pp. 2663–2672. [9] D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, T. d. Lange, D. Johansen, H. D. Johansen, Kvasir-seg: A segmented polyp dataset, in: International Con- ference on Multimedia Modeling, Springer, 2020, pp. 451–462. [10] J. Silva, A. Histace, O. Romain, X. Dray, B. Granado, Toward embedded detection of polyps in wce im- ages for early diagnosis of colorectal cancer, Inter- national journal of computer assisted radiology and surgery 9 (2014) 283–293. [11] J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, D. Gil, C. Rodríguez, F. Vilariño, Wm-dova maps for accurate polyp highlighting in colonoscopy: Val- idation vs. saliency maps from physicians, Com- puterized Medical Imaging and Graphics 43 (2015) 99–111. [12] S. Ali, F. Zhou, B. Braden, A. Bailey, S. Yang, G. Cheng, P. Zhang, X. Li, M. Kayser, R. D. Soberanis-Mukul, et al., An objective comparison of detection and segmentation algorithms for arte- facts in clinical endoscopy, Scientific reports 10 (2020) 1–15. [13] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Po- lat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo, B. Matuszewski, et al., Deep learning for detec- tion and segmentation of artefact and disease in- stances in gastrointestinal endoscopy, Medical image analysis 70 (2021) 102002. doi:10.1016/j. media.2021.102002.