=Paper=
{{Paper
|id=Vol-3148/paper6
|storemode=property
|title=Improved-STCN Network with Enhanced Strategy for Sequence Polyp Segmentation
|pdfUrl=https://ceur-ws.org/Vol-3148/paper5.pdf
|volume=Vol-3148
|authors=Quan He,Xiaobo Hu,Feng Sun,Lulu Zhou,Jing Wang,Qiming Wan
|dblpUrl=https://dblp.org/rec/conf/isbi/HeHSZWW22
}}
==Improved-STCN Network with Enhanced Strategy for Sequence Polyp Segmentation==
<pdf width="1500px">https://ceur-ws.org/Vol-3148/paper5.pdf</pdf>
<pre>
Improved-STCN Network with Enhanced Strategy for
Sequence Polyp Segmentation
Quan He1 , Xiaobo Hu1 , Feng Sun1 , Lulu Zhou1 , Jing Wang1 and Qiming Wan1
1
    Hangzhou Hikvision Digital Technology Co.,ltd, Hangzhou, China


                                          Abstract
                                          The detection of polyps is helpful to the diagnosis of early colorectal cancer. With the rapid development of deep learning,
                                          more and more researchers apply detection and segmentation technology to assist polyp detection. This work is our solution
                                          to the polyp segmentation subtask in the EndoCV2022 challenge. We come up with the idea from the semi-supervised video
                                          object segmentation and build on STCN [1] for this challenge. STCN is built for the task when the correct segmentation
                                          mask of the first frame of the video is given as input, then the model just tracks the target, no matter what it is. We modify
                                          STCN into a sequence polyp segmentation network named improved-STCN, which can not only segment the polyps but also
                                          track the polyps. As EndoCV2022 challenge [2] [3] is a sequence challenge, the images in the same sequence are very similar,
                                          which will lead to bad performance. Thus, we adopt semi-supervised learning to get more abundant data for training. We
                                          also carry out experiments on how to make the segmentation results more credible, that single frame detection and reverse
                                          sequence information will help in this part. Finally, on the round-II test, our system achieves a segmentation score of 0.7654
                                          and ranked the second.

                                          Keywords
                                          Polyp segmentation, Sequence data, Deep learning, Semi-supervised learning, Improved-STCN


1. Introduction
Colorectal cancer (CRC) is a common malignant tumor in
the gastrointestinal tract. Its incidence rate and mortality
rate are the second most important in digestive system
cancer, followed by gastric cancer, esophageal cancer
and primary liver cancer. Polyp is considered a sign of
precancerous lesions, thus, finding it at any time during
precancerous lesions and blocking it not only reduce the
mortality of colorectal cancer, but also reduce the inci-
dence rate. Colorectal lesions are usually diagnosed by                                            Figure 1: Example of EndoCV2022 challenge sequence data
colonoscopy, but unfortunately, it is estimated that about
6-27% of pathological missed diagnosis in colonoscopy
[4]. Colonoscopy image analysis and decision support                                                   the context in the image. The encoder is just a tradi-
system have shown great potential in improving examina-                                                tional convolution and maximum pool layer stack. The
tion efficiency and reducing the number of missed lesions                                              second path is the symmetric spread path (also known
[5]. Deep learning is more and more widely used in the                                                 as the decoder), which is used for precise positioning
field of medical images. Since MICCAI 2015 Automatic                                                   using transpose convolution. This structure has been
Polyp.                                                                                                 proved to be able to segment medical images effectively.
   Detection in Colonoscopy Videos challenge, more and                                                 However, for sequence data in real scenes, this kind of
more datasets and challenges have been launched, which                                                 method can not effectively model timing information.
further promote the application of deep learning-based                                                 In the field of video object segmentation, the model is
endoscopic vision [6]. Among them, the most widely                                                     trained to extract the relationship between video frames
used deep learning model is Unet [7] and its variants.                                                 to improve the performance of segmentation. Masktrack
The Unet consists of two paths. The first path is a com-                                               [8] is a typical network of video object segmentation.
pression path (also known as an encoder) that captures                                                 Taking the mask of the previous frame and the current
4th International Workshop and Challenge on Computer Vision in frame as the model input, the trained model will outputs
Endoscopy (EndoCV2022) in conjunction with the 19th IEEE Inter- the mask of the current frame with high segmentation
national Symposium on Biomedical Imaging ISBI2022, March accuracy. However, the performance of this method often
28th, 2022, IC Royal Bengal, Kolkata, India                                                            depends on the accuracy of the output of the previous
$ whut2014hq@163.com (Q. He)                                                                           frame, which has the risk of cumulative error. This work
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).                                                   is our solution to the polyp segmentation subtask in the
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 2: Overview of the improved-STCN


EndoCV2022 challenge. The proposed approach is built      encoder and a value encoder respectively. The key en-
on STCN, a semi-supervised video object segmentation      coder encodes the images into the key feature space and
network. In particular, we modify STCN into a sequence    the value encoder encode both the images and mask into
polyp segmentation network ,which can not only seg-       the value feature space. The key correspond with value
ment the polyps but also track the polyps. In short, our  one by one will be stored in the memory bank. Then,
main contribution for this work are as follows:           when a new frame in the video sequence is collected, the
                                                          frame will be encoded into the key feature space firstly,
     • We modify STCN into a sequence polyp seg- and then calculate the similarity with the key features
       mentation network, which will no need the first of the previous frame stored in the memory bank. The
       frame’s mask to predict like it used to be. And we most similar features will be combined into the feature
       also practice the experiment on training strategy space of the current frame for model outputs. Here, the
       to find a stronger model.                          negative square Euclidean distance is used as similarity
     • We learn from semi-supervised learning to gener- functions, which is defined as follows:
       ate more training data, as the image of the same
       sequence have great similarity, which is not con-                   𝑆 = −||𝐾 𝑃 − 𝐾 𝐶 ||22                (1)
       ducive to the improvement of network general-
                                                             where 𝐾 represents the previous frames’ key fea-
                                                                      𝑃
       ization and feature extraction ability.
                                                          tures, 𝐾 𝐶 represents the current frames’ key feature.
     • We propose an enhanced scheme to make the
                                                          Then the aggregated readout feature 𝑉 𝐶 for the current
       segmentations results more credible. Overall,
                                                          frame can be computed as a weighted sum of the memory
       our method is proved to be effective in the En-
                                                          features with an efficient matrix multiplication:
       doCV2022 challenge round-I and round-II.
                                                                               𝑉 𝐶 = 𝐶 𝑃 .𝑆                    (2)
2. Method                                                which is then passed to the decoder for mask generation
                                                         [1].
2.1. Overview of the framework                              STCN is used to meet the semi-supervised video object
Figure 2 shows the overall process of the improved-STCN. segmentation task where the first frame of the video is
The network use ResNet50 and ResNet18 to build a key needed. We have specially improved the STCN’s struc-
                                                         ture named improved-STCN for EndoCV2022 challenge.
Figure 3: Overview of the enhanced scheme


In particular, we firstly hidden memory bank and affin-
ity compute module, then add a convolution module to
get the single frame segmentations network (SFSN), as
shown in the red dashed box in the figure 2. In the train-   Figure 4: EndoCV2022 challenge Dataset statistical
ing phase, we train the SFSN only to make the encoder
and decoder strong. Then the parameter of SFSN will be
the pre-training parameters for STCN’s training. In the      of the network output response in the segmentation tar-
inference phase, for the first frame, SFSN will outputs      get area. Then the key encoder and value encoder of
the result firstly, then STCN will track the mask and com-   STCN will encode the segmentation results with higher
plete the predictions of all subsequent sequences. In this   confidence and store the coding results in the memory
way, improved-STCN build the ability of single frames’       bank. The prediction of all subsequent sequences will be
segmentation without the help of other frames. Finally,      completed next.
the improved-STCN can not only segment the polyps but           Sequence information is helpful for model segmen-
also track the polyps that appear in the previous frame.     tation. Usually, we use forward sequence information.
                                                             As for offline diagnosis, such as capsule endoscopy di-
2.2. Semi-supervised learning                                agnosis, we can take advantage of backward sequence
                                                             information. Thus, we reverse the input sequence data
Due to the small field of vision of the endoscope and and make the model to predict. Then, fuse the forward
the slow movement during endoscopy, the sequence data sequence data results and the backward sequence data
collected over a period of time are highly approximate, results as the final output of the network. Here, fuse
as figure 1 shows. These approximate data are not con- method is the same as the above, that is comparing the
ducive to the improvement of network generalization confidence in the segmentation result and select the one
ability and feature extraction ability. We learn from semi- with higher confidence as the final result.
supervised learning to generate more training data. In
practice, firstly, we use all the EndoCV2022 challenge
Dataset and STCN to train the polyp tracking model. 3. EXPERIMENTAL RESULTS
Then we manually annotate the first frame of the Hyper-
Kvasir videos [9], and the polyp tracking model will gen- The experimental part is mainly composed of two part-
erate the pseudo labels. In this way, we get more abun- snamed baseline experiments and experiments used for
dant sequence data with labels, which is helpful for our the challenge. In part one, the baseline experiments were
model’s learning.                                            used to find the suitable hyper-parameters and data aug-
                                                             mentation strategy for the training of improved-STCN.
                                                             Besides, we carried out the semi-supervised learning men-
2.3. Enhanced scheme                                         tioned in the Subsection 2.2. We also explored the effects
Although the model mentioned in the Subsection 2.1 of illumination and size on model’s performance. In part
has the ability to segment and track the polyps, we find two, we used the same train strategy as the part one to
that train two models to segment and track polyps sepa- train model with all the dataset we have, and tested model
rately will get better results. As figure 3 shows, SFSN that with the Endocv2022 challenge unseen dataset. The en-
change from STCN is used to segment the polyps in the hanced scheme was adopted to get the more credible
first few frames of the sequence data. Meanwhile, STCN segmentation results.
will also outputs the segmentation results. The results of
the two models will use the same calculation method to
obtain confidence, which is defined as the average value
3.1. Dataset                                                   EndoCV2022 leaderboard also chosen the Dice coefficient
                                                               as the scores to evaluate the performance of the model.
The EndoCV2022’s organizing committee provided a to-
tal of 46 sequence data for all participants. According to
the statistics, the EndoCV2022 challenge Dataset consists      3.3. Training Details
3348 frames sampled in the real-world clinical scenario.       We chose PyTorch to train our model, and both the train
As figure 4 shows, most polyps are around 400 in size          and inference were run on the NVIDIA TESLA V100
while a few polyps are larger than 800. Due to the differ-     GPU. Here, we minimized the cross-entropy loss using
ent sizes of polyps and images, we need to pay attention       Adam optimizer with default momentum 𝛽1 = 0.9, 𝛽2
to using some strategies to reduce the sensitivity of the      = 0.999. The learning rate lr=0.0001 and the batch size
network to resolution, such as Multi-scale training. Al-       was set to 16. The input image size of the model was
though polyps have different shapes and sizes, the image       384 × 384 pixels As it was an sequential learning task,
of the same sequence data have great similarity, which is      the maximum temporal distance between frames was
not conducive to the improvement of network’s general-         set to be [5,10,15,20,25,5] at the corresponding iterations
ization and feature extraction ability. Thus, in baseline      of [0%,10%,20%,30%,40%,90%] of the total 20000 training
experiments, we split the EndoCV2022 challenge Dataset         iterations We also adopted the strategy to make the model
into 80% for training and 20% for validation in sequence.      pay more attention to the learning of difficult pixels. After
To enhance the generalization and feature extraction abil-     15000 iterations, only the top-20% pixels that had the
ity of our model, we also utilized three well-known pub-       highest loss would be selected to compute gradients. As
licly endoscopy sequence datasets, ETIS-Larib Polyp [10],      we describes in the subsection 3.1, we added multi-scale
CVC-Clinic [11], and Hyper-Kvasir dataset. ETIS-Larib          training strategy to train model. The initial input image
Polyp DB were used directly as a training set. CVC-Clinic      size of the model was 384 × 384 pixels, the model would
were used as validation set as more data can better eval-      be trained with multi-scale training parameters 0.75, 1,
uate the generalization of the model. As HyperKvasir           1.25.
dataset has only video data and no labels, we adopted
the method mentioned in the subsection 2.2 to generate
labels. Then, these sequence data with pseudo labels           3.4. Experimental Results
were also used as a training set. In the experiments for       Table 1 shows the Ablation study result of Endocv2022
challenge, we used the same train strategy as the baseline     validation and CVC-Clinic datasets. Firstly, we see that
experiments, and trained model with all the dataset we         when we use semi-supervised learning, the dice coeffi-
have                                                           cient of the model in the Validation Set (EndoCV2022
                                                               validation + CVC-Clinic) has increased by 3%. It proves
3.2. Evaluation Metrics                                        that adding more sequence data for model to learn does
                                                               help .Secondly, colonoscopy is a product of a combined
The EndoCV2022’s organizing committee provided par-            light source, thus, the collected images are either very
ticipants a toolbox to calculate the scores between the        bright or very dark. We set color jitter of (brightness=0.5,
predicted mask and the ground truth mask at github             contrast=0.03, saturation=0.03) to simulated light change.
[12, 13]. There are seven metrics in the toolbox: Jaccard      In this way, the dice coefficient improves to 0.7694. Fig-
(Jac), Dice, F2-score, Precision (Positive Predictive Value,   ure 5 shows that images cases which the base model can
PPV), Recall (Rec), Accuracy (Acc), and Hausdorff dis-         not segment benefit from this approach. Lastly, we see
tance (Hdf). As these metrics are similar, and to make         that the scale of images will affect the performance of
experiments more efficient, we chose the most commonly         the model. The multi-scale training strategy reduces the
used metrics for the medical image segmentation, the Jac-      sensitivity of the model to image resolution, as the dice
card and the Dice coefficient. The Jaccard is defined as       coefficient of the model improves to 0.7800.
follows:                                                          Table 2 provides our model’s segmentation results on
                              𝑇𝑃                               EndoCV2022 challenge segmentation task. Firstly, the
             𝐽𝑎𝑐 =                                      (3)    improved-STCN model we have trained for polyp seg-
                      2 * 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁
                                                               mentation have an excellent performance on the unseen
   Where TP represents true positive "polyp", while FP         dataset while the dice coefficient is up to 0.7423.This re-
and FN represents false positive and false negative re-        sult already make us ranked the top5 on the leaderboards.
spectively. Similarly, the Dice coefficient is calculated as   When we adopt the two methods mentioned in the sub-
follows:                                                       section 2.3, the dice coefficient has increased by 2% and
                            2 * 𝑇𝑃                             by 3% respectively. From the results, we see that our en-
             𝐷𝑖𝑐𝑒 =                                     (4)    hance scheme mentioned above does help. Unfortunately,
                      2 * 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁
Table 1                                                            4. Conclusion
Ablation study result of Endocv2022 validation combined with
CVC-Clinic datasets                                                In this work, we have detailed our solution for the polyp
                Method                        Dice        IOU      segmentation subtask in the EndoCV2022 challenge. We
                                                                   have proposed improved-STCN network with a semi-
                base                          0.7338      0.6701   supervised learning method to improve model’s general-
       semi-supervised learning               0.7613      0.6894   ization and an enhanced scheme to make model output
       semi-supervised learning
                                              0.7694      0.7058   more credible results. Limited experimental results show
           + Light Change                                          that our method achieves consistently high Dice scores
       semi-supervised learning                                    at very low standard deviations, suggesting its suitability
                                              0.7800      0.7237
 + Light Change + Multi-scale training                             for polyp segmentation on endoscopic sequence data.


                                                                   References
                                                                    [1] H. K. Cheng, Y.-W. Tai, C.-K. Tang, Rethinking
                                                                        space-time networks with improved memory cov-
                                                                        erage for efficient video object segmentation, Ad-
                                                                        vances in Neural Information Processing Systems
                                                                        34 (2021).
                                                                    [2] S. Ali, N. Ghatwary, D. Jha, E. Isik-Polat, G. Po-
                                                                        lat, C. Yang, W. Li, A. Galdran, M.-Á. G. Ballester,
Figure 5: Comparison of model segmentation under strong                 V. Thambawita, et al., Assessing generalisabil-
light and low light (a) shows model trained with light change           ity of deep learning-based polyp detection and
strategy has better performance, as (b) can not distinguish             segmentation methods through a computer vision
the target.                                                             challenge, arXiv preprint arXiv:2202.12031 (2022).
                                                                        doi:10.48550/arXiv.2202.12031.
                                                                    [3] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Can-
Table 2                                                                 nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V.
Results on EndoCV2022 segmentation task round II test set               Anonsen, M. A. Riegler, et al., Polypgen: A
                Method                Dice        std                   multi-center polyp detection and segmentation
                                                                        dataset for generalisability assessment, arXiv
               STCN                  0.7423      0.3756                 preprint arXiv:2106.04463 (2021). doi:10.48550/
           STCN + SFSN               0.7613      0.3571                 arXiv.2106.04463.
       STCN + Reverse Sequence       0.7694      0.3543             [4] S. B. Ahn, D. S. Han, J. H. Bae, T. J. Byun, J. P. Kim,
                                                                        C. S. Eun, The miss rate for colorectal adenoma de-
                                                                        termined by quality-adjusted, back-to-back colono-
                                                                        scopies, Gut and liver 6 (2012) 64.
                                                                    [5] T. K. Lui, C. K. Hui, V. W. Tsui, K. S. Cheung, M. K.
                                                                        Ko, D. C. Foo, L. Y. Mak, C. K. Yeung, T. H. Lui, S. Y.
                                                                        Wong, et al., New insights on missed colonic lesions
                                                                        during colonoscopy through artificial intelligence–
                                                                        assisted real-time detection (with video), Gastroin-
Figure 6: Example of model segmentation results on EndoCV               testinal Endoscopy 93 (2021) 193–200.
2022 round-II.(a) shows the easy case for model and (b) shows       [6] C. Yua, J. Yana, X. Lia, Parallel res2net-based net-
the hard case in complex scenarios.                                     work with reverse attention for polyp segmentation
                                                                        (2021).
                                                                    [7] O. Ronneberger, P. Fischer, T. Brox, U-Net: con-
as figure 6 shows, our model does not recognize objects                 volutional networks for biomedical image segmen-
in complex scenarios, such as dim and dark scenes.                      tation, in: International Conference on Medical
                                                                        image computing and computer-assisted interven-
                                                                        tion, Springer, 2015, pp. 234–241.
                                                                    [8] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele,
                                                                        A. Sorkine-Hornung, Learning video object seg-
                                                                        mentation from static images, in: Proceedings of
     the IEEE conference on computer vision and pat-
     tern recognition, 2017, pp. 2663–2672.
 [9] D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen,
     T. d. Lange, D. Johansen, H. D. Johansen, Kvasir-seg:
     A segmented polyp dataset, in: International Con-
     ference on Multimedia Modeling, Springer, 2020,
     pp. 451–462.
[10] J. Silva, A. Histace, O. Romain, X. Dray, B. Granado,
     Toward embedded detection of polyps in wce im-
     ages for early diagnosis of colorectal cancer, Inter-
     national journal of computer assisted radiology and
     surgery 9 (2014) 283–293.
[11] J. Bernal, F. J. Sánchez, G. Fernández-Esparrach,
     D. Gil, C. Rodríguez, F. Vilariño, Wm-dova maps
     for accurate polyp highlighting in colonoscopy: Val-
     idation vs. saliency maps from physicians, Com-
     puterized Medical Imaging and Graphics 43 (2015)
     99–111.
[12] S. Ali, F. Zhou, B. Braden, A. Bailey, S. Yang,
     G. Cheng, P. Zhang, X. Li, M. Kayser, R. D.
     Soberanis-Mukul, et al., An objective comparison
     of detection and segmentation algorithms for arte-
     facts in clinical endoscopy, Scientific reports 10
     (2020) 1–15.
[13] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Po-
     lat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo,
     B. Matuszewski, et al., Deep learning for detec-
     tion and segmentation of artefact and disease in-
     stances in gastrointestinal endoscopy, Medical
     image analysis 70 (2021) 102002. doi:10.1016/j.
     media.2021.102002.

</pre>