KD-ResUNet++: Automatic Polyp Segmentation via
                           Self-Knowledge Distillation
                                                       Jaeyong Kang1 , Jeonghwan Gwak1,2,∗
                1 Department of Software, Korea National University of Transportation, Chungju 27469, South Korea
    2 Department of IT · Energy Convergence (BK21 FOUR), Korea National University of Transportation, Chungju 27469,

                                                                             South Korea
ABSTRACT
In this paper, we present our method for Medico automatic polyp
segmentation challenge at MediaEval 2020. In our method, we
utilized the knowledge distillation technique to improve ResUNet++
which performs well on automatic polyp segmentation. In our
experiment, our proposed model called KD-ResUNet++ outperforms
ResUNet++ in terms of Jaccard index, Dice similarity coefficient,
and recall. Our best models achieved Jaccard index, Dice similarity
coefficient, and FPS of 0.6196, 0.7089, and 107.8797 respectively on
the official test dataset in the challenge.


1     INTRODUCTION
Automatic polyp segmentation is a challenging task due to vari-
ations in the shape and size of polyps. In this paper, we propose                          Figure 1: Our proposed KD-ResUNet++ architecture
KD-ResUNet++, which is based on the ResUNet++ architecture [9]
and knowledge distillation for Medico automatic polyp segmenta-                      been consistently reported [1, 4, 6, 13] that student models trained
tion challenge at MediaEval 2020 [7]. Knowledge distillation is a                    with self-knowledge distillation show better performance than their
method to transfer knowledge from one architecture (e.g., teacher)                   teacher models by significant margins in several language modeling
to another (e.g., student). In particular, we use self-knowledge dis-                and computer vision tasks.
tillation where teacher and student architectures are the same.
                                                                                     3     METHODS
2 RELATE WORKS
                                                                                     In this section, the architecture of our proposed method for auto-
2.1 ResUNet++                                                                        matic polyp segmentation is first presented. After that, we describe
U-Net is a very popular deep learning architecture for biomedical                    the details of the key components in the following subsections. The
image segmentation. U-Net won the 2015 ISBI cell tracking chal-                      overall architecture of our proposed model is shown in Figure 1.
lenge. ResUNet [14] is an improved U-Net architecture which takes                    First, input images are augmented by the data augmentation mod-
advantage of strengths from both the U-Net architecture and deep                     ule. Second, augmented images are used as the input of both the
residual learning. ResUNet++ [9] is an improved ResUNet architec-                    student model and the teacher model. Third, distillation loss be-
ture that further takes advantage of attention blocks, Atrous Spatial                tween the output of the student model and the output of the teacher
Pyramidal Pooling (ASPP), and squeeze and excitation blocks. As                      model and the student loss between the output of the student model
being reported in [9], ResUNet++ shows the state-of-the-art perfor-                  and ground-truth label is calculated to train the student model.
mance on automatic polyp segmentation.
                                                                                     3.1    Data augmentation
2.2     Knowledge distillation                                                       Deep learning models require a large amount of training data to
Knowledge distillation aims at transferring dark knowledge from a                    work effectively. However, the size of the provided colonoscopy
teacher model such as wide [11] or deep [2, 10, 12], or an ensemble                  dataset is not very large. To solve this problem, data augmentation
of models [5] to a student model which is typically thin and small.                  can be used to make the relatively smaller dataset a large one. It is
Trained in this way, the student model can mimic the behavior of                     reported that the performance of the deep learning model can be
the teacher model such as class probability distribution to achieve                  improved by augmenting the existing data rather than collecting
better performance than the model trained with hard labels inde-                     new data. In our data augmentation step, we used 2 augmentation
pendently. Self-knowledge distillation refers to the special case                    strategies (rotation and horizontal flipping) to generate new train-
where the teacher and student architectures are the same. It has                     ing sets. The rotation operation used for data augmentation is done
                                                                                     by randomly rotating the input by 90 degrees zero or more times.
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
                                                                                     The rotation operation fills the area of rotated images where there
MediaEval’20, December 14-15 2020, Online                                            was no image pixel with black. In addition, we applied horizontal
                                                                                     flipping to each of the rotated images.
MediaEval’20, December 14-15 2020, Online                                                                                           Kang and Gwak

                Table 1: Results on validation set

           Model          Jaccard        DSC     Recall       Precision
       ResUNet++          0.7342     0.8120      0.8260        0.8892
      KD-ResUNet++        0.7530     0.8310      0.8495        0.8701

      Table 2: Official results on polyp segmentation task

           Model          Jaccard        DSC     Recall       Precision
      KD-ResUNet++        0.6196     0.7089      0.7287        0.7914
                                                                                Figure 2: Examples of three different segmentations pro-
                                                                                duced by ResUNet++ and KD-ResUNet++
      Table 3: Official results on algorithm efficiency task

               Model               FPS         Mean time taken                  4.3    Results
                                                                                Our results on the validation set are presented in Table 1. Also,
           KD-ResUNet++        107.8797              0.0093
                                                                                the official results on polyp segmentation and algorithm efficiency
                                                                                tasks on the same test dataset are shown in Table 2 and Table 3,
3.2     Training with Self-knowledge distillation                               respectively. Table 1 shows that ResUNet++ achieved slightly better
                                                                                precision than KD-ResUNet++. However, KD-ResUNet++ outper-
In our proposed approach, we use self-knowledge distillation where              forms ResUNet++ in terms of Jaccard index and Dice similarity
teacher network and student network are the same. We use Re-                    coefficient which is an important metric for semantic segmenta-
sUNet++ for both teacher and student networks. In knowledge                     tion task. Table 1 shows that our proposed model outperforms
distillation, the teacher network is first trained to transfer knowl-           ResUNet++ in terms of Jaccard index, Dice similarity coefficient,
edge to the student network. Also, the loss function consists of                and recall. Table 2 and 3 show that our proposed model achieved
1) the distillation loss and 2) the student loss. The distillation loss         Jaccard index, Dice similarity coefficient, and FPS of 0.6196, 0.7089,
𝐿𝑑𝑖𝑠𝑡 can be calculated using dice loss between the output of the               and 107.8797 respectively on the official test dataset. Besides, exam-
student model 𝑦𝑠 and the output of pre-trained teacher model 𝑦𝑡 ,               ples of three different segmentations produced by ResUNet++ and
and the student loss 𝐿𝑠 can be calculated using dice loss between               KD-ResUNet++ are depicted in Figure 2. Figure 2 shows that the
the output of the student model 𝑦𝑠 and the ground-truth label 𝑦𝑡𝑟𝑢𝑒             result of KD-ResUNet++ are more similar with ground truth than
as follows:                                                                     the result of ResUNet++.

                       𝐿𝑑𝑖𝑠𝑡 = 1 − 𝐷𝑖𝑐𝑒 (𝑦𝑠 , 𝑦𝑡 )                        (1)   5     CONCLUSION
                       𝐿𝑠 = 1 − 𝐷𝑖𝑐𝑒 (𝑦𝑠 , 𝑦𝑡𝑟𝑢𝑒 )                        (2)   In this paper, we presented KD-ResUNet++ for automatic polyp
The total loss 𝐿𝑡𝑜𝑡𝑎𝑙 is then calculated as the joint of the distillation       segmentation. In our proposed framework, the data augmentation
and student losses as follows:                                                  technique is applied to input images. Also, we use self-knowledge
                                                                                distillation where teacher and student networks are the same. We
                       𝐿𝑡𝑜𝑡𝑎𝑙 = 0.1 ∗ 𝐿𝑑𝑖𝑠𝑡 + 𝐿𝑠                          (3)   use the ResUNet++ model for our student and teacher networks.
                                                                                Our proposed model is evaluated on the validation set as well as the
4 EXPERIMENTS AND RESULTS                                                       official test set. Our experimental results show that our proposed
                                                                                model outperforms ResUNet++ in terms of Jaccard index, Dice
4.1 Dataset                                                                     similarity coefficient, and recall. These results indicate that our
We trained our proposed model using the Kvasir-SEG dataset [8],                 proposed method can capture polyp segmentation boundary well
the benchmark dataset for the 2020 Medico automatic polyp seg-                  and could be potentially used in clinical settings. In the future, we
mentation challenge. It consists of 1,000 polyp images and their cor-           plan to use different knowledge types in our knowledge distillation.
responding ground truth masks annotated by expert endoscopists                  Also, we plan to modify the ResUNet++ architecture to incorporate
from Oslo University Hospital, Norway.                                          the model pre-trained on a large image dataset (e.g., ImageNet [3])
                                                                                to reduces the long training time which is normally required to
4.2     Experimental Setting                                                    train deep learning model from scratch, and also to remove the
The dataset is split into 88 % for learning the weights and 12 % for            requirement of having a large training dataset.
validating the model during the training step. Before the training
step, we augment input images using our data augmentation module                ACKNOWLEDGMENTS
described in Section 3.1 and converted the images to the size of                This research was supported by the Basic Science Research Program
256×256 pixels. The validation set is only normalized. The learning             through the National Research Foundation of Korea (NRF) funded
rate is set to 0.001. We use Adam as our optimizer.                             by the Ministry of Education (Grant No. NRF-2020R1I1A3074141).
Medico Multimedia Task                                                                       MediaEval’20, December 14-15 2020, Online


REFERENCES
 [1] Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhen-
     wen Dai. 2019. Variational information distillation for knowledge transfer. In
     Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
     9163–9171.
 [2] Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker.
     2017. Learning efficient object detection models with knowledge distillation. In
     Advances in Neural Information Processing Systems. 742–751.
 [3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Im-
     agenet: A large-scale hierarchical image database. In 2009 IEEE conference on
     computer vision and pattern recognition. Ieee, 248–255.
 [4] Tommaso Furlanello, Zachary C Lipton, Michael Tschannen, Laurent Itti, and
     Anima Anandkumar. 2018. Born again neural networks. arXiv preprint
     arXiv:1805.04770 (2018).
 [5] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in
     a neural network. arXiv preprint arXiv:1503.02531 (2015).
 [6] Thi Kieu Khanh Ho and Jeonghwan Gwak. 2020. Utilizing Knowledge Distillation
     in Deep Learning for Classification of Chest X-Ray Abnormalities. IEEE Access 8
     (2020), 160749–160761.
 [7] Debesh Jha, Steven A. Hicks, Krister Emanuelsen, Håvard D. Johansen, Dag
     Johansen, Thomas de Lange, Michael A. Riegler, and Pål Halvorsen. 2020. Medico
     Multimedia Task at MediaEval 2020:Automatic Polyp Segmentation. In Proc. of
     MediaEval 2020 CEUR Workshop.
 [8] Debesh Jha, Pia H Smedsrud, Michael A Riegler, Pål Halvorsen, Thomas de Lange,
     Dag Johansen, and Håvard D Johansen. 2020. Kvasir-SEG: A Segmented Polyp
     Dataset. In Proc. of International Conference on Multimedia Modeling (MMM).
     451–462.
 [9] Debesh Jha, Pia H Smedsrud, Michael A Riegler, Dag Johansen, Thomas De Lange,
     Pål Halvorsen, and Håvard D Johansen. 2019. ResUNet++: An Advanced Archi-
     tecture for Medical Image Segmentation. In Proc. of International Symposium on
     Multimedia. 225–230.
[10] Jaeyong Kang and Jeonghwan Gwak. 2020. Ensemble Learning of Lightweight
     Deep Learning Models Using Knowledge Distillation for Image Classification.
     Mathematics 8, 10 (2020), 1652.
[11] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang,
     Carlo Gatta, and Yoshua Bengio. 2014. Fitnets: Hints for thin deep nets. arXiv
     preprint arXiv:1412.6550 (2014).
[12] Gregor Urban, Krzysztof J Geras, Samira Ebrahimi Kahou, Ozlem Aslan, Shengjie
     Wang, Rich Caruana, Abdelrahman Mohamed, Matthai Philipose, and Matt
     Richardson. 2016. Do deep convolutional nets really need to be deep and convo-
     lutional? arXiv preprint arXiv:1603.05691 (2016).
[13] Chenglin Yang, Lingxi Xie, Siyuan Qiao, and Alan L Yuille. 2019. Training deep
     neural networks in generations: A more tolerant teacher educates better students.
     In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5628–5635.
[14] Zhengxin Zhang, Qingjie Liu, and Yunhong Wang. 2018. Road Extraction by
     Deep Residual U-Net. IEEE Geoscience and Remote Sensing Letters 15, 5 (May
     2018), 749–753. https://doi.org/10.1109/lgrs.2018.2802944