KD-ResUNet++: Automatic Polyp Segmentation via Self-Knowledge Distillation Jaeyong Kang1 , Jeonghwan Gwak1,2,βˆ— 1 Department of Software, Korea National University of Transportation, Chungju 27469, South Korea 2 Department of IT Β· Energy Convergence (BK21 FOUR), Korea National University of Transportation, Chungju 27469, South Korea ABSTRACT In this paper, we present our method for Medico automatic polyp segmentation challenge at MediaEval 2020. In our method, we utilized the knowledge distillation technique to improve ResUNet++ which performs well on automatic polyp segmentation. In our experiment, our proposed model called KD-ResUNet++ outperforms ResUNet++ in terms of Jaccard index, Dice similarity coefficient, and recall. Our best models achieved Jaccard index, Dice similarity coefficient, and FPS of 0.6196, 0.7089, and 107.8797 respectively on the official test dataset in the challenge. 1 INTRODUCTION Automatic polyp segmentation is a challenging task due to vari- ations in the shape and size of polyps. In this paper, we propose Figure 1: Our proposed KD-ResUNet++ architecture KD-ResUNet++, which is based on the ResUNet++ architecture [9] and knowledge distillation for Medico automatic polyp segmenta- been consistently reported [1, 4, 6, 13] that student models trained tion challenge at MediaEval 2020 [7]. Knowledge distillation is a with self-knowledge distillation show better performance than their method to transfer knowledge from one architecture (e.g., teacher) teacher models by significant margins in several language modeling to another (e.g., student). In particular, we use self-knowledge dis- and computer vision tasks. tillation where teacher and student architectures are the same. 3 METHODS 2 RELATE WORKS In this section, the architecture of our proposed method for auto- 2.1 ResUNet++ matic polyp segmentation is first presented. After that, we describe U-Net is a very popular deep learning architecture for biomedical the details of the key components in the following subsections. The image segmentation. U-Net won the 2015 ISBI cell tracking chal- overall architecture of our proposed model is shown in Figure 1. lenge. ResUNet [14] is an improved U-Net architecture which takes First, input images are augmented by the data augmentation mod- advantage of strengths from both the U-Net architecture and deep ule. Second, augmented images are used as the input of both the residual learning. ResUNet++ [9] is an improved ResUNet architec- student model and the teacher model. Third, distillation loss be- ture that further takes advantage of attention blocks, Atrous Spatial tween the output of the student model and the output of the teacher Pyramidal Pooling (ASPP), and squeeze and excitation blocks. As model and the student loss between the output of the student model being reported in [9], ResUNet++ shows the state-of-the-art perfor- and ground-truth label is calculated to train the student model. mance on automatic polyp segmentation. 3.1 Data augmentation 2.2 Knowledge distillation Deep learning models require a large amount of training data to Knowledge distillation aims at transferring dark knowledge from a work effectively. However, the size of the provided colonoscopy teacher model such as wide [11] or deep [2, 10, 12], or an ensemble dataset is not very large. To solve this problem, data augmentation of models [5] to a student model which is typically thin and small. can be used to make the relatively smaller dataset a large one. It is Trained in this way, the student model can mimic the behavior of reported that the performance of the deep learning model can be the teacher model such as class probability distribution to achieve improved by augmenting the existing data rather than collecting better performance than the model trained with hard labels inde- new data. In our data augmentation step, we used 2 augmentation pendently. Self-knowledge distillation refers to the special case strategies (rotation and horizontal flipping) to generate new train- where the teacher and student architectures are the same. It has ing sets. The rotation operation used for data augmentation is done by randomly rotating the input by 90 degrees zero or more times. Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). The rotation operation fills the area of rotated images where there MediaEval’20, December 14-15 2020, Online was no image pixel with black. In addition, we applied horizontal flipping to each of the rotated images. MediaEval’20, December 14-15 2020, Online Kang and Gwak Table 1: Results on validation set Model Jaccard DSC Recall Precision ResUNet++ 0.7342 0.8120 0.8260 0.8892 KD-ResUNet++ 0.7530 0.8310 0.8495 0.8701 Table 2: Official results on polyp segmentation task Model Jaccard DSC Recall Precision KD-ResUNet++ 0.6196 0.7089 0.7287 0.7914 Figure 2: Examples of three different segmentations pro- duced by ResUNet++ and KD-ResUNet++ Table 3: Official results on algorithm efficiency task Model FPS Mean time taken 4.3 Results Our results on the validation set are presented in Table 1. Also, KD-ResUNet++ 107.8797 0.0093 the official results on polyp segmentation and algorithm efficiency tasks on the same test dataset are shown in Table 2 and Table 3, 3.2 Training with Self-knowledge distillation respectively. Table 1 shows that ResUNet++ achieved slightly better precision than KD-ResUNet++. However, KD-ResUNet++ outper- In our proposed approach, we use self-knowledge distillation where forms ResUNet++ in terms of Jaccard index and Dice similarity teacher network and student network are the same. We use Re- coefficient which is an important metric for semantic segmenta- sUNet++ for both teacher and student networks. In knowledge tion task. Table 1 shows that our proposed model outperforms distillation, the teacher network is first trained to transfer knowl- ResUNet++ in terms of Jaccard index, Dice similarity coefficient, edge to the student network. Also, the loss function consists of and recall. Table 2 and 3 show that our proposed model achieved 1) the distillation loss and 2) the student loss. The distillation loss Jaccard index, Dice similarity coefficient, and FPS of 0.6196, 0.7089, 𝐿𝑑𝑖𝑠𝑑 can be calculated using dice loss between the output of the and 107.8797 respectively on the official test dataset. Besides, exam- student model 𝑦𝑠 and the output of pre-trained teacher model 𝑦𝑑 , ples of three different segmentations produced by ResUNet++ and and the student loss 𝐿𝑠 can be calculated using dice loss between KD-ResUNet++ are depicted in Figure 2. Figure 2 shows that the the output of the student model 𝑦𝑠 and the ground-truth label π‘¦π‘‘π‘Ÿπ‘’π‘’ result of KD-ResUNet++ are more similar with ground truth than as follows: the result of ResUNet++. 𝐿𝑑𝑖𝑠𝑑 = 1 βˆ’ 𝐷𝑖𝑐𝑒 (𝑦𝑠 , 𝑦𝑑 ) (1) 5 CONCLUSION 𝐿𝑠 = 1 βˆ’ 𝐷𝑖𝑐𝑒 (𝑦𝑠 , π‘¦π‘‘π‘Ÿπ‘’π‘’ ) (2) In this paper, we presented KD-ResUNet++ for automatic polyp The total loss πΏπ‘‘π‘œπ‘‘π‘Žπ‘™ is then calculated as the joint of the distillation segmentation. In our proposed framework, the data augmentation and student losses as follows: technique is applied to input images. Also, we use self-knowledge distillation where teacher and student networks are the same. We πΏπ‘‘π‘œπ‘‘π‘Žπ‘™ = 0.1 βˆ— 𝐿𝑑𝑖𝑠𝑑 + 𝐿𝑠 (3) use the ResUNet++ model for our student and teacher networks. Our proposed model is evaluated on the validation set as well as the 4 EXPERIMENTS AND RESULTS official test set. Our experimental results show that our proposed model outperforms ResUNet++ in terms of Jaccard index, Dice 4.1 Dataset similarity coefficient, and recall. These results indicate that our We trained our proposed model using the Kvasir-SEG dataset [8], proposed method can capture polyp segmentation boundary well the benchmark dataset for the 2020 Medico automatic polyp seg- and could be potentially used in clinical settings. In the future, we mentation challenge. It consists of 1,000 polyp images and their cor- plan to use different knowledge types in our knowledge distillation. responding ground truth masks annotated by expert endoscopists Also, we plan to modify the ResUNet++ architecture to incorporate from Oslo University Hospital, Norway. the model pre-trained on a large image dataset (e.g., ImageNet [3]) to reduces the long training time which is normally required to 4.2 Experimental Setting train deep learning model from scratch, and also to remove the The dataset is split into 88 % for learning the weights and 12 % for requirement of having a large training dataset. validating the model during the training step. Before the training step, we augment input images using our data augmentation module ACKNOWLEDGMENTS described in Section 3.1 and converted the images to the size of This research was supported by the Basic Science Research Program 256Γ—256 pixels. The validation set is only normalized. The learning through the National Research Foundation of Korea (NRF) funded rate is set to 0.001. We use Adam as our optimizer. by the Ministry of Education (Grant No. NRF-2020R1I1A3074141). Medico Multimedia Task MediaEval’20, December 14-15 2020, Online REFERENCES [1] Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhen- wen Dai. 2019. Variational information distillation for knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9163–9171. [2] Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. 2017. Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems. 742–751. [3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Im- agenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255. [4] Tommaso Furlanello, Zachary C Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. 2018. Born again neural networks. arXiv preprint arXiv:1805.04770 (2018). [5] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015). [6] Thi Kieu Khanh Ho and Jeonghwan Gwak. 2020. Utilizing Knowledge Distillation in Deep Learning for Classification of Chest X-Ray Abnormalities. IEEE Access 8 (2020), 160749–160761. [7] Debesh Jha, Steven A. Hicks, Krister Emanuelsen, HΓ₯vard D. Johansen, Dag Johansen, Thomas de Lange, Michael A. Riegler, and PΓ₯l Halvorsen. 2020. Medico Multimedia Task at MediaEval 2020:Automatic Polyp Segmentation. In Proc. of MediaEval 2020 CEUR Workshop. [8] Debesh Jha, Pia H Smedsrud, Michael A Riegler, PΓ₯l Halvorsen, Thomas de Lange, Dag Johansen, and HΓ₯vard D Johansen. 2020. Kvasir-SEG: A Segmented Polyp Dataset. In Proc. of International Conference on Multimedia Modeling (MMM). 451–462. [9] Debesh Jha, Pia H Smedsrud, Michael A Riegler, Dag Johansen, Thomas De Lange, PΓ₯l Halvorsen, and HΓ₯vard D Johansen. 2019. ResUNet++: An Advanced Archi- tecture for Medical Image Segmentation. In Proc. of International Symposium on Multimedia. 225–230. [10] Jaeyong Kang and Jeonghwan Gwak. 2020. Ensemble Learning of Lightweight Deep Learning Models Using Knowledge Distillation for Image Classification. Mathematics 8, 10 (2020), 1652. [11] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014). [12] Gregor Urban, Krzysztof J Geras, Samira Ebrahimi Kahou, Ozlem Aslan, Shengjie Wang, Rich Caruana, Abdelrahman Mohamed, Matthai Philipose, and Matt Richardson. 2016. Do deep convolutional nets really need to be deep and convo- lutional? arXiv preprint arXiv:1603.05691 (2016). [13] Chenglin Yang, Lingxi Xie, Siyuan Qiao, and Alan L Yuille. 2019. Training deep neural networks in generations: A more tolerant teacher educates better students. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5628–5635. [14] Zhengxin Zhang, Qingjie Liu, and Yunhong Wang. 2018. Road Extraction by Deep Residual U-Net. IEEE Geoscience and Remote Sensing Letters 15, 5 (May 2018), 749–753. https://doi.org/10.1109/lgrs.2018.2802944