XP-Net: An Attention Segmentation Network by Dual Teacher Hierarchical Knowledge distillation for Polyp Generalization Ragu B1 , Antony Raj1,2 , Rahul GS1,2 , Sneha Chand1,2 , Preejith SP1 and Mohanasankar Sivaprakasam2 1 Healthcare Technology Innovation Centre, Chennai, India 2 Department of Electrical Engineering, IIT Madras, Chennai, India Abstract -5pt Endoscopic imaging is largely used as the diagnostic tool for Colon polyps-induced GI tract cancer. This diagnosis via image identification requires expertise that may be lacking in inexperienced physicians. Hence, using a software-aided approach to detect those anomalies may better identify the tissue abnormalities. In this paper, a novel deep learning network ’XP-Net’ with Effective Pyramidal Squeeze Attention (EPSA) module using hierarchical adversarial knowledge distillation by a combination of two teacher networks is proposed. It adds ‘complementary knowledge’ to the student network– thus aiding in the improvement of network performance. The lightweight EPSA block enhances the current network architecture by capturing multi-scale spatial information of objects at a granular level with long-range channel dependency. The XP-Net compiled into the NVIDIA TensorRT engine gave a better real-time performance in terms of throughput. The proposed network has achieved a dice score of 0.839 and IoU of 0.805 in the validation data set, and it was able to attain an average throughput of 60 fps in mobile GPU. This proposed deep learning-based segmentation approach is expected to aid clinicians in addressing the complications involved in the identification and removal of precancerous anomalies more competently. Keywords Polyp, Generalization, Attention block, Knowledge distillation 1. Introduction Colorectal polyps are one of the early indicators of lower Gastro-Intestinal (GI) tract cancer. These polyps are extra growth lumps of tissues, having no particular function in the bodily processes [1]. Although these growth tis- sues are often benign, they can become cancerous. The early detection and removal of the polyps in the colon region may prevent these tissues from becoming can- cerous. Colonoscopy is a general diagnostic procedure widely used to investigate the colon region for any type of malformation and disease [2]. Generally, a trained physician visually inspects the colon region for polyps and removes them using a minimally invasive endoscopic surgery. Research on the visual inspection of the colon Figure 1: (A) shows a generic hierarchical knowledge distilla- region shows that small size adenomas (benign tumor), tion using a single teacher and (B) is our proposed methodol- less than 5mm diameter, have a miss rate of 27% and for ogy using dual teacher to derive the student network adenomas greater than 10mm have a miss rate of 6%. It has been reported that the quality of bowel preparation and the experience of colonoscopists are major contribu- tection is a highly researched area that has been found tory factors to missed polyps during a colonoscopy [3]. effective to mitigate the miss rates and assist in the faster A quick alternative, computer-vision based polyps de- diagnosis for colonoscopists [4]. The addition of deep learning techniques proves to be much more effective, 4th International Workshop and Challenge on Computer Vision in since a network like U-net has shown promising results Endoscopy (EndoCV2022) in conjunction with the 19th IEEE Inter- in biomedical imaging and widely accepted as the state- national Symposium on Biomedical Imaging ISBI2022, March of-the-art image to image translation network [5]. 28th, 2022, IC Royal Bengal, Kolkata, India In this paper, the U-net was chosen as the baseline $ ragu.b@htic.iitm.ac.in (R. B); antony.raj@htic.iitm.ac.in (A. Raj) model because of its ability to outperform other segmen- © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). tation networks with extensive data augmentation re- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) gardless of a limited dataset, as reported Ronneberger et a granular level with long-range channel dependency at al. [5]. A plug-and-play EPSA module was implemented the initial stage of the network. as proposed by EPSANet [6] with U-Net for enhancing Our first teacher network comprises of U-Net with the multiscale spatial information, which results in the EPSA module. Similarly, we trained the second teacher detection of objects over different scale factors [6]. Since network, a baseline U-Net with EPSA block using pix2pix the baseline U-Net with the EPSA module was found GAN [13], which has a promising result for an image to to be computationally heavy for real-time performance, image translation that learns a loss adapted to the input model compression techniques through knowledge dis- data and task. The proposed student network consists tillation are implemented [7]. Among the other model of separable filters that hold the same U-Net architec- compression approaches, knowledge distillation shows ture with the EPSA module, which results in the reduced great superiority, which is to transfer knowledge of a number of learnable parameters from the defined teacher large teacher model to a small student model [8]. In network. The hierarchical knowledge distillation tech- our proposed student network, we implemented sepa- nique used in our method is proposed in [8], where a rable filters resulting in model reduction by 78% of the single teacher is used for knowledge distillation. How- teacher network. We implemented a hierarchical knowl- ever the network that we have developed utilizes the edge distillation technique which was proposed in the dual teachers via multi-step learning as suggested in [14] paper HAD-Net [8] where a single teacher network is to map the in-between features to train the respective used to distill the knowledge. Whereas, in our proposed student network. methodology, the Dual teacher transfers the complemen- The input and target of the teacher and student net- tary knowledge to the student network. All the mod- work is denoted by x and y. The output segmentation els where trained over EndoCV2022 Challenge dataset of two teachers and student is denoted by T(1,2) ŷ and [9][10][11]. Sŷ respectively. The multi-scale feature map of teacher In summary, the contributions in this paper are as and student is denoted by T(1,2) y𝑙𝑎𝑡𝑒𝑛𝑡 and Sŷ𝑙𝑎𝑡𝑒𝑛𝑡 . In follows: hierarchical knowledge distillation, the student loss is denoted by Ls which consists of weighted combination • A hierarchical dual teacher knowledge distillation of two terms, (a) the sum of dice [15] and tversky loss network to transfer the complementary knowl- [16] with the student generated segmentation (Sŷ) and edge of both networks to a student. ground truth (y), (b) mean square error adversarial loss. • A student network with a lower computational The overall student loss is given in equation 2. cost for real-time performance without signifi- cantly reducing accuracy. 𝐷𝑉 = [Dice Loss + Tversky Loss] (1) • Experiments: By evaluating our model’s gen- erality in the external Kvasir-Seg dataset [12], 𝐿𝑆 =𝐷𝑉 [𝑆𝑦ˆ; 𝑦] + 𝜆 (2) The dice and IoU scores of 0.782 and 0.769 are * 𝑀 𝑆𝐸[𝐻𝐷(𝑥, 𝑆𝑦ˆ, 𝑆𝑦ˆ𝑙𝑎𝑡𝑒𝑛𝑡 ), 1] achieved, respectively. The hierarchical discriminator (HD) is trained using LS-GAN loss denoted as L𝐻𝐷 . The L𝐻𝐷 is made up 2. XP-Net of two mean square error term. one term is between the HD output after being passed a "fake" datasample 2.1. Methodology from the teacher, and a tensor of all zeros [17]. The In our methodology, a student network is derived from other term is a mean square error loss between the HD two teacher networks through a hierarchical knowl- output after being passed a "real" data sample from either edge distillation process. The two teacher networks that teacher 1 or teacher 2 and a tensor off all ones. The are highly computational, transfer their complementary overall discriminator loss is denoted in equation 3. knowledge to a lightweight student network. The base- line U-Net architecture has the ability to capture features 𝐿𝐻𝐷 =𝑀 𝑆𝐸[𝐻𝐷(𝑥, 𝑆𝑦ˆ, 𝑆𝑦ˆ𝑙𝑎𝑡𝑒𝑛𝑡 , 0] at multiple scales. To enhance this visual perception of + 𝑀 𝑆𝐸[𝐻𝐷(𝑋, 𝑇(1,2) 𝑦ˆ, 𝑇(1,2) 𝑦ˆ𝑙𝑎𝑡𝑒𝑛𝑡 , 1] the U-Net network, we implemented an Effective Pyrami- dal Squeeze Attention (EPSA) block at the first encoder (3) of the U-Net.This attention mechanism boosts the alloca- tion of the most informative feature expressions while 2.2. Network Architecture suppressing the less useful ones, allowing the model to focus on clinically crucial areas [6]. The lightweight 2.2.1. Teacher Network EPSA block enhanced the current architecture’s ability CABR32-P-CBR64-P-CBR128-P-CBR256-P- by capturing multi-scale spatial information of objects at CBR512-UPCONV256-CBR256-UPCONV128- (a) Teacher and Student Discriminator Network (b) Teacher Network Blocks (c) Student Network Blocks Figure 2: XP-Net Network Architecture CBR128-UPCONV64-CBR64-UPCONV32-CBR32- • Pool (P) CE1 Pool represents a pooling layer with a kernel size (2,2) and stride (2,2). • CABRK CABRK represents two stacks of Convolution, 2.2.2. Student Network Batch norm and Relu activation function with K number of output filters with an CAPDBR32-P-CPDBR64-P-CPDBR128-P- intermediate attention block CPDBR256-P-CPDBR512-UPCONV256- • CBRK CPDBR256-UPCONV128-CPDBR128-UPCONV64- CPRBR64-UPCONV32-CPDBR32-CE1 CBRK represents two stacks of Convolution, Batch norm and Relu activation function • CPDBRK with K number of output filters. CPDBRK represents stack of (A) point wise con- • CEK volution of kernel size (1,5) and depth wise CEK denotes a (1,1) convolution with k out- convolution of kernel size (1,1) followed put feature map with a Sigmoid activation by Batch norm and Relu and (B) point wise function. convolution of kernel size (5,1) and depth wise convolution of kernel size (1,1) fol- • UPCONVK lowed by Batch norm and Relu. All the UPCONVK represents a layer of transpose con- convolution layers consists of K number volution with a kernel size (2,2), stride (2,2) of feature outputs. with k output number of feature maps. • CAPDBRK Figure 3: The input image (a) and ground truth (b) is given and student networks mask (e) which has learned from teacher 1 (c) and teacher 2 (d) network’s. CAPDBRK block is a modified version of Table 1 CPDBRk block where attention block is Comparison between teacher and student network placed in between the two sets of point No. of Dice IoU wise convolution, depth wise convolution, Dataset Network params score score batch norm and relu. Teach. 1 7,774,374 0.893 0.889 EndoCV Teach. 2 7,774,374 0.871 0.884 Dataset Student 1,839,333 0.839 0.805 2.2.3. Discriminator Network U-Net 7,763,041 0.841 0.812 CAT-DC32-CAT-DC128-CAT-DC128-CAT-DC32- Teach. 1 7774374 0.812 0.809 CAT-DC32-ENCONV Kvasir Teach. 2 7,774,374 0.803 0.798 Dataset Student 1,839,333 0.783 0.769 • CAT U-Net 7,763,041 0.798 0.784 CAT is the concatenation of two different layers either from teacher or student network. EndoCV2022 challenge provided us with series of se- • DCK quence dataset of 2631 images with their corresponding DCK represents a stack of convolution of ker- ground truth masks [9][10][11]. In that, we utilized more nel size (3,3), padding and stride of (1,1) than 95% of the data for training and 5% of the data for with instance norm and Leaky Relu with testing. External dataset such as Kvasir-Seg was utilized negative slope of 0.2. for testing the model generality. The hierarchical discriminator consists of five discrim- 3.1.1. Dataset augmentation inator blocks (DC) and an End Convolution (ENCONV). All the models were trained with an input image size of In our proposed model, the feature map from encoder 512x512. The data augmentation such as random rotate, 1, encoder 3, decoder 1, decoder 3 from the teacher or horizontal flip, vertical flip, perspective transform was student network are used for hierarchical knowledge dis- implemented. Usually the endoscopic images are sub- tillation. The full network architecture is described in jected to different light sources that might have different Fig.2. intensities of brightness, contrast and hue, so images are augmented in such a way to replicate those scenarios. 3. Dataset and Implementation 3.2. Implementation 3.1. Dataset Both the teacher network is trained using Adam opti- Automatic polyp detection and classification requires mizer with initial learning rate of 3e−4 with step learn- the availability of big datasets of polyp images or videos ing rate scheduler of gamma 0.1 and step size of 30. The along with high-quality, manual annotations provided networks were trained for 450 epoch with batch size of by experts. These annotations provide the ground truth 8. The student network was trained using Adam opti- necessary to train the supervised deep learning models. mizer with 𝛽1 0.5 and 𝛽2 0.999 with a initial learning cancer tissues. rate of 1e−4 with step learning rate scheduler of gamma 0.1 and step size of 30. After multiple experiments of initializing weights with uniform, xavier-uniform and References kaiming-uniform given in pytorch weight initialization, it [1] B. Levin, D. A. Lieberman, B. McFarland, K. S. An- showed that kaiming uniform weight initialization have drews, D. Brooks, J. Bond, C. Dash, F. M. Giardiello, helped for better convergence of model. We also imple- S. Glick, D. Johnson, et al., Screening and surveil- mented our model in Nvidia TensorRT inference library lance for the early detection of colorectal cancer and for effective realtime model throughput. All the models adenomatous polyps, 2008: a joint guideline from were trained using Nvidia RTX 3090 GPU. the american cancer society, the us multi-society task force on colorectal cancer, and the american 4. Results and Discussion college of radiology, Gastroenterology 134 (2008) 1570–1595. The networks were evaluated and the computed metrics [2] D. K. Rex, J. L. Petrini, T. H. Baron, A. Chak, J. Co- are reported in Table.1. In the validation data of EndoCV hen, S. E. Deal, B. Hoffman, B. C. Jacobson, K. Mer- dataset, the Teacher 1 model was able to achieve 0.893 and gener, B. T. Petersen, et al., Quality indicators for 0.889 for Dice score and IoU score, respectively. Similarly, colonoscopy, Gastrointestinal endoscopy 63 (2006) the teacher 2 model was able to achieve 0.871 and 0.884 S16–S28. for the same metrics. The student network has achieved [3] S. N. Bonnington, M. D. Rutter, Surveillance of a commendable dice and IoU score of 0.839 and 0.805 colonic polyps: are we getting it right?, World even with the reduced number of learnable parameters. journal of gastroenterology 22 (2016) 1925. The trade-off here is the larger sized teacher network for [4] Y. Mintz, R. Brodie, Introduction to artificial intelli- a minimal loss in the accuracy of the light weight student gence in medicine, Minimally Invasive Therapy & network. Similarly, these metrics were calculated for Allied Technologies 28 (2019) 73–81. Kvasir-Seg dataset and is reported in the Table 1. [5] O. Ronneberger, P. Fischer, T. Brox, U-net: Con- Results have shown that the teacher 2 perform bet- volutional networks for biomedical image segmen- ter for region with higher amount of specular reflection tation, in: International Conference on Medical than teacher 1 for those regions. The student network image computing and computer-assisted interven- thus obtains the complimentary knowledge from the two tion, Springer, 2015, pp. 234–241. teacher networks. With reference to the ground truth, [6] H. Zhang, K. Zu, J. Lu, Y. Zou, D. Meng, Ep- it is observed that the student network had proper seg- sanet: An efficient pyramid squeeze attention block mentation even though one of the teachers had missed on convolutional neural network, arXiv preprint areas in its segmentation masks as shown in Fig 3. These arXiv:2105.14447 (2021). results show that multiple teacher knowledge helps to [7] G. Hinton, O. Vinyals, J. Dean, et al., Distilling generalize better segmentation. the knowledge in a neural network, arXiv preprint As a part of benchmarking the network in terms of arXiv:1503.02531 2 (2015). inference time, the model was converted into TensorRT [8] S. Vadacchino, R. Mehta, N. M. Sepahvand, B. Nichy- engine for faster throughput. The model was able to poruk, J. J. Clark, T. Arbel, Had-net: A hierarchi- attain an average throughput of 60 fps on GeForce RTX cal adversarial knowledge distillation network for 3070 mobile GPU and 120 fps in Nvidia RTX 3090 GPU. improved enhanced tumour segmentation without From the results, we believe that constructing multiple post-contrast images, in: Medical Imaging with teacher models which focuses on various aspects of the Deep Learning, PMLR, 2021, pp. 787–801. input data can distill a superior student network. [9] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Po- lat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo, 5. Conclusion B. Matuszewski, M. Gridach, I. Voiculescu, V. Yo- The proposed network is light weight and does faster ganand, A. Chavan, A. Raj, N. T. Nguyen, D. Q. computation when compared with traditional networks Tran, L. D. Huynh, N. Boutry, S. Rezvy, H. Chen, that are used for segmentation. Since this uses dual teach- Y. H. Choi, A. Subramanian, V. Balasubramanian, ers for knowledge distillation, by increasing the number X. W. Gao, H. Hu, Y. Liao, D. Stoyanov, C. Daul, of teacher networks, there is room for further improve- S. Realdon, R. Cannizzaro, D. Lamarque, T. Tran- ment in performance. Moreover, the sample size of data Nguyen, A. Bailey, B. Braden, J. E. East, J. Rittscher, also plays a crucial role in the accuracy of the network. Deep learning for detection and segmentation of Further studies can be done to design a much more in- artefact and disease instances in gastrointestinal en- telligent network for polyps and other varieties of early doscopy, Medical Image Analysis 70 (2021) 102002. URL: https://doi.org/10.10162/j.media.2021.102002. doi:10.1016/j.media.2021.102002. [10] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Can- nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V. Anonsen, M. A. Riegler, et al., Polypgen: A multi-center polyp detection and segmentation dataset for generalisability assessment, arXiv preprint arXiv:2106.04463 (2021). doi:10.48550/ arXiv.2106.04463. [11] S. Ali, N. Ghatwary, D. Jha, E. Isik-Polat, G. Po- lat, C. Yang, W. Li, A. Galdran, M.-Á. G. Ballester, V. Thambawita, et al., Assessing generalisabil- ity of deep learning-based polyp detection and segmentation methods through a computer vision challenge, arXiv preprint arXiv:2202.12031 (2022). doi:10.48550/arXiv.2202.12031. [12] H. Borgli, V. Thambawita, P. H. Smedsrud, S. Hicks, D. Jha, S. L. Eskeland, K. R. Randel, K. Pogorelov, M. Lux, D. T. D. Nguyen, et al., Hyperkvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy, Scientific data 7 (2020) 1–14. [13] P. Isola, J.-Y. Zhu, T. Zhou, A. A. Efros, Image- to-image translation with conditional adversarial networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134. [14] Z.-Q. Zhao, Y. Gao, Y. Ge, W. Tian, Or- derly dual-teacher knowledge distillation for lightweight human pose estimation, arXiv preprint arXiv:2104.10414 (2021). [15] X. Li, X. Sun, Y. Meng, J. Liang, F. Wu, J. Li, Dice loss for data-imbalanced nlp tasks, arXiv preprint arXiv:1911.02855 (2019). [16] N. Nasalwai, N. S. Punn, S. K. Sonbhadra, S. Agar- wal, Addressing the class imbalance problem in medical image segmentation via accelerated tver- sky loss function, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2021, pp. 390–402. [17] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, S. Paul Smolley, Least squares generative adver- sarial networks, in: Proceedings of the IEEE inter- national conference on computer vision, 2017, pp. 2794–2802.