XP-Net: An Attention Segmentation Network by Dual
Teacher Hierarchical Knowledge distillation for Polyp
Generalization
Ragu B1 , Antony Raj1,2 , Rahul GS1,2 , Sneha Chand1,2 , Preejith SP1 and
Mohanasankar Sivaprakasam2
1
    Healthcare Technology Innovation Centre, Chennai, India
2
    Department of Electrical Engineering, IIT Madras, Chennai, India


                                          Abstract
                                         -5pt Endoscopic imaging is largely used as the diagnostic tool for Colon polyps-induced GI tract cancer. This diagnosis
                                         via image identification requires expertise that may be lacking in inexperienced physicians. Hence, using a software-aided
                                         approach to detect those anomalies may better identify the tissue abnormalities. In this paper, a novel deep learning network
                                         ’XP-Net’ with Effective Pyramidal Squeeze Attention (EPSA) module using hierarchical adversarial knowledge distillation by
                                         a combination of two teacher networks is proposed. It adds ‘complementary knowledge’ to the student network– thus aiding
                                         in the improvement of network performance. The lightweight EPSA block enhances the current network architecture by
                                         capturing multi-scale spatial information of objects at a granular level with long-range channel dependency. The XP-Net
                                         compiled into the NVIDIA TensorRT engine gave a better real-time performance in terms of throughput. The proposed
                                         network has achieved a dice score of 0.839 and IoU of 0.805 in the validation data set, and it was able to attain an average
                                         throughput of 60 fps in mobile GPU. This proposed deep learning-based segmentation approach is expected to aid clinicians
                                         in addressing the complications involved in the identification and removal of precancerous anomalies more competently.

                                          Keywords
                                          Polyp, Generalization, Attention block, Knowledge distillation


1. Introduction
Colorectal polyps are one of the early indicators of lower
Gastro-Intestinal (GI) tract cancer. These polyps are extra
growth lumps of tissues, having no particular function
in the bodily processes [1]. Although these growth tis-
sues are often benign, they can become cancerous. The
early detection and removal of the polyps in the colon
region may prevent these tissues from becoming can-
cerous. Colonoscopy is a general diagnostic procedure
widely used to investigate the colon region for any type
of malformation and disease [2]. Generally, a trained
physician visually inspects the colon region for polyps
and removes them using a minimally invasive endoscopic
surgery. Research on the visual inspection of the colon                                           Figure 1: (A) shows a generic hierarchical knowledge distilla-
region shows that small size adenomas (benign tumor),                                             tion using a single teacher and (B) is our proposed methodol-
less than 5mm diameter, have a miss rate of 27% and for                                           ogy using dual teacher to derive the student network
adenomas greater than 10mm have a miss rate of 6%. It
has been reported that the quality of bowel preparation
and the experience of colonoscopists are major contribu-                                               tection is a highly researched area that has been found
tory factors to missed polyps during a colonoscopy [3].                                                effective to mitigate the miss rates and assist in the faster
A quick alternative, computer-vision based polyps de-                                                  diagnosis for colonoscopists [4]. The addition of deep
                                                                                                       learning techniques proves to be much more effective,
4th International Workshop and Challenge on Computer Vision in since a network like U-net has shown promising results
Endoscopy (EndoCV2022) in conjunction with the 19th IEEE Inter- in biomedical imaging and widely accepted as the state-
national Symposium on Biomedical Imaging ISBI2022, March of-the-art image to image translation network [5].
28th, 2022, IC Royal Bengal, Kolkata, India                                                               In this paper, the U-net was chosen as the baseline
$ ragu.b@htic.iitm.ac.in (R. B); antony.raj@htic.iitm.ac.in (A. Raj) model because of its ability to outperform other segmen-
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).                                                   tation networks with extensive data augmentation re-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
gardless of a limited dataset, as reported Ronneberger et a granular level with long-range channel dependency at
al. [5]. A plug-and-play EPSA module was implemented      the initial stage of the network.
as proposed by EPSANet [6] with U-Net for enhancing          Our first teacher network comprises of U-Net with
the multiscale spatial information, which results in the  EPSA module. Similarly, we trained the second teacher
detection of objects over different scale factors [6]. Since
                                                          network, a baseline U-Net with EPSA block using pix2pix
the baseline U-Net with the EPSA module was found         GAN [13], which has a promising result for an image to
to be computationally heavy for real-time performance,    image translation that learns a loss adapted to the input
model compression techniques through knowledge dis-       data and task. The proposed student network consists
tillation are implemented [7]. Among the other model      of separable filters that hold the same U-Net architec-
compression approaches, knowledge distillation shows      ture with the EPSA module, which results in the reduced
great superiority, which is to transfer knowledge of a    number of learnable parameters from the defined teacher
large teacher model to a small student model [8]. In      network. The hierarchical knowledge distillation tech-
our proposed student network, we implemented sepa-        nique used in our method is proposed in [8], where a
rable filters resulting in model reduction by 78% of the  single teacher is used for knowledge distillation. How-
teacher network. We implemented a hierarchical knowl-     ever the network that we have developed utilizes the
edge distillation technique which was proposed in the     dual teachers via multi-step learning as suggested in [14]
paper HAD-Net [8] where a single teacher network is       to map the in-between features to train the respective
used to distill the knowledge. Whereas, in our proposed   student network.
methodology, the Dual teacher transfers the complemen-       The input and target of the teacher and student net-
tary knowledge to the student network. All the mod-       work is denoted by x and y. The output segmentation
els where trained over EndoCV2022 Challenge dataset       of two teachers and student is denoted by T(1,2) ŷ and
[9][10][11].                                              Sŷ respectively. The multi-scale feature map of teacher
   In summary, the contributions in this paper are as     and student is denoted by T(1,2) y𝑙𝑎𝑡𝑒𝑛𝑡 and Sŷ𝑙𝑎𝑡𝑒𝑛𝑡 . In
follows:                                                  hierarchical knowledge distillation, the student loss is
                                                          denoted by Ls which consists of weighted combination
     • A hierarchical dual teacher knowledge distillation of two terms, (a) the sum of dice [15] and tversky loss
       network to transfer the complementary knowl- [16] with the student generated segmentation (Sŷ) and
       edge of both networks to a student.                ground truth (y), (b) mean square error adversarial loss.
     • A student network with a lower computational The overall student loss is given in equation 2.
       cost for real-time performance without signifi-
       cantly reducing accuracy.                                      𝐷𝑉 = [Dice Loss + Tversky Loss]             (1)
     • Experiments: By evaluating our model’s gen-
       erality in the external Kvasir-Seg dataset [12],            𝐿𝑆 =𝐷𝑉 [𝑆𝑦ˆ; 𝑦] + 𝜆
                                                                                                                  (2)
       The dice and IoU scores of 0.782 and 0.769 are                     * 𝑀 𝑆𝐸[𝐻𝐷(𝑥, 𝑆𝑦ˆ, 𝑆𝑦ˆ𝑙𝑎𝑡𝑒𝑛𝑡 ), 1]
       achieved, respectively.                               The hierarchical discriminator (HD) is trained using
                                                            LS-GAN loss denoted as L𝐻𝐷 . The L𝐻𝐷 is made up
2. XP-Net                                                   of two mean square error term. one term is between
                                                            the HD output after being passed a "fake" datasample
2.1. Methodology                                            from the teacher, and a tensor of all zeros [17]. The
In our methodology, a student network is derived from other term is a mean square error loss between the HD
two teacher networks through a hierarchical knowl- output after being passed a "real" data sample from either
edge distillation process. The two teacher networks that teacher 1 or teacher 2 and a tensor off all ones. The
are highly computational, transfer their complementary overall discriminator loss is denoted in equation 3.
knowledge to a lightweight student network. The base-
line U-Net architecture has the ability to capture features    𝐿𝐻𝐷 =𝑀 𝑆𝐸[𝐻𝐷(𝑥, 𝑆𝑦ˆ, 𝑆𝑦ˆ𝑙𝑎𝑡𝑒𝑛𝑡 , 0]
at multiple scales. To enhance this visual perception of
                                                                      + 𝑀 𝑆𝐸[𝐻𝐷(𝑋, 𝑇(1,2) 𝑦ˆ, 𝑇(1,2) 𝑦ˆ𝑙𝑎𝑡𝑒𝑛𝑡 , 1]
the U-Net network, we implemented an Effective Pyrami-
dal Squeeze Attention (EPSA) block at the first encoder                                                           (3)
of the U-Net.This attention mechanism boosts the alloca-
tion of the most informative feature expressions while 2.2. Network Architecture
suppressing the less useful ones, allowing the model to
focus on clinically crucial areas [6]. The lightweight 2.2.1. Teacher Network
EPSA block enhanced the current architecture’s ability CABR32-P-CBR64-P-CBR128-P-CBR256-P-
by capturing multi-scale spatial information of objects at CBR512-UPCONV256-CBR256-UPCONV128-
                                            (a) Teacher and Student Discriminator Network


               (b) Teacher Network Blocks                                       (c) Student Network Blocks

Figure 2: XP-Net Network Architecture


CBR128-UPCONV64-CBR64-UPCONV32-CBR32-                              • Pool (P)
CE1                                                                    Pool represents a pooling layer with a kernel
                                                                            size (2,2) and stride (2,2).
     • CABRK
         CABRK represents two stacks of Convolution,
                                                              2.2.2. Student Network
            Batch norm and Relu activation function
            with K number of output filters with an           CAPDBR32-P-CPDBR64-P-CPDBR128-P-
            intermediate attention block                      CPDBR256-P-CPDBR512-UPCONV256-
     • CBRK                                                   CPDBR256-UPCONV128-CPDBR128-UPCONV64-
                                                              CPRBR64-UPCONV32-CPDBR32-CE1
         CBRK represents two stacks of Convolution,
            Batch norm and Relu activation function                • CPDBRK
            with K number of output filters.                           CPDBRK represents stack of (A) point wise con-
     • CEK                                                                volution of kernel size (1,5) and depth wise
         CEK denotes a (1,1) convolution with k out-                      convolution of kernel size (1,1) followed
            put feature map with a Sigmoid activation                     by Batch norm and Relu and (B) point wise
            function.                                                     convolution of kernel size (5,1) and depth
                                                                          wise convolution of kernel size (1,1) fol-
     • UPCONVK                                                            lowed by Batch norm and Relu. All the
         UPCONVK represents a layer of transpose con-                     convolution layers consists of K number
            volution with a kernel size (2,2), stride (2,2)               of feature outputs.
            with k output number of feature maps.                  • CAPDBRK
Figure 3: The input image (a) and ground truth (b) is given and student networks mask (e) which has learned from teacher 1
(c) and teacher 2 (d) network’s.


         CAPDBRK block is a modified version of Table 1
            CPDBRk block where attention block is Comparison between teacher and student network
            placed in between the two sets of point                          No. of     Dice      IoU
            wise convolution, depth wise convolution, Dataset   Network
                                                                            params      score    score
            batch norm and relu.                                Teach. 1    7,774,374   0.893    0.889
                                                                  EndoCV      Teach. 2    7,774,374    0.871    0.884
                                                                  Dataset     Student     1,839,333    0.839    0.805
2.2.3. Discriminator Network                                                   U-Net      7,763,041    0.841    0.812
CAT-DC32-CAT-DC128-CAT-DC128-CAT-DC32-                                        Teach. 1    7774374      0.812    0.809
CAT-DC32-ENCONV                                                   Kvasir      Teach. 2    7,774,374    0.803    0.798
                                                                  Dataset     Student     1,839,333    0.783    0.769
     • CAT                                                                     U-Net      7,763,041    0.798    0.784

         CAT is the concatenation of two different layers
             either from teacher or student network.    EndoCV2022 challenge provided us with series of se-
     • DCK                                              quence dataset of 2631 images with their corresponding
         DCK represents a stack of convolution of ker- ground truth masks [9][10][11]. In that, we utilized more
            nel size (3,3), padding and stride of (1,1) than 95% of the data for training and 5% of the data for
            with instance norm and Leaky Relu with testing. External dataset such as Kvasir-Seg was utilized
            negative slope of 0.2.                      for testing the model generality.

   The hierarchical discriminator consists of five discrim-   3.1.1. Dataset augmentation
inator blocks (DC) and an End Convolution (ENCONV).
                                                              All the models were trained with an input image size of
In our proposed model, the feature map from encoder
                                                              512x512. The data augmentation such as random rotate,
1, encoder 3, decoder 1, decoder 3 from the teacher or
                                                              horizontal flip, vertical flip, perspective transform was
student network are used for hierarchical knowledge dis-
                                                              implemented. Usually the endoscopic images are sub-
tillation. The full network architecture is described in
                                                              jected to different light sources that might have different
Fig.2.
                                                              intensities of brightness, contrast and hue, so images are
                                                              augmented in such a way to replicate those scenarios.
3. Dataset and Implementation
                                                              3.2. Implementation
3.1. Dataset
                                                              Both the teacher network is trained using Adam opti-
Automatic polyp detection and classification requires         mizer with initial learning rate of 3e−4 with step learn-
the availability of big datasets of polyp images or videos    ing rate scheduler of gamma 0.1 and step size of 30. The
along with high-quality, manual annotations provided          networks were trained for 450 epoch with batch size of
by experts. These annotations provide the ground truth        8. The student network was trained using Adam opti-
necessary to train the supervised deep learning models.
mizer with 𝛽1 0.5 and 𝛽2 0.999 with a initial learning cancer tissues.
rate of 1e−4 with step learning rate scheduler of gamma
0.1 and step size of 30. After multiple experiments of
initializing weights with uniform, xavier-uniform and References
kaiming-uniform given in pytorch weight initialization, it
                                                             [1] B. Levin, D. A. Lieberman, B. McFarland, K. S. An-
showed that kaiming uniform weight initialization have
                                                                 drews, D. Brooks, J. Bond, C. Dash, F. M. Giardiello,
helped for better convergence of model. We also imple-
                                                                 S. Glick, D. Johnson, et al., Screening and surveil-
mented our model in Nvidia TensorRT inference library
                                                                 lance for the early detection of colorectal cancer and
for effective realtime model throughput. All the models
                                                                 adenomatous polyps, 2008: a joint guideline from
were trained using Nvidia RTX 3090 GPU.
                                                                 the american cancer society, the us multi-society
                                                                 task force on colorectal cancer, and the american
4. Results and Discussion                                        college of radiology, Gastroenterology 134 (2008)
                                                                 1570–1595.
The networks were evaluated and the computed metrics
                                                             [2] D. K. Rex, J. L. Petrini, T. H. Baron, A. Chak, J. Co-
are reported in Table.1. In the validation data of EndoCV
                                                                 hen, S. E. Deal, B. Hoffman, B. C. Jacobson, K. Mer-
dataset, the Teacher 1 model was able to achieve 0.893 and
                                                                 gener, B. T. Petersen, et al., Quality indicators for
0.889 for Dice score and IoU score, respectively. Similarly,
                                                                 colonoscopy, Gastrointestinal endoscopy 63 (2006)
the teacher 2 model was able to achieve 0.871 and 0.884
                                                                 S16–S28.
for the same metrics. The student network has achieved
                                                             [3] S. N. Bonnington, M. D. Rutter, Surveillance of
a commendable dice and IoU score of 0.839 and 0.805
                                                                 colonic polyps: are we getting it right?, World
even with the reduced number of learnable parameters.
                                                                 journal of gastroenterology 22 (2016) 1925.
The trade-off here is the larger sized teacher network for
                                                             [4] Y. Mintz, R. Brodie, Introduction to artificial intelli-
a minimal loss in the accuracy of the light weight student
                                                                 gence in medicine, Minimally Invasive Therapy &
network. Similarly, these metrics were calculated for
                                                                 Allied Technologies 28 (2019) 73–81.
Kvasir-Seg dataset and is reported in the Table 1.
                                                             [5] O. Ronneberger, P. Fischer, T. Brox, U-net: Con-
    Results have shown that the teacher 2 perform bet-
                                                                 volutional networks for biomedical image segmen-
ter for region with higher amount of specular reflection
                                                                 tation, in: International Conference on Medical
than teacher 1 for those regions. The student network
                                                                 image computing and computer-assisted interven-
thus obtains the complimentary knowledge from the two
                                                                 tion, Springer, 2015, pp. 234–241.
teacher networks. With reference to the ground truth,
                                                             [6] H. Zhang, K. Zu, J. Lu, Y. Zou, D. Meng, Ep-
it is observed that the student network had proper seg-
                                                                 sanet: An efficient pyramid squeeze attention block
mentation even though one of the teachers had missed
                                                                 on convolutional neural network, arXiv preprint
areas in its segmentation masks as shown in Fig 3. These
                                                                 arXiv:2105.14447 (2021).
results show that multiple teacher knowledge helps to
                                                             [7] G. Hinton, O. Vinyals, J. Dean, et al., Distilling
generalize better segmentation.
                                                                 the knowledge in a neural network, arXiv preprint
    As a part of benchmarking the network in terms of
                                                                 arXiv:1503.02531 2 (2015).
inference time, the model was converted into TensorRT
                                                             [8] S. Vadacchino, R. Mehta, N. M. Sepahvand, B. Nichy-
engine for faster throughput. The model was able to
                                                                 poruk, J. J. Clark, T. Arbel, Had-net: A hierarchi-
attain an average throughput of 60 fps on GeForce RTX
                                                                 cal adversarial knowledge distillation network for
3070 mobile GPU and 120 fps in Nvidia RTX 3090 GPU.
                                                                 improved enhanced tumour segmentation without
From the results, we believe that constructing multiple
                                                                 post-contrast images, in: Medical Imaging with
teacher models which focuses on various aspects of the
                                                                 Deep Learning, PMLR, 2021, pp. 787–801.
input data can distill a superior student network.
                                                             [9] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Po-
                                                                 lat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo,
5. Conclusion                                                    B. Matuszewski, M. Gridach, I. Voiculescu, V. Yo-
The proposed network is light weight and does faster             ganand, A. Chavan, A. Raj, N. T. Nguyen, D. Q.
computation when compared with traditional networks              Tran, L. D. Huynh, N. Boutry, S. Rezvy, H. Chen,
that are used for segmentation. Since this uses dual teach-      Y. H. Choi, A. Subramanian, V. Balasubramanian,
ers for knowledge distillation, by increasing the number         X. W. Gao, H. Hu, Y. Liao, D. Stoyanov, C. Daul,
of teacher networks, there is room for further improve-          S. Realdon, R. Cannizzaro, D. Lamarque, T. Tran-
ment in performance. Moreover, the sample size of data           Nguyen, A. Bailey, B. Braden, J. E. East, J. Rittscher,
also plays a crucial role in the accuracy of the network.        Deep learning for detection and segmentation of
Further studies can be done to design a much more in-            artefact and disease instances in gastrointestinal en-
telligent network for polyps and other varieties of early        doscopy, Medical Image Analysis 70 (2021) 102002.
     URL: https://doi.org/10.10162/j.media.2021.102002.
     doi:10.1016/j.media.2021.102002.
[10] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Can-
     nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V.
     Anonsen, M. A. Riegler, et al., Polypgen: A
     multi-center polyp detection and segmentation
     dataset for generalisability assessment, arXiv
     preprint arXiv:2106.04463 (2021). doi:10.48550/
     arXiv.2106.04463.
[11] S. Ali, N. Ghatwary, D. Jha, E. Isik-Polat, G. Po-
     lat, C. Yang, W. Li, A. Galdran, M.-Á. G. Ballester,
     V. Thambawita, et al., Assessing generalisabil-
     ity of deep learning-based polyp detection and
     segmentation methods through a computer vision
     challenge, arXiv preprint arXiv:2202.12031 (2022).
     doi:10.48550/arXiv.2202.12031.
[12] H. Borgli, V. Thambawita, P. H. Smedsrud, S. Hicks,
     D. Jha, S. L. Eskeland, K. R. Randel, K. Pogorelov,
     M. Lux, D. T. D. Nguyen, et al., Hyperkvasir, a
     comprehensive multi-class image and video dataset
     for gastrointestinal endoscopy, Scientific data 7
     (2020) 1–14.
[13] P. Isola, J.-Y. Zhu, T. Zhou, A. A. Efros, Image-
     to-image translation with conditional adversarial
     networks, in: Proceedings of the IEEE conference
     on computer vision and pattern recognition, 2017,
     pp. 1125–1134.
[14] Z.-Q. Zhao, Y. Gao, Y. Ge, W. Tian,              Or-
     derly dual-teacher knowledge distillation for
     lightweight human pose estimation, arXiv preprint
     arXiv:2104.10414 (2021).
[15] X. Li, X. Sun, Y. Meng, J. Liang, F. Wu, J. Li, Dice
     loss for data-imbalanced nlp tasks, arXiv preprint
     arXiv:1911.02855 (2019).
[16] N. Nasalwai, N. S. Punn, S. K. Sonbhadra, S. Agar-
     wal, Addressing the class imbalance problem in
     medical image segmentation via accelerated tver-
     sky loss function, in: Pacific-Asia Conference on
     Knowledge Discovery and Data Mining, Springer,
     2021, pp. 390–402.
[17] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang,
     S. Paul Smolley, Least squares generative adver-
     sarial networks, in: Proceedings of the IEEE inter-
     national conference on computer vision, 2017, pp.
     2794–2802.