Mask Classification-based method for Polyps Segmentation
and Detection
Mariia Kokshaikyna , Yurii Yelisieiev and Mariia Dobko
The Machine Learning Lab, Ukrainian Catholic University, Lviv, Ukraine


                                    Abstract
                                    We introduce a mask classification model with a transformer decoder for polyps segmentation in endoscopy images. Our
                                    novel approach combines custom data pre-processing, a modified mask classification network, test time augmentations, and
                                    connected-component analysis. We show the successful performance for polyp semantic segmentation and detection tasks in
                                    EndoCV 2022 challenge.


1. Introduction                                                                           classification baselines on natural scenes.
                                                                                             We propose to use a mask classification-based method
Endoscopy is a widely used procedure for detecting and                                    for polyp segmentation in endoscopy data. We are the
diagnosing multiple diseases. Computer-aided endo-                                        first to test this model on endoscopic images to our best
scopic image analysis and decision support systems can                                    knowledge. We also customize parts of MaskFormer [5]
help doctors with diagnosis and increase its effectiveness.                               architecture and show its successful performance for
Such systems are mainly used to detect, localize, and                                     polyp detection.
segment cancer precursor lesions, also called “polyps.”                                      To increase the robustness of our solution, we add
EndoCV challenge [1, 2, 3, 4] aims to tackle the gener-                                   test time augmentations (TTA) and perform connected-
alizability aspect of such methods. In 2022, it has two                                   component analysis (CCA).
sub-challenges (Endoscopy artefact detection) EAD 2.0                                        Our contribution can be summed up as follows:
and (Polyp generalization) PolypGen 2.0. Both tracks set
detection and segmentation tasks on a diverse population                                        • Evaluated and showed the performance of mask
dataset. This work describes our solution to the EndoCV                                           classification method - MaskFormer on en-
2022 challenge on the polyp segmentation and detection                                            doscopy data. Added custom modifications that
tracks.                                                                                           improve results of MaskFormer for polyp segmen-
   The dataset of EndoCV 2022 challenge [1, 2, 3, 4]                                              tation.
is diverse and comprises images from various endo-                                              • Presented a step-by-step pre-processing mecha-
scope types. This presents an additional difficulty to any                                        nism for training and inference
computer-aided system. We decided to simplify the input                                         • Tested the impact of different loss functions
by cropping out the uninformative part and generalizing                                         • Added custom post-processing using test time
the input image at pre-processing step.                                                           augmentations and connected-component analy-
   Standardly, the semantic segmentation task is solved                                           sis
as a per-pixel classification problem, applying a classifi-
cation loss to each output pixel. An alternative approach
is mask classification which, instead of classifying each                                 2. Data Pre-processing
pixel, predicts a set of binary masks, each associated
with a single class prediction. Authors of MaskFormer                                                   The PolypGen2.0 subchallenge dataset consists of 46 se-
[5] proposed a modern approach last year by using mask                                                  quences with 3348 images with polyp labels. Different
classification to solve both semantic- and instance-level                                               endoscopes produced these images with various sizes
segmentation tasks in a unified manner. This model pre-                                                 and artifacts - black section located at the left part of the
dicts a set of binary masks corresponding to a single                                                   image, blue rectangle with endoscope position, text arti-
global class label. MaskFormer [5] outperforms per-pixel                                                facts, and others. Overall, we can distinguish 15 types of
                                                                                                        images among these sequences. Statistics about different
4th International Workshop and Challenge on Computer Vision in types is shown on Fig 2.
Endoscopy (EndoCV2022) in conjunction with the 19th IEEE Inter-                                            For train and validation set, we divided sequences
national Symposium on Biomedical Imaging ISBI2022, March into groups using mannually labeled endoscope im-
28th, 2022, IC Royal Bengal, Kolkata, India                                                             age types. For validation set we selected sequences
$ mariia.kokshaikyna@ucu.edu.ua (M. K. ); yelisieiev@ucu.edu.ua seq1, seq1_endocv22, seq2_endocv22, seq3, seq3_endocv22,
(Y. Yelisieiev); dobkom@ucu.edu.ua (M. D. )                                                             seq5_endocv22,
           © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
           Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
           CEUR Workshop Proceedings (CEUR-WS.org)
               http://ceur-ws.org
               ISSN 1613-0073
                                                                                                        seq7_endocv22, seq10, seq13_endocv22, seq14_endocv22,
 Proceedings
                                                  Modifications to N of queries and
                                                            embeddings


                                                                                                                     Final prediction
                                                                                                  TTA
     Automatic                                                                                     +
   Pre-processing                                                                              Connected -
   removing uninformative
          regions                                                                              component
                                                                                                analysis

                                             MaskFormer image from original paper              Post-processing


Figure 1: Our Pipeline. The main stages include: pre-processing, MaskFormer with modified queries, post-processing via
test time augmentation and connected-component analysis


                                                                3. Method
                                                                We chose MaskFormer [5] as the primary model for our
                                                                approach. MaskFormer approaches the problem of se-
                                                                mantic segmentation as a classification of masks. This
                                                                approach is an alternative to the per-pixel classification,
                                                                which predominates in semantic segmentation problems.
                                                                Instead of classifying each pixel separately, mask clas-
                                                                sification approaches disjoins the process of semantic
                                                                segmentation into a division of the image into regions
                                                                and classification of these regions. Such an approach
                                                                is general enough to solve semantic and instance seg-
                                                                mentation problems. MaskFormer is divided into three
                                                                modules: pixel-level, transformer, and segmentation.

                                                                3.1. Pixel-level module
                                                            This module is an encoder-decoder architecture typically
Figure 2: Different endoscope image types                   used for the semantic segmentation task. The encoder
                                                            part (a backbone) generates a high-level feature repre-
                                                            sentation of the image. Further, we obtain pixel-level
seq15, seq17, seq19_endocv22, seq21_endocv22, and           embeddings by iteratively upsampling feature represen-
seq24_endocv22. Other sequences were used in training tation from the encoder. Since this is a typical problem
set. Overall, our train set contains 3306 images and vali- setting for a per-pixel classification semantic segmenta-
dation set contains 649 images, which is 19,63% of total tion task, any model of this type can be plugged into this
image number.                                               module.
   To bring all images to the same view and use the most
informative regions during training, we make simple pre- 3.2. Transformer module
processing and automatically crop images cutting black
areas on the left and right sides of the input. To do that, Transformer module generates 𝑁 learnable positional
we take the center row of the image, sum up values of embeddings (i.e., queries) as in DETR[6], which encodes
RGB channels in this row and use a threshold equal to global information about each segment of MaskFormer
48. Continuous left and right parts under this threshold prediction. This module architecture is adapted from
are considered redundant and cut. Examples of cropped transformers[7], popular for sequence data. In contrast
images are shown in Fig 3. This cropping improves the to the standard transformer architecture, each object is
informativeness of images and model generalization.         decoded in parallel. In Transformer module each output
                                                            is predicted in an autoregressive manner. The attention
                                                            mechanism encodes information about the relation of
                                                             zero, and the cross-entropy loss will rapidly converge to
                                                             zero. Therefore we changed cross-entropy loss to focal
                                                             loss to mitigate class imbalance in the classification. We
                                                             have experimented with Boundary loss, which showed
                                                             promising results in other medical imaging tasks. For
                                                             our results with this loss, refer to Section 5.2.

                                                             3.5. Our modifications
                                                             The MaskFormer’s transformer module’s has an ability to
                                                             reason about connections in different localities of the im-
                                                             age and make distinct predictions for each segment. The
                                                             model was designed for such large datasets as ADE20k
                                                             and COCO-Stuff-10k. Whereas the challenge dataset is
                                                             small compared to them, some model hyperparameters
                                                             were changed to increase the performance and gener-
Figure 3: Examples of cropped images. First column - before, alizability of our model. We decreased the number of
second - after our pre-processing procedure                  queries from 100 to 50, FC layers dimensionality from
                                                             2048 to 24, and pixel- from 256 to 64. We use a standard
                                                             convolution ResNet backbone (R50 with 50 layers) in-
these segments and enhances them with the image con- stead of SWIN because transformer backbones have poor
text.                                                        performance on datasets with few samples and this was
                                                             proven in our experiments as well. We use the same pixel
                                                             decoder as described in [5]. Normalization coefficients
3.3. Segmentation module                                     were recalculated for the PolypGen dataset.
The segmentation module utilizes a linear classifier and a
softmax activation function to acquire class probabilities
from each query. Note that we have only two distinct cat- 4. Post-processing
egories of object and no object in the case of the EndoCV
                                                             Test time augmentation is widely used to increase the
PolypGen subtask. An MLP with two hidden dimensions
                                                             model’s robustness in deep learning. This procedure
converts queries into mask embeddings for further con-
                                                             makes the final prediction by averaging the predictions
version. The dot product between mask embeddings and
                                                             after several separately performed augmentations. Our
per-pixel embeddings is used to calculate mask predic-
                                                             TTA includes horizontal and vertical flips, rotations for
tions.
                                                             90 and 180 degrees, scaling the input from original size
                                                             down to 50% of the original size.
3.4. Model training                                          Connected-component analysis We perform
                                                             connected-component analysis of predicted labels during
We need one-to-one correspondence between ground
                                                             inference. The algorithm divides the segmentation mask
truth labels and predictions to calculate losses. This prob-
                                                             into components according to the given connectivity.
lem is solved as in DETR via bipartite matching. Mask
                                                             CCA can have 4 or 8-connected-neighborhood. We
and class predictions are used instead of bounding boxes
                                                             remove all smaller parts from the prediction based on
to calculate costs.
                                                             the largest connected component.
   Model training given matching is performed by utiliz-
ing mask classification composed of cross-entropy classi-
fication loss and a binary mask loss.                        5. Experiments
                           𝑁                              We compare our approach against CaraNet [8] which is
                                                          one of the state-of-the-art methods for polyp segmen-
                       )︀ ∑︁
      ℒmask-cls 𝑧, 𝑧 gt =       − log 𝑝𝜎(𝑗) 𝑐gt        (1)
               (︀            [︀            (︀ )︀
                                             𝑗   +
                           𝑗=1                            tation. This model has proven to be effective on many
                                     (︀        gt )︀
                       1𝑐gt ̸=∅ ℒmask 𝑚𝜎(𝑗) , 𝑚𝑗 ,        endoscopy datasets including Kvasir-SEG [9]. On this
                         𝑗                                challenge, however, CaraNet with default parameters
where mask loss is a linear combination of dice and focal shows a good Precision score of 0.6041, but much worse
loss as in MaskFormer.                                    Dice than our proposed solution, refer to Table1. In our
   Since we exploited MaskFormer for binary segmenta- experiments MaskFormer is capable of capturing more
tion, most ground truth classes for each query will be cases of polyps presence.
Table 1                                                          Table 2
Metrics on our local validation set. MF stands for MaskFormer.   Metrics on round 2 test data of PolypGen2.0 track in En-
                                                                 doCV2022 challenge. MF stands for MaskFormer.
  Method               Dice       Dice std     Type2 error
  CaraNet              0.37516    0.31954      0.71444            Method                  Dice     Dice std    Type2 error
  MF                   0.73587    0.30823      0.28758            MF                      0.5497   0.4319      0.556
  MF + TTA + CCA       0.75717    0.32518      0.27494            MF + boundary loss      0.3346   0.3631      0.400


                                                                 6. Discussion
                                                                 We assume that including sequence information as an
                                                                 input to MaskFormer [5] can potentially improve the re-
                                                                 sults. Since the original MaskFormer architecture starts
                                                                 with a regular convolution, one could combine sequences
                                                                 into a volume and pass it as a separate channel for the con-
                                                                 volutional layer. Another option is to use a Mask2Former
                                                                 [12] model, which was created for video segmentation
                                                                 and inspired by MaskFormer. Mask2Former [12] is based
                                                                 on Masked-attention Mask Transformer for universal
                                                                 image and video segmentation. It is possible to incorpo-
Figure 4: Examples of images where TTA and CCA improved          rate their idea in combining the images from the same
predicted masks                                                  sequence into a single input with additional dimension
                                                                 responsible for time frames.


5.1. TTA and CCA impact                                          7. Conclusion
TTA and CCA impact on result on our validation set is We are first to show the mask classification-based model
provided in Table 1. We observe that TTA and CCA in performance on endoscopy data. We use MaskFormer [5]
most cases help to decrease false positive regions. For as the main component of our approach, adding modifi-
images where TTA and CCA improved predicted masks, cations to the number of queries, for instance, decreasing
see Fig. 4.                                                the number as polyp segmentation is a binary segmen-
                                                           tation task. We also introduce a simple pre-processing
5.2. Boundary loss                                         technique for endoscopy images, which helps to remove
                                                           redundant information from the input. This step simpli-
We use a combination of cross-entropy classification loss
                                                           fies the learning of meaningful features for the model.
and a binary mask loss for each predicted segment dur-
                                                           Moreover, we add test time augmentation and connected-
ing training. The binary loss is a linear combination of
                                                           component analysis at post-processing. Combining all
focal, and dice losses [10]. We also experimented with
                                                           these components achieves a 54.97 Dice score on round
other losses. Boundary loss [11] was initially proposed
                                                           2 validation in the EndoCV2022 challenge.
for highly unbalanced segmentation, for instance, when
                                                              In this work, we also experiment with boundary loss
the size of the target foreground region is several times
                                                           for MaskFormer [5] and show that it doesn’t bring im-
less than the background. It works as a distance metric on
                                                           provements in the polyp segmentation task.
the space of contours; computing active-contour flows
through a non-symmetric L2 distance on the space of
contours as a regional integral. This method has shown References
remarkable results on medical images, for example, in the
task of white matter hyperintensities segmentation. How- [1] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Can-
ever, our experiments didn’t show any positive impact of         nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V.
boundary loss for polyp segmentation. It decreased the           Anonsen, M. A. Riegler, et al., Polypgen: A
performance severely, refer to the comparison in Table 2.        multi-center polyp detection and segmentation
                                                                 dataset for generalisability assessment, arXiv
                                                                 preprint arXiv:2106.04463 (2021). doi:10.48550/
                                                                 arXiv.2106.04463.
 [2] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Polat,
     A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo, B. Ma-
     tuszewski, et al., Deep learning for detection and
     segmentation of artefact and disease instances in
     gastrointestinal endoscopy, Medical Image Analy-
     sis 70 (2021) 102002. URL: https://doi.org/10.1016/j.
     media.2021.102002. doi:10.1016/j.media.2021.
     102002.
 [3] S. Ali, F. Zhou, B. Braden, A. Bailey, S. Yang,
     G. Cheng, P. Zhang, X. Li, M. Kayser, R. D.
     Soberanis-Mukul, et al., An objective comparison of
     detection and segmentation algorithms for artefacts
     in clinical endoscopy, Scientific Reports 10 (2020).
     URL: https://doi.org/10.1038/s41598-020-59413-5.
     doi:10.1038/s41598-020-59413-5.
 [4] S. Ali, F. Zhou, A. Bailey, B. Braden, J. E. East,
     X. Lu, J. Rittscher, A deep learning framework
     for quality assessment and restoration in video en-
     doscopy, Medical Image Analysis 68 (2021) 101900.
     URL: https://doi.org/10.1016/j.media.2020.101900.
     doi:10.1016/j.media.2020.101900.
 [5] B. Cheng, et al., Per-pixel classification is not all you
     need for semantic segmentation, 2021. URL: https:
     //arxiv.org/abs/2107.06278. doi:10.48550/ARXIV.
     2107.06278.
 [6] N. Carion, et al., End-to-end object detection with
     transformers, in: European conference on computer
     vision, Springer, 2020, pp. 213–229.
 [7] A. Vaswani, et al., Attention is all you need, Ad-
     vances in neural information processing systems
     30 (2017).
 [8] A. Lou, et al., Caranet: Context axial reverse at-
     tention network for segmentation of small medical
     objects, arXiv preprint arXiv:2108.07368 (2021).
 [9] D. Jha, et al., Kvasir-SEG: A segmented polyp
     dataset,      in: MultiMedia Modeling, Springer
     International Publishing, 2019, pp. 451–462. URL:
     https://doi.org/10.1007/978-3-030-37734-2_37.
     doi:10.1007/978-3-030-37734-2_37.
[10] T.-Y. Lin, et al., Focal loss for dense object detection,
     2017. URL: https://arxiv.org/abs/1708.02002. doi:10.
     48550/ARXIV.1708.02002.
[11] H. Kervadec, et al., Boundary loss for highly un-
     balanced segmentation, Medical Image Analysis
     67 (2021) 101851. URL: http://dx.doi.org/10.1016/j.
     media.2020.101851. doi:10.1016/j.media.2020.
     101851.
[12] B. Cheng, et al., Masked-attention mask trans-
     former for universal image segmentation, arXiv
     (2021).