Mask Classification-based method for Polyps Segmentation and Detection Mariia Kokshaikyna , Yurii Yelisieiev and Mariia Dobko The Machine Learning Lab, Ukrainian Catholic University, Lviv, Ukraine Abstract We introduce a mask classification model with a transformer decoder for polyps segmentation in endoscopy images. Our novel approach combines custom data pre-processing, a modified mask classification network, test time augmentations, and connected-component analysis. We show the successful performance for polyp semantic segmentation and detection tasks in EndoCV 2022 challenge. 1. Introduction classification baselines on natural scenes. We propose to use a mask classification-based method Endoscopy is a widely used procedure for detecting and for polyp segmentation in endoscopy data. We are the diagnosing multiple diseases. Computer-aided endo- first to test this model on endoscopic images to our best scopic image analysis and decision support systems can knowledge. We also customize parts of MaskFormer [5] help doctors with diagnosis and increase its effectiveness. architecture and show its successful performance for Such systems are mainly used to detect, localize, and polyp detection. segment cancer precursor lesions, also called “polyps.” To increase the robustness of our solution, we add EndoCV challenge [1, 2, 3, 4] aims to tackle the gener- test time augmentations (TTA) and perform connected- alizability aspect of such methods. In 2022, it has two component analysis (CCA). sub-challenges (Endoscopy artefact detection) EAD 2.0 Our contribution can be summed up as follows: and (Polyp generalization) PolypGen 2.0. Both tracks set detection and segmentation tasks on a diverse population • Evaluated and showed the performance of mask dataset. This work describes our solution to the EndoCV classification method - MaskFormer on en- 2022 challenge on the polyp segmentation and detection doscopy data. Added custom modifications that tracks. improve results of MaskFormer for polyp segmen- The dataset of EndoCV 2022 challenge [1, 2, 3, 4] tation. is diverse and comprises images from various endo- • Presented a step-by-step pre-processing mecha- scope types. This presents an additional difficulty to any nism for training and inference computer-aided system. We decided to simplify the input • Tested the impact of different loss functions by cropping out the uninformative part and generalizing • Added custom post-processing using test time the input image at pre-processing step. augmentations and connected-component analy- Standardly, the semantic segmentation task is solved sis as a per-pixel classification problem, applying a classifi- cation loss to each output pixel. An alternative approach is mask classification which, instead of classifying each 2. Data Pre-processing pixel, predicts a set of binary masks, each associated with a single class prediction. Authors of MaskFormer The PolypGen2.0 subchallenge dataset consists of 46 se- [5] proposed a modern approach last year by using mask quences with 3348 images with polyp labels. Different classification to solve both semantic- and instance-level endoscopes produced these images with various sizes segmentation tasks in a unified manner. This model pre- and artifacts - black section located at the left part of the dicts a set of binary masks corresponding to a single image, blue rectangle with endoscope position, text arti- global class label. MaskFormer [5] outperforms per-pixel facts, and others. Overall, we can distinguish 15 types of images among these sequences. Statistics about different 4th International Workshop and Challenge on Computer Vision in types is shown on Fig 2. Endoscopy (EndoCV2022) in conjunction with the 19th IEEE Inter- For train and validation set, we divided sequences national Symposium on Biomedical Imaging ISBI2022, March into groups using mannually labeled endoscope im- 28th, 2022, IC Royal Bengal, Kolkata, India age types. For validation set we selected sequences $ mariia.kokshaikyna@ucu.edu.ua (M. K. ); yelisieiev@ucu.edu.ua seq1, seq1_endocv22, seq2_endocv22, seq3, seq3_endocv22, (Y. Yelisieiev); dobkom@ucu.edu.ua (M. D. ) seq5_endocv22, © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 seq7_endocv22, seq10, seq13_endocv22, seq14_endocv22, Proceedings Modifications to N of queries and embeddings Final prediction TTA Automatic + Pre-processing Connected - removing uninformative regions component analysis MaskFormer image from original paper Post-processing Figure 1: Our Pipeline. The main stages include: pre-processing, MaskFormer with modified queries, post-processing via test time augmentation and connected-component analysis 3. Method We chose MaskFormer [5] as the primary model for our approach. MaskFormer approaches the problem of se- mantic segmentation as a classification of masks. This approach is an alternative to the per-pixel classification, which predominates in semantic segmentation problems. Instead of classifying each pixel separately, mask clas- sification approaches disjoins the process of semantic segmentation into a division of the image into regions and classification of these regions. Such an approach is general enough to solve semantic and instance seg- mentation problems. MaskFormer is divided into three modules: pixel-level, transformer, and segmentation. 3.1. Pixel-level module This module is an encoder-decoder architecture typically Figure 2: Different endoscope image types used for the semantic segmentation task. The encoder part (a backbone) generates a high-level feature repre- sentation of the image. Further, we obtain pixel-level seq15, seq17, seq19_endocv22, seq21_endocv22, and embeddings by iteratively upsampling feature represen- seq24_endocv22. Other sequences were used in training tation from the encoder. Since this is a typical problem set. Overall, our train set contains 3306 images and vali- setting for a per-pixel classification semantic segmenta- dation set contains 649 images, which is 19,63% of total tion task, any model of this type can be plugged into this image number. module. To bring all images to the same view and use the most informative regions during training, we make simple pre- 3.2. Transformer module processing and automatically crop images cutting black areas on the left and right sides of the input. To do that, Transformer module generates 𝑁 learnable positional we take the center row of the image, sum up values of embeddings (i.e., queries) as in DETR[6], which encodes RGB channels in this row and use a threshold equal to global information about each segment of MaskFormer 48. Continuous left and right parts under this threshold prediction. This module architecture is adapted from are considered redundant and cut. Examples of cropped transformers[7], popular for sequence data. In contrast images are shown in Fig 3. This cropping improves the to the standard transformer architecture, each object is informativeness of images and model generalization. decoded in parallel. In Transformer module each output is predicted in an autoregressive manner. The attention mechanism encodes information about the relation of zero, and the cross-entropy loss will rapidly converge to zero. Therefore we changed cross-entropy loss to focal loss to mitigate class imbalance in the classification. We have experimented with Boundary loss, which showed promising results in other medical imaging tasks. For our results with this loss, refer to Section 5.2. 3.5. Our modifications The MaskFormer’s transformer module’s has an ability to reason about connections in different localities of the im- age and make distinct predictions for each segment. The model was designed for such large datasets as ADE20k and COCO-Stuff-10k. Whereas the challenge dataset is small compared to them, some model hyperparameters were changed to increase the performance and gener- Figure 3: Examples of cropped images. First column - before, alizability of our model. We decreased the number of second - after our pre-processing procedure queries from 100 to 50, FC layers dimensionality from 2048 to 24, and pixel- from 256 to 64. We use a standard convolution ResNet backbone (R50 with 50 layers) in- these segments and enhances them with the image con- stead of SWIN because transformer backbones have poor text. performance on datasets with few samples and this was proven in our experiments as well. We use the same pixel decoder as described in [5]. Normalization coefficients 3.3. Segmentation module were recalculated for the PolypGen dataset. The segmentation module utilizes a linear classifier and a softmax activation function to acquire class probabilities from each query. Note that we have only two distinct cat- 4. Post-processing egories of object and no object in the case of the EndoCV Test time augmentation is widely used to increase the PolypGen subtask. An MLP with two hidden dimensions model’s robustness in deep learning. This procedure converts queries into mask embeddings for further con- makes the final prediction by averaging the predictions version. The dot product between mask embeddings and after several separately performed augmentations. Our per-pixel embeddings is used to calculate mask predic- TTA includes horizontal and vertical flips, rotations for tions. 90 and 180 degrees, scaling the input from original size down to 50% of the original size. 3.4. Model training Connected-component analysis We perform connected-component analysis of predicted labels during We need one-to-one correspondence between ground inference. The algorithm divides the segmentation mask truth labels and predictions to calculate losses. This prob- into components according to the given connectivity. lem is solved as in DETR via bipartite matching. Mask CCA can have 4 or 8-connected-neighborhood. We and class predictions are used instead of bounding boxes remove all smaller parts from the prediction based on to calculate costs. the largest connected component. Model training given matching is performed by utiliz- ing mask classification composed of cross-entropy classi- fication loss and a binary mask loss. 5. Experiments 𝑁 We compare our approach against CaraNet [8] which is one of the state-of-the-art methods for polyp segmen- )︀ ∑︁ ℒmask-cls 𝑧, 𝑧 gt = − log 𝑝𝜎(𝑗) 𝑐gt (1) (︀ [︀ (︀ )︀ 𝑗 + 𝑗=1 tation. This model has proven to be effective on many (︀ gt )︀ 1𝑐gt ̸=∅ ℒmask 𝑚𝜎(𝑗) , 𝑚𝑗 , endoscopy datasets including Kvasir-SEG [9]. On this 𝑗 challenge, however, CaraNet with default parameters where mask loss is a linear combination of dice and focal shows a good Precision score of 0.6041, but much worse loss as in MaskFormer. Dice than our proposed solution, refer to Table1. In our Since we exploited MaskFormer for binary segmenta- experiments MaskFormer is capable of capturing more tion, most ground truth classes for each query will be cases of polyps presence. Table 1 Table 2 Metrics on our local validation set. MF stands for MaskFormer. Metrics on round 2 test data of PolypGen2.0 track in En- doCV2022 challenge. MF stands for MaskFormer. Method Dice Dice std Type2 error CaraNet 0.37516 0.31954 0.71444 Method Dice Dice std Type2 error MF 0.73587 0.30823 0.28758 MF 0.5497 0.4319 0.556 MF + TTA + CCA 0.75717 0.32518 0.27494 MF + boundary loss 0.3346 0.3631 0.400 6. Discussion We assume that including sequence information as an input to MaskFormer [5] can potentially improve the re- sults. Since the original MaskFormer architecture starts with a regular convolution, one could combine sequences into a volume and pass it as a separate channel for the con- volutional layer. Another option is to use a Mask2Former [12] model, which was created for video segmentation and inspired by MaskFormer. Mask2Former [12] is based on Masked-attention Mask Transformer for universal image and video segmentation. It is possible to incorpo- Figure 4: Examples of images where TTA and CCA improved rate their idea in combining the images from the same predicted masks sequence into a single input with additional dimension responsible for time frames. 5.1. TTA and CCA impact 7. Conclusion TTA and CCA impact on result on our validation set is We are first to show the mask classification-based model provided in Table 1. We observe that TTA and CCA in performance on endoscopy data. We use MaskFormer [5] most cases help to decrease false positive regions. For as the main component of our approach, adding modifi- images where TTA and CCA improved predicted masks, cations to the number of queries, for instance, decreasing see Fig. 4. the number as polyp segmentation is a binary segmen- tation task. We also introduce a simple pre-processing 5.2. Boundary loss technique for endoscopy images, which helps to remove redundant information from the input. This step simpli- We use a combination of cross-entropy classification loss fies the learning of meaningful features for the model. and a binary mask loss for each predicted segment dur- Moreover, we add test time augmentation and connected- ing training. The binary loss is a linear combination of component analysis at post-processing. Combining all focal, and dice losses [10]. We also experimented with these components achieves a 54.97 Dice score on round other losses. Boundary loss [11] was initially proposed 2 validation in the EndoCV2022 challenge. for highly unbalanced segmentation, for instance, when In this work, we also experiment with boundary loss the size of the target foreground region is several times for MaskFormer [5] and show that it doesn’t bring im- less than the background. It works as a distance metric on provements in the polyp segmentation task. the space of contours; computing active-contour flows through a non-symmetric L2 distance on the space of contours as a regional integral. This method has shown References remarkable results on medical images, for example, in the task of white matter hyperintensities segmentation. How- [1] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Can- ever, our experiments didn’t show any positive impact of nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V. boundary loss for polyp segmentation. It decreased the Anonsen, M. A. Riegler, et al., Polypgen: A performance severely, refer to the comparison in Table 2. multi-center polyp detection and segmentation dataset for generalisability assessment, arXiv preprint arXiv:2106.04463 (2021). doi:10.48550/ arXiv.2106.04463. [2] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Polat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo, B. Ma- tuszewski, et al., Deep learning for detection and segmentation of artefact and disease instances in gastrointestinal endoscopy, Medical Image Analy- sis 70 (2021) 102002. URL: https://doi.org/10.1016/j. media.2021.102002. doi:10.1016/j.media.2021. 102002. [3] S. Ali, F. Zhou, B. Braden, A. Bailey, S. Yang, G. Cheng, P. Zhang, X. Li, M. Kayser, R. D. Soberanis-Mukul, et al., An objective comparison of detection and segmentation algorithms for artefacts in clinical endoscopy, Scientific Reports 10 (2020). URL: https://doi.org/10.1038/s41598-020-59413-5. doi:10.1038/s41598-020-59413-5. [4] S. Ali, F. Zhou, A. Bailey, B. Braden, J. E. East, X. Lu, J. Rittscher, A deep learning framework for quality assessment and restoration in video en- doscopy, Medical Image Analysis 68 (2021) 101900. URL: https://doi.org/10.1016/j.media.2020.101900. doi:10.1016/j.media.2020.101900. [5] B. Cheng, et al., Per-pixel classification is not all you need for semantic segmentation, 2021. URL: https: //arxiv.org/abs/2107.06278. doi:10.48550/ARXIV. 2107.06278. [6] N. Carion, et al., End-to-end object detection with transformers, in: European conference on computer vision, Springer, 2020, pp. 213–229. [7] A. Vaswani, et al., Attention is all you need, Ad- vances in neural information processing systems 30 (2017). [8] A. Lou, et al., Caranet: Context axial reverse at- tention network for segmentation of small medical objects, arXiv preprint arXiv:2108.07368 (2021). [9] D. Jha, et al., Kvasir-SEG: A segmented polyp dataset, in: MultiMedia Modeling, Springer International Publishing, 2019, pp. 451–462. URL: https://doi.org/10.1007/978-3-030-37734-2_37. doi:10.1007/978-3-030-37734-2_37. [10] T.-Y. Lin, et al., Focal loss for dense object detection, 2017. URL: https://arxiv.org/abs/1708.02002. doi:10. 48550/ARXIV.1708.02002. [11] H. Kervadec, et al., Boundary loss for highly un- balanced segmentation, Medical Image Analysis 67 (2021) 101851. URL: http://dx.doi.org/10.1016/j. media.2020.101851. doi:10.1016/j.media.2020. 101851. [12] B. Cheng, et al., Masked-attention mask trans- former for universal image segmentation, arXiv (2021).