HCMUS at MediaEval2021: Polyps Segmentation using TransFuse with Focal Tversky Loss Nhat-Khang Ngo1,2 , Tuan-Luc Huynh 1,2 , Thanh-Danh Le 1,2 Hai-Dang Nguyen 1,2 , Minh-Triet Tran1,2,3 1 University of Science, VNU-HCM 2 Vietnam National University, Ho Chi Minh city, Vietnam 3 John von Neumann Institute, VNU-HCM {nnkhang19,htluc19,ltdanh19}@apcs.fitus.edu.vn,nhdang@selab.hcmus.edu.vn,tmtriet@fit.hcmus.edu.vn ABSTRACT 2 RELATED WORK The Medico task, MediaEval 2021, aims at developing accurate and Self-attention is a critical phenomenon in deep learning. The mech- high-performance techniques for automatic medical image segmen- anism enables models to capture the global context between objects tation. In this work, we describe an approach for tackling Tasks in data. Self-attention is used in medical image segmentation to 1 and 2 of the challenge. We retrain TransFuse, a state-of-the-art manage the relationships between regions in the images. Oktay et model in medical image segmentation, along with focal Tversky al. [8] integrate Attention Gates into U-net to suppress inessential loss function to segment the polyp regions in endoscopic images. areas and emphasize salient characteristics. To further handle the The approach focuses on computation efficiency while also pro- global context, Chen et al. [3] proposed TransUnet in which the ducing high-quality segmented results. In evaluation, our method encoders of U-net are replaced by the encoders of Vision Trans- achieves appropriate results for both efficiency and accuracy. formers [4]. Petit et al. [7] proposed a U-net architecture featuring self-attention and cross-attention between the encoder and decoder. While the preceding methods combine self-attention and CNNs 1 INTRODUCTION sequentially, Zhang et al. [9] combine them in a parallel manner. Medical image segmentation has become more common in recent This kind of incorporation can mitigate the loss of local details in years, thanks to important advances in artificial intelligence. The deep CNNs and reduce the inference time. work mainly focuses on helping experts diagnose life-threatening cancers by early detecting and segmenting polyps in medical images. However, automatic polyp segmentation is challenging due to the 3 APPROACH diversity of polyp shapes and positions. Numerous studies leverage 3.1 TransFuse the representation power of deep learning to capture numerous variations of polyps in endoscopic images. The MediaEval Task 2021 As illustrated in Figure 1, TransFuse includes three branches; Trans- Transparency in Medical Image Segmentation calls for researchers former, CNN, and BiFusion. The Transformer branch makes use to investigate a method for polyps segmentation. [5] of the Vision Transformers architecture, in which an image is em- This paper presents an approach that can efficiently segment bedded into patches before being transmitted to many multi-head the polyp regions in the endoscopic images. We train from scratch self-attention and multi-layer perceptron modules. The result is TransFuse [9], a state-of-the-art model in medical image segmen- molded into several feature maps, which are kept for later fusion. Si- tation, along with a generalized focal Tversky loss function [1]. multaneously, the CNN branch downsamples the image into feature TransFuse is a combination of vision transformers [4] and convolu- maps with the same size as the corresponding ones in the Trans- tional neural networks in a parallel manner [9]. While the former former branch. The outputs of the two parallel branches are fused in learn to model the relations between regions in the images, the the BiFusion module. The module contains spatial attention, chan- latter extracts the local details of these regions. The two processes nel attention, and residual blocks to perform multi-modal fusion execute in parallel. Hence, TransFuse boosts the time efficiency and self-attention [9]. Finally, the fused output is upsampled to get in the inference phase. To combine both information, Zhang et al. the segmented result. In addition, deep supervision is provided at [9] propose the BiFusion module consisting of several attention the output of the transformer branch and the final BiFusion module. modules and convolution blocks. In addition, Kvasir-seg [2], the In our experiments, we use TransFuse-S proposed by Zhang et al. given dataset, is a small dataset with only 1360 samples. The dataset [9]. also consists of many hard samples in which the polyps are large and have unusual locations and shapes. To address this problem, 3.2 Focal Tversky Loss we train TransFuse with focal Tversky loss function. We train the models with various hyperparameter settings to assess the efficacy Tversky Score is extended from Dice Score that flexibly adjusts the and failures of this approach. scores of false positive and false negative cases among the classes [1]. Equation 1 shows how to calculate the Tversky score. In the Copyright 2021 for this paper by its authors. Use permitted under Creative Commons equation, 𝛼 is a hyperparameter that we can fine-tune during train- License Attribution 4.0 International (CC BY 4.0). ing. High values of 𝛼 enhance the recall rate in highly imbalanced MediaEval’21, December 13-15 2021, Bergen, Norway and Online datasets [1]. Wider polyp regions, consequently, can be detected in MediaEval’21, December 13-15 2021, Online K.Ngo et al. Run ID Acc Jacc DSC F1 Rec Prec Run 1 0.9484 0.6684 0.7672 0.7672 0.8430 0.7628 Run 2 0.9462 0.6780 0.7756 0.7756 0.8413 0.7748 Run 3 0.9406 0.6596 0.7583 0.7583 0.8427 0.7656 Run 4 0.9441 0.6700 0.7644 0.7644 0.7814 0.8208 Run 5 0.9407 0.6689 0.7659 0.7659 0.8584 0.7569 Table 1: Results in Task 1 average inference time and frame rate, as well as the Jaccard Score, Recall, and Precision of Task 2’s Run 1. On average, the model makes one prediction in 0.0132 seconds. Besides fast inference, Figure 1: Architecture of TransFuse [9] our technique produces accurate findings, with a Jaccard score of 0.6692, a high Recall of 0.8586, and a high Precision of 0.7572. images. Additionally, πœ– is a constant that stabilizes the score. Furthermore, Figure 2 depicts the efficacy and failure of focusing 𝑇𝑃 + πœ– on enhancing the recall rate in the dataset. We paint the polyp 𝑇 = (1) regions green based on the projections to see if the borders of these 𝑇 𝑃 + 𝛼𝐹 𝑁 + (1 βˆ’ 𝛼)𝐹 𝑃 + πœ– regions are suitable. The first image demonstrates that strong recall The Tversky Loss 𝐿 equals 1βˆ’π‘‡ . To tackle hard samples, Abraham et is acceptable, whereas the green hue in the second image surpasses al. [1] adapt the loss function to a focal version. The loss is written the polyp regions. 1 as 𝐹 𝐿 = (1 βˆ’ 𝑇 ) 𝛾 , where 𝛾 ∈ [1, 3] is a hyperparameter. When a high Tversky score has a high number of erroneous predictions, Run ID Avg-time Avg-fps Jacc Rec Prec i.e., 𝐹 𝑁 and 𝐹 𝑃, the loss decreases dramatically. By using 𝛼 > 0.5 Run 1 0.0132 75.7629 0.6692 0.8586 0.7572 and 𝛾 > 1, the function focuses on merely misclassified samples. Table 2: Results in Task 2 As a result, the model can widen the segmented polyp regions. 4 EXPERIMENTS AND RESULTS 4.1 Experiments We train TransFuse-S with the focal Tversky loss by varying 𝛼 in five Runs. In the first four Runs, we split the dataset into training and validation sets with the ratio of 8:2, whereas we train the model with all samples in Run 5. We use four values of 𝛼, including 0.3, 0.4, 0.6, and 0.7. In Run 1 and Run 5, 𝛼 equals to 0.7, while 𝛼 equals to 0.6, 0.4, and 0.3 in Runs 2,3,4, respectively. It is worth noting that when 𝛼 = 0.5, the Tversky score becomes Dice score. Thus, we do not use 0.5 in our experiments. In addition, we fix the value of 𝛾 to 43 which is proved to be the most effective in [1]. We use Adam [6] to optimize the loss function with a learning rate of 1𝑒 βˆ’ 4, and the batch size of data is 16. Additionally, because we Image Prediction Map Overlay use deep supervsion, there are three losses 𝐿1, 𝐿2, and 𝐿3 with the corresponding scales 𝛽 1 = 0.5, 𝛽 2 = 0.2, and 𝛽 3 = 0.3. And thus, Figure 2: Visualization the final loss 𝐿 equals 0.5𝐿1 + 0.2𝐿2 + 0.3𝐿3 . 4.2 Results Table 1 displays the outcomes of our submissions from Run 1 to 5 CONCLUSION Run 5 in the challenge’s Task 1. Accuracy, Jaccard score, Dice Score, We present an approach to automatically segment polyp regions F1-score, Recall, and Precision are the six metrics used to assess in endoscopic images. Our work is to train from scratch Trans- predictions. In Run 2, when π‘Žπ‘™π‘β„Žπ‘Ž = 0.6, we attain the highest Fuse along with focal Tversky Loss to tackle hard samples in an Jaccard score of 0.6780. This run also produces the highest Dice imbalanced dataset. We plan to investigate this approach more Score of 0.7756. All runs have a greater recall than a higher precision. thoroughly in the future. This demonstrates our approach’s responsibility for false negative predictions. We achieve the greatest recall and accuracy of 0.8584 ACKNOWLEDGMENTS and 0.8208, respectively. Table 1 further shows that the accuracy This work was funded by Gia Lam Urban Development and Invest- ratings for the five runs are almost comparable. In this section, we ment Company Limited, Vingroup and supported by Vingroup In- additionally present the inference time for Task 2. Table 2 shows the novation Foundation (VINIF) under project code VINIF.2019.DA19. Medico: Transparency in Medical Image Segmentation K.Ngo et al. REFERENCES [1] Nabila Abraham and Naimul Mefraz Khan. 2019. A novel focal tversky loss function with improved attention u-net for lesion segmentation. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019). IEEE, 683–687. [2] Hanna Borgli, Vajira Thambawita, Pia H Smedsrud, Steven Hicks, Debesh Jha, Sigrun L Eskeland, Kristin Ranheim Randel, Konstantin Pogorelov, Mathias Lux, Duc Tien Dang Nguyen, Dag Johansen, Carsten Griwodz, HΓ₯kon K Stensland, Enrique Garcia-Ceja, Peter T Schmidt, Hugo L Hammer, Michael A Riegler, PΓ₯l Halvorsen, and Thomas de Lange. 2020. HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Scientific Data 7, 1 (2020), 283. https://doi.org/10.1038/s41597-020-00622-y [3] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. 2021. Transunet: Trans- formers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021). [4] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, and others. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020). [5] Steven Hicks, Debesh Jha, Vajira Thambawita, Hugo Hammer, Thomas de Lange, Sravanthi Parasa, Michael Riegler, and PΓ₯l Halvorsen. 2021. Medico Multimedia Task at MediaEval 2021: Transparency in Medical Image Segmentation. In Proceedings of MediaEval 2021 CEUR Work- shop. [6] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). [7] Olivier Petit, Nicolas Thome, Clement Rambour, Loic Themyr, Toby Collins, and Luc Soler. 2021. U-net transformer: self and cross attention for medical image segmentation. In International Workshop on Machine Learning in Medical Imaging. Springer, 267–276. [8] Jo Schlemper, Ozan Oktay, Michiel Schaap, Mattias Heinrich, Bernhard Kainz, Ben Glocker, and Daniel Rueckert. 2019. Attention gated net- works: Learning to leverage salient regions in medical images. Medical image analysis 53 (2019), 197–207. [9] Yundong Zhang, Huiye Liu, and Qiang Hu. 2021. Transfuse: Fusing transformers and cnns for medical image segmentation. arXiv preprint arXiv:2102.08005 (2021).