Yunnan University at VQA-Med 2021: Pretrained BioBERT for Medical Domain Visual Question Answering Qian Xiao,Xiaobing Zhou* ,Ya Xiao and Kun Zhao Yunnan University, Kunming, China Corresponding author: zhouxb@ynu.edu.cn Abstract. This paper describes the submission of the Yunnan University team in the Visual Question Answering task of the ImageCLEF 2021 VQA medical image challenge. According to the analysis of the dataset, we regard this task as a classification task. Firstly, we use the pre-trained VGG16 model, Global Average Pooling (GAP), and image enhancement technology to process and extract the image features. Secondly, we use BioBERT, which is pre-trained with biomedical text, to extract all the semantic features. BioBERT and BERT have the same model structure, but BioBERT have better performance in extracting medical text features. Thirdly, the semantic features and image features are fused by Multi-modal Factorized High-order (MFH) Pooling. Finally, the fused features are input into a fully connected layer for classification. Our method achieved an accuracy score of 0.362 and a BLEU score of 0.402 and ranked 2nd among all the participating teams in the VQA-Med task at ImageCLEF 2021. Our code is publicly available1. Keywords: BioBERT·VGG Network·Global Average Pooling·VQA-Med 1 Introduction The visual question answering (VQA) task aims to answer questions according to the content of the corresponding image. It involves data processing technology in the field of computer vision(CV) and natural language processing(NLP). The dataset of the VQA task is composed of medical images and related question-answer pairs. The main task of the VQA system is to input images and questions into the system and predict an answer according to the questions. For the general domain VQA, there are a large number of datasets, many advanced models, and technologies to solve this task. With the increasing interest in the application of artificial intelligence technology in the medical field, VQA has attracted people's attention in medical field, because it can support doctors' clinical decisions and enhance patients' understanding of their symptoms from medical images, especially in patient-centered medical care. CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania Copyright ©️ 2021 for this paper by its authors.Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 code: https://github.com/huanhuan414/YNU-at-ImageCLEF-VQA-Med-2021 Compared with VQA for the general domain, VQA for the medical domain is a more challenging task. Firstly, due to the high cost of collecting valid data, the available medical data for training is limited. For the general domain VQA, we can easily obtain thousands of images with guaranteed quality. Secondly, the words used in question and answer matching or medical report are quite different from the language used in daily life, and they are more professional. In the following, we first describe the work related to the VQA Med task in Section 2. Then the dataset provided by ImageCLEF 2021 is described in Section 3. In Section 4, we describe the details of our proposed method, and then we describe the experiments in Section 5. We finally conclude this paper in Section 6. 2 Related Work For the medical field VQA, this task is more challenging because it requires specialized medical datasets and expert doctors to understand the data. The VQA-Med competition started in 2018, and since then, it has provided a medical dataset for VQA tasks every year. In 2019, the Zhejiang University team [3] proposed a convolutional neural network based on VGG16 [7] network and global average pool strategy [9] to extract visual features. The proposed method can effectively capture medical image features in a small training set. The semantic features of the proposed problem are encoded by the BERT [11] model. Then, the common attention mechanism is used to fuse the two enhanced features. In the end, their proposed model ranked 1st among all participating groups of ImageCLEF2019 with an accuracy of 0.624 and a BLEU score of 0.644. In the same year, the cooperative team [15] of Umea University, Sweden, and the University of Bern, Switzerland proposed to use a bilinear model to aggregate and synthesize the extracted image and question features. At the same time, they used an attention scheme to focus on the relevant input context, and further enhanced it by using a set of trained models. Their proposed method ranked 3rd among all the participating groups. In the third edition of the VQA-Med challenge in 2020, the AIML team [2] used a knowledge reasoning method called skeleton-based sentence mapping (SSM). Using all the questions and answers, they derived a set of classifiable tasks and infered the corresponding tags. At the same time, a classification and task standardization method is proposed to optimize multiple tasks in a single network, which makes it possible to apply multi-scale and multi-architecture integration strategies for robust prediction. In the end, they ranked 1rd among all the participating teams in ImageCLEF2020. The main method of the Inception team [5] is to use the pre-trained VGG16 model, remove the last layer (softmax layer), freeze all layers (except the last four layers) and one of the data enhancement technologies such as geometric transformation, flipping, filling or random erasure of the image. In the end, the Inception team ranked 2nd among all the participating teams in ImageCLEF2020. 3 Data Description The VQA-Med dataset [16] provided by ImageCLEF 2021 [17] consists of 4000 radiologic images and training sets of question-answer pairs, 3500 train sets, 500 verification sets. Figure. 1 shows three sample examples from VQA-Med 2021 dataset. Q:what is the primary Q:is this a normal Q:is there an abnormality abnormality in this image? gastrointestinal image? in the x-ray? A:adrenal adenoma A:yes A:no Fig.1. Three examples of ImageCLEF 2021 VQA-Med dataset ImageCLEF 2019 dataset [4] can be used as additional training data, which contains 3200 medical images and 12792 question-answer pairs associated with images. However, unlike the VQA-Med2021 dataset, it focuses on four main categories of problems: modal, plan, organ system, abnormaly. In this paper, we extend the VQA-Med2021 training set with 473 images and question-answer pairs. Secondly, to further expand the VQA-Med2021 training set, we also add 500 verification sets in VQA-Med2021 as additional training data to the training set for model training. 4 Methodology In this section, we introduce the task model we submitted to the ImageCLEF VQA-Med 2021 competition. Our model consists of four parts: image feature extractor, text feature extractor, Multi-modal Factorized High-order (MFH) Pooling feature fusion with the Co-attention mechanism, and classification model. In this paper, we regard the ImageCLEF VQA-Med 2021 task as a classification task with C categories. C is the result of removing all the repeated answers in the task. Fig.2. shows the structure of our model. unicornuate uterus Image Feature hiatal hernia extraction ......... Feature Fuse Answer benign cystic teratoma with Classify Co-attention Question what is abnormal in ......... Feature the ct scan? extraction odontoid fracture Fig.2. Our model architecture 4.1 Image feature extractor In our model, we use the pre-trained VGG16 network on the Imagenet dataset to extract features, and the GAP strategy is applied to the VGG16 network to prevent overfitting. GAP averages the output feature of the convolution layer with different channel numbers. The shape of the input model is 224 * 224 * 3, and the shape of the VGG16 network output is 224 * 224 * 64,112 * 112 * 128, 56 * 56 * 258, 28 * 28 * 512, 14 * 14 * 512, 7 * 7 * 512. After averaging the output according to the channel, we get five vectors, 1 * 1 * 64, 1 * 1 * 128, 1 * 1 * 258, 1 * 1 * 512, 1 * 1 * 512, and finally we concatenate these five vectors to get a 1 * 1 * 1472 dimensional vector, and input it to the next network. 4.2 Text feature extractor BioBERT [10] is used to extract the semantic features of a given question. BioBERT is a pre-trained language representation model for the biomedical field, which has the same network model structure as BERT. Compared with most biomedical text mining models focusing on a single task, BioBERT can achieve the most advanced performance on a variety of biomedical text mining tasks. In order to extract the text features that can represent the question sentence, I encode the question sentence to get the input_ ids, attension_ mask, token_ type_ ids, and then input them into BioBERT to get a 768 dimension vector. 4.3 Feature fusion Multi-modal feature fusion is one of the most important technologies to improve the performance of the VQA model. For multi-modal feature fusion, most existing methods use a simple linear model to combine image visual features with text semantic features. This paper uses the multi-modal factorized high order pooling (MFH) [8] method, which can fuse multi-modal features with less computational cost. At the same time, the Co-attention mechanism can help the model learn the important parts of visual features and text features, and can better notice the important parts of features while ignoring irrelevant information. In this part, the received 1 * 1 * 1472 dimension image features and 768 dimension text features are sent to the MFH module based on the Co-attention mechanism. Finally, a 2000 dimension fusion feature is obtained and input to the next network. 4.4 Answer prediction According to the analysis of the dataset of this competition, we regard this task as a classification task. In this part, the received fusion features of 2000 dimensions are first input into a dropout layer with P = 0.3, and then connected to a fully connected layer for final classification prediction. 5 Experiments Our model trained 350 epochs on GTX2080Ti for about 7 hours. In this part, we will introduce the training process and parameters in detail. 5.1 Train data extension In addition to the dataset provided by ImageCLEF 2021 VQA-Med task, we also add the abnormal subset of the ImageCLEF 2019 VQA-Med task training set, which contains 473 images and question-answer pairs as the training set of this task. In the ImageCLEF 2019 VQA-Med training set data, only the problem that exists in the ImageCLEF 2021 VQA-Med task test set will be added to this training task as training data. 5.2 Hyperparameter In order to achieve the best performance of the model in the validation set, the parameters are adjusted several times. Finally, we use binary cross-entropy loss function, Adamax optimizer, dropout with P = 0.1, and initial learning rate of 1e-3. Secondly, the default super parameter setting of MFH (with Co-attention) is used in the multi-modal feature fusion part. 5.3 Evaluation In VQA-Med2021, accuracy and BLEU are used as the evaluation criteria. The accuracy represents the correct sample in all samples, and the BLEU score measures the similarity between the real answer and the predicted answer. The maximum accuracy we achieved in the validation set is 66.8%. Fig.3 shows the change curve of train accuracy and valid accuracy in the training process. In a total of ten valid submissions, the VGG16 (with GAP) + BioBERT + MFH pooling (with Co-attention) + linear classification layer model proposed in this paper finally achieves an accuracy score of 0.362 and a BLEU score of 0.402. The entries of this paper won second place in this competition. The results of the competition have been shown in Table 1. Our team ID is Zhao_Ling_Ling. Fig.3. Accuracy、Loss transition by training epoch Table1. Official results of ImageCLEF VQA-Med 2021 Participants Accuracy BLEU duadua 0.382 0.416 Zhao_Ling_Ling 0.362 0.402 TeamS 0.348 0.391 jeanbenoit_delbrouck 0.348 0.384 6 Conclusion In this paper, we describe the model we submitted to the ImageCLEF 2021 VQA-Med challenge. We use BioBERT to extract text features. BioBERT has a better performance than BERT in biomedical text extraction. In addition, the application of Multi-modal Factorized High-order (MFH) Pooling with the Co-attention mechanism also makes the model get better performance in this task. For future work, we will continue to improve the current network, introduce some more advanced methods and apply them to other data sets and tasks. References [1] Zhan, Li-Ming, Bo Liu, Lu Fan, Jiaxin Chen, and Xiao-Ming Wu. "Medical Visual Question Answering via Conditional Reasoning." In Proceedings of the 28th ACM International Conference on Multimedia, pp. 2345-2354. 2020. [2] Liao, Zhibin, Qi Wu, Chunhua Shen, Anton van den Hengel, and Johan Verjans. "Aiml at vqa-med 2020: Knowledge inference via a skeleton-based sentence mapping approach for medical domain visual question answering." CLEF, 2020: 78. [3] Yan, Xin, Lin Li, Chulin Xie, Jun Xiao, and Lin Gu. "Zhejiang University at ImageCLEF 2019 Visual Question Answering in the Medical Domain." In CLEF (Working Notes). 2019: 85. [4] Abacha, Asma Ben, Sadid A. Hasan, Vivek V. Datla, Joey Liu, Dina Demner-Fushman, and Henning Müller. "VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019." In CLEF (Working Notes). 2019: 272. [5] Al-Sadi, Aisha, Hana Al-Theiabat, and Mahmoud Al-Ayyoub. "The inception team at vqa-med 2020: Pretrained vgg with data augmentation for medical vqa and vqg." CLEF, 2020: 69. [6] Jung, Bumjun, Lin Gu, and Tatsuya Harada. "bumjun jung at vqa-med 2020: Vqa model based on feature extraction and multi-modal feature fusion." CLEF, 2020 : 87. [7] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014). [8] Yu, Zhou, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. "Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering." IEEE transactions on neural networks and learning systems 29, no. 12 (2018): 5947-5959. [9] Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013). [10] Alsentzer, Emily, John R. Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. "Publicly available clinical BERT embeddings." arXiv preprint arXiv:1904.03323 (2019). [11] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018). [12] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016. [13] Staudemeyer, Ralf C., and Eric Rothstein Morris. "Understanding LSTM--a tutorial into Long Short-Term Memory Recurrent Neural Networks." arXiv preprint arXiv:1909.09586 (2019). [14] Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014). [15] Vu, Minh, Raphael Sznitman, Tufve Nyholm, and Tommy Löfstedt. "Ensemble of streamlined bilinear visual question answering models for the imageclef 2019 challenge in the medical domain." In CLEF 2019-Conference and Labs of the Evaluation Forum, Lugano, Switzerland, Sept 9-12, 2019, vol. 2380. 2019. [16] A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A. Hasan, H. Müller, Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain, in: CLEF 2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org,Bucharest, Romania, 2021. [17] B. Ionescu, H. Müller, R. Peteri, A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A.Hasan, V. Kovalev, S. Kozlovski, V. Liauchuk, Y. Dicente, O. Pelka, A. G. S. de Herrera,J. Jacutprakart, C. M. Friedrich, R. Berari, A. Tauteanu, D. Fichou, P. Brie, M. Dogariu, L. D.Ştefan, M. G. Constantin, J. Chamberlain, A. Campello, A. Clark, T. A. Oliver, H. Moustahfid,A. Popescu, J. Deshayes-Chossart, Overview of the ImageCLEF 2021: Multimedia retrieval in medical, nature, internet and social media applications, in: Experimental IR MeetsMultilinguality, Multimodality, and Interaction, Proceedings of the 12th International Conference of the CLEF Association (CLEF 2021), LNCS Lecture Notes in ComputerScience, Springer, Bucharest, Romania, 2021.