SSN MLRG at VQA-MED 2021: An Approach for VQA to Solve Abnormality Related Queries using Improved Datasets Noor Mohamed Sheerin Sitara 1 and Srinivasan Kavitha2 1, 2 Department of CSE, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam – 603110, India Abstract The Visual Question Answering (VQA) in the medical domain attains tremendous advancement in last few years. To improvise the VQA research, ImageCLEF forum is organizing the fourth edition of VQA task in medical domain. This year, the abnormality related VQA queries are to be answered for the given set of radiology images. In the proposed system, VGGNet based on transfer learning approach and LSTM is used to extract image and text features respectively. The extracted three dimensional (embedding, image, text) feature vectors are concatenated into sequence of vectors by LSTM for predicting the answer. The purpose of selecting VGGNet and LSTM are: VGGNet, outperforms complex recognition tasks and also addresses vanishing gradient and exploding gradient problem and LSTM, solves complex sequence learning problems and overcomes long term dependency problems. In addition, the hyper parameters are chosen appropriately and four improved datasets are used to analyze the performance of the proposed model. These four datasets are build by collecting the samples from previous ImageCLEF VQA – MED tasks. The proposed model resulted in an accuracy of 0.196 and a BLEU score of 0.227 for one of the dataset, which is ranked tenth among all participating groups in ImageCLEF 2021 VQA-MED task. Keywords 1 Visual Question Answering; VGGNet; Long Short Term Memory; medical domain; VQA dataset; augmented dataset; reduced dataset; ImageCLEF 1. Introduction The recent studies of 2020 reveals that the 90% of data are unlabelled and 40 – 50 % of data is in the form of images [12]. Hence an Artificial Intelligent (AI) approach is required to analyze both image and text. Now-a-days, the advantage of AI approach is extended to different applications like text summarization, machine translation, sentiment analysis, image captioning, and Visual Question Answering (VQA). Among which, VQA comprises both image and text for real world dataset [1], abstract dataset [2] and medical dataset [3] are evolved in this decade. For medical dataset, ImageCLEF organizes medical related image captioning and VQA task since 2018 [3]. From 2020, ImageCLEF concentrates on solving abnormality related VQA questions [4]. The Visual Question Answering system of medical domain takes one or more abnormality related natural language questions with respective radiology images as input and predicts the appropriate answer as output. Some of the applications of medical VQA are: (i). Helps partially visually sighted people (ii). Helps in clinical support and decision. To answer the medical VQA queries, the visual information of the radiology image is extracted based on the significant textual content of the question. In other words, image features are extracted based on the text features and finally both feature vectors are concatenated to answer the respective questions. The different image processing techniques are, Convolutional Neural Network, pre-trained models like VGGNet, ResNet and DenseNet and, text 1 CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania EMAIL: sheerinsitaran@ssn.edu.in (A. 1); kavithas@ssn.edu.in (A. 2) ORCID: 0000-0003-1752-2107 (A. 1); 0000-0003-3439-2383 (A. 2) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) processing techniques are Long Short Term Memory (LSTM), Gated Recurrent Unit (GRU), Bidirectional Encoder Representation (BERT). The overview of ImageCLEF VQA – MED tasks (2018, 2019 and 2020) are summarized and given in Table 1. From the results, the observations are: (i). ImageCLEF VQA – MED 2019 achieved better performance than ImageCLEF VQA – MED 2018, because of the increased number of samples for each class (ii). In ImageCLEF VQA – MED 2019 task, abnormality type VQA questions achieved less performance as compared with organ, plane and modality type questions (iii). Based on the ImageCLEF VQA – MED 2019 task outcome, ImageCLEF begin to concentrates on abnormality type questions since 2020 but the performance is reduced. From the inference of previous tasks (especially ImageCLEF VQA – MED 2020 task) as tabulated in Table 1, VGGNet and LSTM (modification of RNN) are used in the proposed model for VQA system development. In addition, VGGNet and LSTM have some advantages, such as (i). VGGNet - Outperforms complex recognition tasks, addresses vanishing gradient and exploding gradient problem [5] (ii). LSTM – Solves complex sequence learning problems and overcomes long term dependency problems [6]. Table 1 ImageCLEF VQA – MED task Overview Task Widespread Techniques Remarkable Category Remarka Remark Techniques ble able Images Texts Images Texts Accuracy BLEU score ImageCLEF Convoluti Recurrent ResNet LSTM Organ, - 0.162 VQA – onal Neural plane, MED 2018 Neural Network modality [7] Network (RNN) and (CNN) abnormali ty ImageCLEF VGGNet Bidirection CNN BERT Organ, 0.624 0.644 VQA – or ResNet al Encoder plane, MED 2019 Representa modality [8] tion (BERT) and or RNN abnormali ty ImageCLEF CNN, BERT or DenseN Skeleton Abnormal 0.496 0.542 VQA – VGGNet modificatio et and based ity MED 2020 or ResNet n of RNN ResNet Sentence [9] Mapping The research contributions of ImageCLEF VQA – MED 2021 task using the proposed model are: (i). For training the model, the dataset is augmented from ImageCLEF VQA-MED 2018, 2019 and 2020 (test set) datasets. From 2018 and 2019 datasets, 126 samples associated with abnormality related queries are collected and augmented. The ImageCLEF VQA-MED 2020 test set consists of 500 radiology images with respective 500 question-answer pairs are also used for augmenting the dataset. (ii). In terms of implementation, VGGNet followed by LSTM are used for answering the medical questions related to radiology images. (iii). For building the model, the hyper parameters like learning rate, number of epochs, batch size, momentum, dropout, etc., are selected and the values are fixed based on the performance measures. The remaining part of the paper spans across following subsections. In Sect. 2, ImageCLEF VQA- MED 2021 task and its dataset are discussed and, compared with 2020 task. In Sect. 3, the design of the proposed VQA model and its implementation are explained. A brief summary about the results obtained and the performance evaluation are given in Sect. 4 with a conclusion at the end. 2. Task and Dataset Description In this section, ImageCLEF VQA – MED 2021 task and given dataset are discussed with three types of improved datasets, which are build from the previous VQA datasets. 2.1. ImageCLEF VQA – MED 2021 task ImageCLEF, a part of Conference and Labs of the Evaluation Forum is conducting tasks related to the medical domain since 2018. ImageCLEF VQA – MED 2021 task concentrates on abnormality type questions for different organs, planes and modalities. In this task, 33 participants were registered and 13 teams were participated with 75 successful runs. 2.2. ImageCLEF VQA – MED 2021 dataset The ImageCLEF VQA-MED 2021 dataset [10] is given as four subsets namely, training set, validation set, new validation set and test set. The first two subsets are equivalent to ImageCLEF VQA- MED 2020 dataset and it is used for training. This set consists of 4500 radiology images and 4500 question-answer pairs, among which the validation set consists of 500 radiology images with respective 500 question-answer pairs. The new validation set consists of 500 question-answer pairs associated with 500 radiology images. Finally, the test set includes 500 radiology images and 500 questions about abnormality. The datasets used for training the proposed model is given in Table 2. The acronyms, GD, GTD, AD and ARD represents Given Dataset, Given dataset along with Test dataset from ImageCLEF VQA- MED 2020, Augmented Dataset and Augmented Reduced Dataset. The Augmented Dataset consists of GTD along with the augmented samples from ImageCLEF VQA-MED 2018 and 2019. The Augmented Reduced Dataset is a modification of AD dataset, in which some of the samples are removed by two ways, (i). Least contributing samples, (ii). Identify and reduce the number of samples of similar cases where the count value deviates from the remaining classes. Table 2 Dataset Description Datasets Training Set Classes Description Images QA pairs GD 4500 4500 330 Different abnormality GTD 5000 5000 366 related medical images AD 5126 5126 366 along with associated ARD 4848 4848 352 question answer pairs The advancement in ImageCLEF VQA-MED 2021 task when compared to 2020 task are: (i). The number of VQA samples are increased (ii). Number of classes of abnormality type questions are increased 3. Proposed Methodology The proposed VQA model comprises of VGGNet (used as Transfer Learning approach) and LSTM to answer the VQA queries related to radiology images. This VQA model is further tuned by hyperparameter selection as tabulated in Table 3 and supported by three improved VQA – MED dataset (as discussed in Section 2). VGGNet and LSTM are used to obtain the image features and text features respectively. These features are then combined using elementwise multiplication and used for model creation. The output of the model is the sequence of words for all possible answer classes. Table 3 Hyper parameter selection and its respective values Hyper Parameters Value Number of epochs 800 Batch Size 256 Momentum 0.9 Dropout 0.3 Learning rate 0.001 VGGNet, a pre-trained model, is used as a transfer learning approach. The transfer learning approach is adapted because of three factors namely, (i). Higher start – Model with transfer learning approach outperforms the model without transfer learning approach (ii). Higher slope – Performance rate gradually increases in the training phase (iii). Higher asymptote – Training rate converges smoothly. 4. Experiments and Results The proposed model is executed on four datasets (as discussed in Section 2) and the performance is analyzed using five different runs as given in Table 4, are: (i). VGG16 concatenated with LSTM by excluding the last layer with less number of epochs for the given dataset (ii). Same as (i) but number of epochs is increased (iii). Similar to (ii) for Given dataset along with Test dataset from ImageCLEF VQA-MED 2020 (GTD). (iv). Same as (ii) for Augmented Dataset (v). Same as (ii) for Augmented Reduced Dataset. Table 4 Brief Description about each run with performance score Run Dataset Number Training Validation Test Set number of Error Error (Performance Epochs metrics) Accuracy BLEU Score 1 Given Dataset (GD) 31 0.158 0.231 0.020 0.049 2 Given Dataset (GD) 401 0.130 0.219 0.172 0.213 3 Given dataset along with 800 0.114 0.183 0.196 0.227 Test dataset from ImageCLEF VQA-MED 2020 (GTD) 4 Augmented Dataset (AD) 800 0.134 0.225 0.172 0.211 5 Augmented Reduced 800 0.132 0.221 0.170 0.208 Dataset (ARD) The performance of the model depends on suitable hyper parameters and the appropriate values as given in Table 3. The result of the proposed model are analysed using suitable quantitative metrics for different runs. The quantitative metrics includes, mean square error for training and validation set and, accuracy and BLEU score for test set as given in Table 4. The overall inferences are: (i). Training error is lesser than validation error because most of the samples are learned in the training phase and followed early stopping (ii). For the third run, both training and validation error is minimum than other runs which leads to better prediction rate (iii). Among the five runs, the third run achieved a better accuracy of 0.196 and the BLEU score of 0.227 for GTD dataset. Table 5 Top 10 ranking of ImageCLEF 2021 VQA-MED Rank Team name Accuracy BLEU score No. of runs submitted 1 duadua 0.382 0.416 10 2 Zhao_Ling_Ling_ 0.362 0.402 10 3 TeamS 0.348 0.391 11 4 Jeanbenoit_delbrouck 0.348 0.384 13 5 riven 0.332 0.361 1 6 Zhao_Shi_ 0.316 0.352 4 7 IALab_PUC 0.236 0.276 7 8 Li_Yong_ 0.222 0.255 10 9 silencec 0.220 0.235 2 10 sheerin 0.196 0.227 5 The final result of the leaderboard is given in Table 5 where our team achieved 10th place in the listed ranks. Our proposed model achieved improved accuracy of 0.196 and BLEU score of 0.227 due to the usage of timestamps during the training phase. The timestamps play a major role in relearning the appropriate answers of the sample based on the previously predicted answer. It also helps the proposed model to learn temporal patterns from a sequence of question-answer pairs based on radiology images. The overall experience from the VQA-MED 2021 task is based on the dataset only. It concentrates on abnormality type questions which can be answered more easily than the questions related to organ, plane and modality type. However, the accuracy is reduced by 11.4% as compared with previous year because of two reasons such as: large number of samples and the number of classes are also increased in VQA-MED 2021 task. 5. Conclusion This paper describes an approach to solve Visual Question Answering on medical domain for ImageCLEF VQA - MED 2021 dataset. The ImageCLEF concentrates on abnormality related VQA dataset from previous year onwards. When compared with previous year, the number of samples, abnormality type and difficulty level are increased. For the VQA dataset, image features and text features are extracted using VGGNet and LSTM, finally both features are concatenated using LSTM to predict the answer. In this VQA model, word embedding is used which allows the model to focus on the part of the image which is relevant to both the image and the keyword in the question. As irrelevant parts of the radiology image are not taken into consideration and thus the classification accuracy is improved by reducing the chances of predicting wrong answers. To validate the model, four datasets namely, Given Dataset (GD), Given dataset along with Test dataset from ImageCLEF VQA-MED 2020 (GTD), Augmented Dataset (AD) and Augmented Reduced Dataset (ARD) are used in five different runs. Among the five runs of the proposed model the better result is achieved for Given dataset along with Test dataset from ImageCLEF VQA-MED 2020 (GTD) with an accuracy score of 0.196 and BLEU score of 0.227. Even though the 2021 dataset is complex, the appropriate parameter selection and improved datasets helps to maintain the performance of the proposed VQA system. 6. Acknowledgements Our profound gratitude to Sri Sivasubramaniya Nadar College of Engineering, Department of CSE, for allowing us to utilize the High Performance Computing Laboratory and GPU Server for the execution of this challenge successfully. 7. References [1] Agrawal, A., Lu, J., Antol. S., Mitchell, M., Zitnick, C.L., Parikh, D., Batra, D.: VQA: Visual Question Answering, International Journal of Computer Vision, 123 (1), pp. 4 – 31 (2017). [2] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh, D. (2015). VQA: Visual question answering, Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015). [3] Ionescu, B., Muller, H., Peteri, R., Ben Abacha, A., Sarrouti, M., Demner-Fushman, D., Hasan, S.A., Kovalev, V., Kozlovski, S., Liauchuk, V., Dicente, Y., Pelka, O., Garcia Secode Herrera, A., Jacutprakart, J., Friedrich, C.M., Berari, R., Tauteanu, A., Fichou, D., Brie, P., Dogariu, M., Daniel Stefan, L., Gabriel Constantin, M., Chamberlain, J., Campello, A., Clark, A., Oliver, T.A., Moustahfid, H., Popescu, A., Deshayes-Chossart, J.: Overview of the ImageCLEF 2021: Multimedia Retrieval in Medical, Nature, Internet and Social Media Applications. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 12th International Conference of the CLEF Association (CLEF 2021), Romania, September 21- 24. Lecture Notes in Computer Science, Springer (2021). [4] Sheerin Sitara, N., Kavitha, S.: ImageCLEF 2020: An approach for Visual Question Answering using VGG-LSTM for different datasets. In: CLEF 2020 Working Notes, CEUR Workshop Proceedings, Greece, September 22-25 (2020). [5] Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. In: International Conference on Learning Representations, Canada, pp. 1 - 14 (2014). [6] Greff. K., Srivastava, R.K., Koutnik, J., Steunebrink. B.R., Schmidhuber. J.: LSTM: A Search Space Odyssey. In: IEEE Transcations on Neural Networks and Learning Systems, 28(10), pp. 2222 - 2232 (2017). [7] Hasan, S.A., Ling, Y., Farri, O., Liu, J., Miller, H, Lungren, M.: Overview of ImageCLEF 2018 Medical Domain Visual Question Answering Task. In: CLEF 2018 Working Notes,CEUR Workshop Proceedings, Switzerland (2018). [8] Ben Abacha, A., Hasan, S.A., Datla, V.V., Liu, J., Demner-Fushman, D., Miller, H.: VQAMed: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019. In: CLEF 2019 Working Notes. CEUR Workshop Proceedings, Switzerland (2019). [9] Ionescu, B., Muller, H.,Peteri, R., Ben Abacha, A., Datla, V., Hasan, S. A., Demner-Fushman, D., Kozlovski, S., Liauchuk, V., Cid, Y.D., Kovalev, V., Pelka, O., Friedrich, C.M., Herrera, A. G. S. D., Ninh, V., Le, T., Zhou, l., Piras, l., Riegler, M., Halvorsen, P., Tran, M., Lux, M., Gurrin, C., Dang-Nguyen, D., Chamberlain, J., Clark, A., Campello, A., Fichou, D., Berari, R., Brie, P.,Dogariu, M., Stefan, L.D., Constantin, M. G.: Overview of the ImageCLEF 2020: Multimedia Retrieval in Lifelogging, Medical, Nature and Internet Applications. In: Experimental IR Meets Multilinguality, Multimodality and Interaction, Proceedings of the 11th International Conference of the CLEF Association (CLEF 2020), Greece, September 22-25. LNCS Lecture Notes in Computer Science, Springer (2020). [10] Ben Abacha, A., Sarrouti, M., Demner-Fushman, D., Hasan, S.A., Muller, H.: Overview of the VQA-Med Task at ImageCLEF 2021: Visual Question Answering and Generation in the Medical Domain. In: CLEF 2021 Working Notes, CEUR Workshop Proceedings, Romania, September 21-24 (2021). [11] Nguyen, .D., Do, T.T, Nguyen, B.X., Do, T., Tjiputra, E., Tran, Q.D.: Overcoming Data Limitation in Medical Visual Question Answering. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp.522 - 530 (2019). [12] https://medium.com/pythoneers/vgg-16-architecture-implementation-and-practical- use-e0fef1d14557