ImageCLEF 2020: An approach for Visual Question Answering using VGG-LSTM for different datasets Sheerin Sitara Noor Mohamed [0000-0003-1752-2107], Kavitha Srinivasan [0000-0003-3439-2383] Department of CSE, SSN College of Engineering, Kalavakkam – 603110, India sheerinsitaran@ssn.edu.in, kavithas@ssn.edu.in Abstract. The recent advancement and digitalization in the medical domain re- quires an image based question answering system to support clinical decisions. This system also helps the patients to know about their present conditions rapidly with more information. As an effort to promote the development, ImageCLEF 2020 organizes third edition of the Visual Question Answering (VQA) Task. In this task, the abnormality related questions are to be answered for the given set of radiology images. In the proposed system, VGGNet based on transfer learning approach and LSTM are used to extract the image and text feature vectors re- spectively in the encoder stage. Then, both feature vectors are combined and given as input to the decoder for predicting the answer. The purpose of selecting VGGNet and LSTM are: (i). VGGNet is able to extract medical image features effectively in small dataset (ii). LSTM is capable to accommodate significant in- formation of the text. Moreover, the proposed model is evaluated for three da- tasets namely original dataset (4500 samples), reduced dataset (4348 samples) and augmented reduced dataset (4626 samples). The proposed model resulted in an accuracy of 0.282 and a BLEU score of 0.330 for augmented reduced dataset, which is ranked ninth among all participating group in ImageCLEF 2020 VQA- MED task. Keywords. VQA; VGGNet; LSTM; medical domain; augmented dataset; re- duced dataset; ImageCLEF 1 Introduction The amount of data generated and used in this era are increasing exponentially and medical domain is not an exception. Also, everyone wants to know the answer for eve- rything they come across in the internet world. Multiple search engines are working towards satisfying their knowledge thirst, unfortunately very few image based search engines are available in the market. However, these search engines are generalized and not suitable for medical domain. 1 Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020. Thessaloniki, Greece The medical domain is wide, and it needs prior knowledge and analysis to answer the questions. These issues can be addressed by improving the medical image based question answering system. To enhance this research further, ImageCLEF is conduct- ing VQA task in medical domain since 2018 [1]. The Visual Question Answering (VQA) in medical domain helps people (especially partially sighted) in better understanding of their condition and supports clinical deci- sion. The challenges of VQA in medical domain includes: (i). Parameter selection and feature extraction for medical dataset which differs from the real time and abstract da- taset (ii). Specific VQA model which works for all medical category is in developing stage. For example in [2], different approaches are required to answer different medical questions. The pre-trained model followed by BERT model answers organ, plane and modality related questions whereas abnormality related questions are answered effec- tively by sequence-to-sequence model (iii). Single optimal model which detects all types of medical abnormalities in different region needs some attention and effort. But, abnormality detection with respect to the particular region are available. For example, in [3], bifurcated structure detects four gastrointestinal abnormalities and three dermo- scopic lesions in WCE images and PH2 dataset respectively. Two abnormality catego- ries are detected separately and attained an accuracy of 97.8% and 97.5% respectively. (iv). Memory and time constraints. The remaining part of the paper spans across following subsections. In Sect. 2, liter- ature survey related to automation in medical domain, inference from VQA task for real world dataset and its recent advancement in medical domain are discussed. Sect. 3 gives brief description about the ImageCLEF VQA-Med 2020 dataset and two other proposed datasets used for analysis and validation. In Sect. 4, the design of the proposed VQA model based on inference attained and its implementation are explained. A brief summary about the result and the respective evaluation of all five runs are given in Sect. 5 and conclusion is given at the end. 2 Related Works The recent studies shows a tremendous advancement in the medical domain. One of the best advancement is that the medical data in structured, semi-structured and unstruc- tured formats are digitized. From the last decades, Artificial intelligence (AI) utilizes the digitization advancement and enhances an automation in the medical domain. In [4], natural language text (medical history, physical examination result, result of X-ray, ultrasound or ECG ) are collected, analysed and used to find the dependency between features to improve the healthcare quality in multidisciplinary paediatric centre using deep linguistic techniques. The advantage of digitization is also applicable for medical imaging applications like image classification [5], caption generation [6] and compu- ting severity level [7]. The reliability of these applications are based on the features extracted from the images. At present, the pre-trained models like Convolutional Neural Network (CNN) or pre-trained models like VGGNet or ResNet are playing a vital role in feature extraction for VQA related applications. VQA on medical domain emerged based on the knowledge obtained from real world datasets like MSCOCO dataset, DAQUAR, VQA Dataset, FM-IQA and Visual7W. The inferences are (i). The detailed understanding of the image and complex reasoning are required to answer the visual questions because it selectively targets background details and/or underlying context [8]. (ii). Questions are arbitrary and it imposes many sub-problems in computer vision like object location, detection and/or counting [9]. (iii) Improvement in rare question type has negligible impact on overall performance [10]. (iv). Least contributing ques- tion types need to be victimized because it pulls down the overall performance [10]. (v). Appropriate parameter selection (activation function, large mini-batches, smart shuffling of training data and word embedding by Glove, google images, etc.,) has its own impact in performance of the model. Table 1. Brief description of ImageCLEF MED-VQA task for last three years Training Validation Test Performance anal- set set set ysis Dataset Category Accuracy QA pairs QA pairs QA pairs WBSS BLEU Image Image Image CBSS Organ, plane, modality and abnormality 0.162 0.186 0.338 2278 5413 324 500 264 500 [11] - and Organ, plane, 12792 0.644 0.624 3200 2000 abnormality 500 500 500 [12] - - modality Abnormality 0.496 0.542 4000 4000 500 500 500 500 - - [13] From 2018, ImageCLEF is conducting VQA task in medical domain. The VQA- Med 2018 and VQA-Med 2019 dataset contains organ, plane, modality and abnormality related visual question answer pairs. In these tasks, most of the researcher applied pre- trained models like VGGNet, ResNet, etc., to encode medical images and Recurrent Neural Networks (RNN) to generate question encodings. Some of the researchers ap- plied attention based mechanism to extract relevant image features to answer the ques- tions. The highest BLEU, WBSS and CBSS scores obtained in 2018 tasks are 0.162, 0.186 and 0.338 respectively [11]. In 2019, along with the above approaches, different pooling strategies and transformer-based approaches are also used and attained a high- est accuracy and BLEU score as 0.644 and 0.624 respectively [12]. The overall sum- mary of the ImageCLEF VQA tasks are tabulated in Table 1. From the overall inference, VGGNet and LSTM are selected for the implementation of given task on VQA 2020. The advantages of selecting VGGNet [14] for image fea- ture extraction includes: (i). Built on ImageNet dataset but works for other datasets and tasks. (ii). Outperforms the complex recognition tasks involving less detailed images. (iii). Addresses the vanishing gradient and exploding gradient problem. (iv). Illustrates the importance of deepest model in visual representation. The advantages of using LSTM [15] are (i). Developed for TIMIT dataset, but it can solve any complex sequence learning problem in handwriting recognition, speech recognition, polyphonic music modelling, etc., (ii). The role of hyper parameters with respect to performance in LSTM structure includes: (a). Coupling the inputs, removing forget gates simplifies LSTM structure, reduces the number of parameters and compu- tational cost, without significantly decreasing performance. (b). Gaussian noise is mod- erately helpful for TIMIT dataset but it is harmful for other datasets. (c). Highest meas- ured interaction between hyper parameters are quite small. 3 Dataset Description In this section, three medical VQA dataset are discussed along with its description. The three datasets are: ImageCLEF VQA-Med 2020 dataset (Original Dataset (OD)) and two other datasets used with modification (Reduced Dataset (RD) and Augmented Re- duced Dataset (ARD)). In Original dataset, (ImageCLEF VQA-Med 2020 dataset), the dataset is divided into three subsets namely training set, validation set and test set as 4000, 500 and 500 with equivalent number of question answer pairs. In addition the dataset consists of abnormality related visual questions for different organs (e.g. lung, skull, spine, gastrointestinal, musculoskeletal), planes (e.g. axial, sagittal and corona) and modalities (e.g. CT, X-ray, MRI). For better learning, the training set and validation set (as a total 4500 samples) are used for training. The Reduced Dataset (RD) consists of 4348 samples (from training and validation set) for training and 500 samples for testing. The reduced dataset is generated by two ways namely (i). Eliminate the least contributing samples (ii). Identify and reduce the number of samples of similar class, when the count deviates much from the remaining classes. These samples degrade the overall performance of the system and hence both approaches are applied. The Augmented Reduced Dataset (ARD) consists of 4626 samples (from training and validation set) for training and 500 samples for testing. The dataset is augmented by collecting samples from VQA-Med 2018 and 2019. The collected samples are merged with RD to generate Augmented Reduced Dataset. Augmenting the training set improves the learning rate and as a result generates better model. The OD, RD and ARD contains 330, 316, 316 classes respectively. 4 System Design In this VQA task, the VGGNet and LSTM techniques are used to answer the medical visual questions. The system design of the proposed model is shown in Fig. 1. In this, the feature information from the medical image and its question-answer pairs are ex- tracted and concatenated by encoder. Then, the concatenated feature vector is decoded by timestamp to generate the answer, with post-processing at the end. The proposed model consists of five modules namely, (i). Pre-processing (ii). Encoder (iii). Decoder, (iv). Post-processing and (v). Answer prediction. Fig. 1. System design 4.1 Pre-processing In the pre-processing stage, the input samples are converted to required format for ef- fective image and text processing. As a first step, the images are reshaped to (229, 229) dimension (preferable input size of VGG16/VGG19 network). In text processing, comma is the best separator and hence the question-answer pairs are converted to comma separated file. The already existing comma within the field are converted to related special symbol (here we used semicolon). Otherwise, these commas within the field are encountered as separator, and end up with an imbalanced fields. 4.2 Encoder Encoder transforms the feature vectors into the required format for the model to answer the questions. This transformation is required because the type and dimension of fea- tures extracted from image and its respective text are different. Hence a dimensionality mapping is required to bridge the gaps and then the features vectors are concatenated. To perform this, encoder has three sub-modules namely (i). Image processing (ii). Text processing (iii). Concatenation. The system architecture of the encoder is given in Fig.2 shows the each encoding stages along with the size of feature vector before and after concatenation. Image Processing. In the proposed system, the image features are extracted by VGG16/VGG19. The last layer of the VGGNet is frozen and the resulted model is used for image feature extrac- tion as transfer learning approach. The last layer is frozen because VGGNet is trained for ImageNet dataset (1000 classes) but we required the output dimension to be 1024. For this reason, after the last before layer, the dense and fully connected layers are used to adjust the dimension of the image feature vector. Fig. 2. Encoder Text Processing. Text processing, computes the dependency between the words and derives the infor- mation from the sequence of input words. The LSTM (an advanced type of RNN) is used to generate the text feature vector. The input text is tokenized into individual words and the minimum and maximum length of question and answer are computed. The LSTM computes the question embedding (using the Glove vector), timestamp by timestamp for the respective samples. This vector is given to the fully connected layer to project it to the same dimensional shape as image feature vector. Concatenation. The computed feature vectors (image and text feature vector) are combined using ele- ment wise multiplication and are later used by decoder for model creation. 4.3 Decoder Both visual and textual features are merged into three dimensional vector (2048-dimen- sional space) which is a sequence of vectors. As both the image and textual features are represented as sequence of vectors (not as single vector), LSTM is required to feed the concatenated vector to the softmax layer. The system architecture of this sub-module, decoder is shown in Fig.3. Fig. 3. Decoder 4.4 Post-processing In post-processing, the generated answer needs to be converted to the required format as in training set. In this, semicolon in the generated answers are converted back to comma format. 4.5 Answer Prediction In this stage, encoder-decoder model based on VGG-LSTM is generated. The answer for the test set can be predicted by the model. Further, the result can be analysed and evaluated using performance metrics like accuracy and BLEU score. 5 Experiments and Results The proposed model is executed on three datasets (as discussed in Section 3) and ana- lysed using five different combination of techniques, such as: (i). VGG16 (excluding last layer) followed by LSTM for original dataset (ii). Same as (i) for reduced dataset (iii). VGG16 (excluding last layer) followed by LSTM and post-processing at the end for Augmented Reduced Dataset (iv). Same as (iii), but the pre-trained model is VGG19 (v). Similar to first combination but post-processing is included at the end. From the results it is inferred that proposed model with post-processing is included at the end for Augmented Reduced Dataset gives better performance than the other combinations. In Table 2, OD, RD and ARD represents Original dataset, Reduced Dataset and Aug- mented Reduced Dataset respectively. Table 2. Brief description about each run Run number Dataset Techniques Accuracy BLEU score 1 OD VGG16 and LSTM 0.274 0.321 2 ARD VGG16 and LSTM 0.268 0.320 3 ARD VGG16 and LSTM 0.282 0.330 (Post processing) 4 RD VGG19 and LSTM 0.248 0.292 (Post processing) 5 OD VGG16 and LSTM 0.276 0.323 (Post processing) The performance of the model depends on appropriate parameter selection also. In this model, RMSPROP optimizer is used with a learning rate of 0.001 and the batch size, epoch and dropout are set to 256, 400 and 0.2 respectively. For training the model using these hyper parameters, each run took approximately 180 minutes in GPU. Among the five runs, third run achieved a better accuracy score of 0.282 and the BLEU score of 0.330. The final result of the leaderboard is given in Table 3 where our team achieved 9th place in the listed ranks. Table 3. Top 10 ranking of ImageCLEF 2020 VQA-MED Rank Team name Accuracy BLEU No. of runs score submitted 1 z_liao 0.496 0.542 5 2 TheInceptionTeam 0.480 0.511 5 3 bumjun_jung 0.466 0.502 5 4 going 0.426 0.462 5 5 NLM 0.400 0.441 5 6 harendrakv 0.378 0.439 7 7 shengyan 0.376 0.412 5 8 kdevqa 0.314 0.350 4 9 sheerin 0.282 0.330 5 10 umassmednlp 0.220 0.340 4 6 Conclusion and Future Work In this paper, an approach for Visual Question Answering (VQA) on medical domain is implemented for ImageCLEF VQA-Med 2020 dataset and further analysed using two different types of proposed datasets namely: Reduced Dataset (RD) and Augmented Reduced Dataset (ARD). The proposed model has five stages namely: (i). Pre-pro- cessing (ii). Encoding (iii). Decoding (iv). Post-processing and (v). Answer prediction. In pre-processing, the dataset has been converted to the specific input format as required for VGGNet and LSTM. Then the image and text features are extracted and concate- nated. The concatenated feature vector is decoded for next level. In post-processing, the answer is converted to the format as in the training dataset. Finally, the generated model predicts the answer for the test set. Among the five runs of the proposed model the better result is achieved for augmented reduced dataset with an accuracy score of 0.282 and BLEU score of 0.330. In medical VQA domain, large amount of information needs to be extracted and hence it has more memory constraint. This can be addressed with the help of GPU and the selection of optimal hyper parameters. In future, the proposed VQA model can be improvised by developing a design of Convolutional Neural Network (CNN) for med- ical images and fixing the appropriated hyper parameters with visualization of layers. In addition, the advanced text processing approach like BERT, which represent each sentence in 768-d question feature vector can be included. 7 Acknowledgement Our profound gratitude to SSN College of Engineering, Department of CSE, for allow- ing us to utilize the High Performance Computing Laboratory and GPU Server for the execution of this challenge successfully. References 1. Ionescu, B., Muller, H.,Peteri, R., Ben Abacha, A., Datla, V., Hasan, S. A., Demner- Fushman, D., Kozlovski, S., Liauchuk, V., Cid, Y.D., Kovalev, V., Pelka, O., Friedrich, C.M., Herrera, A. G. S. D., Ninh, V., Le, T., Zhou, l., Piras, l., Riegler, M., Halvorsen, P., Tran, M., Lux, M., Gurrin, C., Dang-Nguyen, D., Chamberlain, J., Clark, A., Campello, A., Fichou, D., Berari, R., Brie, P.,Dogariu, M., Stefan, L.D., Constantin, M. G.: Overview of the ImageCLEF 2020: Multimedia Retrieval in Lifelogging, Medical, Nature and Internet Applications. In: Experimental IR Meets Multilinguality, Multimodality and Interaction, Proceedings of the 11th International Conference of the CLEF Association (CLEF 2020), Greece, September 22-25. LNCS Lecture Notes in Computer Science, Springer (2020). 2. Zhou, Y., Kang, X., Ren, F.: TUA1 a ImageCLEF 2019 VQA-Med: A Classification and Generation Model based on Transfer Learning. In: CLEF 2019 Working Notes. CEUR Workshop Proceedings, Switzerland (2019). 3. Baranov, A.A., Namazova-Baranova, L.S., Smirnov, I.V., Devyatkin, D.A., Shelmanov, A.O., Vishneva, E.A., Antonova, E.V., Smirnov, V.I.: Technologies for Complex Intelligent Clinical Data Analysis. In: Annals of the Russian Academy of Medical Sciences, 71(2), pp. 160 - 171 (2016). 4. Hajabdollahi, M., Esfandiarpoor, R., Sabeti, E., Karimi, N., Soroushmehr, S.M.R., Samavi, S.: Multiple Abnormality Detection for Automatic Medical Image Diagnosis using Bifur- cated Convolutional Neural Network. In: Biomedical Signal Processing and Control, 57, pp.101792 - 101802 (2020) 5. Liauchuk, V., Tarasau, A., Snezhko, E., Kovalev, V.: ImageCLEF 2018: Lesion-based TB- Descriptor for CT Image Analysis. In: CLEF 2018 Working Notes, CEUR Workshop Pro- ceedings, Belarus (2018). 6. Herrera, A.G.S.D., Eickhoff, C., Andrearczyk, V., Miller, H.: Overview of the ImageCLEF 2018 Caption Prediction Tasks. In: CLEF 2018 Working Notes, CEUR Workshop Proceed- ings, China (2018). 7. Kavitha, S., Nandhinee, P.R., Harshana, S., Srividya, J.S., Harrinei, K.: ImageCLEF 2019: A 2D Convolutional Neural Network Approach for Severity Scoring of Lung Tuberculosis using CT Images. In: CLEF 2019 Working Notes. CEUR Workshop Proceedings, Switzer- land (2019). 8. Antol, S., Agrawal. A., Lu, J., Antol, S., Mitchell, M., Zitnick, L., Batra, D., Parikh, D.: VQA: Visual Question Answering. In: International Conference on Computer Vision, pp. 2425 - 2433 (2015). 9. Kafle. K., Kanan, C.: Visual Question Answering: Datasets, Algorithms and Future Chal- lenges. In: Computer Vision and Image Understanding, 163, pp.3 - 20 (2016). 10. Teney, D., Anderson, P., He, X., Hengel, A.V.D.: Tips and tricks for Visual Question An- swering: Learning from the 2017 Challenge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223 - 4232 (2018). 11. Hasan, S.A., Ling, Y., Farri, O., Liu, J., Miller, H, Lungren, M.: Overview of ImageCLEF 2018 Medical Domain Visual Question Answering Task. In: CLEF 2018 Working Notes, CEUR Workshop Proceedings, Switzerland (2018). 12. Ben Abacha, A., Hasan, S.A., Datla, V.V., Liu, J., Demner-Fushman, D., Miller, H.: VQA- Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019. In: CLEF 2019 Working Notes. CEUR Workshop Proceedings, Switzerland (2019). 13. Ben Abacha, A., Datla, V.V., Sadid A. Hasan, S.A., Demner-Fushman, D., Muller, H.: Over- view of the VQA-Med Task at ImageCLEF 2020: Visual Question Answering and Genera- tion in the Medical Domain. In: CLEF 2020 Working Notes. CEUR Workshop Proceedings, Greece (2020). 14. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. In: International Conference on Learning Representations, Canada, pp. 1 - 14 (2014). 15. Greff. K., Srivastava, R.K., Koutnik, J., Steunebrink. B.R., Schmidhuber. J.: LSTM: A Search Space Odyssey. In: IEEE Transcations on Neural Networks and Learning Systems, 28(10), pp. 2222 - 2232 (2017). 16. Nguyen, .D., Do, T.T, Nguyen, B.X., Do, T., Tjiputra, E., Tran, Q.D.: Overcoming Data Limitation in Medical Visual Question Answering. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp.522 - 530 (2019)