Medical Visual Question Answering at Image CLEF 2019- VQA Med Mohit Bansal, Tanmay Gadgil, Rishi Shah and Parag Verma Pricewaterhouse Coopers US Advisory, Mumbai, India mohit.b.bansal@pwc.com,tanmay.p.gadgil@pwc.com, rishi.s.shah@pwc.com, parag.verma@pwc.com1 Abstract. This paper describes the submission created by PwC US-Advisory for the Medical Domain Visual Question Answering (VQA-Med) Task of Image CLEF 2019. The goal of the challenge was to create a Visual Question Answering System which uses medical images as context to generate answers. The VQA pipeline classifies the questions into two groups, the first group of questions in- volves giving answers from a fixed pool of predefined answer categories and the second group of questions involves generating answers based on the abnormality seen in the image. The first model uses question embeddings from the Universal Sentence Encoder and Image Embeddings from ResNet which are fed into an attention-based classifier to generate answers. The second model uses the same ResNet image embedding along with word embeddings from a Word2Vec model pre-trained on PubMed data which is used as an input to a sequence to sequence model which generates descriptions of abnormalities. This methodology helped us achieve reasonable results with a strict accuracy of 48% and a BLEU score of 53% on the challenge’s test data. Keywords: Visual Question Answering, Sequence to Sequence Model, Image CLEF 2019, Attention 1 Introduction Technical advancements in the healthcare sector have helped in storing large number of health records electronically and this provides opportunities for the use of artificial intelligence (AI) in supporting clinical decision making and improving patient engagement. Visual Question answering (VQA) is one way in which the technical innovation in Artificial Intelligence can be used for the benefit of the healthcare sector. VQA is a relatively new approach that uses a combination of computer vision and natural language processing for designing 1 Copyright (c) 2019 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lu- gano, Switzerland the model. Given the image and question input, VQA technique processes both the visual and textual inputs to give relevant answer to the input question. Some interesting approaches have been published over the years on VQA, but for this challenge we have come up with a relatively simple architec- ture to best address the Image-CLEF 2019 problem. In this paper, we aim to explain the details about different models used for getting the image embed- dings, question embeddings and answer embeddings and for finally training the model. In subsequent parts of this paper, we go on to discuss the possible ap- plications of the VQA model in the industry and possible changes in model design for enhancing the accuracy. 2 Literature Review In the last few years, there have been tremendous advancements in the field of AI. From neural network to computer vision, the field has transformed through leaps and bounds [1]. Among other things, its ability to impact the value chain has been felt unequivocally [2]. Healthcare as a domain is very fragmented and the components in the value chain don't adhere to accepted standards [3]. Such discretization provides immense opportunity for consolidation and optimization in the Provider Market [4]. A physician scans scores of documents and images before arriving at a conclusion. Without having a heuristic process in place, he or she may have to often spend substantial amount of time sieving through the scans and vitals be- fore arriving at a conclusion. Visual Question Answering (VQA) is a recent advancement in AI which tries to answer a set of questions about an image [5]. Tasks such as feature extraction, understanding question and generating an- swers are an important part of the VQA. Complex systems built to extract in- formation out of images and scans need to incorporate VQA along with natural language processing to leverage the huge data generated at the patient and pro- vider interface [6]. These will not only supplement the information at hand with the clinician to make better decisions but also enable better knowledge man- agement. Earlier work in VQA was concentrated mostly towards image captioning [7]. In most of them, deep learning methods such as CNN, LSTM, DMN had been used. The proposed model aims to find a link between the semantics of the text description and the features extracted from an image. 3 Task Description and Dataset As part of the Image-CLEF 2019 VQA-Med Task [12] [13], the participants were given medical images accompanied by few clinically relevant questions, and were supposed to answer the questions based on the visual image content. The training dataset consisted of 3200 medical images and 12,792 question- answer pairs, while the validation set had 500 medical images and 2000 ques- tion-answer pairs. Finally, after developing the model framework, the partici- pants were required to predict answers for a test set of 500 medical images with 500 questions 4 Pipeline overview / Model Structure In this section we discuss our model architecture, which primarily consists of 3 main components: Question Classifier Model, VQA Classifier Model, VQA Seq-2-Seq Model as shown in Fig. 1. Due to the complex nature of the problem, we decided to break the problem into simpler tasks which could be handled independently. We sorted the questions into its respective categories namely Modality, Plane, Organ system & Abnormality using the Question Classifier Model. The first group consisting of all questions on Modality, Plane & Organ system categories along with their image pairs was fed to VQA Classifier Model to implement a multi-class classification model. The second group con- sisting of all the questions on Abnormality category along with their image pair was fed to the VQA Seq-2-Seq Model to generate answer sequences for predic- tions. Fig. 1. Model Pipeline Overview 4.1 Question Classifier Model To classify the questions, question embeddings were passed through a dense connected layer. This deep representation was then projected using a final dense layer followed by a softmax activation to form a distribution over all possible tags. Question embeddings were created using the Universal Sentence Encoder as it gives strong transfer performance on a number of NLP tasks and surpasses the performance of transfer learning using word level embeddings alone. Also, it handled variable input text length seamlessly to give a 512-dimensional vec- tor output. 4.2 VQA Classifier Model Once the questions were classified in the question classifier, questions and im- ages from the category of Modality, Plane and Organ System were fed to the VQA classifier. The VQA classifier model used a Universal Sentence Encoder to create Question embeddings. To encode the information of the category to which the question belongs, one-hot-encoded vector of the Question Classifier was concatenated to the 512-dimensional vector to create a final 516-dimen- sional question embedding vector. A pre-trained convolutional neural network (CNN) model based on ResNet50 architecture [8] was used to embed the image. To compute attention over image features, we concatenated tiled LSTM state with image features over the depth dimension and passed it through a 1 × 1 dimensional convolution layer of depth 512 followed by ReLU nonlinearity. The output feature was passed through another 1 × 1 convolution of depth C= 2 followed by softmax over spatial dimensions to compute attention distribu- tions. We used these distributions to compute two image glimpses by compu- ting the weighted average of image features. We further concatenated the image glimpses with the state of the LSTM and passed it through a fully connected layer of size 1024 with ReLU nonlinearity [10]. The output was then fed to a linear layer of size M = 66 followed by softmax to produce probabilities over the different sub-categorical answers in the top 3 categories. We used dropout of 0.5 on input features of all layers including the LSTM, convolutions, and fully connected layers and optimized the model with Adam optimizer. Fig 2 below shows the model architecture Fig. 2. VQA Classifier Architecture 4.3 VQA Seq-2-Seq Model Questions and images from the Abnormality category were fed to the VQA Seq-2-Seq Model as shown in Fig 3. This is a custom encoder decoder archi- tecture that was used to generate answer sequences. In the following section, this will be discussed in detail. The first component was a pre-trained convolu- tional neural network (CNN) model based on ResNet50 architecture that took the image as an input and extracted a vector representation for that image, while the second component was a word embedding layer that encoded the question into a vector representation which was passed through a LSTM network. The embeddings were created using pre-trained word2vec model on PubMed data. The decoder consisted of LSTM network that took the thought vector as initial state and ‘Start of Sentence’ token as input in the first time step and tried to predict the answer tokens using softmax layer until ‘End of Sentence’ was obtained [11]. Fig. 3. VQA Seq-2-Seq Architecture 4.3.1 Encoder The encoder layer consists of 2 main components: the first is used to obtain image embeddings from a CNN architecture and the second is a LSTM network with a pre-trained word embedding layer to encode questions. Fig 4 below de- scribes the encoder architecture. In the first component, the image embedding input layer takes the vec- torized image from ResNet50. The image is flattened and passed through a fully connected dense layer followed by a RELU activation to create a deep 512- dimensional representation of the image. The main purpose of these two layers is to reshape and compress the feature vector dimension to match the hidden layer of the LSTM. In the second component, the semantic meaning of the question is to be extracted with respect to the image. A 100-dimensional pre-trained word em- bedding layer is used to encode the word into a dense semantic space using a word2vec model trained on PubMed Data. This word representation is then fed to a LSTM network with 512 hidden nodes. LSTM is a special type of Recur- rent Neural Network (RNN) that has been designed to solve the problem of vanishing gradient. The LSTM layer uses its memory cells to store the context information. LSTM has three gates (i.e. Input gate, forget gate and output gate) which will decide how the input will be handled. At any time step, inputs to the LSTM cell include the current word (x), previous hidden state (h-1) and previ- ous memory state (c-1), and LSTM cell outputs are current hidden state (h) and current memory state (c). These states have 512 hidden nodes. At last time step in the sequence, the LSTM cell outputs its hidden state (h) and memory state (c). Both the hidden state and the memory state of the first LSTM layer is ini- tialized with the image vector which helps LSTM layer learn the internal rep- resentations useful for extracting information relevant to answering the ques- tion. Fig. 4. Encoder Model Architecture 4.3.2 Decoder The decoder model is responsible for generating the answer from the input im- age and question. The Decoder LSTM uses three inputs to generate a token which is mapped to a dictionary. These include the token of the previous word, the hidden state and the memory state of the previous LSTM layer as shown in Fig 5. At the first time step, LSTM cell takes ‘Start of Sentence’ token, the hidden and memory state of the encoder model as the input and calculates the probability distribution of the target word using softmax layer. The word with the highest probability will be the first word of the answer; this word will be then passed to the second LSTM cell as input and predict the second word of the answer. The full answer will be generated by repeating this process until the model predicts ‘End of Sentence’ token. Fig. 5. Decoder Model Architecture 5 Results Training data and validation data for the challenge comprised of 3,200 images with 12,792 Question-Answer (QA) pairs & 500 images with 2,000 QA pairs respectively. Test data contained 500 images and questions. We experimented with primarily 2 approaches for challenge submissions: Approach 1: In the first experiment, we use the Question Classifier, followed by VQA Classifier on the top 3 categories and a second VQA Classifier model trained on Abnormality category train data only. We only consider top 657 most frequent answers in the abnormality classifier. Other answers are ignored and do not contribute to the loss during training. This covers 74% of the answers in the validation set of abnormality category in VQA dataset [9] Approach 2: In the second experiment, we use the proposed pipeline, Question Classifier, followed by VQA Classifier on the top 3 categories and VQA Seq- 2-Seq Model on the Abnormality category questions The question classifier model to classify the question category gave a 100% accuracy on the validation set. The VQA classifier trained on the first 3 categories (Modality, Plane and Organ Systems) delivered an overall 76.1% accuracy in predicting answers across 66 sub-categorical classes on validation data of the corresponding 3 categories. The second VQA classifier trained on the Abnormality category questions, used in Approach 1, delivered a 19.5% accuracy in predicting answers across 657 sub-categorical classes on the vali- dation set of Abnormality category. The VQA Seq-2-Seq model, used in Approach 2, generated answer se- quences with an accuracy of 23.4% on validation set of Abnormality category. Evaluation of test data was conducted based on the following metrics: 1. Accuracy (Strict)- an adapted version of the accuracy metric from the general domain VQA task that considers exact matching of a partici- pant provided answer and the ground truth answer.; 2. BLEU metric [1] to capture the similarity between a system-generated answer and the ground truth answer.; With different model hyper parameters, 5 valid submissions were given by our team; 2 from Approach 1 and 3 from Approach 2. The best results from both approaches were as follows: Table 1. Performance statistics of model on test data Accuracy (Strict) BLEU Score Approach 1 0.484 0.531 Approach 2 0.488 0.534 It is worth mentioning that both the VQA Classifier and VQA Seq-2- Seq models are not very expensive to train. It took 14 seconds and 12 seconds per epoch for the VQA-Classifier and VQA-Seq-2-Seq respectively on a Vir- tual Machine (VM) equipped with NVIDIA Tesla P4 GPU card with 8GB of RAM. The VM had Ubuntu OS with CUDA 10.1. For the implementation, we use Keras with TensorFlow 1.13.1 backend. 6 Conclusion VQA has great potential with respect to the automation of information gather- ing from images. The medical domain can be greatly benefited in several ways by these techniques, like providing the coherence required to simplify the diag- nosis process, providing clinicians with a tool to check the validity of their di- agnosis and providing patients with details which may otherwise be overlooked during a consultation. It can empower the clinician to ascertain that enough in- formation is at hand before arriving at any conclusion. Specific models relevant for a particular disease can also be made by training the algorithm on specific data sets, and this can create a better scanning process. Supplementing the mod- els with more advanced techniques like attention and Multitask learning can greatly enhance their accuracy. It can also create standards in terms of models used by different providers. Moreover, the cost benefits incurred from such simplification can be propagated through the entire value chain. Clinicians, pa- tients and insurance providers can benefit greatly from the value created by VQA based diagnosis systems. References 1. Koushal, K., Gour, S.: Advanced Applications of Neural Networks and Ar- tificial Intelligence: A Review, (2012). 2. Soroush, A., Nakhai, I., Bahreininejad, A.: Review on Application of Arti- ficial Neural Networks in Supply Chain Management and its Future, (2009). 3. Lawton, R.: The Health Care Value Chain: Producers, Purchasers, and Pro- viders, (2002). 4. Martin, G.: Examining the Impact of Health Care Consolidation, (2018). 5. Aishwarya, A., Jiasen, L., Stanislaw, A., Margaret, M., Lawrence, C., Dhruv, B., Devi, P.: VQA-Visual Question Answering, (2016). 6. Sonit, S.: Pushing the Limits of Radiology with Joint Modeling of Visual and Textual Information, (208). 7. Shuhui, Q.: Visual Question Answering Using Various Methods,Califor- nia,94305 8. Alfredo, C., Eugenio, C., Adam, P.: An Analysis of deep neural network models for practical application, (2017). 9. Antol, S., Agrawal, A., Lu, M., Mitchell, D., Batra, C., Zitnick, L., Parikh, D.: VQA: Visual question answering. In International Journal of Computer Vision, (2015). 10. Vahid, K., Ali, E.: Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering. CoRR, abs/1704.03162, (2017). 11. Ilya, S., Oriol, V., Quoc, V.: Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS) 27, pages 3104–3112, (2014). 12. Asma, B., Sadid, A., Vivek, V., Joey, L., Dina, D., Henning, M.: VQA- Med: Overview of the Medical Visual Question Answering Task at Im- ageCLEF 2019 In: CLEF 2019 Working Notes. CEUR Workshop Proceed- ings (CEUR- WS.org), ISSN 1613-0073, http://ceur-ws.org/Vol-2380/, Lu- gano, Switzerland, (September 09-12, 2019) 13. Bogdan, I., Henning, M., Vivek, V., Renaud, P., Yashin, D., Vitali, L., Vassili, K., Dzmitri, K., Aleh, T., Asma, B., Sadid, A., Vivek, D., Joey, L., Dina, D., Duc-Tien, D. , Luca, P., Michael, R., Minh-Triet, T., Mathias, L., Cathal, G., Obioma, P., Christoph, M. , Alba, G., Narciso, G. , Ergina, K., Carlos, R., Carlos, C., Nikos, V., Konstantinos, K., Jon, C., Adrian, C., An- tonio, C.: ImageCLEF 2019: Multimedia Retrieval in Medicine, Lifelog- ging, Security and Nature In: Experimental IR Meets Multilinguality, Mul- timodality, and Interaction. Proceedings of the 10th International Confer- ence of the CLEF Association (CLEF 2019), Lugano, Switzerland, LNCS Lecture Notes in Computer Science, Springer, (September 9-12, 2019).