Deep Neural Networks and Decision Tree classifier for Visual Question Answering in the medical domain

Deep Neural Networks and Decision Tree classifier for Visual Question Answering in the medical domain ImaneAllaouzi Faculty of Sciences and Techniques Abdelmalek Essaâdi University

Tangier Morocco

BadrBenamrou Faculty of Sciences and Techniques Abdelmalek Essaâdi University

Tangier Morocco

MohamedBenamrou Faculty of Sciences and Techniques Abdelmalek Essaâdi University

Tangier Morocco

MohamedBenAhmed Faculty of Sciences and Techniques Abdelmalek Essaâdi University

Tangier Morocco

Deep Neural Networks and Decision Tree classifier for Visual Question Answering in the medical domain 6C59E078E898A529502832DCF56CAC00 GROBID - A machine learning software for extracting information from scholarly documents CNN Bidirectional LSTM Decision Tree classifier Language modeling medical imaging Visual Question Answering

This paper presents our contribution to the problem of visual question answering in the medical domain using a combination of deep neural networks and the Decision tree classifier. In our proposed approach we consider the task of visual question answering as multi-label classification problem, where each label corresponds to a unique word in the answer dictionary that was built from the training set.

Introduction

Visual question answering (VQA) is a new and challenging task that has witnessed a surge interest from Artificial Intelligence (AI) community, since it combines the fields of Computer Vision (CV) and Natural Language Processing (NLP). NLP and CV are two branches of AI, where the former one enables computers to understand and analyze human language, while the second enables computers to understand and process images in the same way that a human does. The main idea of VQA systems is to predict the right answer giving both image and question about this image in a natural language. The VQA task can be treated as a classification problem if the answer is chosen from among different choices or as a generation problem if the answer is a comprehensive and well-formed textual description.

In the last few years, Deep Neural Networks have achieved the state-of-the-art in a wide range of NLP and CV applications including image recognition [1,2], machine translation [3,4],image caption [5,6] and Visual Question Answering [7,8,9]. Following this trend, this paper presents our contribution to the problem of visual question answering in the medical domain [10,11] using a combination of deep neural networks (Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory) and the Decision tree classifier. In our proposed approach we consider the task of VQA as multi-label classification problem, where each label corresponds to a unique word in the answer dictionary that was built from the training set. The paper's arrangement is as follows: the dataset is described in Section 2, the proposed model is described in Section 3, results are presented and discussed in Section 4, and finally Section 5 draws some conclusions and future work.

Dataset:

VQA-Med [10] is a dataset generated using images from PubMed Central articles (essentially a subset of the ImageCLEF 2017 caption prediction task [12]). As shown in the table

Images Questions Answers

What does the CT scan show? A large filling defect in the left atrium.

gr Where does CT coronal section of the skull show well-defined unilocular lesion?

In the right maxillary sinus.

Who does CT abdomen show?

Right adrenal pheochromacytoma.

Is there any intra-cardiac mass identified?

No.

What shows the limits between the stomach and mass? MRI.

The Proposed Model:

The VQA in the medical domain involves providing a medical question-image pairs to produce answers. In this work we assume that the answers are a concatenation of one or more words, therefore we have treated the task as multi-label classification problem.

Our proposed model uses the pre-trained VGG-16 [13] model to extract image features and the word embedding [14] along with a Bidirectional Long Short-Term Memory (LSTM) [15] to embed the question and extract textual features. The image and textual features are concatenated using two fully connected layers of 512 neurons to get a fixed length feature vector. This vector is used as a new input for Decision Tree Classifier in order to predict an answer.

The model consists of 3 sub-models:  Image Representation:

To extract prominent features from medical images, we have used the pre-trained VGG-16 network that won the ImageNet 2014 challenge [16], by achieving a 7.4% error rate on object classification. We have removed the last layer of this network to obtain an output vector of 4096 elements, which in turn passed through a fully connected layer to get image representation of size 512. The VGG-16 architecture is shown in the figure 1:

 Question Representation:

Recently recurrent neural networks (RNNs) have shown great success in diverse NLP tasks [18,19], motivated by this success we have used a bidirectional RNN with LSTM for dealing with the medical questions. Bidirectional Long Short-Term Memory (BDLSTM) is an extension of the traditional LSTM; its main idea consists of processing sequence data in both forward and backward directions to avoid the problem of limited context that applies to any feed-forward model.

For that, first the question is converted to a matrix of one-hot vectors and passed through an embedding layer (with a vocabulary of 3312 and a dense embedding of 521), in order to get their dense representation and their relative meanings. The embedded question is then fed to a BDLSTM with 512 units followed by a fully connected layer to get question representation of size 512.

 Answer prediction:

To predict an answer, we have modeled the VQA-Med task as multi-label classification problem, since we have assumed that an answer is a concatenation of one or more words. Therefore, we have used the multi-label Decision Tree classifier that takes as input the output from both sub-models of image representation and question representation and predicts one or more predefined labels. The total number of labels equals to 3109.Where, each label corresponds to a unique word in the answer dictionary that was created from the training set.

In the training phase, we have kept the CNN parameters frozen, and we have trained the rest of our deep neural network using a fully connected layer with sigmoid as activation function, Binary Cross-entropy as loss function and Adam as optimizer. As well as, the dropout technique was used before the last fully connected layer and after the BDLSTM layer with a probability of 0.5.

The best parameters were selected based on the validation loss, with a mini-batch of 20 and a number of epochs up to 10.

Results:

Three metrics are used to evaluate our proposed VQA-Med model, which are: BLEU score [20], WBSS (Word-based Semantic Similarity), and CBSS (Concept-based Semantic Similarity). The first one is one of the most commonly used metrics that have been used to measure the similarity between two sentences, the second one aims to calculate the semantic similarity in the biomedical domain [21], it was created based on Wu-Palmer Similarity (WUPS) [22] with WordNet ontology in the backend, while the third one is similar to the WBSS metric, except that instead of tokenizing the predicted and ground truth answers into words, it uses MetaMap via the pymetamap wrapper to extract biomedical concepts from the answers. Before applying the evaluation metrics, each answer undergoes the following preprocessing techniques:

 Lower-case: Converts each answer to lower-case.  Tokenization: Divides the answer into individual words.  Stop-words: Removes punctuations and commonly encountered English words.

The following table shows the results obtained on the test set:

Conclusion:

In this paper, we present our contribution to the task of visual question answering in the medical domain. We have treated the task as a multi-label classification using the decision tree classifier. However, the results on test set are totally unsatisfactory, especially in term of BLEU metric with a score of 0.054. Therefore, we think to develop an LSTM model to generate answers since the adopted classification approach ignores words order in the answer which leads to a loss of information. We also think to improve our visual model by using the attention technique .This technique allows to pay more attention to specific regions that better represent the question instead of the whole image.

Fig. 1 .1Fig. 1. The VGG-16 model architecture [17].

Table 1 .11 the VQA-Med dataset consists of 2278 training images and 324 validation images, accompanied respectively with 5413 and 500 of question-answer pairs, and a test set of 264 medical images with 500 questions. The answer can be either "a single word", "a phrase containing around 2-28 words", or "a yes/no". The table2illustrates some examples of the training data with different types of questions and answers. The VQA-Med dataset distribution.ImagesQuestionsAnswersTrain227854135413Validation324500500Test264500-

Table 1 .1Some examples of the training data.

Table 3 .3Results of our proposed model on Test set.As shown in the table above, our proposed model gives good results in term of CBSS metric (0.27) comparing with BLEU score (0.054) and WBSS metric (0.10). This is justified by the high number of labels that are not presented equally in the training set. This is what is known as the label imbalance problem.

Evaluation metricsBLEUWBSSCBSS0.0538670180.1008542950.269119831

Very deep convolutional networks for large-scale image recognition KSimonyan AZisserman arXiv:1409.1556 2014 2 17 ImageNet classification with deep convolutional neural networks AKrizhevsky ISutskever GEHinton NIPS 2012 Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation ChoKyunghyun BVan Merrienboer CGulcehre Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Doha

Association for Computational Linguistics 2014 Sequence to Sequence Learning with Neural Networks ISutskever OVinyals QLe the 27th International Conference on Neural Information Processing Systems 2014 2 Show and tell: A neural image caption generator OVinyals AToshev SBengio DErhan CVPR 2015 Show, attend and tell: Neural image caption generation with visual attention KXu JBa RKiros KCho ACourville RSalakhutdinov RZemel YBengio 2015 ICML Stacked attention networks for image question answering ZYang XHe JGao LDeng ASmola CVPR 2016 Simple baseline for visual question answering BZhou YTian SSukhbaatar ASzlam RFergus arXiv:1512.02167 2015 arXiv preprint Ask your neurons: A deep learning approach to visual question answering MMalinowski MRohrbach MFritz arXiv:1605.02697 2016 arXiv preprint Overview of Im-ageCLEF 2018 Medical Domain Visual Question Answering Task SAHasan YLing OFarri JLiu MLungren HMüller CLEF working notes 2018 CEUR Overview of ImageCLEF 2018: Challenges, Datasets and Evaluation BIonescu HMüller MVillegas AGarcía Seco De Herrera CEickhoff VAndrearczyk YDicente Cid VLiauchuk VKovalev SAHasan YLing OFarri JLiu MLungren DDang-Nguyen LPiras MRiegler LZhou MLux MGurrin Proceedings of the Ninth International Conference of the CLEF Association the Ninth International Conference of the CLEF Association

CLEF

2018. 2018 Overview of Im-ageCLEFcaption 2017 -the image caption prediction and concept extraction tasks to under-stand biomedical images CEickhoff ISchwall AGarc´ıa Seco De Herrera HMuller 2017 CLEF working notes Very Deep Convolutional Networks for Large-Scale Image Recognition KSimonyan AZisserman arXiv:1409.1556 2014 arXiv preprint Distributed Representations of Words and Phrases and their Compositionality TMikolov ISutskever KChen GSCorrado JDean NIPS 4 8 17 2013 Bidirectional recurrent neural networks MSchuster KKPaliwal IEEE Transactions on Signal Processing 4 1997 Speech recognition with deep recurrent neural networks MGraves AMohamed GHinton Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing 2013 RNNLM-recurrent neural network language modeling toolkit TMikolov SKombrink ADeoras LBurget JCernocky Proceedings of the 2011 ASRU Workshop the 2011 ASRU Workshop 2011 BLEU: a method for automatic evaluation of machine translation (PDF) KPapineni SRoukos TWard WJZhu 40th Annual meeting of the Association for Computational Linguistics Pensylvania 2002 ACL-2002 BIOSSES: a semantic sentence similarity estimation system for the biomedical domain GSoğancıoğlu HÖztürk AÖzgür Bioinformatics 33 14 2017 Verbs semantics and lexical selection ZWu MPalmer Proceedings of the 32nd annual meeting on Association for Computational Linguistics Association for Computational Linguistics the 32nd annual meeting on Association for Computational Linguistics Association for Computational Linguistics 1994