An Encoder-Decoder model for visual question
              answering in the medical domain

      Imane Allaouzi1[0000-0002-8737-8115], Mohamed Ben Ahmed, Badr Benamrou

                           LIST, Abdelmalek Essaâdi University
                            Faculty of Sciences and Techniques
                                     Tangier, Morocco
                           1imane.allaouzi@gmail.com


       Abstract. This paper describes our participation in the task of VQA-Med of
       ImageCLEF 2019. We proposed an encoder-decoder model that takes as input a
       medical question-image pair and generates an answer as output. The encoder network
       consists of a pre-trained CNN model that extracts prominent features from a medical
       image and a pre-trained word embedding along with LSTM to embed textual data. The
       answer generation is accomplished by the greedy search algorithm, which predicts the
       next word based on the previously generated words. Thus, the answer is built up by
       recursively calling the model.

       Keyword: Transfer Learning, Encoder-Decoder, CNN, LSTM, Word
       Embedding, Language Modeling, Medical Imaging, Visual Question
       Answering, Greedy Search, Beam Search, NLP, Computer Vision.


1      Introduction

With the widespread adoption of electronic medical record (EMR) systems, a large
amount of medical information is becoming available such as doctors' reports, test
results and medical images. This health information is a gold mine for artificial
intelligence (AI) researchers who seek to enhance doctors’ ability to analyze medical
images, to support clinical decision making and improve patient engagement. One of
the most exciting and challenging AI tasks is the visual question answering in the
medical domain (VQA-Med) [1]. The main idea of VQA-Med system is to predict the
right answer given a medical image accompanied with clinically relevant question. It
is a difficult task because the computer system must understand and analyze the
question (natural language processing, or NLP) as well as understand and process the
image (computer vision).
   Different approaches have been proposed to address the task of VQA-Med. Some
of them treat the task as a generative problem generating answers in a comprehensive
and well-formed textual description [2], while others treat it as a multi-label


Copyright (c) 2019 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano,
Switzerland.
classification problem in which the answer is chosen from among different choices
[3,4].
    This paper describes our participation in the task of VQA-Med of ImageCLEF
2019 [5]. We proposed an encoder-decoder model that takes as input a medical
question-image pair and generates an answer as output. The encoder network consists
of a pre-trained CNN model that extracts prominent features from a medical image
and a pre-trained word embedding along with Long Short-Term Memory (LSTM) to
embed textual data. The answer generation is accomplished by the greedy search
algorithm, which predicts the next word based on the previously generated words.
Thus, the answer is built up by recursively calling the model.
    The rest of this paper is organized as follows. Section 2 describes details of the
provided dataset. Section 3 gives a detailed description of the proposed system.
Section 4, presents metrics used to assess the performance of our system and also
provides a presentation and analysis of the experimental results, and finally Section 5
concludes the presented work with some remarks.


2      Dataset

VQA-Med dataset consists of 3,200 training medical images and 12,792 Question-
Answer (QA) pairs, a validation set of 500 medical images with 2,000 QA pairs, and
a test set of 500 medical images with 500 questions.
    Four categories of questions are considered: Modality, Plane, Organ system and
Abnormality. The answer can be either “a single word”, “a phrase containing 2-21
words”, or “a yes/no”. Table 1 illustrates some examples of medical images with
associated question-answer pairs.

Table 1. Examples of medical images with associated question-answer pairs.

 Medical Image         Question                                      Answer


                       What part of the body is being imaged here?   Skull and contents.


                       Which plane is the image shown in?            Axial.


                                                                     Right aortic arch with
                       What abnormality is seen in the image?
                                                                     aberrant           left
                                                                     subclavian artery.
                       Is this a t1 weighted image?               Yes.


3       Methodology

To address the problem of VQA in the medical domain, we proposed an encoder-
decoder model that takes as input a medical question-image pair and generates an
answer as output. As shown in figure 1, the encoder network consists of a pre-trained
DenseNet-21 model that extracts prominent features from the medical image and a
pre-trained word embedding followed by two LSTM layers to embed the question and
extract textual features. The textual and image features are concatenated together into
one vector “QI vector”. Our proposed model generates one word at a time. That is, all
words generated so far are embedded, with the same word embedding used for
questions, and each of word embedding is fed then into an LSTM with 1024 units.
The distributed representation of all words generated so far is concatenated with the
“QI vector” to form an “encoder vector”. The decoder uses the encoder vector as
input in order to generate the next word, this is then fed to a fully connected layer of
256 neurons and then to the final layer, which has one neuron for each word in the
output vocabulary and a softmax activation function to output a likelihood of each
word in the vocabulary being the next word in the answer. Thus, the answer is built
up by recursively calling the model with the previously generated words.


    Fig. 1.The proposed architecture for VQA-Med 2019.
3.1      Image encoding:

Our proposed model is a deep learning network with a high number of parameters.
This type of model often overfits when training on small datasets. To
prevent overfitting, the best solution is to use the transfer learning technique. The idea
is to use a pre-trained CNN model on a large and similar dataset as a fixed feature
extractor, as we expect higher-level features in the CNN to be relevant to our dataset
as well.
    Motivated by the results obtained by DenseNet-121 model on the task of medical
image classification [6] and since we don't have a large dataset, we used a pre-trained
DenseNet-121 on chexpert [7], a large dataset of thorax chest-x-ray images. The
network has four dense blocks, which have 6, 12, 24, 16 dense layers respectively. A
dense block consists of a series of units. Each unit packs two convolutions, each
preceded by Batch Normalization and ReLU activations. In addition, each unit
generates a fixed number of feature vectors. This parameter, called growth rate,
controls the amount of new information that layers can transmit. The layers between
these dense blocks are transition layers which perform down-sampling of the features
passing the network. A detailed explanation of DenseNet-121 architecture used in this
work is shown in Table 2.

Table 2. The DenseNet-121 architecture.

Layers                               Output Size               DenseNet-121
Convolution                          112x112                   7x7 conv, stride2

Pooling                              56x56                     3x3 max pool, stride 2

Dense Block 1                        56x56                         1x1 conv
                                                                                        x6
                                                                   3x3 conv


Transition Layer 1                   56x56                     1x1 conv


                                     28x28                     2x2 average pool, stride 2


Dense Block 2                        28x28                         1x1 conv
                                                                                        x 12
                                                                   3x3 conv


Transition Layer 2                   28x28                     1x1 conv


                                     14x14                     2x2 average pool, stride 2


Dense Block 3                        7x7                           1x1 conv
                                                                                        x24
                                                                   3x3 conv
Table 2. The DenseNet-121 architecture (continued)

Layers                                Output Size               DenseNet-121
Transition Layer 3                    14x14                     1x1 conv
                                      7x7                       2x2 average pool, stride 2

Dense Block 4                         7x7                           1x1 conv
                                                                                       x 16
                                                                    3x3 conv


Output                                1x1                       7x7 global average pool


3.2      Question encoding:

To capture the sequential nature of language data, we modeled our questions using
LSTM, a special type of Recurrent Neural Networks (RNNs). LSTM has
demonstrated great success in various NLP tasks and is the state of the art algorithm
for sequential data. It succeeds in being able to capture information about previous
states to better inform the current prediction through its memory cell state.
    An LSTM consists of three main components: a forget gate, input gate, and output
gate. These gates determine whether or not to let new input in (input gate), delete the
information because it isn’t important (forget gate) or to let it impact the output at the
current time step (output gate).


Fig. 2. Memory block in LSTM network.

A pre-trained word embedding [8] on biomedical texts from MEDLINE/PubMed
using gensim's Word2Vec implementation is used to provide a distributed
representation of words. Word Embeddings are much better at capturing the context
around the words than using a one hot vector for every word. For this problem we
used 200 dimension word embeddings and we did not tune them during the training
process since we did not have sufficient data. These embeddings are passed into two
LSTM layers with respectively 512 and 1024 units.
3.3    Answer generation:

To predict an answer for a given image-question pair, we treated the task as text
generation. This often operates by generating probability distributions across the
vocabulary of output words and it is up to decoding algorithms to sample the
probability distributions to generate the most likely sequences of words. To find the
best decoder algorithm both greedy search and beam search are evaluated.

  • Greedy search:
A greedy algorithm uses a heuristic for making locally optimal choices at each step
with the hope of finding a global optimum solution. This means that the algorithm
chooses the most likely word in each step in the output sequence and does not take
into account the entire sentence. Therefore, the quality of the final output sequence
may be far from optimal, hence it is considered greedy.

  • Beam search:
Unlike greedy search, beam search allows for non-greedy local decisions that can
potentially lead to a sequence with a higher overall probability. The beam search
expands all possible next steps and keeps the k most likely words, where k is a user-
specified parameter and controls the number of beams or parallel searches through the
sequence of probabilities.


3.4    Dropout:

The proposed model is a deep neural network model and is trained on a small dataset.
As a result, the model can learn statistical noise in the training data, resulting in poor
performance and generalization on new testing data “overfitting”. To reduce
overfitting and improve generalization error, we used the dropout technique. Dropout
is a very computationally cheap and remarkably effective regularization method. It
works by randomly removing or dropping out inputs to a layer. This has the effect of
making nodes in the network generally more robust to the inputs and reduces the
number of training parameters, hence reduces the training time and memory
requirements.


4      Evaluation Methodology:
Before applying the evaluation metrics, each answer undergoes the following pre-
processing techniques:

        •    Lower-case: Converts each answer to lower-case.
        •    Tokenization: Divides the answer into individual words.
        •    Remove punctuation: Remove punctuation marks from answers.
Evaluation metrics used to evaluate our proposed VQA-Med model are:

          •   Accuracy (Strict): The entire predicted answer must match the ground
              truth answer.
          •   BLEU [9]: Capture the similarity between a system-generated answer
              and the ground truth answer.

Three experiments are conducted to evaluate our model:

          •   Expr1: Answers are generated using greedy search.
          •   Expr2: Answers are generated using beam search with k=2.
          •   Expr3: Answers are generated using beam search with k=3.

Our model is trained using RMSprop optimizer with an initial learning rate of 0.001
which is multiplied by 10 each time the validation loss plateau after an epoch. We
have used a mini-batch size of 535 samples, a number of epochs up to 100, and the
categorical cross-entropy as a loss function where the best model was selected based
on the validation loss.
As shown in Table3, experiment 1 (Expr 1) achieves best results with a strict accuracy
of 0.556 and BLEU score of 0.583. This means that for our case, greedy search is
better than beam search algorithm.

Table 3. Experimental results on test dataset.

 Experiment                        Accuracy                     BLEU
 Expr1                             0.556                        0.583
 Expr2                             0.538                        0.556
 Expr3                             0.526                        0.547

The following table provides the results obtained by our model and the three best run
for the task of VQA-Med.

Table 4. Comparison with the three best VQA-Med methods.

 Model                               Accuracy                   BLEU
 Hanlin                              0.624                      0.644
 yan                                 0.62                       0.64
 minhvu                              0.616                      0.634
 Our model (LIST)                    0. 556                     0.583

As shown in Table 4, the best model achieved an accuracy of 0.624 and a BLUE
score of 0.644. This means that it exceeds the results of our model with only 0.068 in
terms of accuracy and 0.061 in terms of BLUE score. As a result, we can say that our
model gives very good results.
5      Conclusion:
In this paper, we propose an Encoder-Decoder model for the task of visual question
answering in the medical domain. VQA is a difficult and challenging task since it
combines the fields of Computer Vision and NLP. This difficulty increases even more
with the inherent nature of medical imaging. Our proposed model achieves great
results with an accuracy of 0,556 and BLEU score of 0,583.To further substantiate
these results, several improvements could be made such as the use of an attention
mechanism that allows to pay more attention to specific regions that better represent
the question instead of the whole image.


References
 1. Ben Abacha, A., Hasan, S. A., Datla , V. V.,Joey Liu, Demner-Fushman, D., Müller, H.
    (2019). VQA-Med: Overview of the Medical Visual Question Answering Task at
    ImageCLEF 2019. CLEF2019 Working Notes, CEUR Workshop Proceedings.
 2. Talafha, B., and Al-Ayyoub, M. (2018). JUST at VQA-Med: A VGG-Seq2Seq Model.
    CLEF 2018 Labs Working Notes, CEUR Workshop Proceedings.
 3. Allaouzi, I., Benamrou, B., Ben Ahmed, M. (2018). Deep Neural Networks and Decision
    Tree classifier for Visual Question Answering in the medical domain. CLEF 2018 Labs
    Working Notes, CEUR Workshop Proceedings.
 4. Peng, Y., Liu, F., and Rosen, M. (2018). UMass at ImageCLEF Medical Visual Question
    Answering (Med-VQA) 2018 Task. CLEF 2018 Labs Working Notes, CEUR Workshop
    Proceedings.
 5. Ionescu, B., Müller, H., and Péteri, R., Dicente Cid, Y., and Liauchukn V., and Kovalev,
    V., and Klimuk, D., Tarasau , A., Ben Abacha, A., Hasan, S. A., Datla, V., Liu, J.,
    Demner-Fushman, D., Dang-Nguyen , D. T., Piras, L., Riegler, M., Tran M. T. and Lux,
    M., Gurrin, C., Pelka, O., Friedrich, C. M., Garcia Seco de Herrera, A., Garcia, N.,
    Kavallieratou, E., Roberto del Blanco, C., Cuevas Rodriguez, C., and Vasillopoulos, N.,
    Karampidis, K., Chamberlain, J., Clark, A., Campello, A. (2019). ImageCLEF 2019:
    Multimedia Retrieval in Medicine, Lifelogging, Security and Nature. Experimental IR
    Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 10th
    International Conference of the CLEF Association (CLEF 2019).
 6. Allaouzi, I., Ben Ahmed, M. (2019). A Novel Approach for Multi-Label Chest X-Ray
    Classification of Common Thorax Diseases. IEEE Access 7(1), 64279-64288.
 7. Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H.,
    Haghgoo, B., Ball, R., Shpanskaya, K., Seekins, J., Mong, D., A., Halabi , S. S.
    , Sandberg, K. J., , Jones, R., Larson, B. D., Langlotz, C. P., Patel, N. B., Lungren,
    Matthew P. M, Andrew Y. Ng. (2019). Chexpert: A large chest radiograph dataset with
    uncertainty labels and expert comparison. In Thirty-Third AAAI Conference on Artificial
    Intelligence.
 8. McDonald, R., Brokos, G., Androutsopoulos, I. (2018). Deep Relevance Ranking Using
    Enhanced Document-Query Interactions. Proceedings of the Conference on Empirical
    Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium.
 9. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. J. (2002). BLEU: a method for automatic
    evaluation of machine translation (PDF). ACL-2002: 40th Annual meeting of the
    Association for Computational Linguistics. pp. 311–318.