LSTM in VQA-Med, is it really needed? JCE
      study on the ImageCLEF 2019 dataset

                         Avi Turner1 and Assaf B. Spanier1

    Department of Software Engineering of Azrieli College of Engineering Jerusalem,
                       Israel assaf.spanier@mail.huji.ac.il


        Abstract. This paper describes the contribution of the Department of
        Software Engineering at the Azrieli College of Engineering, Jerusalem,
        Israel to the ImageCLEF VQA-Med 2019 task. This task was inspired by
        the recent ever greater success of visual question answering (VQA) in the
        general domain. Given medical images accompanied by clinically relevant
        questions, participating systems were tasked with answering questions
        based on the image content.We explored and implemented a two-stage
        model. The first stage predicts the category of the textual question, while
        the second stage is comprised of 5 sub-models. Each sub-model is a clas-
        sic VQA deep learning module with two branches for feature extraction,
        the first using CNN to extract image features, and the second using em-
        bedding (and optionally LSTM) to extract textual features. The network
        then combines the two feature branches to predict the appropriate an-
        swer. We found that most sub-models didnt need LSTM to achieve high
        scores on the validation and test data-sets. We submitted 10 models for
        the challenge, our best submission overall ranked 9th out of 17. All source
        codes are available at https://github.com/turner11/VQA-Med

        Keywords: VQA-Med · LSTM · ImageCLEF-2019.


1     Introduction
The ever increasing demand for automated computer systems (AI) to assist
clinical medical practice addresses two main audiences: Doctors who use these
systems to get a second opinion on their diagnosis; and patients who increasingly
have easy access to comprehensive and detailed medical data which they find
bewildering. Thus, addressing patients, the systems motivation is to help them
have a better understanding of their medical condition, by providing detailed
explanations of the results of their medical tests and scans, which is something
that doctors, naturally, are unable to do for each data item of each patients file.
The current access to ones detailed medical file without explanation leads to
the unfortunate situation that patients turn to searching the Internet and online
forums to better understand their condition, reaching misleading information
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem-
    ber 2019, Lugano, Switzerland.
and false conclusions. Consequently, this often worries patients, either because
insufficiently specific details of their health are considered, or even worse, because
irrelevant, false, inexpert information is found.
    Visual question answering (VQA) [1] is a subfield of automated systems (AI)
relevant to these kinds of problems. The task of VQA is to produce textual
answers to textual questions asked in the context of a specific image. This is
illustrated in Fig 1: given an image and a question, a VQA system should supply
an answer relevant to the question in the context of the image.


Fig. 1. Given an image and a question, a VQA system should supply an answer in the
context of the question and the given image.


    A VQA [2] system questions takes textual questions as input with the images
they refer to, and combines data from the image and the question text to arrive
at the most relevant answer. To produce answers to specific questions, VQA
systems combine natural language processing methods with advanced computer
vision techniques. The application of VQA to the field of medicine is a twofold
challenge, not only are medical texts and images significantly different from
those in the general computer vision field, but the resources and labelled data
available in the medical field are quite limited relative to what is available in the
general field. Evidenced by the 260,000 image COCO-QA Challenge dataset of
general images, this quantity contrasted with the 5,000 VQA-Med medical image
dataset. Following the recent successes of VQA in the general computer vision
field and the challenge posed by the medical field, as of 2018, ImageClef 2019 [3]
published a second round of the VQA-Med Challenge [4]. This paper deals with
the problems of VQA in the medical field. The rest of this paper is organized
as follows. First, we describe some related work. Next, we describe the database
and challenge characteristics. Then, we describe our method in detail. Lastly,
results are presented in Section 4, followed by conclusions and future work in
the last Section.
2   Related Work

The VQA COCO-QA Challenge is studying a problem very similar to the VQA-
Med task. VQA [1] has been held every year since 2016. The dataset is public
domain based. The prevalent approach to VQA uses recurrent neural networks,
such as LSTMs [5], to encode the textual questions, and deep convolution net-
works, such as VGG-16, to encode and extract features from the images [6].
Based on these ideas, a plethora of other methods have been proposed in the
literature: including attention, dynamic models, and even incorporating external
databases.
    In this study, we took a different approach: our objective was to use classic
VQA methods [7]. We analyzed those methods in order to determine their ad-
vantages and limitations with respect to the necessity of the LSTM layer and
other parameters. We utilized conventional VQA approaches, optimizing their
parameters, to find the best prediction method and its corresponding imaging
and text features, which provided the best evidence as to whether or not the
LSTM layer is necessary to achieve a high score or not.


Fig. 2. VQA-Med, texts as well as images pertaining to the medical field are signifi-
cantly different and more complex.


3   Task Description and Dataset

The Challenge dataset comprised of a training dataset of 3,200 medical images
and 12,792 Question and Answer (QA) pairs, a validation dataset of 500 medical
images and 2,000 QA pairs, and a test dataset of 500 medical images and 500
questions with answers withheld. The questions are divided into 4 categories:
Modality, Plane, Organ System and Abnormality.
    The evaluation of the participant systems of the VQA-Med 2019 task was
conducted based on two metrics: BLEU, and Accuracy (Strict). Accuracy (Strict)
is an adapted version of the accuracy metric from the general domain VQA task
that considers exact matching of a participant provided answer and the ground
truth answer. BLEU [10] is used to capture the similarity between a system
generated answer and the ground truth answer. Each answer is converted to
   lower-case, all punctuation was removed, and the answer was tokenized to
individual words. Stopwords were removed using NLTKs9 English stopword list.
Snowball stemming10 was applied to increase the coverage of overlaps.


4    Methods

The input to our method is an image and a questions referring to it. The output
is an answer for the question in the context of the given image. Fig 2. The
system is comprised of two stages: The first predicts the category of the textual
question Fig 3, while the second is a classic VQA module which combines the
question and image to predict a relevant answer (see Fig 4 below). The first stage
classifies the question into 5 question categories: Modality, Plane, Organ System
and 2 Abnormality categories: Note, we subdivide the tasks given Abnormality
class into two categories: Questions with a Yes or No answer, and All Other
Questions. This stage uses embedding and an MLPClassifier, and it optimizes
the log-loss function using LBFGS or stochastic gradient descent. We used the
sklearn package [8] for this, with its default parameters.


Fig. 3. The first stage predicts the category of the textual question. Classifying into the
following 5 question categories: Modality, Plane, Organ System, Abnormality, Yes/No
and Abnormality, Other


    The second stage is a classic VQA deep learning module, which takes the
question and image as a combined input and predicts the appropriate answer.
The text undergoes preprocessing and embedding the output of this branch
is treated as features. We investigated whether LSTM was needed at the next
stage or not. Image features are extracted using a CNN network (VGG 19) [9].
Next, the features from both branches are merged using fully-connected layers
Fig. 4. A classic VQA deep learning module, which takes the question and image as a
combined input and predicts the appropriate answer


5     Results

We will start with a short comparison of our results with those of other groups,
then we will focus in on our own submissions. When participating groups were
compared using only the best performing submission from each group we placed
9th out of 17 groups, achieving a strict accuracy score of 0.53 compared to 0.62
achieved by the best performing groups submission. Looking at the number of
submissions we submitted 10, while the average number of submissions per group
was 4.7 (80 submissions by 17 groups). However, in term of average performance
across all submissions we placed 8th, trading places with LIST due to the
small variation in performance between our submissions, which were all between
places 25 and 44 (of 80 total submissions), while LIST groups submissions were
between places 24 and 57.
    We will review and analyze the submissions in order of the scores they
achieved on the test set, looking particularly at the following features of the
submitted models: 1) Optimization Function, 2) Activation Function, 3) Loss
Function, 4) batch size, 5) size of fully-connected layers, 6) number of units in
the LSTM layer, and 7) whether Class Weights were used.
    Since this paper is focused on finding whether LSTM is needed for VQA
tasks, and we wanted the effect and contribution of LSTM to be highlighted,
we chose to work with a very simple convolution network, and used the VGG
network. when evaluating optimization functions we found that RMSprop pro-
duced the best results across all the submitted models. Results clearly indicated
using Softmax for Optimization alongside Categorical Crossentropy as the Loss
Function was the best option, which was expected as they are the most natural
choice for a task like this [10]. These were the parameters used in the three high-
est performing submissions (by test set scores). Batch size was 32 for all question
categories, except the Abnormal – Yes/No category which required a batch size
of 75, submissions with lower batch sizes produced less accurate results.
    Lets turn to the last three parameters:

    • Size of fully-connected layers
    • Number of units in the LSTM layer
    • Whether Class Weights were used
5.1   Submissions Details
We will review our ten submissions in order of their Challenge test set results.
Note that each submission is comprised of five sub-models – one per question
category.
   In the tables presented per model, each row represents a sub-model (for a
question category), the columns are:

 – Column 1 – question category that sub-model was trained for
 – Column 2-3 – sub-models validation set scores (strict-accuracy and BLEU)
 – Column 4 – size of fully-connected layers
 – Column 5 – number of units in the LSTM layer (LM in short)
 – Column 6 – Loss Function used
 – Column 7 – Activation Function used (act, in short)
 – Column 8 – batch size
 – Column 9 – number of epochs (epo, in short)
 – Column 10 – whether Class Weights were used

Best Performing Submission The submission with the highest test set score
had the following characteristics: Fully-connected layer size of 14 for all sub-
models This is the highest number used among our submissions, and our findings
indicate that a higher number of fully-connected layers was more successful in
generalizing from the validation set. An LSTM layer was used only in the sub-
model handling the Abnormality – Other question category. In the training and
validation datasets Yes and No answer frequencies were not balanced for the
Abnormality Yes/No category. We therefore investigated whether class weights
would improve accuracy and found that they did. See test set results in Table 4
and validation results in Table 1

Table 1. Best performing submission. cross. stands for categorical crossentropy, Ab-
norm for Abnormality, LM for LSTM, epo for epochs, act for activation

                                                   batch      class
            category acc bleu FC LM loss        act.     epo
                                                    size     weight
             Organ 0.7 0.70 14 0 cross softmax 32         7 NO
             Plane 0.74 0.74 14 0 cross softmax 32        7 NO
            Modality 0.82 0.82 14 0 cross softmax 32 10 NO
            Abnorm. 0.724 0.76 21 128 cross softmax 32    3 NO
            Abnorm.
                     0.02 0.05 14 0 cross softmax 75      2    yes
             yes no


2nd and 3rd best performing submissions The differences between these
models and the best submission were not great, nor were the differences in the
scores they both achieved 0.53 strict and 0.55 BLEU. Compared to the best
performing submission, these two had fewer epochs and smaller fully-connected
layer sizes. See test set results in Table 4 and validation results in Table 2 and
Table 3

Table 2. 2nd performing submission. cross. stand for categorical crossentropy, Abnorm
for Abnormality, LM for LSTM, epo for epochs, act for activation

                                                          batch      class
         Category Accuracy BLEU FC LM Loss          act.        epo
                                                           size     weight
          Organ      0.66      0.68   14 0 cross softmax 32      7 NO
          Plane      0.74      0.74    8 0 cross softmax 32      5 NO
         Modality    0.72      0.75   14 0 cross softmax 32 10 NO
         Abnorm.     0.02      0.04   21 128 cross softmax 32    3 NO
         Abnorm.
                     0.78      0.78 14    0 cross softmax    75   2   YES
          yes no


Table 3. 3rd performing submission. cross. stand for categorical crossentropy, Abnorm
for Abnormality, LM for LSTM, epo for epochs, act for activation

                                                          batch      class
         Category Accuracy BLEU FC LM Loss          act.        epo
                                                           size     weight
          Organ      0.63      0.65   14 0 cross softmax 32      9 NO
          Plane      0.72      0.72   14 0 cross softmax 32      7 NO
         Modality    0.72      0.75   14 0 cross softmax 32      7 NO
         Abnorm.     0.02      0.02   21 128 cross softmax 32    3 NO
         Abnorm.
                     0.78      0.78 14    0 cross softmax    75   2   YES
          yes no


                    Table 4. Test accuracy of all 10 summations

                            submission rank accuracy BLEU
                                   1          0.54    0.55
                                   2          0.53    0.55
                                   3          0.52    0.57
                                   4          0.53   0.558
                                   5          0.52    0.55
                                   6          0.52    0.55
                                   7          0.52    0.57
                                   8          0.50    0.54
                                   9          0.50    0.51
                                  10          0.50    0.56
Submissions 4,5,6,8,9 These submissions either did not use Class Weights at
all or did not use them exclusively for the Abnormality – Yes/No category, and
the size of fully-connected layers was smaller, emphasizing the importance of
these elements to the network. See test set results in Table 4,


Submissions performing 7th and 10th These submissions did not include
LSTM, their low scores proving the importance of this layer for handling complex
tasks such as the Abnormality – Other question category. See test set results in
Table 4,


6   Conclusions

This paper presents research done in the context of participation in the VQA-
Med Challenge. We analyzed VQA classifiers and feature extraction methods for
image and text classification in the context of medical images in the VQA-Med
2019 task. We found that none of the sub-models needed LSTM, except the one
handling the Abnormality Other questions category, the most complex task,
which also required fully-connected layers of size 21, unlike all the other cate-
gories, for which fully-connected layers of size 14 were sufficient. Class weights
are needed only in cases were a significant imbalance between answer class fre-
quency exists as there was in this challenge in the Abnormality Yes/No question
category. We submitted 10 models, our best submission ranking 9th out of 17.
All source codes are available at https://github.com/turner11/VQAMED,


7   Future Work

This paper focused on the question of whether and when LSTM may be useful for
VQA tasks. We therefore chose to work with a very simple convolution network,
the VGG network. Further research on the effect and contribution of the LSTM
module is needed in order to look at a broader range of convolution networks,
including more advanced versions, such as ResNet and Inception, and their effect
on results. We intend to investigate the effects of using larger size fully-connected
layers and more epochs. Looking at batch size, found that our best performing
submissions had a batch size of 32, with 75 for the Abnormality Yes/No sub-
model, while all lower batch sizes produced less accurate results. The batch size
was limited by computing resources, and we intend to examine larger batch sizes
with stronger processors.


References

 1. Kushal Kafle and Christopher Kanan. Visual question answering: Datasets, algo-
    rithms, and future challenges. Computer Vision and Image Understanding, 163:3–
    20, 2017.
 2. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,
    C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceed-
    ings of the IEEE international conference on computer vision, pages 2425–2433,
    2015.
 3. Bogdan Ionescu, Henning Müller, Renaud Péteri, Duc-Tien Dang-Nguyen, Luca
    Piras, Michael Riegler, Minh-Triet Tran, Mathias Lux, Cathal Gurrin, Yashin Di-
    cente Cid, et al. Imageclef 2019: Multimedia retrieval in lifelogging, medical, na-
    ture, and security applications. In European Conference on Information Retrieval,
    pages 301–308. Springer, 2019.
 4. Asma Ben Abacha, Sadid A. Hasan, Vivek V. Datla, Joey Liu, Dina Demner-
    Fushman, and Henning Müller. VQA-Med: Overview of the medical visual question
    answering task at imageclef 2019. In CLEF2019 Working Notes, CEUR Workshop
    Proceedings, Lugano, Switzerland, September 09-12 2019. CEUR-WS.org.
 5. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural com-
    putation, 9(8):1735–1780, 1997.
 6. Kushal Kafle and Christopher Kanan. Answer-type prediction for visual question
    answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, pages 4976–4984, 2016.
 7. Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C Lawrence
    Zitnick, Devi Parikh, and Dhruv Batra. Vqa: Visual question answering. Interna-
    tional Journal of Computer Vision, 123(1):4–31, 2017.
 8. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel,
    Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss,
    Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of
    machine learning research, 12(Oct):2825–2830, 2011.
 9. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
    large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
10. Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. A
    tutorial on the cross-entropy method. Annals of operations research, 134(1):19–67,
    2005.