The Inception Team at VQA-Med 2020:
    Pretrained VGG with Data Augmentation for
              Medical VQA and VQG

           Aisha Al-Sadi, Hana Al-Theiabat, and Mahmoud Al-Ayyoub

                Jordan University of Science and Technology, Jordan
         asalsadi16@cit.just.edu.jo, haaltheiabat13@cit.just.edu.jo,
                            maalshbool@just.edu.jo


        Abstract. This paper describes the methodology of The Inception team
        participation at ImageCLEF Medical 2020 tasks: Visual Question An-
        swering (VQA) and Visual Question Generation (VQG). Based on the
        data type and structure of the dataset, both tasks are treated as im-
        age classification tasks and are handled by using the VGG16 pre-trained
        model along with a data augmentation technique. In both tasks, our best
        approach achieves the second place with an accuracy of 48% in the VQA
        task and a BLEU score of 33.9% in the VQG task.


Keywords: ImageCLEF 2020· VQA-Med· Visual Question Answering· Visual
Question Generation· Medical Image Interpretation· Medical Questions and An-
swers· Transfer Learning· VGG Network· Augmentation.


1     Introduction

Recently, enormous research efforts have been invested in merging or fusing tech-
niques and natural language processing, signal processing and computer vision
with the aim of improving the interaction between humans and intelligent sys-
tems, which in turn gives rise to tasks like Visual Question Answering (VQA),
Visual Question Generation (VQG), Multimodal Machine Translation, Video
Captioning and Scene Understanding, etc. Such tasks require learning across
multimodal features from both visual and language data [18, 23, 25, 5, 11].
   VQA and VQG are closely related tasks with significant importance [3]. VQA
aims to answer a given question based on a given image. On the other hand, given
an input image, VQG is concerned with generating relevant questions on this
image along with their answers while relying only on the image content. Both
tasks have attracted the attention of several researchers who proposed interesting
models with promising results [3].

    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
    Among the most notable efforts on VQA is a set of challenges or shared
tasks held every year addressing different versions of VQA. Starting from 2016,
a challenge on general VQA, known as the VQA Challenge,1 is held every year
to answer questions of various types (multiple-choice questions, yes/no questions
and questions with open-ended answers).
    For the medical domain, a challenge known as the VQA-Med challenge has
been held since in 2018. In its first version [6] only five teams participated in
the task which focused on answering questions about abnormalities in medical
images. For the second version, VQA-Med 2019 [4], the challenge included four
question categories on each medical image: the plane, the modality, the organ
system and the abnormality shown in the image.
    In this paper, we propose models to solve the medical VQA-Med tasks [3]
(VQA and VQG) organized by ImageCLEF 2020 [8]. All of our models are based
on the pre-trained convolutional neural networks (CNN), VGG16 [22], and data
augmentation technique. In both tasks, our best approach achieves the second
place with an accuracy of 48% in the VQA task and a BLEU score of 33.9% in
the VQG task.
    The rest of this paper is organized as follows. The following section briefly
surveys the literature on VQA and VQG. Section 3 presents the description of
the both VQA-Med tasks (VQA and VQG) along with a detailed analysis of
the dataset for each task. In Section 4, we present the proposed models for each
task. Submission results and discussion are presented in Section 5. Finally, the
paper is concluded in Section 6.


2     Related Work
For the general VQA task, several researchers merge both visual and text in-
formation using encoder-decoder networks [14], while others introduce attention
over images to highlight the most important regions in the image to answer
questions [20]. In [13], the authors proposed a co-attention mechanism to jointly
apply attention on both images and questions.
    For medical VQA, the task is challenging as it requires a dedicated medical
dataset and expert doctors to understand the data. VQA-Med [7] is a compe-
tition launched first in 2018, where it provides each year a medical dataset for
the VQA task. Team UMass [17], which achieved the highest BLEU score in
2018, proposed a co-attention method between images features, which are ex-
tracted from ResNet-152, and text features from pre-trained word embedding.
Team JUST [24] used a simple encoder-decoder model based on Recurrent Neu-
ral Networks (RNN) with Long Short-Term Memory (LSTM) where the image
features were extracted by a VGG16 model.
    For the second version of the VQA-Med challenge in 2019, the team of Zhe-
jiang University [26] was ranked first among 17 teams. This team adopted the
co-attention mechanism to merge both visual and text features. The image fea-
tures were extracted using the VGG16 network with Global Average Pooling
1
    https://visualqa.org/index.html
strategy, while the Bidirectional Encoder Representations from Transformers
(BERT) model was used to extract question features. The Multimodal factorized
bilinear pooling (MFB) and multi-modal factorized high-order pooling (MFH)
methods were used for features concatenation. In the same year. Our team,
Team JUST [1], was ranked among the top five teams by proposing a hierar-
chical model composed of multiple deep learning sub-models to handle different
question types. All sub-models were based on a pre-trained VGG16 network
as image classification, without considering questions as input features for the
sub-models.
    Visual Question Generation (VQG) is a task of developing visual understand-
ing from images to generate reasonable questions with constraints (conditional
VQG) [12] using labeled answers, or without constraints (unconditional) using
only the image itself. Annotating multiple questions with each image as a VQG
dataset was first collected by [15].
    Various works have adopted the multimodal context of the natural language
of questions/answers and visual understanding of the image. The authors in [27]
proposed an approach to understanding the semantics in the image by simulta-
neously training VQG and VQA models as it is viewed in [12] as a dual-task.
In [27], the VQG model used both RNN and CNN to let the model learn both
natural language and vision aspects. Extending this in [9], where the authors
used Variational Autoencoder (VAE) with LSTM to generate several questions
per image. On the other hand, a deep Bayesian multimodal network was pro-
posed by [16] to generate a set of questions for each image.
    Regarding VQG in the medical field, Sarrouti et al. [19] proposed an approach
for VQG about radiology images called VQGR. The approach relied on VAE.
To increase the dataset size, the same authors [19] applied data augmentation
on both images and questions on the [10] dataset.


3     Tasks and Dataset Description

In this section, we discuss the details and datasets of VQA-Med 2020’s two tasks:
VQA and VQG.


3.1   VQA Task and Dataset Description

The dataset of VQA-Med consists of 4,000 training medical images and 500 val-
idation medical images, where each image is associated with a Question-Answer
(QA) pair. Additionally, there are 500 medical test images with their questions.
   The questions are from two types:

 – Type 1: Questions asking about abnormalities in the image. For example,
   “what is the abnormality in this image”. This type represents the major-
   ity of abnormality questions (98.5% of the training data, and 94.4% of the
   validation data).
 – Type 2: Questions with yes/no answers that ask if the image is normal or
   not. For example, “is this image normal” or “is this image abnormal” (1.5%
   of the training data, and 5.6% of the validation data).

      It is worth to mention some brief statistics about the dataset:

 – There are only 38 unique questions in the 4,000 training questions.
 – There are only 26 unique questions in the 500 validation questions.
 – There are only 332 unique answers in the 4,000 training answers.
 – There are only 232 unique answers from the 500 validation answers.
 – Most unique answers are each associated with 5-20 questions/images, while
   the most repetitive answer is associated with 80 questions/images, and the
   least repetitive answer is associated with 3 questions/images.

   Samples of the datasets are provided in Table 1. The required in the VQA
task is to answer the question given for each image.


3.2     VQG Task and Dataset Description

For the second task, VQG, the dataset consists of 780 images with 2,156 ques-
tions for training data, and 141 images with 164 questions for validation data.
Each image is associated with one or more questions (up to 12 questions). For
the test data, there are 80 images. The answers are given for the training and
validation datasets but not for the test dataset. The questions on the images are
diverse and are not necessarily related to the VQA dataset (i.e., questions are
not limited to asking about abnormality in images).
    It is worth to mention some brief statistics about the dataset:

 – There are 1,942 unique questions in the 2,156 training questions.
 – There are 161 unique questions in the 164 validation questions.
 – Most questions occurred once, but some questions are associated with more
   than one image (up to 8 images).
 – Some questions in the validation data are not in the training data.

   Samples of the datasets are provided in Table 2. The requirement in the VQG
task is to generate at least one question and a maximum of seven questions for
each test image.


4      Methodology

In this section, first, the description of the basic models is provided, followed by
the details of each task submission.
    We explore different approaches for the tasks at hand. However, most of these
approaches are based on our experiments from last year’s edition of the VQA-
Med task [1]. We call this the basic model for any reference later. Following is a
description of this model.
                       Table 1. VQA task dataset samples

                 Image                 Question             Answer


                                       what is the
                                                      neurofibromatosis-1,
                                  primary abnormality
                                                            nf1, nfr
                                     in this image?


                                                           incarcerated
                                                          diaphragmatic
                                   what is abnormal
                                                              hernia
                                     in the x-ray?
                                                           presenting as
                                                       colonic obstruction.


                                      is this image
                                                               yes
                                         normal?


    The model is for image classification that uses the pre-trained model VGG16
with the last layer (the Softmax layer) removed and all layers (except the last
four) are frozen. The output from this part is passed into two fully-connected
layers with 1,024 hidden nodes, followed by a normalization layer. Finally, the
output is passed into a Softmax layer with the required number of classes based
on the task. Figure 1 shows the architecture of the basic model.
    The other pillar in our models is the augmentation technique. Although deep
learning models have a remarkably excellent performance in several fields such
as computer vision, they suffer from overfitting when the learners fit the training
data perfectly. Usually, to avoid overfitting, the networks have to access more
data. However, many applications lack the access to big data, such as medical
image applications. One solution to increase the amount of the training data is
data augmentation, which aims at increasing the diversity of the data without
the need to collect new samples. For images, the augmentation done by applying
                      Table 2. VQG task dataset samples

                   Image              Question
                                      Q1: What are present in the right
                                        greater wing of the sphenoid
                                      and in the right parietal bone just
                                        above the squamosal suture?
                                         Q2: What involves the right
                                          parietal bone just above
                                           the squamosal suture?
                                         Q3: What are demonstrated
                                               in the skull?
                                         Q4: What are present in the
                                          right greater wing of the
                                           sphenoid and in the left
                                        parietal bone near the vertex?


                       Fig. 1. Basic model architecture


one of the geometric transformations, flipping, padding or random erasing on
the existing images to produce new images [21].
    For the VQA-Med tasks, we apply data augmentation on images to improve
the performance of our models. We experiment with various ways of augmen-
tation such as rotation change, width/ height shift, rescale, zoom and ZCA
whitening.
    ZCA whitening [2] is a whitening transformation used to decorrelate fea-
tures in the data. It is widely used for images as it removes redundancy and
highlight the structure of features. Not like PCA whitening, ZCA preserves the
arrangement of the origin which makes it a good option for CNN.


4.1   VQA Task

A common way of dealing with VQA tasks is by using sequence-to-sequence
(SeqtoSeq) models that merge image features and question features to predict
the answer. However, due to the dataset nature and the repetitiveness of the
questions, the contents of the questions are not expected to play a major role in
answering the questions (aside from determining the questions type, etc.). On
the other hand, the image features play a more significant role. Hence, we treat
this task as an image classification task.
    The core for our models is the basic model which is illustrated previously
(see Figure 1) with additional modifications. Specifically, we build two models
as follows.

 – A model for classifying image into normal/abnormal for yes/no answers.
 – A model for answering abnormalities questions.

    For the first model (normal/abnormal), we use the basic model by passing
the output to a Softmax layer with 2 classes (normal/abnormal). Images with
their QA pairs used in this model are the ones associated with questions that
start with “What”.
    For answering abnormalities questions by the second model, we use the same
previous model’s architecture, but the output is passed to a Softmax layer with
330 classes (the different 330 abnormalities in the training dataset).
    For this task, we make five submissions. Table 3 summarizes the difference
between these submissions, which are all based on the previous model descrip-
tion. For Submission 2-5, we apply data augmentation on images to improve
the performance of the second model (the abnormalities model). We experiment
with various ways of augmentation such as rotation change, width/ height shift,
rescale, zoom, and ZCA whitening.


4.2   VQG Task

For the VQG task, based on the data description, answers are available only
for training and validation data. So, we decide to build our model based on
the images only. The first idea to apply in such tasks is the image captioning
where the question is considered as the caption. This model is used in our first
submission out of the five ones we made for this task.
    The image captioning model is a Seq2Seq model. In each time step, the image
features are concatenated with the current question word to extract the features
in that time step. These features are passed to an LSTM layer, then to a dense
layer followed by a Softmax layer to predict the next word. For the first step,
the image with a predefined start word is used to predict the first word of the
answer, and the image with the first word of the answer is used to predict the
second word of the answer, and so on.
    Due to the poor performance we get from this approach, we resort to treating
this task as an image classification task in Submissions 2 and 3. We use the same
basic image classification model, except that the output is passed to a Softmax
layer with 2,072 classes (the different 2,072 unique questions in both the training
and validation datasets).
                     Table 3. Description of VQA submissions

          Submission# Method
                        The basic model
                1
                        without any data augmentation, 50 epochs
                        The basic model +
                        data augmentation using
                2
                        featurewise center and
                        featurewise std normalization, 50 epochs
                        The basic model +
                3       data augmentation using
                        ZCA whitening, 300 epochs
                        The basic model +
                4       data augmentation using
                        ZCA whitening, 50 epochs
                        The basic model +
                        use the weights of the highest
                5       accuracy model so far (Submission 4)
                        as pre-training +
                        data augmentation using ZCA whitening, 150 epochs


    For Submissions 4 and 5, we use the augmentation technique with ZCA
whitening beside the basic image classification model. Table 4 summarizes the
difference between the five submissions.

5      Results and Discussion
5.1     VQA Task Results
Our team, The Inception Team, got the second place in the leaderboard, as
shown in Figure 2, among the eleven participating teams with our best accuracy
reaching 48%. This is lower than the first place by only 1.6%. Our best result
is obvious in Submission 4, which uses our basic model and ZCA whitening
augmentation and 50 epochs of training (see Table 3).


                           Fig. 2. VQA task leaderboard


      Table 5, provides accuracy and BLEU score of each of our submissions.
                    Table 4. Description of VQG submissions

             Submission#                     Method
                            Image captioning model
                   1        50 epochs
                            predict one question for each image.
                            Image classification model
                   2        25 epochs
                            predict 7 questions for each image.
                            Image classification model
                   3        50 epochs
                            predict 7 questions for each image
                            Image classification model
                            data augmentation using ZCA whitening
                   4
                            100 epochs
                            predict 7 questions for each image.
                            Image classification model
                            data augmentation using ZCA whitening
                   5
                            50 epochs
                            predict 7 questions for each image.

           Table 5. The Inception team VQA task submissions results

                    Submission# Accuracy BLEU Score
                        1         0.454     0.486
                        2         0.458     0.495
                        3         0.444     0.479
                        4          0.48     0.511
                        5          0.44     0.476


5.2   VQG Task Results

In the VQG task, we also got second place in the leaderboard, as shown in
Figure 3, among the three participating teams with our best accuracy 33.9%
which is less than the first place by 0.9%. Our best result is via Submission 4,
which uses our basic model and ZCA whitening augmentation (see Table 4),
predicting 7 questions for each image, and 100 epochs of training.
   Table 6 provides the BLEU scores of each of our submissions.


5.3   Discussion

Several observations can be obtained from our experimentation. For example,
the positive effect of using the augmentation technique is clear in both tasks.
The models with augmented images results are better than the ones without
augmentation, especially in the VQA task. For both tasks, after using ZCA
whitening augmentation, the accuracy of the models has increased. The reason
why other augmentation methods such as rotation and shifting have worsened the
                            Fig. 3. VQG task leaderboard

            Table 6. The Inception team VQG task submissions results

                    Submission# BLEU Score BLEU Score
                        1          0.031      0.486
                        2          0.314      0.495
                        3          0.319      0.479
                        4         0.339      0.511
                        5          0.331      0.476


model performance is due to the nature of radiology images which are sensitive
to these changes as they might affect the purpose of images.
    It worth to mention that both of feature wise std normalization and feature
center techniques have not outperformed the ZCA whitening. The distribution
of data will be Gaussian distribution in the feature wise std normalization, as
it divides each image by the std standard deviation of all images in the dataset.
On the other hand, the mean of each image is set to zero for the feature center
method.


6   Conclusion

In this paper, we describe a set of models we submitted for the ImageCLEF
2020 VQA-Med tasks. We showed that our intuition (based on our analysis of
the dataset) of treating the tasks as image classification tasks is more useful
than including a natural language processing (NLP) component as one would
expect in VQA/VQG tasks. Our best model is based on VGG16 and augments
the data using the ZCA whitening technique. We achieved 48% accuracy in the
VQA task, and 33.9% Bleu score in the VQG task. These scores gave our team
the second place in the two tasks.


References

 1. Al-Sadi, A., Talafha, B., Al-Ayyoub, M., Jararweh, Y., Costen, F.: Just at imageclef
    2019 visual question answering in the medical domain. In: CLEF (Working Notes)
    (2019)
 2. Bell, A.J., Sejnowski, T.J.: The “independent components” of natural scenes are
    edge filters. Vision research 37(23), 3327–3338 (1997)
 3. Ben Abacha, A., Datla, V.V., Hasan, S.A., Demner-Fushman, D., Müller, H.:
    Overview of the vqa-med task at imageclef 2020: Visual question answering and
    generation in the medical domain. In: CLEF 2020 Working Notes. CEUR Work-
    shop Proceedings, CEUR-WS.org, Thessaloniki, Greece (September 22-25 2020)
 4. Ben Abacha, A., Hasan, S.A., Datla, V.V., Liu, J., Demner-Fushman, D., Müller,
    H.: Vqa-med: Overview of the medical visual question answering task at imageclef
    2019. In: CLEF (Working Notes) (2019)
 5. Ebrahim, M., Al-Ayyoub, M., Alsmirat, M.: Determine bipolar disorder level from
    patient interviews using bi-lstm and feature fusion. In: 2018 Fifth International
    Conference on Social Networks Analysis, Management and Security (SNAMS).
    pp. 182–189. IEEE (2018)
 6. Hasan, S.A., Ling, Y., Farri, O., Liu, J., Lungren, M., Müller, H.: Overview of
    the ImageCLEF 2018 medical domain visual question answering task (September
    10-14 2018)
 7. Hasan, S.A., Ling, Y., Farri, O., Liu, J., Müller, H., Lungren, M.P.: Overview of
    imageclef 2018 medical domain visual question answering task. In: CLEF (Working
    Notes) (2018)
 8. Ionescu, B., Müller, H., Péteri, R., Ben Abacha, A., Datla, V., Hasan, S.A.,
    Demner-Fushman, D., Kozlovski, S., Liauchuk, V., Cid, Y.D., Kovalev, V., Pelka,
    O., Friedrich, C.M., de Herrera, A.G.S., Ninh, V.T., Le, T.K., Zhou, L., Piras, L.,
    Riegler, M., l Halvorsen, P., Tran, M.T., Lux, M., Gurrin, C., Dang-Nguyen, D.T.,
    Chamberlain, J., Clark, A., Campello, A., Fichou, D., Berari, R., Brie, P., Dogariu,
    M., Ştefan, L.D., Constantin, M.G.: Overview of the ImageCLEF 2020: Multimedia
    retrieval in lifelogging, medical, nature, and internet applications. In: Experimental
    IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 11th
    International Conference of the CLEF Association (CLEF 2020), vol. 12260. LNCS
    Lecture Notes in Computer Science, Springer, Thessaloniki, Greece (September 22-
    25 2020)
 9. Jain, U., Zhang, Z., Schwing, A.G.: Creativity: Generating diverse questions using
    variational autoencoders. In: Proceedings of the IEEE Conference on Computer
    Vision and Pattern Recognition. pp. 6485–6494 (2017)
10. Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically
    generated visual questions and answers about radiology images. Scientific data
    5(1), 1–10 (2018)
11. Lawonn, K., Smit, N.N., Bühler, K., Preim, B.: A survey on multimodal medical
    data visualization. In: Computer Graphics Forum. vol. 37, pp. 413–438. Wiley
    Online Library (2018)
12. Li, Y., Duan, N., Zhou, B., Chu, X., Ouyang, W., Wang, X., Zhou, M.: Visual
    question generation as dual task of visual question answering. In: Proceedings of
    the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6116–6124
    (2018)
13. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for
    visual question answering. In: Advances in neural information processing systems.
    pp. 289–297 (2016)
14. Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: A neural-based ap-
    proach to answering questions about images. In: Proceedings of the IEEE interna-
    tional conference on computer vision. pp. 1–9 (2015)
15. Mostafazadeh, N., Misra, I., Devlin, J., Mitchell, M., He, X., Vanderwende, L.:
    Generating natural questions about an image. arXiv preprint arXiv:1603.06059
    (2016)
16. Patro, B.N., Kumar, S., Kurmi, V.K., Namboodiri, V.P.: Multimodal differential
    network for visual question generation. arXiv preprint arXiv:1808.03986 (2018)
17. Peng, Y., Liu, F., Rosen, M.P.: Umass at imageclef medical visual question an-
    swering (med-vqa) 2018 task. In: CLEF (Working Notes) (2018)
18. Ramachandram, D., Taylor, G.W.: Deep multimodal learning: A survey on recent
    advances and trends. IEEE Signal Processing Magazine 34(6), 96–108 (2017)
19. Sarrouti, M., Ben Abacha, A., Demner-Fushman, D.: Visual question generation
    from radiology images. In: Proceedings of the First Workshop on Advances in
    Language and Vision Research. pp. 12–18 (2020)
20. Shih, K.J., Singh, S., Hoiem, D.: Where to look: Focus regions for visual question
    answering. In: Proceedings of the IEEE conference on computer vision and pattern
    recognition. pp. 4613–4621 (2016)
21. Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep
    learning. Journal of Big Data 6(1), 60 (2019)
22. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
    image recognition. arXiv preprint arXiv:1409.1556 (2014)
23. Srivastava, Y., Murali, V., Dubey, S.R., Mukherjee, S.: Visual question answer-
    ing using deep learning: A survey and performance analysis. arXiv preprint
    arXiv:1909.01860 (2019)
24. Talafha, B., Al-Ayyoub, M.: Just at vqa-med: A vgg-seq2seq model. In: CLEF
    (Working Notes) (2018)
25. Xu, X., Song, J., Lu, H., He, L., Yang, Y., Shen, F.: Dual learning for visual
    question generation. In: 2018 IEEE International Conference on Multimedia and
    Expo (ICME). pp. 1–6. IEEE (2018)
26. Yan, X., Li, L., Xie, C., Xiao, J., Gu, L.: Zhejiang university at imageclef 2019 visual
    question answering in the medical domain. In: CLEF (Working Notes) (2019)
27. Yang, Y., Li, Y., Fermuller, C., Aloimonos, Y.: Neural self talk: Image understand-
    ing via continuous questioning and answering. arXiv preprint arXiv:1512.03460
    (2015)