=Paper= {{Paper |id=Vol-2380/paper_272 |storemode=property |title=VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019 |pdfUrl=https://ceur-ws.org/Vol-2380/paper_272.pdf |volume=Vol-2380 |authors=Asma Ben Abacha1,Sadid A. Hasan,Vivek V. Datla,Joey Liu,Dina Demner-Fushman,Henning Müller |dblpUrl=https://dblp.org/rec/conf/clef/AbachaHDLDM19 }} ==VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019== https://ceur-ws.org/Vol-2380/paper_272.pdf
     VQA-Med: Overview of the Medical Visual
    Question Answering Task at ImageCLEF 2019

     Asma Ben Abacha1 , Sadid A. Hasan2 , Vivek V. Datla2 , Joey Liu2 , Dina
                 Demner-Fushman1 , and Henning Müller3
               1
                  Lister Hill Center, National Library of Medicine, USA
                          2
                             Philips Research Cambridge, USA
       3
         University of Applied Sciences Western Switzerland, Sierre, Switzerland
                                asma.benabacha@nih.gov
                               sadid.hasan@philips.com



        Abstract. This paper presents an overview of the Medical Visual Ques-
        tion Answering task (VQA-Med) at ImageCLEF 2019. Participating sys-
        tems were tasked with answering medical questions based on the visual
        content of radiology images. In this second edition of VQA-Med, we fo-
        cused on four categories of clinical questions: Modality, Plane, Organ
        System, and Abnormality. These categories are designed with different
        degrees of difficulty leveraging both classification and text generation
        approaches. We also ensured that all questions can be answered from
        the image content without requiring additional medical knowledge or
        domain-specific inference. We created a new dataset of 4,200 radiology
        images and 15,292 question-answer pairs following these guidelines. The
        challenge was well received with 17 participating teams who applied a
        wide range of approaches such as transfer learning, multi-task learning,
        and ensemble methods. The best team achieved a BLEU score of 64.4%
        and an accuracy of 62.4%. In future editions, we will consider designing
        more goal-oriented datasets and tackling new aspects such as contextual
        information and domain-specific inference.

        Keywords: Visual Question Answering, Data Creation, Deep Learning,
        Radiology Images, Medical Questions and Answers


1     Introduction

Recent advances in artificial intelligence opened new opportunities in clinical de-
cision support. In particular, relevant solutions for the automatic interpretation
of medical images are attracting a growing interest due to their potential appli-
cations in image retrieval and in assisted diagnosis. Moreover, systems capable of
understanding clinical images and answering questions related to their content
could support clinical education, clinical decision, and patient education. From a
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem-
    ber 2019, Lugano, Switzerland.
computational perspective, this Visual Question Answering (VQA) task presents
an exciting problem that combines natural language processing and computer
vision techniques. In recent years, substantial progress has been made on VQA
with new open-domain datasets [3, 8] and approaches [23, 7].
    However, there are challenges that need to be addressed when tackling VQA
in a specialized domain such as Medicine. Ben Abacha et al. [4] analyzed some
of the issues facing medical visual question answering and described four key
challenges (i) designing goal-oriented VQA systems and datasets, (ii) catego-
rizing the clinical questions, (iii) selecting (clinically) relevant images, and (iv)
capturing the context and the medical knowledge.
    Inspired by the success of visual question answering in the general domain,
we conducted a pilot task (VQA-Med 2018) in ImageCLEF 2018 to focus on
visual question answering in the medical domain [9]. Based on the success of the
initial edition, we continued the task this year with enhanced focus on a well
curated and larger dataset.
    In VQA-Med 2019, we selected radiology images and medical questions that
(i) asked about only one element and (ii) could be answered from the image
content. We targeted four main categories of questions with different difficulty
levels: Modality, Plane, Organ system, and Abnormality. For instance, the first
three categories can be tackled as a classification task, while the fourth category
(abnormality) presents an answer generation problem. We intentionally designed
the data in this manner to study the behavior and performance of different
approaches on both aspects. This design is more relevant to clinical decision
support than the common approach in open-domain VQA datasets [3, 8] where
the answers consist of one word or number (e.g. yes, no, 3, stop).
    In the following section, we present the task description with more details
and examples. We describe the data creation process and the VQA-Med-2019
dataset in section 3. We present the evaluation methodology and discuss the
challenge results respectively in sections 4 and 5.


2   Task Description

In the same way as last year, given a medical image accompanied by a clinically
relevant question, participating systems in VQA-Med 2019 are tasked with an-
swering the question based on the visual image content. In VQA-Med 2019, we
specifically focused on radiology images and four main categories of questions:
Modality, Plane, Organ System, and Abnormality. We mainly considered medi-
cal questions asking only about one element: e.g., “what is the organ principally
shown in this MRI?”, “in what plane is this mammograph taken?”, “is this a
t1 weighted, t2 weighted, or flair image?”, “what is most alarming about this
ultrasound?”).
    All selected questions can be answered from the image content without re-
quiring additional domain-specific inference or context. Other questions includ-
ing these aspects will be considered in future editions of the challenge, e.g.: ”Is
this modality safe for pregnant women?”, ”What is located immediately inferior
to the right hemidiaphragm?”, ”What can be typically visualized in this plane?”,
”How would you measure the length of the kidneys?”


3      VQA-Med-2019 Dataset
We automatically constructed the training, validation, and test sets, by (i) apply-
ing several filters to select relevant images and associated annotations, and (ii)
creating patterns to generate the questions and their answers. The test set was
manually validated by two medical doctors. The dataset is publicly available4 .
Figure 1 presents examples from the VQA-Med-2019 dataset.

3.1     Medical Images
We selected relevant medical images from the MedPix5 database with filters
based on their captions, modalities, planes, localities, categories, and diagnosis
methods. We selected only the cases where the diagnosis was made based on
the image. Examples of the selected diagnosis methods: CT/MRI Imaging, An-
giography, Characteristic imaging appearance, Radiographs, Imaging features,
Ultrasound, Diagnostic Radiology.

3.2     Question Categories and Patterns
We targeted the most frequent question categories: Modality, Plane, Organ sys-
tem and Abnormality (Ref:VQA-RAD).

      1) Modality: Yes/No, WH and closed questions. Examples:
 – was gi contrast given to the patient?
 – what is the mr weighting in this image?
 – what modality was used to take this image?
 – is this a t1 weighted, t2 weighted, or flair image?
      2) Plane: WH questions. Examples:
 – what is the plane of this mri?
 – in what plane is this mammograph taken?
      3) Organ System: WH questions. Examples:
 – what organ system is shown in this x-ray?
 – what is the organ principally shown in this mri?
      4) Abnormality: Yes/No and WH questions. Examples:
 – does this image look normal?
4
    github.com/abachaa/VQA-Med-2019
5
    https://medpix.nlm.nih.gov
(a) Q: what imaging method was         (b) Q: which plane is the image shown
used? A: us-d - doppler ultrasound     in? A: axial




        (c) Q: is this a contrast or non-      (d) Q: what plane is this?
        contrast ct? A: contrast               A: lateral




 (e) Q: what abnormality is seen in         (f) Q: what is the organ system in
 the image? A:nodular opacity on the        this image? A: skull and contents
 left#metastastic melanoma




      (g) Q: which organ system is          (h) Q: what is abnormal in the
      shown in the ct scan? A: lung,        gastrointestinal image? A: gas-
      mediastinum, pleura                   tric volvulus (organoaxial)

              Fig. 1: Examples from VQA-Med-2019 test set
 – are there abnormalities in this gastrointestinal image?
 – what is the primary abnormality in the image?
 – what is most alarming about this ultrasound?
Planes (16): Axial; Sagittal; Coronal; AP; Lateral; Frontal; PA; Transverse;
Oblique; Longitudinal; Decubitus; 3D Reconstruction; Mammo-MLO; Mammo-
CC; Mammo-Mag CC; Mammo-XCC.
Organ Systems (10): Breast; Skull and Contents; Face, sinuses, and neck; Spine
and contents; Musculoskeletal; Heart and great vessels; Lung, mediastinum,
pleura; Gastrointestinal; Genitourinary; Vascular and lymphatic.
Modalities (36):
 – [XR]: XR-Plain Film
 – [CT]: CT-noncontrast; CT w/contrast (IV); CT-GI & IV Contrast; CTA-CT
   Angiography; CT-GI Contrast; CT-Myelogram; Tomography
 – [MR]: MR-T1W w/Gadolinium; MR-T1W-noncontrast; MR-T2 weighted;
   MR-FLAIR; MR-T1W w/Gd (fat suppressed); MR T2* gradient,GRE,MPGR,
   SWAN,SWI; MR-DWI Diffusion Weighted; MRA-MR Angiography/Venography;
   MR-Other Pulse Seq.; MR-ADC Map (App Diff Coeff); MR-PDW Proton
   Density; MR-STIR; MR-FIESTA; MR-FLAIR w/Gd; MR-T1W SPGR; MR-
   T2 FLAIR w/Contrast; MR T2* gradient GRE
 – [US]: US-Ultrasound; US-D-Doppler Ultrasound
 – [MA]: Mammograph
 – [GI]: BAS-Barium Swallow; UGI-Upper GI; BE-Barium Enema; SBFT-
   Small Bowel
 – [AG]: AN-Angiogram; Venogram
 – [PT]: NM-Nuclear Medicine; PET-Positron Emission
   Patterns: For each category, we selected question patterns from hundreds of
questions naturally asked and validated by medical students from the VQA-RAD
dataset [13].

3.3   Training and Validation Sets
The training set includes 3,200 images and 12,792 question-answer (QA) pairs,
with 3 to 4 questions per image. Table 1 presents the most frequent answers per
category. The validation set includes 500 medical images with 2,000 QA pairs.

3.4   Test Set
A medical doctor and a radiologist performed a manual double validation of the
test answers. A total of 33 answers were updated by (i) indicating an optional
part (8 answers), (ii) adding other possible answers (10), or (iii) correcting the
automatic answer. 15 answers were corrected, which corresponds to 3% of the test
answers. The corrected answers correspond to the following categories: Abnor-
mality (8/125), Organ (6/125), and Plane (1/125). For abnormality questions,
the correction was mainly changing the diagnosis that is inferred, by the problem
Category    Most frequent answers (#)
Modality    no (554), yes (552), xr-plain film (456), t2 (217), us-ultrasound (183), t1
            (137), contrast (107), noncontrast (102), ct noncontrast (84), mr-flair
            (84), an-angiogram (78), mr-t2 weighted (56), flair (53), ct w/contrast
            (iv) (50), cta-ct angiograph (45)
Plane       axial (1558), sagittal (478), coronal (389), ap (197), lateral (151), frontal
            (120), pa (92), transverse (76), oblique (50)
Organ       skull and contents (1216), musculoskeletal (436), gastrointestinal (352),
System      lung, mediastinum, pleura (250), spine and contents (234), genitouri-
            nary (214), face, sinuses, and neck (191), vascular and lymphatic (122),
            heart and great vessels (120), breast (65)
Abnormality yes (62), no (48), meningioma (30), glioblastoma multiforme (28), pul-
            monary embolism (16), acute appendicitis (14), arteriovenous malfor-
            mation (avm) (14), arachnoid cyst (13), schwannoma (13), tuberous
            sclerosis (13), brain, cerebral abscess (12), ependymoma (12), fibrous
            dysplasia (12), multiple sclerosis (12), diverticulitis (11), langerhan cell
            histiocytosis (11), sarcoidosis (11)
Table 1: VQA-Med-2019 Training Set: the Most Frequent Answers Per Category



seen in the image. We expect a similar error rate in the training and validation
sets that were generated using the same automatic data creation method. The
test set consists of 500 medical images and 500 questions.


4     Evaluation Methodology

The evaluation of the systems that participated in the VQA-Med 2019 task
was conducted based on two primary metrics: Accuracy and BLEU. We use an
adapted version of the accuracy metric from the general domain VQA6 task
that strictly considers exact matching of a participant provided answer and the
ground truth answer. We calculate the overall accuracy scores as well as the
scores for each question category. To compensate for the strictness of the ac-
curacy metric, BLEU [15] is used to capture the word overlap-based similarity
between a system-generated answer and the ground truth answer. The overall
methodology and resources for the BLEU metric are essentially similar to last
year’s task [9].


5     Results and Discussion

Out of 104 online registrations, 61 participants submitted signed end-user agree-
ment forms. Finally, 17 groups submitted a total of 90 runs, indicating a notable
interest in the VQA-Med 2019 task. Figure 2 presents the results of the 17
participating teams. The best overall result was obtained by the Hanlin team,
6
    https://visualqa.org/evaluation.html
achieving 0.624 Accuracy and 0.644 BLEU score. Table 2 gives an overview of
all participants and the number of submitted runs7 . The overall results of the
participating systems are presented in Table 3 to Table 4 for the two metrics in
a descending order of the scores (the higher the better). Detailed results of each
run are described in the ImageCLEF 2019 lab overview paper [11].




                   Fig. 2: Results of VQA-Med 2019 on crowdAI



                    Table 3: VQA-Med 2019: Accuracy scores

     Team           Run ID Modality Plane Organ Abnormality Overall
     Hanlin          26889  0.202   0.192 0.184    0.046     0.624
     yan             26853  0.202   0.192 0.184    0.042     0.620
     minhvu          26881  0.210   0.194 0.190    0.022     0.616
     TUA1            26822  0.186   0.204 0.198    0.018     0.606
     UMMS            27306  0.168   0.190 0.184    0.024     0.566
     AIOZ            26873  0.182   0.180 0.182    0.020     0.564
     IBM Research AI 27199  0.160   0.196 0.192    0.010     0.558
     LIST            26908  0.180   0.184 0.178    0.014     0.556
7
    There was a limit of maximum 10 run submissions per team. The table includes only
    the valid runs that were graded (total# 80 out of 90 submissions)
     Turner.JCE         26913      0.164    0.176   0.182       0.014       0.536
     JUST19             27142      0.160    0.182   0.176       0.016       0.534
     Team Pwc Med       26941      0.148    0.150   0.168       0.022       0.488
     Techno             27079      0.082    0.184   0.170       0.026       0.462
     deepak.gupta651    27232      0.096    0.140   0.124       0.006       0.366
     ChandanReddy       26884      0.094    0.126   0.064       0.010       0.294
     Dear stranger      26895      0.062    0.140     0         0.008       0.210
     abhishekthanki     27307      0.122      0     0.028       0.010       0.160
     IITISM@CLEF        26905      0.052    0.004   0.026       0.006       0.088


                       Table 4: VQA-Med 2019: BLEU scores

                          Team           Run ID BLEU
                          Hanlin          26889 0.644
                          yan             26853 0.640
                          minhvu          26881 0.634
                          TUA1            26822 0.633
                          UMMS            27306 0.593
                          JUST19          27142 0.591
                          LIST            26908 0.583
                          IBM Research AI 27199 0.582
                          AIOZ            26833 0.579
                          Turner.JCE      26940 0.572
                          Team Pwc Med    26955 0.534
                          Techno          27079 0.486
                          abhishekthanki  26824 0.462
                          Dear stranger   26895 0.393
                          deepak.gupta651 27232 0.389
                          ChandanReddy    26946 0.323
                          IITISM@CLEF     26905 0.096


    Similar to last year, participants mainly used deep learning techniques to build
their VQA-Med systems. In particular, the best-performing systems leveraged deep
convolutional neural networks (CNNs) like VGGNet [18] or ResNet [10] with a va-
riety of pooling strategies e.g., global average pooling to encode image features and
transformer-based architectures like BERT [6] or recurrent neural networks (RNN) to
extract question features. Then, various types of attention mechanisms are used coupled
with different pooling strategies such as multimodal factorized bilinear (MFB) pooling
or multi-modal factorized high-order pooling (MFH) in order to combine multimodal
features followed by bilinear transformations to finally predict the possible answers.
    Analyses of the question category-wise8 accuracy in Table 3 suggest that in general,
participating systems performed well to answer modality questions, followed by plane
and organ questions because the possible types of answers for each of these question
categories were finite. However, for the abnormality type questions, systems did not
perform well in terms of accuracy because of the underlying complexity of open-ended
8
    Note that the question category-wise accuracy scores are normalized (each divided
    by a factor of 4) so that the summation is equal to the overall accuracy.
            Table 2: Participating groups in the VQA-Med 2019 task.
Team                 Institution                                                                        # Runs
abhishekthanki [20]  Manipal Institute of Technology (India)                                              8
AIOZ                 AIOZ Pte Ltd (Singapore)                                                             6
ChandanReddy         Virginia Tech (USA)                                                                  4
Dear stranger [14]   School of Information Science and Engineering, Kunming (China)                        6
deepak.gupta651      Indian Institute of Technology Patna (India)                                         1
Hanlin               Zhejiang University (China)                                                          5
IBM Research AI [12] IBM Research, Almaden (USA)                                                          4
IITISM@CLEF          Indian Institute of Technology Dhanbad (India)                                        3
JUST19 [1]           (Jordan) University of Science and Technology & University of Manchester (UK)        4
LIST [2]             Faculty of Sciences and Technologies, Tangier (Morocco)                               7
minhvu [21]          Umeå University (Sweden) & University of Bern (Switzerland)                         10
Team Pwc Med [16] Pricewaterhouse Coopers US Advisory (India)                                             5
Techno [5]           Faculty of Technology Tlemcen (Algeria)                                              2
TUA1 [24]            Tokushima University (Japan)                                                          1
Turner.JCE [19]      Azrieli College of Engineering Jerusalem (Israel)                                    10
UMMS [17]            Worcester Polytechnic Institute & University of Massachusetts Medical School (USA)    3
yan [22]             Zhejiang University (China) & National Institute of Informatics (Japan)              1




questions and possibly due to the strictness of the accuracy metric. To compensate
for the strictness of the accuracy, we computed the BLEU scores to understand the
similarity of the system generated answers and the ground-truth answers. The higher
BLEU scores of the systems this year (0.631 best BLEU vs. 0.162 in 2018) further
verify the effectiveness of the proposed deep learning-based models for the VQA task.
Overall, the results obtained this year clearly denote the robustness of the provided
dataset compared to last year’s task.


6     Conclusions
We presented the VQA-Med 2019 task, the new dataset, the participating systems, and
official results. To ensure that the questions are naturally phrased, we used patterns
from question asked by medical students to build clinically relevant questions belong-
ing to our four target categories. We created a new dataset for the challenge9 following
goal-oriented guidelines, and covering questions with varying degrees of difficulty. A
wide range of approaches have been applied such as transfer learning, multi-task learn-
ing, ensemble methods, and hybrid approaches combining classification models and
answer generation methods. The best team achieved 0.644 BLEU score and 0.624 over-
all accuracy. In future editions we are considering more complex questions that might
include contextual information or require domain-specific inference to reach the right
answer.


Acknowledgments
This work was supported by the intramural research program at the U.S. National
Library of Medicine, National Institutes of Health.
We thank Dr. James G. Smirniotopoulos and Soumya Gayen from the MedPix team
for their support.
9
    www.crowdai.org/clef tasks/13/task dataset files?challenge id=53
    github.com/abachaa/VQA-Med-2019
References
 1. Al-Sadi, A., Talafha, B., Al-Ayyoub, M., Jararweh, Y., Costen, F.: Just at imageclef
    2019 visual question answering in the medical domain. In: Working Notes of CLEF
    2019 (2019)
 2. Allaouzi, I., Benamrou, B., Ahmed, M.B.: An encoder-decoder model for visual
    question answering in the medical domain. In: Working Notes of CLEF 2019 (2019)
 3. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.:
    VQA: visual question answering. In: 2015 IEEE International Conference on Com-
    puter Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. pp. 2425–2433
    (2015), https://doi.org/10.1109/ICCV.2015.279
 4. Ben Abacha, A., Gayen, S., Lau, J.J., Rajaraman, S., Demner-Fushman, D.:
    NLM at imageclef 2018 visual question answering in the medical domain. In:
    Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Fo-
    rum, Avignon, France, September 10-14, 2018. (2018), http://ceur-ws.org/Vol-
    2125/paper 165.pdf
 5. Bounaama, R., Abderrahim, M.A.: Tlemcen university at imageclef 2019 visual
    question answering task. In: Working Notes of CLEF 2019 (2019)
 6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
    tional transformers for language understanding. In: Proceedings of NAACL (2019)
 7. Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multi-
    modal compact bilinear pooling for visual question answering and visual grounding.
    In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language
    Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016. pp. 457–468
    (2016), http://aclweb.org/anthology/D/D16/D16-1044.pdf
 8. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in
    VQA matter: Elevating the role of image understanding in visual question an-
    swering. In: 2017 IEEE Conference on Computer Vision and Pattern Recogni-
    tion, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 6325–6334 (2017),
    https://doi.org/10.1109/CVPR.2017.670
 9. Hasan, S.A., Ling, Y., Farri, O., Liu, J., Müller, H., Lungren, M.: Overview of
    imageclef 2018 medical domain visual question answering task. In: Working Notes
    of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France,
    September 10-14, 2018. (2018), http://ceur-ws.org/Vol-2125/paper 212.pdf
10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recog-
    nition. In: 2016 IEEE Conference on Computer Vision and Pattern Recogni-
    tion, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp. 770–778 (2016),
    https://doi.org/10.1109/CVPR.2016.90
11. Ionescu, B., Müller, H., Péteri, R., Cid, Y.D., Liauchuk, V., Kovalev, V., Klimuk,
    D., Tarasau, A., Ben Abacha, A., Hasan, S.A., Datla, V., Liu, J., Demner-Fushman,
    D., Dang-Nguyen, D.T., Piras, L., Riegler, M., Tran, M.T., Lux, M., Gurrin, C.,
    Pelka, O., Friedrich, C.M., de Herrera, A.G.S., Garcia, N., Kavallieratou, E., del
    Blanco, C.R., Rodrı́guez, C.C., Vasillopoulos, N., Karampidis, K., Chamberlain,
    J., Clark, A., Campello, A.: ImageCLEF 2019: Multimedia retrieval in medicine,
    lifelogging, security and nature. In: Experimental IR Meets Multilinguality, Mul-
    timodality, and Interaction. Proceedings of the 10th International Conference of
    the CLEF Association (CLEF 2019), LNCS Lecture Notes in Computer Science,
    Springer, Lugano, Switzerland (September 9-12 2019)
12. Kornuta, T., Rajan, D., Shivade, C., Asseman, A., Ozcan, A.: Leveraging medical
    visual question answering with supporting facts. In: Working Notes of CLEF 2019
    (2019)
13. Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically
    generated visual questions and answers about radiology images. Scientific Data
    5(180251) (2018), https://www.nature.com/articles/sdata2018251
14. Liu, S., Ou, X., Che, J.: Vqa-med: An xception-gru model. In: Working Notes of
    CLEF 2019 (2019)
15. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic
    evaluation of machine translation. In: Proceedings of the 40th annual meeting on
    association for computational linguistics. pp. 311–318. Association for Computa-
    tional Linguistics (2002)
16. Shah, R., Gadgil, T., Bansal, M., Verma, P.: Medical visual question answering at
    imageclef 2019- vqa med. In: Working Notes of CLEF 2019 (2019)
17. Shi, L., Liu, F., Rosen, M.P.: Deep multimodal learning for medical visual question
    answering. In: Working Notes of CLEF 2019 (2019)
18. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
    image recognition. In: 3rd International Conference on Learning Representations,
    ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings
    (2015), http://arxiv.org/abs/1409.1556
19. Spanier, A., Turner, A.: Lstm in vqa-med, is it really needed? validation study on
    the imageclef 2019 dataset. In: Working Notes of CLEF 2019 (2019)
20. Thanki, A., Makkithaya, K.: Mit manipal at imageclef 2019 visualquuestion an-
    swering in medical domain. In: Working Notes of CLEF 2019 (2019)
21. Vu, M., Sznitman, R., Nyholm, T., Lfstedt, T.: Ensemble of streamlined bilinear
    visual question answering models for the imageclef 2019 challenge in the medical
    domain. In: Working Notes of CLEF 2019 (2019)
22. Yan, X., Li, L., Xie, C., Xiao, J., Gu, L.: Zhejiang university at imageclef 2019
    visual question answering in the medical domain. In: Working Notes of CLEF
    2019 (2019)
23. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for
    image question answering. In: 2016 IEEE Conference on Computer Vision and
    Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp.
    21–29 (2016), https://doi.org/10.1109/CVPR.2016.10
24. Zhou, Y., Kang, X., Ren, F.: Tua1 at imageclef 2019 vqa-med: A classification
    and generation model based on transfer learning. In: Working Notes of CLEF 2019
    (2019)