=Paper=
{{Paper
|id=Vol-2936/paper-87
|storemode=property
|title=Overview of the VQA-Med Task at ImageCLEF 2021: Visual Question Answering and Generation
in the Medical Domain
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-87.pdf
|volume=Vol-2936
|authors=Asma Ben Abacha,Mourad Sarrouti,Dina Demner-Fushman,Sadid A. Hasan,Henning Müller
|dblpUrl=https://dblp.org/rec/conf/clef/AbachaSDHM21
}}
==Overview of the VQA-Med Task at ImageCLEF 2021: Visual Question Answering and Generation
in the Medical Domain==
Overview of the VQA-Med Task at ImageCLEF 2021: Visual Question Answering and Generation in the Medical Domain Asma Ben Abacha1 , Mourad Sarrouti1 , Dina Demner-Fushman1 , Sadid A. Hasan2 and Henning Müller3 1 National Library of Medicine, USA 2 CVS Health, USA 3 University of Applied Sciences Western Switzerland (HES-SO), Sierre, Switzerland Abstract This paper presents an overview of the fourth edition of the Medical Visual Question Answering (VQA- Med) task at ImageCLEF 2021. VQA-Med 2021 includes a task on Visual Question Answering (VQA), where participants are tasked with answering questions from the visual content of radiology images, and a second task on Visual Question Generation (VQG), consisting of generating relevant questions about radiology images. Thirteen teams participated in VQA-Med 2021 and submitted a total of 75 runs. The best teams achieved a BLEU score of 0.416 in the VQA task and 0.383 in the VQG task. Keywords Visual Question Answering, Visual Question Generation, Data Creation, Radiology Images 1. Introduction Visual Question Answering is a challenging and promising problem that combines natural language processing (NLP) and computer vision (CV) techniques. With the increasing interest in artificial intelligence (AI) technologies to support clinical decision making and improve patient engagement, opportunities to generate and leverage algorithms for automated medical image interpretation are being explored at a faster pace. To offer more training data and evaluation benchmarks, we organized the first visual question answering (VQA) task in the medical domain in 2018 [1], and continued the task in 2019 [2] and 2020 [3]. Following the strong engagement from the research community in the previous editions of VQA in the medical domain (VQA-Med), we continued the task this year within the scope of ImageCLEF 2021 [4], with a focus on answering questions about abnormalities in radiology images. In this edition, we also organized a second task on visual question generation (VQG), consisting of generating relevant natural language questions about radiology images based on their visual content1 . CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " asma.benabacha@nih.gov (A. Ben Abacha); mourad.sarrouti@nih.gov (M. Sarrouti); ddemner@mail.nih.gov (D. Demner-Fushman); sadidhasan@gmail.com (S. A. Hasan); henning.mueller@hevs.ch (H. Müller) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 https://www.imageclef.org/2021/medical/vqa 2. Task Description The two VQA-Med tasks can be described more precisely as follows: • Visual question answering (VQA)2 : given a radiology image accompanied by a relevant question, participating systems in VQA-Med 2021 were tasked with answering the question based on the visual image content. • Visual question generation (VQG)3 : given a radiology image, participating systems were tasked with generating relevant natural language questions about the abnormality present in the image. 3. Data Creation 3.1. VQA Data For the visual question answering task, we automatically constructed the training, validation, and test sets by: (i) applying several filters to select relevant images and associated annotations, and, (ii) creating patterns to generate the questions and their answers. We selected relevant medical images from the MedPix4 database with filters based on their captions, localities and diagnosis methods. We selected only the cases where the diagnosis was made based on the image. Finally, we considered the most frequent abnormality question categories to create the data set, which included a training set of 4,500 radiology images with 4,500 question-answer (QA) pairs (the same dataset used in 2020), a new validation set of 500 radiology images with 500 QA pairs, and a new test set of 500 radiology images with 500 questions about Abnormality. To further ensure the quality of the data, the reference answers of the test set were manually validated by a medical doctor. Figure 1 presents examples from the VQA 2021 test set. The participants were also encouraged to utilize the VQA-Med 2019 and 2020 datasets as additional training data. 3.2. VQG Data For the visual question generation task, we constructed the validation and test sets semi- automatically. First, we generated questions automatically from the images and their captions using two different approaches. In a first approach, we used only the image and a variational autoencoder model called VQGR [5] trained on the VQA-RAD dataset [6] (A CNN was used to encode the images and an LSTM to decode the questions). The second approach used a T5-based model fine-tuned on the SQuAD and MS MARCO datasets to generate questions from the image captions. Then, a medical doctor curated the list of automatically created questions. The final curated corpus for the VQG task was comprised of 85 radiology images with 200 questions for validation and 100 radiology images with 302 reference questions for the test set. Figure 2 presents examples from the VQG 2021 test set. 2 https://www.aicrowd.com/challenges/imageclef-2021-vqa-med-vqa 3 https://www.aicrowd.com/challenges/imageclef-2021-vqa-med-vqg 4 https://medpix.nlm.nih.gov/ (a) Q: What is the primary abnormal- (b) Q: What is abnormal in ity in this image? A: Large hetero- the x-ray? A: enchondroma geneously enhancing right hepatic | Lytic lesion with chon- mass with mass effect on left lobe droid matrix of the proxi- consistent with hepatoblastoma mal metadiaphysis of the humerus (c) Q: What is most alarming (d) Q: What abnormality is seen in about this mri? A: focal the image? A: Enhancing lesion nodular hyperplasia right parietal lobe with surround- ing edema Figure 1: Examples from the test set of the VQA 2021 Task 4. Submitted Runs Out of 48 online registrations, 33 participants submitted signed end user agreement forms, and 13 teams submitted a total of 75 successful runs; including 68 runs for the VQA task and 7 runs for the VQG task. Table 1 gives an overview of all participating teams and the number of submitted runs (only 10 runs were allowed per team). Table 1 Participating groups in the VQA-Med 2021 tasks. Team Institution # Valid Runs Yunnan [7] Yunnan University (China) 10 SYSU-HCP [8] School of Computer Science and Engineering, 10 Sun Yat-sen University (China) TAM [9] South China Normal University (China) 10 TeamS [10] D4L data4life gGmbH&Hasso Plattner Insti- 10 tute (Germany) jeanbenoit_delbrouck Stanford University (USA) 10 sheerin [11] Siva Subramaniya Nadar College of Engineer- 5 ing (India) IALab_PUC [12] IALab group of the Pontifical Catholic Univer- 5 sity (Chile) Chabbiimen [13] REGIM Lab & Higher Institute of Informatics 5 and Communication Technologies (Tunisia) SSN_hacML SSN College of Engineering, Chennai (India) 3 Lijie [14] School of Information Science and Engineer- 2 ing, Yunnan University (China) sliencec SIE of NCU, Nanchang (China) 2 riven SEU, Suzhou (China) 1 Table 2 Maximum Accuracy and Maximum BLEU Scores for the VQA Task (out of each team’s submitted runs). Team Accuracy BLEU SYSU-HCP 0.382 0.416 Yunnan University 0.362 0.402 TeamS 0.348 0.391 jeanbenoit_delbrouck 0.348 0.384 riven 0.332 0.361 Lijie 0.316 0.352 IALab_PUC 0.236 0.276 TAM 0.222 0.255 sliencec 0.220 0.235 sheerin 0.196 0.227 SSN_hacML 0.000 0.002 Baseline 1 0.288 0.326 Baseline 2 0.134 0.156 Table 3 Maximum Average BLEU Scores for the VQG Task (out of each team’s submitted runs). Team Average BLEU Chabbiimen 0.383 Baseline 0.274 5. Results Similar to the evaluation setup of the VQA-Med 2020 challenge [3], the evaluation of the participant systems for the VQA task in VQA-Med 2021 is also conducted based on two primary metrics: accuracy and BLEU. We used an adapted version of accuracy from VQA in the open domain5 that relies on an exact matching between a participant provided answer and the ground truth answer. To compensate for the strictness of the accuracy metric, BLEU [15] is used to capture the word overlap-based similarity between a system-generated answer and the ground truth answer. The overall methodology and resources for the BLEU metric are essentially similar to last year’s VQA task. The BLEU metric is also used to evaluate the submissions for the VQG task to compute an overlap-based average similarity score between the system-generated questions and the ground truth question for each given test image6 . We prepared three baseline systems for the VQA and VQG tasks. Our VQA baselines are based on a multi-class image classification approach using ResNet50 (baseline 1) and a variational autoencoder model (baseline 2) trained on the VQA-Med data [16]. Our VQG baseline system relies on a variational autoencoder model trained on the VQA-RAD and VQA-Med datasets [5]. The overall results of the participating systems and our baselines are presented in Table 2 and Table 3 in descending accuracy order and average BLEU scores, respectively. 6. Discussion The results in Table 2 show that participating systems performed relatively well for the VQA task, in comparison with the VQG results, presented in Table 3, and suggesting that the VQG task was more challenging. However, the participating systems achieved better BLEU scores compared to last year’s VQG results [3]. The participants’ approaches relied on state-of-the-art deep learning techniques for the VQA and VQG tasks. Most systems used Convolutional Neural Networks (CNNs) for visual feature extraction such as VGGNet, ResNet, and DenseNet. Long-short-term memory (LSTM) networks and Transformer-based models (e.g. BERT, BioBERT) were used to extract question features. Several pooling strategies were explored such as multimodal factorized bilinear "MFB" pooling or multi-modal factorized high-order "MFH" pooling to combine image and question features and generate the answer (e.g. [7, 9]). Participating teams also applied various attention mechanisms and ensemble methods. For instance, the SYSU-HCP team [8] designed a hierarchical feature extraction structure to cap- ture multi-scale features of radiology images and replaced the fully-connected layers with hierarchical adaptive global average pooling layers. For training, they used three techniques: data augmentation, curriculum learning, and label smoothing. Their final system relied on a multi-architecture ensemble combining the output of eight models and achieving the best accuracy of 0.382 and BLEU score of 0.416. 5 https://visualqa.org/evaluation.html 6 https://github.com/abachaa/VQA-Med-2021/tree/main/EvaluationCode 7. Conclusion In this paper, we presented the ImageCLEF VQA-Med 2021 tasks and official results. We created new datasets for the visual question generation and visual question answering tasks with a more pronounced focus on questions about abnormality. For the VQG task, we explored the use of Deep Learning and Transformer-based models for semi-automatic question generation from the images and their captions. The VQA-Med task attracted high participation in ImageCLEF 2021. The best VQA team achieved 0.416 BLEU score and 0.382 accuracy. For the VQG task, the best BLEU score is 0.383, outperforming the results achieved last year. We hope that these VQA and VQG datasets will encourage further research efforts in multimodal architectures and approaches for radiology image understanding. Acknowledgments This work was partially supported by the intramural research program at the U.S. National Library of Medicine, National Institutes of Health. References [1] S. A. Hasan, Y. Ling, O. Farri, J. Liu, H. Müller, M. Lungren, Overview of imageclef 2018 medical domain visual question answering task, in: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10-14, 2018., 2018. [2] A. Ben Abacha, S. A. Hasan, V. V. Datla, J. Liu, D. Demner-Fushman, H. Müller, Vqa-med: Overview of the medical visual question answering task at imageclef 2019, in: Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings, CEUR-WS.org, 2019. [3] A. Ben Abacha, V. V. Datla, S. A. Hasan, D. Demner-Fushman, H. Müller, Overview of the vqa-med task at imageclef 2020: Visual question answering and generation in the medical domain, in: CLEF 2020 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Thessaloniki, Greece, 2020. [4] B. Ionescu, H. Müller, R. Peteri, A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A. Hasan, V. Kovalev, S. Kozlovski, V. Liauchuk, Y. Dicente, O. Pelka, A. G. S. de Herrera, J. Jacutprakart, C. M. Friedrich, R. Berari, A. Tauteanu, D. Fichou, P. Brie, M. Dogariu, L. D. Ştefan, M. G. Constantin, J. Chamberlain, A. Campello, A. Clark, T. A. Oliver, H. Moustahfid, A. Popescu, J. Deshayes-Chossart, Overview of the ImageCLEF 2021: Multimedia retrieval in medical, nature, internet and social media applications, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 12th International Conference of the CLEF Association (CLEF 2021), LNCS Lecture Notes in Computer Science, Springer, Bucharest, Romania, 2021. [5] M. Sarrouti, A. Ben Abacha, D. Demner-Fushman, Visual question generation from radiology images, in: Proceedings of the First Workshop on Advances in Language and Vision Research, Association for Computational Linguistics, Online, 2020, pp. 12–18. URL: https://www.aclweb.org/anthology/2020.alvr-1.3. [6] J. J. Lau, S. Gayen, A. Ben Abacha, D. Demner-Fushman, A dataset of clinically generated visual questions and answers about radiology images, Scientific Data 5 (2018). URL: https://www.nature.com/articles/sdata2018251. [7] Q. Xiao, X. Zhou, Y. Xiao, K. Zhao, Yunnan university at vqa-med 2021: Pretrained biobert for medical domain visual question answering, in: Working Notes of CLEF 201, 2021. [8] H. Gong, R. Huang, G. Chen, G. Li, Sysu-hcp at vqa-med 2021: A data-centric model with efficient training methodology for medical visual question answering, in: Working Notes of CLEF 201, 2021. [9] Y. Li, Z. Yang, T. Hao, Tam at vqa-med 2021: A hybrid model with feature extraction and fusion for medical visual question answering, in: Working Notes of CLEF 201, 2021. [10] S. Eslami, G. de Melo, C. Meinel, Teams at vqa-med 2021: Bbn-orchestra for long-tailed medical visual question answering, in: Working Notes of CLEF 201, 2021. [11] S. S. N. Mohamed, K. Srinivasan, Imageclef 2021: An approach for vqa to solve abnormality related queries using improved datasets, in: Working Notes of CLEF 201, 2021. [12] R. Schilling, P. Messina, D. Parra, H. Lobel, Puc chile team at vqa-med 2021: approaching vqa as a classfication task via fine-tuning a pretrained cnn, in: Working Notes of CLEF 201, 2021. [13] I. Chebbi, G. Feki, C. B. Amar, Regim lab at vqa-med 2021: Visual generation of relevant natural language questions from radiology images for anomaly detection, in: Working Notes of CLEF 201, 2021. [14] J. Li, S. Liu, Lijie at imageclefmed vqa-med 2021: Attention model based on efficient interaction between multimodality, in: Working Notes of CLEF 201, 2021. [15] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: a method for automatic evaluation of machine translation, in: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, 2002, pp. 311–318. [16] M. Sarrouti, Nlm at vqa-med 2020: Visual question answering and generation in the medical domain, CLEF, 2020. (a) Q1: What lesion is seen in the (b) Q1: Where are the exophytic mediastinum? Q2: Are there any lesions located? Q2: What calcifications in the mediastinal lesions affect the femur and mass? Q3: Where is the hypo- tibia? Q3: Do the lesions density consistent with necrosis involve the knee joint? Q4: seen? Q4: Are there any en- Do the lesions demonstrate larged lymph nodes? Q5: Where medullary continuity with the is an enlarged lymph node lo- bone of origin? Q5: Is the cated? fibula deformed? Q6: For what disorder are these multiple ex- ostoses diagnostic? (c) Q1: What causes proptosis of (d) Q1: Where is the thrombus lo- the right eye? Q2: What kind cated? Q2: Where is collater- of lesion is present in the right alization demonstrated? Q3: Is orbit? Q3: Are the optic nerves the thecal sac effected? and muscles involved? Q4: Is the mass homogeneous? Q5: What is the lesion suggestive of? Figure 2: Examples from the Test Set of the VQG 2021 Task