PUC Chile team at VQA-Med 2021: approaching VQA as a classification task via fine-tuning a pretrained CNN Ricardo Schilling1 , Pablo Messina1 , Denis Parra1 and Hans Löbel1 1 Pontificia Universidad Católica, Chile Abstract This paper describes the submission of the IALab group of the Pontifical Catholic University of Chile to the Medical Domain Visual Question Answering (VQA-Med) task. Our participation was rather simple: we approached the problem as image classification. We took a DenseNet121 with its weights pre-trained in ImageNet and fine-tuned it with the VQA-Med 2020 dataset labels to predict the answer. Different answers were treated as different classes, and the questions were disregarded for simplicity since essentially they all ask for abnormalities. With this very simple approach we ranked 7th among 11 teams, with a test set accuracy of 0.236. 1. Introduction ImageCLEF [1] is an initiative with the aim of advancing the field of image retrieval (IR) as well as enhancing the evaluation in various fields of IR. The initiative takes the form of several challenges, and it is specially aware of the changes in the IR field in recent years, which have brought about tasks requiring the use of different types of data such as text, images and other features moving towards multi-modality. ImageCLEF has been running annually since 2003, and since the second version (2004) there are medical images involved in some tasks, such as medical image retrieval. Since then, new tasks involving medical images have been integrated into the ImageCLEFmedical challenge group of tasks [2], and that is how the task of medical visual question-answering (VQA) has been taking place since 2018. The goal of this task is as follows: given a medical image accompanied with a clinically relevant question, participating systems are tasked with answering the question based on the visual image content [3]. This task could help patients get a second opinion by an automated system, helping them better understand their conditions and supporting clinical discisions [2]. In this document we describe the participation of our team from HAIVis group 1 within the artificial intelligence laboratory 2 at PUC Chile (PUC Chile team) in the VQA-Med task at MedicalImageCLEF 2021 [2]. Our team earned the 7th place out of 11 in the challenge, and CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " reschilling@uc.cl (R. Schilling); pamessina@uc.cl (P. Messina); dparra@ing.puc.cl (D. Parra); halobel@ing.puc.cl (H. Löbel) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 http://haivis.ing.puc.cl/ 2 http://ialab.ing.puc.cl/ our best submission was as simple as a DenseNet121 for visual input encoding to a classifier of answers, which did not take into account the questions and obtained a test accuracy of 0.236. The rest of the paper is structured as follows: Section 1 presents relevant related work, Section 2 describes our data analysis, in section 3 we provide a description of our method, and section 4 describes our results. Lastly, in section 5 we conclude our article. 2. Related work We addressed the task of medical VQA by modeling it as a classification task. Previously, there have been other attempts at medical image classification, such as the work of Kumar et al [4] that used an ensemble of fine tuned models to classify medical images, using the ImageCLEF 2016 medical image public dataset and achieving over 82% accuracy on its predictions. Also the work of Li [5] which solved a similar problem but worked on lung images exclusively using the ILD dataset, reporting a significant improvement to regular image descriptors. Similar works, such as Ren [6] also use a classification approach for visual question answering, in this case for the COCO-QA dataset, using a CNN together with an LSTM which predicts a single word. Other similar approaches include the work of Yan, Xin, et al [7], using a VGG network with BERT[8] encodings for the 2019 Imageclef Med task. Unlike these approaches, we attempted to see how a regular CNN would work in a more general Visual Question Answering problem, where the answers are previously known and we rank them. 3. Dataset Our dataset consisted of triples of {image, question, answer}, where the question asks about the possible existence of abnormalities in the image, and the answer provides a list of relevant abnormalities plus some possible additional details. Figure 1: Example of dataset triples Because in this task every question asks about the abnormalities present in the image, we ignored the questions and treated the problem as classification. In the 2020 dataset[9], there are a total of 3940 medical images with associated answers. After some basic preprocessing, such as simplifying whitespaces, removing punctuation and so on, we obtained a total of 325 distinct labels. Figure 2: Distribution of labels in the training set As Figure 3 shows, the amount of images per label varied, with the least frequent label appearing 5 times, and the most frequent label up to 80 times. We also tried including data from the 2019 dataset[10], however this ended up adding over a thousand new labels, with a total of over 1400 distinct labels, while only doubling our amount of images. 4. Methods We tried out two distinct approaches, and applied them to both the normal and extended datasets. The best results were achieved with our first approach, where we finetuned a densenet-121[11] pre-trained on imagenet, to predict the image label. We first treated each of our distint labels as a different class and tasked the network with predicting the corresponding label, which we then mapped into the corresponding label. To achieve this, we changed the last layer of the pretrained model with three dense layers, size 1024, 1024 and a final classification layer with 325 neurons. For this approach, the network perfomed slighly better using only the smaller dataset. We finetuned the network for around 100 epochs achieving around 30% accuracy and similar BLEU in the validation set. We used simple techniques such as early stopping, l2 normalization and dropout to avoid unnecessary overfitting of the networks. We also tried using perceptual similarity[12] in an attempt to classify the image accordingly, however this approach failed to generalize properly and didn’t manage to beat our initial approach. This approach consisted in using a simple k-Neighbors Classifier to try to predict the corresponding image label, using an image vector obtained directly from the lpips module. Figure 3: Early results of the evolution of model accuracy for the extended dataset. 5. Results With our simple approach, we were able to achieve 30% accuracy and a similar BLEU score on the 2021 validation set[2], however we realized that adding the data from 2019 only decreased both metrics to around 20%. The model’s performance on the test set ended up being slightly lower at a 23% accuracy and 0.276 BLEU score, which was expected and shows the model’s inability to generalize well. As for the perceptual similarity model, unfortunately the results were very bad, not even making over 5% accuracy. This however, was expected as it was only an initial approach and this technique proved useful for some similar tasks. 6. Conclusion Although our approach did not attain high accuracy, we believe it is simple enough and could serve as the base for a more complex model which includes actual text generation for better generalization. In the future, we plan to also include the question as input along the image to output the generated answer. We strongly believe that this task can be distilled into a simpler classification or captioning problem, however we failed to make a model that could accurately determine the exact description of the input image. One of the reasons we believe this is the case is that our model does not currently generate new text, but could serve as the base of a bigger, more complex model. Acknowledgments This work was partially funded by ANID - Millennium Science Initiative Program - Code ICN17_002 and by ANID, FONDECYT grant 1191791. References [1] B. Ionescu, H. Müller, R. Péteri, A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A. Hasan, V. Kovalev, S. Kozlovski, V. Liauchuk, Y. Dicente, O. Pelka, A. G. S. de Herrera, J. Jacutprakart, C. M. Friedrich, R. Berari, A. Tauteanu, D. Fichou, P. Brie, M. Dogariu, L. D. Stefan, M. G. Constantin, J. Chamberlain, A. Campello, A. Clark, T. A. Oliver, H. Moustahfid, A. Popescu, J. Deshayes-Chossart, Overview of the ImageCLEF 2021: Multimedia retrieval in medical, nature, internet and social media applications, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 12th International Conference of the CLEF Association (CLEF 2021), LNCS Lecture Notes in Computer Science, Springer, Bucharest, Romania, 2021. [2] A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A. Hasan, H. Müller, Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain, in: CLEF 2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Bucharest, Romania, 2021. [3] S. A. Hasan, Y. Ling, O. Farri, J. Liu, H. Müller, M. P. Lungren, Overview of imageclef 2018 medical domain visual question answering task., in: CLEF (Working Notes), 2018. [4] A. Kumar, J. Kim, D. Lyndon, M. Fulham, D. Feng, An ensemble of fine-tuned convolutional neural networks for medical image classification, IEEE Journal of Biomedical and Health Informatics 21 (2017) 31–40. doi:10.1109/JBHI.2016.2635663. [5] Q. Li, W. Cai, X. Wang, Y. Zhou, D. D. Feng, M. Chen, Medical image classification with convolutional neural network, in: 2014 13th International Conference on Control Automation Robotics Vision (ICARCV), 2014, pp. 844–848. doi:10.1109/ICARCV.2014. 7064414. [6] M. Ren, R. Kiros, R. Zemel, Exploring models and data for image question answering, 2015. arXiv:1505.02074. [7] X. Yan, L. Li, C. Xie, J. Xiao, L. Gu, Zhejiang university at imageclef 2019 visual question answering in the medical domain., in: CLEF (Working Notes), 2019. [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. arXiv:1810.04805. [9] A. Ben Abacha, V. V. Datla, S. A. Hasan, D. Demner-Fushman, H. Müller, Overview of the vqa-med task at imageclef 2020: Visual question answering and generation in the medical domain, in: CLEF 2020 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Thessaloniki, Greece, 2020. [10] A. Ben Abacha, S. A. Hasan, V. V. Datla, J. Liu, D. Demner-Fushman, H. Müller, VQA- Med: Overview of the medical visual question answering task at imageclef 2019, in: CLEF2019 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org , Lugano, Switzerland, 2019. [11] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely connected convolutional networks, 2018. arXiv:1608.06993. [12] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, O. Wang, The unreasonable effectiveness of deep features as a perceptual metric, 2018. arXiv:1801.03924.