=Paper=
{{Paper
|id=Vol-2936/paper-98
|storemode=property
|title=TeamS at VQA-Med 2021: BBN-Orchestra for Long-tailed Medical Visual Question Answering
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-98.pdf
|volume=Vol-2936
|authors=Sedigheh Eslami,Gerard de Melo,Christoph Meinel
|dblpUrl=https://dblp.org/rec/conf/clef/EslamiMM21
}}
==TeamS at VQA-Med 2021: BBN-Orchestra for Long-tailed Medical Visual Question Answering==
TeamS at VQA-Med 2021: BBN-Orchestra for Long-tailed Medical Visual Question Answering Sedigheh Eslami1,2 , Gerard de Melo2 and Christoph Meinel2 1 D4L data4life gGmbH, Charlottenstraße 109, 14467 Potsdam, Germany 2 Hasso Plattner Institute, Prof.-Dr.-Helmert-Straße 2-3, 14482 Potsdam, Germany Abstract This work describes our (TeamS) participation in the Medical Domain Visual Question Answering chal- lenge (VQA-Med) at ImageCLEF 2021. We translate the VQA problem to long-tailed multi-class image classification for categorizing abnormalities present in medical images. Our proposed BBN-Orchestra is an ensemble of bilateral-branch networks (BBN) and successfully reduces overfitting to train and validation data in addition to effectively modeling the imbalanced long-tailed image distribution. BBN- Orchestra employs a voting mechanism to assign final predicted classes in the inference phase. Our pro- posed method achieved a test accuracy of 34.8% and a BLEU score of 39.1%, ranking 3rd in the competi- tion. Our source code is available at https://github.com/d4l-data4life/BBNOrchestra-for-VQAmed2021. Keywords Medical visual question answering, Long-tailed visual recognition, Ensemble learning, Bilateral neural network 1. Introduction Digitized medical data brings the potential to develop multi-modal tools such as Visual Question Answering (VQA) systems that can assist patients, clinicians, and radiology trainees in order to expedite patient care. Medical VQA systems are capable of answering questions about a given medical image and thereby aid in assessing and interpreting a radiology image. Despite this enormous potential, the development of medical VQA systems remains in its infancy due to complications arising from the scarcity of available training data, the distribution of this data, as well as the disparity between the natural language question and the medical image modalities. Recent progress has been driven by the VQA-RAD [1] and SLAKE [2] datasets, as well as the ImageCLEF initiative, which has been releasing VQA datasets extracted from PubMed Central articles and hosting challenges for developing task-oriented medical VQA systems. Still, due to the wide range of possible answers and the imbalanced distribution of the training data with respect to such answers, it is non-trivial to train a model to perform well on this task. For many sorts of answers, there are only very few example instances in the training data. In this work, we investigate the effectiveness of deep learning approaches that overcome these challenges and are able to cope with long-tailed medical VQA data. Our BBN-Orchestra approach recasts the VQA task as a long-tailed multi-class image classification problem and CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " sedigheh.eslami@data4life.care, hpi.de (S. Eslami); gerard.demelo@hpi.de (G. d. Melo); christoph.meinel@hpi.de (C. Meinel) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) learns an ensemble of bilateral-branch networks (BBNs) to better model imbalanced long-tailed training data and mitigate overfitting. This paper presents our submissions at VQA-MED 2021 challenge [3]. We developed an ensemble model using Bilateral-Branch Networks (BBN) in order to simultaneously learn effective representative image features and achieve accurate classifiers considering the long- tailed class distribution. We further compare our results with single BBN using different backbone architectures. Our model was the 3rd ranked one in the VQA-Med 2021 Challenge, with 34.8% accuracy and 39.1% BLEU score on the test data. 2. Related work Medical visual question answering is a challenging problem due to the diversity of the ques- tions, diversity of the image data, as well as the insufficient and imbalanced annotated data distributions. This task has been previously investigated by classifying multi-modal features obtained by fusing encoded questions and images. Vu et al.[4] use pre-trained CNN models and Skip-Thought Vectors to encode the image and question, respectively, and combine them via attention mechanisms. Zhan et al.[5] enhance the attention-based fusion strategies via a novel question-type-specific conditional reasoning module that further highlights the important segments of the questions. Nguyen et al.[6] propose to use publicly available unlabeled image datasets in an unsupervised fashion using meta learning in order to enhance image features and thereby overcome data constraints. The winning team of the VQA-Med 2020 challenge [7] firstly maps similar questions into unified backbones in order to detect the type of the questions in a rule-based fashion. Afterwards, an ensemble multi-task classification network with ResNet, ResNext, VGG and MobileNet backbones is applied for image classification. Kovaleva et al.[8] utilize the MIMIC-CXR dataset to create the first publicly available visual dialogue dataset for radiology, which is not only useful for medical VQA, but also draws on the medical history of patients in order to better answer visual-based questions. In the VQA-Med 2021 challenge [3] hosted by ImageCLEF [9], we propose to solve the VQA problem purely by image classification, since the dataset consists entirely of questions sharing a common semantic interpretation, despite being presented in different syntactical forms: “What abnormality is present in the image?”. Inspired by the HCP-MIC team’s work [10], we adopt Bilateral-Branch Networks for classification in the presence of an imbalanced long-tail class distribution. 3. Approach By exploring the datasets released in the challenge, we observe that the training data includes two general types of questions: 1. “yes/no” questions, asking about the presence of a medical abnormality in an image, 2. “what” questions, asking about the category of abnormality present in an image. Although these questions appear in different syntactical forms, their semantics can be cate- gorized into the two mentioned types and can be detected by simple rule-based mechanisms. Furthermore, we notice that the questions in the validation and test sets are only of the afore- mentioned “what” type. Therefore, we decided to translate the VQA setting in this specific challenge to a multi-class image classification problem. Denote by 𝐷T = {(𝑥𝑖 , 𝑞𝑖 , 𝑎𝑖 )}𝑛𝑖=1T the training dataset for a generic VQA model, where 𝑛T is the number of samples in the training set, and 𝑥, 𝑞, 𝑎 represent image, question, and answer, respectively. We can enumerate the set of answers and assume 𝑎𝑖 ∈ {1, 2, . . . , 𝐶}, where 𝐶 is the total number of candidate answers and is typically large. In this task, since all questions solicit the same kind of information, we relax the VQA problem to learn a function 𝑓 that maps each 𝑥𝑖 to 𝑎𝑖 and thereby, classify the abnormality for each medical image. BBN-Orchestra is an ensemble deep learning solution using Bilateral-Branch Networks (BBNs) [11]. As previous work [10, 11, 12] suggests, BBNs achieve effective results when classifying data with long-tail distribution, i.e., when a few classes form most of the data, whereas most classes have very few samples. BBNs consist of three main components: 1. a conventional network for effective representation learning, 2. a re-balancing network for modeling the tail class distribution by reverse sampling, 3. an adaptive cumulative learning component that controls how to shift the attention between the two former components during different epochs and train the classifier by minimizing the training loss.1 Algorithm 1 BBN Orchestra: Train Input: 𝐷 = {(𝑥𝑖 , 𝑎𝑖 )}𝑛𝑖=1 , 𝐾, 𝑏𝑎𝑐𝑘𝑏𝑜𝑛𝑒_𝑡𝑦𝑝𝑒, 𝑐𝑟𝑖𝑡𝑒𝑟𝑖𝑜𝑛, 𝑛_𝑒𝑝𝑜𝑐ℎ𝑠 Output: 𝐾 BBN models 1: procedure Orchestrate(𝐷, 𝐾, 𝑏𝑎𝑐𝑘𝑏𝑜𝑛𝑒_𝑡𝑦𝑝𝑒, 𝑐𝑟𝑖𝑡𝑒𝑟𝑖𝑜𝑛, 𝑛_𝑒𝑝𝑜𝑐ℎ𝑠) 2: 𝑚𝑒𝑚𝑏𝑒𝑟𝑠 ← ∅ 3: for 𝑘 in {1, ..., 𝐾} do 4: 𝑡𝑟𝑎𝑖𝑛_𝑑𝑎𝑡𝑎, 𝑣𝑎𝑙_𝑑𝑎𝑡𝑎 ← random_split(𝐷, 𝑣𝑎𝑙_𝑠𝑖𝑧𝑒 = 0.1) 5: 𝐶 ← num_classes(𝐷) 6: 𝑚𝑜𝑑𝑒𝑙 ← initialize_BBN(𝐶, 𝑏𝑎𝑐𝑘𝑏𝑜𝑛𝑒_𝑡𝑦𝑝𝑒) 7: for 𝑒𝑝𝑜𝑐ℎ in {1, ..., 𝑛_𝑒𝑝𝑜𝑐ℎ𝑠} do 8: 𝑚𝑜𝑑𝑒𝑙 ← train_BBN(𝑚𝑜𝑑𝑒𝑙, 𝑡𝑟𝑎𝑖𝑛_𝑑𝑎𝑡𝑎, 𝑐𝑟𝑖𝑡𝑒𝑟𝑖𝑜𝑛, 𝑒𝑝𝑜𝑐ℎ) 9: 𝑣𝑎𝑙_𝑎𝑐𝑐, 𝑣𝑎𝑙_𝑙𝑜𝑠𝑠 ← validate(𝑚𝑜𝑑𝑒𝑙, 𝑣𝑎𝑙_𝑑𝑎𝑡𝑎, 𝑐𝑟𝑖𝑡𝑒𝑟𝑖𝑜𝑛) 10: if 𝑣𝑎𝑙_𝑎𝑐𝑐 > 𝑏𝑒𝑠𝑡_𝑟𝑒𝑠𝑢𝑙𝑡 then 11: 𝑏𝑒𝑠𝑡_𝑚𝑜𝑑𝑒𝑙 ← 𝑚𝑜𝑑𝑒𝑙 12: 𝑚𝑒𝑚𝑏𝑒𝑟𝑠 ← 𝑚𝑒𝑚𝑏𝑒𝑟𝑠 ∪ {𝑏𝑒𝑠𝑡_𝑚𝑜𝑑𝑒𝑙} return 𝑚𝑒𝑚𝑏𝑒𝑟𝑠 In BBN-Orchestra, we use ensemble multiple BBNs in order to prevent potential over-fitting with regard to the training and validation sets. To achieve this, the training and validation splits are combined first to form 𝐷 = 𝐷T ∪ 𝐷V = {(𝑥𝑖 , 𝑎𝑖 )}𝑛𝑖=1 , where 𝐷V is the validation set and 𝑛 = 𝑛T + 𝑛V .2 We train 𝐾 different BBN models with diverse backbone networks using different random splits from 𝐷. In the inference phase, the voting mechanism selects the most frequently predicted class by the 𝐾 trained BBNs as the final predicted label for an unseen sample. Training and inference phases of BBN-Orchestra are summarized in Algorithms 1 For further information on BBNs, the reader is referred to the original paper [11]. 2 Since the training and validation sets are independent and have no intersection. 1 and 2, respectively. Algorithm 2 BBN Orchestra: Inference ′ Input: 𝐷′ = {(𝑥′𝑖 )}𝑛𝑖=1 , 𝑚𝑒𝑚𝑏𝑒𝑟𝑠 ′ Output: {(𝑥′𝑖 , 𝑎 ˆ𝑖 )}𝑛𝑖=1 1: procedure Predict(𝐷 ′ , 𝑚𝑒𝑚𝑏𝑒𝑟𝑠) 2: 𝑑𝑎𝑡𝑎_𝑠𝑖𝑧𝑒 ← len (𝐷′ ) 3: 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 ← dict() 4: ˆ ← dict() 𝑎 5: for 𝑚𝑜𝑑𝑒𝑙 in 𝑚𝑒𝑚𝑏𝑒𝑟𝑠 do 6: 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑_𝑙𝑎𝑏𝑒𝑙𝑠 ← predict(𝑚𝑜𝑑𝑒𝑙, 𝐷′ ) 7: for 𝑑 in 𝐷′ do 8: 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠[𝑑].𝑎𝑝𝑝𝑒𝑛𝑑(𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑_𝑙𝑎𝑏𝑒𝑙𝑠[𝑑]) 9: for 𝑑 in 𝐷′ do 10: ˆ[𝑑] ← most_frequent(𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠[𝑑]) 𝑎 return (𝐷 ′ , 𝑎 ˆ) 4. Experiments 4.1. Dataset We conduct our experiments using the datasets released in the VQA-Med 2021 challenge. The train, validation, and test datasets respectively include 4 500, 500, and 500 images as well as question–answer pairs. This means that for each image, there exists one question–answer pair among all sets. Additionally, we exploited the train, validation, and test datasets from the VQA-Med 2019 challenge [13] to increase the amount of data available for training. Since both validation and test sets only include “what” questions asking about the type of abnormality in the image, we omit the “yes/no” questions via the simple rule of checking whether the answer is “yes” or ”no” and retain only “what” questions. The final training set includes a total of 5 4353 training samples with 330 distinct answers. 4.2. Experimental Setup 4.2.1. Data setup and augmentation In order to develop ensemble models, we combine the 5 435 train and 500 validation samples. In each iteration, 10% of the combined data is randomly selected to serve as a validation set and the rest is used for training. Similar to the original BBN experiments [11], we perform random resized cropping with size 224 and random horizontal flipping with probability 0.5 for data augmentation. 3 Including VQA-Med 2019. Before mixing with the validation set from VQA-Med 2021. Table 1 Accuracy scores on validation and test sets BBN BBN BBN BBN BBN ResNet 34 ResNeSt 50 Orchestra 1 Orchestra 2 Orchestra 3 Validation 59.8% 61.3% 54.4% 57.7% 55.9% Test 29.9% 30.4% 32.2% 34.8% 32.8% 4.2.2. BBN-Orchestra setup We evaluated three ensemble models: 1. Orchestra 1: 𝐾 = 4 with ResNet34 [14] backbone4 in BBNs, 2. Orchestra 2: 𝐾 = 4 with ResNeSt50 [15] backbone in BBNs, 3. Orchestra 3: 𝐾 = 8 with 4 BBNs with ResNet34 and 4 BBNs with ResNeSt50 backbone. All backbones are trained end-to-end from scratch with the medical VQA data described above. The adaptive parameter in BBN that controls the attention between learning universal features and the long-tail class distribution is directly proportional to the epoch number [11]. Hence, we aim to set the maximum number of epochs relatively high, namely 450, in order to give BBN a better chance to model the tail distribution.5 For all BBNs, we use cross-entropy loss. Stochastic gradient descent optimization is used with momentum 0.9 and a weight decay of 0.0004. The initial learning rate is set to 0.1 decaying at the 150th , 250th and 300th epochs by a factor of 0.1. All pipeline implementations are based on the PyTorch framework [16]. 4.3. Results and Insights The experimental results of our submissions are given in Tables 1 and 2. We provide the accuracy and BLEU evaluation scores, as they are the official evaluation metrics in the VQA-Med 2021 challenge. The reported values for the validation case of BBN-Orchestras are the averaged performance on different random validation splits over 𝐾 models. In contrast, for single BBNs, the scores are computed using the original validation set released by the challenge. The results show that all three orchestrated BBN models achieve better test accuracy in comparison to single BBN networks. The best performance on the test set is achieved by BBN- Orchestra 2. Comparing BBN-Orchestra 2 with a single BBN-ResNeSt 50 and BBN-Orchestra 1 with a single BBN-ResNet 34, we observe that the increase in test accuracy occurs while the validation accuracy decreases. This means that the orchestrated models were partially able to mitigate the potential overfitting with respect to the validation set. Furthermore, using the ResNeSt backbone performs better in comparison to ResNet 34. The reason for this is two-fold: 1. fewer layers in ResNet 34 leads to underfitting, 2. as shown in an empirical analysis [15], the Split-Attention mechanism of the ResNeSt architecture improves the performance of residual 4 Both conventional learning and rebalancing branches. 5 In early epochs, BBN focuses on learning the universal features and in later epochs, it learns to model the tail class distribution. Table 2 BLEU scores on validation and test sets BBN BBN BBN BBN BBN ResNet 34 ResNeSt 50 Orchestra 1 Orchestra 2 Orchestra 3 Validation 63.1% 64.6% 57.3% 61.9% 59.3% Test 33.2% 33.8% 35.8% 39.1% 36.6% networks, e.g., the mean average precision of Cascade-RCNN is improved by 3% when using ResNeSt 50 instead of ResNet 50. With the same reasoning, Orchestra 3 achieves better results in comparison to Orchestra 1, since it benefits from ResNeSt, but cannot outperform Orchestra 2, since it also uses ResNet 34 models that underfit the data. 5. Conclusion This work describs the submissions of our team (TeamS) to the VQA-Med challenge at ImageCLEF 2021. Considering the simplicity of the questions in the challenge datasets, we mapped the medical VQA problem to a multi-class image classification problem and mainly utilized Bilateral- Branch Networks to effectively address the resulting long-tailed abnormality classification task. In order to prevent potential overfitting, we developed BBN-Orchestra, an ensemble version of BBN. Our best submission exploited BBN-Orchestra with ResNeSt 50 backbone, which achieved 34.8% accuracy on the test data and ranked 3rd in the competition. Acknowledgement We would like to thank Matthias Steinbrecher for his helpful comments and discussions. References [1] J. J. Lau, S. Gayen, A. B. Abacha, D. Demner-Fushman, A dataset of clinically generated visual questions and answers about radiology images, Scientific data 5 (2018) 1–10. [2] B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y. Yang, X.-M. Wu, SLAKE: A semantically-labeled knowledge-enhanced dataset for medical visual question answering, arXiv preprint arXiv:2102.09542 (2021). [3] A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A. Hasan, H. Müller, Overview of the VQA-Med task at ImageCLEF 2021: Visual question answering and generation in the medical domain, in: CLEF 2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Bucharest, Romania, 2021. [4] M. H. Vu, T. Löfstedt, T. Nyholm, R. Sznitman, A question-centric model for visual question answering in medical imaging, IEEE transactions on medical imaging 39 (2020) 2856–2868. [5] L.-M. Zhan, B. Liu, L. Fan, J. Chen, X.-M. Wu, Medical visual question answering via conditional reasoning, in: Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2345–2354. [6] B. D. Nguyen, T.-T. Do, B. X. Nguyen, T. Do, E. Tjiputra, Q. D. Tran, Overcoming data limitation in medical visual question answering, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2019, pp. 522–530. [7] Z. Liao, Q. Wu, C. Shen, A. van den Hengel, J. Verjans, AIML at VQA-Med 2020: Knowledge inference via a skeleton-based sentence mapping approach for medical domain visual question answering, CLEF, 2020. [8] O. Kovaleva, C. Shivade, S. Kashyap, K. Kanjaria, J. Wu, D. Ballah, A. Coy, A. Karargyris, Y. Guo, D. B. Beymer, et al., Towards visual dialog for radiology, in: Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, 2020, pp. 60–69. [9] B. Ionescu, H. Müller, R. Péteri, A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A. Hasan, V. Kovalev, S. Kozlovski, V. Liauchuk, Y. Dicente, O. Pelka, A. G. S. de Herrera, J. Jacutprakart, C. M. Friedrich, R. Berari, A. Tauteanu, D. Fichou, P. Brie, M. Dogariu, L. D. Ştefan, M. G. Constantin, J. Chamberlain, A. Campello, A. Clark, T. A. Oliver, H. Moustahfid, A. Popescu, J. Deshayes-Chossart, Overview of the ImageCLEF 2021: Multimedia retrieval in medical, nature, internet and social media applications, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 12th International Conference of the CLEF Association (CLEF 2021), LNCS Lecture Notes in Computer Science, Springer, Bucharest, Romania, 2021. [10] G. Chen, H. Gong, G. Li, HCP-MIC at VQA-Med 2020: Effective visual representation for medical visual question answering, CLEF, 2020. [11] B. Zhou, Q. Cui, X.-S. Wei, Z.-M. Chen, BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9719–9728. [12] Y. Liang, T. Qian, Recommending accurate and diverse items using bilateral branch network, arXiv preprint arXiv:2101.00781 (2021). [13] A. B. Abacha, S. A. Hasan, V. V. Datla, J. Liu, D. Demner-Fushman, H. Müller, VQA-Med: Overview of the medical visual question answering task at ImageCLEF 2019., in: CLEF (Working Notes), 2019. [14] L. Lei, H. Zhu, Y. Gong, Q. Cheng, A deep residual networks classification algorithm of fetal heart CT images, in: 2018 IEEE international conference on imaging systems and techniques (IST), IEEE, 2018, pp. 1–4. [15] H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, Z. Zhang, Y. Sun, T. He, J. Mueller, R. Manmatha, et al., Resnest: Split-attention networks, arXiv preprint arXiv:2004.08955 (2020). [16] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style, high-performance deep learning library, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems 32, Curran Associates, Inc., 2019, pp. 8024–8035.