=Paper=
{{Paper
|id=Vol-2696/paper_78
|storemode=property
|title=AIML at VQA-Med 2020: Knowledge Inference via a Skeleton-based Sentence Mapping Approach for Medical Domain Visual Question Answering
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_78.pdf
|volume=Vol-2696
|authors=Zhibin Liao,Qi Wu,Chunhua Shen,Anton van den Hengel,Johan Verjans
|dblpUrl=https://dblp.org/rec/conf/clef/LiaoWSHV20
}}
==AIML at VQA-Med 2020: Knowledge Inference via a Skeleton-based Sentence Mapping Approach for Medical Domain Visual Question Answering==
AIML at VQA-Med 2020: Knowledge Inference via a Skeleton-based Sentence Mapping Approach for Medical Domain Visual Question Answering Zhibin Liao1,2 , Qi Wu1 , Chunhua Shen1 , Anton van den Hengel1 , and Johan Verjans1,2 1 Australian Institute for Machine Learning, University of Adelaide, Australia 2 South Australian Health and Medical Research Institute, Adelaide, Australia Abstract. In this paper, we describe our contribution to the 2020 Im- ageCLEF Medical Domain Visual Question Answering (VQA-Med) chal- lenge. Our submissions scored first place on the VQA challenge leader- board, and also the first place on the associated Visual Question Gener- ation (VQG) challenge leaderboard. Our VQA approach was developed using a knowledge inference methodology called Skeleton-based Sentence Mapping (SSM). Using all the questions and answers, we derived a set of classifiable tasks and inferred the corresponding labels. As a result, we were able to transform the VQA task into a multi-task image classifica- tion problem which allowed us to focus on the image modelling aspect. We further propose a class-wise and task-wise normalization facilitating optimization of multiple tasks in a single network. This enabled us to apply a multi-scale and multi-architecture ensemble strategy for robust prediction. Lastly, we positioned the VQG task as a transfer learning problem using the VGA task trained models. The VQG task was also solved using classification. Keywords: Visual Question Answering · Visual Question Generation · Knowledge Inference · Deep Neural Networks · Skeleton-based Sentence Mapping · Class-wise and Task-wise Normalization 1 Introduction Visual question answering (VQA) [4,20] is a challenging new task which requires a broad knowledge of image processing, natural language processing (NLP), and multi-modal learning. In the medical domain, VQA is an attractive topic showing great potential in automated medical image interpretation and machine supported diagnoses, with potential to benefit both medical practitioners and A. van den Hengel and J. Verjans – Joint senior authorship. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessaloniki, Greece. patients. Nevertheless, medical VQA remains an unsolved problem. The Image- CLEF association [15] has been hosting the Medical Domain VQA (VQA-Med) challenges for three consequent years since 2018 [2, 5, 10]. In the 2018 challenge, the images were extracted from PubMed Central articles with the questions and answers automatically generated from image captions before checked manually by human annotators. In addition to the clarity issues of the machine generated questions as reported by [1], it is also noticeable that both the questions and ground-truth answers are in variable-length and free-form, both of which add dif- ficulties to the answer generation task. The 2019 challenge [2] advanced from the previous challenge by narrowing the task scope: 1) using only radiology images; and 2) asking questions in four topics (i.e., image modality, imaging plane, vi- sualized organ systems, and abnormality detectable from an image). As noticed by many participated teams, the 2019 challenge is solvable in a classification manner, i.e., there are 36 unique answers for the modality questions, 16 for the plane questions, 10 for the organ questions, with an exception of over a thou- sand possible answers for the abnormality category. A post-challenge question category-wise accuracy analysis [2] suggests that the modality, plane, and organ categories possess much better accuracy compared to the abnormality category. In the 2020 VQA challenge, our AIML team participated in, the dataset [5] was curated with only questions in abnormality category. While analyzing the questions, we found that questions come in two major forms: 1) yes/no ques- tions, e.g., “is this image normal/abnormal?”, and 2) wh-questions e.g., “what is abnormal in the image?”. In comparison to the last year’s challenge, we noticed the unique question phrasings were reduced from 253 to 52 and the unique an- swer phrasings from 1,749 to 332, while having a 25% increase of images (from 3,200 to 4,000 in the training set; validation and test sets are equal), resulting in a much richer data support for the VQA task. Our initial attempt at the 2020 VQA-Med challenge was to fine-tuning of the Pythia [27] model. However, this did not yield a desirable performance, hence after which we conducted an analysis of the predicted answers. The analysis led to the development of a novel knowledge inference method, namely Skeleton- based Sentence Mapping (SSM) that helped reverse engineer a set of question backbones. SSM helped us to determine the question categories and infer corre- sponding labels, reducing the VQA problem to a pure multi-task image classifi- cation problem. As a result, we were able to focus on the imaging modality. In particular, we developed a class-wise and task-wise normalization method to give balanced weighting to presented classes and tasks in a mini-batch. This helps to jointly optimize multiple tasks in a single network. At last, we applied multi-scale and multi-architecture ensemble learning. Our best submission scored 0.496 in accuracy and 0.542 in BLEU score which won the first place at the 2020 VQA challenge. For the associated Medical Domain Visual Question Generation (VQG-Med) challenge, we considered the task as a transfer learning problem, where we ap- plied the VQA-Med data trained models as non-trainable feature extractors. The answer generation is also formed as a classification task. Our best submission scored 0.348 in BLEU score which won the first place at the VQG challenge. In the rest of the paper, we give explanations on our VQA and VQG ap- proaches. Each approach is a self-contained section to avoid cluttering. 2 VQA-Med Challenge Participation 2.1 Literature Review We will first introduce the general domain VQA methods followed by an intro- duction to the methods that have been applied specifically in the medical domain VQA. General domain VQA: the goal of a VQA method is to produce an answer from a given image-question pair. Early VQA works [4, 9, 20, 24] used a general CNN-RNN framework. In brief, the CNN-RNN approach is carried out using a Convolutional Neural Network (CNN) model (e.g., VGG-Net [26]) to process the input image and a Recurrent Neural Network (RNN) Encoder–Decoder [7] (more specifically, LSTM [12]) to handle the language modelling. While the vision and language information fusion component can also be handled by the RNN lan- guage model altogether, or just by concatenation, there are also more advanced options such as the Multi-modal Factorized Bilinear (MFB) pooling and High- order pooling (MFH) [38] and MUTAN [6]. Attention is also a frequently visited topic in VQA, e.g., question-guided visual attention methods [35,37] and vision- language co-attention methods [19, 38]. Finally, semantic image representation (e.g., attribute-based image representation [31]), pretrained language represen- tation (e.g., BERT [8]), external knowledge and common sense knowledge [32] could all be beneficial towards solving VQA. Medical domain VQA: a noticeable difference between the medical domain and general domain VQA is the size of the dataset. The general domain VQA can accumulate a sizable dataset due to the fact that a common-sense knowledge is sufficient for generating question and answers. On the other hand, the necessity of clinical expertise imposes a huge difficulty in the medical domain VQA data collection. In the 2018 VQA-Med challenge, the leading3 three participating teams [1, 23, 39] differentiate in image modelling (i.e., ResNet-152 [11], Inception-ResNet- v2 [28], VGG-16), language modelling (i.e., LSTM, Bi-LSTM), vision-language fusion (i.e., MFB/MFH [38], SAN [37]), attention models (i.e., question guided attention [35], co-attention [38]), and word embeddings (i.e., word2vec [21] or 3 The 2018 VQA-Med challenge employed three measurements: BLEU [22], Word- based Semantic Similarity (WBSS) [33], and Concept-based Semantic Similarity (CBSS). The leading teams are referred to the BLEU and WBSS [33] score rankings. The CBSS can result a different ranking. medical article pretrained embedding [23]). Considering the component-wise di- versity and minor performance gaps, it is difficult to find out which component is favourable. However, we notice that all three teams treated the VQA task as a classification problem whereas the rest two teams treated the problem as a generation task [29] or still a classification task but not fine-tuning the image model [3]. In the 2019 VQA-Med challenge, the top three teams [30, 36, 40] (with a working notes paper) all used BERT [8] for language processing. Apart from that, we point to some of the unique techniques from the top three teams. The winning team Hanlin [36] has adopted Global Average Pooling (GAP) [18] short- cuts. This differs from the conventional position of GAP which connects the last convolution layer and the classification layer. The Hanlin team placed multiple GAPs that each links to a low-level convolution layer and forwards the pooled low-level features to be concatenated with the final image representation. The second-place team minhvu [30] adopted an ensemble learning approach with a variation of VQA components. The third-place team TUA1 [40] used a question classifier to figure out the question category and then choose answers from a set of modality, plane, and organ classifiers and a generative model for abnormal- ity answers. Note that the question classification strategy was also employed by several other participated teams; therefore we speculate the use of BERT could have been the delimiting factor that caused a noticeable gap of 0.04 (in both accuracy and BLEU) between the third place [30] and fourth place [25] (who also used question classification and sub answer models). 2.2 Dataset The VQA-Med 2020 dataset has a composition of 4,000 radiology images for training, 500 for validation, and 500 for testing. Each image has exactly one Question-Answer (QA) pair from the abnormality question category. We followed the official suggestion to use the VQA-Med 2019 dataset4 as additional training data. The VQA-Med 2019 dataset has 3,200 medical images for training, 500 for validation, and another 500 for testing. For training and validation sets, there are 12,792 and 2,000 QA pairs, giving most images exactly one QA pair in each question category (i.e., imaging modality, imaging plane, organ systems, and abnormality). For the test set, each question category has 125 images. In addition, the yes-no questions appear only in the imaging modality and abnormality question categories. 2.3 Skeleton-based Sentence Mapping As mentioned in Sec. 1, Pythia [27] was our initial attempt, from which we observed a proportion of the yes-no questions answered by categorical abnor- mality answers and vice versa. This could be a sign of insufficient question variations. To address this issue, we tried to develop a question generator to 4 https://github.com/abachaa/VQA-Med-2019 populate training questions while keeping the meaning unchanged. The Skeleton- based Sentence Mapping (SSM) method was developed to summarize questions with similar sentence structures into a unified backbone. An example of the de- rived sentence backbones are shown in Table 1. Taking the question backbone “is ${this pronoun alts} ${ct alts} ${normal alts}? ” as an example, we call the swap-able parts the skeleton variables and write in the Shell variable style “${. . . }”. An example can be found in Table 2.5 Table 1. An example of the question backbones derived from the VQA-Med 2019 and 2020 datasets. The last six columns present the respective number of question instances in each set. VQA-Med 2020 VQA-Med 2019 Dataset Questions Question Backbones train val test train val test is the ct scan normal? 1 3 1 is the mri normal? is ${this alts} ${ct alts} 3 2 1 6 is the ultrasound normal? ${normal alts}? 1 1 1 is the x-ray normal? 2 4 1 3 what abnormality is seen in the image? what abnormality is ${imaged alts} 1001 105 127 776 133 20 what abnormality is seen in this x-ray? in ${this alts} ${ct alts}? 2 what is seen in the image? 1 what is seen in the x-ray? what ${is being alts} ${imaged alts} 2 what is seen in this ct scan? in ${this alts} ${ct alts}? 1 what is shown in the x-ray? 1 Table 2. Corresponding candidates for the skeleton variables appeared in Table 1. The candidate elements were extracted from the real VQA-Med 2019 and 2020 dataset questions and added with improvised ones. Skeleton variables Candidates this alts this, the ct alts ct, ct scan, mri, pet, x-ray, image, . . . normal alts normal, abnormal imaged alts imaged, displayed, seen, shown, . . . is being alts is, is being Before applying SSM, we first removed the duplicated questions in the dataset, resulting in 266 unique questions. After then, we applied word-level edit distance (i.e., levenshtein distance) to pairs of questions, finding groups of questions with 1-distance and 2-distance. For example, in Table 1, the corresponding questions of each question backbone mostly have either 1-distance or 2-distance within the group, and the highest 4-distance is between “what is shown in the x-ray?” and 5 The naming was determined by choosing the a representative candidate from candi- dates for each skeleton variable; by ignoring the “alts” suffix, a question backbone becomes readable. “what is seen in this ct scan?”. The grouped questions were manually checked to see if the dissimilar parts can be described by a unified skeleton variable. If so, the generated backbone would replace the group of question and enter the next iteration of edit distance computation. The first iteration was able to detect most of the easy question groups, leave the later iterations with a small number of questions. The process was ran until all questions were skeletonized, resulting in 68 ques- tion backbones. We labeled the question backbones in the four aforementioned question categories, partially based on the corresponding answers. In addition, we also determined two sub categories under the imaging modality category, namely the MR modality category and the contrast imaging type category. Next, we compared our own question category annotation with the official question category annotation for the VQA-Med 2019 test set (only available in this set), which is equivalent. The SSM was able to populate dynamic question variations (with some rule based restrictions, e.g., changing “ct scan” in “is the ct scan normal?” to other candidates except “ct” and “image” results in a fallacious judgement of the image modality, hence is not allowed) and the same Pythia model trained with the augmented questions was able to rectify the yes-no and wh-question cross answering errors. Nevertheless, we found the SSM method rendered language modelling trivial. With its help, we can solve the VQA task as an image classification task. Label inference from question backbones: based on the question category annotation, we were able to record the paired answer annotation as the label for each mapped task. In addition, we could also extract labels from the skeleton variables. For example, for the first question “is the ct scan normal?” in Table 1, “ct” is capturable by ${ct alts} and “normal” is capturable by ${normal alts}; hence producing a coarse modality label “ct”, and also produce a binary abnor- mality label “normal” if the answer is a “yes”. We found the same can also be generalized to infer task labels from the wh-questions. An issue with the question backbone derived modality labels is that the de- tailed modality (e.g., ct with contrast or not) is unknown. To address this issue, we treat the coarse modality labels as an independent task. The answer derived modality labels were mapped back to the coarse labels following the information provided in [2]. Next, we treated all abnormality wh-questions to have an “ab- normal” label to add to the yes-no question derived binary abnormality labels. At the end of the process, we were able to produce six classification tasks: 1) fine imaging modalities; 2) coarse imaging modalities; 3) imaging plane; 4) organ systems; 5) binary abnormality, and 6) categorical abnormality. 2.4 Multi-task Image Classification The schematic of an exemplar image classification network we used is illustrated in Fig. 1, sketched with the knowledge inference process. The two important tasks are the binary and categorical abnormality classification tasks while the rest four can be thought as regularization tasks. We believe that all the tasks should have strong correlation to each other, i.e., the correct imaging modality and organ judgements should be strong prior knowledge for correct recognition of abnormality. Knowledge Inference Matched Backbone what is most alarming about ${this_alts} ${ct_alts}? Q: “what is most alarming about this ct Categorical Abnormality pancreatic carcinoma scan?” Binary Abnormality abnormal A: “pancreatic carcinoma” Coarse Imaging Modality ct Fine Imaging Modality Coarse Imaging Modality Backbone Network Imaging Plane (e.g., ResNet, DenseNet, VGG) Organ System Binary Abnormality Categorical Abnormality Input Image Shared Feature Space Fig. 1. The schematic of an image classification network we used and the label inference result produced by the proposed SSM method. Class-wise and task-wise normalization: since only the 2019 challenge im- ages have (almost) complete four QA pairs per image, a large number of images in the joint 2019 and 2020 dataset do not have a complete label set (mainly the 2020 images). Hence when all six tasks are jointly optimized via a mini-batch gradient method, a conventional normalization by the batch size effectively as- signs a lower weight to a less populated task, e.g., for a batch with 12 images, a task that has 3 labeled images effectively has 0.25 weighting. In addition to the incomplete label problem, we also observed imbalanced class distributions within the tasks. For example in the categorical abnormality question category, the number of samples per abnormality class ranges from 4 to 104. We propose to solve both issues together by a class-wise and task-wise normalization in order to jointly optimize all six tasks together. Assume that t ∈ {coarse modality, . . .} represents a task, for a set of images X and the label set Yt , the mini-batch training loss L is computed as: X 1 1 1(yt = ct )·`t (x, yt ) , X X L= 1 1 P P t ct (ct ∈ Yt ) ct yt ∈Yt (yt = ct ) x∈X,yt ∈Yt (1) where x ∈ X and yt ∈ Yt represent individual image and label, 1(.) denotes an in- dicator function, and ct denotes a candidate class of t (e.g., ct ∈ {ct, . . . , x-ray}, if t = coarse modality). 2.5 Multi-scale and multi-architecture ensemble We adopted a multi-scale learning technique, using 128, 256, 384, and 512 as candidate image resize options. After applying the resize operation, we randomly crop the network input image at a ratio of 87.5% along both dimensions from a resized image. Random affine transformations and horizontal flip were used. The initial learning rate is set to 1e-3, linearly reduced 1e-6 after 100 epochs using Adam optimizer. On the other hand, ResNets [11], DenseNets [14], ResNexts [34], MobileNet [13], and VGG nets [26] were selected as the image backbone candidates. We put the backbone and input scale options as training script hyper-parameters, which helped us to disperse the training over several GPU stations and gradually ex- pand the number of ensemble members. 2.6 Experiment Results We show the validation results from all trained models in Table 3, the corre- sponding training volume includes 2019-{train, val, test} and 2020-train. Based on these results, we made decisions of which models to be trained for test eval- uation. Note that the training volume was changed to all of the 2019-{train, val, test} and 2020-{train, val, test} sets for training the testing-use models. We included the 2020-test set because some amount of partial coarse imaging modality labels (i.e., from ${ct alts}) and binary abnormality labels (i.e., only the abnormal ones from wh-question abnormality) were extractable by SSM from only the questions, which served as a form of weak regularization for the test images. Finally, for the categorical abnormality type questions, we only select a top prediction from the VQA-Med 2020 subset of the abnormality classes as the predictions. Our submissions on the 2020 validation set are shown in Table 4. Our sec- ond submission was purposed to determine the exact category type of the last question backbone in Table 1 as the five instances all appear in the 2020 test set. Although all other 2020 questions were in the abnormality question cate- gory (aligned with the official statement), we found the five questions could also be interpreted as asking which organ is present. We treated the 5 questions as categorical abnormality questions in the first submission and as organ questions in the second submission. Given the accuracy dropped, the ground truth should be the abnormality category. From a post-challenge point of view, our third submission secured the leading position in the leaderboard. Our fourth submission was purposed to include more DenseNet-121 instances in the ensemble as the DenseNet-121-only multi- scale ensemble showed the highest 0.6 accuracy in Table 3. Our fifth submission added the two VGG multi-scale groups, presenting the final ensemble result Table 3. The accuracy evaluation on the VQA-Med 2020 validation set. Network Input Size Ensemble Architecture 128 256 384 512 Multi-scale Multi-scale & Arch. ResNet-50 0.510 0.508 0.478 0.492 0.558 0.570 ResNet-101 0.486 0.530 0.508 0.460 0.566 0.580 ResNet-152 0.486 0.522 0.486 0.386 0.548 0.596 0.596 ResNext-50 32x4d 0.510 0.538 0.492 0.456 0.566 0.590 0.584 ResNext-101 32x8d 0.522 0.520 - - 0.538 DenseNet-121 0.548 0.562 0.536 0.504 0.600 DenseNet-161 0.526 0.520 0.518 - 0.564 MobileNet v2 0.512 0.512 0.428 - 0.538 VGG-16 with BN 0.478 0.482 0.426 0.486 0.530 VGG-19 with BN 0.444 0.474 0.442 - 0.502 of all trained models. Nevertheless, these final attempts only pushed up the performance marginally, suggesting a performance saturation in our approach. Table 4. The officially evaluated accuracy and BLEU scores on the VQA-Med 2020 test set. The numbers in the brackets, e.g., 256x2, indicates the use of 256 as the network input size and repeated 2 times (with different initial seeds). 2020-val 2020-test ID Ensemble Members Accu. Accu. BLEU 67598 ResNet-50 (256x2, 384) + ResNet-101 (256) + ResNet-152 (256) 0.552 0.446 0.486 67737 Same as 67598 0.552 0.442 0.482 67915 All Resnets + All ResNexts + All Densenets + All Mobilenet V2 0.596 0.494 0.539 68012 67915 + extra DenseNet-121 (128x2, 256x2, 384x2, 512) - 0.496 0.540 68017 68012 + VGG-16/19 - 0.496 0.542 3 VQG-Med Challenge Participation 3.1 Challenge Overview The VQG-Med challenge dataset is a much smaller dataset compared to the VQA-Med datasets. The training set contains 780 radiology images with 2,156 associated QA pairs. The validation set has 141 images with 164 QA pairs. The test set has only 80 images. The goal of the VQG challenge is to generate between 1 to 7 answers for each test image. 3.2 Methodology The VQG challenge describes a question generation task which in concept is close to image captioning but our proposed solution continued as a classification approach. The main reason is that we found there were more than one ground truth questions tied to each image. Unlike a VQA task, a question can be con- sidered as a prior knowledge on which the corresponding answer is conditionally dependent. Generating multiple questions while lacking such prior knowledge could be resolved by sampling approaches, but it can be difficult to associate a random state to a specific ground truth question. Hence, we instead treated all observed questions for an image as its attributes and modelled the question generation task as again an image attributes classification task. A downside of the classification approach is not able to produce novel questions. Our VQG approach was built upon our VQA-Med solution with the following settings. – Solving the question generation task as a classification task leads to a total of 2,073 classes each as an unique observed question from the joint training and validation sets. – We were concerned about finetuning the entire image model by the lim- ited amount of data and the large number of class, which may end up over-fitting in a much faster rate, hence we did not choose to fine-tune the backbones. However, as a compensation of non-linear capacity, we added a 2-layer batch-normalized and fully-connected (FC) (512 units each, ReLU activation) multiple-level perceptron (MLP) model before the softmax layer. The MLP model also avoided a direct mapping from the image features (e.g., 2048 dimensional features) to the 2,073 classes which would result in a computational expensive matrix multiplication and a large memory usage. – At the training hyper-parameter level, we kept the initial learning rate as 1e-3 but adjusted the final learning rate to 1e-5. Finally, we shortened the number of epochs to 40. – Each training image could be associated with more than one question, re- sulting a multi-label problem. We used the Stochastic Ground Truth method in [16] which treats each image with multiple observed questions as multiple one-question-for-one-image samples, converting the multi-label problem to a single-label problem. – The multi-scale and multi-architecture ensemble were continued in the VQG approach. These settings helped us to reuse most of the VQA-Med code base and models to develop a tangible solution within a very short time frame. Table 5. The accuracy evaluation on the VQG-Med 2020 validation set. Network Input Size Multi-scale Architecture 128 256 384 512 Ensemble ResNet-50 0.067 0.091 0.098 0.067 0.091 ResNet-101 0.055 0.098 0.080 0.061 0.067 ResNet-152 0.091 0.067 0.067 0.073 0.067 DenseNet-121 - 0.085 0.079 - 0.091 DenseNet-161 - 0.079 0.079 - 0.073 3.3 Experiment Results Similar to the VQA-Med 2020 result presentation, we show the VQG-Med 2020 validation and test results separately in Table 5 and 6, respectively. While the official evaluation only has BLEU score, in our local evaluation, we used top-7 accuracy to evaluate the validation performance. For official testing, each of our submission generates seven questions according to the highest probabilities for each image. Table 6. The VQG-Med 2020 submitted results. The number in a bracket indicates the network input scale of the respective member model. val test ID Ensemble Members Accu. BLEU 67984 ResNets-50/101/152 (no 512) 0.085 0.335 67995 ResNets-50/101/152 (all scales) 0.073 0.335 67996 67995 + ResNets-50/101/152 (no 512 + answer prediction) 0.091 0.326 68006 ResNet-50/101 (256, 384) + ResNet-152 (128) + DenseNet-121 (256, 384) 0.110 0.348 68018 68006 + DenseNet-161 (256, 384) 0.098 0.338 The first two submissions tested whether the large input size models should be continued. Given the lower top-7 accuracy on the validation set and the same BLEU value on the test set, we decided to not continue the 512 input size training. In the third submission, we tried to utilize the ground truth answer annotations by introducing the answer classification as an additional regular- ization task, but the result dropped by 0.009. In addition, the results from the first three submissions suggested a low correlation between the validation top-7 accuracy and the test BLEU scores. Hence in our forth submission we made two decisions in order to push for a much larger margin on the local evaluation: 1) forgoing the low accuracy models from the ensemble (validation accuracy < 0.079); 2) including the DenseNet-121 architecture given its good performance in the VQA-Med challenge. The fourth submission scored 0.11 for the valida- tion accuracy and 0.348 for the test BLEU score, secured our leading position in the VQG-Med challenge. Finally, in the fifth submission, we further added the DenseNet-161 multi-scale models as a last-minute attempt. Given the local evaluation dropped by 0.012, the test performance drop was expected as well. 4 Discussion and Conclusion In this paper, we described our participation at the 2020 VQA-Med challenge and the associated VQG-Med challenge. The center of our approach is a knowledge inference method which we named Skeleton-based Sentence Mapping (SSM). In the VQA-Med challenge, the SSM method was useful on multiple fronts: 1) it mapped questions to a set of backbones which were useful to populate dynamic question instances; 2) it replaced the need of the language modelling and was able to provide the direct selection to the corresponding answer predictor; and 3) it was used to infer six image classification tasks and corresponding training labels. Bypassing the development of language modelling allowed us to focus on tweaking the image classification model so that we devoted more time and resource on the multi-scale and multi-architecture ensemble learning. At last, we developed a class-wise and task-wise normalization technique for balancing the class and task populations, allowing the tasks with incomplete labels to be jointly optimized in one network. The main inspiration of SSM came from [17], where we back-translated the questions via a number of foreign languages for augmentation purpose, resulting from a group of sentences with a small wording variation; hence a sentence back- bone could be inferred. Nevertheless, whether the augmented questions carry the same meaning needs to be manually checked. The idea of reverse-engineering the sentence backbone was extended during our participation at the VQA-Med chal- lenge and led to the proposal of SSM. We are aware of the fact that SSM is not fully automated which requires further development. In addition, we understand SSM is a form of explicit rea- soning model and its efficiency highly depends on the question regularity and dataset size which may not generalize well for VQA datasets containing free-form questions. References 1. Abacha, A.B., Gayen, S., Lau, J.J., Rajaraman, S., Demner-Fushman, D.: Nlm at imageclef 2018 visual question answering in the medical domain. In: CLEF (Working Notes) (2018) 2. Abacha, A.B., Hasan, S.A., Datla, V.V., Liu, J., Demner-Fushman, D., Müller, H.: Vqa-med: Overview of the medical visual question answering task at imageclef 2019. In: CLEF (Working Notes) (2019) 3. Allaouzi, I., Ahmed, M.B.: Deep neural networks and decision tree classifier for visual question answering in the medical domain. In: CLEF (Working Notes) (2018) 4. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp. 2425–2433 (2015) 5. Ben Abacha, A., Datla, V.V., Hasan, S.A., Demner-Fushman, D., Müller, H.: Overview of the vqa-med task at imageclef 2020: Visual question answering and generation in the medical domain. In: CLEF 2020 Working Notes. CEUR Work- shop Proceedings, CEUR-WS.org, Thessaloniki, Greece (September 22-25 2020) 6. Ben-Younes, H., Cadene, R., Cord, M., Thome, N.: Mutan: Multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp. 2612–2620 (2017) 7. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014) 8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv: 1810.04805 (2018) 9. Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? dataset and methods for multilingual image question. In: Advances in neural information processing systems. pp. 2296–2304 (2015) 10. Hasan, S.A., Ling, Y., Farri, O., Liu, J., Müller, H., Lungren, M.P.: Overview of imageclef 2018 medical domain visual question answering task. In: CLEF (Working Notes) (2018) 11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016) 12. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997) 13. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An- dreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017) 14. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708 (2017) 15. Ionescu, B., Müller, H., Péteri, R., Abacha, A.B., Datla, V., Hasan, S.A., Demner- Fushman, D., Kozlovski, S., Liauchuk, V., Cid, Y.D., Kovalev, V., Pelka, O., Friedrich, C.M., de Herrera, A.G.S., Ninh, V.T., Le, T.K., Zhou, L., Piras, L., Riegler, M., l Halvorsen, P., Tran, M.T., Lux, M., Gurrin, C., Dang-Nguyen, D.T., Chamberlain, J., Clark, A., Campello, A., Fichou, D., Berari, R., Brie, P., Dogariu, M., Ştefan, L.D., Constantin, M.G.: Overview of the ImageCLEF 2020: Multimedia retrieval in medical, lifelogging, nature, and internet applications. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 11th International Conference of the CLEF Association (CLEF 2020), vol. 12260. LNCS Lecture Notes in Computer Science, Springer, Thessaloniki, Greece (September 22- 25 2020) 16. Liao, Z., Girgis, H., Abdi, A., Vaseli, H., Hetherington, J., Rohling, R., Gin, K., Tsang, T., Abolmaesumi, P.: On modelling label uncertainty in deep neural networks: Automatic estimation of intra-observer variability in 2d echocardiogra- phy quality assessment. IEEE Transactions on Medical Imaging 39(6), 1868–1883 (2019) 17. Liao, Z., Liu, L., Wu, Q., Teney, D., Shen, C., van den Hengel, A., Verjans, J.: Medical data inquiry using a question answering model. In: 2020 IEEE 17th Inter- national Symposium on Biomedical Imaging (ISBI). pp. 1490–1493. IEEE (2020) 18. Lin, M., Chen, Q., Yan, S.: Network in network. International Conference on Learn- ing Representations (2014) 19. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances in neural information processing systems. pp. 289–297 (2016) 20. Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: A neural-based ap- proach to answering questions about images. In: Proceedings of the IEEE interna- tional conference on computer vision. pp. 1–9 (2015) 21. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre- sentations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111–3119 (2013) 22. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002) 23. Peng, Y., Liu, F., Rosen, M.P.: Umass at imageclef medical visual question an- swering (med-vqa) 2018 task. In: CLEF (Working Notes) (2018) 24. Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Advances in neural information processing systems. pp. 2953–2961 (2015) 25. Shi, L., Liu, F., Rosen, M.P.: Deep multimodal learning for medical visual question answering. In: CLEF (Working Notes) (2019) 26. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 27. Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8317–8326 (2019) 28. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI confer- ence on artificial intelligence (2017) 29. Talafha, B., Al-Ayyoub, M.: Just at vqa-med: A vgg-seq2seq model. In: CLEF (Working Notes) (2018) 30. Vu, M., Sznitman, R., Nyholm, T., Löfstedt, T.: Ensemble of streamlined bilinear visual question answering models for the imageclef 2019 challenge in the medical domain. In: CLEF 2019. vol. 2380 (2019) 31. Wu, Q., Shen, C., Liu, L., Dick, A., Van Den Hengel, A.: What value do explicit high level concepts have in vision to language problems? In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 203–212 (2016) 32. Wu, Q., Wang, P., Shen, C., Dick, A., Van Den Hengel, A.: Ask me anything: Free-form visual question answering based on knowledge from external sources. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4622–4630 (2016) 33. Wu, Z., Palmer, M.: Verb semantics and lexical selection. arXiv preprint cmp- lg/9406033 (1994) 34. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1492–1500 (2017) 35. Xu, H., Saenko, K.: Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In: European Conference on Computer Vision. pp. 451–466. Springer (2016) 36. Yan, X., Li, L., Xie, C., Xiao, J., Gu, L.: Zhejiang university at imageclef 2019 visual question answering in the medical domain. In: CLEF (Working Notes) (2019) 37. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 21–29 (2016) 38. Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co- attention learning for visual question answering. In: Proceedings of the IEEE in- ternational conference on computer vision. pp. 1821–1830 (2017) 39. Zhou, Y., Kang, X., Ren, F.: Employing inception-resnet-v2 and bi-lstm for medical domain visual question answering. In: CLEF (Working Notes) (2018) 40. Zhou, Y., Kang, X., Ren, F.: Tua1 at imageclef 2019 vqa-med: a classification and generation model based on transfer learning. In: CLEF (Working Notes) (2019)