=Paper= {{Paper |id=Vol-2936/paper-106 |storemode=property |title=TAM at VQA-Med 2021: A Hybrid Model with Feature Extraction and Fusion for Medical Visual Question Answering |pdfUrl=https://ceur-ws.org/Vol-2936/paper-106.pdf |volume=Vol-2936 |authors=Yong Li,Zhenguo Yang,Tianyong Hao |dblpUrl=https://dblp.org/rec/conf/clef/LiYH21 }} ==TAM at VQA-Med 2021: A Hybrid Model with Feature Extraction and Fusion for Medical Visual Question Answering== https://ceur-ws.org/Vol-2936/paper-106.pdf
TAM at VQA-Med 2021: A Hybrid Model with Feature
Extraction and Fusion for Medical Visual Question
Answering
Yong Li1 , Zhenguo Yang2 and Tianyong Hao1( )
1
    School of Computer Science, South China Normal University, China
2
    School of Computer Science, Guangdong University of Technology, China


                                         Abstract
                                         This paper briefly describes our model for the ImageCLEF Medical Visual Question Answering Task
                                         2021 (ImageCLEF VQA-Med task 2021). Our method is based on a universal VQA framework and con-
                                         sists of image feature extraction module, question feature extraction module and feature fusion module.
                                         We employ the modified ResNet-34 as the backbone to construct an image feature extractor, which effec-
                                         tively extracts pixel-level features and enhances the model performance in a deep network. For question
                                         feature extraction, we firstly use word embedding to map question tokens to high dimension vectors,
                                         and then input them to a long-short-term memory (LSTM) to extract high-level question features. In
                                         addition, we leverage Multi-modal Factorized Bilinear Pooling (MFB) with a co-Attention mechanism
                                         to fuse these features to predict final answers. Our model achieves the accuracy score of 0.222 and bleu
                                         score of 0.255, ranking at the eighth among all participating teams in the VQA-Med task.

                                         Keywords
                                         VQA, ResNet, LSTM, Co-Attention, MFB




1. Introduction
In recent years, the applications of deep learning in Computer Vision (CV) and Natural Language
Processing (NLP) have gained remarkable progress. The development of deep learning in single
modality facilitates researchers to explore multimodal studies, e.g., image-text retrieval [1],
image captioning [2] and Visual Question Answering (VQA) [3]. These techniques have been
applied to the domains of finance [4], traffic [5] and medical [6], which are all prosperous.
When applying the VQA technique on medical domain, it can fulfill automatic interpretation
of radiology images and make clinical decisions, thereby alleviating the shortage of medical
resources.
   In the ImageCLEF VQA-Med task 2021 [7], given a radiology image with a related question,
the class of diseases indicated in the image is needed to be predicted. Compared to other ordinary
benchmark datasets such as VQA [8], TDIUC [9] and Visual7W [10], the ImageCLEF VQA-Med
task 2021 appears to be more challenging. The images of ordinary datasets contain abundant
prior knowledge, such as the object labels including coordinates and category information.
However, the images of ImageCLEF VQA-Med task 2021 dataset [11] do not contain object-level

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" yonglicutter@gmail.com (Y. Li); yzg@gdut.edu.cn (Z. Yang); haoty@m.scnu.edu.cn (T. Hao)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
labels. Besides, questions of ordinary VQA datasets usually have many entity names. On the
contrary, the questions of the ImageCLEF VQA-Med task 2021 have no entity names of diseases.
The intuitive comparison of data examples of common VQA tasks and the ImageCLEF VQA-Med
task 2021 can be showed in Figure 1. In addition, the amount of data in the two kinds of datasets
is dramatically different (Tens of thousands of the former and only 4500 of the latter).
   As for the universal VQA benchmark datasets, we can utilize the object-level and pixel-level
information of images as well as the entity information of questions for answer prediction.
This prior knowledge can greatly enhance the performances of the VQA models. Due to the
shortages of ImageCLEF VQA-Med task 2021 dataset, we utilize the modified ResNet-34 [12] as
the image feature extraction module. The basic structure of ResNet-34 is convolution neural
network (CNN), which can learn the data bias and pixel-level features through a small number of
images. Besides, the residual structure can stabilize the information flow during training, which
benefits to high-level feature extraction. We utilize long-short-term memory (LSTM) [13] to
extract the question features from embedded vectors followed by the word embedding module.
After that, Multi-modal Factorized Bilinear pooling (MFB) with co-Attention mechanism [3] is
introduced to fuse the image features and question features. Finally, we predict the final answer
through doing softmax on the fused features.




                        (a) Data example of ImageCLEF 2021 VQA-Med dataset.




                               (b) Data example of common VQA tasks

Figure 1: Comparison of ImageCLEF VQA-Med task 2021 dataset and common VQA datasets.
2. Related Work
CNN has been widely used in image feature extraction throughout computer vision. Since LeNet
[14] was introduced to extract image features, there have been more and more CNN variants
(AlexNet, VGG, GoogleNet) applying in computer vision tasks. With the solutions to gradient
disappearing (residual learning and dense learning [15]), deeper CNN can be constructed to
promote the model performances on different vision tasks. Recently, many researchers have
focused on utilizing transformer [16] as the image features encoder and gained remarkable
performances. In transformer, the input images are split into patches and treated as sequences,
and then input to the transformer for feature extraction. However, this requires a large number
of training data to learn the distribution of the datasets, which is not effective in this task.
   The researches of NLP tasks have been greatly promoted by the proposal of transformer.
Transformer introduces the full-connected layer (FC) to build the self-attention mechanism,
which replaces RNN to learn contextual information of a long-length sentence. Later, Devlin J
et al. [17] proposed Bert by stacking encoders of Transformer. With a deeper structure, Bert
adds word embedding, segment embedding and position embedding together as input to reach
better performances in NLP tasks. In the next years, XLNet [18], GPT-2 [19], GPT-3 [20] were
proposed to further promote the performances in NLP tasks. However, both of these models
require large amount of training data to fit the distribution of the input text. For the LSTM [13],
it can well preserve the contextual features of long-time sequences with a small number of texts.
Therefore, we use LSTM as our question feature extractor instead.
   The simple methods of feature fusion in deep neural networks include concatenating the
different kinds of feature in channel dimension, making element-wise sum or producting with
same feature map sizes. But these might not be expressive enough to fully capture the complex
associations between the two different modalities. The Multimodal Compact Bilinear pooling
(MCB) [21] projects the images and text representations to a higher dimensional space, and then
convolves both vectors by using element-wise product in Fast Fourier Transform (FFT) space.
The Multi-modal Factorized Bilinear Pooling (MFB) [22] introduces co-Attention mechanism to
jointly learn both image and question attention. The co-Attention mechanism can effectively
learn which regions are important for the images related to the questions.


3. Method
The overview of the architecture of our proposed model is shown in Figure 2. Our model
consists of three components: A image feature extraction module, a question feature extraction
module and an attention feature fusion module.

3.1. Image Feature Extraction Module
The main component of the original ResNet-34 is convolution neural network (CNN). With the
characteristics of translation invariant, translation equivalence, scale invariance and rotation
invariance, CNN can learn strong data bias with a small number of data. Besides, the residual
structure has a large receptive field and it keeps gradient from vanishing in model training
process. Benefitting from these strengthens, this image feature extraction module effectively
Figure 2: The overview of the architecture of our proposed model.


learns pixel features and spatial features of the medical images. Although it could be easy to
overfit because of the small number of data, we alleviate this situation with a dropout operation.
The output of the original ResNet-34 is a 1000-dimension vectors used for 1000 classification,
which is not suitable in this task.
   Compared to the original ResNet-34, we remove the global average pooling layer (GAP) and
full-connected layer (FC), as showed in Figure 3. Before the images are fed to this module,
each image is resized to 128×128 with the INTER_AREA algorithm. In order to fit the input
size of the feature fusion module, we reshape the shape of output image feature maps from
B×512×16×16 to a new shape of B×64×2048, where B represents the batch size in training. We
take this reshaped feature maps as the extracted image features.




Figure 3: The structure of modified ResNet-34.



3.2. Text Feature Extraction Module
In this task, we introduce the word embedding and LSTM to extract the question features.
Given the questions of the raw data, the preprocessing of them includes two steps: tokenizing
the words of questions and fixing the length of sentences to 12. After that, the tokenized
sequences are sent to the word embedding module and generate the embedded word vectors
in the dimension of B×12×600 (Note the tokens are encoded using GLOVE word embeddings).
These embedded vectors are fed into the LSTM module to acquire high-level question features.
The input layer dim and hidden layer dim are set to 600 and 1024 respectively. The number
of LSTM unit in feature extraction module is set to 1. In this task, we use the whole sequence
output features instead of the last token output features to guarantee the information integrity
of sentence structure.
   While inputting the word vectors to the LSTM, the forget gate of LSTM determines whether
the information flows from previous moment can pass through to the next moment with a
sigmoid function, which prompts LSTM to keep useful information and filter useless ones. The
input gate of LSTM determines which information needs to be updated at current moment, and
the output gate outputs updated information to next moment or as final output. With the gate
units, LSTM has a strong ability in storing state information, which benefits to catch contextual
correlation in long-time sequences.

3.3. Attentional Feature Fusion Module
After extracting the image features and question features, we feed them to the Multi-modal
Factorized Bilinear Pooling (MFB) with the co-Attention mechanism together to obtain the
fused features.
   The MFB keeps the robust expressive capacity when compacts the features from different
modalities by using the matrix factorization tricks. The co-Attention mechanism consists of
a self-attention mechanism (SA) and a guided-attention (GA). Given two modalities features
𝑋 and 𝑌 , 𝑋 first makes the self-attention operation to generate attention features, denoted
as 𝑋𝑎𝑡𝑡𝑛 . Furthermore, the 𝑋𝑎𝑡𝑡𝑛 is used to guide the attention learning to obtain attention
features of 𝑌 , denoted as 𝑌𝑎𝑡𝑡𝑛 . This operation enhances the connection of two modalities in
the learning process. By introducing the co-Attention mechanism to the MFB, the joint feature
representation learning can be more accurate and effective.
   In this task, we input the image features and question features (denoted as 𝐼𝑓 𝑒𝑎𝑡 and 𝑄𝑓 𝑒𝑎𝑡 )
extracted from the modified ResNet-34 and LSTM to the MFB with the co-Attention mechanism,
as showed in Figure 4. We firstly made a self-attention operation on 𝑄𝑓 𝑒𝑎𝑡 to obtain question
attention features 𝑄𝑓 𝑒𝑎𝑡 _𝑎𝑡𝑡𝑛 , then we used the MFB to fuse the 𝑄𝑓 𝑒𝑎𝑡 _𝑎𝑡𝑡𝑛 and 𝐼𝑓 𝑒𝑎𝑡 to guide
the image features attention learning to generate image attention features 𝐼𝑓 𝑒𝑎𝑡 _𝑎𝑡𝑡𝑛 . After that,
we used the MFB to fuse the 𝐼𝑓 𝑒𝑎𝑡 _𝑎𝑡𝑡𝑛 and 𝑄𝑓 𝑒𝑎𝑡 _𝑎𝑡𝑡𝑛 with vector multiplication and projected
the fused features to a linear dimension. At last, we employed a softmax function on the fused
features to predict the probability of each answer.


4. Experiments
4.1. Data Description
This dataset of ImageCLEF VQA-Med task 2021 contains a training set of 4000 image-question
pairs, a validation set of 500 pairs and a test set of 500 pairs. The triplet data of images, questions
and answers are one-to-one associated. The training set of ImageCLEF VQA-Med task 2021 is
Figure 4: The co-Attention mechanism in our model.


totally the same as the training set of ImageCLEF VQA-Med task 2020. The organizers replaced
the samples of image, question and answer in validation set and test set in ImageCLEF VQA-Med
task 2021. Therefore, we merged the validation set of ImageCLEF VQA-Med task 2020 with the
training set of ImageCLEF VQA-Med task 2021 in order to extend the training set to 4500 paired
image-question-answer data.
   By analyzing the dataset, we found that the data can be divided into two types: open-ended,
i.e., the answers are “yes/no” or the questions start with “Is/Does/. . . ”, or close-ended, i.e., the
questions do not have limited structures and could have multiple correct answers. Through
statistics, there are 88 close-ended paired data and the rest of them are open-ended in the
training set. However, there is no close-ended data in the validation set and test set. The total
classes of answers are 332 in the training set and 236 (subset of training set) in the validation
set. The statistical result is summarized in Table 1.
Table 1
Statistics of the ImageCLEF VQA-Med task 2021 dataset.
                            Dataset      Training set     Validation set   Test set
                       Open-ended            4412               500          500
                       Close-ended            88                 0            0
                    Classes of Answers        332               236        Unknown


4.2. Optimization
Our model was optimized with BCE loss function on RTX2080 GPU devices by training 800
epochs. We trained the network from the scratch without using any pretrained model weights.
We employed Adam with 𝛽1 = 0.9 and 𝛽2 = 0.999 as the optimizer, and used dynamic learning
rate to update the weights. In terms of convergence, we visualized the optimizations of the
objective of our proposed model in Figure 5, from which we could observe that the training
loss was decreasing, the accuracy score and bleu score were increasing. We utilized the model
weight that generated the highest validation accuracy score as the final model weight. We
submitted the result generated by this trained model on the test set and achieved the accuracy
score of 0.222 and bleu score of 0.255. The top 10 result of this competition is shown in Table 2.




          (a) Train loss.             (b) Validation accuracy score.        (c) Validation bleu score.

Figure 5: Optimization of our model



4.3. Discussion
An intuitive observation of Figure 5 (b) and (c) is that the performance of our method on
validation set is much better than on test set. The reason could be summarized as follows:
The sampling method of the test set was different from that of the training and validation set.
Besides, in order to fit the input size of the image feature extraction model, the input images
were resized to 128×128 roughly, which might lose some spatial information and introduce the
noises. In addition, the questions of the dataset provided little entity information of the related
classes.
Table 2
Official results of ImageCLEF VQA-Med task 2021.
                                Participants       Accuracy     bleu
                                   duadua            0.382     0.416
                               Zhao_Ling_Ling        0.362     0.402
                                    TeamS            0.348     0.391
                            jeanbenoit_delbrouck     0.348     0.384
                                     riven           0.332     0.361
                                  Zhao_Shi           0.316     0.352
                                 IALab_PUC           0.236     0.276
                                TAM (ours)           0.222     0.255
                                   sliencec          0.220     0.235
                                   sheerin           0.196     0.227


5. Conclusion
This paper describes the model designed in the ImageCLEF VQA-Med task 2021 competition.
We proposed a modified ResNet-34 + LSTM + MFB with a co-Attention mechanism to predict
final answers of VQA-Med. In image features extraction process, CNN was employed to learn
the image bias and dropout function was introduced to suppress the over-fitting situation. For
question feature extraction module, LSTM was utilized to extract the question features that did
not rely on a large number of data. The two modalities features were fused by the MFB with
co-Attention mechanism to generate the fused features for answer prediction. The best result
of our model is 0.222 in accuracy score and 0.255 in bleu score.


References
 [1] Q. Zhang, Z. Lei, Z. Zhang, S. Z. Li, Context-aware attention network for image-text
     retrieval, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
     Recognition, 2020, pp. 3536–3545. doi:10.1109/CVPR42600.2020.00359.
 [2] Y. Feng, L. Ma, W. Liu, J. Luo, Unsupervised image captioning, in: Proceedings of the
     IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4125–4134.
     doi:10.1109/CVPR.2019.00425.
 [3] Z. Yu, J. Yu, Y. Cui, D. Tao, Q. Tian, Deep modular co-attention networks for visual question
     answering, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
     Recognition, 2019, pp. 6281–6290. doi:10.1109/CVPR.2019.00644.
 [4] G. Liu, Y. Mao, Q. Sun, H. Huang, W. Gao, X. Li, J. Shen, R. Li, X. Wang, Multi-scale
     two-way deep neural network for stock trend prediction (2020) 4555–4561. doi:10.24963/
     ijcai.2020/621.
 [5] A. Prakash, K. Chitta, A. Geiger, Multi-modal fusion transformer for end-to-end au-
     tonomous driving, arXiv preprint arXiv:2104.09224 (2021).
 [6] B. D. Nguyen, T.-T. Do, B. X. Nguyen, T. Do, E. Tjiputra, Q. D. Tran, Overcoming data
     limitation in medical visual question answering, in: International Conference on Medical
     Image Computing and Computer-Assisted Intervention, Springer, 2019, pp. 522–530. doi:10.
     1007/978-3-030-32251-9_57.
 [7] B. Ionescu, H. Müller, R. Peteri, A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A.
     Hasan, V. Kovalev, S. Kozlovski, V. Liauchuk, Y. Dicente, O. Pelka, A. G. S. de Herrera,
     J. Jacutprakart, C. M. Friedrich, R. Berari, A. Tauteanu, D. Fichou, P. Brie, M. Dogariu, L. D.
     Ştefan, M. G. Constantin, J. Chamberlain, A. Campello, A. Clark, T. A. Oliver, H. Moustahfid,
     A. Popescu, J. Deshayes-Chossart, Overview of the ImageCLEF 2021: Multimedia retrieval
     in medical, nature, internet and social media applications, in: Experimental IR Meets
     Multilinguality, Multimodality, and Interaction, Proceedings of the 12th International
     Conference of the CLEF Association (CLEF 2021), LNCS Lecture Notes in Computer
     Science, Springer, Bucharest, Romania, 2021.
 [8] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, D. Parikh, Vqa: Visual
     question answering, in: Proceedings of the IEEE international conference on computer
     vision, 2015, pp. 2425–2433. doi:10.1109/ICCV.2015.279.
 [9] K. Kafle, C. Kanan, An analysis of visual question answering algorithms, in: Proceedings
     of the IEEE International Conference on Computer Vision, 2017, pp. 1965–1973. doi:10.
     1109/ICCV.2017.217.
[10] Y. Zhu, O. Groth, M. Bernstein, L. Fei-Fei, Visual7w: Grounded question answering in
     images, in: Proceedings of the IEEE conference on computer vision and pattern recognition,
     2016, pp. 4995–5004. doi:10.1109/CVPR.2016.540.
[11] A. Ben Abacha, M. Sarrouti, D. Demner-Fushman, S. A. Hasan, H. Müller, Overview of the
     vqa-med task at imageclef 2021: Visual question answering and generation in the medical
     domain, in: CLEF 2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org,
     Bucharest, Romania, 2021.
[12] K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, in:
     European conference on computer vision, Springer, 2016, pp. 630–645. doi:10.1007/
     978-3-319-46493-0_38.
[13] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997)
     1735–1780. doi:10.1162/neco.1997.9.8.1735.
[14] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document
     recognition, in: Proceedings of the IEEE, volume 86, Ieee, 1998, pp. 2278–2324. doi:10.
     1109/5.726791.
[15] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolu-
     tional networks, in: Proceedings of the IEEE conference on computer vision and pattern
     recognition, 2017, pp. 4700–4708. doi:10.1109/CVPR.2017.243.
[16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
     sukhin, Attention is all you need, arXiv preprint arXiv:1706.03762 (2017).
[17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[18] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, Q. V. Le, Xlnet: Generalized
     autoregressive pretraining for language understanding, arXiv preprint arXiv:1906.08237
     (2019).
[19] K. Lagler, M. Schindelegger, J. Böhm, H. Krásná, T. Nilsson, Gpt2: Empirical slant delay
     model for radio space geodetic techniques, Geophysical research letters 40 (2013) 1069–
     1073. doi:10.1002/grl.50288.
[20] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, arXiv
     preprint arXiv:2005.14165 (2020).
[21] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, Multimodal compact
     bilinear pooling for visual question answering and visual grounding, arXiv preprint
     arXiv:1606.01847 (2016).
[22] Z. Yu, J. Yu, J. Fan, D. Tao, Multi-modal factorized bilinear pooling with co-attention learn-
     ing for visual question answering, in: Proceedings of the IEEE international conference
     on computer vision, 2017, pp. 1821–1830.