bumjun jung at VQA-Med 2020: VQA model
    based on feature extraction and multi-modal
                   feature fusion

                 Bumjun Jung1 , Lin Gu2,1 , and Tatsuya Harada1,2
                           1
                           The University of Tokyo, Japan
                    {jung, lingu, harada}@mi.t.u-tokyo.ac.jp
                               2
                                 RIKEN AIP, Japan


        Abstract. This paper describes the submission of University of Tokyo
        for Medical Domain Visual Question Answering (VQA-Med) task [3]
        at ImageCLEF 2020 [11]. The data set for the task mostly consists of
        Medical Images and Question Answer pair considering the abnormality
        appeared in the images. We extract visual features by VGG16 network
        [16] with Global Average Pooling (GAP) [14]. Compared to the model
        [18] that ranked first in last year’s competition that used BERT [6] model
        to encode semantic features of questions, we used bioBERT model [13],
        which is a BERT model pre-trained by biomedical textual data. We also
        apply multi-modal Factorized High-order (MFH) Pooling [20] with co-
        attention which shows higher performance than Multi-modal Factorized
        Bilinear (MFB) Pooling [19] used in [18], to fuse two feature modalities.
        The fused features are then fed to a decoder to predict the answer in
        a manner of classification. The score of our model is 0.466 in accuracy,
        0.502 in BLEU score, and ranked 3rd among all the participating teams
        in the VQA-Med task [3] at ImageCLEF 2020 [11].

        Keywords: Visual Question Answering · Medical Imagery · Global Av-
        erage Pooling · bioBERT · Multi-modal Factorized High-order Pooling


1     Introduction

With many achievements and rapid progress in the field of Artificial Intelligence
(AI) related to Computer Vision (CV) and Natural Language Processing (NLP),
recently the AI technology is applied in the medical domain to analyze the
pathological images and medical reports. To be specific, it is used to detect
abnormalities or symptoms shown in the images or to generate explanations
regarding the medical images.
   Visual Question Answering (VQA) task involves both CV and NLP tech-
niques to process the data. VQA data set is comprised of both Images and
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
Question Answer (QA) pairs about the medical images. The images and ques-
tions become the inputs to VQA system, whose goal is to predict the answers
for the given questions.
    Large-scale data sets of VQA for general domain [2], [8] exist and there
are many advanced models and techniques that effectively solve the task. With
increasing interest in applying AI technology in the medical field, VQA in medical
domain is drawing attention due to the importance of supporting the doctors’
clinical decision and enhancing the patients’ understanding of their conditions
from the medical images especially in patient-centered medical care.
    VQA for medical domain is a challenging task compared to that of general
domain. First, since the cost of collecting valid data is high, valid medical data
for training are limited compared to those in general domain such as [2], [8]
where hundreds of thousands of images and QA pairs are available. Second,
the vocabulary used in QA pairs or medical reports is quite distinct from the
language used in daily life.
    VQA-Med data set provided by ImageCLEF 2020 consists of 4,000 training
set with radiology images and QA pairs, 500 validation set, and 500 test set only
with questions without answers. As illustrated in Fig.1, VQA-Med 2020 data set
generally asks questions related to the abnormalities shown in the images.


     Fig. 1. One example of VQA-Med data set provided by ImageCLEF 2020


    The proposed framework in this paper is shown in Fig.2 and can be described
as following steps with Fig2: 1. VGG16 network [16] with GAP [14] (Green) is
used to extract image features from input image 2. bioBERT model [13] (Blue) is
used to capture the semantic of questions and encode it into textual features. 3.
Visual features and textual features are fused by fusion mechanism called MFH
Pooling [20] (Purple) 4. Co-attention mechanism (Purple) is applied to both
visual and textual features to focus on particular image regions based on the
question features and vice versa. 5. Finally, the features, fused by MFH Pooling
[20] with co-attention, are fed to the decoder to predict the answer in a manner
of classification.


        *+,-.$'/./                          !"#$%&'()


                                   >?5:#)4#5%"-#)
                                                      01(234+3(#"53)
                                     *@%-5+%'-
                                  .ABBC#%DBE83

                                                     4#5%"-#)4"$&'(
                                                    .647)8''9&(:);&%<)
                                                      +'=5%%#(%&'(3       !"#$"#%&'())*+*,(#*-./


                                 !"#$%&'()*(+',#-
                                    ./&'0*123         6"3)#*-.4+3(#"53)


                 Fig. 2. General pipeline of the proposed framework


    The improvements and contributions we made compared to the method [18]
comprises of three points: First, for the visual feature extraction, the dimension of
extracted feature is reduced to 1472 from 1984 to avoid over-fitting problem while
maintaining the quantity of information in extracted features. Second, bioBERT
model [13] is used to extract textual features instead of BERT [6] model used in
[18]. While bioBERT [13] model has the same network structure as BERT [6] ,
the data used in pre-training of bioBERT [13] model were biomedical texts which
are different from the text data of general domain used in pre-training of BERT
[6] . Third, MFH Pooling [20] is used to fuse the visual and textual features
which is the advanced version of MFB Pooling [19] used in [18]. MFH Pooling
[20] method achieves higher performance than MFB Pooling [19] method.


2   Related Works

There has been much of developments in methods and models used in open
domain VQA [2], [8]. For these tasks, deep learning models for image process-
ing based on deep Convolution Neural Networks (CNNs) such as VGGNet [16],
ResNet [9] are frequently used to extract image features after pre-trained by
large-scale data set in the general domain such as Image net data set [5]. Re-
garding the question information processing, models for NLP, which are based
on Recurrent Neural Networks (RNNs) such as long short-term memory (LSTM)
[10] and grated recurrent units (GRU) [4], are frequently used not only to en-
code the textual features but also to generate answers as output. Similarlly, NLP
models such as BERT [6] pre-trained by large-scale data are applied to extract
semantic features from text data.
    Attention mechanism and multi-modal feature fusion methods are the impor-
tant factors of VQA system since VQA is a multidisciplinary task that involves
both CV and NLP approaches. Attention mechanisms have been successfully
employed in image captioning [17] and NLP models such as BERT [6] also have
adopted self-attention transformers in the network structure. Multi-modal fea-
ture fusion is essential for VQA task since it combines the information from both
modalities to predict the right answer. Fusion techniques have evolved start-
ing from hierarchical co-attention model (Hie+CoAtt) [15] which employs co-
attention mechanism using element-wise summations, concatenation, and fully
connected layers. Multimodal Compact Bilinear (MCB) pooling [7] computes
the outer product between two features to represent every information from
the features and this also reduces the computational cost compared to simple
outer product calculation. Also, Multi-modal Low-rank Bilinear (MLB) pooling
[12] generate output features with lower dimensions and models with fewer pa-
rameters compared to MCB Pooling. MFB Pooling [19] method fixed the slow
convergence rate of MLB Pooling as well as the part that it is sensitive to the
hyper-parameters. MFH Pooling [20] extends the MFB Pooling to a generalized
high-order setting to fuse the multi-modal features more effectively.
    [18] describes the first rank method of VQA-Med challenge at ImageCLEF
2019 [1] which uses VGG16 network with GAP to extract visual features from
input images and BERT model to extract textual features from questions. Those
extracted features are fused by MFB Pooling with co-attention method and the
fused features are used to predict the answer in a manner of classification.
    In addition, bioBERT [13] model is a pre-trained language representation
model for the biomedical domain which shares the same network structure with
BERT [6]. bioBERT is pre-trained on biomedical domain corpora and it can
capture semantic features of biomedical texts such as medical reports more ef-
fectively than BERT.

3     Methodology
This section describes the whole pipeline of our VQA model submitted for Im-
ageCLEF 2020 VQA-Med task [3]. As shown in Fig.2, first, from the input image
and question the image features and question features are extracted by Image
feature extractor and Question encoder. The extracted features are then fused
with feature fusion method with co-attention to a classification network for the
answer selecting.

3.1   Image feature extractor
In our VQA framework, VGG16 network pre-trained by ImageNet data set [5]
is used to extract image features. GAP [14] strategy is applied with VGG16 net-
work to prevent over-fitting problem. The GAP method take the average of last
convolution outputs of each layers that have different number of channels. If the
input image shape is 224x224x3 as in our model, the output shapes of VGG16
network layers are as follows: 224x224x64, 112x112x128, 56x56x256, 28x28x512,
14x14x512, 7x7x512. The last number of each output shape is the channel size of
the convolution layer outputs. After taking the average of the outputs by channel,
the extracted features’ dimension become the channel size of the layers. Those
features are concatenated to form a 1472-dimensional (64+128+256+512+512=1472)3
vector and it is used as image features and fed to the next network.

3.2     Question encoder
bioBERT [13] is used to extract the semantic features of the given questions.
bioBERT is pre-trained with biomedical text and has the same network struc-
ture as BERT [6]. bioBERT largely outperforms BERT and previous state-of-
the-art models in a variety of biomedical text mining tasks when pre-trained
on biomedical corpora. To extract the textual features that can represent the
question sentences, we average the last layer of bioBERT-base model to obtain
a 768-dimensional question feature vector.

3.3     Feature fusion with co-attention
Fusing multi-modal features is essential and important technique to improve the
performance of VQA model. As mentioned in section 2, Multi-modal Factor-
ized High-order (MFH) Pooling [20] method can fuse multi-modal features with
less computational cost and improved performance. Co-attention mechanism can
help the model to learn the importance of each part in both visual and textual
features. It can use the relative information from both modalities to learn which
parts of the features are important and to ignore the irrelevant information. We
therefore employ the MFH Pooling with co-attention to fuse visual and textual
features.


4     Training
Our model is trained for 990 epochs on one Quadro GV100 for about 4 hours.
This section describes the detailed process and parameters used in the actual
training.

4.1     Train data extension
Besides the data set provided in the ImageCLEF 2020 VQA-Med task [3], we
also took advantage of VQA-Med data set of ImageCLEF 2019 [1]. From the
3
    The last convolution layer output that shaped 7x7x512 is precluded when extracting
    features because it is only the output of Max pooling layer that represents the same
    information as the former layer output.
data set in [1], only the data comprised of QA pair existing in VQA-Med 2020
data set is used to train the model. 978 pairs in training set and 143 pairs in
validation set from [1] are used to extend the VQA-Med 2020 data set.

4.2   Hyper parameters
Hyper parameters are set according to the performance on the validation data
set. We used Binary cross-entropy loss as loss function, ADAM optimizer with
initial learning rate of 3e-5 and the L1 regularization with co-efficient of 5e-11.
MFH Pooling [20] is used with default parameters explained in [20] except for
the dropout co-efficient which is set to 0.85 to prevent the over-fitting problem.

5     Evaluation
Two evaluation methods were adopted to VQA-Med 2020 competition, accuracy
(strict) and BLEU score. The accuracy measures the ratio of correct prediction
and the BLEU score measures the similarity between the real answer and pre-
dicted answer. The max validation accuracy of our model was 0.612 and the
accuracy transition by training epoch is shown in Fig.3. For the actual training
of submitted model, the validation accuracy became 1.0 since the validation data
set was also included during the training.


                   Fig. 3. Accuracy transition by training epoch


    Among the 5 valid submissions, the model described in this paper that com-
prises of VGG16 (with GAP) + bioBERT + MFH Pooling (with co-attention)
achieved the accuracy score of 0.466 and BLEU score of 0.502 for the test data
set. Our submission took 3rd place in the competition. Fig.4 shows the leader-
board page of VQA-Med competition.
    Fig. 4. Leader-board page of the competition. Our team ID is bumjun jung


6   Conclusion
This paper describes the model submitted in ImageCLEF 2020 VQA-Med chal-
lenge. Our model ranked 3rd place and achieved accuracy of 0.466 and BLEU
score of 0.502 on test data set. We applied bioBERT [13] model to extract textual
features which has stronger performance on encoding biomedical texts compared
BERT [6]. Also MFH Pooling [20] is used to fuse the multi-modal features that
extends the MFB Pooling [19] to a generalized high-order setting to perform
better. For the future work, we will continue to improve the current network
and apply it to other data set or tasks.


Acknowledgement
This work was supported by JSPS KAKENHI Grant Number JP20H05556,
JST AIP Acceleration Research Grant Number JPMJCR20U3 and JST ACT-X
Grant Number JPMJAX190D. We would like to thank Kohei Uehara, Ryohei
Shimizu, Dr. Hiroaki Yamane, and Dr. Yusuke Kurose for helpful discussion.


References
 1. Abacha, A.B., Hasan, S.A., Datla, V.V., Liu, J., Demner-Fushman, D., Müller,
    H.: Vqa-med: Overview of the medical visual question answering task at imageclef
    2019. In: CLEF (Working Notes) (2019)
 2. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh,
    D.: Vqa: Visual question answering. In: Proceedings of the IEEE international
    conference on computer vision. pp. 2425–2433 (2015)
 3. Ben Abacha, A., Datla, V.V., Hasan, S.A., Demner-Fushman, D., Müller, H.:
    Overview of the vqa-med task at imageclef 2020: Visual question answering and
    generation in the medical domain. In: CLEF 2020 Working Notes. CEUR Work-
    shop Proceedings, CEUR-WS.org, Thessaloniki, Greece (September 22-25 2020)
 4. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recur-
    rent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
 5. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-
    scale hierarchical image database. In: 2009 IEEE conference on computer vision
    and pattern recognition. pp. 248–255. Ieee (2009)
 6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
    tional transformers for language understanding. arXiv preprint arXiv:1810.04805
    (2018)
 7. Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multi-
    modal compact bilinear pooling for visual question answering and visual grounding.
    arXiv preprint arXiv:1606.01847 (2016)
 8. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa
    matter: Elevating the role of image understanding in visual question answering. In:
    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
    pp. 6904–6913 (2017)
 9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
    Proceedings of the IEEE conference on computer vision and pattern recognition.
    pp. 770–778 (2016)
10. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation
    9(8), 1735–1780 (1997)
11. Ionescu, B., Müller, H., Péteri, R., Ben Abacha, A., Datla, V., Hasan, S.A.,
    Demner-Fushman, D., Kozlovski, S., Liauchuk, V., Cid, Y.D., Kovalev, V., Pelka,
    O., Friedrich, C.M., de Herrera, A.G.S., Ninh, V.T., Le, T.K., Zhou, L., Piras, L.,
    Riegler, M., l Halvorsen, P., Tran, M.T., Lux, M., Gurrin, C., Dang-Nguyen, D.T.,
    Chamberlain, J., Clark, A., Campello, A., Fichou, D., Berari, R., Brie, P., Dogariu,
    M., Ştefan, L.D., Constantin, M.G.: Overview of the ImageCLEF 2020: Multimedia
    retrieval in medical, lifelogging, nature, and internet applications. In: Experimental
    IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 11th
    International Conference of the CLEF Association (CLEF 2020), vol. 12260. LNCS
    Lecture Notes in Computer Science, Springer, Thessaloniki, Greece (September 22-
    25 2020)
12. Kim, J.H., Lee, S.W., Kwak, D., Heo, M.O., Kim, J., Ha, J.W., Zhang, B.T.:
    Multimodal residual learning for visual qa. In: Advances in neural information
    processing systems. pp. 361–369 (2016)
13. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: a
    pre-trained biomedical language representation model for biomedical text mining.
    Bioinformatics 36(4), 1234–1240 (2020)
14. Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400
    (2013)
15. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for
    visual question answering. In: Advances in neural information processing systems.
    pp. 289–297 (2016)
16. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
    image recognition. arXiv preprint arXiv:1409.1556 (2014)
17. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R.,
    Bengio, Y.: Show, attend and tell: Neural image caption generation with visual
    attention. In: International conference on machine learning. pp. 2048–2057 (2015)
18. Yan, X., Li, L., Xie, C., Xiao, J., Gu, L.: Zhejiang university at imageclef 2019 visual
    question answering in the medical domain. In: CLEF (Working Notes) (2019)
19. Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-
    attention learning for visual question answering. In: Proceedings of the IEEE in-
    ternational conference on computer vision. pp. 1821–1830 (2017)
20. Yu, Z., Yu, J., Xiang, C., Fan, J., Tao, D.: Beyond bilinear: Generalized multimodal
    factorized high-order pooling for visual question answering. IEEE transactions on
    neural networks and learning systems 29(12), 5947–5959 (2018)